1. 20 6月, 2018 3 次提交
  2. 06 6月, 2018 1 次提交
  3. 02 6月, 2018 7 次提交
  4. 31 5月, 2018 1 次提交
  5. 17 5月, 2018 2 次提交
  6. 16 5月, 2018 1 次提交
  7. 10 5月, 2018 2 次提交
    • D
      iomap: Use FUA for pure data O_DSYNC DIO writes · 3460cac1
      Dave Chinner 提交于
      If we are doing direct IO writes with datasync semantics, we often
      have to flush metadata changes along with the data write. However,
      if we are overwriting existing data, there are no metadata changes
      that we need to flush. In this case, optimising the IO by using
      FUA write makes sense.
      
      We know from the IOMAP_F_DIRTY flag as to whether a specific inode
      requires a metadata flush - this is currently used by DAX to ensure
      extent modification as stable in page fault operations. For direct
      IO writes, we can use it to determine if we need to flush metadata
      or not once the data is on disk.
      
      Hence if we have been returned a mapped extent that is not new and
      the IO mapping is not dirty, then we can use a FUA write to provide
      datasync semantics. This allows us to short-cut the
      generic_write_sync() call in IO completion and hence avoid
      unnecessary operations. This makes pure direct IO data write
      behaviour identical to the way block devices use REQ_FUA to provide
      datasync semantics.
      
      On a FUA enabled device, a synchronous direct IO write workload
      (sequential 4k overwrites in 32MB file) had the following results:
      
      # xfs_io -fd -c "pwrite -V 1 -D 0 32m" /mnt/scratch/boo
      
      kernel		time	write()s	write iops	Write b/w
      ------		----	--------	----------	---------
      (no dsync)	 4s	2173/s		2173		8.5MB/s
      vanilla		22s	 370/s		 750		1.4MB/s
      patched		19s	 420/s		 420		1.6MB/s
      
      The patched code clearly doesn't send cache flushes anymore, but
      instead uses FUA (confirmed via blktrace), and performance improves
      a bit as a result. However, the benefits will be higher on workloads
      that mix O_DSYNC overwrites with other write IO as we won't be
      flushing the entire device cache on every DSYNC overwrite IO
      anymore.
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      3460cac1
    • D
      iomap: iomap_dio_rw() handles all sync writes · 4f8ff44b
      Dave Chinner 提交于
      Currently iomap_dio_rw() only handles (data)sync write completions
      for AIO. This means we can't optimised non-AIO IO to minimise device
      flushes as we can't tell the caller whether a flush is required or
      not.
      
      To solve this problem and enable further optimisations, make
      iomap_dio_rw responsible for data sync behaviour for all IO, not
      just AIO.
      
      In doing so, the sync operation is now accounted as part of the DIO
      IO by inode_dio_end(), hence post-IO data stability updates will no
      long race against operations that serialise via inode_dio_wait()
      such as truncate or hole punch.
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4f8ff44b
  8. 29 1月, 2018 1 次提交
  9. 09 1月, 2018 1 次提交
    • D
      iomap: report collisions between directio and buffered writes to userspace · 5a9d929d
      Darrick J. Wong 提交于
      If two programs simultaneously try to write to the same part of a file
      via direct IO and buffered IO, there's a chance that the post-diowrite
      pagecache invalidation will fail on the dirty page.  When this happens,
      the dio write succeeded, which means that the page cache is no longer
      coherent with the disk!
      
      Programs are not supposed to mix IO types and this is a clear case of
      data corruption, so store an EIO which will be reflected to userspace
      during the next fsync.  Replace the WARN_ON with a ratelimited pr_crit
      so that the developers have /some/ kind of breadcrumb to track down the
      offending program(s) and file(s) involved.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      5a9d929d
  10. 04 11月, 2017 1 次提交
  11. 17 10月, 2017 1 次提交
    • E
      fs: invalidate page cache after end_io() in dio completion · 5e25c269
      Eryu Guan 提交于
      Commit 332391a9 ("fs: Fix page cache inconsistency when mixing
      buffered and AIO DIO") moved page cache invalidation from
      iomap_dio_rw() to iomap_dio_complete() for iomap based direct write
      path, but before the dio->end_io() call, and it re-introdued the bug
      fixed by commit c771c14b ("iomap: invalidate page caches should
      be after iomap_dio_complete() in direct write").
      
      I found this because fstests generic/418 started failing on XFS with
      v4.14-rc3 kernel, which is the regression test for this specific
      bug.
      
      So similarly, fix it by moving dio->end_io() (which does the
      unwritten extent conversion) before page cache invalidation, to make
      sure next buffer read reads the final real allocations not unwritten
      extents. I also add some comments about why should end_io() go first
      in case we get it wrong again in the future.
      
      Note that, there's no such problem in the non-iomap based direct
      write path, because we didn't remove the page cache invalidation
      after the ->direct_IO() in generic_file_direct_write() call, but I
      decided to fix dio_complete() too so we don't leave a landmine
      there, also be consistent with iomap_dio_complete().
      
      Fixes: 332391a9 ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
      Signed-off-by: NEryu Guan <eguan@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      5e25c269
  12. 12 10月, 2017 1 次提交
    • A
      iomap_dio_actor(): fix iov_iter bugs · cfe057f7
      Al Viro 提交于
      1) Ignoring return value from iov_iter_zero() is wrong
      for iovec-backed case as well as for pipes - it can fail.
      
      2) Failure to fault destination pages in 25Mb into a 50Mb iovec
      should not act as if nothing in the area had been read, nevermind
      that the first 25Mb might have *already* been read by that point.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cfe057f7
  13. 02 10月, 2017 2 次提交
  14. 27 9月, 2017 1 次提交
    • C
      iomap_dio_rw: Allocate AIO completion queue before submitting dio · 546e7be8
      Chandan Rajendra 提交于
      Executing xfs/104 test in a loop on Linux-v4.13 kernel on a ppc64
      machine can cause the following NULL pointer dereference,
      
      .queue_work_on+0x4c/0x80
      .iomap_dio_bio_end_io+0xbc/0x1f0
      .bio_endio+0x118/0x1f0
      .blk_update_request+0xd0/0x470
      .blk_mq_end_request+0x24/0xc0
      .lo_complete_rq+0x40/0xe0
      .__blk_mq_complete_request_remote+0x28/0x40
      .flush_smp_call_function_queue+0xc4/0x1e0
      .smp_ipi_demux_relaxed+0x8c/0x100
      .icp_hv_ipi_action+0x54/0xa0
      .__handle_irq_event_percpu+0x84/0x2c0
      .handle_irq_event_percpu+0x28/0x80
      .handle_percpu_irq+0x78/0xc0
      .generic_handle_irq+0x40/0x70
      .__do_irq+0x88/0x200
      .call_do_irq+0x14/0x24
      .do_IRQ+0x84/0x130
      
      This occurs due to the following sequence of events,
      
      1. Allocate dio for Direct I/O write.
      2. Invoke iomap_apply() until iov_iter_count() bytes have been submitted.
         - Assume that we have submitted atleast one bio. Hence iomap_dio->ref value
           will be >= 2.
         - If during the second iteration, iomap_apply() ends up returning -ENOSPC, we would
           break out of the loop and since the 'ret' value is a negative number we
           end up not allocating memory for super_block->s_dio_done_wq.
      3. Meanwhile, iomap_dio_bio_end_io() is invoked for bios that have been
         submitted and here the code ends up dereferencing the NULL pointer stored
         at super_block->s_dio_done_wq.
      
      This commit fixes the bug by allocating memory for
      super_block->s_dio_done_wq before iomap_apply() is invoked.
      Reported-by: NEryu Guan <eguan@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NEryu Guan <eguan@redhat.com>
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      546e7be8
  15. 25 9月, 2017 1 次提交
    • L
      fs: Fix page cache inconsistency when mixing buffered and AIO DIO · 332391a9
      Lukas Czerner 提交于
      Currently when mixing buffered reads and asynchronous direct writes it
      is possible to end up with the situation where we have stale data in the
      page cache while the new data is already written to disk. This is
      permanent until the affected pages are flushed away. Despite the fact
      that mixing buffered and direct IO is ill-advised it does pose a thread
      for a data integrity, is unexpected and should be fixed.
      
      Fix this by deferring completion of asynchronous direct writes to a
      process context in the case that there are mapped pages to be found in
      the inode. Later before the completion in dio_complete() invalidate
      the pages in question. This ensures that after the completion the pages
      in the written area are either unmapped, or populated with up-to-date
      data. Also do the same for the iomap case which uses
      iomap_dio_complete() instead.
      
      This has a side effect of deferring the completion to a process context
      for every AIO DIO that happens on inode that has pages mapped. However
      since the consensus is that this is ill-advised practice the performance
      implication should not be a problem.
      
      This was based on proposal from Jeff Moyer, thanks!
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      332391a9
  16. 02 9月, 2017 1 次提交
  17. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  18. 12 8月, 2017 1 次提交
  19. 14 7月, 2017 1 次提交
  20. 03 7月, 2017 1 次提交
  21. 28 6月, 2017 1 次提交
  22. 20 6月, 2017 1 次提交
  23. 09 6月, 2017 1 次提交
  24. 09 5月, 2017 1 次提交
  25. 04 5月, 2017 1 次提交
    • A
      fs: fix data invalidation in the cleancache during direct IO · 55635ba7
      Andrey Ryabinin 提交于
      Patch series "Properly invalidate data in the cleancache", v2.
      
      We've noticed that after direct IO write, buffered read sometimes gets
      stale data which is coming from the cleancache.  The reason for this is
      that some direct write hooks call call invalidate_inode_pages2[_range]()
      conditionally iff mapping->nrpages is not zero, so we may not invalidate
      data in the cleancache.
      
      Another odd thing is that we check only for ->nrpages and don't check
      for ->nrexceptional, but invalidate_inode_pages2[_range] also
      invalidates exceptional entries as well.  So we invalidate exceptional
      entries only if ->nrpages != 0? This doesn't feel right.
      
       - Patch 1 fixes direct IO writes by removing ->nrpages check.
       - Patch 2 fixes similar case in invalidate_bdev().
           Note: I only fixed conditional cleancache_invalidate_inode() here.
             Do we also need to add ->nrexceptional check in into invalidate_bdev()?
      
       - Patches 3-4: some optimizations.
      
      This patch (of 4):
      
      Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
      conditionally iff mapping->nrpages is not zero.  This can't be right,
      because invalidate_inode_pages2[_range]() also invalidate data in the
      cleancache via cleancache_invalidate_inode() call.  So if page cache is
      empty but there is some data in the cleancache, buffered read after
      direct IO write would get stale data from the cleancache.
      
      Also it doesn't feel right to check only for ->nrpages because
      invalidate_inode_pages2[_range] invalidates exceptional entries as well.
      
      Fix this by calling invalidate_inode_pages2[_range]() regardless of
      nrpages state.
      
      Note: nfs,cifs,9p doesn't need similar fix because the never call
      cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
      they are not affected by this bug.
      
      Fixes: c515e1fd ("mm/fs: add hooks to support cleancache")
      Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55635ba7
  26. 26 4月, 2017 2 次提交
    • D
      filesystem-dax: convert to dax_direct_access() · cccbce67
      Dan Williams 提交于
      Now that a dax_device is plumbed through all dax-capable drivers we can
      switch from block_device_operations to dax_operations for invoking
      ->direct_access.
      
      This also lets us kill off some usages of struct blk_dax_ctl on the way
      to its eventual removal.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      cccbce67
    • C
      iomap_dio_rw: Prevent reading file data beyond iomap_dio->i_size · a008c31c
      Chandan Rajendra 提交于
      On a ppc64 machine executing overlayfs/019 with xfs as the lower and
      upper filesystem causes the following call trace,
      
      WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420
      Modules linked in:
      CPU: 2 PID: 8034 Comm: fsstress Tainted: G             L  4.11.0-rc5-next-20170405 #100
      task: c000000631314880 task.stack: c0000003915d4000
      NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660
      REGS: c0000003915d7570 TRAP: 0700   Tainted: G             L   (4.11.0-rc5-next-20170405)
      MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI>
        CR: 24004284  XER: 00000000
      CFAR: c0000000006f7190 SOFTE: 1
      GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600
      GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960
      GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0
      GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff
      GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8
      GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff
      GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000
      GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff
      NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420
      LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420
      Call Trace:
      [c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable)
      [c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0
      [c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420
      [c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160
      [c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130
      [c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0
      [c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0
      [c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100
      [c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc
      Instruction dump:
      78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288
      2f890004 419e00a0 2f890001 419e02a8 <0fe00000> 3b80fffb 38210100 7f83e378
      
      The above problem can also be recreated on a regular xfs filesystem
      using the command,
      
      $ fsstress -d /mnt -l 1000 -n 1000 -p 1000
      
      The reason for the call trace is,
      1. When 'reserving' blocks for delayed allocation , XFS reserves more
         blocks (i.e. past file's current EOF) than required. This is done
         because XFS assumes that userspace might write more data and hence
         'reserving' more blocks might lead to the file's new data being
         stored contiguously on disk.
      2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would
         then cover the prealloc-ed EOF blocks in addition to the regular blocks.
      3. When flushing the dirty blocks to disk, we only flush data till the
         file's EOF. But before writing out the dirty data, we allocate blocks
         on the disk for holding the file's new data. This allocation includes
         the blocks that are part of the 'prealloc EOF blocks'.
      4. Later, when the last reference to the inode is being closed, XFS frees the
         unused 'prealloc EOF blocks' in xfs_inactive().
      
      In step 3 above, When allocating space on disk for the delayed allocation
      range, the space allocator might sometimes allocate less blocks than
      required. If such an allocation ends right at the current EOF of the
      file, We will not be able to clear the "delayed allocation" flag for the
      'prealloc EOF blocks', since we won't have dirty buffer heads associated
      with that range of the file.
      
      In such a situation if a Direct I/O read operation is performed on file
      range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the
      range [X, Y] and invalidate page cache for that range (Refer to
      iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains
      the extent items (which are still cached in memory) for the file
      range. When doing so we are not supposed to get an extent item with
      IOMAP_DELALLOC flag set, since the previous "flush" operation should
      have converted any delayed allocation data in the range [X, Y]. Hence we
      end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor().
      
      This commit fixes the bug by preventing the read operation from going
      beyond iomap_dio->i_size.
      Reported-by: NSanthosh G <santhog4@linux.vnet.ibm.com>
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a008c31c
  27. 07 3月, 2017 1 次提交
    • E
      iomap: invalidate page caches should be after iomap_dio_complete() in direct write · c771c14b
      Eryu Guan 提交于
      After XFS switching to iomap based DIO (commit acdda3aa ("xfs:
      use iomap_dio_rw")), I started to notice dio29/dio30 tests failures
      from LTP run on ppc64 hosts, and they can be reproduced on x86_64
      hosts with 512B/1k block size XFS too.
      
      dio29	diotest3 -b 65536 -n 100 -i 1000 -o 1024000
      dio30	diotest6 -b 65536 -n 100 -i 1000 -o 1024000
      
      The failure message is like:
      bufcmp: offset 0: Expected: 0x62, got 0x0
      diotest03    1  TPASS  :  Read with Direct IO, Write without
      diotest03    2  TFAIL  :  diotest3.c:142: comparsion failed; child=98 offset=1425408
      diotest03    3  TFAIL  :  diotest3.c:194: Write Direct-child 98 failed
      
      Direct write wrote 0x62 but buffer read got zero. This is because,
      when doing direct write to a hole or preallocated file, we
      invalidate the page caches before converting the extent from
      unwritten state to normal state, which is done by
      iomap_dio_complete(), thus leave a window for other buffer reader to
      cache the unwritten state extent.
      
      Consider this case, with sub-page blocksize XFS, two processes are
      direct writing to different blocksize-aligned regions (say 512B) of
      the same preallocated file, and reading the region back via buffered
      I/O to compare contents.
      
      process A, region [0,512]		process B, region [512,1024]
      xfs_file_write_iter
       xfs_file_aio_dio_write
        iomap_dio_rw
         iomap_apply
         invalidate_inode_pages2_range
         					xfs_file_write_iter
      				 	xfs_file_aio_dio_write
      					  iomap_dio_rw
      					   iomap_apply
      					   invalidate_inode_pages2_range
      					   iomap_dio_complete
      					xfs_file_read_iter
      					 xfs_file_buffered_aio_read
      					  generic_file_read_iter
      					   do_generic_file_read
      					    <readahead fills pagecache with 0>
         iomap_dio_complete
      xfs_file_read_iter
       <read gets 0 from pagecache>
      
      Process A first invalidates page caches, at this point the
      underlying extent is still in unwritten state (iomap_dio_complete
      not called yet), and process B finishs direct write and populates
      page caches via readahead, which caches zeros in page for region A,
      then process A reads zeros from page cache, instead of the actual
      data.
      
      Fix it by invalidating page caches after converting unwritten extent
      to make sure we read content from disk after extent state changed,
      as what we did before switching to iomap based dio.
      
      Also introduce a new 'start' variable to save the original write
      offset (iomap_dio_complete() updates iocb->ki_pos), and a 'err'
      variable for invalidating caches result, cause we can't reuse 'ret'
      anymore.
      Signed-off-by: NEryu Guan <eguan@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c771c14b
  28. 02 3月, 2017 1 次提交