1. 04 7月, 2018 3 次提交
  2. 21 6月, 2018 1 次提交
  3. 20 6月, 2018 4 次提交
  4. 06 6月, 2018 1 次提交
  5. 02 6月, 2018 7 次提交
  6. 31 5月, 2018 1 次提交
  7. 17 5月, 2018 2 次提交
  8. 16 5月, 2018 1 次提交
  9. 10 5月, 2018 2 次提交
    • D
      iomap: Use FUA for pure data O_DSYNC DIO writes · 3460cac1
      Dave Chinner 提交于
      If we are doing direct IO writes with datasync semantics, we often
      have to flush metadata changes along with the data write. However,
      if we are overwriting existing data, there are no metadata changes
      that we need to flush. In this case, optimising the IO by using
      FUA write makes sense.
      
      We know from the IOMAP_F_DIRTY flag as to whether a specific inode
      requires a metadata flush - this is currently used by DAX to ensure
      extent modification as stable in page fault operations. For direct
      IO writes, we can use it to determine if we need to flush metadata
      or not once the data is on disk.
      
      Hence if we have been returned a mapped extent that is not new and
      the IO mapping is not dirty, then we can use a FUA write to provide
      datasync semantics. This allows us to short-cut the
      generic_write_sync() call in IO completion and hence avoid
      unnecessary operations. This makes pure direct IO data write
      behaviour identical to the way block devices use REQ_FUA to provide
      datasync semantics.
      
      On a FUA enabled device, a synchronous direct IO write workload
      (sequential 4k overwrites in 32MB file) had the following results:
      
      # xfs_io -fd -c "pwrite -V 1 -D 0 32m" /mnt/scratch/boo
      
      kernel		time	write()s	write iops	Write b/w
      ------		----	--------	----------	---------
      (no dsync)	 4s	2173/s		2173		8.5MB/s
      vanilla		22s	 370/s		 750		1.4MB/s
      patched		19s	 420/s		 420		1.6MB/s
      
      The patched code clearly doesn't send cache flushes anymore, but
      instead uses FUA (confirmed via blktrace), and performance improves
      a bit as a result. However, the benefits will be higher on workloads
      that mix O_DSYNC overwrites with other write IO as we won't be
      flushing the entire device cache on every DSYNC overwrite IO
      anymore.
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      3460cac1
    • D
      iomap: iomap_dio_rw() handles all sync writes · 4f8ff44b
      Dave Chinner 提交于
      Currently iomap_dio_rw() only handles (data)sync write completions
      for AIO. This means we can't optimised non-AIO IO to minimise device
      flushes as we can't tell the caller whether a flush is required or
      not.
      
      To solve this problem and enable further optimisations, make
      iomap_dio_rw responsible for data sync behaviour for all IO, not
      just AIO.
      
      In doing so, the sync operation is now accounted as part of the DIO
      IO by inode_dio_end(), hence post-IO data stability updates will no
      long race against operations that serialise via inode_dio_wait()
      such as truncate or hole punch.
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4f8ff44b
  10. 29 1月, 2018 1 次提交
  11. 09 1月, 2018 1 次提交
    • D
      iomap: report collisions between directio and buffered writes to userspace · 5a9d929d
      Darrick J. Wong 提交于
      If two programs simultaneously try to write to the same part of a file
      via direct IO and buffered IO, there's a chance that the post-diowrite
      pagecache invalidation will fail on the dirty page.  When this happens,
      the dio write succeeded, which means that the page cache is no longer
      coherent with the disk!
      
      Programs are not supposed to mix IO types and this is a clear case of
      data corruption, so store an EIO which will be reflected to userspace
      during the next fsync.  Replace the WARN_ON with a ratelimited pr_crit
      so that the developers have /some/ kind of breadcrumb to track down the
      offending program(s) and file(s) involved.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      5a9d929d
  12. 04 11月, 2017 1 次提交
  13. 17 10月, 2017 1 次提交
    • E
      fs: invalidate page cache after end_io() in dio completion · 5e25c269
      Eryu Guan 提交于
      Commit 332391a9 ("fs: Fix page cache inconsistency when mixing
      buffered and AIO DIO") moved page cache invalidation from
      iomap_dio_rw() to iomap_dio_complete() for iomap based direct write
      path, but before the dio->end_io() call, and it re-introdued the bug
      fixed by commit c771c14b ("iomap: invalidate page caches should
      be after iomap_dio_complete() in direct write").
      
      I found this because fstests generic/418 started failing on XFS with
      v4.14-rc3 kernel, which is the regression test for this specific
      bug.
      
      So similarly, fix it by moving dio->end_io() (which does the
      unwritten extent conversion) before page cache invalidation, to make
      sure next buffer read reads the final real allocations not unwritten
      extents. I also add some comments about why should end_io() go first
      in case we get it wrong again in the future.
      
      Note that, there's no such problem in the non-iomap based direct
      write path, because we didn't remove the page cache invalidation
      after the ->direct_IO() in generic_file_direct_write() call, but I
      decided to fix dio_complete() too so we don't leave a landmine
      there, also be consistent with iomap_dio_complete().
      
      Fixes: 332391a9 ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
      Signed-off-by: NEryu Guan <eguan@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      5e25c269
  14. 12 10月, 2017 1 次提交
    • A
      iomap_dio_actor(): fix iov_iter bugs · cfe057f7
      Al Viro 提交于
      1) Ignoring return value from iov_iter_zero() is wrong
      for iovec-backed case as well as for pipes - it can fail.
      
      2) Failure to fault destination pages in 25Mb into a 50Mb iovec
      should not act as if nothing in the area had been read, nevermind
      that the first 25Mb might have *already* been read by that point.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cfe057f7
  15. 02 10月, 2017 2 次提交
  16. 27 9月, 2017 1 次提交
    • C
      iomap_dio_rw: Allocate AIO completion queue before submitting dio · 546e7be8
      Chandan Rajendra 提交于
      Executing xfs/104 test in a loop on Linux-v4.13 kernel on a ppc64
      machine can cause the following NULL pointer dereference,
      
      .queue_work_on+0x4c/0x80
      .iomap_dio_bio_end_io+0xbc/0x1f0
      .bio_endio+0x118/0x1f0
      .blk_update_request+0xd0/0x470
      .blk_mq_end_request+0x24/0xc0
      .lo_complete_rq+0x40/0xe0
      .__blk_mq_complete_request_remote+0x28/0x40
      .flush_smp_call_function_queue+0xc4/0x1e0
      .smp_ipi_demux_relaxed+0x8c/0x100
      .icp_hv_ipi_action+0x54/0xa0
      .__handle_irq_event_percpu+0x84/0x2c0
      .handle_irq_event_percpu+0x28/0x80
      .handle_percpu_irq+0x78/0xc0
      .generic_handle_irq+0x40/0x70
      .__do_irq+0x88/0x200
      .call_do_irq+0x14/0x24
      .do_IRQ+0x84/0x130
      
      This occurs due to the following sequence of events,
      
      1. Allocate dio for Direct I/O write.
      2. Invoke iomap_apply() until iov_iter_count() bytes have been submitted.
         - Assume that we have submitted atleast one bio. Hence iomap_dio->ref value
           will be >= 2.
         - If during the second iteration, iomap_apply() ends up returning -ENOSPC, we would
           break out of the loop and since the 'ret' value is a negative number we
           end up not allocating memory for super_block->s_dio_done_wq.
      3. Meanwhile, iomap_dio_bio_end_io() is invoked for bios that have been
         submitted and here the code ends up dereferencing the NULL pointer stored
         at super_block->s_dio_done_wq.
      
      This commit fixes the bug by allocating memory for
      super_block->s_dio_done_wq before iomap_apply() is invoked.
      Reported-by: NEryu Guan <eguan@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NEryu Guan <eguan@redhat.com>
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      546e7be8
  17. 25 9月, 2017 1 次提交
    • L
      fs: Fix page cache inconsistency when mixing buffered and AIO DIO · 332391a9
      Lukas Czerner 提交于
      Currently when mixing buffered reads and asynchronous direct writes it
      is possible to end up with the situation where we have stale data in the
      page cache while the new data is already written to disk. This is
      permanent until the affected pages are flushed away. Despite the fact
      that mixing buffered and direct IO is ill-advised it does pose a thread
      for a data integrity, is unexpected and should be fixed.
      
      Fix this by deferring completion of asynchronous direct writes to a
      process context in the case that there are mapped pages to be found in
      the inode. Later before the completion in dio_complete() invalidate
      the pages in question. This ensures that after the completion the pages
      in the written area are either unmapped, or populated with up-to-date
      data. Also do the same for the iomap case which uses
      iomap_dio_complete() instead.
      
      This has a side effect of deferring the completion to a process context
      for every AIO DIO that happens on inode that has pages mapped. However
      since the consensus is that this is ill-advised practice the performance
      implication should not be a problem.
      
      This was based on proposal from Jeff Moyer, thanks!
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      332391a9
  18. 02 9月, 2017 1 次提交
  19. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  20. 12 8月, 2017 1 次提交
  21. 14 7月, 2017 1 次提交
  22. 03 7月, 2017 1 次提交
  23. 28 6月, 2017 1 次提交
  24. 20 6月, 2017 1 次提交
  25. 09 6月, 2017 1 次提交
  26. 09 5月, 2017 1 次提交