1. 29 9月, 2018 1 次提交
    • B
      iomap: set page dirty after partial delalloc on mkwrite · 561295a3
      Brian Foster 提交于
      The iomap page fault mechanism currently dirties the associated page
      after the full block range of the page has been allocated. This
      leaves the page susceptible to delayed allocations without ever
      being set dirty on sub-page block sized filesystems.
      
      For example, consider a page fault on a page with one preexisting
      real (non-delalloc) block allocated in the middle of the page. The
      first iomap_apply() iteration performs delayed allocation on the
      range up to the preexisting block, the next iteration finds the
      preexisting block, and the last iteration attempts to perform
      delayed allocation on the range after the prexisting block to the
      end of the page. If the first allocation succeeds and the final
      allocation fails with -ENOSPC, iomap_apply() returns the error and
      iomap_page_mkwrite() fails to dirty the page having already
      performed partial delayed allocation. This eventually results in the
      page being invalidated without ever converting the delayed
      allocation to real blocks.
      
      This problem is reliably reproduced by generic/083 on XFS on ppc64
      systems (64k page size, 4k block size). It results in leaked
      delalloc blocks on inode reclaim, which triggers an assert failure
      in xfs_fs_destroy_inode() and filesystem accounting inconsistency.
      
      Move the set_page_dirty() call from iomap_page_mkwrite() to the
      actor callback, similar to how the buffer head implementation works.
      The actor callback is called iff ->iomap_begin() returns success, so
      ensures the page is dirtied as soon as possible after an allocation.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      561295a3
  2. 14 8月, 2018 1 次提交
  3. 12 8月, 2018 1 次提交
  4. 03 8月, 2018 1 次提交
    • E
      fs: fix iomap_bmap position calculation · 79b3dbe4
      Eric Sandeen 提交于
      The position calculation in iomap_bmap() shifts bno the wrong way,
      so we don't progress properly and end up re-mapping block zero
      over and over, yielding an unchanging physical block range as the
      logical block advances:
      
      # filefrag -Be file
       ext:   logical_offset:     physical_offset: length:   expected: flags:
         0:      0..       0:      21..        21:      1:             merged
         1:      1..       1:      21..        21:      1:         22: merged
      Discontinuity: Block 1 is at 21 (was 22)
         2:      2..       2:      21..        21:      1:         22: merged
      Discontinuity: Block 2 is at 21 (was 22)
         3:      3..       3:      21..        21:      1:         22: merged
      
      This breaks the FIBMAP interface for anyone using it (XFS), which
      in turn breaks LILO, zipl, etc.
      Bug-actually-spotted-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Fixes: 89eb1906 ("iomap: add an iomap-based bmap implementation")
      Cc: stable@vger.kernel.org
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      79b3dbe4
  5. 12 7月, 2018 1 次提交
    • C
      iomap: add support for sub-pagesize buffered I/O without buffer heads · 9dc55f13
      Christoph Hellwig 提交于
      After already supporting a simple implementation of buffered writes for
      the blocksize == PAGE_SIZE case in the last commit this adds full support
      even for smaller block sizes.   There are three bits of per-block
      information in the buffer_head structure that really matter for the iomap
      read and write path:
      
       - uptodate status (BH_uptodate)
       - marked as currently under read I/O (BH_Async_Read)
       - marked as currently under write I/O (BH_Async_Write)
      
      Instead of having new per-block structures this now adds a per-page
      structure called struct iomap_page to track this information in a slightly
      different form:
      
       - a bitmap for the per-block uptodate status.  For worst case of a 64k
         page size system this bitmap needs to contain 128 bits.  For the
         typical 4k page size case it only needs 8 bits, although we still
         need a full unsigned long due to the way the atomic bitmap API works.
       - two atomic_t counters are used to track the outstanding read and write
         counts
      
      There is quite a bit of boilerplate code as the buffered I/O path uses
      various helper methods, but the actual code is very straight forward.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9dc55f13
  6. 04 7月, 2018 3 次提交
  7. 21 6月, 2018 1 次提交
  8. 20 6月, 2018 4 次提交
  9. 06 6月, 2018 1 次提交
  10. 02 6月, 2018 7 次提交
  11. 31 5月, 2018 1 次提交
  12. 17 5月, 2018 2 次提交
  13. 16 5月, 2018 1 次提交
  14. 10 5月, 2018 2 次提交
    • D
      iomap: Use FUA for pure data O_DSYNC DIO writes · 3460cac1
      Dave Chinner 提交于
      If we are doing direct IO writes with datasync semantics, we often
      have to flush metadata changes along with the data write. However,
      if we are overwriting existing data, there are no metadata changes
      that we need to flush. In this case, optimising the IO by using
      FUA write makes sense.
      
      We know from the IOMAP_F_DIRTY flag as to whether a specific inode
      requires a metadata flush - this is currently used by DAX to ensure
      extent modification as stable in page fault operations. For direct
      IO writes, we can use it to determine if we need to flush metadata
      or not once the data is on disk.
      
      Hence if we have been returned a mapped extent that is not new and
      the IO mapping is not dirty, then we can use a FUA write to provide
      datasync semantics. This allows us to short-cut the
      generic_write_sync() call in IO completion and hence avoid
      unnecessary operations. This makes pure direct IO data write
      behaviour identical to the way block devices use REQ_FUA to provide
      datasync semantics.
      
      On a FUA enabled device, a synchronous direct IO write workload
      (sequential 4k overwrites in 32MB file) had the following results:
      
      # xfs_io -fd -c "pwrite -V 1 -D 0 32m" /mnt/scratch/boo
      
      kernel		time	write()s	write iops	Write b/w
      ------		----	--------	----------	---------
      (no dsync)	 4s	2173/s		2173		8.5MB/s
      vanilla		22s	 370/s		 750		1.4MB/s
      patched		19s	 420/s		 420		1.6MB/s
      
      The patched code clearly doesn't send cache flushes anymore, but
      instead uses FUA (confirmed via blktrace), and performance improves
      a bit as a result. However, the benefits will be higher on workloads
      that mix O_DSYNC overwrites with other write IO as we won't be
      flushing the entire device cache on every DSYNC overwrite IO
      anymore.
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      3460cac1
    • D
      iomap: iomap_dio_rw() handles all sync writes · 4f8ff44b
      Dave Chinner 提交于
      Currently iomap_dio_rw() only handles (data)sync write completions
      for AIO. This means we can't optimised non-AIO IO to minimise device
      flushes as we can't tell the caller whether a flush is required or
      not.
      
      To solve this problem and enable further optimisations, make
      iomap_dio_rw responsible for data sync behaviour for all IO, not
      just AIO.
      
      In doing so, the sync operation is now accounted as part of the DIO
      IO by inode_dio_end(), hence post-IO data stability updates will no
      long race against operations that serialise via inode_dio_wait()
      such as truncate or hole punch.
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4f8ff44b
  15. 29 1月, 2018 1 次提交
  16. 09 1月, 2018 1 次提交
    • D
      iomap: report collisions between directio and buffered writes to userspace · 5a9d929d
      Darrick J. Wong 提交于
      If two programs simultaneously try to write to the same part of a file
      via direct IO and buffered IO, there's a chance that the post-diowrite
      pagecache invalidation will fail on the dirty page.  When this happens,
      the dio write succeeded, which means that the page cache is no longer
      coherent with the disk!
      
      Programs are not supposed to mix IO types and this is a clear case of
      data corruption, so store an EIO which will be reflected to userspace
      during the next fsync.  Replace the WARN_ON with a ratelimited pr_crit
      so that the developers have /some/ kind of breadcrumb to track down the
      offending program(s) and file(s) involved.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      5a9d929d
  17. 04 11月, 2017 1 次提交
  18. 17 10月, 2017 1 次提交
    • E
      fs: invalidate page cache after end_io() in dio completion · 5e25c269
      Eryu Guan 提交于
      Commit 332391a9 ("fs: Fix page cache inconsistency when mixing
      buffered and AIO DIO") moved page cache invalidation from
      iomap_dio_rw() to iomap_dio_complete() for iomap based direct write
      path, but before the dio->end_io() call, and it re-introdued the bug
      fixed by commit c771c14b ("iomap: invalidate page caches should
      be after iomap_dio_complete() in direct write").
      
      I found this because fstests generic/418 started failing on XFS with
      v4.14-rc3 kernel, which is the regression test for this specific
      bug.
      
      So similarly, fix it by moving dio->end_io() (which does the
      unwritten extent conversion) before page cache invalidation, to make
      sure next buffer read reads the final real allocations not unwritten
      extents. I also add some comments about why should end_io() go first
      in case we get it wrong again in the future.
      
      Note that, there's no such problem in the non-iomap based direct
      write path, because we didn't remove the page cache invalidation
      after the ->direct_IO() in generic_file_direct_write() call, but I
      decided to fix dio_complete() too so we don't leave a landmine
      there, also be consistent with iomap_dio_complete().
      
      Fixes: 332391a9 ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
      Signed-off-by: NEryu Guan <eguan@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      5e25c269
  19. 12 10月, 2017 1 次提交
    • A
      iomap_dio_actor(): fix iov_iter bugs · cfe057f7
      Al Viro 提交于
      1) Ignoring return value from iov_iter_zero() is wrong
      for iovec-backed case as well as for pipes - it can fail.
      
      2) Failure to fault destination pages in 25Mb into a 50Mb iovec
      should not act as if nothing in the area had been read, nevermind
      that the first 25Mb might have *already* been read by that point.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cfe057f7
  20. 02 10月, 2017 2 次提交
  21. 27 9月, 2017 1 次提交
    • C
      iomap_dio_rw: Allocate AIO completion queue before submitting dio · 546e7be8
      Chandan Rajendra 提交于
      Executing xfs/104 test in a loop on Linux-v4.13 kernel on a ppc64
      machine can cause the following NULL pointer dereference,
      
      .queue_work_on+0x4c/0x80
      .iomap_dio_bio_end_io+0xbc/0x1f0
      .bio_endio+0x118/0x1f0
      .blk_update_request+0xd0/0x470
      .blk_mq_end_request+0x24/0xc0
      .lo_complete_rq+0x40/0xe0
      .__blk_mq_complete_request_remote+0x28/0x40
      .flush_smp_call_function_queue+0xc4/0x1e0
      .smp_ipi_demux_relaxed+0x8c/0x100
      .icp_hv_ipi_action+0x54/0xa0
      .__handle_irq_event_percpu+0x84/0x2c0
      .handle_irq_event_percpu+0x28/0x80
      .handle_percpu_irq+0x78/0xc0
      .generic_handle_irq+0x40/0x70
      .__do_irq+0x88/0x200
      .call_do_irq+0x14/0x24
      .do_IRQ+0x84/0x130
      
      This occurs due to the following sequence of events,
      
      1. Allocate dio for Direct I/O write.
      2. Invoke iomap_apply() until iov_iter_count() bytes have been submitted.
         - Assume that we have submitted atleast one bio. Hence iomap_dio->ref value
           will be >= 2.
         - If during the second iteration, iomap_apply() ends up returning -ENOSPC, we would
           break out of the loop and since the 'ret' value is a negative number we
           end up not allocating memory for super_block->s_dio_done_wq.
      3. Meanwhile, iomap_dio_bio_end_io() is invoked for bios that have been
         submitted and here the code ends up dereferencing the NULL pointer stored
         at super_block->s_dio_done_wq.
      
      This commit fixes the bug by allocating memory for
      super_block->s_dio_done_wq before iomap_apply() is invoked.
      Reported-by: NEryu Guan <eguan@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NEryu Guan <eguan@redhat.com>
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      546e7be8
  22. 25 9月, 2017 1 次提交
    • L
      fs: Fix page cache inconsistency when mixing buffered and AIO DIO · 332391a9
      Lukas Czerner 提交于
      Currently when mixing buffered reads and asynchronous direct writes it
      is possible to end up with the situation where we have stale data in the
      page cache while the new data is already written to disk. This is
      permanent until the affected pages are flushed away. Despite the fact
      that mixing buffered and direct IO is ill-advised it does pose a thread
      for a data integrity, is unexpected and should be fixed.
      
      Fix this by deferring completion of asynchronous direct writes to a
      process context in the case that there are mapped pages to be found in
      the inode. Later before the completion in dio_complete() invalidate
      the pages in question. This ensures that after the completion the pages
      in the written area are either unmapped, or populated with up-to-date
      data. Also do the same for the iomap case which uses
      iomap_dio_complete() instead.
      
      This has a side effect of deferring the completion to a process context
      for every AIO DIO that happens on inode that has pages mapped. However
      since the consensus is that this is ill-advised practice the performance
      implication should not be a problem.
      
      This was based on proposal from Jeff Moyer, thanks!
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      332391a9
  23. 02 9月, 2017 1 次提交
  24. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  25. 12 8月, 2017 1 次提交
  26. 14 7月, 2017 1 次提交