1. 13 12月, 2019 1 次提交
    • D
      iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents · ac3ec5a4
      Dave Chinner 提交于
      [ Upstream commit 0929d8580071c6a1cec1a7916a8f674c243ceee1 ]
      
      When we write into an unwritten extent via direct IO, we dirty
      metadata on IO completion to convert the unwritten extent to
      written. However, when we do the FUA optimisation checks, the inode
      may be clean and so we issue a FUA write into the unwritten extent.
      This means we then bypass the generic_write_sync() call after
      unwritten extent conversion has ben done and we don't force the
      modified metadata to stable storage.
      
      This violates O_DSYNC semantics. The window of exposure is a single
      IO, as the next DIO write will see the inode has dirty metadata and
      hence will not use the FUA optimisation. Calling
      generic_write_sync() after completion of the second IO will also
      sync the first write and it's metadata.
      
      Fix this by avoiding the FUA optimisation when writing to unwritten
      extents.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ac3ec5a4
  2. 14 3月, 2019 2 次提交
  3. 26 1月, 2019 1 次提交
    • E
      iomap: don't search past page end in iomap_is_partially_uptodate · c9dcb871
      Eric Sandeen 提交于
      [ Upstream commit 3cc31fa65d85610574c0f6a474e89f4c419923d5 ]
      
      iomap_is_partially_uptodate() is intended to check wither blocks within
      the selected range of a not-uptodate page are uptodate; if the range we
      care about is up to date, it's an optimization.
      
      However, the iomap implementation continues to check all blocks up to
      from+count, which is beyond the page, and can even be well beyond the
      iop->uptodate bitmap.
      
      I think the worst that will happen is that we may eventually find a zero
      bit and return "not partially uptodate" when it would have otherwise
      returned true, and skip the optimization.  Still, it's clearly an invalid
      memory access that must be fixed.
      
      So: fix this by limiting the search to within the page as is done in the
      non-iomap variant, block_is_partially_uptodate().
      
      Zorro noticed thiswhen KASAN went off for 512 byte blocks on a 64k
      page system:
      
       BUG: KASAN: slab-out-of-bounds in iomap_is_partially_uptodate+0x1a0/0x1e0
       Read of size 8 at addr ffff800120c3a318 by task fsstress/22337
      Reported-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NEric Sandeen <sandeen@sandeen.net>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c9dcb871
  4. 29 12月, 2018 1 次提交
    • D
      iomap: Revert "fs/iomap.c: get/put the page in iomap_page_create/release()" · 38d072a4
      Dave Chinner 提交于
      [ Upstream commit a837eca2412051628c0529768c9bc4f3580b040e ]
      
      This reverts commit 61c6de667263184125d5ca75e894fcad632b0dd3.
      
      The reverted commit added page reference counting to iomap page
      structures that are used to track block size < page size state. This
      was supposed to align the code with page migration page accounting
      assumptions, but what it has done instead is break XFS filesystems.
      Every fstests run I've done on sub-page block size XFS filesystems
      has since picking up this commit 2 days ago has failed with bad page
      state errors such as:
      
      # ./run_check.sh "-m rmapbt=1,reflink=1 -i sparse=1 -b size=1k" "generic/038"
      ....
      SECTION       -- xfs
      FSTYP         -- xfs (debug)
      PLATFORM      -- Linux/x86_64 test1 4.20.0-rc6-dgc+
      MKFS_OPTIONS  -- -f -m rmapbt=1,reflink=1 -i sparse=1 -b size=1k /dev/sdc
      MOUNT_OPTIONS -- /dev/sdc /mnt/scratch
      
      generic/038 454s ...
       run fstests generic/038 at 2018-12-20 18:43:05
       XFS (sdc): Unmounting Filesystem
       XFS (sdc): Mounting V5 Filesystem
       XFS (sdc): Ending clean mount
       BUG: Bad page state in process kswapd0  pfn:3a7fa
       page:ffffea0000ccbeb0 count:0 mapcount:0 mapping:ffff88800d9b6360 index:0x1
       flags: 0xfffffc0000000()
       raw: 000fffffc0000000 dead000000000100 dead000000000200 ffff88800d9b6360
       raw: 0000000000000001 0000000000000000 00000000ffffffff
       page dumped because: non-NULL mapping
       CPU: 0 PID: 676 Comm: kswapd0 Not tainted 4.20.0-rc6-dgc+ #915
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
       Call Trace:
        dump_stack+0x67/0x90
        bad_page.cold.116+0x8a/0xbd
        free_pcppages_bulk+0x4bf/0x6a0
        free_unref_page_list+0x10f/0x1f0
        shrink_page_list+0x49d/0xf50
        shrink_inactive_list+0x19d/0x3b0
        shrink_node_memcg.constprop.77+0x398/0x690
        ? shrink_slab.constprop.81+0x278/0x3f0
        shrink_node+0x7a/0x2f0
        kswapd+0x34b/0x6d0
        ? node_reclaim+0x240/0x240
        kthread+0x11f/0x140
        ? __kthread_bind_mask+0x60/0x60
        ret_from_fork+0x24/0x30
       Disabling lock debugging due to kernel taint
      ....
      
      The failures are from anyway that frees pages and empties the
      per-cpu page magazines, so it's not a predictable failure or an easy
      to debug failure.
      
      generic/038 is a reliable reproducer of this problem - it has a 9 in
      10 failure rate on one of my test machines. Failure on other
      machines have been at random points in fstests runs but every run
      has ended up tripping this problem. Hence generic/038 was used to
      bisect the failure because it was the most reliable failure.
      
      It is too close to the 4.20 release (not to mention holidays) to
      try to diagnose, fix and test the underlying cause of the problem,
      so reverting the commit is the only option we have right now. The
      revert has been tested against a current tot 4.20-rc7+ kernel across
      multiple machines running sub-page block size XFs filesystems and
      none of the bad page state failures have been seen.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Cc: Piotr Jaroszynski <pjaroszynski@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Brian Foster <bfoster@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      38d072a4
  5. 20 12月, 2018 1 次提交
  6. 29 9月, 2018 1 次提交
    • B
      iomap: set page dirty after partial delalloc on mkwrite · 561295a3
      Brian Foster 提交于
      The iomap page fault mechanism currently dirties the associated page
      after the full block range of the page has been allocated. This
      leaves the page susceptible to delayed allocations without ever
      being set dirty on sub-page block sized filesystems.
      
      For example, consider a page fault on a page with one preexisting
      real (non-delalloc) block allocated in the middle of the page. The
      first iomap_apply() iteration performs delayed allocation on the
      range up to the preexisting block, the next iteration finds the
      preexisting block, and the last iteration attempts to perform
      delayed allocation on the range after the prexisting block to the
      end of the page. If the first allocation succeeds and the final
      allocation fails with -ENOSPC, iomap_apply() returns the error and
      iomap_page_mkwrite() fails to dirty the page having already
      performed partial delayed allocation. This eventually results in the
      page being invalidated without ever converting the delayed
      allocation to real blocks.
      
      This problem is reliably reproduced by generic/083 on XFS on ppc64
      systems (64k page size, 4k block size). It results in leaked
      delalloc blocks on inode reclaim, which triggers an assert failure
      in xfs_fs_destroy_inode() and filesystem accounting inconsistency.
      
      Move the set_page_dirty() call from iomap_page_mkwrite() to the
      actor callback, similar to how the buffer head implementation works.
      The actor callback is called iff ->iomap_begin() returns success, so
      ensures the page is dirtied as soon as possible after an allocation.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      561295a3
  7. 14 8月, 2018 1 次提交
  8. 12 8月, 2018 1 次提交
  9. 03 8月, 2018 1 次提交
    • E
      fs: fix iomap_bmap position calculation · 79b3dbe4
      Eric Sandeen 提交于
      The position calculation in iomap_bmap() shifts bno the wrong way,
      so we don't progress properly and end up re-mapping block zero
      over and over, yielding an unchanging physical block range as the
      logical block advances:
      
      # filefrag -Be file
       ext:   logical_offset:     physical_offset: length:   expected: flags:
         0:      0..       0:      21..        21:      1:             merged
         1:      1..       1:      21..        21:      1:         22: merged
      Discontinuity: Block 1 is at 21 (was 22)
         2:      2..       2:      21..        21:      1:         22: merged
      Discontinuity: Block 2 is at 21 (was 22)
         3:      3..       3:      21..        21:      1:         22: merged
      
      This breaks the FIBMAP interface for anyone using it (XFS), which
      in turn breaks LILO, zipl, etc.
      Bug-actually-spotted-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Fixes: 89eb1906 ("iomap: add an iomap-based bmap implementation")
      Cc: stable@vger.kernel.org
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      79b3dbe4
  10. 12 7月, 2018 1 次提交
    • C
      iomap: add support for sub-pagesize buffered I/O without buffer heads · 9dc55f13
      Christoph Hellwig 提交于
      After already supporting a simple implementation of buffered writes for
      the blocksize == PAGE_SIZE case in the last commit this adds full support
      even for smaller block sizes.   There are three bits of per-block
      information in the buffer_head structure that really matter for the iomap
      read and write path:
      
       - uptodate status (BH_uptodate)
       - marked as currently under read I/O (BH_Async_Read)
       - marked as currently under write I/O (BH_Async_Write)
      
      Instead of having new per-block structures this now adds a per-page
      structure called struct iomap_page to track this information in a slightly
      different form:
      
       - a bitmap for the per-block uptodate status.  For worst case of a 64k
         page size system this bitmap needs to contain 128 bits.  For the
         typical 4k page size case it only needs 8 bits, although we still
         need a full unsigned long due to the way the atomic bitmap API works.
       - two atomic_t counters are used to track the outstanding read and write
         counts
      
      There is quite a bit of boilerplate code as the buffered I/O path uses
      various helper methods, but the actual code is very straight forward.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9dc55f13
  11. 04 7月, 2018 3 次提交
  12. 21 6月, 2018 1 次提交
  13. 20 6月, 2018 4 次提交
  14. 06 6月, 2018 1 次提交
  15. 02 6月, 2018 7 次提交
  16. 31 5月, 2018 1 次提交
  17. 17 5月, 2018 2 次提交
  18. 16 5月, 2018 1 次提交
  19. 10 5月, 2018 2 次提交
    • D
      iomap: Use FUA for pure data O_DSYNC DIO writes · 3460cac1
      Dave Chinner 提交于
      If we are doing direct IO writes with datasync semantics, we often
      have to flush metadata changes along with the data write. However,
      if we are overwriting existing data, there are no metadata changes
      that we need to flush. In this case, optimising the IO by using
      FUA write makes sense.
      
      We know from the IOMAP_F_DIRTY flag as to whether a specific inode
      requires a metadata flush - this is currently used by DAX to ensure
      extent modification as stable in page fault operations. For direct
      IO writes, we can use it to determine if we need to flush metadata
      or not once the data is on disk.
      
      Hence if we have been returned a mapped extent that is not new and
      the IO mapping is not dirty, then we can use a FUA write to provide
      datasync semantics. This allows us to short-cut the
      generic_write_sync() call in IO completion and hence avoid
      unnecessary operations. This makes pure direct IO data write
      behaviour identical to the way block devices use REQ_FUA to provide
      datasync semantics.
      
      On a FUA enabled device, a synchronous direct IO write workload
      (sequential 4k overwrites in 32MB file) had the following results:
      
      # xfs_io -fd -c "pwrite -V 1 -D 0 32m" /mnt/scratch/boo
      
      kernel		time	write()s	write iops	Write b/w
      ------		----	--------	----------	---------
      (no dsync)	 4s	2173/s		2173		8.5MB/s
      vanilla		22s	 370/s		 750		1.4MB/s
      patched		19s	 420/s		 420		1.6MB/s
      
      The patched code clearly doesn't send cache flushes anymore, but
      instead uses FUA (confirmed via blktrace), and performance improves
      a bit as a result. However, the benefits will be higher on workloads
      that mix O_DSYNC overwrites with other write IO as we won't be
      flushing the entire device cache on every DSYNC overwrite IO
      anymore.
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      3460cac1
    • D
      iomap: iomap_dio_rw() handles all sync writes · 4f8ff44b
      Dave Chinner 提交于
      Currently iomap_dio_rw() only handles (data)sync write completions
      for AIO. This means we can't optimised non-AIO IO to minimise device
      flushes as we can't tell the caller whether a flush is required or
      not.
      
      To solve this problem and enable further optimisations, make
      iomap_dio_rw responsible for data sync behaviour for all IO, not
      just AIO.
      
      In doing so, the sync operation is now accounted as part of the DIO
      IO by inode_dio_end(), hence post-IO data stability updates will no
      long race against operations that serialise via inode_dio_wait()
      such as truncate or hole punch.
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4f8ff44b
  20. 29 1月, 2018 1 次提交
  21. 09 1月, 2018 1 次提交
    • D
      iomap: report collisions between directio and buffered writes to userspace · 5a9d929d
      Darrick J. Wong 提交于
      If two programs simultaneously try to write to the same part of a file
      via direct IO and buffered IO, there's a chance that the post-diowrite
      pagecache invalidation will fail on the dirty page.  When this happens,
      the dio write succeeded, which means that the page cache is no longer
      coherent with the disk!
      
      Programs are not supposed to mix IO types and this is a clear case of
      data corruption, so store an EIO which will be reflected to userspace
      during the next fsync.  Replace the WARN_ON with a ratelimited pr_crit
      so that the developers have /some/ kind of breadcrumb to track down the
      offending program(s) and file(s) involved.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      5a9d929d
  22. 04 11月, 2017 1 次提交
  23. 17 10月, 2017 1 次提交
    • E
      fs: invalidate page cache after end_io() in dio completion · 5e25c269
      Eryu Guan 提交于
      Commit 332391a9 ("fs: Fix page cache inconsistency when mixing
      buffered and AIO DIO") moved page cache invalidation from
      iomap_dio_rw() to iomap_dio_complete() for iomap based direct write
      path, but before the dio->end_io() call, and it re-introdued the bug
      fixed by commit c771c14b ("iomap: invalidate page caches should
      be after iomap_dio_complete() in direct write").
      
      I found this because fstests generic/418 started failing on XFS with
      v4.14-rc3 kernel, which is the regression test for this specific
      bug.
      
      So similarly, fix it by moving dio->end_io() (which does the
      unwritten extent conversion) before page cache invalidation, to make
      sure next buffer read reads the final real allocations not unwritten
      extents. I also add some comments about why should end_io() go first
      in case we get it wrong again in the future.
      
      Note that, there's no such problem in the non-iomap based direct
      write path, because we didn't remove the page cache invalidation
      after the ->direct_IO() in generic_file_direct_write() call, but I
      decided to fix dio_complete() too so we don't leave a landmine
      there, also be consistent with iomap_dio_complete().
      
      Fixes: 332391a9 ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
      Signed-off-by: NEryu Guan <eguan@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      5e25c269
  24. 12 10月, 2017 1 次提交
    • A
      iomap_dio_actor(): fix iov_iter bugs · cfe057f7
      Al Viro 提交于
      1) Ignoring return value from iov_iter_zero() is wrong
      for iovec-backed case as well as for pipes - it can fail.
      
      2) Failure to fault destination pages in 25Mb into a 50Mb iovec
      should not act as if nothing in the area had been read, nevermind
      that the first 25Mb might have *already* been read by that point.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cfe057f7
  25. 02 10月, 2017 2 次提交