1. 22 12月, 2018 1 次提交
    • E
      iomap: don't search past page end in iomap_is_partially_uptodate · 3cc31fa6
      Eric Sandeen 提交于
      iomap_is_partially_uptodate() is intended to check wither blocks within
      the selected range of a not-uptodate page are uptodate; if the range we
      care about is up to date, it's an optimization.
      
      However, the iomap implementation continues to check all blocks up to
      from+count, which is beyond the page, and can even be well beyond the
      iop->uptodate bitmap.
      
      I think the worst that will happen is that we may eventually find a zero
      bit and return "not partially uptodate" when it would have otherwise
      returned true, and skip the optimization.  Still, it's clearly an invalid
      memory access that must be fixed.
      
      So: fix this by limiting the search to within the page as is done in the
      non-iomap variant, block_is_partially_uptodate().
      
      Zorro noticed thiswhen KASAN went off for 512 byte blocks on a 64k
      page system:
      
       BUG: KASAN: slab-out-of-bounds in iomap_is_partially_uptodate+0x1a0/0x1e0
       Read of size 8 at addr ffff800120c3a318 by task fsstress/22337
      Reported-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NEric Sandeen <sandeen@sandeen.net>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      3cc31fa6
  2. 20 12月, 2018 1 次提交
    • D
      iomap: Revert "fs/iomap.c: get/put the page in iomap_page_create/release()" · a837eca2
      Dave Chinner 提交于
      This reverts commit 61c6de66.
      
      The reverted commit added page reference counting to iomap page
      structures that are used to track block size < page size state. This
      was supposed to align the code with page migration page accounting
      assumptions, but what it has done instead is break XFS filesystems.
      Every fstests run I've done on sub-page block size XFS filesystems
      has since picking up this commit 2 days ago has failed with bad page
      state errors such as:
      
      # ./run_check.sh "-m rmapbt=1,reflink=1 -i sparse=1 -b size=1k" "generic/038"
      ....
      SECTION       -- xfs
      FSTYP         -- xfs (debug)
      PLATFORM      -- Linux/x86_64 test1 4.20.0-rc6-dgc+
      MKFS_OPTIONS  -- -f -m rmapbt=1,reflink=1 -i sparse=1 -b size=1k /dev/sdc
      MOUNT_OPTIONS -- /dev/sdc /mnt/scratch
      
      generic/038 454s ...
       run fstests generic/038 at 2018-12-20 18:43:05
       XFS (sdc): Unmounting Filesystem
       XFS (sdc): Mounting V5 Filesystem
       XFS (sdc): Ending clean mount
       BUG: Bad page state in process kswapd0  pfn:3a7fa
       page:ffffea0000ccbeb0 count:0 mapcount:0 mapping:ffff88800d9b6360 index:0x1
       flags: 0xfffffc0000000()
       raw: 000fffffc0000000 dead000000000100 dead000000000200 ffff88800d9b6360
       raw: 0000000000000001 0000000000000000 00000000ffffffff
       page dumped because: non-NULL mapping
       CPU: 0 PID: 676 Comm: kswapd0 Not tainted 4.20.0-rc6-dgc+ #915
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
       Call Trace:
        dump_stack+0x67/0x90
        bad_page.cold.116+0x8a/0xbd
        free_pcppages_bulk+0x4bf/0x6a0
        free_unref_page_list+0x10f/0x1f0
        shrink_page_list+0x49d/0xf50
        shrink_inactive_list+0x19d/0x3b0
        shrink_node_memcg.constprop.77+0x398/0x690
        ? shrink_slab.constprop.81+0x278/0x3f0
        shrink_node+0x7a/0x2f0
        kswapd+0x34b/0x6d0
        ? node_reclaim+0x240/0x240
        kthread+0x11f/0x140
        ? __kthread_bind_mask+0x60/0x60
        ret_from_fork+0x24/0x30
       Disabling lock debugging due to kernel taint
      ....
      
      The failures are from anyway that frees pages and empties the
      per-cpu page magazines, so it's not a predictable failure or an easy
      to debug failure.
      
      generic/038 is a reliable reproducer of this problem - it has a 9 in
      10 failure rate on one of my test machines. Failure on other
      machines have been at random points in fstests runs but every run
      has ended up tripping this problem. Hence generic/038 was used to
      bisect the failure because it was the most reliable failure.
      
      It is too close to the 4.20 release (not to mention holidays) to
      try to diagnose, fix and test the underlying cause of the problem,
      so reverting the commit is the only option we have right now. The
      revert has been tested against a current tot 4.20-rc7+ kernel across
      multiple machines running sub-page block size XFs filesystems and
      none of the bad page state failures have been seen.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Cc: Piotr Jaroszynski <pjaroszynski@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Brian Foster <bfoster@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a837eca2
  3. 15 12月, 2018 1 次提交
    • P
      fs/iomap.c: get/put the page in iomap_page_create/release() · 61c6de66
      Piotr Jaroszynski 提交于
      migrate_page_move_mapping() expects pages with private data set to have
      a page_count elevated by 1.  This is what used to happen for xfs through
      the buffer_heads code before the switch to iomap in commit 82cb1417
      ("xfs: add support for sub-pagesize writeback without buffer_heads").
      Not having the count elevated causes move_pages() to fail on memory
      mapped files coming from xfs.
      
      Make iomap compatible with the migrate_page_move_mapping() assumption by
      elevating the page count as part of iomap_page_create() and lowering it
      in iomap_page_release().
      
      It causes the move_pages() syscall to misbehave on memory mapped files
      from xfs.  It does not not move any pages, which I suppose is "just" a
      perf issue, but it also ends up returning a positive number which is out
      of spec for the syscall.  Talking to Michal Hocko, it sounds like
      returning positive numbers might be a necessary update to move_pages()
      anyway though
      (https://lkml.kernel.org/r/20181116114955.GJ14706@dhcp22.suse.cz).
      
      I only hit this in tests that verify that move_pages() actually moved
      the pages.  The test also got confused by the positive return from
      move_pages() (it got treated as a success as positive numbers were not
      expected and not handled) making it a bit harder to track down what's
      going on.
      
      Link: http://lkml.kernel.org/r/20181115184140.1388751-1-pjaroszynski@nvidia.com
      Fixes: 82cb1417 ("xfs: add support for sub-pagesize writeback without buffer_heads")
      Signed-off-by: NPiotr Jaroszynski <pjaroszynski@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61c6de66
  4. 05 12月, 2018 1 次提交
    • D
      iomap: partially revert 4721a601 (simulated directio short read on EFAULT) · 8f67b5ad
      Darrick J. Wong 提交于
      In commit 4721a601, we tried to fix a problem wherein directio reads
      into a splice pipe will bounce EFAULT/EAGAIN all the way out to
      userspace by simulating a zero-byte short read.  This happens because
      some directio read implementations (xfs) will call
      bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
      reads, but as soon as we run out of pipe buffers that _get_pages call
      returns EFAULT, which the splice code translates to EAGAIN and bounces
      out to userspace.
      
      In that commit, the iomap code catches the EFAULT and simulates a
      zero-byte read, but that causes assertion errors on regular splice reads
      because xfs doesn't allow short directio reads.  This causes infinite
      splice() loops and assertion failures on generic/095 on overlayfs
      because xfs only permit total success or total failure of a directio
      operation.  The underlying issue in the pipe splice code has now been
      fixed by changing the pipe splice loop to avoid avoid reading more data
      than there is space in the pipe.
      
      Therefore, it's no longer necessary to simulate the short directio, so
      remove the hack from iomap.
      
      Fixes: 4721a601 ("iomap: dio data corruption and spurious errors when pipes fill")
      Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
      Ranted-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      8f67b5ad
  5. 22 11月, 2018 4 次提交
    • D
      iomap: readpages doesn't zero page tail beyond EOF · 8c110d43
      Dave Chinner 提交于
      When we read the EOF page of the file via readpages, we need
      to zero the region beyond EOF that we either do not read or
      should not contain data so that mmap does not expose stale data to
      user applications.
      
      However, iomap_adjust_read_range() fails to detect EOF correctly,
      and so fsx on 1k block size filesystems fails very quickly with
      mapreads exposing data beyond EOF. There are two problems here.
      
      Firstly, when calculating the end block of the EOF byte, we have
      to round the size by one to avoid a block aligned EOF from reporting
      a block too large. i.e. a size of 1024 bytes is 1 block, which in
      index terms is block 0. Therefore we have to calculate the end block
      from (isize - 1), not isize.
      
      The second bug is determining if the current page spans EOF, and so
      whether we need split it into two half, one for the IO, and the
      other for zeroing. Unfortunately, the code that checks whether
      we should split the block doesn't actually check if we span EOF, it
      just checks if the read spans the /offset in the page/ that EOF
      sits on. So it splits every read into two if EOF is not page
      aligned, regardless of whether we are reading the EOF block or not.
      
      Hence we need to restrict the "does the read span EOF" check to
      just the page that spans EOF, not every page we read.
      
      This patch results in correct EOF detection through readpages:
      
      xfs_vm_readpages:     dev 259:0 ino 0x43 nr_pages 24
      xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x4f000 count 98304 type hole startoff 0x13c startblock 1368 blockcount 0x4
      iomap_readpage_actor: orig pos 323584 pos 323584, length 4096, poff 0 plen 4096, isize 420864
      xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x50000 count 94208 type hole startoff 0x140 startblock 1497 blockcount 0x5c
      iomap_readpage_actor: orig pos 327680 pos 327680, length 94208, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 331776 pos 331776, length 90112, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 335872 pos 335872, length 86016, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 339968 pos 339968, length 81920, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 344064 pos 344064, length 77824, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 348160 pos 348160, length 73728, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 352256 pos 352256, length 69632, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 356352 pos 356352, length 65536, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 360448 pos 360448, length 61440, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 364544 pos 364544, length 57344, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 368640 pos 368640, length 53248, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 372736 pos 372736, length 49152, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 376832 pos 376832, length 45056, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 380928 pos 380928, length 40960, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 385024 pos 385024, length 36864, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 389120 pos 389120, length 32768, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 393216 pos 393216, length 28672, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 397312 pos 397312, length 24576, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 401408 pos 401408, length 20480, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 405504 pos 405504, length 16384, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 409600 pos 409600, length 12288, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 413696 pos 413696, length 8192, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 417792 pos 417792, length 4096, poff 0 plen 3072, isize 420864
      iomap_readpage_actor: orig pos 420864 pos 420864, length 1024, poff 3072 plen 1024, isize 420864
      
      As you can see, it now does full page reads until the last one which
      is split correctly at the block aligned EOF, reading 3072 bytes and
      zeroing the last 1024 bytes. The original version of the patch got
      this right, but it got another case wrong.
      
      The EOF detection crossing really needs to the the original length
      as plen, while it starts at the end of the block, will be shortened
      as up-to-date blocks are found on the page. This means "orig_pos +
      plen" no longer points to the end of the page, and so will not
      correctly detect EOF crossing. Hence we have to use the length
      passed in to detect this partial page case:
      
      xfs_filemap_fault:    dev 259:1 ino 0x43  write_fault 0
      xfs_vm_readpage:      dev 259:1 ino 0x43 nr_pages 1
      xfs_iomap_found:      dev 259:1 ino 0x43 size 0x2cc00 offset 0x2c000 count 4096 type hole startoff 0xb0 startblock 282 blockcount 0x4
      iomap_readpage_actor: orig pos 180224 pos 181248, length 4096, poff 1024 plen 2048, isize 183296
      xfs_iomap_found:      dev 259:1 ino 0x43 size 0x2cc00 offset 0x2cc00 count 1024 type hole startoff 0xb3 startblock 285 blockcount 0x1
      iomap_readpage_actor: orig pos 183296 pos 183296, length 1024, poff 3072 plen 1024, isize 183296
      
      Heere we see a trace where the first block on the EOF page is up to
      date, hence poff = 1024 bytes. The offset into the page of EOF is
      3072, so the range we want to read is 1024 - 3071, and the range we
      want to zero is 3072 - 4095. You can see this is split correctly
      now.
      
      This fixes the stale data beyond EOF problem that fsx quickly
      uncovers on 1k block size filesystems.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      8c110d43
    • D
      iomap: dio data corruption and spurious errors when pipes fill · 4721a601
      Dave Chinner 提交于
      When doing direct IO to a pipe for do_splice_direct(), then pipe is
      trivial to fill up and overflow as it can only hold 16 pages. At
      this point bio_iov_iter_get_pages() then returns -EFAULT, and we
      abort the IO submission process. Unfortunately, iomap_dio_rw()
      propagates the error back up the stack.
      
      The error is converted from the EFAULT to EAGAIN in
      generic_file_splice_read() to tell the splice layers that the pipe
      is full. do_splice_direct() completely fails to handle EAGAIN errors
      (it aborts on error) and returns EAGAIN to the caller.
      
      copy_file_write() then completely fails to handle EAGAIN as well,
      and so returns EAGAIN to userspace, having failed to copy the data
      it was asked to.
      
      Avoid this whole steaming pile of fail by having iomap_dio_rw()
      silently swallow EFAULT errors and so do short reads.
      
      To make matters worse, iomap_dio_actor() has a stale data exposure
      bug bio_iov_iter_get_pages() fails - it does not zero the tail block
      that it may have been left uncovered by partial IO. Fix the error
      handling case to drop to the sub-block zeroing rather than
      immmediately returning the -EFAULT error.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4721a601
    • D
      iomap: sub-block dio needs to zeroout beyond EOF · b450672f
      Dave Chinner 提交于
      If we are doing sub-block dio that extends EOF, we need to zero
      the unused tail of the block to initialise the data in it it. If we
      do not zero the tail of the block, then an immediate mmap read of
      the EOF block will expose stale data beyond EOF to userspace. Found
      with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations.
      
      Fix this by detecting if the end of the DIO write is beyond EOF
      and zeroing the tail if necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      b450672f
    • D
      iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents · 0929d858
      Dave Chinner 提交于
      When we write into an unwritten extent via direct IO, we dirty
      metadata on IO completion to convert the unwritten extent to
      written. However, when we do the FUA optimisation checks, the inode
      may be clean and so we issue a FUA write into the unwritten extent.
      This means we then bypass the generic_write_sync() call after
      unwritten extent conversion has ben done and we don't force the
      modified metadata to stable storage.
      
      This violates O_DSYNC semantics. The window of exposure is a single
      IO, as the next DIO write will see the inode has dirty metadata and
      hence will not use the FUA optimisation. Calling
      generic_write_sync() after completion of the second IO will also
      sync the first write and it's metadata.
      
      Fix this by avoiding the FUA optimisation when writing to unwritten
      extents.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0929d858
  6. 27 10月, 2018 1 次提交
  7. 24 10月, 2018 1 次提交
  8. 18 10月, 2018 1 次提交
  9. 29 9月, 2018 1 次提交
    • B
      iomap: set page dirty after partial delalloc on mkwrite · 561295a3
      Brian Foster 提交于
      The iomap page fault mechanism currently dirties the associated page
      after the full block range of the page has been allocated. This
      leaves the page susceptible to delayed allocations without ever
      being set dirty on sub-page block sized filesystems.
      
      For example, consider a page fault on a page with one preexisting
      real (non-delalloc) block allocated in the middle of the page. The
      first iomap_apply() iteration performs delayed allocation on the
      range up to the preexisting block, the next iteration finds the
      preexisting block, and the last iteration attempts to perform
      delayed allocation on the range after the prexisting block to the
      end of the page. If the first allocation succeeds and the final
      allocation fails with -ENOSPC, iomap_apply() returns the error and
      iomap_page_mkwrite() fails to dirty the page having already
      performed partial delayed allocation. This eventually results in the
      page being invalidated without ever converting the delayed
      allocation to real blocks.
      
      This problem is reliably reproduced by generic/083 on XFS on ppc64
      systems (64k page size, 4k block size). It results in leaked
      delalloc blocks on inode reclaim, which triggers an assert failure
      in xfs_fs_destroy_inode() and filesystem accounting inconsistency.
      
      Move the set_page_dirty() call from iomap_page_mkwrite() to the
      actor callback, similar to how the buffer head implementation works.
      The actor callback is called iff ->iomap_begin() returns success, so
      ensures the page is dirtied as soon as possible after an allocation.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      561295a3
  10. 14 8月, 2018 1 次提交
  11. 12 8月, 2018 1 次提交
  12. 03 8月, 2018 1 次提交
    • E
      fs: fix iomap_bmap position calculation · 79b3dbe4
      Eric Sandeen 提交于
      The position calculation in iomap_bmap() shifts bno the wrong way,
      so we don't progress properly and end up re-mapping block zero
      over and over, yielding an unchanging physical block range as the
      logical block advances:
      
      # filefrag -Be file
       ext:   logical_offset:     physical_offset: length:   expected: flags:
         0:      0..       0:      21..        21:      1:             merged
         1:      1..       1:      21..        21:      1:         22: merged
      Discontinuity: Block 1 is at 21 (was 22)
         2:      2..       2:      21..        21:      1:         22: merged
      Discontinuity: Block 2 is at 21 (was 22)
         3:      3..       3:      21..        21:      1:         22: merged
      
      This breaks the FIBMAP interface for anyone using it (XFS), which
      in turn breaks LILO, zipl, etc.
      Bug-actually-spotted-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Fixes: 89eb1906 ("iomap: add an iomap-based bmap implementation")
      Cc: stable@vger.kernel.org
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      79b3dbe4
  13. 12 7月, 2018 1 次提交
    • C
      iomap: add support for sub-pagesize buffered I/O without buffer heads · 9dc55f13
      Christoph Hellwig 提交于
      After already supporting a simple implementation of buffered writes for
      the blocksize == PAGE_SIZE case in the last commit this adds full support
      even for smaller block sizes.   There are three bits of per-block
      information in the buffer_head structure that really matter for the iomap
      read and write path:
      
       - uptodate status (BH_uptodate)
       - marked as currently under read I/O (BH_Async_Read)
       - marked as currently under write I/O (BH_Async_Write)
      
      Instead of having new per-block structures this now adds a per-page
      structure called struct iomap_page to track this information in a slightly
      different form:
      
       - a bitmap for the per-block uptodate status.  For worst case of a 64k
         page size system this bitmap needs to contain 128 bits.  For the
         typical 4k page size case it only needs 8 bits, although we still
         need a full unsigned long due to the way the atomic bitmap API works.
       - two atomic_t counters are used to track the outstanding read and write
         counts
      
      There is quite a bit of boilerplate code as the buffered I/O path uses
      various helper methods, but the actual code is very straight forward.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9dc55f13
  14. 04 7月, 2018 3 次提交
  15. 21 6月, 2018 1 次提交
  16. 20 6月, 2018 4 次提交
  17. 06 6月, 2018 1 次提交
  18. 02 6月, 2018 7 次提交
  19. 31 5月, 2018 1 次提交
  20. 17 5月, 2018 2 次提交
  21. 16 5月, 2018 1 次提交
  22. 10 5月, 2018 2 次提交
    • D
      iomap: Use FUA for pure data O_DSYNC DIO writes · 3460cac1
      Dave Chinner 提交于
      If we are doing direct IO writes with datasync semantics, we often
      have to flush metadata changes along with the data write. However,
      if we are overwriting existing data, there are no metadata changes
      that we need to flush. In this case, optimising the IO by using
      FUA write makes sense.
      
      We know from the IOMAP_F_DIRTY flag as to whether a specific inode
      requires a metadata flush - this is currently used by DAX to ensure
      extent modification as stable in page fault operations. For direct
      IO writes, we can use it to determine if we need to flush metadata
      or not once the data is on disk.
      
      Hence if we have been returned a mapped extent that is not new and
      the IO mapping is not dirty, then we can use a FUA write to provide
      datasync semantics. This allows us to short-cut the
      generic_write_sync() call in IO completion and hence avoid
      unnecessary operations. This makes pure direct IO data write
      behaviour identical to the way block devices use REQ_FUA to provide
      datasync semantics.
      
      On a FUA enabled device, a synchronous direct IO write workload
      (sequential 4k overwrites in 32MB file) had the following results:
      
      # xfs_io -fd -c "pwrite -V 1 -D 0 32m" /mnt/scratch/boo
      
      kernel		time	write()s	write iops	Write b/w
      ------		----	--------	----------	---------
      (no dsync)	 4s	2173/s		2173		8.5MB/s
      vanilla		22s	 370/s		 750		1.4MB/s
      patched		19s	 420/s		 420		1.6MB/s
      
      The patched code clearly doesn't send cache flushes anymore, but
      instead uses FUA (confirmed via blktrace), and performance improves
      a bit as a result. However, the benefits will be higher on workloads
      that mix O_DSYNC overwrites with other write IO as we won't be
      flushing the entire device cache on every DSYNC overwrite IO
      anymore.
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      3460cac1
    • D
      iomap: iomap_dio_rw() handles all sync writes · 4f8ff44b
      Dave Chinner 提交于
      Currently iomap_dio_rw() only handles (data)sync write completions
      for AIO. This means we can't optimised non-AIO IO to minimise device
      flushes as we can't tell the caller whether a flush is required or
      not.
      
      To solve this problem and enable further optimisations, make
      iomap_dio_rw responsible for data sync behaviour for all IO, not
      just AIO.
      
      In doing so, the sync operation is now accounted as part of the DIO
      IO by inode_dio_end(), hence post-IO data stability updates will no
      long race against operations that serialise via inode_dio_wait()
      such as truncate or hole punch.
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4f8ff44b
  23. 29 1月, 2018 1 次提交
  24. 09 1月, 2018 1 次提交
    • D
      iomap: report collisions between directio and buffered writes to userspace · 5a9d929d
      Darrick J. Wong 提交于
      If two programs simultaneously try to write to the same part of a file
      via direct IO and buffered IO, there's a chance that the post-diowrite
      pagecache invalidation will fail on the dirty page.  When this happens,
      the dio write succeeded, which means that the page cache is no longer
      coherent with the disk!
      
      Programs are not supposed to mix IO types and this is a clear case of
      data corruption, so store an EIO which will be reflected to userspace
      during the next fsync.  Replace the WARN_ON with a ratelimited pr_crit
      so that the developers have /some/ kind of breadcrumb to track down the
      offending program(s) and file(s) involved.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      5a9d929d