1. 22 1月, 2019 1 次提交
  2. 30 11月, 2018 1 次提交
    • M
      fs: fix lost error code in dio_complete · 41e817bc
      Maximilian Heyne 提交于
      commit e2592217 ("fs: simplify the
      generic_write_sync prototype") reworked callers of generic_write_sync(),
      and ended up dropping the error return for the directio path. Prior to
      that commit, in dio_complete(), an error would be bubbled up the stack,
      but after that commit, errors passed on to dio_complete were eaten up.
      
      This was reported on the list earlier, and a fix was proposed in
      https://lore.kernel.org/lkml/20160921141539.GA17898@infradead.org/, but
      never followed up with.  We recently hit this bug in our testing where
      fencing io errors, which were previously erroring out with EIO, were
      being returned as success operations after this commit.
      
      The fix proposed on the list earlier was a little short -- it would have
      still called generic_write_sync() in case `ret` already contained an
      error. This fix ensures generic_write_sync() is only called when there's
      no pending error in the write. Additionally, transferred is replaced
      with ret to bring this code in line with other callers.
      
      Fixes: e2592217 ("fs: simplify the generic_write_sync prototype")
      Reported-by: NRavi Nankani <rnankani@amazon.com>
      Signed-off-by: NMaximilian Heyne <mheyne@amazon.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      CC: Torsten Mehlan <tomeh@amazon.de>
      CC: Uwe Dannowski <uwed@amazon.de>
      CC: Amit Shah <aams@amazon.de>
      CC: David Woodhouse <dwmw@amazon.co.uk>
      CC: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      41e817bc
  3. 26 11月, 2018 1 次提交
    • J
      block: make blk_poll() take a parameter on whether to spin or not · 0a1b8b87
      Jens Axboe 提交于
      blk_poll() has always kept spinning until it found an IO. This is
      fine for SYNC polling, since we need to find one request we have
      pending, but in preparation for ASYNC polling it can be beneficial
      to just check if we have any entries available or not.
      
      Existing callers are converted to pass in 'spin == true', to retain
      the old behavior.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0a1b8b87
  4. 08 11月, 2018 1 次提交
  5. 24 10月, 2018 1 次提交
  6. 14 5月, 2018 1 次提交
  7. 06 4月, 2018 1 次提交
  8. 13 3月, 2018 2 次提交
  9. 27 2月, 2018 1 次提交
    • J
      direct-io: Fix sleep in atomic due to sync AIO · d9c10e5b
      Jan Kara 提交于
      Commit e864f395 "fs: add RWF_DSYNC aand RWF_SYNC" added additional
      way for direct IO to become synchronous and thus trigger fsync from the
      IO completion handler. Then commit 9830f4be "fs: Use RWF_* flags for
      AIO operations" allowed these flags to be set for AIO as well. However
      that commit forgot to update the condition checking whether the IO
      completion handling should be defered to a workqueue and thus AIO DIO
      with RWF_[D]SYNC set will call fsync() from IRQ context resulting in
      sleep in atomic.
      
      Fix the problem by checking directly iocb flags (the same way as it is
      done in dio_complete()) instead of checking all conditions that could
      lead to IO being synchronous.
      
      CC: Christoph Hellwig <hch@lst.de>
      CC: Goldwyn Rodrigues <rgoldwyn@suse.com>
      CC: stable@vger.kernel.org
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Fixes: 9830f4beSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d9c10e5b
  10. 09 1月, 2018 1 次提交
    • D
      iomap: report collisions between directio and buffered writes to userspace · 5a9d929d
      Darrick J. Wong 提交于
      If two programs simultaneously try to write to the same part of a file
      via direct IO and buffered IO, there's a chance that the post-diowrite
      pagecache invalidation will fail on the dirty page.  When this happens,
      the dio write succeeded, which means that the page cache is no longer
      coherent with the disk!
      
      Programs are not supposed to mix IO types and this is a clear case of
      data corruption, so store an EIO which will be reflected to userspace
      during the next fsync.  Replace the WARN_ON with a ratelimited pr_crit
      so that the developers have /some/ kind of breadcrumb to track down the
      offending program(s) and file(s) involved.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      5a9d929d
  11. 04 11月, 2017 1 次提交
  12. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  13. 17 10月, 2017 2 次提交
    • L
      fs: Avoid invalidation in interrupt context in dio_complete() · ffe51f01
      Lukas Czerner 提交于
      Currently we try to defer completion of async DIO to the process context
      in case there are any mapped pages associated with the inode so that we
      can invalidate the pages when the IO completes. However the check is racy
      and the pages can be mapped afterwards. If this happens we might end up
      calling invalidate_inode_pages2_range() in dio_complete() in interrupt
      context which could sleep. This can be reproduced by generic/451.
      
      Fix this by passing the information whether we can or can't invalidate
      to the dio_complete(). Thanks Eryu Guan for reporting this and Jan Kara
      for suggesting a fix.
      
      Fixes: 332391a9 ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
      Reported-by: NEryu Guan <eguan@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Tested-by: NEryu Guan <eguan@redhat.com>
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ffe51f01
    • E
      fs: invalidate page cache after end_io() in dio completion · 5e25c269
      Eryu Guan 提交于
      Commit 332391a9 ("fs: Fix page cache inconsistency when mixing
      buffered and AIO DIO") moved page cache invalidation from
      iomap_dio_rw() to iomap_dio_complete() for iomap based direct write
      path, but before the dio->end_io() call, and it re-introdued the bug
      fixed by commit c771c14b ("iomap: invalidate page caches should
      be after iomap_dio_complete() in direct write").
      
      I found this because fstests generic/418 started failing on XFS with
      v4.14-rc3 kernel, which is the regression test for this specific
      bug.
      
      So similarly, fix it by moving dio->end_io() (which does the
      unwritten extent conversion) before page cache invalidation, to make
      sure next buffer read reads the final real allocations not unwritten
      extents. I also add some comments about why should end_io() go first
      in case we get it wrong again in the future.
      
      Note that, there's no such problem in the non-iomap based direct
      write path, because we didn't remove the page cache invalidation
      after the ->direct_IO() in generic_file_direct_write() call, but I
      decided to fix dio_complete() too so we don't leave a landmine
      there, also be consistent with iomap_dio_complete().
      
      Fixes: 332391a9 ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
      Signed-off-by: NEryu Guan <eguan@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      5e25c269
  14. 11 10月, 2017 1 次提交
  15. 25 9月, 2017 1 次提交
    • L
      fs: Fix page cache inconsistency when mixing buffered and AIO DIO · 332391a9
      Lukas Czerner 提交于
      Currently when mixing buffered reads and asynchronous direct writes it
      is possible to end up with the situation where we have stale data in the
      page cache while the new data is already written to disk. This is
      permanent until the affected pages are flushed away. Despite the fact
      that mixing buffered and direct IO is ill-advised it does pose a thread
      for a data integrity, is unexpected and should be fixed.
      
      Fix this by deferring completion of asynchronous direct writes to a
      process context in the case that there are mapped pages to be found in
      the inode. Later before the completion in dio_complete() invalidate
      the pages in question. This ensures that after the completion the pages
      in the written area are either unmapped, or populated with up-to-date
      data. Also do the same for the iomap case which uses
      iomap_dio_complete() instead.
      
      This has a side effect of deferring the completion to a process context
      for every AIO DIO that happens on inode that has pages mapped. However
      since the consensus is that this is ill-advised practice the performance
      implication should not be a problem.
      
      This was based on proposal from Jeff Moyer, thanks!
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      332391a9
  16. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  17. 28 6月, 2017 1 次提交
  18. 20 6月, 2017 1 次提交
    • G
      block: return on congested block device · 03a07c92
      Goldwyn Rodrigues 提交于
      A new bio operation flag REQ_NOWAIT is introduced to identify bio's
      orignating from iocb with IOCB_NOWAIT. This flag indicates
      to return immediately if a request cannot be made instead
      of retrying.
      
      Stacked devices such as md (the ones with make_request_fn hooks)
      currently are not supported because it may block for housekeeping.
      For example, an md can have a part of the device suspended.
      For this reason, only request based devices are supported.
      In the future, this feature will be expanded to stacked devices
      by teaching them how to handle the REQ_NOWAIT flags.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      03a07c92
  19. 09 6月, 2017 3 次提交
  20. 28 2月, 2017 1 次提交
  21. 11 1月, 2017 1 次提交
    • C
      do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned · dd545b52
      Chandan Rajendra 提交于
      The code currently uses sdio->blkbits to compute the number of blocks to
      be cleaned. However sdio->blkbits is derived from the logical block size
      of the underlying block device (Refer to the definition of
      do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
      fail when executed on an ext4 filesystem with 64k as the block size and
      when using a virtio based disk (having 512 byte as the logical block
      size) inside a kvm guest.
      
      This commit fixes the bug by using inode->i_blkbits to compute the
      number of blocks to be cleaned.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      
      Fixed up by Jeff Moyer to only use/evaluate inode->i_blkbits once,
      to avoid issues with block size changes with IO in flight.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dd545b52
  22. 30 11月, 2016 1 次提交
  23. 12 11月, 2016 1 次提交
  24. 05 11月, 2016 1 次提交
  25. 01 11月, 2016 1 次提交
  26. 04 10月, 2016 1 次提交
  27. 08 6月, 2016 2 次提交
  28. 28 5月, 2016 1 次提交
    • E
      direct-io: fix direct write stale data exposure from concurrent buffered read · 9ecd10b7
      Eryu Guan 提交于
      Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are
      not allowed to allocate blocks(get_more_blocks() sets 'create' to 0
      before calling get_block() callback), if it's a sparse file, direct
      writes fall back to buffered writes to avoid stale data exposure from
      concurrent buffered read.  But there're two cases that can result in
      stale data exposure are not correctly detected.
      
      1. The detection for "writing inside i_size" is not sufficient,
         writes can be treated as "extending writes" wrongly.  For example,
         direct write 1FSB (file system block) to a 1FSB sparse file on
         ext2/3/4, starting from offset 0, in this case it's writing inside
         i_size, but 'create' is non-zero, because 'block_in_file' and
         '(i_size_read(inode) >> blkbits' are both zero.
      
      2. Direct writes starting from or beyong i_size (not inside i_size)
         also could trigger block allocation and expose stale data.  For
         example, consider a sparse file with i_size of 2k, and a write to
         offset 2k or 3k into the file, with a filesystem block size of 4k.
         (Thanks to Jeff Moyer for pointing this case out in his review.)
      
      The first problem can be demostrated by running ltp-aiodio test ADSP045
      many times.  When testing on extN filesystems, I see test failures
      occasionally, buffered read could read non-zero (stale) data.
      
      ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1
      
      dio_sparse    0  TINFO  :  Dirtying free blocks
      dio_sparse    0  TINFO  :  Starting I/O tests
      non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
      non-zero read at offset 0
      dio_sparse    0  TINFO  :  Killing childrens(s)
      dio_sparse    1  TFAIL  :  dio_sparse.c:191: 1 children(s) exited abnormally
      
      The second problem can also be reproduced easily by a hacked dio_sparse
      program, which accepts an option to specify the write offset.
      
      What we should really do is to disable block allocation for writes that
      could result in filling holes inside i_size.
      
      Link: http://lkml.kernel.org/r/1463156728-13357-1-git-send-email-guaneryu@gmail.comReviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ecd10b7
  29. 02 5月, 2016 4 次提交
  30. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  31. 05 3月, 2016 1 次提交
  32. 08 2月, 2016 1 次提交