1. 25 9月, 2017 1 次提交
    • L
      fs: Fix page cache inconsistency when mixing buffered and AIO DIO · 332391a9
      Lukas Czerner 提交于
      Currently when mixing buffered reads and asynchronous direct writes it
      is possible to end up with the situation where we have stale data in the
      page cache while the new data is already written to disk. This is
      permanent until the affected pages are flushed away. Despite the fact
      that mixing buffered and direct IO is ill-advised it does pose a thread
      for a data integrity, is unexpected and should be fixed.
      
      Fix this by deferring completion of asynchronous direct writes to a
      process context in the case that there are mapped pages to be found in
      the inode. Later before the completion in dio_complete() invalidate
      the pages in question. This ensures that after the completion the pages
      in the written area are either unmapped, or populated with up-to-date
      data. Also do the same for the iomap case which uses
      iomap_dio_complete() instead.
      
      This has a side effect of deferring the completion to a process context
      for every AIO DIO that happens on inode that has pages mapped. However
      since the consensus is that this is ill-advised practice the performance
      implication should not be a problem.
      
      This was based on proposal from Jeff Moyer, thanks!
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      332391a9
  2. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  3. 28 6月, 2017 1 次提交
  4. 20 6月, 2017 1 次提交
    • G
      block: return on congested block device · 03a07c92
      Goldwyn Rodrigues 提交于
      A new bio operation flag REQ_NOWAIT is introduced to identify bio's
      orignating from iocb with IOCB_NOWAIT. This flag indicates
      to return immediately if a request cannot be made instead
      of retrying.
      
      Stacked devices such as md (the ones with make_request_fn hooks)
      currently are not supported because it may block for housekeeping.
      For example, an md can have a part of the device suspended.
      For this reason, only request based devices are supported.
      In the future, this feature will be expanded to stacked devices
      by teaching them how to handle the REQ_NOWAIT flags.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      03a07c92
  5. 09 6月, 2017 3 次提交
  6. 28 2月, 2017 1 次提交
  7. 11 1月, 2017 1 次提交
    • C
      do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned · dd545b52
      Chandan Rajendra 提交于
      The code currently uses sdio->blkbits to compute the number of blocks to
      be cleaned. However sdio->blkbits is derived from the logical block size
      of the underlying block device (Refer to the definition of
      do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
      fail when executed on an ext4 filesystem with 64k as the block size and
      when using a virtio based disk (having 512 byte as the logical block
      size) inside a kvm guest.
      
      This commit fixes the bug by using inode->i_blkbits to compute the
      number of blocks to be cleaned.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      
      Fixed up by Jeff Moyer to only use/evaluate inode->i_blkbits once,
      to avoid issues with block size changes with IO in flight.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dd545b52
  8. 30 11月, 2016 1 次提交
  9. 12 11月, 2016 1 次提交
  10. 05 11月, 2016 1 次提交
  11. 01 11月, 2016 1 次提交
  12. 04 10月, 2016 1 次提交
  13. 08 6月, 2016 2 次提交
  14. 28 5月, 2016 1 次提交
    • E
      direct-io: fix direct write stale data exposure from concurrent buffered read · 9ecd10b7
      Eryu Guan 提交于
      Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are
      not allowed to allocate blocks(get_more_blocks() sets 'create' to 0
      before calling get_block() callback), if it's a sparse file, direct
      writes fall back to buffered writes to avoid stale data exposure from
      concurrent buffered read.  But there're two cases that can result in
      stale data exposure are not correctly detected.
      
      1. The detection for "writing inside i_size" is not sufficient,
         writes can be treated as "extending writes" wrongly.  For example,
         direct write 1FSB (file system block) to a 1FSB sparse file on
         ext2/3/4, starting from offset 0, in this case it's writing inside
         i_size, but 'create' is non-zero, because 'block_in_file' and
         '(i_size_read(inode) >> blkbits' are both zero.
      
      2. Direct writes starting from or beyong i_size (not inside i_size)
         also could trigger block allocation and expose stale data.  For
         example, consider a sparse file with i_size of 2k, and a write to
         offset 2k or 3k into the file, with a filesystem block size of 4k.
         (Thanks to Jeff Moyer for pointing this case out in his review.)
      
      The first problem can be demostrated by running ltp-aiodio test ADSP045
      many times.  When testing on extN filesystems, I see test failures
      occasionally, buffered read could read non-zero (stale) data.
      
      ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1
      
      dio_sparse    0  TINFO  :  Dirtying free blocks
      dio_sparse    0  TINFO  :  Starting I/O tests
      non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
      non-zero read at offset 0
      dio_sparse    0  TINFO  :  Killing childrens(s)
      dio_sparse    1  TFAIL  :  dio_sparse.c:191: 1 children(s) exited abnormally
      
      The second problem can also be reproduced easily by a hacked dio_sparse
      program, which accepts an option to specify the write offset.
      
      What we should really do is to disable block allocation for writes that
      could result in filling holes inside i_size.
      
      Link: http://lkml.kernel.org/r/1463156728-13357-1-git-send-email-guaneryu@gmail.comReviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ecd10b7
  15. 02 5月, 2016 4 次提交
  16. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  17. 05 3月, 2016 1 次提交
  18. 08 2月, 2016 1 次提交
  19. 31 1月, 2016 1 次提交
    • M
      block: fix use-after-free in dio_bio_complete · 7ddc971f
      Mike Krinkin 提交于
      kasan reported the following error when i ran xfstest:
      
      [  701.826854] ==================================================================
      [  701.826864] BUG: KASAN: use-after-free in dio_bio_complete+0x41a/0x600 at addr ffff880080b95f94
      [  701.826870] Read of size 4 by task loop2/3874
      [  701.826879] page:ffffea000202e540 count:0 mapcount:0 mapping:          (null) index:0x0
      [  701.826890] flags: 0x100000000000000()
      [  701.826895] page dumped because: kasan: bad access detected
      [  701.826904] CPU: 3 PID: 3874 Comm: loop2 Tainted: G    B   W    L  4.5.0-rc1-next-20160129 #83
      [  701.826910] Hardware name: LENOVO 23205NG/23205NG, BIOS G2ET95WW (2.55 ) 07/09/2013
      [  701.826917]  ffff88008fadf800 ffff88008fadf758 ffffffff81ca67bb 0000000041b58ab3
      [  701.826941]  ffffffff830d1e74 ffffffff81ca6724 ffff88008fadf748 ffffffff8161c05c
      [  701.826963]  0000000000000282 ffff88008fadf800 ffffed0010172bf2 ffffea000202e540
      [  701.826987] Call Trace:
      [  701.826997]  [<ffffffff81ca67bb>] dump_stack+0x97/0xdc
      [  701.827005]  [<ffffffff81ca6724>] ? _atomic_dec_and_lock+0xc4/0xc4
      [  701.827014]  [<ffffffff8161c05c>] ? __dump_page+0x32c/0x490
      [  701.827023]  [<ffffffff816b0d03>] kasan_report_error+0x5f3/0x8b0
      [  701.827033]  [<ffffffff817c302a>] ? dio_bio_complete+0x41a/0x600
      [  701.827040]  [<ffffffff816b1119>] __asan_report_load4_noabort+0x59/0x80
      [  701.827048]  [<ffffffff817c302a>] ? dio_bio_complete+0x41a/0x600
      [  701.827053]  [<ffffffff817c302a>] dio_bio_complete+0x41a/0x600
      [  701.827057]  [<ffffffff81bd19c8>] ? blk_queue_exit+0x108/0x270
      [  701.827060]  [<ffffffff817c32b0>] dio_bio_end_aio+0xa0/0x4d0
      [  701.827063]  [<ffffffff817c3210>] ? dio_bio_complete+0x600/0x600
      [  701.827067]  [<ffffffff81bd2806>] ? blk_account_io_completion+0x316/0x5d0
      [  701.827070]  [<ffffffff81bafe89>] bio_endio+0x79/0x200
      [  701.827074]  [<ffffffff81bd2c9f>] blk_update_request+0x1df/0xc50
      [  701.827078]  [<ffffffff81c02c27>] blk_mq_end_request+0x57/0x120
      [  701.827081]  [<ffffffff81c03670>] __blk_mq_complete_request+0x310/0x590
      [  701.827084]  [<ffffffff812348d8>] ? set_next_entity+0x2f8/0x2ed0
      [  701.827088]  [<ffffffff8124b34d>] ? put_prev_entity+0x22d/0x2a70
      [  701.827091]  [<ffffffff81c0394b>] blk_mq_complete_request+0x5b/0x80
      [  701.827094]  [<ffffffff821e2a33>] loop_queue_work+0x273/0x19d0
      [  701.827098]  [<ffffffff811f6578>] ? finish_task_switch+0x1c8/0x8e0
      [  701.827101]  [<ffffffff8129d058>] ? trace_hardirqs_on_caller+0x18/0x6c0
      [  701.827104]  [<ffffffff821e27c0>] ? lo_read_simple+0x890/0x890
      [  701.827108]  [<ffffffff8129dd60>] ? debug_check_no_locks_freed+0x350/0x350
      [  701.827111]  [<ffffffff811f63b0>] ? __hrtick_start+0x130/0x130
      [  701.827115]  [<ffffffff82a0c8f6>] ? __schedule+0x936/0x20b0
      [  701.827118]  [<ffffffff811dd6bd>] ? kthread_worker_fn+0x3ed/0x8d0
      [  701.827121]  [<ffffffff811dd4ed>] ? kthread_worker_fn+0x21d/0x8d0
      [  701.827125]  [<ffffffff8129d058>] ? trace_hardirqs_on_caller+0x18/0x6c0
      [  701.827128]  [<ffffffff811dd57f>] kthread_worker_fn+0x2af/0x8d0
      [  701.827132]  [<ffffffff811dd2d0>] ? __init_kthread_worker+0x170/0x170
      [  701.827135]  [<ffffffff82a1ea46>] ? _raw_spin_unlock_irqrestore+0x36/0x60
      [  701.827138]  [<ffffffff811dd2d0>] ? __init_kthread_worker+0x170/0x170
      [  701.827141]  [<ffffffff811dd2d0>] ? __init_kthread_worker+0x170/0x170
      [  701.827144]  [<ffffffff811dd00b>] kthread+0x24b/0x3a0
      [  701.827148]  [<ffffffff811dcdc0>] ? kthread_create_on_node+0x4c0/0x4c0
      [  701.827151]  [<ffffffff8129d70d>] ? trace_hardirqs_on+0xd/0x10
      [  701.827155]  [<ffffffff8116d41d>] ? do_group_exit+0xdd/0x350
      [  701.827158]  [<ffffffff811dcdc0>] ? kthread_create_on_node+0x4c0/0x4c0
      [  701.827161]  [<ffffffff82a1f52f>] ret_from_fork+0x3f/0x70
      [  701.827165]  [<ffffffff811dcdc0>] ? kthread_create_on_node+0x4c0/0x4c0
      [  701.827167] Memory state around the buggy address:
      [  701.827170]  ffff880080b95e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [  701.827172]  ffff880080b95f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [  701.827175] >ffff880080b95f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [  701.827177]                          ^
      [  701.827179]  ffff880080b96000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [  701.827182]  ffff880080b96080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [  701.827183] ==================================================================
      
      The problem is that bio_check_pages_dirty calls bio_put, so we must
      not access bio fields after bio_check_pages_dirty.
      
      Fixes: 9b81c842 ("block: don't access bio->bi_error after bio_put()").
      Signed-off-by: NMike Krinkin <krinkin.m.u@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7ddc971f
  20. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  21. 09 12月, 2015 1 次提交
  22. 01 12月, 2015 1 次提交
    • J
      direct-io: Fix negative return from dio read beyond eof · 74cedf9b
      Jan Kara 提交于
      Assume a filesystem with 4KB blocks. When a file has size 1000 bytes and
      we issue direct IO read at offset 1024, blockdev_direct_IO() reads the
      tail of the last block and the logic for handling short DIO reads in
      dio_complete() results in a return value -24 (1000 - 1024) which
      obviously confuses userspace.
      
      Fix the problem by bailing out early once we sample i_size and can
      reliably check that direct IO read starts beyond i_size.
      Reported-by: NAvi Kivity <avi@scylladb.com>
      Fixes: 9fe55eea
      CC: stable@vger.kernel.org
      CC: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      74cedf9b
  23. 11 11月, 2015 1 次提交
  24. 08 11月, 2015 1 次提交
  25. 07 11月, 2015 1 次提交
  26. 24 9月, 2015 1 次提交
  27. 14 8月, 2015 1 次提交
  28. 12 8月, 2015 1 次提交
    • S
      block: don't access bio->bi_error after bio_put() · 9b81c842
      Sasha Levin 提交于
      Commit 4246a0b6 ("block: add a bi_error field to struct bio") has added a few
      dereferences of 'bio' after a call to bio_put(). This causes use-after-frees
      such as:
      
      [521120.719695] BUG: KASan: use after free in dio_bio_complete+0x2b3/0x320 at addr ffff880f36b38714
      [521120.720638] Read of size 4 by task mount.ocfs2/9644
      [521120.721212] =============================================================================
      [521120.722056] BUG kmalloc-256 (Not tainted): kasan: bad access detected
      [521120.722968] -----------------------------------------------------------------------------
      [521120.722968]
      [521120.723915] Disabling lock debugging due to kernel taint
      [521120.724539] INFO: Slab 0xffffea003cdace00 objects=32 used=25 fp=0xffff880f36b38600 flags=0x46fffff80004080
      [521120.726037] INFO: Object 0xffff880f36b38700 @offset=1792 fp=0xffff880f36b38800
      [521120.726037]
      [521120.726974] Bytes b4 ffff880f36b386f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.727898] Object ffff880f36b38700: 00 88 b3 36 0f 88 ff ff 00 00 d8 de 0b 88 ff ff  ...6............
      [521120.728822] Object ffff880f36b38710: 02 00 00 f0 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.729705] Object ffff880f36b38720: 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00  ................
      [521120.730623] Object ffff880f36b38730: 00 00 00 00 00 00 00 00 01 00 00 00 00 02 00 00  ................
      [521120.731621] Object ffff880f36b38740: 00 02 00 00 01 00 00 00 d0 f7 87 ad ff ff ff ff  ................
      [521120.732776] Object ffff880f36b38750: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.733640] Object ffff880f36b38760: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.734508] Object ffff880f36b38770: 01 00 03 00 01 00 00 00 88 87 b3 36 0f 88 ff ff  ...........6....
      [521120.735385] Object ffff880f36b38780: 00 73 22 ad 02 88 ff ff 40 13 e0 3c 00 ea ff ff  .s".....@..<....
      [521120.736667] Object ffff880f36b38790: 00 02 00 00 00 04 00 00 00 00 00 00 00 00 00 00  ................
      [521120.737596] Object ffff880f36b387a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.738524] Object ffff880f36b387b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.739388] Object ffff880f36b387c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.740277] Object ffff880f36b387d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.741187] Object ffff880f36b387e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.742233] Object ffff880f36b387f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.743229] CPU: 41 PID: 9644 Comm: mount.ocfs2 Tainted: G    B           4.2.0-rc6-next-20150810-sasha-00039-gf909086 #2420
      [521120.744274]  ffff880f36b38000 ffff880d89c8f638 ffffffffb6e9ba8a ffff880101c0e5c0
      [521120.745025]  ffff880d89c8f668 ffffffffad76a313 ffff880101c0e5c0 ffffea003cdace00
      [521120.745908]  ffff880f36b38700 ffff880f36b38798 ffff880d89c8f690 ffffffffad772854
      [521120.747063] Call Trace:
      [521120.747520] dump_stack (lib/dump_stack.c:52)
      [521120.748053] print_trailer (mm/slub.c:653)
      [521120.748582] object_err (mm/slub.c:660)
      [521120.749079] kasan_report_error (include/linux/kasan.h:20 mm/kasan/report.c:152 mm/kasan/report.c:194)
      [521120.750834] __asan_report_load4_noabort (mm/kasan/report.c:250)
      [521120.753580] dio_bio_complete (fs/direct-io.c:478)
      [521120.755752] do_blockdev_direct_IO (fs/direct-io.c:494 fs/direct-io.c:1291)
      [521120.759765] __blockdev_direct_IO (fs/direct-io.c:1322)
      [521120.761658] blkdev_direct_IO (fs/block_dev.c:162)
      [521120.762993] generic_file_read_iter (mm/filemap.c:1738)
      [521120.767405] blkdev_read_iter (fs/block_dev.c:1649)
      [521120.768556] __vfs_read (fs/read_write.c:423 fs/read_write.c:434)
      [521120.772126] vfs_read (fs/read_write.c:454)
      [521120.773118] SyS_pread64 (fs/read_write.c:607 fs/read_write.c:594)
      [521120.776062] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:186)
      [521120.777375] Memory state around the buggy address:
      [521120.778118]  ffff880f36b38600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [521120.779211]  ffff880f36b38680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [521120.780315] >ffff880f36b38700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [521120.781465]                          ^
      [521120.782083]  ffff880f36b38780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [521120.783717]  ffff880f36b38800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [521120.784818] ==================================================================
      
      This patch fixes a few of those places that I caught while auditing the patch, but the
      original patch should be audited further for more occurences of this issue since I'm
      not too familiar with the code.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9b81c842
  29. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  30. 25 4月, 2015 1 次提交
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0
  31. 12 4月, 2015 1 次提交
  32. 26 3月, 2015 1 次提交
  33. 14 3月, 2015 1 次提交
    • C
      fs: split generic and aio kiocb · 04b2fa9f
      Christoph Hellwig 提交于
      Most callers in the kernel want to perform synchronous file I/O, but
      still have to bloat the stack with a full struct kiocb.  Split out
      the parts needed in filesystem code from those in the aio code, and
      only allocate those needed to pass down argument on the stack.  The
      aio code embedds the generic iocb in the one it allocates and can
      easily get back to it by using container_of.
      
      Also add a ->ki_complete method to struct kiocb, this is used to call
      into the aio code and thus removes the dependency on aio for filesystems
      impementing asynchronous operations.  It will also allow other callers
      to substitute their own completion callback.
      
      We also add a new ->ki_flags field to work around the nasty layering
      violation recently introduced in commit 5e33f6 ("usb: gadget: ffs: add
      eventfd notification about ffs events").
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      04b2fa9f
  34. 27 9月, 2014 1 次提交