1. 01 7月, 2020 1 次提交
  2. 29 6月, 2020 1 次提交
  3. 27 5月, 2020 1 次提交
  4. 21 5月, 2020 1 次提交
    • M
      dm: use DMDEBUG macros now that they use pr_debug variants · ac75b09f
      Mike Snitzer 提交于
      Now that DMDEBUG uses pr_debug and DMDEBUG_LIMIT uses
      pr_debug_ratelimited cleanup DM's 2 direct pr_debug callers to use
      them to get the benefit of consistent DM_FMT formatting of debugging
      messages.
      
      While doing so, dm-mpath.c:dm_report_EIO() was switched over to using
      DMDEBUG_LIMIT due to the potential for error handling floods in the IO
      completion path.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      ac75b09f
  5. 19 5月, 2020 1 次提交
    • C
      blk-mq: allow blk_mq_make_request to consume the q_usage_counter reference · ac7c5675
      Christoph Hellwig 提交于
      blk_mq_make_request currently needs to grab an q_usage_counter
      reference when allocating a request.  This is because the block layer
      grabs one before calling blk_mq_make_request, but also releases it as
      soon as blk_mq_make_request returns.  Remove the blk_queue_exit call
      after blk_mq_make_request returns, and instead let it consume the
      reference.  This works perfectly fine for the block layer caller, just
      device mapper needs an extra reference as the old problem still
      persists there.  Open code blk_queue_enter_live in device mapper,
      as there should be no other callers and this allows better documenting
      why we do a non-try get.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ac7c5675
  6. 15 5月, 2020 1 次提交
  7. 14 5月, 2020 1 次提交
    • S
      block: Inline encryption support for blk-mq · a892c8d5
      Satya Tangirala 提交于
      We must have some way of letting a storage device driver know what
      encryption context it should use for en/decrypting a request. However,
      it's the upper layers (like the filesystem/fscrypt) that know about and
      manages encryption contexts. As such, when the upper layer submits a bio
      to the block layer, and this bio eventually reaches a device driver with
      support for inline encryption, the device driver will need to have been
      told the encryption context for that bio.
      
      We want to communicate the encryption context from the upper layer to the
      storage device along with the bio, when the bio is submitted to the block
      layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
      represent an encryption context (note that we can't use the bi_private
      field in struct bio to do this because that field does not function to pass
      information across layers in the storage stack). We also introduce various
      functions to manipulate the bio_crypt_ctx and make the bio/request merging
      logic aware of the bio_crypt_ctx.
      
      We also make changes to blk-mq to make it handle bios with encryption
      contexts. blk-mq can merge many bios into the same request. These bios need
      to have contiguous data unit numbers (the necessary changes to blk-merge
      are also made to ensure this) - as such, it suffices to keep the data unit
      number of just the first bio, since that's all a storage driver needs to
      infer the data unit number to use for each data block in each bio in a
      request. blk-mq keeps track of the encryption context to be used for all
      the bios in a request with the request's rq_crypt_ctx. When the first bio
      is added to an empty request, blk-mq will program the encryption context
      of that bio into the request_queue's keyslot manager, and store the
      returned keyslot in the request's rq_crypt_ctx. All the functions to
      operate on encryption contexts are in blk-crypto.c.
      
      Upper layers only need to call bio_crypt_set_ctx with the encryption key,
      algorithm and data_unit_num; they don't have to worry about getting a
      keyslot for each encryption context, as blk-mq/blk-crypto handles that.
      Blk-crypto also makes it possible for request-based layered devices like
      dm-rq to make use of inline encryption hardware by cloning the
      rq_crypt_ctx and programming a keyslot in the new request_queue when
      necessary.
      
      Note that any user of the block layer can submit bios with an
      encryption context, such as filesystems, device-mapper targets, etc.
      Signed-off-by: NSatya Tangirala <satyat@google.com>
      Reviewed-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a892c8d5
  8. 25 4月, 2020 1 次提交
  9. 03 4月, 2020 3 次提交
    • M
      Revert "dm: always call blk_queue_split() in dm_process_bio()" · 120c9257
      Mike Snitzer 提交于
      This reverts commit effd58c9.
      
      blk_queue_split() is causing excessive IO splitting -- because
      blk_max_size_offset() depends on 'chunk_sectors' limit being set and
      if it isn't (as is the case for DM targets!) it falls back to
      splitting on a 'max_sectors' boundary regardless of offset.
      
      "Fix" this by reverting back to _not_ using blk_queue_split() in
      dm_process_bio() for normal IO (reads and writes).  Long-term fix is
      still TBD but it should focus on training blk_max_size_offset() to
      call into a DM provided hook (to call DM's max_io_len()).
      
      Test results from simple misaligned IO test on 4-way dm-striped device
      with chunksize of 128K and stripesize of 512K:
      
      xfs_io -d -c 'pread -b 2m 224s 4072s' /dev/mapper/stripe_dev
      
      before this revert:
      
      253,0   21        1     0.000000000  2206  Q   R 224 + 4072 [xfs_io]
      253,0   21        2     0.000008267  2206  X   R 224 / 480 [xfs_io]
      253,0   21        3     0.000010530  2206  X   R 224 / 256 [xfs_io]
      253,0   21        4     0.000027022  2206  X   R 480 / 736 [xfs_io]
      253,0   21        5     0.000028751  2206  X   R 480 / 512 [xfs_io]
      253,0   21        6     0.000033323  2206  X   R 736 / 992 [xfs_io]
      253,0   21        7     0.000035130  2206  X   R 736 / 768 [xfs_io]
      253,0   21        8     0.000039146  2206  X   R 992 / 1248 [xfs_io]
      253,0   21        9     0.000040734  2206  X   R 992 / 1024 [xfs_io]
      253,0   21       10     0.000044694  2206  X   R 1248 / 1504 [xfs_io]
      253,0   21       11     0.000046422  2206  X   R 1248 / 1280 [xfs_io]
      253,0   21       12     0.000050376  2206  X   R 1504 / 1760 [xfs_io]
      253,0   21       13     0.000051974  2206  X   R 1504 / 1536 [xfs_io]
      253,0   21       14     0.000055881  2206  X   R 1760 / 2016 [xfs_io]
      253,0   21       15     0.000057462  2206  X   R 1760 / 1792 [xfs_io]
      253,0   21       16     0.000060999  2206  X   R 2016 / 2272 [xfs_io]
      253,0   21       17     0.000062489  2206  X   R 2016 / 2048 [xfs_io]
      253,0   21       18     0.000066133  2206  X   R 2272 / 2528 [xfs_io]
      253,0   21       19     0.000067507  2206  X   R 2272 / 2304 [xfs_io]
      253,0   21       20     0.000071136  2206  X   R 2528 / 2784 [xfs_io]
      253,0   21       21     0.000072764  2206  X   R 2528 / 2560 [xfs_io]
      253,0   21       22     0.000076185  2206  X   R 2784 / 3040 [xfs_io]
      253,0   21       23     0.000077486  2206  X   R 2784 / 2816 [xfs_io]
      253,0   21       24     0.000080885  2206  X   R 3040 / 3296 [xfs_io]
      253,0   21       25     0.000082316  2206  X   R 3040 / 3072 [xfs_io]
      253,0   21       26     0.000085788  2206  X   R 3296 / 3552 [xfs_io]
      253,0   21       27     0.000087096  2206  X   R 3296 / 3328 [xfs_io]
      253,0   21       28     0.000093469  2206  X   R 3552 / 3808 [xfs_io]
      253,0   21       29     0.000095186  2206  X   R 3552 / 3584 [xfs_io]
      253,0   21       30     0.000099228  2206  X   R 3808 / 4064 [xfs_io]
      253,0   21       31     0.000101062  2206  X   R 3808 / 3840 [xfs_io]
      253,0   21       32     0.000104956  2206  X   R 4064 / 4096 [xfs_io]
      253,0   21       33     0.001138823     0  C   R 4096 + 200 [0]
      
      after this revert:
      
      253,0   18        1     0.000000000  4430  Q   R 224 + 3896 [xfs_io]
      253,0   18        2     0.000018359  4430  X   R 224 / 256 [xfs_io]
      253,0   18        3     0.000028898  4430  X   R 256 / 512 [xfs_io]
      253,0   18        4     0.000033535  4430  X   R 512 / 768 [xfs_io]
      253,0   18        5     0.000065684  4430  X   R 768 / 1024 [xfs_io]
      253,0   18        6     0.000091695  4430  X   R 1024 / 1280 [xfs_io]
      253,0   18        7     0.000098494  4430  X   R 1280 / 1536 [xfs_io]
      253,0   18        8     0.000114069  4430  X   R 1536 / 1792 [xfs_io]
      253,0   18        9     0.000129483  4430  X   R 1792 / 2048 [xfs_io]
      253,0   18       10     0.000136759  4430  X   R 2048 / 2304 [xfs_io]
      253,0   18       11     0.000152412  4430  X   R 2304 / 2560 [xfs_io]
      253,0   18       12     0.000160758  4430  X   R 2560 / 2816 [xfs_io]
      253,0   18       13     0.000183385  4430  X   R 2816 / 3072 [xfs_io]
      253,0   18       14     0.000190797  4430  X   R 3072 / 3328 [xfs_io]
      253,0   18       15     0.000197667  4430  X   R 3328 / 3584 [xfs_io]
      253,0   18       16     0.000218751  4430  X   R 3584 / 3840 [xfs_io]
      253,0   18       17     0.000226005  4430  X   R 3840 / 4096 [xfs_io]
      253,0   18       18     0.000250404  4430  Q   R 4120 + 176 [xfs_io]
      253,0   18       19     0.000847708     0  C   R 4096 + 24 [0]
      253,0   18       20     0.000855783     0  C   R 4120 + 176 [0]
      
      Fixes: effd58c9 ("dm: always call blk_queue_split() in dm_process_bio()")
      Cc: stable@vger.kernel.org
      Reported-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Tested-by: NBarry Marson <bmarson@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      120c9257
    • V
      dax: Move mandatory ->zero_page_range() check in alloc_dax() · 4e4ced93
      Vivek Goyal 提交于
      zero_page_range() dax operation is mandatory for dax devices. Right now
      that check happens in dax_zero_page_range() function. Dan thinks that's
      too late and its better to do the check earlier in alloc_dax().
      
      I also modified alloc_dax() to return pointer with error code in it in
      case of failure. Right now it returns NULL and caller assumes failure
      happened due to -ENOMEM. But with this ->zero_page_range() check, I
      need to return -EINVAL instead.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Link: https://lore.kernel.org/r/20200401161125.GB9398@redhat.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      4e4ced93
    • V
      dm,dax: Add dax zero_page_range operation · cdf6cdcd
      Vivek Goyal 提交于
      This patch adds support for dax zero_page_range operation to dm targets.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Link: https://lore.kernel.org/r/20200228163456.1587-5-vgoyal@redhat.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      cdf6cdcd
  10. 28 3月, 2020 1 次提交
    • C
      block: simplify queue allocation · 3d745ea5
      Christoph Hellwig 提交于
      Current make_request based drivers use either blk_alloc_queue_node or
      blk_alloc_queue to allocate a queue, and then set up the make_request_fn
      function pointer and a few parameters using the blk_queue_make_request
      helper.  Simplify this by passing the make_request pointer to
      blk_alloc_queue, and while at it merge the _node variant into the main
      helper by always passing a node_id, and remove the superfluous gfp_mask
      parameter.  A lower-level __blk_alloc_queue is kept for the blk-mq case.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d745ea5
  11. 25 3月, 2020 1 次提交
  12. 04 3月, 2020 1 次提交
  13. 28 2月, 2020 1 次提交
    • M
      dm: report suspended device during destroy · adc0daad
      Mikulas Patocka 提交于
      The function dm_suspended returns true if the target is suspended.
      However, when the target is being suspended during unload, it returns
      false.
      
      An example where this is a problem: the test "!dm_suspended(wc->ti)" in
      writecache_writeback is not sufficient, because dm_suspended returns
      zero while writecache_suspend is in progress.  As is, without an
      enhanced dm_suspended, simply switching from flush_workqueue to
      drain_workqueue still emits warnings:
      workqueue writecache-writeback: drain_workqueue() isn't complete after 10 tries
      workqueue writecache-writeback: drain_workqueue() isn't complete after 100 tries
      workqueue writecache-writeback: drain_workqueue() isn't complete after 200 tries
      workqueue writecache-writeback: drain_workqueue() isn't complete after 300 tries
      workqueue writecache-writeback: drain_workqueue() isn't complete after 400 tries
      
      writecache_suspend calls flush_workqueue(wc->writeback_wq) - this function
      flushes the current work. However, the workqueue may re-queue itself and
      flush_workqueue doesn't wait for re-queued works to finish. Because of
      this - the function writecache_writeback continues execution after the
      device was suspended and then concurrently with writecache_dtr, causing
      a crash in writecache_writeback.
      
      We must use drain_workqueue - that waits until the work and all re-queued
      works finish.
      
      As a prereq for switching to drain_workqueue, this commit fixes
      dm_suspended to return true after the presuspend hook and before the
      postsuspend hook - just like during a normal suspend. It allows
      simplifying the dm-integrity and dm-writecache targets so that they
      don't have to maintain suspended flags on their own.
      
      With this change use of drain_workqueue() can be used effectively.  This
      change was tested with the lvm2 testsuite and cryptsetup testsuite and
      the are no regressions.
      
      Fixes: 48debafe ("dm: add writecache target")
      Cc: stable@vger.kernel.org # 4.18+
      Reported-by: NCorey Marthaler <cmarthal@redhat.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      adc0daad
  14. 28 1月, 2020 1 次提交
  15. 13 11月, 2019 3 次提交
  16. 07 11月, 2019 1 次提交
  17. 23 8月, 2019 1 次提交
    • M
      dm: make dm_table_find_target return NULL · 123d87d5
      Mikulas Patocka 提交于
      Currently, if we pass too high sector number to dm_table_find_target, it
      returns zeroed dm_target structure and callers test if the structure is
      zeroed with the macro dm_target_is_valid.
      
      However, returning NULL is common practice to indicate errors.
      
      This patch refactors the dm code, so that dm_table_find_target returns
      NULL and its callers test the returned value for NULL. The macro
      dm_target_is_valid is deleted. In alloc_targets, we no longer allocate an
      extra zeroed target.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      123d87d5
  18. 12 7月, 2019 1 次提交
  19. 06 7月, 2019 2 次提交
  20. 22 5月, 2019 1 次提交
    • M
      dm: make sure to obey max_io_len_target_boundary · 51b86f9a
      Michael Lass 提交于
      Commit 61697a6a ("dm: eliminate 'split_discard_bios' flag from DM
      target interface") incorrectly removed code from
      __send_changing_extent_only() that is required to impose a per-target IO
      boundary on IO that exceeds max_io_len_target_boundary().  Otherwise
      "special" IO (e.g. DISCARD, WRITE SAME, WRITE ZEROES) can write beyond
      where allowed.
      
      Fix this by restoring the max_io_len_target_boundary() limit in
      __send_changing_extent_only()
      
      Fixes: 61697a6a ("dm: eliminate 'split_discard_bios' flag from DM target interface")
      Cc: stable@vger.kernel.org # 5.1+
      Signed-off-by: NMichael Lass <bevan@bi-co.net>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      51b86f9a
  21. 21 5月, 2019 1 次提交
    • D
      dax: Arrange for dax_supported check to span multiple devices · 7bf7eac8
      Dan Williams 提交于
      Pankaj reports that starting with commit ad428cdb "dax: Check the
      end of the block-device capacity with dax_direct_access()" device-mapper
      no longer allows dax operation. This results from the stricter checks in
      __bdev_dax_supported() that validate that the start and end of a
      block-device map to the same 'pagemap' instance.
      
      Teach the dax-core and device-mapper to validate the 'pagemap' on a
      per-target basis. This is accomplished by refactoring the
      bdev_dax_supported() internals into generic_fsdax_supported() which
      takes a sector range to validate. Consequently generic_fsdax_supported()
      is suitable to be used in a device-mapper ->iterate_devices() callback.
      A new ->dax_supported() operation is added to allow composite devices to
      split and route upper-level bdev_dax_supported() requests.
      
      Fixes: ad428cdb ("dax: Check the end of the block-device...")
      Cc: <stable@vger.kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reported-by: NPankaj Gupta <pagupta@redhat.com>
      Reviewed-by: NPankaj Gupta <pagupta@redhat.com>
      Tested-by: NPankaj Gupta <pagupta@redhat.com>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Reviewed-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      7bf7eac8
  22. 16 5月, 2019 1 次提交
  23. 26 4月, 2019 1 次提交
  24. 05 4月, 2019 1 次提交
    • M
      dm: disable DISCARD if the underlying storage no longer supports it · bcb44433
      Mike Snitzer 提交于
      Storage devices which report supporting discard commands like
      WRITE_SAME_16 with unmap, but reject discard commands sent to the
      storage device.  This is a clear storage firmware bug but it doesn't
      change the fact that should a program cause discards to be sent to a
      multipath device layered on this buggy storage, all paths can end up
      failed at the same time from the discards, causing possible I/O loss.
      
      The first discard to a path will fail with Illegal Request, Invalid
      field in cdb, e.g.:
       kernel: sd 8:0:8:19: [sdfn] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
       kernel: sd 8:0:8:19: [sdfn] tag#0 Sense Key : Illegal Request [current]
       kernel: sd 8:0:8:19: [sdfn] tag#0 Add. Sense: Invalid field in cdb
       kernel: sd 8:0:8:19: [sdfn] tag#0 CDB: Write same(16) 93 08 00 00 00 00 00 a0 08 00 00 00 80 00 00 00
       kernel: blk_update_request: critical target error, dev sdfn, sector 10487808
      
      The SCSI layer converts this to the BLK_STS_TARGET error number, the sd
      device disables its support for discard on this path, and because of the
      BLK_STS_TARGET error multipath fails the discard without failing any
      path or retrying down a different path.  But subsequent discards can
      cause path failures.  Any discards sent to the path which already failed
      a discard ends up failing with EIO from blk_cloned_rq_check_limits with
      an "over max size limit" error since the discard limit was set to 0 by
      the sd driver for the path.  As the error is EIO, this now fails the
      path and multipath tries to send the discard down the next path.  This
      cycle continues as discards are sent until all paths fail.
      
      Fix this by training DM core to disable DISCARD if the underlying
      storage already did so.
      
      Also, fix branching in dm_done() and clone_endio() to reflect the
      mutually exclussive nature of the IO operations in question.
      
      Cc: stable@vger.kernel.org
      Reported-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      bcb44433
  25. 02 4月, 2019 1 次提交
    • M
      dm: revert 8f50e358 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE") · 75ae1936
      Mikulas Patocka 提交于
      The limit was already incorporated to dm-crypt with commit 4e870e94
      ("dm crypt: fix error with too large bios"), so we don't need to apply
      it globally to all targets. The quantity BIO_MAX_PAGES * PAGE_SIZE is
      wrong anyway because the variable ti->max_io_len it is supposed to be in
      the units of 512-byte sectors not in bytes.
      
      Reduction of the limit to 1048576 sectors could even cause data
      corruption in rare cases - suppose that we have a dm-striped device with
      stripe size 768MiB. The target will call dm_set_target_max_io_len with
      the value 1572864. The buggy code would reduce it to 1048576. Now, the
      dm-core will errorneously split the bios on 1048576-sector boundary
      insetad of 1572864-sector boundary and pass these stripe-crossing bios
      to the striped target.
      
      Cc: stable@vger.kernel.org # v4.16+
      Fixes: 8f50e358 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      75ae1936
  26. 06 3月, 2019 2 次提交
  27. 21 2月, 2019 1 次提交
  28. 20 2月, 2019 1 次提交
  29. 07 2月, 2019 2 次提交
    • M
      dm: don't use bio_trim() afterall · fa8db494
      Mike Snitzer 提交于
      bio_trim() has an early return, which makes it _not_ idempotent, if the
      offset is 0 and the bio's bi_size already matches the requested size.
      Prior to DM, all users of bio_trim() were fine with this.  But DM has
      exposed the fact that bio_trim()'s early return is incompatible with a
      cloned bio whose integrity payload must be trimmed via
      bio_integrity_trim().
      
      Fix this by reverting DM back to doing the equivalent of bio_trim() but
      in an idempotent manner (so bio_integrity_trim is always performed).
      
      Follow-on work is needed to assess what benefit bio_trim()'s early
      return is providing to its existing callers.
      Reported-by: NMilan Broz <gmazyland@gmail.com>
      Fixes: 57c36519 ("dm: fix clone_bio() to trigger blk_recount_segments()")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      fa8db494
    • M
      dm: add memory barrier before waitqueue_active · 645efa84
      Mikulas Patocka 提交于
      Block core changes to switch bio-based IO accounting to be percpu had a
      side-effect of altering DM core to now rely on calling waitqueue_active
      (in both bio-based and request-based) to check if another task is in
      dm_wait_for_completion().
      
      A memory barrier is needed before calling waitqueue_active().  DM core
      doesn't piggyback on a preceding memory barrier so it must explicitly
      use its own.
      
      For more details on why using waitqueue_active() without a preceding
      barrier is unsafe, please see the comment before the waitqueue_active()
      definition in include/linux/wait.h.
      
      Add the missing memory barrier by switching to using wq_has_sleeper().
      
      Fixes: 6f757231 ("dm: remove the pending IO accounting")
      Fixes: c4576aed ("dm: fix request-based dm's use of dm_wait_for_completion")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      645efa84
  30. 23 1月, 2019 2 次提交
  31. 22 1月, 2019 2 次提交
    • M
      dm: fix redundant IO accounting for bios that need splitting · a1e1cb72
      Mike Snitzer 提交于
      The risk of redundant IO accounting was not taken into consideration
      when commit 18a25da8 ("dm: ensure bio submission follows a
      depth-first tree walk") introduced IO splitting in terms of recursion
      via generic_make_request().
      
      Fix this by subtracting the split bio's payload from the IO stats that
      were already accounted for by start_io_acct() upon dm_make_request()
      entry.  This repeat oscillation of the IO accounting, up then down,
      isn't ideal but refactoring DM core's IO splitting to pre-split bios
      _before_ they are accounted turned out to be an excessive amount of
      change that will need a full development cycle to refine and verify.
      
      Before this fix:
      
        /dev/mapper/stripe_dev is a 4-way stripe using a 32k chunksize, so
        bios are split on 32k boundaries.
      
        # fio --name=16M --filename=/dev/mapper/stripe_dev --rw=write --bs=64k --size=16M \
          	--iodepth=1 --ioengine=libaio --direct=1 --refill_buffers
      
        with debugging added:
        [103898.310264] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=0 len=128
        [103898.318704] device-mapper: core: __split_and_process_bio: recursing for following split bio:
        [103898.329136] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=64 len=64
        ...
      
        16M written yet 136M (278528 * 512b) accounted:
        # cat /sys/block/dm-2/stat | awk '{ print $7 }'
        278528
      
      After this fix:
      
        16M written and 16M (32768 * 512b) accounted:
        # cat /sys/block/dm-2/stat | awk '{ print $7 }'
        32768
      
      Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk")
      Cc: stable@vger.kernel.org # 4.16+
      Reported-by: NBryan Gurney <bgurney@redhat.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      a1e1cb72
    • M
      dm: fix clone_bio() to trigger blk_recount_segments() · 57c36519
      Mike Snitzer 提交于
      DM's clone_bio() now benefits from using bio_trim() by fixing the fact
      that clone_bio() wasn't clearing BIO_SEG_VALID like bio_trim() does;
      which triggers blk_recount_segments() via bio_phys_segments().
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      57c36519