1. 26 4月, 2019 1 次提交
  2. 05 4月, 2019 1 次提交
    • M
      dm: disable DISCARD if the underlying storage no longer supports it · bcb44433
      Mike Snitzer 提交于
      Storage devices which report supporting discard commands like
      WRITE_SAME_16 with unmap, but reject discard commands sent to the
      storage device.  This is a clear storage firmware bug but it doesn't
      change the fact that should a program cause discards to be sent to a
      multipath device layered on this buggy storage, all paths can end up
      failed at the same time from the discards, causing possible I/O loss.
      
      The first discard to a path will fail with Illegal Request, Invalid
      field in cdb, e.g.:
       kernel: sd 8:0:8:19: [sdfn] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
       kernel: sd 8:0:8:19: [sdfn] tag#0 Sense Key : Illegal Request [current]
       kernel: sd 8:0:8:19: [sdfn] tag#0 Add. Sense: Invalid field in cdb
       kernel: sd 8:0:8:19: [sdfn] tag#0 CDB: Write same(16) 93 08 00 00 00 00 00 a0 08 00 00 00 80 00 00 00
       kernel: blk_update_request: critical target error, dev sdfn, sector 10487808
      
      The SCSI layer converts this to the BLK_STS_TARGET error number, the sd
      device disables its support for discard on this path, and because of the
      BLK_STS_TARGET error multipath fails the discard without failing any
      path or retrying down a different path.  But subsequent discards can
      cause path failures.  Any discards sent to the path which already failed
      a discard ends up failing with EIO from blk_cloned_rq_check_limits with
      an "over max size limit" error since the discard limit was set to 0 by
      the sd driver for the path.  As the error is EIO, this now fails the
      path and multipath tries to send the discard down the next path.  This
      cycle continues as discards are sent until all paths fail.
      
      Fix this by training DM core to disable DISCARD if the underlying
      storage already did so.
      
      Also, fix branching in dm_done() and clone_endio() to reflect the
      mutually exclussive nature of the IO operations in question.
      
      Cc: stable@vger.kernel.org
      Reported-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      bcb44433
  3. 02 4月, 2019 1 次提交
    • M
      dm: revert 8f50e358 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE") · 75ae1936
      Mikulas Patocka 提交于
      The limit was already incorporated to dm-crypt with commit 4e870e94
      ("dm crypt: fix error with too large bios"), so we don't need to apply
      it globally to all targets. The quantity BIO_MAX_PAGES * PAGE_SIZE is
      wrong anyway because the variable ti->max_io_len it is supposed to be in
      the units of 512-byte sectors not in bytes.
      
      Reduction of the limit to 1048576 sectors could even cause data
      corruption in rare cases - suppose that we have a dm-striped device with
      stripe size 768MiB. The target will call dm_set_target_max_io_len with
      the value 1572864. The buggy code would reduce it to 1048576. Now, the
      dm-core will errorneously split the bios on 1048576-sector boundary
      insetad of 1572864-sector boundary and pass these stripe-crossing bios
      to the striped target.
      
      Cc: stable@vger.kernel.org # v4.16+
      Fixes: 8f50e358 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      75ae1936
  4. 06 3月, 2019 2 次提交
  5. 21 2月, 2019 1 次提交
  6. 20 2月, 2019 1 次提交
  7. 07 2月, 2019 2 次提交
    • M
      dm: don't use bio_trim() afterall · fa8db494
      Mike Snitzer 提交于
      bio_trim() has an early return, which makes it _not_ idempotent, if the
      offset is 0 and the bio's bi_size already matches the requested size.
      Prior to DM, all users of bio_trim() were fine with this.  But DM has
      exposed the fact that bio_trim()'s early return is incompatible with a
      cloned bio whose integrity payload must be trimmed via
      bio_integrity_trim().
      
      Fix this by reverting DM back to doing the equivalent of bio_trim() but
      in an idempotent manner (so bio_integrity_trim is always performed).
      
      Follow-on work is needed to assess what benefit bio_trim()'s early
      return is providing to its existing callers.
      Reported-by: NMilan Broz <gmazyland@gmail.com>
      Fixes: 57c36519 ("dm: fix clone_bio() to trigger blk_recount_segments()")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      fa8db494
    • M
      dm: add memory barrier before waitqueue_active · 645efa84
      Mikulas Patocka 提交于
      Block core changes to switch bio-based IO accounting to be percpu had a
      side-effect of altering DM core to now rely on calling waitqueue_active
      (in both bio-based and request-based) to check if another task is in
      dm_wait_for_completion().
      
      A memory barrier is needed before calling waitqueue_active().  DM core
      doesn't piggyback on a preceding memory barrier so it must explicitly
      use its own.
      
      For more details on why using waitqueue_active() without a preceding
      barrier is unsafe, please see the comment before the waitqueue_active()
      definition in include/linux/wait.h.
      
      Add the missing memory barrier by switching to using wq_has_sleeper().
      
      Fixes: 6f757231 ("dm: remove the pending IO accounting")
      Fixes: c4576aed ("dm: fix request-based dm's use of dm_wait_for_completion")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      645efa84
  8. 23 1月, 2019 2 次提交
  9. 22 1月, 2019 2 次提交
    • M
      dm: fix redundant IO accounting for bios that need splitting · a1e1cb72
      Mike Snitzer 提交于
      The risk of redundant IO accounting was not taken into consideration
      when commit 18a25da8 ("dm: ensure bio submission follows a
      depth-first tree walk") introduced IO splitting in terms of recursion
      via generic_make_request().
      
      Fix this by subtracting the split bio's payload from the IO stats that
      were already accounted for by start_io_acct() upon dm_make_request()
      entry.  This repeat oscillation of the IO accounting, up then down,
      isn't ideal but refactoring DM core's IO splitting to pre-split bios
      _before_ they are accounted turned out to be an excessive amount of
      change that will need a full development cycle to refine and verify.
      
      Before this fix:
      
        /dev/mapper/stripe_dev is a 4-way stripe using a 32k chunksize, so
        bios are split on 32k boundaries.
      
        # fio --name=16M --filename=/dev/mapper/stripe_dev --rw=write --bs=64k --size=16M \
          	--iodepth=1 --ioengine=libaio --direct=1 --refill_buffers
      
        with debugging added:
        [103898.310264] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=0 len=128
        [103898.318704] device-mapper: core: __split_and_process_bio: recursing for following split bio:
        [103898.329136] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=64 len=64
        ...
      
        16M written yet 136M (278528 * 512b) accounted:
        # cat /sys/block/dm-2/stat | awk '{ print $7 }'
        278528
      
      After this fix:
      
        16M written and 16M (32768 * 512b) accounted:
        # cat /sys/block/dm-2/stat | awk '{ print $7 }'
        32768
      
      Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk")
      Cc: stable@vger.kernel.org # 4.16+
      Reported-by: NBryan Gurney <bgurney@redhat.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      a1e1cb72
    • M
      dm: fix clone_bio() to trigger blk_recount_segments() · 57c36519
      Mike Snitzer 提交于
      DM's clone_bio() now benefits from using bio_trim() by fixing the fact
      that clone_bio() wasn't clearing BIO_SEG_VALID like bio_trim() does;
      which triggers blk_recount_segments() via bio_phys_segments().
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      57c36519
  10. 20 12月, 2018 1 次提交
    • J
      dm: don't reuse bio for flushes · dbe3ece1
      Jens Axboe 提交于
      DM currently has a statically allocated bio that it uses to issue empty
      flushes. It doesn't submit this bio, it just uses it for maintaining
      state while setting up clones. Multiple users can access this bio at the
      same time. This wasn't previously an issue, even if it was a bit iffy,
      but with the blkg associations it can become one.
      
      We setup the blkg association, then clone bio's and submit, then remove
      the blkg assocation again. But since we can have multiple tasks doing
      this at the same time, against multiple blkg's, then we can either lose
      references to a blkg, or put it twice. The latter causes complaints on
      the percpu ref being <= 0 when released, and can cause use-after-free as
      well. Ming reports that xfstest generic/475 triggers this:
      
      ------------[ cut here ]------------
      percpu ref (blkg_release) <= 0 (0) after switching to atomic
      WARNING: CPU: 13 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x2c9/0x4a0
      
      Switch to just using an on-stack bio for this, and get rid of the
      embedded bio.
      
      Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device")
      Reported-by: NMing Lei <ming.lei@redhat.com>
      Tested-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dbe3ece1
  11. 18 12月, 2018 3 次提交
  12. 11 12月, 2018 2 次提交
    • M
      dm: fix request-based dm's use of dm_wait_for_completion · c4576aed
      Mike Snitzer 提交于
      The md->wait waitqueue is used by both bio-based and request-based DM.
      Commit dbd3bbd2 ("dm rq: leverage blk_mq_queue_busy() to check for
      outstanding IO") lost sight of the requirement that
      dm_wait_for_completion() must work with all types of DM devices.
      
      Fix md_in_flight() to call the blk-mq or bio-based method accordingly.
      
      Fixes: dbd3bbd2 ("dm rq: leverage blk_mq_queue_busy() to check for outstanding IO")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c4576aed
    • J
      dm: fix inflight IO check · b7934ba4
      Jens Axboe 提交于
      After switching to percpu inflight counters, the inflight check
      is totally buggy. It's perfectly valid for some counters to be
      non-zero while having a total inflight IO count of 0, that's how
      these kinds of counters work (inc on one CPU, dec on another).
      Fix the md_in_flight() check to sum all counters before returning
      a false positive, potentially.
      
      While at it, remove the inflight read for IO completion. We don't
      need it, just wake anyone that's waiting for the IO count to drop
      to zero. The caller needs to re-check that value anyway when woken,
      which it does.
      
      Fixes: 6f757231 ("dm: remove the pending IO accounting")
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b7934ba4
  13. 10 12月, 2018 2 次提交
  14. 08 12月, 2018 2 次提交
  15. 16 11月, 2018 1 次提交
  16. 26 10月, 2018 1 次提交
    • C
      block: add a report_zones method · e76239a3
      Christoph Hellwig 提交于
      Dispatching a report zones command through the request queue is a major
      pain due to the command reply payload rewriting necessary. Given that
      blkdev_report_zones() is executing everything synchronously, implement
      report zones as a block device file operation instead, allowing major
      simplification of the code in many places.
      
      sd, null-blk, dm-linear and dm-flakey being the only block device
      drivers supporting exposing zoned block devices, these drivers are
      modified to provide the device side implementation of the
      report_zones() block device file operation.
      
      For device mappers, a new report_zones() target type operation is
      defined so that the upper block layer calls blkdev_report_zones() can
      be propagated down to the underlying devices of the dm targets.
      Implementation for this new operation is added to the dm-linear and
      dm-flakey targets.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      [Damien]
      * Changed method block_device argument to gendisk
      * Various bug fixes and improvements
      * Added support for null_blk, dm-linear and dm-flakey.
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e76239a3
  17. 17 10月, 2018 1 次提交
  18. 11 10月, 2018 2 次提交
  19. 10 10月, 2018 1 次提交
    • D
      dm: fix report zone remapping to account for partition offset · 9864cd5d
      Damien Le Moal 提交于
      If dm-linear or dm-flakey are layered on top of a partition of a zoned
      block device, remapping of the start sector and write pointer position
      of the zones reported by a report zones BIO must be modified to account
      for the target table entry mapping (start offset within the device and
      entry mapping with the dm device).  If the target's backing device is a
      partition of a whole disk, the start sector on the physical device of
      the partition must also be accounted for when modifying the zone
      information.  However, dm_remap_zone_report() was not considering this
      last case, resulting in incorrect zone information remapping with
      targets using disk partitions.
      
      Fix this by calculating the target backing device start sector using
      the position of the completed report zones BIO and the unchanged
      position and size of the original report zone BIO. With this value
      calculated, the start sector and write pointer position of the target
      zones can be correctly remapped.
      
      Fixes: 10999307 ("dm: introduce dm_remap_zone_report()")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      9864cd5d
  20. 18 7月, 2018 1 次提交
    • M
      block: Add and use op_stat_group() for indexing disk_stat fields. · ddcf35d3
      Michael Callahan 提交于
      Add and use a new op_stat_group() function for indexing partition stat
      fields rather than indexing them by rq_data_dir() or bio_data_dir().
      This function works similarly to op_is_sync() in that it takes the
      request::cmd_flags or bio::bi_opf flags and determines which stats
      should et updated.
      
      In addition, the second parameter to generic_start_io_acct() and
      generic_end_io_acct() is now a REQ_OP rather than simply a read or
      write bit and it uses op_stat_group() on the parameter to determine
      the stat group.
      
      Note that the partition in_flight counts are not part of the per-cpu
      statistics and as such are not indexed via this function.  It's now
      indexed by op_is_write().
      
      tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Matias Bjorling <mb@lightnvm.io>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddcf35d3
  21. 29 6月, 2018 1 次提交
    • R
      dm: prevent DAX mounts if not supported · dbc62659
      Ross Zwisler 提交于
      Currently device_supports_dax() just checks to see if the QUEUE_FLAG_DAX
      flag is set on the device's request queue to decide whether or not the
      device supports filesystem DAX.  Really we should be using
      bdev_dax_supported() like filesystems do at mount time.  This performs
      other tests like checking to make sure the dax_direct_access() path works.
      
      We also explicitly clear QUEUE_FLAG_DAX on the DM device's request queue if
      any of the underlying devices do not support DAX.  This makes the handling
      of QUEUE_FLAG_DAX consistent with the setting/clearing of most other flags
      in dm_table_set_restrictions().
      
      Now that bdev_dax_supported() explicitly checks for QUEUE_FLAG_DAX, this
      will ensure that filesystems built upon DM devices will only be able to
      mount with DAX if all underlying devices also support DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Fixes: commit 545ed20e ("dm: add infrastructure for DAX support")
      Cc: stable@vger.kernel.org
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      dbc62659
  22. 23 6月, 2018 1 次提交
    • M
      dm: use bio_split() when splitting out the already processed bio · f21c601a
      Mike Snitzer 提交于
      Use of bio_clone_bioset() is inefficient if there is no need to clone
      the original bio's bio_vec array.  Best to use the bio_clone_fast()
      variant.  Also, just using bio_advance() is only part of what is needed
      to properly setup the clone -- it doesn't account for the various
      bio_integrity() related work that also needs to be performed (see
      bio_split).
      
      Address both of these issues by switching from bio_clone_bioset() to
      bio_split().
      
      Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk")
      Cc: stable@vger.kernel.org # 4.15+, requires removal of '&' before md->queue->bio_split
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f21c601a
  23. 08 6月, 2018 1 次提交
  24. 31 5月, 2018 2 次提交
  25. 23 5月, 2018 1 次提交
    • D
      dax: Introduce a ->copy_to_iter dax operation · b3a9a0c3
      Dan Williams 提交于
      Similar to the ->copy_from_iter() operation, a platform may want to
      deploy an architecture or device specific routine for handling reads
      from a dax_device like /dev/pmemX. On x86 this routine will point to a
      machine check safe version of copy_to_iter(). For now, add the plumbing
      to device-mapper and the dax core.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b3a9a0c3
  26. 01 5月, 2018 1 次提交
  27. 05 4月, 2018 2 次提交
    • M
      dm: remove fmode_t argument from .prepare_ioctl hook · 5bd5e8d8
      Mike Snitzer 提交于
      Use the fmode_t that is passed to dm_blk_ioctl() rather than
      inconsistently (varies across targets) drop it on the floor by
      overriding it with the fmode_t stored in 'struct dm_dev'.
      
      All the persistent reservation functions weren't using the fmode_t they
      got back from .prepare_ioctl so remove them.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      5bd5e8d8
    • M
      dm: hold DM table for duration of ioctl rather than use blkdev_get · 971888c4
      Mike Snitzer 提交于
      Commit 519049af ("dm: use blkdev_get rather than bdgrab when issuing
      pass-through ioctl") inadvertantly introduced a regression relative to
      users of device cgroups that issue ioctls (e.g. libvirt).  Using
      blkdev_get() in DM's passthrough ioctl support implicitly introduced a
      cgroup permissions check that would fail unless care were taken to add
      all devices in the IO stack to the device cgroup.  E.g. rather than just
      adding the top-level DM multipath device to the cgroup all the
      underlying devices would need to be allowed.
      
      Fix this, to no longer require allowing all underlying devices, by
      simply holding the live DM table (which includes the table's original
      blkdev_get() reference on the blockdevice that the ioctl will be issued
      to) for the duration of the ioctl.
      
      Also, bump the DM ioctl version so a user can know that their device
      cgroup allow workaround is no longer needed.
      Reported-by: NMichal Privoznik <mprivozn@redhat.com>
      Suggested-by: NMikulas Patocka <mpatocka@redhat.com>
      Fixes: 519049af ("dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl")
      Cc: stable@vger.kernel.org # 4.16
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      971888c4
  28. 04 4月, 2018 1 次提交