1. 10 10月, 2018 1 次提交
    • D
      dm: fix report zone remapping to account for partition offset · 9864cd5d
      Damien Le Moal 提交于
      If dm-linear or dm-flakey are layered on top of a partition of a zoned
      block device, remapping of the start sector and write pointer position
      of the zones reported by a report zones BIO must be modified to account
      for the target table entry mapping (start offset within the device and
      entry mapping with the dm device).  If the target's backing device is a
      partition of a whole disk, the start sector on the physical device of
      the partition must also be accounted for when modifying the zone
      information.  However, dm_remap_zone_report() was not considering this
      last case, resulting in incorrect zone information remapping with
      targets using disk partitions.
      
      Fix this by calculating the target backing device start sector using
      the position of the completed report zones BIO and the unchanged
      position and size of the original report zone BIO. With this value
      calculated, the start sector and write pointer position of the target
      zones can be correctly remapped.
      
      Fixes: 10999307 ("dm: introduce dm_remap_zone_report()")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      9864cd5d
  2. 18 7月, 2018 1 次提交
    • M
      block: Add and use op_stat_group() for indexing disk_stat fields. · ddcf35d3
      Michael Callahan 提交于
      Add and use a new op_stat_group() function for indexing partition stat
      fields rather than indexing them by rq_data_dir() or bio_data_dir().
      This function works similarly to op_is_sync() in that it takes the
      request::cmd_flags or bio::bi_opf flags and determines which stats
      should et updated.
      
      In addition, the second parameter to generic_start_io_acct() and
      generic_end_io_acct() is now a REQ_OP rather than simply a read or
      write bit and it uses op_stat_group() on the parameter to determine
      the stat group.
      
      Note that the partition in_flight counts are not part of the per-cpu
      statistics and as such are not indexed via this function.  It's now
      indexed by op_is_write().
      
      tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Matias Bjorling <mb@lightnvm.io>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddcf35d3
  3. 29 6月, 2018 1 次提交
    • R
      dm: prevent DAX mounts if not supported · dbc62659
      Ross Zwisler 提交于
      Currently device_supports_dax() just checks to see if the QUEUE_FLAG_DAX
      flag is set on the device's request queue to decide whether or not the
      device supports filesystem DAX.  Really we should be using
      bdev_dax_supported() like filesystems do at mount time.  This performs
      other tests like checking to make sure the dax_direct_access() path works.
      
      We also explicitly clear QUEUE_FLAG_DAX on the DM device's request queue if
      any of the underlying devices do not support DAX.  This makes the handling
      of QUEUE_FLAG_DAX consistent with the setting/clearing of most other flags
      in dm_table_set_restrictions().
      
      Now that bdev_dax_supported() explicitly checks for QUEUE_FLAG_DAX, this
      will ensure that filesystems built upon DM devices will only be able to
      mount with DAX if all underlying devices also support DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Fixes: commit 545ed20e ("dm: add infrastructure for DAX support")
      Cc: stable@vger.kernel.org
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      dbc62659
  4. 23 6月, 2018 1 次提交
    • M
      dm: use bio_split() when splitting out the already processed bio · f21c601a
      Mike Snitzer 提交于
      Use of bio_clone_bioset() is inefficient if there is no need to clone
      the original bio's bio_vec array.  Best to use the bio_clone_fast()
      variant.  Also, just using bio_advance() is only part of what is needed
      to properly setup the clone -- it doesn't account for the various
      bio_integrity() related work that also needs to be performed (see
      bio_split).
      
      Address both of these issues by switching from bio_clone_bioset() to
      bio_split().
      
      Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk")
      Cc: stable@vger.kernel.org # 4.15+, requires removal of '&' before md->queue->bio_split
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f21c601a
  5. 08 6月, 2018 1 次提交
  6. 31 5月, 2018 2 次提交
  7. 23 5月, 2018 1 次提交
    • D
      dax: Introduce a ->copy_to_iter dax operation · b3a9a0c3
      Dan Williams 提交于
      Similar to the ->copy_from_iter() operation, a platform may want to
      deploy an architecture or device specific routine for handling reads
      from a dax_device like /dev/pmemX. On x86 this routine will point to a
      machine check safe version of copy_to_iter(). For now, add the plumbing
      to device-mapper and the dax core.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b3a9a0c3
  8. 01 5月, 2018 1 次提交
  9. 05 4月, 2018 2 次提交
    • M
      dm: remove fmode_t argument from .prepare_ioctl hook · 5bd5e8d8
      Mike Snitzer 提交于
      Use the fmode_t that is passed to dm_blk_ioctl() rather than
      inconsistently (varies across targets) drop it on the floor by
      overriding it with the fmode_t stored in 'struct dm_dev'.
      
      All the persistent reservation functions weren't using the fmode_t they
      got back from .prepare_ioctl so remove them.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      5bd5e8d8
    • M
      dm: hold DM table for duration of ioctl rather than use blkdev_get · 971888c4
      Mike Snitzer 提交于
      Commit 519049af ("dm: use blkdev_get rather than bdgrab when issuing
      pass-through ioctl") inadvertantly introduced a regression relative to
      users of device cgroups that issue ioctls (e.g. libvirt).  Using
      blkdev_get() in DM's passthrough ioctl support implicitly introduced a
      cgroup permissions check that would fail unless care were taken to add
      all devices in the IO stack to the device cgroup.  E.g. rather than just
      adding the top-level DM multipath device to the cgroup all the
      underlying devices would need to be allowed.
      
      Fix this, to no longer require allowing all underlying devices, by
      simply holding the live DM table (which includes the table's original
      blkdev_get() reference on the blockdevice that the ioctl will be issued
      to) for the duration of the ioctl.
      
      Also, bump the DM ioctl version so a user can know that their device
      cgroup allow workaround is no longer needed.
      Reported-by: NMichal Privoznik <mprivozn@redhat.com>
      Suggested-by: NMikulas Patocka <mpatocka@redhat.com>
      Fixes: 519049af ("dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl")
      Cc: stable@vger.kernel.org # 4.16
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      971888c4
  10. 04 4月, 2018 2 次提交
  11. 03 4月, 2018 1 次提交
  12. 30 3月, 2018 1 次提交
    • M
      dm: fix dropped return code from dm_get_bdev_for_ioctl · da5dadb4
      Mike Snitzer 提交于
      dm_get_bdev_for_ioctl()'s return of 0 or 1 must be the result from
      prepare_ioctl (1 means the ioctl was issued to a partition, 0 means it
      wasn't).  Unfortunately commit 519049af ("dm: use blkdev_get rather
      than bdgrab when issuing pass-through ioctl") reused the variable 'r'
      to store the return from blkdev_get() that follows prepare_ioctl()
      -- whereby dropping prepare_ioctl()'s result on the floor.
      
      This can lead to an ioctl or persistent reservation being issued to a
      partition going unnoticed, which implies the extra permission check for
      CAP_SYS_RAWIO is skipped.
      
      Fix this by using a different variable to store blkdev_get()'s return.
      
      Fixes: 519049af ("dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl")
      Reported-by: NAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      da5dadb4
  13. 07 3月, 2018 1 次提交
  14. 01 3月, 2018 1 次提交
  15. 16 2月, 2018 1 次提交
    • N
      dm: correctly handle chained bios in dec_pending() · 8dd601fa
      NeilBrown 提交于
      dec_pending() is given an error status (possibly 0) to be recorded
      against a bio.  It can be called several times on the one 'struct
      dm_io', and it is careful to only assign a non-zero error to
      io->status.  However when it then assigned io->status to bio->bi_status,
      it is not careful and could overwrite a genuine error status with 0.
      
      This can happen when chained bios are in use.  If a bio is chained
      beneath the bio that this dm_io is handling, the child bio might
      complete and set bio->bi_status before the dm_io completes.
      
      This has been possible since chained bios were introduced in 3.14, and
      has become a lot easier to trigger with commit 18a25da8 ("dm: ensure
      bio submission follows a depth-first tree walk") as that commit caused
      dm to start using chained bios itself.
      
      A particular failure mode is that if a bio spans an 'error' target and a
      working target, the 'error' fragment will complete instantly and set the
      ->bi_status, and the other fragment will normally complete a little
      later, and will clear ->bi_status.
      
      The fix is simply to only assign io_error to bio->bi_status when
      io_error is not zero.
      Reported-and-tested-by: NMilan Broz <gmazyland@gmail.com>
      Cc: stable@vger.kernel.org (v3.14+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      8dd601fa
  16. 30 1月, 2018 1 次提交
  17. 17 1月, 2018 1 次提交
  18. 15 1月, 2018 1 次提交
    • M
      dm: fix incomplete request_queue initialization · c100ec49
      Mike Snitzer 提交于
      DM is no longer prone to having its request_queue be improperly
      initialized.
      
      Summary of changes:
      
      - defer DM's blk_register_queue() from add_disk()-time until
        dm_setup_md_queue() by using add_disk_no_queue_reg() in alloc_dev().
      
      - dm_setup_md_queue() is updated to fully initialize DM's request_queue
        (_after_ all table loads have occurred and the request_queue's type,
        features and limits are known).
      
      A very welcome side-effect of these changes is DM no longer needs to:
      1) backfill the "mq" sysfs entry (because historically DM didn't
      initialize the request_queue to use blk-mq until _after_
      blk_register_queue() was called via add_disk()).
      2) call elv_register_queue() to get .request_fn request-based DM
      device's "iosched" exposed in syfs.
      
      In addition, blk-mq debugfs support is now made available because
      request-based DM's blk-mq request_queue is now properly initialized
      before dm_setup_md_queue() calls blk_register_queue().
      
      These changes also stave off the need to introduce new DM-specific
      workarounds in block core, e.g. this proposal:
      https://patchwork.kernel.org/patch/10067961/
      
      In the end DM devices should be less unicorn in nature (relative to
      initialization and availability of block core infrastructure provided by
      the request_queue).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Tested-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c100ec49
  19. 07 1月, 2018 1 次提交
  20. 20 12月, 2017 2 次提交
    • M
      dm: optimize bio-based NVMe IO submission · 978e51ba
      Mike Snitzer 提交于
      Upper level bio-based drivers that stack immediately ontop of NVMe can
      leverage direct_make_request().  In addition DM's NVMe bio-based
      will initially only ever have one NVMe device that it submits IO to at a
      time.  There is no splitting needed.  Enhance DM core so that
      DM_TYPE_NVME_BIO_BASED's IO submission takes advantage of both of these
      characteristics.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      978e51ba
    • M
      dm: introduce DM_TYPE_NVME_BIO_BASED · 22c11858
      Mike Snitzer 提交于
      If dm_table_determine_type() establishes DM_TYPE_NVME_BIO_BASED then
      all devices in the DM table do not support partial completions.  Also,
      the table has a single immutable target that doesn't require DM core to
      split bios.
      
      This will enable adding NVMe optimizations to bio-based DM.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      22c11858
  21. 18 12月, 2017 1 次提交
    • M
      dm: simplify start of block stats accounting for bio-based · f3986374
      Mike Snitzer 提交于
      No apparent need to generic_start_io_acct() until before the IO is ready
      for submission.  start_io_acct() is the proper place to do this
      accounting -- it is also where DM accounts for pending IO and, if
      enabled, starts dm-stats accounting.
      
      Replace start_io_acct()'s part_round_stats() with generic_start_io_acct().
      This eliminates needing to take part_stat_lock() multiple times when
      starting an IO on bio-based devices.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f3986374
  22. 17 12月, 2017 5 次提交
  23. 14 12月, 2017 8 次提交
    • M
      dm: set QUEUE_FLAG_DAX accordingly in dm_table_set_restrictions() · ad3793fc
      Mike Snitzer 提交于
      Rather than having DAX support be unique by setting it based on table
      type in dm_setup_md_queue().
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      ad3793fc
    • M
      dm: fix __send_changing_extent_only() to send first bio and chain remainder · 3d7f4562
      Mike Snitzer 提交于
      __send_changing_extent_only() must follow the same pattern that was
      established with commit "dm: ensure bio submission follows a depth-first
      tree walk".  That is: submit first bio up to split boundary and then
      split the remainder to further submissions.
      Suggested-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      3d7f4562
    • M
      dm: ensure bio-based DM's bioset and io_pool support targets' maximum IOs · 0776aa0e
      Mike Snitzer 提交于
      alloc_multiple_bios() assumes it can allocate the requested number of
      bios but until now there was no gaurantee that the mempools would be
      accomodating.
      Suggested-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      0776aa0e
    • M
      dm: remove BIOSET_NEED_RESCUER based dm_offload infrastructure · 4a3f54d9
      Mike Snitzer 提交于
      Now that all of DM has been revised and/or verified to no longer require
      the use of BIOSET_NEED_RESCUER the dm_offload code may be removed.
      Suggested-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      4a3f54d9
    • M
      dm: safely allocate multiple bioset bios · 318716dd
      Mike Snitzer 提交于
      DM targets can request multiple bios be sent to them by DM core (see:
      num_{flush,discard,write_same,write_zeroes}_bios).  But until now these
      bios were allocated in an unsafe manner than could potentially exhaust
      the DM device's bioset -- in the face of multiple threads each trying to
      do multiple allocations from the same DM device's bioset.
      
      Fix __send_duplicate_bios() by using the new alloc_multiple_bios().  The
      allocation strategy used by alloc_multiple_bios() models that used by
      dm-crypt.c:crypt_alloc_buffer().
      
      Neil Brown initially proposed this fix but the implementation has been
      revised enough that it inappropriate to attribute the entirety of it to
      him.
      Suggested-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      318716dd
    • N
      dm: remove unused 'num_write_bios' target interface · f31c21e4
      NeilBrown 提交于
      No DM target provides num_write_bios and none has since dm-cache's
      brief use in 2013.
      
      Having the possibility of num_write_bios > 1 complicates bio
      allocation.  So remove the interface and assume there is only one bio
      needed.
      
      If a target ever needs more, it must provide a suitable bioset and
      allocate itself based on its particular needs.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f31c21e4
    • N
      dm: ensure bio submission follows a depth-first tree walk · 18a25da8
      NeilBrown 提交于
      A dm device can, in general, represent a tree of targets, each of which
      handles a sub-range of the range of blocks handled by the parent.
      
      The bio sequencing managed by generic_make_request() requires that bios
      are generated and handled in a depth-first manner.  Each call to a
      make_request_fn() may submit bios to a single member device, and may
      submit bios for a reduced region of the same device as the
      make_request_fn.
      
      In particular, any bios submitted to member devices must be expected to
      be processed in order, so a later one must never wait for an earlier
      one.
      
      This ordering is usually achieved by using bio_split() to reduce a bio
      to a size that can be completely handled by one target, and resubmitting
      the remainder to the originating device. bio_queue_split() shows the
      canonical approach.
      
      dm doesn't follow this approach, largely because it has needed to split
      bios since long before bio_split() was available.  It currently can
      submit bios to separate targets within the one dm_make_request() call.
      Dependencies between these targets, as can happen with dm-snap, can
      cause deadlocks if either bios gets stuck behind the other in the queues
      managed by generic_make_request().  This requires the 'rescue'
      functionality provided by dm_offload_{start,end}.
      
      Some of this requirement can be removed by changing the order of bio
      submission to follow the canonical approach.  That is, if dm finds that
      it needs to split a bio, the remainder should be sent to
      generic_make_request() rather than being handled immediately.  This
      delays the handling until the first part is completely processed, so the
      deadlock problems do not occur.
      
      __split_and_process_bio() can be called both from dm_make_request() and
      from dm_wq_work().  When called from dm_wq_work() the current approach
      is perfectly satisfactory as each bio will be processed immediately.
      When called from dm_make_request(), current->bio_list will be non-NULL,
      and in this case it is best to create a separate "clone" bio for the
      remainder.
      
      When we use bio_clone_bioset() to split off the front part of a bio
      and chain the two together and submit the remainder to
      generic_make_request(), it is important that the newly allocated
      bio is used as the head to be processed immediately, and the original
      bio gets "bio_advance()"d and sent to generic_make_request() as the
      remainder.  Otherwise, if the newly allocated bio is used as the
      remainder, and if it then needs to be split again, then the next
      bio_clone_bioset() call will be made while holding a reference a bio
      (result of the first clone) from the same bioset.  This can potentially
      exhaust the bioset mempool and result in a memory allocation deadlock.
      
      Note that there is no race caused by reassigning cio.io->bio after already
      calling __map_bio().  This bio will only be dereferenced again after
      dec_pending() has found io->io_count to be zero, and this cannot happen
      before the dec_pending() call at the end of __split_and_process_bio().
      
      To provide the clone bio when splitting, we use q->bio_split.  This
      was previously being freed by bio-based dm to avoid having excess
      rescuer threads.  As bio_split bio sets no longer create rescuer
      threads, there is little cost and much gain from restoring the
      q->bio_split bio set.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      18a25da8
    • N
      dm: fix comment above dm_accept_partial_bio · c06b3e58
      NeilBrown 提交于
      Clarify that dm_accept_partial_bio isn't allowed for REQ_OP_ZONE_RESET
      bios.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c06b3e58
  24. 11 11月, 2017 2 次提交
    • M
      dm: small cleanup in dm_get_md() · 49de5769
      Mike Snitzer 提交于
      Makes dm_get_md() and dm_get_from_kobject() have similar code.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      49de5769
    • H
      dm: fix race between dm_get_from_kobject() and __dm_destroy() · b9a41d21
      Hou Tao 提交于
      The following BUG_ON was hit when testing repeat creation and removal of
      DM devices:
      
          kernel BUG at drivers/md/dm.c:2919!
          CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44
          Call Trace:
           [<ffffffff81649e8b>] dm_get_from_kobject+0x34/0x3a
           [<ffffffff81650ef1>] dm_attr_show+0x2b/0x5e
           [<ffffffff817b46d1>] ? mutex_lock+0x26/0x44
           [<ffffffff811df7f5>] sysfs_kf_seq_show+0x83/0xcf
           [<ffffffff811de257>] kernfs_seq_show+0x23/0x25
           [<ffffffff81199118>] seq_read+0x16f/0x325
           [<ffffffff811de994>] kernfs_fop_read+0x3a/0x13f
           [<ffffffff8117b625>] __vfs_read+0x26/0x9d
           [<ffffffff8130eb59>] ? security_file_permission+0x3c/0x44
           [<ffffffff8117bdb8>] ? rw_verify_area+0x83/0xd9
           [<ffffffff8117be9d>] vfs_read+0x8f/0xcf
           [<ffffffff81193e34>] ? __fdget_pos+0x12/0x41
           [<ffffffff8117c686>] SyS_read+0x4b/0x76
           [<ffffffff817b606e>] system_call_fastpath+0x12/0x71
      
      The bug can be easily triggered, if an extra delay (e.g. 10ms) is added
      between the test of DMF_FREEING & DMF_DELETING and dm_get() in
      dm_get_from_kobject().
      
      To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and
      dm_get() are done in an atomic way, so _minor_lock is used.
      
      The other callers of dm_get() have also been checked to be OK: some
      callers invoke dm_get() under _minor_lock, some callers invoke it under
      _hash_lock, and dm_start_request() invoke it after increasing
      md->open_count.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b9a41d21