1. 30 10月, 2015 1 次提交
    • M
      dm: initialize non-blk-mq queue data before queue is used · ad5f498f
      Mikulas Patocka 提交于
      Commit bfebd1cd ("dm: add full blk-mq
      support to request-based DM") moves the initialization of the fields
      backing_dev_info.congested_fn, backing_dev_info.congested_data and
      queuedata from the function dm_init_md_queue (that is called when the
      device is created) to dm_init_old_md_queue (that is called after the
      device type is determined).
      
      There is no locking when accessing these variables, thus it is possible
      for other parts of the kernel to briefly see this data in a transient
      state (e.g. queue->backing_dev_info.congested_fn initialized and
      md->queue->backing_dev_info.congested_data uninitialized, resulting in
      passing an incorrect parameter to the function dm_any_congested).
      
      This queue data is left initialized for blk-mq devices even though they
      that don't use it.
      
      Fixes: bfebd1cd ("dm: add full blk-mq support to request-based DM")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # v4.1+
      ad5f498f
  2. 14 8月, 2015 2 次提交
    • K
      block: kill merge_bvec_fn() completely · 8ae12666
      Kent Overstreet 提交于
      As generic_make_request() is now able to handle arbitrarily sized bios,
      it's no longer necessary for each individual block driver to define its
      own ->merge_bvec_fn() callback. Remove every invocation completely.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: ceph-devel@vger.kernel.org
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
       dm-era-target, and resolve merge conflicts]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ae12666
    • K
      block: make generic_make_request handle arbitrarily sized bios · 54efd50b
      Kent Overstreet 提交于
      The way the block layer is currently written, it goes to great lengths
      to avoid having to split bios; upper layer code (such as bio_add_page())
      checks what the underlying device can handle and tries to always create
      bios that don't need to be split.
      
      But this approach becomes unwieldy and eventually breaks down with
      stacked devices and devices with dynamic limits, and it adds a lot of
      complexity. If the block layer could split bios as needed, we could
      eliminate a lot of complexity elsewhere - particularly in stacked
      drivers. Code that creates bios can then create whatever size bios are
      convenient, and more importantly stacked drivers don't have to deal with
      both their own bio size limitations and the limitations of the
      (potentially multiple) devices underneath them.  In the future this will
      let us delete merge_bvec_fn and a bunch of other code.
      
      We do this by adding calls to blk_queue_split() to the various
      make_request functions that need it - a few can already handle arbitrary
      size bios. Note that we add the call _after_ any call to
      blk_queue_bounce(); this means that blk_queue_split() and
      blk_recalc_rq_segments() don't need to be concerned with bouncing
      affecting segment merging.
      
      Some make_request_fn() callbacks were simple enough to audit and verify
      they don't need blk_queue_split() calls. The skipped ones are:
      
       * nfhd_make_request (arch/m68k/emu/nfblock.c)
       * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
       * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
       * brd_make_request (ramdisk - drivers/block/brd.c)
       * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
       * loop_make_request
       * null_queue_bio
       * bcache's make_request fns
      
      Some others are almost certainly safe to remove now, but will be left
      for future patches.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Jim Paris <jim@jtan.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      54efd50b
  3. 12 8月, 2015 1 次提交
  4. 04 8月, 2015 1 次提交
  5. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  6. 13 7月, 2015 1 次提交
    • M
      dm: fix use after free crash due to incorrect cleanup sequence · b06075a9
      Mikulas Patocka 提交于
      Linux 4.2-rc1 Commit 0f20972f ("dm: factor out a common
      cleanup_mapped_device()") moved a common cleanup code to a separate
      function.  Unfortunately, that commit incorrectly changed the order of
      cleanup, so that it destroys the mapped_device's srcu structure
      'io_barrier' before destroying its workqueue.
      
      The function that is executed on the workqueue (dm_wq_work) uses the srcu
      structure, thus it may use it after being freed.  That results in a
      crash in the LVM test suite's mirror-vgreduce-removemissing.sh test.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Fixes: 0f20972f ("dm: factor out a common cleanup_mapped_device()")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b06075a9
  7. 09 7月, 2015 1 次提交
  8. 26 6月, 2015 2 次提交
  9. 18 6月, 2015 1 次提交
  10. 02 6月, 2015 1 次提交
    • T
      writeback: move backing_dev_info->state into bdi_writeback · 4452226e
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->state into wb.
      
      * enum bdi_state is renamed to wb_state and the prefix of all enums is
        changed from BDI_ to WB_.
      
      * Explicit zeroing of bdi->state is removed without adding zeoring of
        wb->state as the whole data structure is zeroed on init anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->state are mechanically replaced with bdi->wb.state
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: drbd-dev@lists.linbit.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4452226e
  11. 30 5月, 2015 4 次提交
    • M
      dm: factor out a common cleanup_mapped_device() · 0f20972f
      Mike Snitzer 提交于
      Introduce a single common method for cleaning up a DM device's
      mapped_device.  No functional change, just eliminates duplication of
      delicate mapped_device cleanup code.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      0f20972f
    • M
      dm: cleanup methods that requeue requests · 2d76fff1
      Mike Snitzer 提交于
      More often than not a request that is requeued _is_ mapped (meaning the
      clone request is allocated and clone->q is initialized).  Rename
      dm_requeue_unmapped_original_request() to avoid potential confusion due
      to function name containing "unmapped".
      
      Also, remove dm_requeue_unmapped_request() since callers can easily call
      the dm_requeue_original_request() directly.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2d76fff1
    • M
      dm: do not allocate any mempools for blk-mq request-based DM · cbc4e3c1
      Mike Snitzer 提交于
      Do not allocate the io_pool mempool for blk-mq request-based DM
      (DM_TYPE_MQ_REQUEST_BASED) in dm_alloc_rq_mempools().
      
      Also refine __bind_mempools() to have more precise awareness of which
      mempools each type of DM device uses -- avoids mempool churn when
      reloading DM tables (particularly for DM_TYPE_REQUEST_BASED).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      cbc4e3c1
    • J
      dm: fix casting bug in dm_merge_bvec() · 1c220c69
      Joe Thornber 提交于
      dm_merge_bvec() was originally added in f6fccb ("dm: introduce
      merge_bvec_fn").  In that commit a value in sectors is converted to
      bytes using << 9, and then assigned to an int.  This code made
      assumptions about the value of BIO_MAX_SECTORS.
      
      A later commit 148e51 ("dm: improve documentation and code clarity in
      dm_merge_bvec") was meant to have no functional change but it removed
      the use of BIO_MAX_SECTORS in favor of using queue_max_sectors().  At
      this point the cast from sector_t to int resulted in a zero value.  The
      fallout being dm_merge_bvec() would only allow a single page to be added
      to a bio.
      
      This interim fix is minimal for the benefit of stable@ because the more
      comprehensive cleanup of passing a sector_t to all DM targets' merge
      function will impact quite a few DM targets.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.19+
      1c220c69
  12. 29 5月, 2015 1 次提交
    • M
      dm: fix false warning in free_rq_clone() for unmapped requests · e5d8de32
      Mike Snitzer 提交于
      When stacking request-based dm device on non blk-mq device and
      device-mapper target could not map the request (error target is used,
      multipath target with all paths down, etc), the WARN_ON_ONCE() in
      free_rq_clone() will trigger when it shouldn't.
      
      The warning was added by commit aa6df8dd ("dm: fix free_rq_clone() NULL
      pointer when requeueing unmapped request").  But free_rq_clone() with
      clone->q == NULL is valid usage for the case where
      dm_kill_unmapped_request() initiates request cleanup.
      
      Fix this false warning by just removing the WARN_ON -- it only generated
      false positives and was never useful in catching the intended case
      (completing clone request not being mapped e.g. clone->q being NULL).
      
      Fixes: aa6df8dd ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request")
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reported-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      e5d8de32
  13. 28 5月, 2015 1 次提交
  14. 27 5月, 2015 1 次提交
    • J
      dm: fix NULL pointer when clone_and_map_rq returns !DM_MAPIO_REMAPPED · 3a140755
      Junichi Nomura 提交于
      When stacking request-based DM on blk_mq device, request cloning and
      remapping are done in a single call to target's clone_and_map_rq().
      The clone is allocated and valid only if clone_and_map_rq() returns
      DM_MAPIO_REMAPPED.
      
      The "IS_ERR(clone)" check in map_request() does not cover all the
      !DM_MAPIO_REMAPPED cases that are possible (E.g. if underlying devices
      are not ready or unavailable, clone_and_map_rq() may return
      DM_MAPIO_REQUEUE without ever having established an ERR_PTR).  Fix this
      by explicitly checking for a return that is not DM_MAPIO_REMAPPED in
      map_request().
      
      Without this fix, DM core may call setup_clone() for a NULL clone
      and oops like this:
      
         BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
         IP: [<ffffffff81227525>] blk_rq_prep_clone+0x7d/0x137
         ...
         CPU: 2 PID: 5793 Comm: kdmwork-253:3 Not tainted 4.0.0-nm #1
         ...
         Call Trace:
          [<ffffffffa01d1c09>] map_tio_request+0xa9/0x258 [dm_mod]
          [<ffffffff81071de9>] kthread_worker_fn+0xfd/0x150
          [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24
          [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24
          [<ffffffff81071fdd>] kthread+0xe6/0xee
          [<ffffffff81093a59>] ? put_lock_stats+0xe/0x20
          [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b
          [<ffffffff814c2d98>] ret_from_fork+0x58/0x90
          [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b
      
      Fixes: e5863d9a ("dm: allocate requests in target when stacking on blk-mq devices")
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.0+
      3a140755
  15. 26 5月, 2015 1 次提交
  16. 22 5月, 2015 1 次提交
    • C
      block, dm: don't copy bios for request clones · 5f1b670d
      Christoph Hellwig 提交于
      Currently dm-multipath has to clone the bios for every request sent
      to the lower devices, which wastes cpu cycles and ties down memory.
      
      This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
      to not complete bios attached to a request, which we set on clone
      requests similar to bios in a flush sequence.  With this change I/O
      errors on a path failure only get propagated to dm-multipath, which
      can then either resubmit the I/O or complete the bios on the original
      request.
      
      I've done some basic testing of this on a Linux target with ALUA support,
      and it survives path failures during I/O nicely.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5f1b670d
  17. 30 4月, 2015 2 次提交
    • M
      dm: fix free_rq_clone() NULL pointer when requeueing unmapped request · aa6df8dd
      Mike Snitzer 提交于
      Commit 02233342 ("dm: optimize dm_mq_queue_rq to _not_ use kthread if
      using pure blk-mq") mistakenly removed free_rq_clone()'s clone->q check
      before testing clone->q->mq_ops.  It was an oversight to discontinue
      that check for 1 of the 2 use-cases for free_rq_clone():
      1) free_rq_clone() called when an unmapped original request is requeued
      2) free_rq_clone() called in the request-based IO completion path
      
      The clone->q check made sense for case #1 but not for #2.  However, we
      cannot just reinstate the check as it'd mask a serious bug in the IO
      completion case #2 -- no in-flight request should have an uninitialized
      request_queue (basic block layer refcounting _should_ ensure this).
      
      The NULL pointer seen for case #1 is detailed here:
      https://www.redhat.com/archives/dm-devel/2015-April/msg00160.html
      
      Fix this free_rq_clone() NULL pointer by simply checking if the
      mapped_device's type is DM_TYPE_MQ_REQUEST_BASED (clone's queue is
      blk-mq) rather than checking clone->q->mq_ops.  This avoids the need to
      dereference clone->q, but a WARN_ON_ONCE is added to let us know if an
      uninitialized clone request is being completed.
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      aa6df8dd
    • C
      dm: only initialize the request_queue once · 3e6180f0
      Christoph Hellwig 提交于
      Commit bfebd1cd ("dm: add full blk-mq support to request-based DM")
      didn't properly account for the need to short-circuit re-initializing
      DM's blk-mq request_queue if it was already initialized.
      
      Otherwise, reloading a blk-mq request-based DM table (either manually
      or via multipathd) resulted in errors, see:
       https://www.redhat.com/archives/dm-devel/2015-April/msg00132.html
      
      Fix is to only initialize the request_queue on the initial table load
      (when the mapped_device type is assigned).
      
      This is better than having dm_init_request_based_blk_mq_queue() return
      early if the queue was already initialized because it elevates the
      constraint to a more meaningful location in DM core.  As such the
      pre-existing early return in dm_init_request_based_queue() can now be
      removed.
      
      Fixes: bfebd1cd ("dm: add full blk-mq support to request-based DM")
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      3e6180f0
  18. 16 4月, 2015 10 次提交
    • S
      dm verity: add error handling modes for corrupted blocks · 65ff5b7d
      Sami Tolvanen 提交于
      Add device specific modes to dm-verity to specify how corrupted
      blocks should be handled.  The following modes are defined:
      
        - DM_VERITY_MODE_EIO is the default behavior, where reading a
          corrupted block results in -EIO.
      
        - DM_VERITY_MODE_LOGGING only logs corrupted blocks, but does
          not block the read.
      
        - DM_VERITY_MODE_RESTART calls kernel_restart when a corrupted
          block is discovered.
      
      In addition, each mode sends a uevent to notify userspace of
      corruption and to allow further recovery actions.
      
      The driver defaults to previous behavior (DM_VERITY_MODE_EIO)
      and other modes can be enabled with an additional parameter to
      the verity table.
      Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      65ff5b7d
    • M
      dm: add 'use_blk_mq' module param and expose in per-device ro sysfs attr · 17e149b8
      Mike Snitzer 提交于
      Request-based DM's blk-mq support defaults to off; but a user can easily
      change the default using the dm_mod.use_blk_mq module/boot option.
      
      Also, you can check what mode a given request-based DM device is using
      with: cat /sys/block/dm-X/dm/use_blk_mq
      
      This change enabled further cleanup and reduced work (e.g. the
      md->io_pool and md->rq_pool isn't created if using blk-mq).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      17e149b8
    • M
      dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq · 02233342
      Mike Snitzer 提交于
      dm_mq_queue_rq() is in atomic context so care must be taken to not
      sleep -- as such GFP_ATOMIC is used for the md->bs bioset allocations
      and dm-mpath's call to blk_get_request().  In the future the bioset
      allocations will hopefully go away (by removing support for partial
      completions of bios in a cloned request).
      
      Also prepare for supporting DM blk-mq ontop of old-style request_fn
      device(s) if a new dm-mod 'use_blk_mq' parameter is set.  The kthread
      will still be used to queue work if blk-mq is used ontop of old-style
      request_fn device(s).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      02233342
    • M
      dm: add full blk-mq support to request-based DM · bfebd1cd
      Mike Snitzer 提交于
      Commit e5863d9a ("dm: allocate requests in target when stacking on
      blk-mq devices") served as the first step toward fully utilizing blk-mq
      in request-based DM -- it enabled stacking an old-style (request_fn)
      request_queue ontop of the underlying blk-mq device(s).  That first step
      didn't improve performance of DM multipath ontop of fast blk-mq devices
      (e.g. NVMe) because the top-level old-style request_queue was severely
      limited by the queue_lock.
      
      The second step offered here enables stacking a blk-mq request_queue
      ontop of the underlying blk-mq device(s).  This unlocks significant
      performance gains on fast blk-mq devices, Keith Busch tested on his NVMe
      testbed and offered this really positive news:
      
       "Just providing a performance update. All my fio tests are getting
        roughly equal performance whether accessed through the raw block
        device or the multipath device mapper (~470k IOPS). I could only push
        ~20% of the raw iops through dm before this conversion, so this latest
        tree is looking really solid from a performance standpoint."
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Tested-by: NKeith Busch <keith.busch@intel.com>
      bfebd1cd
    • M
      dm: impose configurable deadline for dm_request_fn's merge heuristic · 0ce65797
      Mike Snitzer 提交于
      Otherwise, for sequential workloads, the dm_request_fn can allow
      excessive request merging at the expense of increased service time.
      
      Add a per-device sysfs attribute to allow the user to control how long a
      request, that is a reasonable merge candidate, can be queued on the
      request queue.  The resolution of this request dispatch deadline is in
      microseconds (ranging from 1 to 100000 usecs), to set a 20us deadline:
        echo 20 > /sys/block/dm-7/dm/rq_based_seq_io_merge_deadline
      
      The dm_request_fn's merge heuristic and associated extra accounting is
      disabled by default (rq_based_seq_io_merge_deadline is 0).
      
      This sysfs attribute is not applicable to bio-based DM devices so it
      will only ever report 0 for them.
      
      By allowing a request to remain on the queue it will block others
      requests on the queue.  But introducing a short dequeue delay has proven
      very effective at enabling certain sequential IO workloads on really
      fast, yet IOPS constrained, devices to build up slightly larger IOs --
      yielding 90+% throughput improvements.  Having precise control over the
      time taken to wait for larger requests to build affords control beyond
      that of waiting for certain IO sizes to accumulate (which would require
      a deadline anyway).  This knob will only ever make sense with sequential
      IO workloads and the particular value used is storage configuration
      specific.
      
      Given the expected niche use-case for when this knob is useful it has
      been deemed acceptable to expose this relatively crude method for
      crafting optimal IO on specific storage -- especially given the solution
      is simple yet effective.  In the context of DM multipath, it is
      advisable to tune this sysfs attribute to a value that offers the best
      performance for the common case (e.g. if 4 paths are expected active,
      tune for that; if paths fail then performance may be slightly reduced).
      
      Alternatives were explored to have request-based DM autotune this value
      (e.g. if/when paths fail) but they were quickly deemed too fragile and
      complex to warrant further design and development time.  If this problem
      proves more common as faster storage emerges we'll have to look at
      elevating a generic solution into the block core.
      Tested-by: NShiva Krishna Merla <shivakrishna.merla@netapp.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      0ce65797
    • M
      dm: don't start current request if it would've merged with the previous · de3ec86d
      Mike Snitzer 提交于
      Request-based DM's dm_request_fn() is so fast to pull requests off the
      queue that steps need to be taken to promote merging by avoiding request
      processing if it makes sense.
      
      If the current request would've merged with previous request let the
      current request stay on the queue longer.
      Suggested-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      de3ec86d
    • M
      dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms · d548b34b
      Mike Snitzer 提交于
      Commit 7eaceacc ("block: remove per-queue plugging") didn't justify
      DM's use of a 100ms delay; such an extended delay is a liability when
      there is reason to re-kick the queue.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d548b34b
    • M
      dm: don't schedule delayed run of the queue if nothing to do · 9d1deb83
      Mike Snitzer 提交于
      In request-based DM's dm_request_fn(), if blk_peek_request() returns
      NULL just return.  Avoids unnecessary blk_delay_queue().
      Reported-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      9d1deb83
    • M
      dm: only run the queue on completion if congested or no requests pending · 9a0e609e
      Mike Snitzer 提交于
      On really fast storage it can be beneficial to delay running the
      request_queue to allow the elevator more opportunity to merge requests.
      
      Otherwise, it has been observed that requests are being sent to
      q->request_fn much quicker than is ideal on IOPS-bound backends.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      9a0e609e
    • M
      dm: remove request-based logic from make_request_fn wrapper · ff36ab34
      Mike Snitzer 提交于
      The old dm_request() method used for q->make_request_fn had a branch for
      request-based DM support but it isn't needed given that
      dm_init_request_based_queue() sets it to the standard blk_queue_bio()
      anyway.
      
      Cleanup dm_init_md_queue() to be DM device-type agnostic and have
      dm_setup_md_queue() properly finish queue setup based on DM device-type
      (bio-based vs request-based).
      
      A followup block patch can be made to remove the export for
      blk_queue_bio() now that DM no longer calls it directly.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      ff36ab34
  19. 01 4月, 2015 3 次提交
  20. 24 3月, 2015 1 次提交
  21. 28 2月, 2015 3 次提交
    • M
      dm snapshot: suspend merging snapshot when doing exception handover · 09ee96b2
      Mikulas Patocka 提交于
      The "dm snapshot: suspend origin when doing exception handover" commit
      fixed a exception store handover bug associated with pending exceptions
      to the "snapshot-origin" target.
      
      However, a similar problem exists in snapshot merging.  When snapshot
      merging is in progress, we use the target "snapshot-merge" instead of
      "snapshot-origin".  Consequently, during exception store handover, we
      must find the snapshot-merge target and suspend its associated
      mapped_device.
      
      To avoid lockdep warnings, the target must be suspended and resumed
      without holding _origins_lock.
      
      Introduce a dm_hold() function that grabs a reference on a
      mapped_device, but unlike dm_get(), it doesn't crash if the device has
      the DMF_FREEING flag set, it returns an error in this case.
      
      In snapshot_resume() we grab the reference to the origin device using
      dm_hold() while holding _origins_lock (_origins_lock guarantees that the
      device won't disappear).  Then we release _origins_lock, suspend the
      device and grab _origins_lock again.
      
      NOTE to stable@ people:
      When backporting to kernels 3.18 and older, use dm_internal_suspend and
      dm_internal_resume instead of dm_internal_suspend_fast and
      dm_internal_resume_fast.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      09ee96b2
    • M
      dm snapshot: suspend origin when doing exception handover · b735fede
      Mikulas Patocka 提交于
      In the function snapshot_resume we perform exception store handover.  If
      there is another active snapshot target, the exception store is moved
      from this target to the target that is being resumed.
      
      The problem is that if there is some pending exception, it will point to
      an incorrect exception store after that handover, causing a crash due to
      dm-snap-persistent.c:get_exception()'s BUG_ON.
      
      This bug can be triggered by repeatedly changing snapshot permissions
      with "lvchange -p r" and "lvchange -p rw" while there are writes on the
      associated origin device.
      
      To fix this bug, we must suspend the origin device when doing the
      exception store handover to make sure that there are no pending
      exceptions:
      - introduce _origin_hash that keeps track of dm_origin structures.
      - introduce functions __lookup_dm_origin, __insert_dm_origin and
        __remove_dm_origin that manipulate the origin hash.
      - modify snapshot_resume so that it calls dm_internal_suspend_fast() and
        dm_internal_resume_fast() on the origin device.
      
      NOTE to stable@ people:
      
      When backporting to kernels 3.12-3.18, use dm_internal_suspend and
      dm_internal_resume instead of dm_internal_suspend_fast and
      dm_internal_resume_fast.
      
      When backporting to kernels older than 3.12, you need to pick functions
      dm_internal_suspend and dm_internal_resume from the commit
      fd2ed4d2.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      b735fede
    • M
      dm: hold suspend_lock while suspending device during device deletion · ab7c7bb6
      Mikulas Patocka 提交于
      __dm_destroy() must take the suspend_lock so that its presuspend and
      postsuspend calls do not race with an internal suspend.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      ab7c7bb6