1. 16 4月, 2015 9 次提交
    • M
      dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq · 02233342
      Mike Snitzer 提交于
      dm_mq_queue_rq() is in atomic context so care must be taken to not
      sleep -- as such GFP_ATOMIC is used for the md->bs bioset allocations
      and dm-mpath's call to blk_get_request().  In the future the bioset
      allocations will hopefully go away (by removing support for partial
      completions of bios in a cloned request).
      
      Also prepare for supporting DM blk-mq ontop of old-style request_fn
      device(s) if a new dm-mod 'use_blk_mq' parameter is set.  The kthread
      will still be used to queue work if blk-mq is used ontop of old-style
      request_fn device(s).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      02233342
    • M
      dm: add full blk-mq support to request-based DM · bfebd1cd
      Mike Snitzer 提交于
      Commit e5863d9a ("dm: allocate requests in target when stacking on
      blk-mq devices") served as the first step toward fully utilizing blk-mq
      in request-based DM -- it enabled stacking an old-style (request_fn)
      request_queue ontop of the underlying blk-mq device(s).  That first step
      didn't improve performance of DM multipath ontop of fast blk-mq devices
      (e.g. NVMe) because the top-level old-style request_queue was severely
      limited by the queue_lock.
      
      The second step offered here enables stacking a blk-mq request_queue
      ontop of the underlying blk-mq device(s).  This unlocks significant
      performance gains on fast blk-mq devices, Keith Busch tested on his NVMe
      testbed and offered this really positive news:
      
       "Just providing a performance update. All my fio tests are getting
        roughly equal performance whether accessed through the raw block
        device or the multipath device mapper (~470k IOPS). I could only push
        ~20% of the raw iops through dm before this conversion, so this latest
        tree is looking really solid from a performance standpoint."
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Tested-by: NKeith Busch <keith.busch@intel.com>
      bfebd1cd
    • M
      dm: impose configurable deadline for dm_request_fn's merge heuristic · 0ce65797
      Mike Snitzer 提交于
      Otherwise, for sequential workloads, the dm_request_fn can allow
      excessive request merging at the expense of increased service time.
      
      Add a per-device sysfs attribute to allow the user to control how long a
      request, that is a reasonable merge candidate, can be queued on the
      request queue.  The resolution of this request dispatch deadline is in
      microseconds (ranging from 1 to 100000 usecs), to set a 20us deadline:
        echo 20 > /sys/block/dm-7/dm/rq_based_seq_io_merge_deadline
      
      The dm_request_fn's merge heuristic and associated extra accounting is
      disabled by default (rq_based_seq_io_merge_deadline is 0).
      
      This sysfs attribute is not applicable to bio-based DM devices so it
      will only ever report 0 for them.
      
      By allowing a request to remain on the queue it will block others
      requests on the queue.  But introducing a short dequeue delay has proven
      very effective at enabling certain sequential IO workloads on really
      fast, yet IOPS constrained, devices to build up slightly larger IOs --
      yielding 90+% throughput improvements.  Having precise control over the
      time taken to wait for larger requests to build affords control beyond
      that of waiting for certain IO sizes to accumulate (which would require
      a deadline anyway).  This knob will only ever make sense with sequential
      IO workloads and the particular value used is storage configuration
      specific.
      
      Given the expected niche use-case for when this knob is useful it has
      been deemed acceptable to expose this relatively crude method for
      crafting optimal IO on specific storage -- especially given the solution
      is simple yet effective.  In the context of DM multipath, it is
      advisable to tune this sysfs attribute to a value that offers the best
      performance for the common case (e.g. if 4 paths are expected active,
      tune for that; if paths fail then performance may be slightly reduced).
      
      Alternatives were explored to have request-based DM autotune this value
      (e.g. if/when paths fail) but they were quickly deemed too fragile and
      complex to warrant further design and development time.  If this problem
      proves more common as faster storage emerges we'll have to look at
      elevating a generic solution into the block core.
      Tested-by: NShiva Krishna Merla <shivakrishna.merla@netapp.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      0ce65797
    • M
      dm sysfs: introduce ability to add writable attributes · b898320d
      Mike Snitzer 提交于
      Add DM_ATTR_RW() macro and establish .store method in dm_sysfs_ops.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b898320d
    • M
      dm: don't start current request if it would've merged with the previous · de3ec86d
      Mike Snitzer 提交于
      Request-based DM's dm_request_fn() is so fast to pull requests off the
      queue that steps need to be taken to promote merging by avoiding request
      processing if it makes sense.
      
      If the current request would've merged with previous request let the
      current request stay on the queue longer.
      Suggested-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      de3ec86d
    • M
      dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms · d548b34b
      Mike Snitzer 提交于
      Commit 7eaceacc ("block: remove per-queue plugging") didn't justify
      DM's use of a 100ms delay; such an extended delay is a liability when
      there is reason to re-kick the queue.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d548b34b
    • M
      dm: don't schedule delayed run of the queue if nothing to do · 9d1deb83
      Mike Snitzer 提交于
      In request-based DM's dm_request_fn(), if blk_peek_request() returns
      NULL just return.  Avoids unnecessary blk_delay_queue().
      Reported-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      9d1deb83
    • M
      dm: only run the queue on completion if congested or no requests pending · 9a0e609e
      Mike Snitzer 提交于
      On really fast storage it can be beneficial to delay running the
      request_queue to allow the elevator more opportunity to merge requests.
      
      Otherwise, it has been observed that requests are being sent to
      q->request_fn much quicker than is ideal on IOPS-bound backends.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      9a0e609e
    • M
      dm: remove request-based logic from make_request_fn wrapper · ff36ab34
      Mike Snitzer 提交于
      The old dm_request() method used for q->make_request_fn had a branch for
      request-based DM support but it isn't needed given that
      dm_init_request_based_queue() sets it to the standard blk_queue_bio()
      anyway.
      
      Cleanup dm_init_md_queue() to be DM device-type agnostic and have
      dm_setup_md_queue() properly finish queue setup based on DM device-type
      (bio-based vs request-based).
      
      A followup block patch can be made to remove the export for
      blk_queue_bio() now that DM no longer calls it directly.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      ff36ab34
  2. 01 4月, 2015 9 次提交
  3. 24 3月, 2015 1 次提交
  4. 21 3月, 2015 1 次提交
  5. 28 2月, 2015 4 次提交
    • D
      dm io: deal with wandering queue limits when handling REQ_DISCARD and REQ_WRITE_SAME · e5db2980
      Darrick J. Wong 提交于
      Since it's possible for the discard and write same queue limits to
      change while the upper level command is being sliced and diced, fix up
      both of them (a) to reject IO if the special command is unsupported at
      the start of the function and (b) read the limits once and let the
      commands error out on their own if the status happens to change.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      e5db2980
    • M
      dm snapshot: suspend merging snapshot when doing exception handover · 09ee96b2
      Mikulas Patocka 提交于
      The "dm snapshot: suspend origin when doing exception handover" commit
      fixed a exception store handover bug associated with pending exceptions
      to the "snapshot-origin" target.
      
      However, a similar problem exists in snapshot merging.  When snapshot
      merging is in progress, we use the target "snapshot-merge" instead of
      "snapshot-origin".  Consequently, during exception store handover, we
      must find the snapshot-merge target and suspend its associated
      mapped_device.
      
      To avoid lockdep warnings, the target must be suspended and resumed
      without holding _origins_lock.
      
      Introduce a dm_hold() function that grabs a reference on a
      mapped_device, but unlike dm_get(), it doesn't crash if the device has
      the DMF_FREEING flag set, it returns an error in this case.
      
      In snapshot_resume() we grab the reference to the origin device using
      dm_hold() while holding _origins_lock (_origins_lock guarantees that the
      device won't disappear).  Then we release _origins_lock, suspend the
      device and grab _origins_lock again.
      
      NOTE to stable@ people:
      When backporting to kernels 3.18 and older, use dm_internal_suspend and
      dm_internal_resume instead of dm_internal_suspend_fast and
      dm_internal_resume_fast.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      09ee96b2
    • M
      dm snapshot: suspend origin when doing exception handover · b735fede
      Mikulas Patocka 提交于
      In the function snapshot_resume we perform exception store handover.  If
      there is another active snapshot target, the exception store is moved
      from this target to the target that is being resumed.
      
      The problem is that if there is some pending exception, it will point to
      an incorrect exception store after that handover, causing a crash due to
      dm-snap-persistent.c:get_exception()'s BUG_ON.
      
      This bug can be triggered by repeatedly changing snapshot permissions
      with "lvchange -p r" and "lvchange -p rw" while there are writes on the
      associated origin device.
      
      To fix this bug, we must suspend the origin device when doing the
      exception store handover to make sure that there are no pending
      exceptions:
      - introduce _origin_hash that keeps track of dm_origin structures.
      - introduce functions __lookup_dm_origin, __insert_dm_origin and
        __remove_dm_origin that manipulate the origin hash.
      - modify snapshot_resume so that it calls dm_internal_suspend_fast() and
        dm_internal_resume_fast() on the origin device.
      
      NOTE to stable@ people:
      
      When backporting to kernels 3.12-3.18, use dm_internal_suspend and
      dm_internal_resume instead of dm_internal_suspend_fast and
      dm_internal_resume_fast.
      
      When backporting to kernels older than 3.12, you need to pick functions
      dm_internal_suspend and dm_internal_resume from the commit
      fd2ed4d2.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      b735fede
    • M
      dm: hold suspend_lock while suspending device during device deletion · ab7c7bb6
      Mikulas Patocka 提交于
      __dm_destroy() must take the suspend_lock so that its presuspend and
      postsuspend calls do not race with an internal suspend.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      ab7c7bb6
  6. 27 2月, 2015 1 次提交
    • J
      dm thin: fix to consistently zero-fill reads to unprovisioned blocks · 5f027a3b
      Joe Thornber 提交于
      It was always intended that a read to an unprovisioned block will return
      zeroes regardless of whether the pool is in read-only or read-write
      mode.  thin_bio_map() was inconsistent with its handling of such reads
      when the pool is in read-only mode, it now properly zero-fills the bios
      it returns in response to unprovisioned block reads.
      
      Eliminate thin_bio_map()'s special read-only mode handling of -ENODATA
      and just allow the IO to be deferred to the worker which will result in
      pool->process_bio() handling the IO (which already properly zero-fills
      reads to unprovisioned blocks).
      Reported-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      5f027a3b
  7. 25 2月, 2015 3 次提交
  8. 18 2月, 2015 3 次提交
    • M
      dm snapshot: fix a possible invalid memory access on unload · 22aa66a3
      Mikulas Patocka 提交于
      When the snapshot target is unloaded, snapshot_dtr() waits until
      pending_exceptions_count drops to zero.  Then, it destroys the snapshot.
      Therefore, the function that decrements pending_exceptions_count
      should not touch the snapshot structure after the decrement.
      
      pending_complete() calls free_pending_exception(), which decrements
      pending_exceptions_count, and then it performs up_write(&s->lock) and it
      calls retry_origin_bios() which dereferences  s->origin.  These two
      memory accesses to the fields of the snapshot may touch the dm_snapshot
      struture after it is freed.
      
      This patch moves the call to free_pending_exception() to the end of
      pending_complete(), so that the snapshot will not be destroyed while
      pending_complete() is in progress.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      22aa66a3
    • M
      dm: fix a race condition in dm_get_md · 2bec1f4a
      Mikulas Patocka 提交于
      The function dm_get_md finds a device mapper device with a given dev_t,
      increases the reference count and returns the pointer.
      
      dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the
      device, tests that the device doesn't have DMF_DELETING or DMF_FREEING
      flag, drops _minor_lock and returns pointer to the device. dm_get_md then
      calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag,
      otherwise it increments the reference count.
      
      There is a possible race condition - after dm_find_md exits and before
      dm_get is called, there are no locks held, so the device may disappear or
      DMF_FREEING flag may be set, which results in BUG.
      
      To fix this bug, we need to call dm_get while we hold _minor_lock. This
      patch renames dm_find_md to dm_get_md and changes it so that it calls
      dm_get while holding the lock.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      2bec1f4a
    • N
      md/raid5: Fix livelock when array is both resyncing and degraded. · 26ac1073
      NeilBrown 提交于
      Commit a7854487:
        md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
      
      Causes an RCW cycle to be forced even when the array is degraded.
      A degraded array cannot support RCW as that requires reading all data
      blocks, and one may be missing.
      
      Forcing an RCW when it is not possible causes a live-lock and the code
      spins, repeatedly deciding to do something that cannot succeed.
      
      So change the condition to only force RCW on non-degraded arrays.
      Reported-by: NManibalan P <pmanibalan@amiindia.co.in>
      Bisected-by: NJes Sorensen <Jes.Sorensen@redhat.com>
      Tested-by: NJes Sorensen <Jes.Sorensen@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Fixes: a7854487
      Cc: stable@vger.kernel.org (v3.7+)
      26ac1073
  9. 17 2月, 2015 7 次提交
    • M
      dm crypt: sort writes · b3c5fd30
      Mikulas Patocka 提交于
      Write requests are sorted in a red-black tree structure and are
      submitted in the sorted order.
      
      In theory the sorting should be performed by the underlying disk
      scheduler, however, in practice the disk scheduler only accepts and
      sorts a finite number of requests.  To allow the sorting of all
      requests, dm-crypt needs to implement its own sorting.
      
      The overhead associated with rbtree-based sorting is considered
      negligible so it is not used conditionally.  Even on SSD sorting can be
      beneficial since in-order request dispatch promotes lower latency IO
      completion to the upper layers.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b3c5fd30
    • M
      dm crypt: add 'submit_from_crypt_cpus' option · 0f5d8e6e
      Mikulas Patocka 提交于
      Make it possible to disable offloading writes by setting the optional
      'submit_from_crypt_cpus' table argument.
      
      There are some situations where offloading write bios from the
      encryption threads to a single thread degrades performance
      significantly.
      
      The default is to offload write bios to the same thread because it
      benefits CFQ to have writes submitted using the same IO context.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      0f5d8e6e
    • M
      dm crypt: offload writes to thread · dc267621
      Mikulas Patocka 提交于
      Submitting write bios directly in the encryption thread caused serious
      performance degradation.  On a multiprocessor machine, encryption requests
      finish in a different order than they were submitted.  Consequently, write
      requests would be submitted in a different order and it could cause severe
      performance degradation.
      
      Move the submission of write requests to a separate thread so that the
      requests can be sorted before submitting.  But this commit improves
      dm-crypt performance even without having dm-crypt perform request
      sorting (in particular it enables IO schedulers like CFQ to sort more
      effectively).
      
      Note: it is required that a previous commit ("dm crypt: don't allocate
      pages for a partial request") be applied before applying this patch.
      Otherwise, this commit could introduce a crash.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      dc267621
    • M
      dm crypt: remove unused io_pool and _crypt_io_pool · 94f5e024
      Mikulas Patocka 提交于
      The previous commit ("dm crypt: don't allocate pages for a partial
      request") stopped using the io_pool slab mempool and backing
      _crypt_io_pool kmem cache.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      94f5e024
    • M
      dm crypt: avoid deadlock in mempools · 7145c241
      Mikulas Patocka 提交于
      Fix a theoretical deadlock introduced in the previous commit ("dm crypt:
      don't allocate pages for a partial request").
      
      The function crypt_alloc_buffer may be called concurrently.  If we allocate
      from the mempool concurrently, there is a possibility of deadlock.  For
      example, if we have mempool of 256 pages, two processes, each wanting
      256, pages allocate from the mempool concurrently, it may deadlock in a
      situation where both processes have allocated 128 pages and the mempool
      is exhausted.
      
      To avoid such a scenario we allocate the pages under a mutex.  In order
      to not degrade performance with excessive locking, we try non-blocking
      allocations without a mutex first and if that fails, we fallback to a
      blocking allocations with a mutex.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      7145c241
    • M
      dm crypt: don't allocate pages for a partial request · cf2f1abf
      Mikulas Patocka 提交于
      Change crypt_alloc_buffer so that it only ever allocates pages for a
      full request.  This is a prerequisite for the commit "dm crypt: offload
      writes to thread".
      
      This change simplifies the dm-crypt code at the expense of reduced
      throughput in low memory conditions (where allocation for a partial
      request is most useful).
      
      Note: the next commit ("dm crypt: avoid deadlock in mempools") is needed
      to fix a theoretical deadlock.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      cf2f1abf
    • M
      dm crypt: use unbound workqueue for request processing · f3396c58
      Mikulas Patocka 提交于
      Use unbound workqueue by default so that work is automatically balanced
      between available CPUs.  The original behavior of encrypting using the
      same cpu that IO was submitted on can still be enabled by setting the
      optional 'same_cpu_crypt' table argument.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f3396c58
  10. 16 2月, 2015 2 次提交