1. 23 2月, 2016 17 次提交
    • M
      7943bd6d
    • M
      dm mpath: use blk_mq_alloc_request() and blk_mq_free_request() directly · 78ce23b5
      Mike Snitzer 提交于
      There isn't any need to support both old .request_fn and blk-mq paths
      in the blk-mq specific portion of __multipath_map().  Call
      blk_mq_alloc_request() directly rather than use blk_get_request().
      
      Similarly, call blk_mq_free_request(), rather than blk_put_request(), in
      multipath_release_clone().
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      78ce23b5
    • M
      dm mpath: cleanup 'struct dm_mpath_io' management code · 2eff1924
      Mike Snitzer 提交于
      Refactor and rename existing interfaces to be more specific and
      self-documenting.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2eff1924
    • M
      dm mpath: use blk-mq pdu for per-request 'struct dm_mpath_io' · 8637a6bf
      Mike Snitzer 提交于
      Allow the multipath target to avoid making small allocations for each
      'struct dm_mpath_io' that is needed for each request.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      8637a6bf
    • M
      dm: allow immutable request-based targets to use blk-mq pdu · 591ddcfc
      Mike Snitzer 提交于
      This will allow DM multipath to use a portion of the blk-mq pdu space
      for target data (e.g. struct dm_mpath_io).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      591ddcfc
    • M
      dm: rename target's per_bio_data_size to per_io_data_size · 30187e1d
      Mike Snitzer 提交于
      Request-based DM will also make use of per_bio_data_size.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      30187e1d
    • M
      dm: distinquish old .request_fn (dm-old) vs dm-mq request-based DM · eca7ee6d
      Mike Snitzer 提交于
      Rename various methods to have either a "dm_old" or "dm_mq" prefix.
      Improve code comments to assist with understanding the duality of code
      that handles both "dm_old" and "dm_mq" cases.
      
      It is no much easier to quickly look at the code and _know_ that a given
      method is either 1) "dm_old" only 2) "dm_mq" only 3) common to both.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      eca7ee6d
    • M
      dm: remove support for stacking dm-mq on .request_fn device(s) · c5248f79
      Mike Snitzer 提交于
      Remove all fiddley code that propped up this support for a blk-mq
      request-queue ontop of all .request_fn devices.
      
      Testing has proven this niche request-based dm-mq mode to be buggy, when
      testing fault tolerance with DM multipath, and there is no point trying
      to preserve it.
      
      Should help improve efficiency of pure dm-mq code and make code
      maintenance less delicate.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c5248f79
    • M
      dm: fix a couple locking issues with use of block interfaces · 818c5f3b
      Mike Snitzer 提交于
      old_stop_queue() was checking blk_queue_stopped() without holding the
      q->queue_lock.
      
      dm_requeue_original_request() needed to check blk_queue_stopped(), with
      q->queue_lock held, before calling blk_mq_kick_requeue_list().  And a
      side-effect of that change is start_queue() must also call
      blk_mq_kick_requeue_list().
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      818c5f3b
    • M
      dm: allocate blk_mq_tag_set rather than embed in mapped_device · 1c357a1e
      Mike Snitzer 提交于
      The blk_mq_tag_set is only needed for dm-mq support.  There is point
      wasting space in 'struct mapped_device' for non-dm-mq devices.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> # check kzalloc return
      1c357a1e
    • M
      dm: add 'dm_mq_nr_hw_queues' and 'dm_mq_queue_depth' module params · faad87df
      Mike Snitzer 提交于
      Allow user to change these values via module params or sysfs.
      
      'dm_mq_nr_hw_queues' defaults to 1 (max 32).
      
      'dm_mq_queue_depth' defaults to 2048 (up from 64, which proved far too
      small under moderate sized workloads -- the dm-multipath device would
      continuously block waiting for tags (requests) to become available).
      The maximum is BLK_MQ_MAX_DEPTH (currently 10240).
      
      Keep in mind the total number of pre-allocated requests per
      request-based dm-mq device is 'dm_mq_nr_hw_queues' * 'dm_mq_queue_depth'
      (currently 2048).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      faad87df
    • M
      dm: optimize dm_request_fn() · c91852ff
      Mike Snitzer 提交于
      DM multipath is the only request-based DM target -- which only supports
      tables with a single target that is immutable.  Leverage this fact in
      dm_request_fn().
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c91852ff
    • M
      dm: optimize dm_mq_queue_rq() · 16f12266
      Mike Snitzer 提交于
      DM multipath is the only dm-mq target.  But that aside, request-based DM
      only supports tables with a single target that is immutable.  Leverage
      this fact in dm_mq_queue_rq() by using the 'immutable_target' stored in
      the mapped_device when the table was made active.  This saves the need
      to even take the read-side of the SRCU via dm_{get,put}_live_table.
      
      If the active DM table does not have an immutable target (e.g. "error"
      target was swapped in) then fallback to the slow-path where the target
      is looked up from the live table.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      16f12266
    • M
      dm: set DM_TARGET_WILDCARD feature on "error" target · f083b09b
      Mike Snitzer 提交于
      The DM_TARGET_WILDCARD feature indicates that the "error" target may
      replace any target; even immutable targets.  This feature will be useful
      to preserve the ability to replace the "multipath" target even once it
      is formally converted over to having the DM_TARGET_IMMUTABLE feature.
      
      Also, implicit in the DM_TARGET_WILDCARD feature flag being set is that
      .map, .map_rq, .clone_and_map_rq and .release_clone_rq are all defined
      in the target_type.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f083b09b
    • M
      dm: cleanup dm_any_congested() · e522c039
      Mike Snitzer 提交于
      The request-based DM support for checking queue congestion doesn't
      require access to the live DM table.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      e522c039
    • M
      dm: remove unused dm_get_rq_mapinfo() · ae6ad75e
      Mike Snitzer 提交于
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      ae6ad75e
    • M
      dm: fix excessive dm-mq context switching · 6acfe68b
      Mike Snitzer 提交于
      Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
      than if an underlying null_blk device were used directly.  One of the
      reasons for this drop in performance is that blk_insert_clone_request()
      was calling blk_mq_insert_request() with @async=true.  This forced the
      use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
      which ushered in ping-ponging between process context (fio in this case)
      and kblockd's kworker to submit the cloned request.  The ftrace
      function_graph tracer showed:
      
        kworker-2013  =>   fio-12190
        fio-12190    =>  kworker-2013
        ...
        kworker-2013  =>   fio-12190
        fio-12190    =>  kworker-2013
        ...
      
      Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
      _not_ use kblockd to submit the cloned requests isn't enough to
      eliminate the observed context switches.
      
      In addition to this dm-mq specific blk-core fix, there are 2 DM core
      fixes to dm-mq that (when paired with the blk-core fix) completely
      eliminate the observed context switching:
      
      1)  don't blk_mq_run_hw_queues in blk-mq request completion
      
          Motivated by desire to reduce overhead of dm-mq, punting to kblockd
          just increases context switches.
      
          In my testing against a really fast null_blk device there was no benefit
          to running blk_mq_run_hw_queues() on completion (and no other blk-mq
          driver does this).  So hopefully this change doesn't induce the need for
          yet another revert like commit 621739b0 !
      
      2)  use blk_mq_complete_request() in dm_complete_request()
      
          blk_complete_request() doesn't offer the traditional q->mq_ops vs
          .request_fn branching pattern that other historic block interfaces
          do (e.g. blk_get_request).  Using blk_mq_complete_request() for
          blk-mq requests is important for performance.  It should be noted
          that, like blk_complete_request(), blk_mq_complete_request() doesn't
          natively handle partial completions -- but the request-based
          DM-multipath target does provide the required partial completion
          support by dm.c:end_clone_bio() triggering requeueing of the request
          via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.
      
      dm-mq fix #2 is _much_ more important than #1 for eliminating the
      context switches.
      Before: cpu          : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
      After:  cpu          : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472
      
      With these changes multithreaded async read IOPs improved from ~950K
      to ~1350K for this dm-mq stacked on null_blk test-case.  The raw read
      IOPs of the underlying null_blk device for the same workload is ~1950K.
      
      Fixes: 7fb4898e ("block: add blk-mq support to blk_insert_cloned_request()")
      Fixes: bfebd1cd ("dm: add full blk-mq support to request-based DM")
      Cc: stable@vger.kernel.org # 4.1+
      Reported-by: NSagi Grimberg <sagig@dev.mellanox.co.il>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      6acfe68b
  2. 22 2月, 2016 3 次提交
    • M
      dm: fix sparse "unexpected unlock" warnings in ioctl code · 956a4025
      Mike Snitzer 提交于
      Rename dm_get_live_table_for_ioctl to dm_grab_bdev_for_ioctl and have it
      do the dm_{get,put}_live_table() rather than split those operations.
      
      The dm_grab_bdev_for_ioctl() callers only care about the block_device
      associated with a singleton DM device so there isn't any need to retain
      a reference to the live DM table.  It is sufficient to:
      1) dm_get_live_table()
      2) bdgrab() the bdev associated with the singleton table's target
      3) dm_put_live_table()
      4) bdput() the bdev
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      956a4025
    • M
      dm: do not return target from dm_get_live_table_for_ioctl() · 66482026
      Mike Snitzer 提交于
      None of the callers actually used the returned target.
      Also, just reuse bdev pointer passed to dm_blk_ioctl().
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      66482026
    • M
      dm: fix dm_rq_target_io leak on faults with .request_fn DM w/ blk-mq paths · 4328daa2
      Mike Snitzer 提交于
      Using request-based DM mpath configured with the following stacking
      (.request_fn DM mpath ontop of scsi-mq paths):
      
      echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
      echo N > /sys/module/dm_mod/parameters/use_blk_mq
      
      'struct dm_rq_target_io' would leak if a request is requeued before a
      blk-mq clone is allocated (or fails to allocate).  free_rq_tio()
      wasn't being called.
      
      kmemleak reported:
      
      unreferenced object 0xffff8800b90b98c0 (size 112):
        comm "kworker/7:1H", pid 5692, jiffies 4295056109 (age 78.589s)
        hex dump (first 32 bytes):
          00 d0 5c 2c 03 88 ff ff 40 00 bf 01 00 c9 ff ff  ..\,....@.......
          e0 d9 b1 34 00 88 ff ff 00 00 00 00 00 00 00 00  ...4............
        backtrace:
          [<ffffffff81672b6e>] kmemleak_alloc+0x4e/0xb0
          [<ffffffff811dbb63>] kmem_cache_alloc+0xc3/0x1e0
          [<ffffffff8117eae5>] mempool_alloc_slab+0x15/0x20
          [<ffffffff8117ec1e>] mempool_alloc+0x6e/0x170
          [<ffffffffa00029ac>] dm_old_prep_fn+0x3c/0x180 [dm_mod]
          [<ffffffff812fbd78>] blk_peek_request+0x168/0x290
          [<ffffffffa0003e62>] dm_request_fn+0xb2/0x1b0 [dm_mod]
          [<ffffffff812f66e3>] __blk_run_queue+0x33/0x40
          [<ffffffff812f9585>] blk_delay_work+0x25/0x40
          [<ffffffff81096fff>] process_one_work+0x14f/0x3d0
          [<ffffffff81097715>] worker_thread+0x125/0x4b0
          [<ffffffff8109ce88>] kthread+0xd8/0xf0
          [<ffffffff8167cb8f>] ret_from_fork+0x3f/0x70
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      crash> struct -o dm_rq_target_io
      struct dm_rq_target_io {
          ...
      }
      SIZE: 112
      
      Fixes: e5863d9a ("dm: allocate requests in target when stacking on blk-mq devices")
      Cc: stable@vger.kernel.org # 4.0+
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      4328daa2
  3. 25 1月, 2016 2 次提交
  4. 21 1月, 2016 1 次提交
  5. 14 1月, 2016 4 次提交
    • D
      md/raid: only permit hot-add of compatible integrity profiles · 1501efad
      Dan Williams 提交于
      It is not safe for an integrity profile to be changed while i/o is
      in-flight in the queue.  Prevent adding new disks or otherwise online
      spares to an array if the device has an incompatible integrity profile.
      
      The original change to the blk_integrity_unregister implementation in
      md, commmit c7bfced9 "md: suspend i/o during runtime
      blk_integrity_unregister" introduced an immediate hang regression.
      
      This policy of disallowing changes the integrity profile once one has
      been established is shared with DM.
      
      Here is an abbreviated log from a test run that:
      1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [   59.076127]
      2/ Tries to add an integrity-disabled device (pmem1m) [   90.489209]
      3/ Retries with an integrity-enabled device (pmem1s) [  205.671277]
      
      [   59.076127] md/raid1:md0: active with 1 out of 2 mirrors
      [   59.078302] md: data integrity enabled on md0
      [..]
      [   90.489209] md0: incompatible integrity profile for pmem1m
      [..]
      [  205.671277] md: super_written gets error=-5
      [  205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
      [  205.677386] md/raid1:md0: Operation continuing on 1 devices.
      [  205.683037] RAID1 conf printout:
      [  205.684699]  --- wd:1 rd:2
      [  205.685972]  disk 0, wo:0, o:1, dev:pmem0s
      [  205.687562]  disk 1, wo:1, o:1, dev:pmem1s
      [  205.691717] md: recovery of RAID array md0
      
      Fixes: c7bfced9 ("md: suspend i/o during runtime blk_integrity_unregister")
      Cc: <stable@vger.kernel.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reported-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      1501efad
    • S
      raid5-cache: handle journal hotadd in quiesce · 16a43f6a
      Shaohua Li 提交于
      Handle journal hotadd in quiesce to avoid creating duplicated threads.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      16a43f6a
    • S
      MD: add journal with array suspended · 87d4d916
      Shaohua Li 提交于
      Hot add journal disk in recovery thread context brings a lot of trouble
      as IO could be running. Unlike spare disk hot add, adding journal disk
      with array suspended makes more sense and implmentation is much easier.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      87d4d916
    • S
      md: set MD_HAS_JOURNAL in correct places · a62ab49e
      Shaohua Li 提交于
      Set MD_HAS_JOURNAL when a array is loaded or journal is initialized.
      This is to avoid the flags set too early in journal disk hotadd.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      a62ab49e
  6. 10 1月, 2016 2 次提交
  7. 09 1月, 2016 1 次提交
    • M
      dm snapshot: fix hung bios when copy error occurs · 385277bf
      Mikulas Patocka 提交于
      When there is an error copying a chunk dm-snapshot can incorrectly hold
      associated bios indefinitely, resulting in hung IO.
      
      The function copy_callback sets pe->error if there was error copying the
      chunk, and then calls complete_exception.  complete_exception calls
      pending_complete on error, otherwise it calls commit_exception with
      commit_callback (and commit_callback calls complete_exception).
      
      The persistent exception store (dm-snap-persistent.c) assumes that calls
      to prepare_exception and commit_exception are paired.
      persistent_prepare_exception increases ps->pending_count and
      persistent_commit_exception decreases it.
      
      If there is a copy error, persistent_prepare_exception is called but
      persistent_commit_exception is not.  This results in the variable
      ps->pending_count never returning to zero and that causes some pending
      exceptions (and their associated bios) to be held forever.
      
      Fix this by unconditionally calling commit_exception regardless of
      whether the copy was successful.  A new "valid" parameter is added to
      commit_exception -- when the copy fails this parameter is set to zero so
      that the chunk that failed to copy (and all following chunks) is not
      recorded in the snapshot store.  Also, remove commit_callback now that
      it is merely a wrapper around pending_complete.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      385277bf
  8. 07 1月, 2016 3 次提交
  9. 06 1月, 2016 7 次提交
    • C
      raid5: allow r5l_io_unit allocations to fail · 5036c390
      Christoph Hellwig 提交于
      And propagate the error up the stack so we can add the stripe
      to no_stripes_list and retry our log operation later.  This avoids
      blocking raid5d due to reclaim, an it allows to get rid of the
      deadlock-prone GFP_NOFAIL allocation.
      
      shli: add missing mempool_destroy()
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      5036c390
    • C
      raid5-cache: use a mempool for the metadata block · e8deb638
      Christoph Hellwig 提交于
      We only have a limited number in flight, so use a page based mempool.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      e8deb638
    • C
      raid5-cache: use a bio_set · c38d29b3
      Christoph Hellwig 提交于
      This allows us to make guaranteed forward progress.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c38d29b3
    • S
      raid5-cache: add journal hot add/remove support · f6b6ec5c
      Shaohua Li 提交于
      Add support for journal disk hot add/remove. Mostly trival checks in md
      part. The raid5 part is a little tricky. For hot-remove, we can't wait
      pending write as it's called from raid5d. The wait will cause deadlock.
      We simplily fail the hot-remove. A hot-remove retry can success
      eventually since if journal disk is faulty all pending write will be
      failed and finish. For hot-add, since an array supporting journal but
      without journal disk will be marked read-only, we are safe to hot add
      journal without stopping IO (should be read IO, while journal only
      handles write IO).
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      f6b6ec5c
    • D
      drivers: md: use ktime_get_real_seconds() · 9ebc6ef1
      Deepa Dinamani 提交于
      get_seconds() API is not y2038 safe on 32 bit systems and the API
      is deprecated. Replace it with calls to ktime_get_real_seconds()
      API instead. Change mddev structure types to time64_t accordingly.
      
      32 bit signed timestamps will overflow in the year 2038.
      
      Change the user interface mdu_array_info_s structure timestamps:
      ctime and utime values used in ioctls GET_ARRAY_INFO and
      SET_ARRAY_INFO to unsigned int. This will extend the field to last
      until the year 2106.
      The long term plan is to get rid of ctime and utime values in
      this structure as this information can be read from the on-disk
      meta data directly.
      
      Clamp the tim64_t timestamps to positive values with a max of U32_MAX
      when returning from GET_ARRAY_INFO ioctl to accommodate above changes
      in the data type of timestamps to unsigned int.
      
      v0.90 on disk meta data uses u32 for maintaining time stamps.
      So this will also last until year 2106.
      Assumption is that the usage of v0.90 will be deprecated by
      year 2106.
      
      Timestamp fields in the on disk meta data for v1.0 version already
      use 64 bit data types. Remove the truncation of the bits while
      writing to or reading from these from the disk.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      9ebc6ef1
    • A
      md: avoid warning for 32-bit sector_t · 3312c951
      Arnd Bergmann 提交于
      When CONFIG_LBDAF is not set, sector_t is only 32-bits wide, which
      means we cannot have devices with more than 2TB, and the code that
      is trying to handle compatibility support for large devices in
      md version 0.90 is meaningless but also causes a compile-time warning:
      
      drivers/md/md.c: In function 'super_90_load':
      drivers/md/md.c:1029:19: warning: large integer implicitly truncated to unsigned type [-Woverflow]
      drivers/md/md.c: In function 'super_90_rdev_size_change':
      drivers/md/md.c:1323:17: warning: large integer implicitly truncated to unsigned type [-Woverflow]
      
      This adds a check for CONFIG_LBDAF to avoid even getting into this
      code path, and also adds an explicit cast to let the compiler know
      it doesn't have to warn about the truncation.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      3312c951
    • C
      raid5-cache: free meta_page earlier · ad66d445
      Christoph Hellwig 提交于
      Once the I/O completed we don't need the meta page anymore.  As the iounits
      can live on for a long time this reduces memory pressure a bit.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      ad66d445