1. 30 1月, 2018 6 次提交
    • K
      dm mpath selector: more evenly distribute ties · f2042605
      Khazhismel Kumykov 提交于
      Move the last used path to the end of the list (least preferred) so that
      ties are more evenly distributed.
      
      For example, in case with three paths with one that is slower than
      others, the remaining two would be unevenly used if they tie. This is
      due to the rotation not being a truely fair distribution.
      
      Illustrated: paths a, b, c, 'c' has 1 outstanding IO, a and b are 'tied'
      Three possible rotations:
      (a, b, c) -> best path 'a'
      (b, c, a) -> best path 'b'
      (c, a, b) -> best path 'a'
      (a, b, c) -> best path 'a'
      (b, c, a) -> best path 'b'
      (c, a, b) -> best path 'a'
      ...
      
      So 'a' is used 2x more than 'b', although they should be used evenly.
      
      With this change, the most recently used path is always the least
      preferred, removing this bias resulting in even distribution.
      (a, b, c) -> best path 'a'
      (b, c, a) -> best path 'b'
      (c, a, b) -> best path 'a'
      (c, b, a) -> best path 'b'
      ...
      Signed-off-by: NKhazhismel Kumykov <khazhy@google.com>
      Reviewed-by: NMartin Wilck <mwilck@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f2042605
    • S
      dm unstripe: fix target length versus number of stripes size check · cc656619
      Scott Bauer 提交于
      Since the unstripe target takes a target length which is the
      size of *one* striped member we're trying to expose, not the
      total size of *all* the striped members, the check does not
      make sense and fails for some striped setups.
      
      For example, say we have a 4TB striped device:
      or 3907018496 sectors per underlying device:
      
      if (sector_div(width, uc->stripes)) :
         3907018496 / 2(num stripes)  == 1953509248
      
      tmp_len = width;
      if (sector_div(tmp_len, uc->chunk_size)) :
         1953509248 / 256(chunk size) == 7630895.5
         (fails)
      
      Fix this by removing the first check which isn't valid for unstriping.
      Signed-off-by: NScott Bauer <scott.bauer@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      cc656619
    • L
      dm thin: fix trailing semicolon in __remap_and_issue_shared_cell · bd6d1e0a
      Luis de Bethencourt 提交于
      The trailing semicolon is an empty statement that does no operation.
      Removing it since it doesn't do anything.
      Signed-off-by: NLuis de Bethencourt <luisbg@kernel.org>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      bd6d1e0a
    • M
      dm table: fix NVMe bio-based dm_table_determine_type() validation · eaa160ed
      Mike Snitzer 提交于
      The 'verify_rq_based:' code in dm_table_determine_type() was checking
      all devices in the DM table rather than only checking the data devices.
      Fix this by using the immutable target's iterate_devices method.
      
      Also, tweak the block of dm_table_determine_type() code that decides
      whether to upgrade from DM_TYPE_BIO_BASED to DM_TYPE_NVME_BIO_BASED so
      that it makes sure the immutable_target doesn't support require
      splitting IOs.
      
      These changes have been verified to allow a "thin-pool" target whose
      data device is an NVMe device to be upgraded to DM_TYPE_NVME_BIO_BASED.
      Using the thin-pool in NVMe bio-based mode was verified to pass all the
      device-mapper-test-suite's "thin-provisioning" tests.
      
      Also verified that request-based DM multipath (with queue_mode "rq" and
      "mq") works as expected using the 'mptest' harness.
      
      Fixes: 22c11858 ("dm: introduce DM_TYPE_NVME_BIO_BASED")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      eaa160ed
    • M
      dm: various cleanups to md->queue initialization code · c12c9a3c
      Mike Snitzer 提交于
      Also, add dm_sysfs_init() error handling to dm_create().
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c12c9a3c
    • M
      dm mpath: delay the retry of a request if the target responded as busy · ac514ffc
      Mike Snitzer 提交于
      Add DM_ENDIO_DELAY_REQUEUE to allow request-based multipath's
      multipath_end_io() to instruct dm-rq.c:dm_done() to delay a requeue.
      This is beneficial to do if BLK_STS_RESOURCE is returned from the target
      (because target is busy).
      
      Relative to blk-mq: kick the hw queues via blk_mq_requeue_work(),
      indirectly from dm-rq.c:__dm_mq_kick_requeue_list(), after a delay.
      
      For old .request_fn: use blk_delay_queue().
      
      bio-based multipath doesn't have feature parity with request-based for
      retryable error requeues; that is something that'll need fixing in the
      future.
      Suggested-by: NBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NBart Van Assche <bart.vanassche@wdc.com>
      [as interpreted from Bart's "... patch looks fine to me."]
      ac514ffc
  2. 18 1月, 2018 1 次提交
    • M
      blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback · 396eaf21
      Ming Lei 提交于
      blk_insert_cloned_request() is called in the fast path of a dm-rq driver
      (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
      blk_mq_request_bypass_insert() to directly append the request to the
      blk-mq hctx->dispatch_list of the underlying queue.
      
      1) This way isn't efficient enough because the hctx spinlock is always
      used.
      
      2) With blk_insert_cloned_request(), we completely bypass underlying
      queue's elevator and depend on the upper-level dm-rq driver's elevator
      to schedule IO.  But dm-rq currently can't get the underlying queue's
      dispatch feedback at all.  Without knowing whether a request was issued
      or not (e.g. due to underlying queue being busy) the dm-rq elevator will
      not be able to provide effective IO merging (as a side-effect of dm-rq
      currently blindly destaging a request from its elevator only to requeue
      it after a delay, which kills any opportunity for merging).  This
      obviously causes very bad sequential IO performance.
      
      Fix this by updating blk_insert_cloned_request() to use
      blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
      request to be issued directly to the underlying queue and returns the
      dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
      returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
      to _not_ destage the request.  Whereby preserving the opportunity to
      merge IO.
      
      With this, request-based DM's blk-mq sequential IO performance is vastly
      improved (as much as 3X in mpath/virtio-scsi testing).
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
      they were refactored to make them less fragile and easier to read/review]
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      396eaf21
  3. 17 1月, 2018 19 次提交
  4. 16 1月, 2018 1 次提交
    • T
      raid5-ppl: PPL support for disks with write-back cache enabled · 1532d9e8
      Tomasz Majchrzak 提交于
      In order to provide data consistency with PPL for disks with write-back
      cache enabled all data has to be flushed to disks before next PPL
      entry. The disks to be flushed are marked in the bitmap. It's modified
      under a mutex and it's only read after PPL io unit is submitted.
      
      A limitation of 64 disks in the array has been introduced to keep data
      structures and implementation simple. RAID5 arrays with so many disks are
      not likely due to high risk of multiple disks failure. Such restriction
      should not be a real life limitation.
      
      With write-back cache disabled next PPL entry is submitted when data write
      for current one completes. Data flush defers next log submission so trigger
      it when there are no stripes for handling found.
      
      As PPL assures all data is flushed to disk at request completion, just
      acknowledge flush request when PPL is enabled.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      1532d9e8
  5. 15 1月, 2018 1 次提交
    • M
      dm: fix incomplete request_queue initialization · c100ec49
      Mike Snitzer 提交于
      DM is no longer prone to having its request_queue be improperly
      initialized.
      
      Summary of changes:
      
      - defer DM's blk_register_queue() from add_disk()-time until
        dm_setup_md_queue() by using add_disk_no_queue_reg() in alloc_dev().
      
      - dm_setup_md_queue() is updated to fully initialize DM's request_queue
        (_after_ all table loads have occurred and the request_queue's type,
        features and limits are known).
      
      A very welcome side-effect of these changes is DM no longer needs to:
      1) backfill the "mq" sysfs entry (because historically DM didn't
      initialize the request_queue to use blk-mq until _after_
      blk_register_queue() was called via add_disk()).
      2) call elv_register_queue() to get .request_fn request-based DM
      device's "iosched" exposed in syfs.
      
      In addition, blk-mq debugfs support is now made available because
      request-based DM's blk-mq request_queue is now properly initialized
      before dm_setup_md_queue() calls blk_register_queue().
      
      These changes also stave off the need to introduce new DM-specific
      workarounds in block core, e.g. this proposal:
      https://patchwork.kernel.org/patch/10067961/
      
      In the end DM devices should be less unicorn in nature (relative to
      initialization and availability of block core infrastructure provided by
      the request_queue).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Tested-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c100ec49
  6. 11 1月, 2018 1 次提交
  7. 10 1月, 2018 1 次提交
  8. 09 1月, 2018 10 次提交
    • M
      bcache: fix writeback target calc on large devices · 616486ab
      Michael Lyle 提交于
      Bcache needs to scale the dirty data in the cache over the multiple
      backing disks in order to calculate writeback rates for each.
      The previous code did this by multiplying the target number of dirty
      sectors by the backing device size, and expected it to fit into a
      uint64_t; this blows up on relatively small backing devices.
      
      The new approach figures out the bdev's share in 16384ths of the overall
      cached data.  This is chosen to cope well when bdevs drastically vary in
      size and to ensure that bcache can cross the petabyte boundary for each
      backing device.
      
      This has been improved based on Tang Junhui's feedback to ensure that
      every device gets a share of dirty data, no matter how small it is
      compared to the total backing pool.
      
      The existing mechanism is very limited; this is purely a bug fix to
      remove limits on volume size.  However, there still needs to be change
      to make this "fair" over many volumes where some are idle.
      Reported-by: NJack Douglas <jack@douglastechnology.co.uk>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      616486ab
    • C
      bcache: fix misleading error message in bch_count_io_errors() · 5138ac67
      Coly Li 提交于
      Bcache only does recoverable I/O for read operations by calling
      cached_dev_read_error(). For write opertions there is no I/O recovery for
      failed requests.
      
      But in bch_count_io_errors() no matter read or write I/Os, before errors
      counter reaches io error limit, pr_err() always prints "IO error on %,
      recoverying". For write requests this information is misleading, because
      there is no I/O recovery at all.
      
      This patch adds a parameter 'is_read' to bch_count_io_errors(), and only
      prints "recovering" by pr_err() when the bio direction is READ.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5138ac67
    • C
      bcache: reduce cache_set devices iteration by devices_max_used · 2831231d
      Coly Li 提交于
      Member devices of struct cache_set is used to reference all attached
      bcache devices to this cache set. If it is treated as array of pointers,
      size of devices[] is indicated by member nr_uuids of struct cache_set.
      
      nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(),
      	bucket_bytes(c) / sizeof(struct uuid_entry)
      Bucket size is determined by user space tool "make-bcache", by default it
      is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default
      nr_uuids value is 4096 from the above calculation.
      
      Every time when bcache code iterates bcache devices of a cache set, all
      the 4096 pointers are checked even only 1 bcache device is attached to the
      cache set, that's a wast of time and unncessary.
      
      This patch adds a member devices_max_used to struct cache_set. Its value
      is 1 + the maximum used index of devices[] in a cache set. When iterating
      all valid bcache devices of a cache set, use c->devices_max_used in
      for-loop may reduce a lot of useless checking.
      
      Personally, my motivation of this patch is not for performance, I use it
      in bcache debugging, which helps me to narrow down the scape to check
      valid bcached devices of a cache set.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2831231d
    • Z
      bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct() · b40503ea
      Zhai Zhaoxuan 提交于
      The function cached_dev_make_request() and flash_dev_make_request() call
      generic_start_io_acct() with (struct bcache_device)->disk when they start a
      closure. Then the function bio_complete() calls generic_end_io_acct() with
      (struct search)->orig_bio->bi_disk when the closure has done.
      Since the `bi_disk` is not the bcache device, the generic_end_io_acct() is
      called with a wrong device queue.
      
      It causes the "inflight" (in struct hd_struct) counter keep increasing
      without decreasing.
      
      This patch fix the problem by calling generic_end_io_acct() with
      (struct bcache_device)->disk.
      Signed-off-by: NZhai Zhaoxuan <kxuanobj@gmail.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b40503ea
    • K
      bcache: mark closure_sync() __sched · ce439bf7
      Kent Overstreet 提交于
      [edit by mlyle: include sched/debug.h to get __sched]
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ce439bf7
    • K
      bcache: Fix, improve efficiency of closure_sync() · e4bf7919
      Kent Overstreet 提交于
      Eliminates cases where sync can race and fail to complete / get stuck.
      Removes many status flags and simplifies entering-and-exiting closure
      sleeping behaviors.
      
      [mlyle: fixed conflicts due to changed return behavior in mainline.
      extended commit comment, and squashed down two commits that were mostly
      contradictory to get to this state.  Changed __set_current_state to
      set_current_state per Jens review comment]
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e4bf7919
    • M
      bcache: allow quick writeback when backing idle · b1092c9a
      Michael Lyle 提交于
      If the control system would wait for at least half a second, and there's
      been no reqs hitting the backing disk for awhile: use an alternate mode
      where we have at most one contiguous set of writebacks in flight at a
      time. (But don't otherwise delay).  If front-end IO appears, it will
      still be quick, as it will only have to contend with one real operation
      in flight.  But otherwise, we'll be sending data to the backing disk as
      quickly as it can accept it (with one op at a time).
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Acked-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b1092c9a
    • M
      bcache: writeback: properly order backing device IO · 6e6ccc67
      Michael Lyle 提交于
      Writeback keys are presently iterated and dispatched for writeback in
      order of the logical block address on the backing device.  Multiple may
      be, in parallel, read from the cache device and then written back
      (especially when there are contiguous I/O).
      
      However-- there was no guarantee with the existing code that the writes
      would be issued in LBA order, as the reads from the cache device are
      often re-ordered.  In turn, when writing back quickly, the backing disk
      often has to seek backwards-- this slows writeback and increases
      utilization.
      
      This patch introduces an ordering mechanism that guarantees that the
      original order of issue is maintained for the write portion of the I/O.
      Performance for writeback is significantly improved when there are
      multiple contiguous keys or high writeback rates.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Tested-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6e6ccc67
    • T
      bcache: fix wrong return value in bch_debug_init() · 539d39eb
      Tang Junhui 提交于
      in bch_debug_init(), ret is always 0, and the return value is useless,
      change it to return 0 if be success after calling debugfs_create_dir(),
      else return a non-zero value.
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      539d39eb
    • T
      bcache: segregate flash only volume write streams · 4eca1cb2
      Tang Junhui 提交于
      In such scenario that there are some flash only volumes
      , and some cached devices, when many tasks request these devices in
      writeback mode, the write IOs may fall to the same bucket as bellow:
      | cached data | flash data | cached data | cached data| flash data|
      then after writeback of these cached devices, the bucket would
      be like bellow bucket:
      | free | flash data | free | free | flash data |
      
      So, there are many free space in this bucket, but since data of flash
      only volumes still exists, so this bucket cannot be reclaimable,
      which would cause waste of bucket space.
      
      In this patch, we segregate flash only volume write streams from
      cached devices, so data from flash only volumes and cached devices
      can store in different buckets.
      
      Compare to v1 patch, this patch do not add a additionally open bucket
      list, and it is try best to segregate flash only volume write streams
      from cached devices, sectors of flash only volumes may still be mixed
      with dirty sectors of cached device, but the number is very small.
      
      [mlyle: fixed commit log formatting, permissions, line endings]
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4eca1cb2