1. 28 3月, 2017 12 次提交
    • S
      blk-throttle: make bandwidth change smooth · 7394e31f
      Shaohua Li 提交于
      When cgroups all reach low limit, cgroups can dispatch more IO. This
      could make some cgroups dispatch more IO but others not, and even some
      cgroups could dispatch less IO than their low limit. For example, cg1
      low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
      120M/s for the workload. Their bps could something like this:
      
      cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80
      
      At T1, all cgroups reach low limit, so they can dispatch more IO later.
      Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
      T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
      than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to
      LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2
      will have bandwidth below its low limit at most time.
      
      The big problem here is we don't know the maximum bandwidth of the
      workload, so we can't make smart decision to avoid the situation. This
      patch makes cgroup bandwidth change smooth. After disk upgrades from
      LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
      their max limit immediately. Their bandwidth limit will be increased
      gradually to avoid above situation. So above example will became
      something like:
      
      cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
      -> 45/75 -> 22/98
      
      In this way cgroups bandwidth will be above their limit in majority
      time, this still doesn't fully utilize disk bandwidth, but that's
      something we pay for sharing.
      
      Scale up is linear. The limit scales up 1/2 .low limit every
      throtl_slice after upgrade. The scale up will stop if the adjusted limit
      hits .max limit. Scale down is exponential. We cut the scale value half
      if a cgroup doesn't hit its .low limit. If the scale becomes 0, we then
      fully downgrade the queue to LIMIT_LOW state.
      
      Note this doesn't completely avoid cgroup running under its low limit.
      The best way to guarantee cgroup doesn't run under its limit is to set
      max limit. For example, if we set cg1 max limit to 40, cg2 will never
      run under its low limit.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7394e31f
    • S
      blk-throttle: detect completed idle cgroup · aec24246
      Shaohua Li 提交于
      cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
      cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
      we can't move the state machine to higher level and all cgroups will be
      throttled to their lower limit, so we waste bandwidth. Detecting idle
      cgroup is hard. This patch handles a simple case, a cgroup doesn't
      dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
      the bandwidth.
      
      Please note this will be replaced with a more sophisticated algorithm
      later, but this demonstrates the idea how we handle idle cgroups, so I
      leave it here.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      aec24246
    • S
      blk-throttle: choose a small throtl_slice for SSD · d61fcfa4
      Shaohua Li 提交于
      The throtl_slice is 100ms by default. This is a long time for SSD, a lot
      of IO can run. To make cgroups have smoother throughput, we choose a
      small value (20ms) for SSD.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d61fcfa4
    • S
      blk-throttle: make throtl_slice tunable · 297e3d85
      Shaohua Li 提交于
      throtl_slice is important for blk-throttling. It's called slice
      internally but it really is a time window blk-throttling samples data.
      blk-throttling will make decision based on the samplings. An example is
      bandwidth measurement. A cgroup's bandwidth is measured in the time
      interval of throtl_slice.
      
      A small throtl_slice meanse cgroups have smoother throughput but burn
      more CPUs. It has 100ms default value, which is not appropriate for all
      disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
      it tunable.
      
      Since throtl_slice isn't a time slice, the sysfs name
      'throttle_sample_time' reflects its character better.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      297e3d85
    • S
      blk-throttle: make sure expire time isn't too big · 06cceedc
      Shaohua Li 提交于
      cgroup could be throttled to a limit but when all cgroups cross high
      limit, queue enters a higher state and so the group should be throttled
      to a higher limit. It's possible the cgroup is sleeping because of
      throttle and other cgroups don't dispatch IO any more. In this case,
      nobody can trigger current downgrade/upgrade logic. To fix this issue,
      we could either set up a timer to wakeup the cgroup if other cgroups are
      idle or make sure this cgroup doesn't sleep too long. Setting up a timer
      means we must change the timer very frequently. This patch chooses the
      latter. Making cgroup sleep time not too big wouldn't change cgroup
      bps/iops, but could make it wakeup more frequently, which isn't a big
      issue because throtl_slice * 8 is already quite big.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      06cceedc
    • S
      blk-throttle: add downgrade logic · 3f0abd80
      Shaohua Li 提交于
      When queue state machine is in LIMIT_MAX state, but a cgroup is below
      its low limit for some time, the queue should be downgraded to lower
      state as one cgroup's low limit isn't met.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3f0abd80
    • S
      blk-throttle: add upgrade logic for LIMIT_LOW state · c79892c5
      Shaohua Li 提交于
      When queue is in LIMIT_LOW state and all cgroups with low limit cross
      the bps/iops limitation, we will upgrade queue's state to
      LIMIT_MAX. To determine if a cgroup exceeds its limitation, we check if
      the cgroup has pending request. Since cgroup is throttled according to
      the limit, pending request means the cgroup reaches the limit.
      
      If a cgroup has limit set for both read and write, we consider the
      combination of them for upgrade. The reason is read IO and write IO can
      interfere with each other. If we do the upgrade based in one direction
      IO, the other direction IO could be severly harmed.
      
      For a cgroup hierarchy, there are two cases. Children has lower low
      limit than parent. Parent's low limit is meaningless. If children's
      bps/iops cross low limit, we can upgrade queue state. The other case is
      children has higher low limit than parent. Children's low limit is
      meaningless. As long as parent's bps/iops (which is a sum of childrens
      bps/iops) cross low limit, we can upgrade queue state.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c79892c5
    • S
      blk-throttle: configure bps/iops limit for cgroup in low limit · b22c417c
      Shaohua Li 提交于
      each queue will have a state machine. Initially queue is in LIMIT_LOW
      state, which means all cgroups will be throttled according to their low
      limit. After all cgroups with low limit cross the limit, the queue state
      gets upgraded to LIMIT_MAX state.
      For max limit, cgroup will use the limit configured by user.
      For low limit, cgroup will use the minimal value between low limit and
      max limit configured by user. If the minimal value is 0, which means the
      cgroup doesn't configure low limit, we will use max limit to throttle
      the cgroup and the cgroup is ready to upgrade to LIMIT_MAX
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b22c417c
    • S
      blk-throttle: add .low interface · cd5ab1b0
      Shaohua Li 提交于
      Add low limit for cgroup and corresponding cgroup interface. To be
      consistent with memcg, we allow users configure .low limit higher than
      .max limit. But the internal logic always assumes .low limit is lower
      than .max limit. So we add extra bps/iops_conf fields in throtl_grp for
      userspace configuration. Old bps/iops fields in throtl_grp will be the
      actual limit we use for throttling.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cd5ab1b0
    • S
      blk-throttle: add configure option for new .low interface · 327ffb9b
      Shaohua Li 提交于
      As discussed in LSF, add configure option for the interface and mark it
      as experimental, so people can try/test.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      327ffb9b
    • S
      blk-throttle: prepare support multiple limits · 9f626e37
      Shaohua Li 提交于
      We are going to support low/max limit, each cgroup will have 2 limits
      after that. This patch prepares for the multiple limits change.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9f626e37
    • S
      blk-throttle: use U64_MAX/UINT_MAX to replace -1 · 2ab5492d
      Shaohua Li 提交于
      clean up the code to avoid using -1
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2ab5492d
  2. 25 3月, 2017 3 次提交
  3. 23 3月, 2017 7 次提交
  4. 22 3月, 2017 6 次提交
    • J
      block: fix stacked driver stats init and free · a83b576c
      Jens Axboe 提交于
      If a driver allocates a queue for stacked usage, then it does
      not currently get stats allocated. This causes the later init
      of, eg, writeback throttling to blow up. Move the init to the
      queue allocation instead.
      
      Additionally, allow a NULL callback unregistration. This avoids
      having the caller check for that, fixing another oops on
      removal of a block device that doesn't have poll stats allocated.
      
      Fixes: 34dbad5d ("blk-stat: convert to callback-based statistics reporting")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a83b576c
    • O
      blk-stat: convert to callback-based statistics reporting · 34dbad5d
      Omar Sandoval 提交于
      Currently, statistics are gathered in ~0.13s windows, and users grab the
      statistics whenever they need them. This is not ideal for both in-tree
      users:
      
      1. Writeback throttling wants its own dynamically sized window of
         statistics. Since the blk-stats statistics are reset after every
         window and the wbt windows don't line up with the blk-stats windows,
         wbt doesn't see every I/O.
      2. Polling currently grabs the statistics on every I/O. Again, depending
         on how the window lines up, we may miss some I/Os. It's also
         unnecessary overhead to get the statistics on every I/O; the hybrid
         polling heuristic would be just as happy with the statistics from the
         previous full window.
      
      This reworks the blk-stats infrastructure to be callback-based: users
      register a callback that they want called at a given time with all of
      the statistics from the window during which the callback was active.
      Users can dynamically bucketize the statistics. wbt and polling both
      currently use read vs. write, but polling can be extended to further
      subdivide based on request size.
      
      The callbacks are kept on an RCU list, and each callback has percpu
      stats buffers. There will only be a few users, so the overhead on the
      I/O completion side is low. The stats flushing is also simplified
      considerably: since the timer function is responsible for clearing the
      statistics, we don't have to worry about stale statistics.
      
      wbt is a trivial conversion. After the conversion, the windowing problem
      mentioned above is fixed.
      
      For polling, we register an extra callback that caches the previous
      window's statistics in the struct request_queue for the hybrid polling
      heuristic to use.
      
      Since we no longer have a single stats buffer for the request queue,
      this also removes the sysfs and debugfs stats entries. To replace those,
      we add a debugfs entry for the poll statistics.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      34dbad5d
    • O
      blk-stat: move BLK_RQ_STAT_BATCH definition to blk-stat.c · 4875253f
      Omar Sandoval 提交于
      This is an implementation detail that no-one outside of blk-stat.c uses.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4875253f
    • O
      blk-stat: use READ and WRITE instead of BLK_STAT_{READ,WRITE} · fa2e39cb
      Omar Sandoval 提交于
      The stats buckets will become generic soon, so make the existing users
      use the common READ and WRITE definitions instead of one internal to
      blk-stat.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fa2e39cb
    • O
      block: remove extra calls to wbt_exit() · 0315b159
      Omar Sandoval 提交于
      We always call wbt_exit() from blk_release_queue(), so these are
      unnecessary.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0315b159
    • O
      blk-stat: fix blk_stat_sum() if all samples are batched · 7d8d0014
      Omar Sandoval 提交于
      We need to flush the batch _before_ we check the number of samples,
      otherwise we'll miss all of the batched samples.
      
      Fixes: cf43e6be ("block: add scalable completion tracking of requests")
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7d8d0014
  5. 15 3月, 2017 1 次提交
  6. 13 3月, 2017 1 次提交
  7. 12 3月, 2017 1 次提交
    • N
      blk: Ensure users for current->bio_list can see the full list. · f5fe1b51
      NeilBrown 提交于
      Commit 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      changed current->bio_list so that it did not contain *all* of the
      queued bios, but only those submitted by the currently running
      make_request_fn.
      
      There are two places which walk the list and requeue selected bios,
      and others that check if the list is empty.  These are no longer
      correct.
      
      So redefine current->bio_list to point to an array of two lists, which
      contain all queued bios, and adjust various code to test or walk both
      lists.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Fixes: 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f5fe1b51
  8. 09 3月, 2017 8 次提交
    • N
      blk: improve order of bio handling in generic_make_request() · 79bd9959
      NeilBrown 提交于
      To avoid recursion on the kernel stack when stacked block devices
      are in use, generic_make_request() will, when called recursively,
      queue new requests for later handling.  They will be handled when the
      make_request_fn for the current bio completes.
      
      If any bios are submitted by a make_request_fn, these will ultimately
      be handled seqeuntially.  If the handling of one of those generates
      further requests, they will be added to the end of the queue.
      
      This strict first-in-first-out behaviour can lead to deadlocks in
      various ways, normally because a request might need to wait for a
      previous request to the same device to complete.  This can happen when
      they share a mempool, and can happen due to interdependencies
      particular to the device.  Both md and dm have examples where this happens.
      
      These deadlocks can be erradicated by more selective ordering of bios.
      Specifically by handling them in depth-first order.  That is: when the
      handling of one bio generates one or more further bios, they are
      handled immediately after the parent, before any siblings of the
      parent.  That way, when generic_make_request() calls make_request_fn
      for some particular device, we can be certain that all previously
      submited requests for that device have been completely handled and are
      not waiting for anything in the queue of requests maintained in
      generic_make_request().
      
      An easy way to achieve this would be to use a last-in-first-out stack
      instead of a queue.  However this will change the order of consecutive
      bios submitted by a make_request_fn, which could have unexpected consequences.
      Instead we take a slightly more complex approach.
      A fresh queue is created for each call to a make_request_fn.  After it completes,
      any bios for a different device are placed on the front of the main queue, followed
      by any bios for the same device, followed by all bios that were already on
      the queue before the make_request_fn was called.
      This provides the depth-first approach without reordering bios on the same level.
      
      This, by itself, it not enough to remove all deadlocks.  It just makes
      it possible for drivers to take the extra step required themselves.
      
      To avoid deadlocks, drivers must never risk waiting for a request
      after submitting one to generic_make_request.  This includes never
      allocing from a mempool twice in the one call to a make_request_fn.
      
      A common pattern in drivers is to call bio_split() in a loop, handling
      the first part and then looping around to possibly split the next part.
      Instead, a driver that finds it needs to split a bio should queue
      (with generic_make_request) the second part, handle the first part,
      and then return.  The new code in generic_make_request will ensure the
      requests to underlying bios are processed first, then the second bio
      that was split off.  If it splits again, the same process happens.  In
      each case one bio will be completely handled before the next one is attempted.
      
      With this is place, it should be possible to disable the
      punt_bios_to_recover() recovery thread for many block devices, and
      eventually it may be possible to remove it completely.
      
      Ref: http://www.spinics.net/lists/raid/msg54680.htmlTested-by: NJinpu Wang <jinpu.wang@profitbricks.com>
      Inspired-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      79bd9959
    • J
      Revert "scsi, block: fix duplicate bdi name registration crashes" · c01228db
      Jan Kara 提交于
      This reverts commit 0dba1314. It causes
      leaking of device numbers for SCSI when SCSI registers multiple gendisks
      for one request_queue in succession. It can be easily reproduced using
      Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE.
      Furthermore the protection provided by this commit is not needed anymore
      as the problem it was fixing got also fixed by commit 165a5e22
      "block: Move bdi_unregister() to del_gendisk()".
      
      [1]: http://marc.info/?l=linux-block&m=148554717109098&w=2Signed-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c01228db
    • J
      block: Make del_gendisk() safer for disks without queues · 90f16fdd
      Jan Kara 提交于
      Commit 165a5e22 "block: Move bdi_unregister() to del_gendisk()"
      added disk->queue dereference to del_gendisk(). Although del_gendisk()
      is not supposed to be called without disk->queue valid and
      blk_unregister_queue() warns in that case, this change will make it oops
      instead. Return to the old more robust behavior of just warning when
      del_gendisk() gets called for gendisk with disk->queue being NULL.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Tested-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      90f16fdd
    • J
      block/sed: Fix opal user range check and unused variables · b0bfdfc2
      Jon Derrick 提交于
      Fixes check that the opal user is within the range, and cleans up unused
      method variables.
      Signed-off-by: NJon Derrick <jonathan.derrick@intel.com>
      Reviewed-by: NScott Bauer <scott.bauer@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b0bfdfc2
    • M
      blk-mq: free hctx->cpumask in release handler of hctx's kobject · 01388df3
      Ming Lei 提交于
      It is obviously that hctx->cpumask is per hctx, and both
      share same lifetime, so this patch moves freeing of hctx->cpumask
      into release handler of hctx's kobject.
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      01388df3
    • M
      blk-mq: make lifetime consistent between hctx and its kobject · 6c8b232e
      Ming Lei 提交于
      This patch removes kobject_put() over hctx in __blk_mq_unregister_dev(),
      and trys to keep lifetime consistent between hctx and hctx's kobject.
      
      Now blk_mq_sysfs_register() and blk_mq_sysfs_unregister() become
      totally symmetrical, and kobject's refcounter drops to zero just
      when the hctx is freed.
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6c8b232e
    • M
      blk-mq: make lifetime consitent between q/ctx and its kobject · 7ea5fe31
      Ming Lei 提交于
      Currently from kobject view, both q->mq_kobj and ctx->kobj can
      be released during one cycle of blk_mq_register_dev() and
      blk_mq_unregister_dev(). Actually, sw queue's lifetime is
      same with its request queue's, which is covered by request_queue->kobj.
      
      So we don't need to call kobject_put() for the two kinds of
      kobject in __blk_mq_unregister_dev(), instead we do that
      in release handler of request queue.
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7ea5fe31
    • M
      blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue() · 737f98cf
      Ming Lei 提交于
      Both q->mq_kobj and sw queues' kobjects should have been initialized
      once, instead of doing that each add_disk context.
      
      Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
      because percpu allocator fills zero to allocated variable.
      
      This patch fixes one issue[1] reported from Omar.
      
      [1] kernel wearning when doing unbind/bind on one scsi-mq device
      
      [   19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong.
      [   19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34
      [   19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
      [   19.350920] Workqueue: events_unbound async_run_entry_fn
      [   19.350920] Call Trace:
      [   19.350920]  dump_stack+0x63/0x83
      [   19.350920]  kobject_init+0x77/0x90
      [   19.350920]  blk_mq_register_dev+0x40/0x130
      [   19.350920]  blk_register_queue+0xb6/0x190
      [   19.350920]  device_add_disk+0x1ec/0x4b0
      [   19.350920]  sd_probe_async+0x10d/0x1c0 [sd_mod]
      [   19.350920]  async_run_entry_fn+0x48/0x150
      [   19.350920]  process_one_work+0x1d0/0x480
      [   19.350920]  worker_thread+0x48/0x4e0
      [   19.350920]  kthread+0x101/0x140
      [   19.350920]  ? process_one_work+0x480/0x480
      [   19.350920]  ? kthread_create_on_node+0x60/0x60
      [   19.350920]  ret_from_fork+0x2c/0x40
      
      Cc: Omar Sandoval <osandov@osandov.com>
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      737f98cf
  9. 03 3月, 2017 1 次提交