1. 09 6月, 2020 1 次提交
  2. 18 3月, 2020 1 次提交
  3. 15 1月, 2020 6 次提交
    • X
      alinux: blk-throttle: limit bios to fix amount of pages entering writeback prematurely · 0fd4aa6d
      Xiaoguang Wang 提交于
      Currently in blk_throtl_bio(), if one bio exceeds its throtl_grp's bps
      or iops limit, this bio will be queued throtl_grp's throtl_service_queue,
      then obviously mm subsys will submit more pages, even underlying device
      can not handle these io requests, also this will make large amount of pages
      entering writeback prematurely, later if some process writes some of these
      pages, it will wait for long time.
      
      I have done some tests: one process does buffered writes on a 1GB file,
      and make this process's blkcg max bps limit be 10MB/s, I observe this:
      	#cat /proc/meminfo  | grep -i back
      	Writeback:        900024 kB
      	WritebackTmp:          0 kB
      
      I think this Writeback value is just too big, indeed many bios have been
      queued in throtl_grp's throtl_service_queue, if one process try to write
      the last bio's page in this queue, it will call wait_on_page_writeback(page),
      which must wait the previous bios to finish and will take long time, we
      have also see 120s hung task warning in our server.
      
       INFO: task kworker/u128:0:30072 blocked for more than 120 seconds.
             Tainted: G            E 4.9.147-013.ali3000_015_test.alios7.x86_64 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       kworker/u128:0  D    0 30072      2 0x00000000
       Workqueue: writeback wb_workfn (flush-8:16)
        ffff882ddd066b40 0000000000000000 ffff882e5cad3400 ffff882fbe959e80
        ffff882fa50b1a00 ffffc9003a5a3768 ffffffff8173325d ffffc9003a5a3780
        00ff882e5cad3400 ffff882fbe959e80 ffffffff81360b49 ffff882e5cad3400
       Call Trace:
        [<ffffffff8173325d>] ? __schedule+0x23d/0x6d0
        [<ffffffff81360b49>] ? alloc_request_struct+0x19/0x20
        [<ffffffff81733726>] schedule+0x36/0x80
        [<ffffffff81736c56>] schedule_timeout+0x206/0x4b0
        [<ffffffff81036c69>] ? sched_clock+0x9/0x10
        [<ffffffff81363073>] ? get_request+0x403/0x810
        [<ffffffff8110ca10>] ? ktime_get+0x40/0xb0
        [<ffffffff81732f8a>] io_schedule_timeout+0xda/0x170
        [<ffffffff81733f90>] ? bit_wait+0x60/0x60
        [<ffffffff81733fab>] bit_wait_io+0x1b/0x60
        [<ffffffff81733b28>] __wait_on_bit+0x58/0x90
        [<ffffffff811b0d91>] ? find_get_pages_tag+0x161/0x2e0
        [<ffffffff811aff62>] wait_on_page_bit+0x82/0xa0
        [<ffffffff810d47f0>] ? wake_atomic_t_function+0x60/0x60
        [<ffffffffa02fc181>] mpage_prepare_extent_to_map+0x2d1/0x310 [ext4]
        [<ffffffff8121ff65>] ? kmem_cache_alloc+0x185/0x1a0
        [<ffffffffa0305a2f>] ? ext4_init_io_end+0x1f/0x40 [ext4]
        [<ffffffffa0300294>] ext4_writepages+0x404/0xef0 [ext4]
        [<ffffffff81508c64>] ? scsi_init_io+0x44/0x200
        [<ffffffff81398a0f>] ? fprop_fraction_percpu+0x2f/0x80
        [<ffffffff811c139e>] do_writepages+0x1e/0x30
        [<ffffffff8127c0f5>] __writeback_single_inode+0x45/0x320
        [<ffffffff8127c942>] writeback_sb_inodes+0x272/0x600
        [<ffffffff8127cf6b>] wb_writeback+0x10b/0x300
        [<ffffffff8127d884>] wb_workfn+0xb4/0x380
        [<ffffffff810b85e9>] ? try_to_wake_up+0x59/0x3e0
        [<ffffffff810a5759>] process_one_work+0x189/0x420
        [<ffffffff810a5a3e>] worker_thread+0x4e/0x4b0
        [<ffffffff810a59f0>] ? process_one_work+0x420/0x420
        [<ffffffff810ac026>] kthread+0xe6/0x100
        [<ffffffff810abf40>] ? kthread_park+0x60/0x60
        [<ffffffff81738499>] ret_from_fork+0x39/0x50
      
      To fix this issue, we can simply limit throtl_service_queue's max queued
      bios, currently we limit it to throtl_grp's bps_limit or iops limit, if it
      still exteeds, we just sleep for a while.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      0fd4aa6d
    • J
      alinux: block-throttle: add counters for completed io · 33ed5f09
      Jiufei Xue 提交于
      Now we have counters for wait_time and service_time, but no completed
      ios, so the average latency can not be measured.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      33ed5f09
    • J
      alinux: block-throttle: code cleanup · b03ba65b
      Jiufei Xue 提交于
      This patch does the code cleanup because the seq_show handlers for tg
      counters are the same. No functional changes.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b03ba65b
    • J
      alinux: blk-throttle: add throttled io/bytes counter · 766cfe98
      Joseph Qi 提交于
      Add 2 interfaces to stat io throttle information:
        blkio.throttle.total_io_queued
        blkio.throttle.total_bytes_queued
      
      These interfaces are used for monitoring throttled io/bytes and
      analyzing if delay has relation with io throttle.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      766cfe98
    • J
      alinux: blk-throttle: fix tg NULL pointer dereference · bc0cc360
      Joseph Qi 提交于
      io throtl stats will blkg_get at the beginning of throttle and then
      blkg_put at the new introduced bi_tg_end_io. This will cause blkg to be
      freed if end_io is called twice like dm-thin, which will save origin
      end_io first, and call its overwrite end_io and then the saved end_io.
      After that, access blkg is invalid and finally BUG:
      
      [ 4417.235048] BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0
      [ 4417.236475] IP: [<ffffffff812e7c71>] throtl_update_dispatch_stats+0x21/0xb0
      [ 4417.237865] PGD 98395067 PUD 362e1067 PMD 0
      [ 4417.239232] Oops: 0000 [#1] SMP
      ......
      [ 4417.274070] Call Trace:
      [ 4417.275407]  [<ffffffff812ea93d>] blk_throtl_bio+0xfd/0x630
      [ 4417.276760]  [<ffffffff810b3613>] ? wake_up_process+0x23/0x40
      [ 4417.278079]  [<ffffffff81094c04>] ? wake_up_worker+0x24/0x30
      [ 4417.279387]  [<ffffffff81095772>] ? insert_work+0x62/0xa0
      [ 4417.280697]  [<ffffffff8116c2c7>] ? mempool_free_slab+0x17/0x20
      [ 4417.282019]  [<ffffffff8116c6c9>] ? mempool_free+0x49/0x90
      [ 4417.283326]  [<ffffffff812c9acf>] generic_make_request_checks+0x16f/0x360
      [ 4417.284637]  [<ffffffffa0340d97>] ? thin_map+0x227/0x2c0 [dm_thin_pool]
      [ 4417.285951]  [<ffffffff812c9ce7>] generic_make_request+0x27/0x130
      [ 4417.287240]  [<ffffffffa0230b3d>] __map_bio+0xad/0x100 [dm_mod]
      [ 4417.288503]  [<ffffffffa023257e>] __clone_and_map_data_bio+0x15e/0x240 [dm_mod]
      [ 4417.289778]  [<ffffffffa02329ea>] __split_and_process_bio+0x38a/0x500 [dm_mod]
      [ 4417.291062]  [<ffffffffa0232c91>] dm_make_request+0x131/0x1a0 [dm_mod]
      [ 4417.292344]  [<ffffffff812c9da2>] generic_make_request+0xe2/0x130
      [ 4417.293626]  [<ffffffff812c9e61>] submit_bio+0x71/0x150
      [ 4417.294909]  [<ffffffff8121ab1d>] ? bio_alloc_bioset+0x20d/0x360
      [ 4417.296195]  [<ffffffff81215acb>] _submit_bh+0x14b/0x220
      [ 4417.297484]  [<ffffffff81215bb0>] submit_bh+0x10/0x20
      [ 4417.298744]  [<ffffffffa016d8d8>] jbd2_journal_commit_transaction+0x6c8/0x19a0 [jbd2]
      [ 4417.300014]  [<ffffffff810135b8>] ? __switch_to+0xf8/0x4c0
      [ 4417.301268]  [<ffffffffa01731e9>] kjournald2+0xc9/0x270 [jbd2]
      [ 4417.302524]  [<ffffffff810a0fd0>] ? wake_up_atomic_t+0x30/0x30
      [ 4417.303753]  [<ffffffffa0173120>] ? commit_timeout+0x10/0x10 [jbd2]
      [ 4417.304950]  [<ffffffff8109ffef>] kthread+0xcf/0xe0
      [ 4417.306107]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
      [ 4417.307255]  [<ffffffff81647f18>] ret_from_fork+0x58/0x90
      [ 4417.308349]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
      ......
      
      Now we introduce a new bio flag BIO_THROTL_STATED to make sure
      blkg_get/put only get called once for the same bio.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      bc0cc360
    • J
      alinux: blk-throttle: support io delay stats · dc61ad52
      Joseph Qi 提交于
      Add blkio.throttle.io_service_time and blkio.throttle.io_wait_time to
      get per-cgroup io delay statistics.
      io_service_time represents the time spent after io throttle to io
      completion, while io_wait_time represents the time spent on throttle
      queue.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      dc61ad52
  4. 27 12月, 2019 2 次提交
  5. 26 7月, 2019 1 次提交
  6. 01 9月, 2018 1 次提交
    • D
      blkcg: use tryget logic when associating a blkg with a bio · 31118850
      Dennis Zhou (Facebook) 提交于
      There is a very small change a bio gets caught up in a really
      unfortunate race between a task migration, cgroup exiting, and itself
      trying to associate with a blkg. This is due to css offlining being
      performed after the css->refcnt is killed which triggers removal of
      blkgs that reach their blkg->refcnt of 0.
      
      To avoid this, association with a blkg should use tryget and fallback to
      using the root_blkg.
      
      Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      31118850
  7. 10 8月, 2018 1 次提交
    • L
      Blk-throttle: reduce tail io latency when iops limit is enforced · 991f61fe
      Liu Bo 提交于
      When an application's iops has exceeded its cgroup's iops limit, surely it
      is throttled and kernel will set a timer for dispatching, thus IO latency
      includes the delay.
      
      However, the dispatch delay which is calculated by the limit and the
      elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
      once the application's iops is (iops limit + 1), it doesn't need to wait
      any longer than the remaining time of the current slice.
      
      The difference can be proved by the following fio job and cgroup iops
      setting,
      -----
      $ echo 4 > /mnt/config/nullb/disk1/mbps    # limit nullb's bandwidth to 4MB/s for testing.
      $ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max
      $ cat r2.job
      [global]
      name=fio-rand-read
      filename=/dev/nullb1
      rw=randread
      bs=4k
      direct=1
      numjobs=1
      time_based=1
      runtime=60
      group_reporting=1
      
      [file1]
      size=4G
      ioengine=libaio
      iodepth=1
      rate_iops=50000
      norandommap=1
      thinktime=4ms
      -----
      
      wo patch:
      file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
      fio-3.7-66-gedfc
      Starting 1 process
      
         read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec)
          slat (usec): min=10, max=336, avg=27.71, stdev=17.82
          clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29
           lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22
          clat percentiles (usec):
           |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
           | 30.00th=[    4], 40.00th=[    4], 50.00th=[    6], 60.00th=[11731],
           | 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676],
           | 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035],
           | 99.99th=[28967]
      
      w/ patch:
      file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
      fio-3.7-66-gedfc
      Starting 1 process
      
         read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec)
          slat (usec): min=10, max=155, avg=23.24, stdev=16.79
          clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25
           lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92
          clat percentiles (usec):
           |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
           | 30.00th=[    4], 40.00th=[    5], 50.00th=[   47], 60.00th=[11863],
           | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
           | 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125],
           | 99.99th=[12387]
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      991f61fe
  8. 09 7月, 2018 2 次提交
    • J
      block: add bi_blkg to the bio for cgroups · 08e18eab
      Josef Bacik 提交于
      Currently io.low uses a bi_cg_private to stash its private data for the
      blkg, however other blkcg policies may want to use this as well.  Since
      we can get the private data out of the blkg, move this to bi_blkg in the
      bio and make it generic, then we can use bio_associate_blkg() to attach
      the blkg to the bio.
      
      Theoretically we could simply replace the bi_css with this since we can
      get to all the same information from the blkg, however you have to
      lookup the blkg, so for example wbc_init_bio() would have to lookup and
      possibly allocate the blkg for the css it was trying to attach to the
      bio.  This could be problematic and result in us either not attaching
      the css at all to the bio, or falling back to the root blkcg if we are
      unable to allocate the corresponding blkg.
      
      So for now do this, and in the future if possible we could just replace
      the bi_css with bi_blkg and update the helpers to do the correct
      translation.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08e18eab
    • L
      Block: blk-throttle: set low_valid immediately once one cgroup has io.low configured · 43ada787
      Liu Bo 提交于
      Once one cgroup has io.low configured, @low_valid becomes true and other
      cgroups won't switch it back whatsoever.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      43ada787
  9. 31 5月, 2018 2 次提交
  10. 09 5月, 2018 2 次提交
  11. 20 1月, 2018 1 次提交
  12. 19 1月, 2018 2 次提交
  13. 21 12月, 2017 1 次提交
    • S
      block-throttle: avoid double charge · 111be883
      Shaohua Li 提交于
      If a bio is throttled and split after throttling, the bio could be
      resubmited and enters the throttling again. This will cause part of the
      bio to be charged multiple times. If the cgroup has an IO limit, the
      double charge will significantly harm the performance. The bio split
      becomes quite common after arbitrary bio size change.
      
      To fix this, we always set the BIO_THROTTLED flag if a bio is throttled.
      If the bio is cloned/split, we copy the flag to new bio too to avoid a
      double charge. However, cloned bio could be directed to a new disk,
      keeping the flag be a problem. The observation is we always set new disk
      for the bio in this case, so we can clear the flag in bio_set_dev().
      
      This issue exists for a long time, arbitrary bio size change just makes
      it worse, so this should go into stable at least since v4.2.
      
      V1-> V2: Not add extra field in bio based on discussion with Tejun
      
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: stable@vger.kernel.org
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      111be883
  14. 22 11月, 2017 1 次提交
    • K
      treewide: setup_timer() -> timer_setup() · e99e88a9
      Kees Cook 提交于
      This converts all remaining cases of the old setup_timer() API into using
      timer_setup(), where the callback argument is the structure already
      holding the struct timer_list. These should have no behavioral changes,
      since they just change which pointer is passed into the callback with
      the same available pointers after conversion. It handles the following
      examples, in addition to some other variations.
      
      Casting from unsigned long:
      
          void my_callback(unsigned long data)
          {
              struct something *ptr = (struct something *)data;
          ...
          }
          ...
          setup_timer(&ptr->my_timer, my_callback, ptr);
      
      and forced object casts:
      
          void my_callback(struct something *ptr)
          {
          ...
          }
          ...
          setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);
      
      become:
      
          void my_callback(struct timer_list *t)
          {
              struct something *ptr = from_timer(ptr, t, my_timer);
          ...
          }
          ...
          timer_setup(&ptr->my_timer, my_callback, 0);
      
      Direct function assignments:
      
          void my_callback(unsigned long data)
          {
              struct something *ptr = (struct something *)data;
          ...
          }
          ...
          ptr->my_timer.function = my_callback;
      
      have a temporary cast added, along with converting the args:
      
          void my_callback(struct timer_list *t)
          {
              struct something *ptr = from_timer(ptr, t, my_timer);
          ...
          }
          ...
          ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;
      
      And finally, callbacks without a data assignment:
      
          void my_callback(unsigned long data)
          {
          ...
          }
          ...
          setup_timer(&ptr->my_timer, my_callback, 0);
      
      have their argument renamed to verify they're unused during conversion:
      
          void my_callback(struct timer_list *unused)
          {
          ...
          }
          ...
          timer_setup(&ptr->my_timer, my_callback, 0);
      
      The conversion is done with the following Coccinelle script:
      
      spatch --very-quiet --all-includes --include-headers \
      	-I ./arch/x86/include -I ./arch/x86/include/generated \
      	-I ./include -I ./arch/x86/include/uapi \
      	-I ./arch/x86/include/generated/uapi -I ./include/uapi \
      	-I ./include/generated/uapi --include ./include/linux/kconfig.h \
      	--dir . \
      	--cocci-file ~/src/data/timer_setup.cocci
      
      @fix_address_of@
      expression e;
      @@
      
       setup_timer(
      -&(e)
      +&e
       , ...)
      
      // Update any raw setup_timer() usages that have a NULL callback, but
      // would otherwise match change_timer_function_usage, since the latter
      // will update all function assignments done in the face of a NULL
      // function initialization in setup_timer().
      @change_timer_function_usage_NULL@
      expression _E;
      identifier _timer;
      type _cast_data;
      @@
      
      (
      -setup_timer(&_E->_timer, NULL, _E);
      +timer_setup(&_E->_timer, NULL, 0);
      |
      -setup_timer(&_E->_timer, NULL, (_cast_data)_E);
      +timer_setup(&_E->_timer, NULL, 0);
      |
      -setup_timer(&_E._timer, NULL, &_E);
      +timer_setup(&_E._timer, NULL, 0);
      |
      -setup_timer(&_E._timer, NULL, (_cast_data)&_E);
      +timer_setup(&_E._timer, NULL, 0);
      )
      
      @change_timer_function_usage@
      expression _E;
      identifier _timer;
      struct timer_list _stl;
      identifier _callback;
      type _cast_func, _cast_data;
      @@
      
      (
      -setup_timer(&_E->_timer, _callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, &_callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, _callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)_callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, &_callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
       _E->_timer@_stl.function = _callback;
      |
       _E->_timer@_stl.function = &_callback;
      |
       _E->_timer@_stl.function = (_cast_func)_callback;
      |
       _E->_timer@_stl.function = (_cast_func)&_callback;
      |
       _E._timer@_stl.function = _callback;
      |
       _E._timer@_stl.function = &_callback;
      |
       _E._timer@_stl.function = (_cast_func)_callback;
      |
       _E._timer@_stl.function = (_cast_func)&_callback;
      )
      
      // callback(unsigned long arg)
      @change_callback_handle_cast
       depends on change_timer_function_usage@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _origtype;
      identifier _origarg;
      type _handletype;
      identifier _handle;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *t
       )
       {
      (
      	... when != _origarg
      	_handletype *_handle =
      -(_handletype *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle =
      -(void *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle;
      	... when != _handle
      	_handle =
      -(_handletype *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle;
      	... when != _handle
      	_handle =
      -(void *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      )
       }
      
      // callback(unsigned long arg) without existing variable
      @change_callback_handle_cast_no_arg
       depends on change_timer_function_usage &&
                           !change_callback_handle_cast@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _origtype;
      identifier _origarg;
      type _handletype;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *t
       )
       {
      +	_handletype *_origarg = from_timer(_origarg, t, _timer);
      +
      	... when != _origarg
      -	(_handletype *)_origarg
      +	_origarg
      	... when != _origarg
       }
      
      // Avoid already converted callbacks.
      @match_callback_converted
       depends on change_timer_function_usage &&
                  !change_callback_handle_cast &&
      	    !change_callback_handle_cast_no_arg@
      identifier change_timer_function_usage._callback;
      identifier t;
      @@
      
       void _callback(struct timer_list *t)
       { ... }
      
      // callback(struct something *handle)
      @change_callback_handle_arg
       depends on change_timer_function_usage &&
      	    !match_callback_converted &&
                  !change_callback_handle_cast &&
                  !change_callback_handle_cast_no_arg@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _handletype;
      identifier _handle;
      @@
      
       void _callback(
      -_handletype *_handle
      +struct timer_list *t
       )
       {
      +	_handletype *_handle = from_timer(_handle, t, _timer);
      	...
       }
      
      // If change_callback_handle_arg ran on an empty function, remove
      // the added handler.
      @unchange_callback_handle_arg
       depends on change_timer_function_usage &&
      	    change_callback_handle_arg@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _handletype;
      identifier _handle;
      identifier t;
      @@
      
       void _callback(struct timer_list *t)
       {
      -	_handletype *_handle = from_timer(_handle, t, _timer);
       }
      
      // We only want to refactor the setup_timer() data argument if we've found
      // the matching callback. This undoes changes in change_timer_function_usage.
      @unchange_timer_function_usage
       depends on change_timer_function_usage &&
                  !change_callback_handle_cast &&
                  !change_callback_handle_cast_no_arg &&
      	    !change_callback_handle_arg@
      expression change_timer_function_usage._E;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type change_timer_function_usage._cast_data;
      @@
      
      (
      -timer_setup(&_E->_timer, _callback, 0);
      +setup_timer(&_E->_timer, _callback, (_cast_data)_E);
      |
      -timer_setup(&_E._timer, _callback, 0);
      +setup_timer(&_E._timer, _callback, (_cast_data)&_E);
      )
      
      // If we fixed a callback from a .function assignment, fix the
      // assignment cast now.
      @change_timer_function_assignment
       depends on change_timer_function_usage &&
                  (change_callback_handle_cast ||
                   change_callback_handle_cast_no_arg ||
                   change_callback_handle_arg)@
      expression change_timer_function_usage._E;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type _cast_func;
      typedef TIMER_FUNC_TYPE;
      @@
      
      (
       _E->_timer.function =
      -_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_timer.function =
      -&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_timer.function =
      -(_cast_func)_callback;
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_timer.function =
      -(_cast_func)&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -&_callback;
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -(_cast_func)_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -(_cast_func)&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      )
      
      // Sometimes timer functions are called directly. Replace matched args.
      @change_timer_function_calls
       depends on change_timer_function_usage &&
                  (change_callback_handle_cast ||
                   change_callback_handle_cast_no_arg ||
                   change_callback_handle_arg)@
      expression _E;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type _cast_data;
      @@
      
       _callback(
      (
      -(_cast_data)_E
      +&_E->_timer
      |
      -(_cast_data)&_E
      +&_E._timer
      |
      -_E
      +&_E->_timer
      )
       )
      
      // If a timer has been configured without a data argument, it can be
      // converted without regard to the callback argument, since it is unused.
      @match_timer_function_unused_data@
      expression _E;
      identifier _timer;
      identifier _callback;
      @@
      
      (
      -setup_timer(&_E->_timer, _callback, 0);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, _callback, 0L);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, _callback, 0UL);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, 0);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, 0L);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, 0UL);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_timer, _callback, 0);
      +timer_setup(&_timer, _callback, 0);
      |
      -setup_timer(&_timer, _callback, 0L);
      +timer_setup(&_timer, _callback, 0);
      |
      -setup_timer(&_timer, _callback, 0UL);
      +timer_setup(&_timer, _callback, 0);
      |
      -setup_timer(_timer, _callback, 0);
      +timer_setup(_timer, _callback, 0);
      |
      -setup_timer(_timer, _callback, 0L);
      +timer_setup(_timer, _callback, 0);
      |
      -setup_timer(_timer, _callback, 0UL);
      +timer_setup(_timer, _callback, 0);
      )
      
      @change_callback_unused_data
       depends on match_timer_function_unused_data@
      identifier match_timer_function_unused_data._callback;
      type _origtype;
      identifier _origarg;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *unused
       )
       {
      	... when != _origarg
       }
      Signed-off-by: NKees Cook <keescook@chromium.org>
      e99e88a9
  15. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  16. 11 10月, 2017 1 次提交
    • J
      blk-throttle: fix null pointer dereference while throttling writeback IOs · 53cfdc10
      Jiufei Xue 提交于
      A null pointer dereference can occur when blkcg is removed manually
      with writeback IOs inflight. This is caused by the following case:
      
      Writeback kworker submit the bio and set bio->bi_cg_private to tg
      in blk_throtl_assoc_bio.
      Then we remove the block cgroup manually, the blkg and tg would be
      freed if there is no request inflight.
      When the submitted bio come back, blk_throtl_bio_endio() fetch the tg
      which was already freed.
      
      Fix this by increasing the refcount of blkg in funcion
      blk_throtl_assoc_bio() so that the blkg will not be freed until the
      bio_endio called.
      Reviewed-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJiufei Xue <jiufei.xjf@alibaba-inc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      53cfdc10
  17. 04 10月, 2017 1 次提交
    • J
      blk-throttle: fix possible io stall when upgrade to max · 4f02fb76
      Joseph Qi 提交于
      There is a case which will lead to io stall. The case is described as
      follows.
      /test1
        |-subtest1
      /test2
        |-subtest2
      And subtest1 and subtest2 each has 32 queued bios already.
      
      Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
      bios as follows:
      1) tg=subtest1, do nothing;
      2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
      left, no need to schedule next dispatch;
      3) tg=subtest2, do nothing;
      4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
      left, no need to schedule next dispatch;
      5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
      test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
      to /; note that test1 and test2 each still has 16 queued bios left;
      6) tg=/, try to schedule next dispatch, but since disptime is now
      (update in tg_update_disptime, wait=0), pending timer is not scheduled
      in fact;
      7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
      32 left. test1 and test2 each has 16 queued bios;
      8) throtl_pending_timer_fn sees the left over bios, but could do
      nothing, because throtl_select_dispatch returns 0, and test1/test2 has
      no pending tg.
      
      The blktrace shows the following:
      8,32   0        0     2.539007641     0  m   N throtl upgrade to max
      8,32   0        0     2.539072267     0  m   N throtl /test2 dispatch nr_queued=16 read=0 write=16
      8,32   7        0     2.539077142     0  m   N throtl /test1 dispatch nr_queued=16 read=0 write=16
      
      So force schedule dispatch if there are pending children.
      Reviewed-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJoseph Qi <qijiang.qj@alibaba-inc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4f02fb76
  18. 24 8月, 2017 1 次提交
    • S
      blk-throttle: cap discard request size · ea0ea2bc
      Shaohua Li 提交于
      discard request usually is very big and easily use all bandwidth budget
      of a cgroup. discard request size doesn't really mean the size of data
      written, so it doesn't make sense to account it into bandwidth budget.
      Jens pointed out treating the size 0 doesn't make sense too, because
      discard request does have cost. But it's not easy to find the actual
      cost. This patch simply makes the size one sector.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ea0ea2bc
  19. 29 7月, 2017 2 次提交
    • S
      block: use standard blktrace API to output cgroup info for debug notes · 35fe6d76
      Shaohua Li 提交于
      Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
      way. Now we have standard blktrace API for this, so convert them to use
      it.
      
      Note, this changes the behavior a little bit. cgroup info isn't output
      by default, we only do this with 'blk_cgroup' option enabled. cgroup
      info isn't output as a string by default too, we only do this with
      'blk_cgname' option enabled. Also cgroup info is output in different
      position of the note string. I think these behavior changes aren't a big
      issue (actually we make trace data shorter which is good), since the
      blktrace note is solely for debugging.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      35fe6d76
    • S
      block: always attach cgroup info into bio · 007cc56b
      Shaohua Li 提交于
      blkcg_bio_issue_check() already gets blkcg for a BIO.
      bio_associate_blkcg() uses a percpu refcounter, so it's a very cheap
      operation. There is no point we don't attach the cgroup info into bio at
      blkcg_bio_issue_check. This also makes blktrace outputs correct cgroup
      info.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      007cc56b
  20. 07 6月, 2017 2 次提交
    • S
      blk-throttle: set default latency baseline for harddisk · 6679a90c
      Shaohua Li 提交于
      hard disk IO latency varies a lot depending on spindle move. The latency
      range could be from several microseconds to several milliseconds. It's
      pretty hard to get the baseline latency used by io.low.
      
      We will use a different stragety here. The idea is only using IO with
      spindle move to determine if cgroup IO is in good state. For HD, if io
      latency is small (< 1ms), we ignore the IO. Such IO is likely from
      sequential IO, and is helpless to help determine if a cgroup's IO is
      impacted by other cgroups. With this, we only account IO with big
      latency. Then we can choose a hardcoded baseline latency for HD (4ms,
      which is typical IO latency with seek).  With all these settings, the
      io.low latency works for both HD and SSD.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6679a90c
    • J
      blk-throttle: fix NULL pointer dereference in throtl_schedule_pending_timer · a41b816c
      Joseph Qi 提交于
      I have encountered a NULL pointer dereference in
      throtl_schedule_pending_timer:
        [  413.735396] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
        [  413.735535] IP: [<ffffffff812ebbbf>] throtl_schedule_pending_timer+0x3f/0x210
        [  413.735643] PGD 22c8cf067 PUD 22cb34067 PMD 0
        [  413.735713] Oops: 0000 [#1] SMP
        ......
      
      This is caused by the following case:
        blk_throtl_bio
          throtl_schedule_next_dispatch  <= sq is top level one without parent
            throtl_schedule_pending_timer
              sq_to_tg(sq)->td->throtl_slice  <= sq_to_tg(sq) returns NULL
      
      Fix it by using sq_to_td instead of sq_to_tg(sq)->td, which will always
      return a valid td.
      
      Fixes: 297e3d85 ("blk-throttle: make throtl_slice tunable")
      Signed-off-by: NJoseph Qi <qijiang.qj@alibaba-inc.com>
      Reviewed-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a41b816c
  21. 23 5月, 2017 4 次提交
  22. 20 4月, 2017 1 次提交
  23. 28 3月, 2017 3 次提交
    • S
      blk-throttle: add latency target support · 53696b8d
      Shaohua Li 提交于
      One hard problem adding .low limit is to detect idle cgroup. If one
      cgroup doesn't dispatch enough IO against its low limit, we must have a
      mechanism to determine if other cgroups dispatch more IO. We added the
      think time detection mechanism before, but it doesn't work for all
      workloads. Here we add a latency based approach.
      
      We already have mechanism to calculate latency threshold for each IO
      size. For every IO dispatched from a cgorup, we compare its latency
      against its threshold and record the info. If most IO latency is below
      threshold (in the code I use 75%), the cgroup could be treated idle and
      other cgroups can dispatch more IO.
      
      Currently this latency target check is only for SSD as we can't
      calcualte the latency target for hard disk. And this is only for cgroup
      leaf node so far.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      53696b8d
    • S
      blk-throttle: add a mechanism to estimate IO latency · b9147dd1
      Shaohua Li 提交于
      User configures latency target, but the latency threshold for each
      request size isn't fixed. For a SSD, the IO latency highly depends on
      request size. To calculate latency threshold, we sample some data, eg,
      average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
      threshold of each request size will be the sample latency (I'll call it
      base latency) plus latency target. For example, the base latency for
      request size 4k is 80us and user configures latency target 60us. The 4k
      latency threshold will be 80 + 60 = 140us.
      
      To sample data, we calculate the order base 2 of rounded up IO sectors.
      If the IO size is bigger than 1M, it will be accounted as 1M. Since the
      calculation does round up, the base latency will be slightly smaller
      than actual value. Also if there isn't any IO dispatched for a specific
      IO size, we will use the base latency of smaller IO size for this IO
      size.
      
      But we shouldn't sample data at any time. The base latency is supposed
      to be latency where disk isn't congested, because we use latency
      threshold to schedule IOs between cgroups. If disk is congested, the
      latency is higher, using it for scheduling is meaningless. Hence we only
      do the sampling when block throttling is in the LOW limit, with
      assumption disk isn't congested in such state. If the assumption isn't
      true, eg, low limit is too high, calculated latency threshold will be
      higher.
      
      Hard disk is completely different. Latency depends on spindle seek
      instead of request size. Currently this feature is SSD only, we probably
      can use a fixed threshold like 4ms for hard disk though.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b9147dd1
    • S
      blk-throttle: add interface for per-cgroup target latency · ec80991d
      Shaohua Li 提交于
      Here we introduce per-cgroup latency target. The target determines how a
      cgroup can afford latency increasement. We will use the target latency
      to calculate a threshold and use it to schedule IO for cgroups. If a
      cgroup's bandwidth is below its low limit but its average latency is
      below the threshold, other cgroups can safely dispatch more IO even
      their bandwidth is higher than their low limits. On the other hand, if
      the first cgroup's latency is higher than the threshold, other cgroups
      are throttled to their low limits. So the target latency determines how
      we efficiently utilize free disk resource without sacifice of worload's
      IO latency.
      
      For example, assume 4k IO average latency is 50us when disk isn't
      congested. A cgroup sets the target latency to 30us. Then the cgroup can
      accept 50+30=80us IO latency. If the cgroupt's average IO latency is
      90us and its bandwidth is below low limit, other cgroups are throttled
      to their low limit. If the cgroup's average IO latency is 60us, other
      cgroups are allowed to dispatch more IO. When other cgroups dispatch
      more IO, the first cgroup's IO latency will increase. If it increases to
      81us, we then throttle other cgroups.
      
      User will configure the interface in this way:
      echo "8:16 rbps=2097152 wbps=max latency=100 idle=200" > io.low
      
      latency is in microsecond unit
      
      By default, latency target is 0, which means to guarantee IO latency.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ec80991d