1. 18 2月, 2022 2 次提交
    • Y
      block, bfq: don't move oom_bfqq · 8410f709
      Yu Kuai 提交于
      Our test report a UAF:
      
      [ 2073.019181] ==================================================================
      [ 2073.019188] BUG: KASAN: use-after-free in __bfq_put_async_bfqq+0xa0/0x168
      [ 2073.019191] Write of size 8 at addr ffff8000ccf64128 by task rmmod/72584
      [ 2073.019192]
      [ 2073.019196] CPU: 0 PID: 72584 Comm: rmmod Kdump: loaded Not tainted 4.19.90-yk #5
      [ 2073.019198] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      [ 2073.019200] Call trace:
      [ 2073.019203]  dump_backtrace+0x0/0x310
      [ 2073.019206]  show_stack+0x28/0x38
      [ 2073.019210]  dump_stack+0xec/0x15c
      [ 2073.019216]  print_address_description+0x68/0x2d0
      [ 2073.019220]  kasan_report+0x238/0x2f0
      [ 2073.019224]  __asan_store8+0x88/0xb0
      [ 2073.019229]  __bfq_put_async_bfqq+0xa0/0x168
      [ 2073.019233]  bfq_put_async_queues+0xbc/0x208
      [ 2073.019236]  bfq_pd_offline+0x178/0x238
      [ 2073.019240]  blkcg_deactivate_policy+0x1f0/0x420
      [ 2073.019244]  bfq_exit_queue+0x128/0x178
      [ 2073.019249]  blk_mq_exit_sched+0x12c/0x160
      [ 2073.019252]  elevator_exit+0xc8/0xd0
      [ 2073.019256]  blk_exit_queue+0x50/0x88
      [ 2073.019259]  blk_cleanup_queue+0x228/0x3d8
      [ 2073.019267]  null_del_dev+0xfc/0x1e0 [null_blk]
      [ 2073.019274]  null_exit+0x90/0x114 [null_blk]
      [ 2073.019278]  __arm64_sys_delete_module+0x358/0x5a0
      [ 2073.019282]  el0_svc_common+0xc8/0x320
      [ 2073.019287]  el0_svc_handler+0xf8/0x160
      [ 2073.019290]  el0_svc+0x10/0x218
      [ 2073.019291]
      [ 2073.019294] Allocated by task 14163:
      [ 2073.019301]  kasan_kmalloc+0xe0/0x190
      [ 2073.019305]  kmem_cache_alloc_node_trace+0x1cc/0x418
      [ 2073.019308]  bfq_pd_alloc+0x54/0x118
      [ 2073.019313]  blkcg_activate_policy+0x250/0x460
      [ 2073.019317]  bfq_create_group_hierarchy+0x38/0x110
      [ 2073.019321]  bfq_init_queue+0x6d0/0x948
      [ 2073.019325]  blk_mq_init_sched+0x1d8/0x390
      [ 2073.019330]  elevator_switch_mq+0x88/0x170
      [ 2073.019334]  elevator_switch+0x140/0x270
      [ 2073.019338]  elv_iosched_store+0x1a4/0x2a0
      [ 2073.019342]  queue_attr_store+0x90/0xe0
      [ 2073.019348]  sysfs_kf_write+0xa8/0xe8
      [ 2073.019351]  kernfs_fop_write+0x1f8/0x378
      [ 2073.019359]  __vfs_write+0xe0/0x360
      [ 2073.019363]  vfs_write+0xf0/0x270
      [ 2073.019367]  ksys_write+0xdc/0x1b8
      [ 2073.019371]  __arm64_sys_write+0x50/0x60
      [ 2073.019375]  el0_svc_common+0xc8/0x320
      [ 2073.019380]  el0_svc_handler+0xf8/0x160
      [ 2073.019383]  el0_svc+0x10/0x218
      [ 2073.019385]
      [ 2073.019387] Freed by task 72584:
      [ 2073.019391]  __kasan_slab_free+0x120/0x228
      [ 2073.019394]  kasan_slab_free+0x10/0x18
      [ 2073.019397]  kfree+0x94/0x368
      [ 2073.019400]  bfqg_put+0x64/0xb0
      [ 2073.019404]  bfqg_and_blkg_put+0x90/0xb0
      [ 2073.019408]  bfq_put_queue+0x220/0x228
      [ 2073.019413]  __bfq_put_async_bfqq+0x98/0x168
      [ 2073.019416]  bfq_put_async_queues+0xbc/0x208
      [ 2073.019420]  bfq_pd_offline+0x178/0x238
      [ 2073.019424]  blkcg_deactivate_policy+0x1f0/0x420
      [ 2073.019429]  bfq_exit_queue+0x128/0x178
      [ 2073.019433]  blk_mq_exit_sched+0x12c/0x160
      [ 2073.019437]  elevator_exit+0xc8/0xd0
      [ 2073.019440]  blk_exit_queue+0x50/0x88
      [ 2073.019443]  blk_cleanup_queue+0x228/0x3d8
      [ 2073.019451]  null_del_dev+0xfc/0x1e0 [null_blk]
      [ 2073.019459]  null_exit+0x90/0x114 [null_blk]
      [ 2073.019462]  __arm64_sys_delete_module+0x358/0x5a0
      [ 2073.019467]  el0_svc_common+0xc8/0x320
      [ 2073.019471]  el0_svc_handler+0xf8/0x160
      [ 2073.019474]  el0_svc+0x10/0x218
      [ 2073.019475]
      [ 2073.019479] The buggy address belongs to the object at ffff8000ccf63f00
       which belongs to the cache kmalloc-1024 of size 1024
      [ 2073.019484] The buggy address is located 552 bytes inside of
       1024-byte region [ffff8000ccf63f00, ffff8000ccf64300)
      [ 2073.019486] The buggy address belongs to the page:
      [ 2073.019492] page:ffff7e000333d800 count:1 mapcount:0 mapping:ffff8000c0003a00 index:0x0 compound_mapcount: 0
      [ 2073.020123] flags: 0x7ffff0000008100(slab|head)
      [ 2073.020403] raw: 07ffff0000008100 ffff7e0003334c08 ffff7e00001f5a08 ffff8000c0003a00
      [ 2073.020409] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000
      [ 2073.020411] page dumped because: kasan: bad access detected
      [ 2073.020412]
      [ 2073.020414] Memory state around the buggy address:
      [ 2073.020420]  ffff8000ccf64000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020424]  ffff8000ccf64080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020428] >ffff8000ccf64100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020430]                                   ^
      [ 2073.020434]  ffff8000ccf64180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020438]  ffff8000ccf64200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020439] ==================================================================
      
      The same problem exist in mainline as well.
      
      This is because oom_bfqq is moved to a non-root group, thus root_group
      is freed earlier.
      
      Thus fix the problem by don't move oom_bfqq.
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NPaolo Valente <paolo.valente@linaro.org>
      Link: https://lore.kernel.org/r/20220129015924.3958918-4-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      8410f709
    • Y
      block, bfq: avoid moving bfqq to it's parent bfqg · c5e4cb0f
      Yu Kuai 提交于
      Moving bfqq to it's parent bfqg is pointless.
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20220129015924.3958918-3-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      c5e4cb0f
  2. 20 10月, 2021 1 次提交
  3. 18 10月, 2021 1 次提交
  4. 17 10月, 2021 1 次提交
  5. 26 3月, 2021 1 次提交
    • P
      block, bfq: merge bursts of newly-created queues · 430a67f9
      Paolo Valente 提交于
      Many throughput-sensitive workloads are made of several parallel I/O
      flows, with all flows generated by the same application, or more
      generically by the same task (e.g., system boot). The most
      counterproductive action with these workloads is plugging I/O dispatch
      when one of the bfq_queues associated with these flows remains
      temporarily empty.
      
      To avoid this plugging, BFQ has been using a burst-handling mechanism
      for years now. This mechanism has proven effective for throughput, and
      not detrimental for service guarantees. This commit pushes this
      mechanism a little bit further, basing on the following two facts.
      
      First, all the I/O flows of a the same application or task contribute
      to the execution/completion of that common application or task. So the
      performance figures that matter are total throughput of the flows and
      task-wide I/O latency.  In particular, these flows do not need to be
      protected from each other, in terms of individual bandwidth or
      latency.
      
      Second, the above fact holds regardless of the number of flows.
      
      Putting these two facts together, this commits merges stably the
      bfq_queues associated with these I/O flows, i.e., with the processes
      that generate these IO/ flows, regardless of how many the involved
      processes are.
      
      To decide whether a set of bfq_queues is actually associated with the
      I/O flows of a common application or task, and to merge these queues
      stably, this commit operates as follows: given a bfq_queue, say Q2,
      currently being created, and the last bfq_queue, say Q1, created
      before Q2, Q2 is merged stably with Q1 if
      - very little time has elapsed since when Q1 was created
      - Q2 has the same ioprio as Q1
      - Q2 belongs to the same group as Q1
      
      Merging bfq_queues also reduces scheduling overhead. A fio test with
      ten random readers on /dev/nullb shows a throughput boost of 40%, with
      a quadcore. Since BFQ's execution time amounts to ~50% of the total
      per-request processing time, the above throughput boost implies that
      BFQ's overhead is reduced by more than 50%.
      Tested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Link: https://lore.kernel.org/r/20210304174627.161-7-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>
      430a67f9
  6. 18 8月, 2020 1 次提交
    • D
      bfq: fix blkio cgroup leakage v4 · 2de791ab
      Dmitry Monakhov 提交于
      Changes from v1:
          - update commit description with proper ref-accounting justification
      
      commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      introduce leak forbfq_group and blkcg_gq objects because of get/put
      imbalance.
      In fact whole idea of original commit is wrong because bfq_group entity
      can not dissapear under us because it is referenced by child bfq_queue's
      entities from here:
       -> bfq_init_entity()
          ->bfqg_and_blkg_get(bfqg);
          ->entity->parent = bfqg->my_entity
      
       -> bfq_put_queue(bfqq)
          FINAL_PUT
          ->bfqg_and_blkg_put(bfqq_group(bfqq))
          ->kmem_cache_free(bfq_pool, bfqq);
      
      So parent entity can not disappear while child entity is in tree,
      and child entities already has proper protection.
      This patch revert commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      
      bfq_group leak trace caused by bad commit:
      -> blkg_alloc
         -> bfq_pq_alloc
           -> bfqg_get (+1)
      ->bfq_activate_bfqq
        ->bfq_activate_requeue_entity
          -> __bfq_activate_entity
             ->bfq_get_entity
               ->bfqg_and_blkg_get (+1)  <==== : Note1
      ->bfq_del_bfqq_busy
        ->bfq_deactivate_entity+0x53/0xc0 [bfq]
          ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
            -> bfq_forget_entity(is_in_service = true)
      	 entity->on_st_or_in_serv = false   <=== :Note2
      	 if (is_in_service)
      	     return;  ==> do not touch reference
      -> blkcg_css_offline
       -> blkcg_destroy_blkgs
        -> blkg_destroy
         -> bfq_pd_offline
          -> __bfq_deactivate_entity
               if (!entity->on_st_or_in_serv) /* true, because (Note2)
      		return false;
       -> bfq_pd_free
          -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
      So bfq_group and blkcg_gq  will leak forever, see test-case below.
      
      ##TESTCASE_BEGIN:
      #!/bin/bash
      
      max_iters=${1:-100}
      #prep cgroup mounts
      mount -t tmpfs cgroup_root /sys/fs/cgroup
      mkdir /sys/fs/cgroup/blkio
      mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
      
      # Prepare blkdev
      grep blkio /proc/cgroups
      truncate -s 1M img
      losetup /dev/loop0 img
      echo bfq > /sys/block/loop0/queue/scheduler
      
      grep blkio /proc/cgroups
      for ((i=0;i<max_iters;i++))
      do
          mkdir -p /sys/fs/cgroup/blkio/a
          echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
          dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
          echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
          rmdir /sys/fs/cgroup/blkio/a
          grep blkio /proc/cgroups
      done
      ##TESTCASE_END:
      
      Fixes: db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NDmitry Monakhov <dmtrmonakhov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2de791ab
  7. 22 3月, 2020 4 次提交
    • P
      block, bfq: invoke flush_idle_tree after reparent_active_queues in pd_offline · 4d38a87f
      Paolo Valente 提交于
      In bfq_pd_offline(), the function bfq_flush_idle_tree() is invoked to
      flush the rb tree that contains all idle entities belonging to the pd
      (cgroup) being destroyed. In particular, bfq_flush_idle_tree() is
      invoked before bfq_reparent_active_queues(). Yet the latter may happen
      to add some entities to the idle tree. It happens if, in some of the
      calls to bfq_bfqq_move() performed by bfq_reparent_active_queues(),
      the queue to move is empty and gets expired.
      
      This commit simply reverses the invocation order between
      bfq_flush_idle_tree() and bfq_reparent_active_queues().
      
      Tested-by: cki-project@redhat.com
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4d38a87f
    • P
      block, bfq: make reparent_leaf_entity actually work only on leaf entities · 576682fa
      Paolo Valente 提交于
      bfq_reparent_leaf_entity() reparents the input leaf entity (a leaf
      entity represents just a bfq_queue in an entity tree). Yet, the input
      entity is guaranteed to always be a leaf entity only in two-level
      entity trees. In this respect, because of the error fixed by
      commit 14afc593 ("block, bfq: fix overwrite of bfq_group pointer
      in bfq_find_set_group()"), all (wrongly collapsed) entity trees happened
      to actually have only two levels. After the latter commit, this does not
      hold any longer.
      
      This commit fixes this problem by modifying
      bfq_reparent_leaf_entity(), so that it searches an active leaf entity
      down the path that stems from the input entity. Such a leaf entity is
      guaranteed to exist when bfq_reparent_leaf_entity() is invoked.
      
      Tested-by: cki-project@redhat.com
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      576682fa
    • P
      block, bfq: turn put_queue into release_process_ref in __bfq_bic_change_cgroup · c8997736
      Paolo Valente 提交于
      A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The
      goal of this put is to release a process reference to a bfq_queue. But
      process-reference releases may trigger also some extra operation, and,
      to this goal, are handled through bfq_release_process_ref(). So, turn
      the invocation of bfq_put_queue() into an invocation of
      bfq_release_process_ref().
      
      Tested-by: cki-project@redhat.com
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c8997736
    • P
      block, bfq: move forward the getting of an extra ref in bfq_bfqq_move · fd1bb3ae
      Paolo Valente 提交于
      Commit ecedd3d7 ("block, bfq: get extra ref to prevent a queue
      from being freed during a group move") gets an extra reference to a
      bfq_queue before possibly deactivating it (temporarily), in
      bfq_bfqq_move(). This prevents the bfq_queue from disappearing before
      being reactivated in its new group.
      
      Yet, the bfq_queue may also be expired (i.e., its service may be
      stopped) before the bfq_queue is deactivated. And also an expiration
      may lead to a premature freeing. This commit fixes this issue by
      simply moving forward the getting of the extra reference already
      introduced by commit ecedd3d7 ("block, bfq: get extra ref to
      prevent a queue from being freed during a group move").
      
      Reported-by: cki-project@redhat.com
      Tested-by: cki-project@redhat.com
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fd1bb3ae
  8. 06 3月, 2020 1 次提交
    • C
      block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group() · 14afc593
      Carlo Nonato 提交于
      The bfq_find_set_group() function takes as input a blkcg (which represents
      a cgroup) and retrieves the corresponding bfq_group, then it updates the
      bfq internal group hierarchy (see comments inside the function for why
      this is needed) and finally it returns the bfq_group.
      In the hierarchy update cycle, the pointer holding the correct bfq_group
      that has to be returned is mistakenly used to traverse the hierarchy
      bottom to top, meaning that in each iteration it gets overwritten with the
      parent of the current group. Since the update cycle stops at root's
      children (depth = 2), the overwrite becomes a problem only if the blkcg
      describes a cgroup at a hierarchy level deeper than that (depth > 2). In
      this case the root's child that happens to be also an ancestor of the
      correct bfq_group is returned. The main consequence is that processes
      contained in a cgroup at depth greater than 2 are wrongly placed in the
      group described above by BFQ.
      
      This commits fixes this problem by using a different bfq_group pointer in
      the update cycle in order to avoid the overwrite of the variable holding
      the original group reference.
      Reported-by: NKwon Je Oh <kwonje.oh2@gmail.com>
      Signed-off-by: NCarlo Nonato <carlo.nonato95@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      14afc593
  9. 03 2月, 2020 4 次提交
  10. 05 12月, 2019 1 次提交
    • H
      bfq-iosched: Ensure bio->bi_blkg is valid before using it · 08802ed6
      Hou Tao 提交于
      bio->bi_blkg will be NULL when the issue of the request
      has bypassed the block layer as shown in the following oops:
      
       Internal error: Oops: 96000005 [#1] SMP
       CPU: 17 PID: 2996 Comm: scsi_id Not tainted 5.4.0 #4
       Call trace:
        percpu_counter_add_batch+0x38/0x4c8
        bfqg_stats_update_legacy_io+0x9c/0x280
        bfq_insert_requests+0xbac/0x2190
        blk_mq_sched_insert_request+0x288/0x670
        blk_execute_rq_nowait+0x140/0x178
        blk_execute_rq+0x8c/0x140
        sg_io+0x604/0x9c0
        scsi_cmd_ioctl+0xe38/0x10a8
        scsi_cmd_blk_ioctl+0xac/0xe8
        sd_ioctl+0xe4/0x238
        blkdev_ioctl+0x590/0x20e0
        block_ioctl+0x60/0x98
        do_vfs_ioctl+0xe0/0x1b58
        ksys_ioctl+0x80/0xd8
        __arm64_sys_ioctl+0x40/0x78
        el0_svc_handler+0xc4/0x270
      
      so ensure its validity before using it.
      
      Fixes: fd41e603 ("bfq-iosched: stop using blkg->stat_bytes and ->stat_ios")
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08802ed6
  11. 08 11月, 2019 2 次提交
    • T
      bfq-iosched: stop using blkg->stat_bytes and ->stat_ios · fd41e603
      Tejun Heo 提交于
      When used on cgroup1, bfq uses the blkg->stat_bytes and ->stat_ios
      from blk-cgroup core to populate six stat knobs.  blk-cgroup core is
      moving away from blkg_rwstat to improve scalability and won't be able
      to support this usage.
      
      It isn't like the sharing gains all that much.  Let's break it out to
      dedicated rwstat counters which are updated when on cgroup1.  This
      makes use of bfqg_*rwstat*() helpers outside of
      CONFIG_BFQ_CGROUP_DEBUG.  Move them out.
      
      v2: Compile fix when !CONFIG_BFQ_CGROUP_DEBUG.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Paolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fd41e603
    • T
      bfq-iosched: relocate bfqg_*rwstat*() helpers · a557f1c7
      Tejun Heo 提交于
      Collect them right under #ifdef CONFIG_BFQ_CGROUP_DEBUG.  The next
      patch will use them from !DEBUG path and this makes it easy to move
      them out of the ifdef block.
      
      This is pure code reorganization.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a557f1c7
  12. 07 9月, 2019 2 次提交
  13. 29 8月, 2019 1 次提交
  14. 21 6月, 2019 5 次提交
  15. 10 6月, 2019 1 次提交
  16. 07 6月, 2019 1 次提交
    • A
      block, bfq: add weight symlink to the bfq.weight cgroup parameter · 19e9da9e
      Angelo Ruocco 提交于
      Many userspace tools and services use the proportional-share policy of
      the blkio/io cgroups controller. The CFQ I/O scheduler implemented
      this policy for the legacy block layer. To modify the weight of a
      group in case CFQ was in charge, the 'weight' parameter of the group
      must be modified. On the other hand, the BFQ I/O scheduler implements
      the same policy in blk-mq, but, with BFQ, the parameter to modify has
      a different name: bfq.weight (forced choice until legacy block was
      present, because two different policies cannot share a common parameter
      in cgroups).
      
      Due to CFQ legacy, most if not all userspace configurations still use
      the parameter 'weight', and for the moment do not seem likely to be
      changed. But, when CFQ went away with legacy block, such a parameter
      ceased to exist.
      
      So, a simple workaround has been proposed [1] to make all
      configurations work: add a symlink, named weight, to bfq.weight. This
      commit adds such a symlink.
      
      [1] https://lkml.org/lkml/2019/4/8/555Suggested-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      19e9da9e
  17. 01 5月, 2019 1 次提交
  18. 09 4月, 2019 1 次提交
  19. 01 4月, 2019 1 次提交
    • P
      block, bfq: do not merge queues on flash storage with queueing · 8cacc5ab
      Paolo Valente 提交于
      To boost throughput with a set of processes doing interleaved I/O
      (i.e., a set of processes whose individual I/O is random, but whose
      merged cumulative I/O is sequential), BFQ merges the queues associated
      with these processes, i.e., redirects the I/O of these processes into a
      common, shared queue. In the shared queue, I/O requests are ordered by
      their position on the medium, thus sequential I/O gets dispatched to
      the device when the shared queue is served.
      
      Queue merging costs execution time, because, to detect which queues to
      merge, BFQ must maintain a list of the head I/O requests of active
      queues, ordered by request positions. Measurements showed that this
      costs about 10% of BFQ's total per-request processing time.
      
      Request processing time becomes more and more critical as the speed of
      the underlying storage device grows. Yet, fortunately, queue merging
      is basically useless on the very devices that are so fast to make
      request processing time critical. To reach a high throughput, these
      devices must have many requests queued at the same time. But, in this
      configuration, the internal scheduling algorithms of these devices do
      also the job of queue merging: they reorder requests so as to obtain
      as much as possible a sequential I/O pattern. As a consequence, with
      processes doing interleaved I/O, the throughput reached by one such
      device is likely to be the same, with and without queue merging.
      
      In view of this fact, this commit disables queue merging, and all
      related housekeeping, for non-rotational devices with internal
      queueing. The total, single-lock-protected, per-request processing
      time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
      (time measured with simple code instrumentation, and using the
      throughput-sync.sh script of the S suite [1], in performance-profiling
      mode). To put this result into context, the total,
      single-lock-protected, per-request execution time of the lightest I/O
      scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
      ~800 LOC, against ~10500 LOC for BFQ).
      
      Disabling merging provides a further, remarkable benefit in terms of
      throughput. Merging tends to make many workloads artificially more
      uneven, mainly because of shared queues remaining non empty for
      incomparably more time than normal queues. So, if, e.g., one of the
      queues in a set of merged queues has a higher weight than a normal
      queue, then the shared queue may inherit such a high weight and, by
      staying almost always active, may force BFQ to perform I/O plugging
      most of the time. This evidently makes it harder for BFQ to let the
      device reach a high throughput.
      
      As a practical example of this problem, and of the benefits of this
      commit, we measured again the throughput in the nasty scenario
      considered in previous commit messages: dbench test (in the Phoronix
      suite), with 6 clients, on a filesystem with journaling, and with the
      journaling daemon enjoying a higher weight than normal processes. With
      this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
      PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
      of the other I/O schedulers. As such, this is also likely to be the
      maximum possible throughput reachable with this workload on this
      device, because I/O is mostly random, and the other schedulers
      basically just pass I/O requests to the drive as fast as possible.
      
      [1] https://github.com/Algodev-github/STested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: NAlessio Masola <alessio.masola@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8cacc5ab
  20. 08 12月, 2018 1 次提交
    • D
      blkcg: fix ref count issue with bio_blkcg() using task_css · 0fe061b9
      Dennis Zhou 提交于
      The bio_blkcg() function turns out to be inconsistent and consequently
      dangerous to use. The first part returns a blkcg where a reference is
      owned by the bio meaning it does not need to be rcu protected. However,
      the third case, the last line, is problematic:
      
      	return css_to_blkcg(task_css(current, io_cgrp_id));
      
      This can race against task migration and the cgroup dying. It is also
      semantically different as it must be called rcu protected and is
      susceptible to failure when trying to get a reference to it.
      
      This patch adds association ahead of calling bio_blkcg() rather than
      after. This makes association a required and explicit step along the
      code paths for calling bio_blkcg(). In blk-iolatency, association is
      moved above the bio_blkcg() call to ensure it will not return %NULL.
      
      BFQ uses the old bio_blkcg() function, but I do not want to address it
      in this series due to the complexity. I have created a private version
      documenting the inconsistency and noting not to use it.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0fe061b9
  21. 16 11月, 2018 1 次提交
  22. 02 11月, 2018 1 次提交
  23. 22 9月, 2018 1 次提交
    • D
      blkcg: fix ref count issue with bio_blkcg using task_css · 27e6fa99
      Dennis Zhou (Facebook) 提交于
      The accessor function bio_blkcg either returns the blkcg associated with
      the bio or finds one in the current context. This can cause an issue
      when trying to associate a bio with a blkcg. Particularly, it's the
      third case that is problematic:
      
      	return css_to_blkcg(task_css(current, io_cgrp_id));
      
      As the above may race against task migration and the cgroup exiting, it
      is not always ok to take a reference on the blkcg returned from
      bio_blkcg.
      
      This patch adds association ahead of calling bio_blkcg rather than
      after. This makes association a required and explicit step along the
      code paths for calling bio_blkcg. blk_get_rl is modified as well to get
      a reference to the blkcg it may use and blk_put_rl will always put the
      reference back. Association is also moved above the bio_blkcg call to
      ensure it will not return NULL in blk-iolatency.
      
      BFQ and CFQ utilize this flaw, but due to the complexity, I do not want
      to address this in this series. I've created a private version of the
      function with notes not to use it describing the flaw. Hopefully soon,
      that code can be cleaned up.
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      27e6fa99
  24. 07 9月, 2018 1 次提交
  25. 17 8月, 2018 1 次提交
  26. 09 5月, 2018 1 次提交
  27. 09 1月, 2018 1 次提交