1. 19 11月, 2021 1 次提交
  2. 25 10月, 2021 1 次提交
    • Y
      blk-cgroup: synchronize blkg creation against policy deactivation · 0c9d338c
      Yu Kuai 提交于
      Our test reports a null pointer dereference:
      
      [  168.534653] ==================================================================
      [  168.535614] Disabling lock debugging due to kernel taint
      [  168.536346] BUG: kernel NULL pointer dereference, address: 0000000000000008
      [  168.537274] #PF: supervisor read access in kernel mode
      [  168.537964] #PF: error_code(0x0000) - not-present page
      [  168.538667] PGD 0 P4D 0
      [  168.539025] Oops: 0000 [#1] PREEMPT SMP KASAN
      [  168.539656] CPU: 13 PID: 759 Comm: bash Tainted: G    B             5.15.0-rc2-next-202100
      [  168.540954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_0738364
      [  168.542736] RIP: 0010:bfq_pd_init+0x88/0x1e0
      [  168.543318] Code: 98 00 00 00 e8 c9 e4 5b ff 4c 8b 65 00 49 8d 7c 24 08 e8 bb e4 5b ff 4d0
      [  168.545803] RSP: 0018:ffff88817095f9c0 EFLAGS: 00010002
      [  168.546497] RAX: 0000000000000001 RBX: ffff888101a1c000 RCX: 0000000000000000
      [  168.547438] RDX: 0000000000000003 RSI: 0000000000000002 RDI: ffff888106553428
      [  168.548402] RBP: ffff888106553400 R08: ffffffff961bcaf4 R09: 0000000000000001
      [  168.549365] R10: ffffffffa2e16c27 R11: fffffbfff45c2d84 R12: 0000000000000000
      [  168.550291] R13: ffff888101a1c098 R14: ffff88810c7a08c8 R15: ffffffffa55541a0
      [  168.551221] FS:  00007fac75227700(0000) GS:ffff88839ba80000(0000) knlGS:0000000000000000
      [  168.552278] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  168.553040] CR2: 0000000000000008 CR3: 0000000165ce7000 CR4: 00000000000006e0
      [  168.554000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  168.554929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  168.555888] Call Trace:
      [  168.556221]  <TASK>
      [  168.556510]  blkg_create+0x1c0/0x8c0
      [  168.556989]  blkg_conf_prep+0x574/0x650
      [  168.557502]  ? stack_trace_save+0x99/0xd0
      [  168.558033]  ? blkcg_conf_open_bdev+0x1b0/0x1b0
      [  168.558629]  tg_set_conf.constprop.0+0xb9/0x280
      [  168.559231]  ? kasan_set_track+0x29/0x40
      [  168.559758]  ? kasan_set_free_info+0x30/0x60
      [  168.560344]  ? tg_set_limit+0xae0/0xae0
      [  168.560853]  ? do_sys_openat2+0x33b/0x640
      [  168.561383]  ? do_sys_open+0xa2/0x100
      [  168.561877]  ? __x64_sys_open+0x4e/0x60
      [  168.562383]  ? __kasan_check_write+0x20/0x30
      [  168.562951]  ? copyin+0x48/0x70
      [  168.563390]  ? _copy_from_iter+0x234/0x9e0
      [  168.563948]  tg_set_conf_u64+0x17/0x20
      [  168.564467]  cgroup_file_write+0x1ad/0x380
      [  168.565014]  ? cgroup_file_poll+0x80/0x80
      [  168.565568]  ? __mutex_lock_slowpath+0x30/0x30
      [  168.566165]  ? pgd_free+0x100/0x160
      [  168.566649]  kernfs_fop_write_iter+0x21d/0x340
      [  168.567246]  ? cgroup_file_poll+0x80/0x80
      [  168.567796]  new_sync_write+0x29f/0x3c0
      [  168.568314]  ? new_sync_read+0x410/0x410
      [  168.568840]  ? __handle_mm_fault+0x1c97/0x2d80
      [  168.569425]  ? copy_page_range+0x2b10/0x2b10
      [  168.570007]  ? _raw_read_lock_bh+0xa0/0xa0
      [  168.570622]  vfs_write+0x46e/0x630
      [  168.571091]  ksys_write+0xcd/0x1e0
      [  168.571563]  ? __x64_sys_read+0x60/0x60
      [  168.572081]  ? __kasan_check_write+0x20/0x30
      [  168.572659]  ? do_user_addr_fault+0x446/0xff0
      [  168.573264]  __x64_sys_write+0x46/0x60
      [  168.573774]  do_syscall_64+0x35/0x80
      [  168.574264]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [  168.574960] RIP: 0033:0x7fac74915130
      [  168.575456] Code: 73 01 c3 48 8b 0d 58 ed 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 444
      [  168.577969] RSP: 002b:00007ffc3080e288 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  168.578986] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007fac74915130
      [  168.579937] RDX: 0000000000000009 RSI: 000056007669f080 RDI: 0000000000000001
      [  168.580884] RBP: 000056007669f080 R08: 000000000000000a R09: 00007fac75227700
      [  168.581841] R10: 000056007655c8f0 R11: 0000000000000246 R12: 0000000000000009
      [  168.582796] R13: 0000000000000001 R14: 00007fac74be55e0 R15: 00007fac74be08c0
      [  168.583757]  </TASK>
      [  168.584063] Modules linked in:
      [  168.584494] CR2: 0000000000000008
      [  168.584964] ---[ end trace 2475611ad0f77a1a ]---
      
      This is because blkg_alloc() is called from blkg_conf_prep() without
      holding 'q->queue_lock', and elevator is exited before blkg_create():
      
      thread 1                            thread 2
      blkg_conf_prep
       spin_lock_irq(&q->queue_lock);
       blkg_lookup_check -> return NULL
       spin_unlock_irq(&q->queue_lock);
      
       blkg_alloc
        blkcg_policy_enabled -> true
        pd = ->pd_alloc_fn
        blkg->pd[i] = pd
                                         blk_mq_exit_sched
                                          bfq_exit_queue
                                           blkcg_deactivate_policy
                                            spin_lock_irq(&q->queue_lock);
                                            __clear_bit(pol->plid, q->blkcg_pols);
                                            spin_unlock_irq(&q->queue_lock);
                                          q->elevator = NULL;
        spin_lock_irq(&q->queue_lock);
         blkg_create
          if (blkg->pd[i])
           ->pd_init_fn -> q->elevator is NULL
        spin_unlock_irq(&q->queue_lock);
      
      Because blkcg_deactivate_policy() requires queue to be frozen, we can
      grab q_usage_counter to synchoronize blkg_conf_prep() against
      blkcg_deactivate_policy().
      
      Fixes: e21b7a0b ("block, bfq: add full hierarchical scheduling and cgroups support")
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20211020014036.2141723-1-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      0c9d338c
  3. 18 10月, 2021 4 次提交
  4. 16 9月, 2021 2 次提交
    • L
      blk-cgroup: fix UAF by grabbing blkcg lock before destroying blkg pd · 858560b2
      Li Jinlin 提交于
      KASAN reports a use-after-free report when doing fuzz test:
      
      [693354.104835] ==================================================================
      [693354.105094] BUG: KASAN: use-after-free in bfq_io_set_weight_legacy+0xd3/0x160
      [693354.105336] Read of size 4 at addr ffff888be0a35664 by task sh/1453338
      
      [693354.105607] CPU: 41 PID: 1453338 Comm: sh Kdump: loaded Not tainted 4.18.0-147
      [693354.105610] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.81 07/02/2018
      [693354.105612] Call Trace:
      [693354.105621]  dump_stack+0xf1/0x19b
      [693354.105626]  ? show_regs_print_info+0x5/0x5
      [693354.105634]  ? printk+0x9c/0xc3
      [693354.105638]  ? cpumask_weight+0x1f/0x1f
      [693354.105648]  print_address_description+0x70/0x360
      [693354.105654]  kasan_report+0x1b2/0x330
      [693354.105659]  ? bfq_io_set_weight_legacy+0xd3/0x160
      [693354.105665]  ? bfq_io_set_weight_legacy+0xd3/0x160
      [693354.105670]  bfq_io_set_weight_legacy+0xd3/0x160
      [693354.105675]  ? bfq_cpd_init+0x20/0x20
      [693354.105683]  cgroup_file_write+0x3aa/0x510
      [693354.105693]  ? ___slab_alloc+0x507/0x540
      [693354.105698]  ? cgroup_file_poll+0x60/0x60
      [693354.105702]  ? 0xffffffff89600000
      [693354.105708]  ? usercopy_abort+0x90/0x90
      [693354.105716]  ? mutex_lock+0xef/0x180
      [693354.105726]  kernfs_fop_write+0x1ab/0x280
      [693354.105732]  ? cgroup_file_poll+0x60/0x60
      [693354.105738]  vfs_write+0xe7/0x230
      [693354.105744]  ksys_write+0xb0/0x140
      [693354.105749]  ? __ia32_sys_read+0x50/0x50
      [693354.105760]  do_syscall_64+0x112/0x370
      [693354.105766]  ? syscall_return_slowpath+0x260/0x260
      [693354.105772]  ? do_page_fault+0x9b/0x270
      [693354.105779]  ? prepare_exit_to_usermode+0xf9/0x1a0
      [693354.105784]  ? enter_from_user_mode+0x30/0x30
      [693354.105793]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      
      [693354.105875] Allocated by task 1453337:
      [693354.106001]  kasan_kmalloc+0xa0/0xd0
      [693354.106006]  kmem_cache_alloc_node_trace+0x108/0x220
      [693354.106010]  bfq_pd_alloc+0x96/0x120
      [693354.106015]  blkcg_activate_policy+0x1b7/0x2b0
      [693354.106020]  bfq_create_group_hierarchy+0x1e/0x80
      [693354.106026]  bfq_init_queue+0x678/0x8c0
      [693354.106031]  blk_mq_init_sched+0x1f8/0x460
      [693354.106037]  elevator_switch_mq+0xe1/0x240
      [693354.106041]  elevator_switch+0x25/0x40
      [693354.106045]  elv_iosched_store+0x1a1/0x230
      [693354.106049]  queue_attr_store+0x78/0xb0
      [693354.106053]  kernfs_fop_write+0x1ab/0x280
      [693354.106056]  vfs_write+0xe7/0x230
      [693354.106060]  ksys_write+0xb0/0x140
      [693354.106064]  do_syscall_64+0x112/0x370
      [693354.106069]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      
      [693354.106114] Freed by task 1453336:
      [693354.106225]  __kasan_slab_free+0x130/0x180
      [693354.106229]  kfree+0x90/0x1b0
      [693354.106233]  blkcg_deactivate_policy+0x12c/0x220
      [693354.106238]  bfq_exit_queue+0xf5/0x110
      [693354.106241]  blk_mq_exit_sched+0x104/0x130
      [693354.106245]  __elevator_exit+0x45/0x60
      [693354.106249]  elevator_switch_mq+0xd6/0x240
      [693354.106253]  elevator_switch+0x25/0x40
      [693354.106257]  elv_iosched_store+0x1a1/0x230
      [693354.106261]  queue_attr_store+0x78/0xb0
      [693354.106264]  kernfs_fop_write+0x1ab/0x280
      [693354.106268]  vfs_write+0xe7/0x230
      [693354.106271]  ksys_write+0xb0/0x140
      [693354.106275]  do_syscall_64+0x112/0x370
      [693354.106280]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      
      [693354.106329] The buggy address belongs to the object at ffff888be0a35580
                       which belongs to the cache kmalloc-1k of size 1024
      [693354.106736] The buggy address is located 228 bytes inside of
                       1024-byte region [ffff888be0a35580, ffff888be0a35980)
      [693354.107114] The buggy address belongs to the page:
      [693354.107273] page:ffffea002f828c00 count:1 mapcount:0 mapping:ffff888107c17080 index:0x0 compound_mapcount: 0
      [693354.107606] flags: 0x17ffffc0008100(slab|head)
      [693354.107760] raw: 0017ffffc0008100 ffffea002fcbc808 ffffea0030bd3a08 ffff888107c17080
      [693354.108020] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000
      [693354.108278] page dumped because: kasan: bad access detected
      
      [693354.108511] Memory state around the buggy address:
      [693354.108671]  ffff888be0a35500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [693354.116396]  ffff888be0a35580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [693354.124473] >ffff888be0a35600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [693354.132421]                                                        ^
      [693354.140284]  ffff888be0a35680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [693354.147912]  ffff888be0a35700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [693354.155281] ==================================================================
      
      blkgs are protected by both queue and blkcg locks and holding
      either should stabilize them. However, the path of destroying
      blkg policy data is only protected by queue lock in
      blkcg_activate_policy()/blkcg_deactivate_policy(). Other tasks
      can get the blkg policy data before the blkg policy data is
      destroyed, and use it after destroyed, which will result in a
      use-after-free.
      
      CPU0                             CPU1
      blkcg_deactivate_policy
        spin_lock_irq(&q->queue_lock)
                                       bfq_io_set_weight_legacy
                                         spin_lock_irq(&blkcg->lock)
                                         blkg_to_bfqg(blkg)
                                           pd_to_bfqg(blkg->pd[pol->plid])
                                           ^^^^^^blkg->pd[pol->plid] != NULL
                                                 bfqg != NULL
        pol->pd_free_fn(blkg->pd[pol->plid])
          pd_to_bfqg(blkg->pd[pol->plid])
          bfqg_put(bfqg)
            kfree(bfqg)
        blkg->pd[pol->plid] = NULL
        spin_unlock_irq(q->queue_lock);
                                         bfq_group_set_weight(bfqg, val, 0)
                                           bfqg->entity.new_weight
                                           ^^^^^^trigger uaf here
                                         spin_unlock_irq(&blkcg->lock);
      
      Fix by grabbing the matching blkcg lock before trying to
      destroy blkg policy data.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NLi Jinlin <lijinlin3@huawei.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20210914042605.3260596-1-lijinlin3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      858560b2
    • Y
      blkcg: fix memory leak in blk_iolatency_init · 6f5ddde4
      Yanfei Xu 提交于
      BUG: memory leak
      unreferenced object 0xffff888129acdb80 (size 96):
        comm "syz-executor.1", pid 12661, jiffies 4294962682 (age 15.220s)
        hex dump (first 32 bytes):
          20 47 c9 85 ff ff ff ff 20 d4 8e 29 81 88 ff ff   G...... ..)....
          01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff82264ec8>] kmalloc include/linux/slab.h:591 [inline]
          [<ffffffff82264ec8>] kzalloc include/linux/slab.h:721 [inline]
          [<ffffffff82264ec8>] blk_iolatency_init+0x28/0x190 block/blk-iolatency.c:724
          [<ffffffff8225b8c4>] blkcg_init_queue+0xb4/0x1c0 block/blk-cgroup.c:1185
          [<ffffffff822253da>] blk_alloc_queue+0x22a/0x2e0 block/blk-core.c:566
          [<ffffffff8223b175>] blk_mq_init_queue_data block/blk-mq.c:3100 [inline]
          [<ffffffff8223b175>] __blk_mq_alloc_disk+0x25/0xd0 block/blk-mq.c:3124
          [<ffffffff826a9303>] loop_add+0x1c3/0x360 drivers/block/loop.c:2344
          [<ffffffff826a966e>] loop_control_get_free drivers/block/loop.c:2501 [inline]
          [<ffffffff826a966e>] loop_control_ioctl+0x17e/0x2e0 drivers/block/loop.c:2516
          [<ffffffff81597eec>] vfs_ioctl fs/ioctl.c:51 [inline]
          [<ffffffff81597eec>] __do_sys_ioctl fs/ioctl.c:874 [inline]
          [<ffffffff81597eec>] __se_sys_ioctl fs/ioctl.c:860 [inline]
          [<ffffffff81597eec>] __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860
          [<ffffffff843fa745>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<ffffffff843fa745>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Once blk_throtl_init() queue init failed, blkcg_iolatency_exit() will
      not be invoked for cleanup. That leads a memory leak. Swap the
      blk_throtl_init() and blk_iolatency_init() calls can solve this.
      
      Reported-by: syzbot+01321b15cc98e6bf96d6@syzkaller.appspotmail.com
      Fixes: 19688d7f (block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls)
      Signed-off-by: NYanfei Xu <yanfei.xu@windriver.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20210915072426.4022924-1-yanfei.xu@windriver.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      6f5ddde4
  5. 24 8月, 2021 1 次提交
  6. 17 8月, 2021 2 次提交
  7. 10 8月, 2021 1 次提交
  8. 28 7月, 2021 1 次提交
    • T
      cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync · c3df5fb5
      Tejun Heo 提交于
      0fa294fb ("cgroup: Replace cgroup_rstat_mutex with a spinlock") added
      cgroup_rstat_flush_irqsafe() allowing flushing to happen from the irq
      context. However, rstat paths use u64_stats_sync to synchronize access to
      64bit stat counters on 32bit machines. u64_stats_sync is implemented using
      seq_lock and trying to read from an irq context can lead to A-A deadlock if
      the irq happens to interrupt the stat update.
      
      Fix it by using the irqsafe variants - u64_stats_update_begin_irqsave() and
      u64_stats_update_end_irqrestore() - in the update paths. Note that none of
      this matters on 64bit machines. All these are just for 32bit SMP setups.
      
      Note that the interface was introduced way back, its first and currently
      only use was recently added by 2d146aa3 ("mm: memcontrol: switch to
      rstat"). Stable tagging targets this commit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NRik van Riel <riel@surriel.com>
      Fixes: 2d146aa3 ("mm: memcontrol: switch to rstat")
      Cc: stable@vger.kernel.org # v5.13+
      c3df5fb5
  9. 07 7月, 2021 1 次提交
  10. 22 6月, 2021 2 次提交
  11. 24 5月, 2021 1 次提交
    • T
      blkcg: drop CLONE_IO check in blkcg_can_attach() · b5f3352e
      Tejun Heo 提交于
      blkcg has always rejected to attach if any of the member tasks has shared
      io_context. The rationale was that io_contexts can be shared across
      different cgroups making it impossible to define what the appropriate
      control behavior should be. However, this check causes more problems than it
      solves:
      
      * The check prevents controller enable and migrations but not CLONE_IO
        itself, which can lead to surprises as the outcome changes depending on
        the order of operations.
      
      * Sharing within a cgroup is fine but the check can't distinguish that. This
        leads to unnecessary conflicts with the recent CLONE_IO usage in io_uring.
      
      io_context sharing doesn't make any difference for rq_qos based controllers
      and the way it's used is safe as long as tasks aren't migrated dynamically
      which is the vast majority of use cases. While we can try to make the check
      more precise to avoid false positives, the added complexity doesn't seem
      worthwhile. Let's just drop blkcg_can_attach().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/YJrTvHbrRDbJjw+S@slm.duckdns.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>
      b5f3352e
  12. 01 5月, 2021 1 次提交
    • J
      cgroup: rstat: punt root-level optimization to individual controllers · dc26532a
      Johannes Weiner 提交于
      Current users of the rstat code can source root-level statistics from
      the native counters of their respective subsystem, allowing them to
      forego aggregation at the root level.  This optimization is currently
      implemented inside the generic rstat code, which doesn't track the root
      cgroup and doesn't invoke the subsystem flush callbacks on it.
      
      However, the memory controller cannot do this optimization, because
      cgroup1 breaks out memory specifically for the local level, including at
      the root level.  In preparation for the memory controller switching to
      rstat, move the optimization from rstat core to the controllers.
      
      Afterwards, rstat will always track the root cgroup for changes and
      invoke the subsystem callbacks on it; and it's up to the subsystem to
      special-case and skip aggregation of the root cgroup if it can source
      this information through other, cheaper means.
      
      This is the case for the io controller and the cgroup base stats.  In
      their respective flush callbacks, check whether the parent is the root
      cgroup, and if so, skip the unnecessary upward propagation.
      
      The extra cost of tracking the root cgroup is negligible: on stat
      changes, we actually remove a branch that checks for the root.  The
      queueing for a flush touches only per-cpu data, and only the first stat
      change since a flush requires a (per-cpu) lock.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-6-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc26532a
  13. 28 1月, 2021 2 次提交
  14. 27 1月, 2021 1 次提交
  15. 25 1月, 2021 1 次提交
  16. 02 12月, 2020 3 次提交
  17. 15 11月, 2020 1 次提交
  18. 26 10月, 2020 2 次提交
  19. 10 9月, 2020 1 次提交
    • X
      blkcg: add plugging support for punt bio · 192f1c6b
      Xianting Tian 提交于
      The test and the explaination of the patch as bellow.
      
      Before test we added more debug code in blkg_async_bio_workfn():
      	int count = 0
      	if (bios.head && bios.head->bi_next) {
      		need_plug = true;
      		blk_start_plug(&plug);
      	}
      	while ((bio = bio_list_pop(&bios))) {
      		/*io_punt is a sysctl user interface to control the print*/
      		if(io_punt) {
      			printk("[%s:%d] bio start,size:%llu,%d count=%d plug?%d\n",
      				current->comm, current->pid, bio->bi_iter.bi_sector,
      				(bio->bi_iter.bi_size)>>9, count++, need_plug);
      		}
      		submit_bio(bio);
      	}
      	if (need_plug)
      		blk_finish_plug(&plug);
      
      Steps that need to be set to trigger *PUNT* io before testing:
      	mount -t btrfs -o compress=lzo /dev/sda6 /btrfs
      	mount -t cgroup2 nodev /cgroup2
      	mkdir /cgroup2/cg3
      	echo "+io" > /cgroup2/cgroup.subtree_control
      	echo "8:0 wbps=1048576000" > /cgroup2/cg3/io.max #1000M/s
      	echo $$ > /cgroup2/cg3/cgroup.procs
      
      Then use dd command to test btrfs PUNT io in current shell:
      	dd if=/dev/zero of=/btrfs/file bs=64K count=100000
      
      Test hardware environment as below:
      	[root@localhost btrfs]# lscpu
      	Architecture:          x86_64
      	CPU op-mode(s):        32-bit, 64-bit
      	Byte Order:            Little Endian
      	CPU(s):                32
      	On-line CPU(s) list:   0-31
      	Thread(s) per core:    2
      	Core(s) per socket:    8
      	Socket(s):             2
      	NUMA node(s):          2
      	Vendor ID:             GenuineIntel
      
      With above debug code, test command and test environment, I did the
      tests under 3 different system loads, which are triggered by stress:
      1, Run 64 threads by command "stress -c 64 &"
      	[53615.975974] [kworker/u66:18:1490] bio start,size:45583056,8 count=0 plug?1
      	[53615.975980] [kworker/u66:18:1490] bio start,size:45583064,8 count=1 plug?1
      	[53615.975984] [kworker/u66:18:1490] bio start,size:45583072,8 count=2 plug?1
      	[53615.975987] [kworker/u66:18:1490] bio start,size:45583080,8 count=3 plug?1
      	[53615.975990] [kworker/u66:18:1490] bio start,size:45583088,8 count=4 plug?1
      	[53615.975993] [kworker/u66:18:1490] bio start,size:45583096,8 count=5 plug?1
      	... ...
      	[53615.977041] [kworker/u66:18:1490] bio start,size:45585480,8 count=303 plug?1
      	[53615.977044] [kworker/u66:18:1490] bio start,size:45585488,8 count=304 plug?1
      	[53615.977047] [kworker/u66:18:1490] bio start,size:45585496,8 count=305 plug?1
      	[53615.977050] [kworker/u66:18:1490] bio start,size:45585504,8 count=306 plug?1
      	[53615.977053] [kworker/u66:18:1490] bio start,size:45585512,8 count=307 plug?1
      	[53615.977056] [kworker/u66:18:1490] bio start,size:45585520,8 count=308 plug?1
      	[53615.977058] [kworker/u66:18:1490] bio start,size:45585528,8 count=309 plug?1
      
      2, Run 32 threads by command "stress -c 32 &"
      	[50586.290521] [kworker/u66:6:32351] bio start,size:45806496,8 count=0 plug?1
      	[50586.290526] [kworker/u66:6:32351] bio start,size:45806504,8 count=1 plug?1
      	[50586.290529] [kworker/u66:6:32351] bio start,size:45806512,8 count=2 plug?1
      	[50586.290531] [kworker/u66:6:32351] bio start,size:45806520,8 count=3 plug?1
      	[50586.290533] [kworker/u66:6:32351] bio start,size:45806528,8 count=4 plug?1
      	[50586.290535] [kworker/u66:6:32351] bio start,size:45806536,8 count=5 plug?1
      	... ...
      	[50586.299640] [kworker/u66:5:32350] bio start,size:45808576,8 count=252 plug?1
      	[50586.299643] [kworker/u66:5:32350] bio start,size:45808584,8 count=253 plug?1
      	[50586.299646] [kworker/u66:5:32350] bio start,size:45808592,8 count=254 plug?1
      	[50586.299649] [kworker/u66:5:32350] bio start,size:45808600,8 count=255 plug?1
      	[50586.299652] [kworker/u66:5:32350] bio start,size:45808608,8 count=256 plug?1
      	[50586.299663] [kworker/u66:5:32350] bio start,size:45808616,8 count=257 plug?1
      	[50586.299665] [kworker/u66:5:32350] bio start,size:45808624,8 count=258 plug?1
      	[50586.299668] [kworker/u66:5:32350] bio start,size:45808632,8 count=259 plug?1
      
      3, Don't run thread by stress
      	[50861.355246] [kworker/u66:19:32376] bio start,size:13544504,8 count=0 plug?0
      	[50861.355288] [kworker/u66:19:32376] bio start,size:13544512,8 count=0 plug?0
      	[50861.355322] [kworker/u66:19:32376] bio start,size:13544520,8 count=0 plug?0
      	[50861.355353] [kworker/u66:19:32376] bio start,size:13544528,8 count=0 plug?0
      	[50861.355392] [kworker/u66:19:32376] bio start,size:13544536,8 count=0 plug?0
      	[50861.355431] [kworker/u66:19:32376] bio start,size:13544544,8 count=0 plug?0
      	[50861.355468] [kworker/u66:19:32376] bio start,size:13544552,8 count=0 plug?0
      	[50861.355499] [kworker/u66:19:32376] bio start,size:13544560,8 count=0 plug?0
      	[50861.355532] [kworker/u66:19:32376] bio start,size:13544568,8 count=0 plug?0
      	[50861.355575] [kworker/u66:19:32376] bio start,size:13544576,8 count=0 plug?0
      	[50861.355618] [kworker/u66:19:32376] bio start,size:13544584,8 count=0 plug?0
      	[50861.355659] [kworker/u66:19:32376] bio start,size:13544592,8 count=0 plug?0
      	[50861.355740] [kworker/u66:0:32346] bio start,size:13544600,8 count=0 plug?1
      	[50861.355748] [kworker/u66:0:32346] bio start,size:13544608,8 count=1 plug?1
      	[50861.355962] [kworker/u66:2:32347] bio start,size:13544616,8 count=0 plug?0
      	[50861.356272] [kworker/u66:7:31962] bio start,size:13544624,8 count=0 plug?0
      	[50861.356446] [kworker/u66:7:31962] bio start,size:13544632,8 count=0 plug?0
      	[50861.356567] [kworker/u66:7:31962] bio start,size:13544640,8 count=0 plug?0
      	[50861.356707] [kworker/u66:19:32376] bio start,size:13544648,8 count=0 plug?0
      	[50861.356748] [kworker/u66:15:32355] bio start,size:13544656,8 count=0 plug?0
      	[50861.356825] [kworker/u66:17:31970] bio start,size:13544664,8 count=0 plug?0
      
      Analysis of above 3 test results with different system load:
      >From above test, we can see more and more continuous bios can be plugged
      with system load increasing. When run "stress -c 64 &", 310 continuous
      bios are plugged; When run "stress -c 32 &", 260 continuous bios are
      plugged; When don't run stress, at most only 2 continuous bios are
      plugged, in most cases, bio_list only contains one single bio.
      
      How to explain above phenomenon:
      We know, in submit_bio(), if the bio is a REQ_CGROUP_PUNT io, it will
      queue a work to workqueue blkcg_punt_bio_wq. But when the workqueue is
      scheduled, it depends on the system load.  When system load is low, the
      workqueue will be quickly scheduled, and the bio in bio_list will be
      quickly processed in blkg_async_bio_workfn(), so there is less chance
      that the same io submit thread can add multiple continuous bios to
      bio_list before workqueue is scheduled to run. The analysis aligned with
      above test "3".
      When system load is high, there is some delay before the workqueue can
      be scheduled to run, the higher the system load the greater the delay.
      So there is more chance that the same io submit thread can add multiple
      continuous bios to bio_list. Then when the workqueue is scheduled to run,
      there are more continuous bios in bio_list, which will be processed in
      blkg_async_bio_workfn(). The analysis aligned with above test "1" and "2".
      
      According to test, we can get io performance improved with the patch,
      especially when system load is higher. Another optimazition is to use
      the plug only when bio_list contains at least 2 bios.
      Signed-off-by: NXianting Tian <tian.xianting@h3c.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      192f1c6b
  20. 02 9月, 2020 1 次提交
    • T
      blk-iocost: implement delay adjustment hysteresis · 5160a5a5
      Tejun Heo 提交于
      Curently, iocost syncs the delay duration to the outstanding debt amount,
      which seemed enough to protect the system from anon memory hogs. However,
      that was mostly because the delay calcuation was using hweight_inuse which
      quickly converges towards zero under debt for delay duration calculation,
      often pusnishing debtors overly harshly for longer than deserved.
      
      The previous patch fixed the delay calcuation and now the protection against
      anonymous memory hogs isn't enough because the effect of delay is indirect
      and non-linear and a huge amount of future debt can accumulate abruptly
      while unthrottled.
      
      This patch implements delay hysteresis so that delay is decayed
      exponentially over time instead of getting cleared immediately as debt is
      paid off. While the overall behavior is similar to the blk-cgroup
      implementation used by blk-iolatency, a lot of the details are different and
      due to the empirical nature of the mechanism, it's challenging to adapt the
      mechanism for one controller without negatively impacting the other.
      
      As the delay is gradually decayed now, there's no point in running it from
      its own hrtimer. Periodic updates are now performed from ioc_timer_fn() and
      the dedicated hrtimer is removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5160a5a5
  21. 22 8月, 2020 1 次提交
  22. 18 7月, 2020 2 次提交
    • B
      blk-cgroup: show global disk stats in root cgroup io.stat · ef45fe47
      Boris Burkov 提交于
      In order to improve consistency and usability in cgroup stat accounting,
      we would like to support the root cgroup's io.stat.
      
      Since the root cgroup has processes doing io even if the system has no
      explicitly created cgroups, we need to be careful to avoid overhead in
      that case.  For that reason, the rstat algorithms don't handle the root
      cgroup, so just turning the file on wouldn't give correct statistics.
      
      To get around this, we simulate flushing the iostat struct by filling it
      out directly from global disk stats. The result is a root cgroup io.stat
      file consistent with both /proc/diskstats and io.stat.
      
      Note that in order to collect the disk stats, we needed to iterate over
      devices. To facilitate that, we had to change the linkage of a disk_type
      to external so that it can be used from blk-cgroup.c to iterate over
      disks.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ef45fe47
    • B
      blk-cgroup: make iostat functions visible to stat printing · cd1fc4b9
      Boris Burkov 提交于
      Previously, the code which printed io.stat only needed access to the
      generic rstat flushing code, but since we plan to write some more
      specific code for preparing root cgroup stats, we need to manipulate
      iostat structs directly. Since declaring static functions ahead does not
      seem like common practice in this file, simply move the iostat functions
      up. We only plan to use blkg_iostat_set, but it seems better to keep them
      all together.
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cd1fc4b9
  23. 09 7月, 2020 1 次提交
  24. 01 7月, 2020 2 次提交
  25. 29 6月, 2020 4 次提交