1. 18 7月, 2018 1 次提交
  2. 09 7月, 2018 3 次提交
    • J
      block: introduce blk-iolatency io controller · d7067512
      Josef Bacik 提交于
      Current IO controllers for the block layer are less than ideal for our
      use case.  The io.max controller is great at hard limiting, but it is
      not work conserving.  This patch introduces io.latency.  You provide a
      latency target for your group and we monitor the io in short windows to
      make sure we are not exceeding those latency targets.  This makes use of
      the rq-qos infrastructure and works much like the wbt stuff.  There are
      a few differences from wbt
      
       - It's bio based, so the latency covers the whole block layer in addition to
         the actual io.
       - We will throttle all IO types that comes in here if we need to.
       - We use the mean latency over the 100ms window.  This is because writes can
         be particularly fast, which could give us a false sense of the impact of
         other workloads on our protected workload.
       - By default there's no throttling, we set the queue_depth to INT_MAX so that
         we can have as many outstanding bio's as we're allowed to.  Only at
         throttle time do we pay attention to the actual queue depth.
       - We backcharge cgroups for root cg issued IO and induce artificial
         delays in order to deal with cases like metadata only or swap heavy
         workloads.
      
      In testing this has worked out relatively well.  Protected workloads
      will throttle noisy workloads down to 1 io at time if they are doing
      normal IO on their own, or induce up to a 1 second delay per syscall if
      they are doing a lot of root issued IO (metadata/swap IO).
      
      Our testing has revolved mostly around our production web servers where
      we have hhvm (the web server application) in a protected group and
      everything else in another group.  We see slightly higher requests per
      second (RPS) on the test tier vs the control tier, and much more stable
      RPS across all machines in the test tier vs the control tier.
      
      Another test we run is a slow memory allocator in the unprotected group.
      Before this would eventually push us into swap and cause the whole box
      to die and not recover at all.  With these patches we see slight RPS
      drops (usually 10-15%) before the memory consumer is properly killed and
      things recover within seconds.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d7067512
    • J
      blkcg: add generic throttling mechanism · d09d8df3
      Josef Bacik 提交于
      Since IO can be issued from literally anywhere it's almost impossible to
      do throttling without having some sort of adverse effect somewhere else
      in the system because of locking or other dependencies.  The best way to
      solve this is to do the throttling when we know we aren't holding any
      other kernel resources.  Do this by tracking throttling in a per-blkg
      basis, and if we require throttling flag the task that it needs to check
      before it returns to user space and possibly sleep there.
      
      This is to address the case where a process is doing work that is
      generating IO that can't be throttled, whether that is directly with a
      lot of REQ_META IO, or indirectly by allocating so much memory that it
      is swamping the disk with REQ_SWAP.  We can't use task_add_work as we
      don't want to induce a memory allocation in the IO path, so simply
      saving the request queue in the task and flagging it to do the
      notify_resume thing achieves the same result without the overhead of a
      memory allocation.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d09d8df3
    • J
      blk-cgroup: allow controllers to output their own stats · 903d23f0
      Josef Bacik 提交于
      blk-iolatency has a few stats that it would like to print out, and
      instead of adding a bunch of crap to the generic code just provide a
      helper so that controllers can add stuff to the stat line if they want
      to.
      
      Hide it behind a boot option since it changes the output of io.stat from
      normal, and these stats are only interesting to developers.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      903d23f0
  3. 19 4月, 2018 2 次提交
  4. 18 4月, 2018 1 次提交
  5. 17 3月, 2018 1 次提交
    • J
      blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir() · 4c699480
      Joseph Qi 提交于
      We've triggered a WARNING in blk_throtl_bio() when throttling writeback
      io, which complains blkg->refcnt is already 0 when calling blkg_get(),
      and then kernel crashes with invalid page request.
      After investigating this issue, we've found it is caused by a race
      between blkcg_bio_issue_check() and cgroup_rmdir(), which is described
      below:
      
      writeback kworker               cgroup_rmdir
                                        cgroup_destroy_locked
                                          kill_css
                                            css_killed_ref_fn
                                              css_killed_work_fn
                                                offline_css
                                                  blkcg_css_offline
        blkcg_bio_issue_check
          rcu_read_lock
          blkg_lookup
                                                    spin_trylock(q->queue_lock)
                                                    blkg_destroy
                                                    spin_unlock(q->queue_lock)
          blk_throtl_bio
          spin_lock_irq(q->queue_lock)
          ...
          spin_unlock_irq(q->queue_lock)
        rcu_read_unlock
      
      Since rcu can only prevent blkg from releasing when it is being used,
      the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule
      blkg release.
      Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING.
      And then the corresponding blkg_put() will schedule blkg release again,
      which result in double free.
      This race is introduced by commit ae118896 ("blkcg: consolidate blkg
      creation in blkcg_bio_issue_check()"). Before this commit, it will
      lookup first and then try to lookup/create again with queue_lock. Since
      revive this logic is a bit drastic, so fix it by only offlining pd during
      blkcg_css_offline(), and move the rest destruction (especially
      blkg_put()) into blkcg_css_free(), which should be the right way as
      discussed.
      
      Fixes: ae118896 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()")
      Reported-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4c699480
  6. 27 2月, 2018 1 次提交
  7. 05 11月, 2017 1 次提交
  8. 10 10月, 2017 1 次提交
  9. 26 8月, 2017 1 次提交
  10. 02 6月, 2017 1 次提交
    • B
      block: Avoid that blk_exit_rl() triggers a use-after-free · b425e504
      Bart Van Assche 提交于
      Since the introduction of .init_rq_fn() and .exit_rq_fn() it is
      essential that the memory allocated for struct request_queue
      stays around until all blk_exit_rl() calls have finished. Hence
      make blk_init_rl() take a reference on struct request_queue.
      
      This patch fixes the following crash:
      
      general protection fault: 0000 [#2] SMP
      CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G      D         4.12.0-rc2-dbg+ #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
      task: ffff88013a108040 task.stack: ffffc9000071c000
      RIP: 0010:free_request_size+0x1a/0x30
      RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202
      RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003
      RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418
      RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009
      R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8
      R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a
      FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0
      Call Trace:
       mempool_destroy.part.10+0x21/0x40
       mempool_destroy+0xe/0x10
       blk_exit_rl+0x12/0x20
       blkg_free+0x4d/0xa0
       __blkg_release_rcu+0x59/0x170
       rcu_process_callbacks+0x260/0x4e0
       __do_softirq+0x116/0x250
       smpboot_thread_fn+0x123/0x1e0
       kthread+0x109/0x140
       ret_from_fork+0x31/0x40
      
      Fixes: commit e9c787e6 ("scsi: allocate scsi_cmnd structures as part of struct request")
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org> # v4.11+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b425e504
  11. 30 3月, 2017 2 次提交
    • T
      blkcg: allocate struct blkcg_gq outside request queue spinlock · 457e490f
      Tahsin Erdogan 提交于
      blkg_conf_prep() currently calls blkg_lookup_create() while holding
      request queue spinlock. This means allocating memory for struct
      blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
      failures in call paths like below:
      
        pcpu_alloc+0x68f/0x710
        __alloc_percpu_gfp+0xd/0x10
        __percpu_counter_init+0x55/0xc0
        cfq_pd_alloc+0x3b2/0x4e0
        blkg_alloc+0x187/0x230
        blkg_create+0x489/0x670
        blkg_lookup_create+0x9a/0x230
        blkg_conf_prep+0x1fb/0x240
        __cfqg_set_weight_device.isra.105+0x5c/0x180
        cfq_set_weight_on_dfl+0x69/0xc0
        cgroup_file_write+0x39/0x1c0
        kernfs_fop_write+0x13f/0x1d0
        __vfs_write+0x23/0x120
        vfs_write+0xc2/0x1f0
        SyS_write+0x44/0xb0
        entry_SYSCALL_64_fastpath+0x18/0xad
      
      In the code path above, percpu allocator cannot call vmalloc() due to
      queue spinlock.
      
      A failure in this call path gives grief to tools which are trying to
      configure io weights. We see occasional failures happen shortly after
      reboots even when system is not under any memory pressure. Machines
      with a lot of cpus are more vulnerable to this condition.
      
      Do struct blkcg_gq allocations outside the queue spinlock to allow
      blocking during memory allocations.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      457e490f
    • J
      Revert "blkcg: allocate struct blkcg_gq outside request queue spinlock" · d708f0d5
      Jens Axboe 提交于
      I inadvertently applied the v5 version of this patch, whereas
      the agreed upon version was v5. Revert this one so we can apply
      the right one.
      
      This reverts commit 7fc6b87a.
      d708f0d5
  12. 29 3月, 2017 1 次提交
    • T
      blkcg: allocate struct blkcg_gq outside request queue spinlock · 7fc6b87a
      Tahsin Erdogan 提交于
      blkg_conf_prep() currently calls blkg_lookup_create() while holding
      request queue spinlock. This means allocating memory for struct
      blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
      failures in call paths like below:
      
        pcpu_alloc+0x68f/0x710
        __alloc_percpu_gfp+0xd/0x10
        __percpu_counter_init+0x55/0xc0
        cfq_pd_alloc+0x3b2/0x4e0
        blkg_alloc+0x187/0x230
        blkg_create+0x489/0x670
        blkg_lookup_create+0x9a/0x230
        blkg_conf_prep+0x1fb/0x240
        __cfqg_set_weight_device.isra.105+0x5c/0x180
        cfq_set_weight_on_dfl+0x69/0xc0
        cgroup_file_write+0x39/0x1c0
        kernfs_fop_write+0x13f/0x1d0
        __vfs_write+0x23/0x120
        vfs_write+0xc2/0x1f0
        SyS_write+0x44/0xb0
        entry_SYSCALL_64_fastpath+0x18/0xad
      
      In the code path above, percpu allocator cannot call vmalloc() due to
      queue spinlock.
      
      A failure in this call path gives grief to tools which are trying to
      configure io weights. We see occasional failures happen shortly after
      reboots even when system is not under any memory pressure. Machines
      with a lot of cpus are more vulnerable to this condition.
      
      Update blkg_create() function to temporarily drop the rcu and queue
      locks when it is allowed by gfp mask.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7fc6b87a
  13. 02 3月, 2017 1 次提交
  14. 03 2月, 2017 1 次提交
  15. 02 2月, 2017 1 次提交
  16. 19 1月, 2017 1 次提交
  17. 18 1月, 2017 2 次提交
  18. 22 11月, 2016 1 次提交
  19. 30 9月, 2016 1 次提交
  20. 14 6月, 2016 1 次提交
  21. 10 2月, 2016 1 次提交
    • R
      block: fix module reference leak on put_disk() call for cgroups throttle · 39a169b6
      Roman Pen 提交于
      get_disk(),get_gendisk() calls have non explicit side effect: they
      increase the reference on the disk owner module.
      
      The following is the correct sequence how to get a disk reference and
      to put it:
      
          disk = get_gendisk(...);
      
          /* use disk */
      
          owner = disk->fops->owner;
          put_disk(disk);
          module_put(owner);
      
      fs/block_dev.c is aware of this required module_put() call, but f.e.
      blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put
      a module reference.  To see a leakage in action cgroups throttle config
      can be used.  In the following script I'm removing throttle for /dev/ram0
      (actually this is NOP, because throttle was never set for this device):
      
          # lsmod | grep brd
          brd                     5175  0
          # i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \
              /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \
          done
          # lsmod | grep brd
          brd                     5175  100
      
      Now brd module has 100 references.
      
      The issue is fixed by calling module_put() just right away put_disk().
      Signed-off-by: NRoman Pen <roman.penyaev@profitbricks.com>
      Cc: Gi-Oh Kim <gi-oh.kim@profitbricks.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      39a169b6
  22. 03 12月, 2015 1 次提交
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
  23. 22 10月, 2015 1 次提交
    • T
      blkcg: don't create "io.stat" on the root cgroup · ca0752c5
      Tejun Heo 提交于
      The stat files on the root cgroup shows stats for the whole system and
      usually don't contain any information which isn't available through
      the usual system monitoring mechanisms.  Some controllers skip
      collecting these duplicate stats to optimize cases where cgroup isn't
      used and later try to emulate the result on demand.
      
      This leads to complexities and subtle differences in the information
      shown through different channels.  This is entirely unnecessary and
      cgroup v2 is dropping stat files which are duplicate from all
      controllers.  This patch removes "io.stat" from the root hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      ca0752c5
  24. 11 9月, 2015 1 次提交
    • T
      block: blkg_destroy_all() should clear q->root_blkg and ->root_rl.blkg · 6fe810bd
      Tejun Heo 提交于
      While making the root blkg unconditional, ec13b1d6 ("blkcg: always
      create the blkcg_gq for the root blkcg") removed the part which clears
      q->root_blkg and ->root_rl.blkg during q exit.  This leaves the two
      pointers dangling after blkg_destroy_all().  blk-throttle exit path
      performs blkg traversals and dereferences ->root_blkg and can lead to
      the following oops.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000558
       IP: [<ffffffff81389746>] __blkg_lookup+0x26/0x70
       ...
       task: ffff88001b4e2580 ti: ffff88001ac0c000 task.ti: ffff88001ac0c000
       RIP: 0010:[<ffffffff81389746>]  [<ffffffff81389746>] __blkg_lookup+0x26/0x70
       ...
       Call Trace:
        [<ffffffff8138d14a>] blk_throtl_drain+0x5a/0x110
        [<ffffffff8138a108>] blkcg_drain_queue+0x18/0x20
        [<ffffffff81369a70>] __blk_drain_queue+0xc0/0x170
        [<ffffffff8136a101>] blk_queue_bypass_start+0x61/0x80
        [<ffffffff81388c59>] blkcg_deactivate_policy+0x39/0x100
        [<ffffffff8138d328>] blk_throtl_exit+0x38/0x50
        [<ffffffff8138a14e>] blkcg_exit_queue+0x3e/0x50
        [<ffffffff8137016e>] blk_release_queue+0x1e/0xc0
       ...
      
      While the bug is a straigh-forward use-after-free bug, it is tricky to
      reproduce because blkg release is RCU protected and the rest of exit
      path usually finishes before RCU grace period.
      
      This patch fixes the bug by updating blkg_destro_all() to clear
      q->root_blkg and ->root_rl.blkg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: N"Richard W.M. Jones" <rjones@redhat.com>
      Reported-by: NJosh Boyer <jwboyer@fedoraproject.org>
      Link: http://lkml.kernel.org/g/CA+5PVA5rzQ0s4723n5rHBcxQa9t0cW8BPPBekr_9aMRoWt2aYg@mail.gmail.com
      Fixes: ec13b1d6 ("blkcg: always create the blkcg_gq for the root blkcg")
      Cc: stable@vger.kernel.org # v4.2+
      Tested-by: NRichard W.M. Jones <rjones@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6fe810bd
  25. 19 8月, 2015 11 次提交
    • T
      blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy · 69d7fde5
      Tejun Heo 提交于
      cgroup is trying to make interface consistent across different
      controllers.  For weight based resource control, the knob should have
      the range [1, 10000] and default to 100.  This patch updates
      cfq-iosched so that the weight range conforms.  The internal
      calculations have enough range and the widening of the weight range
      shouldn't cause any problem.
      
      * blkcg_policy->cpd_bind_fn() is added.  If present, this is invoked
        when blkcg is attached to a hierarchy.
      
      * cfq_cpd_init() is updated to use the new default value on the
        unified hierarchy.
      
      * cfq_cpd_bind() callback is implemented to clear per-blkg configs and
        apply the default config matching the hierarchy type.
      
      * cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
        is moved into !CONFIG_CFQ_GROUP_IOSCHED block.  cfq_cpd_bind() is
        now responsible for initializing the initial weights when blkcg is
        enabled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      69d7fde5
    • T
      blkcg: implement interface for the unified hierarchy · 2ee867dc
      Tejun Heo 提交于
      blkcg interface grew to be the biggest of all controllers and
      unfortunately most inconsistent too.  The interface files are
      inconsistent with a number of cloes duplicates.  Some files have
      recursive variants while others don't.  There's distinction between
      normal and leaf weights which isn't intuitive and there are a lot of
      stat knobs which don't make much sense outside of debugging and expose
      too much implementation details to userland.
      
      In the unified hierarchy, everything is always hierarchical and
      internal nodes can't have tasks rendering the two structural issues
      twisting the current interface.  The interface has to be updated in a
      significant anyway and this is a good chance to revamp it as a whole.
      This patch implements blkcg interface for the unified hierarchy.
      
      * (from a previous patch) blkcg is identified by "io" instead of
        "blkio" on the unified hierarchy.  Given that the whole interface is
        updated anyway, the rename shouldn't carry noticeable conversion
        overhead.
      
      * The original interface consisted of 27 files is replaced with the
        following three files.
      
        blkio.stat	: per-blkcg stats
        blkio.weight	: per-cgroup and per-cgroup-queue weight settings
        blkio.max	: per-cgroup-queue bps and iops max limits
      
      Documentation/cgroups/unified-hierarchy.txt updated accordingly.
      
      v2: blkcg_policy->dfl_cftypes wasn't removed on
          blkcg_policy_unregister() corrupting the cftypes list.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2ee867dc
    • T
      blkcg: misc preparations for unified hierarchy interface · dd165eb3
      Tejun Heo 提交于
      * Export blkg_dev_name()
      
      * Drop unnecessary @cft from __cfq_set_weight().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dd165eb3
    • T
      blkcg: move body parsing from blkg_conf_prep() to its callers · 36aa9e5f
      Tejun Heo 提交于
      Currently, blkg_conf_prep() expects input to be of the following form
      
       MAJ:MIN NUM
      
      and reads the NUM part into blkg_conf_ctx->v.  This is quite
      restrictive and gets in the way in implementing blkcg interface for
      the unified hierarchy.  This patch updates blkg_conf_prep() so that it
      expects
      
       MAJ:MIN BODY_STR
      
      where BODY_STR is an arbitrary string.  blkg_conf_ctx->v is replaced
      with ->body which is a char pointer pointing to the start of BODY_STR.
      Parsing of the body is moved to blkg_conf_prep()'s callers.
      
      To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
      non-const pointer and to accommodate that const is dropped from @input
      too.
      
      This doesn't cause any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      36aa9e5f
    • T
      blkcg: mark existing cftypes as legacy · 880f50e2
      Tejun Heo 提交于
      blkcg is about to grow interface for the unified hierarchy.  Add
      legacy to existing cftypes.
      
      * blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
      * blk-cgroup.c:blkcg_files -> blkcg_legacy_files
      * cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
      * blk-throttle.c:throtl_files -> throtl_legacy_files
      
      Pure renames.  No functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      880f50e2
    • T
      blkcg: rename subsystem name from blkio to io · c165b3e3
      Tejun Heo 提交于
      blkio interface has become messy over time and is currently the
      largest.  In addition to the inconsistent naming scheme, it has
      multiple stat files which report more or less the same thing, a number
      of debug stat files which expose internal details which shouldn't have
      been part of the public interface in the first place, recursive and
      non-recursive stats and leaf and non-leaf knobs.
      
      Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
      don't make any sense on the unified hierarchy as only leaf cgroups can
      contain processes.  cgroups is going through a major interface
      revision with the unified hierarchy involving significant fundamental
      usage changes and given that a significant portion of the interface
      doesn't make sense anymore, it's a good time to reorganize the
      interface.
      
      As the first step, this patch renames the external visible subsystem
      name from "blkio" to "io".  This is more concise, matches the other
      two major subsystem names, "cpu" and "memory", and better suited as
      blkcg will be involved in anything writeback related too whether an
      actual block device is involved or not.
      
      As the subsystem legacy_name is set to "blkio", the only userland
      visible change outside the unified hierarchy is that blkcg is reported
      as "io" instead of "blkio" in the subsystem initialized message during
      boot.  On the unified hierarchy, blkcg now appears as "io".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: cgroups@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c165b3e3
    • T
      blkcg: refine error codes returned during blkcg configuration · 20386ce0
      Tejun Heo 提交于
      blkcg currently returns -EINVAL for most errors which can be pretty
      confusing given that the failure modes are quite varied.  Update the
      error returns so that
      
      * -EINVAL only for syntactic errors.
      * -ERANGE if the value is out of range.
      * -ENODEV if the target device can't be found.
      * -EOPNOTSUPP if the policy is not enabled on the target device.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      20386ce0
    • T
      blkcg: reduce stack usage of blkg_rwstat_recursive_sum() · 3a7faead
      Tejun Heo 提交于
      The recent percpu conversion of blkg_rwstat triggered the following
      warning in certain configurations.
      
       block/blk-cgroup.c:654:1: warning: the frame size of 1360 bytes is larger than 1024 bytes
      
      This is because blkg_rwstat now contains four percpu_counter which can
      be pretty big depending on debug options although it shouldn't be a
      problem in production configs.  This patch removes one of the two
      local blkg_rwstat variables used by blkg_rwstat_recursive_sum() to
      reduce stack usage.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Link: http://article.gmane.org/gmane.linux.kernel.cgroups/13835Signed-off-by: NJens Axboe <axboe@fb.com>
      3a7faead
    • T
      blkcg: move io_service_bytes and io_serviced stats into blkcg_gq · 77ea7338
      Tejun Heo 提交于
      Currently, both cfq-iosched and blk-throttle keep track of
      io_service_bytes and io_serviced stats.  While keeping track of them
      separately may be useful during development, it doesn't make much
      sense otherwise.  Also, blk-throttle was counting bio's as IOs while
      cfq-iosched request's, which is more confusing than informative.
      
      This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
      removes the counterparts from cfq-iosched and blk-throttle and let
      them print from the common blkg counters.  The common counters are
      incremented during bio issue in blkcg_bio_issue_check().
      
      The outputs are still filtered by whether the policy has
      blkg_policy_data on a given blkg, so cfq's output won't show up if it
      has never been used for a given blkg.  The only times when the outputs
      would differ significantly are when policies are attached on the fly
      or elevators are switched back and forth.  Those are quite exceptional
      operations and I don't think they warrant keeping separate counters.
      
      v3: Update blkio-controller.txt accordingly.
      
      v2: Account IOs during bio issues instead of request completions so
          that bio-based drivers can be handled the same way.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      77ea7338
    • T
      blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq · f12c74ca
      Tejun Heo 提交于
      Currently, blkg_[rw]stat_recursive_sum() assume that the target
      counter is located in pd (blkg_policy_data); however, some counters
      are planned to be moved to blkg (blkcg_gq).
      
      This patch updates blkg_[rw]stat_recursive_sum() to take blkg and
      blkg_policy pointers instead of pd.  If policy is NULL, it indexes
      into blkg.  If non-NULL, into the blkg's pd of the policy.
      
      The existing usages are updated to maintain the current behaviors.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f12c74ca
    • T
      blkcg: make blkcg_[rw]stat per-cpu · 24bdb8ef
      Tejun Heo 提交于
      blkcg_[rw]stat are used as stat counters for blkcg policies.  It isn't
      per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
      it.  This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
      per-cpu wrapping in blk-throttle.
      
      * blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
        percpu_counter.  This makes syncp unnecessary as remote accesses are
        handled by percpu_counter itself.
      
      * blkg_[rw]stat_init() can now fail due to percpu allocation failure
        and thus are updated to return int.
      
      * percpu_counters need explicit freeing.  blkg_[rw]stat_exit() added.
      
      * As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
        and summing results are stored in ->aux_cnt[] instead.
      
      * Custom per-cpu stat implementation in blk-throttle is removed.
      
      This makes all blkcg stat counters per-cpu without complicating policy
      implmentations.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      24bdb8ef