1. 19 8月, 2015 2 次提交
    • T
      blkcg: restructure blkg_policy_data allocation in blkcg_activate_policy() · 4c55f4f9
      Tejun Heo 提交于
      When a policy gets activated, it needs to allocate and install its
      policy data on all existing blkg's (blkcg_gq's).  Because blkg
      iteration is protected by a spinlock, it currently counts the total
      number of blkg's in the system, allocates the matching number of
      policy data on a list and installs them during a single iteration.
      
      This can be simplified by using speculative GFP_NOWAIT allocations
      while iterating and falling back to a preallocated policy data on
      failure.  If the preallocated one has already been consumed, it
      releases the lock, preallocate with GFP_KERNEL and then restarts the
      iteration.  This can be a bit more expensive than before but policy
      activation is a very cold path and shouldn't matter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4c55f4f9
    • T
      blkcg: remove unnecessary request_list->blkg NULL test in blk_put_rl() · 401efbf8
      Tejun Heo 提交于
      Since ec13b1d6 ("blkcg: always create the blkcg_gq for the root
      blkcg"), a request_list always has its blkg associated.  Drop
      unnecessary rl->blkg NULL test from blk_put_rl().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      401efbf8
  2. 10 7月, 2015 2 次提交
    • T
      blkcg: fix blkcg_policy_data allocation bug · 06b285bd
      Tejun Heo 提交于
      e48453c3 ("block, cgroup: implement policy-specific per-blkcg
      data") updated per-blkcg policy data to be dynamically allocated.
      When a policy is registered, its policy data aren't created.  Instead,
      when the policy is activated on a queue, the policy data are allocated
      if there are blkg's (blkcg_gq's) which are attached to a given blkcg.
      This is buggy.  Consider the following scenario.
      
      1. A blkcg is created.  No blkg's attached yet.
      
      2. The policy is registered.  No policy data is allocated.
      
      3. The policy is activated on a queue.  As the above blkcg doesn't
         have any blkg's, it won't allocate the matching blkcg_policy_data.
      
      4. An IO is issued from the blkcg and blkg is created and the blkcg
         still doesn't have the matching policy data allocated.
      
      With cfq-iosched, this leads to an oops.
      
      It also doesn't free policy data on policy unregistration assuming
      that freeing of all policy data on blkcg destruction should take care
      of it; however, this also is incorrect.
      
      1. A blkcg has policy data.
      
      2. The policy gets unregistered but the policy data remains.
      
      3. Another policy gets registered on the same slot.
      
      4. Later, the new policy tries to allocate policy data on the previous
         blkcg but the slot is already occupied and gets skipped.  The
         policy ends up operating on the policy data of the previous policy.
      
      There's no reason to manage blkcg_policy_data lazily.  The reason we
      do lazy allocation of blkg's is that the number of all possible blkg's
      is the product of cgroups and block devices which can reach a
      surprising level.  blkcg_policy_data is contrained by the number of
      cgroups and shouldn't be a problem.
      
      This patch makes blkcg_policy_data to be allocated for all existing
      blkcg's on policy registration and freed on unregistration and removes
      blkcg_policy_data handling from policy [de]activation paths.  This
      makes that blkcg_policy_data are created and removed with the policy
      they belong to and fixes the above described problems.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      06b285bd
    • T
      blkcg: implement all_blkcgs list · 7876f930
      Tejun Heo 提交于
      Add all_blkcgs list goes through blkcg->all_blkcgs_node and is
      protected by blkcg_pol_mutex.  This will be used to fix
      blkcg_policy_data allocation bug.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7876f930
  3. 02 6月, 2015 6 次提交
    • T
      writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback_congested · ce7acfea
      Tejun Heo 提交于
      A blkg (blkcg_gq) can be congested and decongested independently from
      other blkgs on the same request_queue.  Accordingly, for cgroup
      writeback support, the congestion status at bdi (backing_dev_info)
      should be split and updated separately from matching blkg's.
      
      This patch prepares by adding blkg->wb_congested and associating a
      blkg with its matching per-blkcg bdi_writeback_congested on creation.
      
      v2: Updated to associate bdi_writeback_congested instead of
          bdi_writeback.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ce7acfea
    • T
      writeback: make backing_dev_info host cgroup-specific bdi_writebacks · 52ebea74
      Tejun Heo 提交于
      For the planned cgroup writeback support, on each bdi
      (backing_dev_info), each memcg will be served by a separate wb
      (bdi_writeback).  This patch updates bdi so that a bdi can host
      multiple wbs (bdi_writebacks).
      
      On the default hierarchy, blkcg implicitly enables memcg.  This allows
      using memcg's page ownership for attributing writeback IOs, and every
      memcg - blkcg combination can be served by its own wb by assigning a
      dedicated wb to each memcg.  This means that there may be multiple
      wb's of a bdi mapped to the same blkcg.  As congested state is per
      blkcg - bdi combination, those wb's should share the same congested
      state.  This is achieved by tracking congested state via
      bdi_writeback_congested structs which are keyed by blkcg.
      
      bdi->wb remains unchanged and will keep serving the root cgroup.
      cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
      looked up while dirtying an inode according to the memcg of the page
      being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
      by its memcg id.  Once an inode is associated with its wb, it can be
      retrieved using inode_to_wb().
      
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being associated with bdi->wb.
      
      v3: inode_attach_wb() in account_page_dirtied() moved inside
          mapping_cap_account_dirty() block where it's known to be !NULL.
          Also, an unnecessary NULL check before kfree() removed.  Both
          detected by the kbuild bot.
      
      v2: Updated so that wb association is per inode and wb is per memcg
          rather than blkcg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52ebea74
    • T
      blkcg: implement task_get_blkcg_css() · fd383c2d
      Tejun Heo 提交于
      Implement a wrapper around task_get_css() to acquire the blkcg css for
      a given task.  The wrapper is necessary for cgroup writeback support
      as there will be places outside blkcg proper trying to acquire
      blkcg_css and blkio_cgrp_id will be undefined when !CONFIG_BLK_CGROUP.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fd383c2d
    • T
      blkcg: add blkcg_root_css · 496d5e75
      Tejun Heo 提交于
      Add global constant blkcg_root_css which points to &blkcg_root.css.
      This will be used by cgroup writeback support.  If blkcg is disabled,
      it's defined as ERR_PTR(-EINVAL).
      
      v2: The declarations moved to include/linux/blk-cgroup.h as suggested
          by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      496d5e75
    • T
      update !CONFIG_BLK_CGROUP dummies in include/linux/blk-cgroup.h · efa7d1c7
      Tejun Heo 提交于
      The header file will be used more widely with the pending cgroup
      writeback support and the current set of dummy declarations aren't
      enough to handle different config combinations.  Update as follows.
      
      * Drop the struct cgroup declaration.  None of the dummy defs need it.
      
      * Define blkcg as an empty struct instead of just declaring it.
      
      * Wrap dummy function defs in CONFIG_BLOCK.  Some functions use block
        data types and none of them are to be used w/o block enabled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      efa7d1c7
    • T
      blkcg: move block/blk-cgroup.h to include/linux/blk-cgroup.h · eea8f41c
      Tejun Heo 提交于
      cgroup aware writeback support will require exposing some of blkcg
      details.  In preprataion, move block/blk-cgroup.h to
      include/linux/blk-cgroup.h.  This patch is pure file move.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      eea8f41c
  4. 08 9月, 2014 1 次提交
    • T
      blkcg: remove blkcg->id · f4da8072
      Tejun Heo 提交于
      blkcg->id is a unique id given to each blkcg; however, the
      cgroup_subsys_state which each blkcg embeds already has ->serial_nr
      which can be used for the same purpose.  Drop blkcg->id and replace
      its uses with blkcg->css.serial_nr.  Rename cfq_cgroup->blkcg_id to
      ->blkcg_serial_nr and @id in check_blkcg_changed() to @serial_nr for
      consistency.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f4da8072
  5. 23 6月, 2014 2 次提交
    • J
      Revert "block: add __init to blkcg_policy_register" · d5bf0291
      Jens Axboe 提交于
      This reverts commit a2d445d4.
      
      The original commit is buggy, we do use the registration functions
      at runtime for modular builds.
      d5bf0291
    • T
      blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t · a5049a8a
      Tejun Heo 提交于
      Hello,
      
      So, this patch should do.  Joe, Vivek, can one of you guys please
      verify that the oops goes away with this patch?
      
      Jens, the original thread can be read at
      
        http://thread.gmane.org/gmane.linux.kernel/1720729
      
      The fix converts blkg->refcnt from int to atomic_t.  It does some
      overhead but it should be minute compared to everything else which is
      going on and the involved cacheline bouncing, so I think it's highly
      unlikely to cause any noticeable difference.  Also, the refcnt in
      question should be converted to a perpcu_ref for blk-mq anyway, so the
      atomic_t is likely to go away pretty soon anyway.
      
      Thanks.
      
      ------- 8< -------
      __blkg_release_rcu() may be invoked after the associated request_queue
      is released with a RCU grace period inbetween.  As such, the function
      and callbacks invoked from it must not dereference the associated
      request_queue.  This is clearly indicated in the comment above the
      function.
      
      Unfortunately, while trying to fix a different issue, 2a4fd070
      ("blkcg: move bulk of blkcg_gq release operations to the RCU
      callback") ignored this and added [un]locking of @blkg->q->queue_lock
      to __blkg_release_rcu().  This of course can cause oops as the
      request_queue may be long gone by the time this code gets executed.
      
        general protection fault: 0000 [#1] SMP
        CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
        Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
        task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
        RIP: 0010:[<ffffffff8162e9e5>]  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
        RSP: 0018:ffff88085403fdf0  EFLAGS: 00010086
        RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
        RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
        RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
        R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
        R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
        Stack:
         ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
         ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
         ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
        Call Trace:
         [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150
         [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300
         [<ffffffff81091d81>] kthread+0xe1/0x100
         [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0
        Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
        +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
        +b7
        RIP  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
         RSP <ffff88085403fdf0>
      
      The request_queue locking was added because blkcg_gq->refcnt is an int
      protected with the queue lock and __blkg_release_rcu() needs to put
      the parent.  Let's fix it by making blkcg_gq->refcnt an atomic_t and
      dropping queue locking in the function.
      
      Given the general heavy weight of the current request_queue and blkcg
      operations, this is unlikely to cause any noticeable overhead.
      Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
      the near future, so whatever (most likely negligible) overhead it may
      add is temporary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a5049a8a
  6. 11 6月, 2014 1 次提交
  7. 17 5月, 2014 1 次提交
    • T
      cgroup: remove css_parent() · 5c9d535b
      Tejun Heo 提交于
      cgroup in general is moving towards using cgroup_subsys_state as the
      fundamental structural component and css_parent() was introduced to
      convert from using cgroup->parent to css->parent.  It was quite some
      time ago and we're moving forward with making css more prominent.
      
      This patch drops the trivial wrapper css_parent() and let the users
      dereference css->parent.  While at it, explicitly mark fields of css
      which are public and immutable.
      
      v2: New usage from device_cgroup.c converted.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      5c9d535b
  8. 15 3月, 2014 1 次提交
  9. 12 2月, 2014 1 次提交
    • T
      cgroup: remove cgroup->name · e61734c5
      Tejun Heo 提交于
      cgroup->name handling became quite complicated over time involving
      dedicated struct cgroup_name for RCU protection.  Now that cgroup is
      on kernfs, we can drop all of it and simply use kernfs_name/path() and
      friends.  Replace cgroup->name and all related code with kernfs
      name/path constructs.
      
      * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
        of kernfs counterparts, which involves semantic changes.
        pr_cont_cgroup_name() and pr_cont_cgroup_path() added.
      
      * cgroup->name handling dropped from cgroup_rename().
      
      * All users of cgroup_name/path() updated to the new semantics.  Users
        which were formatting the string just to printk them are converted
        to use pr_cont_cgroup_name/path() instead, which simplifies things
        quite a bit.  As cgroup_name() no longer requires RCU read lock
        around it, RCU lockings which were protecting only cgroup_name() are
        removed.
      
      v2: Comment above oom_info_lock updated as suggested by Michal.
      
      v3: dummy_top doesn't have a kn associated and
          pr_cont_cgroup_name/path() ended up calling the matching kernfs
          functions with NULL kn leading to oops.  Test for NULL kn and
          print "/" if so.  This issue was reported by Fengguang Wu.
      
      v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      e61734c5
  10. 08 2月, 2014 1 次提交
    • T
      cgroup: clean up cgroup_subsys names and initialization · 073219e9
      Tejun Heo 提交于
      cgroup_subsys is a bit messier than it needs to be.
      
      * The name of a subsys can be different from its internal identifier
        defined in cgroup_subsys.h.  Most subsystems use the matching name
        but three - cpu, memory and perf_event - use different ones.
      
      * cgroup_subsys_id enums are postfixed with _subsys_id and each
        cgroup_subsys is postfixed with _subsys.  cgroup.h is widely
        included throughout various subsystems, it doesn't and shouldn't
        have claim on such generic names which don't have any qualifier
        indicating that they belong to cgroup.
      
      * cgroup_subsys->subsys_id should always equal the matching
        cgroup_subsys_id enum; however, we require each controller to
        initialize it and then BUG if they don't match, which is a bit
        silly.
      
      This patch cleans up cgroup_subsys names and initialization by doing
      the followings.
      
      * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
        cgroup_subsys with _cgrp_subsys.
      
      * With the above, renaming subsys identifiers to match the userland
        visible names doesn't cause any naming conflicts.  All non-matching
        identifiers are renamed to match the official names.
      
        cpu_cgroup -> cpu
        mem_cgroup -> memory
        perf -> perf_event
      
      * controllers no longer need to initialize ->subsys_id and ->name.
        They're generated in cgroup core and set automatically during boot.
      
      * Redundant cgroup_subsys declarations removed.
      
      * While updating BUG_ON()s in cgroup_init_early(), convert them to
        WARN()s.  BUGging that early during boot is stupid - the kernel
        can't print anything, even through serial console and the trap
        handler doesn't even link stack frame properly for back-tracing.
      
      This patch doesn't introduce any behavior changes.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NIngo Molnar <mingo@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      073219e9
  11. 21 11月, 2013 1 次提交
  12. 13 11月, 2013 1 次提交
    • P
      block: Use u64_stats_init() to initialize seqcounts · 90d3839b
      Peter Zijlstra 提交于
      Now that seqcounts are lockdep enabled objects, we need to explicitly
      initialize runtime allocated seqcounts so that lockdep can track them.
      
      Without this patch, Fengguang was seeing:
      
        [    4.127282] INFO: trying to register non-static key.
        [    4.128027] the code is fine but needs lockdep annotation.
        [    4.128027] turning off the locking correctness validator.
        [    4.128027] CPU: 0 PID: 96 Comm: kworker/u4:1 Not tainted 3.12.0-next-20131108-10601-gbad570d #2
        [    4.128027] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        [    ...     ]
        [    4.128027] Call Trace:
        [    4.128027]  [<7908e744>] ? console_unlock+0x353/0x380
        [    4.128027]  [<79dc7cf2>] dump_stack+0x48/0x60
        [    4.128027]  [<7908953e>] __lock_acquire.isra.26+0x7e3/0xceb
        [    4.128027]  [<7908a1c5>] lock_acquire+0x71/0x9a
        [    4.128027]  [<794079aa>] ? blk_throtl_bio+0x1c3/0x485
        [    4.128027]  [<7940658b>] throtl_update_dispatch_stats+0x7c/0x153
        [    4.128027]  [<794079aa>] ? blk_throtl_bio+0x1c3/0x485
        [    4.128027]  [<794079aa>] blk_throtl_bio+0x1c3/0x485
        ...
      
      Use u64_stats_init() for all affected data structures, which initializes
      the seqcount.
      Reported-and-Tested-by: NFengguang Wu <fengguang.wu@intel.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      [ Folded in another fix from the mailing list as well as a fix to that fix. Tweaked commit message. ]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1384314134-6895-1-git-send-email-john.stultz@linaro.org
      [ So I actually think that the two SOBs from PeterZ are the right depiction of the patch route. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      90d3839b
  13. 09 8月, 2013 5 次提交
    • T
      cgroup: make css_for_each_descendant() and friends include the origin css in the iteration · bd8815a6
      Tejun Heo 提交于
      Previously, all css descendant iterators didn't include the origin
      (root of subtree) css in the iteration.  The reasons were maintaining
      consistency with css_for_each_child() and that at the time of
      introduction more use cases needed skipping the origin anyway;
      however, given that css_is_descendant() considers self to be a
      descendant, omitting the origin css has become more confusing and
      looking at the accumulated use cases rather clearly indicates that
      including origin would result in simpler code overall.
      
      While this is a change which can easily lead to subtle bugs, cgroup
      API including the iterators has recently gone through major
      restructuring and no out-of-tree changes will be applicable without
      adjustments making this a relatively acceptable opportunity for this
      type of change.
      
      The conversions are mostly straight-forward.  If the iteration block
      had explicit origin handling before or after, it's moved inside the
      iteration.  If not, if (pos == origin) continue; is added.  Some
      conversions add extra reference get/put around origin handling by
      consolidating origin handling and the rest.  While the extra ref
      operations aren't strictly necessary, this shouldn't cause any
      noticeable difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      bd8815a6
    • T
      cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroup · 492eb21b
      Tejun Heo 提交于
      cgroup is currently in the process of transitioning to using css
      (cgroup_subsys_state) as the primary handle instead of cgroup in
      subsystem API.  For hierarchy iterators, this is beneficial because
      
      * In most cases, css is the only thing subsystems care about anyway.
      
      * On the planned unified hierarchy, iterations for different
        subsystems will need to skip over different subtrees of the
        hierarchy depending on which subsystems are enabled on each cgroup.
        Passing around css makes it unnecessary to explicitly specify the
        subsystem in question as css is intersection between cgroup and
        subsystem
      
      * For the planned unified hierarchy, css's would need to be created
        and destroyed dynamically independent from cgroup hierarchy.  Having
        cgroup core manage css iteration makes enforcing deref rules a lot
        easier.
      
      Most subsystem conversions are straight-forward.  Noteworthy changes
      are
      
      * blkio: cgroup_to_blkcg() is no longer used.  Removed.
      
      * freezer: cgroup_freezer() is no longer used.  Removed.
      
      * devices: cgroup_to_devcgroup() is no longer used.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      492eb21b
    • T
      cgroup: add css_parent() · 63876986
      Tejun Heo 提交于
      Currently, controllers have to explicitly follow the cgroup hierarchy
      to find the parent of a given css.  cgroup is moving towards using
      cgroup_subsys_state as the main controller interface construct, so
      let's provide a way to climb the hierarchy using just csses.
      
      This patch implements css_parent() which, given a css, returns its
      parent.  The function is guarnateed to valid non-NULL parent css as
      long as the target css is not at the top of the hierarchy.
      
      freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
      are converted to use css_parent() instead of accessing cgroup->parent
      directly.
      
      * __parent_ca() is dropped from cpuacct and its usage is replaced with
        parent_ca().  The only difference between the two was NULL test on
        cgroup->parent which is now embedded in css_parent() making the
        distinction moot.  Note that eventually a css->parent field will be
        added to css and the NULL check in css_parent() will go away.
      
      This patch shouldn't cause any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      63876986
    • T
      cgroup: add/update accessors which obtain subsys specific data from css · a7c6d554
      Tejun Heo 提交于
      css (cgroup_subsys_state) is usually embedded in a subsys specific
      data structure.  Subsystems either use container_of() directly to cast
      from css to such data structure or has an accessor function wrapping
      such cast.  As cgroup as whole is moving towards using css as the main
      interface handle, add and update such accessors to ease dealing with
      css's.
      
      All accessors explicitly handle NULL input and return NULL in those
      cases.  While this looks like an extra branch in the code, as all
      controllers specific data structures have css as the first field, the
      casting doesn't involve any offsetting and the compiler can trivially
      optimize out the branch.
      
      * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
        accessor.  Added.
      
      * memory, hugetlb and devices already had one but didn't explicitly
        handle NULL input.  Updated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a7c6d554
    • T
      cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/ · 8af01f56
      Tejun Heo 提交于
      The names of the two struct cgroup_subsys_state accessors -
      cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
      The former clashes with the type name and the latter doesn't even
      indicate it's somehow related to cgroup.
      
      We're about to revamp large portion of cgroup API, so, let's rename
      them so that they're less awkward.  Most per-controller usages of the
      accessors are localized in accessor wrappers and given the amount of
      scheduled changes, this isn't gonna add any noticeable headache.
      
      Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
      to task_css().  This patch is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8af01f56
  14. 15 5月, 2013 3 次提交
  15. 05 3月, 2013 1 次提交
    • L
      cgroup: fix cgroup_path() vs rename() race · 65dff759
      Li Zefan 提交于
      rename() will change dentry->d_name. The result of this race can
      be worse than seeing partially rewritten name, but we might access
      a stale pointer because rename() will re-allocate memory to hold
      a longer name.
      
      As accessing dentry->name must be protected by dentry->d_lock or
      parent inode's i_mutex, while on the other hand cgroup-path() can
      be called with some irq-safe spinlocks held, we can't generate
      cgroup path using dentry->d_name.
      
      Alternatively we make a copy of dentry->d_name and save it in
      cgrp->name when a cgroup is created, and update cgrp->name at
      rename().
      
      v5: use flexible array instead of zero-size array.
      v4: - allocate root_cgroup_name and all root_cgroup->name points to it.
          - add cgroup_name() wrapper.
      v3: use kfree_rcu() instead of synchronize_rcu() in user-visible path.
      v2: make cgrp->name RCU safe.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      65dff759
  16. 10 1月, 2013 6 次提交
    • T
      blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() · 16b3de66
      Tejun Heo 提交于
      Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge().
      The former two collect the [rw]stats designated by the target policy
      data and offset from the pd's subtree.  The latter two add one
      [rw]stat to another.
      
      Note that the recursive sum functions require the queue lock to be
      held on entry to make blkg online test reliable.  This is necessary to
      properly handle stats of a dying blkg.
      
      These will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      16b3de66
    • T
      blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/ · 4d5e80a7
      Tejun Heo 提交于
      Rename blkg_rwstat_sum() to blkg_rwstat_total().  sum will be used for
      summing up stats from multiple blkgs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      4d5e80a7
    • T
      blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online · f427d909
      Tejun Heo 提交于
      Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(),
      which are invoked as the policy_data gets activated and deactivated
      while holding both blkcg and q locks.
      
      Also, add blkcg_gq->online bool, which is set and cleared as the
      blkcg_gq gets activated and deactivated.  This flag also is toggled
      while holding both blkcg and q locks.
      
      These will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      f427d909
    • T
      blkcg: add blkg_policy_data->plid · b276a876
      Tejun Heo 提交于
      Add pd->plid so that the policy a pd belongs to can be identified
      easily.  This will be used to implement hierarchical blkg_[rw]stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      b276a876
    • T
      cfq-iosched: add leaf_weight · e71357e1
      Tejun Heo 提交于
      cfq blkcg is about to grow proper hierarchy handling, where a child
      blkg's weight would nest inside the parent's.  This makes tasks in a
      blkg to compete against both tasks in the sibling blkgs and the tasks
      of child blkgs.
      
      We're gonna use the existing weight as the group weight which decides
      the blkg's weight against its siblings.  This patch introduces a new
      weight - leaf_weight - which decides the weight of a blkg against the
      child blkgs.
      
      It's named leaf_weight because another way to look at it is that each
      internal blkg nodes have a hidden child leaf node which contains all
      its tasks and leaf_weight is the weight of the leaf node and handled
      the same as the weight of the child blkgs.
      
      This patch only adds leaf_weight fields and exposes it to userland.
      The new weight isn't actually used anywhere yet.  Note that
      cfq-iosched currently offcially supports only single level hierarchy
      and root blkgs compete with the first level blkgs - ie. root weight is
      basically being used as leaf_weight.  For root blkgs, the two weights
      are kept in sync for backward compatibility.
      
      v2: cfqd->root_group->leaf_weight initialization was missing from
          cfq_init_queue() causing divide by zero when
          !CONFIG_CFQ_GROUP_SCHED.  Fix it.  Reported by Fengguang.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      e71357e1
    • T
      blkcg: make blkcg_gq's hierarchical · 3c547865
      Tejun Heo 提交于
      Currently a child blkg (blkcg_gq) can be created even if its parent
      doesn't exist.  ie. Given a blkg, it's not guaranteed that its
      ancestors will exist.  This makes it difficult to implement proper
      hierarchy support for blkcg policies.
      
      Always create blkgs recursively and make a child blkg hold a reference
      to its parent.  blkg->parent is added so that finding the parent is
      easy.  blkcg_parent() is also added in the process.
      
      This change can be visible to userland.  e.g. while issuing IO in a
      nested cgroup didn't affect the ancestors at all, now it will
      initialize all ancestor blkgs and zero stats for the request_queue
      will always appear on them.  While this is userland visible, this
      shouldn't cause any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      3c547865
  17. 27 6月, 2012 1 次提交
    • T
      blkcg: implement per-blkg request allocation · a051661c
      Tejun Heo 提交于
      Currently, request_queue has one request_list to allocate requests
      from regardless of blkcg of the IO being issued.  When the unified
      request pool is used up, cfq proportional IO limits become meaningless
      - whoever grabs the next request being freed wins the race regardless
      of the configured weights.
      
      This can be easily demonstrated by creating a blkio cgroup w/ very low
      weight, put a program which can issue a lot of random direct IOs there
      and running a sequential IO from a different cgroup.  As soon as the
      request pool is used up, the sequential IO bandwidth crashes.
      
      This patch implements per-blkg request_list.  Each blkg has its own
      request_list and any IO allocates its request from the matching blkg
      making blkcgs completely isolated in terms of request allocation.
      
      * Root blkcg uses the request_list embedded in each request_queue,
        which was renamed to @q->root_rl from @q->rq.  While making blkcg rl
        handling a bit harier, this enables avoiding most overhead for root
        blkcg.
      
      * Queue fullness is properly per request_list but bdi isn't blkcg
        aware yet, so congestion state currently just follows the root
        blkcg.  As writeback isn't aware of blkcg yet, this works okay for
        async congestion but readahead may get the wrong signals.  It's
        better than blkcg completely collapsing with shared request_list but
        needs to be improved with future changes.
      
      * After this change, each block cgroup gets a full request pool making
        resource consumption of each cgroup higher.  This makes allowing
        non-root users to create cgroups less desirable; however, note that
        allowing non-root users to directly manage cgroups is already
        severely broken regardless of this patch - each block cgroup
        consumes kernel memory and skews IO weight (IO weights are not
        hierarchical).
      
      v2: queue-sysfs.txt updated and patch description udpated as suggested
          by Vivek.
      
      v3: blk_get_rl() wasn't checking error return from
          blkg_lookup_create() and may cause oops on lookup failure.  Fix it
          by falling back to root_rl on blkg lookup failures.  This problem
          was spotted by Rakesh Iyer <rni@google.com>.
      
      v4: Updated to accomodate 458f27a9 "block: Avoid missed wakeup in
          request waitqueue".  blk_drain_queue() now wakes up waiters on all
          blkg->rl on the target queue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a051661c
  18. 25 6月, 2012 1 次提交
  19. 20 4月, 2012 3 次提交
    • T
      blkcg: use radix tree to index blkgs from blkcg · a637120e
      Tejun Heo 提交于
      blkg lookup is currently performed by traversing linked list anchored
      at blkcg->blkg_list.  This is very unscalable and with blk-throttle
      enabled and enough request queues on the system, this can get very
      ugly quickly (blk-throttle performs look up on every bio submission).
      
      This patch makes blkcg use radix tree to index blkgs combined with
      simple last-looked-up hint.  This is mostly identical to how icqs are
      indexed from ioc.
      
      Note that because __blkg_lookup() may be invoked without holding queue
      lock, hint is only updated from __blkg_lookup_create().  Due to cfq's
      cfqq caching, this makes hint updates overly lazy.  This will be
      improved with scheduled blkcg aware request allocation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a637120e
    • T
      blkcg: collapse blkcg_policy_ops into blkcg_policy · f9fcc2d3
      Tejun Heo 提交于
      There's no reason to keep blkcg_policy_ops separate.  Collapse it into
      blkcg_policy.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f9fcc2d3
    • T
      blkcg: embed struct blkg_policy_data in policy specific data · f95a04af
      Tejun Heo 提交于
      Currently blkg_policy_data carries policy specific data as char flex
      array instead of being embedded in policy specific data.  This was
      forced by oddities around blkg allocation which are all gone now.
      
      This patch makes blkg_policy_data embedded in policy specific data -
      throtl_grp and cfq_group so that it's more conventional and consistent
      with how io_cq is handled.
      
      * blkcg_policy->pdata_size is renamed to ->pd_size.
      
      * Functions which used to take void *pdata now takes struct
        blkg_policy_data *pd.
      
      * blkg_to_pdata/pdata_to_blkg() updated to blkg_to_pd/pd_to_blkg().
      
      * Dummy struct blkg_policy_data definition added.  Dummy
        pdata_to_blkg() definition was unused and inconsistent with the
        non-dummy version - correct dummy pd_to_blkg() added.
      
      * throtl and cfq updated accordingly.
      
      * As dummy blkg_to_pd/pd_to_blkg() are provided,
        blkg_to_cfqg/cfqg_to_blkg() don't need to be ifdef'd.  Moved outside
        ifdef block.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f95a04af