1. 02 6月, 2015 1 次提交
  2. 08 9月, 2014 1 次提交
    • T
      blkcg: remove blkcg->id · f4da8072
      Tejun Heo 提交于
      blkcg->id is a unique id given to each blkcg; however, the
      cgroup_subsys_state which each blkcg embeds already has ->serial_nr
      which can be used for the same purpose.  Drop blkcg->id and replace
      its uses with blkcg->css.serial_nr.  Rename cfq_cgroup->blkcg_id to
      ->blkcg_serial_nr and @id in check_blkcg_changed() to @serial_nr for
      consistency.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f4da8072
  3. 16 7月, 2014 1 次提交
    • T
      blkcg: don't call into policy draining if root_blkg is already gone · 2a1b4cf2
      Tejun Heo 提交于
      While a queue is being destroyed, all the blkgs are destroyed and its
      ->root_blkg pointer is set to NULL.  If someone else starts to drain
      while the queue is in this state, the following oops happens.
      
        NULL pointer dereference at 0000000000000028
        IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        PGD e4a1067 PUD b773067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
        CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
        RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
        R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
        FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
        Stack:
         ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
         ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
         ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
        Call Trace:
         [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
         [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
         [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
         [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
         [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
         [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
         [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff81427505>] blk_put_queue+0x15/0x20
         [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
         [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
         [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
         [<ffffffff817930e2>] device_release+0x32/0xa0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff817934d7>] put_device+0x17/0x20
         [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
         [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
         [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
         [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
         [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
         [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
         [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
         [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b
      
      776687bc ("block, blk-mq: draining can't be skipped even if
      bypass_depth was non-zero") made it easier to trigger this bug by
      making blk_queue_bypass_start() drain even when it loses the first
      bypass test to blk_cleanup_queue(); however, the bug has always been
      there even before the commit as blk_queue_bypass_start() could race
      against queue destruction, win the initial bypass test but perform the
      actual draining after blk_cleanup_queue() already destroyed all blkgs.
      
      Fix it by skippping calling into policy draining if all the blkgs are
      already gone.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NShirish Pargaonkar <spargaonkar@suse.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Reported-by: NJet Chen <jet.chen@intel.com>
      Cc: stable@vger.kernel.org
      Tested-by: NShirish Pargaonkar <spargaonkar@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2a1b4cf2
  4. 15 7月, 2014 2 次提交
    • T
      cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() · 2cf669a5
      Tejun Heo 提交于
      Currently, cftypes added by cgroup_add_cftypes() are used for both the
      unified default hierarchy and legacy ones and subsystems can mark each
      file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
      appear only on one of them.  This is quite hairy and error-prone.
      Also, we may end up exposing interface files to the default hierarchy
      without thinking it through.
      
      cgroup_subsys will grow two separate cftype addition functions and
      apply each only on the hierarchies of the matching type.  This will
      allow organizing cftypes in a lot clearer way and encourage subsystems
      to scrutinize the interface which is being exposed in the new default
      hierarchy.
      
      In preparation, this patch adds cgroup_add_legacy_cftypes() which
      currently is a simple wrapper around cgroup_add_cftypes() and replaces
      all cgroup_add_cftypes() usages with it.
      
      While at it, this patch drops a completely spurious return from
      __hugetlb_cgroup_file_init().
      
      This patch doesn't introduce any functional differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      2cf669a5
    • T
      cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes · 5577964e
      Tejun Heo 提交于
      Currently, cgroup_subsys->base_cftypes is used for both the unified
      default hierarchy and legacy ones and subsystems can mark each file
      with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
      only on one of them.  This is quite hairy and error-prone.  Also, we
      may end up exposing interface files to the default hierarchy without
      thinking it through.
      
      cgroup_subsys will grow two separate cftype arrays and apply each only
      on the hierarchies of the matching type.  This will allow organizing
      cftypes in a lot clearer way and encourage subsystems to scrutinize
      the interface which is being exposed in the new default hierarchy.
      
      In preparation, this patch renames cgroup_subsys->base_cftypes to
      cgroup_subsys->legacy_cftypes.  This patch is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      5577964e
  5. 12 7月, 2014 1 次提交
    • T
      blkcg: don't call into policy draining if root_blkg is already gone · 0b462c89
      Tejun Heo 提交于
      While a queue is being destroyed, all the blkgs are destroyed and its
      ->root_blkg pointer is set to NULL.  If someone else starts to drain
      while the queue is in this state, the following oops happens.
      
        NULL pointer dereference at 0000000000000028
        IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        PGD e4a1067 PUD b773067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
        CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
        RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
        R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
        FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
        Stack:
         ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
         ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
         ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
        Call Trace:
         [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
         [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
         [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
         [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
         [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
         [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
         [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff81427505>] blk_put_queue+0x15/0x20
         [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
         [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
         [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
         [<ffffffff817930e2>] device_release+0x32/0xa0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff817934d7>] put_device+0x17/0x20
         [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
         [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
         [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
         [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
         [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
         [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
         [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
         [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b
      
      776687bc ("block, blk-mq: draining can't be skipped even if
      bypass_depth was non-zero") made it easier to trigger this bug by
      making blk_queue_bypass_start() drain even when it loses the first
      bypass test to blk_cleanup_queue(); however, the bug has always been
      there even before the commit as blk_queue_bypass_start() could race
      against queue destruction, win the initial bypass test but perform the
      actual draining after blk_cleanup_queue() already destroyed all blkgs.
      
      Fix it by skippping calling into policy draining if all the blkgs are
      already gone.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NShirish Pargaonkar <spargaonkar@suse.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Reported-by: NJet Chen <jet.chen@intel.com>
      Cc: stable@vger.kernel.org
      Tested-by: NShirish Pargaonkar <spargaonkar@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0b462c89
  6. 09 7月, 2014 1 次提交
    • T
      blkcg, memcg: make blkcg depend on memcg on the default hierarchy · 1ced953b
      Tejun Heo 提交于
      Currently, the blkio subsystem attributes all of writeback IOs to the
      root.  One of the issues is that there's no way to tell who originated
      a writeback IO from block layer.  Those IOs are usually issued
      asynchronously from a task which didn't have anything to do with
      actually generating the dirty pages.  The memory subsystem, when
      enabled, already keeps track of the ownership of each dirty page and
      it's desirable for blkio to piggyback instead of adding its own
      per-page tag.
      
      cgroup now has a mechanism to express such dependency -
      cgroup_subsys->depends_on.  This patch declares that blkcg depends on
      memcg so that memcg is enabled automatically on the default hierarchy
      when available.  Future changes will make blkcg map the memcg tag to
      find out the cgroup to blame for writeback IOs.
      
      As this means that a memcg may be made invisible, this patch also
      implements css_reset() for memcg which resets its basic
      configurations.  This implementation will probably need to be expanded
      to cover other states which are used in the default hierarchy.
      
      v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid
          build failure.  Reported by kbuild test robot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      1ced953b
  7. 23 6月, 2014 2 次提交
    • J
      Revert "block: add __init to blkcg_policy_register" · d5bf0291
      Jens Axboe 提交于
      This reverts commit a2d445d4.
      
      The original commit is buggy, we do use the registration functions
      at runtime for modular builds.
      d5bf0291
    • T
      blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t · a5049a8a
      Tejun Heo 提交于
      Hello,
      
      So, this patch should do.  Joe, Vivek, can one of you guys please
      verify that the oops goes away with this patch?
      
      Jens, the original thread can be read at
      
        http://thread.gmane.org/gmane.linux.kernel/1720729
      
      The fix converts blkg->refcnt from int to atomic_t.  It does some
      overhead but it should be minute compared to everything else which is
      going on and the involved cacheline bouncing, so I think it's highly
      unlikely to cause any noticeable difference.  Also, the refcnt in
      question should be converted to a perpcu_ref for blk-mq anyway, so the
      atomic_t is likely to go away pretty soon anyway.
      
      Thanks.
      
      ------- 8< -------
      __blkg_release_rcu() may be invoked after the associated request_queue
      is released with a RCU grace period inbetween.  As such, the function
      and callbacks invoked from it must not dereference the associated
      request_queue.  This is clearly indicated in the comment above the
      function.
      
      Unfortunately, while trying to fix a different issue, 2a4fd070
      ("blkcg: move bulk of blkcg_gq release operations to the RCU
      callback") ignored this and added [un]locking of @blkg->q->queue_lock
      to __blkg_release_rcu().  This of course can cause oops as the
      request_queue may be long gone by the time this code gets executed.
      
        general protection fault: 0000 [#1] SMP
        CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
        Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
        task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
        RIP: 0010:[<ffffffff8162e9e5>]  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
        RSP: 0018:ffff88085403fdf0  EFLAGS: 00010086
        RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
        RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
        RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
        R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
        R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
        Stack:
         ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
         ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
         ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
        Call Trace:
         [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150
         [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300
         [<ffffffff81091d81>] kthread+0xe1/0x100
         [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0
        Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
        +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
        +b7
        RIP  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
         RSP <ffff88085403fdf0>
      
      The request_queue locking was added because blkcg_gq->refcnt is an int
      protected with the queue lock and __blkg_release_rcu() needs to put
      the parent.  Let's fix it by making blkcg_gq->refcnt an atomic_t and
      dropping queue locking in the function.
      
      Given the general heavy weight of the current request_queue and blkcg
      operations, this is unlikely to cause any noticeable overhead.
      Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
      the near future, so whatever (most likely negligible) overhead it may
      add is temporary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a5049a8a
  8. 11 6月, 2014 1 次提交
  9. 14 5月, 2014 1 次提交
    • T
      cgroup: rename css_tryget*() to css_tryget_online*() · ec903c0c
      Tejun Heo 提交于
      Unlike the more usual refcnting, what css_tryget() provides is the
      distinction between online and offline csses instead of protection
      against upping a refcnt which already reached zero.  cgroup is
      planning to provide actual tryget which fails if the refcnt already
      reached zero.  Let's rename the existing trygets so that they clearly
      indicate that they're onliness.
      
      I thought about keeping the existing names as-are and introducing new
      names for the planned actual tryget; however, given that each
      controller participates in the synchronization of the online state, it
      seems worthwhile to make it explicit that these functions are about
      on/offline state.
      
      Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
      to css_tryget_online_from_dir().  This is pure rename.
      
      v2: cgroup_freezer grew new usages of css_tryget().  Update
          accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      ec903c0c
  10. 06 5月, 2014 1 次提交
    • T
      blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats() · 36c38fb7
      Tejun Heo 提交于
      During the recent conversion of cgroup to kernfs, cgroup_tree_mutex
      which nests above both the kernfs s_active protection and cgroup_mutex
      is added to synchronize cgroup file type operations as cgroup_mutex
      needed to be grabbed from some file operations and thus can't be put
      above s_active protection.
      
      While this arrangement mostly worked for cgroup, this triggered the
      following lockdep warning.
      
        ======================================================
        [ INFO: possible circular locking dependency detected ]
        3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G        W
        -------------------------------------------------------
        trinity-c173/9024 is trying to acquire lock:
        (blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
      
        but task is already holding lock:
        (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (s_active#89){++++.+}:
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        __kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024)
        kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219)
        cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899)
        cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2))
        rebind_subsystems (kernel/cgroup.c:1144)
        cgroup_setup_root (kernel/cgroup.c:1568)
        cgroup_mount (kernel/cgroup.c:1716)
        mount_fs (fs/super.c:1094)
        vfs_kern_mount (fs/namespace.c:899)
        do_mount (fs/namespace.c:2238 fs/namespace.c:2561)
        SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729)
        tracesys (arch/x86/kernel/entry_64.S:746)
      
        -> #1 (cgroup_tree_mutex){+.+.+.}:
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040)
        blkcg_policy_register (block/blk-cgroup.c:1106)
        throtl_init (block/blk-throttle.c:1694)
        do_one_initcall (init/main.c:789)
        kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003)
        kernel_init (init/main.c:935)
        ret_from_fork (arch/x86/kernel/entry_64.S:552)
      
        -> #0 (blkcg_pol_mutex){+.+.+.}:
        __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
        cgroup_file_write (kernel/cgroup.c:2714)
        kernfs_fop_write (fs/kernfs/file.c:295)
        vfs_write (fs/read_write.c:532)
        SyS_write (fs/read_write.c:584 fs/read_write.c:576)
        tracesys (arch/x86/kernel/entry_64.S:746)
      
        other info that might help us debug this:
      
        Chain exists of:
        blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(s_active#89);
      				 lock(cgroup_tree_mutex);
      				 lock(s_active#89);
          lock(blkcg_pol_mutex);
      
         *** DEADLOCK ***
      
        4 locks held by trinity-c173/9024:
        #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
        #1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530)
        #2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283)
        #3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)
      
        stack backtrace:
        CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G        W     3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429
         ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002
         ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004
         ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0
        Call Trace:
        dump_stack (lib/dump_stack.c:52)
        print_circular_bug (kernel/locking/lockdep.c:1216)
        __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
        cgroup_file_write (kernel/cgroup.c:2714)
        kernfs_fop_write (fs/kernfs/file.c:295)
        vfs_write (fs/read_write.c:532)
        SyS_write (fs/read_write.c:584 fs/read_write.c:576)
      
      This is a highly unlikely but valid circular dependency between "echo
      1 > blkcg.reset_stats" and cfq module [un]loading.  cgroup is going
      through further locking update which will remove this complication but
      for now let's use trylock on blkcg_pol_mutex and retry the file
      operation if the trylock fails.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com
      36c38fb7
  11. 19 2月, 2014 1 次提交
  12. 13 2月, 2014 1 次提交
    • T
      cgroup: drop @skip_css from cgroup_taskset_for_each() · 924f0d9a
      Tejun Heo 提交于
      If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching
      css.  The intention of the interface is to make it easy to skip css's
      (cgroup_subsys_states) which already match the migration target;
      however, this is entirely unnecessary as migration taskset doesn't
      include tasks which are already in the target cgroup.  Drop @skip_css
      from cgroup_taskset_for_each().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      924f0d9a
  13. 08 2月, 2014 2 次提交
    • T
      cgroup: clean up cgroup_subsys names and initialization · 073219e9
      Tejun Heo 提交于
      cgroup_subsys is a bit messier than it needs to be.
      
      * The name of a subsys can be different from its internal identifier
        defined in cgroup_subsys.h.  Most subsystems use the matching name
        but three - cpu, memory and perf_event - use different ones.
      
      * cgroup_subsys_id enums are postfixed with _subsys_id and each
        cgroup_subsys is postfixed with _subsys.  cgroup.h is widely
        included throughout various subsystems, it doesn't and shouldn't
        have claim on such generic names which don't have any qualifier
        indicating that they belong to cgroup.
      
      * cgroup_subsys->subsys_id should always equal the matching
        cgroup_subsys_id enum; however, we require each controller to
        initialize it and then BUG if they don't match, which is a bit
        silly.
      
      This patch cleans up cgroup_subsys names and initialization by doing
      the followings.
      
      * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
        cgroup_subsys with _cgrp_subsys.
      
      * With the above, renaming subsys identifiers to match the userland
        visible names doesn't cause any naming conflicts.  All non-matching
        identifiers are renamed to match the official names.
      
        cpu_cgroup -> cpu
        mem_cgroup -> memory
        perf -> perf_event
      
      * controllers no longer need to initialize ->subsys_id and ->name.
        They're generated in cgroup core and set automatically during boot.
      
      * Redundant cgroup_subsys declarations removed.
      
      * While updating BUG_ON()s in cgroup_init_early(), convert them to
        WARN()s.  BUGging that early during boot is stupid - the kernel
        can't print anything, even through serial console and the trap
        handler doesn't even link stack frame properly for back-tracing.
      
      This patch doesn't introduce any behavior changes.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NIngo Molnar <mingo@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      073219e9
    • T
      cgroup: drop module support · 3ed80a62
      Tejun Heo 提交于
      With module supported dropped from net_prio, no controller is using
      cgroup module support.  None of actual resource controllers can be
      built as a module and we aren't gonna add new controllers which don't
      control resources.  This patch drops module support from cgroup.
      
      * cgroup_[un]load_subsys() and cgroup_subsys->module removed.
      
      * As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(),
        cgroup_subsys.h now uses IS_ENABLED() directly.
      
      * enum cgroup_subsys_id now exactly matches the list of enabled
        controllers as ordered in cgroup_subsys.h.
      
      * cgroup_subsys[] is now a contiguously occupied array.  Size
        specification is no longer necessary and dropped.
      
      * for_each_builtin_subsys() is removed and for_each_subsys() is
        updated to not require any locking.
      
      * module ref handling is removed from rebind_subsystems().
      
      * Module related comments dropped.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      
      v3: Added {} around the if (need_forkexit_callback) block in
          cgroup_post_fork() for readability as suggested by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3ed80a62
  14. 12 9月, 2013 1 次提交
    • T
      blkcg: relocate root_blkg setting and clearing · 577cee1e
      Tejun Heo 提交于
      Hello, Jens.
      
      The original thread can be read from
      
       http://thread.gmane.org/gmane.linux.kernel.cgroups/8937
      
      While it leads to oops, given that it only triggers under specific
      configurations which aren't common.  I don't think it's necessary to
      backport it through -stable and merging it during the coming merge
      window should be enough.
      
      Thanks!
      
      ----- 8< -----
      Currently, q->root_blkg and q->root_rl.blkg are set from
      blkcg_activate_policy() and cleared from blkg_destroy_all().  This
      doesn't necessarily coincide with the lifetime of the root blkcg_gq
      leading to the following oops when blkcg is enabled but no policy is
      activated because __blk_queue_next_rl() malfunctions expecting the
      root_blkg pointers to be set.
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff810c58cb>] __wake_up_common+0x2b/0x90
        PGD 60f7a9067 PUD 60f4c9067 PMD 0
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
        gsmi: Log Shutdown Reason 0x03
        Modules linked in: act_mirred cls_tcindex cls_prioshift sch_dsmark xt_multiport iptable_mangle sata_mv elephant elephant_dev_num cdc_acm uhci_hcd ehci_hcd i2c_d
        CPU: 9 PID: 41382 Comm: iSCSI-write- Not tainted 3.11.0-dbg-DEV #19
        Hardware name: Intel XXX
        task: ffff88060d16eec0 ti: ffff88060d170000 task.ti: ffff88060d170000
        RIP: 0010:[<ffffffff810c58cb>] [<ffffffff810c58cb>] __wake_up_common+0x2b/0x90
        RSP: 0000:ffff88060d171818  EFLAGS: 00010096
        RAX: 0000000000000082 RBX: ffff880baa3dee60 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff880baa3dee60
        RBP: ffff88060d171858 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000002 R12: ffff880baa3dee98
        R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
        FS:  00007f977cba6700(0000) GS:ffff880c79c60000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000000 CR3: 000000060f7a5000 CR4: 00000000000007e0
        Stack:
         0000000000000082 0000000000000000 ffff88060d171858 ffff880baa3dee60
         0000000000000082 0000000000000003 0000000000000000 0000000000000000
         ffff88060d171898 ffffffff810c7848 ffff88060d171888 ffff880bde4bc4b8
        Call Trace:
         [<ffffffff810c7848>] __wake_up+0x48/0x70
         [<ffffffff8131da53>] __blk_drain_queue+0x123/0x190
         [<ffffffff8131dbb5>] blk_cleanup_queue+0xf5/0x210
         [<ffffffff8141877a>] __scsi_remove_device+0x5a/0xd0
         [<ffffffff81418824>] scsi_remove_device+0x34/0x50
         [<ffffffff814189cb>] scsi_remove_target+0x16b/0x220
         [<ffffffff814210f1>] __iscsi_unbind_session+0xd1/0x1b0
         [<ffffffff814212b2>] iscsi_remove_session+0xe2/0x1c0
         [<ffffffff814213a6>] iscsi_destroy_session+0x16/0x60
         [<ffffffff81423a59>] iscsi_session_teardown+0xd9/0x100
         [<ffffffff8142b75a>] iscsi_sw_tcp_session_destroy+0x5a/0xb0
         [<ffffffff81420948>] iscsi_if_rx+0x10e8/0x1560
         [<ffffffff81573335>] netlink_unicast+0x145/0x200
         [<ffffffff815736f3>] netlink_sendmsg+0x303/0x410
         [<ffffffff81528196>] sock_sendmsg+0xa6/0xd0
         [<ffffffff815294bc>] ___sys_sendmsg+0x38c/0x3a0
         [<ffffffff811ea840>] ? fget_light+0x40/0x160
         [<ffffffff811ea899>] ? fget_light+0x99/0x160
         [<ffffffff811ea840>] ? fget_light+0x40/0x160
         [<ffffffff8152bc79>] __sys_sendmsg+0x49/0x90
         [<ffffffff8152bcd2>] SyS_sendmsg+0x12/0x20
         [<ffffffff815fb642>] system_call_fastpath+0x16/0x1b
        Code: 66 66 66 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 4c 8d 67 38 53 48 83 ec 18 89 55 c4 48 8b 57 38 4c 89 45 c8 <4c> 8b 2a 48 8d 42 e8 49
      
      Fix it by moving r->root_blkg and q->root_rl.blkg setting to
      blkg_create() and clearing to blkg_destroy() so that they area
      initialized when a root blkg is created and cleared when destroyed.
      Reported-and-tested-by: NAnatol Pomozov <anatol.pomozov@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      577cee1e
  15. 09 8月, 2013 6 次提交
    • T
      cgroup: make css_for_each_descendant() and friends include the origin css in the iteration · bd8815a6
      Tejun Heo 提交于
      Previously, all css descendant iterators didn't include the origin
      (root of subtree) css in the iteration.  The reasons were maintaining
      consistency with css_for_each_child() and that at the time of
      introduction more use cases needed skipping the origin anyway;
      however, given that css_is_descendant() considers self to be a
      descendant, omitting the origin css has become more confusing and
      looking at the accumulated use cases rather clearly indicates that
      including origin would result in simpler code overall.
      
      While this is a change which can easily lead to subtle bugs, cgroup
      API including the iterators has recently gone through major
      restructuring and no out-of-tree changes will be applicable without
      adjustments making this a relatively acceptable opportunity for this
      type of change.
      
      The conversions are mostly straight-forward.  If the iteration block
      had explicit origin handling before or after, it's moved inside the
      iteration.  If not, if (pos == origin) continue; is added.  Some
      conversions add extra reference get/put around origin handling by
      consolidating origin handling and the rest.  While the extra ref
      operations aren't strictly necessary, this shouldn't cause any
      noticeable difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      bd8815a6
    • T
      cgroup: make cgroup_taskset deal with cgroup_subsys_state instead of cgroup · d99c8727
      Tejun Heo 提交于
      cgroup is in the process of converting to css (cgroup_subsys_state)
      from cgroup as the principal subsystem interface handle.  This is
      mostly to prepare for the unified hierarchy support where css's will
      be created and destroyed dynamically but also helps cleaning up
      subsystem implementations as css is usually what they are interested
      in anyway.
      
      cgroup_taskset which is used by the subsystem attach methods is the
      last cgroup subsystem API which isn't using css as the handle.  Update
      cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
      cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.
      
      The conversions are pretty mechanical.  One exception is
      cpuset::cgroup_cs(), which lost its last user and got removed.
      
      This patch shouldn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      d99c8727
    • T
      cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroup · 492eb21b
      Tejun Heo 提交于
      cgroup is currently in the process of transitioning to using css
      (cgroup_subsys_state) as the primary handle instead of cgroup in
      subsystem API.  For hierarchy iterators, this is beneficial because
      
      * In most cases, css is the only thing subsystems care about anyway.
      
      * On the planned unified hierarchy, iterations for different
        subsystems will need to skip over different subtrees of the
        hierarchy depending on which subsystems are enabled on each cgroup.
        Passing around css makes it unnecessary to explicitly specify the
        subsystem in question as css is intersection between cgroup and
        subsystem
      
      * For the planned unified hierarchy, css's would need to be created
        and destroyed dynamically independent from cgroup hierarchy.  Having
        cgroup core manage css iteration makes enforcing deref rules a lot
        easier.
      
      Most subsystem conversions are straight-forward.  Noteworthy changes
      are
      
      * blkio: cgroup_to_blkcg() is no longer used.  Removed.
      
      * freezer: cgroup_freezer() is no longer used.  Removed.
      
      * devices: cgroup_to_devcgroup() is no longer used.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      492eb21b
    • T
      cgroup: pass around cgroup_subsys_state instead of cgroup in file methods · 182446d0
      Tejun Heo 提交于
      cgroup is currently in the process of transitioning to using struct
      cgroup_subsys_state * as the primary handle instead of struct cgroup.
      Please see the previous commit which converts the subsystem methods
      for rationale.
      
      This patch converts all cftype file operations to take @css instead of
      @cgroup.  cftypes for the cgroup core files don't have their subsytem
      pointer set.  These will automatically use the dummy_css added by the
      previous patch and can be converted the same way.
      
      Most subsystem conversions are straight forwards but there are some
      interesting ones.
      
      * freezer: update_if_frozen() is also converted to take @css instead
        of @cgroup for consistency.  This will make the code look simpler
        too once iterators are converted to use css.
      
      * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
        vmpressure while mem_cgroup_from_cont() can be made static.
        Updated accordingly.
      
      * cpu: cgroup_tg() doesn't have any user left.  Removed.
      
      * cpuacct: cgroup_ca() doesn't have any user left.  Removed.
      
      * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
        Removed.
      
      * net_cls: cgrp_cls_state() doesn't have any user left.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      182446d0
    • T
      cgroup: add subsys backlink pointer to cftype · 2bb566cb
      Tejun Heo 提交于
      cgroup is transitioning to using css (cgroup_subsys_state) instead of
      cgroup as the primary subsystem handle.  The cgroupfs file interface
      will be converted to use css's which requires finding out the
      subsystem from cftype so that the matching css can be determined from
      the cgroup.
      
      This patch adds cftype->ss which points to the subsystem the file
      belongs to.  The field is initialized while a cftype is being
      registered.  This makes it unnecessary to explicitly specify the
      subsystem for other cftype handling functions.  @ss argument dropped
      from various cftype handling functions.
      
      This patch shouldn't introduce any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      2bb566cb
    • T
      cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methods · eb95419b
      Tejun Heo 提交于
      cgroup is currently in the process of transitioning to using struct
      cgroup_subsys_state * as the primary handle instead of struct cgroup *
      in subsystem implementations for the following reasons.
      
      * With unified hierarchy, subsystems will be dynamically bound and
        unbound from cgroups and thus css's (cgroup_subsys_state) may be
        created and destroyed dynamically over the lifetime of a cgroup,
        which is different from the current state where all css's are
        allocated and destroyed together with the associated cgroup.  This
        in turn means that cgroup_css() should be synchronized and may
        return NULL, making it more cumbersome to use.
      
      * Differing levels of per-subsystem granularity in the unified
        hierarchy means that the task and descendant iterators should behave
        differently depending on the specific subsystem the iteration is
        being performed for.
      
      * In majority of the cases, subsystems only care about its part in the
        cgroup hierarchy - ie. the hierarchy of css's.  Subsystem methods
        often obtain the matching css pointer from the cgroup and don't
        bother with the cgroup pointer itself.  Passing around css fits
        much better.
      
      This patch converts all cgroup_subsys methods to take @css instead of
      @cgroup.  The conversions are mostly straight-forward.  A few
      noteworthy changes are
      
      * ->css_alloc() now takes css of the parent cgroup rather than the
        pointer to the new cgroup as the css for the new cgroup doesn't
        exist yet.  Knowing the parent css is enough for all the existing
        subsystems.
      
      * In kernel/cgroup.c::offline_css(), unnecessary open coded css
        dereference is replaced with local variable access.
      
      This patch shouldn't cause any behavior differences.
      
      v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
          with local variable @css as suggested by Li Zefan.
      
          Rebased on top of new for-3.12 which includes for-3.11-fixes so
          that ->css_free() invocation added by da0a12ca ("cgroup: fix a
          leak when percpu_ref_init() fails") is converted too.  Suggested
          by Li Zefan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      eb95419b
  16. 15 5月, 2013 5 次提交
    • T
      blk-throttle: implement proper hierarchy support · 9138125b
      Tejun Heo 提交于
      With the recent updates, blk-throttle is finally ready for proper
      hierarchy support.  Dispatching now honors service_queue->parent_sq
      and propagates correctly.  The only thing missing is setting
      ->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
      hierarchy.
      
      This patch updates throtl_pd_init() such that service_queues form the
      same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
      As this concludes proper hierarchy support for blkcg, the shameful
      .broken_hierarchy tag is removed from blkio_subsys.
      
      v2: Updated blkio-controller.txt as suggested by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      9138125b
    • T
      blkcg: move bulk of blkcg_gq release operations to the RCU callback · 2a4fd070
      Tejun Heo 提交于
      Currently, when the last reference of a blkcg_gq is put, all then
      release operations sans the actual freeing happen directly in
      blkg_put().  As blkg_put() may be called under queue_lock, all
      pd_exit_fn()s may be too.  This makes it impossible for pd_exit_fn()s
      to use del_timer_sync() on timers which grab the queue_lock which is
      an irq-safe lock due to the deadlock possibility described in the
      comment on top of del_timer_sync().
      
      This can be easily avoided by perfoming the release operations in the
      RCU callback instead of directly from blkg_put().  This patch moves
      the blkcg_gq release operations to the RCU callback.
      
      As this leaves __blkg_release() with only call_rcu() invocation,
      blkg_rcu_free() is renamed to __blkg_release_rcu(), exported and
      call_rcu() invocation is now done directly from blkg_put() instead of
      going through __blkg_release() which is removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      2a4fd070
    • T
      blkcg: invoke blkcg_policy->pd_init() after parent is linked · db613670
      Tejun Heo 提交于
      Currently, when creating a new blkcg_gq, each policy's pd_init_fn() is
      invoked in blkg_alloc() before the parent is linked.  This makes it
      difficult for policies to perform initializations which are dependent
      on the parent.
      
      This patch moves pd_init_fn() invocations to blkg_create() after the
      parent blkg is linked where the new blkg is fully initialized.  As
      this means that blkg_free() can't assume that pd's are initialized,
      pd_exit_fn() invocations are moved to __blkg_release().  This
      guarantees that pd_exit_fn() is also invoked with fully initialized
      blkgs with valid parent pointers.
      
      This will help implementing hierarchy support in blk-throttle.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      db613670
    • T
      blkcg: move blkg_for_each_descendant_pre() to block/blk-cgroup.h · dd4a4ffc
      Tejun Heo 提交于
      blk-throttle hierarchy support will make use of it.  Move
      blkg_for_each_descendant_pre() from block/blk-cgroup.c to
      block/blk-cgroup.h.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      dd4a4ffc
    • T
      blkcg: fix error return path in blkg_create() · 2423c9c3
      Tejun Heo 提交于
      In blkg_create(), after lookup of parent fails, the control jumps to
      error path with the error code encoded into @blkg.  The error path
      doesn't use @blkg for the return value.  It returns ERR_PTR(ret).
      Make lookup fail path set @ret instead of @blkg.
      
      Note that the parent lookup is guaranteed to succeed at that point and
      the condition check is purely for sanity and triggers WARN when fails.
      As such, I don't think it's necessary to mark it for -stable.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      2423c9c3
  17. 09 4月, 2013 1 次提交
    • J
      blkcg: fix "scheduling while atomic" in blk_queue_bypass_start · e5072664
      Jun'ichi Nomura 提交于
      Since 749fefe6 in v3.7 ("block: lift the initial queue bypass mode
      on blk_register_queue() instead of blk_init_allocated_queue()"),
      the following warning appears when multipath is used with CONFIG_PREEMPT=y.
      
      This patch moves blk_queue_bypass_start() before radix_tree_preload()
      to avoid the sleeping call while preemption is disabled.
      
        BUG: scheduling while atomic: multipath/2460/0x00000002
        1 lock held by multipath/2460:
         #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
        Modules linked in: ...
        Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
        Call Trace:
         [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
         [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
         [<ffffffff814291e6>] schedule+0x64/0x66
         [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
         [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
         [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
         [<ffffffff814289e3>] wait_for_common+0x9d/0xee
         [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
         [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
         [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
         [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
         [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
         [<ffffffff8106fb18>] ? complete+0x21/0x53
         [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
         [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
         [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
         [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
         [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
         [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
         [<ffffffff811d8c41>] elevator_init+0xe1/0x115
         [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
         [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
         [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
         [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
         [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
         [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
         [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
         [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
         [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
         [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
         [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
         [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e5072664
  18. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  19. 10 1月, 2013 10 次提交
    • T
      blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock · 810ecfa7
      Tejun Heo 提交于
      Instead of holding blkcg->lock while walking ->blkg_list and executing
      prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
      executing prfill().  This makes prfill() implementations easier as
      stats are mostly protected by queue lock.
      
      This will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      810ecfa7
    • T
      blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() · 16b3de66
      Tejun Heo 提交于
      Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge().
      The former two collect the [rw]stats designated by the target policy
      data and offset from the pd's subtree.  The latter two add one
      [rw]stat to another.
      
      Note that the recursive sum functions require the queue lock to be
      held on entry to make blkg online test reliable.  This is necessary to
      properly handle stats of a dying blkg.
      
      These will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      16b3de66
    • T
      blkcg: export __blkg_prfill_rwstat() · b50da39f
      Tejun Heo 提交于
      Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat().
      Export it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      b50da39f
    • T
      blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online · f427d909
      Tejun Heo 提交于
      Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(),
      which are invoked as the policy_data gets activated and deactivated
      while holding both blkcg and q locks.
      
      Also, add blkcg_gq->online bool, which is set and cleared as the
      blkcg_gq gets activated and deactivated.  This flag also is toggled
      while holding both blkcg and q locks.
      
      These will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      f427d909
    • T
      blkcg: add blkg_policy_data->plid · b276a876
      Tejun Heo 提交于
      Add pd->plid so that the policy a pd belongs to can be identified
      easily.  This will be used to implement hierarchical blkg_[rw]stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      b276a876
    • T
      cfq-iosched: add leaf_weight · e71357e1
      Tejun Heo 提交于
      cfq blkcg is about to grow proper hierarchy handling, where a child
      blkg's weight would nest inside the parent's.  This makes tasks in a
      blkg to compete against both tasks in the sibling blkgs and the tasks
      of child blkgs.
      
      We're gonna use the existing weight as the group weight which decides
      the blkg's weight against its siblings.  This patch introduces a new
      weight - leaf_weight - which decides the weight of a blkg against the
      child blkgs.
      
      It's named leaf_weight because another way to look at it is that each
      internal blkg nodes have a hidden child leaf node which contains all
      its tasks and leaf_weight is the weight of the leaf node and handled
      the same as the weight of the child blkgs.
      
      This patch only adds leaf_weight fields and exposes it to userland.
      The new weight isn't actually used anywhere yet.  Note that
      cfq-iosched currently offcially supports only single level hierarchy
      and root blkgs compete with the first level blkgs - ie. root weight is
      basically being used as leaf_weight.  For root blkgs, the two weights
      are kept in sync for backward compatibility.
      
      v2: cfqd->root_group->leaf_weight initialization was missing from
          cfq_init_queue() causing divide by zero when
          !CONFIG_CFQ_GROUP_SCHED.  Fix it.  Reported by Fengguang.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      e71357e1
    • T
      blkcg: make blkcg_gq's hierarchical · 3c547865
      Tejun Heo 提交于
      Currently a child blkg (blkcg_gq) can be created even if its parent
      doesn't exist.  ie. Given a blkg, it's not guaranteed that its
      ancestors will exist.  This makes it difficult to implement proper
      hierarchy support for blkcg policies.
      
      Always create blkgs recursively and make a child blkg hold a reference
      to its parent.  blkg->parent is added so that finding the parent is
      easy.  blkcg_parent() is also added in the process.
      
      This change can be visible to userland.  e.g. while issuing IO in a
      nested cgroup didn't affect the ancestors at all, now it will
      initialize all ancestor blkgs and zero stats for the request_queue
      will always appear on them.  While this is userland visible, this
      shouldn't cause any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      3c547865
    • T
      blkcg: cosmetic updates to blkg_create() · 93e6d5d8
      Tejun Heo 提交于
      * Rename out_* labels to err_*.
      
      * Do ERR_PTR() conversion once in the error return path.
      
      This patch is cosmetic and to prepare for the hierarchy support.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      93e6d5d8
    • T
      blkcg: reorganize blkg_lookup_create() and friends · 86cde6b6
      Tejun Heo 提交于
      Reorganize such that
      
      * __blkg_lookup() takes bool param @update_hint to determine whether
        to update hint.
      
      * __blkg_lookup_create() no longer performs lookup before trying to
        create.  Renamed to blkg_create().
      
      * blkg_lookup_create() now performs lookup and then invokes
        blkg_create() if lookup fails.
      
      * root_blkg creation in blkcg_activate_policy() updated accordingly.
        Note that blkcg_activate_policy() no longer updates lookup hint if
        root_blkg already exists.
      
      Except for the last lookup hint bit which is immaterial, this is pure
      reorganization and doesn't introduce any visible behavior change.
      This is to prepare for proper hierarchy support.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      86cde6b6
    • T
      blkcg: fix minor bug in blkg_alloc() · 356d2e58
      Tejun Heo 提交于
      blkg_alloc() was mistakenly checking blkcg_policy_enabled() twice.
      The latter test should have been on whether pol->pd_init_fn() exists.
      This doesn't cause actual problems because both blkcg policies
      implement pol->pd_init_fn().  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      356d2e58