1. 09 1月, 2009 2 次提交
  2. 07 1月, 2009 1 次提交
  3. 06 1月, 2009 1 次提交
  4. 05 1月, 2009 1 次提交
    • L
      cgroups: fix a race between cgroup_clone and umount · 7b574b7b
      Li Zefan 提交于
      The race is calling cgroup_clone() while umounting the ns cgroup subsys,
      and thus cgroup_clone() might access invalid cgroup_fs, or kill_sb() is
      called after cgroup_clone() created a new dir in it.
      
      The BUG I triggered is BUG_ON(root->number_of_cgroups != 1);
      
        ------------[ cut here ]------------
        kernel BUG at kernel/cgroup.c:1093!
        invalid opcode: 0000 [#1] SMP
        ...
        Process umount (pid: 5177, ti=e411e000 task=e40c4670 task.ti=e411e000)
        ...
        Call Trace:
         [<c0493df7>] ? deactivate_super+0x3f/0x51
         [<c04a3600>] ? mntput_no_expire+0xb3/0xdd
         [<c04a3ab2>] ? sys_umount+0x265/0x2ac
         [<c04a3b06>] ? sys_oldumount+0xd/0xf
         [<c0403911>] ? sysenter_do_call+0x12/0x31
        ...
        EIP: [<c0456e76>] cgroup_kill_sb+0x23/0xe0 SS:ESP 0068:e411ef2c
        ---[ end trace c766c1be3bf944ac ]---
      
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Serge E. Hallyn" <serue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b574b7b
  5. 24 12月, 2008 2 次提交
  6. 16 12月, 2008 1 次提交
    • P
      cgroups: fix a race between rmdir and remount · 307257cf
      Paul Menage 提交于
      When a cgroup is removed, it's unlinked from its parent's children list,
      but not actually freed until the last dentry on it is released (at which
      point cgrp->root->number_of_cgroups is decremented).
      
      Currently rebind_subsystems checks for the top cgroup's child list being
      empty in order to rebind subsystems into or out of a hierarchy - this can
      result in the set of subsystems bound to a hierarchy being
      removed-but-not-freed cgroup.
      
      The simplest fix for this is to forbid remounts that change the set of
      subsystems on a hierarchy that has removed-but-not-freed cgroups.  This
      bug can be reproduced via:
      
      mkdir /mnt/cg
      mount -t cgroup -o ns,freezer cgroup /mnt/cg
      mkdir /mnt/cg/foo
      sleep 1h < /mnt/cg/foo &
      rmdir /mnt/cg/foo
      mount -t cgroup -o remount,ns,devices,freezer cgroup /mnt/cg
      kill $!
      
      Though the above will cause oops in -mm only but not mainline, but the bug
      can cause memory leak in mainline (and even oops)
      Signed-off-by: NPaul Menage <menage@google.com>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      307257cf
  7. 20 11月, 2008 2 次提交
  8. 14 11月, 2008 3 次提交
  9. 07 11月, 2008 1 次提交
    • L
      cgroups: fix invalid cgrp->dentry before cgroup has been completely removed · 24eb0899
      Li Zefan 提交于
      This fixes an oops when reading /proc/sched_debug.
      
      A cgroup won't be removed completely until finishing cgroup_diput(), so we
      shouldn't invalidate cgrp->dentry in cgroup_rmdir().  Otherwise, when a
      group is being removed while cgroup_path() gets called, we may trigger
      NULL dereference BUG.
      
      The bug can be reproduced:
      
       # cat test.sh
       #!/bin/sh
       mount -t cgroup -o cpu xxx /mnt
       for (( ; ; ))
       {
      	mkdir /mnt/sub
      	rmdir /mnt/sub
       }
       # ./test.sh &
       # cat /proc/sched_debug
      
      BUG: unable to handle kernel NULL pointer dereference at 00000038
      IP: [<c045a47f>] cgroup_path+0x39/0x90
      ...
      Call Trace:
       [<c0420344>] ? print_cfs_rq+0x6e/0x75d
       [<c0421160>] ? sched_debug_show+0x72d/0xc1e
      ...
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>		[2.6.26.x, 2.6.27.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24eb0899
  10. 27 10月, 2008 1 次提交
  11. 20 10月, 2008 2 次提交
    • P
      cgroups: convert tasks file to use a seq_file with shared pid array · cc31edce
      Paul Menage 提交于
      Rather than pre-generating the entire text for the "tasks" file each
      time the file is opened, we instead just generate/update the array of
      process ids and use a seq_file to report these to userspace.  All open
      file handles on the same "tasks" file can share a pid array, which may
      be updated any time that no thread is actively reading the array.  By
      sharing the array, the potential for userspace to DoS the system by
      opening many handles on the same "tasks" file is removed.
      
      [Based on a patch by Lai Jiangshan, extended to use seq_file]
      Signed-off-by: NPaul Menage <menage@google.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc31edce
    • L
      cgroups: fix probable race with put_css_set[_taskexit] and find_css_set · 146aa1bd
      Lai Jiangshan 提交于
      put_css_set_taskexit may be called when find_css_set is called on other
      cpu.  And the race will occur:
      
      put_css_set_taskexit side                    find_css_set side
      
                                              |
      atomic_dec_and_test(&kref->refcount)    |
          /* kref->refcount = 0 */            |
      ....................................................................
                                              |  read_lock(&css_set_lock)
                                              |  find_existing_css_set
                                              |  get_css_set
                                              |  read_unlock(&css_set_lock);
      ....................................................................
      __release_css_set                       |
      ....................................................................
                                              | /* use a released css_set */
                                              |
      
      [put_css_set is the same. But in the current code, all put_css_set are
      put into cgroup mutex critical region as the same as find_css_set.]
      
      [akpm@linux-foundation.org: repair comments]
      [menage@google.com: eliminate race in css_set refcounting]
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NPaul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      146aa1bd
  12. 17 10月, 2008 1 次提交
  13. 29 9月, 2008 1 次提交
    • B
      mm owner: fix race between swapoff and exit · 31a78f23
      Balbir Singh 提交于
      There's a race between mm->owner assignment and swapoff, more easily
      seen when task slab poisoning is turned on.  The condition occurs when
      try_to_unuse() runs in parallel with an exiting task.  A similar race
      can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
      or ptrace or page migration.
      
      CPU0                                    CPU1
                                              try_to_unuse
                                              looks at mm = task0->mm
                                              increments mm->mm_users
      task 0 exits
      mm->owner needs to be updated, but no
      new owner is found (mm_users > 1, but
      no other task has task->mm = task0->mm)
      mm_update_next_owner() leaves
                                              mmput(mm) decrements mm->mm_users
      task0 freed
                                              dereferencing mm->owner fails
      
      The fix is to notify the subsystem via mm_owner_changed callback(),
      if no new owner is found, by specifying the new task as NULL.
      
      Jiri Slaby:
      mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
      must be set after that, so as not to pass NULL as old owner causing oops.
      
      Daisuke Nishimura:
      mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
      and its callers need to take account of this situation to avoid oops.
      
      Hugh Dickins:
      Lockdep warning and hang below exec_mmap() when testing these patches.
      exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
      so exec_mmap() now needs to do the same.  And with that repositioning,
      there's now no point in mm_need_new_owner() allowing for NULL mm.
      Reported-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31a78f23
  14. 31 7月, 2008 3 次提交
  15. 27 7月, 2008 2 次提交
  16. 26 7月, 2008 9 次提交
  17. 25 5月, 2008 1 次提交
    • C
      cgroups: remove node_ prefix_from ns subsystem · 5c02b575
      Cedric Le Goater 提交于
      This is a slight change in the namespace cgroup subsystem api.
      
      The change is that previously when cgroup_clone() was called (currently
      only from the unshare path in ns_proxy cgroup, you'd get a new group named
      "node_$pid" whereas now you'll get a group named after just your pid.)
      
      The only users who would notice it are those who are using the ns_proxy
      cgroup subsystem to auto-create cgroups when namespaces are unshared -
      something of an experimental feature, which I think really needs more
      complete container/namespace support in order to be useful.  I suspect the
      only users are Cedric and Serge, or maybe a few others on
      containers@lists.linux-foundation.org.  And in fact it would only be
      noticed by the users who make the assumption about how the name is
      generated, rather than getting it from the /proc/<pid>/cgroups file for
      the process in question.
      
      Whether the change is actually needed or not I'm fairly agnostic on, but I
      guess it is more elegant to just use the pid as the new group name rather
      than adding a fairly arbitrary "node_" prefix on the front.
      
      [menage@google.com: provided changelog]
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Cc: "Paul Menage" <menage@google.com>
      Cc: "Serge E. Hallyn" <serue@us.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c02b575
  18. 30 4月, 2008 1 次提交
  19. 29 4月, 2008 5 次提交
    • B
      cgroups: add an owner to the mm_struct · cf475ad2
      Balbir Singh 提交于
      Remove the mem_cgroup member from mm_struct and instead adds an owner.
      
      This approach was suggested by Paul Menage.  The advantage of this approach
      is that, once the mm->owner is known, using the subsystem id, the cgroup
      can be determined.  It also allows several control groups that are
      virtually grouped by mm_struct, to exist independent of the memory
      controller i.e., without adding mem_cgroup's for each controller, to
      mm_struct.
      
      A new config option CONFIG_MM_OWNER is added and the memory resource
      controller selects this config option.
      
      This patch also adds cgroup callbacks to notify subsystems when mm->owner
      changes.  The mm_cgroup_changed callback is called with the task_lock() of
      the new task held and is called just prior to changing the mm->owner.
      
      I am indebted to Paul Menage for the several reviews of this patchset and
      helping me make it lighter and simpler.
      
      This patch was tested on a powerpc box, it was compiled with both the
      MM_OWNER config turned on and off.
      
      After the thread group leader exits, it's moved to init_css_state by
      cgroup_exit(), thus all future charges from runnings threads would be
      redirected to the init_css_set's subsystem.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: David Rientjes <rientjes@google.com>,
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf475ad2
    • S
      cgroups: introduce cft->read_seq() · 29486df3
      Serge E. Hallyn 提交于
      Introduce a read_seq() helper in cftype, which uses seq_file to print out
      lists.  Use it in the devices cgroup.  Also split devices.allow into two
      files, so now devices.deny and devices.allow are the ones to use to manipulate
      the whitelist, while devices.list outputs the cgroup's current whitelist.
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Acked-by: NPaul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29486df3
    • L
      cgroups: remove the css_set linked-list · 28fd5dfc
      Li Zefan 提交于
      Now we can run through the hash table instead of running through the
      linked-list.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28fd5dfc
    • L
      cgroups: simplify init_subsys() · e8d55fde
      Li Zefan 提交于
      We are at system boot and there is only 1 cgroup group (i,e, init_css_set), so
      we don't need to run through the css_set linked list.  Neither do we need to
      run through the task list, since no processes have been created yet.
      
      Also referring to a comment in cgroup.h:
      
      struct css_set
      {
      	...
      	/*
      	 * Set of subsystem states, one for each subsystem. This array
      	 * is immutable after creation apart from the init_css_set
      	 * during subsystem registration (at boot time).
      	 */
      	struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
      }
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8d55fde
    • L
      cgroups: use a hash table for css_set finding · 472b1053
      Li Zefan 提交于
      When we attach a process to a different cgroup, the css_set linked-list will
      be run through to find a suitable existing css_set to use.  This patch
      implements a hash table for better performance.
      
      The following benchmarks have been tested:
      
      For N in 1, 5, 10, 50, 100, 500, 1000, create N cgroups with one sleeping
      task in each, and then move an additional task through each cgroup in
      turn.
      
      Here is a test result:
      
      N	Loop	orig - Time(s)	hash - Time(s)
      ----------------------------------------------
      1	10000	1.201231728	1.196311177
      5	2000	1.065743872	1.040566424
      10	1000	0.991054735	0.986876440
      50	200	0.976554203	0.969608733
      100	100	0.998504680	0.969218270
      500	20	1.157347764	0.962602963
      1000	10	1.619521852	1.085140172
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      472b1053