1. 20 10月, 2008 1 次提交
    • L
      cgroups: fix probable race with put_css_set[_taskexit] and find_css_set · 146aa1bd
      Lai Jiangshan 提交于
      put_css_set_taskexit may be called when find_css_set is called on other
      cpu.  And the race will occur:
      
      put_css_set_taskexit side                    find_css_set side
      
                                              |
      atomic_dec_and_test(&kref->refcount)    |
          /* kref->refcount = 0 */            |
      ....................................................................
                                              |  read_lock(&css_set_lock)
                                              |  find_existing_css_set
                                              |  get_css_set
                                              |  read_unlock(&css_set_lock);
      ....................................................................
      __release_css_set                       |
      ....................................................................
                                              | /* use a released css_set */
                                              |
      
      [put_css_set is the same. But in the current code, all put_css_set are
      put into cgroup mutex critical region as the same as find_css_set.]
      
      [akpm@linux-foundation.org: repair comments]
      [menage@google.com: eliminate race in css_set refcounting]
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NPaul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      146aa1bd
  2. 17 10月, 2008 1 次提交
  3. 29 9月, 2008 1 次提交
    • B
      mm owner: fix race between swapoff and exit · 31a78f23
      Balbir Singh 提交于
      There's a race between mm->owner assignment and swapoff, more easily
      seen when task slab poisoning is turned on.  The condition occurs when
      try_to_unuse() runs in parallel with an exiting task.  A similar race
      can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
      or ptrace or page migration.
      
      CPU0                                    CPU1
                                              try_to_unuse
                                              looks at mm = task0->mm
                                              increments mm->mm_users
      task 0 exits
      mm->owner needs to be updated, but no
      new owner is found (mm_users > 1, but
      no other task has task->mm = task0->mm)
      mm_update_next_owner() leaves
                                              mmput(mm) decrements mm->mm_users
      task0 freed
                                              dereferencing mm->owner fails
      
      The fix is to notify the subsystem via mm_owner_changed callback(),
      if no new owner is found, by specifying the new task as NULL.
      
      Jiri Slaby:
      mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
      must be set after that, so as not to pass NULL as old owner causing oops.
      
      Daisuke Nishimura:
      mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
      and its callers need to take account of this situation to avoid oops.
      
      Hugh Dickins:
      Lockdep warning and hang below exec_mmap() when testing these patches.
      exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
      so exec_mmap() now needs to do the same.  And with that repositioning,
      there's now no point in mm_need_new_owner() allowing for NULL mm.
      Reported-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31a78f23
  4. 31 7月, 2008 3 次提交
  5. 27 7月, 2008 2 次提交
  6. 26 7月, 2008 9 次提交
  7. 25 5月, 2008 1 次提交
    • C
      cgroups: remove node_ prefix_from ns subsystem · 5c02b575
      Cedric Le Goater 提交于
      This is a slight change in the namespace cgroup subsystem api.
      
      The change is that previously when cgroup_clone() was called (currently
      only from the unshare path in ns_proxy cgroup, you'd get a new group named
      "node_$pid" whereas now you'll get a group named after just your pid.)
      
      The only users who would notice it are those who are using the ns_proxy
      cgroup subsystem to auto-create cgroups when namespaces are unshared -
      something of an experimental feature, which I think really needs more
      complete container/namespace support in order to be useful.  I suspect the
      only users are Cedric and Serge, or maybe a few others on
      containers@lists.linux-foundation.org.  And in fact it would only be
      noticed by the users who make the assumption about how the name is
      generated, rather than getting it from the /proc/<pid>/cgroups file for
      the process in question.
      
      Whether the change is actually needed or not I'm fairly agnostic on, but I
      guess it is more elegant to just use the pid as the new group name rather
      than adding a fairly arbitrary "node_" prefix on the front.
      
      [menage@google.com: provided changelog]
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Cc: "Paul Menage" <menage@google.com>
      Cc: "Serge E. Hallyn" <serue@us.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c02b575
  8. 30 4月, 2008 1 次提交
  9. 29 4月, 2008 15 次提交
  10. 18 4月, 2008 1 次提交
    • L
      cgroup: fix a race condition in manipulating tsk->cg_list · 0e04388f
      Li Zefan 提交于
      When I ran a test program to fork mass processes and at the same time
      'cat /cgroup/tasks', I got the following oops:
      
        ------------[ cut here ]------------
        kernel BUG at lib/list_debug.c:72!
        invalid opcode: 0000 [#1] SMP
        Pid: 4178, comm: a.out Not tainted (2.6.25-rc9 #72)
        ...
        Call Trace:
         [<c044a5f9>] ? cgroup_exit+0x55/0x94
         [<c0427acf>] ? do_exit+0x217/0x5ba
         [<c0427ed7>] ? do_group_exit+0.65/0x7c
         [<c0427efd>] ? sys_exit_group+0xf/0x11
         [<c0404842>] ? syscall_call+0x7/0xb
         [<c05e0000>] ? init_cyrix+0x2fa/0x479
        ...
        EIP: [<c04df671>] list_del+0x35/0x53 SS:ESP 0068:ebc7df4
        ---[ end trace caffb7332252612b ]---
        Fixing recursive fault but reboot is needed!
      
      After digging into the code and debugging, I finlly found out a race
      situation:
      
      				do_exit()
      				  ->cgroup_exit()
      				    ->if (!list_empty(&tsk->cg_list))
      				        list_del(&tsk->cg_list);
      
        cgroup_iter_start()
          ->cgroup_enable_task_cg_list()
            ->list_add(&tsk->cg_list, ..);
      
      In this case the list won't be deleted though the process has exited.
      
      We got two bug reports in the past, which seem to be the same bug as
      this one:
      	http://lkml.org/lkml/2008/3/5/332
      	http://lkml.org/lkml/2007/10/17/224
      
      Actually sometimes I got oops on list_del, sometimes oops on list_add.
      And I can change my test program a bit to trigger other oops.
      
      The patch has been tested both on x86_32 and x86_64.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e04388f
  11. 11 4月, 2008 1 次提交
  12. 05 4月, 2008 1 次提交
    • P
      cgroups: add cgroup support for enabling controllers at boot time · 8bab8dde
      Paul Menage 提交于
      The effects of cgroup_disable=foo are:
      
      - foo isn't auto-mounted if you mount all cgroups in a single hierarchy
      - foo isn't visible as an individually mountable subsystem
      
      As a result there will only ever be one call to foo->create(), at init time;
      all processes will stay in this group, and the group will never be mounted on
      a visible hierarchy.  Any additional effects (e.g.  not allocating metadata)
      are up to the foo subsystem.
      
      This doesn't handle early_init subsystems (their "disabled" bit isn't set be,
      but it could easily be extended to do so if any of the early_init systems
      wanted it - I think it would just involve some nastier parameter processing
      since it would occur before the command-line argument parser had been run.
      
      Hugh said:
      
        Ballpark figures, I'm trying to get this question out rather than
        processing the exact numbers: CONFIG_CGROUP_MEM_RES_CTLR adds 15% overhead
        to the affected paths, booting with cgroup_disable=memory cuts that back to
        1% overhead (due to slightly bigger struct page).
      
        I'm no expert on distros, they may have no interest whatever in
        CONFIG_CGROUP_MEM_RES_CTLR=y; and the rest of us can easily build with or
        without it, or apply the cgroup_disable=memory patches.
      
      Unix bench's execl test result on x86_64 was
      
      == just after boot without mounting any cgroup fs.==
      mem_cgorup=off : Execl Throughput       43.0     3150.1      732.6
      mem_cgroup=on  : Execl Throughput       43.0     2932.6      682.0
      ==
      
      [lizf@cn.fujitsu.com: fix boot option parsing]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bab8dde
  13. 31 3月, 2008 1 次提交
  14. 05 3月, 2008 1 次提交
  15. 24 2月, 2008 1 次提交