1. 27 10月, 2008 1 次提交
  2. 20 10月, 2008 2 次提交
    • P
      cgroups: convert tasks file to use a seq_file with shared pid array · cc31edce
      Paul Menage 提交于
      Rather than pre-generating the entire text for the "tasks" file each
      time the file is opened, we instead just generate/update the array of
      process ids and use a seq_file to report these to userspace.  All open
      file handles on the same "tasks" file can share a pid array, which may
      be updated any time that no thread is actively reading the array.  By
      sharing the array, the potential for userspace to DoS the system by
      opening many handles on the same "tasks" file is removed.
      
      [Based on a patch by Lai Jiangshan, extended to use seq_file]
      Signed-off-by: NPaul Menage <menage@google.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc31edce
    • L
      cgroups: fix probable race with put_css_set[_taskexit] and find_css_set · 146aa1bd
      Lai Jiangshan 提交于
      put_css_set_taskexit may be called when find_css_set is called on other
      cpu.  And the race will occur:
      
      put_css_set_taskexit side                    find_css_set side
      
                                              |
      atomic_dec_and_test(&kref->refcount)    |
          /* kref->refcount = 0 */            |
      ....................................................................
                                              |  read_lock(&css_set_lock)
                                              |  find_existing_css_set
                                              |  get_css_set
                                              |  read_unlock(&css_set_lock);
      ....................................................................
      __release_css_set                       |
      ....................................................................
                                              | /* use a released css_set */
                                              |
      
      [put_css_set is the same. But in the current code, all put_css_set are
      put into cgroup mutex critical region as the same as find_css_set.]
      
      [akpm@linux-foundation.org: repair comments]
      [menage@google.com: eliminate race in css_set refcounting]
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NPaul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      146aa1bd
  3. 17 10月, 2008 1 次提交
  4. 29 9月, 2008 1 次提交
    • B
      mm owner: fix race between swapoff and exit · 31a78f23
      Balbir Singh 提交于
      There's a race between mm->owner assignment and swapoff, more easily
      seen when task slab poisoning is turned on.  The condition occurs when
      try_to_unuse() runs in parallel with an exiting task.  A similar race
      can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
      or ptrace or page migration.
      
      CPU0                                    CPU1
                                              try_to_unuse
                                              looks at mm = task0->mm
                                              increments mm->mm_users
      task 0 exits
      mm->owner needs to be updated, but no
      new owner is found (mm_users > 1, but
      no other task has task->mm = task0->mm)
      mm_update_next_owner() leaves
                                              mmput(mm) decrements mm->mm_users
      task0 freed
                                              dereferencing mm->owner fails
      
      The fix is to notify the subsystem via mm_owner_changed callback(),
      if no new owner is found, by specifying the new task as NULL.
      
      Jiri Slaby:
      mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
      must be set after that, so as not to pass NULL as old owner causing oops.
      
      Daisuke Nishimura:
      mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
      and its callers need to take account of this situation to avoid oops.
      
      Hugh Dickins:
      Lockdep warning and hang below exec_mmap() when testing these patches.
      exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
      so exec_mmap() now needs to do the same.  And with that repositioning,
      there's now no point in mm_need_new_owner() allowing for NULL mm.
      Reported-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31a78f23
  5. 31 7月, 2008 3 次提交
  6. 27 7月, 2008 2 次提交
  7. 26 7月, 2008 9 次提交
  8. 25 5月, 2008 1 次提交
    • C
      cgroups: remove node_ prefix_from ns subsystem · 5c02b575
      Cedric Le Goater 提交于
      This is a slight change in the namespace cgroup subsystem api.
      
      The change is that previously when cgroup_clone() was called (currently
      only from the unshare path in ns_proxy cgroup, you'd get a new group named
      "node_$pid" whereas now you'll get a group named after just your pid.)
      
      The only users who would notice it are those who are using the ns_proxy
      cgroup subsystem to auto-create cgroups when namespaces are unshared -
      something of an experimental feature, which I think really needs more
      complete container/namespace support in order to be useful.  I suspect the
      only users are Cedric and Serge, or maybe a few others on
      containers@lists.linux-foundation.org.  And in fact it would only be
      noticed by the users who make the assumption about how the name is
      generated, rather than getting it from the /proc/<pid>/cgroups file for
      the process in question.
      
      Whether the change is actually needed or not I'm fairly agnostic on, but I
      guess it is more elegant to just use the pid as the new group name rather
      than adding a fairly arbitrary "node_" prefix on the front.
      
      [menage@google.com: provided changelog]
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Cc: "Paul Menage" <menage@google.com>
      Cc: "Serge E. Hallyn" <serue@us.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c02b575
  9. 30 4月, 2008 1 次提交
  10. 29 4月, 2008 15 次提交
  11. 18 4月, 2008 1 次提交
    • L
      cgroup: fix a race condition in manipulating tsk->cg_list · 0e04388f
      Li Zefan 提交于
      When I ran a test program to fork mass processes and at the same time
      'cat /cgroup/tasks', I got the following oops:
      
        ------------[ cut here ]------------
        kernel BUG at lib/list_debug.c:72!
        invalid opcode: 0000 [#1] SMP
        Pid: 4178, comm: a.out Not tainted (2.6.25-rc9 #72)
        ...
        Call Trace:
         [<c044a5f9>] ? cgroup_exit+0x55/0x94
         [<c0427acf>] ? do_exit+0x217/0x5ba
         [<c0427ed7>] ? do_group_exit+0.65/0x7c
         [<c0427efd>] ? sys_exit_group+0xf/0x11
         [<c0404842>] ? syscall_call+0x7/0xb
         [<c05e0000>] ? init_cyrix+0x2fa/0x479
        ...
        EIP: [<c04df671>] list_del+0x35/0x53 SS:ESP 0068:ebc7df4
        ---[ end trace caffb7332252612b ]---
        Fixing recursive fault but reboot is needed!
      
      After digging into the code and debugging, I finlly found out a race
      situation:
      
      				do_exit()
      				  ->cgroup_exit()
      				    ->if (!list_empty(&tsk->cg_list))
      				        list_del(&tsk->cg_list);
      
        cgroup_iter_start()
          ->cgroup_enable_task_cg_list()
            ->list_add(&tsk->cg_list, ..);
      
      In this case the list won't be deleted though the process has exited.
      
      We got two bug reports in the past, which seem to be the same bug as
      this one:
      	http://lkml.org/lkml/2008/3/5/332
      	http://lkml.org/lkml/2007/10/17/224
      
      Actually sometimes I got oops on list_del, sometimes oops on list_add.
      And I can change my test program a bit to trigger other oops.
      
      The patch has been tested both on x86_32 and x86_64.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e04388f
  12. 11 4月, 2008 1 次提交
  13. 05 4月, 2008 1 次提交
    • P
      cgroups: add cgroup support for enabling controllers at boot time · 8bab8dde
      Paul Menage 提交于
      The effects of cgroup_disable=foo are:
      
      - foo isn't auto-mounted if you mount all cgroups in a single hierarchy
      - foo isn't visible as an individually mountable subsystem
      
      As a result there will only ever be one call to foo->create(), at init time;
      all processes will stay in this group, and the group will never be mounted on
      a visible hierarchy.  Any additional effects (e.g.  not allocating metadata)
      are up to the foo subsystem.
      
      This doesn't handle early_init subsystems (their "disabled" bit isn't set be,
      but it could easily be extended to do so if any of the early_init systems
      wanted it - I think it would just involve some nastier parameter processing
      since it would occur before the command-line argument parser had been run.
      
      Hugh said:
      
        Ballpark figures, I'm trying to get this question out rather than
        processing the exact numbers: CONFIG_CGROUP_MEM_RES_CTLR adds 15% overhead
        to the affected paths, booting with cgroup_disable=memory cuts that back to
        1% overhead (due to slightly bigger struct page).
      
        I'm no expert on distros, they may have no interest whatever in
        CONFIG_CGROUP_MEM_RES_CTLR=y; and the rest of us can easily build with or
        without it, or apply the cgroup_disable=memory patches.
      
      Unix bench's execl test result on x86_64 was
      
      == just after boot without mounting any cgroup fs.==
      mem_cgorup=off : Execl Throughput       43.0     3150.1      732.6
      mem_cgroup=on  : Execl Throughput       43.0     2932.6      682.0
      ==
      
      [lizf@cn.fujitsu.com: fix boot option parsing]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bab8dde
  14. 31 3月, 2008 1 次提交