1. 03 11月, 2011 1 次提交
    • A
      memcg: replace ss->id_lock with a rwlock · c1e2ee2d
      Andrew Bresticker 提交于
      While back-porting Johannes Weiner's patch "mm: memcg-aware global
      reclaim" for an internal effort, we noticed a significant performance
      regression during page-reclaim heavy workloads due to high contention of
      the ss->id_lock.  This lock protects idr map, and serializes calls to
      idr_get_next() in css_get_next() (which is used during the memcg hierarchy
      walk).
      
      Since idr_get_next() is just doing a look up, we need only serialize it
      with respect to idr_remove()/idr_get_new().  By making the ss->id_lock a
      rwlock, contention is greatly reduced and performance improves.
      
      Tested: cat a 256m file from a ramdisk in a 128m container 50 times on
      each core (one file + container per core) in parallel on a NUMA machine.
      Result is the time for the test to complete in 1 of the containers.
      Both kernels included Johannes' memcg-aware global reclaim patches.
      
      Before rwlock patch: 1710.778s
      After rwlock patch: 152.227s
      Signed-off-by: NAndrew Bresticker <abrestic@google.com>
      Cc: Paul Menage <menage@gmail.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1e2ee2d
  2. 09 7月, 2011 1 次提交
  3. 27 5月, 2011 2 次提交
    • D
      cgroup: remove the ns_cgroup · a77aea92
      Daniel Lezcano 提交于
      The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
      leads to some problems:
      
        * cgroup creation is out-of-control
        * cgroup name can conflict when pids are looping
        * it is not possible to have a single process handling a lot of
          namespaces without falling in a exponential creation time
        * we may want to create a namespace without creating a cgroup
      
        The ns_cgroup was replaced by a compatibility flag 'clone_children',
        where a newly created cgroup will copy the parent cgroup values.
        The userspace has to manually create a cgroup and add a task to
        the 'tasks' file.
      
      This patch removes the ns_cgroup as suggested in the following thread:
      
      https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html
      
      The 'cgroup_clone' function is removed because it is no longer used.
      
      This is a userspace-visible change.  Commit 45531757 ("cgroup: notify
      ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
      printk warning users that the feature is planned for removal.  Since that
      time we have heard from XXX users who were affected by this.
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jamal Hadi Salim <hadi@cyberus.ca>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Acked-by: NMatt Helsley <matthltc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a77aea92
    • B
      cgroups: add per-thread subsystem callbacks · f780bdb7
      Ben Blum 提交于
      Add cgroup subsystem callbacks for per-thread attachment in atomic contexts
      
      Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
      for cgroups's subsystem interface.  Unlike can_attach and attach, these
      are for per-thread operations, to be called potentially many times when
      attaching an entire threadgroup.
      
      Also, the old "bool threadgroup" interface is removed, as replaced by
      this.  All subsystems are modified for the new interface - of note is
      cpuset, which requires from/to nodemasks for attach to be globally scoped
      (though per-cpuset would work too) to persist from its pre_attach to
      attach_task and attach.
      
      This is a pre-patch for cgroup-procs-writable.patch.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f780bdb7
  4. 31 3月, 2011 1 次提交
  5. 16 2月, 2011 2 次提交
    • S
      perf: Add cgroup support · e5d1367f
      Stephane Eranian 提交于
      This kernel patch adds the ability to filter monitoring based on
      container groups (cgroups). This is for use in per-cpu mode only.
      
      The cgroup to monitor is passed as a file descriptor in the pid
      argument to the syscall. The file descriptor must be opened to
      the cgroup name in the cgroup filesystem. For instance, if the
      cgroup name is foo and cgroupfs is mounted in /cgroup, then the
      file descriptor is opened to /cgroup/foo. Cgroup mode is
      activated by passing PERF_FLAG_PID_CGROUP in the flags argument
      to the syscall.
      
      For instance to measure in cgroup foo on CPU1 assuming
      cgroupfs is mounted under /cgroup:
      
      struct perf_event_attr attr;
      int cgroup_fd, fd;
      
      cgroup_fd = open("/cgroup/foo", O_RDONLY);
      fd = perf_event_open(&attr, cgroup_fd, 1, -1, PERF_FLAG_PID_CGROUP);
      close(cgroup_fd);
      Signed-off-by: NStephane Eranian <eranian@google.com>
      [ added perf_cgroup_{exit,attach} ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <4d590250.114ddf0a.689e.4482@mx.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e5d1367f
    • P
      cgroup: Fix cgroup_subsys::exit callback · d41d5a01
      Peter Zijlstra 提交于
      Make the ::exit method act like ::attach, it is after all very nearly
      the same thing.
      
      The bug had no effect on correctness - fixing it is an optimization for
      the scheduler. Also, later perf-cgroups patches rely on it.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NPaul Menage <menage@google.com>
      LKML-Reference: <1297160655.13327.92.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d41d5a01
  6. 02 11月, 2010 1 次提交
  7. 28 10月, 2010 1 次提交
    • D
      cgroup: add clone_children control file · 97978e6d
      Daniel Lezcano 提交于
      The ns_cgroup is a control group interacting with the namespaces.  When a
      new namespace is created, a corresponding cgroup is automatically created
      too.  The cgroup name is the pid of the process who did 'unshare' or the
      child of 'clone'.
      
      This cgroup is tied with the namespace because it prevents a process to
      escape the control group and use the post_clone callback, so the child
      cgroup inherits the values of the parent cgroup.
      
      Unfortunately, the more we use this cgroup and the more we are facing
      problems with it:
      
      (1) when a process unshares, the cgroup name may conflict with a
          previous cgroup with the same pid, so unshare or clone return -EEXIST
      
      (2) the cgroup creation is out of control because there may have an
          application creating several namespaces where the system will
          automatically create several cgroups in his back and let them on the
          cgroupfs (eg.  a vrf based on the network namespace).
      
      (3) the mix of (1) and (2) force an administrator to regularly check
          and clean these cgroups.
      
      This patchset removes the ns_cgroup by adding a new flag to the cgroup and
      the cgroupfs mount option.  It enables the copy of the parent cgroup when
      a child cgroup is created.  We can then safely remove the ns_cgroup as
      this flag brings a compatibility.  We have now to manually create and add
      the task to a cgroup, which is consistent with the cgroup framework.
      
      This patch:
      
      Sent as an answer to a previous thread around the ns_cgroup.
      
      https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html
      
      It adds a control file 'clone_children' for a cgroup.  This control file
      is a boolean specifying if the child cgroup should be a clone of the
      parent cgroup or not.  The default value is 'false'.
      
      This flag makes the child cgroup to call the post_clone callback of all
      the subsystem, if it is available.
      
      At present, the cpuset is the only one which had implemented the
      post_clone callback.
      
      The option can be set at mount time by specifying the 'clone_children'
      mount option.
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Acked-by: NPaul Menage <menage@google.com>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <hadi@cyberus.ca>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97978e6d
  8. 10 9月, 2010 1 次提交
  9. 05 9月, 2010 1 次提交
    • M
      cgroups: fix API thinko · 73457f0f
      Michael S. Tsirkin 提交于
      cgroup_attach_task_current_cg API that have upstream is backwards: we
      really need an API to attach to the cgroups from another process A to
      the current one.
      
      In our case (vhost), a priveledged user wants to attach it's task to cgroups
      from a less priveledged one, the API makes us run it in the other
      task's context, and this fails.
      
      So let's make the API generic and just pass in 'from' and 'to' tasks.
      Add an inline wrapper for cgroup_attach_task_current_cg to avoid
      breaking bisect.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      73457f0f
  10. 20 8月, 2010 1 次提交
  11. 28 7月, 2010 1 次提交
  12. 09 6月, 2010 1 次提交
    • P
      sched: Fix PROVE_RCU vs cpu_cgroup · dc61b1d6
      Peter Zijlstra 提交于
      PROVE_RCU has a few issues with the cpu_cgroup because the scheduler
      typically holds rq->lock around the css rcu derefs but the generic
      cgroup code doesn't (and can't) know about that lock.
      
      Provide means to add extra checks to the css dereference and use that
      in the scheduler to annotate its users.
      
      The addition of rq->lock to these checks is correct because the
      cgroup_subsys::attach() method takes the rq->lock for each task it
      moves, therefore by holding that lock, we ensure the task is pinned to
      the current cgroup and the RCU derefence is valid.
      
      That leaves one genuine race in __sched_setscheduler() where we used
      task_group() without holding any of the required locks and thus raced
      with the cgroup code. Solve this by moving the check under the
      appropriate lock.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dc61b1d6
  13. 28 5月, 2010 1 次提交
  14. 05 5月, 2010 1 次提交
  15. 13 3月, 2010 7 次提交
    • K
      cgroups: remove events before destroying subsystem state objects · a0a4db54
      Kirill A. Shutemov 提交于
      Events should be removed after rmdir of cgroup directory, but before
      destroying subsystem state objects.  Let's take reference to cgroup
      directory dentry to do that.
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0a4db54
    • K
      cgroup: implement eventfd-based generic API for notifications · 0dea1168
      Kirill A. Shutemov 提交于
      This patchset introduces eventfd-based API for notifications in cgroups
      and implements memory notifications on top of it.
      
      It uses statistics in memory controler to track memory usage.
      
      Output of time(1) on building kernel on tmpfs:
      
      Root cgroup before changes:
      	make -j2  506.37 user 60.93s system 193% cpu 4:52.77 total
      Non-root cgroup before changes:
      	make -j2  507.14 user 62.66s system 193% cpu 4:54.74 total
      Root cgroup after changes (0 thresholds):
      	make -j2  507.13 user 62.20s system 193% cpu 4:53.55 total
      Non-root cgroup after changes (0 thresholds):
      	make -j2  507.70 user 64.20s system 193% cpu 4:55.70 total
      Root cgroup after changes (1 thresholds, never crossed):
      	make -j2  506.97 user 62.20s system 193% cpu 4:53.90 total
      Non-root cgroup after changes (1 thresholds, never crossed):
      	make -j2  507.55 user 64.08s system 193% cpu 4:55.63 total
      
      This patch:
      
      Introduce the write-only file "cgroup.event_control" in every cgroup.
      
      To register new notification handler you need:
      - create an eventfd;
      - open a control file to be monitored. Callbacks register_event() and
        unregister_event() must be defined for the control file;
      - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
        Interpretation of args is defined by control file implementation;
      
      eventfd will be woken up by control file implementation or when the
      cgroup is removed.
      
      To unregister notification handler just close eventfd.
      
      If you need notification functionality for a control file you have to
      implement callbacks register_event() and unregister_event() in the
      struct cftype.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix]
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Alexander Shishkin <virtuoso@slind.org>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0dea1168
    • B
      cgroups: subsystem module unloading · cf5d5941
      Ben Blum 提交于
      Provides support for unloading modular subsystems.
      
      This patch adds a new function cgroup_unload_subsys which is to be used
      for removing a loaded subsystem during module deletion.  Reference
      counting of the subsystems' modules is moved from once (at load time) to
      once per attached hierarchy (in parse_cgroupfs_options and
      rebind_subsystems) (i.e., 0 or 1).
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf5d5941
    • B
      cgroups: subsystem module loading interface · e6a1105b
      Ben Blum 提交于
      Add interface between cgroups subsystem management and module loading
      
      This patch implements rudimentary module-loading support for cgroups -
      namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a
      module initcall, and a struct module pointer in struct cgroup_subsys.
      
      Several functions that might be wanted by modules have had EXPORT_SYMBOL
      added to them, but it's unclear exactly which functions want it and which
      won't.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6a1105b
    • B
      cgroups: revamp subsys array · aae8aab4
      Ben Blum 提交于
      This patch series provides the ability for cgroup subsystems to be
      compiled as modules both within and outside the kernel tree.  This is
      mainly useful for classifiers and subsystems that hook into components
      that are already modules.  cls_cgroup and blkio-cgroup serve as the
      example use cases for this feature.
      
      It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
      which modular subsystems can use to register and depart during runtime.
      The net_cls classifier subsystem serves as the example for a subsystem
      which can be converted into a module using these changes.
      
      Patch #1 sets up the subsys[] array so its contents can be dynamic as
      modules appear and (eventually) disappear.  Iterations over the array are
      modified to handle when subsystems are absent, and the dynamic section of
      the array is protected by cgroup_mutex.
      
      Patch #2 implements an interface for modules to load subsystems, called
      cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
      pointer in struct cgroup_subsys.
      
      Patch #3 adds a mechanism for unloading modular subsystems, which includes
      a more advanced rework of the rudimentary reference counting introduced in
      patch 2.
      
      Patch #4 modifies the net_cls subsystem, which already had some module
      declarations, to be configurable as a module, which also serves as a
      simple proof-of-concept.
      
      Part of implementing patches 2 and 4 involved updating css pointers in
      each css_set when the module appears or leaves.  In doing this, it was
      discovered that css_sets always remain linked to the dummy cgroup,
      regardless of whether or not any subsystems are actually bound to it
      (i.e., not mounted on an actual hierarchy).  The subsystem loading and
      unloading code therefore should keep in mind the special cases where the
      added subsystem is the only one in the dummy cgroup (and therefore all
      css_sets need to be linked back into it) and where the removed subsys was
      the only one in the dummy cgroup (and therefore all css_sets should be
      unlinked from it) - however, as all css_sets always stay attached to the
      dummy cgroup anyway, these cases are ignored.  Any fix that addresses this
      issue should also make sure these cases are addressed in the subsystem
      loading and unloading code.
      
      This patch:
      
      Make subsys[] able to be dynamically populated to support modular
      subsystems
      
      This patch reworks the way the subsys[] array is used so that subsystems
      can register themselves after boot time, and enables the internals of
      cgroups to be able to handle when subsystems are not present or may
      appear/disappear.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aae8aab4
    • D
      cgroup: introduce coalesce css_get() and css_put() · d7b9fff7
      Daisuke Nishimura 提交于
      Current css_get() and css_put() increment/decrement css->refcnt one by
      one.
      
      This patch add a new function __css_get(), which takes "count" as a arg
      and increment the css->refcnt by "count".  And this patch also add a new
      arg("count") to __css_put() and change the function to decrement the
      css->refcnt by "count".
      
      These coalesce version of __css_get()/__css_put() will be used to improve
      performance of memcg's moving charge feature later, where instead of
      calling css_get()/css_put() repeatedly, these new functions will be used.
      
      No change is needed for current users of css_get()/css_put().
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NPaul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7b9fff7
    • D
      cgroup: introduce cancel_attach() · 2468c723
      Daisuke Nishimura 提交于
      Add cancel_attach() operation to struct cgroup_subsys.  cancel_attach()
      can be used when can_attach() operation prepares something for the subsys,
      but we should rollback what can_attach() operation has prepared if attach
      task fails after we've succeeded in can_attach().
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2468c723
  16. 04 3月, 2010 1 次提交
  17. 28 2月, 2010 1 次提交
  18. 25 2月, 2010 1 次提交
    • P
      sched: Use lockdep-based checking on rcu_dereference() · d11c563d
      Paul E. McKenney 提交于
      Update the rcu_dereference() usages to take advantage of the new
      lockdep-based checking.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com>
      [ -v2: fix allmodconfig missing symbol export build failure on x86 ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d11c563d
  19. 02 10月, 2009 1 次提交
  20. 24 9月, 2009 5 次提交
    • B
      cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time · be367d09
      Ben Blum 提交于
      Alter the ss->can_attach and ss->attach functions to be able to deal with
      a whole threadgroup at a time, for use in cgroup_attach_proc.  (This is a
      pre-patch to cgroup-procs-writable.patch.)
      
      Currently, new mode of the attach function can only tell the subsystem
      about the old cgroup of the threadgroup leader.  No subsystem currently
      needs that information for each thread that's being moved, but if one were
      to be added (for example, one that counts tasks within a group) this bit
      would need to be reworked a bit to tell the subsystem the right
      information.
      
      [hidave.darkstar@gmail.com: fix build]
      Signed-off-by: NBen Blum <bblum@google.com>
      Signed-off-by: NPaul Menage <menage@google.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Reviewed-by: NMatt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dave Young <hidave.darkstar@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be367d09
    • B
      cgroups: change css_set freeing mechanism to be under RCU · c378369d
      Ben Blum 提交于
      Changes css_set freeing mechanism to be under RCU
      
      This is a prepatch for making the procs file writable. In order to free the
      old css_sets for each task to be moved as they're being moved, the freeing
      mechanism must be RCU-protected, or else we would have to have a call to
      synchronize_rcu() for each task before freeing its old css_set.
      Signed-off-by: NBen Blum <bblum@google.com>
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c378369d
    • B
      cgroups: ensure correct concurrent opening/reading of pidlists across pid namespaces · 72a8cb30
      Ben Blum 提交于
      Previously there was the problem in which two processes from different pid
      namespaces reading the tasks or procs file could result in one process
      seeing results from the other's namespace.  Rather than one pidlist for
      each file in a cgroup, we now keep a list of pidlists keyed by namespace
      and file type (tasks versus procs) in which entries are placed on demand.
      Each pidlist has its own lock, and that the pidlists themselves are passed
      around in the seq_file's private pointer means we don't have to touch the
      cgroup or its master list except when creating and destroying entries.
      Signed-off-by: NBen Blum <bblum@google.com>
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72a8cb30
    • B
      cgroups: add a read-only "procs" file similar to "tasks" that shows only unique tgids · 102a775e
      Ben Blum 提交于
      struct cgroup used to have a bunch of fields for keeping track of the
      pidlist for the tasks file.  Those are now separated into a new struct
      cgroup_pidlist, of which two are had, one for procs and one for tasks.
      The way the seq_file operations are set up is changed so that just the
      pidlist struct gets passed around as the private data.
      
      Interface example: Suppose a multithreaded process has pid 1000 and other
      threads with ids 1001, 1002, 1003:
      $ cat tasks
      1000
      1001
      1002
      1003
      $ cat cgroup.procs
      1000
      $
      Signed-off-by: NBen Blum <bblum@google.com>
      Signed-off-by: NPaul Menage <menage@google.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      102a775e
    • P
      cgroups: revert "cgroups: fix pid namespace bug" · 8f3ff208
      Paul Menage 提交于
      The following series adds a "cgroup.procs" file to each cgroup that
      reports unique tgids rather than pids, and allows all threads in a
      threadgroup to be atomically moved to a new cgroup.
      
      The subsystem "attach" interface is modified to support attaching whole
      threadgroups at a time, which could introduce potential problems if any
      subsystem were to need to access the old cgroup of every thread being
      moved.  The attach interface may need to be revised if this becomes the
      case.
      
      Also added is functionality for read/write locking all CLONE_THREAD
      fork()ing within a threadgroup, by means of an rwsem that lives in the
      sighand_struct, for per-threadgroup-ness and also for sharing a cacheline
      with the sighand's atomic count.  This scheme should introduce no extra
      overhead in the fork path when there's no contention.
      
      The final patch reveals potential for a race when forking before a
      subsystem's attach function is called - one potential solution in case any
      subsystem has this problem is to hang on to the group's fork mutex through
      the attach() calls, though no subsystem yet demonstrates need for an
      extended critical section.
      
      This patch:
      
      Revert
      
      commit 096b7fe0
      Author:     Li Zefan <lizf@cn.fujitsu.com>
      AuthorDate: Wed Jul 29 15:04:04 2009 -0700
      Commit:     Linus Torvalds <torvalds@linux-foundation.org>
      CommitDate: Wed Jul 29 19:10:35 2009 -0700
      
          cgroups: fix pid namespace bug
      
      This is in preparation for some clashing cgroups changes that subsume the
      original commit's functionaliy.
      
      The original commit fixed a pid namespace bug which Ben Blum fixed
      independently (in the same way, but with different code) as part of a
      series of patches.  I played around with trying to reconcile Ben's patch
      series with Li's patch, but concluded that it was simpler to just revert
      Li's, given that Ben's patch series contained essentially the same fix.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f3ff208
  21. 30 7月, 2009 2 次提交
  22. 03 4月, 2009 6 次提交
    • L
      cgroups: add 'data' field to struct cgroup_scanner · bd1a8ab7
      Li Zefan 提交于
      We need to pass some data to test_task() or process_task() in some cases.
      Will be used later.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd1a8ab7
    • K
      memcg: fix OOM killer under memcg · 0b7f569e
      KAMEZAWA Hiroyuki 提交于
      This patch tries to fix OOM Killer problems caused by hierarchy.
      Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
      kill a task in memcg.
      
      But, when hierarchy is used, it's broken and correct task cannot
      be killed. For example, in following cgroup
      
      	/groupA/	hierarchy=1, limit=1G,
      		01	nolimit
      		02	nolimit
      All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
      groupA's 1Gbytes but OOM Killer just kills tasks in groupA.
      
      This patch provides makes the bad process be selected from all tasks
      under hierarchy. BTW, currently, oom_jiffies is updated against groupA
      in above case. oom_jiffies of tree should be updated.
      
      To see how oom_jiffies is used, please check mem_cgroup_oom_called()
      callers.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: const fix]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b7f569e
    • L
      cgroups: show correct file mode · 099fca32
      Li Zefan 提交于
      We have some read-only files and write-only files, but currently they are
      all set to 0644, which is counter-intuitive and cause trouble for some
      cgroup tools like libcgroup.
      
      This patch adds 'mode' to struct cftype to allow cgroup subsys to set it's
      own files' file mode, and for the most cases cft->mode can be default to 0
      and cgroup will figure out proper mode.
      Acked-by: NPaul Menage <menage@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      099fca32
    • K
      cgroup: fix frequent -EBUSY at rmdir · ec64f515
      KAMEZAWA Hiroyuki 提交于
      In following situation, with memory subsystem,
      
      	/groupA use_hierarchy==1
      		/01 some tasks
      		/02 some tasks
      		/03 some tasks
      		/04 empty
      
      When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
      is triggered and the kernel walks tree under groupA. In this case,
      rmdir /groupA/04 fails with -EBUSY frequently because of temporal
      refcnt from the kernel.
      
      In general. cgroup can be rmdir'd if there are no children groups and
      no tasks. Frequent fails of rmdir() is not useful to users.
      (And the reason for -EBUSY is unknown to users.....in most cases)
      
      This patch tries to modify above behavior, by
      	- retries if css_refcnt is got by someone.
      	- add "return value" to pre_destroy() and allows subsystem to
      	  say "we're really busy!"
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec64f515
    • K
      cgroup: CSS ID support · 38460b48
      KAMEZAWA Hiroyuki 提交于
      Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.
      
      This patch attaches unique ID to each css and provides following.
      
       - css_lookup(subsys, id)
         returns pointer to struct cgroup_subysys_state of id.
       - css_get_next(subsys, id, rootid, depth, foundid)
         returns the next css under "root" by scanning
      
      When cgroup_subsys->use_id is set, an id for css is maintained.
      
      The cgroup framework only parepares
      	- css_id of root css for subsys
      	- id is automatically attached at creation of css.
      	- id is *not* freed automatically. Because the cgroup framework
      	  don't know lifetime of cgroup_subsys_state.
      	  free_css_id() function is provided. This must be called by subsys.
      
      There are several reasons to develop this.
      	- Saving space .... For example, memcg's swap_cgroup is array of
      	  pointers to cgroup. But it is not necessary to be very fast.
      	  By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
      	  reduce much amount of memory usage.
      
      	- Scanning without lock.
      	  CSS_ID provides "scan id under this ROOT" function. By this, scanning
      	  css under root can be written without locks.
      	  ex)
      	  do {
      		rcu_read_lock();
      		next = cgroup_get_next(subsys, id, root, &found);
      		/* check sanity of next here */
      		css_tryget();
      		rcu_read_unlock();
      		id = found + 1
      	 } while(...)
      
      Characteristics:
      	- Each css has unique ID under subsys.
      	- Lifetime of ID is controlled by subsys.
      	- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
      	- Allowed ID is 1-65535, ID 0 is UNUSED ID.
      
      Design Choices:
      	- scan-by-ID v.s. scan-by-tree-walk.
      	  As /proc's pid scan does, scan-by-ID is robust when scanning is done
      	  by following kind of routine.
      	  scan -> rest a while(release a lock) -> conitunue from interrupted
      	  memcg's hierarchical reclaim does this.
      
      	- When subsys->use_id is set, # of css in the system is limited to
      	  65535.
      
      [bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38460b48
    • G
      cgroups: relax ns_can_attach checks to allow attaching to grandchild cgroups · 313e924c
      Grzegorz Nosek 提交于
      The ns_proxy cgroup allows moving processes to child cgroups only one
      level deep at a time.  This commit relaxes this restriction and makes it
      possible to attach tasks directly to grandchild cgroups, e.g.:
      
      ($pid is in the root cgroup)
      echo $pid > /cgroup/CG1/CG2/tasks
      
      Previously this operation would fail with -EPERM and would have to be
      performed as two steps:
      echo $pid > /cgroup/CG1/tasks
      echo $pid > /cgroup/CG1/CG2/tasks
      
      Also, the target cgroup no longer needs to be empty to move a task there.
      Signed-off-by: NGrzegorz Nosek <root@localdomain.pl>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      313e924c