1. 14 1月, 2011 1 次提交
  2. 07 1月, 2011 7 次提交
    • N
      fs: dcache reduce branches in lookup path · fb045adb
      Nick Piggin 提交于
      Reduce some branches and memory accesses in dcache lookup by adding dentry
      flags to indicate common d_ops are set, rather than having to check them.
      This saves a pointer memory access (dentry->d_op) in common path lookup
      situations, and saves another pointer load and branch in cases where we
      have d_op but not the particular operation.
      
      Patched with:
      
      git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fb045adb
    • N
      fs: dcache rationalise dget variants · dc0474be
      Nick Piggin 提交于
      dget_locked was a shortcut to avoid the lazy lru manipulation when we already
      held dcache_lock (lru manipulation was relatively cheap at that point).
      However, how that the lru lock is an innermost one, we never hold it at any
      caller, so the lock cost can now be avoided. We already have well working lazy
      dcache LRU, so it should be fine to defer LRU manipulations to scan time.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      dc0474be
    • N
      fs: dcache remove dcache_lock · b5c84bf6
      Nick Piggin 提交于
      dcache_lock no longer protects anything. remove it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b5c84bf6
    • N
      fs: dcache scale subdirs · 2fd6b7f5
      Nick Piggin 提交于
      Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
      using dcache_lock for these anyway (eg. using i_mutex).
      
      Note: if we change the locking rule in future so that ->d_child protection is
      provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
      But it would be an exception to an otherwise regular locking scheme, so we'd
      have to see some good results. Probably not worthwhile.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      2fd6b7f5
    • N
      fs: dcache scale dentry refcount · b7ab39f6
      Nick Piggin 提交于
      Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
      0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
      we start protecting many other dentry members with d_lock.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b7ab39f6
    • N
      fs: change d_delete semantics · fe15ce44
      Nick Piggin 提交于
      Change d_delete from a dentry deletion notification to a dentry caching
      advise, more like ->drop_inode. Require it to be constant and idempotent,
      and not take d_lock. This is how all existing filesystems use the callback
      anyway.
      
      This makes fine grained dentry locking of dput and dentry lru scanning
      much simpler.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fe15ce44
    • N
      cgroup fs: avoid switching ->d_op on live dentry · 5adcee1d
      Nick Piggin 提交于
      Switching d_op on a live dentry is racy in general, so avoid it. In this case
      it is a negative dentry, which is safer, but there are still concurrent ops
      which may be called on d_op in that case (eg. d_revalidate). So in general
      a filesystem may not do this. Fix cgroupfs so as not to do this.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      5adcee1d
  3. 29 10月, 2010 1 次提交
  4. 28 10月, 2010 3 次提交
    • E
      cgroups: add check for strcpy destination string overflow · f4a2589f
      Evgeny Kuznetsov 提交于
      Function "strcpy" is used without check for maximum allowed source string
      length and could cause destination string overflow.  Check for string
      length is added before using "strcpy".  Function now is return error if
      source string length is more than a maximum.
      
      akpm: presently considered NotABug, but add the check for general
      future-safeness and robustness.
      Signed-off-by: NEvgeny Kuznetsov <EXT-Eugeny.Kuznetsov@nokia.com>
      Acked-by: NPaul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4a2589f
    • D
      cgroup: make the mount options parsing more accurate · 32a8cf23
      Daniel Lezcano 提交于
      Current behavior:
      =================
      
      (1) When we mount a cgroup, we can specify the 'all' option which
          means to enable all the cgroup subsystems.  This is the default option
          when no option is specified.
      
      (2) If we want to mount a cgroup with a subset of the supported cgroup
          subsystems, we have to specify a subsystems name list for the mount
          option.
      
      (3) If we specify another option like 'noprefix' or 'release_agent',
          the actual code wants the 'all' or a subsystem name option specified
          also.  Not critical but a bit not friendly as we should assume (1) in
          this case.
      
      (4) Logically, the 'all' option is mutually exclusive with a subsystem
          name, but this is not detected.
      
      In other words:
       succeed : mount -t cgroup -o all,freezer cgroup /cgroup
      	=> is it 'all' or 'freezer' ?
       fails : mount -t cgroup -o noprefix cgroup /cgroup
      	=> succeed if we do '-o noprefix,all'
      
      The following patches consolidate a bit the mount options check.
      
      New behavior:
      =============
      
      (1) untouched
      (2) untouched
      (3) the 'all' option will be by default when specifying other than
          a subsystem name option
      (4) raises an error
      
      In other words:
       fails   : mount -t cgroup -o all,freezer cgroup /cgroup
       succeed : mount -t cgroup -o noprefix cgroup /cgroup
      
      For the sake of lisibility, the if ... then ... else ... if ...
      indentation when parsing the options has been changed to:
      if ... then
      	...
      	continue
      fi
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jamal Hadi Salim <hadi@cyberus.ca>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32a8cf23
    • D
      cgroup: add clone_children control file · 97978e6d
      Daniel Lezcano 提交于
      The ns_cgroup is a control group interacting with the namespaces.  When a
      new namespace is created, a corresponding cgroup is automatically created
      too.  The cgroup name is the pid of the process who did 'unshare' or the
      child of 'clone'.
      
      This cgroup is tied with the namespace because it prevents a process to
      escape the control group and use the post_clone callback, so the child
      cgroup inherits the values of the parent cgroup.
      
      Unfortunately, the more we use this cgroup and the more we are facing
      problems with it:
      
      (1) when a process unshares, the cgroup name may conflict with a
          previous cgroup with the same pid, so unshare or clone return -EEXIST
      
      (2) the cgroup creation is out of control because there may have an
          application creating several namespaces where the system will
          automatically create several cgroups in his back and let them on the
          cgroupfs (eg.  a vrf based on the network namespace).
      
      (3) the mix of (1) and (2) force an administrator to regularly check
          and clean these cgroups.
      
      This patchset removes the ns_cgroup by adding a new flag to the cgroup and
      the cgroupfs mount option.  It enables the copy of the parent cgroup when
      a child cgroup is created.  We can then safely remove the ns_cgroup as
      this flag brings a compatibility.  We have now to manually create and add
      the task to a cgroup, which is consistent with the cgroup framework.
      
      This patch:
      
      Sent as an answer to a previous thread around the ns_cgroup.
      
      https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html
      
      It adds a control file 'clone_children' for a cgroup.  This control file
      is a boolean specifying if the child cgroup should be a clone of the
      parent cgroup or not.  The default value is 'false'.
      
      This flag makes the child cgroup to call the post_clone callback of all
      the subsystem, if it is available.
      
      At present, the cpuset is the only one which had implemented the
      post_clone callback.
      
      The option can be set at mount time by specifying the 'clone_children'
      mount option.
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Acked-by: NPaul Menage <menage@google.com>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <hadi@cyberus.ca>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97978e6d
  5. 26 10月, 2010 1 次提交
    • C
      fs: do not assign default i_ino in new_inode · 85fe4025
      Christoph Hellwig 提交于
      Instead of always assigning an increasing inode number in new_inode
      move the call to assign it into those callers that actually need it.
      For now callers that need it is estimated conservatively, that is
      the call is added to all filesystems that do not assign an i_ino
      by themselves.  For a few more filesystems we can avoid assigning
      any inode number given that they aren't user visible, and for others
      it could be done lazily when an inode number is actually needed,
      but that's left for later patches.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      85fe4025
  6. 05 10月, 2010 2 次提交
    • J
      BKL: Remove BKL from cgroup · 38d018db
      Jan Blunck 提交于
      The BKL is only used in remount_fs and get_sb that are both protected by
      the superblocks s_umount rw_semaphore. Therefore it is safe to remove the
      BKL entirely.
      Signed-off-by: NJan Blunck <jblunck@infradead.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      38d018db
    • J
      BKL: Explicitly add BKL around get_sb/fill_super · db719222
      Jan Blunck 提交于
      This patch is a preparation necessary to remove the BKL from do_new_mount().
      It explicitly adds calls to lock_kernel()/unlock_kernel() around
      get_sb/fill_super operations for filesystems that still uses the BKL.
      
      I've read through all the code formerly covered by the BKL inside
      do_kern_mount() and have satisfied myself that it doesn't need the BKL
      any more.
      
      do_kern_mount() is already called without the BKL when mounting the rootfs
      and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called
      from various places without BKL: simple_pin_fs(), nfs_do_clone_mount()
      through nfs_follow_mountpoint(), afs_mntpt_do_automount() through
      afs_mntpt_follow_link(). Both later functions are actually the filesystems
      follow_link inode operation. vfs_kern_mount() is calling the specified
      get_sb function and lets the filesystem do its job by calling the given
      fill_super function.
      
      Therefore I think it is safe to push down the BKL from the VFS to the
      low-level filesystems get_sb/fill_super operation.
      
      [arnd: do not add the BKL to those file systems that already
             don't use it elsewhere]
      Signed-off-by: NJan Blunck <jblunck@infradead.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      db719222
  7. 10 9月, 2010 1 次提交
  8. 05 9月, 2010 1 次提交
    • M
      cgroups: fix API thinko · 73457f0f
      Michael S. Tsirkin 提交于
      cgroup_attach_task_current_cg API that have upstream is backwards: we
      really need an API to attach to the cgroups from another process A to
      the current one.
      
      In our case (vhost), a priveledged user wants to attach it's task to cgroups
      from a less priveledged one, the API makes us run it in the other
      task's context, and this fails.
      
      So let's make the API generic and just pass in 'from' and 'to' tasks.
      Add an inline wrapper for cgroup_attach_task_current_cg to avoid
      breaking bisect.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      73457f0f
  9. 20 8月, 2010 1 次提交
  10. 11 8月, 2010 1 次提交
  11. 06 8月, 2010 1 次提交
  12. 28 7月, 2010 1 次提交
  13. 05 6月, 2010 1 次提交
    • G
      cgroups: alloc_css_id() increments hierarchy depth · 94b3dd0f
      Greg Thelen 提交于
      Child groups should have a greater depth than their parents.  Prior to
      this change, the parent would incorrectly report zero memory usage for
      child cgroups when use_hierarchy is enabled.
      
      test script:
        mount -t cgroup none /cgroups -o memory
        cd /cgroups
        mkdir cg1
      
        echo 1 > cg1/memory.use_hierarchy
        mkdir cg1/cg11
      
        echo $$ > cg1/cg11/tasks
        dd if=/dev/zero of=/tmp/foo bs=1M count=1
      
        echo
        echo CHILD
        grep cache cg1/cg11/memory.stat
      
        echo
        echo PARENT
        grep cache cg1/memory.stat
      
        echo $$ > tasks
        rmdir cg1/cg11 cg1
        cd /
        umount /cgroups
      
      Using fae9c791, a recent patch that changed alloc_css_id() depth computation,
      the parent incorrectly reports zero usage:
        root@ubuntu:~# ./test
        1+0 records in
        1+0 records out
        1048576 bytes (1.0 MB) copied, 0.0151844 s, 69.1 MB/s
      
        CHILD
        cache 1048576
        total_cache 1048576
      
        PARENT
        cache 0
        total_cache 0
      
      With this patch, the parent correctly includes child usage:
        root@ubuntu:~# ./test
        1+0 records in
        1+0 records out
        1048576 bytes (1.0 MB) copied, 0.0136827 s, 76.6 MB/s
      
        CHILD
        cache 1052672
        total_cache 1052672
      
        PARENT
        cache 0
        total_cache 1052672
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NPaul Menage <menage@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.34.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94b3dd0f
  14. 28 5月, 2010 1 次提交
  15. 12 5月, 2010 2 次提交
    • K
      memcg: fix css_is_ancestor() RCU locking · 747388d7
      KAMEZAWA Hiroyuki 提交于
      Some callers (in memcontrol.c) calls css_is_ancestor() without
      rcu_read_lock.  Because css_is_ancestor() has to access RCU protected
      data, it should be under rcu_read_lock().
      
      This makes css_is_ancestor() itself does safe access to RCU protected
      area.  (At least, "root" can have refcnt==0 if it's not an ancestor of
      "child".  So, we need rcu_read_lock().)
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      747388d7
    • K
      memcg: fix css_id() RCU locking for real · 7f0f1546
      KAMEZAWA Hiroyuki 提交于
      Commit ad4ba375 ("memcg: css_id() must be
      called under rcu_read_lock()") modifies memcontol.c for fixing RCU check
      message.  But Andrew Morton pointed out that the fix doesn't seems sane
      and it was just for hidining lockdep messages.
      
      This is a patch for do proper things.  Checking again, all places,
      accessing without rcu_read_lock, that commit fixies was intentional....
      all callers of css_id() has reference count on it.  So, it's not necessary
      to be under rcu_read_lock().
      
      Considering again, we can use rcu_dereference_check for css_id().  We know
      css->id is valid if css->refcnt > 0.  (css->id never changes and freed
      after css->refcnt going to be 0.)
      
      This patch makes use of rcu_dereference_check() in css_id/depth and remove
      unnecessary rcu-read-lock added by the commit.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f0f1546
  16. 11 5月, 2010 1 次提交
    • C
      sched, wait: Use wrapper functions · a93d2f17
      Changli Gao 提交于
      epoll should not touch flags in wait_queue_t. This patch introduces a new
      function __add_wait_queue_exclusive(), for the users, who use wait queue as a
      LIFO queue.
      
      __add_wait_queue_tail_exclusive() is introduced too instead of
      add_wait_queue_exclusive_locked(). remove_wait_queue_locked() is removed, as
      it is a duplicate of __remove_wait_queue(), disliked by users, and with less
      users.
      Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: <containers@lists.linux-foundation.org>
      LKML-Reference: <1273214006-2979-1-git-send-email-xiaosuo@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a93d2f17
  17. 05 5月, 2010 2 次提交
  18. 25 3月, 2010 1 次提交
  19. 16 3月, 2010 1 次提交
  20. 13 3月, 2010 10 次提交
    • K
      cgroups: remove events before destroying subsystem state objects · a0a4db54
      Kirill A. Shutemov 提交于
      Events should be removed after rmdir of cgroup directory, but before
      destroying subsystem state objects.  Let's take reference to cgroup
      directory dentry to do that.
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0a4db54
    • K
      cgroups: fix race between userspace and kernelspace · 4ab78683
      Kirill A. Shutemov 提交于
      Notify userspace about cgroup removing only after rmdir of cgroup
      directory to avoid race between userspace and kernelspace.
      
      eventfd are used to notify about two types of event:
       - control file-specific, like crossing memory threshold;
       - cgroup removing.
      
      To understand what really happen, userspace can check if the cgroup still
      exists.  To avoid race beetween userspace and kernelspace we have to
      notify userspace about cgroup removing only after rmdir of cgroup
      directory.
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ab78683
    • K
      cgroup: implement eventfd-based generic API for notifications · 0dea1168
      Kirill A. Shutemov 提交于
      This patchset introduces eventfd-based API for notifications in cgroups
      and implements memory notifications on top of it.
      
      It uses statistics in memory controler to track memory usage.
      
      Output of time(1) on building kernel on tmpfs:
      
      Root cgroup before changes:
      	make -j2  506.37 user 60.93s system 193% cpu 4:52.77 total
      Non-root cgroup before changes:
      	make -j2  507.14 user 62.66s system 193% cpu 4:54.74 total
      Root cgroup after changes (0 thresholds):
      	make -j2  507.13 user 62.20s system 193% cpu 4:53.55 total
      Non-root cgroup after changes (0 thresholds):
      	make -j2  507.70 user 64.20s system 193% cpu 4:55.70 total
      Root cgroup after changes (1 thresholds, never crossed):
      	make -j2  506.97 user 62.20s system 193% cpu 4:53.90 total
      Non-root cgroup after changes (1 thresholds, never crossed):
      	make -j2  507.55 user 64.08s system 193% cpu 4:55.63 total
      
      This patch:
      
      Introduce the write-only file "cgroup.event_control" in every cgroup.
      
      To register new notification handler you need:
      - create an eventfd;
      - open a control file to be monitored. Callbacks register_event() and
        unregister_event() must be defined for the control file;
      - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
        Interpretation of args is defined by control file implementation;
      
      eventfd will be woken up by control file implementation or when the
      cgroup is removed.
      
      To unregister notification handler just close eventfd.
      
      If you need notification functionality for a control file you have to
      implement callbacks register_event() and unregister_event() in the
      struct cftype.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix]
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Alexander Shishkin <virtuoso@slind.org>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0dea1168
    • L
      cgroups: clean up cgroup_pidlist_find() a bit · b70cc5fd
      Li Zefan 提交于
      Don't call get_pid_ns() before we locate/alloc the ns.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Acked-by: NPaul Menage <menage@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b70cc5fd
    • B
      cgroups: blkio subsystem as module · 67523c48
      Ben Blum 提交于
      Modify the Block I/O cgroup subsystem to be able to be built as a module.
      As the CFQ disk scheduler optionally depends on blk-cgroup, config options
      in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are
      enhanced to support the new module dependency.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67523c48
    • B
      cgroups: subsystem module unloading · cf5d5941
      Ben Blum 提交于
      Provides support for unloading modular subsystems.
      
      This patch adds a new function cgroup_unload_subsys which is to be used
      for removing a loaded subsystem during module deletion.  Reference
      counting of the subsystems' modules is moved from once (at load time) to
      once per attached hierarchy (in parse_cgroupfs_options and
      rebind_subsystems) (i.e., 0 or 1).
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf5d5941
    • B
      cgroups: subsystem module loading interface · e6a1105b
      Ben Blum 提交于
      Add interface between cgroups subsystem management and module loading
      
      This patch implements rudimentary module-loading support for cgroups -
      namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a
      module initcall, and a struct module pointer in struct cgroup_subsys.
      
      Several functions that might be wanted by modules have had EXPORT_SYMBOL
      added to them, but it's unclear exactly which functions want it and which
      won't.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6a1105b
    • B
      cgroups: revamp subsys array · aae8aab4
      Ben Blum 提交于
      This patch series provides the ability for cgroup subsystems to be
      compiled as modules both within and outside the kernel tree.  This is
      mainly useful for classifiers and subsystems that hook into components
      that are already modules.  cls_cgroup and blkio-cgroup serve as the
      example use cases for this feature.
      
      It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
      which modular subsystems can use to register and depart during runtime.
      The net_cls classifier subsystem serves as the example for a subsystem
      which can be converted into a module using these changes.
      
      Patch #1 sets up the subsys[] array so its contents can be dynamic as
      modules appear and (eventually) disappear.  Iterations over the array are
      modified to handle when subsystems are absent, and the dynamic section of
      the array is protected by cgroup_mutex.
      
      Patch #2 implements an interface for modules to load subsystems, called
      cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
      pointer in struct cgroup_subsys.
      
      Patch #3 adds a mechanism for unloading modular subsystems, which includes
      a more advanced rework of the rudimentary reference counting introduced in
      patch 2.
      
      Patch #4 modifies the net_cls subsystem, which already had some module
      declarations, to be configurable as a module, which also serves as a
      simple proof-of-concept.
      
      Part of implementing patches 2 and 4 involved updating css pointers in
      each css_set when the module appears or leaves.  In doing this, it was
      discovered that css_sets always remain linked to the dummy cgroup,
      regardless of whether or not any subsystems are actually bound to it
      (i.e., not mounted on an actual hierarchy).  The subsystem loading and
      unloading code therefore should keep in mind the special cases where the
      added subsystem is the only one in the dummy cgroup (and therefore all
      css_sets need to be linked back into it) and where the removed subsys was
      the only one in the dummy cgroup (and therefore all css_sets should be
      unlinked from it) - however, as all css_sets always stay attached to the
      dummy cgroup anyway, these cases are ignored.  Any fix that addresses this
      issue should also make sure these cases are addressed in the subsystem
      loading and unloading code.
      
      This patch:
      
      Make subsys[] able to be dynamically populated to support modular
      subsystems
      
      This patch reworks the way the subsys[] array is used so that subsystems
      can register themselves after boot time, and enables the internals of
      cgroups to be able to handle when subsystems are not present or may
      appear/disappear.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aae8aab4
    • D
      cgroup: introduce coalesce css_get() and css_put() · d7b9fff7
      Daisuke Nishimura 提交于
      Current css_get() and css_put() increment/decrement css->refcnt one by
      one.
      
      This patch add a new function __css_get(), which takes "count" as a arg
      and increment the css->refcnt by "count".  And this patch also add a new
      arg("count") to __css_put() and change the function to decrement the
      css->refcnt by "count".
      
      These coalesce version of __css_get()/__css_put() will be used to improve
      performance of memcg's moving charge feature later, where instead of
      calling css_get()/css_put() repeatedly, these new functions will be used.
      
      No change is needed for current users of css_get()/css_put().
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NPaul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7b9fff7
    • D
      cgroup: introduce cancel_attach() · 2468c723
      Daisuke Nishimura 提交于
      Add cancel_attach() operation to struct cgroup_subsys.  cancel_attach()
      can be used when can_attach() operation prepares something for the subsys,
      but we should rollback what can_attach() operation has prepared if attach
      task fails after we've succeeded in can_attach().
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2468c723