1. 05 6月, 2014 10 次提交
  2. 06 5月, 2014 1 次提交
  3. 19 4月, 2014 1 次提交
  4. 08 4月, 2014 2 次提交
  5. 04 4月, 2014 3 次提交
  6. 20 3月, 2014 2 次提交
  7. 13 3月, 2014 1 次提交
  8. 03 3月, 2014 1 次提交
  9. 12 2月, 2014 1 次提交
    • T
      cgroup: convert to kernfs · 2bd59d48
      Tejun Heo 提交于
      cgroup filesystem code was derived from the original sysfs
      implementation which was heavily intertwined with vfs objects and
      locking with the goal of re-using the existing vfs infrastructure.
      That experiment turned out rather disastrous and sysfs switched, a
      long time ago, to distributed filesystem model where a separate
      representation is maintained which is queried by vfs.  Unfortunately,
      cgroup stuck with the failed experiment all these years and
      accumulated even more problems over time.
      
      Locking and object lifetime management being entangled with vfs is
      probably the most egregious.  vfs is never designed to be misused like
      this and cgroup ends up jumping through various convoluted dancing to
      make things work.  Even then, operations across multiple cgroups can't
      be done safely as it'll deadlock with rename locking.
      
      Recently, kernfs is separated out from sysfs so that it can be used by
      users other than sysfs.  This patch converts cgroup to use kernfs,
      which will bring the following benefits.
      
      * Separation from vfs internals.  Locking and object lifetime
        management is contained in cgroup proper making things a lot
        simpler.  This removes significant amount of locking convolutions,
        hairy object lifetime rules and the restriction on multi-cgroup
        operations.
      
      * Can drop a lot of code to implement filesystem interface as most are
        provided by kernfs.
      
      * Proper "severing" semantics, which allows controllers to not worry
        about lingering file accesses after offline.
      
      While the preceding patches did as much as possible to make the
      transition less painful, large part of the conversion has to be one
      discrete step making this patch rather large.  The rest of the commit
      message lists notable changes in different areas.
      
      Overall
      -------
      
      * vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
        cgroupfs_root->sb w/ ->kf_root.
      
      * All dentry accessors are removed.  Helpers to map from kernfs
        constructs are added.
      
      * All vfs plumbing around dentry, inode and bdi removed.
      
      * cgroup_mount() now directly looks for matching root and then
        proceeds to create a new one if not found.
      
      Synchronization and object lifetime
      -----------------------------------
      
      * vfs inode locking removed.  Among other things, this removes the
        need for the convolution in cgroup_cfts_commit().  Future patches
        will further simplify it.
      
      * vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
        cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
        root->refcnt and when it reaches zero proceeds to destroy it thus
        merging cgroup_put_root() and the former cgroup_kill_sb().
        Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
        when refcnt reaches zero.
      
      * Unlike before, kernfs objects don't hold onto cgroup objects.  When
        cgroup destroys a kernfs node, all existing operations are drained
        and the association is broken immediately.  The same for
        cgroupfs_roots and mounts.
      
      * All operations which come through kernfs guarantee that the
        associated cgroup is and stays valid for the duration of operation;
        however, there are two paths which need to find out the associated
        cgroup from dentry without going through kernfs -
        css_tryget_from_dir() and cgroupstats_build().  For these two,
        kernfs_node->priv is RCU managed so that they can dereference it
        under RCU read lock.
      
      File and directory handling
      ---------------------------
      
      * File and directory operations converted to kernfs_ops and
        kernfs_syscall_ops.
      
      * xattrs is implicitly supported by kernfs.  No need to worry about it
        from cgroup.  This means that "xattr" mount option is no longer
        necessary.  A future patch will add a deprecated warning message
        when sane_behavior.
      
      * When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
        private copy of one of the kernfs_ops to set its atomic_write_len.
        cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
        to handle it.
      
      * cftype->lockdep_key added so that kernfs lockdep annotation can be
        per cftype.
      
      * Inidividual file entries and open states are now managed by kernfs.
        No need to worry about them from cgroup.  cfent, cgroup_open_file
        and their friends are removed.
      
      * kernfs_nodes are created deactivated and kernfs_activate()
        invocations added to places where creation of new nodes are
        committed.
      
      * cgroup_rmdir() uses kernfs_[un]break_active_protection() for
        self-removal.
      
      v2: - Li pointed out in an earlier patch that specifying "name="
            during mount without subsystem specification should succeed if
            there's an existing hierarchy with a matching name although it
            should fail with -EINVAL if a new hierarchy should be created.
            Prior to the conversion, this used by handled by deferring
            failure from NULL return from cgroup_root_from_opts(), which was
            necessary because root was being created before checking for
            existing ones.  Note that cgroup_root_from_opts() returned an
            ERR_PTR() value for error conditions which require immediate
            mount failure.
      
            As we now have separate search and creation steps, deferring
            failure from cgroup_root_from_opts() is no longer necessary.
            cgroup_root_from_opts() is updated to always return ERR_PTR()
            value on failure.
      
          - The logic to match existing roots is updated so that a mount
            attempt with a matching name but different subsys_mask are
            rejected.  This was handled by a separate matching loop under
            the comment "Check for name clashes with existing mounts" but
            got lost during conversion.  Merge the check into the main
            search loop.
      
          - Add __rcu __force casting in RCU_INIT_POINTER() in
            cgroup_destroy_locked() to avoid the sparse address space
            warning reported by kbuild test bot.  Maybe we want an explicit
            interface to use kn->priv as RCU protected pointer?
      
      v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
      
      v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: kbuild test robot fengguang.wu@intel.com>
      2bd59d48
  10. 06 2月, 2014 1 次提交
    • L
      execve: use 'struct filename *' for executable name passing · c4ad8f98
      Linus Torvalds 提交于
      This changes 'do_execve()' to get the executable name as a 'struct
      filename', and to free it when it is done.  This is what the normal
      users want, and it simplifies and streamlines their error handling.
      
      The controlled lifetime of the executable name also fixes a
      use-after-free problem with the trace_sched_process_exec tracepoint: the
      lifetime of the passed-in string for kernel users was not at all
      obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
      the pathname allocation lifetime with the execve() having finished,
      which in turn meant that the trace point that happened after
      mm_release() of the old process VM ended up using already free'd memory.
      
      To solve the kernel string lifetime issue, this simply introduces
      "getname_kernel()" that works like the normal user-space getname()
      function, except with the source coming from kernel memory.
      
      As Oleg points out, this also means that we could drop the tcomm[] array
      from 'struct linux_binprm', since the pathname lifetime now covers
      setup_new_exec().  That would be a separate cleanup.
      Reported-by: NIgor Zhbanov <i.zhbanov@samsung.com>
      Tested-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4ad8f98
  11. 01 2月, 2014 1 次提交
  12. 28 1月, 2014 1 次提交
  13. 25 1月, 2014 1 次提交
  14. 24 1月, 2014 2 次提交
  15. 22 1月, 2014 2 次提交
    • S
      init/main.c: use memblock apis for early memory allocations · 098b081b
      Santosh Shilimkar 提交于
      Switch to memblock interfaces for early memory allocator instead of
      bootmem allocator.  No functional change in beahvior than what it is in
      current code from bootmem users points of view.
      
      Archs already converted to NO_BOOTMEM now directly use memblock
      interfaces instead of bootmem wrappers build on top of memblock.  And
      the archs which still uses bootmem, these new apis just fall back to
      exiting bootmem APIs.
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Lindgren <tony@atomide.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      098b081b
    • K
      mm: create a separate slab for page->ptl allocation · b35f1819
      Kirill A. Shutemov 提交于
      If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
      is 72 bytes.  For page->ptl they will be allocated from kmalloc-96 slab,
      so we loose 24 on each.  An average system can easily allocate few tens
      thousands of page->ptl and overhead is significant.
      
      Let's create a separate slab for page->ptl allocation to solve this.
      
      To make sure that it really works this time, some numbers from my test
      machine (just booted, no load):
      
      Before:
        # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
        kmalloc-96         31987  32190    128   30    1 : tunables  120   60    8 : slabdata   1073   1073     92
      After:
        # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
        page->ptl          27516  28143     72   53    1 : tunables  120   60    8 : slabdata    531    531      9
        kmalloc-96          3853   5280    128   30    1 : tunables  120   60    8 : slabdata    176    176      0
      
      Note that the patch is useful not only for debug case, but also for
      PREEMPT_RT, where spinlock_t is always bloated.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b35f1819
  16. 16 1月, 2014 1 次提交
  17. 11 12月, 2013 1 次提交
  18. 03 12月, 2013 1 次提交
  19. 02 12月, 2013 1 次提交
  20. 27 11月, 2013 1 次提交
  21. 23 11月, 2013 1 次提交
    • T
      cgroup, memcg: move cgroup_event implementation to memcg · 79bd9814
      Tejun Heo 提交于
      cgroup_event is way over-designed and tries to build a generic
      flexible event mechanism into cgroup - fully customizable event
      specification for each user of the interface.  This is utterly
      unnecessary and overboard especially in the light of the planned
      unified hierarchy as there's gonna be single agent.  Simply generating
      events at fixed points, or if that's too restrictive, configureable
      cadence or single set of configureable points should be enough.
      
      Thankfully, memcg is the only user and gets to keep it.  Replacing it
      with something simpler on sane_behavior is strongly recommended.
      
      This patch moves cgroup_event and "cgroup.event_control"
      implementation to mm/memcontrol.c.  Clearing of events on cgroup
      destruction is moved from cgroup_destroy_locked() to
      mem_cgroup_css_offline(), which shouldn't make any noticeable
      difference.
      
      cgroup_css() and __file_cft() are exported to enable the move;
      however, this will soon be reverted once the event code is updated to
      be memcg specific.
      
      Note that "cgroup.event_control" will now exist only on the hierarchy
      with memcg attached to it.  While this change is visible to userland,
      it is unlikely to be noticeable as the file has never been meaningful
      outside memcg.
      
      Aside from the above change, this is pure code relocation.
      
      v2: Per Li Zefan's comments, init/Kconfig updated accordingly and
          poll.h inclusion moved from cgroup.c to memcontrol.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      79bd9814
  22. 21 11月, 2013 1 次提交
  23. 18 11月, 2013 1 次提交
  24. 15 11月, 2013 1 次提交
  25. 13 11月, 2013 1 次提交
    • P
      ./Makefile: export initial ramdisk compression config option · 1bf49dd4
      P J P 提交于
      Make menuconfig allows one to choose compression format of an initial
      ramdisk image.  But this choice does not result in duly compressed ramdisk
      image.  Because - $ make install - does not pass on the selected
      compression choice to the dracut(8) tool, which creates the initramfs
      file.  dracut(8) generates the image with the default compression, ie.
      gzip(1).
      
      This patch exports the selected compression option to a sub-shell
      environment, so that it could be used by dracut(8) tool to generate
      appropriately compressed initramfs images.
      
      There isn't a straightforward way to pass on options to dracut(8) via
      positional parameters.  Because it is indirectly invoked at the end of a $
      make install sequence.
      
       # make install
         -> arch/$arch/boot/Makefile
          -> arch/$arch/boot/install.sh
           -> /sbing/installkernel ...
            -> /sbin/new-kernel-pkg ...
             -> /sbin/dracut ...
      Signed-off-by: NP J P <ppandit@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1bf49dd4