1. 20 10月, 2007 10 次提交
    • P
      pid namespaces: initialize the namespace's proc_mnt · 6f4e6433
      Pavel Emelyanov 提交于
      The namespace's proc_mnt must be kern_mount-ed to make this pointer always
      valid, independently of whether the user space mounted the proc or not.  This
      solves raced in proc_flush_task, etc.  with the proc_mnt switching from NULL
      to not-NULL.
      
      The initialization is done after the init's pid is created and hashed to make
      proc_get_sb() finr it and get for root inode.
      
      Sice the namespace holds the vfsmnt, vfsmnt holds the superblock and the
      superblock holds the namespace we must explicitly break this circle to destroy
      all the stuff.  This is done after the init of the namespace dies.  Running a
      few steps forward - when init exits it will kill all its children, so no
      proc_mnt will be needed after its death.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f4e6433
    • P
      pid namespaces: make proc_flush_task() actually from entries from multiple namespaces · 130f77ec
      Pavel Emelyanov 提交于
      This means that proc_flush_task_mnt() is to be called for many proc mounts and
      with different ids, depending on the namespace this pid is to be flushed from.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      130f77ec
    • P
      pid namespaces: make proc have multiple superblocks - one for each namespace · 07543f5c
      Pavel Emelyanov 提交于
      Each pid namespace have to be visible through its own proc mount.  Thus we
      need to have per-namespace proc trees with their own superblocks.
      
      We cannot easily show different pid namespace via one global proc tree, since
      each pid refers to different tasks in different namespaces.  E.g.  pid 1
      refers to the init task in the initial namespace and to some other task when
      seeing from another namespace.  Moreover - pid, exisintg in one namespace may
      not exist in the other.
      
      This approach has one move advantage is that the tasks from the init namespace
      can see what tasks live in another namespace by reading entries from another
      proc tree.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07543f5c
    • P
      pid namespaces: helpers to find the task by its numerical ids · 198fe21b
      Pavel Emelyanov 提交于
      When searching the task by numerical id on may need to find it using global
      pid (as it is done now in kernel) or by its virtual id, e.g.  when sending a
      signal to a task from one namespace the sender will specify the task's virtual
      id and we should find the task by this value.
      
      [akpm@linux-foundation.org: fix gfs2 linkage]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      198fe21b
    • P
      pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees · 60347f67
      Pavel Emelyanov 提交于
      The first part is trivial - we just make the proc_flush_task() to operate on
      arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to
      it.
      
      The other change is more tricky: I moved the proc_flush_task() call in
      release_task() higher to address the following problem.
      
      When flushing task from many proc trees we need to know the set of ids (not
      just one pid) to find the dentries' names to flush.  Thus we need to pass the
      task's pid to proc_flush_task() as struct pid is the only object that can
      provide all the pid numbers.  But after __exit_signal() task has detached all
      his pids and this information is lost.
      
      This creates a tiny gap for proc_pid_lookup() to bring some dentries back to
      tree and keep them in hash (since pids are still alive before __exit_signal())
      till the next shrink, but since proc_flush_task() does not provide a 100%
      guarantee that the dentries will be flushed, this is OK to do so.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60347f67
    • P
      Make access to task's nsproxy lighter · cf7b708c
      Pavel Emelyanov 提交于
      When someone wants to deal with some other taks's namespaces it has to lock
      the task and then to get the desired namespace if the one exists.  This is
      slow on read-only paths and may be impossible in some cases.
      
      E.g.  Oleg recently noticed a race between unshare() and the (sent for
      review in cgroups) pid namespaces - when the task notifies the parent it
      has to know the parent's namespace, but taking the task_lock() is
      impossible there - the code is under write locked tasklist lock.
      
      On the other hand switching the namespace on task (daemonize) and releasing
      the namespace (after the last task exit) is rather rare operation and we
      can sacrifice its speed to solve the issues above.
      
      The access to other task namespaces is proposed to be performed
      like this:
      
           rcu_read_lock();
           nsproxy = task_nsproxy(tsk);
           if (nsproxy != NULL) {
                   / *
                     * work with the namespaces here
                     * e.g. get the reference on one of them
                     * /
           } / *
               * NULL task_nsproxy() means that this task is
               * almost dead (zombie)
               * /
           rcu_read_unlock();
      
      This patch has passed the review by Eric and Oleg :) and,
      of course, tested.
      
      [clg@fr.ibm.com: fix unshare()]
      [ebiederm@xmission.com: Update get_net_ns_by_pid]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf7b708c
    • S
      pid namespaces: define and use task_active_pid_ns() wrapper · 2894d650
      Sukadev Bhattiprolu 提交于
      With multiple pid namespaces, a process is known by some pid_t in every
      ancestor pid namespace.  Every time the process forks, the child process also
      gets a pid_t in every ancestor pid namespace.
      
      While a process is visible in >=1 pid namespaces, it can see pid_t's in only
      one pid namespace.  We call this pid namespace it's "active pid namespace",
      and it is always the youngest pid namespace in which the process is known.
      
      This patch defines and uses a wrapper to find the active pid namespace of a
      process.  The implementation of the wrapper will be changed in when support
      for multiple pid namespaces are added.
      
      Changelog:
      	2.6.22-rc4-mm2-pidns1:
      	- [Pavel Emelianov, Alexey Dobriyan] Back out the change to use
      	  task_active_pid_ns() in child_reaper() since task->nsproxy
      	  can be NULL during task exit (so child_reaper() continues to
      	  use init_pid_ns).
      
      	  to implement child_reaper() since init_pid_ns.child_reaper to
      	  implement child_reaper() since tsk->nsproxy can be NULL during exit.
      
      	2.6.21-rc6-mm1:
      	- Rename task_pid_ns() to task_active_pid_ns() to reflect that a
      	  process can have multiple pid namespaces.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Acked-by: NPavel Emelianov <xemul@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Herbert Poetzel <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2894d650
    • P
      pid namespaces: round up the API · a47afb0f
      Pavel Emelianov 提交于
      The set of functions process_session, task_session, process_group and
      task_pgrp is confusing, as the names can be mixed with each other when looking
      at the code for a long time.
      
      The proposals are to
      * equip the functions that return the integer with _nr suffix to
        represent that fact,
      * and to make all functions work with task (not process) by making
        the common prefix of the same name.
      
      For monotony the routines signal_session() and set_signal_session() are
      replaced with task_session_nr() and set_task_session(), especially since they
      are only used with the explicit task->signal dereference.
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a47afb0f
    • P
      Task Control Groups: make cpusets a client of cgroups · 8793d854
      Paul Menage 提交于
      Remove the filesystem support logic from the cpusets system and makes cpusets
      a cgroup subsystem
      
      The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
      passed through to the cgroup filesystem with the appropriate options to
      emulate the old cpuset filesystem behaviour.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8793d854
    • P
      Task Control Groups: add procfs interface · a424316c
      Paul Menage 提交于
      Add:
      
      /proc/cgroups - general system info
      
      /proc/*/cgroup - per-task cgroup membership info
      
      [a.p.zijlstra@chello.nl: cgroups: bdi init hooks]
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a424316c
  2. 17 10月, 2007 8 次提交
  3. 15 10月, 2007 3 次提交
  4. 11 10月, 2007 5 次提交
    • P
      [NETNS]: Move some code into __init section when CONFIG_NET_NS=n · 4665079c
      Pavel Emelyanov 提交于
      With the net namespaces many code leaved the __init section,
      thus making the kernel occupy more memory than it did before.
      Since we have a config option that prohibits the namespace
      creation, the functions that initialize/finalize some netns
      stuff are simply not needed and can be freed after the boot.
      
      Currently, this is almost not noticeable, since few calls
      are no longer in __init, but when the namespaces will be
      merged it will be possible to free more code. I propose to
      use the __net_init, __net_exit and __net_initdata "attributes"
      for functions/variables that are not used if the CONFIG_NET_NS
      is not set to save more space in memory.
      
      The exiting functions cannot just reside in the __exit section,
      as noticed by David, since the init section will have
      references on it and the compilation will fail due to modpost
      checks. These references can exist, since the init namespace
      never dies and the exit callbacks are never called. So I
      introduce the __exit_refok attribute just like it is already
      done with the __init_refok.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4665079c
    • E
      [NET]: Fix race when opening a proc file while a network namespace is exiting. · 077130c0
      Eric W. Biederman 提交于
      The problem:  proc_net files remember which network namespace the are
      against but do not remember hold a reference count (as that would pin
      the network namespace).   So we currently have a small window where
      the reference count on a network namespace may be incremented when opening
      a /proc file when it has already gone to zero.
      
      To fix this introduce maybe_get_net and get_proc_net.
      
      maybe_get_net increments the network namespace reference count only if it is
      greater then zero, ensuring we don't increment a reference count after it
      has gone to zero.
      
      get_proc_net handles all of the magic to go from a proc inode to the network
      namespace instance and call maybe_get_net on it.
      
      PROC_NET the old accessor is removed so that we don't get confused and use
      the wrong helper function.
      
      Then I fix up the callers to use get_proc_net and handle the case case
      where get_proc_net returns NULL.  In that case I return -ENXIO because
      effectively the network namespace has already gone away so the files
      we are trying to access don't exist anymore.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Acked-by: NPaul E. McKenney <paulmck@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      077130c0
    • D
      [NETNS]: Fix export symbols. · 36ac3135
      Daniel Lezcano 提交于
      Add the appropriate EXPORT_SYMBOLS for proc_net_create,
      proc_net_fops_create and proc_net_remove to fix errors when
      compiling allmodconfig
      Signed-off-by: NMark Nelson <markn@au1.ibm.com>
      Acked-by: NBenjamin Thery <benjamin.thery@bull.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36ac3135
    • D
      [NET]: Fix missed addition of fs/proc/proc_net.c · 3c12afe7
      David S. Miller 提交于
      My bad.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c12afe7
    • E
      [NET]: Make /proc/net per network namespace · 457c4cbc
      Eric W. Biederman 提交于
      This patch makes /proc/net per network namespace.  It modifies the global
      variables proc_net and proc_net_stat to be per network namespace.
      The proc_net file helpers are modified to take a network namespace argument,
      and all of their callers are fixed to pass &init_net for that argument.
      This ensures that all of the /proc/net files are only visible and
      usable in the initial network namespace until the code behind them
      has been updated to be handle multiple network namespaces.
      
      Making /proc/net per namespace is necessary as at least some files
      in /proc/net depend upon the set of network devices which is per
      network namespace, and even more files in /proc/net have contents
      that are relevant to a single network namespace.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      457c4cbc
  5. 10 10月, 2007 1 次提交
    • P
      Rework /proc/locks via seq_files and seq_list helpers · 7f8ada98
      Pavel Emelyanov 提交于
      Currently /proc/locks is shown with a proc_read function, but its behavior
      is rather complex as it has to manually handle current offset and buffer
      length.  On the other hand, files that show objects from lists can be
      easily reimplemented using the sequential files and the seq_list_XXX()
      helpers.
      
      This saves (as usually) 16 lines of code and more than 200 from
      the .text section.
      
      [akpm@linux-foundation.org: no externs in C]
      [akpm@linux-foundation.org: warning fixes]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7f8ada98
  6. 12 9月, 2007 1 次提交
    • A
      Fix select on /proc files without ->poll · dd23aae4
      Alexey Dobriyan 提交于
      Taneli Vähäkangas <vahakang@cs.helsinki.fi> reported that commit
      786d7e16 aka "Fix rmmod/read/write races
      in /proc entries" broke SBCL + SLIME combo.
      
      The old code in do_select() used DEFAULT_POLLMASK, if couldn't find
      ->poll handler.  The new code makes ->poll always there and returns 0 by
      default, which is not correct.  Return DEFAULT_POLLMASK instead.
      
      Steps to reproduce:
      
      	install emacs, SBCL, SLIME
      	emacs
      	M-x slime	in *inferior-lisp* buffer
      	[watch it doing "Connecting to Swank on port X.."]
      
      Please, apply before 2.6.23.
      
      P.S.: why SBCL can't just read(2) /proc/cpuinfo is a mystery.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: T Taneli Vahakangas <vahakang@cs.helsinki.fi>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd23aae4
  7. 23 8月, 2007 1 次提交
  8. 01 8月, 2007 1 次提交
  9. 29 7月, 2007 1 次提交
  10. 22 7月, 2007 1 次提交
  11. 20 7月, 2007 4 次提交
  12. 18 7月, 2007 1 次提交
    • T
      kallsyms: make KSYM_NAME_LEN include space for trailing '\0' · 9281acea
      Tejun Heo 提交于
      KSYM_NAME_LEN is peculiar in that it does not include the space for the
      trailing '\0', forcing all users to use KSYM_NAME_LEN + 1 when allocating
      buffer.  This is nonsense and error-prone.  Moreover, when the caller
      forgets that it's very likely to subtly bite back by corrupting the stack
      because the last position of the buffer is always cleared to zero.
      
      This patch increments KSYM_NAME_LEN by one and updates code accordingly.
      
      * off-by-one bug in asm-powerpc/kprobes.h::kprobe_lookup_name() macro
        is fixed.
      
      * Where MODULE_NAME_LEN and KSYM_NAME_LEN were used together,
        MODULE_NAME_LEN was treated as if it didn't include space for the
        trailing '\0'.  Fix it.
      Signed-off-by: NTejun Heo <htejun@gmail.com>
      Acked-by: NPaulo Marques <pmarques@grupopie.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9281acea
  13. 17 7月, 2007 3 次提交