1. 20 11月, 2012 1 次提交
    • E
      proc: Usable inode numbers for the namespace file descriptors. · 98f842e6
      Eric W. Biederman 提交于
      Assign a unique proc inode to each namespace, and use that
      inode number to ensure we only allocate at most one proc
      inode for every namespace in proc.
      
      A single proc inode per namespace allows userspace to test
      to see if two processes are in the same namespace.
      
      This has been a long requested feature and only blocked because
      a naive implementation would put the id in a global space and
      would ultimately require having a namespace for the names of
      namespaces, making migration and certain virtualization tricks
      impossible.
      
      We still don't have per superblock inode numbers for proc, which
      appears necessary for application unaware checkpoint/restart and
      migrations (if the application is using namespace file descriptors)
      but that is now allowd by the design if it becomes important.
      
      I have preallocated the ipc and uts initial proc inode numbers so
      their structures can be statically initialized.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      98f842e6
  2. 19 11月, 2012 5 次提交
    • E
      pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 · af4b8a83
      Eric W. Biederman 提交于
      Looking at pid_ns->nr_hashed is a bit simpler and it works for
      disjoint process trees that an unshare or a join of a pid_namespace
      may create.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      af4b8a83
    • E
      pidns: Don't allow new processes in a dead pid namespace. · 5e1182de
      Eric W. Biederman 提交于
      Set nr_hashed to -1 just before we schedule the work to cleanup proc.
      Test nr_hashed just before we hash a new pid and if nr_hashed is < 0
      fail.
      
      This guaranteees that processes never enter a pid namespaces after we
      have cleaned up the state to support processes in a pid namespace.
      
      Currently sending SIGKILL to all of the process in a pid namespace as
      init exists gives us this guarantee but we need something a little
      stronger to support unsharing and joining a pid namespace.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      5e1182de
    • E
      pidns: Make the pidns proc mount/umount logic obvious. · 0a01f2cc
      Eric W. Biederman 提交于
      Track the number of pids in the proc hash table.  When the number of
      pids goes to 0 schedule work to unmount the kernel mount of proc.
      
      Move the mount of proc into alloc_pid when we allocate the pid for
      init.
      
      Remove the surprising calls of pid_ns_release proc in fork and
      proc_flush_task.  Those code paths really shouldn't know about proc
      namespace implementation details and people have demonstrated several
      times that finding and understanding those code paths is difficult and
      non-obvious.
      
      Because of the call path detach pid is alwasy called with the
      rtnl_lock held free_pid is not allowed to sleep, so the work to
      unmounting proc is moved to a work queue.  This has the side benefit
      of not blocking the entire world waiting for the unnecessary
      rcu_barrier in deactivate_locked_super.
      
      In the process of making the code clear and obvious this fixes a bug
      reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
      mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
      succeeded and copy_net_ns failed.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0a01f2cc
    • E
      pidns: Use task_active_pid_ns where appropriate · 17cf22c3
      Eric W. Biederman 提交于
      The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
      aka ns_of_pid(task_pid(tsk)) should have the same number of
      cache line misses with the practical difference that
      ns_of_pid(task_pid(tsk)) is released later in a processes life.
      
      Furthermore by using task_active_pid_ns it becomes trivial
      to write an unshare implementation for the the pid namespace.
      
      So I have used task_active_pid_ns everywhere I can.
      
      In fork since the pid has not yet been attached to the
      process I use ns_of_pid, to achieve the same effect.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      17cf22c3
    • E
      pidns: Capture the user namespace and filter ns_last_pid · 49f4d8b9
      Eric W. Biederman 提交于
      - Capture the the user namespace that creates the pid namespace
      - Use that user namespace to test if it is ok to write to
        /proc/sys/kernel/ns_last_pid.
      
      Zhao Hongjiang <zhaohongjiang@huawei.com> noticed I was missing a put_user_ns
      in when destroying a pid_ns.  I have foloded his patch into this one
      so that bisects will work properly.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      49f4d8b9
  3. 15 8月, 2012 1 次提交
    • E
      net ip6 flowlabel: Make owner a union of struct pid * and kuid_t · 4f82f457
      Eric W. Biederman 提交于
      Correct a long standing omission and use struct pid in the owner
      field of struct ip6_flowlabel when the share type is IPV6_FL_S_PROCESS.
      This guarantees we don't have issues when pid wraparound occurs.
      
      Use a kuid_t in the owner field of struct ip6_flowlabel when the
      share type is IPV6_FL_S_USER to add user namespace support.
      
      In /proc/net/ip6_flowlabel capture the current pid namespace when
      opening the file and release the pid namespace when the file is
      closed ensuring we print the pid owner value that is meaning to
      the reader of the file.  Similarly use from_kuid_munged to print
      uid values that are meaningful to the reader of the file.
      
      This requires exporting pid_nr_ns so that ipv6 can continue to built
      as a module.  Yoiks what silliness
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      4f82f457
  4. 24 5月, 2012 1 次提交
  5. 14 2月, 2012 1 次提交
  6. 13 1月, 2012 1 次提交
  7. 31 10月, 2011 1 次提交
  8. 29 9月, 2011 1 次提交
    • P
      rcu: Restore checks for blocking in RCU read-side critical sections · b3fbab05
      Paul E. McKenney 提交于
      Long ago, using TREE_RCU with PREEMPT would result in "scheduling
      while atomic" diagnostics if you blocked in an RCU read-side critical
      section.  However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats
      this diagnostic.  This commit therefore adds a replacement diagnostic
      based on PROVE_RCU.
      
      Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being
      used for things that have nothing to do with rcu_dereference(), rename
      lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third
      argument that is a string indicating what is suspicious.  This third
      argument is passed in from a new third argument to rcu_lockdep_assert().
      Update all calls to rcu_lockdep_assert() to add an informative third
      argument.
      
      Also, add a pair of rcu_lockdep_assert() calls from within
      rcu_note_context_switch(), one complaining if a context switch occurs
      in an RCU-bh read-side critical section and another complaining if a
      context switch occurs in an RCU-sched read-side critical section.
      These are present only if the PROVE_RCU kernel parameter is enabled.
      
      Finally, fix some checkpatch whitespace complaints in lockdep.c.
      
      Again, you must enable PROVE_RCU to see these new diagnostics.  But you
      are enabling PROVE_RCU to check out new RCU uses in any case, aren't you?
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b3fbab05
  9. 09 7月, 2011 1 次提交
  10. 19 4月, 2011 1 次提交
    • L
      next_pidmap: fix overflow condition · c78193e9
      Linus Torvalds 提交于
      next_pidmap() just quietly accepted whatever 'last' pid that was passed
      in, which is not all that safe when one of the users is /proc.
      
      Admittedly the proc code should do some sanity checking on the range
      (and that will be the next commit), but that doesn't mean that the
      helper functions should just do that pidmap pointer arithmetic without
      checking the range of its arguments.
      
      So clamp 'last' to PID_MAX_LIMIT.  The fact that we then do "last+1"
      doesn't really matter, the for-loop does check against the end of the
      pidmap array properly (it's only the actual pointer arithmetic overflow
      case we need to worry about, and going one bit beyond isn't going to
      overflow).
      
      [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]
      Reported-by: NTavis Ormandy <taviso@cmpxchg8b.com>
      Analyzed-by: NRobert Święcki <robert@swiecki.net>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c78193e9
  11. 18 3月, 2011 1 次提交
  12. 20 8月, 2010 2 次提交
    • T
      Add RCU check for find_task_by_vpid(). · 4221a991
      Tetsuo Handa 提交于
      find_task_by_vpid() says "Must be called under rcu_read_lock().". But due to
      commit 3120438a "rcu: Disable lockdep checking in RCU list-traversal primitives",
      we are currently unable to catch "find_task_by_vpid() with tasklist_lock held
      but RCU lock not held" errors due to the RCU-lockdep checks being
      suppressed in the RCU variants of the struct list_head traversals.
      This commit therefore places an explicit check for being in an RCU
      read-side critical section in find_task_by_pid_ns().
      
        ===================================================
        [ INFO: suspicious rcu_dereference_check() usage. ]
        ---------------------------------------------------
        kernel/pid.c:386 invoked rcu_dereference_check() without protection!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 1, debug_locks = 1
        1 lock held by rc.sysinit/1102:
         #0:  (tasklist_lock){.+.+..}, at: [<c1048340>] sys_setpgid+0x40/0x160
      
        stack backtrace:
        Pid: 1102, comm: rc.sysinit Not tainted 2.6.35-rc3-dirty #1
        Call Trace:
         [<c105e714>] lockdep_rcu_dereference+0x94/0xb0
         [<c104b4cd>] find_task_by_pid_ns+0x6d/0x70
         [<c104b4e8>] find_task_by_vpid+0x18/0x20
         [<c1048347>] sys_setpgid+0x47/0x160
         [<c1002b50>] sysenter_do_call+0x12/0x36
      
      Commit updated to use a new rcu_lockdep_assert() exported API rather than
      the old internal __do_rcu_dereference().
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      4221a991
    • A
      rculist: avoid __rcu annotations · 67bdbffd
      Arnd Bergmann 提交于
      This avoids warnings from missing __rcu annotations
      in the rculist implementation, making it possible to
      use the same lists in both RCU and non-RCU cases.
      
      We can add rculist annotations later, together with
      lockdep support for rculist, which is missing as well,
      but that may involve changing all the users.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      67bdbffd
  13. 11 8月, 2010 2 次提交
    • O
      pids: alloc_pidmap: remove the unnecessary boundary checks · c52b0b91
      Oleg Nesterov 提交于
      alloc_pidmap() calculates max_scan so that if the initial offset != 0 we
      inspect the first map->page twice.  This is correct, we want to find the
      unused bits < offset in this bitmap block.  Add the comment.
      
      But it doesn't make any sense to stop the find_next_offset() loop when we
      are looking into this map->page for the second time.  We have already
      already checked the bits >= offset during the first attempt, it is fine to
      do this again, no matter if we succeed this time or not.
      
      Remove this hard-to-understand code.  It optimizes the very unlikely case
      when we are going to fail, but slows down the more likely case.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Salman Qazi <sqazi@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c52b0b91
    • S
      pids: fix a race in pid generation that causes pids to be reused immediately · 5fdee8c4
      Salman 提交于
      A program that repeatedly forks and waits is susceptible to having the
      same pid repeated, especially when it competes with another instance of
      the same program.  This is really bad for bash implementation.
      Furthermore, many shell scripts assume that pid numbers will not be used
      for some length of time.
      
      Race Description:
      
      A                                    B
      
      // pid == offset == n                // pid == offset == n + 1
      test_and_set_bit(offset, map->page)
                                           test_and_set_bit(offset, map->page);
                                           pid_ns->last_pid = pid;
      pid_ns->last_pid = pid;
                                           // pid == n + 1 is freed (wait())
      
                                           // Next fork()...
                                           last = pid_ns->last_pid; // == n
                                           pid = last + 1;
      
      Code to reproduce it (Running multiple instances is more effective):
      
      #include <errno.h>
      #include <sys/types.h>
      #include <sys/wait.h>
      #include <unistd.h>
      #include <stdio.h>
      #include <stdlib.h>
      
      // The distance mod 32768 between two pids, where the first pid is expected
      // to be smaller than the second.
      int PidDistance(pid_t first, pid_t second) {
        return (second + 32768 - first) % 32768;
      }
      
      int main(int argc, char* argv[]) {
        int failed = 0;
        pid_t last_pid = 0;
        int i;
        printf("%d\n", sizeof(pid_t));
        for (i = 0; i < 10000000; ++i) {
          if (i % 32786 == 0)
            printf("Iter: %d\n", i/32768);
          int child_exit_code = i % 256;
          pid_t pid = fork();
          if (pid == -1) {
            fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno);
            exit(1);
          }
          if (pid == 0) {
            // Child
            exit(child_exit_code);
          } else {
            // Parent
            if (i > 0) {
              int distance = PidDistance(last_pid, pid);
              if (distance == 0 || distance > 30000) {
                fprintf(stderr,
                        "Unexpected pid sequence: previous fork: pid=%d, "
                        "current fork: pid=%d for iteration=%d.\n",
                        last_pid, pid, i);
                failed = 1;
              }
            }
            last_pid = pid;
            int status;
            int reaped = wait(&status);
            if (reaped != pid) {
              fprintf(stderr,
                      "Wait return value: expected pid=%d, "
                      "got %d, iteration %d\n",
                      pid, reaped, i);
              failed = 1;
            } else if (WEXITSTATUS(status) != child_exit_code) {
              fprintf(stderr,
                      "Unexpected exit status %x, iteration %d\n",
                      WEXITSTATUS(status), i);
              failed = 1;
            }
          }
        }
        exit(failed);
      }
      
      Thanks to Ted Tso for the key ideas of this implementation.
      Signed-off-by: NSalman Qazi <sqazi@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fdee8c4
  14. 28 5月, 2010 1 次提交
  15. 07 3月, 2010 1 次提交
  16. 04 3月, 2010 1 次提交
  17. 25 2月, 2010 1 次提交
    • P
      sched: Use lockdep-based checking on rcu_dereference() · d11c563d
      Paul E. McKenney 提交于
      Update the rcu_dereference() usages to take advantage of the new
      lockdep-based checking.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com>
      [ -v2: fix allmodconfig missing symbol export build failure on x86 ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d11c563d
  18. 16 12月, 2009 2 次提交
  19. 22 9月, 2009 1 次提交
  20. 10 7月, 2009 1 次提交
  21. 30 6月, 2009 1 次提交
  22. 19 6月, 2009 1 次提交
    • C
      pids: clean up find_task_by_pid variants · 17f98dcf
      Christoph Hellwig 提交于
      find_task_by_pid_type_ns is only used to implement find_task_by_vpid and
      find_task_by_pid_ns, but both of them pass PIDTYPE_PID as first argument.
      So just fold find_task_by_pid_type_ns into find_task_by_pid_ns and use
      find_task_by_pid_ns to implement find_task_by_vpid.
      
      While we're at it also remove the exports for find_task_by_pid_ns and
      find_task_by_vpid - we don't have any modular callers left as the only
      modular caller of he old pre pid namespace find_task_by_pid (gfs2) was
      switched to pid_task which operates on a struct pid pointer instead of a
      pid_t.  Given the confusion about pid_t values vs namespace that's
      generally the better option anyway and I think we're better of restricting
      modules to do it that way.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17f98dcf
  23. 03 4月, 2009 2 次提交
    • O
      pids: refactor vnr/nr_ns helpers to make them safe · 52ee2dfd
      Oleg Nesterov 提交于
      Inho, the safety rules for vnr/nr_ns helpers are horrible and buggy.
      
      task_pid_nr_ns(task) needs rcu/tasklist depending on task == current.
      
      As for "special" pids, vnr/nr_ns helpers always need rcu.  However, if
      task != current, they are unsafe even under rcu lock, we can't trust
      task->group_leader without the special checks.
      
      And almost every helper has a callsite which needs a fix.
      
      Also, it is a bit annoying that the implementations of, say,
      task_pgrp_vnr() and task_pgrp_nr_ns() are not "symmetrical".
      
      This patch introduces the new helper, __task_pid_nr_ns(), which is always
      safe to use, and turns all other helpers into the trivial wrappers.
      
      After this I'll send another patch which converts task_tgid_xxx() as well,
      they're are a bit special.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52ee2dfd
    • O
      pids: improve get_task_pid() to fix the unsafe sys_wait4()->task_pgrp() · 2ae448ef
      Oleg Nesterov 提交于
      sys_wait4() does get_pid(task_pgrp(current)), this is not safe.  We can
      add rcu lock/unlock around, but we already have get_task_pid() which can
      be improved to handle the special pids in more reliable manner.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ae448ef
  24. 09 1月, 2009 1 次提交
    • E
      pid: generalize task_active_pid_ns · 61bce0f1
      Eric W. Biederman 提交于
      Currently task_active_pid_ns is not safe to call after a task becomes a
      zombie and exit_task_namespaces is called, as nsproxy becomes NULL.  By
      reading the pid namespace from the pid of the task we can trivially solve
      this problem at the cost of one extra memory read in what should be the
      same cacheline as we read the namespace from.
      
      When moving things around I have made task_active_pid_ns out of line
      because keeping it in pid_namespace.h would require adding includes of
      pid.h and sched.h that I don't think we want.
      
      This change does make task_active_pid_ns unsafe to call during
      copy_process until we attach a pid on the task_struct which seems to be a
      reasonable trade off.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Bastian Blank <bastian@waldi.eu.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Nadia Derbey <Nadia.Derbey@bull.net>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61bce0f1
  25. 06 1月, 2009 1 次提交
  26. 26 7月, 2008 2 次提交
  27. 19 5月, 2008 1 次提交
  28. 30 4月, 2008 4 次提交