1. 30 1月, 2016 1 次提交
  2. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  3. 25 11月, 2015 1 次提交
  4. 23 7月, 2015 1 次提交
  5. 17 4月, 2015 1 次提交
    • M
      fork: report pid reservation failure properly · 35f71bc0
      Michal Hocko 提交于
      copy_process will report any failure in alloc_pid as ENOMEM currently
      which is misleading because the pid allocation might fail not only when
      the memory is short but also when the pid space is consumed already.
      
      The current man page even mentions this case:
      
      : EAGAIN
      :
      :       A system-imposed limit on the number of threads was encountered.
      :       There are a number of limits that may trigger this error: the
      :       RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
      :       limits the number of processes and threads for a real user ID, was
      :       reached; the kernel's system-wide limit on the number of processes
      :       and threads, /proc/sys/kernel/threads-max, was reached (see
      :       proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
      :       was reached (see proc(5)).
      
      so the current behavior is also incorrect wrt.  documentation.  POSIX man
      page also suggest returing EAGAIN when the process count limit is reached.
      
      This patch simply propagates error code from alloc_pid and makes sure we
      return -EAGAIN due to reservation failure.  This will make behavior of
      fork closer to both our documentation and POSIX.
      
      alloc_pid might alsoo fail when the reaper in the pid namespace is dead
      (the namespace basically disallows all new processes) and there is no
      good error code which would match documented ones. We have traditionally
      returned ENOMEM for this case which is misleading as well but as per
      Eric W. Biederman this behavior is documented in man pid_namespaces(7)
      
      : If the "init" process of a PID namespace terminates, the kernel
      : terminates all of the processes in the namespace via a SIGKILL signal.
      : This behavior reflects the fact that the "init" process is essential for
      : the correct operation of a PID namespace.  In this case, a subsequent
      : fork(2) into this PID namespace will fail with the error ENOMEM; it is
      : not possible to create a new processes in a PID namespace whose "init"
      : process has terminated.
      
      and introducing a new error code would be too risky so let's stick to
      ENOMEM for this case.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35f71bc0
  6. 11 12月, 2014 1 次提交
  7. 05 12月, 2014 2 次提交
  8. 01 10月, 2013 1 次提交
  9. 31 8月, 2013 1 次提交
    • E
      pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup · a6064885
      Eric W. Biederman 提交于
      Serge Hallyn <serge.hallyn@ubuntu.com> writes:
      
      > Since commit af4b8a83 it's been
      > possible to get into a situation where a pidns reaper is
      > <defunct>, reparented to host pid 1, but never reaped.  How to
      > reproduce this is documented at
      >
      > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526
      > (and see
      > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526/comments/13)
      > In short, run repeated starts of a container whose init is
      >
      > Process.exit(0);
      >
      > sysrq-t when such a task is playing zombie shows:
      >
      > [  131.132978] init            x ffff88011fc14580     0  2084   2039 0x00000000
      > [  131.132978]  ffff880116e89ea8 0000000000000002 ffff880116e89fd8 0000000000014580
      > [  131.132978]  ffff880116e89fd8 0000000000014580 ffff8801172a0000 ffff8801172a0000
      > [  131.132978]  ffff8801172a0630 ffff88011729fff0 ffff880116e14650 ffff88011729fff0
      > [  131.132978] Call Trace:
      > [  131.132978]  [<ffffffff816f6159>] schedule+0x29/0x70
      > [  131.132978]  [<ffffffff81064591>] do_exit+0x6e1/0xa40
      > [  131.132978]  [<ffffffff81071eae>] ? signal_wake_up_state+0x1e/0x30
      > [  131.132978]  [<ffffffff8106496f>] do_group_exit+0x3f/0xa0
      > [  131.132978]  [<ffffffff810649e4>] SyS_exit_group+0x14/0x20
      > [  131.132978]  [<ffffffff8170102f>] tracesys+0xe1/0xe6
      >
      > Further debugging showed that every time this happened, zap_pid_ns_processes()
      > started with nr_hashed being 3, while we were expecting it to drop to 2.
      > Any time it didn't happen, nr_hashed was 1 or 2.  So the reaper was
      > waiting for nr_hashed to become 2, but free_pid() only wakes the reaper
      > if nr_hashed hits 1.
      
      The issue is that when the task group leader of an init process exits
      before other tasks of the init process when the init process finally
      exits it will be a secondary task sleeping in zap_pid_ns_processes and
      waiting to wake up when the number of hashed pids drops to two.  This
      case waits forever as free_pid only sends a wake up when the number of
      hashed pids drops to 1.
      
      To correct this the simple strategy of sending a possibly unncessary
      wake up when the number of hashed pids drops to 2 is adopted.
      
      Sending one extraneous wake up is relatively harmless, at worst we
      waste a little cpu time in the rare case when a pid namespace
      appropaches exiting.
      
      We can detect the case when the pid namespace drops to just two pids
      hashed race free in free_pid.
      
      Dereferencing pid_ns->child_reaper with the pidmap_lock held is safe
      without out the tasklist_lock because it is guaranteed that the
      detach_pid will be called on the child_reaper before it is freed and
      detach_pid calls __change_pid which calls free_pid which takes the
      pidmap_lock.  __change_pid only calls free_pid if this is the
      last use of the pid.  For a thread that is not the thread group leader
      the threads pid will only ever have one user because a threads pid
      is not allowed to be the pid of a process, of a process group or
      a session.  For a thread that is a thread group leader all of
      the other threads of that process will be reaped before it is allowed
      for the thread group leader to be reaped ensuring there will only
      be one user of the threads pid as a process pid.  Furthermore
      because the thread is the init process of a pid namespace all of the
      other processes in the pid namespace will have also been already freed
      leading to the fact that the pid will not be used as a session pid or
      a process group pid for any other running process.
      
      CC: stable@vger.kernel.org
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Tested-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Reported-by: NSerge Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      a6064885
  10. 04 7月, 2013 2 次提交
  11. 02 5月, 2013 1 次提交
  12. 01 5月, 2013 2 次提交
  13. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  14. 13 2月, 2013 1 次提交
  15. 26 12月, 2012 1 次提交
    • E
      pidns: Stop pid allocation when init dies · c876ad76
      Eric W. Biederman 提交于
      Oleg pointed out that in a pid namespace the sequence.
      - pid 1 becomes a zombie
      - setns(thepidns), fork,...
      - reaping pid 1.
      - The injected processes exiting.
      
      Can lead to processes attempting access their child reaper and
      instead following a stale pointer.
      
      That waitpid for init can return before all of the processes in
      the pid namespace have exited is also unfortunate.
      
      Avoid these problems by disabling the allocation of new pids in a pid
      namespace when init dies, instead of when the last process in a pid
      namespace is reaped.
      Pointed-out-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c876ad76
  16. 18 12月, 2012 1 次提交
  17. 06 12月, 2012 1 次提交
  18. 20 11月, 2012 1 次提交
    • E
      proc: Usable inode numbers for the namespace file descriptors. · 98f842e6
      Eric W. Biederman 提交于
      Assign a unique proc inode to each namespace, and use that
      inode number to ensure we only allocate at most one proc
      inode for every namespace in proc.
      
      A single proc inode per namespace allows userspace to test
      to see if two processes are in the same namespace.
      
      This has been a long requested feature and only blocked because
      a naive implementation would put the id in a global space and
      would ultimately require having a namespace for the names of
      namespaces, making migration and certain virtualization tricks
      impossible.
      
      We still don't have per superblock inode numbers for proc, which
      appears necessary for application unaware checkpoint/restart and
      migrations (if the application is using namespace file descriptors)
      but that is now allowd by the design if it becomes important.
      
      I have preallocated the ipc and uts initial proc inode numbers so
      their structures can be statically initialized.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      98f842e6
  19. 19 11月, 2012 5 次提交
    • E
      pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 · af4b8a83
      Eric W. Biederman 提交于
      Looking at pid_ns->nr_hashed is a bit simpler and it works for
      disjoint process trees that an unshare or a join of a pid_namespace
      may create.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      af4b8a83
    • E
      pidns: Don't allow new processes in a dead pid namespace. · 5e1182de
      Eric W. Biederman 提交于
      Set nr_hashed to -1 just before we schedule the work to cleanup proc.
      Test nr_hashed just before we hash a new pid and if nr_hashed is < 0
      fail.
      
      This guaranteees that processes never enter a pid namespaces after we
      have cleaned up the state to support processes in a pid namespace.
      
      Currently sending SIGKILL to all of the process in a pid namespace as
      init exists gives us this guarantee but we need something a little
      stronger to support unsharing and joining a pid namespace.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      5e1182de
    • E
      pidns: Make the pidns proc mount/umount logic obvious. · 0a01f2cc
      Eric W. Biederman 提交于
      Track the number of pids in the proc hash table.  When the number of
      pids goes to 0 schedule work to unmount the kernel mount of proc.
      
      Move the mount of proc into alloc_pid when we allocate the pid for
      init.
      
      Remove the surprising calls of pid_ns_release proc in fork and
      proc_flush_task.  Those code paths really shouldn't know about proc
      namespace implementation details and people have demonstrated several
      times that finding and understanding those code paths is difficult and
      non-obvious.
      
      Because of the call path detach pid is alwasy called with the
      rtnl_lock held free_pid is not allowed to sleep, so the work to
      unmounting proc is moved to a work queue.  This has the side benefit
      of not blocking the entire world waiting for the unnecessary
      rcu_barrier in deactivate_locked_super.
      
      In the process of making the code clear and obvious this fixes a bug
      reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
      mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
      succeeded and copy_net_ns failed.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0a01f2cc
    • E
      pidns: Use task_active_pid_ns where appropriate · 17cf22c3
      Eric W. Biederman 提交于
      The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
      aka ns_of_pid(task_pid(tsk)) should have the same number of
      cache line misses with the practical difference that
      ns_of_pid(task_pid(tsk)) is released later in a processes life.
      
      Furthermore by using task_active_pid_ns it becomes trivial
      to write an unshare implementation for the the pid namespace.
      
      So I have used task_active_pid_ns everywhere I can.
      
      In fork since the pid has not yet been attached to the
      process I use ns_of_pid, to achieve the same effect.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      17cf22c3
    • E
      pidns: Capture the user namespace and filter ns_last_pid · 49f4d8b9
      Eric W. Biederman 提交于
      - Capture the the user namespace that creates the pid namespace
      - Use that user namespace to test if it is ok to write to
        /proc/sys/kernel/ns_last_pid.
      
      Zhao Hongjiang <zhaohongjiang@huawei.com> noticed I was missing a put_user_ns
      in when destroying a pid_ns.  I have foloded his patch into this one
      so that bisects will work properly.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      49f4d8b9
  20. 15 8月, 2012 1 次提交
    • E
      net ip6 flowlabel: Make owner a union of struct pid * and kuid_t · 4f82f457
      Eric W. Biederman 提交于
      Correct a long standing omission and use struct pid in the owner
      field of struct ip6_flowlabel when the share type is IPV6_FL_S_PROCESS.
      This guarantees we don't have issues when pid wraparound occurs.
      
      Use a kuid_t in the owner field of struct ip6_flowlabel when the
      share type is IPV6_FL_S_USER to add user namespace support.
      
      In /proc/net/ip6_flowlabel capture the current pid namespace when
      opening the file and release the pid namespace when the file is
      closed ensuring we print the pid owner value that is meaning to
      the reader of the file.  Similarly use from_kuid_munged to print
      uid values that are meaningful to the reader of the file.
      
      This requires exporting pid_nr_ns so that ipv6 can continue to built
      as a module.  Yoiks what silliness
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      4f82f457
  21. 24 5月, 2012 1 次提交
  22. 14 2月, 2012 1 次提交
  23. 13 1月, 2012 1 次提交
  24. 31 10月, 2011 1 次提交
  25. 29 9月, 2011 1 次提交
    • P
      rcu: Restore checks for blocking in RCU read-side critical sections · b3fbab05
      Paul E. McKenney 提交于
      Long ago, using TREE_RCU with PREEMPT would result in "scheduling
      while atomic" diagnostics if you blocked in an RCU read-side critical
      section.  However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats
      this diagnostic.  This commit therefore adds a replacement diagnostic
      based on PROVE_RCU.
      
      Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being
      used for things that have nothing to do with rcu_dereference(), rename
      lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third
      argument that is a string indicating what is suspicious.  This third
      argument is passed in from a new third argument to rcu_lockdep_assert().
      Update all calls to rcu_lockdep_assert() to add an informative third
      argument.
      
      Also, add a pair of rcu_lockdep_assert() calls from within
      rcu_note_context_switch(), one complaining if a context switch occurs
      in an RCU-bh read-side critical section and another complaining if a
      context switch occurs in an RCU-sched read-side critical section.
      These are present only if the PROVE_RCU kernel parameter is enabled.
      
      Finally, fix some checkpatch whitespace complaints in lockdep.c.
      
      Again, you must enable PROVE_RCU to see these new diagnostics.  But you
      are enabling PROVE_RCU to check out new RCU uses in any case, aren't you?
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b3fbab05
  26. 09 7月, 2011 1 次提交
  27. 19 4月, 2011 1 次提交
    • L
      next_pidmap: fix overflow condition · c78193e9
      Linus Torvalds 提交于
      next_pidmap() just quietly accepted whatever 'last' pid that was passed
      in, which is not all that safe when one of the users is /proc.
      
      Admittedly the proc code should do some sanity checking on the range
      (and that will be the next commit), but that doesn't mean that the
      helper functions should just do that pidmap pointer arithmetic without
      checking the range of its arguments.
      
      So clamp 'last' to PID_MAX_LIMIT.  The fact that we then do "last+1"
      doesn't really matter, the for-loop does check against the end of the
      pidmap array properly (it's only the actual pointer arithmetic overflow
      case we need to worry about, and going one bit beyond isn't going to
      overflow).
      
      [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]
      Reported-by: NTavis Ormandy <taviso@cmpxchg8b.com>
      Analyzed-by: NRobert Święcki <robert@swiecki.net>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c78193e9
  28. 18 3月, 2011 1 次提交
  29. 20 8月, 2010 2 次提交
    • T
      Add RCU check for find_task_by_vpid(). · 4221a991
      Tetsuo Handa 提交于
      find_task_by_vpid() says "Must be called under rcu_read_lock().". But due to
      commit 3120438a "rcu: Disable lockdep checking in RCU list-traversal primitives",
      we are currently unable to catch "find_task_by_vpid() with tasklist_lock held
      but RCU lock not held" errors due to the RCU-lockdep checks being
      suppressed in the RCU variants of the struct list_head traversals.
      This commit therefore places an explicit check for being in an RCU
      read-side critical section in find_task_by_pid_ns().
      
        ===================================================
        [ INFO: suspicious rcu_dereference_check() usage. ]
        ---------------------------------------------------
        kernel/pid.c:386 invoked rcu_dereference_check() without protection!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 1, debug_locks = 1
        1 lock held by rc.sysinit/1102:
         #0:  (tasklist_lock){.+.+..}, at: [<c1048340>] sys_setpgid+0x40/0x160
      
        stack backtrace:
        Pid: 1102, comm: rc.sysinit Not tainted 2.6.35-rc3-dirty #1
        Call Trace:
         [<c105e714>] lockdep_rcu_dereference+0x94/0xb0
         [<c104b4cd>] find_task_by_pid_ns+0x6d/0x70
         [<c104b4e8>] find_task_by_vpid+0x18/0x20
         [<c1048347>] sys_setpgid+0x47/0x160
         [<c1002b50>] sysenter_do_call+0x12/0x36
      
      Commit updated to use a new rcu_lockdep_assert() exported API rather than
      the old internal __do_rcu_dereference().
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      4221a991
    • A
      rculist: avoid __rcu annotations · 67bdbffd
      Arnd Bergmann 提交于
      This avoids warnings from missing __rcu annotations
      in the rculist implementation, making it possible to
      use the same lists in both RCU and non-RCU cases.
      
      We can add rculist annotations later, together with
      lockdep support for rculist, which is missing as well,
      but that may involve changing all the users.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      67bdbffd
  30. 11 8月, 2010 2 次提交
    • O
      pids: alloc_pidmap: remove the unnecessary boundary checks · c52b0b91
      Oleg Nesterov 提交于
      alloc_pidmap() calculates max_scan so that if the initial offset != 0 we
      inspect the first map->page twice.  This is correct, we want to find the
      unused bits < offset in this bitmap block.  Add the comment.
      
      But it doesn't make any sense to stop the find_next_offset() loop when we
      are looking into this map->page for the second time.  We have already
      already checked the bits >= offset during the first attempt, it is fine to
      do this again, no matter if we succeed this time or not.
      
      Remove this hard-to-understand code.  It optimizes the very unlikely case
      when we are going to fail, but slows down the more likely case.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Salman Qazi <sqazi@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c52b0b91
    • S
      pids: fix a race in pid generation that causes pids to be reused immediately · 5fdee8c4
      Salman 提交于
      A program that repeatedly forks and waits is susceptible to having the
      same pid repeated, especially when it competes with another instance of
      the same program.  This is really bad for bash implementation.
      Furthermore, many shell scripts assume that pid numbers will not be used
      for some length of time.
      
      Race Description:
      
      A                                    B
      
      // pid == offset == n                // pid == offset == n + 1
      test_and_set_bit(offset, map->page)
                                           test_and_set_bit(offset, map->page);
                                           pid_ns->last_pid = pid;
      pid_ns->last_pid = pid;
                                           // pid == n + 1 is freed (wait())
      
                                           // Next fork()...
                                           last = pid_ns->last_pid; // == n
                                           pid = last + 1;
      
      Code to reproduce it (Running multiple instances is more effective):
      
      #include <errno.h>
      #include <sys/types.h>
      #include <sys/wait.h>
      #include <unistd.h>
      #include <stdio.h>
      #include <stdlib.h>
      
      // The distance mod 32768 between two pids, where the first pid is expected
      // to be smaller than the second.
      int PidDistance(pid_t first, pid_t second) {
        return (second + 32768 - first) % 32768;
      }
      
      int main(int argc, char* argv[]) {
        int failed = 0;
        pid_t last_pid = 0;
        int i;
        printf("%d\n", sizeof(pid_t));
        for (i = 0; i < 10000000; ++i) {
          if (i % 32786 == 0)
            printf("Iter: %d\n", i/32768);
          int child_exit_code = i % 256;
          pid_t pid = fork();
          if (pid == -1) {
            fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno);
            exit(1);
          }
          if (pid == 0) {
            // Child
            exit(child_exit_code);
          } else {
            // Parent
            if (i > 0) {
              int distance = PidDistance(last_pid, pid);
              if (distance == 0 || distance > 30000) {
                fprintf(stderr,
                        "Unexpected pid sequence: previous fork: pid=%d, "
                        "current fork: pid=%d for iteration=%d.\n",
                        last_pid, pid, i);
                failed = 1;
              }
            }
            last_pid = pid;
            int status;
            int reaped = wait(&status);
            if (reaped != pid) {
              fprintf(stderr,
                      "Wait return value: expected pid=%d, "
                      "got %d, iteration %d\n",
                      pid, reaped, i);
              failed = 1;
            } else if (WEXITSTATUS(status) != child_exit_code) {
              fprintf(stderr,
                      "Unexpected exit status %x, iteration %d\n",
                      WEXITSTATUS(status), i);
              failed = 1;
            }
          }
        }
        exit(failed);
      }
      
      Thanks to Ted Tso for the key ideas of this implementation.
      Signed-off-by: NSalman Qazi <sqazi@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fdee8c4
  31. 28 5月, 2010 1 次提交