1. 18 11月, 2017 1 次提交
    • G
      pid: replace pid bitmap implementation with IDR API · 95846ecf
      Gargi Sharma 提交于
      Patch series "Replacing PID bitmap implementation with IDR API", v4.
      
      This series replaces kernel bitmap implementation of PID allocation with
      IDR API.  These patches are written to simplify the kernel by replacing
      custom code with calls to generic code.
      
      The following are the stats for pid and pid_namespace object files
      before and after the replacement.  There is a noteworthy change between
      the IDR and bitmap implementation.
      
      Before
         text       data        bss        dec        hex    filename
         8447       3894         64      12405       3075    kernel/pid.o
      After
         text       data        bss        dec        hex    filename
         3397        304          0       3701        e75    kernel/pid.o
      
      Before
         text       data        bss        dec        hex    filename
         5692       1842        192       7726       1e2e    kernel/pid_namespace.o
      After
         text       data        bss        dec        hex    filename
         2854        216         16       3086        c0e    kernel/pid_namespace.o
      
      The following are the stats for ps, pstree and calling readdir on /proc
      for 10,000 processes.
      
      ps:
              With IDR API    With bitmap
      real    0m1.479s        0m2.319s
      user    0m0.070s        0m0.060s
      sys     0m0.289s        0m0.516s
      
      pstree:
              With IDR API    With bitmap
      real    0m1.024s        0m1.794s
      user    0m0.348s        0m0.612s
      sys     0m0.184s        0m0.264s
      
      proc:
              With IDR API    With bitmap
      real    0m0.059s        0m0.074s
      user    0m0.000s        0m0.004s
      sys     0m0.016s        0m0.016s
      
      This patch (of 2):
      
      Replace the current bitmap implementation for Process ID allocation.
      Functions that are no longer required, for example, free_pidmap(),
      alloc_pidmap(), etc.  are removed.  The rest of the functions are
      modified to use the IDR API.  The change was made to make the PID
      allocation less complex by replacing custom code with calls to generic
      API.
      
      [gs051095@gmail.com: v6]
        Link: http://lkml.kernel.org/r/1507760379-21662-2-git-send-email-gs051095@gmail.com
      [avagin@openvz.org: restore the old behaviour of the ns_last_pid sysctl]
        Link: http://lkml.kernel.org/r/20171106183144.16368-1-avagin@openvz.org
      Link: http://lkml.kernel.org/r/1507583624-22146-2-git-send-email-gs051095@gmail.comSigned-off-by: NGargi Sharma <gs051095@gmail.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Julia Lawall <julia.lawall@lip6.fr>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95846ecf
  2. 22 8月, 2017 1 次提交
    • O
      pids: make task_tgid_nr_ns() safe · dd1c1f2f
      Oleg Nesterov 提交于
      This was reported many times, and this was even mentioned in commit
      52ee2dfd ("pids: refactor vnr/nr_ns helpers to make them safe") but
      somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
      not safe because task->group_leader points to nowhere after the exiting
      task passes exit_notify(), rcu_read_lock() can not help.
      
      We really need to change __unhash_process() to nullify group_leader,
      parent, and real_parent, but this needs some cleanups.  Until then we
      can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
      fix the problem.
      Reported-by: NTroy Kensinger <tkensinger@google.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1c1f2f
  3. 03 8月, 2017 1 次提交
  4. 07 7月, 2017 1 次提交
    • P
      mm: update callers to use HASH_ZERO flag · 3d375d78
      Pavel Tatashin 提交于
      Update dcache, inode, pid, mountpoint, and mount hash tables to use
      HASH_ZERO, and remove initialization after allocations.  In case of
      places where HASH_EARLY was used such as in __pv_init_lock_hash the
      zeroed hash table was already assumed, because memblock zeroes the
      memory.
      
      CPU: SPARC M6, Memory: 7T
      Before fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 11.798s
      
      After fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 3.198s
      
      CPU: Intel Xeon E5-2630, Memory: 2.2T:
      Before fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.245s
      
      After fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.244s
      
      Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d375d78
  5. 09 5月, 2017 1 次提交
    • K
      pidns: disable pid allocation if pid_ns_prepare_proc() is failed in alloc_pid() · 8896c23d
      Kirill Tkhai 提交于
      alloc_pidmap() advances pid_namespace::last_pid.  When first pid
      allocation fails, then next created process will have pid 2 and
      pid_ns_prepare_proc() won't be called.  So, pid_namespace::proc_mnt will
      never be initialized (not to mention that there won't be a child
      reaper).
      
      I saw crash stack of such case on kernel 3.10:
      
          BUG: unable to handle kernel NULL pointer dereference at (null)
          IP: proc_flush_task+0x8f/0x1b0
          Call Trace:
              release_task+0x3f/0x490
              wait_consider_task.part.10+0x7ff/0xb00
              do_wait+0x11f/0x280
              SyS_wait4+0x7d/0x110
      
      We may fix this by restore of last_pid in 0 or by prohibiting of futher
      allocations.  Since there was a similar issue in Oleg Nesterov's commit
      314a8ad0 ("pidns: fix free_pid() to handle the first fork failure").
      and it was fixed via prohibiting allocation, let's follow this way, and
      do the same.
      
      Link: http://lkml.kernel.org/r/149201021004.4863.6762095011554287922.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Andrei Vagin <avagin@virtuozzo.com>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Serge Hallyn <serge@hallyn.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8896c23d
  6. 02 3月, 2017 1 次提交
  7. 14 1月, 2017 1 次提交
    • P
      locking/atomic, kref: Add KREF_INIT() · 1e24edca
      Peter Zijlstra 提交于
      Since we need to change the implementation, stop exposing internals.
      
      Provide KREF_INIT() to allow static initialization of struct kref.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1e24edca
  8. 28 5月, 2016 1 次提交
    • A
      remove lots of IS_ERR_VALUE abuses · 287980e4
      Arnd Bergmann 提交于
      Most users of IS_ERR_VALUE() in the kernel are wrong, as they
      pass an 'int' into a function that takes an 'unsigned long'
      argument. This happens to work because the type is sign-extended
      on 64-bit architectures before it gets converted into an
      unsigned type.
      
      However, anything that passes an 'unsigned short' or 'unsigned int'
      argument into IS_ERR_VALUE() is guaranteed to be broken, as are
      8-bit integers and types that are wider than 'unsigned long'.
      
      Andrzej Hajda has already fixed a lot of the worst abusers that
      were causing actual bugs, but it would be nice to prevent any
      users that are not passing 'unsigned long' arguments.
      
      This patch changes all users of IS_ERR_VALUE() that I could find
      on 32-bit ARM randconfig builds and x86 allmodconfig. For the
      moment, this doesn't change the definition of IS_ERR_VALUE()
      because there are probably still architecture specific users
      elsewhere.
      
      Almost all the warnings I got are for files that are better off
      using 'if (err)' or 'if (err < 0)'.
      The only legitimate user I could find that we get a warning for
      is the (32-bit only) freescale fman driver, so I did not remove
      the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
      For 9pfs, I just worked around one user whose calling conventions
      are so obscure that I did not dare change the behavior.
      
      I was using this definition for testing:
      
       #define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
             unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))
      
      which ends up making all 16-bit or wider types work correctly with
      the most plausible interpretation of what IS_ERR_VALUE() was supposed
      to return according to its users, but also causes a compile-time
      warning for any users that do not pass an 'unsigned long' argument.
      
      I suggested this approach earlier this year, but back then we ended
      up deciding to just fix the users that are obviously broken. After
      the initial warning that caused me to get involved in the discussion
      (fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
      asked me to send the whole thing again.
      
      [ Updated the 9p parts as per Al Viro  - Linus ]
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Andrzej Hajda <a.hajda@samsung.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: https://lkml.org/lkml/2016/1/7/363
      Link: https://lkml.org/lkml/2016/5/27/486
      Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> # For nvmem part
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      287980e4
  9. 30 1月, 2016 1 次提交
  10. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  11. 25 11月, 2015 1 次提交
  12. 23 7月, 2015 1 次提交
  13. 17 4月, 2015 1 次提交
    • M
      fork: report pid reservation failure properly · 35f71bc0
      Michal Hocko 提交于
      copy_process will report any failure in alloc_pid as ENOMEM currently
      which is misleading because the pid allocation might fail not only when
      the memory is short but also when the pid space is consumed already.
      
      The current man page even mentions this case:
      
      : EAGAIN
      :
      :       A system-imposed limit on the number of threads was encountered.
      :       There are a number of limits that may trigger this error: the
      :       RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
      :       limits the number of processes and threads for a real user ID, was
      :       reached; the kernel's system-wide limit on the number of processes
      :       and threads, /proc/sys/kernel/threads-max, was reached (see
      :       proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
      :       was reached (see proc(5)).
      
      so the current behavior is also incorrect wrt.  documentation.  POSIX man
      page also suggest returing EAGAIN when the process count limit is reached.
      
      This patch simply propagates error code from alloc_pid and makes sure we
      return -EAGAIN due to reservation failure.  This will make behavior of
      fork closer to both our documentation and POSIX.
      
      alloc_pid might alsoo fail when the reaper in the pid namespace is dead
      (the namespace basically disallows all new processes) and there is no
      good error code which would match documented ones. We have traditionally
      returned ENOMEM for this case which is misleading as well but as per
      Eric W. Biederman this behavior is documented in man pid_namespaces(7)
      
      : If the "init" process of a PID namespace terminates, the kernel
      : terminates all of the processes in the namespace via a SIGKILL signal.
      : This behavior reflects the fact that the "init" process is essential for
      : the correct operation of a PID namespace.  In this case, a subsequent
      : fork(2) into this PID namespace will fail with the error ENOMEM; it is
      : not possible to create a new processes in a PID namespace whose "init"
      : process has terminated.
      
      and introducing a new error code would be too risky so let's stick to
      ENOMEM for this case.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35f71bc0
  14. 11 12月, 2014 1 次提交
  15. 05 12月, 2014 2 次提交
  16. 01 10月, 2013 1 次提交
  17. 31 8月, 2013 1 次提交
    • E
      pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup · a6064885
      Eric W. Biederman 提交于
      Serge Hallyn <serge.hallyn@ubuntu.com> writes:
      
      > Since commit af4b8a83 it's been
      > possible to get into a situation where a pidns reaper is
      > <defunct>, reparented to host pid 1, but never reaped.  How to
      > reproduce this is documented at
      >
      > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526
      > (and see
      > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526/comments/13)
      > In short, run repeated starts of a container whose init is
      >
      > Process.exit(0);
      >
      > sysrq-t when such a task is playing zombie shows:
      >
      > [  131.132978] init            x ffff88011fc14580     0  2084   2039 0x00000000
      > [  131.132978]  ffff880116e89ea8 0000000000000002 ffff880116e89fd8 0000000000014580
      > [  131.132978]  ffff880116e89fd8 0000000000014580 ffff8801172a0000 ffff8801172a0000
      > [  131.132978]  ffff8801172a0630 ffff88011729fff0 ffff880116e14650 ffff88011729fff0
      > [  131.132978] Call Trace:
      > [  131.132978]  [<ffffffff816f6159>] schedule+0x29/0x70
      > [  131.132978]  [<ffffffff81064591>] do_exit+0x6e1/0xa40
      > [  131.132978]  [<ffffffff81071eae>] ? signal_wake_up_state+0x1e/0x30
      > [  131.132978]  [<ffffffff8106496f>] do_group_exit+0x3f/0xa0
      > [  131.132978]  [<ffffffff810649e4>] SyS_exit_group+0x14/0x20
      > [  131.132978]  [<ffffffff8170102f>] tracesys+0xe1/0xe6
      >
      > Further debugging showed that every time this happened, zap_pid_ns_processes()
      > started with nr_hashed being 3, while we were expecting it to drop to 2.
      > Any time it didn't happen, nr_hashed was 1 or 2.  So the reaper was
      > waiting for nr_hashed to become 2, but free_pid() only wakes the reaper
      > if nr_hashed hits 1.
      
      The issue is that when the task group leader of an init process exits
      before other tasks of the init process when the init process finally
      exits it will be a secondary task sleeping in zap_pid_ns_processes and
      waiting to wake up when the number of hashed pids drops to two.  This
      case waits forever as free_pid only sends a wake up when the number of
      hashed pids drops to 1.
      
      To correct this the simple strategy of sending a possibly unncessary
      wake up when the number of hashed pids drops to 2 is adopted.
      
      Sending one extraneous wake up is relatively harmless, at worst we
      waste a little cpu time in the rare case when a pid namespace
      appropaches exiting.
      
      We can detect the case when the pid namespace drops to just two pids
      hashed race free in free_pid.
      
      Dereferencing pid_ns->child_reaper with the pidmap_lock held is safe
      without out the tasklist_lock because it is guaranteed that the
      detach_pid will be called on the child_reaper before it is freed and
      detach_pid calls __change_pid which calls free_pid which takes the
      pidmap_lock.  __change_pid only calls free_pid if this is the
      last use of the pid.  For a thread that is not the thread group leader
      the threads pid will only ever have one user because a threads pid
      is not allowed to be the pid of a process, of a process group or
      a session.  For a thread that is a thread group leader all of
      the other threads of that process will be reaped before it is allowed
      for the thread group leader to be reaped ensuring there will only
      be one user of the threads pid as a process pid.  Furthermore
      because the thread is the init process of a pid namespace all of the
      other processes in the pid namespace will have also been already freed
      leading to the fact that the pid will not be used as a session pid or
      a process group pid for any other running process.
      
      CC: stable@vger.kernel.org
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Tested-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Reported-by: NSerge Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      a6064885
  18. 04 7月, 2013 2 次提交
  19. 02 5月, 2013 1 次提交
  20. 01 5月, 2013 2 次提交
  21. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  22. 13 2月, 2013 1 次提交
  23. 26 12月, 2012 1 次提交
    • E
      pidns: Stop pid allocation when init dies · c876ad76
      Eric W. Biederman 提交于
      Oleg pointed out that in a pid namespace the sequence.
      - pid 1 becomes a zombie
      - setns(thepidns), fork,...
      - reaping pid 1.
      - The injected processes exiting.
      
      Can lead to processes attempting access their child reaper and
      instead following a stale pointer.
      
      That waitpid for init can return before all of the processes in
      the pid namespace have exited is also unfortunate.
      
      Avoid these problems by disabling the allocation of new pids in a pid
      namespace when init dies, instead of when the last process in a pid
      namespace is reaped.
      Pointed-out-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c876ad76
  24. 18 12月, 2012 1 次提交
  25. 06 12月, 2012 1 次提交
  26. 20 11月, 2012 1 次提交
    • E
      proc: Usable inode numbers for the namespace file descriptors. · 98f842e6
      Eric W. Biederman 提交于
      Assign a unique proc inode to each namespace, and use that
      inode number to ensure we only allocate at most one proc
      inode for every namespace in proc.
      
      A single proc inode per namespace allows userspace to test
      to see if two processes are in the same namespace.
      
      This has been a long requested feature and only blocked because
      a naive implementation would put the id in a global space and
      would ultimately require having a namespace for the names of
      namespaces, making migration and certain virtualization tricks
      impossible.
      
      We still don't have per superblock inode numbers for proc, which
      appears necessary for application unaware checkpoint/restart and
      migrations (if the application is using namespace file descriptors)
      but that is now allowd by the design if it becomes important.
      
      I have preallocated the ipc and uts initial proc inode numbers so
      their structures can be statically initialized.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      98f842e6
  27. 19 11月, 2012 5 次提交
    • E
      pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 · af4b8a83
      Eric W. Biederman 提交于
      Looking at pid_ns->nr_hashed is a bit simpler and it works for
      disjoint process trees that an unshare or a join of a pid_namespace
      may create.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      af4b8a83
    • E
      pidns: Don't allow new processes in a dead pid namespace. · 5e1182de
      Eric W. Biederman 提交于
      Set nr_hashed to -1 just before we schedule the work to cleanup proc.
      Test nr_hashed just before we hash a new pid and if nr_hashed is < 0
      fail.
      
      This guaranteees that processes never enter a pid namespaces after we
      have cleaned up the state to support processes in a pid namespace.
      
      Currently sending SIGKILL to all of the process in a pid namespace as
      init exists gives us this guarantee but we need something a little
      stronger to support unsharing and joining a pid namespace.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      5e1182de
    • E
      pidns: Make the pidns proc mount/umount logic obvious. · 0a01f2cc
      Eric W. Biederman 提交于
      Track the number of pids in the proc hash table.  When the number of
      pids goes to 0 schedule work to unmount the kernel mount of proc.
      
      Move the mount of proc into alloc_pid when we allocate the pid for
      init.
      
      Remove the surprising calls of pid_ns_release proc in fork and
      proc_flush_task.  Those code paths really shouldn't know about proc
      namespace implementation details and people have demonstrated several
      times that finding and understanding those code paths is difficult and
      non-obvious.
      
      Because of the call path detach pid is alwasy called with the
      rtnl_lock held free_pid is not allowed to sleep, so the work to
      unmounting proc is moved to a work queue.  This has the side benefit
      of not blocking the entire world waiting for the unnecessary
      rcu_barrier in deactivate_locked_super.
      
      In the process of making the code clear and obvious this fixes a bug
      reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
      mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
      succeeded and copy_net_ns failed.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0a01f2cc
    • E
      pidns: Use task_active_pid_ns where appropriate · 17cf22c3
      Eric W. Biederman 提交于
      The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
      aka ns_of_pid(task_pid(tsk)) should have the same number of
      cache line misses with the practical difference that
      ns_of_pid(task_pid(tsk)) is released later in a processes life.
      
      Furthermore by using task_active_pid_ns it becomes trivial
      to write an unshare implementation for the the pid namespace.
      
      So I have used task_active_pid_ns everywhere I can.
      
      In fork since the pid has not yet been attached to the
      process I use ns_of_pid, to achieve the same effect.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      17cf22c3
    • E
      pidns: Capture the user namespace and filter ns_last_pid · 49f4d8b9
      Eric W. Biederman 提交于
      - Capture the the user namespace that creates the pid namespace
      - Use that user namespace to test if it is ok to write to
        /proc/sys/kernel/ns_last_pid.
      
      Zhao Hongjiang <zhaohongjiang@huawei.com> noticed I was missing a put_user_ns
      in when destroying a pid_ns.  I have foloded his patch into this one
      so that bisects will work properly.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      49f4d8b9
  28. 15 8月, 2012 1 次提交
    • E
      net ip6 flowlabel: Make owner a union of struct pid * and kuid_t · 4f82f457
      Eric W. Biederman 提交于
      Correct a long standing omission and use struct pid in the owner
      field of struct ip6_flowlabel when the share type is IPV6_FL_S_PROCESS.
      This guarantees we don't have issues when pid wraparound occurs.
      
      Use a kuid_t in the owner field of struct ip6_flowlabel when the
      share type is IPV6_FL_S_USER to add user namespace support.
      
      In /proc/net/ip6_flowlabel capture the current pid namespace when
      opening the file and release the pid namespace when the file is
      closed ensuring we print the pid owner value that is meaning to
      the reader of the file.  Similarly use from_kuid_munged to print
      uid values that are meaningful to the reader of the file.
      
      This requires exporting pid_nr_ns so that ipv6 can continue to built
      as a module.  Yoiks what silliness
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      4f82f457
  29. 24 5月, 2012 1 次提交
  30. 14 2月, 2012 1 次提交
  31. 13 1月, 2012 1 次提交
  32. 31 10月, 2011 1 次提交
  33. 29 9月, 2011 1 次提交
    • P
      rcu: Restore checks for blocking in RCU read-side critical sections · b3fbab05
      Paul E. McKenney 提交于
      Long ago, using TREE_RCU with PREEMPT would result in "scheduling
      while atomic" diagnostics if you blocked in an RCU read-side critical
      section.  However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats
      this diagnostic.  This commit therefore adds a replacement diagnostic
      based on PROVE_RCU.
      
      Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being
      used for things that have nothing to do with rcu_dereference(), rename
      lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third
      argument that is a string indicating what is suspicious.  This third
      argument is passed in from a new third argument to rcu_lockdep_assert().
      Update all calls to rcu_lockdep_assert() to add an informative third
      argument.
      
      Also, add a pair of rcu_lockdep_assert() calls from within
      rcu_note_context_switch(), one complaining if a context switch occurs
      in an RCU-bh read-side critical section and another complaining if a
      context switch occurs in an RCU-sched read-side critical section.
      These are present only if the PROVE_RCU kernel parameter is enabled.
      
      Finally, fix some checkpatch whitespace complaints in lockdep.c.
      
      Again, you must enable PROVE_RCU to see these new diagnostics.  But you
      are enabling PROVE_RCU to check out new RCU uses in any case, aren't you?
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b3fbab05