1. 21 7月, 2018 3 次提交
    • E
      pid: Implement PIDTYPE_TGID · 6883f81a
      Eric W. Biederman 提交于
      Everywhere except in the pid array we distinguish between a tasks pid and
      a tasks tgid (thread group id).  Even in the enumeration we want that
      distinction sometimes so we have added __PIDTYPE_TGID.  With leader_pid
      we almost have an implementation of PIDTYPE_TGID in struct signal_struct.
      
      Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
      into the pids array.  Then remove the __PIDTYPE_TGID special case and the
      leader_pid in signal_struct.
      
      The net size increase is just an extra pointer added to struct pid and
      an extra pair of pointers of an hlist_node added to task_struct.
      
      The effect on code maintenance is the removal of a number of special
      cases today and the potential to remove many more special cases as
      PIDTYPE_TGID gets used to it's fullest.  The long term potential
      is allowing zombie thread group leaders to exit, which will remove
      a lot more special cases in the code.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      6883f81a
    • E
      pids: Move the pgrp and session pid pointers from task_struct to signal_struct · 2c470475
      Eric W. Biederman 提交于
      To access these fields the code always has to go to group leader so
      going to signal struct is no loss and is actually a fundamental simplification.
      
      This saves a little bit of memory by only allocating the pid pointer array
      once instead of once for every thread, and even better this removes a
      few potential races caused by the fact that group_leader can be changed
      by de_thread, while signal_struct can not.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      2c470475
    • E
      pids: Compute task_tgid using signal->leader_pid · 7a36094d
      Eric W. Biederman 提交于
      The cost is the the same and this removes the need
      to worry about complications that come from de_thread
      and group_leader changing.
      
      __task_pid_nr_ns has been updated to take advantage of this change.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      7a36094d
  2. 12 4月, 2018 1 次提交
  3. 07 2月, 2018 1 次提交
  4. 17 1月, 2018 1 次提交
  5. 24 12月, 2017 1 次提交
    • E
      pid: Handle failure to allocate the first pid in a pid namespace · c0ee5549
      Eric W. Biederman 提交于
      With the replacement of the pid bitmap and hashtable with an idr in
      alloc_pid started occassionally failing when allocating the first pid
      in a pid namespace.  Things were not completely reset resulting in
      the first allocated pid getting the number 2 (not 1).  Which
      further resulted in ns->proc_mnt not getting set and eventually
      causing an oops in proc_flush_task.
      
      Oops: 0000 [#1] SMP
      CPU: 2 PID: 6743 Comm: trinity-c117 Not tainted 4.15.0-rc4-think+ #2
      RIP: 0010:proc_flush_task+0x8e/0x1b0
      RSP: 0018:ffffc9000bbffc40 EFLAGS: 00010286
      RAX: 0000000000000001 RBX: 0000000000000001 RCX: 00000000fffffffb
      RDX: 0000000000000000 RSI: ffffc9000bbffc50 RDI: 0000000000000000
      RBP: ffffc9000bbffc63 R08: 0000000000000000 R09: 0000000000000002
      R10: ffffc9000bbffb70 R11: ffffc9000bbffc64 R12: 0000000000000003
      R13: 0000000000000000 R14: 0000000000000003 R15: ffff8804c10d7840
      FS:  00007f7cb8965700(0000) GS:ffff88050a200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 00000003e21ae003 CR4: 00000000001606e0
      DR0: 00007fb1d6c22000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
      Call Trace:
       ? release_task+0xaf/0x680
       release_task+0xd2/0x680
       ? wait_consider_task+0xb82/0xce0
       wait_consider_task+0xbe9/0xce0
       ? do_wait+0xe1/0x330
       do_wait+0x151/0x330
       kernel_wait4+0x8d/0x150
       ? task_stopped_code+0x50/0x50
       SYSC_wait4+0x95/0xa0
       ? rcu_read_lock_sched_held+0x6c/0x80
       ? syscall_trace_enter+0x2d7/0x340
       ? do_syscall_64+0x60/0x210
       do_syscall_64+0x60/0x210
       entry_SYSCALL64_slow_path+0x25/0x25
      RIP: 0033:0x7f7cb82603aa
      RSP: 002b:00007ffd60770bc8 EFLAGS: 00000246
       ORIG_RAX: 000000000000003d
      RAX: ffffffffffffffda RBX: 00007f7cb6cd4000 RCX: 00007f7cb82603aa
      RDX: 000000000000000b RSI: 00007ffd60770bd0 RDI: 0000000000007cca
      RBP: 0000000000007cca R08: 00007f7cb8965700 R09: 00007ffd607c7080
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007ffd60770bd0 R14: 00007f7cb6cd4058 R15: 00000000cccccccd
      Code: c1 e2 04 44 8b 60 30 48 8b 40 38 44 8b 34 11 48 c7 c2 60 3a f5 81 44 89 e1 4c 8b 68 58 e8 4b b4 77 00 89 44 24 14 48 8d 74 24 10 <49> 8b 7d 00 e8 b9 6a f9 ff 48 85 c0 74 1a 48 89 c7 48 89 44 24
      RIP: proc_flush_task+0x8e/0x1b0 RSP: ffffc9000bbffc40
      CR2: 0000000000000000
      ---[ end trace 53d67a6481059862 ]---
      
      Improve the quality of the implementation by resetting the place to
      start allocating pids on failure to allocate the first pid.
      
      As improving the quality of the implementation is the goal remove the now
      unnecesarry disable_pid_allocations call when we fail to mount proc.
      
      Fixes: 95846ecf ("pid: replace pid bitmap implementation with IDR API")
      Fixes: 8ef047aa ("pid namespaces: make alloc_pid(), free_pid() and put_pid() work with struct upid")
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c0ee5549
  6. 18 11月, 2017 2 次提交
  7. 22 8月, 2017 1 次提交
    • O
      pids: make task_tgid_nr_ns() safe · dd1c1f2f
      Oleg Nesterov 提交于
      This was reported many times, and this was even mentioned in commit
      52ee2dfd ("pids: refactor vnr/nr_ns helpers to make them safe") but
      somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
      not safe because task->group_leader points to nowhere after the exiting
      task passes exit_notify(), rcu_read_lock() can not help.
      
      We really need to change __unhash_process() to nullify group_leader,
      parent, and real_parent, but this needs some cleanups.  Until then we
      can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
      fix the problem.
      Reported-by: NTroy Kensinger <tkensinger@google.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1c1f2f
  8. 03 8月, 2017 1 次提交
  9. 07 7月, 2017 1 次提交
    • P
      mm: update callers to use HASH_ZERO flag · 3d375d78
      Pavel Tatashin 提交于
      Update dcache, inode, pid, mountpoint, and mount hash tables to use
      HASH_ZERO, and remove initialization after allocations.  In case of
      places where HASH_EARLY was used such as in __pv_init_lock_hash the
      zeroed hash table was already assumed, because memblock zeroes the
      memory.
      
      CPU: SPARC M6, Memory: 7T
      Before fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 11.798s
      
      After fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 3.198s
      
      CPU: Intel Xeon E5-2630, Memory: 2.2T:
      Before fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.245s
      
      After fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.244s
      
      Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d375d78
  10. 09 5月, 2017 1 次提交
    • K
      pidns: disable pid allocation if pid_ns_prepare_proc() is failed in alloc_pid() · 8896c23d
      Kirill Tkhai 提交于
      alloc_pidmap() advances pid_namespace::last_pid.  When first pid
      allocation fails, then next created process will have pid 2 and
      pid_ns_prepare_proc() won't be called.  So, pid_namespace::proc_mnt will
      never be initialized (not to mention that there won't be a child
      reaper).
      
      I saw crash stack of such case on kernel 3.10:
      
          BUG: unable to handle kernel NULL pointer dereference at (null)
          IP: proc_flush_task+0x8f/0x1b0
          Call Trace:
              release_task+0x3f/0x490
              wait_consider_task.part.10+0x7ff/0xb00
              do_wait+0x11f/0x280
              SyS_wait4+0x7d/0x110
      
      We may fix this by restore of last_pid in 0 or by prohibiting of futher
      allocations.  Since there was a similar issue in Oleg Nesterov's commit
      314a8ad0 ("pidns: fix free_pid() to handle the first fork failure").
      and it was fixed via prohibiting allocation, let's follow this way, and
      do the same.
      
      Link: http://lkml.kernel.org/r/149201021004.4863.6762095011554287922.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Andrei Vagin <avagin@virtuozzo.com>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Serge Hallyn <serge@hallyn.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8896c23d
  11. 02 3月, 2017 1 次提交
  12. 14 1月, 2017 1 次提交
    • P
      locking/atomic, kref: Add KREF_INIT() · 1e24edca
      Peter Zijlstra 提交于
      Since we need to change the implementation, stop exposing internals.
      
      Provide KREF_INIT() to allow static initialization of struct kref.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1e24edca
  13. 28 5月, 2016 1 次提交
    • A
      remove lots of IS_ERR_VALUE abuses · 287980e4
      Arnd Bergmann 提交于
      Most users of IS_ERR_VALUE() in the kernel are wrong, as they
      pass an 'int' into a function that takes an 'unsigned long'
      argument. This happens to work because the type is sign-extended
      on 64-bit architectures before it gets converted into an
      unsigned type.
      
      However, anything that passes an 'unsigned short' or 'unsigned int'
      argument into IS_ERR_VALUE() is guaranteed to be broken, as are
      8-bit integers and types that are wider than 'unsigned long'.
      
      Andrzej Hajda has already fixed a lot of the worst abusers that
      were causing actual bugs, but it would be nice to prevent any
      users that are not passing 'unsigned long' arguments.
      
      This patch changes all users of IS_ERR_VALUE() that I could find
      on 32-bit ARM randconfig builds and x86 allmodconfig. For the
      moment, this doesn't change the definition of IS_ERR_VALUE()
      because there are probably still architecture specific users
      elsewhere.
      
      Almost all the warnings I got are for files that are better off
      using 'if (err)' or 'if (err < 0)'.
      The only legitimate user I could find that we get a warning for
      is the (32-bit only) freescale fman driver, so I did not remove
      the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
      For 9pfs, I just worked around one user whose calling conventions
      are so obscure that I did not dare change the behavior.
      
      I was using this definition for testing:
      
       #define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
             unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))
      
      which ends up making all 16-bit or wider types work correctly with
      the most plausible interpretation of what IS_ERR_VALUE() was supposed
      to return according to its users, but also causes a compile-time
      warning for any users that do not pass an 'unsigned long' argument.
      
      I suggested this approach earlier this year, but back then we ended
      up deciding to just fix the users that are obviously broken. After
      the initial warning that caused me to get involved in the discussion
      (fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
      asked me to send the whole thing again.
      
      [ Updated the 9p parts as per Al Viro  - Linus ]
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Andrzej Hajda <a.hajda@samsung.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: https://lkml.org/lkml/2016/1/7/363
      Link: https://lkml.org/lkml/2016/5/27/486
      Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> # For nvmem part
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      287980e4
  14. 30 1月, 2016 1 次提交
  15. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  16. 25 11月, 2015 1 次提交
  17. 23 7月, 2015 1 次提交
  18. 17 4月, 2015 1 次提交
    • M
      fork: report pid reservation failure properly · 35f71bc0
      Michal Hocko 提交于
      copy_process will report any failure in alloc_pid as ENOMEM currently
      which is misleading because the pid allocation might fail not only when
      the memory is short but also when the pid space is consumed already.
      
      The current man page even mentions this case:
      
      : EAGAIN
      :
      :       A system-imposed limit on the number of threads was encountered.
      :       There are a number of limits that may trigger this error: the
      :       RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
      :       limits the number of processes and threads for a real user ID, was
      :       reached; the kernel's system-wide limit on the number of processes
      :       and threads, /proc/sys/kernel/threads-max, was reached (see
      :       proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
      :       was reached (see proc(5)).
      
      so the current behavior is also incorrect wrt.  documentation.  POSIX man
      page also suggest returing EAGAIN when the process count limit is reached.
      
      This patch simply propagates error code from alloc_pid and makes sure we
      return -EAGAIN due to reservation failure.  This will make behavior of
      fork closer to both our documentation and POSIX.
      
      alloc_pid might alsoo fail when the reaper in the pid namespace is dead
      (the namespace basically disallows all new processes) and there is no
      good error code which would match documented ones. We have traditionally
      returned ENOMEM for this case which is misleading as well but as per
      Eric W. Biederman this behavior is documented in man pid_namespaces(7)
      
      : If the "init" process of a PID namespace terminates, the kernel
      : terminates all of the processes in the namespace via a SIGKILL signal.
      : This behavior reflects the fact that the "init" process is essential for
      : the correct operation of a PID namespace.  In this case, a subsequent
      : fork(2) into this PID namespace will fail with the error ENOMEM; it is
      : not possible to create a new processes in a PID namespace whose "init"
      : process has terminated.
      
      and introducing a new error code would be too risky so let's stick to
      ENOMEM for this case.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35f71bc0
  19. 11 12月, 2014 1 次提交
  20. 05 12月, 2014 2 次提交
  21. 01 10月, 2013 1 次提交
  22. 31 8月, 2013 1 次提交
    • E
      pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup · a6064885
      Eric W. Biederman 提交于
      Serge Hallyn <serge.hallyn@ubuntu.com> writes:
      
      > Since commit af4b8a83 it's been
      > possible to get into a situation where a pidns reaper is
      > <defunct>, reparented to host pid 1, but never reaped.  How to
      > reproduce this is documented at
      >
      > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526
      > (and see
      > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526/comments/13)
      > In short, run repeated starts of a container whose init is
      >
      > Process.exit(0);
      >
      > sysrq-t when such a task is playing zombie shows:
      >
      > [  131.132978] init            x ffff88011fc14580     0  2084   2039 0x00000000
      > [  131.132978]  ffff880116e89ea8 0000000000000002 ffff880116e89fd8 0000000000014580
      > [  131.132978]  ffff880116e89fd8 0000000000014580 ffff8801172a0000 ffff8801172a0000
      > [  131.132978]  ffff8801172a0630 ffff88011729fff0 ffff880116e14650 ffff88011729fff0
      > [  131.132978] Call Trace:
      > [  131.132978]  [<ffffffff816f6159>] schedule+0x29/0x70
      > [  131.132978]  [<ffffffff81064591>] do_exit+0x6e1/0xa40
      > [  131.132978]  [<ffffffff81071eae>] ? signal_wake_up_state+0x1e/0x30
      > [  131.132978]  [<ffffffff8106496f>] do_group_exit+0x3f/0xa0
      > [  131.132978]  [<ffffffff810649e4>] SyS_exit_group+0x14/0x20
      > [  131.132978]  [<ffffffff8170102f>] tracesys+0xe1/0xe6
      >
      > Further debugging showed that every time this happened, zap_pid_ns_processes()
      > started with nr_hashed being 3, while we were expecting it to drop to 2.
      > Any time it didn't happen, nr_hashed was 1 or 2.  So the reaper was
      > waiting for nr_hashed to become 2, but free_pid() only wakes the reaper
      > if nr_hashed hits 1.
      
      The issue is that when the task group leader of an init process exits
      before other tasks of the init process when the init process finally
      exits it will be a secondary task sleeping in zap_pid_ns_processes and
      waiting to wake up when the number of hashed pids drops to two.  This
      case waits forever as free_pid only sends a wake up when the number of
      hashed pids drops to 1.
      
      To correct this the simple strategy of sending a possibly unncessary
      wake up when the number of hashed pids drops to 2 is adopted.
      
      Sending one extraneous wake up is relatively harmless, at worst we
      waste a little cpu time in the rare case when a pid namespace
      appropaches exiting.
      
      We can detect the case when the pid namespace drops to just two pids
      hashed race free in free_pid.
      
      Dereferencing pid_ns->child_reaper with the pidmap_lock held is safe
      without out the tasklist_lock because it is guaranteed that the
      detach_pid will be called on the child_reaper before it is freed and
      detach_pid calls __change_pid which calls free_pid which takes the
      pidmap_lock.  __change_pid only calls free_pid if this is the
      last use of the pid.  For a thread that is not the thread group leader
      the threads pid will only ever have one user because a threads pid
      is not allowed to be the pid of a process, of a process group or
      a session.  For a thread that is a thread group leader all of
      the other threads of that process will be reaped before it is allowed
      for the thread group leader to be reaped ensuring there will only
      be one user of the threads pid as a process pid.  Furthermore
      because the thread is the init process of a pid namespace all of the
      other processes in the pid namespace will have also been already freed
      leading to the fact that the pid will not be used as a session pid or
      a process group pid for any other running process.
      
      CC: stable@vger.kernel.org
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Tested-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Reported-by: NSerge Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      a6064885
  23. 04 7月, 2013 2 次提交
  24. 02 5月, 2013 1 次提交
  25. 01 5月, 2013 2 次提交
  26. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  27. 13 2月, 2013 1 次提交
  28. 26 12月, 2012 1 次提交
    • E
      pidns: Stop pid allocation when init dies · c876ad76
      Eric W. Biederman 提交于
      Oleg pointed out that in a pid namespace the sequence.
      - pid 1 becomes a zombie
      - setns(thepidns), fork,...
      - reaping pid 1.
      - The injected processes exiting.
      
      Can lead to processes attempting access their child reaper and
      instead following a stale pointer.
      
      That waitpid for init can return before all of the processes in
      the pid namespace have exited is also unfortunate.
      
      Avoid these problems by disabling the allocation of new pids in a pid
      namespace when init dies, instead of when the last process in a pid
      namespace is reaped.
      Pointed-out-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c876ad76
  29. 18 12月, 2012 1 次提交
  30. 06 12月, 2012 1 次提交
  31. 20 11月, 2012 1 次提交
    • E
      proc: Usable inode numbers for the namespace file descriptors. · 98f842e6
      Eric W. Biederman 提交于
      Assign a unique proc inode to each namespace, and use that
      inode number to ensure we only allocate at most one proc
      inode for every namespace in proc.
      
      A single proc inode per namespace allows userspace to test
      to see if two processes are in the same namespace.
      
      This has been a long requested feature and only blocked because
      a naive implementation would put the id in a global space and
      would ultimately require having a namespace for the names of
      namespaces, making migration and certain virtualization tricks
      impossible.
      
      We still don't have per superblock inode numbers for proc, which
      appears necessary for application unaware checkpoint/restart and
      migrations (if the application is using namespace file descriptors)
      but that is now allowd by the design if it becomes important.
      
      I have preallocated the ipc and uts initial proc inode numbers so
      their structures can be statically initialized.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      98f842e6
  32. 19 11月, 2012 3 次提交
    • E
      pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 · af4b8a83
      Eric W. Biederman 提交于
      Looking at pid_ns->nr_hashed is a bit simpler and it works for
      disjoint process trees that an unshare or a join of a pid_namespace
      may create.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      af4b8a83
    • E
      pidns: Don't allow new processes in a dead pid namespace. · 5e1182de
      Eric W. Biederman 提交于
      Set nr_hashed to -1 just before we schedule the work to cleanup proc.
      Test nr_hashed just before we hash a new pid and if nr_hashed is < 0
      fail.
      
      This guaranteees that processes never enter a pid namespaces after we
      have cleaned up the state to support processes in a pid namespace.
      
      Currently sending SIGKILL to all of the process in a pid namespace as
      init exists gives us this guarantee but we need something a little
      stronger to support unsharing and joining a pid namespace.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      5e1182de
    • E
      pidns: Make the pidns proc mount/umount logic obvious. · 0a01f2cc
      Eric W. Biederman 提交于
      Track the number of pids in the proc hash table.  When the number of
      pids goes to 0 schedule work to unmount the kernel mount of proc.
      
      Move the mount of proc into alloc_pid when we allocate the pid for
      init.
      
      Remove the surprising calls of pid_ns_release proc in fork and
      proc_flush_task.  Those code paths really shouldn't know about proc
      namespace implementation details and people have demonstrated several
      times that finding and understanding those code paths is difficult and
      non-obvious.
      
      Because of the call path detach pid is alwasy called with the
      rtnl_lock held free_pid is not allowed to sleep, so the work to
      unmounting proc is moved to a work queue.  This has the side benefit
      of not blocking the entire world waiting for the unnecessary
      rcu_barrier in deactivate_locked_super.
      
      In the process of making the code clear and obvious this fixes a bug
      reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
      mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
      succeeded and copy_net_ns failed.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0a01f2cc