1. 04 7月, 2015 1 次提交
  2. 26 6月, 2015 2 次提交
    • I
      fs, proc: introduce CONFIG_PROC_CHILDREN · 2e13ba54
      Iago López Galeiras 提交于
      Commit 81841161 ("fs, proc: introduce /proc/<pid>/task/<tid>/children
      entry") introduced the children entry for checkpoint restore and the
      file is only available on kernels configured with CONFIG_EXPERT and
      CONFIG_CHECKPOINT_RESTORE.
      
      This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS)
      because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE.
      But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE.
      
      However, the children proc file is useful outside of checkpoint restore.
      I would like to use it in rkt.  The rkt process exec() another program
      it does not control, and that other program will fork()+exec() a child
      process.  I would like to find the pid of the child process from an
      external tool without iterating in /proc over all processes to find
      which one has a parent pid equal to rkt.
      
      This commit introduces CONFIG_PROC_CHILDREN and makes
      CONFIG_CHECKPOINT_RESTORE select it.  This allows enabling
      /proc/<pid>/task/<tid>/children without needing to enable
      CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT.
      
      Alban tested that /proc/<pid>/task/<tid>/children is present when the
      kernel is configured with CONFIG_PROC_CHILDREN=y but without
      CONFIG_CHECKPOINT_RESTORE
      Signed-off-by: NIago López Galeiras <iago@endocode.com>
      Tested-by: NAlban Crequy <alban@endocode.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Djalal Harouni <djalal@endocode.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e13ba54
    • A
      proc: fix PAGE_SIZE limit of /proc/$PID/cmdline · c2c0bb44
      Alexey Dobriyan 提交于
      /proc/$PID/cmdline truncates output at PAGE_SIZE. It is easy to see with
      
      	$ cat /proc/self/cmdline $(seq 1037) 2>/dev/null
      
      However, command line size was never limited to PAGE_SIZE but to 128 KB
      and relatively recently limitation was removed altogether.
      
      People noticed and ask questions:
      http://stackoverflow.com/questions/199130/how-do-i-increase-the-proc-pid-cmdline-4096-byte-limit
      
      seq file interface is not OK, because it kmalloc's for whole output and
      open + read(, 1) + sleep will pin arbitrary amounts of kernel memory.  To
      not do that, limit must be imposed which is incompatible with arbitrary
      sized command lines.
      
      I apologize for hairy code, but this it direct consequence of command line
      layout in memory and hacks to support things like "init [3]".
      
      The loops are "unrolled" otherwise it is either macros which hide control
      flow or functions with 7-8 arguments with equal line count.
      
      There should be real setproctitle(2) or something.
      
      [akpm@linux-foundation.org: fix a billion min() warnings]
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Tested-by: NJarod Wilson <jarod@redhat.com>
      Acked-by: NJarod Wilson <jarod@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jan Stancek <jstancek@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2c0bb44
  3. 11 5月, 2015 2 次提交
    • A
      don't pass nameidata to ->follow_link() · 6e77137b
      Al Viro 提交于
      its only use is getting passed to nd_jump_link(), which can obtain
      it from current->nameidata
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6e77137b
    • A
      new ->follow_link() and ->put_link() calling conventions · 680baacb
      Al Viro 提交于
      a) instead of storing the symlink body (via nd_set_link()) and returning
      an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
      that opaque pointer (into void * passed by address by caller) and returns
      the symlink body.  Returning ERR_PTR() on error, NULL on jump (procfs magic
      symlinks) and pointer to symlink body for normal symlinks.  Stored pointer
      is ignored in all cases except the last one.
      
      Storing NULL for opaque pointer (or not storing it at all) means no call
      of ->put_link().
      
      b) the body used to be passed to ->put_link() implicitly (via nameidata).
      Now only the opaque pointer is.  In the cases when we used the symlink body
      to free stuff, ->follow_link() now should store it as opaque pointer in addition
      to returning it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      680baacb
  4. 16 4月, 2015 2 次提交
  5. 12 12月, 2014 1 次提交
    • E
      userns: Add a knob to disable setgroups on a per user namespace basis · 9cc46516
      Eric W. Biederman 提交于
      - Expose the knob to user space through a proc file /proc/<pid>/setgroups
      
        A value of "deny" means the setgroups system call is disabled in the
        current processes user namespace and can not be enabled in the
        future in this user namespace.
      
        A value of "allow" means the segtoups system call is enabled.
      
      - Descendant user namespaces inherit the value of setgroups from
        their parents.
      
      - A proc file is used (instead of a sysctl) as sysctls currently do
        not allow checking the permissions at open time.
      
      - Writing to the proc file is restricted to before the gid_map
        for the user namespace is set.
      
        This ensures that disabling setgroups at a user namespace
        level will never remove the ability to call setgroups
        from a process that already has that ability.
      
        A process may opt in to the setgroups disable for itself by
        creating, entering and configuring a user namespace or by calling
        setns on an existing user namespace with setgroups disabled.
        Processes without privileges already can not call setgroups so this
        is a noop.  Prodcess with privilege become processes without
        privilege when entering a user namespace and as with any other path
        to dropping privilege they would not have the ability to call
        setgroups.  So this remains within the bounds of what is possible
        without a knob to disable setgroups permanently in a user namespace.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      9cc46516
  6. 11 12月, 2014 1 次提交
  7. 20 11月, 2014 1 次提交
  8. 10 10月, 2014 1 次提交
  9. 09 10月, 2014 2 次提交
  10. 19 9月, 2014 2 次提交
  11. 09 8月, 2014 14 次提交
  12. 05 8月, 2014 2 次提交
  13. 08 4月, 2014 3 次提交
  14. 20 3月, 2014 1 次提交
  15. 11 3月, 2014 1 次提交
  16. 24 1月, 2014 4 次提交
    • O
      proc: fix ->f_pos overflows in first_tid() · 9f6e963f
      Oleg Nesterov 提交于
      1. proc_task_readdir()->first_tid() path truncates f_pos to int, this
         is wrong even on 64bit.
      
         We could check that f_pos < PID_MAX or even INT_MAX in
         proc_task_readdir(), but this patch simply checks the potential
         overflow in first_tid(), this check is nop on 64bit.  We do not care if
         it was negative and the new unsigned value is huge, all we need to
         ensure is that we never wrongly return !NULL.
      
      2. Remove the 2nd "nr != 0" check before get_nr_threads(),
         nr_threads == 0 is not distinguishable from !pid_task() above.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Sameer Nanda <snanda@chromium.org>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f6e963f
    • O
      proc: don't (ab)use ->group_leader in proc_task_readdir() paths · d855a4b7
      Oleg Nesterov 提交于
      proc_task_readdir() does not really need "leader", first_tid() has to
      revalidate it anyway.  Just pass proc_pid(inode) to first_tid() instead,
      it can do pid_task(PIDTYPE_PID) itself and read ->group_leader only if
      necessary.
      
      The patch also extracts the "inode is dead" code from
      pid_delete_dentry(dentry) into the new trivial helper,
      proc_inode_is_dead(inode), proc_task_readdir() uses it to return -ENOENT
      if this dir was removed.
      
      This is a bit racy, but the race is very inlikely and the getdents() after
      openndir() can see the empty "." + ".." dir only once.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Sameer Nanda <snanda@chromium.org>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d855a4b7
    • O
      proc: change first_tid() to use while_each_thread() rather than next_thread() · c986c14a
      Oleg Nesterov 提交于
      Rerwrite the main loop to use while_each_thread() instead of
      next_thread().  We are going to fix or replace while_each_thread(),
      next_thread() should be avoided whenever possible.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Sameer Nanda <snanda@chromium.org>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c986c14a
    • O
      proc: fix the potential use-after-free in first_tid() · 940fe479
      Oleg Nesterov 提交于
      proc_task_readdir() verifies that the result of get_proc_task() is
      pid_alive() and thus its ->group_leader is fine too.  However this is not
      necessarily true after rcu_read_unlock(), we need to recheck this again
      after first_tid() does rcu_read_lock().  Otherwise
      leader->thread_group.next (used by next_thread()) can be invalid if the
      rcu grace period expires in between.
      
      The race is subtle and unlikely, but still it is possible afaics.  To
      simplify lets ignore the "likely" case when tid != 0, f_version can be
      cleared by proc_task_operations->llseek().
      
      Suppose we have a main thread M and its subthread T.  Suppose that f_pos
      == 3, iow first_tid() should return T.  Now suppose that the following
      happens between rcu_read_unlock() and rcu_read_lock():
      
      	1. T execs and becomes the new leader. This removes M from
      	    ->thread_group but next_thread(M) is still T.
      
      	2. T creates another thread X which does exec as well, T
      	   goes away.
      
      	3. X creates another subthread, this increments nr_threads.
      
      	4. first_tid() does next_thread(M) and returns the already
      	   dead T.
      
      Note also that we need 2.  and 3.  only because of get_nr_threads() check,
      and this check was supposed to be optimization only.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Sameer Nanda <snanda@chromium.org>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      940fe479