1. 04 11月, 2016 1 次提交
  2. 01 9月, 2016 1 次提交
  3. 03 8月, 2016 1 次提交
  4. 29 7月, 2016 4 次提交
    • M
      mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj · 44a70ade
      Michal Hocko 提交于
      oom_score_adj is shared for the thread groups (via struct signal) but this
      is not sufficient to cover processes sharing mm (CLONE_VM without
      CLONE_SIGHAND) and so we can easily end up in a situation when some
      processes update their oom_score_adj and confuse the oom killer.  In the
      worst case some of those processes might hide from the oom killer
      altogether via OOM_SCORE_ADJ_MIN while others are eligible.  OOM killer
      would then pick up those eligible but won't be allowed to kill others
      sharing the same mm so the mm wouldn't release the mm and so the memory.
      
      It would be ideal to have the oom_score_adj per mm_struct because that is
      the natural entity OOM killer considers.  But this will not work because
      some programs are doing
      
      	vfork()
      	set_oom_adj()
      	exec()
      
      We can achieve the same though.  oom_score_adj write handler can set the
      oom_score_adj for all processes sharing the same mm if the task is not in
      the middle of vfork.  As a result all the processes will share the same
      oom_score_adj.  The current implementation is rather pessimistic and
      checks all the existing processes by default if there is more than 1
      holder of the mm but we do not have any reliable way to check for external
      users yet.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44a70ade
    • M
      proc, oom_adj: extract oom_score_adj setting into a helper · 1d5f0acb
      Michal Hocko 提交于
      Currently we have two proc interfaces to set oom_score_adj.  The legacy
      /proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
      specific handlers.  Big part of the logic is duplicated so extract the
      common code into __set_oom_adj helper.  Legacy knob still expects some
      details slightly different so make sure those are handled same way - e.g.
      the legacy mode ignores oom_score_adj_min and it warns about the usage.
      
      This patch shouldn't introduce any functional changes.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-4-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d5f0acb
    • M
      proc, oom: drop bogus sighand lock · f913da59
      Michal Hocko 提交于
      Oleg has pointed out that can simplify both oom_adj_{read,write} and
      oom_score_adj_{read,write} even further and drop the sighand lock.  The
      main purpose of the lock was to protect p->signal from going away but this
      will not happen since ea6d290c ("signals: make task_struct->signal
      immutable/refcountable").
      
      The other role of the lock was to synchronize different writers,
      especially those with CAP_SYS_RESOURCE.  Introduce a mutex for this
      purpose.  Later patches will need this lock anyway.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Link: http://lkml.kernel.org/r/1466426628-15074-3-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f913da59
    • M
      proc, oom: drop bogus task_lock and mm check · d49fbf76
      Michal Hocko 提交于
      Series "Handle oom bypass more gracefully", V5
      
      The following 10 patches should put some order to very rare cases of mm
      shared between processes and make the paths which bypass the oom killer
      oom reapable and therefore much more reliable finally.  Even though mm
      shared outside of thread group is rare (either vforked tasks for a short
      period, use_mm by kernel threads or exotic thread model of
      clone(CLONE_VM) without CLONE_SIGHAND) it is better to cover them.  Not
      only it makes the current oom killer logic quite hard to follow and
      reason about it can lead to weird corner cases.  E.g.  it is possible to
      select an oom victim which shares the mm with unkillable process or
      bypass the oom killer even when other processes sharing the mm are still
      alive and other weird cases.
      
      Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write.
      This can be considered a bug fix with a low impact as nobody has noticed
      for years.
      
      Patch 2 drops sighand lock because it is not needed anymore as pointed
      by Oleg.
      
      Patch 3 is a clean up of oom_score_adj handling and a preparatory work
      for later patches.
      
      Patch 4 enforces oom_adj_score to be consistent between processes
      sharing the mm to behave consistently with the regular thread groups.
      This can be considered a user visible behavior change because one thread
      group updating oom_score_adj will affect others which share the same mm
      via clone(CLONE_VM).  I argue that this should be acceptable because we
      already have the same behavior for threads in the same thread group and
      sharing the mm without signal struct is just a different model of
      threading.  This is probably the most controversial part of the series,
      I would like to find some consensus here.  There were some suggestions
      to hook some counter/oom_score_adj into the mm_struct but I feel that
      this is not necessary right now and we can rely on proc handler +
      oom_kill_process to DTRT.  I can be convinced otherwise but I strongly
      think that whatever we do the userspace has to have a way to see the
      current oom priority as consistently as possible.
      
      Patch 5 makes sure that no vforked task is selected if it is sharing the
      mm with oom unkillable task.
      
      Patch 6 ensures that all user tasks sharing the mm are killed which in
      turn makes sure that all oom victims are oom reapable.
      
      Patch 7 guarantees that task_will_free_mem will always imply reapable
      bypass of the oom killer.
      
      Patch 8 is new in this version and it addresses an issue pointed out by
      0-day OOM report where an oom victim was reaped several times.
      
      Patch 9 puts an upper bound on how many times oom_reaper tries to reap a
      task and hides it from the oom killer to move on when no progress can be
      made.  This will give an upper bound to how long an oom_reapable task
      can block the oom killer from selecting another victim if the oom_reaper
      is not able to reap the victim.
      
      Patch 10 tries to plug the (hopefully) last hole when we can still lock
      up when the oom victim is shared with oom unkillable tasks (kthreads and
      global init).  We just try to be best effort in that case and rather
      fallback to kill something else than risk a lockup.
      
      This patch (of 10):
      
      Both oom_adj_write and oom_score_adj_write are using task_lock, check for
      task->mm and fail if it is NULL.  This is not needed because the
      oom_score_adj is per signal struct so we do not need mm at all.  The code
      has been introduced by 3d5992d2 ("oom: add per-mm oom disable count")
      but we do not do per-mm oom disable since c9f01245 ("oom: remove
      oom_disable_count").
      
      The task->mm check is even not correct because the current thread might
      have exited but the thread group might be still alive - e.g.  thread group
      leader would lead that echo $VAL > /proc/pid/oom_score_adj would always
      fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj would
      succeed.  This is unexpected at best.
      
      Remove the lock along with the check to fix the unexpected behavior and
      also because there is not real need for the lock in the first place.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-2-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d49fbf76
  5. 21 5月, 2016 1 次提交
    • J
      procfs: fix pthread cross-thread naming if !PR_DUMPABLE · 1b3044e3
      Janis Danisevskis 提交于
      The PR_DUMPABLE flag causes the pid related paths of the proc file
      system to be owned by ROOT.
      
      The implementation of pthread_set/getname_np however needs access to
      /proc/<pid>/task/<tid>/comm.  If PR_DUMPABLE is false this
      implementation is locked out.
      
      This patch installs a special permission function for the file "comm"
      that grants read and write access to all threads of the same group
      regardless of the ownership of the inode.  For all other threads the
      function falls back to the generic inode permission check.
      
      [akpm@linux-foundation.org: fix spello in comment]
      Signed-off-by: NJanis Danisevskis <jdanis@google.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minfei Huang <mnfhuang@gmail.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Calvin Owens <calvinowens@fb.com>
      Cc: Jann Horn <jann@thejh.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b3044e3
  6. 10 5月, 2016 1 次提交
  7. 06 5月, 2016 1 次提交
    • M
      proc: prevent accessing /proc/<PID>/environ until it's ready · 8148a73c
      Mathias Krause 提交于
      If /proc/<PID>/environ gets read before the envp[] array is fully set up
      in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying to
      read more bytes than are actually written, as env_start will already be
      set but env_end will still be zero, making the range calculation
      underflow, allowing to read beyond the end of what has been written.
      
      Fix this as it is done for /proc/<PID>/cmdline by testing env_end for
      zero.  It is, apparently, intentionally set last in create_*_tables().
      
      This bug was found by the PaX size_overflow plugin that detected the
      arithmetic underflow of 'this_len = env_end - (env_start + src)' when
      env_end is still zero.
      
      The expected consequence is that userland trying to access
      /proc/<PID>/environ of a not yet fully set up process may get
      inconsistent data as we're in the middle of copying in the environment
      variables.
      
      Fixes: https://forums.grsecurity.net/viewtopic.php?f=3&t=4363
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=116461Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: Pax Team <pageexec@freemail.hu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jarod Wilson <jarod@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8148a73c
  8. 03 5月, 2016 2 次提交
  9. 18 3月, 2016 3 次提交
  10. 21 1月, 2016 2 次提交
    • M
      proc read mm's {arg,env}_{start,end} with mmap semaphore taken. · a3b609ef
      Mateusz Guzik 提交于
      Only functions doing more than one read are modified.  Consumeres
      happened to deal with possibly changing data, but it does not seem like
      a good thing to rely on.
      Signed-off-by: NMateusz Guzik <mguzik@redhat.com>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Jarod Wilson <jarod@redhat.com>
      Cc: Jan Stancek <jstancek@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Anshuman Khandual <anshuman.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3b609ef
    • J
      ptrace: use fsuid, fsgid, effective creds for fs access checks · caaee623
      Jann Horn 提交于
      By checking the effective credentials instead of the real UID / permitted
      capabilities, ensure that the calling process actually intended to use its
      credentials.
      
      To ensure that all ptrace checks use the correct caller credentials (e.g.
      in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
      flag), use two new flags and require one of them to be set.
      
      The problem was that when a privileged task had temporarily dropped its
      privileges, e.g.  by calling setreuid(0, user_uid), with the intent to
      perform following syscalls with the credentials of a user, it still passed
      ptrace access checks that the user would not be able to pass.
      
      While an attacker should not be able to convince the privileged task to
      perform a ptrace() syscall, this is a problem because the ptrace access
      check is reused for things in procfs.
      
      In particular, the following somewhat interesting procfs entries only rely
      on ptrace access checks:
      
       /proc/$pid/stat - uses the check for determining whether pointers
           should be visible, useful for bypassing ASLR
       /proc/$pid/maps - also useful for bypassing ASLR
       /proc/$pid/cwd - useful for gaining access to restricted
           directories that contain files with lax permissions, e.g. in
           this scenario:
           lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
           drwx------ root root /root
           drwxr-xr-x root root /root/foobar
           -rw-r--r-- root root /root/foobar/secret
      
      Therefore, on a system where a root-owned mode 6755 binary changes its
      effective credentials as described and then dumps a user-specified file,
      this could be used by an attacker to reveal the memory layout of root's
      processes or reveal the contents of files he is not allowed to access
      (through /proc/$pid/cwd).
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: NJann Horn <jann@thejh.net>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      caaee623
  11. 04 1月, 2016 1 次提交
  12. 31 12月, 2015 1 次提交
  13. 19 12月, 2015 1 次提交
  14. 09 12月, 2015 1 次提交
    • A
      replace ->follow_link() with new method that could stay in RCU mode · 6b255391
      Al Viro 提交于
      new method: ->get_link(); replacement of ->follow_link().  The differences
      are:
      	* inode and dentry are passed separately
      	* might be called both in RCU and non-RCU mode;
      the former is indicated by passing it a NULL dentry.
      	* when called that way it isn't allowed to block
      and should return ERR_PTR(-ECHILD) if it needs to be called
      in non-RCU mode.
      
      It's a flagday change - the old method is gone, all in-tree instances
      converted.  Conversion isn't hard; said that, so far very few instances
      do not immediately bail out when called in RCU mode.  That'll change
      in the next commits.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6b255391
  15. 06 11月, 2015 1 次提交
  16. 01 10月, 2015 1 次提交
    • I
      fs/proc, core/debug: Don't expose absolute kernel addresses via wchan · b2f73922
      Ingo Molnar 提交于
      So the /proc/PID/stat 'wchan' field (the 30th field, which contains
      the absolute kernel address of the kernel function a task is blocked in)
      leaks absolute kernel addresses to unprivileged user-space:
      
              seq_put_decimal_ull(m, ' ', wchan);
      
      The absolute address might also leak via /proc/PID/wchan as well, if
      KALLSYMS is turned off or if the symbol lookup fails for some reason:
      
      static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
                                struct pid *pid, struct task_struct *task)
      {
              unsigned long wchan;
              char symname[KSYM_NAME_LEN];
      
              wchan = get_wchan(task);
      
              if (lookup_symbol_name(wchan, symname) < 0) {
                      if (!ptrace_may_access(task, PTRACE_MODE_READ))
                              return 0;
                      seq_printf(m, "%lu", wchan);
              } else {
                      seq_printf(m, "%s", symname);
              }
      
              return 0;
      }
      
      This isn't ideal, because for example it trivially leaks the KASLR offset
      to any local attacker:
      
        fomalhaut:~> printf "%016lx\n" $(cat /proc/$$/stat | cut -d' ' -f35)
        ffffffff8123b380
      
      Most real-life uses of wchan are symbolic:
      
        ps -eo pid:10,tid:10,wchan:30,comm
      
      and procps uses /proc/PID/wchan, not the absolute address in /proc/PID/stat:
      
        triton:~/tip> strace -f ps -eo pid:10,tid:10,wchan:30,comm 2>&1 | grep wchan | tail -1
        open("/proc/30833/wchan", O_RDONLY)     = 6
      
      There's one compatibility quirk here: procps relies on whether the
      absolute value is non-zero - and we can provide that functionality
      by outputing "0" or "1" depending on whether the task is blocked
      (whether there's a wchan address).
      
      These days there appears to be very little legitimate reason
      user-space would be interested in  the absolute address. The
      absolute address is mostly historic: from the days when we
      didn't have kallsyms and user-space procps had to do the
      decoding itself via the System.map.
      
      So this patch sets all numeric output to "0" or "1" and keeps only
      symbolic output, in /proc/PID/wchan.
      
      ( The absolute sleep address can generally still be profiled via
        perf, by tasks with sufficient privileges. )
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@vger.kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: kasan-dev <kasan-dev@googlegroups.com>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20150930135917.GA3285@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b2f73922
  17. 11 9月, 2015 2 次提交
    • A
      proc: convert to kstrto*()/kstrto*_from_user() · 774636e1
      Alexey Dobriyan 提交于
      Convert from manual allocation/copy_from_user/...  to kstrto*() family
      which were designed for exactly that.
      
      One case can not be converted to kstrto*_from_user() to make code even
      more simpler because of whitespace stripping, oh well...
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      774636e1
    • C
      procfs: always expose /proc/<pid>/map_files/ and make it readable · bdb4d100
      Calvin Owens 提交于
      Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
      only exposed if CONFIG_CHECKPOINT_RESTORE is set.
      
      Each mapped file region gets a symlink in /proc/<pid>/map_files/
      corresponding to the virtual address range at which it is mapped.  The
      symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them
      to the backing file even if that backing file has been unlinked.
      
      Currently, files which are mapped, unlinked, and closed are impossible to
      stat() from userspace.  Exposing /proc/<pid>/map_files/ closes this
      functionality "hole".
      
      Not being able to stat() such files makes noticing and explicitly
      accounting for the space they use on the filesystem impossible.  You can
      work around this by summing up the space used by every file in the
      filesystem and subtracting that total from what statfs() tells you, but
      that obviously isn't great, and it becomes unworkable once your filesystem
      becomes large enough.
      
      This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
      adjusts the permissions enforced on it as follows:
      
      * proc_map_files_lookup()
      * proc_map_files_readdir()
      * map_files_d_revalidate()
      
      	Remove the CAP_SYS_ADMIN restriction, leaving only the current
      	restriction requiring PTRACE_MODE_READ. The information made
      	available to userspace by these three functions is already
      	available in /proc/PID/maps with MODE_READ, so I don't see any
      	reason to limit them any further (see below for more detail).
      
      * proc_map_files_follow_link()
      
      	This stub has been added, and requires that the user have
      	CAP_SYS_ADMIN in order to follow the links in map_files/,
      	since there was concern on LKML both about the potential for
      	bypassing permissions on ancestor directories in the path to
      	files pointed to, and about what happens with more exotic
      	memory mappings created by some drivers (ie dma-buf).
      
      In older versions of this patch, I changed every permission check in
      the four functions above to enforce MODE_ATTACH instead of MODE_READ.
      This was an oversight on my part, and after revisiting the discussion
      it seems that nobody was concerned about anything outside of what is
      made possible by ->follow_link(). So in this version, I've left the
      checks for PTRACE_MODE_READ as-is.
      
      [akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes]
      Signed-off-by: NCalvin Owens <calvinowens@fb.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bdb4d100
  18. 18 7月, 2015 1 次提交
  19. 04 7月, 2015 1 次提交
  20. 26 6月, 2015 2 次提交
    • I
      fs, proc: introduce CONFIG_PROC_CHILDREN · 2e13ba54
      Iago López Galeiras 提交于
      Commit 81841161 ("fs, proc: introduce /proc/<pid>/task/<tid>/children
      entry") introduced the children entry for checkpoint restore and the
      file is only available on kernels configured with CONFIG_EXPERT and
      CONFIG_CHECKPOINT_RESTORE.
      
      This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS)
      because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE.
      But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE.
      
      However, the children proc file is useful outside of checkpoint restore.
      I would like to use it in rkt.  The rkt process exec() another program
      it does not control, and that other program will fork()+exec() a child
      process.  I would like to find the pid of the child process from an
      external tool without iterating in /proc over all processes to find
      which one has a parent pid equal to rkt.
      
      This commit introduces CONFIG_PROC_CHILDREN and makes
      CONFIG_CHECKPOINT_RESTORE select it.  This allows enabling
      /proc/<pid>/task/<tid>/children without needing to enable
      CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT.
      
      Alban tested that /proc/<pid>/task/<tid>/children is present when the
      kernel is configured with CONFIG_PROC_CHILDREN=y but without
      CONFIG_CHECKPOINT_RESTORE
      Signed-off-by: NIago López Galeiras <iago@endocode.com>
      Tested-by: NAlban Crequy <alban@endocode.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Djalal Harouni <djalal@endocode.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e13ba54
    • A
      proc: fix PAGE_SIZE limit of /proc/$PID/cmdline · c2c0bb44
      Alexey Dobriyan 提交于
      /proc/$PID/cmdline truncates output at PAGE_SIZE. It is easy to see with
      
      	$ cat /proc/self/cmdline $(seq 1037) 2>/dev/null
      
      However, command line size was never limited to PAGE_SIZE but to 128 KB
      and relatively recently limitation was removed altogether.
      
      People noticed and ask questions:
      http://stackoverflow.com/questions/199130/how-do-i-increase-the-proc-pid-cmdline-4096-byte-limit
      
      seq file interface is not OK, because it kmalloc's for whole output and
      open + read(, 1) + sleep will pin arbitrary amounts of kernel memory.  To
      not do that, limit must be imposed which is incompatible with arbitrary
      sized command lines.
      
      I apologize for hairy code, but this it direct consequence of command line
      layout in memory and hacks to support things like "init [3]".
      
      The loops are "unrolled" otherwise it is either macros which hide control
      flow or functions with 7-8 arguments with equal line count.
      
      There should be real setproctitle(2) or something.
      
      [akpm@linux-foundation.org: fix a billion min() warnings]
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Tested-by: NJarod Wilson <jarod@redhat.com>
      Acked-by: NJarod Wilson <jarod@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jan Stancek <jstancek@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2c0bb44
  21. 11 5月, 2015 2 次提交
    • A
      don't pass nameidata to ->follow_link() · 6e77137b
      Al Viro 提交于
      its only use is getting passed to nd_jump_link(), which can obtain
      it from current->nameidata
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6e77137b
    • A
      new ->follow_link() and ->put_link() calling conventions · 680baacb
      Al Viro 提交于
      a) instead of storing the symlink body (via nd_set_link()) and returning
      an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
      that opaque pointer (into void * passed by address by caller) and returns
      the symlink body.  Returning ERR_PTR() on error, NULL on jump (procfs magic
      symlinks) and pointer to symlink body for normal symlinks.  Stored pointer
      is ignored in all cases except the last one.
      
      Storing NULL for opaque pointer (or not storing it at all) means no call
      of ->put_link().
      
      b) the body used to be passed to ->put_link() implicitly (via nameidata).
      Now only the opaque pointer is.  In the cases when we used the symlink body
      to free stuff, ->follow_link() now should store it as opaque pointer in addition
      to returning it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      680baacb
  22. 16 4月, 2015 2 次提交
  23. 12 12月, 2014 1 次提交
    • E
      userns: Add a knob to disable setgroups on a per user namespace basis · 9cc46516
      Eric W. Biederman 提交于
      - Expose the knob to user space through a proc file /proc/<pid>/setgroups
      
        A value of "deny" means the setgroups system call is disabled in the
        current processes user namespace and can not be enabled in the
        future in this user namespace.
      
        A value of "allow" means the segtoups system call is enabled.
      
      - Descendant user namespaces inherit the value of setgroups from
        their parents.
      
      - A proc file is used (instead of a sysctl) as sysctls currently do
        not allow checking the permissions at open time.
      
      - Writing to the proc file is restricted to before the gid_map
        for the user namespace is set.
      
        This ensures that disabling setgroups at a user namespace
        level will never remove the ability to call setgroups
        from a process that already has that ability.
      
        A process may opt in to the setgroups disable for itself by
        creating, entering and configuring a user namespace or by calling
        setns on an existing user namespace with setgroups disabled.
        Processes without privileges already can not call setgroups so this
        is a noop.  Prodcess with privilege become processes without
        privilege when entering a user namespace and as with any other path
        to dropping privilege they would not have the ability to call
        setgroups.  So this remains within the bounds of what is possible
        without a knob to disable setgroups permanently in a user namespace.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      9cc46516
  24. 11 12月, 2014 1 次提交
  25. 20 11月, 2014 1 次提交
  26. 10 10月, 2014 1 次提交
  27. 09 10月, 2014 2 次提交
  28. 19 9月, 2014 1 次提交