1. 20 7月, 2020 1 次提交
  2. 10 6月, 2020 3 次提交
  3. 19 5月, 2020 1 次提交
  4. 25 4月, 2020 2 次提交
  5. 22 4月, 2020 3 次提交
    • A
    • A
      proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option · 24a71ce5
      Alexey Gladkov 提交于
      If "hidepid=4" mount option is set then do not instantiate pids that
      we can not ptrace. "hidepid=4" means that procfs should only contain
      pids that the caller can ptrace.
      Signed-off-by: NDjalal Harouni <tixxdz@gmail.com>
      Signed-off-by: NAlexey Gladkov <gladkov.alexey@gmail.com>
      Reviewed-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      24a71ce5
    • A
      proc: allow to mount many instances of proc in one pid namespace · fa10fed3
      Alexey Gladkov 提交于
      This patch allows to have multiple procfs instances inside the
      same pid namespace. The aim here is lightweight sandboxes, and to allow
      that we have to modernize procfs internals.
      
      1) The main aim of this work is to have on embedded systems one
      supervisor for apps. Right now we have some lightweight sandbox support,
      however if we create pid namespacess we have to manages all the
      processes inside too, where our goal is to be able to run a bunch of
      apps each one inside its own mount namespace without being able to
      notice each other. We only want to use mount namespaces, and we want
      procfs to behave more like a real mount point.
      
      2) Linux Security Modules have multiple ptrace paths inside some
      subsystems, however inside procfs, the implementation does not guarantee
      that the ptrace() check which triggers the security_ptrace_check() hook
      will always run. We have the 'hidepid' mount option that can be used to
      force the ptrace_may_access() check inside has_pid_permissions() to run.
      The problem is that 'hidepid' is per pid namespace and not attached to
      the mount point, any remount or modification of 'hidepid' will propagate
      to all other procfs mounts.
      
      This also does not allow to support Yama LSM easily in desktop and user
      sessions. Yama ptrace scope which restricts ptrace and some other
      syscalls to be allowed only on inferiors, can be updated to have a
      per-task context, where the context will be inherited during fork(),
      clone() and preserved across execve(). If we support multiple private
      procfs instances, then we may force the ptrace_may_access() on
      /proc/<pids>/ to always run inside that new procfs instances. This will
      allow to specifiy on user sessions if we should populate procfs with
      pids that the user can ptrace or not.
      
      By using Yama ptrace scope, some restricted users will only be able to see
      inferiors inside /proc, they won't even be able to see their other
      processes. Some software like Chromium, Firefox's crash handler, Wine
      and others are already using Yama to restrict which processes can be
      ptracable. With this change this will give the possibility to restrict
      /proc/<pids>/ but more importantly this will give desktop users a
      generic and usuable way to specifiy which users should see all processes
      and which users can not.
      
      Side notes:
      * This covers the lack of seccomp where it is not able to parse
      arguments, it is easy to install a seccomp filter on direct syscalls
      that operate on pids, however /proc/<pid>/ is a Linux ABI using
      filesystem syscalls. With this change LSMs should be able to analyze
      open/read/write/close...
      
      In the new patch set version I removed the 'newinstance' option
      as suggested by Eric W. Biederman.
      
      Selftest has been added to verify new behavior.
      Signed-off-by: NAlexey Gladkov <gladkov.alexey@gmail.com>
      Reviewed-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      fa10fed3
  6. 16 4月, 2020 1 次提交
  7. 10 4月, 2020 1 次提交
    • E
      proc: Use a dedicated lock in struct pid · 63f818f4
      Eric W. Biederman 提交于
      syzbot wrote:
      > ========================================================
      > WARNING: possible irq lock inversion dependency detected
      > 5.6.0-syzkaller #0 Not tainted
      > --------------------------------------------------------
      > swapper/1/0 just changed the state of lock:
      > ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
      > but this lock took another, SOFTIRQ-unsafe lock in the past:
      >  (&pid->wait_pidfd){+.+.}-{2:2}
      >
      >
      > and interrupts could create inverse lock ordering between them.
      >
      >
      > other info that might help us debug this:
      >  Possible interrupt unsafe locking scenario:
      >
      >        CPU0                    CPU1
      >        ----                    ----
      >   lock(&pid->wait_pidfd);
      >                                local_irq_disable();
      >                                lock(tasklist_lock);
      >                                lock(&pid->wait_pidfd);
      >   <Interrupt>
      >     lock(tasklist_lock);
      >
      >  *** DEADLOCK ***
      >
      > 4 locks held by swapper/1/0:
      
      The problem is that because wait_pidfd.lock is taken under the tasklist
      lock.  It must always be taken with irqs disabled as tasklist_lock can be
      taken from interrupt context and if wait_pidfd.lock was already taken this
      would create a lock order inversion.
      
      Oleg suggested just disabling irqs where I have added extra calls to
      wait_pidfd.lock.  That should be safe and I think the code will eventually
      do that.  It was rightly pointed out by Christian that sharing the
      wait_pidfd.lock was a premature optimization.
      
      It is also true that my pre-merge window testing was insufficient.  So
      remove the premature optimization and give struct pid a dedicated lock of
      it's own for struct pid things.  I have verified that lockdep sees all 3
      paths where we take the new pid->lock and lockdep does not complain.
      
      It is my current day dream that one day pid->lock can be used to guard the
      task lists as well and then the tasklist_lock won't need to be held to
      deliver signals.  That will require taking pid->lock with irqs disabled.
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
      Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
      Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
      Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
      Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      63f818f4
  8. 25 3月, 2020 2 次提交
  9. 25 2月, 2020 1 次提交
    • E
      proc: Use a list of inodes to flush from proc · 7bc3e6e5
      Eric W. Biederman 提交于
      Rework the flushing of proc to use a list of directory inodes that
      need to be flushed.
      
      The list is kept on struct pid not on struct task_struct, as there is
      a fixed connection between proc inodes and pids but at least for the
      case of de_thread the pid of a task_struct changes.
      
      This removes the dependency on proc_mnt which allows for different
      mounts of proc having different mount options even in the same pid
      namespace and this allows for the removal of proc_mnt which will
      trivially the first mount of proc to honor it's mount options.
      
      This flushing remains an optimization.  The functions
      pid_delete_dentry and pid_revalidate ensure that ordinary dcache
      management will not attempt to use dentries past the point their
      respective task has died.  When unused the shrinker will
      eventually be able to remove these dentries.
      
      There is a case in de_thread where proc_flush_pid can be
      called early for a given pid.  Which winds up being
      safe (if suboptimal) as this is just an optiimization.
      
      Only pid directories are put on the list as the other
      per pid files are children of those directories and
      d_invalidate on the directory will get them as well.
      
      So that the pid can be used during flushing it's reference count is
      taken in release_task and dropped in proc_flush_pid.  Further the call
      of proc_flush_pid is moved after the tasklist_lock is released in
      release_task so that it is certain that the pid has already been
      unhashed when flushing it taking place.  This removes a small race
      where a dentry could recreated.
      
      As struct pid is supposed to be small and I need a per pid lock
      I reuse the only lock that currently exists in struct pid the
      the wait_pidfd.lock.
      
      The net result is that this adds all of this functionality
      with just a little extra list management overhead and
      a single extra pointer in struct pid.
      
      v2: Initialize pid->inodes.  I somehow failed to get that
          initialization into the initial version of the patch.  A boot
          failure was reported by "kernel test robot <lkp@intel.com>", and
          failure to initialize that pid->inodes matches all of the reported
          symptoms.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      7bc3e6e5
  10. 20 1月, 2020 1 次提交
  11. 19 1月, 2020 1 次提交
  12. 14 1月, 2020 1 次提交
  13. 09 12月, 2019 1 次提交
  14. 17 7月, 2019 2 次提交
    • L
      /proc/<pid>/cmdline: add back the setproctitle() special case · d26d0cd9
      Linus Torvalds 提交于
      This makes the setproctitle() special case very explicit indeed, and
      handles it with a separate helper function entirely.  In the process, it
      re-instates the original semantics of simply stopping at the first NUL
      character when the original last NUL character is no longer there.
      
      [ The original semantics can still be seen in mm/util.c: get_cmdline()
        that is limited to a fixed-size buffer ]
      
      This makes the logic about when we use the string lengths etc much more
      obvious, and makes it easier to see what we do and what the two very
      different cases are.
      
      Note that even when we allow walking past the end of the argument array
      (because the setproctitle() might have overwritten and overflowed the
      original argv[] strings), we only allow it when it overflows into the
      environment region if it is immediately adjacent.
      
      [ Fixed for missing 'count' checks noted by Alexey Izbyshev ]
      
      Link: https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/
      Fixes: 5ab82718 ("fs/proc: simplify and clarify get_mm_cmdline() function")
      Cc: Jakub Jankowski <shasta@toxcorp.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alexey Izbyshev <izbyshev@ispras.ru>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d26d0cd9
    • L
      /proc/<pid>/cmdline: remove all the special cases · 3d712546
      Linus Torvalds 提交于
      Start off with a clean slate that only reads exactly from arg_start to
      arg_end, without any oddities.  This simplifies the code and in the
      process removes the case that caused us to potentially leak an
      uninitialized byte from the temporary kernel buffer.
      
      Note that in order to start from scratch with an understandable base,
      this simplifies things _too_ much, and removes all the legacy logic to
      handle setproctitle() having changed the argument strings.
      
      We'll add back those special cases very differently in the next commit.
      
      Link: https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/
      Fixes: f5b65348 ("proc: fix missing final NUL in get_mm_cmdline() rewrite")
      Cc: Alexey Izbyshev <izbyshev@ispras.ru>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d712546
  15. 13 7月, 2019 3 次提交
    • S
      oom: decouple mems_allowed from oom_unkillable_task · ac311a14
      Shakeel Butt 提交于
      Commit ef08e3b4 ("[PATCH] cpusets: confine oom_killer to
      mem_exclusive cpuset") introduces a heuristic where a potential
      oom-killer victim is skipped if the intersection of the potential victim
      and the current (the process triggered the oom) is empty based on the
      reason that killing such victim most probably will not help the current
      allocating process.
      
      However the commit 7887a3da ("[PATCH] oom: cpuset hint") changed the
      heuristic to just decrease the oom_badness scores of such potential
      victim based on the reason that the cpuset of such processes might have
      changed and previously they may have allocated memory on mems where the
      current allocating process can allocate from.
      
      Unintentionally 7887a3da ("[PATCH] oom: cpuset hint") introduced a
      side effect as the oom_badness is also exposed to the user space through
      /proc/[pid]/oom_score, so, readers with different cpusets can read
      different oom_score of the same process.
      
      Later, commit 6cf86ac6 ("oom: filter tasks not sharing the same
      cpuset") fixed the side effect introduced by 7887a3da by moving the
      cpuset intersection back to only oom-killer context and out of
      oom_badness.  However the combination of ab290adb ("oom: make
      oom_unkillable_task() helper function") and 26ebc984 ("oom:
      /proc/<pid>/oom_score treat kernel thread honestly") unintentionally
      brought back the cpuset intersection check into the oom_badness
      calculation function.
      
      Other than doing cpuset/mempolicy intersection from oom_badness, the memcg
      oom context is also doing cpuset/mempolicy intersection which is quite
      wrong and is caught by syzcaller with the following report:
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
      RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
      RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
      RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
      Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
      00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f
      85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
      RSP: 0018:ffff888000127490 EFLAGS: 00010a03
      RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
      RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
      RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
      R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
      R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
      FS:  00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      Call Trace:
        oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321
        mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169
        select_bad_process mm/oom_kill.c:374 [inline]
        out_of_memory mm/oom_kill.c:1088 [inline]
        out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035
        mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573
        mem_cgroup_oom mm/memcontrol.c:1905 [inline]
        try_charge+0xfbe/0x1480 mm/memcontrol.c:2468
        mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073
        mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088
        do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201
        do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359
        wp_huge_pmd mm/memory.c:3793 [inline]
        __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006
        handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053
        do_user_addr_fault arch/x86/mm/fault.c:1455 [inline]
        __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521
        do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552
        page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156
      RIP: 0033:0x400590
      Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48
      8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01
      00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b
      RSP: 002b:00007fff7bc49780 EFLAGS: 00010206
      RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001
      RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008
      R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0
      Modules linked in:
      ---[ end trace a65689219582ffff ]---
      RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
      RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
      RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
      RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
      Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
      00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f
      85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
      RSP: 0018:ffff888000127490 EFLAGS: 00010a03
      RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
      RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
      RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
      R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
      R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
      FS:  00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      
      The fix is to decouple the cpuset/mempolicy intersection check from
      oom_unkillable_task() and make sure cpuset/mempolicy intersection check is
      only done in the global oom context.
      
      [shakeelb@google.com: change function name and update comment]
        Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac311a14
    • S
      mm, oom: remove redundant task_in_mem_cgroup() check · 6ba749ee
      Shakeel Butt 提交于
      oom_unkillable_task() can be called from three different contexts i.e.
      global OOM, memcg OOM and oom_score procfs interface.  At the moment
      oom_unkillable_task() does a task_in_mem_cgroup() check on the given
      process.  Since there is no reason to perform task_in_mem_cgroup()
      check for global OOM and oom_score procfs interface, those contexts
      provide NULL memcg and skips the task_in_mem_cgroup() check.  However
      for memcg OOM context, the oom_unkillable_task() is always called from
      mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
      redundant and effectively dead code.  So, just remove the
      task_in_mem_cgroup() check altogether.
      
      Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ba749ee
    • K
      proc: use down_read_killable mmap_sem for /proc/pid/map_files · cd9e2bb8
      Konstantin Khlebnikov 提交于
      Do not remain stuck forever if something goes wrong.  Using a killable
      lock permits cleanup of stuck tasks and simplifies investigation.
      
      It seems ->d_revalidate() could return any error (except ECHILD) to abort
      validation and pass error as result of lookup sequence.
      
      [akpm@linux-foundation.org: fix proc_map_files_lookup() return value, per Andrei]
      Link: http://lkml.kernel.org/r/156007493995.3335.9595044802115356911.stgit@buzzSigned-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd9e2bb8
  16. 27 6月, 2019 1 次提交
  17. 12 6月, 2019 1 次提交
    • A
      proc: Add /proc/<pid>/arch_status · 68bc30bb
      Aubrey Li 提交于
      Exposing architecture specific per process information is useful for
      various reasons. An example is the AVX512 usage on x86 which is important
      for task placement for power/performance optimizations.
      
      Adding this information to the existing /prcc/pid/status file would be the
      obvious choise, but it has been agreed on that a explicit arch_status file
      is better in separating the generic and architecture specific information.
      
      [ tglx: Massage changelog ]
      Signed-off-by: NAubrey Li <aubrey.li@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: peterz@infradead.org
      Cc: hpa@zytor.com
      Cc: ak@linux.intel.com
      Cc: tim.c.chen@linux.intel.com
      Cc: dave.hansen@intel.com
      Cc: arjan@linux.intel.com
      Cc: adobriyan@gmail.com
      Cc: aubrey.li@intel.com
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Linux API <linux-api@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20190606012236.9391-1-aubrey.li@linux.intel.com
      68bc30bb
  18. 15 5月, 2019 1 次提交
  19. 29 4月, 2019 2 次提交
    • P
      proc: prevent changes to overridden credentials · 35a196be
      Paul Moore 提交于
      Prevent userspace from changing the the /proc/PID/attr values if the
      task's credentials are currently overriden.  This not only makes sense
      conceptually, it also prevents some really bizarre error cases caused
      when trying to commit credentials to a task with overridden
      credentials.
      
      Cc: <stable@vger.kernel.org>
      Reported-by: N"chengjian (D)" <cj.chengjian@huawei.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      Acked-by: NJohn Johansen <john.johansen@canonical.com>
      Acked-by: NJames Morris <james.morris@microsoft.com>
      Acked-by: NCasey Schaufler <casey@schaufler-ca.com>
      35a196be
    • T
      proc: Simplify task stack retrieval · e988e5ec
      Thomas Gleixner 提交于
      Replace the indirection through struct stack_trace with an invocation of
      the storage array based interface.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: linux-mm@kvack.org
      Cc: David Rientjes <rientjes@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: kasan-dev@googlegroups.com
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: iommu@lists.linux-foundation.org
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: linux-btrfs@vger.kernel.org
      Cc: dm-devel@redhat.com
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: intel-gfx@lists.freedesktop.org
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: dri-devel@lists.freedesktop.org
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: linux-arch@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190425094801.589304463@linutronix.de
      e988e5ec
  20. 15 4月, 2019 1 次提交
  21. 04 4月, 2019 1 次提交
    • S
      ptrace: Remove maxargs from task_current_syscall() · 631b7aba
      Steven Rostedt (Red Hat) 提交于
      task_current_syscall() has a single user that passes in 6 for maxargs, which
      is the maximum arguments that can be used to get system calls from
      syscall_get_arguments(). Instead of passing in a number of arguments to
      grab, just get 6 arguments. The args argument even specifies that it's an
      array of 6 items.
      
      This will also allow changing syscall_get_arguments() to not get a variable
      number of arguments, but always grab 6.
      
      Linus also suggested not passing in a bunch of arguments to
      task_current_syscall() but to instead pass in a pointer to a structure, and
      just fill the structure. struct seccomp_data has almost all the parameters
      that is needed except for the stack pointer (sp). As seccomp_data is part of
      uapi, and I'm afraid to change it, a new structure was created
      "syscall_info", which includes seccomp_data and adds the "sp" field.
      
      Link: http://lkml.kernel.org/r/20161107213233.466776454@goodmis.org
      
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      631b7aba
  22. 13 3月, 2019 2 次提交
  23. 06 3月, 2019 3 次提交
    • A
      proc: use seq_puts() everywhere · 08b55775
      Alexey Dobriyan 提交于
      seq_printf() without format specifiers == faster seq_puts()
      
      Link: http://lkml.kernel.org/r/20190114200545.GC9680@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08b55775
    • Z
      867aaccf
    • C
      signal: add pidfd_send_signal() syscall · 3eb39f47
      Christian Brauner 提交于
      The kill() syscall operates on process identifiers (pid). After a process
      has exited its pid can be reused by another process. If a caller sends a
      signal to a reused pid it will end up signaling the wrong process. This
      issue has often surfaced and there has been a push to address this problem [1].
      
      This patch uses file descriptors (fd) from proc/<pid> as stable handles on
      struct pid. Even if a pid is recycled the handle will not change. The fd
      can be used to send signals to the process it refers to.
      Thus, the new syscall pidfd_send_signal() is introduced to solve this
      problem. Instead of pids it operates on process fds (pidfd).
      
      /* prototype and argument /*
      long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);
      
      /* syscall number 424 */
      The syscall number was chosen to be 424 to align with Arnd's rework in his
      y2038 to minimize merge conflicts (cf. [25]).
      
      In addition to the pidfd and signal argument it takes an additional
      siginfo_t and flags argument. If the siginfo_t argument is NULL then
      pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
      is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
      The flags argument is added to allow for future extensions of this syscall.
      It currently needs to be passed as 0. Failing to do so will cause EINVAL.
      
      /* pidfd_send_signal() replaces multiple pid-based syscalls */
      The pidfd_send_signal() syscall currently takes on the job of
      rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
      positive pid is passed to kill(2). It will however be possible to also
      replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
      
      /* sending signals to threads (tid) and process groups (pgid) */
      Specifically, the pidfd_send_signal() syscall does currently not operate on
      process groups or threads. This is left for future extensions.
      In order to extend the syscall to allow sending signal to threads and
      process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
      PIDFD_TYPE_TID) should be added. This implies that the flags argument will
      determine what is signaled and not the file descriptor itself. Put in other
      words, grouping in this api is a property of the flags argument not a
      property of the file descriptor (cf. [13]). Clarification for this has been
      requested by Eric (cf. [19]).
      When appropriate extensions through the flags argument are added then
      pidfd_send_signal() can additionally replace the part of kill(2) which
      operates on process groups as well as the tgkill(2) and
      rt_tgsigqueueinfo(2) syscalls.
      How such an extension could be implemented has been very roughly sketched
      in [14], [15], and [16]. However, this should not be taken as a commitment
      to a particular implementation. There might be better ways to do it.
      Right now this is intentionally left out to keep this patchset as simple as
      possible (cf. [4]).
      
      /* naming */
      The syscall had various names throughout iterations of this patchset:
      - procfd_signal()
      - procfd_send_signal()
      - taskfd_send_signal()
      In the last round of reviews it was pointed out that given that if the
      flags argument decides the scope of the signal instead of different types
      of fds it might make sense to either settle for "procfd_" or "pidfd_" as
      prefix. The community was willing to accept either (cf. [17] and [18]).
      Given that one developer expressed strong preference for the "pidfd_"
      prefix (cf. [13]) and with other developers less opinionated about the name
      we should settle for "pidfd_" to avoid further bikeshedding.
      
      The  "_send_signal" suffix was chosen to reflect the fact that the syscall
      takes on the job of multiple syscalls. It is therefore intentional that the
      name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
      fomer because it might imply that pidfd_send_signal() is a replacement for
      kill(2), and not the latter because it is a hassle to remember the correct
      spelling - especially for non-native speakers - and because it is not
      descriptive enough of what the syscall actually does. The name
      "pidfd_send_signal" makes it very clear that its job is to send signals.
      
      /* zombies */
      Zombies can be signaled just as any other process. No special error will be
      reported since a zombie state is an unreliable state (cf. [3]). However,
      this can be added as an extension through the @flags argument if the need
      ever arises.
      
      /* cross-namespace signals */
      The patch currently enforces that the signaler and signalee either are in
      the same pid namespace or that the signaler's pid namespace is an ancestor
      of the signalee's pid namespace. This is done for the sake of simplicity
      and because it is unclear to what values certain members of struct
      siginfo_t would need to be set to (cf. [5], [6]).
      
      /* compat syscalls */
      It became clear that we would like to avoid adding compat syscalls
      (cf. [7]).  The compat syscall handling is now done in kernel/signal.c
      itself by adding __copy_siginfo_from_user_generic() which lets us avoid
      compat syscalls (cf. [8]). It should be noted that the addition of
      __copy_siginfo_from_user_any() is caused by a bug in the original
      implementation of rt_sigqueueinfo(2) (cf. 12).
      With upcoming rework for syscall handling things might improve
      significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
      any additional callers.
      
      /* testing */
      This patch was tested on x64 and x86.
      
      /* userspace usage */
      An asciinema recording for the basic functionality can be found under [9].
      With this patch a process can be killed via:
      
       #define _GNU_SOURCE
       #include <errno.h>
       #include <fcntl.h>
       #include <signal.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/stat.h>
       #include <sys/syscall.h>
       #include <sys/types.h>
       #include <unistd.h>
      
       static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
                                               unsigned int flags)
       {
       #ifdef __NR_pidfd_send_signal
               return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
       #else
               return -ENOSYS;
       #endif
       }
      
       int main(int argc, char *argv[])
       {
               int fd, ret, saved_errno, sig;
      
               if (argc < 3)
                       exit(EXIT_FAILURE);
      
               fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
               if (fd < 0) {
                       printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
                       exit(EXIT_FAILURE);
               }
      
               sig = atoi(argv[2]);
      
               printf("Sending signal %d to process %s\n", sig, argv[1]);
               ret = do_pidfd_send_signal(fd, sig, NULL, 0);
      
               saved_errno = errno;
               close(fd);
               errno = saved_errno;
      
               if (ret < 0) {
                       printf("%s - Failed to send signal %d to process %s\n",
                              strerror(errno), sig, argv[1]);
                       exit(EXIT_FAILURE);
               }
      
               exit(EXIT_SUCCESS);
       }
      
      /* Q&A
       * Given that it seems the same questions get asked again by people who are
       * late to the party it makes sense to add a Q&A section to the commit
       * message so it's hopefully easier to avoid duplicate threads.
       *
       * For the sake of progress please consider these arguments settled unless
       * there is a new point that desperately needs to be addressed. Please make
       * sure to check the links to the threads in this commit message whether
       * this has not already been covered.
       */
      Q-01: (Florian Weimer [20], Andrew Morton [21])
            What happens when the target process has exited?
      A-01: Sending the signal will fail with ESRCH (cf. [22]).
      
      Q-02:  (Andrew Morton [21])
             Is the task_struct pinned by the fd?
      A-02:  No. A reference to struct pid is kept. struct pid - as far as I
             understand - was created exactly for the reason to not require to
             pin struct task_struct (cf. [22]).
      
      Q-03: (Andrew Morton [21])
            Does the entire procfs directory remain visible? Just one entry
            within it?
      A-03: The same thing that happens right now when you hold a file descriptor
            to /proc/<pid> open (cf. [22]).
      
      Q-04: (Andrew Morton [21])
            Does the pid remain reserved?
      A-04: No. This patchset guarantees a stable handle not that pids are not
            recycled (cf. [22]).
      
      Q-05: (Andrew Morton [21])
            Do attempts to signal that fd return errors?
      A-05: See {Q,A}-01.
      
      Q-06: (Andrew Morton [22])
            Is there a cleaner way of obtaining the fd? Another syscall perhaps.
      A-06: Userspace can already trivially retrieve file descriptors from procfs
            so this is something that we will need to support anyway. Hence,
            there's no immediate need to add another syscalls just to make
            pidfd_send_signal() not dependent on the presence of procfs. However,
            adding a syscalls to get such file descriptors is planned for a
            future patchset (cf. [22]).
      
      Q-07: (Andrew Morton [21] and others)
            This fd-for-a-process sounds like a handy thing and people may well
            think up other uses for it in the future, probably unrelated to
            signals. Are the code and the interface designed to permit such
            future applications?
      A-07: Yes (cf. [22]).
      
      Q-08: (Andrew Morton [21] and others)
            Now I think about it, why a new syscall? This thing is looking
            rather like an ioctl?
      A-08: This has been extensively discussed. It was agreed that a syscall is
            preferred for a variety or reasons. Here are just a few taken from
            prior threads. Syscalls are safer than ioctl()s especially when
            signaling to fds. Processes are a core kernel concept so a syscall
            seems more appropriate. The layout of the syscall with its four
            arguments would require the addition of a custom struct for the
            ioctl() thereby causing at least the same amount or even more
            complexity for userspace than a simple syscall. The new syscall will
            replace multiple other pid-based syscalls (see description above).
            The file-descriptors-for-processes concept introduced with this
            syscall will be extended with other syscalls in the future. See also
            [22], [23] and various other threads already linked in here.
      
      Q-09: (Florian Weimer [24])
            What happens if you use the new interface with an O_PATH descriptor?
      A-09:
            pidfds opened as O_PATH fds cannot be used to send signals to a
            process (cf. [2]). Signaling processes through pidfds is the
            equivalent of writing to a file. Thus, this is not an operation that
            operates "purely at the file descriptor level" as required by the
            open(2) manpage. See also [4].
      
      /* References */
      [1]:  https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
      [2]:  https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
      [3]:  https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
      [4]:  https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
      [5]:  https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
      [6]:  https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
      [7]:  https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
      [8]:  https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
      [9]:  https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
      [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
      [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
      [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
      [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
      [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
      [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
      [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
      [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
      [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
      [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
      [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
      [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
      [23]: https://lwn.net/Articles/773459/
      [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
      [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Reviewed-by: NTycho Andersen <tycho@tycho.ws>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Acked-by: NAleksa Sarai <cyphar@cyphar.com>
      3eb39f47
  24. 22 2月, 2019 1 次提交
  25. 26 1月, 2019 1 次提交
  26. 09 1月, 2019 1 次提交
    • C
      procfs: add smack subdir to attrs · 6d9c939d
      Casey Schaufler 提交于
      Back in 2007 I made what turned out to be a rather serious
      mistake in the implementation of the Smack security module.
      The SELinux module used an interface in /proc to manipulate
      the security context on processes. Rather than use a similar
      interface, I used the same interface. The AppArmor team did
      likewise. Now /proc/.../attr/current will tell you the
      security "context" of the process, but it will be different
      depending on the security module you're using.
      
      This patch provides a subdirectory in /proc/.../attr for
      Smack. Smack user space can use the "current" file in
      this subdirectory and never have to worry about getting
      SELinux attributes by mistake. Programs that use the
      old interface will continue to work (or fail, as the case
      may be) as before.
      
      The proposed S.A.R.A security module is dependent on
      the mechanism to create its own attr subdirectory.
      
      The original implementation is by Kees Cook.
      Signed-off-by: NCasey Schaufler <casey@schaufler-ca.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6d9c939d
  27. 05 1月, 2019 1 次提交