1. 24 9月, 2022 2 次提交
  2. 22 8月, 2022 1 次提交
  3. 07 6月, 2022 1 次提交
    • Z
      proc: Fix a dentry lock race between release_task and lookup · ad8b5995
      Zhihao Cheng 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 186846, https://gitee.com/openeuler/kernel/issues/I5A80Q
      
      --------------------------------
      
      Commit 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
      moved proc_flush_task() behind __exit_signal(). Then, process systemd
      can take long period high cpu usage during releasing task in following
      concurrent processes:
      
        systemd                                 ps
      kernel_waitid                 stat(/proc/tgid)
        do_wait                       filename_lookup
          wait_consider_task            lookup_fast
            release_task
              __exit_signal
                __unhash_process
                  detach_pid
                    __change_pid // remove task->pid_links
                                           d_revalidate -> pid_revalidate  // 0
                                           d_invalidate(/proc/tgid)
                                             shrink_dcache_parent(/proc/tgid)
                                               d_walk(/proc/tgid)
                                                 spin_lock_nested(/proc/tgid/fd)
                                                 // iterating opened fd
              proc_flush_pid                                    |
                 d_invalidate (/proc/tgid/fd)                   |
                    shrink_dcache_parent(/proc/tgid/fd)         |
                      shrink_dentry_list(subdirs)               ↓
                        shrink_lock_dentry(/proc/tgid/fd) --> race on dentry lock
      
      Function d_invalidate() will remove dentry from hash firstly, but why does
      proc_flush_pid() process dentry '/proc/tgid/fd' before dentry '/proc/tgid'?
      That's because proc_pid_make_inode() adds proc inode in reverse order by
      invoking hlist_add_head_rcu(). But proc should not add any inodes under
      '/proc/tgid' except '/proc/tgid/task/pid', fix it by adding inode into
      'pid->inodes' only if the inode is /proc/tgid or /proc/tgid/task/pid.
      
      Performance regression:
      Create 200 tasks, each task open one file for 50,000 times. Kill all
      tasks when opened files exceed 10,000,000 (cat /proc/sys/fs/file-nr).
      
      Before fix:
      $ time killall -wq aa
        real    4m40.946s   # During this period, we can see 'ps' and 'systemd'
      			taking high cpu usage.
      
      After fix:
      $ time killall -wq aa
        real    1m20.732s   # During this period, we can see 'systemd' taking
      			high cpu usage.
      
      Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=216054Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      ad8b5995
  4. 23 2月, 2022 1 次提交
  5. 30 12月, 2021 1 次提交
  6. 15 10月, 2021 1 次提交
  7. 06 7月, 2021 1 次提交
  8. 03 7月, 2021 2 次提交
  9. 15 6月, 2021 1 次提交
  10. 14 4月, 2021 2 次提交
  11. 29 1月, 2021 1 次提交
    • Y
      proc: fix ubsan warning in mem_lseek · 1bb26e86
      yangerkun 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 47438
      CVE: NA
      ---------------------------
      
      UBSAN has reported a overflow with mem_lseek. And it's fine with
      mem_open set file mode with FMODE_UNSIGNED_OFFSET(memory_lseek).
      However, another file use mem_lseek do lseek can have not
      FMODE_UNSIGNED_OFFSET(proc_kpagecount_operations/proc_pagemap_operations),
      fix it by checking overflow and FMODE_UNSIGNED_OFFSET.
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      
      ==================================================================
      UBSAN: Undefined behaviour in ../fs/proc/base.c:941:15
      signed integer overflow:
      4611686018427387904 + 4611686018427387904 cannot be represented in type 'long long int'
      CPU: 4 PID: 4762 Comm: syz-executor.1 Not tainted 4.4.189 #3
      Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      Call trace:
      [<ffffff90080a5f28>] dump_backtrace+0x0/0x590 arch/arm64/kernel/traps.c:91
      [<ffffff90080a64f0>] show_stack+0x38/0x60 arch/arm64/kernel/traps.c:234
      [<ffffff9008986a34>] __dump_stack lib/dump_stack.c:15 [inline]
      [<ffffff9008986a34>] dump_stack+0x128/0x184 lib/dump_stack.c:51
      [<ffffff9008a2d120>] ubsan_epilogue+0x34/0x9c lib/ubsan.c:166
      [<ffffff9008a2d8b8>] handle_overflow+0x228/0x280 lib/ubsan.c:197
      [<ffffff9008a2da2c>] __ubsan_handle_add_overflow+0x4c/0x68 lib/ubsan.c:204
      [<ffffff900862b9f4>] mem_lseek+0x12c/0x130 fs/proc/base.c:941
      [<ffffff90084ef78c>] vfs_llseek fs/read_write.c:260 [inline]
      [<ffffff90084ef78c>] SYSC_lseek fs/read_write.c:285 [inline]
      [<ffffff90084ef78c>] SyS_lseek+0x164/0x1f0 fs/read_write.c:276
      [<ffffff9008093c80>] el0_svc_naked+0x30/0x34
      ==================================================================
      Signed-off-by: Nyangerkun <yangerkun@huawei.com>
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      (cherry picked from commit a422358aa04c53a08b215b8dcd6814d916ef5cf1)
      
      Conflicts:
      	fs/read_write.c
      Signed-off-by: NLi Ming <limingming.li@huawei.com>
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1bb26e86
  12. 27 1月, 2021 1 次提交
    • E
      exec: Transform exec_update_mutex into a rw_semaphore · ab84d659
      Eric W. Biederman 提交于
      stable inclusion
      from stable-5.10.6
      commit ab7709b551de24e7bebf44946120e6740b1e28db
      bugzilla: 47418
      
      --------------------------------
      
      [ Upstream commit f7cfd871 ]
      
      Recently syzbot reported[0] that there is a deadlock amongst the users
      of exec_update_mutex.  The problematic lock ordering found by lockdep
      was:
      
         perf_event_open  (exec_update_mutex -> ovl_i_mutex)
         chown            (ovl_i_mutex       -> sb_writes)
         sendfile         (sb_writes         -> p->lock)
           by reading from a proc file and writing to overlayfs
         proc_pid_syscall (p->lock           -> exec_update_mutex)
      
      While looking at possible solutions it occured to me that all of the
      users and possible users involved only wanted to state of the given
      process to remain the same.  They are all readers.  The only writer is
      exec.
      
      There is no reason for readers to block on each other.  So fix
      this deadlock by transforming exec_update_mutex into a rw_semaphore
      named exec_update_lock that only exec takes for writing.
      
      Cc: Jann Horn <jannh@google.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christopher Yeoh <cyeoh@au1.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Fixes: eea96732 ("exec: Add exec_update_mutex to replace cred_guard_mutex")
      [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
      Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
      Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      ab84d659
  13. 03 11月, 2020 1 次提交
  14. 17 10月, 2020 1 次提交
  15. 14 10月, 2020 1 次提交
    • S
      mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary · 67197a4f
      Suren Baghdasaryan 提交于
      Currently __set_oom_adj loops through all processes in the system to keep
      oom_score_adj and oom_score_adj_min in sync between processes sharing
      their mm.  This is done for any task with more that one mm_users, which
      includes processes with multiple threads (sharing mm and signals).
      However for such processes the loop is unnecessary because their signal
      structure is shared as well.
      
      Android updates oom_score_adj whenever a tasks changes its role
      (background/foreground/...) or binds to/unbinds from a service, making it
      more/less important.  Such operation can happen frequently.  We noticed
      that updates to oom_score_adj became more expensive and after further
      investigation found out that the patch mentioned in "Fixes" introduced a
      regression.  Using Pixel 4 with a typical Android workload, write time to
      oom_score_adj increased from ~3.57us to ~362us.  Moreover this regression
      linearly depends on the number of multi-threaded processes running on the
      system.
      
      Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
      (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK).  Change __set_oom_adj to use
      MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
      update should be synchronized between multiple processes.  To prevent
      races between clone() and __set_oom_adj(), when oom_score_adj of the
      process being cloned might be modified from userspace, we use
      oom_adj_mutex.  Its scope is changed to global.
      
      The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
      the case of vfork().  To prevent performance regressions of vfork(), we
      skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
      specified.  Clearing the MMF_MULTIPROCESS flag (when the last process
      sharing the mm exits) is left out of this patch to keep it simple and
      because it is believed that this threading model is rare.  Should there
      ever be a need for optimizing that case as well, it can be done by hooking
      into the exit path, likely following the mm_update_next_owner pattern.
      
      With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
      quite rare, the regression is gone after the change is applied.
      
      [surenb@google.com: v3]
        Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com
      
      Fixes: 44a70ade ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
      Reported-by: NTim Murray <timmurray@google.com>
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Eugene Syromiatnikov <esyr@redhat.com>
      Cc: Christian Kellner <christian@kellner.me>
      Cc: Adrian Reber <areber@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
      Cc: John Johansen <john.johansen@canonical.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.comDebugged-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67197a4f
  16. 13 8月, 2020 1 次提交
    • Y
      mm, oom: make the calculation of oom badness more accurate · 9066e5cf
      Yafang Shao 提交于
      Recently we found an issue on our production environment that when memcg
      oom is triggered the oom killer doesn't chose the process with largest
      resident memory but chose the first scanned process.  Note that all
      processes in this memcg have the same oom_score_adj, so the oom killer
      should chose the process with largest resident memory.
      
      Bellow is part of the oom info, which is enough to analyze this issue.
      [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
      [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
      [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
      [...]
      [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
      [7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
      [7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
      [7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
      [7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
      [7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
      [7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
      [7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
      [7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
      [7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
      [7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
      [7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
      [7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
      [7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
      [7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
      [7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
      [7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
      [7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
      [7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
      [7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
      [7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
      [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
      [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
      [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
      [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      We can find that the first scanned process 5740 (pause) was killed, but
      its rss is only one page.  That is because, when we calculate the oom
      badness in oom_badness(), we always ignore the negtive point and convert
      all of these negtive points to 1.  Now as oom_score_adj of all the
      processes in this targeted memcg have the same value -998, the points of
      these processes are all negtive value.  As a result, the first scanned
      process will be killed.
      
      The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
      a Guaranteed pod, which has higher priority to prevent from being killed
      by system oom.
      
      To fix this issue, we should make the calculation of oom point more
      accurate.  We can achieve it by convert the chosen_point from 'unsigned
      long' to 'long'.
      
      [cai@lca.pw: reported a issue in the previous version]
      [mhocko@suse.com: fixed the issue reported by Cai]
      [mhocko@suse.com: add the comment in proc_oom_score()]
      [laoar.shao@gmail.com: v3]
        Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9066e5cf
  17. 20 7月, 2020 1 次提交
  18. 10 6月, 2020 3 次提交
  19. 19 5月, 2020 1 次提交
  20. 25 4月, 2020 2 次提交
  21. 22 4月, 2020 3 次提交
    • A
    • A
      proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option · 24a71ce5
      Alexey Gladkov 提交于
      If "hidepid=4" mount option is set then do not instantiate pids that
      we can not ptrace. "hidepid=4" means that procfs should only contain
      pids that the caller can ptrace.
      Signed-off-by: NDjalal Harouni <tixxdz@gmail.com>
      Signed-off-by: NAlexey Gladkov <gladkov.alexey@gmail.com>
      Reviewed-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      24a71ce5
    • A
      proc: allow to mount many instances of proc in one pid namespace · fa10fed3
      Alexey Gladkov 提交于
      This patch allows to have multiple procfs instances inside the
      same pid namespace. The aim here is lightweight sandboxes, and to allow
      that we have to modernize procfs internals.
      
      1) The main aim of this work is to have on embedded systems one
      supervisor for apps. Right now we have some lightweight sandbox support,
      however if we create pid namespacess we have to manages all the
      processes inside too, where our goal is to be able to run a bunch of
      apps each one inside its own mount namespace without being able to
      notice each other. We only want to use mount namespaces, and we want
      procfs to behave more like a real mount point.
      
      2) Linux Security Modules have multiple ptrace paths inside some
      subsystems, however inside procfs, the implementation does not guarantee
      that the ptrace() check which triggers the security_ptrace_check() hook
      will always run. We have the 'hidepid' mount option that can be used to
      force the ptrace_may_access() check inside has_pid_permissions() to run.
      The problem is that 'hidepid' is per pid namespace and not attached to
      the mount point, any remount or modification of 'hidepid' will propagate
      to all other procfs mounts.
      
      This also does not allow to support Yama LSM easily in desktop and user
      sessions. Yama ptrace scope which restricts ptrace and some other
      syscalls to be allowed only on inferiors, can be updated to have a
      per-task context, where the context will be inherited during fork(),
      clone() and preserved across execve(). If we support multiple private
      procfs instances, then we may force the ptrace_may_access() on
      /proc/<pids>/ to always run inside that new procfs instances. This will
      allow to specifiy on user sessions if we should populate procfs with
      pids that the user can ptrace or not.
      
      By using Yama ptrace scope, some restricted users will only be able to see
      inferiors inside /proc, they won't even be able to see their other
      processes. Some software like Chromium, Firefox's crash handler, Wine
      and others are already using Yama to restrict which processes can be
      ptracable. With this change this will give the possibility to restrict
      /proc/<pids>/ but more importantly this will give desktop users a
      generic and usuable way to specifiy which users should see all processes
      and which users can not.
      
      Side notes:
      * This covers the lack of seccomp where it is not able to parse
      arguments, it is easy to install a seccomp filter on direct syscalls
      that operate on pids, however /proc/<pid>/ is a Linux ABI using
      filesystem syscalls. With this change LSMs should be able to analyze
      open/read/write/close...
      
      In the new patch set version I removed the 'newinstance' option
      as suggested by Eric W. Biederman.
      
      Selftest has been added to verify new behavior.
      Signed-off-by: NAlexey Gladkov <gladkov.alexey@gmail.com>
      Reviewed-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      fa10fed3
  22. 16 4月, 2020 1 次提交
  23. 10 4月, 2020 1 次提交
    • E
      proc: Use a dedicated lock in struct pid · 63f818f4
      Eric W. Biederman 提交于
      syzbot wrote:
      > ========================================================
      > WARNING: possible irq lock inversion dependency detected
      > 5.6.0-syzkaller #0 Not tainted
      > --------------------------------------------------------
      > swapper/1/0 just changed the state of lock:
      > ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
      > but this lock took another, SOFTIRQ-unsafe lock in the past:
      >  (&pid->wait_pidfd){+.+.}-{2:2}
      >
      >
      > and interrupts could create inverse lock ordering between them.
      >
      >
      > other info that might help us debug this:
      >  Possible interrupt unsafe locking scenario:
      >
      >        CPU0                    CPU1
      >        ----                    ----
      >   lock(&pid->wait_pidfd);
      >                                local_irq_disable();
      >                                lock(tasklist_lock);
      >                                lock(&pid->wait_pidfd);
      >   <Interrupt>
      >     lock(tasklist_lock);
      >
      >  *** DEADLOCK ***
      >
      > 4 locks held by swapper/1/0:
      
      The problem is that because wait_pidfd.lock is taken under the tasklist
      lock.  It must always be taken with irqs disabled as tasklist_lock can be
      taken from interrupt context and if wait_pidfd.lock was already taken this
      would create a lock order inversion.
      
      Oleg suggested just disabling irqs where I have added extra calls to
      wait_pidfd.lock.  That should be safe and I think the code will eventually
      do that.  It was rightly pointed out by Christian that sharing the
      wait_pidfd.lock was a premature optimization.
      
      It is also true that my pre-merge window testing was insufficient.  So
      remove the premature optimization and give struct pid a dedicated lock of
      it's own for struct pid things.  I have verified that lockdep sees all 3
      paths where we take the new pid->lock and lockdep does not complain.
      
      It is my current day dream that one day pid->lock can be used to guard the
      task lists as well and then the tasklist_lock won't need to be held to
      deliver signals.  That will require taking pid->lock with irqs disabled.
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
      Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
      Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
      Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
      Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      63f818f4
  24. 25 3月, 2020 2 次提交
  25. 25 2月, 2020 1 次提交
    • E
      proc: Use a list of inodes to flush from proc · 7bc3e6e5
      Eric W. Biederman 提交于
      Rework the flushing of proc to use a list of directory inodes that
      need to be flushed.
      
      The list is kept on struct pid not on struct task_struct, as there is
      a fixed connection between proc inodes and pids but at least for the
      case of de_thread the pid of a task_struct changes.
      
      This removes the dependency on proc_mnt which allows for different
      mounts of proc having different mount options even in the same pid
      namespace and this allows for the removal of proc_mnt which will
      trivially the first mount of proc to honor it's mount options.
      
      This flushing remains an optimization.  The functions
      pid_delete_dentry and pid_revalidate ensure that ordinary dcache
      management will not attempt to use dentries past the point their
      respective task has died.  When unused the shrinker will
      eventually be able to remove these dentries.
      
      There is a case in de_thread where proc_flush_pid can be
      called early for a given pid.  Which winds up being
      safe (if suboptimal) as this is just an optiimization.
      
      Only pid directories are put on the list as the other
      per pid files are children of those directories and
      d_invalidate on the directory will get them as well.
      
      So that the pid can be used during flushing it's reference count is
      taken in release_task and dropped in proc_flush_pid.  Further the call
      of proc_flush_pid is moved after the tasklist_lock is released in
      release_task so that it is certain that the pid has already been
      unhashed when flushing it taking place.  This removes a small race
      where a dentry could recreated.
      
      As struct pid is supposed to be small and I need a per pid lock
      I reuse the only lock that currently exists in struct pid the
      the wait_pidfd.lock.
      
      The net result is that this adds all of this functionality
      with just a little extra list management overhead and
      a single extra pointer in struct pid.
      
      v2: Initialize pid->inodes.  I somehow failed to get that
          initialization into the initial version of the patch.  A boot
          failure was reported by "kernel test robot <lkp@intel.com>", and
          failure to initialize that pid->inodes matches all of the reported
          symptoms.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      7bc3e6e5
  26. 20 1月, 2020 1 次提交
  27. 19 1月, 2020 1 次提交
  28. 14 1月, 2020 1 次提交
  29. 09 12月, 2019 1 次提交
  30. 17 7月, 2019 2 次提交
    • L
      /proc/<pid>/cmdline: add back the setproctitle() special case · d26d0cd9
      Linus Torvalds 提交于
      This makes the setproctitle() special case very explicit indeed, and
      handles it with a separate helper function entirely.  In the process, it
      re-instates the original semantics of simply stopping at the first NUL
      character when the original last NUL character is no longer there.
      
      [ The original semantics can still be seen in mm/util.c: get_cmdline()
        that is limited to a fixed-size buffer ]
      
      This makes the logic about when we use the string lengths etc much more
      obvious, and makes it easier to see what we do and what the two very
      different cases are.
      
      Note that even when we allow walking past the end of the argument array
      (because the setproctitle() might have overwritten and overflowed the
      original argv[] strings), we only allow it when it overflows into the
      environment region if it is immediately adjacent.
      
      [ Fixed for missing 'count' checks noted by Alexey Izbyshev ]
      
      Link: https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/
      Fixes: 5ab82718 ("fs/proc: simplify and clarify get_mm_cmdline() function")
      Cc: Jakub Jankowski <shasta@toxcorp.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alexey Izbyshev <izbyshev@ispras.ru>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d26d0cd9
    • L
      /proc/<pid>/cmdline: remove all the special cases · 3d712546
      Linus Torvalds 提交于
      Start off with a clean slate that only reads exactly from arg_start to
      arg_end, without any oddities.  This simplifies the code and in the
      process removes the case that caused us to potentially leak an
      uninitialized byte from the temporary kernel buffer.
      
      Note that in order to start from scratch with an understandable base,
      this simplifies things _too_ much, and removes all the legacy logic to
      handle setproctitle() having changed the argument strings.
      
      We'll add back those special cases very differently in the next commit.
      
      Link: https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/
      Fixes: f5b65348 ("proc: fix missing final NUL in get_mm_cmdline() rewrite")
      Cc: Alexey Izbyshev <izbyshev@ispras.ru>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d712546