1. 19 6月, 2023 1 次提交
    • S
      proc: allow pid_revalidate() during LOOKUP_RCU · 32dfe7c7
      Stephen Brennan 提交于
      mainline inclusion
      from mainline-v5.16-rc1
      commit da4d6b9c
      category: bugfix
      bugzilla: 188892, https://gitee.com/openeuler/kernel/issues/I7CWJ7
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=da4d6b9cf80ae5b0083f640133b85b68b53b6497
      
      ----------------------------------------
      
      Problem Description:
      
      When running running ~128 parallel instances of
      
        TZ=/etc/localtime ps -fe >/dev/null
      
      on a 128CPU machine, the %sys utilization reaches 97%, and perf shows
      the following code path as being responsible for heavy contention on the
      d_lockref spinlock:
      
            walk_component()
              lookup_fast()
                d_revalidate()
                  pid_revalidate() // returns -ECHILD
                unlazy_child()
                  lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention
      
      The reason is that pid_revalidate() is triggering a drop from RCU to ref
      path walk mode.  All concurrent path lookups thus try to grab a
      reference to the dentry for /proc/, before re-executing pid_revalidate()
      and then stepping into the /proc/$pid directory.  Thus there is huge
      spinlock contention.
      
      This patch allows pid_revalidate() to execute in RCU mode, meaning that
      the path lookup can successfully enter the /proc/$pid directory while
      still in RCU mode.  Later on, the path lookup may still drop into ref
      mode, but the contention will be much reduced at this point.
      
      By applying this patch, %sys utilization falls to around 85% under the
      same workload, and the number of ps processes executed per unit time
      increases by 3x-4x.  Although this particular workload is a bit
      contrived, we have seen some large collections of eager monitoring
      scripts which produced similarly high %sys time due to contention in the
      /proc directory.
      
      As a result this patch, Al noted that several procfs methods which were
      only called in ref-walk mode could now be called from RCU mode.  To
      ensure that this patch is safe, I audited all the inode get_link and
      permission() implementations, as well as dentry d_revalidate()
      implementations, in fs/proc.  The purpose here is to ensure that they
      either are safe to call in RCU (i.e.  don't sleep) or correctly bail out
      of RCU mode if they don't support it.  My analysis shows that all
      at-risk procfs methods are safe to call under RCU, and thus this patch
      is safe.
      
      Procfs RCU-walk Analysis:
      
      This analysis is up-to-date with 5.15-rc3.  When called under RCU mode,
      these functions have arguments as follows:
      
      * get_link() receives a NULL dentry pointer when called in RCU mode.
      * permission() receives MAY_NOT_BLOCK in the mode parameter when called
        from RCU.
      * d_revalidate() receives LOOKUP_RCU in flags.
      
      For the following functions, either they are trivially RCU safe, or they
      explicitly bail at the beginning of the function when they run:
      
      proc_ns_get_link       (bails out)
      proc_get_link          (RCU safe)
      proc_pid_get_link      (bails out)
      map_files_d_revalidate (bails out)
      map_misc_d_revalidate  (bails out)
      proc_net_d_revalidate  (RCU safe)
      proc_sys_revalidate    (bails out, also not under /proc/$pid)
      tid_fd_revalidate      (bails out)
      proc_sys_permission    (not under /proc/$pid)
      
      The remainder of the functions require a bit more detail:
      
      * proc_fd_permission: RCU safe. All of the body of this function is
        under rcu_read_lock(), except generic_permission() which declares
        itself RCU safe in its documentation string.
      * proc_self_get_link uses GFP_ATOMIC in the RCU case, so it is RCU aware
        and otherwise looks safe. The same is true of proc_thread_self_get_link.
      * proc_map_files_get_link: calls ns_capable, which calls capable(), and
        thus calls into the audit code (see note #1 below). The remainder is
        just a call to the trivially safe proc_pid_get_link().
      * proc_pid_permission: calls ptrace_may_access(), which appears RCU
        safe, although it does call into the "security_ptrace_access_check()"
        hook, which looks safe under smack and selinux. Just the audit code is
        of concern. Also uses get_task_struct() and put_task_struct(), see
        note #2 below.
      * proc_tid_comm_permission: Appears safe, though calls put_task_struct
        (see note #2 below).
      
      Note #1:
        Most of the concern of RCU safety has centered around the audit code.
        However, since b17ec22f ("selinux: slow_avc_audit has become
        non-blocking"), it's safe to call this code under RCU. So all of the
        above are safe by my estimation.
      
      Note #2: get_task_struct() and put_task_struct():
        The majority of get_task_struct() is under RCU read lock, and in any
        case it is a simple increment. But put_task_struct() is complex, given
        that it could at some point free the task struct, and this process has
        many steps which I couldn't manually verify. However, several other
        places call put_task_struct() under RCU, so it appears safe to use
        here too (see kernel/hung_task.c:165 or rcu/tree-stall.h:296)
      
      Patch description:
      
      pid_revalidate() drops from RCU into REF lookup mode.  When many threads
      are resolving paths within /proc in parallel, this can result in heavy
      spinlock contention on d_lockref as each thread tries to grab a
      reference to the /proc dentry (and drop it shortly thereafter).
      
      Investigation indicates that it is not necessary to drop RCU in
      pid_revalidate(), as no RCU data is modified and the function never
      sleeps.  So, remove the LOOKUP_RCU check.
      
      Link: https://lkml.kernel.org/r/20211004175629.292270-2-stephen.s.brennan@oracle.comSigned-off-by: NStephen Brennan <stephen.s.brennan@oracle.com>
      Cc: Konrad Wilk <konrad.wilk@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NLi Nan <linan122@huawei.com>
      (cherry picked from commit f2924f34)
      32dfe7c7
  2. 01 6月, 2023 1 次提交
    • H
      sched: fix performance degradation on lmbench · c6aaa310
      Hui Tang 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7A718
      
      --------------------------------
      
      There are worse performance with the 'Fixes'
      when running "./lat_ctx -P $SYNC_MAX -s 64 16".
      
      The 'Fixes' which allocates memory for p->prefer_cpus
      even if "prefer_cpus" not be set.
      
      Before the 'Fixes', only test "p->prefer_cpus",
      after, add test "!cpumask_empty(p->prefer_cpus)"
      which causing performance degradation.
      
      select_task_rq_fair
        ->set_task_select_cpus
          ->prefer_cpus_valid  ----  test cpumask_empty(p->prefer_cpus)
      
      Fixes: ebeb84ad ("cpuset: Introduce new interface for scheduler ...")
      Signed-off-by: NHui Tang <tanghui20@huawei.com>
      (cherry picked from commit d8f77f89)
      c6aaa310
  3. 19 5月, 2023 3 次提交
  4. 08 4月, 2023 1 次提交
  5. 25 11月, 2022 1 次提交
  6. 23 11月, 2022 1 次提交
  7. 11 11月, 2022 1 次提交
  8. 20 9月, 2022 1 次提交
  9. 07 6月, 2022 1 次提交
    • Z
      proc: Fix a dentry lock race between release_task and lookup · ad8b5995
      Zhihao Cheng 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 186846, https://gitee.com/openeuler/kernel/issues/I5A80Q
      
      --------------------------------
      
      Commit 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
      moved proc_flush_task() behind __exit_signal(). Then, process systemd
      can take long period high cpu usage during releasing task in following
      concurrent processes:
      
        systemd                                 ps
      kernel_waitid                 stat(/proc/tgid)
        do_wait                       filename_lookup
          wait_consider_task            lookup_fast
            release_task
              __exit_signal
                __unhash_process
                  detach_pid
                    __change_pid // remove task->pid_links
                                           d_revalidate -> pid_revalidate  // 0
                                           d_invalidate(/proc/tgid)
                                             shrink_dcache_parent(/proc/tgid)
                                               d_walk(/proc/tgid)
                                                 spin_lock_nested(/proc/tgid/fd)
                                                 // iterating opened fd
              proc_flush_pid                                    |
                 d_invalidate (/proc/tgid/fd)                   |
                    shrink_dcache_parent(/proc/tgid/fd)         |
                      shrink_dentry_list(subdirs)               ↓
                        shrink_lock_dentry(/proc/tgid/fd) --> race on dentry lock
      
      Function d_invalidate() will remove dentry from hash firstly, but why does
      proc_flush_pid() process dentry '/proc/tgid/fd' before dentry '/proc/tgid'?
      That's because proc_pid_make_inode() adds proc inode in reverse order by
      invoking hlist_add_head_rcu(). But proc should not add any inodes under
      '/proc/tgid' except '/proc/tgid/task/pid', fix it by adding inode into
      'pid->inodes' only if the inode is /proc/tgid or /proc/tgid/task/pid.
      
      Performance regression:
      Create 200 tasks, each task open one file for 50,000 times. Kill all
      tasks when opened files exceed 10,000,000 (cat /proc/sys/fs/file-nr).
      
      Before fix:
      $ time killall -wq aa
        real    4m40.946s   # During this period, we can see 'ps' and 'systemd'
      			taking high cpu usage.
      
      After fix:
      $ time killall -wq aa
        real    1m20.732s   # During this period, we can see 'systemd' taking
      			high cpu usage.
      
      Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=216054Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      ad8b5995
  10. 23 2月, 2022 1 次提交
  11. 30 12月, 2021 1 次提交
  12. 15 10月, 2021 1 次提交
  13. 06 7月, 2021 1 次提交
  14. 03 7月, 2021 2 次提交
  15. 15 6月, 2021 1 次提交
  16. 14 4月, 2021 2 次提交
  17. 29 1月, 2021 1 次提交
    • Y
      proc: fix ubsan warning in mem_lseek · 1bb26e86
      yangerkun 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 47438
      CVE: NA
      ---------------------------
      
      UBSAN has reported a overflow with mem_lseek. And it's fine with
      mem_open set file mode with FMODE_UNSIGNED_OFFSET(memory_lseek).
      However, another file use mem_lseek do lseek can have not
      FMODE_UNSIGNED_OFFSET(proc_kpagecount_operations/proc_pagemap_operations),
      fix it by checking overflow and FMODE_UNSIGNED_OFFSET.
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      
      ==================================================================
      UBSAN: Undefined behaviour in ../fs/proc/base.c:941:15
      signed integer overflow:
      4611686018427387904 + 4611686018427387904 cannot be represented in type 'long long int'
      CPU: 4 PID: 4762 Comm: syz-executor.1 Not tainted 4.4.189 #3
      Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      Call trace:
      [<ffffff90080a5f28>] dump_backtrace+0x0/0x590 arch/arm64/kernel/traps.c:91
      [<ffffff90080a64f0>] show_stack+0x38/0x60 arch/arm64/kernel/traps.c:234
      [<ffffff9008986a34>] __dump_stack lib/dump_stack.c:15 [inline]
      [<ffffff9008986a34>] dump_stack+0x128/0x184 lib/dump_stack.c:51
      [<ffffff9008a2d120>] ubsan_epilogue+0x34/0x9c lib/ubsan.c:166
      [<ffffff9008a2d8b8>] handle_overflow+0x228/0x280 lib/ubsan.c:197
      [<ffffff9008a2da2c>] __ubsan_handle_add_overflow+0x4c/0x68 lib/ubsan.c:204
      [<ffffff900862b9f4>] mem_lseek+0x12c/0x130 fs/proc/base.c:941
      [<ffffff90084ef78c>] vfs_llseek fs/read_write.c:260 [inline]
      [<ffffff90084ef78c>] SYSC_lseek fs/read_write.c:285 [inline]
      [<ffffff90084ef78c>] SyS_lseek+0x164/0x1f0 fs/read_write.c:276
      [<ffffff9008093c80>] el0_svc_naked+0x30/0x34
      ==================================================================
      Signed-off-by: Nyangerkun <yangerkun@huawei.com>
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      (cherry picked from commit a422358aa04c53a08b215b8dcd6814d916ef5cf1)
      
      Conflicts:
      	fs/read_write.c
      Signed-off-by: NLi Ming <limingming.li@huawei.com>
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1bb26e86
  18. 27 1月, 2021 1 次提交
    • E
      exec: Transform exec_update_mutex into a rw_semaphore · ab84d659
      Eric W. Biederman 提交于
      stable inclusion
      from stable-5.10.6
      commit ab7709b551de24e7bebf44946120e6740b1e28db
      bugzilla: 47418
      
      --------------------------------
      
      [ Upstream commit f7cfd871 ]
      
      Recently syzbot reported[0] that there is a deadlock amongst the users
      of exec_update_mutex.  The problematic lock ordering found by lockdep
      was:
      
         perf_event_open  (exec_update_mutex -> ovl_i_mutex)
         chown            (ovl_i_mutex       -> sb_writes)
         sendfile         (sb_writes         -> p->lock)
           by reading from a proc file and writing to overlayfs
         proc_pid_syscall (p->lock           -> exec_update_mutex)
      
      While looking at possible solutions it occured to me that all of the
      users and possible users involved only wanted to state of the given
      process to remain the same.  They are all readers.  The only writer is
      exec.
      
      There is no reason for readers to block on each other.  So fix
      this deadlock by transforming exec_update_mutex into a rw_semaphore
      named exec_update_lock that only exec takes for writing.
      
      Cc: Jann Horn <jannh@google.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christopher Yeoh <cyeoh@au1.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Fixes: eea96732 ("exec: Add exec_update_mutex to replace cred_guard_mutex")
      [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
      Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
      Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      ab84d659
  19. 03 11月, 2020 1 次提交
  20. 17 10月, 2020 1 次提交
  21. 14 10月, 2020 1 次提交
    • S
      mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary · 67197a4f
      Suren Baghdasaryan 提交于
      Currently __set_oom_adj loops through all processes in the system to keep
      oom_score_adj and oom_score_adj_min in sync between processes sharing
      their mm.  This is done for any task with more that one mm_users, which
      includes processes with multiple threads (sharing mm and signals).
      However for such processes the loop is unnecessary because their signal
      structure is shared as well.
      
      Android updates oom_score_adj whenever a tasks changes its role
      (background/foreground/...) or binds to/unbinds from a service, making it
      more/less important.  Such operation can happen frequently.  We noticed
      that updates to oom_score_adj became more expensive and after further
      investigation found out that the patch mentioned in "Fixes" introduced a
      regression.  Using Pixel 4 with a typical Android workload, write time to
      oom_score_adj increased from ~3.57us to ~362us.  Moreover this regression
      linearly depends on the number of multi-threaded processes running on the
      system.
      
      Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
      (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK).  Change __set_oom_adj to use
      MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
      update should be synchronized between multiple processes.  To prevent
      races between clone() and __set_oom_adj(), when oom_score_adj of the
      process being cloned might be modified from userspace, we use
      oom_adj_mutex.  Its scope is changed to global.
      
      The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
      the case of vfork().  To prevent performance regressions of vfork(), we
      skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
      specified.  Clearing the MMF_MULTIPROCESS flag (when the last process
      sharing the mm exits) is left out of this patch to keep it simple and
      because it is believed that this threading model is rare.  Should there
      ever be a need for optimizing that case as well, it can be done by hooking
      into the exit path, likely following the mm_update_next_owner pattern.
      
      With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
      quite rare, the regression is gone after the change is applied.
      
      [surenb@google.com: v3]
        Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com
      
      Fixes: 44a70ade ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
      Reported-by: NTim Murray <timmurray@google.com>
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Eugene Syromiatnikov <esyr@redhat.com>
      Cc: Christian Kellner <christian@kellner.me>
      Cc: Adrian Reber <areber@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
      Cc: John Johansen <john.johansen@canonical.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.comDebugged-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67197a4f
  22. 13 8月, 2020 1 次提交
    • Y
      mm, oom: make the calculation of oom badness more accurate · 9066e5cf
      Yafang Shao 提交于
      Recently we found an issue on our production environment that when memcg
      oom is triggered the oom killer doesn't chose the process with largest
      resident memory but chose the first scanned process.  Note that all
      processes in this memcg have the same oom_score_adj, so the oom killer
      should chose the process with largest resident memory.
      
      Bellow is part of the oom info, which is enough to analyze this issue.
      [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
      [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
      [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
      [...]
      [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
      [7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
      [7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
      [7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
      [7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
      [7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
      [7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
      [7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
      [7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
      [7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
      [7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
      [7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
      [7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
      [7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
      [7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
      [7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
      [7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
      [7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
      [7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
      [7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
      [7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
      [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
      [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
      [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
      [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      We can find that the first scanned process 5740 (pause) was killed, but
      its rss is only one page.  That is because, when we calculate the oom
      badness in oom_badness(), we always ignore the negtive point and convert
      all of these negtive points to 1.  Now as oom_score_adj of all the
      processes in this targeted memcg have the same value -998, the points of
      these processes are all negtive value.  As a result, the first scanned
      process will be killed.
      
      The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
      a Guaranteed pod, which has higher priority to prevent from being killed
      by system oom.
      
      To fix this issue, we should make the calculation of oom point more
      accurate.  We can achieve it by convert the chosen_point from 'unsigned
      long' to 'long'.
      
      [cai@lca.pw: reported a issue in the previous version]
      [mhocko@suse.com: fixed the issue reported by Cai]
      [mhocko@suse.com: add the comment in proc_oom_score()]
      [laoar.shao@gmail.com: v3]
        Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9066e5cf
  23. 20 7月, 2020 1 次提交
  24. 10 6月, 2020 3 次提交
  25. 19 5月, 2020 1 次提交
  26. 25 4月, 2020 2 次提交
  27. 22 4月, 2020 3 次提交
    • A
    • A
      proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option · 24a71ce5
      Alexey Gladkov 提交于
      If "hidepid=4" mount option is set then do not instantiate pids that
      we can not ptrace. "hidepid=4" means that procfs should only contain
      pids that the caller can ptrace.
      Signed-off-by: NDjalal Harouni <tixxdz@gmail.com>
      Signed-off-by: NAlexey Gladkov <gladkov.alexey@gmail.com>
      Reviewed-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      24a71ce5
    • A
      proc: allow to mount many instances of proc in one pid namespace · fa10fed3
      Alexey Gladkov 提交于
      This patch allows to have multiple procfs instances inside the
      same pid namespace. The aim here is lightweight sandboxes, and to allow
      that we have to modernize procfs internals.
      
      1) The main aim of this work is to have on embedded systems one
      supervisor for apps. Right now we have some lightweight sandbox support,
      however if we create pid namespacess we have to manages all the
      processes inside too, where our goal is to be able to run a bunch of
      apps each one inside its own mount namespace without being able to
      notice each other. We only want to use mount namespaces, and we want
      procfs to behave more like a real mount point.
      
      2) Linux Security Modules have multiple ptrace paths inside some
      subsystems, however inside procfs, the implementation does not guarantee
      that the ptrace() check which triggers the security_ptrace_check() hook
      will always run. We have the 'hidepid' mount option that can be used to
      force the ptrace_may_access() check inside has_pid_permissions() to run.
      The problem is that 'hidepid' is per pid namespace and not attached to
      the mount point, any remount or modification of 'hidepid' will propagate
      to all other procfs mounts.
      
      This also does not allow to support Yama LSM easily in desktop and user
      sessions. Yama ptrace scope which restricts ptrace and some other
      syscalls to be allowed only on inferiors, can be updated to have a
      per-task context, where the context will be inherited during fork(),
      clone() and preserved across execve(). If we support multiple private
      procfs instances, then we may force the ptrace_may_access() on
      /proc/<pids>/ to always run inside that new procfs instances. This will
      allow to specifiy on user sessions if we should populate procfs with
      pids that the user can ptrace or not.
      
      By using Yama ptrace scope, some restricted users will only be able to see
      inferiors inside /proc, they won't even be able to see their other
      processes. Some software like Chromium, Firefox's crash handler, Wine
      and others are already using Yama to restrict which processes can be
      ptracable. With this change this will give the possibility to restrict
      /proc/<pids>/ but more importantly this will give desktop users a
      generic and usuable way to specifiy which users should see all processes
      and which users can not.
      
      Side notes:
      * This covers the lack of seccomp where it is not able to parse
      arguments, it is easy to install a seccomp filter on direct syscalls
      that operate on pids, however /proc/<pid>/ is a Linux ABI using
      filesystem syscalls. With this change LSMs should be able to analyze
      open/read/write/close...
      
      In the new patch set version I removed the 'newinstance' option
      as suggested by Eric W. Biederman.
      
      Selftest has been added to verify new behavior.
      Signed-off-by: NAlexey Gladkov <gladkov.alexey@gmail.com>
      Reviewed-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      fa10fed3
  28. 16 4月, 2020 1 次提交
  29. 10 4月, 2020 1 次提交
    • E
      proc: Use a dedicated lock in struct pid · 63f818f4
      Eric W. Biederman 提交于
      syzbot wrote:
      > ========================================================
      > WARNING: possible irq lock inversion dependency detected
      > 5.6.0-syzkaller #0 Not tainted
      > --------------------------------------------------------
      > swapper/1/0 just changed the state of lock:
      > ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
      > but this lock took another, SOFTIRQ-unsafe lock in the past:
      >  (&pid->wait_pidfd){+.+.}-{2:2}
      >
      >
      > and interrupts could create inverse lock ordering between them.
      >
      >
      > other info that might help us debug this:
      >  Possible interrupt unsafe locking scenario:
      >
      >        CPU0                    CPU1
      >        ----                    ----
      >   lock(&pid->wait_pidfd);
      >                                local_irq_disable();
      >                                lock(tasklist_lock);
      >                                lock(&pid->wait_pidfd);
      >   <Interrupt>
      >     lock(tasklist_lock);
      >
      >  *** DEADLOCK ***
      >
      > 4 locks held by swapper/1/0:
      
      The problem is that because wait_pidfd.lock is taken under the tasklist
      lock.  It must always be taken with irqs disabled as tasklist_lock can be
      taken from interrupt context and if wait_pidfd.lock was already taken this
      would create a lock order inversion.
      
      Oleg suggested just disabling irqs where I have added extra calls to
      wait_pidfd.lock.  That should be safe and I think the code will eventually
      do that.  It was rightly pointed out by Christian that sharing the
      wait_pidfd.lock was a premature optimization.
      
      It is also true that my pre-merge window testing was insufficient.  So
      remove the premature optimization and give struct pid a dedicated lock of
      it's own for struct pid things.  I have verified that lockdep sees all 3
      paths where we take the new pid->lock and lockdep does not complain.
      
      It is my current day dream that one day pid->lock can be used to guard the
      task lists as well and then the tasklist_lock won't need to be held to
      deliver signals.  That will require taking pid->lock with irqs disabled.
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
      Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
      Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
      Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
      Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      63f818f4
  30. 25 3月, 2020 2 次提交