1. 26 8月, 2017 1 次提交
    • E
      fork: fix incorrect fput of ->exe_file causing use-after-free · 2b7e8665
      Eric Biggers 提交于
      Commit 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for
      write killable") made it possible to kill a forking task while it is
      waiting to acquire its ->mmap_sem for write, in dup_mmap().
      
      However, it was overlooked that this introduced an new error path before
      a reference is taken on the mm_struct's ->exe_file.  Since the
      ->exe_file of the new mm_struct was already set to the old ->exe_file by
      the memcpy() in dup_mm(), it was possible for the mmput() in the error
      path of dup_mm() to drop a reference to ->exe_file which was never
      taken.
      
      This caused the struct file to later be freed prematurely.
      
      Fix it by updating mm_init() to NULL out the ->exe_file, in the same
      place it clears other things like the list of mmaps.
      
      This bug was found by syzkaller.  It can be reproduced using the
      following C program:
      
          #define _GNU_SOURCE
          #include <pthread.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/wait.h>
          #include <unistd.h>
      
          static void *mmap_thread(void *_arg)
          {
              for (;;) {
                  mmap(NULL, 0x1000000, PROT_READ,
                       MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
              }
          }
      
          static void *fork_thread(void *_arg)
          {
              usleep(rand() % 10000);
              fork();
          }
      
          int main(void)
          {
              fork();
              fork();
              fork();
              for (;;) {
                  if (fork() == 0) {
                      pthread_t t;
      
                      pthread_create(&t, NULL, mmap_thread, NULL);
                      pthread_create(&t, NULL, fork_thread, NULL);
                      usleep(rand() % 10000);
                      syscall(__NR_exit_group, 0);
                  }
                  wait(NULL);
              }
          }
      
      No special kernel config options are needed.  It usually causes a NULL
      pointer dereference in __remove_shared_vm_struct() during exit, or in
      dup_mmap() (which is usually inlined into copy_process()) during fork.
      Both are due to a vm_area_struct's ->vm_file being used after it's
      already been freed.
      
      Google Bug Id: 64772007
      
      Link: http://lkml.kernel.org/r/20170823211408.31198-1-ebiggers3@gmail.com
      Fixes: 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[v4.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b7e8665
  2. 11 8月, 2017 1 次提交
    • N
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit 提交于
      Patch series "fixes of TLB batching races", v6.
      
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      
      This patch (of 7):
      
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      corruption.
      
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      Fixes: 20841405 ("mm: fix TLB flush race between migration, and
      change_protection_range")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16af97dc
  3. 13 7月, 2017 3 次提交
  4. 07 7月, 2017 1 次提交
  5. 05 7月, 2017 2 次提交
  6. 23 5月, 2017 1 次提交
    • V
      kthread: Fix use-after-free if kthread fork fails · 4d6501dc
      Vegard Nossum 提交于
      If a kthread forks (e.g. usermodehelper since commit 1da5c46f) but
      fails in copy_process() between calling dup_task_struct() and setting
      p->set_child_tid, then the value of p->set_child_tid will be inherited
      from the parent and get prematurely freed by free_kthread_struct().
      
          kthread()
           - worker_thread()
              - process_one_work()
              |  - call_usermodehelper_exec_work()
              |     - kernel_thread()
              |        - _do_fork()
              |           - copy_process()
              |              - dup_task_struct()
              |                 - arch_dup_task_struct()
              |                    - tsk->set_child_tid = current->set_child_tid // implied
              |              - ...
              |              - goto bad_fork_*
              |              - ...
              |              - free_task(tsk)
              |                 - free_kthread_struct(tsk)
              |                    - kfree(tsk->set_child_tid)
              - ...
              - schedule()
                 - __schedule()
                    - wq_worker_sleeping()
                       - kthread_data(task)->flags // UAF
      
      The problem started showing up with commit 1da5c46f since it reused
      ->set_child_tid for the kthread worker data.
      
      A better long-term solution might be to get rid of the ->set_child_tid
      abuse. The comment in set_kthread_struct() also looks slightly wrong.
      Debugged-by: NJamie Iles <jamie.iles@oracle.com>
      Fixes: 1da5c46f ("kthread: Make struct kthread kmalloc'ed")
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jamie Iles <jamie.iles@oracle.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nossum@oracle.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4d6501dc
  7. 14 5月, 2017 1 次提交
    • K
      pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes() · 3fd37226
      Kirill Tkhai 提交于
      Imagine we have a pid namespace and a task from its parent's pid_ns,
      which made setns() to the pid namespace. The task is doing fork(),
      while the pid namespace's child reaper is dying. We have the race
      between them:
      
      Task from parent pid_ns             Child reaper
      copy_process()                      ..
        alloc_pid()                       ..
        ..                                zap_pid_ns_processes()
        ..                                  disable_pid_allocation()
        ..                                  read_lock(&tasklist_lock)
        ..                                  iterate over pids in pid_ns
        ..                                    kill tasks linked to pids
        ..                                  read_unlock(&tasklist_lock)
        write_lock_irq(&tasklist_lock);   ..
        attach_pid(p, PIDTYPE_PID);       ..
        ..                                ..
      
      So, just created task p won't receive SIGKILL signal,
      and the pid namespace will be in contradictory state.
      Only manual kill will help there, but does the userspace
      care about this? I suppose, the most users just inject
      a task into a pid namespace and wait a SIGCHLD from it.
      
      The patch fixes the problem. It simply checks for
      (pid_ns->nr_hashed & PIDNS_HASH_ADDING) in copy_process().
      We do it under the tasklist_lock, and can't skip
      PIDNS_HASH_ADDING as noted by Oleg:
      
      "zap_pid_ns_processes() does disable_pid_allocation()
      and then takes tasklist_lock to kill the whole namespace.
      Given that copy_process() checks PIDNS_HASH_ADDING
      under write_lock(tasklist) they can't race;
      if copy_process() takes this lock first, the new child will
      be killed, otherwise copy_process() can't miss
      the change in ->nr_hashed."
      
      If allocation is disabled, we just return -ENOMEM
      like it's made for such cases in alloc_pid().
      
      v2: Do not move disable_pid_allocation(), do not
      introduce a new variable in copy_process() and simplify
      the patch as suggested by Oleg Nesterov.
      Account the problem with double irq enabling
      found by Eric W. Biederman.
      
      Fixes: c876ad76 ("pidns: Stop pid allocation when init dies")
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Ingo Molnar <mingo@kernel.org>
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Oleg Nesterov <oleg@redhat.com>
      CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
      CC: Michal Hocko <mhocko@suse.com>
      CC: Andy Lutomirski <luto@kernel.org>
      CC: "Eric W. Biederman" <ebiederm@xmission.com>
      CC: Andrei Vagin <avagin@openvz.org>
      CC: Cyrill Gorcunov <gorcunov@openvz.org>
      CC: Serge Hallyn <serge@hallyn.com>
      Cc: stable@vger.kernel.org
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      3fd37226
  8. 09 5月, 2017 2 次提交
  9. 05 5月, 2017 1 次提交
  10. 19 4月, 2017 1 次提交
    • P
      mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU · 5f0d5a3a
      Paul E. McKenney 提交于
      A group of Linux kernel hackers reported chasing a bug that resulted
      from their assumption that SLAB_DESTROY_BY_RCU provided an existence
      guarantee, that is, that no block from such a slab would be reallocated
      during an RCU read-side critical section.  Of course, that is not the
      case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
      slab of blocks.
      
      However, there is a phrase for this, namely "type safety".  This commit
      therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
      to avoid future instances of this sort of confusion.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      [ paulmck: Add comments mentioning the old name, as requested by Eric
        Dumazet, in order to help people familiar with the old name find
        the new one. ]
      Acked-by: NDavid Rientjes <rientjes@google.com>
      5f0d5a3a
  11. 04 4月, 2017 1 次提交
    • X
      sched/rtmutex/deadline: Fix a PI crash for deadline tasks · e96a7705
      Xunlei Pang 提交于
      A crash happened while I was playing with deadline PI rtmutex.
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
          IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
          PGD 232a75067 PUD 230947067 PMD 0
          Oops: 0000 [#1] SMP
          CPU: 1 PID: 10994 Comm: a.out Not tainted
      
          Call Trace:
          [<ffffffff810b658c>] enqueue_task+0x2c/0x80
          [<ffffffff810ba763>] activate_task+0x23/0x30
          [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
          [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
          [<ffffffff8164e783>] __schedule+0xd3/0x900
          [<ffffffff8164efd9>] schedule+0x29/0x70
          [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
          [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
          [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
          [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
          [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
          [<ffffffff810edd50>] SyS_futex+0x80/0x180
      
      This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
      are only protected by pi_lock when operating pi waiters, while
      rt_mutex_get_top_task(), will access them with rq lock held but
      not holding pi_lock.
      
      In order to tackle it, we introduce new "pi_top_task" pointer
      cached in task_struct, and add new rt_mutex_update_top_task()
      to update its value, it can be called by rt_mutex_setprio()
      which held both owner's pi_lock and rq lock. Thus "pi_top_task"
      can be safely accessed by enqueue_task_dl() under rq lock.
      
      Originally-From: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      e96a7705
  12. 28 3月, 2017 1 次提交
    • T
      LSM: Revive security_task_alloc() hook and per "struct task_struct" security blob. · e4e55b47
      Tetsuo Handa 提交于
      We switched from "struct task_struct"->security to "struct cred"->security
      in Linux 2.6.29. But not all LSM modules were happy with that change.
      TOMOYO LSM module is an example which want to use per "struct task_struct"
      security blob, for TOMOYO's security context is defined based on "struct
      task_struct" rather than "struct cred". AppArmor LSM module is another
      example which want to use it, for AppArmor is currently abusing the cred
      a little bit to store the change_hat and setexeccon info. Although
      security_task_free() hook was revived in Linux 3.4 because Yama LSM module
      wanted to release per "struct task_struct" security blob,
      security_task_alloc() hook and "struct task_struct"->security field were
      not revived. Nowadays, we are getting proposals of lightweight LSM modules
      which want to use per "struct task_struct" security blob.
      
      We are already allowing multiple concurrent LSM modules (up to one fully
      armored module which uses "struct cred"->security field or exclusive hooks
      like security_xfrm_state_pol_flow_match(), plus unlimited number of
      lightweight modules which do not use "struct cred"->security nor exclusive
      hooks) as long as they are built into the kernel. But this patch does not
      implement variable length "struct task_struct"->security field which will
      become needed when multiple LSM modules want to use "struct task_struct"->
      security field. Although it won't be difficult to implement variable length
      "struct task_struct"->security field, let's think about it after we merged
      this patch.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NJohn Johansen <john.johansen@canonical.com>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Acked-by: NCasey Schaufler <casey@schaufler-ca.com>
      Tested-by: NDjalal Harouni <tixxdz@gmail.com>
      Acked-by: NJosé Bollo <jobol@nonadev.net>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: José Bollo <jobol@nonadev.net>
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      e4e55b47
  13. 14 3月, 2017 1 次提交
    • H
      perf: Add PERF_RECORD_NAMESPACES to include namespaces related info · e4222673
      Hari Bathini 提交于
      With the advert of container technologies like docker, that depend on
      namespaces for isolation, there is a need for tracing support for
      namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for
      recording namespaces related info. By recording info for every
      namespace, it is left to userspace to take a call on the definition of a
      container and trace containers by updating perf tool accordingly.
      
      Each namespace has a combination of device and inode numbers. Though
      every namespace has the same device number currently, that may change in
      future to avoid the need for a namespace of namespaces. Considering such
      possibility, record both device and inode numbers separately for each
      namespace.
      Signed-off-by: NHari Bathini <hbathini@linux.vnet.ibm.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e4222673
  14. 08 3月, 2017 1 次提交
    • J
      livepatch: change to a per-task consistency model · d83a7cb3
      Josh Poimboeuf 提交于
      Change livepatch to use a basic per-task consistency model.  This is the
      foundation which will eventually enable us to patch those ~10% of
      security patches which change function or data semantics.  This is the
      biggest remaining piece needed to make livepatch more generally useful.
      
      This code stems from the design proposal made by Vojtech [1] in November
      2014.  It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
      consistency and syscall barrier switching combined with kpatch's stack
      trace switching.  There are also a number of fallback options which make
      it quite flexible.
      
      Patches are applied on a per-task basis, when the task is deemed safe to
      switch over.  When a patch is enabled, livepatch enters into a
      transition state where tasks are converging to the patched state.
      Usually this transition state can complete in a few seconds.  The same
      sequence occurs when a patch is disabled, except the tasks converge from
      the patched state to the unpatched state.
      
      An interrupt handler inherits the patched state of the task it
      interrupts.  The same is true for forked tasks: the child inherits the
      patched state of the parent.
      
      Livepatch uses several complementary approaches to determine when it's
      safe to patch tasks:
      
      1. The first and most effective approach is stack checking of sleeping
         tasks.  If no affected functions are on the stack of a given task,
         the task is patched.  In most cases this will patch most or all of
         the tasks on the first try.  Otherwise it'll keep trying
         periodically.  This option is only available if the architecture has
         reliable stacks (HAVE_RELIABLE_STACKTRACE).
      
      2. The second approach, if needed, is kernel exit switching.  A
         task is switched when it returns to user space from a system call, a
         user space IRQ, or a signal.  It's useful in the following cases:
      
         a) Patching I/O-bound user tasks which are sleeping on an affected
            function.  In this case you have to send SIGSTOP and SIGCONT to
            force it to exit the kernel and be patched.
         b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
            then it will get patched the next time it gets interrupted by an
            IRQ.
         c) In the future it could be useful for applying patches for
            architectures which don't yet have HAVE_RELIABLE_STACKTRACE.  In
            this case you would have to signal most of the tasks on the
            system.  However this isn't supported yet because there's
            currently no way to patch kthreads without
            HAVE_RELIABLE_STACKTRACE.
      
      3. For idle "swapper" tasks, since they don't ever exit the kernel, they
         instead have a klp_update_patch_state() call in the idle loop which
         allows them to be patched before the CPU enters the idle state.
      
         (Note there's not yet such an approach for kthreads.)
      
      All the above approaches may be skipped by setting the 'immediate' flag
      in the 'klp_patch' struct, which will disable per-task consistency and
      patch all tasks immediately.  This can be useful if the patch doesn't
      change any function or data semantics.  Note that, even with this flag
      set, it's possible that some tasks may still be running with an old
      version of the function, until that function returns.
      
      There's also an 'immediate' flag in the 'klp_func' struct which allows
      you to specify that certain functions in the patch can be applied
      without per-task consistency.  This might be useful if you want to patch
      a common function like schedule(), and the function change doesn't need
      consistency but the rest of the patch does.
      
      For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user
      must set patch->immediate which causes all tasks to be patched
      immediately.  This option should be used with care, only when the patch
      doesn't change any function or data semantics.
      
      In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE
      may be allowed to use per-task consistency if we can come up with
      another way to patch kthreads.
      
      The /sys/kernel/livepatch/<patch>/transition file shows whether a patch
      is in transition.  Only a single patch (the topmost patch on the stack)
      can be in transition at a given time.  A patch can remain in transition
      indefinitely, if any of the tasks are stuck in the initial patch state.
      
      A transition can be reversed and effectively canceled by writing the
      opposite value to the /sys/kernel/livepatch/<patch>/enabled file while
      the transition is in progress.  Then all the tasks will attempt to
      converge back to the original patch state.
      
      [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.czSigned-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NMiroslav Benes <mbenes@suse.cz>
      Acked-by: Ingo Molnar <mingo@kernel.org>        # for the scheduler changes
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      d83a7cb3
  15. 03 3月, 2017 1 次提交
  16. 02 3月, 2017 11 次提交
  17. 28 2月, 2017 1 次提交
  18. 23 2月, 2017 1 次提交
  19. 20 2月, 2017 1 次提交
  20. 03 2月, 2017 2 次提交
    • P
      prctl: propagate has_child_subreaper flag to every descendant · 749860ce
      Pavel Tikhomirov 提交于
      If process forks some children when it has is_child_subreaper
      flag enabled they will inherit has_child_subreaper flag - first
      group, when is_child_subreaper is disabled forked children will
      not inherit it - second group. So child-subreaper does not reparent
      all his descendants when their parents die. Having these two
      differently behaving groups can lead to confusion. Also it is
      a problem for CRIU, as when we restore process tree we need to
      somehow determine which descendants belong to which group and
      much harder - to put them exactly to these group.
      
      To simplify these we can add a propagation of has_child_subreaper
      flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child-
      subreaper to setup has_child_subreaper flag.
      
      In common cases when process like systemd first sets itself to
      be a child-subreaper and only after that forks its services, we will
      have zero-length list of descendants to walk. Testing with binary
      subtree of 2^15 processes prctl took < 0.007 sec and has shown close
      to linear dependency(~0.2 * n * usec) on lower numbers of processes.
      
      Moreover, I doubt someone intentionaly pre-forks the children whitch
      should reparent to init before becoming subreaper, because some our
      ancestor migh have had is_child_subreaper flag while forking our
      sub-tree and our childs will all inherit has_child_subreaper flag,
      and we have no way to influence it. And only way to check if we have
      no has_child_subreaper flag is to create some childs, kill them and
      see where they will reparent to.
      
      Using walk_process_tree helper to walk subtree, thanks to Oleg! Timing
      seems to be the same.
      
      Optimize:
      
      a) When descendant already has has_child_subreaper flag all his subtree
      has it too already.
      
      * for a) to be true need to move has_child_subreaper inheritance under
      the same tasklist_lock with adding task to its ->real_parent->children
      as without it process can inherit zero has_child_subreaper, then we
      set 1 to it's parent flag, check that parent has no more children, and
      only after child with wrong flag is added to the tree.
      
      * Also make these inheritance more clear by using real_parent instead of
      current, as on clone(CLONE_PARENT) if current has is_child_subreaper
      and real_parent has no is_child_subreaper or has_child_subreaper, child
      will have has_child_subreaper flag set without actually having a
      subreaper in it's ancestors.
      
      b) When some descendant is child_reaper, it's subtree is in different
      pidns from us(original child-subreaper) and processes from other pidns
      will never reparent to us.
      
      So we can skip their(a,b) subtree from walk.
      
      v2: switch to walk_process_tree() general helper, move
      has_child_subreaper inheritance
      v3: remove csr_descendant leftover, change current to real_parent
      in has_child_subreaper inheritance
      v4: small commit message fix
      
      Fixes: ebec18a6 ("prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision")
      Signed-off-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      749860ce
    • O
      introduce the walk_process_tree() helper · 0f1b92cb
      Oleg Nesterov 提交于
      Add the new helper to walk the process tree, the next patch adds a user.
      Note that it visits the group leaders only, proc_visitor can do
      for_each_thread itself or we can trivially extend walk_process_tree() to
      do this.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      0f1b92cb
  21. 01 2月, 2017 1 次提交
  22. 28 1月, 2017 1 次提交
  23. 14 1月, 2017 1 次提交
    • P
      locking/mutex: Fix mutex handoff · e274795e
      Peter Zijlstra 提交于
      While reviewing the ww_mutex patches, I noticed that it was still
      possible to (incorrectly) succeed for (incorrect) code like:
      
      	mutex_lock(&a);
      	mutex_lock(&a);
      
      This was possible if the second mutex_lock() would block (as expected)
      but then receive a spurious wakeup. At that point it would find itself
      at the front of the queue, request a handoff and instantly claim
      ownership and continue, since owner would point to itself.
      
      Avoid this scenario and simplify the code by introducing a third low
      bit to signal handoff pickup. So once we request handoff, unlock
      clears the handoff bit and sets the pickup bit along with the new
      owner.
      
      This also removes the need for the .handoff argument to
      __mutex_trylock(), since that becomes superfluous with PICKUP.
      
      In order to guarantee enough low bits, ensure task_struct alignment is
      at least L1_CACHE_BYTES (which seems a good ideal regardless).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 9d659ae1 ("locking/mutex: Add lock handoff to avoid starvation")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e274795e
  24. 25 12月, 2016 1 次提交
  25. 13 12月, 2016 1 次提交