1. 21 5月, 2016 4 次提交
    • O
      userfaultfd: don't pin the user memory in userfaultfd_file_create() · d2005e3f
      Oleg Nesterov 提交于
      userfaultfd_file_create() increments mm->mm_users; this means that the
      memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY
      after that can populate the orphaned mm more.
      
      Change userfaultfd_file_create() and userfaultfd_ctx_put() to use
      mm->mm_count to pin mm_struct.  This means that
      atomic_inc_not_zero(mm->mm_users) is needed when we are going to
      actually play with this memory.  Except handle_userfault() path doesn't
      need this, the caller must already have a reference.
      
      The patch adds the new trivial helper, mmget_not_zero(), it can have
      more users.
      
      Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2005e3f
    • T
      mm,oom: speed up select_bad_process() loop · f44666b0
      Tetsuo Handa 提交于
      Since commit 3a5dda7a ("oom: prevent unnecessary oom kills or kernel
      panics"), select_bad_process() is using for_each_process_thread().
      
      Since oom_unkillable_task() scans all threads in the caller's thread
      group and oom_task_origin() scans signal_struct of the caller's thread
      group, we don't need to call oom_unkillable_task() and oom_task_origin()
      on each thread.  Also, since !mm test will be done later at
      oom_badness(), we don't need to do !mm test on each thread.  Therefore,
      we only need to do TIF_MEMDIE test on each thread.
      
      Although the original code was correct it was quite inefficient because
      each thread group was scanned num_threads times which can be a lot
      especially with processes with many threads.  Even though the OOM is
      extremely cold path it is always good to be as effective as possible
      when we are inside rcu_read_lock() - aka unpreemptible context.
      
      If we track number of TIF_MEMDIE threads inside signal_struct, we don't
      need to do TIF_MEMDIE test on each thread.  This will allow
      select_bad_process() to use for_each_process().
      
      This patch adds a counter to signal_struct for tracking how many
      TIF_MEMDIE threads are in a given thread group, and check it at
      oom_scan_process_thread() so that select_bad_process() can use
      for_each_process() rather than for_each_process_thread().
      
      [mhocko@suse.com: do not blow the signal_struct size]
        Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f44666b0
    • M
      mm, oom_reaper: do not mmput synchronously from the oom reaper context · ec8d7c14
      Michal Hocko 提交于
      Tetsuo has properly noted that mmput slow path might get blocked waiting
      for another party (e.g.  exit_aio waits for an IO).  If that happens the
      oom_reaper would be put out of the way and will not be able to process
      next oom victim.  We should strive for making this context as reliable
      and independent on other subsystems as much as possible.
      
      Introduce mmput_async which will perform the slow path from an async
      (WQ) context.  This will delay the operation but that shouldn't be a
      problem because the oom_reaper has reclaimed the victim's address space
      for most cases as much as possible and the remaining context shouldn't
      bind too much memory anymore.  The only exception is when mmap_sem
      trylock has failed which shouldn't happen too often.
      
      The issue is only theoretical but not impossible.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec8d7c14
    • M
      mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully · bb8a4b7f
      Michal Hocko 提交于
      Commit 36324a99 ("oom: clear TIF_MEMDIE after oom_reaper managed to
      unmap the address space") not only clears TIF_MEMDIE for oom reaped task
      but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the
      oom killer.  This works in simple cases but it is not sufficient for
      (unlikely) cases where the mm is shared between independent processes
      (as they do not share signal struct).  If the mm had only small amount
      of memory which could be reaped then another task sharing the mm could
      be selected and that wouldn't help to move out from the oom situation.
      
      Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same
      as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set.  Set the
      flag after __oom_reap_task is done with a task.  This will force the
      select_bad_process() to ignore all already oom reaped tasks as well as
      no such task is sacrificed for its parent.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb8a4b7f
  2. 12 5月, 2016 1 次提交
  3. 06 5月, 2016 3 次提交
  4. 05 5月, 2016 2 次提交
  5. 04 5月, 2016 1 次提交
    • A
      signals/sigaltstack: If SS_AUTODISARM, bypass on_sig_stack() · c876eeab
      Andy Lutomirski 提交于
      If a signal stack is set up with SS_AUTODISARM, then the kernel
      inherently avoids incorrectly resetting the signal stack if signals
      recurse: the signal stack will be reset on the first signal
      delivery.  This means that we don't need check the stack pointer
      when delivering signals if SS_AUTODISARM is set.
      
      This will make segmented x86 programs more robust: currently there's
      a hole that could be triggered if ESP/RSP appears to point to the
      signal stack but actually doesn't due to a nonzero SS base.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Amanieu d'Antras <amanieu@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Moore <pmoore@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stas Sergeev <stsp@list.ru>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/c46bee4654ca9e68c498462fd11746e2bd0d98c8.1462296606.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c876eeab
  6. 03 5月, 2016 1 次提交
    • S
      signals/sigaltstack: Implement SS_AUTODISARM flag · 2a742138
      Stas Sergeev 提交于
      This patch implements the SS_AUTODISARM flag that can be OR-ed with
      SS_ONSTACK when forming ss_flags.
      
      When this flag is set, sigaltstack will be disabled when entering
      the signal handler; more precisely, after saving sas to uc_stack.
      When leaving the signal handler, the sigaltstack is restored by
      uc_stack.
      
      When this flag is used, it is safe to switch from sighandler with
      swapcontext(). Without this flag, the subsequent signal will corrupt
      the state of the switched-away sighandler.
      
      To detect the support of this functionality, one can do:
      
        err = sigaltstack(SS_DISABLE | SS_AUTODISARM);
        if (err && errno == EINVAL)
      	unsupported();
      Signed-off-by: NStas Sergeev <stsp@list.ru>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Amanieu d'Antras <amanieu@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Moore <pmoore@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1460665206-13646-4-git-send-email-stsp@list.ruSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2a742138
  7. 23 4月, 2016 2 次提交
    • F
      sched/fair: Correctly handle nohz ticks CPU load accounting · 1f41906a
      Frederic Weisbecker 提交于
      Ticks can happen while the CPU is in dynticks-idle or dynticks-singletask
      mode. In fact "nohz" or "dynticks" only mean that we exit the periodic
      mode and we try to minimize the ticks as much as possible. The nohz
      subsystem uses a confusing terminology with the internal state
      "ts->tick_stopped" which is also available through its public interface
      with tick_nohz_tick_stopped(). This is a misnomer as the tick is instead
      reduced with the best effort rather than stopped. In the best case the
      tick can indeed be actually stopped but there is no guarantee about that.
      If a timer needs to fire one second later, a tick will fire while the
      CPU is in nohz mode and this is a very common scenario.
      
      Now this confusion happens to be a problem with CPU load updates:
      cpu_load_update_active() doesn't handle nohz ticks correctly because it
      assumes that ticks are completely stopped in nohz mode and that
      cpu_load_update_active() can't be called in dynticks mode. When that
      happens, the whole previous tickless load is ignored and the function
      just records the load for the current tick, ignoring potentially long
      idle periods behind.
      
      In order to solve this, we could account the current load for the
      previous nohz time but there is a risk that we account the load of a
      task that got freshly enqueued for the whole nohz period.
      
      So instead, lets record the dynticks load on nohz frame entry so we know
      what to record in case of nohz ticks, then use this record to account
      the tickless load on nohz ticks and nohz frame end.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1460555812-25375-3-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1f41906a
    • F
      sched/fair: Gather CPU load functions under a more conventional namespace · cee1afce
      Frederic Weisbecker 提交于
      The CPU load update related functions have a weak naming convention
      currently, starting with update_cpu_load_*() which isn't ideal as
      "update" is a very generic concept.
      
      Since two of these functions are public already (and a third is to come)
      that's enough to introduce a more conventional naming scheme. So let's
      do the following rename instead:
      
      	update_cpu_load_*() -> cpu_load_update_*()
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1460555812-25375-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cee1afce
  8. 13 4月, 2016 1 次提交
    • D
      sched/clock: Make local_clock()/cpu_clock() inline · 2c923e94
      Daniel Lezcano 提交于
      The local_clock/cpu_clock functions were changed to prevent a double
      identical test with sched_clock_cpu() when HAVE_UNSTABLE_SCHED_CLOCK
      is set. That resulted in one line functions.
      
      As these functions are in all the cases one line functions and in the
      hot path, it is useful to specify them as static inline in order to
      give a strong hint to the compiler.
      
      After verification, it appears the compiler does not inline them
      without this hint. Change those functions to static inline.
      
      sched_clock_cpu() is called via the inlined local_clock()/cpu_clock()
      functions from sched.h. So any module code including sched.h will
      reference sched_clock_cpu(). Thus it must be exported with the
      EXPORT_SYMBOL_GPL macro.
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1460385514-14700-2-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2c923e94
  9. 02 4月, 2016 1 次提交
  10. 01 4月, 2016 1 次提交
  11. 29 3月, 2016 1 次提交
  12. 26 3月, 2016 5 次提交
  13. 23 3月, 2016 2 次提交
    • H
      parisc,metag: Implement CONFIG_DEBUG_STACK_USAGE option · 6c31da34
      Helge Deller 提交于
      On parisc and metag the stack grows upwards, so for those we need to
      scan the stack downwards in order to calculate how much stack a process
      has used.
      
      Tested on a 64bit parisc kernel.
      Signed-off-by: NHelge Deller <deller@gmx.de>
      6c31da34
    • D
      kernel: add kcov code coverage · 5c9a8750
      Dmitry Vyukov 提交于
      kcov provides code coverage collection for coverage-guided fuzzing
      (randomized testing).  Coverage-guided fuzzing is a testing technique
      that uses coverage feedback to determine new interesting inputs to a
      system.  A notable user-space example is AFL
      (http://lcamtuf.coredump.cx/afl/).  However, this technique is not
      widely used for kernel testing due to missing compiler and kernel
      support.
      
      kcov does not aim to collect as much coverage as possible.  It aims to
      collect more or less stable coverage that is function of syscall inputs.
      To achieve this goal it does not collect coverage in soft/hard
      interrupts and instrumentation of some inherently non-deterministic or
      non-interesting parts of kernel is disbled (e.g.  scheduler, locking).
      
      Currently there is a single coverage collection mode (tracing), but the
      API anticipates additional collection modes.  Initially I also
      implemented a second mode which exposes coverage in a fixed-size hash
      table of counters (what Quentin used in his original patch).  I've
      dropped the second mode for simplicity.
      
      This patch adds the necessary support on kernel side.  The complimentary
      compiler support was added in gcc revision 231296.
      
      We've used this support to build syzkaller system call fuzzer, which has
      found 90 kernel bugs in just 2 months:
      
        https://github.com/google/syzkaller/wiki/Found-Bugs
      
      We've also found 30+ bugs in our internal systems with syzkaller.
      Another (yet unexplored) direction where kcov coverage would greatly
      help is more traditional "blob mutation".  For example, mounting a
      random blob as a filesystem, or receiving a random blob over wire.
      
      Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
      coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
      typical coverage can be just a dozen of basic blocks (e.g.  an invalid
      input).  In such context gcov becomes prohibitively expensive as
      reset/collect coverage steps depend on total number of basic
      blocks/edges in program (in case of kernel it is about 2M).  Cost of
      kcov depends only on number of executed basic blocks/edges.  On top of
      that, kernel requires per-thread coverage because there are always
      background threads and unrelated processes that also produce coverage.
      With inlined gcov instrumentation per-thread coverage is not possible.
      
      kcov exposes kernel PCs and control flow to user-space which is
      insecure.  But debugfs should not be mapped as user accessible.
      
      Based on a patch by Quentin Casasnovas.
      
      [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
      [akpm@linux-foundation.org: unbreak allmodconfig]
      [akpm@linux-foundation.org: follow x86 Makefile layout standards]
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tavis Ormandy <taviso@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Drysdale <drysdale@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c9a8750
  14. 18 3月, 2016 1 次提交
    • J
      timer: convert timer_slack_ns from unsigned long to u64 · da8b44d5
      John Stultz 提交于
      This patchset introduces a /proc/<pid>/timerslack_ns interface which
      would allow controlling processes to be able to set the timerslack value
      on other processes in order to save power by avoiding wakeups (Something
      Android currently does via out-of-tree patches).
      
      The first patch tries to fix the internal timer_slack_ns usage which was
      defined as a long, which limits the slack range to ~4 seconds on 32bit
      systems.  It converts it to a u64, which provides the same basically
      unlimited slack (500 years) on both 32bit and 64bit machines.
      
      The second patch introduces the /proc/<pid>/timerslack_ns interface
      which allows the full 64bit slack range for a task to be read or set on
      both 32bit and 64bit machines.
      
      With these two patches, on a 32bit machine, after setting the slack on
      bash to 10 seconds:
      
      $ time sleep 1
      
      real    0m10.747s
      user    0m0.001s
      sys     0m0.005s
      
      The first patch is a little ugly, since I had to chase the slack delta
      arguments through a number of functions converting them to u64s.  Let me
      know if it makes sense to break that up more or not.
      
      Other than that things are fairly straightforward.
      
      This patch (of 2):
      
      The timer_slack_ns value in the task struct is currently a unsigned
      long.  This means that on 32bit applications, the maximum slack is just
      over 4 seconds.  However, on 64bit machines, its much much larger (~500
      years).
      
      This disparity could make application development a little (as well as
      the default_slack) to a u64.  This means both 32bit and 64bit systems
      have the same effective internal slack range.
      
      Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
      the interface as a unsigned long, so we preserve that limitation on
      32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
      long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
      actually larger then what can be stored by an unsigned long.
      
      This patch also modifies hrtimer functions which specified the slack
      delta as a unsigned long.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da8b44d5
  15. 11 3月, 2016 1 次提交
  16. 08 3月, 2016 1 次提交
  17. 02 3月, 2016 2 次提交
    • F
      sched: Migrate sched to use new tick dependency mask model · 76d92ac3
      Frederic Weisbecker 提交于
      Instead of providing asynchronous checks for the nohz subsystem to verify
      sched tick dependency, migrate sched to the new mask.
      
      Everytime a task is enqueued or dequeued, we evaluate the state of the
      tick dependency on top of the policy of the tasks in the runqueue, by
      order of priority:
      
      SCHED_DEADLINE: Need the tick in order to periodically check for runtime
      SCHED_FIFO    : Don't need the tick (no round-robin)
      SCHED_RR      : Need the tick if more than 1 task of the same priority
                      for round robin (simplified with checking if more than
                      one SCHED_RR task no matter what priority).
      SCHED_NORMAL  : Need the tick if more than 1 task for round-robin.
      
      We could optimize that further with one flag per sched policy on the tick
      dependency mask and perform only the checks relevant to the policy
      concerned by an enqueue/dequeue operation.
      
      Since the checks aren't based on the current task anymore, we could get
      rid of the task switch hook but it's still needed for posix cpu
      timers.
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      76d92ac3
    • F
      nohz: New tick dependency mask · d027d45d
      Frederic Weisbecker 提交于
      The tick dependency is evaluated on every IRQ and context switch. This
      consists is a batch of checks which determine whether it is safe to
      stop the tick or not. These checks are often split in many details:
      posix cpu timers, scheduler, sched clock, perf events.... each of which
      are made of smaller details: posix cpu timer involves checking process
      wide timers then thread wide timers. Perf involves checking freq events
      then more per cpu details.
      
      Checking these informations asynchronously every time we update the full
      dynticks state bring avoidable overhead and a messy layout.
      
      Let's introduce instead tick dependency masks: one for system wide
      dependency (unstable sched clock, freq based perf events), one for CPU
      wide dependency (sched, throttling perf events), and task/signal level
      dependencies (posix cpu timers). The subsystems are responsible
      for setting and clearing their dependency through a set of APIs that will
      take care of concurrent dependency mask modifications and kick targets
      to restart the relevant CPU tick whenever needed.
      
      This new dependency engine stays beside the old one until all subsystems
      having a tick dependency are converted to it.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      d027d45d
  18. 29 2月, 2016 2 次提交
    • S
      sched/debug: Fix preempt_disable_ip recording for preempt_disable() · f904f582
      Sebastian Andrzej Siewior 提交于
      The preempt_disable() invokes preempt_count_add() which saves the caller
      in ->preempt_disable_ip. It uses CALLER_ADDR1 which does not look for
      its caller but for the parent of the caller. Which means we get the correct
      caller for something like spin_lock() unless the architectures inlines
      those invocations. It is always wrong for preempt_disable() or
      local_bh_disable().
      
      This patch makes the function get_lock_parent_ip() which tries
      CALLER_ADDR0,1,2 if the former is a locking function.
      This seems to record the preempt_disable() caller properly for
      preempt_disable() itself as well as for get_cpu_var() or
      local_bh_disable().
      
      Steven asked for the get_parent_ip() -> get_lock_parent_ip() rename.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160226135456.GB18244@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f904f582
    • P
      sched/rt: Fix PI handling vs. sched_setscheduler() · ff77e468
      Peter Zijlstra 提交于
      Andrea Parri reported:
      
      > I found that the following scenario (with CONFIG_RT_GROUP_SCHED=y) is not
      > handled correctly:
      >
      >     T1 (prio = 20)
      >        lock(rtmutex);
      >
      >     T2 (prio = 20)
      >        blocks on rtmutex  (rt_nr_boosted = 0 on T1's rq)
      >
      >     T1 (prio = 20)
      >        sys_set_scheduler(prio = 0)
      >           [new_effective_prio == oldprio]
      >           T1 prio = 20    (rt_nr_boosted = 0 on T1's rq)
      >
      > The last step is incorrect as T1 is now boosted (c.f., rt_se_boosted());
      > in particular, if we continue with
      >
      >    T1 (prio = 20)
      >       unlock(rtmutex)
      >          wakeup(T2)
      >          adjust_prio(T1)
      >             [prio != rt_mutex_getprio(T1)]
      >	    dequeue(T1)
      >	       rt_nr_boosted = (unsigned long)(-1)
      >	       ...
      >             T1 prio = 0
      >
      > then we end up leaving rt_nr_boosted in an "inconsistent" state.
      >
      > The simple program attached could reproduce the previous scenario; note
      > that, as a consequence of the presence of this state, the "assertion"
      >
      >     WARN_ON(!rt_nr_running && rt_nr_boosted)
      >
      > from dec_rt_group() may trigger.
      
      So normally we dequeue/enqueue tasks in sched_setscheduler(), which
      would ensure the accounting stays correct. However in the early PI path
      we fail to do so.
      
      So this was introduced at around v3.14, by:
      
        c365c292 ("sched: Consider pi boosting in setscheduler()")
      
      which fixed another problem exactly because that dequeue/enqueue, joy.
      
      Fix this by teaching rt about DEQUEUE_SAVE/ENQUEUE_RESTORE and have it
      preserve runqueue location with that option. This requires decoupling
      the on_rt_rq() state from being on the list.
      
      In order to allow for explicit movement during the SAVE/RESTORE,
      introduce {DE,EN}QUEUE_MOVE. We still must use SAVE/RESTORE in these
      cases to preserve other invariants.
      
      Respecting the SAVE/RESTORE flags also has the (nice) side-effect that
      things like sys_nice()/sys_sched_setaffinity() also do not reorder
      FIFO tasks (whereas they used to before this patch).
      Reported-by: NAndrea Parri <parri.andrea@gmail.com>
      Tested-by: NAndrea Parri <parri.andrea@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ff77e468
  19. 09 2月, 2016 1 次提交
    • M
      sched/debug: Make schedstats a runtime tunable that is disabled by default · cb251765
      Mel Gorman 提交于
      schedstats is very useful during debugging and performance tuning but it
      incurs overhead to calculate the stats. As such, even though it can be
      disabled at build time, it is often enabled as the information is useful.
      
      This patch adds a kernel command-line and sysctl tunable to enable or
      disable schedstats on demand (when it's built in). It is disabled
      by default as someone who knows they need it can also learn to enable
      it when necessary.
      
      The benefits are dependent on how scheduler-intensive the workload is.
      If it is then the patch reduces the number of cycles spent calculating
      the stats with a small benefit from reducing the cache footprint of the
      scheduler.
      
      These measurements were taken from a 48-core 2-socket
      machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
      single socket machine 8-core machine with Intel i7-3770 processors.
      
      netperf-tcp
                                 4.5.0-rc1             4.5.0-rc1
                                   vanilla          nostats-v3r1
      Hmean    64         560.45 (  0.00%)      575.98 (  2.77%)
      Hmean    128        766.66 (  0.00%)      795.79 (  3.80%)
      Hmean    256        950.51 (  0.00%)      981.50 (  3.26%)
      Hmean    1024      1433.25 (  0.00%)     1466.51 (  2.32%)
      Hmean    2048      2810.54 (  0.00%)     2879.75 (  2.46%)
      Hmean    3312      4618.18 (  0.00%)     4682.09 (  1.38%)
      Hmean    4096      5306.42 (  0.00%)     5346.39 (  0.75%)
      Hmean    8192     10581.44 (  0.00%)    10698.15 (  1.10%)
      Hmean    16384    18857.70 (  0.00%)    18937.61 (  0.42%)
      
      Small gains here, UDP_STREAM showed nothing intresting and neither did
      the TCP_RR tests. The gains on the 8-core machine were very similar.
      
      tbench4
                                       4.5.0-rc1             4.5.0-rc1
                                         vanilla          nostats-v3r1
      Hmean    mb/sec-1         500.85 (  0.00%)      522.43 (  4.31%)
      Hmean    mb/sec-2         984.66 (  0.00%)     1018.19 (  3.41%)
      Hmean    mb/sec-4        1827.91 (  0.00%)     1847.78 (  1.09%)
      Hmean    mb/sec-8        3561.36 (  0.00%)     3611.28 (  1.40%)
      Hmean    mb/sec-16       5824.52 (  0.00%)     5929.03 (  1.79%)
      Hmean    mb/sec-32      10943.10 (  0.00%)    10802.83 ( -1.28%)
      Hmean    mb/sec-64      15950.81 (  0.00%)    16211.31 (  1.63%)
      Hmean    mb/sec-128     15302.17 (  0.00%)    15445.11 (  0.93%)
      Hmean    mb/sec-256     14866.18 (  0.00%)    15088.73 (  1.50%)
      Hmean    mb/sec-512     15223.31 (  0.00%)    15373.69 (  0.99%)
      Hmean    mb/sec-1024    14574.25 (  0.00%)    14598.02 (  0.16%)
      Hmean    mb/sec-2048    13569.02 (  0.00%)    13733.86 (  1.21%)
      Hmean    mb/sec-3072    12865.98 (  0.00%)    13209.23 (  2.67%)
      
      Small gains of 2-4% at low thread counts and otherwise flat.  The
      gains on the 8-core machine were slightly different
      
      tbench4 on 8-core i7-3770 single socket machine
      Hmean    mb/sec-1        442.59 (  0.00%)      448.73 (  1.39%)
      Hmean    mb/sec-2        796.68 (  0.00%)      794.39 ( -0.29%)
      Hmean    mb/sec-4       1322.52 (  0.00%)     1343.66 (  1.60%)
      Hmean    mb/sec-8       2611.65 (  0.00%)     2694.86 (  3.19%)
      Hmean    mb/sec-16      2537.07 (  0.00%)     2609.34 (  2.85%)
      Hmean    mb/sec-32      2506.02 (  0.00%)     2578.18 (  2.88%)
      Hmean    mb/sec-64      2511.06 (  0.00%)     2569.16 (  2.31%)
      Hmean    mb/sec-128     2313.38 (  0.00%)     2395.50 (  3.55%)
      Hmean    mb/sec-256     2110.04 (  0.00%)     2177.45 (  3.19%)
      Hmean    mb/sec-512     2072.51 (  0.00%)     2053.97 ( -0.89%)
      
      In constract, this shows a relatively steady 2-3% gain at higher thread
      counts. Due to the nature of the patch and the type of workload, it's
      not a surprise that the result will depend on the CPU used.
      
      hackbench-pipes
                               4.5.0-rc1             4.5.0-rc1
                                 vanilla          nostats-v3r1
      Amean    1        0.0637 (  0.00%)      0.0660 ( -3.59%)
      Amean    4        0.1229 (  0.00%)      0.1181 (  3.84%)
      Amean    7        0.1921 (  0.00%)      0.1911 (  0.52%)
      Amean    12       0.3117 (  0.00%)      0.2923 (  6.23%)
      Amean    21       0.4050 (  0.00%)      0.3899 (  3.74%)
      Amean    30       0.4586 (  0.00%)      0.4433 (  3.33%)
      Amean    48       0.5910 (  0.00%)      0.5694 (  3.65%)
      Amean    79       0.8663 (  0.00%)      0.8626 (  0.43%)
      Amean    110      1.1543 (  0.00%)      1.1517 (  0.22%)
      Amean    141      1.4457 (  0.00%)      1.4290 (  1.16%)
      Amean    172      1.7090 (  0.00%)      1.6924 (  0.97%)
      Amean    192      1.9126 (  0.00%)      1.9089 (  0.19%)
      
      Some small gains and losses and while the variance data is not included,
      it's close to the noise. The UMA machine did not show anything particularly
      different
      
      pipetest
                                   4.5.0-rc1             4.5.0-rc1
                                     vanilla          nostats-v2r2
      Min         Time        4.13 (  0.00%)        3.99 (  3.39%)
      1st-qrtle   Time        4.38 (  0.00%)        4.27 (  2.51%)
      2nd-qrtle   Time        4.46 (  0.00%)        4.39 (  1.57%)
      3rd-qrtle   Time        4.56 (  0.00%)        4.51 (  1.10%)
      Max-90%     Time        4.67 (  0.00%)        4.60 (  1.50%)
      Max-93%     Time        4.71 (  0.00%)        4.65 (  1.27%)
      Max-95%     Time        4.74 (  0.00%)        4.71 (  0.63%)
      Max-99%     Time        4.88 (  0.00%)        4.79 (  1.84%)
      Max         Time        4.93 (  0.00%)        4.83 (  2.03%)
      Mean        Time        4.48 (  0.00%)        4.39 (  1.91%)
      Best99%Mean Time        4.47 (  0.00%)        4.39 (  1.91%)
      Best95%Mean Time        4.46 (  0.00%)        4.38 (  1.93%)
      Best90%Mean Time        4.45 (  0.00%)        4.36 (  1.98%)
      Best50%Mean Time        4.36 (  0.00%)        4.25 (  2.49%)
      Best10%Mean Time        4.23 (  0.00%)        4.10 (  3.13%)
      Best5%Mean  Time        4.19 (  0.00%)        4.06 (  3.20%)
      Best1%Mean  Time        4.13 (  0.00%)        4.00 (  3.39%)
      
      Small improvement and similar gains were seen on the UMA machine.
      
      The gain is small but it stands to reason that doing less work in the
      scheduler is a good thing. The downside is that the lack of schedstats and
      tracepoints may be surprising to experts doing performance analysis until
      they find the existence of the schedstats= parameter or schedstats sysctl.
      It will be automatically activated for latencytop and sleep profiling to
      alleviate the problem. For tracepoints, there is a simple warning as it's
      not safe to activate schedstats in the context when it's known the tracepoint
      may be wanted but is unavailable.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <mgalbraith@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cb251765
  20. 28 1月, 2016 1 次提交
  21. 21 1月, 2016 2 次提交
  22. 20 1月, 2016 1 次提交
    • W
      pipe: limit the per-user amount of pages allocated in pipes · 759c0114
      Willy Tarreau 提交于
      On no-so-small systems, it is possible for a single process to cause an
      OOM condition by filling large pipes with data that are never read. A
      typical process filling 4000 pipes with 1 MB of data will use 4 GB of
      memory. On small systems it may be tricky to set the pipe max size to
      prevent this from happening.
      
      This patch makes it possible to enforce a per-user soft limit above
      which new pipes will be limited to a single page, effectively limiting
      them to 4 kB each, as well as a hard limit above which no new pipes may
      be created for this user. This has the effect of protecting the system
      against memory abuse without hurting other users, and still allowing
      pipes to work correctly though with less data at once.
      
      The limit are controlled by two new sysctls : pipe-user-pages-soft, and
      pipe-user-pages-hard. Both may be disabled by setting them to zero. The
      default soft limit allows the default number of FDs per process (1024)
      to create pipes of the default size (64kB), thus reaching a limit of 64MB
      before starting to create only smaller pipes. With 256 processes limited
      to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
      1084 MB of memory allocated for a user. The hard limit is disabled by
      default to avoid breaking existing applications that make intensive use
      of pipes (eg: for splicing).
      
      Reported-by: socketpair@gmail.com
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Mitigates: CVE-2013-4312 (Linux 2.0+)
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      759c0114
  23. 11 1月, 2016 1 次提交
  24. 06 1月, 2016 2 次提交
    • J
      sched/core: Move sched_entity::avg into separate cache line · 5a107804
      Jiri Olsa 提交于
      The sched_entity::avg collides with read-mostly sched_entity data.
      
      The perf c2c tool showed many read HITM accesses across
      many CPUs for sched_entity's cfs_rq and my_q, while having
      at the same time tons of stores for avg.
      
      After placing sched_entity::avg into separate cache line,
      the perf bench sched pipe showed around 20 seconds speedup.
      
      NOTE I cut out all perf events except for cycles and
      instructions from following output.
      
      Before:
        $ perf stat -r 5 perf bench sched pipe -l 10000000
        # Running 'sched/pipe' benchmark:
        # Executed 10000000 pipe operations between two processes
      
             Total time: 270.348 [sec]
      
              27.034805 usecs/op
                  36989 ops/sec
         ...
      
           245,537,074,035      cycles                    #    1.433 GHz
           187,264,548,519      instructions              #    0.77  insns per cycle
      
             272.653840535 seconds time elapsed           ( +-  1.31% )
      
      After:
        $ perf stat -r 5 perf bench sched pipe -l 10000000
        # Running 'sched/pipe' benchmark:
        # Executed 10000000 pipe operations between two processes
      
             Total time: 251.076 [sec]
      
              25.107678 usecs/op
                  39828 ops/sec
        ...
      
           244,573,513,928      cycles                    #    1.572 GHz
           187,409,641,157      instructions              #    0.76  insns per cycle
      
             251.679315188 seconds time elapsed           ( +-  0.31% )
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1449606239-28602-1-git-send-email-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5a107804
    • P
      sched/core: Fix unserialized r-m-w scribbling stuff · be958bdc
      Peter Zijlstra 提交于
      Some of the sched bitfieds (notably sched_reset_on_fork) can be set
      on other than current, this can cause the r-m-w to race with other
      updates.
      
      Since all the sched bits are serialized by scheduler locks, pull them
      in a separate word.
      Reported-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: hannes@cmpxchg.org
      Cc: mhocko@kernel.org
      Cc: vdavydov@parallels.com
      Link: http://lkml.kernel.org/r/20151125150207.GM11639@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      be958bdc