1. 01 7月, 2015 2 次提交
  2. 26 6月, 2015 13 次提交
    • R
      exit,stats: /* obey this comment */ · 51229b49
      Rik van Riel 提交于
      There is a helpful comment in do_exit() that states we sync the mm's RSS
      info before statistics gathering.
      
      The function that does the statistics gathering is called right above that
      comment.
      
      Change the code to obey the comment.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51229b49
    • R
      kernel/trace/blktrace.c: use strreplace() in do_blk_trace_setup() · ff14417c
      Rasmus Villemoes 提交于
      Part of the disassembly of do_blk_trace_setup:
      
          231b:       e8 00 00 00 00          callq  2320 <do_blk_trace_setup+0x50>
                              231c: R_X86_64_PC32     strlen+0xfffffffffffffffc
          2320:       eb 0a                   jmp    232c <do_blk_trace_setup+0x5c>
          2322:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
          2328:       48 83 c3 01             add    $0x1,%rbx
          232c:       48 39 d8                cmp    %rbx,%rax
          232f:       76 47                   jbe    2378 <do_blk_trace_setup+0xa8>
          2331:       41 80 3c 1c 2f          cmpb   $0x2f,(%r12,%rbx,1)
          2336:       75 f0                   jne    2328 <do_blk_trace_setup+0x58>
          2338:       41 c6 04 1c 5f          movb   $0x5f,(%r12,%rbx,1)
          233d:       4c 89 e7                mov    %r12,%rdi
          2340:       e8 00 00 00 00          callq  2345 <do_blk_trace_setup+0x75>
                              2341: R_X86_64_PC32     strlen+0xfffffffffffffffc
          2345:       eb e1                   jmp    2328 <do_blk_trace_setup+0x58>
      
      Yep, that's right: gcc isn't smart enough to realize that replacing '/' by
      '_' cannot change the strlen(), so we call it again and again (at least
      when a '/' is found).  Even if gcc were that smart, this construction
      would still loop over the string twice, once for the initial strlen() call
      and then the open-coded loop.
      
      Let's simply use strreplace() instead.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Liked-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff14417c
    • R
      kernel/trace/trace_events_filter.c: use strreplace() · 1bb56471
      Rasmus Villemoes 提交于
      There's no point in starting over every time we see a ','...
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1bb56471
    • V
      check_syslog_permissions() cleanup · 3ea4331c
      Vasily Averin 提交于
      Patch fixes drawbacks in heck_syslog_permissions() noticed by AKPM:
      "from_file handling makes me cry.
      
      That's not a boolean - it's an enumerated value with two values
      currently defined.
      
      But the code in check_syslog_permissions() treats it as a boolean and
      also hardwires the knowledge that SYSLOG_FROM_PROC == 1 (or == `true`).
      
      And the name is wrong: it should be called from_proc to match
      SYSLOG_FROM_PROC."
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea4331c
    • V
      security_syslog() should be called once only · d194e5d6
      Vasily Averin 提交于
      The final version of commit 637241a9 ("kmsg: honor dmesg_restrict
      sysctl on /dev/kmsg") lost few hooks, as result security_syslog() are
      processed incorrectly:
      
      - open of /dev/kmsg checks syslog access permissions by using
        check_syslog_permissions() where security_syslog() is not called if
        dmesg_restrict is set.
      
      - syslog syscall and /proc/kmsg calls do_syslog() where security_syslog
        can be executed twice (inside check_syslog_permissions() and then
        directly in do_syslog())
      
      With this patch security_syslog() is called once only in all
      syslog-related operations regardless of dmesg_restrict value.
      
      Fixes: 637241a9 ("kmsg: honor dmesg_restrict sysctl on /dev/kmsg")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d194e5d6
    • T
      printk: implement support for extended console drivers · 6fe29354
      Tejun Heo 提交于
      printk log_buf keeps various metadata for each message including its
      sequence number and timestamp.  The metadata is currently available only
      through /dev/kmsg and stripped out before passed onto console drivers.  We
      want this metadata to be available to console drivers too so that console
      consumers can get full information including the metadata and dictionary,
      which among other things can be used to detect whether messages got lost
      in transit.
      
      This patch implements support for extended console drivers.  Consoles can
      indicate that they want extended messages by setting the new CON_EXTENDED
      flag and they'll be fed messages formatted the same way as /dev/kmsg.
      
       "<level>,<sequnum>,<timestamp>,<contflag>;<message text>\n"
      
      If extended consoles exist, in-kernel fragment assembly is disabled.  This
      ensures that all messages emitted to consoles have full metadata including
      sequence number.  The contflag carries enough information to reassemble
      the fragments from the reader side trivially.  Note that this only affects
      /dev/kmsg.  Regular console and /proc/kmsg outputs are not affected by
      this change.
      
      * Extended message formatting for console drivers is enabled iff there
        are registered extended consoles.
      
      * Comment describing /dev/kmsg message format updated to add missing
        contflag field and help distinguishing variable from verbatim terms.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Kay Sievers <kay@vrfy.org>
      Reviewed-by: NPetr Mladek <pmladek@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6fe29354
    • T
      printk: factor out message formatting from devkmsg_read() · 0a295e67
      Tejun Heo 提交于
      The extended message formatting used for /dev/kmsg will be used implement
      extended consoles.  Factor out msg_print_ext_header() and
      msg_print_ext_body() from devkmsg_read().
      
      This is pure restructuring.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Kay Sievers <kay@vrfy.org>
      Reviewed-by: NPetr Mladek <pmladek@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a295e67
    • T
      printk: guard the amount written per line by devkmsg_read() · d43ff430
      Tejun Heo 提交于
      This patchset updates netconsole so that it can emit messages with the
      same header as used in /dev/kmsg which gives neconsole receiver full log
      information which enables things like structured logging and detection
      of lost messages.
      
      This patch (of 7):
      
      devkmsg_read() uses 8k buffer and assumes that the formatted output
      message won't overrun which seems safe given LOG_LINE_MAX, the current use
      of dict and the escaping method being used; however, we're planning to use
      devkmsg formatting wider and accounting for the buffer size properly isn't
      that complicated.
      
      This patch defines CONSOLE_EXT_LOG_MAX as 8192 and updates devkmsg_read()
      so that it limits output accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Kay Sievers <kay@vrfy.org>
      Reviewed-by: NPetr Mladek <pmladek@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d43ff430
    • J
      clone: support passing tls argument via C rather than pt_regs magic · 3033f14a
      Josh Triplett 提交于
      clone has some of the quirkiest syscall handling in the kernel, with a
      pile of special cases, historical curiosities, and architecture-specific
      calling conventions.  In particular, clone with CLONE_SETTLS accepts a
      parameter "tls" that the C entry point completely ignores and some
      assembly entry points overwrite; instead, the low-level arch-specific
      code pulls the tls parameter out of the arch-specific register captured
      as part of pt_regs on entry to the kernel.  That's a massive hack, and
      it makes the arch-specific code only work when called via the specific
      existing syscall entry points; because of this hack, any new clone-like
      system call would have to accept an identical tls argument in exactly
      the same arch-specific position, rather than providing a unified system
      call entry point across architectures.
      
      The first patch allows architectures to handle the tls argument via
      normal C parameter passing, if they opt in by selecting
      HAVE_COPY_THREAD_TLS.  The second patch makes 32-bit and 64-bit x86 opt
      into this.
      
      These two patches came out of the clone4 series, which isn't ready for
      this merge window, but these first two cleanup patches were entirely
      uncontroversial and have acks.  I'd like to go ahead and submit these
      two so that other architectures can begin building on top of this and
      opting into HAVE_COPY_THREAD_TLS.  However, I'm also happy to wait and
      send these through the next merge window (along with v3 of clone4) if
      anyone would prefer that.
      
      This patch (of 2):
      
      clone with CLONE_SETTLS accepts an argument to set the thread-local
      storage area for the new thread.  sys_clone declares an int argument
      tls_val in the appropriate point in the argument list (based on the
      various CLONE_BACKWARDS variants), but doesn't actually use or pass along
      that argument.  Instead, sys_clone calls do_fork, which calls
      copy_process, which calls the arch-specific copy_thread, and copy_thread
      pulls the corresponding syscall argument out of the pt_regs captured at
      kernel entry (knowing what argument of clone that architecture passes tls
      in).
      
      Apart from being awful and inscrutable, that also only works because only
      one code path into copy_thread can pass the CLONE_SETTLS flag, and that
      code path comes from sys_clone with its architecture-specific
      argument-passing order.  This prevents introducing a new version of the
      clone system call without propagating the same architecture-specific
      position of the tls argument.
      
      However, there's no reason to pull the argument out of pt_regs when
      sys_clone could just pass it down via C function call arguments.
      
      Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
      and a new copy_thread_tls that accepts the tls parameter as an additional
      unsigned long (syscall-argument-sized) argument.  Change sys_clone's tls
      argument to an unsigned long (which does not change the ABI), and pass
      that down to copy_thread_tls.
      
      Architectures that don't opt into copy_thread_tls will continue to ignore
      the C argument to sys_clone in favor of the pt_regs captured at kernel
      entry, and thus will be unable to introduce new versions of the clone
      syscall.
      
      Patch co-authored by Josh Triplett and Thiago Macieira.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thiago Macieira <thiago.macieira@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3033f14a
    • A
      prctl: more prctl(PR_SET_MM_*) checks · 4a00e9df
      Alexey Dobriyan 提交于
      Individual prctl(PR_SET_MM_*) calls do some checking to maintain a
      consistent view of mm->arg_start et al fields, but not enough.  In
      particular PR_SET_MM_ARG_START/PR_SET_MM_ARG_END/ R_SET_MM_ENV_START/
      PR_SET_MM_ENV_END only check that the address lies in an existing VMA,
      but don't check that the start address is lower than the end address _at
      all_.
      
      Consolidate all consistency checks, so there will be no difference in
      the future between PR_SET_MM_MAP and individual PR_SET_MM_* calls.
      
      The program below makes both ARGV and ENVP areas be reversed.  It makes
      /proc/$PID/cmdline show garbage (it doesn't oops by luck).
      
      #include <sys/mman.h>
      #include <sys/prctl.h>
      #include <unistd.h>
      
      enum {PAGE_SIZE=4096};
      
      int main(void)
      {
      	void *p;
      
      	p = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
      
      #define PR_SET_MM               35
      #define PR_SET_MM_ARG_START     8
      #define PR_SET_MM_ARG_END       9
      #define PR_SET_MM_ENV_START     10
      #define PR_SET_MM_ENV_END       11
      	prctl(PR_SET_MM, PR_SET_MM_ARG_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
      	prctl(PR_SET_MM, PR_SET_MM_ARG_END,   (unsigned long)p, 0, 0);
      	prctl(PR_SET_MM, PR_SET_MM_ENV_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
      	prctl(PR_SET_MM, PR_SET_MM_ENV_END,   (unsigned long)p, 0, 0);
      
      	pause();
      	return 0;
      }
      
      [akpm@linux-foundation.org: tidy code, tweak comment]
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jarod Wilson <jarod@redhat.com>
      Cc: Jan Stancek <jstancek@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a00e9df
    • S
      tracing: Fix typo from "static inlin" to "static inline" · cc9e4bde
      Steven Rostedt (Red Hat) 提交于
      The trace.h header when called without CONFIG_EVENT_TRACING enabled
      (seldom done), will not compile because of a typo in the protocol
      of trace_event_enum_update().
      
      Cc: stable@vger.kernel.org # 4.1+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      cc9e4bde
    • S
      tracing/filter: Do not allow infix to exceed end of string · 6b88f44e
      Steven Rostedt (Red Hat) 提交于
      While debugging a WARN_ON() for filtering, I found that it is possible
      for the filter string to be referenced after its end. With the filter:
      
       # echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter
      
      The filter_parse() function can call infix_get_op() which calls
      infix_advance() that updates the infix filter pointers for the cnt
      and tail without checking if the filter is already at the end, which
      will put the cnt to zero and the tail beyond the end. The loop then calls
      infix_next() that has
      
      	ps->infix.cnt--;
      	return ps->infix.string[ps->infix.tail++];
      
      The cnt will now be below zero, and the tail that is returned is
      already passed the end of the filter string. So far the allocation
      of the filter string usually has some buffer that is zeroed out, but
      if the filter string is of the exact size of the allocated buffer
      there's no guarantee that the charater after the nul terminating
      character will be zero.
      
      Luckily, only root can write to the filter.
      
      Cc: stable@vger.kernel.org # 2.6.33+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      6b88f44e
    • S
      tracing/filter: Do not WARN on operand count going below zero · b4875bbe
      Steven Rostedt (Red Hat) 提交于
      When testing the fix for the trace filter, I could not come up with
      a scenario where the operand count goes below zero, so I added a
      WARN_ON_ONCE(cnt < 0) to the logic. But there is legitimate case
      that it can happen (although the filter would be wrong).
      
       # echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter
      
      That is, a single operation without any operands will hit the path
      where the WARN_ON_ONCE() can trigger. Although this is harmless,
      and the filter is reported as a error. But instead of spitting out
      a warning to the kernel dmesg, just fail nicely and report it via
      the proper channels.
      
      Link: http://lkml.kernel.org/r/558C6082.90608@oracle.comReported-by: NVince Weaver <vincent.weaver@maine.edu>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: stable@vger.kernel.org # 2.6.33+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b4875bbe
  3. 25 6月, 2015 3 次提交
    • J
      mm: oom_kill: clean up victim marking and exiting interfaces · 16e95196
      Johannes Weiner 提交于
      Rename unmark_oom_victim() to exit_oom_victim().  Marking and unmarking
      are related in functionality, but the interface is not symmetrical at
      all: one is an internal OOM killer function used during the killing, the
      other is for an OOM victim to signal its own death on exit later on.
      This has locking implications, see follow-up changes.
      
      While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
      is easier on the eye.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16e95196
    • C
      watchdog: add watchdog_cpumask sysctl to assist nohz · fe4ba3c3
      Chris Metcalf 提交于
      Change the default behavior of watchdog so it only runs on the
      housekeeping cores when nohz_full is enabled at build and boot time.
      Allow modifying the set of cores the watchdog is currently running on
      with a new kernel.watchdog_cpumask sysctl.
      
      In the current system, the watchdog subsystem runs a periodic timer that
      schedules the watchdog kthread to run.  However, nohz_full cores are
      designed to allow userspace application code running on those cores to
      have 100% access to the CPU.  So the watchdog system prevents the
      nohz_full application code from being able to run the way it wants to,
      thus the motivation to suppress the watchdog on nohz_full cores, which
      this patchset provides by default.
      
      However, if we disable the watchdog globally, then the housekeeping
      cores can't benefit from the watchdog functionality.  So we allow
      disabling it only on some cores.  See Documentation/lockup-watchdogs.txt
      for more information.
      
      [jhubbard@nvidia.com: fix a watchdog crash in some configurations]
      Signed-off-by: NChris Metcalf <cmetcalf@ezchip.com>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe4ba3c3
    • C
      smpboot: allow excluding cpus from the smpboot threads · b5242e98
      Chris Metcalf 提交于
      This patch series allows the watchdog to run by default only on the
      housekeeping cores when nohz_full is in effect; this seems to be a good
      compromise short of turning it off completely (since the nohz_full cores
      can't tolerate a watchdog).
      
      To provide customizability, we add /proc/sys/kernel/watchdog_cpumask so
      that the set of cores running the watchdog can be tuned to different
      values after bootup.
      
      To implement this customizability, we add a new
      smpboot_update_cpumask_percpu_thread() API to the smpboot_thread
      subsystem that lets us park or unpark "unwanted" threads.
      
      And now that threads can be parked for long periods of time, we tweak the
      /proc/<pid>/stat and /proc/<pid>/status code so parked threads aren't
      reported as running, which is otherwise confusing.
      
      This patch (of 3):
      
      This change allows some cores to be excluded from running the
      smp_hotplug_thread tasks.  The following commit to update
      kernel/watchdog.c to use this functionality is the motivating example, and
      more information on the motivation is provided there.
      
      A new smp_hotplug_thread field is introduced, "cpumask", which is cpumask
      field managed by the smpboot subsystem that indicates whether or not the
      given smp_hotplug_thread should run on that core; the cpumask is checked
      when deciding whether to unpark the thread.
      
      To limit the cpumask to less than cpu_possible, you must call
      smpboot_update_cpumask_percpu_thread() after registering.
      Signed-off-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5242e98
  4. 21 6月, 2015 1 次提交
  5. 20 6月, 2015 2 次提交
  6. 19 6月, 2015 19 次提交
    • T
      timer: Minimize nohz off overhead · 683be13a
      Thomas Gleixner 提交于
      If nohz is disabled on the kernel command line the [hr]timer code
      still calls wake_up_nohz_cpu() and tick_nohz_full_cpu(), a pretty
      pointless exercise. Cache nohz_active in [hr]timer per cpu bases and
      avoid the overhead.
      
      Before:
        48.10%  hog       [.] main
        15.25%  [kernel]  [k] _raw_spin_lock_irqsave
         9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.50%  [kernel]  [k] mod_timer
         6.44%  [kernel]  [k] lock_timer_base.isra.38
         3.87%  [kernel]  [k] detach_if_pending
         3.80%  [kernel]  [k] del_timer
         2.67%  [kernel]  [k] internal_add_timer
         1.33%  [kernel]  [k] __internal_add_timer
         0.73%  [kernel]  [k] timerfn
         0.54%  [kernel]  [k] wake_up_nohz_cpu
      
      After:
        48.73%  hog       [.] main
        15.36%  [kernel]  [k] _raw_spin_lock_irqsave
         9.77%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.61%  [kernel]  [k] lock_timer_base.isra.38
         6.42%  [kernel]  [k] mod_timer
         3.90%  [kernel]  [k] detach_if_pending
         3.76%  [kernel]  [k] del_timer
         2.41%  [kernel]  [k] internal_add_timer
         1.39%  [kernel]  [k] __internal_add_timer
         0.76%  [kernel]  [k] timerfn
      
      We probably should have a cached value for nohz full in the per cpu
      bases as well to avoid the cpumask check. The base cache line is hot
      already, the cpumask not necessarily.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.207378134@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      683be13a
    • T
      timer: Reduce timer migration overhead if disabled · bc7a34b8
      Thomas Gleixner 提交于
      Eric reported that the timer_migration sysctl is not really nice
      performance wise as it needs to check at every timer insertion whether
      the feature is enabled or not. Further the check does not live in the
      timer code, so we have an extra function call which checks an extra
      cache line to figure out that it is disabled.
      
      We can do better and store that information in the per cpu (hr)timer
      bases. I pondered to use a static key, but that's a nightmare to
      update from the nohz code and the timer base cache line is hot anyway
      when we select a timer base.
      
      The old logic enabled the timer migration unconditionally if
      CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
      line.
      
      With this modification, we start off with migration disabled. The user
      visible sysctl is still set to enabled. If the kernel switches to NOHZ
      migration is enabled, if the user did not disable it via the sysctl
      prior to the switch. If nohz=off is on the kernel command line,
      migration stays disabled no matter what.
      
      Before:
        47.76%  hog       [.] main
        14.84%  [kernel]  [k] _raw_spin_lock_irqsave
         9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.71%  [kernel]  [k] mod_timer
         6.24%  [kernel]  [k] lock_timer_base.isra.38
         3.76%  [kernel]  [k] detach_if_pending
         3.71%  [kernel]  [k] del_timer
         2.50%  [kernel]  [k] internal_add_timer
         1.51%  [kernel]  [k] get_nohz_timer_target
         1.28%  [kernel]  [k] __internal_add_timer
         0.78%  [kernel]  [k] timerfn
         0.48%  [kernel]  [k] wake_up_nohz_cpu
      
      After:
        48.10%  hog       [.] main
        15.25%  [kernel]  [k] _raw_spin_lock_irqsave
         9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.50%  [kernel]  [k] mod_timer
         6.44%  [kernel]  [k] lock_timer_base.isra.38
         3.87%  [kernel]  [k] detach_if_pending
         3.80%  [kernel]  [k] del_timer
         2.67%  [kernel]  [k] internal_add_timer
         1.33%  [kernel]  [k] __internal_add_timer
         0.73%  [kernel]  [k] timerfn
         0.54%  [kernel]  [k] wake_up_nohz_cpu
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bc7a34b8
    • T
      timer: Stats: Simplify the flags handling · c74441a1
      Thomas Gleixner 提交于
      Simplify the handling of the flag storage for the timer statistics. No
      intermediate storage anymore. Just hand over the flags field.
      
      I left the printout of 'deferrable' for now because changing this
      would be an ABI update and I have no idea how strong people feel about
      that. OTOH, I wonder whether we should kill the whole timer stats
      stuff because all of that information can be retrieved via ftrace/perf
      as well.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.046626248@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c74441a1
    • T
      timer: Replace timer base by a cpu index · 0eeda71b
      Thomas Gleixner 提交于
      Instead of storing a pointer to the per cpu tvec_base we can simply
      cache a CPU index in the timer_list and use that to get hold of the
      correct per cpu tvec_base. This is only used in lock_timer_base() and
      the slightly larger code is peanuts versus the spinlock operation and
      the d-cache foot print of the timer wheel.
      
      Aside of that this allows to get rid of following nuisances:
      
       - boot_tvec_base
      
         That statically allocated 4k bss data is just kept around so the
         timer has a home when it gets statically initialized. It serves no
         other purpose.
      
         With the CPU index we assign the timer to CPU0 at static
         initialization time and therefor can avoid the whole boot_tvec_base
         dance.  That also simplifies the init code, which just can use the
         per cpu base.
      
         Before:
           text	   data	    bss	    dec	    hex	filename
          17491	   9201	   4160	  30852	   7884	../build/kernel/time/timer.o
         After:
           text	   data	    bss	    dec	    hex	filename
          17440	   9193	      0	  26633	   6809	../build/kernel/time/timer.o
      
       - Overloading the base pointer with various flags
      
         The CPU index has enough space to hold the flags (deferrable,
         irqsafe) so we can get rid of the extra masking and bit fiddling
         with the base pointer.
      
      As a benefit we reduce the size of struct timer_list on 64 bit
      machines. 4 - 8 bytes, a size reduction up to 15% per struct timer_list,
      which is a real win as we have tons of them embedded in other structs.
      
      This changes also the newly added deferrable printout of the timer
      start trace point to capture and print all timer->flags, which allows
      us to decode the target cpu of the timer as well.
      
      We might have used bitfields for this, but that would change the
      static initializers and the init function for no value to accomodate
      big endian bitfields.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Badhri Jagan Sridharan <Badhri@google.com>
      Link: http://lkml.kernel.org/r/20150526224511.950084301@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      0eeda71b
    • T
      timer: Use hlist for the timer wheel hash buckets · 1dabbcec
      Thomas Gleixner 提交于
      This reduces the size of struct tvec_base by 50% and results in
      slightly smaller code as well.
      
      Before:
         struct tvec_base: size: 8256, cachelines: 129
      
         text	   data	    bss	    dec	    hex	filename
        17698	  13297	   8256	  39251	   9953	../build/kernel/time/timer.o
      
      After:
        struct tvec_base: 4160, cachelines: 65
      
         text	   data	    bss	    dec	    hex	filename
        17491	   9201	   4160	  30852	   7884	../build/kernel/time/timer.o
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224511.854731214@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1dabbcec
    • T
      timer: Remove FIFO "guarantee" · 1bd04bf6
      Thomas Gleixner 提交于
      The FIFO guarantee is only there if two timers are queued into the
      same bucket at the same jiffie on the same cpu:
      
       - The slack value depends on the delta between expiry and enqueue
         time, so the resulting expiry time can be different for timers
         which are queued in different jiffies.
      
       - Timers which are queued into the secondary array end up after a
         later queued timer which was queued into the primary array due to
         cascading.
      
       - Timers can end up on different cpus due to the NOHZ target moving
         around. Obviously there is no guarantee of expiry ordering between
         cpus.
      
      So anything which relies on FIFO behaviour of the timer wheel is
      broken already.
      
      This is a preparatory patch for converting the timer wheel to hlist
      which reduces the memory foot print of the wheel by 50%.
      
      It's a seperate patch so any (unlikely to happen) regression caused by
      this can be identified clearly.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Cc: George Spelvin <linux@horizon.com>
      Link: http://lkml.kernel.org/r/20150526224511.757520403@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1bd04bf6
    • T
      timers: Sanitize catchup_timer_jiffies() usage · 3bb475a3
      Thomas Gleixner 提交于
      catchup_timer_jiffies() has been applied blindly to several functions
      without looking for possible better ways to do it.
      
      1) internal_add_timer()
      
         Move the update to base->all_timers before we actually insert the
         timer into the wheel.
      
      2) detach_if_pending()
      
         Again the update to base->all_timers allows us to explicitely do
         the timer_jiffies update in place, if this was the last timer which
         got removed.
      
      3) __run_timers()
      
         We only check on entry, which is silly, because base->timer_jiffies
         can be behind - especially on NOHZ kernels - and if there is a
         single deferrable timer somewhere between base->timer_jiffies and
         jiffies we expire it and then loop until base->timer_jiffies ==
         jiffies.
      
         Move it into the loop.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224511.662994644@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      3bb475a3
    • Z
      sched/deadline: Remove needless parameter in dl_runtime_exceeded() · 6fab5410
      Zhiqiang Zhang 提交于
      Sine commit 269ad801 ("sched/deadline: Avoid double-accounting in
      case of missed deadlines), parameter 'rq' is no longer used, so
      remove it.
      Signed-off-by: NZhiqiang Zhang <zhangzhiqiang.zhang@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <juri.lelli@gmail.com>
      Cc: <luca.abeni@unitn.it>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1434338120-43773-1-git-send-email-zhangzhiqiang.zhang@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fab5410
    • W
      sched: Remove superfluous resetting of the p->dl_throttled flag · 6713c3aa
      Wanpeng Li 提交于
      Resetting the p->dl_throttled flag in rt_mutex_setprio() (for a task that is going
      to be boosted) is superfluous, as the natural place to do so is in
      replenish_dl_entity().
      
      If the task was on the runqueue and it is boosted by a DL task, it will be enqueued
      back with ENQUEUE_REPLENISH flag set, which can guarantee that dl_throttled is
      reset in replenish_dl_entity().
      
      This patch drops the resetting of throttled status in function rt_mutex_setprio().
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431496867-4194-6-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6713c3aa
    • W
      sched/deadline: Drop duplicate init_sched_dl_class() declaration · 178a4d23
      Wanpeng Li 提交于
      There are two init_sched_dl_class() declarations, this patch drops
      the duplicate.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431496867-4194-5-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      178a4d23
    • W
      sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target · 9d514262
      Wanpeng Li 提交于
      This patch adds a check that prevents futile attempts to move DL tasks
      to a CPU with active tasks of equal or earlier deadline. The same
      behavior as commit 80e3d87b ("sched/rt: Reduce rq lock contention
      by eliminating locking of non-feasible target") for rt class.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431496867-4194-3-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9d514262
    • W
      sched/deadline: Make init_sched_dl_class() __init · a6c0e746
      Wanpeng Li 提交于
      It's a bootstrap function, make init_sched_dl_class() __init.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431496867-4194-2-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a6c0e746
    • W
      sched/deadline: Optimize pull_dl_task() · 8b5e770e
      Wanpeng Li 提交于
      pull_dl_task() uses pick_next_earliest_dl_task() to select a migration
      candidate; this is sub-optimal since the next earliest task -- as per
      the regular runqueue -- might not be migratable at all. This could
      result in iterating the entire runqueue looking for a task.
      
      Instead iterate the pushable queue -- this queue only contains tasks
      that have at least 2 cpus set in their cpus_allowed mask.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      [ Improved the changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431496867-4194-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8b5e770e
    • P
      sched/preempt: Add static_key() to preempt_notifiers · 1cde2930
      Peter Zijlstra 提交于
      Avoid touching the curr->preempt_notifier cacheline when not needed.
      
      Provides a small improvement on pipe-bench:
      
        taskset 01 perf stat --repeat 10 -- perf bench sched pipe
      
      before:
      
       Performance counter stats for 'perf bench sched pipe' (10 runs):
      
            12385.016204      task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.34% )
               2,000,023      context-switches          #    0.161 M/sec                    ( +-  0.00% )
                       0      cpu-migrations            #    0.000 K/sec
                     175      page-faults               #    0.014 K/sec                    ( +-  0.26% )
          41,376,162,250      cycles                    #    3.341 GHz                      ( +-  0.11% )
          17,389,139,321      stalled-cycles-frontend   #   42.03% frontend cycles idle     ( +-  0.25% )
         <not supported>      stalled-cycles-backend
          68,788,588,003      instructions              #    1.66  insns per cycle
                                                        #    0.25  stalled cycles per insn  ( +-  0.02% )
          13,449,387,620      branches                  # 1085.940 M/sec                    ( +-  0.02% )
              20,880,690      branch-misses             #    0.16% of all branches          ( +-  0.98% )
      
            12.372646094 seconds time elapsed                                          ( +-  0.34% )
      
      after:
      
       Performance counter stats for 'perf bench sched pipe' (10 runs):
      
            12180.936528      task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.33% )
               2,000,077      context-switches          #    0.164 M/sec                    ( +-  0.00% )
                       0      cpu-migrations            #    0.000 K/sec
                     174      page-faults               #    0.014 K/sec                    ( +-  0.27% )
          40,691,545,577      cycles                    #    3.341 GHz                      ( +-  0.06% )
          16,446,333,371      stalled-cycles-frontend   #   40.42% frontend cycles idle     ( +-  0.18% )
         <not supported>      stalled-cycles-backend
          68,570,100,387      instructions              #    1.69  insns per cycle
                                                        #    0.24  stalled cycles per insn  ( +-  0.01% )
          13,389,740,014      branches                  # 1099.237 M/sec                    ( +-  0.01% )
              20,175,440      branch-misses             #    0.15% of all branches          ( +-  0.52% )
      
            12.169253010 seconds time elapsed                                          ( +-  0.33% )
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1cde2930
    • M
      sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration · d84525a8
      Mathieu Desnoyers 提交于
      preempt_notifier_unregister() documents:
      
        "This is safe to call from within a preemption notifier."
      
      However, both fire_sched_in_preempt_notifiers() and
      fire_sched_out_preempt_notifiers() are using hlist_for_each_entry(),
      which is not safe against entry removal during iteration.
      
      Inspection of the KVM code does not reveal any use of
      preempt_notifier_unregister() within the preempt notifiers.
      
      Therefore, fix the comment.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431881590-1456-1-git-send-email-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d84525a8
    • P
      sched/stop_machine: Fix deadlock between multiple stop_two_cpus() · b17718d0
      Peter Zijlstra 提交于
      Jiri reported a machine stuck in multi_cpu_stop() with
      migrate_swap_stop() as function and with the following src,dst cpu
      pairs: {11,  4} {13, 11} { 4, 13}
      
                              4       11      13
      
      cpuM: queue(4 ,13)
                              *Ma
      cpuN: queue(13,11)
                                      *N      Na
                              *M              Mb
      cpuO: queue(11, 4)
                              *O      Oa
                                      *Nb
                              *Ob
      
      Where *X denotes the cpu running the queueing of cpu-X and X[ab] denotes
      the first/second queued work.
      
      You'll observe the top of the workqueue for each cpu: 4,11,13 to be work
      from cpus: M, O, N resp. IOW. deadlock.
      
      Do away with the queueing trickery and introduce lg_double_lock() to
      lock both CPUs and fully serialize the stop_two_cpus() callers instead
      of the partial (and buggy) serialization we have now.
      Reported-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150605153023.GH19282@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b17718d0
    • S
      sched/debug: Add sum_sleep_runtime to /proc/<pid>/sched · 82a0d276
      Srikar Dronamraju 提交于
      When CONFIG_SCHEDSTATS is enabled, /proc/<pid>/sched prints almost all
      sched statistics except sum_sleep_runtime. Since sum_sleep_runtime is
      a good info to collect, add this it to /proc/<pid>/sched.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1433751041-11724-4-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      82a0d276
    • S
      sched/debug: Replace vruntime with wait_sum in /proc/sched_debug · c5f3ab1c
      Srikar Dronamraju 提交于
      Within runnable tasks in /proc/sched_debug, vruntime is printed twice,
      once as tree-key and again as exec-runtime.
      
      Since exec-runtime isnt populated in !CONFIG_SCHEDSTATS, use this field
      to print wait_sum.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1433751041-11724-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c5f3ab1c
    • S
      sched/debug: Properly format runnable tasks in /proc/sched_debug · 33d6176e
      Srikar Dronamraju 提交于
      With !CONFIG_SCHEDSTATS, runnable tasks in /proc/sched_debug has too
      many columns than required. Fix this by printing appropriate columns.
      
      While at this, print sum_exec_runtime, since this information is
      available even in !CONFIG_SCHEDSTATS case.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1433751041-11724-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33d6176e