1. 11 7月, 2016 1 次提交
    • I
      Revert "perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86" · 44530d58
      Ingo Molnar 提交于
      This reverts commit 2c95afc1.
      
      Stephane reported the following regression:
      
       > Since Andi added:
       >
       > commit 2c95afc1
       > Author: Andi Kleen <ak@linux.intel.com>
       > Date:   Thu Jun 9 06:14:38 2016 -0700
       >
       >    perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86
       >
       > $ perf stat -e ref-cycles ls
       >   <not counted> ....
       >
       > fails systematically because the ref-cycles is now used by the
       > watchdog and given this is a system-wide pinned event, it monopolizes
       > the fixed counter 2 which is the only counter able to measure this event.
      
      Since the next merge window is near, fix the regression for now
      by reverting the commit.
      Reported-by: NStephane Eranian <eranian@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      44530d58
  2. 07 7月, 2016 1 次提交
    • M
      perf/core: Fix pmu::filter_match for SW-led groups · 2c81a647
      Mark Rutland 提交于
      The following commit:
      
        66eb579e ("perf: allow for PMU-specific event filtering")
      
      added the pmu::filter_match() callback. This was intended to
      avoid HW constraints on events from resulting in extremely
      pessimistic scheduling.
      
      However, pmu::filter_match() is only called for the leader of each event
      group. When the leader is a SW event, we do not filter the groups, and
      may fail at pmu::add() time, and when this happens we'll give up on
      scheduling any event groups later in the list until they are rotated
      ahead of the failing group.
      
      This can result in extremely sub-optimal event scheduling behaviour,
      e.g. if running the following on a big.LITTLE platform:
      
      $ taskset -c 0 ./perf stat \
       -e 'a57{context-switches,armv8_cortex_a57/config=0x11/}' \
       -e 'a53{context-switches,armv8_cortex_a53/config=0x11/}' \
       ls
      
           <not counted>      context-switches                                              (0.00%)
           <not counted>      armv8_cortex_a57/config=0x11/                                 (0.00%)
                      24      context-switches                                              (37.36%)
                57589154      armv8_cortex_a53/config=0x11/                                 (37.36%)
      
      Here the 'a53' event group was always eligible to be scheduled, but
      the 'a57' group never eligible to be scheduled, as the task was always
      affine to a Cortex-A53 CPU. The SW (group leader) event in the 'a57'
      group was eligible, but the HW event failed at pmu::add() time,
      resulting in ctx_flexible_sched_in giving up on scheduling further
      groups with HW events.
      
      One way of avoiding this is to check pmu::filter_match() on siblings
      as well as the group leader. If any of these fail their
      pmu::filter_match() call, we must skip the entire group before
      attempting to add any events.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Fixes: 66eb579e ("perf: allow for PMU-specific event filtering")
      Link: http://lkml.kernel.org/r/1465917041-15339-1-git-send-email-mark.rutland@arm.com
      [ Small readability edits. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2c81a647
  3. 29 6月, 2016 3 次提交
  4. 25 6月, 2016 3 次提交
    • M
      Fix build break in fork.c when THREAD_SIZE < PAGE_SIZE · 9521d399
      Michael Ellerman 提交于
      Commit b235beea ("Clarify naming of thread info/stack allocators")
      breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE:
      
        kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack'
        kernel/fork.c:355:8: error: assignment from incompatible pointer type
          stack = alloc_thread_stack_node(tsk, node);
          ^
      
      Fix it by renaming free_stack() to free_thread_stack(), and updating the
      return type of alloc_thread_stack_node().
      
      Fixes: b235beea ("Clarify naming of thread info/stack allocators")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9521d399
    • M
      oom, suspend: fix oom_reaper vs. oom_killer_disable race · 74070542
      Michal Hocko 提交于
      Tetsuo has reported the following potential oom_killer_disable vs.
      oom_reaper race:
      
       (1) freeze_processes() starts freezing user space threads.
       (2) Somebody (maybe a kenrel thread) calls out_of_memory().
       (3) The OOM killer calls mark_oom_victim() on a user space thread
           P1 which is already in __refrigerator().
       (4) oom_killer_disable() sets oom_killer_disabled = true.
       (5) P1 leaves __refrigerator() and enters do_exit().
       (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
           exit_oom_victim(P1).
       (7) oom_killer_disable() returns while P1 not yet finished
       (8) P1 perform IO/interfere with the freezer.
      
      This situation is unfortunate.  We cannot move oom_killer_disable after
      all the freezable kernel threads are frozen because the oom victim might
      depend on some of those kthreads to make a forward progress to exit so
      we could deadlock.  It is also far from trivial to teach the oom_reaper
      to not call exit_oom_victim() because then we would lose a guarantee of
      the OOM killer and oom_killer_disable forward progress because
      exit_mm->mmput might block and never call exit_oom_victim.
      
      It seems the easiest way forward is to workaround this race by calling
      try_to_freeze_tasks again after oom_killer_disable.  This will make sure
      that all the tasks are frozen or it bails out.
      
      Fixes: 449d777d ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
      Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74070542
    • L
      Clarify naming of thread info/stack allocators · b235beea
      Linus Torvalds 提交于
      We've had the thread info allocated together with the thread stack for
      most architectures for a long time (since the thread_info was split off
      from the task struct), but that is about to change.
      
      But the patches that move the thread info to be off-stack (and a part of
      the task struct instead) made it clear how confused the allocator and
      freeing functions are.
      
      Because the common case was that we share an allocation with the thread
      stack and the thread_info, the two pointers were identical.  That
      identity then meant that we would have things like
      
      	ti = alloc_thread_info_node(tsk, node);
      	...
      	tsk->stack = ti;
      
      which certainly _worked_ (since stack and thread_info have the same
      value), but is rather confusing: why are we assigning a thread_info to
      the stack? And if we move the thread_info away, the "confusing" code
      just gets to be entirely bogus.
      
      So remove all this confusion, and make it clear that we are doing the
      stack allocation by renaming and clarifying the function names to be
      about the stack.  The fact that the thread_info then shares the
      allocation is an implementation detail, and not really about the
      allocation itself.
      
      This is a pure renaming and type fix: we pass in the same pointer, it's
      just that we clarify what the pointer means.
      
      The ia64 code that actually only has one single allocation (for all of
      task_struct, thread_info and kernel thread stack) now looks a bit odd,
      but since "tsk->stack" is actually not even used there, that oddity
      doesn't matter.  It would be a separate thing to clean that up, I
      intentionally left the ia64 changes as a pure brute-force renaming and
      type change.
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b235beea
  5. 24 6月, 2016 6 次提交
    • T
      sched/core: Allow kthreads to fall back to online && !active cpus · feb245e3
      Tejun Heo 提交于
      During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is
      online but not active.  A CPU_ONLINE callback may create or bind a
      kthread so that its cpus_allowed mask only allows the CPU which is
      being brought online.  The kthread may start executing before the CPU
      is made active and can end up in select_fallback_rq().
      
      In such cases, the expected behavior is selecting the CPU which is
      coming online; however, because select_fallback_rq() only chooses from
      active CPUs, it determines that the task doesn't have any viable CPU
      in its allowed mask and ends up overriding it to cpu_possible_mask.
      
      CPU_ONLINE callbacks should be able to put kthreads on the CPU which
      is coming online.  Update select_fallback_rq() so that it follows
      cpu_online() rather than cpu_active() for kthreads.
      Reported-by: NGautham R Shenoy <ego@linux.vnet.ibm.com>
      Tested-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team@fb.com
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20160616193504.GB3262@mtj.duckdns.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      feb245e3
    • K
      sched/fair: Do not announce throttled next buddy in dequeue_task_fair() · 754bd598
      Konstantin Khlebnikov 提交于
      Hierarchy could be already throttled at this point. Throttled next
      buddy could trigger a NULL pointer dereference in pick_next_task_fair().
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      754bd598
    • K
      sched/fair: Initialize throttle_count for new task-groups lazily · 094f4691
      Konstantin Khlebnikov 提交于
      Cgroup created inside throttled group must inherit current throttle_count.
      Broken throttle_count allows to nominate throttled entries as a next buddy,
      later this leads to null pointer dereference in pick_next_task_fair().
      
      This patch initialize cfs_rq->throttle_count at first enqueue: laziness
      allows to skip locking all rq at group creation. Lazy approach also allows
      to skip full sub-tree scan at throttling hierarchy (not in this patch).
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      094f4691
    • P
      locking/static_key: Fix concurrent static_key_slow_inc() · 4c5ea0a9
      Paolo Bonzini 提交于
      The following scenario is possible:
      
          CPU 1                                   CPU 2
          static_key_slow_inc()
           atomic_inc_not_zero()
            -> key.enabled == 0, no increment
           jump_label_lock()
           atomic_inc_return()
            -> key.enabled == 1 now
                                                  static_key_slow_inc()
                                                   atomic_inc_not_zero()
                                                    -> key.enabled == 1, inc to 2
                                                   return
                                                  ** static key is wrong!
           jump_label_update()
           jump_label_unlock()
      
      Testing the static key at the point marked by (**) will follow the
      wrong path for jumps that have not been patched yet.  This can
      actually happen when creating many KVM virtual machines with userspace
      LAPIC emulation; just run several copies of the following program:
      
          #include <fcntl.h>
          #include <unistd.h>
          #include <sys/ioctl.h>
          #include <linux/kvm.h>
      
          int main(void)
          {
              for (;;) {
                  int kvmfd = open("/dev/kvm", O_RDONLY);
                  int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
                  close(ioctl(vmfd, KVM_CREATE_VCPU, 1));
                  close(vmfd);
                  close(kvmfd);
              }
              return 0;
          }
      
      Every KVM_CREATE_VCPU ioctl will attempt a static_key_slow_inc() call.
      The static key's purpose is to skip NULL pointer checks and indeed one
      of the processes eventually dereferences NULL.
      
      As explained in the commit that introduced the bug:
      
        706249c2 ("locking/static_keys: Rework update logic")
      
      jump_label_update() needs key.enabled to be true.  The solution adopted
      here is to temporarily make key.enabled == -1, and use go down the
      slow path when key.enabled <= 0.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org> # v4.3+
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 706249c2 ("locking/static_keys: Rework update logic")
      Link: http://lkml.kernel.org/r/1466527937-69798-1-git-send-email-pbonzini@redhat.com
      [ Small stylistic edits to the changelog and the code. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      4c5ea0a9
    • D
      cgroup: Disable IRQs while holding css_set_lock · 82d6489d
      Daniel Bristot de Oliveira 提交于
      While testing the deadline scheduler + cgroup setup I hit this
      warning.
      
      [  132.612935] ------------[ cut here ]------------
      [  132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80
      [  132.612952] Modules linked in: (a ton of modules...)
      [  132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2
      [  132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
      [  132.612982]  0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e
      [  132.612984]  0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b
      [  132.612985]  00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00
      [  132.612986] Call Trace:
      [  132.612987]  <IRQ>  [<ffffffff813d229e>] dump_stack+0x63/0x85
      [  132.612994]  [<ffffffff810a652b>] __warn+0xcb/0xf0
      [  132.612997]  [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170
      [  132.612999]  [<ffffffff810a665d>] warn_slowpath_null+0x1d/0x20
      [  132.613000]  [<ffffffff810aba5b>] __local_bh_enable_ip+0x6b/0x80
      [  132.613008]  [<ffffffff817d6c8a>] _raw_write_unlock_bh+0x1a/0x20
      [  132.613010]  [<ffffffff817d6c9e>] _raw_spin_unlock_bh+0xe/0x10
      [  132.613015]  [<ffffffff811388ac>] put_css_set+0x5c/0x60
      [  132.613016]  [<ffffffff8113dc7f>] cgroup_free+0x7f/0xa0
      [  132.613017]  [<ffffffff810a3912>] __put_task_struct+0x42/0x140
      [  132.613018]  [<ffffffff810e776a>] dl_task_timer+0xca/0x250
      [  132.613027]  [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170
      [  132.613030]  [<ffffffff8111371e>] __hrtimer_run_queues+0xee/0x270
      [  132.613031]  [<ffffffff81113ec8>] hrtimer_interrupt+0xa8/0x190
      [  132.613034]  [<ffffffff81051a58>] local_apic_timer_interrupt+0x38/0x60
      [  132.613035]  [<ffffffff817d9b0d>] smp_apic_timer_interrupt+0x3d/0x50
      [  132.613037]  [<ffffffff817d7c5c>] apic_timer_interrupt+0x8c/0xa0
      [  132.613038]  <EOI>  [<ffffffff81063466>] ? native_safe_halt+0x6/0x10
      [  132.613043]  [<ffffffff81037a4e>] default_idle+0x1e/0xd0
      [  132.613044]  [<ffffffff810381cf>] arch_cpu_idle+0xf/0x20
      [  132.613046]  [<ffffffff810e8fda>] default_idle_call+0x2a/0x40
      [  132.613047]  [<ffffffff810e92d7>] cpu_startup_entry+0x2e7/0x340
      [  132.613048]  [<ffffffff81050235>] start_secondary+0x155/0x190
      [  132.613049] ---[ end trace f91934d162ce9977 ]---
      
      The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt
      context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid
      this problem - and other problems of sharing a spinlock with an
      interrupt.
      
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: cgroups@vger.kernel.org
      Cc: stable@vger.kernel.org # 4.5+
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: N"Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
      Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      82d6489d
    • L
      locking: avoid passing around 'thread_info' in mutex debugging code · 6720a305
      Linus Torvalds 提交于
      None of the code actually wants a thread_info, it all wants a
      task_struct, and it's just converting back and forth between the two
      ("ti->task" to get the task_struct from the thread_info, and
      "task_thread_info(task)" to go the other way).
      
      No semantic change.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6720a305
  6. 20 6月, 2016 2 次提交
    • S
      tracing: Handle NULL formats in hold_module_trace_bprintk_format() · 70c8217a
      Steven Rostedt (Red Hat) 提交于
      If a task uses a non constant string for the format parameter in
      trace_printk(), then the trace_printk_fmt variable is set to NULL. This
      variable is then saved in the __trace_printk_fmt section.
      
      The function hold_module_trace_bprintk_format() checks to see if duplicate
      formats are used by modules, and reuses them if so (saves them to the list
      if it is new). But this function calls lookup_format() that does a strcmp()
      to the value (which is now NULL) and can cause a kernel oops.
      
      This wasn't an issue till 3debb0a9 ("tracing: Fix trace_printk() to print
      when not using bprintk()") which added "__used" to the trace_printk_fmt
      variable, and before that, the kernel simply optimized it out (no NULL value
      was saved).
      
      The fix is simply to handle the NULL pointer in lookup_format() and have the
      caller ignore the value if it was NULL.
      
      Link: http://lkml.kernel.org/r/1464769870-18344-1-git-send-email-zhengjun.xing@intel.comReported-by: Nxingzhen <zhengjun.xing@intel.com>
      Acked-by: NNamhyung Kim <namhyung@kernel.org>
      Fixes: 3debb0a9 ("tracing: Fix trace_printk() to print when not using bprintk()")
      Cc: stable@vger.kernel.org # v3.5+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      70c8217a
    • P
      sched/fair: Fix cfs_rq avg tracking underflow · 89741892
      Peter Zijlstra 提交于
      As per commit:
      
        b7fa30c9 ("sched/fair: Fix post_init_entity_util_avg() serialization")
      
      > the code generated from update_cfs_rq_load_avg():
      >
      > 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
      > 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
      > 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
      > 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
      > 		removed_load = 1;
      > 	}
      >
      > turns into:
      >
      > ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
      > ffffffff8108706b:       48 85 c0                test   %rax,%rax
      > ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
      > ffffffff81087070:       4c 89 f8                mov    %r15,%rax
      > ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
      > ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
      > ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
      > ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
      > ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
      > ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
      >
      > Which you'll note ends up with sa->load_avg -= r in memory at
      > ffffffff8108707a.
      
      So I _should_ have looked at other unserialized users of ->load_avg,
      but alas. Luckily nikbor reported a similar /0 from task_h_load() which
      instantly triggered recollection of this here problem.
      
      Aside from the intermediate value hitting memory and causing problems,
      there's another problem: the underflow detection relies on the signed
      bit. This reduces the effective width of the variables, IOW its
      effectively the same as having these variables be of signed type.
      
      This patch changes to a different means of unsigned underflow
      detection to not rely on the signed bit. This allows the variables to
      use the 'full' unsigned range. And it does so with explicit LOAD -
      STORE to ensure any intermediate value will never be visible in
      memory, allowing these unserialized loads.
      
      Note: GCC generates crap code for this, might warrant a look later.
      
      Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
             maybe we should do clamping on add too.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Cc: bsegall@google.com
      Cc: kernel@kyup.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: steve.muckle@linaro.org
      Fixes: 9d89c257 ("sched/fair: Rewrite runnable load and utilization average tracking")
      Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      89741892
  7. 17 6月, 2016 1 次提交
    • T
      cgroup: set css->id to -1 during init · 8fa3b8d6
      Tejun Heo 提交于
      If percpu_ref initialization fails during css_create(), the free path
      can end up trying to free css->id of zero.  As ID 0 is unused, it
      doesn't cause a critical breakage but it does trigger a warning
      message.  Fix it by setting css->id to -1 from init_and_link_css().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Wenwei Tao <ww.tao0320@gmail.com>
      Fixes: 01e58659 ("cgroup: release css->id after css_free")
      Cc: stable@vger.kernel.org # v4.0+
      Signed-off-by: NTejun Heo <tj@kernel.org>
      8fa3b8d6
  8. 16 6月, 2016 2 次提交
  9. 15 6月, 2016 1 次提交
    • N
      kernel/kcov: unproxify debugfs file's fops · df4565f9
      Nicolai Stange 提交于
      Since commit 49d200de ("debugfs: prevent access to removed files'
      private data"), a debugfs file's file_operations methods get proxied
      through lifetime aware wrappers.
      
      However, only a certain subset of the file_operations members is supported
      by debugfs and ->mmap isn't among them -- it appears to be NULL from the
      VFS layer's perspective.
      
      This behaviour breaks the /sys/kernel/debug/kcov file introduced
      concurrently with commit 5c9a8750 ("kernel: add kcov code coverage").
      
      Since that file never gets removed, there is no file removal race and thus,
      a lifetime checking proxy isn't needed.
      
      Avoid the proxying for /sys/kernel/debug/kcov by creating it via
      debugfs_create_file_unsafe() rather than debugfs_create_file().
      
      Fixes: 49d200de ("debugfs: prevent access to removed files' private data")
      Fixes: 5c9a8750 ("kernel: add kcov code coverage")
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NNicolai Stange <nicstange@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      df4565f9
  10. 14 6月, 2016 4 次提交
    • A
      kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w · 57675cb9
      Andrey Ryabinin 提交于
      Lengthy output of sysrq-w may take a lot of time on slow serial console.
      
      Currently we reset NMI-watchdog on the current CPU to avoid spurious
      lockup messages. Sometimes this doesn't work since softlockup watchdog
      might trigger on another CPU which is waiting for an IPI to proceed.
      We reset softlockup watchdogs on all CPUs, but we do this only after
      listing all tasks, and this may be too late on a busy system.
      
      So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      57675cb9
    • J
      sched/debug: Fix deadlock when enabling sched events · eda8dca5
      Josh Poimboeuf 提交于
      I see a hang when enabling sched events:
      
        echo 1 > /sys/kernel/debug/tracing/events/sched/enable
      
      The printk buffer shows:
      
        BUG: spinlock recursion on CPU#1, swapper/1/0
         lock: 0xffff88007d5d8c00, .magic: dead4ead, .owner: swapper/1/0, .owner_cpu: 1
        CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc2+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
        ...
        Call Trace:
         <IRQ>  [<ffffffff8143d663>] dump_stack+0x85/0xc2
         [<ffffffff81115948>] spin_dump+0x78/0xc0
         [<ffffffff81115aea>] do_raw_spin_lock+0x11a/0x150
         [<ffffffff81891471>] _raw_spin_lock+0x61/0x80
         [<ffffffff810e5466>] ? try_to_wake_up+0x256/0x4e0
         [<ffffffff810e5466>] try_to_wake_up+0x256/0x4e0
         [<ffffffff81891a0a>] ? _raw_spin_unlock_irqrestore+0x4a/0x80
         [<ffffffff810e5705>] wake_up_process+0x15/0x20
         [<ffffffff810cebb4>] insert_work+0x84/0xc0
         [<ffffffff810ced7f>] __queue_work+0x18f/0x660
         [<ffffffff810cf9a6>] queue_work_on+0x46/0x90
         [<ffffffffa00cd95b>] drm_fb_helper_dirty.isra.11+0xcb/0xe0 [drm_kms_helper]
         [<ffffffffa00cdac0>] drm_fb_helper_sys_imageblit+0x30/0x40 [drm_kms_helper]
         [<ffffffff814babcd>] soft_cursor+0x1ad/0x230
         [<ffffffff814ba379>] bit_cursor+0x649/0x680
         [<ffffffff814b9d30>] ? update_attr.isra.2+0x90/0x90
         [<ffffffff814b5e6a>] fbcon_cursor+0x14a/0x1c0
         [<ffffffff81555ef8>] hide_cursor+0x28/0x90
         [<ffffffff81558b6f>] vt_console_print+0x3bf/0x3f0
         [<ffffffff81122c63>] call_console_drivers.constprop.24+0x183/0x200
         [<ffffffff811241f4>] console_unlock+0x3d4/0x610
         [<ffffffff811247f5>] vprintk_emit+0x3c5/0x610
         [<ffffffff81124bc9>] vprintk_default+0x29/0x40
         [<ffffffff811e965b>] printk+0x57/0x73
         [<ffffffff810f7a9e>] enqueue_entity+0xc2e/0xc70
         [<ffffffff810f7b39>] enqueue_task_fair+0x59/0xab0
         [<ffffffff8106dcd9>] ? kvm_sched_clock_read+0x9/0x20
         [<ffffffff8103fb39>] ? sched_clock+0x9/0x10
         [<ffffffff810e3fcc>] activate_task+0x5c/0xa0
         [<ffffffff810e4514>] ttwu_do_activate+0x54/0xb0
         [<ffffffff810e5cea>] sched_ttwu_pending+0x7a/0xb0
         [<ffffffff810e5e51>] scheduler_ipi+0x61/0x170
         [<ffffffff81059e7f>] smp_trace_reschedule_interrupt+0x4f/0x2a0
         [<ffffffff81893ba6>] trace_reschedule_interrupt+0x96/0xa0
         <EOI>  [<ffffffff8106e0d6>] ? native_safe_halt+0x6/0x10
         [<ffffffff8110fb1d>] ? trace_hardirqs_on+0xd/0x10
         [<ffffffff81040ac0>] default_idle+0x20/0x1a0
         [<ffffffff8104147f>] arch_cpu_idle+0xf/0x20
         [<ffffffff81102f8f>] default_idle_call+0x2f/0x50
         [<ffffffff8110332e>] cpu_startup_entry+0x37e/0x450
         [<ffffffff8105af70>] start_secondary+0x160/0x1a0
      
      Note the hang only occurs when echoing the above from a physical serial
      console, not from an ssh session.
      
      The bug is caused by a deadlock where the task is trying to grab the rq
      lock twice because printk()'s aren't safe in sched code.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: cb251765 ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
      Link: http://lkml.kernel.org/r/20160613073209.gdvdybiruljbkn3p@trebleSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eda8dca5
    • A
      perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86 · 2c95afc1
      Andi Kleen 提交于
      The NMI watchdog uses either the fixed cycles or a generic cycles
      counter. This causes a lot of conflicts with users of the PMU who want
      to run a full group including the cycles fixed counter, for example
      the --topdown support recently added to perf stat. The code needs to
      fall back to not use groups, which can cause measurement inaccuracy
      due to multiplexing errors.
      
      This patch switches the NMI watchdog to use reference cycles
      on Intel systems.  This is actually more accurate than cycles,
      because cycles can tick faster than the measured CPU Frequency
      due to Turbo mode.
      
      The ref cycles always tick at their frequency, or slower when
      the system is idling. That means the NMI watchdog can never
      expire too early, unlike with cycles.
      
      The reference cycles tick roughly at the frequency of the TSC,
      so the same period computation can be used.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Link: http://lkml.kernel.org/r/1465478079-19993-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2c95afc1
    • P
      sched/fair: Fix post_init_entity_util_avg() serialization · b7fa30c9
      Peter Zijlstra 提交于
      Chris Wilson reported a divide by 0 at:
      
       post_init_entity_util_avg():
      
       >    725	if (cfs_rq->avg.util_avg != 0) {
       >    726		sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
       > -> 727		sa->util_avg /= (cfs_rq->avg.load_avg + 1);
       >    728
       >    729		if (sa->util_avg > cap)
       >    730			sa->util_avg = cap;
       >    731	} else {
      
      Which given the lack of serialization, and the code generated from
      update_cfs_rq_load_avg() is entirely possible:
      
      	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
      		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
      		sa->load_avg = max_t(long, sa->load_avg - r, 0);
      		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
      		removed_load = 1;
      	}
      
      turns into:
      
        ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
        ffffffff8108706b:       48 85 c0                test   %rax,%rax
        ffffffff8108706e:       74 40                   je     ffffffff810870b0
        ffffffff81087070:       4c 89 f8                mov    %r15,%rax
        ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
        ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
        ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
        ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
        ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
        ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
      
      Which you'll note ends up with 'sa->load_avg - r' in memory at
      ffffffff8108707a.
      
      By calling post_init_entity_util_avg() under rq->lock we're sure to be
      fully serialized against PELT updates and cannot observe intermediate
      state like this.
      Reported-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Cc: bsegall@google.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: steve.muckle@linaro.org
      Fixes: 2b8c41da ("sched/fair: Initiate a new task's util avg to a bounded value")
      Link: http://lkml.kernel.org/r/20160609130750.GQ30909@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b7fa30c9
  11. 11 6月, 2016 1 次提交
  12. 10 6月, 2016 1 次提交
  13. 09 6月, 2016 1 次提交
    • M
      futex: Calculate the futex key based on a tail page for file-based futexes · 077fa7ae
      Mel Gorman 提交于
      Mike Galbraith reported that the LTP test case futex_wake04 was broken
      by commit 65d8fc77 ("futex: Remove requirement for lock_page()
      in get_futex_key()").
      
      This test case uses futexes backed by hugetlbfs pages and so there is an
      associated inode with a futex stored on such pages. The problem is that
      the key is being calculated based on the head page index of the hugetlbfs
      page and not the tail page.
      
      Prior to the optimisation, the page lock was used to stabilise mappings and
      pin the inode is file-backed which is overkill. If the page was a compound
      page, the head page was automatically looked up as part of the page lock
      operation but the tail page index was used to calculate the futex key.
      
      After the optimisation, the compound head is looked up early and the page
      lock is only relied upon to identify truncated pages, special pages or a
      shmem page moving to swapcache. The head page is looked up because without
      the page lock, special care has to be taken to pin the inode correctly.
      However, the tail page is still required to calculate the futex key so
      this patch records the tail page.
      
      On vanilla 4.6, the output of the test case is;
      
      futex_wake04    0  TINFO  :  Hugepagesize 2097152
      futex_wake04    1  TFAIL  :  futex_wake04.c:126: Bug: wait_thread2 did not wake after 30 secs.
      
      With the patch applied
      
      futex_wake04    0  TINFO  :  Hugepagesize 2097152
      futex_wake04    1  TPASS  :  Hi hydra, thread2 awake!
      
      Fixes: 65d8fc77 "futex: Remove requirement for lock_page() in get_futex_key()"
      Reported-and-tested-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDavidlohr Bueso <dave@stgolabs.net>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20160608132522.GM2469@suse.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      077fa7ae
  14. 08 6月, 2016 6 次提交
    • J
      sched/debug: Fix 'schedstats=enable' cmdline option · 4698f88c
      Josh Poimboeuf 提交于
      The 'schedstats=enable' option doesn't work, and also produces the
      following warning during boot:
      
        WARNING: CPU: 0 PID: 0 at /home/jpoimboe/git/linux/kernel/jump_label.c:61 static_key_slow_inc+0x8c/0xa0
        static_key_slow_inc used before call to jump_label_init
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.7.0-rc1+ #25
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
         0000000000000086 3ae3475a4bea95d4 ffffffff81e03da8 ffffffff8143fc83
         ffffffff81e03df8 0000000000000000 ffffffff81e03de8 ffffffff810b1ffb
         0000003d00000096 ffffffff823514d0 ffff88007ff197c8 0000000000000000
        Call Trace:
         [<ffffffff8143fc83>] dump_stack+0x85/0xc2
         [<ffffffff810b1ffb>] __warn+0xcb/0xf0
         [<ffffffff810b207f>] warn_slowpath_fmt+0x5f/0x80
         [<ffffffff811e9c0c>] static_key_slow_inc+0x8c/0xa0
         [<ffffffff810e07c6>] static_key_enable+0x16/0x40
         [<ffffffff8216d633>] setup_schedstats+0x29/0x94
         [<ffffffff82148a05>] unknown_bootoption+0x89/0x191
         [<ffffffff810d8617>] parse_args+0x297/0x4b0
         [<ffffffff82148d61>] start_kernel+0x1d8/0x4a9
         [<ffffffff8214897c>] ? set_init_arg+0x55/0x55
         [<ffffffff82148120>] ? early_idt_handler_array+0x120/0x120
         [<ffffffff821482db>] x86_64_start_reservations+0x2f/0x31
         [<ffffffff82148427>] x86_64_start_kernel+0x14a/0x16d
      
      The problem is that it tries to update the 'sched_schedstats' static key
      before jump labels have been initialized.
      
      Changing jump_label_init() to be called earlier before
      parse_early_param() wouldn't fix it: it would still fail trying to
      poke_text() because mm isn't yet initialized.
      
      Instead, just create a temporary '__sched_schedstats' variable which can
      be copied to the static key later during sched_init() after jump labels
      have been initialized.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: cb251765 ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
      Link: http://lkml.kernel.org/r/453775fe3433bed65731a583e228ccea806d18cd.1465322027.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4698f88c
    • J
      sched/debug: Fix /proc/sched_debug regression · 9c572591
      Josh Poimboeuf 提交于
      Commit:
      
        cb251765 ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
      
      ... introduced a bug when CONFIG_SCHEDSTATS is enabled and the
      runtime tunable is disabled (which is the default).
      
      The wait-time, sum-exec, and sum-sleep fields are missing from the
      /proc/sched_debug file in the runnable_tasks section.
      
      Fix it with a new schedstat_val() macro which returns the field value
      when schedstats is enabled and zero otherwise.  The macro works with
      both SCHEDSTATS and !SCHEDSTATS.  I put the macro in stats.h since it
      might end up being useful in other places.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: cb251765 ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
      Link: http://lkml.kernel.org/r/bcda7c2790cf2ccbe586a28c02dd7b6fe7749a2b.1464994423.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c572591
    • A
      perf/core: Remove a redundant check · 62a92c8f
      Alexander Shishkin 提交于
      There is no way to end up in _free_event() with event::pmu being NULL.
      The latter is initialized in event allocation path and remains set
      forever. In case of allocation failure, the error path doesn't use
      _free_event().
      
      Having the check, however, suggests that it is possible to have a
      event::pmu==NULL situation in _free_event() and confuses the robots.
      
      This patch gets rid of the check.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: eranian@google.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/1465303455-26032-1-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      62a92c8f
    • P
      locking/qspinlock: Fix spin_unlock_wait() some more · 2c610022
      Peter Zijlstra 提交于
      While this prior commit:
      
        54cf809b ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
      
      ... fixes spin_is_locked() and spin_unlock_wait() for the usage
      in ipc/sem and netfilter, it does not in fact work right for the
      usage in task_work and futex.
      
      So while the 2 locks crossed problem:
      
      	spin_lock(A)		spin_lock(B)
      	if (!spin_is_locked(B)) spin_unlock_wait(A)
      	  foo()			foo();
      
      ... works with the smp_mb() injected by both spin_is_locked() and
      spin_unlock_wait(), this is not sufficient for:
      
      	flag = 1;
      	smp_mb();		spin_lock()
      	spin_unlock_wait()	if (!flag)
      				  // add to lockless list
      	// iterate lockless list
      
      ... because in this scenario, the store from spin_lock() can be delayed
      past the load of flag, uncrossing the variables and loosing the
      guarantee.
      
      This patch reworks spin_is_locked() and spin_unlock_wait() to work in
      both cases by exploiting the observation that while the lock byte
      store can be delayed, the contender must have registered itself
      visibly in other state contained in the word.
      
      It also allows for architectures to override both functions, as PPC
      and ARM64 have an additional issue for which we currently have no
      generic solution.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Giovanni Gherdovich <ggherdovich@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <waiman.long@hpe.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: stable@vger.kernel.org # v4.2 and later
      Fixes: 54cf809b ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2c610022
    • D
      perf/core: Fix crash due to account/unaccount_sb_event() inconsistency · a4f144eb
      David Carrillo-Cisneros 提交于
      unaccount_pmu_sb_event() did not check for attributes in event->attr
      before calling detach_sb_event(), while account_pmu_event() did.
      
      This caused NULL pointer reference in cgroup events that did not
      have any of the attributes checked by account_pmu_event().
      
      To trigger the bug just wait for a cgroup event to terminate, e.g.:
      
        $ mkdir /dev/cgroup/devices/test
        $ perf stat -e cycles -a -G test sleep 0
      
      ... see crash ...
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Zheng <zheng.z.yan@intel.com>
      Link: http://lkml.kernel.org/r/1464809585-66072-1-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a4f144eb
    • D
      bpf, trace: use READ_ONCE for retrieving file ptr · 5b6c1b4d
      Daniel Borkmann 提交于
      In bpf_perf_event_read() and bpf_perf_event_output(), we must use
      READ_ONCE() for fetching the struct file pointer, which could get
      updated concurrently, so we must prevent the compiler from potential
      refetching.
      
      We already do this with tail calls for fetching the related bpf_prog,
      but not so on stored perf events. Semantics for both are the same
      with regards to updates.
      
      Fixes: a43eec30 ("bpf: introduce bpf_perf_event_output() helper")
      Fixes: 35578d79 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b6c1b4d
  15. 03 6月, 2016 6 次提交
    • V
      perf/abi: Change the errno for sampling event not supported in hardware · a1396555
      Vineet Gupta 提交于
      Change the return code for sampling event not supported from -ENOTSUPP
      to -EOPNOTSUPP.
      
      This allows userspace to identify this case specifically, instead of
      printing the catch-all error message it did previously.
      
      Technically this is an ABI change, but we think we can get away
      with it.
      
      Old behavior:
       -------
       | # perf record ls
       | Error:
       | The sys_perf_event_open() syscall returned with 524 (Unknown error 524)
       | for event (cycles:ppp).
       | /bin/dmesg may provide additional information.
       | No CONFIG_PERF_EVENTS=y kernel support configured?
      
      New behavior:
       -------
       | # perf record ls
       | Error:
       | PMU Hardware doesn't support sampling/overflow-interrupts.
      Signed-off-by: NVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <acme@redhat.com>
      Cc: <linux-snps-arc@lists.infradead.org>
      Cc: <vincent.weaver@maine.edu>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
      Link: http://lkml.kernel.org/r/1462786660-2900-3-git-send-email-vgupta@synopsys.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1396555
    • K
      perf/core: Fix implicitly enable dynamic interrupt throttle · ab7fdefb
      Kan Liang 提交于
      This patch fixes an issue which was introduced by commit:
      
        91a612ee ("perf/core: Fix dynamic interrupt throttle")
      
      ... which commit unconditionally sets the perf_sample_allowed_ns value
      to !0. But that could trigger a bug in the following corner case:
      
      The user can disable the dynamic interrupt throttle mechanism by setting
      perf_cpu_time_max_percent to 0. Then they change perf_event_max_sample_rate.
      For this case, the mechanism will be enabled implicitly, because
      perf_sample_allowed_ns becomes !0 - which is not what we want.
      
      This patch only updates perf_sample_allowed_ns when the dynamic
      interrupt throttle mechanism is enabled.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Link: http://lkml.kernel.org/r/1462260366-3160-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ab7fdefb
    • P
      perf/core: Rename the perf_event_aux*() APIs to perf_event_sb*(), to separate... · aab5b71e
      Peter Zijlstra 提交于
      perf/core: Rename the perf_event_aux*() APIs to perf_event_sb*(), to separate them from AUX ring-buffer records
      
      There are now two different things called AUX in perf, the
      infrastructure to deliver the mmap/comm/task records and the
      AUX part in the mmap buffer (with associated AUX_RECORD).
      
      Since the former is internal, rename it to side-band to reduce
      the confusion factor.
      
      No change in functionality.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      aab5b71e
    • K
      perf/core: Optimize side-band event delivery · f2fb6bef
      Kan Liang 提交于
      The perf_event_aux() function iterates all PMUs and all events in
      their respective per-CPU contexts to find the events to deliver
      side-band records to.
      
      For example, the brk test case in lkp triggers many mmap() operations,
      which, if we're also running perf, results in many perf_event_aux()
      invocations.
      
      If we enable uncore PMU support (even when uncore events are not used),
      dozens of uncore PMUs will be iterated, which can significantly
      decrease brk_test's throughput.
      
      For example, the brk throughput:
      
        without uncore PMUs: 2647573 ops_per_sec
        with    uncore PMUs: 1768444 ops_per_sec
      
      ... a 33% reduction.
      
      To get at the per-CPU events that need side-band records, this patch
      puts these events on a per-CPU list, this avoids iterating the PMUs
      and any events that do not need side-band records.
      
      Per task events are unchanged to avoid extra overhead on the context
      switch paths.
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reported-by: NHuang, Ying <ying.huang@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1458757477-3781-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f2fb6bef
    • C
      locking/ww_mutex: Report recursive ww_mutex locking early · 0422e83d
      Chris Wilson 提交于
      Recursive locking for ww_mutexes was originally conceived as an
      exception. However, it is heavily used by the DRM atomic modesetting
      code. Currently, the recursive deadlock is checked after we have queued
      up for a busy-spin and as we never release the lock, we spin until
      kicked, whereupon the deadlock is discovered and reported.
      
      A simple solution for the now common problem is to move the recursive
      deadlock discovery to the first action when taking the ww_mutex.
      Suggested-by: NMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1464293297-19777-1-git-send-email-chris@chris-wilson.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0422e83d
    • C
      cpuidle: Do not access cpuidle_devices when !CONFIG_CPU_IDLE · 9bd616e3
      Catalin Marinas 提交于
      The cpuidle_devices per-CPU variable is only defined when CPU_IDLE is
      enabled. Commit c8cc7d4d ("sched/idle: Reorganize the idle loop")
      removed the #ifdef CONFIG_CPU_IDLE around cpuidle_idle_call() with the
      compiler optimising away __this_cpu_read(cpuidle_devices). However, with
      CONFIG_UBSAN && !CONFIG_CPU_IDLE, this optimisation no longer happens
      and the kernel fails to link since cpuidle_devices is not defined.
      
      This patch introduces an accessor function for the current CPU cpuidle
      device (returning NULL when !CONFIG_CPU_IDLE) and uses it in
      cpuidle_idle_call().
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: 4.5+ <stable@vger.kernel.org> # 4.5+
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      9bd616e3
  16. 01 6月, 2016 1 次提交