1. 10 2月, 2017 2 次提交
  2. 31 1月, 2017 1 次提交
    • S
      tracing: Fix hwlat kthread migration · 79c6f448
      Steven Rostedt (VMware) 提交于
      The hwlat tracer creates a kernel thread at start of the tracer. It is
      pinned to a single CPU and will move to the next CPU after each period of
      running. If the user modifies the migration thread's affinity, it will not
      change after that happens.
      
      The original code created the thread at the first instance it was called,
      but later was changed to destroy the thread after the tracer was finished,
      and would not be created until the next instance of the tracer was
      established. The code that initialized the affinity was only called on the
      initial instantiation of the tracer. After that, it was not initialized, and
      the previous affinity did not match the current newly created one, making
      it appear that the user modified the thread's affinity when it did not, and
      the thread failed to migrate again.
      
      Cc: stable@vger.kernel.org
      Fixes: 0330f7aa ("tracing: Have hwlat trace migrate across tracing_cpumask CPUs")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      79c6f448
  3. 30 1月, 2017 7 次提交
    • K
      perf/core: Try parent PMU first when initializing a child event · 40999312
      Kan Liang 提交于
      perf has additional overhead when monitoring the task which
      frequently generates child tasks.
      
      perf_init_event() is one of the hotspots for the additional overhead:
      
      Currently, to get the PMU, it tries to search the type in pmu_idr at
      first. But it is not always successful, especially for the widely used
      PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events. So it has to go to the
      slow path which go through the whole PMUs list.
      
      It will be a big performance issue, if the PMUs list is long (e.g. server
      with many uncore boxes) and the task frequently generates child tasks.
      
      The child event inherits its parent event. So the child event should
      try its parent PMU first.
      
      Here is some data from the overhead test on Broadwell server:
      
        perf record -e $TEST_EVENTS -- ./loop.sh 50000
      
        loop.sh
          start=$(date +%s%N)
          i=0
          while [ "$i" -le "$1" ]
          do
                  date > /dev/null
                  i=`expr $i + 1`
          done
          end=$(date +%s%N)
          elapsed=`expr $end - $start`
      
        Event#	Original elapsed time	Elapsed time with patch		delta
        1		196,573,192,397		189,162,029,998			-3.77%
        2		257,567,753,013		241,620,788,683			-6.19%
        4		398,730,726,971		370,518,938,714			-7.08%
        8		824,983,761,120		740,702,489,329			-10.22%
        16		1,883,411,923,498	1,672,027,508,355		-11.22%
      
      ... which shows a nice performance improvement.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1484745662-15928-2-git-send-email-kan.liang@intel.com
      [ Tidied up the changelog and the code comment. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      40999312
    • A
      perf/core: Optimize event rescheduling on active contexts · 487f05e1
      Alexander Shishkin 提交于
      When new events are added to an active context, we go and reschedule
      all cpu groups and all task groups in order to preserve the priority
      (cpu pinned, task pinned, cpu flexible, task flexible), but in
      reality we only need to reschedule groups of the same priority as
      that of the events being added, and below.
      
      This patch changes the behavior so that only groups that need to be
      rescheduled are rescheduled.
      Reported-by: NAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20170119164330.22887-3-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      487f05e1
    • A
      perf/core: Don't re-schedule CPU flexible events needlessly · fe45bafb
      Alexander Shishkin 提交于
      In the sched-in path, we first remove a CPU's flexible events in order to
      give priority to the task's pinned events. However, this step can be safely
      skipped if the task doesn't have its own pinned events.
      
      This patch implements this skipping.
      Reported-by: NAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20170119164330.22887-2-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fe45bafb
    • D
      perf/core: Remove perf_cpu_context::unique_pmu · 1fd7e416
      David Carrillo-Cisneros 提交于
      cpuctx->unique_pmu was originally introduced as a way to identify cpuctxs
      with shared pmus in order to avoid visiting the same cpuctx more than once
      in a for_each_pmu loop.
      
      cpuctx->unique_pmu == cpuctx->pmu in non-software task contexts since they
      have only one pmu per cpuctx. Since perf_pmu_sched_task() is only called in
      hw contexts, this patch replaces cpuctx->unique_pmu by cpuctx->pmu in it.
      
      The change above, together with the previous patch in this series, removed
      the remaining uses of cpuctx->unique_pmu, so we remove it altogether.
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/20170118192454.58008-3-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1fd7e416
    • D
      perf/core: Make cgroup switch visit only cpuctxs with cgroup events · 058fe1c0
      David Carrillo-Cisneros 提交于
      This patch follows from a conversation in CQM/CMT's last series about
      speeding up the context switch for cgroup events:
      
        https://patchwork.kernel.org/patch/9478617/
      
      This is a low-hanging fruit optimization. It replaces the iteration over
      the "pmus" list in cgroup switch by an iteration over a new list that
      contains only cpuctxs with at least one cgroup event.
      
      This is necessary because the number of PMUs have increased over the years
      e.g modern x86 server systems have well above 50 PMUs.
      
      The iteration over the full PMU list is unneccessary and can be costly in
      heavy cache contention scenarios.
      
      Below are some instrumentation measurements with 10, 50 and 90 percentiles
      of the total cost of context switch before and after this optimization for
      a simple array read/write microbenchark.
      
        Contention
          Level    Nr events      Before (us)            After (us)       Median
        L2    L3     types      (10%, 50%, 90%)       (10%, 50%, 90%     Speedup
        --------------------------------------------------------------------------
        Low   Low       1       (1.72, 2.42, 5.85)    (1.35, 1.64, 5.46)     29%
        High  Low       1       (2.08, 4.56, 19.8)    (1720, 2.20, 13.7)     51%
        High  High      1       (2.86, 10.4, 12.7)    (2.54, 4.32, 12.1)     58%
      
        Low   Low       2       (1.98, 3.20, 6.89)    (1.68, 2.41, 8.89)     24%
        High  Low       2       (2.48, 5.28, 22.4)    (2150, 3.69, 14.6)     30%
        High  High      2       (3.32, 8.09, 13.9)    (2.80, 5.15, 13.7)     36%
      
      where:
      
        1 event type  = cycles
        2 event types = cycles,intel_cqm/llc_occupancy/
      
         Contention L2 Low: workset  <  L2 cache size.
                       High:  "     >>  L2   "     " .
         Contention L3 Low: workset of task on all sockets  <  L3 cache size.
                       High:   "     "   "   "   "    "    >>  L3   "     " .
      
         Median Speedup is (50%ile Before - 50%ile After) /  50%ile Before
      
      Unsurprisingly, the benefits of this optimization decrease with the number
      of cpuctxs with a cgroup events, yet, is never detrimental.
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/20170118192454.58008-2-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      058fe1c0
    • P
      perf/core: Fix PERF_RECORD_MMAP2 prot/flags for anonymous memory · 0b3589be
      Peter Zijlstra 提交于
      Andres reported that MMAP2 records for anonymous memory always have
      their protection field 0.
      
      Turns out, someone daft put the prot/flags generation code in the file
      branch, leaving them unset for anonymous memory.
      Reported-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Don Zickus <dzickus@redhat.com
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@gmail.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: anton@ozlabs.org
      Cc: namhyung@kernel.org
      Cc: stable@vger.kernel.org # v3.16+
      Fixes: f972eb63 ("perf: Pass protection and flags bits through mmap2 interface")
      Link: http://lkml.kernel.org/r/20170126221508.GF6536@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0b3589be
    • P
      perf/core: Fix use-after-free bug · a76a82a3
      Peter Zijlstra 提交于
      Dmitry reported a KASAN use-after-free on event->group_leader.
      
      It turns out there's a hole in perf_remove_from_context() due to
      event_function_call() not calling its function when the task
      associated with the event is already dead.
      
      In this case the event will have been detached from the task, but the
      grouping will have been retained, such that group operations might
      still work properly while there are live child events etc.
      
      This does however mean that we can miss a perf_group_detach() call
      when the group decomposes, this in turn can then lead to
      use-after-free.
      
      Fix it by explicitly doing the group detach if its still required.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org # v4.5+
      Cc: syzkaller <syzkaller@googlegroups.com>
      Fixes: 63b6da39 ("perf: Fix perf_event_exit_task() race")
      Link: http://lkml.kernel.org/r/20170126153955.GD6515@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a76a82a3
  4. 27 1月, 2017 2 次提交
    • T
      cgroup: don't online subsystems before cgroup_name/path() are operational · 07cd1294
      Tejun Heo 提交于
      While refactoring cgroup creation, a5bca215 ("cgroup: factor out
      cgroup_create() out of cgroup_mkdir()") incorrectly onlined subsystems
      before the new cgroup is associated with it kernfs_node.  This is fine
      for cgroup proper but cgroup_name/path() depend on the associated
      kernfs_node and if a subsystem makes the new cgroup_subsys_state
      visible, which they're allowed to after onlining, it can lead to NULL
      dereference.
      
      The current code performs cgroup creation and subsystem onlining in
      cgroup_create() and cgroup_mkdir() makes the cgroup and subsystems
      visible afterwards.  There's no reason to online the subsystems early
      and we can simply drop cgroup_apply_control_enable() call from
      cgroup_create() so that the subsystems are onlined and made visible at
      the same time.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Fixes: a5bca215 ("cgroup: factor out cgroup_create() out of cgroup_mkdir()") 
      Cc: stable@vger.kernel.org # v4.6+
      07cd1294
    • E
      sysctl: fix proc_doulongvec_ms_jiffies_minmax() · ff9f8a7c
      Eric Dumazet 提交于
      We perform the conversion between kernel jiffies and ms only when
      exporting kernel value to user space.
      
      We need to do the opposite operation when value is written by user.
      
      Only matters when HZ != 1000
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff9f8a7c
  5. 25 1月, 2017 2 次提交
  6. 24 1月, 2017 1 次提交
    • N
      userns: Make ucounts lock irq-safe · 880a3854
      Nikolay Borisov 提交于
      The ucounts_lock is being used to protect various ucounts lifecycle
      management functionalities. However, those services can also be invoked
      when a pidns is being freed in an RCU callback (e.g. softirq context).
      This can lead to deadlocks. There were already efforts trying to
      prevent similar deadlocks in add7c65c ("pid: fix lockdep deadlock
      warning due to ucount_lock"), however they just moved the context
      from hardirq to softrq. Fix this issue once and for all by explictly
      making the lock disable irqs altogether.
      
      Dmitry Vyukov <dvyukov@google.com> reported:
      
      > I've got the following deadlock report while running syzkaller fuzzer
      > on eec0d3d065bfcdf9cd5f56dd2a36b94d12d32297 of linux-next (on odroid
      > device if it matters):
      >
      > =================================
      > [ INFO: inconsistent lock state ]
      > 4.10.0-rc3-next-20170112-xc2-dirty #6 Not tainted
      > ---------------------------------
      > inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      > swapper/2/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
      >  (ucounts_lock){+.?...}, at: [<     inline     >] spin_lock
      > ./include/linux/spinlock.h:302
      >  (ucounts_lock){+.?...}, at: [<ffff2000081678c8>]
      > put_ucounts+0x60/0x138 kernel/ucount.c:162
      > {SOFTIRQ-ON-W} state was registered at:
      > [<ffff2000081c82d8>] mark_lock+0x220/0xb60 kernel/locking/lockdep.c:3054
      > [<     inline     >] mark_irqflags kernel/locking/lockdep.c:2941
      > [<ffff2000081c97a8>] __lock_acquire+0x388/0x3260 kernel/locking/lockdep.c:3295
      > [<ffff2000081cce24>] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
      > [<     inline     >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
      > [<ffff200009798128>] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
      > [<     inline     >] spin_lock ./include/linux/spinlock.h:302
      > [<     inline     >] get_ucounts kernel/ucount.c:131
      > [<ffff200008167c28>] inc_ucount+0x80/0x6c8 kernel/ucount.c:189
      > [<     inline     >] inc_mnt_namespaces fs/namespace.c:2818
      > [<ffff200008481850>] alloc_mnt_ns+0x78/0x3a8 fs/namespace.c:2849
      > [<ffff200008487298>] create_mnt_ns+0x28/0x200 fs/namespace.c:2959
      > [<     inline     >] init_mount_tree fs/namespace.c:3199
      > [<ffff200009bd6674>] mnt_init+0x258/0x384 fs/namespace.c:3251
      > [<ffff200009bd60bc>] vfs_caches_init+0x6c/0x80 fs/dcache.c:3626
      > [<ffff200009bb1114>] start_kernel+0x414/0x460 init/main.c:648
      > [<ffff200009bb01e8>] __primary_switched+0x6c/0x70 arch/arm64/kernel/head.S:456
      > irq event stamp: 2316924
      > hardirqs last  enabled at (2316924): [<     inline     >] rcu_do_batch
      > kernel/rcu/tree.c:2911
      > hardirqs last  enabled at (2316924): [<     inline     >]
      > invoke_rcu_callbacks kernel/rcu/tree.c:3182
      > hardirqs last  enabled at (2316924): [<     inline     >]
      > __rcu_process_callbacks kernel/rcu/tree.c:3149
      > hardirqs last  enabled at (2316924): [<ffff200008210414>]
      > rcu_process_callbacks+0x7a4/0xc28 kernel/rcu/tree.c:3166
      > hardirqs last disabled at (2316923): [<     inline     >] rcu_do_batch
      > kernel/rcu/tree.c:2900
      > hardirqs last disabled at (2316923): [<     inline     >]
      > invoke_rcu_callbacks kernel/rcu/tree.c:3182
      > hardirqs last disabled at (2316923): [<     inline     >]
      > __rcu_process_callbacks kernel/rcu/tree.c:3149
      > hardirqs last disabled at (2316923): [<ffff20000820fe80>]
      > rcu_process_callbacks+0x210/0xc28 kernel/rcu/tree.c:3166
      > softirqs last  enabled at (2316912): [<ffff20000811b4c4>]
      > _local_bh_enable+0x4c/0x80 kernel/softirq.c:155
      > softirqs last disabled at (2316913): [<     inline     >]
      > do_softirq_own_stack ./include/linux/interrupt.h:488
      > softirqs last disabled at (2316913): [<     inline     >]
      > invoke_softirq kernel/softirq.c:371
      > softirqs last disabled at (2316913): [<ffff20000811c994>]
      > irq_exit+0x264/0x308 kernel/softirq.c:405
      >
      > other info that might help us debug this:
      >  Possible unsafe locking scenario:
      >
      >        CPU0
      >        ----
      >   lock(ucounts_lock);
      >   <Interrupt>
      >     lock(ucounts_lock);
      >
      >  *** DEADLOCK ***
      >
      > 1 lock held by swapper/2/0:
      >  #0:  (rcu_callback){......}, at: [<     inline     >] __rcu_reclaim
      > kernel/rcu/rcu.h:108
      >  #0:  (rcu_callback){......}, at: [<     inline     >] rcu_do_batch
      > kernel/rcu/tree.c:2919
      >  #0:  (rcu_callback){......}, at: [<     inline     >]
      > invoke_rcu_callbacks kernel/rcu/tree.c:3182
      >  #0:  (rcu_callback){......}, at: [<     inline     >]
      > __rcu_process_callbacks kernel/rcu/tree.c:3149
      >  #0:  (rcu_callback){......}, at: [<ffff200008210390>]
      > rcu_process_callbacks+0x720/0xc28 kernel/rcu/tree.c:3166
      >
      > stack backtrace:
      > CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.10.0-rc3-next-20170112-xc2-dirty #6
      > Hardware name: Hardkernel ODROID-C2 (DT)
      > Call trace:
      > [<ffff20000808fa60>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:500
      > [<ffff20000808fec0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:225
      > [<ffff2000088a99e0>] dump_stack+0x110/0x168
      > [<ffff2000082fa2b4>] print_usage_bug.part.27+0x49c/0x4bc
      > kernel/locking/lockdep.c:2387
      > [<     inline     >] print_usage_bug kernel/locking/lockdep.c:2357
      > [<     inline     >] valid_state kernel/locking/lockdep.c:2400
      > [<     inline     >] mark_lock_irq kernel/locking/lockdep.c:2617
      > [<ffff2000081c89ec>] mark_lock+0x934/0xb60 kernel/locking/lockdep.c:3065
      > [<     inline     >] mark_irqflags kernel/locking/lockdep.c:2923
      > [<ffff2000081c9a60>] __lock_acquire+0x640/0x3260 kernel/locking/lockdep.c:3295
      > [<ffff2000081cce24>] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
      > [<     inline     >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
      > [<ffff200009798128>] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
      > [<     inline     >] spin_lock ./include/linux/spinlock.h:302
      > [<ffff2000081678c8>] put_ucounts+0x60/0x138 kernel/ucount.c:162
      > [<ffff200008168364>] dec_ucount+0xf4/0x158 kernel/ucount.c:214
      > [<     inline     >] dec_pid_namespaces kernel/pid_namespace.c:89
      > [<ffff200008293dc8>] delayed_free_pidns+0x40/0xe0 kernel/pid_namespace.c:156
      > [<     inline     >] __rcu_reclaim kernel/rcu/rcu.h:118
      > [<     inline     >] rcu_do_batch kernel/rcu/tree.c:2919
      > [<     inline     >] invoke_rcu_callbacks kernel/rcu/tree.c:3182
      > [<     inline     >] __rcu_process_callbacks kernel/rcu/tree.c:3149
      > [<ffff2000082103d8>] rcu_process_callbacks+0x768/0xc28 kernel/rcu/tree.c:3166
      > [<ffff2000080821dc>] __do_softirq+0x324/0x6e0 kernel/softirq.c:284
      > [<     inline     >] do_softirq_own_stack ./include/linux/interrupt.h:488
      > [<     inline     >] invoke_softirq kernel/softirq.c:371
      > [<ffff20000811c994>] irq_exit+0x264/0x308 kernel/softirq.c:405
      > [<ffff2000081ecc28>] __handle_domain_irq+0xc0/0x150 kernel/irq/irqdesc.c:636
      > [<ffff200008081c80>] gic_handle_irq+0x68/0xd8
      > Exception stack(0xffff8000648e7dd0 to 0xffff8000648e7f00)
      > 7dc0:                                   ffff8000648d4b3c 0000000000000007
      > 7de0: 0000000000000000 1ffff0000c91a967 1ffff0000c91a967 1ffff0000c91a967
      > 7e00: ffff20000a4b6b68 0000000000000001 0000000000000007 0000000000000001
      > 7e20: 1fffe4000149ae90 ffff200009d35000 0000000000000000 0000000000000002
      > 7e40: 0000000000000000 0000000000000000 0000000002624a1a 0000000000000000
      > 7e60: 0000000000000000 ffff200009cbcd88 000060006d2ed000 0000000000000140
      > 7e80: ffff200009cff000 ffff200009cb6000 ffff200009cc2020 ffff200009d2159d
      > 7ea0: 0000000000000000 ffff8000648d4380 0000000000000000 ffff8000648e7f00
      > 7ec0: ffff20000820a478 ffff8000648e7f00 ffff20000820a47c 0000000010000145
      > 7ee0: 0000000000000140 dfff200000000000 ffffffffffffffff ffff20000820a478
      > [<ffff2000080837f8>] el1_irq+0xb8/0x130 arch/arm64/kernel/entry.S:486
      > [<     inline     >] arch_local_irq_restore
      > ./arch/arm64/include/asm/irqflags.h:81
      > [<ffff20000820a47c>] rcu_idle_exit+0x64/0xa8 kernel/rcu/tree.c:1030
      > [<     inline     >] cpuidle_idle_call kernel/sched/idle.c:200
      > [<ffff2000081bcbfc>] do_idle+0x1dc/0x2d0 kernel/sched/idle.c:243
      > [<ffff2000081bd1cc>] cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:345
      > [<ffff200008099f8c>] secondary_start_kernel+0x2cc/0x358
      > arch/arm64/kernel/smp.c:276
      > [<000000000279f1a4>] 0x279f1a4
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Fixes: add7c65c ("pid: fix lockdep deadlock warning due to ucount_lock")
      Fixes: f333c700 ("pidns: Add a limit on the number of pid namespaces")
      Cc: stable@vger.kernel.org
      Link: https://www.spinics.net/lists/kernel/msg2426637.htmlSigned-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      880a3854
  7. 20 1月, 2017 1 次提交
  8. 19 1月, 2017 1 次提交
    • D
      bpf: don't trigger OOM killer under pressure with map alloc · d407bd25
      Daniel Borkmann 提交于
      This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
      that are to be used for map allocations. Using kmalloc() for very large
      allocations can cause excessive work within the page allocator, so i) fall
      back earlier to vmalloc() when the attempt is considered costly anyway,
      and even more importantly ii) don't trigger OOM killer with any of the
      allocators.
      
      Since this is based on a user space request, for example, when creating
      maps with element pre-allocation, we really want such requests to fail
      instead of killing other user space processes.
      
      Also, don't spam the kernel log with warnings should any of the allocations
      fail under pressure. Given that, we can make backend selection in
      bpf_map_area_alloc() generic, and convert all maps over to use this API
      for spots with potentially large allocation requests.
      
      Note, replacing the one kmalloc_array() is fine as overflow checks happen
      earlier in htab_map_alloc(), since it must also protect the multiplication
      for vmalloc() should kmalloc_array() fail.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d407bd25
  9. 18 1月, 2017 2 次提交
  10. 17 1月, 2017 1 次提交
    • D
      bpf: rework prog_digest into prog_tag · f1f7714e
      Daniel Borkmann 提交于
      Commit 7bd509e3 ("bpf: add prog_digest and expose it via
      fdinfo/netlink") was recently discussed, partially due to
      admittedly suboptimal name of "prog_digest" in combination
      with sha1 hash usage, thus inevitably and rightfully concerns
      about its security in terms of collision resistance were
      raised with regards to use-cases.
      
      The intended use cases are for debugging resp. introspection
      only for providing a stable "tag" over the instruction sequence
      that both kernel and user space can calculate independently.
      It's not usable at all for making a security relevant decision.
      So collisions where two different instruction sequences generate
      the same tag can happen, but ideally at a rather low rate. The
      "tag" will be dumped in hex and is short enough to introspect
      in tracepoints or kallsyms output along with other data such
      as stack trace, etc. Thus, this patch performs a rename into
      prog_tag and truncates the tag to a short output (64 bits) to
      make it obvious it's not collision-free.
      
      Should in future a hash or facility be needed with a security
      relevant focus, then we can think about requirements, constraints,
      etc that would fit to that situation. For now, rework the exposed
      parts for the current use cases as long as nothing has been
      released yet. Tested on x86_64 and s390x.
      
      Fixes: 7bd509e3 ("bpf: add prog_digest and expose it via fdinfo/netlink")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1f7714e
  11. 16 1月, 2017 1 次提交
  12. 15 1月, 2017 2 次提交
    • P
      rcu: Narrow early boot window of illegal synchronous grace periods · 52d7e48b
      Paul E. McKenney 提交于
      The current preemptible RCU implementation goes through three phases
      during bootup.  In the first phase, there is only one CPU that is running
      with preemption disabled, so that a no-op is a synchronous grace period.
      In the second mid-boot phase, the scheduler is running, but RCU has
      not yet gotten its kthreads spawned (and, for expedited grace periods,
      workqueues are not yet running.  During this time, any attempt to do
      a synchronous grace period will hang the system (or complain bitterly,
      depending).  In the third and final phase, RCU is fully operational and
      everything works normally.
      
      This has been OK for some time, but there has recently been some
      synchronous grace periods showing up during the second mid-boot phase.
      This code worked "by accident" for awhile, but started failing as soon
      as expedited RCU grace periods switched over to workqueues in commit
      8b355e3b ("rcu: Drive expedited grace periods from workqueue").
      Note that the code was buggy even before this commit, as it was subject
      to failure on real-time systems that forced all expedited grace periods
      to run as normal grace periods (for example, using the rcu_normal ksysfs
      parameter).  The callchain from the failure case is as follows:
      
      early_amd_iommu_init()
      |-> acpi_put_table(ivrs_base);
      |-> acpi_tb_put_table(table_desc);
      |-> acpi_tb_invalidate_table(table_desc);
      |-> acpi_tb_release_table(...)
      |-> acpi_os_unmap_memory
      |-> acpi_os_unmap_iomem
      |-> acpi_os_map_cleanup
      |-> synchronize_rcu_expedited
      
      The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
      which caused the code to try using workqueues before they were
      initialized, which did not go well.
      
      This commit therefore reworks RCU to permit synchronous grace periods
      to proceed during this mid-boot phase.  This commit is therefore a
      fix to a regression introduced in v4.9, and is therefore being put
      forward post-merge-window in v4.10.
      
      This commit sets a flag from the existing rcu_scheduler_starting()
      function which causes all synchronous grace periods to take the expedited
      path.  The expedited path now checks this flag, using the requesting task
      to drive the expedited grace period forward during the mid-boot phase.
      Finally, this flag is updated by a core_initcall() function named
      rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.
      
      Note that this arrangement assumes that tasks are not sent POSIX signals
      (or anything similar) from the time that the first task is spawned
      through core_initcall() time.
      
      Fixes: 8b355e3b ("rcu: Drive expedited grace periods from workqueue")
      Reported-by: N"Zheng, Lv" <lv.zheng@intel.com>
      Reported-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NStan Kain <stan.kain@gmail.com>
      Tested-by: NIvan <waffolz@hotmail.com>
      Tested-by: NEmanuel Castelo <emanuel.castelo@gmail.com>
      Tested-by: NBruno Pesavento <bpesavento@infinito.it>
      Tested-by: NBorislav Petkov <bp@suse.de>
      Tested-by: NFrederic Bezies <fredbezies@gmail.com>
      Cc: <stable@vger.kernel.org> # 4.9.0-
      52d7e48b
    • P
      rcu: Remove cond_resched() from Tiny synchronize_sched() · f466ae66
      Paul E. McKenney 提交于
      It is now legal to invoke synchronize_sched() at early boot, which causes
      Tiny RCU's synchronize_sched() to emit spurious splats.  This commit
      therefore removes the cond_resched() from Tiny RCU's synchronize_sched().
      
      Fixes: 8b355e3b ("rcu: Drive expedited grace periods from workqueue")
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org> # 4.9.0-
      f466ae66
  13. 14 1月, 2017 4 次提交
    • J
      perf/x86/intel: Account interrupts for PEBS errors · 475113d9
      Jiri Olsa 提交于
      It's possible to set up PEBS events to get only errors and not
      any data, like on SNB-X (model 45) and IVB-EP (model 62)
      via 2 perf commands running simultaneously:
      
          taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10
      
      This leads to a soft lock up, because the error path of the
      intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
      for error PEBS interrupts, so in case you're getting ONLY
      errors you don't have a way to stop the event when it's over
      the max_samples_per_tick limit:
      
        NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
        ...
        RIP: 0010:[<ffffffff81159232>]  [<ffffffff81159232>] smp_call_function_single+0xe2/0x140
        ...
        Call Trace:
         ? trace_hardirqs_on_caller+0xf5/0x1b0
         ? perf_cgroup_attach+0x70/0x70
         perf_install_in_context+0x199/0x1b0
         ? ctx_resched+0x90/0x90
         SYSC_perf_event_open+0x641/0xf90
         SyS_perf_event_open+0x9/0x10
         do_syscall_64+0x6c/0x1f0
         entry_SYSCALL64_slow_path+0x25/0x25
      
      Add perf_event_account_interrupt() which does the interrupt
      and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
      error path.
      
      We keep the pending_kill and pending_wakeup logic only in the
      __perf_event_overflow() path, because they make sense only if
      there's any data to deliver.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      475113d9
    • P
      perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race · 321027c1
      Peter Zijlstra 提交于
      Di Shen reported a race between two concurrent sys_perf_event_open()
      calls where both try and move the same pre-existing software group
      into a hardware context.
      
      The problem is exactly that described in commit:
      
        f63a8daa ("perf: Fix event->ctx locking")
      
      ... where, while we wait for a ctx->mutex acquisition, the event->ctx
      relation can have changed under us.
      
      That very same commit failed to recognise sys_perf_event_context() as an
      external access vector to the events and thereby didn't apply the
      established locking rules correctly.
      
      So while one sys_perf_event_open() call is stuck waiting on
      mutex_lock_double(), the other (which owns said locks) moves the group
      about. So by the time the former sys_perf_event_open() acquires the
      locks, the context we've acquired is stale (and possibly dead).
      
      Apply the established locking rules as per perf_event_ctx_lock_nested()
      to the mutex_lock_double() for the 'move_group' case. This obviously means
      we need to validate state after we acquire the locks.
      
      Reported-by: Di Shen (Keen Lab)
      Tested-by: NJohn Dias <joaodias@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Min Chong <mchong@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: f63a8daa ("perf: Fix event->ctx locking")
      Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      321027c1
    • P
      perf/core: Fix sys_perf_event_open() vs. hotplug · 63cae12b
      Peter Zijlstra 提交于
      There is problem with installing an event in a task that is 'stuck' on
      an offline CPU.
      
      Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
      blocked task doesn't run and doesn't require a CPU etc.. Only on
      wakeup do we ammend the situation and place the task on a available
      CPU.
      
      If we hit such a task with perf_install_in_context() we'll loop until
      either that task wakes up or the CPU comes back online, if the task
      waking depends on the event being installed, we're stuck.
      
      While looking into this issue, I also spotted another problem, if we
      hit a task with perf_install_in_context() that is in the middle of
      being migrated, that is we observe the old CPU before sending the IPI,
      but run the IPI (on the old CPU) while the task is already running on
      the new CPU, things also go sideways.
      
      Rework things to rely on task_curr() -- outside of rq->lock -- which
      is rather tricky. Imagine the following scenario where we're trying to
      install the first event into our task 't':
      
      CPU0            CPU1            CPU2
      
                      (current == t)
      
      t->perf_event_ctxp[] = ctx;
      smp_mb();
      cpu = task_cpu(t);
      
                      switch(t, n);
                                      migrate(t, 2);
                                      switch(p, t);
      
                                      ctx = t->perf_event_ctxp[]; // must not be NULL
      
      smp_function_call(cpu, ..);
      
                      generic_exec_single()
                        func();
                          spin_lock(ctx->lock);
                          if (task_curr(t)) // false
      
                          add_event_to_ctx();
                          spin_unlock(ctx->lock);
      
                                      perf_event_context_sched_in();
                                        spin_lock(ctx->lock);
                                        // sees event
      
      So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
      Because if CPU2's load of that variable were to observe NULL, it would
      not try to schedule the ctx and we'd have a task running without its
      counter, which would be 'bad'.
      
      As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
      first and not see the event yet, then CPU0 must observe task_curr()
      and retry. If the install happens first, then we must see the event on
      sched-in and all is well.
      
      I think we can translate the first part (until the 'must not be NULL')
      of the scenario to a litmus test like:
      
        C C-peterz
      
        {
        }
      
        P0(int *x, int *y)
        {
                int r1;
      
                WRITE_ONCE(*x, 1);
                smp_mb();
                r1 = READ_ONCE(*y);
        }
      
        P1(int *y, int *z)
        {
                WRITE_ONCE(*y, 1);
                smp_store_release(z, 1);
        }
      
        P2(int *x, int *z)
        {
                int r1;
                int r2;
      
                r1 = smp_load_acquire(z);
      	  smp_mb();
                r2 = READ_ONCE(*x);
        }
      
        exists
        (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
      
      Where:
        x is perf_event_ctxp[],
        y is our tasks's CPU, and
        z is our task being placed on the rq of CPU2.
      
      The P0 smp_mb() is the one added by this patch, ordering the store to
      perf_event_ctxp[] from find_get_context() and the load of task_cpu()
      in task_function_call().
      
      The smp_store_release/smp_load_acquire model the RCpc locking of the
      rq->lock and the smp_mb() of P2 is the context switch switching from
      whatever CPU2 was running to our task 't'.
      
      This litmus test evaluates into:
      
        Test C-peterz Allowed
        States 7
        0:r1=0; 2:r1=0; 2:r2=0;
        0:r1=0; 2:r1=0; 2:r2=1;
        0:r1=0; 2:r1=1; 2:r2=1;
        0:r1=1; 2:r1=0; 2:r2=0;
        0:r1=1; 2:r1=0; 2:r2=1;
        0:r1=1; 2:r1=1; 2:r2=0;
        0:r1=1; 2:r1=1; 2:r2=1;
        No
        Witnesses
        Positive: 0 Negative: 7
        Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
        Observation C-peterz Never 0 7
        Hash=e427f41d9146b2a5445101d3e2fcaa34
      
      And the strong and weak model agree.
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: jeremy.linton@arm.com
      Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      63cae12b
    • M
      kprobes, extable: Identify kprobes trampolines as kernel text area · 5b485629
      Masami Hiramatsu 提交于
      Improve __kernel_text_address()/kernel_text_address() to return
      true if the given address is on a kprobe's instruction slot
      trampoline.
      
      This can help stacktraces to determine the address is on a
      text area or not.
      
      To implement this atomically in is_kprobe_*_slot(), also change
      the insn_cache page list to an RCU list.
      
      This changes timings a bit (it delays page freeing to the RCU garbage
      collection phase), but none of that is in the hot path.
      
      Note: this change can add small overhead to stack unwinders because
      it adds 2 additional checks to __kernel_text_address(). However, the
      impact should be very small, because kprobe_insn_pages list has 1 entry
      per 256 probes(on x86, on arm/arm64 it will be 1024 probes),
      and kprobe_optinsn_pages has 1 entry per 32 probes(on x86).
      In most use cases, the number of kprobe events may be less
      than 20, which means that is_kprobe_*_slot() will check just one entry.
      Tested-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/148388747896.6869.6354262871751682264.stgit@devbox
      [ Improved the changelog and coding style. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5b485629
  14. 12 1月, 2017 2 次提交
  15. 11 1月, 2017 4 次提交
    • F
      nohz: Fix collision between tick and other hrtimers · 24b91e36
      Frederic Weisbecker 提交于
      When the tick is stopped and an interrupt occurs afterward, we check on
      that interrupt exit if the next tick needs to be rescheduled. If it
      doesn't need any update, we don't want to do anything.
      
      In order to check if the tick needs an update, we compare it against the
      clockevent device deadline. Now that's a problem because the clockevent
      device is at a lower level than the tick itself if it is implemented
      on top of hrtimer.
      
      Every hrtimer share this clockevent device. So comparing the next tick
      deadline against the clockevent device deadline is wrong because the
      device may be programmed for another hrtimer whose deadline collides
      with the tick. As a result we may end up not reprogramming the tick
      accidentally.
      
      In a worst case scenario under full dynticks mode, the tick stops firing
      as it is supposed to every 1hz, leaving /proc/stat stalled:
      
            Task in a full dynticks CPU
            ----------------------------
      
            * hrtimer A is queued 2 seconds ahead
            * the tick is stopped, scheduled 1 second ahead
            * tick fires 1 second later
            * on tick exit, nohz schedules the tick 1 second ahead but sees
              the clockevent device is already programmed to that deadline,
              fooled by hrtimer A, the tick isn't rescheduled.
            * hrtimer A is cancelled before its deadline
            * tick never fires again until an interrupt happens...
      
      In order to fix this, store the next tick deadline to the tick_sched
      local structure and reuse that value later to check whether we need to
      reprogram the clock after an interrupt.
      
      On the other hand, ts->sleep_length still wants to know about the next
      clock event and not just the tick, so we want to improve the related
      comment to avoid confusion.
      Reported-by: NJames Hartsock <hartsjc@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/1483539124-5693-1-git-send-email-fweisbec@gmail.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      24b91e36
    • J
      signal: protect SIGNAL_UNKILLABLE from unintentional clearing. · 2d39b3cd
      Jamie Iles 提交于
      Since commit 00cd5c37 ("ptrace: permit ptracing of /sbin/init") we
      can now trace init processes.  init is initially protected with
      SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
      there are a number of paths during tracing where SIGNAL_UNKILLABLE can
      be implicitly cleared.
      
      This can result in init becoming stoppable/killable after tracing.  For
      example, running:
      
        while true; do kill -STOP 1; done &
        strace -p 1
      
      and then stopping strace and the kill loop will result in init being
      left in state TASK_STOPPED.  Sending SIGCONT to init will resume it, but
      init will now respond to future SIGSTOP signals rather than ignoring
      them.
      
      Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
      that we don't clear SIGNAL_UNKILLABLE.
      
      Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.comSigned-off-by: NJamie Iles <jamie.iles@oracle.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d39b3cd
    • D
      mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done} · f931ab47
      Dan Williams 提交于
      Both arch_add_memory() and arch_remove_memory() expect a single threaded
      context.
      
      For example, arch/x86/mm/init_64.c::kernel_physical_mapping_init() does
      not hold any locks over this check and branch:
      
          if (pgd_val(*pgd)) {
          	pud = (pud_t *)pgd_page_vaddr(*pgd);
          	paddr_last = phys_pud_init(pud, __pa(vaddr),
          				   __pa(vaddr_end),
          				   page_size_mask);
          	continue;
          }
      
          pud = alloc_low_page();
          paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
          			   page_size_mask);
      
      The result is that two threads calling devm_memremap_pages()
      simultaneously can end up colliding on pgd initialization.  This leads
      to crash signatures like the following where the loser of the race
      initializes the wrong pgd entry:
      
          BUG: unable to handle kernel paging request at ffff888ebfff0000
          IP: memcpy_erms+0x6/0x10
          PGD 2f8e8fc067 PUD 0 /* <---- Invalid PUD */
          Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
          CPU: 54 PID: 3818 Comm: systemd-udevd Not tainted 4.6.7+ #13
          task: ffff882fac290040 ti: ffff882f887a4000 task.ti: ffff882f887a4000
          RIP: memcpy_erms+0x6/0x10
          [..]
          Call Trace:
            ? pmem_do_bvec+0x205/0x370 [nd_pmem]
            ? blk_queue_enter+0x3a/0x280
            pmem_rw_page+0x38/0x80 [nd_pmem]
            bdev_read_page+0x84/0xb0
      
      Hold the standard memory hotplug mutex over calls to
      arch_{add,remove}_memory().
      
      Fixes: 41e94a85 ("add devm_memremap_pages")
      Link: http://lkml.kernel.org/r/148357647831.9498.12606007370121652979.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f931ab47
    • M
      bpf: do not use KMALLOC_SHIFT_MAX · 7984c27c
      Michal Hocko 提交于
      Commit 01b3f521 ("bpf: fix allocation warnings in bpf maps and
      integer overflow") has added checks for the maximum allocateable size.
      It (ab)used KMALLOC_SHIFT_MAX for that purpose.
      
      While this is not incorrect it is not very clean because we already have
      KMALLOC_MAX_SIZE for this very reason so let's change both checks to use
      KMALLOC_MAX_SIZE instead.
      
      The original motivation for using KMALLOC_SHIFT_MAX was to work around
      an incorrect KMALLOC_MAX_SIZE which could lead to allocation warnings
      but it is no longer needed since "slab: make sure that KMALLOC_MAX_SIZE
      will fit into MAX_ORDER".
      
      Link: http://lkml.kernel.org/r/20161220130659.16461-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7984c27c
  16. 10 1月, 2017 1 次提交
    • A
      pid: fix lockdep deadlock warning due to ucount_lock · add7c65c
      Andrei Vagin 提交于
      =========================================================
      [ INFO: possible irq lock inversion dependency detected ]
      4.10.0-rc2-00024-g4aecec9-dirty #118 Tainted: G        W
      ---------------------------------------------------------
      swapper/1/0 just changed the state of lock:
       (&(&sighand->siglock)->rlock){-.....}, at: [<ffffffffbd0a1bc6>] __lock_task_sighand+0xb6/0x2c0
      but this lock took another, HARDIRQ-unsafe lock in the past:
       (ucounts_lock){+.+...}
      and interrupts could create inverse lock ordering between them.
      other info that might help us debug this:
      Chain exists of:                 &(&sighand->siglock)->rlock --> &(&tty->ctrl_lock)->rlock --> ucounts_lock
       Possible interrupt unsafe locking scenario:
             CPU0                    CPU1
             ----                    ----
        lock(ucounts_lock);
                                     local_irq_disable();
                                     lock(&(&sighand->siglock)->rlock);
                                     lock(&(&tty->ctrl_lock)->rlock);
        <Interrupt>
          lock(&(&sighand->siglock)->rlock);
      
       *** DEADLOCK ***
      
      This patch removes a dependency between rlock and ucount_lock.
      
      Fixes: f333c700 ("pidns: Add a limit on the number of pid namespaces")
      Cc: stable@vger.kernel.org
      Signed-off-by: NAndrei Vagin <avagin@openvz.org>
      Acked-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      add7c65c
  17. 04 1月, 2017 1 次提交
    • J
      audit: Fix sleep in atomic · be29d20f
      Jan Kara 提交于
      Audit tree code was happily adding new notification marks while holding
      spinlocks. Since fsnotify_add_mark() acquires group->mark_mutex this can
      lead to sleeping while holding a spinlock, deadlocks due to lock
      inversion, and probably other fun. Fix the problem by acquiring
      group->mark_mutex earlier.
      
      CC: Paul Moore <paul@paul-moore.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      be29d20f
  18. 27 12月, 2016 1 次提交
  19. 26 12月, 2016 2 次提交
    • T
      ktime: Cleanup ktime_set() usage · 8b0e1953
      Thomas Gleixner 提交于
      ktime_set(S,N) was required for the timespec storage type and is still
      useful for situations where a Seconds and Nanoseconds part of a time value
      needs to be converted. For anything where the Seconds argument is 0, this
      is pointless and can be replaced with a simple assignment.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8b0e1953
    • T
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner 提交于
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  20. 25 12月, 2016 2 次提交