1. 20 2月, 2016 1 次提交
    • Y
      tracing, kasan: Silence Kasan warning in check_stack of stack_tracer · 6e22c836
      Yang Shi 提交于
      When enabling stack trace via "echo 1 > /proc/sys/kernel/stack_tracer_enabled",
      the below KASAN warning is triggered:
      
      BUG: KASAN: stack-out-of-bounds in check_stack+0x344/0x848 at addr ffffffc0689ebab8
      Read of size 8 by task ksoftirqd/4/29
      page:ffffffbdc3a27ac0 count:0 mapcount:0 mapping:          (null) index:0x0
      flags: 0x0()
      page dumped because: kasan: bad access detected
      CPU: 4 PID: 29 Comm: ksoftirqd/4 Not tainted 4.5.0-rc1 #129
      Hardware name: Freescale Layerscape 2085a RDB Board (DT)
      Call trace:
      [<ffffffc000091300>] dump_backtrace+0x0/0x3a0
      [<ffffffc0000916c4>] show_stack+0x24/0x30
      [<ffffffc0009bbd78>] dump_stack+0xd8/0x168
      [<ffffffc000420bb0>] kasan_report_error+0x6a0/0x920
      [<ffffffc000421688>] kasan_report+0x70/0xb8
      [<ffffffc00041f7f0>] __asan_load8+0x60/0x78
      [<ffffffc0002e05c4>] check_stack+0x344/0x848
      [<ffffffc0002e0c8c>] stack_trace_call+0x1c4/0x370
      [<ffffffc0002af558>] ftrace_ops_no_ops+0x2c0/0x590
      [<ffffffc00009f25c>] ftrace_graph_call+0x0/0x14
      [<ffffffc0000881bc>] fpsimd_thread_switch+0x24/0x1e8
      [<ffffffc000089864>] __switch_to+0x34/0x218
      [<ffffffc0011e089c>] __schedule+0x3ac/0x15b8
      [<ffffffc0011e1f6c>] schedule+0x5c/0x178
      [<ffffffc0001632a8>] smpboot_thread_fn+0x350/0x960
      [<ffffffc00015b518>] kthread+0x1d8/0x2b0
      [<ffffffc0000874d0>] ret_from_fork+0x10/0x40
      Memory state around the buggy address:
       ffffffc0689eb980: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 f4 f4
       ffffffc0689eba00: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
      >ffffffc0689eba80: 00 00 f1 f1 f1 f1 00 f4 f4 f4 f3 f3 f3 f3 00 00
                                              ^
       ffffffc0689ebb00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffffffc0689ebb80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      
      The stacker tracer traverses the whole kernel stack when saving the max stack
      trace. It may touch the stack red zones to cause the warning. So, just disable
      the instrumentation to silence the warning.
      
      Link: http://lkml.kernel.org/r/1455309960-18930-1-git-send-email-yang.shi@linaro.orgSigned-off-by: NYang Shi <yang.shi@linaro.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      6e22c836
  2. 12 2月, 2016 2 次提交
  3. 11 2月, 2016 2 次提交
    • D
      bpf: fix branch offset adjustment on backjumps after patching ctx expansion · a1b14d27
      Daniel Borkmann 提交于
      When ctx access is used, the kernel often needs to expand/rewrite
      instructions, so after that patching, branch offsets have to be
      adjusted for both forward and backward jumps in the new eBPF program,
      but for backward jumps it fails to account the delta. Meaning, for
      example, if the expansion happens exactly on the insn that sits at
      the jump target, it doesn't fix up the back jump offset.
      
      Analysis on what the check in adjust_branches() is currently doing:
      
        /* adjust offset of jmps if necessary */
        if (i < pos && i + insn->off + 1 > pos)
          insn->off += delta;
        else if (i > pos && i + insn->off + 1 < pos)
          insn->off -= delta;
      
      First condition (forward jumps):
      
        Before:                         After:
      
        insns[0]                        insns[0]
        insns[1] <--- i/insn            insns[1] <--- i/insn
        insns[2] <--- pos               insns[P] <--- pos
        insns[3]                        insns[P]  `------| delta
        insns[4] <--- target_X          insns[P]   `-----|
        insns[5]                        insns[3]
                                        insns[4] <--- target_X
                                        insns[5]
      
      First case is if we cross pos-boundary and the jump instruction was
      before pos. This is handeled correctly. I.e. if i == pos, then this
      would mean our jump that we currently check was the patchlet itself
      that we just injected. Since such patchlets are self-contained and
      have no awareness of any insns before or after the patched one, the
      delta is correctly not adjusted. Also, for the second condition in
      case of i + insn->off + 1 == pos, means we jump to that newly patched
      instruction, so no offset adjustment are needed. That part is correct.
      
      Second condition (backward jumps):
      
        Before:                         After:
      
        insns[0]                        insns[0]
        insns[1] <--- target_X          insns[1] <--- target_X
        insns[2] <--- pos <-- target_Y  insns[P] <--- pos <-- target_Y
        insns[3]                        insns[P]  `------| delta
        insns[4] <--- i/insn            insns[P]   `-----|
        insns[5]                        insns[3]
                                        insns[4] <--- i/insn
                                        insns[5]
      
      Second interesting case is where we cross pos-boundary and the jump
      instruction was after pos. Backward jump with i == pos would be
      impossible and pose a bug somewhere in the patchlet, so the first
      condition checking i > pos is okay only by itself. However, i +
      insn->off + 1 < pos does not always work as intended to trigger the
      adjustment. It works when jump targets would be far off where the
      delta wouldn't matter. But, for example, where the fixed insn->off
      before pointed to pos (target_Y), it now points to pos + delta, so
      that additional room needs to be taken into account for the check.
      This means that i) both tests here need to be adjusted into pos + delta,
      and ii) for the second condition, the test needs to be <= as pos
      itself can be a target in the backjump, too.
      
      Fixes: 9bac3d6d ("bpf: allow extended BPF programs access skb fields")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1b14d27
    • T
      workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup · d6e022f1
      Tejun Heo 提交于
      When looking up the pool_workqueue to use for an unbound workqueue,
      workqueue assumes that the target CPU is always bound to a valid NUMA
      node.  However, currently, when a CPU goes offline, the mapping is
      destroyed and cpu_to_node() returns NUMA_NO_NODE.
      
      This has always been broken but hasn't triggered often enough before
      874bbfe6 ("workqueue: make sure delayed work run in local cpu").
      After the commit, workqueue forcifully assigns the local CPU for
      delayed work items without explicit target CPU to fix a different
      issue.  This widens the window where CPU can go offline while a
      delayed work item is pending causing delayed work items dispatched
      with target CPU set to an already offlined CPU.  The resulting
      NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
      NULL pool_workqueue and thus crash.
      
      While 874bbfe6 has been reverted for a different reason making the
      bug less visible again, it can still happen.  Fix it by mapping
      NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
      This is a temporary workaround.  The long term solution is keeping CPU
      -> NODE mapping stable across CPU off/online cycles which is being
      worked on.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
      Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com
      d6e022f1
  4. 10 2月, 2016 3 次提交
    • T
      workqueue: implement "workqueue.debug_force_rr_cpu" debug feature · f303fccb
      Tejun Heo 提交于
      Workqueue used to guarantee local execution for work items queued
      without explicit target CPU.  The guarantee is gone now which can
      break some usages in subtle ways.  To flush out those cases, this
      patch implements a debug feature which forces round-robin CPU
      selection for all such work items.
      
      The debug feature defaults to off and can be enabled with a kernel
      parameter.  The default can be flipped with a debug config option.
      
      If you hit this commit during bisection, please refer to 041bd12e
      ("Revert "workqueue: make sure delayed work run in local cpu"") for
      more information and ping me.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f303fccb
    • M
      workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs · ef557180
      Mike Galbraith 提交于
      WORK_CPU_UNBOUND work items queued to a bound workqueue always run
      locally.  This is a good thing normally, but not when the user has
      asked us to keep unbound work away from certain CPUs.  Round robin
      these to wq_unbound_cpumask CPUs instead, as perturbation avoidance
      trumps performance.
      
      tj: Cosmetic and comment changes.  WARN_ON_ONCE() dropped from empty
          (wq_unbound_cpumask AND cpu_online_mask).  If we want that, it
          should be done when config changes.
      Signed-off-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ef557180
    • T
      Revert "workqueue: make sure delayed work run in local cpu" · 041bd12e
      Tejun Heo 提交于
      This reverts commit 874bbfe6.
      
      Workqueue used to implicity guarantee that work items queued without
      explicit CPU specified are put on the local CPU.  Recent changes in
      timer broke the guarantee and led to vmstat breakage which was fixed
      by 176bed1d ("vmstat: explicitly schedule per-cpu work on the CPU
      we need it to run on").
      
      vmstat is the most likely to expose the issue and it's quite possible
      that there are other similar problems which are a lot more difficult
      to trigger.  As a preventive measure, 874bbfe6 ("workqueue: make
      sure delayed work run in local cpu") was applied to restore the local
      CPU guarnatee.  Unfortunately, the change exposed a bug in timer code
      which got fixed by 22b886dd ("timers: Use proper base migration in
      add_timer_on()").  Due to code restructuring, the commit couldn't be
      backported beyond certain point and stable kernels which only had
      874bbfe6 started crashing.
      
      The local CPU guarantee was accidental more than anything else and we
      want to get rid of it anyway.  As, with the vmstat case fixed,
      874bbfe6 is causing more problems than it's fixing, it has been
      decided to take the chance and officially break the guarantee by
      reverting the commit.  A debug feature will be added to force foreign
      CPU assignment to expose cases relying on the guarantee and fixes for
      the individual cases will be backported to stable as necessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 874bbfe6 ("workqueue: make sure delayed work run in local cpu")
      Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz
      Cc: stable@vger.kernel.org
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
      Cc: Daniel Bilik <daniel.bilik@neosystem.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Daniel Bilik <daniel.bilik@neosystem.cz>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      041bd12e
  5. 09 2月, 2016 1 次提交
    • D
      locking/lockdep: Fix stack trace caching logic · 8a5fd564
      Dmitry Vyukov 提交于
      check_prev_add() caches saved stack trace in static trace variable
      to avoid duplicate save_trace() calls in dependencies involving trylocks.
      But that caching logic contains a bug. We may not save trace on first
      iteration due to early return from check_prev_add(). Then on the
      second iteration when we actually need the trace we don't save it
      because we think that we've already saved it.
      
      Let check_prev_add() itself control when stack is saved.
      
      There is another bug. Trace variable is protected by graph lock.
      But we can temporary release graph lock during printing.
      
      Fix this by invalidating cached stack trace when we release graph lock.
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: glider@google.com
      Cc: kcc@google.com
      Cc: peter@hurleysoftware.com
      Cc: sasha.levin@oracle.com
      Link: http://lkml.kernel.org/r/1454593240-121647-1-git-send-email-dvyukov@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8a5fd564
  6. 06 2月, 2016 1 次提交
  7. 03 2月, 2016 3 次提交
    • R
      modules: fix longstanding /proc/kallsyms vs module insertion race. · 8244062e
      Rusty Russell 提交于
      For CONFIG_KALLSYMS, we keep two symbol tables and two string tables.
      There's one full copy, marked SHF_ALLOC and laid out at the end of the
      module's init section.  There's also a cut-down version that only
      contains core symbols and strings, and lives in the module's core
      section.
      
      After module init (and before we free the module memory), we switch
      the mod->symtab, mod->num_symtab and mod->strtab to point to the core
      versions.  We do this under the module_mutex.
      
      However, kallsyms doesn't take the module_mutex: it uses
      preempt_disable() and rcu tricks to walk through the modules, because
      it's used in the oops path.  It's also used in /proc/kallsyms.
      There's nothing atomic about the change of these variables, so we can
      get the old (larger!) num_symtab and the new symtab pointer; in fact
      this is what I saw when trying to reproduce.
      
      By grouping these variables together, we can use a
      carefully-dereferenced pointer to ensure we always get one or the
      other (the free of the module init section is already done in an RCU
      callback, so that's safe).  We allocate the init one at the end of the
      module init section, and keep the core one inside the struct module
      itself (it could also have been allocated at the end of the module
      core, but that's probably overkill).
      Reported-by: NWeilong Chen <chenweilong@huawei.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111541
      Cc: stable@kernel.org
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      8244062e
    • R
      module: wrapper for symbol name. · 2e7bac53
      Rusty Russell 提交于
      This trivial wrapper adds clarity and makes the following patch
      smaller.
      
      Cc: stable@kernel.org
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      2e7bac53
    • L
      modules: fix modparam async_probe request · 4355efbd
      Luis R. Rodriguez 提交于
      Commit f2411da7 ("driver-core: add driver module
      asynchronous probe support") added async probe support,
      in two forms:
      
        * in-kernel driver specification annotation
        * generic async_probe module parameter (modprobe foo async_probe)
      
      To support the generic kernel parameter parse_args() was
      extended via commit ecc86170 ("module: add extra
      argument for parse_params() callback") however commit
      failed to f2411da7 failed to add the required argument.
      
      This causes a crash then whenever async_probe generic
      module parameter is used. This was overlooked when the
      form in which in-kernel async probe support was reworked
      a bit... Fix this as originally intended.
      
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: stable@vger.kernel.org (4.2+)
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> [minimized]
      4355efbd
  8. 01 2月, 2016 1 次提交
  9. 30 1月, 2016 4 次提交
    • Z
      pid: Fix spelling in comments · 840d6fe7
      Zhen Lei 提交于
      Accidentally discovered this typo when I studied this module.
      Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tianhong Ding <dingtianhong@huawei.com>
      Cc: Xinwei Hu <huxinwei@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1454119457-11272-1-git-send-email-thunder.leizhen@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      840d6fe7
    • D
      devm_memremap_pages: fix vmem_altmap lifetime + alignment handling · eb7d78c9
      Dan Williams 提交于
      to_vmem_altmap() needs to return valid results until
      arch_remove_memory() completes.  It also needs to be valid for any pfn
      in a section regardless of whether that pfn maps to data.  This escape
      was a result of a bug in the unit test.
      
      The signature of this bug is that free_pagetable() fails to retrieve a
      vmem_altmap and goes off into the weeds:
      
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: [<ffffffff811d2629>] get_pfnblock_flags_mask+0x49/0x60
       [..]
       Call Trace:
        [<ffffffff811d3477>] free_hot_cold_page+0x97/0x1d0
        [<ffffffff811d367a>] __free_pages+0x2a/0x40
        [<ffffffff8191e669>] free_pagetable+0x8c/0xd4
        [<ffffffff8191ef4e>] remove_pagetable+0x37a/0x808
        [<ffffffff8191b210>] vmemmap_free+0x10/0x20
      
      Fixes: 4b94ffdc ("x86, mm: introduce vmem_altmap to augment vmemmap_populate()")
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Reported-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      eb7d78c9
    • T
      workqueue: skip flush dependency checks for legacy workqueues · 23d11a58
      Tejun Heo 提交于
      fca839c0 ("workqueue: warn if memory reclaim tries to flush
      !WQ_MEM_RECLAIM workqueue") implemented flush dependency warning which
      triggers if a PF_MEMALLOC task or WQ_MEM_RECLAIM workqueue tries to
      flush a !WQ_MEM_RECLAIM workquee.
      
      This assumes that workqueues marked with WQ_MEM_RECLAIM sit in memory
      reclaim path and making it depend on something which may need more
      memory to make forward progress can lead to deadlocks.  Unfortunately,
      workqueues created with the legacy create*_workqueue() interface
      always have WQ_MEM_RECLAIM regardless of whether they are depended
      upon memory reclaim or not.  These spurious WQ_MEM_RECLAIM markings
      cause spurious triggering of the flush dependency checks.
      
        WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2361 check_flush_dependency+0x138/0x144()
        workqueue: WQ_MEM_RECLAIM deferwq:deferred_probe_work_func is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
        ...
        Workqueue: deferwq deferred_probe_work_func
        [<c0017acc>] (unwind_backtrace) from [<c0013134>] (show_stack+0x10/0x14)
        [<c0013134>] (show_stack) from [<c0245f18>] (dump_stack+0x94/0xd4)
        [<c0245f18>] (dump_stack) from [<c0026f9c>] (warn_slowpath_common+0x80/0xb0)
        [<c0026f9c>] (warn_slowpath_common) from [<c0026ffc>] (warn_slowpath_fmt+0x30/0x40)
        [<c0026ffc>] (warn_slowpath_fmt) from [<c00390b8>] (check_flush_dependency+0x138/0x144)
        [<c00390b8>] (check_flush_dependency) from [<c0039ca0>] (flush_work+0x50/0x15c)
        [<c0039ca0>] (flush_work) from [<c00c51b0>] (lru_add_drain_all+0x130/0x180)
        [<c00c51b0>] (lru_add_drain_all) from [<c00f728c>] (migrate_prep+0x8/0x10)
        [<c00f728c>] (migrate_prep) from [<c00bfbc4>] (alloc_contig_range+0xd8/0x338)
        [<c00bfbc4>] (alloc_contig_range) from [<c00f8f18>] (cma_alloc+0xe0/0x1ac)
        [<c00f8f18>] (cma_alloc) from [<c001cac4>] (__alloc_from_contiguous+0x38/0xd8)
        [<c001cac4>] (__alloc_from_contiguous) from [<c001ceb4>] (__dma_alloc+0x240/0x278)
        [<c001ceb4>] (__dma_alloc) from [<c001cf78>] (arm_dma_alloc+0x54/0x5c)
        [<c001cf78>] (arm_dma_alloc) from [<c0355ea4>] (dmam_alloc_coherent+0xc0/0xec)
        [<c0355ea4>] (dmam_alloc_coherent) from [<c039cc4c>] (ahci_port_start+0x150/0x1dc)
        [<c039cc4c>] (ahci_port_start) from [<c0384734>] (ata_host_start.part.3+0xc8/0x1c8)
        [<c0384734>] (ata_host_start.part.3) from [<c03898dc>] (ata_host_activate+0x50/0x148)
        [<c03898dc>] (ata_host_activate) from [<c039d558>] (ahci_host_activate+0x44/0x114)
        [<c039d558>] (ahci_host_activate) from [<c039f05c>] (ahci_platform_init_host+0x1d8/0x3c8)
        [<c039f05c>] (ahci_platform_init_host) from [<c039e6bc>] (tegra_ahci_probe+0x448/0x4e8)
        [<c039e6bc>] (tegra_ahci_probe) from [<c0347058>] (platform_drv_probe+0x50/0xac)
        [<c0347058>] (platform_drv_probe) from [<c03458cc>] (driver_probe_device+0x214/0x2c0)
        [<c03458cc>] (driver_probe_device) from [<c0343cc0>] (bus_for_each_drv+0x60/0x94)
        [<c0343cc0>] (bus_for_each_drv) from [<c03455d8>] (__device_attach+0xb0/0x114)
        [<c03455d8>] (__device_attach) from [<c0344ab8>] (bus_probe_device+0x84/0x8c)
        [<c0344ab8>] (bus_probe_device) from [<c0344f48>] (deferred_probe_work_func+0x68/0x98)
        [<c0344f48>] (deferred_probe_work_func) from [<c003b738>] (process_one_work+0x120/0x3f8)
        [<c003b738>] (process_one_work) from [<c003ba48>] (worker_thread+0x38/0x55c)
        [<c003ba48>] (worker_thread) from [<c0040f14>] (kthread+0xdc/0xf4)
        [<c0040f14>] (kthread) from [<c000f778>] (ret_from_fork+0x14/0x3c)
      
      Fix it by marking workqueues created via create*_workqueue() with
      __WQ_LEGACY and disabling flush dependency checks on them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NThierry Reding <thierry.reding@gmail.com>
      Link: http://lkml.kernel.org/g/20160126173843.GA11115@ulmo.nvidia.com
      Fixes: fca839c0 ("workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue")
      23d11a58
    • S
      tracing/stacktrace: Show entire trace if passed in function not found · 6ccd8371
      Steven Rostedt 提交于
      When a max stack trace is discovered, the stack dump is saved. In order to
      not record the overhead of the stack tracer, the ip of the traced function
      is looked for within the dump. The trace is started from the location of
      that function. But if for some reason the ip is not found, the entire stack
      trace is then truncated. That's not very useful. Instead, print everything
      if the ip of the traced function is not found within the trace.
      
      This issue showed up on s390.
      
      Link: http://lkml.kernel.org/r/20160129102241.1b3c9c04@gandalf.local.home
      
      Fixes: 72ac426a ("tracing: Clean up stack tracing and fix fentry updates")
      Cc: stable@vger.kernel.org # v4.3+
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Tested-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      6ccd8371
  10. 29 1月, 2016 13 次提交
    • P
      perf: Remove/simplify lockdep annotation · 5fa7c8ec
      Peter Zijlstra 提交于
      Now that the perf_event_ctx_lock_nested() call has moved from
      put_event() into perf_event_release_kernel() the first reason is no
      longer valid as that can no longer happen.
      
      The second reason seems to have been invalidated when Al Viro made fput()
      unconditionally async in the following commit:
      
        4a9d4b02 ("switch fput to task_work_add")
      
      such that munmap()->fput()->release()->perf_release() would no longer happen.
      
      Therefore, remove the annotation. This should increase the efficiency
      of lockdep coverage of perf locking.
      Suggested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5fa7c8ec
    • P
      perf: Synchronously clean up child events · c6e5b732
      Peter Zijlstra 提交于
      The orphan cleanup workqueue doesn't always catch orphans, for example,
      if they never schedule after they are orphaned. IOW, the event leak is
      still very real. It also wouldn't work for kernel counters.
      
      Doing it synchonously is a little hairy due to lock inversion issues,
      but is made to work.
      
      Patch based on work by Alexander Shishkin.
      Suggested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c6e5b732
    • P
      perf: Untangle 'owner' confusion · 60beda84
      Peter Zijlstra 提交于
      There are two concepts of owner wrt an event and they are conflated:
      
       - event::owner / event::owner_list,
         used by prctl(.option = PR_TASK_PERF_EVENTS_{EN,DIS}ABLE).
      
       - the 'owner' of the event object, typically the file descriptor.
      
      Currently these two concepts are conflated, which gives trouble with
      scm_rights passing of file descriptors. Passing the event and then
      closing the creating task would render the event 'orphan' and would
      have it cleared out. Unlikely what is expectd.
      
      This patch untangles these two concepts by using PERF_EVENT_STATE_EXIT
      to denote the second type.
      Reported-by: NAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      60beda84
    • P
      perf: Add flags argument to perf_remove_from_context() · 45a0e07a
      Peter Zijlstra 提交于
      In preparation to adding more options, convert the boolean argument
      into a flags word.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      45a0e07a
    • P
      perf: Clean up sync_child_event() · 8ba289b8
      Peter Zijlstra 提交于
      sync_child_event() has outgrown its purpose, it does far too much.
      Bring it back to its named purpose.
      
      Rename __perf_event_exit_task() to perf_event_exit_event() to better
      reflect what it does and move the event->state assignment under the
      ctx->lock, like state changes ought to be.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8ba289b8
    • P
      perf: Robustify event->owner usage and SMP ordering · f47c02c0
      Peter Zijlstra 提交于
      Use smp_store_release() to clear event->owner and
      lockless_dereference() to observe it. Further use READ_ONCE() for all
      lockless reads.
      
      This changes perf_remove_from_owner() to leave event->owner cleared.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f47c02c0
    • P
      perf: Fix STATE_EXIT usage · 6e801e01
      Peter Zijlstra 提交于
      We should never attempt to enable a STATE_EXIT event.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6e801e01
    • P
      perf: Update locking order · 07c4a776
      Peter Zijlstra 提交于
      Update the locking order to note that ctx::lock nests inside of
      child_mutex, as per:
      
        perf_ioctl():                ctx::mutex
        -> perf_event_for_each():    event::child_mutex
          -> _perf_event_enable():   ctx::lock
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      07c4a776
    • P
      perf: Remove __free_event() · a0733e69
      Peter Zijlstra 提交于
      There is but a single caller, remove the function - we already have
      _free_event(), the extra indirection is nonsensical..
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a0733e69
    • A
      perf/bpf: Convert perf_event_array to use struct file · e03e7ee3
      Alexei Starovoitov 提交于
      Robustify refcounting.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160126045947.GA40151@ast-mbp.thefacebook.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e03e7ee3
    • P
      perf: Fix NULL deref · 828b6f0e
      Peter Zijlstra 提交于
      Dan reported:
      
        1229                  if (ctx->task == TASK_TOMBSTONE ||
        1230                      !atomic_inc_not_zero(&ctx->refcount)) {
        1231                          raw_spin_unlock(&ctx->lock);
        1232                          ctx = NULL;
                                      ^^^^^^^^^^
      ctx is NULL.
      
        1233                  }
        1234
        1235                  WARN_ON_ONCE(ctx->task != task);
                                           ^^^^^^^^^^^^^^^^^
      The patch adds a NULL dereference.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 63b6da39 ("perf: Fix perf_event_exit_task() race")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      828b6f0e
    • P
      perf: Fix race in perf_event_exit_task_context() · 6a3351b6
      Peter Zijlstra 提交于
      There is a race between perf_event_exit_task_context() and
      orphans_remove_work() which results in a use-after-free.
      
      We mark ctx->task with TASK_TOMBSTONE to indicate a context is
      'dead', under ctx->lock. After which point event_function_call()
      on any event of that context will NOP
      
      A concurrent orphans_remove_work() will only hold ctx->mutex for
      the list iteration and not serialize against this. Therefore its
      possible that orphans_remove_work()'s perf_remove_from_context()
      call will fail, but we'll continue to free the event, with the
      result of free'd memory still being on lists and everything.
      
      Once perf_event_exit_task_context() gets around to acquiring
      ctx->mutex it too will iterate the event list, encounter the
      already free'd event and proceed to free it _again_. This fails
      with the WARN in free_event().
      
      Plug the race by having perf_event_exit_task_context() hold
      ctx::mutex over the whole tear-down, thereby 'naturally'
      serializing against all other sites, including the orphan work.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: alexander.shishkin@linux.intel.com
      Cc: dsahern@gmail.com
      Cc: namhyung@kernel.org
      Link: http://lkml.kernel.org/r/20160125130954.GY6357@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6a3351b6
    • P
      perf: Fix orphan hole · 78cd2c74
      Peter Zijlstra 提交于
      We should set event->owner before we install the event,
      otherwise there is a hole where the target task can fork() and
      we'll not inherit the event because it thinks the event is
      orphaned.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      78cd2c74
  11. 28 1月, 2016 1 次提交
    • A
      PM: APM_EMULATION does not depend on PM · 993e9fe1
      Arnd Bergmann 提交于
      The APM emulation code does multiple things, and some of them depend on
      PM_SLEEP, while the battery management does not. However, selecting
      the symbol like SHARPSL_PM does causes a Kconfig warning:
      
      warning: (SHARPSL_PM && PMAC_APM_EMU) selects APM_EMULATION which has unmet direct dependencies (PM && SYS_SUPPORTS_APM_EMULATION)
      
      From all I can tell, this is completely harmless, and we can simply allow
      APM_EMULATION to be enabled here, even if PM is not.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      993e9fe1
  12. 27 1月, 2016 2 次提交
  13. 26 1月, 2016 3 次提交
    • A
      tick/sched: Hide unused oneshot timer code · 7809998a
      Arnd Bergmann 提交于
      A couple of functions in kernel/time/tick-sched.c are only
      relevant for oneshot timer mode, i.e. when hires-timers or
      nohz mode are enabled. If both are disabled, we get gcc warnings
      about them:
      
      kernel/time/tick-sched.c:98:16: warning: 'tick_init_jiffy_update' defined but not used [-Wunused-function]
       static ktime_t tick_init_jiffy_update(void)
                      ^
      kernel/time/tick-sched.c:112:13: warning: 'tick_sched_do_timer' defined but not used [-Wunused-function]
       static void tick_sched_do_timer(ktime_t now)
                   ^
      kernel/time/tick-sched.c:134:13: warning: 'tick_sched_handle' defined but not used [-Wunused-function]
       static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
                   ^
      
      This encloses the whole set of functions in an appropriate ifdef
      to avoid the warning and to make it clearer when they are used.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/1453736525-1959191-1-git-send-email-arnd@arndb.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      7809998a
    • M
      irqdomain: Allow domain lookup with DOMAIN_BUS_WIRED token · 530cbe10
      Marc Zyngier 提交于
      Let's take the (outlandish) example of an interrupt controller
      capable of handling both wired interrupts and PCI MSIs.
      
      With the current code, the PCI MSI domain is going to be tagged
      with DOMAIN_BUS_PCI_MSI, and the wired domain with DOMAIN_BUS_ANY.
      
      Things get hairy when we start looking up the domain for a wired
      interrupt (typically when creating it based on some firmware
      information - DT or ACPI).
      
      In irq_create_fwspec_mapping(), we perform the lookup using
      DOMAIN_BUS_ANY, which is actually used as a wildcard. This gives
      us one chance out of two to end up with the wrong domain, and
      we try to configure a wired interrupt with the MSI domain.
      Everything grinds to a halt pretty quickly.
      
      What we really need to do is to start looking for a domain that
      would uniquely identify a wired interrupt domain, and only use
      DOMAIN_BUS_ANY as a fallback.
      
      In order to solve this, let's introduce a new DOMAIN_BUS_WIRED
      token, which is going to be used exactly as described above.
      Of course, this depends on the irqchip to setup the domain
      bus_token, and nobody had to implement this so far.
      
      Only so far.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Link: http://lkml.kernel.org/r/1453816347-32720-2-git-send-email-marc.zyngier@arm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      530cbe10
    • T
      rtmutex: Make wait_lock irq safe · b4abf910
      Thomas Gleixner 提交于
      Sasha reported a lockdep splat about a potential deadlock between RCU boosting
      rtmutex and the posix timer it_lock.
      
      CPU0					CPU1
      
      rtmutex_lock(&rcu->rt_mutex)
        spin_lock(&rcu->rt_mutex.wait_lock)
      					local_irq_disable()
      					spin_lock(&timer->it_lock)
      					spin_lock(&rcu->mutex.wait_lock)
      --> Interrupt
          spin_lock(&timer->it_lock)
      
      This is caused by the following code sequence on CPU1
      
           rcu_read_lock()
           x = lookup();
           if (x)
           	spin_lock_irqsave(&x->it_lock);
           rcu_read_unlock();
           return x;
      
      We could fix that in the posix timer code by keeping rcu read locked across
      the spinlocked and irq disabled section, but the above sequence is common and
      there is no reason not to support it.
      
      Taking rt_mutex.wait_lock irq safe prevents the deadlock.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      b4abf910
  14. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  15. 22 1月, 2016 2 次提交