1. 06 4月, 2019 9 次提交
  2. 03 4月, 2019 3 次提交
    • X
      bpf: do not restore dst_reg when cur_state is freed · f5959dec
      Xu Yu 提交于
      commit 0803278b0b4d8eeb2b461fb698785df65a725d9e upstream.
      
      Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug.
      
      Call trace:
      
        dump_stack+0xbf/0x12e
        print_address_description+0x6a/0x280
        kasan_report+0x237/0x360
        sanitize_ptr_alu+0x85a/0x8d0
        adjust_ptr_min_max_vals+0x8f2/0x1ca0
        adjust_reg_min_max_vals+0x8ed/0x22e0
        do_check+0x1ca6/0x5d00
        bpf_check+0x9ca/0x2570
        bpf_prog_load+0xc91/0x1030
        __se_sys_bpf+0x61e/0x1f00
        do_syscall_64+0xc8/0x550
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fault injection trace:
      
        kfree+0xea/0x290
        free_func_state+0x4a/0x60
        free_verifier_state+0x61/0xe0
        push_stack+0x216/0x2f0	          <- inject failslab
        sanitize_ptr_alu+0x2b1/0x8d0
        adjust_ptr_min_max_vals+0x8f2/0x1ca0
        adjust_reg_min_max_vals+0x8ed/0x22e0
        do_check+0x1ca6/0x5d00
        bpf_check+0x9ca/0x2570
        bpf_prog_load+0xc91/0x1030
        __se_sys_bpf+0x61e/0x1f00
        do_syscall_64+0xc8/0x550
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      When kzalloc() fails in push_stack(), free_verifier_state() will free
      current verifier state. As push_stack() returns, dst_reg was restored
      if ptr_is_dst_reg is false. However, as member of the cur_state,
      dst_reg is also freed, and error occurs when dereferencing dst_reg.
      Simply fix it by testing ret of push_stack() before restoring dst_reg.
      
      Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5959dec
    • T
      cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n · a56aa02e
      Thomas Gleixner 提交于
      commit 206b92353c839c0b27a0b9bec24195f93fd6cf7a upstream.
      
      Tianyu reported a crash in a CPU hotplug teardown callback when booting a
      kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot
      parameter.
      
      It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken
      forever in case that a bringup callback fails. Unfortunately this issue was
      not recognized when the CPU hotplug code was reworked, so the shortcoming
      just stayed in place.
      
      When a bringup callback fails, the CPU hotplug code rolls back the
      operation and takes the CPU offline.
      
      The 'nosmt' command line argument uses a bringup failure to abort the
      bringup of SMT sibling CPUs. This partial bringup is required due to the
      MCE misdesign on Intel CPUs.
      
      With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but
      CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level
      teardown of a CPU including the synchronizations in various facilities like
      RCU, NOHZ and others.
      
      As a consequence the teardown callbacks which must be executed on the
      outgoing CPU within stop machine with interrupts disabled are executed on
      the control CPU in interrupt enabled and preemptible context causing the
      kernel to crash and burn. The pre state machine code has a different
      failure mode which is more subtle and resulting in a less obvious use after
      free crash because the control side frees resources which are still in use
      by the undead CPU.
      
      But this is not a x86 only problem. Any architecture which supports the
      SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less
      likely to be triggered because in 99.99999% of the cases all bringup
      callbacks succeed.
      
      The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on
      all architectures as the following architectures have either no hotplug
      support at all or not all subarchitectures support it:
      
       alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial).
      
      Crashing the kernel in such a situation is not an acceptable state
      either.
      
      Implement a minimal rollback variant by limiting the teardown to the point
      where all regular teardown callbacks have been invoked and leave the CPU in
      the 'dead' idle state. This has the following consequences:
      
       - the CPU is brought down to the point where the stop_machine takedown
         would happen.
      
       - the CPU stays there forever and is idle
      
       - The CPU is cleared in the CPU active mask, but not in the CPU online
         mask which is a legit state.
      
       - Interrupts are not forced away from the CPU
      
       - All facilities which only look at online mask would still see it, but
         that is the case during normal hotplug/unplug operations as well. It's
         just a (way) longer time frame.
      
      This will expose issues, which haven't been exposed before or only seldom,
      because now the normally transient state of being non active but online is
      a permanent state. In testing this exposed already an issue vs. work queues
      where the vmstat code schedules work on the almost dead CPU which ends up
      in an unbound workqueue and triggers 'preemtible context' warnings. This is
      not a problem of this change, it merily exposes an already existing issue.
      Still this is better than crashing fully without a chance to debug it.
      
      This is mainly thought as workaround for those architectures which do not
      support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP.
      
      Fixes: 2e1a3483 ("cpu/hotplug: Split out the state walk into functions")
      Reported-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Konrad Wilk <konrad.wilk@oracle.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Mukesh Ojha <mojha@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Micheal Kelley <michael.h.kelley@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a56aa02e
    • T
      watchdog: Respect watchdog cpumask on CPU hotplug · 336f6b23
      Thomas Gleixner 提交于
      commit 7dd47617114921fdd8c095509e5e7b4373cc44a1 upstream.
      
      The rework of the watchdog core to use cpu_stop_work broke the watchdog
      cpumask on CPU hotplug.
      
      The watchdog_enable/disable() functions are now called unconditionally from
      the hotplug callback, i.e. even on CPUs which are not in the watchdog
      cpumask. As a consequence the watchdog can become unstoppable.
      
      Only invoke them when the plugged CPU is in the watchdog cpumask.
      
      Fixes: 9cf57731 ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
      Reported-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903262245490.1789@nanos.tec.linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      336f6b23
  3. 27 3月, 2019 2 次提交
  4. 24 3月, 2019 8 次提交
  5. 14 3月, 2019 5 次提交
    • M
      bpf: Fix syscall's stackmap lookup potential deadlock · ae26a710
      Martin KaFai Lau 提交于
      [ Upstream commit 7c4cd051add3d00bbff008a133c936c515eaa8fe ]
      
      The map_lookup_elem used to not acquiring spinlock
      in order to optimize the reader.
      
      It was true until commit 557c0c6e ("bpf: convert stackmap to pre-allocation")
      The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy().
      bpf_stackmap_copy() may find the elem no longer needed after the copy is done.
      If that is the case, pcpu_freelist_push() saves this elem for reuse later.
      This push requires a spinlock.
      
      If a tracing bpf_prog got run in the middle of the syscall's
      map_lookup_elem(stackmap) and this tracing bpf_prog is calling
      bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's
      spinlock, it may end up with a dead lock situation as reported by
      Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/
      
      The situation is the same as the syscall's map_update_elem() which
      needs to acquire the pcpu_freelist's spinlock and could race
      with tracing bpf_prog.  Hence, this patch fixes it by protecting
      bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active)
      to prevent tracing bpf_prog from running.
      
      A later syscall's map_lookup_elem commit f1a2e44a3aec ("bpf: add queue and stack maps")
      also acquires a spinlock and races with tracing bpf_prog similarly.
      Hence, this patch is forward looking and protects the majority
      of the map lookups.  bpf_map_offload_lookup_elem() is the exception
      since it is for network bpf_prog only (i.e. never called by tracing
      bpf_prog).
      
      Fixes: 557c0c6e ("bpf: convert stackmap to pre-allocation")
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ae26a710
    • A
      bpf: fix potential deadlock in bpf_prog_register · 3bbe6a42
      Alexei Starovoitov 提交于
      [ Upstream commit e16ec34039c701594d55d08a5aa49ee3e1abc821 ]
      
      Lockdep found a potential deadlock between cpu_hotplug_lock, bpf_event_mutex, and cpuctx_mutex:
      [   13.007000] WARNING: possible circular locking dependency detected
      [   13.007587] 5.0.0-rc3-00018-g2fa53f892422-dirty #477 Not tainted
      [   13.008124] ------------------------------------------------------
      [   13.008624] test_progs/246 is trying to acquire lock:
      [   13.009030] 0000000094160d1d (tracepoints_mutex){+.+.}, at: tracepoint_probe_register_prio+0x2d/0x300
      [   13.009770]
      [   13.009770] but task is already holding lock:
      [   13.010239] 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
      [   13.010877]
      [   13.010877] which lock already depends on the new lock.
      [   13.010877]
      [   13.011532]
      [   13.011532] the existing dependency chain (in reverse order) is:
      [   13.012129]
      [   13.012129] -> #4 (bpf_event_mutex){+.+.}:
      [   13.012582]        perf_event_query_prog_array+0x9b/0x130
      [   13.013016]        _perf_ioctl+0x3aa/0x830
      [   13.013354]        perf_ioctl+0x2e/0x50
      [   13.013668]        do_vfs_ioctl+0x8f/0x6a0
      [   13.014003]        ksys_ioctl+0x70/0x80
      [   13.014320]        __x64_sys_ioctl+0x16/0x20
      [   13.014668]        do_syscall_64+0x4a/0x180
      [   13.015007]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.015469]
      [   13.015469] -> #3 (&cpuctx_mutex){+.+.}:
      [   13.015910]        perf_event_init_cpu+0x5a/0x90
      [   13.016291]        perf_event_init+0x1b2/0x1de
      [   13.016654]        start_kernel+0x2b8/0x42a
      [   13.016995]        secondary_startup_64+0xa4/0xb0
      [   13.017382]
      [   13.017382] -> #2 (pmus_lock){+.+.}:
      [   13.017794]        perf_event_init_cpu+0x21/0x90
      [   13.018172]        cpuhp_invoke_callback+0xb3/0x960
      [   13.018573]        _cpu_up+0xa7/0x140
      [   13.018871]        do_cpu_up+0xa4/0xc0
      [   13.019178]        smp_init+0xcd/0xd2
      [   13.019483]        kernel_init_freeable+0x123/0x24f
      [   13.019878]        kernel_init+0xa/0x110
      [   13.020201]        ret_from_fork+0x24/0x30
      [   13.020541]
      [   13.020541] -> #1 (cpu_hotplug_lock.rw_sem){++++}:
      [   13.021051]        static_key_slow_inc+0xe/0x20
      [   13.021424]        tracepoint_probe_register_prio+0x28c/0x300
      [   13.021891]        perf_trace_event_init+0x11f/0x250
      [   13.022297]        perf_trace_init+0x6b/0xa0
      [   13.022644]        perf_tp_event_init+0x25/0x40
      [   13.023011]        perf_try_init_event+0x6b/0x90
      [   13.023386]        perf_event_alloc+0x9a8/0xc40
      [   13.023754]        __do_sys_perf_event_open+0x1dd/0xd30
      [   13.024173]        do_syscall_64+0x4a/0x180
      [   13.024519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.024968]
      [   13.024968] -> #0 (tracepoints_mutex){+.+.}:
      [   13.025434]        __mutex_lock+0x86/0x970
      [   13.025764]        tracepoint_probe_register_prio+0x2d/0x300
      [   13.026215]        bpf_probe_register+0x40/0x60
      [   13.026584]        bpf_raw_tracepoint_open.isra.34+0xa4/0x130
      [   13.027042]        __do_sys_bpf+0x94f/0x1a90
      [   13.027389]        do_syscall_64+0x4a/0x180
      [   13.027727]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.028171]
      [   13.028171] other info that might help us debug this:
      [   13.028171]
      [   13.028807] Chain exists of:
      [   13.028807]   tracepoints_mutex --> &cpuctx_mutex --> bpf_event_mutex
      [   13.028807]
      [   13.029666]  Possible unsafe locking scenario:
      [   13.029666]
      [   13.030140]        CPU0                    CPU1
      [   13.030510]        ----                    ----
      [   13.030875]   lock(bpf_event_mutex);
      [   13.031166]                                lock(&cpuctx_mutex);
      [   13.031645]                                lock(bpf_event_mutex);
      [   13.032135]   lock(tracepoints_mutex);
      [   13.032441]
      [   13.032441]  *** DEADLOCK ***
      [   13.032441]
      [   13.032911] 1 lock held by test_progs/246:
      [   13.033239]  #0: 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
      [   13.033909]
      [   13.033909] stack backtrace:
      [   13.034258] CPU: 1 PID: 246 Comm: test_progs Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #477
      [   13.034964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      [   13.035657] Call Trace:
      [   13.035859]  dump_stack+0x5f/0x8b
      [   13.036130]  print_circular_bug.isra.37+0x1ce/0x1db
      [   13.036526]  __lock_acquire+0x1158/0x1350
      [   13.036852]  ? lock_acquire+0x98/0x190
      [   13.037154]  lock_acquire+0x98/0x190
      [   13.037447]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.037876]  __mutex_lock+0x86/0x970
      [   13.038167]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.038600]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.039028]  ? __mutex_lock+0x86/0x970
      [   13.039337]  ? __mutex_lock+0x24a/0x970
      [   13.039649]  ? bpf_probe_register+0x1d/0x60
      [   13.039992]  ? __bpf_trace_sched_wake_idle_without_ipi+0x10/0x10
      [   13.040478]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.040906]  tracepoint_probe_register_prio+0x2d/0x300
      [   13.041325]  bpf_probe_register+0x40/0x60
      [   13.041649]  bpf_raw_tracepoint_open.isra.34+0xa4/0x130
      [   13.042068]  ? __might_fault+0x3e/0x90
      [   13.042374]  __do_sys_bpf+0x94f/0x1a90
      [   13.042678]  do_syscall_64+0x4a/0x180
      [   13.042975]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.043382] RIP: 0033:0x7f23b10a07f9
      [   13.045155] RSP: 002b:00007ffdef42fdd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
      [   13.045759] RAX: ffffffffffffffda RBX: 00007ffdef42ff70 RCX: 00007f23b10a07f9
      [   13.046326] RDX: 0000000000000070 RSI: 00007ffdef42fe10 RDI: 0000000000000011
      [   13.046893] RBP: 00007ffdef42fdf0 R08: 0000000000000038 R09: 00007ffdef42fe10
      [   13.047462] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
      [   13.048029] R13: 0000000000000016 R14: 00007f23b1db4690 R15: 0000000000000000
      
      Since tracepoints_mutex will be taken in tracepoint_probe_register/unregister()
      there is no need to take bpf_event_mutex too.
      bpf_event_mutex is protecting modifications to prog array used in kprobe/perf bpf progs.
      bpf_raw_tracepoints don't need to take this mutex.
      
      Fixes: c4f6699d ("bpf: introduce BPF_RAW_TRACEPOINT")
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3bbe6a42
    • A
      bpf: fix lockdep false positive in percpu_freelist · e3bc64c9
      Alexei Starovoitov 提交于
      [ Upstream commit a89fac57b5d080771efd4d71feaae19877cf68f0 ]
      
      Lockdep warns about false positive:
      [   12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40
      [   12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past:
      [   12.493275]  (&rq->lock){-.-.}
      [   12.493276]
      [   12.493276]
      [   12.493276] and interrupts could create inverse lock ordering between them.
      [   12.493276]
      [   12.494435]
      [   12.494435] other info that might help us debug this:
      [   12.494979]  Possible interrupt unsafe locking scenario:
      [   12.494979]
      [   12.495518]        CPU0                    CPU1
      [   12.495879]        ----                    ----
      [   12.496243]   lock(&head->lock);
      [   12.496502]                                local_irq_disable();
      [   12.496969]                                lock(&rq->lock);
      [   12.497431]                                lock(&head->lock);
      [   12.497890]   <Interrupt>
      [   12.498104]     lock(&rq->lock);
      [   12.498368]
      [   12.498368]  *** DEADLOCK ***
      [   12.498368]
      [   12.498837] 1 lock held by dd/276:
      [   12.499110]  #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240
      [   12.499747]
      [   12.499747] the shortest dependencies between 2nd lock and 1st lock:
      [   12.500389]  -> (&rq->lock){-.-.} {
      [   12.500669]     IN-HARDIRQ-W at:
      [   12.500934]                       _raw_spin_lock+0x2f/0x40
      [   12.501373]                       scheduler_tick+0x4c/0xf0
      [   12.501812]                       update_process_times+0x40/0x50
      [   12.502294]                       tick_periodic+0x27/0xb0
      [   12.502723]                       tick_handle_periodic+0x1f/0x60
      [   12.503203]                       timer_interrupt+0x11/0x20
      [   12.503651]                       __handle_irq_event_percpu+0x43/0x2c0
      [   12.504167]                       handle_irq_event_percpu+0x20/0x50
      [   12.504674]                       handle_irq_event+0x37/0x60
      [   12.505139]                       handle_level_irq+0xa7/0x120
      [   12.505601]                       handle_irq+0xa1/0x150
      [   12.506018]                       do_IRQ+0x77/0x140
      [   12.506411]                       ret_from_intr+0x0/0x1d
      [   12.506834]                       _raw_spin_unlock_irqrestore+0x53/0x60
      [   12.507362]                       __setup_irq+0x481/0x730
      [   12.507789]                       setup_irq+0x49/0x80
      [   12.508195]                       hpet_time_init+0x21/0x32
      [   12.508644]                       x86_late_time_init+0xb/0x16
      [   12.509106]                       start_kernel+0x390/0x42a
      [   12.509554]                       secondary_startup_64+0xa4/0xb0
      [   12.510034]     IN-SOFTIRQ-W at:
      [   12.510305]                       _raw_spin_lock+0x2f/0x40
      [   12.510772]                       try_to_wake_up+0x1c7/0x4e0
      [   12.511220]                       swake_up_locked+0x20/0x40
      [   12.511657]                       swake_up_one+0x1a/0x30
      [   12.512070]                       rcu_process_callbacks+0xc5/0x650
      [   12.512553]                       __do_softirq+0xe6/0x47b
      [   12.512978]                       irq_exit+0xc3/0xd0
      [   12.513372]                       smp_apic_timer_interrupt+0xa9/0x250
      [   12.513876]                       apic_timer_interrupt+0xf/0x20
      [   12.514343]                       default_idle+0x1c/0x170
      [   12.514765]                       do_idle+0x199/0x240
      [   12.515159]                       cpu_startup_entry+0x19/0x20
      [   12.515614]                       start_kernel+0x422/0x42a
      [   12.516045]                       secondary_startup_64+0xa4/0xb0
      [   12.516521]     INITIAL USE at:
      [   12.516774]                      _raw_spin_lock_irqsave+0x38/0x50
      [   12.517258]                      rq_attach_root+0x16/0xd0
      [   12.517685]                      sched_init+0x2f2/0x3eb
      [   12.518096]                      start_kernel+0x1fb/0x42a
      [   12.518525]                      secondary_startup_64+0xa4/0xb0
      [   12.518986]   }
      [   12.519132]   ... key      at: [<ffffffff82b7bc28>] __key.71384+0x0/0x8
      [   12.519649]   ... acquired at:
      [   12.519892]    pcpu_freelist_pop+0x7b/0xd0
      [   12.520221]    bpf_get_stackid+0x1d2/0x4d0
      [   12.520563]    ___bpf_prog_run+0x8b4/0x11a0
      [   12.520887]
      [   12.521008] -> (&head->lock){+...} {
      [   12.521292]    HARDIRQ-ON-W at:
      [   12.521539]                     _raw_spin_lock+0x2f/0x40
      [   12.521950]                     pcpu_freelist_push+0x2a/0x40
      [   12.522396]                     bpf_get_stackid+0x494/0x4d0
      [   12.522828]                     ___bpf_prog_run+0x8b4/0x11a0
      [   12.523296]    INITIAL USE at:
      [   12.523537]                    _raw_spin_lock+0x2f/0x40
      [   12.523944]                    pcpu_freelist_populate+0xc0/0x120
      [   12.524417]                    htab_map_alloc+0x405/0x500
      [   12.524835]                    __do_sys_bpf+0x1a3/0x1a90
      [   12.525253]                    do_syscall_64+0x4a/0x180
      [   12.525659]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   12.526167]  }
      [   12.526311]  ... key      at: [<ffffffff838f7668>] __key.13130+0x0/0x8
      [   12.526812]  ... acquired at:
      [   12.527047]    __lock_acquire+0x521/0x1350
      [   12.527371]    lock_acquire+0x98/0x190
      [   12.527680]    _raw_spin_lock+0x2f/0x40
      [   12.527994]    pcpu_freelist_push+0x2a/0x40
      [   12.528325]    bpf_get_stackid+0x494/0x4d0
      [   12.528645]    ___bpf_prog_run+0x8b4/0x11a0
      [   12.528970]
      [   12.529092]
      [   12.529092] stack backtrace:
      [   12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f892422 #475
      [   12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      [   12.530750] Call Trace:
      [   12.530948]  dump_stack+0x5f/0x8b
      [   12.531248]  check_usage_backwards+0x10c/0x120
      [   12.531598]  ? ___bpf_prog_run+0x8b4/0x11a0
      [   12.531935]  ? mark_lock+0x382/0x560
      [   12.532229]  mark_lock+0x382/0x560
      [   12.532496]  ? print_shortest_lock_dependencies+0x180/0x180
      [   12.532928]  __lock_acquire+0x521/0x1350
      [   12.533271]  ? find_get_entry+0x17f/0x2e0
      [   12.533586]  ? find_get_entry+0x19c/0x2e0
      [   12.533902]  ? lock_acquire+0x98/0x190
      [   12.534196]  lock_acquire+0x98/0x190
      [   12.534482]  ? pcpu_freelist_push+0x2a/0x40
      [   12.534810]  _raw_spin_lock+0x2f/0x40
      [   12.535099]  ? pcpu_freelist_push+0x2a/0x40
      [   12.535432]  pcpu_freelist_push+0x2a/0x40
      [   12.535750]  bpf_get_stackid+0x494/0x4d0
      [   12.536062]  ___bpf_prog_run+0x8b4/0x11a0
      
      It has been explained that is a false positive here:
      https://lkml.org/lkml/2018/7/25/756
      Recap:
      - stackmap uses pcpu_freelist
      - The lock in pcpu_freelist is a percpu lock
      - stackmap is only used by tracing bpf_prog
      - A tracing bpf_prog cannot be run if another bpf_prog
        has already been running (ensured by the percpu bpf_prog_active counter).
      
      Eric pointed out that this lockdep splats stops other
      legit lockdep splats in selftests/bpf/test_progs.c.
      
      Fix this by calling local_irq_save/restore for stackmap.
      
      Another false positive had also been worked around by calling
      local_irq_save in commit 89ad2fa3 ("bpf: fix lockdep splat").
      That commit added unnecessary irq_save/restore to fast path of
      bpf hash map. irqs are already disabled at that point, since htab
      is holding per bucket spin_lock with irqsave.
      
      Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop
      function w/o irqsave and convert pcpu_freelist_push/pop to irqsave
      to be used elsewhere (right now only in stackmap).
      It stops lockdep false positive in stackmap with a bit of acceptable overhead.
      
      Fixes: 557c0c6e ("bpf: convert stackmap to pre-allocation")
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e3bc64c9
    • G
      relay: check return of create_buf_file() properly · 232bd90c
      Greg Kroah-Hartman 提交于
      [ Upstream commit 2c1cf00eeacb784781cf1c9896b8af001246d339 ]
      
      If create_buf_file() returns an error, don't try to reference it later
      as a valid dentry pointer.
      
      This problem was exposed when debugfs started to return errors instead
      of just NULL for some calls when they do not succeed properly.
      
      Also, the check for WARN_ON(dentry) was just wrong :)
      Reported-by: NKees Cook <keescook@chromium.org>
      Reported-and-tested-by: syzbot+16c3a70e1e9b29346c43@syzkaller.appspotmail.com
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Fixes: ff9fb72bc077 ("debugfs: return error values, not NULL")
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      232bd90c
    • S
      perf core: Fix perf_proc_update_handler() bug · 6ec0698f
      Stephane Eranian 提交于
      [ Upstream commit 1a51c5da5acc6c188c917ba572eebac5f8793432 ]
      
      The perf_proc_update_handler() handles /proc/sys/kernel/perf_event_max_sample_rate
      syctl variable.  When the PMU IRQ handler timing monitoring is disabled, i.e,
      when /proc/sys/kernel/perf_cpu_time_max_percent is equal to 0 or 100,
      then no modification to sysctl_perf_event_sample_rate is allowed to prevent
      possible hang from wrong values.
      
      The problem is that the test to prevent modification is made after the
      sysctl variable is modified in perf_proc_update_handler().
      
      You get an error:
      
        $ echo 10001 >/proc/sys/kernel/perf_event_max_sample_rate
        echo: write error: invalid argument
      
      But the value is still modified causing all sorts of inconsistencies:
      
        $ cat /proc/sys/kernel/perf_event_max_sample_rate
        10001
      
      This patch fixes the problem by moving the parsing of the value after
      the test.
      
      Committer testing:
      
        # echo 100 > /proc/sys/kernel/perf_cpu_time_max_percent
        # echo 10001 > /proc/sys/kernel/perf_event_max_sample_rate
        -bash: echo: write error: Invalid argument
        # cat /proc/sys/kernel/perf_event_max_sample_rate
        10001
        #
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1547169436-6266-1-git-send-email-eranian@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6ec0698f
  6. 10 3月, 2019 2 次提交
  7. 06 3月, 2019 8 次提交
    • X
      locking/rwsem: Fix (possible) missed wakeup · 9ad6216e
      Xie Yongji 提交于
      [ Upstream commit e158488be27b157802753a59b336142dc0eb0380 ]
      
      Because wake_q_add() can imply an immediate wakeup (cmpxchg failure
      case), we must not rely on the wakeup being delayed. However, commit:
      
        e3851390 ("locking/rwsem: Rework zeroing reader waiter->task")
      
      relies on exactly that behaviour in that the wakeup must not happen
      until after we clear waiter->task.
      
      [ peterz: Added changelog. ]
      Signed-off-by: NXie Yongji <xieyongji@baidu.com>
      Signed-off-by: NZhang Yu <zhangyu31@baidu.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: e3851390 ("locking/rwsem: Rework zeroing reader waiter->task")
      Link: https://lkml.kernel.org/r/1543495830-2644-1-git-send-email-xieyongji@baidu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9ad6216e
    • P
      futex: Fix (possible) missed wakeup · 2368e6d3
      Peter Zijlstra 提交于
      [ Upstream commit b061c38bef43406df8e73c5be06cbfacad5ee6ad ]
      
      We must not rely on wake_q_add() to delay the wakeup; in particular
      commit:
      
        1d0dcb3a ("futex: Implement lockless wakeups")
      
      moved wake_q_add() before smp_store_release(&q->lock_ptr, NULL), which
      could result in futex_wait() waking before observing ->lock_ptr ==
      NULL and going back to sleep again.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 1d0dcb3a ("futex: Implement lockless wakeups")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2368e6d3
    • P
      sched/wake_q: Fix wakeup ordering for wake_q · 653a1dbc
      Peter Zijlstra 提交于
      [ Upstream commit 4c4e3731564c8945ac5ac90fc2a1e1f21cb79c92 ]
      
      Notable cmpxchg() does not provide ordering when it fails, however
      wake_q_add() requires ordering in this specific case too. Without this
      it would be possible for the concurrent wakeup to not observe our
      prior state.
      
      Andrea Parri provided:
      
        C wake_up_q-wake_q_add
      
        {
      	int next = 0;
      	int y = 0;
        }
      
        P0(int *next, int *y)
        {
      	int r0;
      
      	/* in wake_up_q() */
      
      	WRITE_ONCE(*next, 1);   /* node->next = NULL */
      	smp_mb();               /* implied by wake_up_process() */
      	r0 = READ_ONCE(*y);
        }
      
        P1(int *next, int *y)
        {
      	int r1;
      
      	/* in wake_q_add() */
      
      	WRITE_ONCE(*y, 1);      /* wake_cond = true */
      	smp_mb__before_atomic();
      	r1 = cmpxchg_relaxed(next, 1, 2);
        }
      
        exists (0:r0=0 /\ 1:r1=0)
      
        This "exists" clause cannot be satisfied according to the LKMM:
      
        Test wake_up_q-wake_q_add Allowed
        States 3
        0:r0=0; 1:r1=1;
        0:r0=1; 1:r1=0;
        0:r0=1; 1:r1=1;
        No
        Witnesses
        Positive: 0 Negative: 3
        Condition exists (0:r0=0 /\ 1:r1=0)
        Observation wake_up_q-wake_q_add Never 0 3
      Reported-by: NYongji Xie <elohimes@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      653a1dbc
    • P
      sched/wait: Fix rcuwait_wake_up() ordering · 5024f0a2
      Prateek Sood 提交于
      [ Upstream commit 6dc080eeb2ba01973bfff0d79844d7a59e12542e ]
      
      For some peculiar reason rcuwait_wake_up() has the right barrier in
      the comment, but not in the code.
      
      This mistake has been observed to cause a deadlock in the following
      situation:
      
          P1					P2
      
          percpu_up_read()			percpu_down_write()
            rcu_sync_is_idle() // false
      					  rcu_sync_enter()
      					  ...
            __percpu_up_read()
      
      [S] ,-  __this_cpu_dec(*sem->read_count)
          |   smp_rmb();
      [L] |   task = rcu_dereference(w->task) // NULL
          |
          |				    [S]	    w->task = current
          |					    smp_mb();
          |				    [L]	    readers_active_check() // fail
          `-> <store happens here>
      
      Where the smp_rmb() (obviously) fails to constrain the store.
      
      [ peterz: Added changelog. ]
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Acked-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 8f95c90c ("sched/wait, RCU: Introduce rcuwait machinery")
      Link: https://lkml.kernel.org/r/1543590656-7157-1-git-send-email-prsood@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5024f0a2
    • S
      genirq: Make sure the initial affinity is not empty · 17fab891
      Srinivas Ramana 提交于
      [ Upstream commit bddda606ec76550dd63592e32a6e87e7d32583f7 ]
      
      If all CPUs in the irq_default_affinity mask are offline when an interrupt
      is initialized then irq_setup_affinity() can set an empty affinity mask for
      a newly allocated interrupt.
      
      Fix this by falling back to cpu_online_mask in case the resulting affinity
      mask is zero.
      Signed-off-by: NSrinivas Ramana <sramana@codeaurora.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-msm@vger.kernel.org
      Link: https://lkml.kernel.org/r/1545312957-8504-1-git-send-email-sramana@codeaurora.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
      17fab891
    • L
      genirq/matrix: Improve target CPU selection for managed interrupts. · 765c30b3
      Long Li 提交于
      [ Upstream commit e8da8794a7fd9eef1ec9a07f0d4897c68581c72b ]
      
      On large systems with multiple devices of the same class (e.g. NVMe disks,
      using managed interrupts), the kernel can affinitize these interrupts to a
      small subset of CPUs instead of spreading them out evenly.
      
      irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask
      of possible target CPUs which has the lowest number of interrupt vectors
      allocated.
      
      This is done by searching the CPU with the highest number of available
      vectors. While this is correct for non-managed CPUs it can select the wrong
      CPU for managed interrupts. Under certain constellations this results in
      affinitizing the managed interrupts of several devices to a single CPU in
      a set.
      
      The book keeping of available vectors works the following way:
      
       1) Non-managed interrupts:
      
          available is decremented when the interrupt is actually requested by
          the device driver and a vector is assigned. It's incremented when the
          interrupt and the vector are freed.
      
       2) Managed interrupts:
      
          Managed interrupts guarantee vector reservation when the MSI/MSI-X
          functionality of a device is enabled, which is achieved by reserving
          vectors in the bitmaps of the possible target CPUs. This reservation
          decrements the available count on each possible target CPU.
      
          When the interrupt is requested by the device driver then a vector is
          allocated from the reserved region. The operation is reversed when the
          interrupt is freed by the device driver. Neither of these operations
          affect the available count.
      
          The reservation persist up to the point where the MSI/MSI-X
          functionality is disabled and only this operation increments the
          available count again.
      
      For non-managed interrupts the available count is the correct selection
      criterion because the guaranteed reservations need to be taken into
      account. Using the allocated counter could lead to a failing allocation in
      the following situation (total vector space of 10 assumed):
      
      		 CPU0	CPU1
       available:	    2	   0
       allocated:	    5	   3   <--- CPU1 is selected, but available space = 0
       managed reserved:  3	   7
      
       while available yields the correct result.
      
      For managed interrupts the available count is not the appropriate
      selection criterion because as explained above the available count is not
      affected by the actual vector allocation.
      
      The following example illustrates that. Total vector space of 10
      assumed. The starting point is:
      
      		 CPU0	CPU1
       available:	    5	   4
       allocated:	    2	   3
       managed reserved:  3	   3
      
       Allocating vectors for three non-managed interrupts will result in
       affinitizing the first two to CPU0 and the third one to CPU1 because the
       available count is adjusted with each allocation:
      
      		  CPU0	CPU1
       available:	     5	   4	<- Select CPU0 for 1st allocation
       --> allocated:	     3	   3
      
       available:	     4	   4	<- Select CPU0 for 2nd allocation
       --> allocated:	     4	   3
      
       available:	     3	   4	<- Select CPU1 for 3rd allocation
       --> allocated:	     4	   4
      
       But the allocation of three managed interrupts starting from the same
       point will affinitize all of them to CPU0 because the available count is
       not affected by the allocation (see above). So the end result is:
      
      		  CPU0	CPU1
       available:	     5	   4
       allocated:	     5	   3
      
      Introduce a "managed_allocated" field in struct cpumap to track the vector
      allocation for managed interrupts separately. Use this information to
      select the target CPU when a vector is allocated for a managed interrupt,
      which results in more evenly distributed vector assignments. The above
      example results in the following allocations:
      
      		 CPU0	CPU1
       managed_allocated: 0	   0	<- Select CPU0 for 1st allocation
       --> allocated:	    3	   3
      
       managed_allocated: 1	   0	<- Select CPU1 for 2nd allocation
       --> allocated:	    3	   4
      
       managed_allocated: 1	   1	<- Select CPU0 for 3rd allocation
       --> allocated:	    4	   4
      
      The allocation of non-managed interrupts is not affected by this change and
      is still evaluating the available count.
      
      The overall distribution of interrupt vectors for both types of interrupts
      might still not be perfectly even depending on the number of non-managed
      and managed interrupts in a system, but due to the reservation guarantee
      for managed interrupts this cannot be avoided.
      
      Expose the new field in debugfs as well.
      
      [ tglx: Clarified the background of the problem in the changelog and
        	described it independent of NVME ]
      Signed-off-by: NLong Li <longli@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Link: https://lkml.kernel.org/r/20181106040000.27316-1-longli@linuxonhyperv.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      765c30b3
    • D
      irq/matrix: Spread managed interrupts on allocation · 8cae7757
      Dou Liyang 提交于
      [ Upstream commit 76f99ae5b54d48430d1f0c5512a84da0ff9761e0 ]
      
      Linux spreads out the non managed interrupt across the possible target CPUs
      to avoid vector space exhaustion.
      
      Managed interrupts are treated differently, as for them the vectors are
      reserved (with guarantee) when the interrupt descriptors are initialized.
      
      When the interrupt is requested a real vector is assigned. The assignment
      logic uses the first CPU in the affinity mask for assignment. If the
      interrupt has more than one CPU in the affinity mask, which happens when a
      multi queue device has less queues than CPUs, then doing the same search as
      for non managed interrupts makes sense as it puts the interrupt on the
      least interrupt plagued CPU. For single CPU affine vectors that's obviously
      a NOOP.
      
      Restructre the matrix allocation code so it does the 'best CPU' search, add
      the sanity check for an empty affinity mask and adapt the call site in the
      x86 vector management code.
      
      [ tglx: Added the empty mask check to the core and improved change log ]
      Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/20180908175838.14450-2-dou_liyang@163.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8cae7757
    • D
      irq/matrix: Split out the CPU selection code into a helper · 2948b887
      Dou Liyang 提交于
      [ Upstream commit 8ffe4e61c06a48324cfd97f1199bb9838acce2f2 ]
      
      Linux finds the CPU which has the lowest vector allocation count to spread
      out the non managed interrupts across the possible target CPUs, but does
      not do so for managed interrupts.
      
      Split out the CPU selection code into a helper function for reuse. No
      functional change.
      Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/20180908175838.14450-1-dou_liyang@163.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2948b887
  8. 27 2月, 2019 3 次提交