1. 16 4月, 2020 25 次提交
    • T
      tracing: Remove open-coding of hist trigger var_ref management · 00265ca6
      Tom Zanussi 提交于
      commit de40f033d4e84e843d6a12266e3869015ea9097c upstream.
      
      Have create_var_ref() manage the hist trigger's var_ref list, rather
      than having similar code doing it in multiple places.  This cleans up
      the code and makes sure var_refs are always accounted properly.
      
      Also, document the var_ref-related functions to make what their
      purpose clearer.
      
      Link: http://lkml.kernel.org/r/05ddae93ff514e66fc03897d6665231892939913.1545161087.git.tom.zanussi@linux.intel.comAcked-by: NNamhyung Kim <namhyung@kernel.org>
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      00265ca6
    • T
      tracing: Use hist trigger's var_ref array to destroy var_refs · 5ee9042a
      Tom Zanussi 提交于
      commit 656fe2ba85e81d00e4447bf77b8da2be3c47acb2 upstream.
      
      Since every var ref for a trigger has an entry in the var_ref[] array,
      use that to destroy the var_refs, instead of piecemeal via the field
      expressions.
      
      This allows us to avoid having to keep and treat differently separate
      lists for the action-related references, which future patches will
      remove.
      
      Link: http://lkml.kernel.org/r/fad1a164f0e257c158e70d6eadbf6c586e04b2a2.1545161087.git.tom.zanussi@linux.intel.comAcked-by: NNamhyung Kim <namhyung@kernel.org>
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5ee9042a
    • M
      tracing: trigger: Replace unneeded RCU-list traversals · 65362131
      Masami Hiramatsu 提交于
      commit aeed8aa3874dc15b9d82a6fe796fd7cfbb684448 upstream.
      
      With CONFIG_PROVE_RCU_LIST, I had many suspicious RCU warnings
      when I ran ftracetest trigger testcases.
      
      -----
        # dmesg -c > /dev/null
        # ./ftracetest test.d/trigger
        ...
        # dmesg | grep "RCU-list traversed" | cut -f 2 -d ] | cut -f 2 -d " "
        kernel/trace/trace_events_hist.c:6070
        kernel/trace/trace_events_hist.c:1760
        kernel/trace/trace_events_hist.c:5911
        kernel/trace/trace_events_trigger.c:504
        kernel/trace/trace_events_hist.c:1810
        kernel/trace/trace_events_hist.c:3158
        kernel/trace/trace_events_hist.c:3105
        kernel/trace/trace_events_hist.c:5518
        kernel/trace/trace_events_hist.c:5998
        kernel/trace/trace_events_hist.c:6019
        kernel/trace/trace_events_hist.c:6044
        kernel/trace/trace_events_trigger.c:1500
        kernel/trace/trace_events_trigger.c:1540
        kernel/trace/trace_events_trigger.c:539
        kernel/trace/trace_events_trigger.c:584
      -----
      
      I investigated those warnings and found that the RCU-list
      traversals in event trigger and hist didn't need to use
      RCU version because those were called only under event_mutex.
      
      I also checked other RCU-list traversals related to event
      trigger list, and found that most of them were called from
      event_hist_trigger_func() or hist_unregister_trigger() or
      register/unregister functions except for a few cases.
      
      Replace these unneeded RCU-list traversals with normal list
      traversal macro and lockdep_assert_held() to check the
      event_mutex is held.
      
      Link: http://lkml.kernel.org/r/157680910305.11685.15110237954275915782.stgit@devnote2
      
      Cc: stable@vger.kernel.org
      Fixes: 30350d65 ("tracing: Add variable support to hist triggers")
      Reviewed-by: NTom Zanussi <zanussi@kernel.org>
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      65362131
    • D
      bpf, offload: Unlock on error in bpf_offload_dev_create() · 9d85d2db
      Dan Carpenter 提交于
      [ Upstream commit d0fbb51dfaa612f960519b798387be436e8f83c5 ]
      
      We need to drop the bpf_devs_lock on error before returning.
      
      Fixes: 9fd7c555 ("bpf: offload: aggregate offloads per-device")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Link: https://lore.kernel.org/bpf/20191104091536.GB31509@mwandaSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      9d85d2db
    • E
      signal: Allow cifs and drbd to receive their terminating signals · 2ead212c
      Eric W. Biederman 提交于
      [ Upstream commit 33da8e7c814f77310250bb54a9db36a44c5de784 ]
      
      My recent to change to only use force_sig for a synchronous events
      wound up breaking signal reception cifs and drbd.  I had overlooked
      the fact that by default kthreads start out with all signals set to
      SIG_IGN.  So a change I thought was safe turned out to have made it
      impossible for those kernel thread to catch their signals.
      
      Reverting the work on force_sig is a bad idea because what the code
      was doing was very much a misuse of force_sig.  As the way force_sig
      ultimately allowed the signal to happen was to change the signal
      handler to SIG_DFL.  Which after the first signal will allow userspace
      to send signals to these kernel threads.  At least for
      wake_ack_receiver in drbd that does not appear actively wrong.
      
      So correct this problem by adding allow_kernel_signal that will allow
      signals whose siginfo reports they were sent by the kernel through,
      but will not allow userspace generated signals, and update cifs and
      drbd to call allow_kernel_signal in an appropriate place so that their
      thread can receive this signal.
      
      Fixing things this way ensures that userspace won't be able to send
      signals and cause problems, that it is clear which signals the
      threads are expecting to receive, and it guarantees that nothing
      else in the system will be affected.
      
      This change was partly inspired by similar cifs and drbd patches that
      added allow_signal.
      Reported-by: Nronnie sahlberg <ronniesahlberg@gmail.com>
      Reported-by: NChristoph Böhmwalder <christoph.boehmwalder@linbit.com>
      Tested-by: NChristoph Böhmwalder <christoph.boehmwalder@linbit.com>
      Cc: Steve French <smfrench@gmail.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Fixes: 247bc9470b1e ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
      Fixes: 72abe3bcf091 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
      Fixes: fee109901f39 ("signal/drbd: Use send_sig not force_sig")
      Fixes: 3cf5d076fb4d ("signal: Remove task parameter from force_sig")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2ead212c
    • A
      fork, memcg: alloc_thread_stack_node needs to set tsk->stack · df37416b
      Andrea Arcangeli 提交于
      [ Upstream commit 1bf4580e00a248a2c86269125390eb3648e1877c ]
      
      Commit 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on
      memcg charge fail") corrected two instances, but there was a third
      instance of this bug.
      
      Without setting tsk->stack, if memcg_charge_kernel_stack fails, it'll
      execute free_thread_stack() on a dangling pointer.
      
      Enterprise kernels are compiled with VMAP_STACK=y so this isn't
      critical, but custom VMAP_STACK=n builds should have some performance
      advantage, with the drawback of risking to fail fork because compaction
      didn't succeed.  So as long as VMAP_STACK=n is a supported option it's
      worth fixing it upstream.
      
      Link: http://lkml.kernel.org/r/20190619011450.28048-1-aarcange@redhat.com
      Fixes: 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      df37416b
    • R
      perf/ioctl: Add check for the sample_period value · 97b9eadb
      Ravi Bangoria 提交于
      [ Upstream commit 913a90bc5a3a06b1f04c337320e9aeee2328dd77 ]
      
      perf_event_open() limits the sample_period to 63 bits. See:
      
        0819b2e3 ("perf: Limit perf_event_attr::sample_period to 63 bits")
      
      Make ioctl() consistent with it.
      
      Also on PowerPC, negative sample_period could cause a recursive
      PMIs leading to a hang (reported when running perf-fuzzer).
      Signed-off-by: NRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: maddy@linux.vnet.ibm.com
      Cc: mpe@ellerman.id.au
      Fixes: 0819b2e3 ("perf: Limit perf_event_attr::sample_period to 63 bits")
      Link: https://lkml.kernel.org/r/20190604042953.914-1-ravi.bangoria@linux.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      97b9eadb
    • D
      kdb: do a sanity check on the cpu in kdb_per_cpu() · 4583a31d
      Dan Carpenter 提交于
      [ Upstream commit b586627e10f57ee3aa8f0cfab0d6f7dc4ae63760 ]
      
      The "whichcpu" comes from argv[3].  The cpu_online() macro looks up the
      cpu in a bitmap of online cpus, but if the value is too high then it
      could read beyond the end of the bitmap and possibly Oops.
      
      Fixes: 5d5314d6 ("kdb: core for kgdb back end (1 of 2)")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NDouglas Anderson <dianders@chromium.org>
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4583a31d
    • A
      perf/core: Fix the address filtering fix · 2a8f602a
      Alexander Shishkin 提交于
      [ Upstream commit 52a44f83fc2d64a5e74d5d685fad2fecc7b7a321 ]
      
      The following recent commit:
      
        c60f83b813e5 ("perf, pt, coresight: Fix address filters for vmas with non-zero offset")
      
      changes the address filtering logic to communicate filter ranges to the PMU driver
      via a single address range object, instead of having the driver do the final bit of
      math.
      
      That change forgets to take into account kernel filters, which are not calculated
      the same way as DSO based filters.
      
      Fix that by passing the kernel filters the same way as file-based filters.
      This doesn't require any additional changes in the drivers.
      Reported-by: NAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: c60f83b813e5 ("perf, pt, coresight: Fix address filters for vmas with non-zero offset")
      Link: https://lkml.kernel.org/r/20190329091212.29870-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2a8f602a
    • A
      bpf: Add missed newline in verifier verbose log · b04482dd
      Andrey Ignatov 提交于
      [ Upstream commit 1fbd20f8b77b366ea4aeb92ade72daa7f36a7e3b ]
      
      check_stack_access() that prints verbose log is used in
      adjust_ptr_min_max_vals() that prints its own verbose log and now they
      stick together, e.g.:
      
        variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16
        size=1R2 stack pointer arithmetic goes out of range, prohibited for
        !root
      
      Add missing newline so that log is more readable:
        variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1
        R2 stack pointer arithmetic goes out of range, prohibited for !root
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b04482dd
    • E
      tick/sched: Annotate lockless access to last_jiffies_update · 7d87000e
      Eric Dumazet 提交于
      commit de95a991bb72e009f47e0c4bbc90fc5f594588d5 upstream.
      
      syzbot (KCSAN) reported a data-race in tick_do_update_jiffies64():
      
      BUG: KCSAN: data-race in tick_do_update_jiffies64 / tick_do_update_jiffies64
      
      write to 0xffffffff8603d008 of 8 bytes by interrupt on cpu 1:
       tick_do_update_jiffies64+0x100/0x250 kernel/time/tick-sched.c:73
       tick_sched_do_timer+0xd4/0xe0 kernel/time/tick-sched.c:138
       tick_sched_timer+0x43/0xe0 kernel/time/tick-sched.c:1292
       __run_hrtimer kernel/time/hrtimer.c:1514 [inline]
       __hrtimer_run_queues+0x274/0x5f0 kernel/time/hrtimer.c:1576
       hrtimer_interrupt+0x22a/0x480 kernel/time/hrtimer.c:1638
       local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1110 [inline]
       smp_apic_timer_interrupt+0xdc/0x280 arch/x86/kernel/apic/apic.c:1135
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
       arch_local_irq_restore arch/x86/include/asm/paravirt.h:756 [inline]
       kcsan_setup_watchpoint+0x1d4/0x460 kernel/kcsan/core.c:436
       check_access kernel/kcsan/core.c:466 [inline]
       __tsan_read1 kernel/kcsan/core.c:593 [inline]
       __tsan_read1+0xc2/0x100 kernel/kcsan/core.c:593
       kallsyms_expand_symbol.constprop.0+0x70/0x160 kernel/kallsyms.c:79
       kallsyms_lookup_name+0x7f/0x120 kernel/kallsyms.c:170
       insert_report_filterlist kernel/kcsan/debugfs.c:155 [inline]
       debugfs_write+0x14b/0x2d0 kernel/kcsan/debugfs.c:256
       full_proxy_write+0xbd/0x100 fs/debugfs/file.c:225
       __vfs_write+0x67/0xc0 fs/read_write.c:494
       vfs_write fs/read_write.c:558 [inline]
       vfs_write+0x18a/0x390 fs/read_write.c:542
       ksys_write+0xd5/0x1b0 fs/read_write.c:611
       __do_sys_write fs/read_write.c:623 [inline]
       __se_sys_write fs/read_write.c:620 [inline]
       __x64_sys_write+0x4c/0x60 fs/read_write.c:620
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      read to 0xffffffff8603d008 of 8 bytes by task 0 on cpu 0:
       tick_do_update_jiffies64+0x2b/0x250 kernel/time/tick-sched.c:62
       tick_nohz_update_jiffies kernel/time/tick-sched.c:505 [inline]
       tick_nohz_irq_enter kernel/time/tick-sched.c:1257 [inline]
       tick_irq_enter+0x139/0x1c0 kernel/time/tick-sched.c:1274
       irq_enter+0x4f/0x60 kernel/softirq.c:354
       entering_irq arch/x86/include/asm/apic.h:517 [inline]
       entering_ack_irq arch/x86/include/asm/apic.h:523 [inline]
       smp_apic_timer_interrupt+0x55/0x280 arch/x86/kernel/apic/apic.c:1133
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
       native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:60
       arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:571
       default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
       cpuidle_idle_call kernel/sched/idle.c:154 [inline]
       do_idle+0x1af/0x280 kernel/sched/idle.c:263
       cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355
       rest_init+0xec/0xf6 init/main.c:452
       arch_call_rest_init+0x17/0x37
       start_kernel+0x838/0x85e init/main.c:786
       x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:490
       x86_64_start_kernel+0x72/0x76 arch/x86/kernel/head64.c:471
       secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc7+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Use READ_ONCE() and WRITE_ONCE() to annotate this expected race.
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20191205045619.204946-1-edumazet@google.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7d87000e
    • D
      bpf: Fix incorrect verifier simulation of ARSH under ALU32 · 7cb6fa32
      Daniel Borkmann 提交于
      commit 0af2ffc93a4b50948f9dad2786b7f1bd253bf0b9 upstream.
      
      Anatoly has been fuzzing with kBdysch harness and reported a hang in one
      of the outcomes:
      
        0: R1=ctx(id=0,off=0,imm=0) R10=fp0
        0: (85) call bpf_get_socket_cookie#46
        1: R0_w=invP(id=0) R10=fp0
        1: (57) r0 &= 808464432
        2: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
        2: (14) w0 -= 810299440
        3: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
        3: (c4) w0 s>>= 1
        4: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
        4: (76) if w0 s>= 0x30303030 goto pc+216
        221: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
        221: (95) exit
        processed 6 insns (limit 1000000) [...]
      
      Taking a closer look, the program was xlated as follows:
      
        # ./bpftool p d x i 12
        0: (85) call bpf_get_socket_cookie#7800896
        1: (bf) r6 = r0
        2: (57) r6 &= 808464432
        3: (14) w6 -= 810299440
        4: (c4) w6 s>>= 1
        5: (76) if w6 s>= 0x30303030 goto pc+216
        6: (05) goto pc-1
        7: (05) goto pc-1
        8: (05) goto pc-1
        [...]
        220: (05) goto pc-1
        221: (05) goto pc-1
        222: (95) exit
      
      Meaning, the visible effect is very similar to f54c7898ed1c ("bpf: Fix
      precision tracking for unbounded scalars"), that is, the fall-through
      branch in the instruction 5 is considered to be never taken given the
      conclusion from the min/max bounds tracking in w6, and therefore the
      dead-code sanitation rewrites it as goto pc-1. However, real-life input
      disagrees with verification analysis since a soft-lockup was observed.
      
      The bug sits in the analysis of the ARSH. The definition is that we shift
      the target register value right by K bits through shifting in copies of
      its sign bit. In adjust_scalar_min_max_vals(), we do first coerce the
      register into 32 bit mode, same happens after simulating the operation.
      However, for the case of simulating the actual ARSH, we don't take the
      mode into account and act as if it's always 64 bit, but location of sign
      bit is different:
      
        dst_reg->smin_value >>= umin_val;
        dst_reg->smax_value >>= umin_val;
        dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val);
      
      Consider an unknown R0 where bpf_get_socket_cookie() (or others) would
      for example return 0xffff. With the above ARSH simulation, we'd see the
      following results:
      
        [...]
        1: R1=ctx(id=0,off=0,imm=0) R2_w=invP65535 R10=fp0
        1: (85) call bpf_get_socket_cookie#46
        2: R0_w=invP(id=0) R10=fp0
        2: (57) r0 &= 808464432
          -> R0_runtime = 0x3030
        3: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
        3: (14) w0 -= 810299440
          -> R0_runtime = 0xcfb40000
        4: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
                                    (0xffffffff)
        4: (c4) w0 s>>= 1
          -> R0_runtime = 0xe7da0000
        5: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
                                    (0x67c00000)           (0x7ffbfff8)
        [...]
      
      In insn 3, we have a runtime value of 0xcfb40000, which is '1100 1111 1011
      0100 0000 0000 0000 0000', the result after the shift has 0xe7da0000 that
      is '1110 0111 1101 1010 0000 0000 0000 0000', where the sign bit is correctly
      retained in 32 bit mode. In insn4, the umax was 0xffffffff, and changed into
      0x7ffbfff8 after the shift, that is, '0111 1111 1111 1011 1111 1111 1111 1000'
      and means here that the simulation didn't retain the sign bit. With above
      logic, the updates happen on the 64 bit min/max bounds and given we coerced
      the register, the sign bits of the bounds are cleared as well, meaning, we
      need to force the simulation into s32 space for 32 bit alu mode.
      
      Verification after the fix below. We're first analyzing the fall-through branch
      on 32 bit signed >= test eventually leading to rejection of the program in this
      specific case:
      
        0: R1=ctx(id=0,off=0,imm=0) R10=fp0
        0: (b7) r2 = 808464432
        1: R1=ctx(id=0,off=0,imm=0) R2_w=invP808464432 R10=fp0
        1: (85) call bpf_get_socket_cookie#46
        2: R0_w=invP(id=0) R10=fp0
        2: (bf) r6 = r0
        3: R0_w=invP(id=0) R6_w=invP(id=0) R10=fp0
        3: (57) r6 &= 808464432
        4: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
        4: (14) w6 -= 810299440
        5: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
        5: (c4) w6 s>>= 1
        6: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
                                                    (0x67c00000)          (0xfffbfff8)
        6: (76) if w6 s>= 0x30303030 goto pc+216
        7: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
        7: (30) r0 = *(u8 *)skb[808464432]
        BPF_LD_[ABS|IND] uses reserved fields
        processed 8 insns (limit 1000000) [...]
      
      Fixes: 9cbe1f5a ("bpf/verifier: improve register value range tracking with ARSH")
      Reported-by: NAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200115204733.16648-1-daniel@iogearbox.netSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7cb6fa32
    • C
      ptrace: reintroduce usage of subjective credentials in ptrace_has_cap() · 1bf3421d
      Christian Brauner 提交于
      commit 6b3ad6649a4c75504edeba242d3fd36b3096a57f upstream.
      
      Commit 69f594a3 ("ptrace: do not audit capability check when outputing /proc/pid/stat")
      introduced the ability to opt out of audit messages for accesses to various
      proc files since they are not violations of policy.  While doing so it
      somehow switched the check from ns_capable() to
      has_ns_capability{_noaudit}(). That means it switched from checking the
      subjective credentials of the task to using the objective credentials. This
      is wrong since. ptrace_has_cap() is currently only used in
      ptrace_may_access() And is used to check whether the calling task (subject)
      has the CAP_SYS_PTRACE capability in the provided user namespace to operate
      on the target task (object). According to the cred.h comments this would
      mean the subjective credentials of the calling task need to be used.
      This switches ptrace_has_cap() to use security_capable(). Because we only
      call ptrace_has_cap() in ptrace_may_access() and in there we already have a
      stable reference to the calling task's creds under rcu_read_lock() there's
      no need to go through another series of dereferences and rcu locking done
      in ns_capable{_noaudit}().
      
      As one example where this might be particularly problematic, Jann pointed
      out that in combination with the upcoming IORING_OP_OPENAT feature, this
      bug might allow unprivileged users to bypass the capability checks while
      asynchronously opening files like /proc/*/mem, because the capability
      checks for this would be performed against kernel credentials.
      
      To illustrate on the former point about this being exploitable: When
      io_uring creates a new context it records the subjective credentials of the
      caller. Later on, when it starts to do work it creates a kernel thread and
      registers a callback. The callback runs with kernel creds for
      ktask->real_cred and ktask->cred. To prevent this from becoming a
      full-blown 0-day io_uring will call override_cred() and override
      ktask->cred with the subjective credentials of the creator of the io_uring
      instance. With ptrace_has_cap() currently looking at ktask->real_cred this
      override will be ineffective and the caller will be able to open arbitray
      proc files as mentioned above.
      Luckily, this is currently not exploitable but will turn into a 0-day once
      IORING_OP_OPENAT{2} land in v5.6. Fix it now!
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NSerge Hallyn <serge@hallyn.com>
      Reviewed-by: NJann Horn <jannh@google.com>
      Fixes: 69f594a3 ("ptrace: do not audit capability check when outputing /proc/pid/stat")
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1bf3421d
    • M
      LSM: generalize flag passing to security_capable · 3ab3d636
      Micah Morton 提交于
      [ Upstream commit c1a85a00ea66cb6f0bd0f14e47c28c2b0999799f ]
      
      This patch provides a general mechanism for passing flags to the
      security_capable LSM hook. It replaces the specific 'audit' flag that is
      used to tell security_capable whether it should log an audit message for
      the given capability check. The reason for generalizing this flag
      passing is so we can add an additional flag that signifies whether
      security_capable is being called by a setid syscall (which is needed by
      the proposed SafeSetID LSM).
      Signed-off-by: NMicah Morton <mortonm@chromium.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NJames Morris <james.morris@microsoft.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3ab3d636
    • P
      membarrier: Fix RCU locking bug caused by faulty merge · fe20cab8
      Peter Zijlstra 提交于
      mainline inclusion
      from mainline-5.4-rc2
      commit 73956fc07dd7b25d4a33ab3fdd6247c60d0b237c
      category: bugfix
      bugzilla: 28332
      CVE: NA
      
      -------------------------------------------------
      
      The following commit:
      
        227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load")
      
      got fat fingered by me when merging it with other patches. It meant to move
      the RCU section out of the for loop but ended up doing it partially, leaving
      a superfluous rcu_read_lock() inside, causing havok.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-tip-commits@vger.kernel.org
      Fixes: 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load")
      Link: https://lkml.kernel.org/r/20191001085033.GP4519@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-By: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fe20cab8
    • M
      sched/membarrier: Return -ENOMEM to userspace on memory allocation failure · 17eaf839
      Mathieu Desnoyers 提交于
      mainline inclusion
      from mainline-5.4-rc1
      commit c172e0a3e8e65a4c6fffec5bc4d6de08d6f894f7
      category: bugfix
      bugzilla: 28332
      CVE: NA
      
      -------------------------------------------------
      
      Remove the IPI fallback code from membarrier to deal with very
      infrequent cpumask memory allocation failure. Use GFP_KERNEL rather
      than GFP_NOWAIT, and relax the blocking guarantees for the expedited
      membarrier system call commands, allowing it to block if waiting for
      memory to be made available.
      
      In addition, now -ENOMEM can be returned to user-space if the cpumask
      memory allocation fails.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-8-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-By: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      17eaf839
    • M
      sched/membarrier: Skip IPIs when mm->mm_users == 1 · c5d09e6c
      Mathieu Desnoyers 提交于
      mainline inclusion
      from mainline-5.4-rc1
      commit c6d68c1c4a4d6611fc0f8145d764226571d737ca
      category: bugfix
      bugzilla: 28332
      CVE: NA
      
      -------------------------------------------------
      
      If there is only a single mm_user for the mm, the private expedited
      membarrier command can skip the IPIs, because only a single thread
      is using the mm.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-7-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-By: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c5d09e6c
    • M
      sched/membarrier: Fix p->mm->membarrier_state racy load · d9913551
      Mathieu Desnoyers 提交于
      mainline inclusion
      from mainline-5.4-rc1
      commit 227a4aadc75ba22fcb6c4e1c078817b8cbaae4ce
      category: bugfix
      bugzilla: 28332
      CVE: NA
      
      -------------------------------------------------
      
      The membarrier_state field is located within the mm_struct, which
      is not guaranteed to exist when used from runqueue-lock-free iteration
      on runqueues by the membarrier system call.
      
      Copy the membarrier_state from the mm_struct into the scheduler runqueue
      when the scheduler switches between mm.
      
      When registering membarrier for mm, after setting the registration bit
      in the mm membarrier state, issue a synchronize_rcu() to ensure the
      scheduler observes the change. In order to take care of the case
      where a runqueue keeps executing the target mm without swapping to
      other mm, iterate over each runqueue and issue an IPI to copy the
      membarrier_state from the mm_struct into each runqueue which have the
      same mm which state has just been modified.
      
      Move the mm membarrier_state field closer to pgd in mm_struct to use
      a cache line already touched by the scheduler switch_mm.
      
      The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
      clear the runqueue's membarrier state in addition to clear the mm
      membarrier state, so move its implementation into the scheduler
      membarrier code so it can access the runqueue structure.
      
      Add memory barrier in membarrier_exec_mmap() prior to clearing
      the membarrier state, ensuring memory accesses executed prior to exec
      are not reordered with the stores clearing the membarrier state.
      
      As suggested by Linus, move all membarrier.c RCU read-side locks outside
      of the for each cpu loops.
      
      [Cheng Jian: use task_rcu_dereference in sync_runqueues_membarrier_state]
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-By: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d9913551
    • P
      sched: Clean up active_mm reference counting · 6674b9f9
      Peter Zijlstra 提交于
      mainline inclusion
      from mainline-5.4-rc1
      commit 139d025cda1da5484e7287b35c019fe1dcf9b650
      category: bugfix
      bugzilla: 28332 [preparation]
      CVE: NA
      
      -------------------------------------------------
      
      The current active_mm reference counting is confusing and sub-optimal.
      
      Rewrite the code to explicitly consider the 4 separate cases:
      
          user -> user
      
      	When switching between two user tasks, all we need to consider
      	is switch_mm().
      
          user -> kernel
      
      	When switching from a user task to a kernel task (which
      	doesn't have an associated mm) we retain the last mm in our
      	active_mm. Increment a reference count on active_mm.
      
        kernel -> kernel
      
      	When switching between kernel threads, all we need to do is
      	pass along the active_mm reference.
      
        kernel -> user
      
      	When switching between a kernel and user task, we must switch
      	from the last active_mm to the next mm, hoping of course that
      	these are the same. Decrement a reference on the active_mm.
      
      The code keeps a different order, because as you'll note, both 'to
      user' cases require switch_mm().
      
      And where the old code would increment/decrement for the 'kernel ->
      kernel' case, the new code observes this is a neutral operation and
      avoids touching the reference count.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Reviewed-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: luto@kernel.org
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-By: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6674b9f9
    • M
      sched/membarrier: Remove redundant check · d45700e8
      Mathieu Desnoyers 提交于
      mainline inclusion
      from mainline-5.4-rc1
      commit 09554009c0cad4cb2223dd943c813c9257c6883a
      category: bugfix
      bugzilla: 28332
      CVE: NA
      
      -------------------------------------------------
      
      Checking that the number of threads is 1 is redundant with checking
      mm_users == 1.
      
      No change in functionality intended.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-3-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-By: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d45700e8
    • H
      PM / hibernate: introduce system_in_hibernation · 08596eae
      Hongbo Yao 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 26326
      CVE: NA
      
      -------------------------------------------------
      Introduce boolean function system_in_hibernation() returning
      'true' when the system carrying out hibernation.
      
      Some device drivers or syscore need such a function to check
      if it is in the phase of hibernation.
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      08596eae
    • S
      tracing: Have stack tracer compile when MCOUNT_INSN_SIZE is not defined · 35934cc1
      Steven Rostedt (VMware) 提交于
      commit b8299d362d0837ae39e87e9019ebe6b736e0f035 upstream.
      
      On some archs with some configurations, MCOUNT_INSN_SIZE is not defined, and
      this makes the stack tracer fail to compile. Just define it to zero in this
      case.
      
      Link: https://lore.kernel.org/r/202001020219.zvE3vsty%lkp@intel.com
      
      Cc: stable@vger.kernel.org
      Fixes: 4df29712 ("tracing: Remove most or all of stack tracer stack size from stack_max_size")
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      35934cc1
    • K
      kernel/trace: Fix do not unregister tracepoints when register sched_migrate_task fail · 9641ccb8
      Kaitao Cheng 提交于
      commit 50f9ad607ea891a9308e67b81f774c71736d1098 upstream.
      
      In the function, if register_trace_sched_migrate_task() returns error,
      sched_switch/sched_wakeup_new/sched_wakeup won't unregister. That is
      why fail_deprobe_sched_switch was added.
      
      Link: http://lkml.kernel.org/r/20191231133530.2794-1-pilgrimtao@gmail.com
      
      Cc: stable@vger.kernel.org
      Fixes: 478142c3 ("tracing: do not grab lock in wakeup latency function tracing")
      Signed-off-by: NKaitao Cheng <pilgrimtao@gmail.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      9641ccb8
    • M
      locking/spinlock/debug: Fix various data races · c2428b60
      Marco Elver 提交于
      [ Upstream commit 1a365e822372ba24c9da0822bc583894f6f3d821 ]
      
      This fixes various data races in spinlock_debug. By testing with KCSAN,
      it is observable that the console gets spammed with data races reports,
      suggesting these are extremely frequent.
      
      Example data race report:
      
        read to 0xffff8ab24f403c48 of 4 bytes by task 221 on cpu 2:
         debug_spin_lock_before kernel/locking/spinlock_debug.c:85 [inline]
         do_raw_spin_lock+0x9b/0x210 kernel/locking/spinlock_debug.c:112
         __raw_spin_lock include/linux/spinlock_api_smp.h:143 [inline]
         _raw_spin_lock+0x39/0x40 kernel/locking/spinlock.c:151
         spin_lock include/linux/spinlock.h:338 [inline]
         get_partial_node.isra.0.part.0+0x32/0x2f0 mm/slub.c:1873
         get_partial_node mm/slub.c:1870 [inline]
        <snip>
      
        write to 0xffff8ab24f403c48 of 4 bytes by task 167 on cpu 3:
         debug_spin_unlock kernel/locking/spinlock_debug.c:103 [inline]
         do_raw_spin_unlock+0xc9/0x1a0 kernel/locking/spinlock_debug.c:138
         __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:159 [inline]
         _raw_spin_unlock_irqrestore+0x2d/0x50 kernel/locking/spinlock.c:191
         spin_unlock_irqrestore include/linux/spinlock.h:393 [inline]
         free_debug_processing+0x1b3/0x210 mm/slub.c:1214
         __slab_free+0x292/0x400 mm/slub.c:2864
        <snip>
      
      As a side-effect, with KCSAN, this eventually locks up the console, most
      likely due to deadlock, e.g. .. -> printk lock -> spinlock_debug ->
      KCSAN detects data race -> kcsan_print_report() -> printk lock ->
      deadlock.
      
      This fix will 1) avoid the data races, and 2) allow using lock debugging
      together with KCSAN.
      Reported-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NMarco Elver <elver@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: https://lkml.kernel.org/r/20191120155715.28089-1-elver@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c2428b60
    • D
      bpf: Fix passing modified ctx to ld/abs/ind instruction · 0cc793ad
      Daniel Borkmann 提交于
      commit 6d4f151acf9a4f6fab09b615f246c717ddedcf0c upstream.
      
      Anatoly has been fuzzing with kBdysch harness and reported a KASAN
      slab oob in one of the outcomes:
      
        [...]
        [   77.359642] BUG: KASAN: slab-out-of-bounds in bpf_skb_load_helper_8_no_cache+0x71/0x130
        [   77.360463] Read of size 4 at addr ffff8880679bac68 by task bpf/406
        [   77.361119]
        [   77.361289] CPU: 2 PID: 406 Comm: bpf Not tainted 5.5.0-rc2-xfstests-00157-g2187f215eba #1
        [   77.362134] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
        [   77.362984] Call Trace:
        [   77.363249]  dump_stack+0x97/0xe0
        [   77.363603]  print_address_description.constprop.0+0x1d/0x220
        [   77.364251]  ? bpf_skb_load_helper_8_no_cache+0x71/0x130
        [   77.365030]  ? bpf_skb_load_helper_8_no_cache+0x71/0x130
        [   77.365860]  __kasan_report.cold+0x37/0x7b
        [   77.366365]  ? bpf_skb_load_helper_8_no_cache+0x71/0x130
        [   77.366940]  kasan_report+0xe/0x20
        [   77.367295]  bpf_skb_load_helper_8_no_cache+0x71/0x130
        [   77.367821]  ? bpf_skb_load_helper_8+0xf0/0xf0
        [   77.368278]  ? mark_lock+0xa3/0x9b0
        [   77.368641]  ? kvm_sched_clock_read+0x14/0x30
        [   77.369096]  ? sched_clock+0x5/0x10
        [   77.369460]  ? sched_clock_cpu+0x18/0x110
        [   77.369876]  ? bpf_skb_load_helper_8+0xf0/0xf0
        [   77.370330]  ___bpf_prog_run+0x16c0/0x28f0
        [   77.370755]  __bpf_prog_run32+0x83/0xc0
        [   77.371153]  ? __bpf_prog_run64+0xc0/0xc0
        [   77.371568]  ? match_held_lock+0x1b/0x230
        [   77.371984]  ? rcu_read_lock_held+0xa1/0xb0
        [   77.372416]  ? rcu_is_watching+0x34/0x50
        [   77.372826]  sk_filter_trim_cap+0x17c/0x4d0
        [   77.373259]  ? sock_kzfree_s+0x40/0x40
        [   77.373648]  ? __get_filter+0x150/0x150
        [   77.374059]  ? skb_copy_datagram_from_iter+0x80/0x280
        [   77.374581]  ? do_raw_spin_unlock+0xa5/0x140
        [   77.375025]  unix_dgram_sendmsg+0x33a/0xa70
        [   77.375459]  ? do_raw_spin_lock+0x1d0/0x1d0
        [   77.375893]  ? unix_peer_get+0xa0/0xa0
        [   77.376287]  ? __fget_light+0xa4/0xf0
        [   77.376670]  __sys_sendto+0x265/0x280
        [   77.377056]  ? __ia32_sys_getpeername+0x50/0x50
        [   77.377523]  ? lock_downgrade+0x350/0x350
        [   77.377940]  ? __sys_setsockopt+0x2a6/0x2c0
        [   77.378374]  ? sock_read_iter+0x240/0x240
        [   77.378789]  ? __sys_socketpair+0x22a/0x300
        [   77.379221]  ? __ia32_sys_socket+0x50/0x50
        [   77.379649]  ? mark_held_locks+0x1d/0x90
        [   77.380059]  ? trace_hardirqs_on_thunk+0x1a/0x1c
        [   77.380536]  __x64_sys_sendto+0x74/0x90
        [   77.380938]  do_syscall_64+0x68/0x2a0
        [   77.381324]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [   77.381878] RIP: 0033:0x44c070
        [...]
      
      After further debugging, turns out while in case of other helper functions
      we disallow passing modified ctx, the special case of ld/abs/ind instruction
      which has similar semantics (except r6 being the ctx argument) is missing
      such check. Modified ctx is impossible here as bpf_skb_load_helper_8_no_cache()
      and others are expecting skb fields in original position, hence, add
      check_ctx_reg() to reject any modified ctx. Issue was first introduced back
      in f1174f77 ("bpf/verifier: rework value tracking").
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Reported-by: NAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200106215157.3553-1-daniel@iogearbox.netSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0cc793ad
  2. 15 4月, 2020 15 次提交
    • W
      ftrace: Avoid potential division by zero in function profiler · d7f598ab
      Wen Yang 提交于
      commit e31f7939c1c27faa5d0e3f14519eaf7c89e8a69d upstream.
      
      The ftrace_profile->counter is unsigned long and
      do_div truncates it to 32 bits, which means it can test
      non-zero and be truncated to zero for division.
      Fix this issue by using div64_ul() instead.
      
      Link: http://lkml.kernel.org/r/20200103030248.14516-1-wenyang@linux.alibaba.com
      
      Cc: stable@vger.kernel.org
      Fixes: e330b3bc ("tracing: Show sample std dev in function profiling")
      Fixes: 34886c8b ("tracing: add average time in function to function profiler")
      Signed-off-by: NWen Yang <wenyang@linux.alibaba.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d7f598ab
    • C
      exit: panic before exit_mm() on global init exit · c954cefc
      chenqiwu 提交于
      commit 43cf75d96409a20ef06b756877a2e72b10a026fc upstream.
      
      Currently, when global init and all threads in its thread-group have exited
      we panic via:
      do_exit()
      -> exit_notify()
         -> forget_original_parent()
            -> find_child_reaper()
      This makes it hard to extract a useable coredump for global init from a
      kernel crashdump because by the time we panic exit_mm() will have already
      released global init's mm.
      This patch moves the panic futher up before exit_mm() is called. As was the
      case previously, we only panic when global init and all its threads in the
      thread-group have exited.
      Signed-off-by: Nchenqiwu <chenqiwu@xiaomi.com>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      [christian.brauner@ubuntu.com: fix typo, rewrite commit message]
      Link: https://lore.kernel.org/r/1576736993-10121-1-git-send-email-qiwuchen55@gmail.comSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c954cefc
    • S
      tracing: Fix endianness bug in histogram trigger · 8ca91c0f
      Sven Schnelle 提交于
      commit fe6e096a5bbf73a142f09c72e7aa2835026eb1a3 upstream.
      
      At least on PA-RISC and s390 synthetic histogram triggers are failing
      selftests because trace_event_raw_event_synth() always writes a 64 bit
      values, but the reader expects a field->size sized value. On little endian
      machines this doesn't hurt, but on big endian this makes the reader always
      read zero values.
      
      Link: http://lore.kernel.org/linux-trace-devel/20191218074427.96184-4-svens@linux.ibm.com
      
      Cc: stable@vger.kernel.org
      Fixes: 4b147936 ("tracing: Add support for 'synthetic' events")
      Acked-by: NTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: NSven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8ca91c0f
    • S
      tracing: Have the histogram compare functions convert to u64 first · 2d9148f6
      Steven Rostedt (VMware) 提交于
      commit 106f41f5a302cb1f36c7543fae6a05de12e96fa4 upstream.
      
      The compare functions of the histogram code would be specific for the size
      of the value being compared (byte, short, int, long long). It would
      reference the value from the array via the type of the compare, but the
      value was stored in a 64 bit number. This is fine for little endian
      machines, but for big endian machines, it would end up comparing zeros or
      all ones (depending on the sign) for anything but 64 bit numbers.
      
      To fix this, first derference the value as a u64 then convert it to the type
      being compared.
      
      Link: http://lkml.kernel.org/r/20191211103557.7bed6928@gandalf.local.home
      
      Cc: stable@vger.kernel.org
      Fixes: 08d43a5f ("tracing: Add lock-free tracing_map")
      Acked-by: NTom Zanussi <zanussi@kernel.org>
      Reported-by: NSven Schnelle <svens@stackframe.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2d9148f6
    • K
      tracing: Avoid memory leak in process_system_preds() · 5497ad6d
      Keita Suzuki 提交于
      commit 79e65c27f09683fbb50c33acab395d0ddf5302d2 upstream.
      
      When failing in the allocation of filter_item, process_system_preds()
      goes to fail_mem, where the allocated filter is freed.
      
      However, this leads to memory leak of filter->filter_string and
      filter->prog, which is allocated before and in process_preds().
      This bug has been detected by kmemleak as well.
      
      Fix this by changing kfree to __free_fiter.
      
      unreferenced object 0xffff8880658007c0 (size 32):
        comm "bash", pid 579, jiffies 4295096372 (age 17.752s)
        hex dump (first 32 bytes):
          63 6f 6d 6d 6f 6e 5f 70 69 64 20 20 3e 20 31 30  common_pid  > 10
          00 00 00 00 00 00 00 00 65 73 00 00 00 00 00 00  ........es......
        backtrace:
          [<0000000067441602>] kstrdup+0x2d/0x60
          [<00000000141cf7b7>] apply_subsystem_event_filter+0x378/0x932
          [<000000009ca32334>] subsystem_filter_write+0x5a/0x90
          [<0000000072da2bee>] vfs_write+0xe1/0x240
          [<000000004f14f473>] ksys_write+0xb4/0x150
          [<00000000a968b4a0>] do_syscall_64+0x6d/0x1e0
          [<000000001a189f40>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      unreferenced object 0xffff888060c22d00 (size 64):
        comm "bash", pid 579, jiffies 4295096372 (age 17.752s)
        hex dump (first 32 bytes):
          01 00 00 00 00 00 00 00 00 e8 d7 41 80 88 ff ff  ...........A....
          01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000b8c1b109>] process_preds+0x243/0x1820
          [<000000003972c7f0>] apply_subsystem_event_filter+0x3be/0x932
          [<000000009ca32334>] subsystem_filter_write+0x5a/0x90
          [<0000000072da2bee>] vfs_write+0xe1/0x240
          [<000000004f14f473>] ksys_write+0xb4/0x150
          [<00000000a968b4a0>] do_syscall_64+0x6d/0x1e0
          [<000000001a189f40>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      unreferenced object 0xffff888041d7e800 (size 512):
        comm "bash", pid 579, jiffies 4295096372 (age 17.752s)
        hex dump (first 32 bytes):
          70 bc 85 97 ff ff ff ff 0a 00 00 00 00 00 00 00  p...............
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<000000001e04af34>] process_preds+0x71a/0x1820
          [<000000003972c7f0>] apply_subsystem_event_filter+0x3be/0x932
          [<000000009ca32334>] subsystem_filter_write+0x5a/0x90
          [<0000000072da2bee>] vfs_write+0xe1/0x240
          [<000000004f14f473>] ksys_write+0xb4/0x150
          [<00000000a968b4a0>] do_syscall_64+0x6d/0x1e0
          [<000000001a189f40>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Link: http://lkml.kernel.org/r/20191211091258.11310-1-keitasuzuki.park@sslab.ics.keio.ac.jp
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 404a3add ("tracing: Only add filter list when needed")
      Signed-off-by: NKeita Suzuki <keitasuzuki.park@sslab.ics.keio.ac.jp>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5497ad6d
    • P
      tracing: Fix lock inversion in trace_event_enable_tgid_record() · bb9364dc
      Prateek Sood 提交于
      commit 3a53acf1d9bea11b57c1f6205e3fe73f9d8a3688 upstream.
      
             Task T2                             Task T3
      trace_options_core_write()            subsystem_open()
      
       mutex_lock(trace_types_lock)           mutex_lock(event_mutex)
      
       set_tracer_flag()
      
         trace_event_enable_tgid_record()       mutex_lock(trace_types_lock)
      
          mutex_lock(event_mutex)
      
      This gives a circular dependency deadlock between trace_types_lock and
      event_mutex. To fix this invert the usage of trace_types_lock and
      event_mutex in trace_options_core_write(). This keeps the sequence of
      lock usage consistent.
      
      Link: http://lkml.kernel.org/r/0101016eef175e38-8ca71caf-a4eb-480d-a1e6-6f0bbc015495-000000@us-west-2.amazonses.com
      
      Cc: stable@vger.kernel.org
      Fixes: d914ba37 ("tracing: Add support for recording tgid of tasks")
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bb9364dc
    • S
      memcg: account security cred as well to kmemcg · 02d09444
      Shakeel Butt 提交于
      commit 84029fd04c201a4c7e0b07ba262664900f47c6f5 upstream.
      
      The cred_jar kmem_cache is already memcg accounted in the current kernel
      but cred->security is not.  Account cred->security to kmemcg.
      
      Recently we saw high root slab usage on our production and on further
      inspection, we found a buggy application leaking processes.  Though that
      buggy application was contained within its memcg but we observe much
      more system memory overhead, couple of GiBs, during that period.  This
      overhead can adversely impact the isolation on the system.
      
      One source of high overhead we found was cred->security objects, which
      have a lifetime of at least the life of the process which allocated
      them.
      
      Link: http://lkml.kernel.org/r/20191205223721.40034-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NChris Down <chris@chrisdown.name>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      02d09444
    • C
      taskstats: fix data-race · 4d5725cc
      Christian Brauner 提交于
      [ Upstream commit 0b8d616fb5a8ffa307b1d3af37f55c15dae14f28 ]
      
      When assiging and testing taskstats in taskstats_exit() there's a race
      when setting up and reading sig->stats when a thread-group with more
      than one thread exits:
      
      write to 0xffff8881157bbe10 of 8 bytes by task 7951 on cpu 0:
       taskstats_tgid_alloc kernel/taskstats.c:567 [inline]
       taskstats_exit+0x6b7/0x717 kernel/taskstats.c:596
       do_exit+0x2c2/0x18e0 kernel/exit.c:864
       do_group_exit+0xb4/0x1c0 kernel/exit.c:983
       get_signal+0x2a2/0x1320 kernel/signal.c:2734
       do_signal+0x3b/0xc00 arch/x86/kernel/signal.c:815
       exit_to_usermode_loop+0x250/0x2c0 arch/x86/entry/common.c:159
       prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
       syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
       do_syscall_64+0x2d7/0x2f0 arch/x86/entry/common.c:299
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      read to 0xffff8881157bbe10 of 8 bytes by task 7949 on cpu 1:
       taskstats_tgid_alloc kernel/taskstats.c:559 [inline]
       taskstats_exit+0xb2/0x717 kernel/taskstats.c:596
       do_exit+0x2c2/0x18e0 kernel/exit.c:864
       do_group_exit+0xb4/0x1c0 kernel/exit.c:983
       __do_sys_exit_group kernel/exit.c:994 [inline]
       __se_sys_exit_group kernel/exit.c:992 [inline]
       __x64_sys_exit_group+0x2e/0x30 kernel/exit.c:992
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fix this by using smp_load_acquire() and smp_store_release().
      
      Reported-by: syzbot+c5d03165a1bd1dead0c1@syzkaller.appspotmail.com
      Fixes: 34ec1234 ("taskstats: cleanup ->signal->stats allocation")
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: NMarco Elver <elver@google.com>
      Reviewed-by: NWill Deacon <will@kernel.org>
      Reviewed-by: NAndrea Parri <parri.andrea@gmail.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Link: https://lore.kernel.org/r/20191009114809.8643-1-christian.brauner@ubuntu.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4d5725cc
    • A
      PM / hibernate: memory_bm_find_bit(): Tighten node optimisation · 7fd04a9f
      Andy Whitcroft 提交于
      [ Upstream commit da6043fe85eb5ec621e34a92540735dcebbea134 ]
      
      When looking for a bit by number we make use of the cached result from the
      preceding lookup to speed up operation.  Firstly we check if the requested
      pfn is within the cached zone and if not lookup the new zone.  We then
      check if the offset for that pfn falls within the existing cached node.
      This happens regardless of whether the node is within the zone we are
      now scanning.  With certain memory layouts it is possible for this to
      false trigger creating a temporary alias for the pfn to a different bit.
      This leads the hibernation code to free memory which it was never allocated
      with the expected fallout.
      
      Ensure the zone we are scanning matches the cached zone before considering
      the cached node.
      
      Deep thanks go to Andrea for many, many, many hours of hacking and testing
      that went into cornering this bug.
      Reported-by: NAndrea Righi <andrea.righi@canonical.com>
      Tested-by: NAndrea Righi <andrea.righi@canonical.com>
      Signed-off-by: NAndy Whitcroft <apw@canonical.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7fd04a9f
    • V
      ptp: fix the race between the release of ptp_clock and cdev · a0582013
      Vladis Dronov 提交于
      [ Upstream commit a33121e5487b424339636b25c35d3a180eaa5f5e ]
      
      In a case when a ptp chardev (like /dev/ptp0) is open but an underlying
      device is removed, closing this file leads to a race. This reproduces
      easily in a kvm virtual machine:
      
      ts# cat openptp0.c
      int main() { ... fp = fopen("/dev/ptp0", "r"); ... sleep(10); }
      ts# uname -r
      5.5.0-rc3-46cf053e
      ts# cat /proc/cmdline
      ... slub_debug=FZP
      ts# modprobe ptp_kvm
      ts# ./openptp0 &
      [1] 670
      opened /dev/ptp0, sleeping 10s...
      ts# rmmod ptp_kvm
      ts# ls /dev/ptp*
      ls: cannot access '/dev/ptp*': No such file or directory
      ts# ...woken up
      [   48.010809] general protection fault: 0000 [#1] SMP
      [   48.012502] CPU: 6 PID: 658 Comm: openptp0 Not tainted 5.5.0-rc3-46cf053e #25
      [   48.014624] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
      [   48.016270] RIP: 0010:module_put.part.0+0x7/0x80
      [   48.017939] RSP: 0018:ffffb3850073be00 EFLAGS: 00010202
      [   48.018339] RAX: 000000006b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffff89a476c00ad0
      [   48.018936] RDX: fffff65a08d3ea08 RSI: 0000000000000247 RDI: 6b6b6b6b6b6b6b6b
      [   48.019470] ...                                              ^^^ a slub poison
      [   48.023854] Call Trace:
      [   48.024050]  __fput+0x21f/0x240
      [   48.024288]  task_work_run+0x79/0x90
      [   48.024555]  do_exit+0x2af/0xab0
      [   48.024799]  ? vfs_write+0x16a/0x190
      [   48.025082]  do_group_exit+0x35/0x90
      [   48.025387]  __x64_sys_exit_group+0xf/0x10
      [   48.025737]  do_syscall_64+0x3d/0x130
      [   48.026056]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   48.026479] RIP: 0033:0x7f53b12082f6
      [   48.026792] ...
      [   48.030945] Modules linked in: ptp i6300esb watchdog [last unloaded: ptp_kvm]
      [   48.045001] Fixing recursive fault but reboot is needed!
      
      This happens in:
      
      static void __fput(struct file *file)
      {   ...
          if (file->f_op->release)
              file->f_op->release(inode, file); <<< cdev is kfree'd here
          if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
                   !(mode & FMODE_PATH))) {
              cdev_put(inode->i_cdev); <<< cdev fields are accessed here
      
      Namely:
      
      __fput()
        posix_clock_release()
          kref_put(&clk->kref, delete_clock) <<< the last reference
            delete_clock()
              delete_ptp_clock()
                kfree(ptp) <<< cdev is embedded in ptp
        cdev_put
          module_put(p->owner) <<< *p is kfree'd, bang!
      
      Here cdev is embedded in posix_clock which is embedded in ptp_clock.
      The race happens because ptp_clock's lifetime is controlled by two
      refcounts: kref and cdev.kobj in posix_clock. This is wrong.
      
      Make ptp_clock's sysfs device a parent of cdev with cdev_device_add()
      created especially for such cases. This way the parent device with its
      ptp_clock is not released until all references to the cdev are released.
      This adds a requirement that an initialized but not exposed struct
      device should be provided to posix_clock_register() by a caller instead
      of a simple dev_t.
      
      This approach was adopted from the commit 72139dfa2464 ("watchdog: Fix
      the race between the release of watchdog_core_data and cdev"). See
      details of the implementation in the commit 233ed09d ("chardev: add
      helper function to register char devs with a struct device").
      
      Link: https://lore.kernel.org/linux-fsdevel/20191125125342.6189-1-vdronov@redhat.com/T/#uAnalyzed-by: NStephen Johnston <sjohnsto@redhat.com>
      Analyzed-by: NVern Lovejoy <vlovejoy@redhat.com>
      Signed-off-by: NVladis Dronov <vdronov@redhat.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a0582013
    • E
      hrtimer: Annotate lockless access to timer->state · 286a4f47
      Eric Dumazet 提交于
      commit 56144737e67329c9aaed15f942d46a6302e2e3d8 upstream.
      
      syzbot reported various data-race caused by hrtimer_is_queued() reading
      timer->state. A READ_ONCE() is required there to silence the warning.
      
      Also add the corresponding WRITE_ONCE() when timer->state is set.
      
      In remove_hrtimer() the hrtimer_is_queued() helper is open coded to avoid
      loading timer->state twice.
      
      KCSAN reported these cases:
      
      BUG: KCSAN: data-race in __remove_hrtimer / tcp_pacing_check
      
      write to 0xffff8880b2a7d388 of 1 bytes by interrupt on cpu 0:
       __remove_hrtimer+0x52/0x130 kernel/time/hrtimer.c:991
       __run_hrtimer kernel/time/hrtimer.c:1496 [inline]
       __hrtimer_run_queues+0x250/0x600 kernel/time/hrtimer.c:1576
       hrtimer_run_softirq+0x10e/0x150 kernel/time/hrtimer.c:1593
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
       smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
       kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
      
      read to 0xffff8880b2a7d388 of 1 bytes by task 24652 on cpu 1:
       tcp_pacing_check net/ipv4/tcp_output.c:2235 [inline]
       tcp_pacing_check+0xba/0x130 net/ipv4/tcp_output.c:2225
       tcp_xmit_retransmit_queue+0x32c/0x5a0 net/ipv4/tcp_output.c:3044
       tcp_xmit_recovery+0x7c/0x120 net/ipv4/tcp_input.c:3558
       tcp_ack+0x17b6/0x3170 net/ipv4/tcp_input.c:3717
       tcp_rcv_established+0x37e/0xf50 net/ipv4/tcp_input.c:5696
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1561
       sk_backlog_rcv include/net/sock.h:945 [inline]
       __release_sock+0x135/0x1e0 net/core/sock.c:2435
       release_sock+0x61/0x160 net/core/sock.c:2951
       sk_stream_wait_memory+0x3d7/0x7c0 net/core/stream.c:145
       tcp_sendmsg_locked+0xb47/0x1f30 net/ipv4/tcp.c:1393
       tcp_sendmsg+0x39/0x60 net/ipv4/tcp.c:1434
       inet_sendmsg+0x6d/0x90 net/ipv4/af_inet.c:807
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg+0x9f/0xc0 net/socket.c:657
      
      BUG: KCSAN: data-race in __remove_hrtimer / __tcp_ack_snd_check
      
      write to 0xffff8880a3a65588 of 1 bytes by interrupt on cpu 0:
       __remove_hrtimer+0x52/0x130 kernel/time/hrtimer.c:991
       __run_hrtimer kernel/time/hrtimer.c:1496 [inline]
       __hrtimer_run_queues+0x250/0x600 kernel/time/hrtimer.c:1576
       hrtimer_run_softirq+0x10e/0x150 kernel/time/hrtimer.c:1593
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       invoke_softirq kernel/softirq.c:373 [inline]
       irq_exit+0xbb/0xe0 kernel/softirq.c:413
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       smp_apic_timer_interrupt+0xe6/0x280 arch/x86/kernel/apic/apic.c:1137
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
      
      read to 0xffff8880a3a65588 of 1 bytes by task 22891 on cpu 1:
       __tcp_ack_snd_check+0x415/0x4f0 net/ipv4/tcp_input.c:5265
       tcp_ack_snd_check net/ipv4/tcp_input.c:5287 [inline]
       tcp_rcv_established+0x750/0xf50 net/ipv4/tcp_input.c:5708
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1561
       sk_backlog_rcv include/net/sock.h:945 [inline]
       __release_sock+0x135/0x1e0 net/core/sock.c:2435
       release_sock+0x61/0x160 net/core/sock.c:2951
       sk_stream_wait_memory+0x3d7/0x7c0 net/core/stream.c:145
       tcp_sendmsg_locked+0xb47/0x1f30 net/ipv4/tcp.c:1393
       tcp_sendmsg+0x39/0x60 net/ipv4/tcp.c:1434
       inet_sendmsg+0x6d/0x90 net/ipv4/af_inet.c:807
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg+0x9f/0xc0 net/socket.c:657
       __sys_sendto+0x21f/0x320 net/socket.c:1952
       __do_sys_sendto net/socket.c:1964 [inline]
       __se_sys_sendto net/socket.c:1960 [inline]
       __x64_sys_sendto+0x89/0xb0 net/socket.c:1960
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 24652 Comm: syz-executor.3 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      [ tglx: Added comments ]
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191106174804.74723-1-edumazet@google.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      286a4f47
    • J
      kernel: sysctl: make drop_caches write-only · 0db1f26c
      Johannes Weiner 提交于
      [ Upstream commit 204cb79ad42f015312a5bbd7012d09c93d9b46fb ]
      
      Currently, the drop_caches proc file and sysctl read back the last value
      written, suggesting this is somehow a stateful setting instead of a
      one-time command.  Make it write-only, like e.g.  compact_memory.
      
      While mitigating a VM problem at scale in our fleet, there was confusion
      about whether writing to this file will permanently switch the kernel into
      a non-caching mode.  This influences the decision making in a tense
      situation, where tens of people are trying to fix tens of thousands of
      affected machines: Do we need a rollback strategy?  What are the
      performance implications of operating in a non-caching state for several
      days?  It also caused confusion when the kernel team said we may need to
      write the file several times to make sure it's effective ("But it already
      reads back 3?").
      
      Link: http://lkml.kernel.org/r/20191031221602.9375-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NChris Down <chris@chrisdown.name>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0db1f26c
    • E
      dma-debug: add a schedule point in debug_dma_dump_mappings() · bf055536
      Eric Dumazet 提交于
      [ Upstream commit 9ff6aa027dbb98755f0265695354f2dd07c0d1ce ]
      
      debug_dma_dump_mappings() can take a lot of cpu cycles :
      
      lpk43:/# time wc -l /sys/kernel/debug/dma-api/dump
      163435 /sys/kernel/debug/dma-api/dump
      
      real	0m0.463s
      user	0m0.003s
      sys	0m0.459s
      
      Let's add a cond_resched() to avoid holding cpu for too long.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Corentin Labbe <clabbe@baylibre.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bf055536
    • R
      cpufreq: Avoid leaving stale IRQ work items during CPU offline · fa6de88d
      Rafael J. Wysocki 提交于
      commit 85572c2c4a45a541e880e087b5b17a48198b2416 upstream.
      
      The scheduler code calling cpufreq_update_util() may run during CPU
      offline on the target CPU after the IRQ work lists have been flushed
      for it, so the target CPU should be prevented from running code that
      may queue up an IRQ work item on it at that point.
      
      Unfortunately, that may not be the case if dvfs_possible_from_any_cpu
      is set for at least one cpufreq policy in the system, because that
      allows the CPU going offline to run the utilization update callback
      of the cpufreq governor on behalf of another (online) CPU in some
      cases.
      
      If that happens, the cpufreq governor callback may queue up an IRQ
      work on the CPU running it, which is going offline, and the IRQ work
      may not be flushed after that point.  Moreover, that IRQ work cannot
      be flushed until the "offlining" CPU goes back online, so if any
      other CPU calls irq_work_sync() to wait for the completion of that
      IRQ work, it will have to wait until the "offlining" CPU is back
      online and that may not happen forever.  In particular, a system-wide
      deadlock may occur during CPU online as a result of that.
      
      The failing scenario is as follows.  CPU0 is the boot CPU, so it
      creates a cpufreq policy and becomes the "leader" of it
      (policy->cpu).  It cannot go offline, because it is the boot CPU.
      Next, other CPUs join the cpufreq policy as they go online and they
      leave it when they go offline.  The last CPU to go offline, say CPU3,
      may queue up an IRQ work while running the governor callback on
      behalf of CPU0 after leaving the cpufreq policy because of the
      dvfs_possible_from_any_cpu effect described above.  Then, CPU0 is
      the only online CPU in the system and the stale IRQ work is still
      queued on CPU3.  When, say, CPU1 goes back online, it will run
      irq_work_sync() to wait for that IRQ work to complete and so it
      will wait for CPU3 to go back online (which may never happen even
      in principle), but (worse yet) CPU0 is waiting for CPU1 at that
      point too and a system-wide deadlock occurs.
      
      To address this problem notice that CPUs which cannot run cpufreq
      utilization update code for themselves (for example, because they
      have left the cpufreq policies that they belonged to), should also
      be prevented from running that code on behalf of the other CPUs that
      belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so
      in that case the cpufreq_update_util_data pointer of the CPU running
      the code must not be NULL as well as for the CPU which is the target
      of the cpufreq utilization update in progress.
      
      Accordingly, change cpufreq_this_cpu_can_update() into a regular
      function in kernel/sched/cpufreq.c (instead of a static inline in a
      header file) and make it check the cpufreq_update_util_data pointer
      of the local CPU if dvfs_possible_from_any_cpu is set for the target
      cpufreq policy.
      
      Also update the schedutil governor to do the
      cpufreq_this_cpu_can_update() check in the non-fast-switch
      case too to avoid the stale IRQ work issues.
      
      Fixes: 99d14d0e ("cpufreq: Process remote callbacks from any CPU if the platform permits")
      Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/Reported-by: NAnson Huang <anson.huang@nxp.com>
      Tested-by: NAnson Huang <anson.huang@nxp.com>
      Cc: 4.14+ <stable@vger.kernel.org> # 4.14+
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Tested-by: Peng Fan <peng.fan@nxp.com> (i.MX8QXP-MEK)
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fa6de88d
    • M
      tracing/kprobe: Check whether the non-suffixed symbol is notrace · 5464389e
      Masami Hiramatsu 提交于
      [ Upstream commit c7411a1a126f649be71526a36d4afac9e5aefa13 ]
      
      Check whether the non-suffixed symbol is notrace, since suffixed
      symbols are generated by the compilers for optimization. Based on
      these suffixed symbols, notrace check might not work because
      some of them are just a partial code of the original function.
      (e.g. cold-cache (unlikely) code is separated from original
       function as FUNCTION.cold.XX)
      
      For example, without this fix,
        # echo p device_add.cold.67 > /sys/kernel/debug/tracing/kprobe_events
        sh: write error: Invalid argument
      
        # cat /sys/kernel/debug/tracing/error_log
        [  135.491035] trace_kprobe: error: Failed to register probe event
          Command: p device_add.cold.67
                     ^
        # dmesg | tail -n 1
        [  135.488599] trace_kprobe: Could not probe notrace function device_add.cold.67
      
      With this,
        # echo p device_add.cold.66 > /sys/kernel/debug/tracing/kprobe_events
        # cat /sys/kernel/debug/kprobes/list
        ffffffff81599de9  k  device_add.cold.66+0x0    [DISABLED]
      
      Actually, kprobe blacklist already did similar thing,
      see within_kprobe_blacklist().
      
      Link: http://lkml.kernel.org/r/157233790394.6706.18243942030937189679.stgit@devnote2
      
      Fixes: 45408c4f ("tracing: kprobes: Prohibit probing on notrace function")
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5464389e