1. 27 12月, 2019 40 次提交
    • P
      bpf: decrease usercnt if bpf_map_new_fd() fails in bpf_map_get_fd_by_id() · eda04c10
      Peng Sun 提交于
      mainline inclusion
      from mainline-5.0
      commit 781e62823cb81b972dc8652c1827205cda2ac9ac
      category: bugfix
      bugzilla: 11101
      CVE: NA
      
      -------------------------------------------------
      In bpf/syscall.c, bpf_map_get_fd_by_id() use bpf_map_inc_not_zero()
      to increase the refcount, both map->refcnt and map->usercnt. Then, if
      bpf_map_new_fd() fails, should handle map->usercnt too.
      
      Fixes: bd5f5f4e ("bpf: Add BPF_MAP_GET_FD_BY_ID")
      Signed-off-by: NPeng Sun <sironhide0null@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      (cherry picked from commit 781e62823cb81b972dc8652c1827205cda2ac9ac)
      Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      eda04c10
    • G
      relay: check return of create_buf_file() properly · 84e164fa
      Greg Kroah-Hartman 提交于
      [ Upstream commit 2c1cf00eeacb784781cf1c9896b8af001246d339 ]
      
      If create_buf_file() returns an error, don't try to reference it later
      as a valid dentry pointer.
      
      This problem was exposed when debugfs started to return errors instead
      of just NULL for some calls when they do not succeed properly.
      
      Also, the check for WARN_ON(dentry) was just wrong :)
      Reported-by: NKees Cook <keescook@chromium.org>
      Reported-and-tested-by: syzbot+16c3a70e1e9b29346c43@syzkaller.appspotmail.com
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Fixes: ff9fb72bc077 ("debugfs: return error values, not NULL")
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      84e164fa
    • S
      perf core: Fix perf_proc_update_handler() bug · 2813654e
      Stephane Eranian 提交于
      [ Upstream commit 1a51c5da5acc6c188c917ba572eebac5f8793432 ]
      
      The perf_proc_update_handler() handles /proc/sys/kernel/perf_event_max_sample_rate
      syctl variable.  When the PMU IRQ handler timing monitoring is disabled, i.e,
      when /proc/sys/kernel/perf_cpu_time_max_percent is equal to 0 or 100,
      then no modification to sysctl_perf_event_sample_rate is allowed to prevent
      possible hang from wrong values.
      
      The problem is that the test to prevent modification is made after the
      sysctl variable is modified in perf_proc_update_handler().
      
      You get an error:
      
        $ echo 10001 >/proc/sys/kernel/perf_event_max_sample_rate
        echo: write error: invalid argument
      
      But the value is still modified causing all sorts of inconsistencies:
      
        $ cat /proc/sys/kernel/perf_event_max_sample_rate
        10001
      
      This patch fixes the problem by moving the parsing of the value after
      the test.
      
      Committer testing:
      
        # echo 100 > /proc/sys/kernel/perf_cpu_time_max_percent
        # echo 10001 > /proc/sys/kernel/perf_event_max_sample_rate
        -bash: echo: write error: Invalid argument
        # cat /proc/sys/kernel/perf_event_max_sample_rate
        10001
        #
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1547169436-6266-1-git-send-email-eranian@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2813654e
    • D
      bpf: fix sanitation rewrite in case of non-pointers · 223cd1fc
      Daniel Borkmann 提交于
      commit 3612af783cf52c74a031a2f11b82247b2599d3cd upstream.
      
      Marek reported that he saw an issue with the below snippet in that
      timing measurements where off when loaded as unpriv while results
      were reasonable when loaded as privileged:
      
          [...]
          uint64_t a = bpf_ktime_get_ns();
          uint64_t b = bpf_ktime_get_ns();
          uint64_t delta = b - a;
          if ((int64_t)delta > 0) {
          [...]
      
      Turns out there is a bug where a corner case is missing in the fix
      d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar
      type from different paths"), namely fixup_bpf_calls() only checks
      whether aux has a non-zero alu_state, but it also needs to test for
      the case of BPF_ALU_NON_POINTER since in both occasions we need to
      skip the masking rewrite (as there is nothing to mask).
      
      Fixes: d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar type from different paths")
      Reported-by: NMarek Majkowski <marek@cloudflare.com>
      Reported-by: NArthur Fabre <afabre@cloudflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/netdev/CAJPywTJqP34cK20iLM5YmUMz9KXQOdu1-+BZrGMAGgLuBWz7fg@mail.gmail.com/T/Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      223cd1fc
    • P
      tracing: Fix event filters and triggers to handle negative numbers · 96d5943c
      Pavel Tikhomirov 提交于
      commit 6a072128d262d2b98d31626906a96700d1fc11eb upstream.
      
      Then tracing syscall exit event it is extremely useful to filter exit
      codes equal to some negative value, to react only to required errors.
      But negative numbers does not work:
      
      [root@snorch sys_exit_read]# echo "ret == -1" > filter
      bash: echo: write error: Invalid argument
      [root@snorch sys_exit_read]# cat filter
      ret == -1
              ^
      parse_error: Invalid value (did you forget quotes)?
      
      Similar thing happens when setting triggers.
      
      These is a regression in v4.17 introduced by the commit mentioned below,
      testing without these commit shows no problem with negative numbers.
      
      Link: http://lkml.kernel.org/r/20180823102534.7642-1-ptikhomirov@virtuozzo.com
      
      Cc: stable@vger.kernel.org
      Fixes: 80765597 ("tracing: Rewrite filter logic to be simpler and faster")
      Signed-off-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      96d5943c
    • A
      perf: Paper over the hw.target problems · 03804742
      Alexander Shishkin 提交于
      euler inclusion
      category: bugfix
      bugzilla: 9513/11006/11050
      CVE: NA
      --------------------------------------------------
      
      [ Cheng Jian
      HULK-Syzkaller reported a problem which has been reported to mainline(lkml)
      by syzbot early, this patch comes from the reply form lkml.
      v1	https://lkml.org/lkml/2019/2/28/529
      v2	https://lkml.org/lkml/2019/3/8/206
      we merged v1 first but cause bugzilla #11050, it was because :
      we also use perf_remove_from_context() in perf_event_open() when we move
      events from a SW context to a HW context, so we can't destroy the event
      here.
      now v2 will not exhibit that warning.
      it's same to another patch at https://lkml.org/lkml/2019/3/8/536.
      but more clear than it.]
      
      First, we have a race between perf_event_release_kernel() and
      perf_free_event(), which happens when parent's event is released while the
      child's fork fails (because of a fatal signal, for example), that looks
      like this:
      
      cpu X                            cpu Y
      -----                            -----
                                       copy_process() error path
      perf_release(parent)             +->perf_event_free_task()
      +-> lock(child_ctx->mutex)       |  |
      +-> remove_from_context(child)   |  |
      +-> unlock(child_ctx->mutex)     |  |
      |                                |  +-> lock(child_ctx->mutex)
      |                                |  +-> unlock(child_ctx->mutex)
      |                                +-> free_task(child_task)
      +-> put_task_struct(child_task)
      
      Technically, we're still holding a reference to the task via
      parent->hw.target, that's not stopping free_task(), so we end up poking at
      free'd memory, as is pointed out by KASAN in the syzkaller report (see Link
      below). The straightforward fix is to drop the hw.target reference while
      the task is still around.
      
      Therein lies the second problem: the users of hw.target (uprobe) assume
      that it's around at ->destroy() callback time, where they use it for
      context. So, in order to not break the uprobe teardown and avoid leaking
      stuff, we need to call ->destroy() at the same time.
      
      This patch fixes the race and the subsequent fallout by doing both these
      things at remove_from_context time.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Link: https://syzkaller.appspot.com/bug?extid=a24c397a29ad22d86c98Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      03804742
    • A
      userfaultfd: use RCU to free the task struct when fork fails if MEMCG · 8eb04a7a
      Andrea Arcangeli 提交于
      euler inclusion
      category: bugfix
      bugzilla: 10989
      CVE: NA
      
      ------------------------------------------------
      
      MEMCG depends on the task structure not to be freed under
      rcu_read_lock() in get_mem_cgroup_from_mm() after it dereferences
      mm->owner.
      
      A better fix would be to avoid registering forked vmas in userfaultfd
      contexts reported to the monitor, if case fork ends up failing.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8eb04a7a
    • C
      Revert "perf: Paper over the hw.target problems" · db93c085
      Cheng Jian 提交于
      euler inclusion
      category: bugfix
      bugzilla: 9513/11006
      CVE: NA
      --------------------------------------------------
      
      This reverts commit b772baf9a14ab4975e8884a399a4e0bab2fb6bf9.
      
      we merge the patch b772baf9a14a ("perf: Paper over the
      hw.target problems") to reslove an use-after-free issue
      (bugzilla #9513/#11006).  but it cause some new problem
      (bugzilla #11050/#11049) in this version.
      
      So just revert it.
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      db93c085
    • X
      posix-cpu-timers: Avoid undefined behaviour in timespec64_to_ns() · 2a3c5819
      Xiongfeng Wang 提交于
      euler inclusion
      category: feature
      Bugzilla: 10876
      CVE: N/A
      
      ----------------------------------------
      
      When I ran Syzkaller testsuite, I got the following call trace.
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      
      ================================================================================
      UBSAN: Undefined behaviour in ./include/linux/time64.h:120:27
      signed integer overflow:
      8243129037239968815 * 1000000000 cannot be represented in type 'long long int'
      CPU: 5 PID: 28854 Comm: syz-executor.1 Not tainted 4.19.24 #4
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0xca/0x13e lib/dump_stack.c:113
       ubsan_epilogue+0xe/0x81 lib/ubsan.c:159
       handle_overflow+0x193/0x1e2 lib/ubsan.c:190
       timespec64_to_ns include/linux/time64.h:120 [inline]
       posix_cpu_timer_set+0x95a/0xb70 kernel/time/posix-cpu-timers.c:687
       do_timer_settime+0x198/0x2a0 kernel/time/posix-timers.c:892
       __do_sys_timer_settime kernel/time/posix-timers.c:918 [inline]
       __se_sys_timer_settime kernel/time/posix-timers.c:904 [inline]
       __x64_sys_timer_settime+0x18d/0x260 kernel/time/posix-timers.c:904
       do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x462eb9
      Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f14e4127c58 EFLAGS: 00000246 ORIG_RAX: 00000000000000df
      RAX: ffffffffffffffda RBX: 000000000073bfa0 RCX: 0000000000462eb9
      RDX: 0000000020000080 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f14e41286bc
      R13: 00000000004c54cc R14: 0000000000704278 R15: 00000000ffffffff
      ================================================================================
      
      It is because 'it_interval.tv_sec' is larger than 'KTIME_SEC_MAX' and
      'it_interval.tv_sec * NSEC_PER_SEC' overflows in 'timespec64_to_ns()'.
      
      This patch checks whether 'it_interval.tv_sec' is larger than
      'KTIME_SEC_MAX' and saturate if that is the case.
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2a3c5819
    • X
      ntp: Avoid undefined behaviour in second_overflow() · d8df4fe5
      Xiongfeng Wang 提交于
      euler inclusion
      category: feature
      Bugzilla: 11009
      CVE: N/A
      
      ----------------------------------------
      
      When I ran Syzkaller testsuite, I got the following call trace.
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      
      ================================================================================
      UBSAN: Undefined behaviour in kernel/time/ntp.c:457:16
      signed integer overflow:
      9223372036854775807 + 500 cannot be represented in type 'long int'
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.19.25-dirty #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0xca/0x13e lib/dump_stack.c:113
       ubsan_epilogue+0xe/0x81 lib/ubsan.c:159
       handle_overflow+0x193/0x1e2 lib/ubsan.c:190
       second_overflow+0x403/0x540 kernel/time/ntp.c:457
       accumulate_nsecs_to_secs kernel/time/timekeeping.c:2002 [inline]
       logarithmic_accumulation kernel/time/timekeeping.c:2046 [inline]
       timekeeping_advance+0x2bb/0xec0 kernel/time/timekeeping.c:2114
       tick_do_update_jiffies64.part.2+0x1a0/0x350 kernel/time/tick-sched.c:97
       tick_do_update_jiffies64 kernel/time/tick-sched.c:1229 [inline]
       tick_nohz_update_jiffies kernel/time/tick-sched.c:499 [inline]
       tick_nohz_irq_enter kernel/time/tick-sched.c:1232 [inline]
       tick_irq_enter+0x1fd/0x240 kernel/time/tick-sched.c:1249
       irq_enter+0xc4/0x100 kernel/softirq.c:353
       entering_irq arch/x86/include/asm/apic.h:517 [inline]
       entering_ack_irq arch/x86/include/asm/apic.h:523 [inline]
       smp_apic_timer_interrupt+0x20/0x480 arch/x86/kernel/apic/apic.c:1052
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:864
       </IRQ>
      RIP: 0010:native_safe_halt+0x2/0x10 arch/x86/include/asm/irqflags.h:58
      Code: 01 f0 0f 82 bc fd ff ff 48 c7 c7 c0 21 b1 83 e8 a1 0a 02 ff e9 ab fd ff ff 4c 89 e7 e8 77 b6 a5 fe e9 6a ff ff ff 90 90 fb f4 <c3> 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 90 90 90 90 90 90
      RSP: 0018:ffff888106307d20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
      RAX: 0000000000000007 RBX: dffffc0000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881062e4f1c
      RBP: 0000000000000003 R08: ffffed107c5dc77b R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff848c78a0
      R13: 0000000000000003 R14: 1ffff11020c60fae R15: 0000000000000000
       arch_safe_halt arch/x86/include/asm/paravirt.h:94 [inline]
       default_idle+0x24/0x2b0 arch/x86/kernel/process.c:561
       cpuidle_idle_call kernel/sched/idle.c:153 [inline]
       do_idle+0x2ca/0x420 kernel/sched/idle.c:262
       cpu_startup_entry+0xcb/0xe0 kernel/sched/idle.c:368
       start_secondary+0x421/0x570 arch/x86/kernel/smpboot.c:271
       secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243
      ================================================================================
      
      It is because time_maxerror is set as 0x7FFFFFFFFFFFFFFF by user. It
      overflows when we add it with 'MAXFREQ / NSEC_PER_USEC' in
      'second_overflow()'.
      
      This patch add a limit check and saturate it when the user set
      'time_maxerror'.
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d8df4fe5
    • A
      bpf: fix lockdep false positive in stackmap · ec9d64f0
      Alexei Starovoitov 提交于
      mainline inclusion
      from mainline-5.0-rc8
      commit 3defaf2f15b2
      category: bugfix
      bugzilla: 10760
      CVE: NA
      
      -------------------------------------------------
      
      Lockdep warns about false positive:
      [   11.211460] ------------[ cut here ]------------
      [   11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0)
      [   11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280
      [   11.213134] Modules linked in:
      [   11.214954] RIP: 0010:lock_release+0x1ad/0x280
      [   11.223508] Call Trace:
      [   11.223705]  <IRQ>
      [   11.223874]  ? __local_bh_enable+0x7a/0x80
      [   11.224199]  up_read+0x1c/0xa0
      [   11.224446]  do_up_read+0x12/0x20
      [   11.224713]  irq_work_run_list+0x43/0x70
      [   11.225030]  irq_work_run+0x26/0x50
      [   11.225310]  smp_irq_work_interrupt+0x57/0x1f0
      [   11.225662]  irq_work_interrupt+0xf/0x20
      
      since rw_semaphore is released in a different task vs task that locked the sema.
      It is expected behavior.
      Fix the warning with up_read_non_owner() and rwsem_release() annotation.
      
      Fixes: bae77c5e ("bpf: enable stackmap with build_id in nmi context")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NLi Bin <huawei.libin@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ec9d64f0
    • A
      perf: Paper over the hw.target problems · e597bc6a
      Alexander Shishkin 提交于
      euler inclusion
      category: bugfix
      bugzilla: 9513/11006
      CVE: NA
      --------------------------------------------------
      
      [ Cheng Jian
      HULK-Syzkaller reported a problem which has been reported
      to mainline(lkml) by syzbot early, this patch comes from the
      reply form lkml.
      https://lkml.org/lkml/2019/2/28/529 ]
      
      First, we have a race between perf_event_release_kernel() and
      perf_free_event(), which happens when parent's event is released while the
      child's fork fails (because of a fatal signal, for example), that looks
      like this:
      
      cpu X                            cpu Y
      -----                            -----
                                       copy_process() error path
      perf_release(parent)             +->perf_event_free_task()
      +-> lock(child_ctx->mutex)       |  |
      +-> remove_from_context(child)   |  |
      +-> unlock(child_ctx->mutex)     |  |
      |                                |  +-> lock(child_ctx->mutex)
      |                                |  +-> unlock(child_ctx->mutex)
      |                                +-> free_task(child_task)
      +-> put_task_struct(child_task)
      
      Technically, we're still holding a reference to the task via
      parent->hw.target, that's not stopping free_task(), so we end up poking at
      free'd memory, as is pointed out by KASAN in the syzkaller report (see Link
      below). The straightforward fix is to drop the hw.target reference while
      the task is still around.
      
      Therein lies the second problem: the users of hw.target (uprobe) assume
      that it's around at ->destroy() callback time, where they use it for
      context. So, in order to not break the uprobe teardown and avoid leaking
      stuff, we need to call ->destroy() at the same time.
      
      This patch fixes the race and the subsequent fallout by doing both these
      things at remove_from_context time.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Link: https://syzkaller.appspot.com/bug?extid=a24c397a29ad22d86c98
      Reported-by: syzbot+a24c397a29ad22d86c98@syzkaller.appspotmail.com
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NLi Bin <huawei.libin@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e597bc6a
    • P
      sched/wake_q: Fix wakeup ordering for wake_q · 5f8c56f8
      Peter Zijlstra 提交于
      [ Upstream commit 4c4e3731564c8945ac5ac90fc2a1e1f21cb79c92 ]
      
      Notable cmpxchg() does not provide ordering when it fails, however
      wake_q_add() requires ordering in this specific case too. Without this
      it would be possible for the concurrent wakeup to not observe our
      prior state.
      
      Andrea Parri provided:
      
        C wake_up_q-wake_q_add
      
        {
      	int next = 0;
      	int y = 0;
        }
      
        P0(int *next, int *y)
        {
      	int r0;
      
      	/* in wake_up_q() */
      
      	WRITE_ONCE(*next, 1);   /* node->next = NULL */
      	smp_mb();               /* implied by wake_up_process() */
      	r0 = READ_ONCE(*y);
        }
      
        P1(int *next, int *y)
        {
      	int r1;
      
      	/* in wake_q_add() */
      
      	WRITE_ONCE(*y, 1);      /* wake_cond = true */
      	smp_mb__before_atomic();
      	r1 = cmpxchg_relaxed(next, 1, 2);
        }
      
        exists (0:r0=0 /\ 1:r1=0)
      
        This "exists" clause cannot be satisfied according to the LKMM:
      
        Test wake_up_q-wake_q_add Allowed
        States 3
        0:r0=0; 1:r1=1;
        0:r0=1; 1:r1=0;
        0:r0=1; 1:r1=1;
        No
        Witnesses
        Positive: 0 Negative: 3
        Condition exists (0:r0=0 /\ 1:r1=0)
        Observation wake_up_q-wake_q_add Never 0 3
      Reported-by: NYongji Xie <elohimes@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5f8c56f8
    • S
      genirq: Make sure the initial affinity is not empty · edbb30e3
      Srinivas Ramana 提交于
      [ Upstream commit bddda606ec76550dd63592e32a6e87e7d32583f7 ]
      
      If all CPUs in the irq_default_affinity mask are offline when an interrupt
      is initialized then irq_setup_affinity() can set an empty affinity mask for
      a newly allocated interrupt.
      
      Fix this by falling back to cpu_online_mask in case the resulting affinity
      mask is zero.
      Signed-off-by: NSrinivas Ramana <sramana@codeaurora.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-msm@vger.kernel.org
      Link: https://lkml.kernel.org/r/1545312957-8504-1-git-send-email-sramana@codeaurora.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      edbb30e3
    • L
      genirq/matrix: Improve target CPU selection for managed interrupts. · 55662b68
      Long Li 提交于
      [ Upstream commit e8da8794a7fd9eef1ec9a07f0d4897c68581c72b ]
      
      On large systems with multiple devices of the same class (e.g. NVMe disks,
      using managed interrupts), the kernel can affinitize these interrupts to a
      small subset of CPUs instead of spreading them out evenly.
      
      irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask
      of possible target CPUs which has the lowest number of interrupt vectors
      allocated.
      
      This is done by searching the CPU with the highest number of available
      vectors. While this is correct for non-managed CPUs it can select the wrong
      CPU for managed interrupts. Under certain constellations this results in
      affinitizing the managed interrupts of several devices to a single CPU in
      a set.
      
      The book keeping of available vectors works the following way:
      
       1) Non-managed interrupts:
      
          available is decremented when the interrupt is actually requested by
          the device driver and a vector is assigned. It's incremented when the
          interrupt and the vector are freed.
      
       2) Managed interrupts:
      
          Managed interrupts guarantee vector reservation when the MSI/MSI-X
          functionality of a device is enabled, which is achieved by reserving
          vectors in the bitmaps of the possible target CPUs. This reservation
          decrements the available count on each possible target CPU.
      
          When the interrupt is requested by the device driver then a vector is
          allocated from the reserved region. The operation is reversed when the
          interrupt is freed by the device driver. Neither of these operations
          affect the available count.
      
          The reservation persist up to the point where the MSI/MSI-X
          functionality is disabled and only this operation increments the
          available count again.
      
      For non-managed interrupts the available count is the correct selection
      criterion because the guaranteed reservations need to be taken into
      account. Using the allocated counter could lead to a failing allocation in
      the following situation (total vector space of 10 assumed):
      
      		 CPU0	CPU1
       available:	    2	   0
       allocated:	    5	   3   <--- CPU1 is selected, but available space = 0
       managed reserved:  3	   7
      
       while available yields the correct result.
      
      For managed interrupts the available count is not the appropriate
      selection criterion because as explained above the available count is not
      affected by the actual vector allocation.
      
      The following example illustrates that. Total vector space of 10
      assumed. The starting point is:
      
      		 CPU0	CPU1
       available:	    5	   4
       allocated:	    2	   3
       managed reserved:  3	   3
      
       Allocating vectors for three non-managed interrupts will result in
       affinitizing the first two to CPU0 and the third one to CPU1 because the
       available count is adjusted with each allocation:
      
      		  CPU0	CPU1
       available:	     5	   4	<- Select CPU0 for 1st allocation
       --> allocated:	     3	   3
      
       available:	     4	   4	<- Select CPU0 for 2nd allocation
       --> allocated:	     4	   3
      
       available:	     3	   4	<- Select CPU1 for 3rd allocation
       --> allocated:	     4	   4
      
       But the allocation of three managed interrupts starting from the same
       point will affinitize all of them to CPU0 because the available count is
       not affected by the allocation (see above). So the end result is:
      
      		  CPU0	CPU1
       available:	     5	   4
       allocated:	     5	   3
      
      Introduce a "managed_allocated" field in struct cpumap to track the vector
      allocation for managed interrupts separately. Use this information to
      select the target CPU when a vector is allocated for a managed interrupt,
      which results in more evenly distributed vector assignments. The above
      example results in the following allocations:
      
      		 CPU0	CPU1
       managed_allocated: 0	   0	<- Select CPU0 for 1st allocation
       --> allocated:	    3	   3
      
       managed_allocated: 1	   0	<- Select CPU1 for 2nd allocation
       --> allocated:	    3	   4
      
       managed_allocated: 1	   1	<- Select CPU0 for 3rd allocation
       --> allocated:	    4	   4
      
      The allocation of non-managed interrupts is not affected by this change and
      is still evaluating the available count.
      
      The overall distribution of interrupt vectors for both types of interrupts
      might still not be perfectly even depending on the number of non-managed
      and managed interrupts in a system, but due to the reservation guarantee
      for managed interrupts this cannot be avoided.
      
      Expose the new field in debugfs as well.
      
      [ tglx: Clarified the background of the problem in the changelog and
        	described it independent of NVME ]
      Signed-off-by: NLong Li <longli@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Link: https://lkml.kernel.org/r/20181106040000.27316-1-longli@linuxonhyperv.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      55662b68
    • D
      irq/matrix: Spread managed interrupts on allocation · 5fc68885
      Dou Liyang 提交于
      [ Upstream commit 76f99ae5b54d48430d1f0c5512a84da0ff9761e0 ]
      
      Linux spreads out the non managed interrupt across the possible target CPUs
      to avoid vector space exhaustion.
      
      Managed interrupts are treated differently, as for them the vectors are
      reserved (with guarantee) when the interrupt descriptors are initialized.
      
      When the interrupt is requested a real vector is assigned. The assignment
      logic uses the first CPU in the affinity mask for assignment. If the
      interrupt has more than one CPU in the affinity mask, which happens when a
      multi queue device has less queues than CPUs, then doing the same search as
      for non managed interrupts makes sense as it puts the interrupt on the
      least interrupt plagued CPU. For single CPU affine vectors that's obviously
      a NOOP.
      
      Restructre the matrix allocation code so it does the 'best CPU' search, add
      the sanity check for an empty affinity mask and adapt the call site in the
      x86 vector management code.
      
      [ tglx: Added the empty mask check to the core and improved change log ]
      Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/20180908175838.14450-2-dou_liyang@163.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5fc68885
    • D
      irq/matrix: Split out the CPU selection code into a helper · 916144b4
      Dou Liyang 提交于
      [ Upstream commit 8ffe4e61c06a48324cfd97f1199bb9838acce2f2 ]
      
      Linux finds the CPU which has the lowest vector allocation count to spread
      out the non managed interrupts across the possible target CPUs, but does
      not do so for managed interrupts.
      
      Split out the CPU selection code into a helper function for reuse. No
      functional change.
      Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/20180908175838.14450-1-dou_liyang@163.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      916144b4
    • X
      timekeeping: Avoid undefined behaviour in 'ktime_get_with_offset()' · 85d6dbdf
      Xiongfeng Wang 提交于
      euler inclusion
      category: feature
      Bugzilla: 10683
      CVE: N/A
      
      ----------------------------------------
      
      When I ran Syzkaller testsuite, I got the following call trace.
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      
      ================================================================================
      UBSAN: Undefined behaviour in kernel/time/timekeeping.c:801:8
      signed integer overflow:
      500152103386 + 9223372036854775807 cannot be represented in type 'long long int'
      CPU: 6 PID: 13904 Comm: syz-executor.0 Not tainted 4.19.25 #5
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0xca/0x13e lib/dump_stack.c:113
       ubsan_epilogue+0xe/0x81 lib/ubsan.c:159
       handle_overflow+0x193/0x1e2 lib/ubsan.c:190
       ktime_get_with_offset+0x26a/0x2d0 kernel/time/timekeeping.c:801
       common_hrtimer_arm+0x14d/0x220 kernel/time/posix-timers.c:817
       common_timer_set+0x337/0x530 kernel/time/posix-timers.c:863
       do_timer_settime+0x198/0x290 kernel/time/posix-timers.c:892
       __do_sys_timer_settime kernel/time/posix-timers.c:918 [inline]
       __se_sys_timer_settime kernel/time/posix-timers.c:904 [inline]
       __x64_sys_timer_settime+0x18d/0x260 kernel/time/posix-timers.c:904
       do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x462eb9
      Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f7968072c58 EFLAGS: 00000246 ORIG_RAX: 00000000000000df
      RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462eb9
      RDX: 00000000200000c0 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f79680736bc
      R13: 00000000004c54cc R14: 0000000000704278 R15: 00000000ffffffff
      ================================================================================
      
      It it because global variable 'offsets' is set with a very large but still
      valid value. It overflows when we add 'tk->tkr_mono.base' with 'offsets'.
      
      This patch use 'ktime_add_safe()' to limit the result to 'KTIME_SEC_MAX'
      when it overflows.
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawe.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      85d6dbdf
    • A
      bpf, lpm: fix lookup bug in map_delete_elem · 5f549338
      Alban Crequy 提交于
      mainline inclusion
      from mainline-5.0
      commit 7c0cdf0b3940
      category: bugfix
      bugzilla: 10936
      CVE: NA
      
      -------------------------------------------------
      
      trie_delete_elem() was deleting an entry even though it was not matching
      if the prefixlen was correct. This patch adds a check on matchlen.
      
      Reproducer:
      
      $ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 entries 128 name mylpm flags 1
      $ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb cc dd value hex 01
      $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
      key: 10 00 00 00 aa bb cc dd  value: 01
      Found 1 element
      $ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff ff ff
      $ echo $?
      0
      $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
      Found 0 elements
      
      A similar reproducer is added in the selftests.
      
      Without the patch:
      
      $ sudo ./tools/testing/selftests/bpf/test_lpm_map
      test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed.
      Aborted
      
      With the patch: test_lpm_map runs without errors.
      
      Fixes: e454cf59 ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE")
      Cc: Craig Gallek <kraig@google.com>
      Signed-off-by: NAlban Crequy <alban@kinvolk.io>
      Acked-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5f549338
    • E
      bpf, lpm: make longest_prefix_match() faster · 7e35327e
      Eric Dumazet 提交于
      mainline inclusion
      from mainline-5.0
      commit 7c0cdf0b3940
      category: bugfix
      bugzilla: 10936
      CVE: NA
      
      -------------------------------------------------
      
      At LPC 2018 in Vancouver, Vlad Dumitrescu mentioned that longest_prefix_match()
      has a high cost [1].
      
      One reason for that cost is a loop handling one byte at a time.
      
      We can handle more bytes at a time, if enough attention is paid
      to endianness.
      
      I was able to remove ~55 % of longest_prefix_match() cpu costs.
      
      [1] https://linuxplumbersconf.org/event/2/contributions/88/attachments/76/87/lpc-bpf-2018-shaping.pdfSigned-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Vlad Dumitrescu <vladum@google.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7e35327e
    • S
      bpf: zero out build_id for BPF_STACK_BUILD_ID_IP · 1f98ce54
      Stanislav Fomichev 提交于
      [ Upstream commit 4af396ae4836c4ecab61e975b8e61270c551894d ]
      
      When returning BPF_STACK_BUILD_ID_IP from stack_map_get_build_id_offset,
      make sure that build_id field is empty. Since we are using percpu
      free list, there is a possibility that we might reuse some previous
      bpf_stack_build_id with non-zero build_id.
      
      Fixes: 615755a7 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1f98ce54
    • S
      bpf: don't assume build-id length is always 20 bytes · 49facf53
      Stanislav Fomichev 提交于
      [ Upstream commit 0b698005a9d11c0e91141ec11a2c4918a129f703 ]
      
      Build-id length is not fixed to 20, it can be (`man ld` /--build-id):
        * 128-bit (uuid)
        * 160-bit (sha1)
        * any length specified in ld --build-id=0xhexstring
      
      To fix the issue of missing BPF_STACK_BUILD_ID_VALID for shorter build-ids,
      assume that build-id is somewhere in the range of 1 .. 20.
      Set the remaining bytes to zero.
      
      v2:
      * don't introduce new "len = min(BPF_BUILD_ID_SIZE, nhdr->n_descsz)",
        we already know that nhdr->n_descsz <= BPF_BUILD_ID_SIZE if we enter
        this 'if' condition
      
      Fixes: 615755a7 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      49facf53
    • S
      bpf: fix panic in stack_map_get_build_id() on i386 and arm32 · 6a263990
      Song Liu 提交于
      [ Upstream commit beaf3d1901f4ea46fbd5c9d857227d99751de469 ]
      
      As Naresh reported, test_stacktrace_build_id() causes panic on i386 and
      arm32 systems. This is caused by page_address() returns NULL in certain
      cases.
      
      This patch fixes this error by using kmap_atomic/kunmap_atomic instead
      of page_address.
      
      Fixes: 615755a7 (" bpf: extend stackmap to save binary_build_id+offset instead of address")
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6a263990
    • Q
      tracing: Fix number of entries in trace header · 2cc39ead
      Quentin Perret 提交于
      commit 9e7382153f80ba45a0bbcd540fb77d4b15f6e966 upstream.
      
      The following commit
      
        441dae8f ("tracing: Add support for display of tgid in trace output")
      
      removed the call to print_event_info() from print_func_help_header_irq()
      which results in the ftrace header not reporting the number of entries
      written in the buffer. As this wasn't the original intent of the patch,
      re-introduce the call to print_event_info() to restore the orginal
      behaviour.
      
      Link: http://lkml.kernel.org/r/20190214152950.4179-1-quentin.perret@arm.comAcked-by: NJoel Fernandes <joelaf@google.com>
      Cc: stable@vger.kernel.org
      Fixes: 441dae8f ("tracing: Add support for display of tgid in trace output")
      Signed-off-by: NQuentin Perret <quentin.perret@arm.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2cc39ead
    • E
      signal: Restore the stop PTRACE_EVENT_EXIT · fd839bdb
      Eric W. Biederman 提交于
      commit cf43a757fd49442bc38f76088b70c2299eed2c2f upstream.
      
      In the middle of do_exit() there is there is a call
      "ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process
      in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for
      for the debugger to release the task or SIGKILL to be delivered.
      
      Skipping past dequeue_signal when we know a fatal signal has already
      been delivered resulted in SIGKILL remaining pending and
      TIF_SIGPENDING remaining set.  This in turn caused the
      scheduler to not sleep in PTACE_EVENT_EXIT as it figured
      a fatal signal was pending.  This also caused ptrace_freeze_traced
      in ptrace_check_attach to fail because it left a per thread
      SIGKILL pending which is what fatal_signal_pending tests for.
      
      This difference in signal state caused strace to report
      strace: Exit of unknown pid NNNNN ignored
      
      Therefore update the signal handling state like dequeue_signal
      would when removing a per thread SIGKILL, by removing SIGKILL
      from the per thread signal mask and clearing TIF_SIGPENDING.
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NIvan Delalande <colona@arista.com>
      Cc: stable@vger.kernel.org
      Fixes: 35634ffa1751 ("signal: Always notice exiting tasks")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fd839bdb
    • A
      tracing/uprobes: Fix output for multiple string arguments · 382322b4
      Andreas Ziegler 提交于
      commit 0722069a5374b904ec1a67f91249f90e1cfae259 upstream.
      
      When printing multiple uprobe arguments as strings the output for the
      earlier arguments would also include all later string arguments.
      
      This is best explained in an example:
      
      Consider adding a uprobe to a function receiving two strings as
      parameters which is at offset 0xa0 in strlib.so and we want to print
      both parameters when the uprobe is hit (on x86_64):
      
      $ echo 'p:func /lib/strlib.so:0xa0 +0(%di):string +0(%si):string' > \
          /sys/kernel/debug/tracing/uprobe_events
      
      When the function is called as func("foo", "bar") and we hit the probe,
      the trace file shows a line like the following:
      
        [...] func: (0x7f7e683706a0) arg1="foobar" arg2="bar"
      
      Note the extra "bar" printed as part of arg1. This behaviour stacks up
      for additional string arguments.
      
      The strings are stored in a dynamically growing part of the uprobe
      buffer by fetch_store_string() after copying them from userspace via
      strncpy_from_user(). The return value of strncpy_from_user() is then
      directly used as the required size for the string. However, this does
      not take the terminating null byte into account as the documentation
      for strncpy_from_user() cleary states that it "[...] returns the
      length of the string (not including the trailing NUL)" even though the
      null byte will be copied to the destination.
      
      Therefore, subsequent calls to fetch_store_string() will overwrite
      the terminating null byte of the most recently fetched string with
      the first character of the current string, leading to the
      "accumulation" of strings in earlier arguments in the output.
      
      Fix this by incrementing the return value of strncpy_from_user() by
      one if we did not hit the maximum buffer size.
      
      Link: http://lkml.kernel.org/r/20190116141629.5752-1-andreas.ziegler@fau.de
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 5baaa59e ("tracing/probes: Implement 'memory' fetch method for uprobes")
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NAndreas Ziegler <andreas.ziegler@fau.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      382322b4
    • J
      perf/x86: Add check_period PMU callback · 745de283
      Jiri Olsa 提交于
      commit 81ec3f3c4c4d78f2d3b6689c9816bfbdf7417dbb upstream.
      
      Vince (and later on Ravi) reported crashes in the BTS code during
      fuzzing with the following backtrace:
      
        general protection fault: 0000 [#1] SMP PTI
        ...
        RIP: 0010:perf_prepare_sample+0x8f/0x510
        ...
        Call Trace:
         <IRQ>
         ? intel_pmu_drain_bts_buffer+0x194/0x230
         intel_pmu_drain_bts_buffer+0x160/0x230
         ? tick_nohz_irq_exit+0x31/0x40
         ? smp_call_function_single_interrupt+0x48/0xe0
         ? call_function_single_interrupt+0xf/0x20
         ? call_function_single_interrupt+0xa/0x20
         ? x86_schedule_events+0x1a0/0x2f0
         ? x86_pmu_commit_txn+0xb4/0x100
         ? find_busiest_group+0x47/0x5d0
         ? perf_event_set_state.part.42+0x12/0x50
         ? perf_mux_hrtimer_restart+0x40/0xb0
         intel_pmu_disable_event+0xae/0x100
         ? intel_pmu_disable_event+0xae/0x100
         x86_pmu_stop+0x7a/0xb0
         x86_pmu_del+0x57/0x120
         event_sched_out.isra.101+0x83/0x180
         group_sched_out.part.103+0x57/0xe0
         ctx_sched_out+0x188/0x240
         ctx_resched+0xa8/0xd0
         __perf_event_enable+0x193/0x1e0
         event_function+0x8e/0xc0
         remote_function+0x41/0x50
         flush_smp_call_function_queue+0x68/0x100
         generic_smp_call_function_single_interrupt+0x13/0x30
         smp_call_function_single_interrupt+0x3e/0xe0
         call_function_single_interrupt+0xf/0x20
         </IRQ>
      
      The reason is that while event init code does several checks
      for BTS events and prevents several unwanted config bits for
      BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
      to create BTS event without those checks being done.
      
      Following sequence will cause the crash:
      
      If we create an 'almost' BTS event with precise_ip and callchains,
      and it into a BTS event it will crash the perf_prepare_sample()
      function because precise_ip events are expected to come
      in with callchain data initialized, but that's not the
      case for intel_pmu_drain_bts_buffer() caller.
      
      Adding a check_period callback to be called before the period
      is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
      if the event would become BTS. Plus adding also the limit_period
      check as well.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190204123532.GA4794@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      745de283
    • I
      perf/core: Fix impossible ring-buffer sizes warning · b335ca4b
      Ingo Molnar 提交于
      commit 528871b456026e6127d95b1b2bd8e3a003dc1614 upstream.
      
      The following commit:
      
        9dff0aa95a32 ("perf/core: Don't WARN() for impossible ring-buffer sizes")
      
      results in perf recording failures with larger mmap areas:
      
        root@skl:/tmp# perf record -g -a
        failed to mmap with 12 (Cannot allocate memory)
      
      The root cause is that the following condition is buggy:
      
      	if (order_base_2(size) >= MAX_ORDER)
      		goto fail;
      
      The problem is that @size is in bytes and MAX_ORDER is in pages,
      so the right test is:
      
      	if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER)
      		goto fail;
      
      Fix it.
      Reported-by: N"Jin, Yao" <yao.jin@linux.intel.com>
      Bisected-by: NBorislav Petkov <bp@alien8.de>
      Analyzed-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Julien Thierry <julien.thierry@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Fixes: 9dff0aa95a32 ("perf/core: Don't WARN() for impossible ring-buffer sizes")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b335ca4b
    • X
      sysctl: add procfs interface to disable/enable usermodhelper_affinity · 0092df08
      Xie XiuQi 提交于
      hulk inclusion
      category: feature
      feature: performance/latency
      upstream: never
      bugzilla: 2680,10641
      CVE: NA
      
      The pervious patch "kmod: run usermodehelpers only on cpus allowed for
      kthreadd V2" allow to set usermodehelper thread's affinity.
      
      add a interface to disable/enable usermodhelper_affinity.
      Disabled by default.
      
      [XQ: backport patch from euleros 2.x:
        - kmod.c => umh.c
      ]
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NLi Bin <huawei.libin@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0092df08
    • C
      kmod: run usermodehelpers only on cpus allowed for kthreadd V2 · 672627d1
      Christoph Lameter 提交于
      hulk inclusion
      category: feature
      feature: performance/latency
      upstream: never
      bugzilla: 2680,10641
      CVE: NA
      
      isolate this usermodehelper kernel threads to other cpus,
      to avoid the latency issue.
      
      With this patch, the usermodehelper thread could inherit kthreadd's
      affinity.
      
      For example, if you want to isolate usermodehelper to cpu 1:
      1) taskset -cp 1 2   # bind kthreadd task (pid = 2) to cpu 1
      2) trigger call usermodhelper threads
      
      ---------------------------------------------
      
      usermodehelper() threads can currently run on all processors.  This is an
      issue for low latency cores.  Spawnig a new thread causes cpu holdoffs in
      the range of hundreds of microseconds to a few milliseconds.  Not good for
      cores on which processes run that need to react as fast as possible.
      
      kthreadd threads can be restricted using taskset to a limited set of
      processors.  Then the kernel thread pool will not fork processes on those
      anymore thereby protecting those processors from additional latencies.
      
      Make usermodehelper() threads obey the limitations that kthreadd is
      restricted to.  Kthreadd is not the parent of usermodehelper threads so we
      need to explicitly get the allowed processors for kthreadd.
      
      Before this patch there is no way to limit the cpus that usermodehelper
      can run on since the affinity is set when the thread is spawned to all
      processors.
      
      [akpm@linux-foundation.org: set_cpus_allowed() doesn't exist when
      CONFIG_CPUMASK_OFFSTACK=y]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: https://patchwork.kernel.org/patch/3153671/Reported-and-tested-by: NXiangyou Xie <xiexiangyou@huawei.com>
      [
      1) kmod.c => umh.c
      2) ____call_usermodehelper => call_usermodehelper_exec_async
      ]
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NLi Bin <huawei.libin@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      672627d1
    • Z
      pagecache: add switch to close the feature completely · 354b44e8
      zhongjiang 提交于
      euler inclusion
      category: bugfix
      CVE: NA
      Bugzilla: 9580
      
      ---------------------------
      
      The patch control the open/close of the feature by the
      /proc/sys/vm/cache_reclaim_enable, hence we can completely
      close all background thread for user demand.
      Signed-off-by: Nzhongjiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      354b44e8
    • Z
      pagecache: add Kconfig to enable/disable the feature · 862e2308
      zhongjiang 提交于
      euler inclusion
      category: bugfix
      CVE: NA
      Bugzilla: 9580
      
      ---------------------------
      
      Just add Kconfig to the feature.
      Signed-off-by: Nzhongjiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      862e2308
    • A
      bpf: fix lockdep false positive in percpu_freelist · f72ddcb4
      Alexei Starovoitov 提交于
      mainline inclusion
      from mainline-5.0
      commit a89fac57b5d0
      category: bugfix
      bugzilla: 9352
      CVE: NA
      
      -------------------------------------------------
      
      Lockdep warns about false positive:
      [   12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40
      [   12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past:
      [   12.493275]  (&rq->lock){-.-.}
      [   12.493276]
      [   12.493276]
      [   12.493276] and interrupts could create inverse lock ordering between them.
      [   12.493276]
      [   12.494435]
      [   12.494435] other info that might help us debug this:
      [   12.494979]  Possible interrupt unsafe locking scenario:
      [   12.494979]
      [   12.495518]        CPU0                    CPU1
      [   12.495879]        ----                    ----
      [   12.496243]   lock(&head->lock);
      [   12.496502]                                local_irq_disable();
      [   12.496969]                                lock(&rq->lock);
      [   12.497431]                                lock(&head->lock);
      [   12.497890]   <Interrupt>
      [   12.498104]     lock(&rq->lock);
      [   12.498368]
      [   12.498368]  *** DEADLOCK ***
      [   12.498368]
      [   12.498837] 1 lock held by dd/276:
      [   12.499110]  #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240
      [   12.499747]
      [   12.499747] the shortest dependencies between 2nd lock and 1st lock:
      [   12.500389]  -> (&rq->lock){-.-.} {
      [   12.500669]     IN-HARDIRQ-W at:
      [   12.500934]                       _raw_spin_lock+0x2f/0x40
      [   12.501373]                       scheduler_tick+0x4c/0xf0
      [   12.501812]                       update_process_times+0x40/0x50
      [   12.502294]                       tick_periodic+0x27/0xb0
      [   12.502723]                       tick_handle_periodic+0x1f/0x60
      [   12.503203]                       timer_interrupt+0x11/0x20
      [   12.503651]                       __handle_irq_event_percpu+0x43/0x2c0
      [   12.504167]                       handle_irq_event_percpu+0x20/0x50
      [   12.504674]                       handle_irq_event+0x37/0x60
      [   12.505139]                       handle_level_irq+0xa7/0x120
      [   12.505601]                       handle_irq+0xa1/0x150
      [   12.506018]                       do_IRQ+0x77/0x140
      [   12.506411]                       ret_from_intr+0x0/0x1d
      [   12.506834]                       _raw_spin_unlock_irqrestore+0x53/0x60
      [   12.507362]                       __setup_irq+0x481/0x730
      [   12.507789]                       setup_irq+0x49/0x80
      [   12.508195]                       hpet_time_init+0x21/0x32
      [   12.508644]                       x86_late_time_init+0xb/0x16
      [   12.509106]                       start_kernel+0x390/0x42a
      [   12.509554]                       secondary_startup_64+0xa4/0xb0
      [   12.510034]     IN-SOFTIRQ-W at:
      [   12.510305]                       _raw_spin_lock+0x2f/0x40
      [   12.510772]                       try_to_wake_up+0x1c7/0x4e0
      [   12.511220]                       swake_up_locked+0x20/0x40
      [   12.511657]                       swake_up_one+0x1a/0x30
      [   12.512070]                       rcu_process_callbacks+0xc5/0x650
      [   12.512553]                       __do_softirq+0xe6/0x47b
      [   12.512978]                       irq_exit+0xc3/0xd0
      [   12.513372]                       smp_apic_timer_interrupt+0xa9/0x250
      [   12.513876]                       apic_timer_interrupt+0xf/0x20
      [   12.514343]                       default_idle+0x1c/0x170
      [   12.514765]                       do_idle+0x199/0x240
      [   12.515159]                       cpu_startup_entry+0x19/0x20
      [   12.515614]                       start_kernel+0x422/0x42a
      [   12.516045]                       secondary_startup_64+0xa4/0xb0
      [   12.516521]     INITIAL USE at:
      [   12.516774]                      _raw_spin_lock_irqsave+0x38/0x50
      [   12.517258]                      rq_attach_root+0x16/0xd0
      [   12.517685]                      sched_init+0x2f2/0x3eb
      [   12.518096]                      start_kernel+0x1fb/0x42a
      [   12.518525]                      secondary_startup_64+0xa4/0xb0
      [   12.518986]   }
      [   12.519132]   ... key      at: [<ffffffff82b7bc28>] __key.71384+0x0/0x8
      [   12.519649]   ... acquired at:
      [   12.519892]    pcpu_freelist_pop+0x7b/0xd0
      [   12.520221]    bpf_get_stackid+0x1d2/0x4d0
      [   12.520563]    ___bpf_prog_run+0x8b4/0x11a0
      [   12.520887]
      [   12.521008] -> (&head->lock){+...} {
      [   12.521292]    HARDIRQ-ON-W at:
      [   12.521539]                     _raw_spin_lock+0x2f/0x40
      [   12.521950]                     pcpu_freelist_push+0x2a/0x40
      [   12.522396]                     bpf_get_stackid+0x494/0x4d0
      [   12.522828]                     ___bpf_prog_run+0x8b4/0x11a0
      [   12.523296]    INITIAL USE at:
      [   12.523537]                    _raw_spin_lock+0x2f/0x40
      [   12.523944]                    pcpu_freelist_populate+0xc0/0x120
      [   12.524417]                    htab_map_alloc+0x405/0x500
      [   12.524835]                    __do_sys_bpf+0x1a3/0x1a90
      [   12.525253]                    do_syscall_64+0x4a/0x180
      [   12.525659]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   12.526167]  }
      [   12.526311]  ... key      at: [<ffffffff838f7668>] __key.13130+0x0/0x8
      [   12.526812]  ... acquired at:
      [   12.527047]    __lock_acquire+0x521/0x1350
      [   12.527371]    lock_acquire+0x98/0x190
      [   12.527680]    _raw_spin_lock+0x2f/0x40
      [   12.527994]    pcpu_freelist_push+0x2a/0x40
      [   12.528325]    bpf_get_stackid+0x494/0x4d0
      [   12.528645]    ___bpf_prog_run+0x8b4/0x11a0
      [   12.528970]
      [   12.529092]
      [   12.529092] stack backtrace:
      [   12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f892422 #475
      [   12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      [   12.530750] Call Trace:
      [   12.530948]  dump_stack+0x5f/0x8b
      [   12.531248]  check_usage_backwards+0x10c/0x120
      [   12.531598]  ? ___bpf_prog_run+0x8b4/0x11a0
      [   12.531935]  ? mark_lock+0x382/0x560
      [   12.532229]  mark_lock+0x382/0x560
      [   12.532496]  ? print_shortest_lock_dependencies+0x180/0x180
      [   12.532928]  __lock_acquire+0x521/0x1350
      [   12.533271]  ? find_get_entry+0x17f/0x2e0
      [   12.533586]  ? find_get_entry+0x19c/0x2e0
      [   12.533902]  ? lock_acquire+0x98/0x190
      [   12.534196]  lock_acquire+0x98/0x190
      [   12.534482]  ? pcpu_freelist_push+0x2a/0x40
      [   12.534810]  _raw_spin_lock+0x2f/0x40
      [   12.535099]  ? pcpu_freelist_push+0x2a/0x40
      [   12.535432]  pcpu_freelist_push+0x2a/0x40
      [   12.535750]  bpf_get_stackid+0x494/0x4d0
      [   12.536062]  ___bpf_prog_run+0x8b4/0x11a0
      
      It has been explained that is a false positive here:
      https://lkml.org/lkml/2018/7/25/756
      Recap:
      - stackmap uses pcpu_freelist
      - The lock in pcpu_freelist is a percpu lock
      - stackmap is only used by tracing bpf_prog
      - A tracing bpf_prog cannot be run if another bpf_prog
        has already been running (ensured by the percpu bpf_prog_active counter).
      
      Eric pointed out that this lockdep splats stops other
      legit lockdep splats in selftests/bpf/test_progs.c.
      
      Fix this by calling local_irq_save/restore for stackmap.
      
      Another false positive had also been worked around by calling
      local_irq_save in commit 89ad2fa3 ("bpf: fix lockdep splat").
      That commit added unnecessary irq_save/restore to fast path of
      bpf hash map. irqs are already disabled at that point, since htab
      is holding per bucket spin_lock with irqsave.
      
      Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop
      function w/o irqsave and convert pcpu_freelist_push/pop to irqsave
      to be used elsewhere (right now only in stackmap).
      It stops lockdep false positive in stackmap with a bit of acceptable overhead.
      
      Fixes: 557c0c6e ("bpf: convert stackmap to pre-allocation")
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f72ddcb4
    • M
      bpf: Fix syscall's stackmap lookup potential deadlock · 2b8e56c9
      Martin KaFai Lau 提交于
      mainline inclusion
      from mainline-5.0
      commit 7c4cd051add3
      category: bugfix
      bugzilla: 9355
      CVE: NA
      
      -------------------------------------------------
      
      The map_lookup_elem used to not acquiring spinlock
      in order to optimize the reader.
      
      It was true until commit 557c0c6e ("bpf: convert stackmap to pre-allocation")
      The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy().
      bpf_stackmap_copy() may find the elem no longer needed after the copy is done.
      If that is the case, pcpu_freelist_push() saves this elem for reuse later.
      This push requires a spinlock.
      
      If a tracing bpf_prog got run in the middle of the syscall's
      map_lookup_elem(stackmap) and this tracing bpf_prog is calling
      bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's
      spinlock, it may end up with a dead lock situation as reported by
      Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/
      
      The situation is the same as the syscall's map_update_elem() which
      needs to acquire the pcpu_freelist's spinlock and could race
      with tracing bpf_prog.  Hence, this patch fixes it by protecting
      bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active)
      to prevent tracing bpf_prog from running.
      
      A later syscall's map_lookup_elem commit f1a2e44a3aec ("bpf: add queue and stack maps")
      also acquires a spinlock and races with tracing bpf_prog similarly.
      Hence, this patch is forward looking and protects the majority
      of the map lookups.  bpf_map_offload_lookup_elem() is the exception
      since it is for network bpf_prog only (i.e. never called by tracing
      bpf_prog).
      
      Fixes: 557c0c6e ("bpf: convert stackmap to pre-allocation")
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2b8e56c9
    • P
      bpf: error handling when map_lookup_elem isn't supported · d7cb6182
      Prashant Bhole 提交于
      mainline inclusion
      from mainline-4.20
      commit 509db2833e0d
      category: bugfix
      bugzilla: 9355
      CVE: NA
      
      -------------------------------------------------
      
      The error value returned by map_lookup_elem doesn't differentiate
      whether lookup was failed because of invalid key or lookup is not
      supported.
      
      Lets add handling for -EOPNOTSUPP return value of map_lookup_elem()
      method of map, with expectation from map's implementation that it
      should return -EOPNOTSUPP if lookup is not supported.
      
      The errno for bpf syscall for BPF_MAP_LOOKUP_ELEM command will be set
      to EOPNOTSUPP if map lookup is not supported.
      Signed-off-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d7cb6182
    • A
      bpf: fix potential deadlock in bpf_prog_register · d1c1cb28
      Alexei Starovoitov 提交于
      mainline inclusion
      from mainline-5.0
      commit e16ec34039c7
      category: bugfix
      bugzilla: 9347
      CVE: NA
      
      -------------------------------------------------
      
      Lockdep found a potential deadlock between cpu_hotplug_lock, bpf_event_mutex, and cpuctx_mutex:
      [   13.007000] WARNING: possible circular locking dependency detected
      [   13.007587] 5.0.0-rc3-00018-g2fa53f892422-dirty #477 Not tainted
      [   13.008124] ------------------------------------------------------
      [   13.008624] test_progs/246 is trying to acquire lock:
      [   13.009030] 0000000094160d1d (tracepoints_mutex){+.+.}, at: tracepoint_probe_register_prio+0x2d/0x300
      [   13.009770]
      [   13.009770] but task is already holding lock:
      [   13.010239] 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
      [   13.010877]
      [   13.010877] which lock already depends on the new lock.
      [   13.010877]
      [   13.011532]
      [   13.011532] the existing dependency chain (in reverse order) is:
      [   13.012129]
      [   13.012129] -> #4 (bpf_event_mutex){+.+.}:
      [   13.012582]        perf_event_query_prog_array+0x9b/0x130
      [   13.013016]        _perf_ioctl+0x3aa/0x830
      [   13.013354]        perf_ioctl+0x2e/0x50
      [   13.013668]        do_vfs_ioctl+0x8f/0x6a0
      [   13.014003]        ksys_ioctl+0x70/0x80
      [   13.014320]        __x64_sys_ioctl+0x16/0x20
      [   13.014668]        do_syscall_64+0x4a/0x180
      [   13.015007]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.015469]
      [   13.015469] -> #3 (&cpuctx_mutex){+.+.}:
      [   13.015910]        perf_event_init_cpu+0x5a/0x90
      [   13.016291]        perf_event_init+0x1b2/0x1de
      [   13.016654]        start_kernel+0x2b8/0x42a
      [   13.016995]        secondary_startup_64+0xa4/0xb0
      [   13.017382]
      [   13.017382] -> #2 (pmus_lock){+.+.}:
      [   13.017794]        perf_event_init_cpu+0x21/0x90
      [   13.018172]        cpuhp_invoke_callback+0xb3/0x960
      [   13.018573]        _cpu_up+0xa7/0x140
      [   13.018871]        do_cpu_up+0xa4/0xc0
      [   13.019178]        smp_init+0xcd/0xd2
      [   13.019483]        kernel_init_freeable+0x123/0x24f
      [   13.019878]        kernel_init+0xa/0x110
      [   13.020201]        ret_from_fork+0x24/0x30
      [   13.020541]
      [   13.020541] -> #1 (cpu_hotplug_lock.rw_sem){++++}:
      [   13.021051]        static_key_slow_inc+0xe/0x20
      [   13.021424]        tracepoint_probe_register_prio+0x28c/0x300
      [   13.021891]        perf_trace_event_init+0x11f/0x250
      [   13.022297]        perf_trace_init+0x6b/0xa0
      [   13.022644]        perf_tp_event_init+0x25/0x40
      [   13.023011]        perf_try_init_event+0x6b/0x90
      [   13.023386]        perf_event_alloc+0x9a8/0xc40
      [   13.023754]        __do_sys_perf_event_open+0x1dd/0xd30
      [   13.024173]        do_syscall_64+0x4a/0x180
      [   13.024519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.024968]
      [   13.024968] -> #0 (tracepoints_mutex){+.+.}:
      [   13.025434]        __mutex_lock+0x86/0x970
      [   13.025764]        tracepoint_probe_register_prio+0x2d/0x300
      [   13.026215]        bpf_probe_register+0x40/0x60
      [   13.026584]        bpf_raw_tracepoint_open.isra.34+0xa4/0x130
      [   13.027042]        __do_sys_bpf+0x94f/0x1a90
      [   13.027389]        do_syscall_64+0x4a/0x180
      [   13.027727]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.028171]
      [   13.028171] other info that might help us debug this:
      [   13.028171]
      [   13.028807] Chain exists of:
      [   13.028807]   tracepoints_mutex --> &cpuctx_mutex --> bpf_event_mutex
      [   13.028807]
      [   13.029666]  Possible unsafe locking scenario:
      [   13.029666]
      [   13.030140]        CPU0                    CPU1
      [   13.030510]        ----                    ----
      [   13.030875]   lock(bpf_event_mutex);
      [   13.031166]                                lock(&cpuctx_mutex);
      [   13.031645]                                lock(bpf_event_mutex);
      [   13.032135]   lock(tracepoints_mutex);
      [   13.032441]
      [   13.032441]  *** DEADLOCK ***
      [   13.032441]
      [   13.032911] 1 lock held by test_progs/246:
      [   13.033239]  #0: 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
      [   13.033909]
      [   13.033909] stack backtrace:
      [   13.034258] CPU: 1 PID: 246 Comm: test_progs Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #477
      [   13.034964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      [   13.035657] Call Trace:
      [   13.035859]  dump_stack+0x5f/0x8b
      [   13.036130]  print_circular_bug.isra.37+0x1ce/0x1db
      [   13.036526]  __lock_acquire+0x1158/0x1350
      [   13.036852]  ? lock_acquire+0x98/0x190
      [   13.037154]  lock_acquire+0x98/0x190
      [   13.037447]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.037876]  __mutex_lock+0x86/0x970
      [   13.038167]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.038600]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.039028]  ? __mutex_lock+0x86/0x970
      [   13.039337]  ? __mutex_lock+0x24a/0x970
      [   13.039649]  ? bpf_probe_register+0x1d/0x60
      [   13.039992]  ? __bpf_trace_sched_wake_idle_without_ipi+0x10/0x10
      [   13.040478]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.040906]  tracepoint_probe_register_prio+0x2d/0x300
      [   13.041325]  bpf_probe_register+0x40/0x60
      [   13.041649]  bpf_raw_tracepoint_open.isra.34+0xa4/0x130
      [   13.042068]  ? __might_fault+0x3e/0x90
      [   13.042374]  __do_sys_bpf+0x94f/0x1a90
      [   13.042678]  do_syscall_64+0x4a/0x180
      [   13.042975]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.043382] RIP: 0033:0x7f23b10a07f9
      [   13.045155] RSP: 002b:00007ffdef42fdd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
      [   13.045759] RAX: ffffffffffffffda RBX: 00007ffdef42ff70 RCX: 00007f23b10a07f9
      [   13.046326] RDX: 0000000000000070 RSI: 00007ffdef42fe10 RDI: 0000000000000011
      [   13.046893] RBP: 00007ffdef42fdf0 R08: 0000000000000038 R09: 00007ffdef42fe10
      [   13.047462] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
      [   13.048029] R13: 0000000000000016 R14: 00007f23b1db4690 R15: 0000000000000000
      
      Since tracepoints_mutex will be taken in tracepoint_probe_register/unregister()
      there is no need to take bpf_event_mutex too.
      bpf_event_mutex is protecting modifications to prog array used in kprobe/perf bpf progs.
      bpf_raw_tracepoints don't need to take this mutex.
      
      Fixes: c4f6699d ("bpf: introduce BPF_RAW_TRACEPOINT")
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d1c1cb28
    • M
      bpf: support raw tracepoints in modules · 00d0f1b1
      Matt Mullins 提交于
      mainline inclusion
      from mainline-5.0
      commit a38d1107f937
      category: bugfix
      bugzilla: 9347
      CVE: NA
      
      -------------------------------------------------
      
      Distributions build drivers as modules, including network and filesystem
      drivers which export numerous tracepoints.  This enables
      bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints.
      Signed-off-by: NMatt Mullins <mmullins@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      00d0f1b1
    • Z
      timekeeping: Fix ktime_add overflow in tk_set_wall_to_mono · 56ef5ae6
      Zhang Xiaoxu 提交于
      euler inclusion
      category: bugfix
      Bugzilla: 5380
      CVE: N/A
      ----------------------------------------
      
      Syzkaller report UBSAN bug:
      UBSAN: Undefined behaviour in kernel/time/timekeeping.c:98:17
      signed integer overflow:
      8589935550743139462 + 2147483647000000000 cannot be represented
      in type 'long long int'
      
      Use add_time_safe instead add_time in tk_set_wall_to_mono.
      Signed-off-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      56ef5ae6
    • Z
      tracing: fix incorrect tracer freeing when opening tracing pipe · 2a2e4325
      zhangyi (F) 提交于
      euler inclusion
      category: bugfix
      bugzilla: 9292
      CVE: NA
      ---------------------------
      
      Commit d716ff71 ("tracing: Remove taking of trace_types_lock in
      pipe files") use the current tracer instead of the copy in
      tracing_open_pipe(), but it forget to remove the freeing sentence in
      the error path.
      
      Fixes: d716ff71 ("tracing: Remove taking of trace_types_lock in pipe files")
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Reviewed-by: NLi Bin <huawei.libin@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2a2e4325
    • S
      printk: move printk_safe macros to printk header · 21c0c13e
      Sergey Senozhatsky 提交于
      euler inclusion
      category: bugfix
      bugzilla: 9509
      CVE: NA
      -------------------------------------------------
      
      Make printk_safe_enter_irqsave()/etc macros available to the
      rest of the kernel.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      21c0c13e