提交 · eda04c10cbc942fdbdfe7d35d0c627041d43c5e5 · openeuler / raspberrypi-kernel

27 12月, 2019 40 次提交

bpf: decrease usercnt if bpf_map_new_fd() fails in bpf_map_get_fd_by_id() · eda04c10

由 Peng Sun 提交于 3月 19, 2019

mainline inclusion
from mainline-5.0
commit 781e62823cb81b972dc8652c1827205cda2ac9ac
category: bugfix
bugzilla: 11101
CVE: NA

-------------------------------------------------
In bpf/syscall.c, bpf_map_get_fd_by_id() use bpf_map_inc_not_zero()
to increase the refcount, both map->refcnt and map->usercnt. Then, if
bpf_map_new_fd() fails, should handle map->usercnt too.

Fixes: bd5f5f4e ("bpf: Add BPF_MAP_GET_FD_BY_ID")
Signed-off-by: NPeng Sun <sironhide0null@gmail.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
(cherry picked from commit 781e62823cb81b972dc8652c1827205cda2ac9ac)
Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

eda04c10

relay: check return of create_buf_file() properly · 84e164fa

由 Greg Kroah-Hartman 提交于 1月 31, 2019

[ Upstream commit 2c1cf00eeacb784781cf1c9896b8af001246d339 ]

If create_buf_file() returns an error, don't try to reference it later
as a valid dentry pointer.

This problem was exposed when debugfs started to return errors instead
of just NULL for some calls when they do not succeed properly.

Also, the check for WARN_ON(dentry) was just wrong :)
Reported-by: NKees Cook <keescook@chromium.org>
Reported-and-tested-by: syzbot+16c3a70e1e9b29346c43@syzkaller.appspotmail.com
Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Fixes: ff9fb72bc077 ("debugfs: return error values, not NULL")
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

84e164fa

perf core: Fix perf_proc_update_handler() bug · 2813654e

由 Stephane Eranian 提交于 1月 10, 2019

[ Upstream commit 1a51c5da5acc6c188c917ba572eebac5f8793432 ]

The perf_proc_update_handler() handles /proc/sys/kernel/perf_event_max_sample_rate
syctl variable.  When the PMU IRQ handler timing monitoring is disabled, i.e,
when /proc/sys/kernel/perf_cpu_time_max_percent is equal to 0 or 100,
then no modification to sysctl_perf_event_sample_rate is allowed to prevent
possible hang from wrong values.

The problem is that the test to prevent modification is made after the
sysctl variable is modified in perf_proc_update_handler().

You get an error:

  $ echo 10001 >/proc/sys/kernel/perf_event_max_sample_rate
  echo: write error: invalid argument

But the value is still modified causing all sorts of inconsistencies:

  $ cat /proc/sys/kernel/perf_event_max_sample_rate
  10001

This patch fixes the problem by moving the parsing of the value after
the test.

Committer testing:

  # echo 100 > /proc/sys/kernel/perf_cpu_time_max_percent
  # echo 10001 > /proc/sys/kernel/perf_event_max_sample_rate
  -bash: echo: write error: Invalid argument
  # cat /proc/sys/kernel/perf_event_max_sample_rate
  10001
  #
Signed-off-by: NStephane Eranian <eranian@google.com>
Reviewed-by: NAndi Kleen <ak@linux.intel.com>
Reviewed-by: NJiri Olsa <jolsa@kernel.org>
Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1547169436-6266-1-git-send-email-eranian@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2813654e

bpf: fix sanitation rewrite in case of non-pointers · 223cd1fc

由 Daniel Borkmann 提交于 3月 01, 2019

commit 3612af783cf52c74a031a2f11b82247b2599d3cd upstream.

Marek reported that he saw an issue with the below snippet in that
timing measurements where off when loaded as unpriv while results
were reasonable when loaded as privileged:

    [...]
    uint64_t a = bpf_ktime_get_ns();
    uint64_t b = bpf_ktime_get_ns();
    uint64_t delta = b - a;
    if ((int64_t)delta > 0) {
    [...]

Turns out there is a bug where a corner case is missing in the fix
d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar
type from different paths"), namely fixup_bpf_calls() only checks
whether aux has a non-zero alu_state, but it also needs to test for
the case of BPF_ALU_NON_POINTER since in both occasions we need to
skip the masking rewrite (as there is nothing to mask).

Fixes: d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar type from different paths")
Reported-by: NMarek Majkowski <marek@cloudflare.com>
Reported-by: NArthur Fabre <afabre@cloudflare.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/netdev/CAJPywTJqP34cK20iLM5YmUMz9KXQOdu1-+BZrGMAGgLuBWz7fg@mail.gmail.com/T/Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

223cd1fc

tracing: Fix event filters and triggers to handle negative numbers · 96d5943c

由 Pavel Tikhomirov 提交于 8月 23, 2018

commit 6a072128d262d2b98d31626906a96700d1fc11eb upstream.

Then tracing syscall exit event it is extremely useful to filter exit
codes equal to some negative value, to react only to required errors.
But negative numbers does not work:

[root@snorch sys_exit_read]# echo "ret == -1" > filter
bash: echo: write error: Invalid argument
[root@snorch sys_exit_read]# cat filter
ret == -1
        ^
parse_error: Invalid value (did you forget quotes)?

Similar thing happens when setting triggers.

These is a regression in v4.17 introduced by the commit mentioned below,
testing without these commit shows no problem with negative numbers.

Link: http://lkml.kernel.org/r/20180823102534.7642-1-ptikhomirov@virtuozzo.com

Cc: stable@vger.kernel.org
Fixes: 80765597 ("tracing: Rewrite filter logic to be simpler and faster")
Signed-off-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

96d5943c

perf: Paper over the hw.target problems · 03804742

由 Alexander Shishkin 提交于 3月 09, 2019

euler inclusion
category: bugfix
bugzilla: 9513/11006/11050
CVE: NA
--------------------------------------------------

[ Cheng Jian
HULK-Syzkaller reported a problem which has been reported to mainline(lkml)
by syzbot early, this patch comes from the reply form lkml.
v1	https://lkml.org/lkml/2019/2/28/529
v2	https://lkml.org/lkml/2019/3/8/206
we merged v1 first but cause bugzilla #11050, it was because :
we also use perf_remove_from_context() in perf_event_open() when we move
events from a SW context to a HW context, so we can't destroy the event
here.
now v2 will not exhibit that warning.
it's same to another patch at https://lkml.org/lkml/2019/3/8/536.
but more clear than it.]

First, we have a race between perf_event_release_kernel() and
perf_free_event(), which happens when parent's event is released while the
child's fork fails (because of a fatal signal, for example), that looks
like this:

cpu X                            cpu Y
-----                            -----
                                 copy_process() error path
perf_release(parent)             +->perf_event_free_task()
+-> lock(child_ctx->mutex)       |  |
+-> remove_from_context(child)   |  |
+-> unlock(child_ctx->mutex)     |  |
|                                |  +-> lock(child_ctx->mutex)
|                                |  +-> unlock(child_ctx->mutex)
|                                +-> free_task(child_task)
+-> put_task_struct(child_task)

Technically, we're still holding a reference to the task via
parent->hw.target, that's not stopping free_task(), so we end up poking at
free'd memory, as is pointed out by KASAN in the syzkaller report (see Link
below). The straightforward fix is to drop the hw.target reference while
the task is still around.

Therein lies the second problem: the users of hw.target (uprobe) assume
that it's around at ->destroy() callback time, where they use it for
context. So, in order to not break the uprobe teardown and avoid leaking
stuff, we need to call ->destroy() at the same time.

This patch fixes the race and the subsequent fallout by doing both these
things at remove_from_context time.
Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Link: https://syzkaller.appspot.com/bug?extid=a24c397a29ad22d86c98Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

03804742

userfaultfd: use RCU to free the task struct when fork fails if MEMCG · 8eb04a7a

由 Andrea Arcangeli 提交于 3月 08, 2019

euler inclusion
category: bugfix
bugzilla: 10989
CVE: NA

------------------------------------------------

MEMCG depends on the task structure not to be freed under
rcu_read_lock() in get_mem_cgroup_from_mm() after it dereferences
mm->owner.

A better fix would be to avoid registering forked vmas in userfaultfd
contexts reported to the monitor, if case fork ends up failing.
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
Reviewed-by: NMiao Xie <miaoxie@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8eb04a7a

Revert "perf: Paper over the hw.target problems" · db93c085

由 Cheng Jian 提交于 3月 08, 2019

euler inclusion
category: bugfix
bugzilla: 9513/11006
CVE: NA
--------------------------------------------------

This reverts commit b772baf9a14ab4975e8884a399a4e0bab2fb6bf9.

we merge the patch b772baf9a14a ("perf: Paper over the
hw.target problems") to reslove an use-after-free issue
(bugzilla #9513/#11006).  but it cause some new problem
(bugzilla #11050/#11049) in this version.

So just revert it.
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

db93c085

posix-cpu-timers: Avoid undefined behaviour in timespec64_to_ns() · 2a3c5819

由 Xiongfeng Wang 提交于 3月 08, 2019

euler inclusion
category: feature
Bugzilla: 10876
CVE: N/A

----------------------------------------

When I ran Syzkaller testsuite, I got the following call trace.
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>

================================================================================
UBSAN: Undefined behaviour in ./include/linux/time64.h:120:27
signed integer overflow:
8243129037239968815 * 1000000000 cannot be represented in type 'long long int'
CPU: 5 PID: 28854 Comm: syz-executor.1 Not tainted 4.19.24 #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0xca/0x13e lib/dump_stack.c:113
 ubsan_epilogue+0xe/0x81 lib/ubsan.c:159
 handle_overflow+0x193/0x1e2 lib/ubsan.c:190
 timespec64_to_ns include/linux/time64.h:120 [inline]
 posix_cpu_timer_set+0x95a/0xb70 kernel/time/posix-cpu-timers.c:687
 do_timer_settime+0x198/0x2a0 kernel/time/posix-timers.c:892
 __do_sys_timer_settime kernel/time/posix-timers.c:918 [inline]
 __se_sys_timer_settime kernel/time/posix-timers.c:904 [inline]
 __x64_sys_timer_settime+0x18d/0x260 kernel/time/posix-timers.c:904
 do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x462eb9
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f14e4127c58 EFLAGS: 00000246 ORIG_RAX: 00000000000000df
RAX: ffffffffffffffda RBX: 000000000073bfa0 RCX: 0000000000462eb9
RDX: 0000000020000080 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f14e41286bc
R13: 00000000004c54cc R14: 0000000000704278 R15: 00000000ffffffff
================================================================================

It is because 'it_interval.tv_sec' is larger than 'KTIME_SEC_MAX' and
'it_interval.tv_sec * NSEC_PER_SEC' overflows in 'timespec64_to_ns()'.

This patch checks whether 'it_interval.tv_sec' is larger than
'KTIME_SEC_MAX' and saturate if that is the case.
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2a3c5819

ntp: Avoid undefined behaviour in second_overflow() · d8df4fe5

由 Xiongfeng Wang 提交于 3月 08, 2019

euler inclusion
category: feature
Bugzilla: 11009
CVE: N/A

----------------------------------------

When I ran Syzkaller testsuite, I got the following call trace.
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>

================================================================================
UBSAN: Undefined behaviour in kernel/time/ntp.c:457:16
signed integer overflow:
9223372036854775807 + 500 cannot be represented in type 'long int'
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.19.25-dirty #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0xca/0x13e lib/dump_stack.c:113
 ubsan_epilogue+0xe/0x81 lib/ubsan.c:159
 handle_overflow+0x193/0x1e2 lib/ubsan.c:190
 second_overflow+0x403/0x540 kernel/time/ntp.c:457
 accumulate_nsecs_to_secs kernel/time/timekeeping.c:2002 [inline]
 logarithmic_accumulation kernel/time/timekeeping.c:2046 [inline]
 timekeeping_advance+0x2bb/0xec0 kernel/time/timekeeping.c:2114
 tick_do_update_jiffies64.part.2+0x1a0/0x350 kernel/time/tick-sched.c:97
 tick_do_update_jiffies64 kernel/time/tick-sched.c:1229 [inline]
 tick_nohz_update_jiffies kernel/time/tick-sched.c:499 [inline]
 tick_nohz_irq_enter kernel/time/tick-sched.c:1232 [inline]
 tick_irq_enter+0x1fd/0x240 kernel/time/tick-sched.c:1249
 irq_enter+0xc4/0x100 kernel/softirq.c:353
 entering_irq arch/x86/include/asm/apic.h:517 [inline]
 entering_ack_irq arch/x86/include/asm/apic.h:523 [inline]
 smp_apic_timer_interrupt+0x20/0x480 arch/x86/kernel/apic/apic.c:1052
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:864
 </IRQ>
RIP: 0010:native_safe_halt+0x2/0x10 arch/x86/include/asm/irqflags.h:58
Code: 01 f0 0f 82 bc fd ff ff 48 c7 c7 c0 21 b1 83 e8 a1 0a 02 ff e9 ab fd ff ff 4c 89 e7 e8 77 b6 a5 fe e9 6a ff ff ff 90 90 fb f4 <c3> 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 90 90 90 90 90 90
RSP: 0018:ffff888106307d20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000007 RBX: dffffc0000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881062e4f1c
RBP: 0000000000000003 R08: ffffed107c5dc77b R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff848c78a0
R13: 0000000000000003 R14: 1ffff11020c60fae R15: 0000000000000000
 arch_safe_halt arch/x86/include/asm/paravirt.h:94 [inline]
 default_idle+0x24/0x2b0 arch/x86/kernel/process.c:561
 cpuidle_idle_call kernel/sched/idle.c:153 [inline]
 do_idle+0x2ca/0x420 kernel/sched/idle.c:262
 cpu_startup_entry+0xcb/0xe0 kernel/sched/idle.c:368
 start_secondary+0x421/0x570 arch/x86/kernel/smpboot.c:271
 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243
================================================================================

It is because time_maxerror is set as 0x7FFFFFFFFFFFFFFF by user. It
overflows when we add it with 'MAXFREQ / NSEC_PER_USEC' in
'second_overflow()'.

This patch add a limit check and saturate it when the user set
'time_maxerror'.
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d8df4fe5

bpf: fix lockdep false positive in stackmap · ec9d64f0

由 Alexei Starovoitov 提交于 3月 06, 2019

mainline inclusion
from mainline-5.0-rc8
commit 3defaf2f15b2
category: bugfix
bugzilla: 10760
CVE: NA

-------------------------------------------------

Lockdep warns about false positive:
[   11.211460] ------------[ cut here ]------------
[   11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280
[   11.213134] Modules linked in:
[   11.214954] RIP: 0010:lock_release+0x1ad/0x280
[   11.223508] Call Trace:
[   11.223705]  <IRQ>
[   11.223874]  ? __local_bh_enable+0x7a/0x80
[   11.224199]  up_read+0x1c/0xa0
[   11.224446]  do_up_read+0x12/0x20
[   11.224713]  irq_work_run_list+0x43/0x70
[   11.225030]  irq_work_run+0x26/0x50
[   11.225310]  smp_irq_work_interrupt+0x57/0x1f0
[   11.225662]  irq_work_interrupt+0xf/0x20

since rw_semaphore is released in a different task vs task that locked the sema.
It is expected behavior.
Fix the warning with up_read_non_owner() and rwsem_release() annotation.

Fixes: bae77c5e ("bpf: enable stackmap with build_id in nmi context")
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NLi Bin <huawei.libin@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ec9d64f0

perf: Paper over the hw.target problems · e597bc6a

由 Alexander Shishkin 提交于 3月 06, 2019

euler inclusion
category: bugfix
bugzilla: 9513/11006
CVE: NA
--------------------------------------------------

[ Cheng Jian
HULK-Syzkaller reported a problem which has been reported
to mainline(lkml) by syzbot early, this patch comes from the
reply form lkml.
https://lkml.org/lkml/2019/2/28/529 ]

First, we have a race between perf_event_release_kernel() and
perf_free_event(), which happens when parent's event is released while the
child's fork fails (because of a fatal signal, for example), that looks
like this:

cpu X                            cpu Y
-----                            -----
                                 copy_process() error path
perf_release(parent)             +->perf_event_free_task()
+-> lock(child_ctx->mutex)       |  |
+-> remove_from_context(child)   |  |
+-> unlock(child_ctx->mutex)     |  |
|                                |  +-> lock(child_ctx->mutex)
|                                |  +-> unlock(child_ctx->mutex)
|                                +-> free_task(child_task)
+-> put_task_struct(child_task)

Technically, we're still holding a reference to the task via
parent->hw.target, that's not stopping free_task(), so we end up poking at
free'd memory, as is pointed out by KASAN in the syzkaller report (see Link
below). The straightforward fix is to drop the hw.target reference while
the task is still around.

Therein lies the second problem: the users of hw.target (uprobe) assume
that it's around at ->destroy() callback time, where they use it for
context. So, in order to not break the uprobe teardown and avoid leaking
stuff, we need to call ->destroy() at the same time.

This patch fixes the race and the subsequent fallout by doing both these
things at remove_from_context time.
Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Link: https://syzkaller.appspot.com/bug?extid=a24c397a29ad22d86c98
Reported-by: syzbot+a24c397a29ad22d86c98@syzkaller.appspotmail.com
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NLi Bin <huawei.libin@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

e597bc6a

sched/wake_q: Fix wakeup ordering for wake_q · 5f8c56f8

由 Peter Zijlstra 提交于 12月 17, 2018

[ Upstream commit 4c4e3731564c8945ac5ac90fc2a1e1f21cb79c92 ]

Notable cmpxchg() does not provide ordering when it fails, however
wake_q_add() requires ordering in this specific case too. Without this
it would be possible for the concurrent wakeup to not observe our
prior state.

Andrea Parri provided:

  C wake_up_q-wake_q_add

  {
	int next = 0;
	int y = 0;
  }

  P0(int *next, int *y)
  {
	int r0;

	/* in wake_up_q() */

	WRITE_ONCE(*next, 1);   /* node->next = NULL */
	smp_mb();               /* implied by wake_up_process() */
	r0 = READ_ONCE(*y);
  }

  P1(int *next, int *y)
  {
	int r1;

	/* in wake_q_add() */

	WRITE_ONCE(*y, 1);      /* wake_cond = true */
	smp_mb__before_atomic();
	r1 = cmpxchg_relaxed(next, 1, 2);
  }

  exists (0:r0=0 /\ 1:r1=0)

  This "exists" clause cannot be satisfied according to the LKMM:

  Test wake_up_q-wake_q_add Allowed
  States 3
  0:r0=0; 1:r1=1;
  0:r0=1; 1:r1=0;
  0:r0=1; 1:r1=1;
  No
  Witnesses
  Positive: 0 Negative: 3
  Condition exists (0:r0=0 /\ 1:r1=0)
  Observation wake_up_q-wake_q_add Never 0 3
Reported-by: NYongji Xie <elohimes@gmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

5f8c56f8

genirq: Make sure the initial affinity is not empty · edbb30e3

由 Srinivas Ramana 提交于 12月 20, 2018

[ Upstream commit bddda606ec76550dd63592e32a6e87e7d32583f7 ]

If all CPUs in the irq_default_affinity mask are offline when an interrupt
is initialized then irq_setup_affinity() can set an empty affinity mask for
a newly allocated interrupt.

Fix this by falling back to cpu_online_mask in case the resulting affinity
mask is zero.
Signed-off-by: NSrinivas Ramana <sramana@codeaurora.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-msm@vger.kernel.org
Link: https://lkml.kernel.org/r/1545312957-8504-1-git-send-email-sramana@codeaurora.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

edbb30e3

genirq/matrix: Improve target CPU selection for managed interrupts. · 55662b68

由 Long Li 提交于 11月 06, 2018

[ Upstream commit e8da8794a7fd9eef1ec9a07f0d4897c68581c72b ]

On large systems with multiple devices of the same class (e.g. NVMe disks,
using managed interrupts), the kernel can affinitize these interrupts to a
small subset of CPUs instead of spreading them out evenly.

irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask
of possible target CPUs which has the lowest number of interrupt vectors
allocated.

This is done by searching the CPU with the highest number of available
vectors. While this is correct for non-managed CPUs it can select the wrong
CPU for managed interrupts. Under certain constellations this results in
affinitizing the managed interrupts of several devices to a single CPU in
a set.

The book keeping of available vectors works the following way:

 1) Non-managed interrupts:

    available is decremented when the interrupt is actually requested by
    the device driver and a vector is assigned. It's incremented when the
    interrupt and the vector are freed.

 2) Managed interrupts:

    Managed interrupts guarantee vector reservation when the MSI/MSI-X
    functionality of a device is enabled, which is achieved by reserving
    vectors in the bitmaps of the possible target CPUs. This reservation
    decrements the available count on each possible target CPU.

    When the interrupt is requested by the device driver then a vector is
    allocated from the reserved region. The operation is reversed when the
    interrupt is freed by the device driver. Neither of these operations
    affect the available count.

    The reservation persist up to the point where the MSI/MSI-X
    functionality is disabled and only this operation increments the
    available count again.

For non-managed interrupts the available count is the correct selection
criterion because the guaranteed reservations need to be taken into
account. Using the allocated counter could lead to a failing allocation in
the following situation (total vector space of 10 assumed):

		 CPU0	CPU1
 available:	    2	   0
 allocated:	    5	   3   <--- CPU1 is selected, but available space = 0
 managed reserved:  3	   7

 while available yields the correct result.

For managed interrupts the available count is not the appropriate
selection criterion because as explained above the available count is not
affected by the actual vector allocation.

The following example illustrates that. Total vector space of 10
assumed. The starting point is:

		 CPU0	CPU1
 available:	    5	   4
 allocated:	    2	   3
 managed reserved:  3	   3

 Allocating vectors for three non-managed interrupts will result in
 affinitizing the first two to CPU0 and the third one to CPU1 because the
 available count is adjusted with each allocation:

		  CPU0	CPU1
 available:	     5	   4	<- Select CPU0 for 1st allocation
 --> allocated:	     3	   3

 available:	     4	   4	<- Select CPU0 for 2nd allocation
 --> allocated:	     4	   3

 available:	     3	   4	<- Select CPU1 for 3rd allocation
 --> allocated:	     4	   4

 But the allocation of three managed interrupts starting from the same
 point will affinitize all of them to CPU0 because the available count is
 not affected by the allocation (see above). So the end result is:

		  CPU0	CPU1
 available:	     5	   4
 allocated:	     5	   3

Introduce a "managed_allocated" field in struct cpumap to track the vector
allocation for managed interrupts separately. Use this information to
select the target CPU when a vector is allocated for a managed interrupt,
which results in more evenly distributed vector assignments. The above
example results in the following allocations:

		 CPU0	CPU1
 managed_allocated: 0	   0	<- Select CPU0 for 1st allocation
 --> allocated:	    3	   3

 managed_allocated: 1	   0	<- Select CPU1 for 2nd allocation
 --> allocated:	    3	   4

 managed_allocated: 1	   1	<- Select CPU0 for 3rd allocation
 --> allocated:	    4	   4

The allocation of non-managed interrupts is not affected by this change and
is still evaluating the available count.

The overall distribution of interrupt vectors for both types of interrupts
might still not be perfectly even depending on the number of non-managed
and managed interrupts in a system, but due to the reservation guarantee
for managed interrupts this cannot be avoided.

Expose the new field in debugfs as well.

[ tglx: Clarified the background of the problem in the changelog and
  	described it independent of NVME ]
Signed-off-by: NLong Li <longli@microsoft.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Michael Kelley <mikelley@microsoft.com>
Link: https://lkml.kernel.org/r/20181106040000.27316-1-longli@linuxonhyperv.comSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

55662b68

irq/matrix: Spread managed interrupts on allocation · 5fc68885

由 Dou Liyang 提交于 9月 09, 2018

[ Upstream commit 76f99ae5b54d48430d1f0c5512a84da0ff9761e0 ]

Linux spreads out the non managed interrupt across the possible target CPUs
to avoid vector space exhaustion.

Managed interrupts are treated differently, as for them the vectors are
reserved (with guarantee) when the interrupt descriptors are initialized.

When the interrupt is requested a real vector is assigned. The assignment
logic uses the first CPU in the affinity mask for assignment. If the
interrupt has more than one CPU in the affinity mask, which happens when a
multi queue device has less queues than CPUs, then doing the same search as
for non managed interrupts makes sense as it puts the interrupt on the
least interrupt plagued CPU. For single CPU affine vectors that's obviously
a NOOP.

Restructre the matrix allocation code so it does the 'best CPU' search, add
the sanity check for an empty affinity mask and adapt the call site in the
x86 vector management code.

[ tglx: Added the empty mask check to the core and improved change log ]
Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/20180908175838.14450-2-dou_liyang@163.comSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

5fc68885

irq/matrix: Split out the CPU selection code into a helper · 916144b4

由 Dou Liyang 提交于 9月 09, 2018

[ Upstream commit 8ffe4e61c06a48324cfd97f1199bb9838acce2f2 ]

Linux finds the CPU which has the lowest vector allocation count to spread
out the non managed interrupts across the possible target CPUs, but does
not do so for managed interrupts.

Split out the CPU selection code into a helper function for reuse. No
functional change.
Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/20180908175838.14450-1-dou_liyang@163.comSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

916144b4

timekeeping: Avoid undefined behaviour in 'ktime_get_with_offset()' · 85d6dbdf

由 Xiongfeng Wang 提交于 3月 05, 2019

euler inclusion
category: feature
Bugzilla: 10683
CVE: N/A

----------------------------------------

When I ran Syzkaller testsuite, I got the following call trace.
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>

================================================================================
UBSAN: Undefined behaviour in kernel/time/timekeeping.c:801:8
signed integer overflow:
500152103386 + 9223372036854775807 cannot be represented in type 'long long int'
CPU: 6 PID: 13904 Comm: syz-executor.0 Not tainted 4.19.25 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0xca/0x13e lib/dump_stack.c:113
 ubsan_epilogue+0xe/0x81 lib/ubsan.c:159
 handle_overflow+0x193/0x1e2 lib/ubsan.c:190
 ktime_get_with_offset+0x26a/0x2d0 kernel/time/timekeeping.c:801
 common_hrtimer_arm+0x14d/0x220 kernel/time/posix-timers.c:817
 common_timer_set+0x337/0x530 kernel/time/posix-timers.c:863
 do_timer_settime+0x198/0x290 kernel/time/posix-timers.c:892
 __do_sys_timer_settime kernel/time/posix-timers.c:918 [inline]
 __se_sys_timer_settime kernel/time/posix-timers.c:904 [inline]
 __x64_sys_timer_settime+0x18d/0x260 kernel/time/posix-timers.c:904
 do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x462eb9
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f7968072c58 EFLAGS: 00000246 ORIG_RAX: 00000000000000df
RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462eb9
RDX: 00000000200000c0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f79680736bc
R13: 00000000004c54cc R14: 0000000000704278 R15: 00000000ffffffff
================================================================================

It it because global variable 'offsets' is set with a very large but still
valid value. It overflows when we add 'tk->tkr_mono.base' with 'offsets'.

This patch use 'ktime_add_safe()' to limit the result to 'KTIME_SEC_MAX'
when it overflows.
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawe.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

85d6dbdf

bpf, lpm: fix lookup bug in map_delete_elem · 5f549338

由 Alban Crequy 提交于 3月 04, 2019

mainline inclusion
from mainline-5.0
commit 7c0cdf0b3940
category: bugfix
bugzilla: 10936
CVE: NA

-------------------------------------------------

trie_delete_elem() was deleting an entry even though it was not matching
if the prefixlen was correct. This patch adds a check on matchlen.

Reproducer:

$ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 entries 128 name mylpm flags 1
$ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb cc dd value hex 01
$ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
key: 10 00 00 00 aa bb cc dd  value: 01
Found 1 element
$ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff ff ff
$ echo $?
0
$ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
Found 0 elements

A similar reproducer is added in the selftests.

Without the patch:

$ sudo ./tools/testing/selftests/bpf/test_lpm_map
test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed.
Aborted

With the patch: test_lpm_map runs without errors.

Fixes: e454cf59 ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE")
Cc: Craig Gallek <kraig@google.com>
Signed-off-by: NAlban Crequy <alban@kinvolk.io>
Acked-by: NCraig Gallek <kraig@google.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

5f549338

bpf, lpm: make longest_prefix_match() faster · 7e35327e

由 Eric Dumazet 提交于 3月 04, 2019

mainline inclusion
from mainline-5.0
commit 7c0cdf0b3940
category: bugfix
bugzilla: 10936
CVE: NA

-------------------------------------------------

At LPC 2018 in Vancouver, Vlad Dumitrescu mentioned that longest_prefix_match()
has a high cost [1].

One reason for that cost is a loop handling one byte at a time.

We can handle more bytes at a time, if enough attention is paid
to endianness.

I was able to remove ~55 % of longest_prefix_match() cpu costs.

[1] https://linuxplumbersconf.org/event/2/contributions/88/attachments/76/87/lpc-bpf-2018-shaping.pdfSigned-off-by: NEric Dumazet <edumazet@google.com>
Cc: Vlad Dumitrescu <vladum@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

7e35327e

bpf: zero out build_id for BPF_STACK_BUILD_ID_IP · 1f98ce54

由 Stanislav Fomichev 提交于 1月 16, 2019

[ Upstream commit 4af396ae4836c4ecab61e975b8e61270c551894d ]

When returning BPF_STACK_BUILD_ID_IP from stack_map_get_build_id_offset,
make sure that build_id field is empty. Since we are using percpu
free list, there is a possibility that we might reuse some previous
bpf_stack_build_id with non-zero build_id.

Fixes: 615755a7 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NStanislav Fomichev <sdf@google.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1f98ce54

bpf: don't assume build-id length is always 20 bytes · 49facf53

由 Stanislav Fomichev 提交于 1月 16, 2019

[ Upstream commit 0b698005a9d11c0e91141ec11a2c4918a129f703 ]

Build-id length is not fixed to 20, it can be (`man ld` /--build-id):
  * 128-bit (uuid)
  * 160-bit (sha1)
  * any length specified in ld --build-id=0xhexstring

To fix the issue of missing BPF_STACK_BUILD_ID_VALID for shorter build-ids,
assume that build-id is somewhere in the range of 1 .. 20.
Set the remaining bytes to zero.

v2:
* don't introduce new "len = min(BPF_BUILD_ID_SIZE, nhdr->n_descsz)",
  we already know that nhdr->n_descsz <= BPF_BUILD_ID_SIZE if we enter
  this 'if' condition

Fixes: 615755a7 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NStanislav Fomichev <sdf@google.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

49facf53

bpf: fix panic in stack_map_get_build_id() on i386 and arm32 · 6a263990

由 Song Liu 提交于 1月 08, 2019

[ Upstream commit beaf3d1901f4ea46fbd5c9d857227d99751de469 ]

As Naresh reported, test_stacktrace_build_id() causes panic on i386 and
arm32 systems. This is caused by page_address() returns NULL in certain
cases.

This patch fixes this error by using kmap_atomic/kunmap_atomic instead
of page_address.

Fixes: 615755a7 (" bpf: extend stackmap to save binary_build_id+offset instead of address")
Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

6a263990

tracing: Fix number of entries in trace header · 2cc39ead

由 Quentin Perret 提交于 2月 14, 2019

commit 9e7382153f80ba45a0bbcd540fb77d4b15f6e966 upstream.

The following commit

  441dae8f ("tracing: Add support for display of tgid in trace output")

removed the call to print_event_info() from print_func_help_header_irq()
which results in the ftrace header not reporting the number of entries
written in the buffer. As this wasn't the original intent of the patch,
re-introduce the call to print_event_info() to restore the orginal
behaviour.

Link: http://lkml.kernel.org/r/20190214152950.4179-1-quentin.perret@arm.comAcked-by: NJoel Fernandes <joelaf@google.com>
Cc: stable@vger.kernel.org
Fixes: 441dae8f ("tracing: Add support for display of tgid in trace output")
Signed-off-by: NQuentin Perret <quentin.perret@arm.com>
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2cc39ead

signal: Restore the stop PTRACE_EVENT_EXIT · fd839bdb

由 Eric W. Biederman 提交于 2月 11, 2019

commit cf43a757fd49442bc38f76088b70c2299eed2c2f upstream.

In the middle of do_exit() there is there is a call
"ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process
in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for
for the debugger to release the task or SIGKILL to be delivered.

Skipping past dequeue_signal when we know a fatal signal has already
been delivered resulted in SIGKILL remaining pending and
TIF_SIGPENDING remaining set.  This in turn caused the
scheduler to not sleep in PTACE_EVENT_EXIT as it figured
a fatal signal was pending.  This also caused ptrace_freeze_traced
in ptrace_check_attach to fail because it left a per thread
SIGKILL pending which is what fatal_signal_pending tests for.

This difference in signal state caused strace to report
strace: Exit of unknown pid NNNNN ignored

Therefore update the signal handling state like dequeue_signal
would when removing a per thread SIGKILL, by removing SIGKILL
from the per thread signal mask and clearing TIF_SIGPENDING.
Acked-by: NOleg Nesterov <oleg@redhat.com>
Reported-by: NOleg Nesterov <oleg@redhat.com>
Reported-by: NIvan Delalande <colona@arista.com>
Cc: stable@vger.kernel.org
Fixes: 35634ffa1751 ("signal: Always notice exiting tasks")
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fd839bdb

tracing/uprobes: Fix output for multiple string arguments · 382322b4

由 Andreas Ziegler 提交于 1月 16, 2019

commit 0722069a5374b904ec1a67f91249f90e1cfae259 upstream.

When printing multiple uprobe arguments as strings the output for the
earlier arguments would also include all later string arguments.

This is best explained in an example:

Consider adding a uprobe to a function receiving two strings as
parameters which is at offset 0xa0 in strlib.so and we want to print
both parameters when the uprobe is hit (on x86_64):

$ echo 'p:func /lib/strlib.so:0xa0 +0(%di):string +0(%si):string' > \
    /sys/kernel/debug/tracing/uprobe_events

When the function is called as func("foo", "bar") and we hit the probe,
the trace file shows a line like the following:

  [...] func: (0x7f7e683706a0) arg1="foobar" arg2="bar"

Note the extra "bar" printed as part of arg1. This behaviour stacks up
for additional string arguments.

The strings are stored in a dynamically growing part of the uprobe
buffer by fetch_store_string() after copying them from userspace via
strncpy_from_user(). The return value of strncpy_from_user() is then
directly used as the required size for the string. However, this does
not take the terminating null byte into account as the documentation
for strncpy_from_user() cleary states that it "[...] returns the
length of the string (not including the trailing NUL)" even though the
null byte will be copied to the destination.

Therefore, subsequent calls to fetch_store_string() will overwrite
the terminating null byte of the most recently fetched string with
the first character of the current string, leading to the
"accumulation" of strings in earlier arguments in the output.

Fix this by incrementing the return value of strncpy_from_user() by
one if we did not hit the maximum buffer size.

Link: http://lkml.kernel.org/r/20190116141629.5752-1-andreas.ziegler@fau.de

Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 5baaa59e ("tracing/probes: Implement 'memory' fetch method for uprobes")
Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: NAndreas Ziegler <andreas.ziegler@fau.de>
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

382322b4

perf/x86: Add check_period PMU callback · 745de283

由 Jiri Olsa 提交于 2月 04, 2019

commit 81ec3f3c4c4d78f2d3b6689c9816bfbdf7417dbb upstream.

Vince (and later on Ravi) reported crashes in the BTS code during
fuzzing with the following backtrace:

  general protection fault: 0000 [#1] SMP PTI
  ...
  RIP: 0010:perf_prepare_sample+0x8f/0x510
  ...
  Call Trace:
   <IRQ>
   ? intel_pmu_drain_bts_buffer+0x194/0x230
   intel_pmu_drain_bts_buffer+0x160/0x230
   ? tick_nohz_irq_exit+0x31/0x40
   ? smp_call_function_single_interrupt+0x48/0xe0
   ? call_function_single_interrupt+0xf/0x20
   ? call_function_single_interrupt+0xa/0x20
   ? x86_schedule_events+0x1a0/0x2f0
   ? x86_pmu_commit_txn+0xb4/0x100
   ? find_busiest_group+0x47/0x5d0
   ? perf_event_set_state.part.42+0x12/0x50
   ? perf_mux_hrtimer_restart+0x40/0xb0
   intel_pmu_disable_event+0xae/0x100
   ? intel_pmu_disable_event+0xae/0x100
   x86_pmu_stop+0x7a/0xb0
   x86_pmu_del+0x57/0x120
   event_sched_out.isra.101+0x83/0x180
   group_sched_out.part.103+0x57/0xe0
   ctx_sched_out+0x188/0x240
   ctx_resched+0xa8/0xd0
   __perf_event_enable+0x193/0x1e0
   event_function+0x8e/0xc0
   remote_function+0x41/0x50
   flush_smp_call_function_queue+0x68/0x100
   generic_smp_call_function_single_interrupt+0x13/0x30
   smp_call_function_single_interrupt+0x3e/0xe0
   call_function_single_interrupt+0xf/0x20
   </IRQ>

The reason is that while event init code does several checks
for BTS events and prevents several unwanted config bits for
BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
to create BTS event without those checks being done.

Following sequence will cause the crash:

If we create an 'almost' BTS event with precise_ip and callchains,
and it into a BTS event it will crash the perf_prepare_sample()
function because precise_ip events are expected to come
in with callchain data initialized, but that's not the
case for intel_pmu_drain_bts_buffer() caller.

Adding a check_period callback to be called before the period
is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
if the event would become BTS. Plus adding also the limit_period
check as well.
Reported-by: NVince Weaver <vincent.weaver@maine.edu>
Signed-off-by: NJiri Olsa <jolsa@kernel.org>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20190204123532.GA4794@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

745de283

perf/core: Fix impossible ring-buffer sizes warning · b335ca4b

由 Ingo Molnar 提交于 2月 13, 2019

commit 528871b456026e6127d95b1b2bd8e3a003dc1614 upstream.

The following commit:

  9dff0aa95a32 ("perf/core: Don't WARN() for impossible ring-buffer sizes")

results in perf recording failures with larger mmap areas:

  root@skl:/tmp# perf record -g -a
  failed to mmap with 12 (Cannot allocate memory)

The root cause is that the following condition is buggy:

	if (order_base_2(size) >= MAX_ORDER)
		goto fail;

The problem is that @size is in bytes and MAX_ORDER is in pages,
so the right test is:

	if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER)
		goto fail;

Fix it.
Reported-by: N"Jin, Yao" <yao.jin@linux.intel.com>
Bisected-by: NBorislav Petkov <bp@alien8.de>
Analyzed-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Julien Thierry <julien.thierry@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Fixes: 9dff0aa95a32 ("perf/core: Don't WARN() for impossible ring-buffer sizes")
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b335ca4b

sysctl: add procfs interface to disable/enable usermodhelper_affinity · 0092df08

由 Xie XiuQi 提交于 2月 21, 2019

hulk inclusion
category: feature
feature: performance/latency
upstream: never
bugzilla: 2680,10641
CVE: NA

The pervious patch "kmod: run usermodehelpers only on cpus allowed for
kthreadd V2" allow to set usermodehelper thread's affinity.

add a interface to disable/enable usermodhelper_affinity.
Disabled by default.

[XQ: backport patch from euleros 2.x:
  - kmod.c => umh.c
]
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NLi Bin <huawei.libin@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

0092df08

kmod: run usermodehelpers only on cpus allowed for kthreadd V2 · 672627d1

由 Christoph Lameter 提交于 2月 21, 2019

hulk inclusion
category: feature
feature: performance/latency
upstream: never
bugzilla: 2680,10641
CVE: NA

isolate this usermodehelper kernel threads to other cpus,
to avoid the latency issue.

With this patch, the usermodehelper thread could inherit kthreadd's
affinity.

For example, if you want to isolate usermodehelper to cpu 1:
1) taskset -cp 1 2   # bind kthreadd task (pid = 2) to cpu 1
2) trigger call usermodhelper threads

---------------------------------------------

usermodehelper() threads can currently run on all processors.  This is an
issue for low latency cores.  Spawnig a new thread causes cpu holdoffs in
the range of hundreds of microseconds to a few milliseconds.  Not good for
cores on which processes run that need to react as fast as possible.

kthreadd threads can be restricted using taskset to a limited set of
processors.  Then the kernel thread pool will not fork processes on those
anymore thereby protecting those processors from additional latencies.

Make usermodehelper() threads obey the limitations that kthreadd is
restricted to.  Kthreadd is not the parent of usermodehelper threads so we
need to explicitly get the allowed processors for kthreadd.

Before this patch there is no way to limit the cpus that usermodehelper
can run on since the affinity is set when the thread is spawned to all
processors.

[akpm@linux-foundation.org: set_cpus_allowed() doesn't exist when
CONFIG_CPUMASK_OFFSTACK=y]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: NChristoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Link: https://patchwork.kernel.org/patch/3153671/Reported-and-tested-by: NXiangyou Xie <xiexiangyou@huawei.com>
[
1) kmod.c => umh.c
2) ____call_usermodehelper => call_usermodehelper_exec_async
]
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NLi Bin <huawei.libin@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

672627d1

pagecache: add switch to close the feature completely · 354b44e8

由 zhongjiang 提交于 2月 21, 2019

euler inclusion
category: bugfix
CVE: NA
Bugzilla: 9580

---------------------------

The patch control the open/close of the feature by the
/proc/sys/vm/cache_reclaim_enable, hence we can completely
close all background thread for user demand.
Signed-off-by: Nzhongjiang <zhongjiang@huawei.com>
Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

354b44e8

pagecache: add Kconfig to enable/disable the feature · 862e2308

由 zhongjiang 提交于 2月 21, 2019

euler inclusion
category: bugfix
CVE: NA
Bugzilla: 9580

---------------------------

Just add Kconfig to the feature.
Signed-off-by: Nzhongjiang <zhongjiang@huawei.com>
Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

862e2308

bpf: fix lockdep false positive in percpu_freelist · f72ddcb4

由 Alexei Starovoitov 提交于 2月 21, 2019

mainline inclusion
from mainline-5.0
commit a89fac57b5d0
category: bugfix
bugzilla: 9352
CVE: NA

-------------------------------------------------

Lockdep warns about false positive:
[   12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40
[   12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past:
[   12.493275]  (&rq->lock){-.-.}
[   12.493276]
[   12.493276]
[   12.493276] and interrupts could create inverse lock ordering between them.
[   12.493276]
[   12.494435]
[   12.494435] other info that might help us debug this:
[   12.494979]  Possible interrupt unsafe locking scenario:
[   12.494979]
[   12.495518]        CPU0                    CPU1
[   12.495879]        ----                    ----
[   12.496243]   lock(&head->lock);
[   12.496502]                                local_irq_disable();
[   12.496969]                                lock(&rq->lock);
[   12.497431]                                lock(&head->lock);
[   12.497890]   <Interrupt>
[   12.498104]     lock(&rq->lock);
[   12.498368]
[   12.498368]  *** DEADLOCK ***
[   12.498368]
[   12.498837] 1 lock held by dd/276:
[   12.499110]  #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240
[   12.499747]
[   12.499747] the shortest dependencies between 2nd lock and 1st lock:
[   12.500389]  -> (&rq->lock){-.-.} {
[   12.500669]     IN-HARDIRQ-W at:
[   12.500934]                       _raw_spin_lock+0x2f/0x40
[   12.501373]                       scheduler_tick+0x4c/0xf0
[   12.501812]                       update_process_times+0x40/0x50
[   12.502294]                       tick_periodic+0x27/0xb0
[   12.502723]                       tick_handle_periodic+0x1f/0x60
[   12.503203]                       timer_interrupt+0x11/0x20
[   12.503651]                       __handle_irq_event_percpu+0x43/0x2c0
[   12.504167]                       handle_irq_event_percpu+0x20/0x50
[   12.504674]                       handle_irq_event+0x37/0x60
[   12.505139]                       handle_level_irq+0xa7/0x120
[   12.505601]                       handle_irq+0xa1/0x150
[   12.506018]                       do_IRQ+0x77/0x140
[   12.506411]                       ret_from_intr+0x0/0x1d
[   12.506834]                       _raw_spin_unlock_irqrestore+0x53/0x60
[   12.507362]                       __setup_irq+0x481/0x730
[   12.507789]                       setup_irq+0x49/0x80
[   12.508195]                       hpet_time_init+0x21/0x32
[   12.508644]                       x86_late_time_init+0xb/0x16
[   12.509106]                       start_kernel+0x390/0x42a
[   12.509554]                       secondary_startup_64+0xa4/0xb0
[   12.510034]     IN-SOFTIRQ-W at:
[   12.510305]                       _raw_spin_lock+0x2f/0x40
[   12.510772]                       try_to_wake_up+0x1c7/0x4e0
[   12.511220]                       swake_up_locked+0x20/0x40
[   12.511657]                       swake_up_one+0x1a/0x30
[   12.512070]                       rcu_process_callbacks+0xc5/0x650
[   12.512553]                       __do_softirq+0xe6/0x47b
[   12.512978]                       irq_exit+0xc3/0xd0
[   12.513372]                       smp_apic_timer_interrupt+0xa9/0x250
[   12.513876]                       apic_timer_interrupt+0xf/0x20
[   12.514343]                       default_idle+0x1c/0x170
[   12.514765]                       do_idle+0x199/0x240
[   12.515159]                       cpu_startup_entry+0x19/0x20
[   12.515614]                       start_kernel+0x422/0x42a
[   12.516045]                       secondary_startup_64+0xa4/0xb0
[   12.516521]     INITIAL USE at:
[   12.516774]                      _raw_spin_lock_irqsave+0x38/0x50
[   12.517258]                      rq_attach_root+0x16/0xd0
[   12.517685]                      sched_init+0x2f2/0x3eb
[   12.518096]                      start_kernel+0x1fb/0x42a
[   12.518525]                      secondary_startup_64+0xa4/0xb0
[   12.518986]   }
[   12.519132]   ... key      at: [<ffffffff82b7bc28>] __key.71384+0x0/0x8
[   12.519649]   ... acquired at:
[   12.519892]    pcpu_freelist_pop+0x7b/0xd0
[   12.520221]    bpf_get_stackid+0x1d2/0x4d0
[   12.520563]    ___bpf_prog_run+0x8b4/0x11a0
[   12.520887]
[   12.521008] -> (&head->lock){+...} {
[   12.521292]    HARDIRQ-ON-W at:
[   12.521539]                     _raw_spin_lock+0x2f/0x40
[   12.521950]                     pcpu_freelist_push+0x2a/0x40
[   12.522396]                     bpf_get_stackid+0x494/0x4d0
[   12.522828]                     ___bpf_prog_run+0x8b4/0x11a0
[   12.523296]    INITIAL USE at:
[   12.523537]                    _raw_spin_lock+0x2f/0x40
[   12.523944]                    pcpu_freelist_populate+0xc0/0x120
[   12.524417]                    htab_map_alloc+0x405/0x500
[   12.524835]                    __do_sys_bpf+0x1a3/0x1a90
[   12.525253]                    do_syscall_64+0x4a/0x180
[   12.525659]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   12.526167]  }
[   12.526311]  ... key      at: [<ffffffff838f7668>] __key.13130+0x0/0x8
[   12.526812]  ... acquired at:
[   12.527047]    __lock_acquire+0x521/0x1350
[   12.527371]    lock_acquire+0x98/0x190
[   12.527680]    _raw_spin_lock+0x2f/0x40
[   12.527994]    pcpu_freelist_push+0x2a/0x40
[   12.528325]    bpf_get_stackid+0x494/0x4d0
[   12.528645]    ___bpf_prog_run+0x8b4/0x11a0
[   12.528970]
[   12.529092]
[   12.529092] stack backtrace:
[   12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f892422 #475
[   12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
[   12.530750] Call Trace:
[   12.530948]  dump_stack+0x5f/0x8b
[   12.531248]  check_usage_backwards+0x10c/0x120
[   12.531598]  ? ___bpf_prog_run+0x8b4/0x11a0
[   12.531935]  ? mark_lock+0x382/0x560
[   12.532229]  mark_lock+0x382/0x560
[   12.532496]  ? print_shortest_lock_dependencies+0x180/0x180
[   12.532928]  __lock_acquire+0x521/0x1350
[   12.533271]  ? find_get_entry+0x17f/0x2e0
[   12.533586]  ? find_get_entry+0x19c/0x2e0
[   12.533902]  ? lock_acquire+0x98/0x190
[   12.534196]  lock_acquire+0x98/0x190
[   12.534482]  ? pcpu_freelist_push+0x2a/0x40
[   12.534810]  _raw_spin_lock+0x2f/0x40
[   12.535099]  ? pcpu_freelist_push+0x2a/0x40
[   12.535432]  pcpu_freelist_push+0x2a/0x40
[   12.535750]  bpf_get_stackid+0x494/0x4d0
[   12.536062]  ___bpf_prog_run+0x8b4/0x11a0

It has been explained that is a false positive here:
https://lkml.org/lkml/2018/7/25/756
Recap:
- stackmap uses pcpu_freelist
- The lock in pcpu_freelist is a percpu lock
- stackmap is only used by tracing bpf_prog
- A tracing bpf_prog cannot be run if another bpf_prog
  has already been running (ensured by the percpu bpf_prog_active counter).

Eric pointed out that this lockdep splats stops other
legit lockdep splats in selftests/bpf/test_progs.c.

Fix this by calling local_irq_save/restore for stackmap.

Another false positive had also been worked around by calling
local_irq_save in commit 89ad2fa3 ("bpf: fix lockdep splat").
That commit added unnecessary irq_save/restore to fast path of
bpf hash map. irqs are already disabled at that point, since htab
is holding per bucket spin_lock with irqsave.

Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop
function w/o irqsave and convert pcpu_freelist_push/pop to irqsave
to be used elsewhere (right now only in stackmap).
It stops lockdep false positive in stackmap with a bit of acceptable overhead.

Fixes: 557c0c6e ("bpf: convert stackmap to pre-allocation")
Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

f72ddcb4

bpf: Fix syscall's stackmap lookup potential deadlock · 2b8e56c9

由 Martin KaFai Lau 提交于 2月 19, 2019

mainline inclusion
from mainline-5.0
commit 7c4cd051add3
category: bugfix
bugzilla: 9355
CVE: NA

-------------------------------------------------

The map_lookup_elem used to not acquiring spinlock
in order to optimize the reader.

It was true until commit 557c0c6e ("bpf: convert stackmap to pre-allocation")
The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy().
bpf_stackmap_copy() may find the elem no longer needed after the copy is done.
If that is the case, pcpu_freelist_push() saves this elem for reuse later.
This push requires a spinlock.

If a tracing bpf_prog got run in the middle of the syscall's
map_lookup_elem(stackmap) and this tracing bpf_prog is calling
bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's
spinlock, it may end up with a dead lock situation as reported by
Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/

The situation is the same as the syscall's map_update_elem() which
needs to acquire the pcpu_freelist's spinlock and could race
with tracing bpf_prog.  Hence, this patch fixes it by protecting
bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active)
to prevent tracing bpf_prog from running.

A later syscall's map_lookup_elem commit f1a2e44a3aec ("bpf: add queue and stack maps")
also acquires a spinlock and races with tracing bpf_prog similarly.
Hence, this patch is forward looking and protects the majority
of the map lookups.  bpf_map_offload_lookup_elem() is the exception
since it is for network bpf_prog only (i.e. never called by tracing
bpf_prog).

Fixes: 557c0c6e ("bpf: convert stackmap to pre-allocation")
Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2b8e56c9

bpf: error handling when map_lookup_elem isn't supported · d7cb6182

由 Prashant Bhole 提交于 2月 19, 2019

mainline inclusion
from mainline-4.20
commit 509db2833e0d
category: bugfix
bugzilla: 9355
CVE: NA

-------------------------------------------------

The error value returned by map_lookup_elem doesn't differentiate
whether lookup was failed because of invalid key or lookup is not
supported.

Lets add handling for -EOPNOTSUPP return value of map_lookup_elem()
method of map, with expectation from map's implementation that it
should return -EOPNOTSUPP if lookup is not supported.

The errno for bpf syscall for BPF_MAP_LOOKUP_ELEM command will be set
to EOPNOTSUPP if map lookup is not supported.
Signed-off-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d7cb6182

bpf: fix potential deadlock in bpf_prog_register · d1c1cb28

由 Alexei Starovoitov 提交于 2月 19, 2019

mainline inclusion
from mainline-5.0
commit e16ec34039c7
category: bugfix
bugzilla: 9347
CVE: NA

-------------------------------------------------

Lockdep found a potential deadlock between cpu_hotplug_lock, bpf_event_mutex, and cpuctx_mutex:
[   13.007000] WARNING: possible circular locking dependency detected
[   13.007587] 5.0.0-rc3-00018-g2fa53f892422-dirty #477 Not tainted
[   13.008124] ------------------------------------------------------
[   13.008624] test_progs/246 is trying to acquire lock:
[   13.009030] 0000000094160d1d (tracepoints_mutex){+.+.}, at: tracepoint_probe_register_prio+0x2d/0x300
[   13.009770]
[   13.009770] but task is already holding lock:
[   13.010239] 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
[   13.010877]
[   13.010877] which lock already depends on the new lock.
[   13.010877]
[   13.011532]
[   13.011532] the existing dependency chain (in reverse order) is:
[   13.012129]
[   13.012129] -> #4 (bpf_event_mutex){+.+.}:
[   13.012582]        perf_event_query_prog_array+0x9b/0x130
[   13.013016]        _perf_ioctl+0x3aa/0x830
[   13.013354]        perf_ioctl+0x2e/0x50
[   13.013668]        do_vfs_ioctl+0x8f/0x6a0
[   13.014003]        ksys_ioctl+0x70/0x80
[   13.014320]        __x64_sys_ioctl+0x16/0x20
[   13.014668]        do_syscall_64+0x4a/0x180
[   13.015007]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   13.015469]
[   13.015469] -> #3 (&cpuctx_mutex){+.+.}:
[   13.015910]        perf_event_init_cpu+0x5a/0x90
[   13.016291]        perf_event_init+0x1b2/0x1de
[   13.016654]        start_kernel+0x2b8/0x42a
[   13.016995]        secondary_startup_64+0xa4/0xb0
[   13.017382]
[   13.017382] -> #2 (pmus_lock){+.+.}:
[   13.017794]        perf_event_init_cpu+0x21/0x90
[   13.018172]        cpuhp_invoke_callback+0xb3/0x960
[   13.018573]        _cpu_up+0xa7/0x140
[   13.018871]        do_cpu_up+0xa4/0xc0
[   13.019178]        smp_init+0xcd/0xd2
[   13.019483]        kernel_init_freeable+0x123/0x24f
[   13.019878]        kernel_init+0xa/0x110
[   13.020201]        ret_from_fork+0x24/0x30
[   13.020541]
[   13.020541] -> #1 (cpu_hotplug_lock.rw_sem){++++}:
[   13.021051]        static_key_slow_inc+0xe/0x20
[   13.021424]        tracepoint_probe_register_prio+0x28c/0x300
[   13.021891]        perf_trace_event_init+0x11f/0x250
[   13.022297]        perf_trace_init+0x6b/0xa0
[   13.022644]        perf_tp_event_init+0x25/0x40
[   13.023011]        perf_try_init_event+0x6b/0x90
[   13.023386]        perf_event_alloc+0x9a8/0xc40
[   13.023754]        __do_sys_perf_event_open+0x1dd/0xd30
[   13.024173]        do_syscall_64+0x4a/0x180
[   13.024519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   13.024968]
[   13.024968] -> #0 (tracepoints_mutex){+.+.}:
[   13.025434]        __mutex_lock+0x86/0x970
[   13.025764]        tracepoint_probe_register_prio+0x2d/0x300
[   13.026215]        bpf_probe_register+0x40/0x60
[   13.026584]        bpf_raw_tracepoint_open.isra.34+0xa4/0x130
[   13.027042]        __do_sys_bpf+0x94f/0x1a90
[   13.027389]        do_syscall_64+0x4a/0x180
[   13.027727]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   13.028171]
[   13.028171] other info that might help us debug this:
[   13.028171]
[   13.028807] Chain exists of:
[   13.028807]   tracepoints_mutex --> &cpuctx_mutex --> bpf_event_mutex
[   13.028807]
[   13.029666]  Possible unsafe locking scenario:
[   13.029666]
[   13.030140]        CPU0                    CPU1
[   13.030510]        ----                    ----
[   13.030875]   lock(bpf_event_mutex);
[   13.031166]                                lock(&cpuctx_mutex);
[   13.031645]                                lock(bpf_event_mutex);
[   13.032135]   lock(tracepoints_mutex);
[   13.032441]
[   13.032441]  *** DEADLOCK ***
[   13.032441]
[   13.032911] 1 lock held by test_progs/246:
[   13.033239]  #0: 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
[   13.033909]
[   13.033909] stack backtrace:
[   13.034258] CPU: 1 PID: 246 Comm: test_progs Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #477
[   13.034964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
[   13.035657] Call Trace:
[   13.035859]  dump_stack+0x5f/0x8b
[   13.036130]  print_circular_bug.isra.37+0x1ce/0x1db
[   13.036526]  __lock_acquire+0x1158/0x1350
[   13.036852]  ? lock_acquire+0x98/0x190
[   13.037154]  lock_acquire+0x98/0x190
[   13.037447]  ? tracepoint_probe_register_prio+0x2d/0x300
[   13.037876]  __mutex_lock+0x86/0x970
[   13.038167]  ? tracepoint_probe_register_prio+0x2d/0x300
[   13.038600]  ? tracepoint_probe_register_prio+0x2d/0x300
[   13.039028]  ? __mutex_lock+0x86/0x970
[   13.039337]  ? __mutex_lock+0x24a/0x970
[   13.039649]  ? bpf_probe_register+0x1d/0x60
[   13.039992]  ? __bpf_trace_sched_wake_idle_without_ipi+0x10/0x10
[   13.040478]  ? tracepoint_probe_register_prio+0x2d/0x300
[   13.040906]  tracepoint_probe_register_prio+0x2d/0x300
[   13.041325]  bpf_probe_register+0x40/0x60
[   13.041649]  bpf_raw_tracepoint_open.isra.34+0xa4/0x130
[   13.042068]  ? __might_fault+0x3e/0x90
[   13.042374]  __do_sys_bpf+0x94f/0x1a90
[   13.042678]  do_syscall_64+0x4a/0x180
[   13.042975]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   13.043382] RIP: 0033:0x7f23b10a07f9
[   13.045155] RSP: 002b:00007ffdef42fdd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
[   13.045759] RAX: ffffffffffffffda RBX: 00007ffdef42ff70 RCX: 00007f23b10a07f9
[   13.046326] RDX: 0000000000000070 RSI: 00007ffdef42fe10 RDI: 0000000000000011
[   13.046893] RBP: 00007ffdef42fdf0 R08: 0000000000000038 R09: 00007ffdef42fe10
[   13.047462] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[   13.048029] R13: 0000000000000016 R14: 00007f23b1db4690 R15: 0000000000000000

Since tracepoints_mutex will be taken in tracepoint_probe_register/unregister()
there is no need to take bpf_event_mutex too.
bpf_event_mutex is protecting modifications to prog array used in kprobe/perf bpf progs.
bpf_raw_tracepoints don't need to take this mutex.

Fixes: c4f6699d ("bpf: introduce BPF_RAW_TRACEPOINT")
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d1c1cb28

bpf: support raw tracepoints in modules · 00d0f1b1

由 Matt Mullins 提交于 2月 19, 2019

mainline inclusion
from mainline-5.0
commit a38d1107f937
category: bugfix
bugzilla: 9347
CVE: NA

-------------------------------------------------

Distributions build drivers as modules, including network and filesystem
drivers which export numerous tracepoints.  This enables
bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints.
Signed-off-by: NMatt Mullins <mmullins@fb.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

00d0f1b1

timekeeping: Fix ktime_add overflow in tk_set_wall_to_mono · 56ef5ae6

由 Zhang Xiaoxu 提交于 2月 16, 2019

euler inclusion
category: bugfix
Bugzilla: 5380
CVE: N/A
----------------------------------------

Syzkaller report UBSAN bug:
UBSAN: Undefined behaviour in kernel/time/timekeeping.c:98:17
signed integer overflow:
8589935550743139462 + 2147483647000000000 cannot be represented
in type 'long long int'

Use add_time_safe instead add_time in tk_set_wall_to_mono.
Signed-off-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

56ef5ae6

tracing: fix incorrect tracer freeing when opening tracing pipe · 2a2e4325

由 zhangyi (F) 提交于 2月 18, 2019

euler inclusion
category: bugfix
bugzilla: 9292
CVE: NA
---------------------------

Commit d716ff71 ("tracing: Remove taking of trace_types_lock in
pipe files") use the current tracer instead of the copy in
tracing_open_pipe(), but it forget to remove the freeing sentence in
the error path.

Fixes: d716ff71 ("tracing: Remove taking of trace_types_lock in pipe files")
Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: NLi Bin <huawei.libin@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2a2e4325

printk: move printk_safe macros to printk header · 21c0c13e

由 Sergey Senozhatsky 提交于 2月 16, 2019

euler inclusion
category: bugfix
bugzilla: 9509
CVE: NA
-------------------------------------------------

Make printk_safe_enter_irqsave()/etc macros available to the
rest of the kernel.
Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

21c0c13e