提交 · 2279f540ea7d05f22d2f0c4224319330228586bc · openeuler / Kernel

17 11月, 2020 2 次提交

sched/deadline: Fix priority inheritance with multiple scheduling classes · 2279f540

由 Juri Lelli 提交于 11月 17, 2020

Glenn reported that "an application [he developed produces] a BUG in
deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested
PTHREAD_PRIO_INHERIT mutexes.  I believe the bug is triggered when a CFS
task that was boosted by a SCHED_DEADLINE task boosts another CFS task
(nested priority inheritance).

 ------------[ cut here ]------------
 kernel BUG at kernel/sched/deadline.c:1462!
 invalid opcode: 0000 [#1] PREEMPT SMP
 CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ...
 Hardware name: ...
 RIP: 0010:enqueue_task_dl+0x335/0x910
 Code: ...
 RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
 RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
 RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
 RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
 R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
 R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
 FS:  00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  ? intel_pstate_update_util_hwp+0x13/0x170
  rt_mutex_setprio+0x1cc/0x4b0
  task_blocks_on_rt_mutex+0x225/0x260
  rt_spin_lock_slowlock_locked+0xab/0x2d0
  rt_spin_lock_slowlock+0x50/0x80
  hrtimer_grab_expiry_lock+0x20/0x30
  hrtimer_cancel+0x13/0x30
  do_nanosleep+0xa0/0x150
  hrtimer_nanosleep+0xe1/0x230
  ? __hrtimer_init_sleeper+0x60/0x60
  __x64_sys_nanosleep+0x8d/0xa0
  do_syscall_64+0x4a/0x100
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7fa58b52330d
 ...
 ---[ end trace 0000000000000002 ]—

He also provided a simple reproducer creating the situation below:

 So the execution order of locking steps are the following
 (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
 are mutexes that are enabled * with priority inheritance.)

 Time moves forward as this timeline goes down:

 N1              N2               D1
 |               |                |
 |               |                |
 Lock(M1)        |                |
 |               |                |
 |             Lock(M2)           |
 |               |                |
 |               |              Lock(M2)
 |               |                |
 |             Lock(M1)           |
 |             (!!bug triggered!) |

Daniel reported a similar situation as well, by just letting ksoftirqd
run with DEADLINE (and eventually block on a mutex).

Problem is that boosted entities (Priority Inheritance) use static
DEADLINE parameters of the top priority waiter. However, there might be
cases where top waiter could be a non-DEADLINE entity that is currently
boosted by a DEADLINE entity from a different lock chain (i.e., nested
priority chains involving entities of non-DEADLINE classes). In this
case, top waiter static DEADLINE parameters could be null (initialized
to 0 at fork()) and replenish_dl_entity() would hit a BUG().

Fix this by keeping track of the original donor and using its parameters
when a task is boosted.
Reported-by: NGlenn Elliott <glenn@aurora.tech>
Reported-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com

2279f540

sched: Fix data-race in wakeup · f97bb527

由 Peter Zijlstra 提交于 11月 17, 2020

Mel reported that on some ARM64 platforms loadavg goes bananas and
Will tracked it down to the following race:

  CPU0					CPU1

  schedule()
    prev->sched_contributes_to_load = X;
    deactivate_task(prev);

					try_to_wake_up()
					  if (p->on_rq &&) // false
					  if (smp_load_acquire(&p->on_cpu) && // true
					      ttwu_queue_wakelist())
					        p->sched_remote_wakeup = Y;

    smp_store_release(prev->on_cpu, 0);

where both p->sched_contributes_to_load and p->sched_remote_wakeup are
in the same word, and thus the stores X and Y race (and can clobber
one another's data).

Whereas prior to commit c6e7bd7a ("sched/core: Optimize ttwu()
spinning on p->on_cpu") the p->on_cpu handoff serialized access to
p->sched_remote_wakeup (just as it still does with
p->sched_contributes_to_load) that commit broke that by calling
ttwu_queue_wakelist() with p->on_cpu != 0.

However, due to

  p->XXX = X			ttwu()
  schedule()			  if (p->on_rq && ...) // false
    smp_mb__after_spinlock()	  if (smp_load_acquire(&p->on_cpu) &&
    deactivate_task()		      ttwu_queue_wakelist())
      p->on_rq = 0;		        p->sched_remote_wakeup = Y;

We can be sure any 'current' store is complete and 'current' is
guaranteed asleep. Therefore we can move p->sched_remote_wakeup into
the current flags word.

Note: while the observed failure was loadavg accounting gone wrong due
to ttwu() cobbering p->sched_contributes_to_load, the reverse problem
is also possible where schedule() clobbers p->sched_remote_wakeup,
this could result in enqueue_entity() wrecking ->vruntime and causing
scheduling artifacts.

Fixes: c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: NMel Gorman <mgorman@techsingularity.net>
Debugged-by: NWill Deacon <will@kernel.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201117083016.GK3121392@hirez.programming.kicks-ass.net

f97bb527

12 10月, 2020 2 次提交

mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged · 4aab2be0

由 Vijay Balakrishna 提交于 10月 10, 2020

When memory is hotplug added or removed the min_free_kbytes should be
recalculated based on what is expected by khugepaged.  Currently after
hotplug, min_free_kbytes will be set to a lower default and higher
default set when THP enabled is lost.

This change restores min_free_kbytes as expected for THP consumers.

[vijayb@linux.microsoft.com: v5]
  Link: https://lkml.kernel.org/r/1601398153-5517-1-git-send-email-vijayb@linux.microsoft.com

Fixes: f000565a ("thp: set recommended min free kbytes")
Signed-off-by: NVijay Balakrishna <vijayb@linux.microsoft.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Allen Pais <apais@microsoft.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/1600305709-2319-2-git-send-email-vijayb@linux.microsoft.com
Link: https://lkml.kernel.org/r/1600204258-13683-1-git-send-email-vijayb@linux.microsoft.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4aab2be0

mm: validate inode in mapping_set_error() · 8b7b2eb1

由 Minchan Kim 提交于 10月 10, 2020

The swap address_space doesn't have host. Thus, it makes kernel crash once
swap write meets error. Fix it.

Fixes: 735e4ae5 ("vfs: track per-sb writeback errors and report them to syncfs")
Signed-off-by: NMinchan Kim <minchan@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NJeff Layton <jlayton@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201010000650.750063-1-minchan@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8b7b2eb1

10 10月, 2020 1 次提交

genirq/irqdomain: Allow partial trimming of irq_data hierarchy · 55567976

由 Marc Zyngier 提交于 10月 06, 2020

It appears that some HW is ugly enough that not all the interrupts
connected to a particular interrupt controller end up with the same
hierarchy depth (some of them are terminated early). This leaves
the irqchip hacker with only two choices, both equally bad:

- create discrete domain chains, one for each "hierarchy depth",
  which is very hard to maintain

- create fake hierarchy levels for the shallow paths, leading
  to all kind of problems (what are the safe hwirq values for these
  fake levels?)

Implement the ability to cut short a single interrupt hierarchy
from a level marked as being disconnected by using the new
irq_domain_disconnect_hierarchy() helper.

The irqdomain allocation code will then perform the trimming
Signed-off-by: NMarc Zyngier <maz@kernel.org>

55567976

09 10月, 2020 3 次提交

lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables" · baffd723

由 Peter Zijlstra 提交于 10月 05, 2020

The thinking in commit:

  fddf9055 ("lockdep: Use raw_cpu_*() for per-cpu variables")

is flawed. While it is true that when we're migratable both CPUs will
have a 0 value, it doesn't hold that when we do get migrated in the
middle of a raw_cpu_op(), the old CPU will still have 0 by the time we
get around to reading it on the new CPU.

Luckily, the reason for that commit (s390 using preempt_disable()
instead of preempt_disable_notrace() in their percpu code), has since
been fixed by commit:

  1196f12a ("s390: don't trace preemption in percpu macros")

An audit of arch/*/include/asm/percpu*.h shows there are no other
architectures affected by this particular issue.

Fixes: fddf9055 ("lockdep: Use raw_cpu_*() for per-cpu variables")
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201005095958.GJ2651@hirez.programming.kicks-ass.net

baffd723

lockdep: Fix lockdep recursion · 4d004099

由 Peter Zijlstra 提交于 10月 02, 2020

Steve reported that lockdep_assert*irq*(), when nested inside lockdep
itself, will trigger a false-positive.

One example is the stack-trace code, as called from inside lockdep,
triggering tracing, which in turn calls RCU, which then uses
lockdep_assert_irqs_disabled().

Fixes: a21ee605 ("lockdep: Change hardirq{s_enabled,_context} to per-cpu variables")
Reported-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

4d004099

lockdep: Fix usage_traceoverflow · 2bb8945b

由 Peter Zijlstra 提交于 9月 30, 2020

Basically print_lock_class_header()'s for loop is out of sync with the
the size of of ->usage_traces[].

Also clean things up a bit while at it, to avoid such mishaps in the future.

Fixes: 23870f12 ("locking/lockdep: Fix "USED" <- "IN-NMI" inversions")
Reported-by: NQian Cai <cai@redhat.com>
Debugged-by: NBoqun Feng <boqun.feng@gmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Tested-by: NQian Cai <cai@redhat.com>
Link: https://lkml.kernel.org/r/20200930094937.GE2651@hirez.programming.kicks-ass.net

2bb8945b

08 10月, 2020 1 次提交

locking/seqlock: Tweak DEFINE_SEQLOCK() kernel doc · 24a18772

由 Sebastian Andrzej Siewior 提交于 9月 24, 2020

ctags creates a warning:
|ctags: Warning: include/linux/seqlock.h:738: null expansion of name pattern "\2"

The DEFINE_SEQLOCK() macro is passed to ctags and being told to expect
an argument.

Add a dummy argument to keep ctags quiet.
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NWill Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20200924154851.skmswuyj322yuz4g@linutronix.de

24a18772

07 10月, 2020 2 次提交

x86/mce: Recover from poison found while copying from user space · c0ab7ffc

由 Tony Luck 提交于 10月 06, 2020

Existing kernel code can only recover from a machine check on code that
is tagged in the exception table with a fault handling recovery path.

Add two new fields in the task structure to pass information from
machine check handler to the "task_work" that is queued to run before
the task returns to user mode:

+ mce_vaddr: will be initialized to the user virtual address of the fault
  in the case where the fault occurred in the kernel copying data from
  a user address.  This is so that kill_me_maybe() can provide that
  information to the user SIGBUS handler.

+ mce_kflags: copy of the struct mce.kflags needed by kill_me_maybe()
  to determine if mce_vaddr is applicable to this error.

Add code to recover from a machine check while copying data from user
space to the kernel. Action for this case is the same as if the user
touched the poison directly; unmap the page and send a SIGBUS to the task.

Use a new helper function to share common code between the "fault
in user mode" case and the "fault while copying from user" case.

New code paths will be activated by the next patch which sets
MCE_IN_KERNEL_COPYIN.
Suggested-by: NBorislav Petkov <bp@alien8.de>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-6-tony.luck@intel.com

c0ab7ffc

arm/arm64: xen: Fix to convert percpu address to gfn correctly · 5a067711

由 Masami Hiramatsu 提交于 10月 06, 2020

Use per_cpu_ptr_to_phys() instead of virt_to_phys() for per-cpu
address conversion.

In xen_starting_cpu(), per-cpu xen_vcpu_info address is converted
to gfn by virt_to_gfn() macro. However, since the virt_to_gfn(v)
assumes the given virtual address is in linear mapped kernel memory
area, it can not convert the per-cpu memory if it is allocated on
vmalloc area.

This depends on CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK.
If it is enabled, the first chunk of percpu memory is linear mapped.
In the other case, that is allocated from vmalloc area. Moreover,
if the first chunk of percpu has run out until allocating
xen_vcpu_info, it will be allocated on the 2nd chunk, which is
based on kernel memory or vmalloc memory (depends on
CONFIG_NEED_PER_CPU_KM).

Without this fix and kernel configured to use vmalloc area for
the percpu memory, the Dom0 kernel will fail to boot with following
errors.

[    0.466172] Xen: initializing cpu0
[    0.469601] ------------[ cut here ]------------
[    0.474295] WARNING: CPU: 0 PID: 1 at arch/arm64/xen/../../arm/xen/enlighten.c:153 xen_starting_cpu+0x160/0x180
[    0.484435] Modules linked in:
[    0.487565] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #4
[    0.493895] Hardware name: Socionext Developer Box (DT)
[    0.499194] pstate: 00000005 (nzcv daif -PAN -UAO BTYPE=--)
[    0.504836] pc : xen_starting_cpu+0x160/0x180
[    0.509263] lr : xen_starting_cpu+0xb0/0x180
[    0.513599] sp : ffff8000116cbb60
[    0.516984] x29: ffff8000116cbb60 x28: ffff80000abec000
[    0.522366] x27: 0000000000000000 x26: 0000000000000000
[    0.527754] x25: ffff80001156c000 x24: fffffdffbfcdb600
[    0.533129] x23: 0000000000000000 x22: 0000000000000000
[    0.538511] x21: ffff8000113a99c8 x20: ffff800010fe4f68
[    0.543892] x19: ffff8000113a9988 x18: 0000000000000010
[    0.549274] x17: 0000000094fe0f81 x16: 00000000deadbeef
[    0.554655] x15: ffffffffffffffff x14: 0720072007200720
[    0.560037] x13: 0720072007200720 x12: 0720072007200720
[    0.565418] x11: 0720072007200720 x10: 0720072007200720
[    0.570801] x9 : ffff8000100fbdc0 x8 : ffff800010715208
[    0.576182] x7 : 0000000000000054 x6 : ffff00001b790f00
[    0.581564] x5 : ffff800010bbf880 x4 : 0000000000000000
[    0.586945] x3 : 0000000000000000 x2 : ffff80000abec000
[    0.592327] x1 : 000000000000002f x0 : 0000800000000000
[    0.597716] Call trace:
[    0.600232]  xen_starting_cpu+0x160/0x180
[    0.604309]  cpuhp_invoke_callback+0xac/0x640
[    0.608736]  cpuhp_issue_call+0xf4/0x150
[    0.612728]  __cpuhp_setup_state_cpuslocked+0x128/0x2c8
[    0.618030]  __cpuhp_setup_state+0x84/0xf8
[    0.622192]  xen_guest_init+0x324/0x364
[    0.626097]  do_one_initcall+0x54/0x250
[    0.630003]  kernel_init_freeable+0x12c/0x2c8
[    0.634428]  kernel_init+0x1c/0x128
[    0.637988]  ret_from_fork+0x10/0x18
[    0.641635] ---[ end trace d95b5309a33f8b27 ]---
[    0.646337] ------------[ cut here ]------------
[    0.651005] kernel BUG at arch/arm64/xen/../../arm/xen/enlighten.c:158!
[    0.657697] Internal error: Oops - BUG: 0 [#1] SMP
[    0.662548] Modules linked in:
[    0.665676] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W         5.9.0-rc4+ #4
[    0.673398] Hardware name: Socionext Developer Box (DT)
[    0.678695] pstate: 00000005 (nzcv daif -PAN -UAO BTYPE=--)
[    0.684338] pc : xen_starting_cpu+0x178/0x180
[    0.688765] lr : xen_starting_cpu+0x144/0x180
[    0.693188] sp : ffff8000116cbb60
[    0.696573] x29: ffff8000116cbb60 x28: ffff80000abec000
[    0.701955] x27: 0000000000000000 x26: 0000000000000000
[    0.707344] x25: ffff80001156c000 x24: fffffdffbfcdb600
[    0.712718] x23: 0000000000000000 x22: 0000000000000000
[    0.718107] x21: ffff8000113a99c8 x20: ffff800010fe4f68
[    0.723481] x19: ffff8000113a9988 x18: 0000000000000010
[    0.728863] x17: 0000000094fe0f81 x16: 00000000deadbeef
[    0.734245] x15: ffffffffffffffff x14: 0720072007200720
[    0.739626] x13: 0720072007200720 x12: 0720072007200720
[    0.745008] x11: 0720072007200720 x10: 0720072007200720
[    0.750390] x9 : ffff8000100fbdc0 x8 : ffff800010715208
[    0.755771] x7 : 0000000000000054 x6 : ffff00001b790f00
[    0.761153] x5 : ffff800010bbf880 x4 : 0000000000000000
[    0.766534] x3 : 0000000000000000 x2 : 00000000deadbeef
[    0.771916] x1 : 00000000deadbeef x0 : ffffffffffffffea
[    0.777304] Call trace:
[    0.779819]  xen_starting_cpu+0x178/0x180
[    0.783898]  cpuhp_invoke_callback+0xac/0x640
[    0.788325]  cpuhp_issue_call+0xf4/0x150
[    0.792317]  __cpuhp_setup_state_cpuslocked+0x128/0x2c8
[    0.797619]  __cpuhp_setup_state+0x84/0xf8
[    0.801779]  xen_guest_init+0x324/0x364
[    0.805683]  do_one_initcall+0x54/0x250
[    0.809590]  kernel_init_freeable+0x12c/0x2c8
[    0.814016]  kernel_init+0x1c/0x128
[    0.817583]  ret_from_fork+0x10/0x18
[    0.821226] Code: d0006980 f9427c00 cb000300 17ffffea (d4210000)
[    0.827415] ---[ end trace d95b5309a33f8b28 ]---
[    0.832076] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    0.839815] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: NStefano Stabellini <sstabellini@kernel.org>
Link: https://lore.kernel.org/r/160196697165.60224.17470743378683334995.stgit@devnote2Signed-off-by: NJuergen Gross <jgross@suse.com>

5a067711

06 10月, 2020 2 次提交

genirq/PM: Introduce IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND flag · 90428a8e

由 Maulik Shah 提交于 9月 28, 2020

An interrupt that is disabled/masked but set for wakeup may still need to
be able to wake up the system from sleep states like "suspend to RAM".

To that effect, introduce the IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND flag.
If the irqchip have this flag set, the irq PM code will enable/unmask
the irqs that are marked for wakeup, but that are in a disabled state.

On resume, such irqs will be restored back to their disabled state.
Suggested-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NMaulik Shah <mkshah@codeaurora.org>
[maz: commit message fix-up]
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Tested-by: NStephen Boyd <swboyd@chromium.org>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NDouglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/1601267524-20199-4-git-send-email-mkshah@codeaurora.org

90428a8e

x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() · ec6347bb

由 Dan Williams 提交于 10月 05, 2020

In reaction to a proposal to introduce a memcpy_mcsafe_fast()
implementation Linus points out that memcpy_mcsafe() is poorly named
relative to communicating the scope of the interface. Specifically what
addresses are valid to pass as source, destination, and what faults /
exceptions are handled.

Of particular concern is that even though x86 might be able to handle
the semantics of copy_mc_to_user() with its common copy_user_generic()
implementation other archs likely need / want an explicit path for this
case:

  On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
  >
  > On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote:
  > >
  > > However now I see that copy_user_generic() works for the wrong reason.
  > > It works because the exception on the source address due to poison
  > > looks no different than a write fault on the user address to the
  > > caller, it's still just a short copy. So it makes copy_to_user() work
  > > for the wrong reason relative to the name.
  >
  > Right.
  >
  > And it won't work that way on other architectures. On x86, we have a
  > generic function that can take faults on either side, and we use it
  > for both cases (and for the "in_user" case too), but that's an
  > artifact of the architecture oddity.
  >
  > In fact, it's probably wrong even on x86 - because it can hide bugs -
  > but writing those things is painful enough that everybody prefers
  > having just one function.

Replace a single top-level memcpy_mcsafe() with either
copy_mc_to_user(), or copy_mc_to_kernel().

Introduce an x86 copy_mc_fragile() name as the rename for the
low-level x86 implementation formerly named memcpy_mcsafe(). It is used
as the slow / careful backend that is supplanted by a fast
copy_mc_generic() in a follow-on patch.

One side-effect of this reorganization is that separating copy_mc_64.S
to its own file means that perf no longer needs to track dependencies
for its memcpy_64.S benchmarks.

 [ bp: Massage a bit. ]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NTony Luck <tony.luck@intel.com>
Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
Cc: <stable@vger.kernel.org>
Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com
Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com

ec6347bb

05 10月, 2020 1 次提交

rxrpc: Fix accept on a connection that need securing · 2d914c1b

由 David Howells 提交于 9月 30, 2020

When a new incoming call arrives at an userspace rxrpc socket on a new
connection that has a security class set, the code currently pushes it onto
the accept queue to hold a ref on it for the socket.  This doesn't work,
however, as recvmsg() pops it off, notices that it's in the SERVER_SECURING
state and discards the ref.  This means that the call runs out of refs too
early and the kernel oopses.

By contrast, a kernel rxrpc socket manually pre-charges the incoming call
pool with calls that already have user call IDs assigned, so they are ref'd
by the call tree on the socket.

Change the mode of operation for userspace rxrpc server sockets to work
like this too.  Although this is a UAPI change, server sockets aren't
currently functional.

Fixes: 248f219c ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: NDavid Howells <dhowells@redhat.com>

2d914c1b

03 10月, 2020 8 次提交

sched/debug: Add new tracepoint to track cpu_capacity · 51cf18c9

由 Vincent Donnefort 提交于 8月 28, 2020

rq->cpu_capacity is a key element in several scheduler parts, such as EAS
task placement and load balancing. Tracking this value enables testing
and/or debugging by a toolkit.
Signed-off-by: NVincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1598605249-72651-1-git-send-email-vincent.donnefort@arm.com

51cf18c9

net: introduce helper sendpage_ok() in include/linux/net.h · c381b079

由 Coly Li 提交于 10月 02, 2020

The original problem was from nvme-over-tcp code, who mistakenly uses
kernel_sendpage() to send pages allocated by __get_free_pages() without
__GFP_COMP flag. Such pages don't have refcount (page_count is 0) on
tail pages, sending them by kernel_sendpage() may trigger a kernel panic
from a corrupted kernel heap, because these pages are incorrectly freed
in network stack as page_count 0 pages.

This patch introduces a helper sendpage_ok(), it returns true if the
checking page,
- is not slab page: PageSlab(page) is false.
- has page refcount: page_count(page) is not zero

All drivers who want to send page to remote end by kernel_sendpage()
may use this helper to check whether the page is OK. If the helper does
not return true, the driver should try other non sendpage method (e.g.
sock_no_sendpage()) to handle the page.
Signed-off-by: NColy Li <colyli@suse.de>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Vlastimil Babka <vbabka@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c381b079

net: core: document two new elements of struct net_device · a93bdcb9

由 Mauro Carvalho Chehab 提交于 10月 02, 2020

As warned by "make htmldocs", there are two new struct elements
that aren't documented:

	../include/linux/netdevice.h:2159: warning: Function parameter or member 'unlink_list' not described in 'net_device'
	../include/linux/netdevice.h:2159: warning: Function parameter or member 'nested_level' not described in 'net_device'

Fixes: 1fc70edb ("net: core: add nested_level variable in net_device")
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a93bdcb9

netlink: fix policy dump leak · a95bc734

由 Johannes Berg 提交于 10月 02, 2020

If userspace doesn't complete the policy dump, we leak the
allocated state. Fix this.

Fixes: d07dcf9a ("netlink: add infrastructure to expose policies to userspace")
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Reviewed-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a95bc734

static_call: Fix return type of static_call_init · 69e0ad37

由 Nathan Chancellor 提交于 9月 28, 2020

Functions that are passed to early_initcall should be of type
initcall_t, which expects a return type of int. This is not currently an
error but a patch in the Clang LTO series could change that in the
future.

Fixes: 9183c3f9 ("static_call: Add inline static call infrastructure")
Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NSami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/lkml/20200903203053.3411268-17-samitolvanen@google.com/

69e0ad37

net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible · b898ce7b

由 Saeed Mahameed 提交于 9月 11, 2020

In case of pci is offline reclaim_pages_cmd() will still try to call
the FW to release FW pages, cmd_exec() in this case will return a silent
success without actually calling the FW.

This is wrong and will cause page leaks, what we should do is to detect
pci offline or command interface un-available before tying to access the
FW and manually release the FW pages in the driver.

In this patch we share the code to check for FW command interface
availability and we call it in sensitive places e.g. reclaim_pages_cmd().

Alternative fix:
 1. Remove MLX5_CMD_OP_MANAGE_PAGES form mlx5_internal_err_ret_value,
    command success simulation list.
 2. Always Release FW pages even if cmd_exec fails in reclaim_pages_cmd().
Reviewed-by: NMoshe Shemesh <moshe@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

b898ce7b

net/mlx5: Avoid possible free of command entry while timeout comp handler · 50b2412b

由 Eran Ben Elisha 提交于 8月 04, 2020

Upon command completion timeout, driver simulates a forced command
completion. In a rare case where real interrupt for that command arrives
simultaneously, it might release the command entry while the forced
handler might still access it.

Fix that by adding an entry refcount, to track current amount of allowed
handlers. Command entry to be released only when this refcount is
decremented to zero.

Command refcount is always initialized to one. For callback commands,
command completion handler is the symmetric flow to decrement it. For
non-callback commands, it is wait_func().

Before ringing the doorbell, increment the refcount for the real completion
handler. Once the real completion handler is called, it will decrement it.

For callback commands, once the delayed work is scheduled, increment the
refcount. Upon callback command completion handler, we will try to cancel
the timeout callback. In case of success, we need to decrement the callback
refcount as it will never run.

In addition, gather the entry index free and the entry free into a one
flow for all command types release.

Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

50b2412b

mm: memcg/slab: fix slab statistics in !SMP configuration · be458311

由 Roman Gushchin 提交于 10月 01, 2020

Since commit ea426c2a ("mm: memcg: prepare for byte-sized vmstat
items") the write side of slab counters accepts a value in bytes and
converts it to pages.  It happens in __mod_node_page_state().

However a non-SMP version of __mod_node_page_state() doesn't perform
this conversion.  It leads to incorrect (unrealistically high) slab
counters values.  Fix this by adding a similar conversion to the non-SMP
version of __mod_node_page_state().
Signed-off-by: NRoman Gushchin <guro@fb.com>
Reported-and-tested-by: NBastian Bittorf <bb@npl.de>
Fixes: ea426c2a ("mm: memcg: prepare for byte-sized vmstat items")
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

be458311

02 10月, 2020 1 次提交

pipe: remove pipe_wait() and fix wakeup race with splice · 472e5b05

由 Linus Torvalds 提交于 10月 01, 2020

The pipe splice code still used the old model of waiting for pipe IO by
using a non-specific "pipe_wait()" that waited for any pipe event to
happen, which depended on all pipe IO being entirely serialized by the
pipe lock. So by checking the state you were waiting for, and then
adding yourself to the wait queue before dropping the lock, you were
guaranteed to see all the wakeups.

Strictly speaking, the actual wakeups were not done under the lock, but
the pipe_wait() model still worked, because since the waiter held the
lock when checking whether it should sleep, it would always see the
current state, and the wakeup was always done after updating the state.

However, commit 0ddad21d ("pipe: use exclusive waits when reading or
writing") split the single wait-queue into two, and in the process also
made the "wait for event" code wait for _two_ wait queues, and that then
showed a race with the wakers that were not serialized by the pipe lock.

It's only splice that used that "pipe_wait()" model, so the problem
wasn't obvious, but Josef Bacik reports:

"I hit a hang with fstest btrfs/187, which does a btrfs send into
/dev/null. This works by creating a pipe, the write side is given to
the kernel to write into, and the read side is handed to a thread that
splices into a file, in this case /dev/null.

The box that was hung had the write side stuck here [pipe_write] and
the read side stuck here [splice_from_pipe_next -> pipe_wait].

[ more details about pipe_wait() scenario ]

The problem is we're doing the prepare_to_wait, which sets our state
each time, however we can be woken up either with reads or writes. In
the case above we race with the WRITER waking us up, and re-set our
state to INTERRUPTIBLE, and thus never break out of schedule"

Josef had a patch that avoided the issue in pipe_wait() by just making
it set the state only once, but the deeper problem is that pipe_wait()
depends on a level of synchonization by the pipe mutex that it really
shouldn't. And the whole "wait for any pipe state change" model really
isn't very good to begin with.

So rather than trying to work around things in pipe_wait(), remove that
legacy model of "wait for arbitrary pipe event" entirely, and actually
create functions that wait for the pipe actually being readable or
writable, and can do so without depending on the pipe lock serializing
everything.

Fixes: 0ddad21d ("pipe: use exclusive waits when reading or writing")
Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/Reported-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-and-tested-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

472e5b05

01 10月, 2020 3 次提交

debugobjects: Free per CPU pool after CPU unplug · 88451f2c

由 Zqiang 提交于 9月 08, 2020

If a CPU is offlined the debug objects per CPU pool is not cleaned up. If
the CPU is never onlined again then the objects in the pool are wasted.

Add a CPU hotplug callback which is invoked after the CPU is dead to free
the pool.

[ tglx: Massaged changelog and added comment about remote access safety ]
Signed-off-by: NZqiang <qiang.zhang@windriver.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20200908062709.11441-1-qiang.zhang@windriver.com

88451f2c

pipe: Fix memory leaks in create_pipe_files() · 8a018eb5

由 Qian Cai 提交于 10月 01, 2020

        Calling pipe2() with O_NOTIFICATION_PIPE could results in memory
leaks unless watch_queue_init() is successful.

        In case of watch_queue_init() failure in pipe2() we are left
with inode and pipe_inode_info instances that need to be freed.  That
failure exit has been introduced in commit c73be61c ("pipe: Add
general notification queue support") and its handling should've been
identical to nearby treatment of alloc_file_pseudo() failures - it
is dealing with the same situation.  As it is, the mainline kernel
leaks in that case.

        Another problem is that CONFIG_WATCH_QUEUE and !CONFIG_WATCH_QUEUE
cases are treated differently (and the former leaks just pipe_inode_info,
the latter - both pipe_inode_info and inode).

        Fixed by providing a dummy wacth_queue_init() in !CONFIG_WATCH_QUEUE
case and by having failures of wacth_queue_init() handled the same way
we handle alloc_file_pseudo() ones.

Fixes: c73be61c ("pipe: Add general notification queue support")
Signed-off-by: NQian Cai <cai@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8a018eb5

arm64: permit ACPI core to map kernel memory used for table overrides · a509a66a

由 Ard Biesheuvel 提交于 9月 29, 2020

Jonathan reports that the strict policy for memory mapped by the
ACPI core breaks the use case of passing ACPI table overrides via
initramfs. This is due to the fact that the memory type used for
loading the initramfs in memory is not recognized as a memory type
that is typically used by firmware to pass firmware tables.

Since the purpose of the strict policy is to ensure that no AML or
other ACPI code can manipulate any memory that is used by the kernel
to keep its internal state or the state of user tasks, we can relax
the permission check, and allow mappings of memory that is reserved
and marked as NOMAP via memblock, and therefore not covered by the
linear mapping to begin with.

Fixes: 1583052d ("arm64/acpi: disallow AML memory opregions to access kernel memory")
Fixes: 325f5585 ("arm64/acpi: disallow writeable AML opregion mapping for EFI code regions")
Reported-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Link: https://lore.kernel.org/r/20200929132522.18067-1-ardb@kernel.orgSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>

a509a66a

30 9月, 2020 5 次提交

drm: drm_dsc.h: fix a kernel-doc markup · 27204b99

由 Mauro Carvalho Chehab 提交于 9月 30, 2020

As warned by Sphinx:

	./Documentation/gpu/drm-kms-helpers:305: ./include/drm/drm_dsc.h:587: WARNING: Unparseable C cross-reference: 'struct'
	Invalid C declaration: Expected identifier in nested name, got keyword: struct [error at 6]
	  struct
	  ------^

The markup for one struct is wrong, as struct is used twice.
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/3d467022325e15bba8dcb13da8fb730099303266.1601467849.git.mchehab+huawei@kernel.org

27204b99

Partially revert "video: fbdev: amba-clcd: Retire elder CLCD driver" · 112c3523

由 Peter Collingbourne 提交于 9月 29, 2020

Also partially revert the follow-up change "drm: pl111: Absorb the
external register header".

This reverts the parts of commits
7e4e589d and
0fb81256 that touch paths outside
of drivers/gpu/drm/pl111.

The fbdev driver is used by Android's FVP configuration. Using the
DRM driver together with DRM's fbdev emulation results in a failure
to boot Android. The root cause is that Android's generic fbdev
userspace driver relies on the ability to set the pixel format via
FBIOPUT_VSCREENINFO, which is not supported by fbdev emulation.

There have been other less critical behavioral differences identified
between the fbdev driver and the DRM driver with fbdev emulation. The
DRM driver exposes different values for the panel's width, height and
refresh rate, and the DRM driver fails a FBIOPUT_VSCREENINFO syscall
with yres_virtual greater than the maximum supported value instead
of letting the syscall succeed and setting yres_virtual based on yres.
Signed-off-by: NPeter Collingbourne <pcc@google.com>
Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20200929195344.2219796-1-pcc@google.com

112c3523

efi: efivars: un-export efivars_sysfs_init() · 5d3c8617

由 Ard Biesheuvel 提交于 9月 23, 2020

efivars_sysfs_init() is only used locally in the source file that
defines it, so make it static and unexport it.
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

5d3c8617

efi: pstore: move workqueue handling out of efivars · c9b51a2d

由 Ard Biesheuvel 提交于 9月 23, 2020

The worker thread that gets kicked off to sync the state of the
EFI variable list is only used by the EFI pstore implementation,
and is defined in its source file. So let's move its scheduling
there as well. Since our efivar_init() scan will bail on duplicate
entries, there is no need to disable the workqueue like we did
before, so we can run it unconditionally.
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

c9b51a2d

efi: pstore: disentangle from deprecated efivars module · 232f4eb6

由 Ard Biesheuvel 提交于 9月 23, 2020

The EFI pstore implementation relies on the 'efivars' abstraction,
which encapsulates the EFI variable store in a way that can be
overridden by other backing stores, like the Google SMI one.

On top of that, the EFI pstore implementation also relies on the
efivars.ko module, which is a separate layer built on top of the
'efivars' abstraction that exposes the [deprecated] sysfs entries
for each variable that exists in the backing store.

Since the efivars.ko module is deprecated, and all users appear to
have moved to the efivarfs file system instead, let's prepare for
its removal, by removing EFI pstore's dependency on it.
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

232f4eb6

29 9月, 2020 4 次提交

genetlink: add missing kdoc for validation flags · 3ddf9b43

由 Jakub Kicinski 提交于 9月 28, 2020

Validation flags are missing kdoc, add it.

Fixes: ef6243ac ("genetlink: optionally validate strictly/dumps")
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3ddf9b43

net: core: add nested_level variable in net_device · 1fc70edb

由 Taehee Yoo 提交于 9月 25, 2020

This patch is to add a new variable 'nested_level' into the net_device
structure.
This variable will be used as a parameter of spin_lock_nested() of
dev->addr_list_lock.

netif_addr_lock() can be called recursively so spin_lock_nested() is
used instead of spin_lock() and dev->lower_level is used as a parameter
of spin_lock_nested().
But, dev->lower_level value can be updated while it is being used.
So, lockdep would warn a possible deadlock scenario.

When a stacked interface is deleted, netif_{uc | mc}_sync() is
called recursively.
So, spin_lock_nested() is called recursively too.
At this moment, the dev->lower_level variable is used as a parameter of it.
dev->lower_level value is updated when interfaces are being unlinked/linked
immediately.
Thus, After unlinking, dev->lower_level shouldn't be a parameter of
spin_lock_nested().

    A (macvlan)
    |
    B (vlan)
    |
    C (bridge)
    |
    D (macvlan)
    |
    E (vlan)
    |
    F (bridge)

    A->lower_level : 6
    B->lower_level : 5
    C->lower_level : 4
    D->lower_level : 3
    E->lower_level : 2
    F->lower_level : 1

When an interface 'A' is removed, it releases resources.
At this moment, netif_addr_lock() would be called.
Then, netdev_upper_dev_unlink() is called recursively.
Then dev->lower_level is updated.
There is no problem.

But, when the bridge module is removed, 'C' and 'F' interfaces
are removed at once.
If 'F' is removed first, a lower_level value is like below.
    A->lower_level : 5
    B->lower_level : 4
    C->lower_level : 3
    D->lower_level : 2
    E->lower_level : 1
    F->lower_level : 1

Then, 'C' is removed. at this moment, netif_addr_lock() is called
recursively.
The ordering is like this.
C(3)->D(2)->E(1)->F(1)
At this moment, the lower_level value of 'E' and 'F' are the same.
So, lockdep warns a possible deadlock scenario.

In order to avoid this problem, a new variable 'nested_level' is added.
This value is the same as dev->lower_level - 1.
But this value is updated in rtnl_unlock().
So, this variable can be used as a parameter of spin_lock_nested() safely
in the rtnl context.

Test commands:
   ip link add br0 type bridge vlan_filtering 1
   ip link add vlan1 link br0 type vlan id 10
   ip link add macvlan2 link vlan1 type macvlan
   ip link add br3 type bridge vlan_filtering 1
   ip link set macvlan2 master br3
   ip link add vlan4 link br3 type vlan id 10
   ip link add macvlan5 link vlan4 type macvlan
   ip link add br6 type bridge vlan_filtering 1
   ip link set macvlan5 master br6
   ip link add vlan7 link br6 type vlan id 10
   ip link add macvlan8 link vlan7 type macvlan

   ip link set br0 up
   ip link set vlan1 up
   ip link set macvlan2 up
   ip link set br3 up
   ip link set vlan4 up
   ip link set macvlan5 up
   ip link set br6 up
   ip link set vlan7 up
   ip link set macvlan8 up
   modprobe -rv bridge

Splat looks like:
[   36.057436][  T744] WARNING: possible recursive locking detected
[   36.058848][  T744] 5.9.0-rc6+ #728 Not tainted
[   36.059959][  T744] --------------------------------------------
[   36.061391][  T744] ip/744 is trying to acquire lock:
[   36.062590][  T744] ffff8c4767509280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_set_rx_mode+0x19/0x30
[   36.064922][  T744]
[   36.064922][  T744] but task is already holding lock:
[   36.066626][  T744] ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60
[   36.068851][  T744]
[   36.068851][  T744] other info that might help us debug this:
[   36.070731][  T744]  Possible unsafe locking scenario:
[   36.070731][  T744]
[   36.072497][  T744]        CPU0
[   36.073238][  T744]        ----
[   36.074007][  T744]   lock(&vlan_netdev_addr_lock_key);
[   36.075290][  T744]   lock(&vlan_netdev_addr_lock_key);
[   36.076590][  T744]
[   36.076590][  T744]  *** DEADLOCK ***
[   36.076590][  T744]
[   36.078515][  T744]  May be due to missing lock nesting notation
[   36.078515][  T744]
[   36.080491][  T744] 3 locks held by ip/744:
[   36.081471][  T744]  #0: ffffffff98571df0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x236/0x490
[   36.083614][  T744]  #1: ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60
[   36.085942][  T744]  #2: ffff8c476c8da280 (&bridge_netdev_addr_lock_key/4){+...}-{2:2}, at: dev_uc_sync+0x39/0x80
[   36.088400][  T744]
[   36.088400][  T744] stack backtrace:
[   36.089772][  T744] CPU: 6 PID: 744 Comm: ip Not tainted 5.9.0-rc6+ #728
[   36.091364][  T744] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   36.093630][  T744] Call Trace:
[   36.094416][  T744]  dump_stack+0x77/0x9b
[   36.095385][  T744]  __lock_acquire+0xbc3/0x1f40
[   36.096522][  T744]  lock_acquire+0xb4/0x3b0
[   36.097540][  T744]  ? dev_set_rx_mode+0x19/0x30
[   36.098657][  T744]  ? rtmsg_ifinfo+0x1f/0x30
[   36.099711][  T744]  ? __dev_notify_flags+0xa5/0xf0
[   36.100874][  T744]  ? rtnl_is_locked+0x11/0x20
[   36.101967][  T744]  ? __dev_set_promiscuity+0x7b/0x1a0
[   36.103230][  T744]  _raw_spin_lock_bh+0x38/0x70
[   36.104348][  T744]  ? dev_set_rx_mode+0x19/0x30
[   36.105461][  T744]  dev_set_rx_mode+0x19/0x30
[   36.106532][  T744]  dev_set_promiscuity+0x36/0x50
[   36.107692][  T744]  __dev_set_promiscuity+0x123/0x1a0
[   36.108929][  T744]  dev_set_promiscuity+0x1e/0x50
[   36.110093][  T744]  br_port_set_promisc+0x1f/0x40 [bridge]
[   36.111415][  T744]  br_manage_promisc+0x8b/0xe0 [bridge]
[   36.112728][  T744]  __dev_set_promiscuity+0x123/0x1a0
[   36.113967][  T744]  ? __hw_addr_sync_one+0x23/0x50
[   36.115135][  T744]  __dev_set_rx_mode+0x68/0x90
[   36.116249][  T744]  dev_uc_sync+0x70/0x80
[   36.117244][  T744]  dev_uc_add+0x50/0x60
[   36.118223][  T744]  macvlan_open+0x18e/0x1f0 [macvlan]
[   36.119470][  T744]  __dev_open+0xd6/0x170
[   36.120470][  T744]  __dev_change_flags+0x181/0x1d0
[   36.121644][  T744]  dev_change_flags+0x23/0x60
[   36.122741][  T744]  do_setlink+0x30a/0x11e0
[   36.123778][  T744]  ? __lock_acquire+0x92c/0x1f40
[   36.124929][  T744]  ? __nla_validate_parse.part.6+0x45/0x8e0
[   36.126309][  T744]  ? __lock_acquire+0x92c/0x1f40
[   36.127457][  T744]  __rtnl_newlink+0x546/0x8e0
[   36.128560][  T744]  ? lock_acquire+0xb4/0x3b0
[   36.129623][  T744]  ? deactivate_slab.isra.85+0x6a1/0x850
[   36.130946][  T744]  ? __lock_acquire+0x92c/0x1f40
[   36.132102][  T744]  ? lock_acquire+0xb4/0x3b0
[   36.133176][  T744]  ? is_bpf_text_address+0x5/0xe0
[   36.134364][  T744]  ? rtnl_newlink+0x2e/0x70
[   36.135445][  T744]  ? rcu_read_lock_sched_held+0x32/0x60
[   36.136771][  T744]  ? kmem_cache_alloc_trace+0x2d8/0x380
[   36.138070][  T744]  ? rtnl_newlink+0x2e/0x70
[   36.139164][  T744]  rtnl_newlink+0x47/0x70
[ ... ]

Fixes: 845e0ebb ("net: change addr_list_lock back to static key")
Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1fc70edb

net: core: introduce struct netdev_nested_priv for nested interface infrastructure · eff74233

由 Taehee Yoo 提交于 9月 25, 2020

Functions related to nested interface infrastructure such as
netdev_walk_all_{ upper | lower }_dev() pass both private functions
and "data" pointer to handle their own things.
At this point, the data pointer type is void *.
In order to make it easier to expand common variables and functions,
this new netdev_nested_priv structure is added.

In the following patch, a new member variable will be added into this
struct to fix the lockdep issue.
Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eff74233

KVM: arm64: pmu: Make overflow handler NMI safe · 95e92e45

由 Julien Thierry 提交于 9月 24, 2020

kvm_vcpu_kick() is not NMI safe. When the overflow handler is called from
NMI context, defer waking the vcpu to an irq_work queue.

A vcpu can be freed while it's not running by kvm_destroy_vm(). Prevent
running the irq_work for a non-existent vcpu by calling irq_work_sync() on
the PMU destroy path.

[Alexandru E.: Added irq_work_sync()]
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Signed-off-by: NAlexandru Elisei <alexandru.elisei@arm.com>
Tested-by: Sumit Garg <sumit.garg@linaro.org> (Developerbox)
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Pouloze <suzuki.poulose@arm.com>
Cc: kvm@vger.kernel.org
Cc: kvmarm@lists.cs.columbia.edu
Link: https://lore.kernel.org/r/20200924110706.254996-6-alexandru.elisei@arm.comSigned-off-by: NWill Deacon <will@kernel.org>

95e92e45

28 9月, 2020 4 次提交

arm64: perf: Add support caps under sysfs · f5be3a61

由 Shaokun Zhang 提交于 9月 22, 2020

ARMv8.4-PMU introduces the PMMIR_EL1 registers and some new PMU events,
like STALL_SLOT etc, are related to it. Let's add a caps directory to
/sys/bus/event_source/devices/armv8_pmuv3_0/ and support slots from
PMMIR_EL1 registers in this entry. The user programs can get the slots
from sysfs directly.

/sys/bus/event_source/devices/armv8_pmuv3_0/caps/slots is exposed
under sysfs. Both ARMv8.4-PMU and STALL_SLOT event are implemented,
it returns the slots from PMMIR_EL1, otherwise it will return 0.
Signed-off-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/1600754025-53535-1-git-send-email-zhangshaokun@hisilicon.comSigned-off-by: NWill Deacon <will@kernel.org>

f5be3a61

memstick: Skip allocating card when removing host · 62c59a87

由 Kai-Heng Feng 提交于 9月 25, 2020

After commit 6827ca57 ("memstick: rtsx_usb_ms: Support runtime power
management"), removing module rtsx_usb_ms will be stuck.

The deadlock is caused by powering on and powering off at the same time,
the former one is when memstick_check() is flushed, and the later is called
by memstick_remove_host().

Soe let's skip allocating card to prevent this issue.

Fixes: 6827ca57 ("memstick: rtsx_usb_ms: Support runtime power management")
Signed-off-by: NKai-Heng Feng <kai.heng.feng@canonical.com>
Link: https://lore.kernel.org/r/20200925084952.13220-1-kai.heng.feng@canonical.com
Cc: stable@vger.kernel.org
Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>

62c59a87

mm/fork: Pass new vma pointer into copy_page_range() · 7a4830c3

由 Peter Xu 提交于 9月 25, 2020

This prepares for the future work to trigger early cow on pinned pages
during fork().

No functional change intended.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7a4830c3

mm: Introduce mm_struct.has_pinned · 008cfe44

由 Peter Xu 提交于 9月 25, 2020

(Commit message majorly collected from Jason Gunthorpe)

Reduce the chance of false positive from page_maybe_dma_pinned() by
keeping track if the mm_struct has ever been used with pin_user_pages().
This allows cases that might drive up the page ref_count to avoid any
penalty from handling dma_pinned pages.

Future work is planned, to provide a more sophisticated solution, likely
to turn it into a real counter.  For now, make it atomic_t but use it as
a boolean for simplicity.
Suggested-by: NJason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

008cfe44

27 9月, 2020 1 次提交

mm: don't rely on system state to detect hot-plug operations · f85086f9

由 Laurent Dufour 提交于 9月 25, 2020

In register_mem_sect_under_node() the system_state's value is checked to
detect whether the call is made during boot time or during an hot-plug
operation.  Unfortunately, that check against SYSTEM_BOOTING is wrong
because regular memory is registered at SYSTEM_SCHEDULING state.  In
addition, memory hot-plug operation can be triggered at this system
state by the ACPI [1].  So checking against the system state is not
enough.

The consequence is that on system with interleaved node's ranges like this:

 Early memory node ranges
   node   1: [mem 0x0000000000000000-0x000000011fffffff]
   node   2: [mem 0x0000000120000000-0x000000014fffffff]
   node   1: [mem 0x0000000150000000-0x00000001ffffffff]
   node   0: [mem 0x0000000200000000-0x000000048fffffff]
   node   2: [mem 0x0000000490000000-0x00000007ffffffff]

This can be seen on PowerPC LPAR after multiple memory hot-plug and
hot-unplug operations are done.  At the next reboot the node's memory
ranges can be interleaved and since the call to link_mem_sections() is
made in topology_init() while the system is in the SYSTEM_SCHEDULING
state, the node's id is not checked, and the sections registered to
multiple nodes:

  $ ls -l /sys/devices/system/memory/memory21/node*
  total 0
  lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
  lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2

In that case, the system is able to boot but if later one of theses
memory blocks is hot-unplugged and then hot-plugged, the sysfs
inconsistency is detected and this is triggering a BUG_ON():

  kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
  Oops: Exception in kernel mode, sig: 5 [#1]
  LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
  CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
  Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

This patch addresses the root cause by not relying on the system_state
value to detect whether the call is due to a hot-plug operation.  An
extra parameter is added to link_mem_sections() detailing whether the
operation is due to a hot-plug operation.

[1] According to Oscar Salvador, using this qemu command line, ACPI
memory hotplug operations are raised at SYSTEM_SCHEDULING state:

  $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
        -m size=$MEM,slots=255,maxmem=4294967296k  \
        -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
        -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
        -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
        -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
        -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
        -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
        -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
        -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \

Fixes: 4fbce633 ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f85086f9

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功