- 17 11月, 2020 2 次提交
-
-
由 Juri Lelli 提交于
Glenn reported that "an application [he developed produces] a BUG in deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested PTHREAD_PRIO_INHERIT mutexes. I believe the bug is triggered when a CFS task that was boosted by a SCHED_DEADLINE task boosts another CFS task (nested priority inheritance). ------------[ cut here ]------------ kernel BUG at kernel/sched/deadline.c:1462! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ... Hardware name: ... RIP: 0010:enqueue_task_dl+0x335/0x910 Code: ... RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002 RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500 RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600 RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078 R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600 R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600 FS: 00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: ? intel_pstate_update_util_hwp+0x13/0x170 rt_mutex_setprio+0x1cc/0x4b0 task_blocks_on_rt_mutex+0x225/0x260 rt_spin_lock_slowlock_locked+0xab/0x2d0 rt_spin_lock_slowlock+0x50/0x80 hrtimer_grab_expiry_lock+0x20/0x30 hrtimer_cancel+0x13/0x30 do_nanosleep+0xa0/0x150 hrtimer_nanosleep+0xe1/0x230 ? __hrtimer_init_sleeper+0x60/0x60 __x64_sys_nanosleep+0x8d/0xa0 do_syscall_64+0x4a/0x100 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fa58b52330d ... ---[ end trace 0000000000000002 ]— He also provided a simple reproducer creating the situation below: So the execution order of locking steps are the following (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2 are mutexes that are enabled * with priority inheritance.) Time moves forward as this timeline goes down: N1 N2 D1 | | | | | | Lock(M1) | | | | | | Lock(M2) | | | | | | Lock(M2) | | | | Lock(M1) | | (!!bug triggered!) | Daniel reported a similar situation as well, by just letting ksoftirqd run with DEADLINE (and eventually block on a mutex). Problem is that boosted entities (Priority Inheritance) use static DEADLINE parameters of the top priority waiter. However, there might be cases where top waiter could be a non-DEADLINE entity that is currently boosted by a DEADLINE entity from a different lock chain (i.e., nested priority chains involving entities of non-DEADLINE classes). In this case, top waiter static DEADLINE parameters could be null (initialized to 0 at fork()) and replenish_dl_entity() would hit a BUG(). Fix this by keeping track of the original donor and using its parameters when a task is boosted. Reported-by: NGlenn Elliott <glenn@aurora.tech> Reported-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: NJuri Lelli <juri.lelli@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com
-
由 Peter Zijlstra 提交于
Mel reported that on some ARM64 platforms loadavg goes bananas and Will tracked it down to the following race: CPU0 CPU1 schedule() prev->sched_contributes_to_load = X; deactivate_task(prev); try_to_wake_up() if (p->on_rq &&) // false if (smp_load_acquire(&p->on_cpu) && // true ttwu_queue_wakelist()) p->sched_remote_wakeup = Y; smp_store_release(prev->on_cpu, 0); where both p->sched_contributes_to_load and p->sched_remote_wakeup are in the same word, and thus the stores X and Y race (and can clobber one another's data). Whereas prior to commit c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu") the p->on_cpu handoff serialized access to p->sched_remote_wakeup (just as it still does with p->sched_contributes_to_load) that commit broke that by calling ttwu_queue_wakelist() with p->on_cpu != 0. However, due to p->XXX = X ttwu() schedule() if (p->on_rq && ...) // false smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) && deactivate_task() ttwu_queue_wakelist()) p->on_rq = 0; p->sched_remote_wakeup = Y; We can be sure any 'current' store is complete and 'current' is guaranteed asleep. Therefore we can move p->sched_remote_wakeup into the current flags word. Note: while the observed failure was loadavg accounting gone wrong due to ttwu() cobbering p->sched_contributes_to_load, the reverse problem is also possible where schedule() clobbers p->sched_remote_wakeup, this could result in enqueue_entity() wrecking ->vruntime and causing scheduling artifacts. Fixes: c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: NMel Gorman <mgorman@techsingularity.net> Debugged-by: NWill Deacon <will@kernel.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201117083016.GK3121392@hirez.programming.kicks-ass.net
-
- 12 10月, 2020 2 次提交
-
-
由 Vijay Balakrishna 提交于
When memory is hotplug added or removed the min_free_kbytes should be recalculated based on what is expected by khugepaged. Currently after hotplug, min_free_kbytes will be set to a lower default and higher default set when THP enabled is lost. This change restores min_free_kbytes as expected for THP consumers. [vijayb@linux.microsoft.com: v5] Link: https://lkml.kernel.org/r/1601398153-5517-1-git-send-email-vijayb@linux.microsoft.com Fixes: f000565a ("thp: set recommended min free kbytes") Signed-off-by: NVijay Balakrishna <vijayb@linux.microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Allen Pais <apais@microsoft.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/1600305709-2319-2-git-send-email-vijayb@linux.microsoft.com Link: https://lkml.kernel.org/r/1600204258-13683-1-git-send-email-vijayb@linux.microsoft.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
The swap address_space doesn't have host. Thus, it makes kernel crash once swap write meets error. Fix it. Fixes: 735e4ae5 ("vfs: track per-sb writeback errors and report them to syncfs") Signed-off-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NJeff Layton <jlayton@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Andres Freund <andres@anarazel.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Howells <dhowells@redhat.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201010000650.750063-1-minchan@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 10月, 2020 1 次提交
-
-
由 Marc Zyngier 提交于
It appears that some HW is ugly enough that not all the interrupts connected to a particular interrupt controller end up with the same hierarchy depth (some of them are terminated early). This leaves the irqchip hacker with only two choices, both equally bad: - create discrete domain chains, one for each "hierarchy depth", which is very hard to maintain - create fake hierarchy levels for the shallow paths, leading to all kind of problems (what are the safe hwirq values for these fake levels?) Implement the ability to cut short a single interrupt hierarchy from a level marked as being disconnected by using the new irq_domain_disconnect_hierarchy() helper. The irqdomain allocation code will then perform the trimming Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
- 09 10月, 2020 3 次提交
-
-
由 Peter Zijlstra 提交于
The thinking in commit: fddf9055 ("lockdep: Use raw_cpu_*() for per-cpu variables") is flawed. While it is true that when we're migratable both CPUs will have a 0 value, it doesn't hold that when we do get migrated in the middle of a raw_cpu_op(), the old CPU will still have 0 by the time we get around to reading it on the new CPU. Luckily, the reason for that commit (s390 using preempt_disable() instead of preempt_disable_notrace() in their percpu code), has since been fixed by commit: 1196f12a ("s390: don't trace preemption in percpu macros") An audit of arch/*/include/asm/percpu*.h shows there are no other architectures affected by this particular issue. Fixes: fddf9055 ("lockdep: Use raw_cpu_*() for per-cpu variables") Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NIngo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20201005095958.GJ2651@hirez.programming.kicks-ass.net
-
由 Peter Zijlstra 提交于
Steve reported that lockdep_assert*irq*(), when nested inside lockdep itself, will trigger a false-positive. One example is the stack-trace code, as called from inside lockdep, triggering tracing, which in turn calls RCU, which then uses lockdep_assert_irqs_disabled(). Fixes: a21ee605 ("lockdep: Change hardirq{s_enabled,_context} to per-cpu variables") Reported-by: NSteven Rostedt <rostedt@goodmis.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Basically print_lock_class_header()'s for loop is out of sync with the the size of of ->usage_traces[]. Also clean things up a bit while at it, to avoid such mishaps in the future. Fixes: 23870f12 ("locking/lockdep: Fix "USED" <- "IN-NMI" inversions") Reported-by: NQian Cai <cai@redhat.com> Debugged-by: NBoqun Feng <boqun.feng@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NIngo Molnar <mingo@kernel.org> Tested-by: NQian Cai <cai@redhat.com> Link: https://lkml.kernel.org/r/20200930094937.GE2651@hirez.programming.kicks-ass.net
-
- 08 10月, 2020 1 次提交
-
-
ctags creates a warning: |ctags: Warning: include/linux/seqlock.h:738: null expansion of name pattern "\2" The DEFINE_SEQLOCK() macro is passed to ctags and being told to expect an argument. Add a dummy argument to keep ctags quiet. Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NWill Deacon <will@kernel.org> Link: https://lkml.kernel.org/r/20200924154851.skmswuyj322yuz4g@linutronix.de
-
- 07 10月, 2020 2 次提交
-
-
由 Tony Luck 提交于
Existing kernel code can only recover from a machine check on code that is tagged in the exception table with a fault handling recovery path. Add two new fields in the task structure to pass information from machine check handler to the "task_work" that is queued to run before the task returns to user mode: + mce_vaddr: will be initialized to the user virtual address of the fault in the case where the fault occurred in the kernel copying data from a user address. This is so that kill_me_maybe() can provide that information to the user SIGBUS handler. + mce_kflags: copy of the struct mce.kflags needed by kill_me_maybe() to determine if mce_vaddr is applicable to this error. Add code to recover from a machine check while copying data from user space to the kernel. Action for this case is the same as if the user touched the poison directly; unmap the page and send a SIGBUS to the task. Use a new helper function to share common code between the "fault in user mode" case and the "fault while copying from user" case. New code paths will be activated by the next patch which sets MCE_IN_KERNEL_COPYIN. Suggested-by: NBorislav Petkov <bp@alien8.de> Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201006210910.21062-6-tony.luck@intel.com
-
由 Masami Hiramatsu 提交于
Use per_cpu_ptr_to_phys() instead of virt_to_phys() for per-cpu address conversion. In xen_starting_cpu(), per-cpu xen_vcpu_info address is converted to gfn by virt_to_gfn() macro. However, since the virt_to_gfn(v) assumes the given virtual address is in linear mapped kernel memory area, it can not convert the per-cpu memory if it is allocated on vmalloc area. This depends on CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK. If it is enabled, the first chunk of percpu memory is linear mapped. In the other case, that is allocated from vmalloc area. Moreover, if the first chunk of percpu has run out until allocating xen_vcpu_info, it will be allocated on the 2nd chunk, which is based on kernel memory or vmalloc memory (depends on CONFIG_NEED_PER_CPU_KM). Without this fix and kernel configured to use vmalloc area for the percpu memory, the Dom0 kernel will fail to boot with following errors. [ 0.466172] Xen: initializing cpu0 [ 0.469601] ------------[ cut here ]------------ [ 0.474295] WARNING: CPU: 0 PID: 1 at arch/arm64/xen/../../arm/xen/enlighten.c:153 xen_starting_cpu+0x160/0x180 [ 0.484435] Modules linked in: [ 0.487565] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #4 [ 0.493895] Hardware name: Socionext Developer Box (DT) [ 0.499194] pstate: 00000005 (nzcv daif -PAN -UAO BTYPE=--) [ 0.504836] pc : xen_starting_cpu+0x160/0x180 [ 0.509263] lr : xen_starting_cpu+0xb0/0x180 [ 0.513599] sp : ffff8000116cbb60 [ 0.516984] x29: ffff8000116cbb60 x28: ffff80000abec000 [ 0.522366] x27: 0000000000000000 x26: 0000000000000000 [ 0.527754] x25: ffff80001156c000 x24: fffffdffbfcdb600 [ 0.533129] x23: 0000000000000000 x22: 0000000000000000 [ 0.538511] x21: ffff8000113a99c8 x20: ffff800010fe4f68 [ 0.543892] x19: ffff8000113a9988 x18: 0000000000000010 [ 0.549274] x17: 0000000094fe0f81 x16: 00000000deadbeef [ 0.554655] x15: ffffffffffffffff x14: 0720072007200720 [ 0.560037] x13: 0720072007200720 x12: 0720072007200720 [ 0.565418] x11: 0720072007200720 x10: 0720072007200720 [ 0.570801] x9 : ffff8000100fbdc0 x8 : ffff800010715208 [ 0.576182] x7 : 0000000000000054 x6 : ffff00001b790f00 [ 0.581564] x5 : ffff800010bbf880 x4 : 0000000000000000 [ 0.586945] x3 : 0000000000000000 x2 : ffff80000abec000 [ 0.592327] x1 : 000000000000002f x0 : 0000800000000000 [ 0.597716] Call trace: [ 0.600232] xen_starting_cpu+0x160/0x180 [ 0.604309] cpuhp_invoke_callback+0xac/0x640 [ 0.608736] cpuhp_issue_call+0xf4/0x150 [ 0.612728] __cpuhp_setup_state_cpuslocked+0x128/0x2c8 [ 0.618030] __cpuhp_setup_state+0x84/0xf8 [ 0.622192] xen_guest_init+0x324/0x364 [ 0.626097] do_one_initcall+0x54/0x250 [ 0.630003] kernel_init_freeable+0x12c/0x2c8 [ 0.634428] kernel_init+0x1c/0x128 [ 0.637988] ret_from_fork+0x10/0x18 [ 0.641635] ---[ end trace d95b5309a33f8b27 ]--- [ 0.646337] ------------[ cut here ]------------ [ 0.651005] kernel BUG at arch/arm64/xen/../../arm/xen/enlighten.c:158! [ 0.657697] Internal error: Oops - BUG: 0 [#1] SMP [ 0.662548] Modules linked in: [ 0.665676] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 5.9.0-rc4+ #4 [ 0.673398] Hardware name: Socionext Developer Box (DT) [ 0.678695] pstate: 00000005 (nzcv daif -PAN -UAO BTYPE=--) [ 0.684338] pc : xen_starting_cpu+0x178/0x180 [ 0.688765] lr : xen_starting_cpu+0x144/0x180 [ 0.693188] sp : ffff8000116cbb60 [ 0.696573] x29: ffff8000116cbb60 x28: ffff80000abec000 [ 0.701955] x27: 0000000000000000 x26: 0000000000000000 [ 0.707344] x25: ffff80001156c000 x24: fffffdffbfcdb600 [ 0.712718] x23: 0000000000000000 x22: 0000000000000000 [ 0.718107] x21: ffff8000113a99c8 x20: ffff800010fe4f68 [ 0.723481] x19: ffff8000113a9988 x18: 0000000000000010 [ 0.728863] x17: 0000000094fe0f81 x16: 00000000deadbeef [ 0.734245] x15: ffffffffffffffff x14: 0720072007200720 [ 0.739626] x13: 0720072007200720 x12: 0720072007200720 [ 0.745008] x11: 0720072007200720 x10: 0720072007200720 [ 0.750390] x9 : ffff8000100fbdc0 x8 : ffff800010715208 [ 0.755771] x7 : 0000000000000054 x6 : ffff00001b790f00 [ 0.761153] x5 : ffff800010bbf880 x4 : 0000000000000000 [ 0.766534] x3 : 0000000000000000 x2 : 00000000deadbeef [ 0.771916] x1 : 00000000deadbeef x0 : ffffffffffffffea [ 0.777304] Call trace: [ 0.779819] xen_starting_cpu+0x178/0x180 [ 0.783898] cpuhp_invoke_callback+0xac/0x640 [ 0.788325] cpuhp_issue_call+0xf4/0x150 [ 0.792317] __cpuhp_setup_state_cpuslocked+0x128/0x2c8 [ 0.797619] __cpuhp_setup_state+0x84/0xf8 [ 0.801779] xen_guest_init+0x324/0x364 [ 0.805683] do_one_initcall+0x54/0x250 [ 0.809590] kernel_init_freeable+0x12c/0x2c8 [ 0.814016] kernel_init+0x1c/0x128 [ 0.817583] ret_from_fork+0x10/0x18 [ 0.821226] Code: d0006980 f9427c00 cb000300 17ffffea (d4210000) [ 0.827415] ---[ end trace d95b5309a33f8b28 ]--- [ 0.832076] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 0.839815] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org> Reviewed-by: NStefano Stabellini <sstabellini@kernel.org> Link: https://lore.kernel.org/r/160196697165.60224.17470743378683334995.stgit@devnote2Signed-off-by: NJuergen Gross <jgross@suse.com>
-
- 06 10月, 2020 2 次提交
-
-
由 Maulik Shah 提交于
An interrupt that is disabled/masked but set for wakeup may still need to be able to wake up the system from sleep states like "suspend to RAM". To that effect, introduce the IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND flag. If the irqchip have this flag set, the irq PM code will enable/unmask the irqs that are marked for wakeup, but that are in a disabled state. On resume, such irqs will be restored back to their disabled state. Suggested-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NMaulik Shah <mkshah@codeaurora.org> [maz: commit message fix-up] Signed-off-by: NMarc Zyngier <maz@kernel.org> Tested-by: NStephen Boyd <swboyd@chromium.org> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NDouglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/1601267524-20199-4-git-send-email-mkshah@codeaurora.org
-
由 Dan Williams 提交于
In reaction to a proposal to introduce a memcpy_mcsafe_fast() implementation Linus points out that memcpy_mcsafe() is poorly named relative to communicating the scope of the interface. Specifically what addresses are valid to pass as source, destination, and what faults / exceptions are handled. Of particular concern is that even though x86 might be able to handle the semantics of copy_mc_to_user() with its common copy_user_generic() implementation other archs likely need / want an explicit path for this case: On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > However now I see that copy_user_generic() works for the wrong reason. > > It works because the exception on the source address due to poison > > looks no different than a write fault on the user address to the > > caller, it's still just a short copy. So it makes copy_to_user() work > > for the wrong reason relative to the name. > > Right. > > And it won't work that way on other architectures. On x86, we have a > generic function that can take faults on either side, and we use it > for both cases (and for the "in_user" case too), but that's an > artifact of the architecture oddity. > > In fact, it's probably wrong even on x86 - because it can hide bugs - > but writing those things is painful enough that everybody prefers > having just one function. Replace a single top-level memcpy_mcsafe() with either copy_mc_to_user(), or copy_mc_to_kernel(). Introduce an x86 copy_mc_fragile() name as the rename for the low-level x86 implementation formerly named memcpy_mcsafe(). It is used as the slow / careful backend that is supplanted by a fast copy_mc_generic() in a follow-on patch. One side-effect of this reorganization is that separating copy_mc_64.S to its own file means that perf no longer needs to track dependencies for its memcpy_64.S benchmarks. [ bp: Massage a bit. ] Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NTony Luck <tony.luck@intel.com> Acked-by: NMichael Ellerman <mpe@ellerman.id.au> Cc: <stable@vger.kernel.org> Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com
-
- 05 10月, 2020 1 次提交
-
-
由 David Howells 提交于
When a new incoming call arrives at an userspace rxrpc socket on a new connection that has a security class set, the code currently pushes it onto the accept queue to hold a ref on it for the socket. This doesn't work, however, as recvmsg() pops it off, notices that it's in the SERVER_SECURING state and discards the ref. This means that the call runs out of refs too early and the kernel oopses. By contrast, a kernel rxrpc socket manually pre-charges the incoming call pool with calls that already have user call IDs assigned, so they are ref'd by the call tree on the socket. Change the mode of operation for userspace rxrpc server sockets to work like this too. Although this is a UAPI change, server sockets aren't currently functional. Fixes: 248f219c ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
- 03 10月, 2020 8 次提交
-
-
由 Vincent Donnefort 提交于
rq->cpu_capacity is a key element in several scheduler parts, such as EAS task placement and load balancing. Tracking this value enables testing and/or debugging by a toolkit. Signed-off-by: NVincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1598605249-72651-1-git-send-email-vincent.donnefort@arm.com
-
由 Coly Li 提交于
The original problem was from nvme-over-tcp code, who mistakenly uses kernel_sendpage() to send pages allocated by __get_free_pages() without __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on tail pages, sending them by kernel_sendpage() may trigger a kernel panic from a corrupted kernel heap, because these pages are incorrectly freed in network stack as page_count 0 pages. This patch introduces a helper sendpage_ok(), it returns true if the checking page, - is not slab page: PageSlab(page) is false. - has page refcount: page_count(page) is not zero All drivers who want to send page to remote end by kernel_sendpage() may use this helper to check whether the page is OK. If the helper does not return true, the driver should try other non sendpage method (e.g. sock_no_sendpage()) to handle the page. Signed-off-by: NColy Li <colyli@suse.de> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Vlastimil Babka <vbabka@suse.com> Cc: stable@vger.kernel.org Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Mauro Carvalho Chehab 提交于
As warned by "make htmldocs", there are two new struct elements that aren't documented: ../include/linux/netdevice.h:2159: warning: Function parameter or member 'unlink_list' not described in 'net_device' ../include/linux/netdevice.h:2159: warning: Function parameter or member 'nested_level' not described in 'net_device' Fixes: 1fc70edb ("net: core: add nested_level variable in net_device") Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Johannes Berg 提交于
If userspace doesn't complete the policy dump, we leak the allocated state. Fix this. Fixes: d07dcf9a ("netlink: add infrastructure to expose policies to userspace") Signed-off-by: NJohannes Berg <johannes.berg@intel.com> Reviewed-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nathan Chancellor 提交于
Functions that are passed to early_initcall should be of type initcall_t, which expects a return type of int. This is not currently an error but a patch in the Clang LTO series could change that in the future. Fixes: 9183c3f9 ("static_call: Add inline static call infrastructure") Signed-off-by: NNathan Chancellor <natechancellor@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NSami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/lkml/20200903203053.3411268-17-samitolvanen@google.com/
-
由 Saeed Mahameed 提交于
In case of pci is offline reclaim_pages_cmd() will still try to call the FW to release FW pages, cmd_exec() in this case will return a silent success without actually calling the FW. This is wrong and will cause page leaks, what we should do is to detect pci offline or command interface un-available before tying to access the FW and manually release the FW pages in the driver. In this patch we share the code to check for FW command interface availability and we call it in sensitive places e.g. reclaim_pages_cmd(). Alternative fix: 1. Remove MLX5_CMD_OP_MANAGE_PAGES form mlx5_internal_err_ret_value, command success simulation list. 2. Always Release FW pages even if cmd_exec fails in reclaim_pages_cmd(). Reviewed-by: NMoshe Shemesh <moshe@nvidia.com> Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
-
由 Eran Ben Elisha 提交于
Upon command completion timeout, driver simulates a forced command completion. In a rare case where real interrupt for that command arrives simultaneously, it might release the command entry while the forced handler might still access it. Fix that by adding an entry refcount, to track current amount of allowed handlers. Command entry to be released only when this refcount is decremented to zero. Command refcount is always initialized to one. For callback commands, command completion handler is the symmetric flow to decrement it. For non-callback commands, it is wait_func(). Before ringing the doorbell, increment the refcount for the real completion handler. Once the real completion handler is called, it will decrement it. For callback commands, once the delayed work is scheduled, increment the refcount. Upon callback command completion handler, we will try to cancel the timeout callback. In case of success, we need to decrement the callback refcount as it will never run. In addition, gather the entry index free and the entry free into a one flow for all command types release. Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters") Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com> Reviewed-by: NMoshe Shemesh <moshe@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
-
由 Roman Gushchin 提交于
Since commit ea426c2a ("mm: memcg: prepare for byte-sized vmstat items") the write side of slab counters accepts a value in bytes and converts it to pages. It happens in __mod_node_page_state(). However a non-SMP version of __mod_node_page_state() doesn't perform this conversion. It leads to incorrect (unrealistically high) slab counters values. Fix this by adding a similar conversion to the non-SMP version of __mod_node_page_state(). Signed-off-by: NRoman Gushchin <guro@fb.com> Reported-and-tested-by: NBastian Bittorf <bb@npl.de> Fixes: ea426c2a ("mm: memcg: prepare for byte-sized vmstat items") Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 10月, 2020 1 次提交
-
-
由 Linus Torvalds 提交于
The pipe splice code still used the old model of waiting for pipe IO by using a non-specific "pipe_wait()" that waited for any pipe event to happen, which depended on all pipe IO being entirely serialized by the pipe lock. So by checking the state you were waiting for, and then adding yourself to the wait queue before dropping the lock, you were guaranteed to see all the wakeups. Strictly speaking, the actual wakeups were not done under the lock, but the pipe_wait() model still worked, because since the waiter held the lock when checking whether it should sleep, it would always see the current state, and the wakeup was always done after updating the state. However, commit 0ddad21d ("pipe: use exclusive waits when reading or writing") split the single wait-queue into two, and in the process also made the "wait for event" code wait for _two_ wait queues, and that then showed a race with the wakers that were not serialized by the pipe lock. It's only splice that used that "pipe_wait()" model, so the problem wasn't obvious, but Josef Bacik reports: "I hit a hang with fstest btrfs/187, which does a btrfs send into /dev/null. This works by creating a pipe, the write side is given to the kernel to write into, and the read side is handed to a thread that splices into a file, in this case /dev/null. The box that was hung had the write side stuck here [pipe_write] and the read side stuck here [splice_from_pipe_next -> pipe_wait]. [ more details about pipe_wait() scenario ] The problem is we're doing the prepare_to_wait, which sets our state each time, however we can be woken up either with reads or writes. In the case above we race with the WRITER waking us up, and re-set our state to INTERRUPTIBLE, and thus never break out of schedule" Josef had a patch that avoided the issue in pipe_wait() by just making it set the state only once, but the deeper problem is that pipe_wait() depends on a level of synchonization by the pipe mutex that it really shouldn't. And the whole "wait for any pipe state change" model really isn't very good to begin with. So rather than trying to work around things in pipe_wait(), remove that legacy model of "wait for arbitrary pipe event" entirely, and actually create functions that wait for the pipe actually being readable or writable, and can do so without depending on the pipe lock serializing everything. Fixes: 0ddad21d ("pipe: use exclusive waits when reading or writing") Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/Reported-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-and-tested-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 10月, 2020 3 次提交
-
-
由 Zqiang 提交于
If a CPU is offlined the debug objects per CPU pool is not cleaned up. If the CPU is never onlined again then the objects in the pool are wasted. Add a CPU hotplug callback which is invoked after the CPU is dead to free the pool. [ tglx: Massaged changelog and added comment about remote access safety ] Signed-off-by: NZqiang <qiang.zhang@windriver.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20200908062709.11441-1-qiang.zhang@windriver.com
-
由 Qian Cai 提交于
Calling pipe2() with O_NOTIFICATION_PIPE could results in memory leaks unless watch_queue_init() is successful. In case of watch_queue_init() failure in pipe2() we are left with inode and pipe_inode_info instances that need to be freed. That failure exit has been introduced in commit c73be61c ("pipe: Add general notification queue support") and its handling should've been identical to nearby treatment of alloc_file_pseudo() failures - it is dealing with the same situation. As it is, the mainline kernel leaks in that case. Another problem is that CONFIG_WATCH_QUEUE and !CONFIG_WATCH_QUEUE cases are treated differently (and the former leaks just pipe_inode_info, the latter - both pipe_inode_info and inode). Fixed by providing a dummy wacth_queue_init() in !CONFIG_WATCH_QUEUE case and by having failures of wacth_queue_init() handled the same way we handle alloc_file_pseudo() ones. Fixes: c73be61c ("pipe: Add general notification queue support") Signed-off-by: NQian Cai <cai@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Ard Biesheuvel 提交于
Jonathan reports that the strict policy for memory mapped by the ACPI core breaks the use case of passing ACPI table overrides via initramfs. This is due to the fact that the memory type used for loading the initramfs in memory is not recognized as a memory type that is typically used by firmware to pass firmware tables. Since the purpose of the strict policy is to ensure that no AML or other ACPI code can manipulate any memory that is used by the kernel to keep its internal state or the state of user tasks, we can relax the permission check, and allow mappings of memory that is reserved and marked as NOMAP via memblock, and therefore not covered by the linear mapping to begin with. Fixes: 1583052d ("arm64/acpi: disallow AML memory opregions to access kernel memory") Fixes: 325f5585 ("arm64/acpi: disallow writeable AML opregion mapping for EFI code regions") Reported-by: NJonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: NArd Biesheuvel <ardb@kernel.org> Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Link: https://lore.kernel.org/r/20200929132522.18067-1-ardb@kernel.orgSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>
-
- 30 9月, 2020 5 次提交
-
-
由 Mauro Carvalho Chehab 提交于
As warned by Sphinx: ./Documentation/gpu/drm-kms-helpers:305: ./include/drm/drm_dsc.h:587: WARNING: Unparseable C cross-reference: 'struct' Invalid C declaration: Expected identifier in nested name, got keyword: struct [error at 6] struct ------^ The markup for one struct is wrong, as struct is used twice. Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/3d467022325e15bba8dcb13da8fb730099303266.1601467849.git.mchehab+huawei@kernel.org
-
由 Peter Collingbourne 提交于
Also partially revert the follow-up change "drm: pl111: Absorb the external register header". This reverts the parts of commits 7e4e589d and 0fb81256 that touch paths outside of drivers/gpu/drm/pl111. The fbdev driver is used by Android's FVP configuration. Using the DRM driver together with DRM's fbdev emulation results in a failure to boot Android. The root cause is that Android's generic fbdev userspace driver relies on the ability to set the pixel format via FBIOPUT_VSCREENINFO, which is not supported by fbdev emulation. There have been other less critical behavioral differences identified between the fbdev driver and the DRM driver with fbdev emulation. The DRM driver exposes different values for the panel's width, height and refresh rate, and the DRM driver fails a FBIOPUT_VSCREENINFO syscall with yres_virtual greater than the maximum supported value instead of letting the syscall succeed and setting yres_virtual based on yres. Signed-off-by: NPeter Collingbourne <pcc@google.com> Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20200929195344.2219796-1-pcc@google.com
-
由 Ard Biesheuvel 提交于
efivars_sysfs_init() is only used locally in the source file that defines it, so make it static and unexport it. Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
-
由 Ard Biesheuvel 提交于
The worker thread that gets kicked off to sync the state of the EFI variable list is only used by the EFI pstore implementation, and is defined in its source file. So let's move its scheduling there as well. Since our efivar_init() scan will bail on duplicate entries, there is no need to disable the workqueue like we did before, so we can run it unconditionally. Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
-
由 Ard Biesheuvel 提交于
The EFI pstore implementation relies on the 'efivars' abstraction, which encapsulates the EFI variable store in a way that can be overridden by other backing stores, like the Google SMI one. On top of that, the EFI pstore implementation also relies on the efivars.ko module, which is a separate layer built on top of the 'efivars' abstraction that exposes the [deprecated] sysfs entries for each variable that exists in the backing store. Since the efivars.ko module is deprecated, and all users appear to have moved to the efivarfs file system instead, let's prepare for its removal, by removing EFI pstore's dependency on it. Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
-
- 29 9月, 2020 4 次提交
-
-
由 Jakub Kicinski 提交于
Validation flags are missing kdoc, add it. Fixes: ef6243ac ("genetlink: optionally validate strictly/dumps") Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Taehee Yoo 提交于
This patch is to add a new variable 'nested_level' into the net_device structure. This variable will be used as a parameter of spin_lock_nested() of dev->addr_list_lock. netif_addr_lock() can be called recursively so spin_lock_nested() is used instead of spin_lock() and dev->lower_level is used as a parameter of spin_lock_nested(). But, dev->lower_level value can be updated while it is being used. So, lockdep would warn a possible deadlock scenario. When a stacked interface is deleted, netif_{uc | mc}_sync() is called recursively. So, spin_lock_nested() is called recursively too. At this moment, the dev->lower_level variable is used as a parameter of it. dev->lower_level value is updated when interfaces are being unlinked/linked immediately. Thus, After unlinking, dev->lower_level shouldn't be a parameter of spin_lock_nested(). A (macvlan) | B (vlan) | C (bridge) | D (macvlan) | E (vlan) | F (bridge) A->lower_level : 6 B->lower_level : 5 C->lower_level : 4 D->lower_level : 3 E->lower_level : 2 F->lower_level : 1 When an interface 'A' is removed, it releases resources. At this moment, netif_addr_lock() would be called. Then, netdev_upper_dev_unlink() is called recursively. Then dev->lower_level is updated. There is no problem. But, when the bridge module is removed, 'C' and 'F' interfaces are removed at once. If 'F' is removed first, a lower_level value is like below. A->lower_level : 5 B->lower_level : 4 C->lower_level : 3 D->lower_level : 2 E->lower_level : 1 F->lower_level : 1 Then, 'C' is removed. at this moment, netif_addr_lock() is called recursively. The ordering is like this. C(3)->D(2)->E(1)->F(1) At this moment, the lower_level value of 'E' and 'F' are the same. So, lockdep warns a possible deadlock scenario. In order to avoid this problem, a new variable 'nested_level' is added. This value is the same as dev->lower_level - 1. But this value is updated in rtnl_unlock(). So, this variable can be used as a parameter of spin_lock_nested() safely in the rtnl context. Test commands: ip link add br0 type bridge vlan_filtering 1 ip link add vlan1 link br0 type vlan id 10 ip link add macvlan2 link vlan1 type macvlan ip link add br3 type bridge vlan_filtering 1 ip link set macvlan2 master br3 ip link add vlan4 link br3 type vlan id 10 ip link add macvlan5 link vlan4 type macvlan ip link add br6 type bridge vlan_filtering 1 ip link set macvlan5 master br6 ip link add vlan7 link br6 type vlan id 10 ip link add macvlan8 link vlan7 type macvlan ip link set br0 up ip link set vlan1 up ip link set macvlan2 up ip link set br3 up ip link set vlan4 up ip link set macvlan5 up ip link set br6 up ip link set vlan7 up ip link set macvlan8 up modprobe -rv bridge Splat looks like: [ 36.057436][ T744] WARNING: possible recursive locking detected [ 36.058848][ T744] 5.9.0-rc6+ #728 Not tainted [ 36.059959][ T744] -------------------------------------------- [ 36.061391][ T744] ip/744 is trying to acquire lock: [ 36.062590][ T744] ffff8c4767509280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_set_rx_mode+0x19/0x30 [ 36.064922][ T744] [ 36.064922][ T744] but task is already holding lock: [ 36.066626][ T744] ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60 [ 36.068851][ T744] [ 36.068851][ T744] other info that might help us debug this: [ 36.070731][ T744] Possible unsafe locking scenario: [ 36.070731][ T744] [ 36.072497][ T744] CPU0 [ 36.073238][ T744] ---- [ 36.074007][ T744] lock(&vlan_netdev_addr_lock_key); [ 36.075290][ T744] lock(&vlan_netdev_addr_lock_key); [ 36.076590][ T744] [ 36.076590][ T744] *** DEADLOCK *** [ 36.076590][ T744] [ 36.078515][ T744] May be due to missing lock nesting notation [ 36.078515][ T744] [ 36.080491][ T744] 3 locks held by ip/744: [ 36.081471][ T744] #0: ffffffff98571df0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x236/0x490 [ 36.083614][ T744] #1: ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60 [ 36.085942][ T744] #2: ffff8c476c8da280 (&bridge_netdev_addr_lock_key/4){+...}-{2:2}, at: dev_uc_sync+0x39/0x80 [ 36.088400][ T744] [ 36.088400][ T744] stack backtrace: [ 36.089772][ T744] CPU: 6 PID: 744 Comm: ip Not tainted 5.9.0-rc6+ #728 [ 36.091364][ T744] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 36.093630][ T744] Call Trace: [ 36.094416][ T744] dump_stack+0x77/0x9b [ 36.095385][ T744] __lock_acquire+0xbc3/0x1f40 [ 36.096522][ T744] lock_acquire+0xb4/0x3b0 [ 36.097540][ T744] ? dev_set_rx_mode+0x19/0x30 [ 36.098657][ T744] ? rtmsg_ifinfo+0x1f/0x30 [ 36.099711][ T744] ? __dev_notify_flags+0xa5/0xf0 [ 36.100874][ T744] ? rtnl_is_locked+0x11/0x20 [ 36.101967][ T744] ? __dev_set_promiscuity+0x7b/0x1a0 [ 36.103230][ T744] _raw_spin_lock_bh+0x38/0x70 [ 36.104348][ T744] ? dev_set_rx_mode+0x19/0x30 [ 36.105461][ T744] dev_set_rx_mode+0x19/0x30 [ 36.106532][ T744] dev_set_promiscuity+0x36/0x50 [ 36.107692][ T744] __dev_set_promiscuity+0x123/0x1a0 [ 36.108929][ T744] dev_set_promiscuity+0x1e/0x50 [ 36.110093][ T744] br_port_set_promisc+0x1f/0x40 [bridge] [ 36.111415][ T744] br_manage_promisc+0x8b/0xe0 [bridge] [ 36.112728][ T744] __dev_set_promiscuity+0x123/0x1a0 [ 36.113967][ T744] ? __hw_addr_sync_one+0x23/0x50 [ 36.115135][ T744] __dev_set_rx_mode+0x68/0x90 [ 36.116249][ T744] dev_uc_sync+0x70/0x80 [ 36.117244][ T744] dev_uc_add+0x50/0x60 [ 36.118223][ T744] macvlan_open+0x18e/0x1f0 [macvlan] [ 36.119470][ T744] __dev_open+0xd6/0x170 [ 36.120470][ T744] __dev_change_flags+0x181/0x1d0 [ 36.121644][ T744] dev_change_flags+0x23/0x60 [ 36.122741][ T744] do_setlink+0x30a/0x11e0 [ 36.123778][ T744] ? __lock_acquire+0x92c/0x1f40 [ 36.124929][ T744] ? __nla_validate_parse.part.6+0x45/0x8e0 [ 36.126309][ T744] ? __lock_acquire+0x92c/0x1f40 [ 36.127457][ T744] __rtnl_newlink+0x546/0x8e0 [ 36.128560][ T744] ? lock_acquire+0xb4/0x3b0 [ 36.129623][ T744] ? deactivate_slab.isra.85+0x6a1/0x850 [ 36.130946][ T744] ? __lock_acquire+0x92c/0x1f40 [ 36.132102][ T744] ? lock_acquire+0xb4/0x3b0 [ 36.133176][ T744] ? is_bpf_text_address+0x5/0xe0 [ 36.134364][ T744] ? rtnl_newlink+0x2e/0x70 [ 36.135445][ T744] ? rcu_read_lock_sched_held+0x32/0x60 [ 36.136771][ T744] ? kmem_cache_alloc_trace+0x2d8/0x380 [ 36.138070][ T744] ? rtnl_newlink+0x2e/0x70 [ 36.139164][ T744] rtnl_newlink+0x47/0x70 [ ... ] Fixes: 845e0ebb ("net: change addr_list_lock back to static key") Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Taehee Yoo 提交于
Functions related to nested interface infrastructure such as netdev_walk_all_{ upper | lower }_dev() pass both private functions and "data" pointer to handle their own things. At this point, the data pointer type is void *. In order to make it easier to expand common variables and functions, this new netdev_nested_priv structure is added. In the following patch, a new member variable will be added into this struct to fix the lockdep issue. Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Julien Thierry 提交于
kvm_vcpu_kick() is not NMI safe. When the overflow handler is called from NMI context, defer waking the vcpu to an irq_work queue. A vcpu can be freed while it's not running by kvm_destroy_vm(). Prevent running the irq_work for a non-existent vcpu by calling irq_work_sync() on the PMU destroy path. [Alexandru E.: Added irq_work_sync()] Signed-off-by: NJulien Thierry <julien.thierry@arm.com> Signed-off-by: NAlexandru Elisei <alexandru.elisei@arm.com> Tested-by: Sumit Garg <sumit.garg@linaro.org> (Developerbox) Cc: Julien Thierry <julien.thierry.kdev@gmail.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Suzuki K Pouloze <suzuki.poulose@arm.com> Cc: kvm@vger.kernel.org Cc: kvmarm@lists.cs.columbia.edu Link: https://lore.kernel.org/r/20200924110706.254996-6-alexandru.elisei@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
-
- 28 9月, 2020 4 次提交
-
-
由 Shaokun Zhang 提交于
ARMv8.4-PMU introduces the PMMIR_EL1 registers and some new PMU events, like STALL_SLOT etc, are related to it. Let's add a caps directory to /sys/bus/event_source/devices/armv8_pmuv3_0/ and support slots from PMMIR_EL1 registers in this entry. The user programs can get the slots from sysfs directly. /sys/bus/event_source/devices/armv8_pmuv3_0/caps/slots is exposed under sysfs. Both ARMv8.4-PMU and STALL_SLOT event are implemented, it returns the slots from PMMIR_EL1, otherwise it will return 0. Signed-off-by: NShaokun Zhang <zhangshaokun@hisilicon.com> Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/1600754025-53535-1-git-send-email-zhangshaokun@hisilicon.comSigned-off-by: NWill Deacon <will@kernel.org>
-
由 Kai-Heng Feng 提交于
After commit 6827ca57 ("memstick: rtsx_usb_ms: Support runtime power management"), removing module rtsx_usb_ms will be stuck. The deadlock is caused by powering on and powering off at the same time, the former one is when memstick_check() is flushed, and the later is called by memstick_remove_host(). Soe let's skip allocating card to prevent this issue. Fixes: 6827ca57 ("memstick: rtsx_usb_ms: Support runtime power management") Signed-off-by: NKai-Heng Feng <kai.heng.feng@canonical.com> Link: https://lore.kernel.org/r/20200925084952.13220-1-kai.heng.feng@canonical.com Cc: stable@vger.kernel.org Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
-
由 Peter Xu 提交于
This prepares for the future work to trigger early cow on pinned pages during fork(). No functional change intended. Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Xu 提交于
(Commit message majorly collected from Jason Gunthorpe) Reduce the chance of false positive from page_maybe_dma_pinned() by keeping track if the mm_struct has ever been used with pin_user_pages(). This allows cases that might drive up the page ref_count to avoid any penalty from handling dma_pinned pages. Future work is planned, to provide a more sophisticated solution, likely to turn it into a real counter. For now, make it atomic_t but use it as a boolean for simplicity. Suggested-by: NJason Gunthorpe <jgg@ziepe.ca> Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 9月, 2020 1 次提交
-
-
由 Laurent Dufour 提交于
In register_mem_sect_under_node() the system_state's value is checked to detect whether the call is made during boot time or during an hot-plug operation. Unfortunately, that check against SYSTEM_BOOTING is wrong because regular memory is registered at SYSTEM_SCHEDULING state. In addition, memory hot-plug operation can be triggered at this system state by the ACPI [1]. So checking against the system state is not enough. The consequence is that on system with interleaved node's ranges like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] This can be seen on PowerPC LPAR after multiple memory hot-plug and hot-unplug operations are done. At the next reboot the node's memory ranges can be interleaved and since the call to link_mem_sections() is made in topology_init() while the system is in the SYSTEM_SCHEDULING state, the node's id is not checked, and the sections registered to multiple nodes: $ ls -l /sys/devices/system/memory/memory21/node* total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 In that case, the system is able to boot but if later one of theses memory blocks is hot-unplugged and then hot-plugged, the sysfs inconsistency is detected and this is triggering a BUG_ON(): kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 Call Trace: add_memory_resource+0x23c/0x340 (unreliable) __add_memory+0x5c/0xf0 dlpar_add_lmb+0x1b4/0x500 dlpar_memory+0x1f8/0xb80 handle_dlpar_errorlog+0xc0/0x190 dlpar_store+0x198/0x4a0 kobj_attr_store+0x30/0x50 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 vfs_write+0xe8/0x290 ksys_write+0xdc/0x130 system_call_exception+0x160/0x270 system_call_common+0xf0/0x27c This patch addresses the root cause by not relying on the system_state value to detect whether the call is due to a hot-plug operation. An extra parameter is added to link_mem_sections() detailing whether the operation is due to a hot-plug operation. [1] According to Oscar Salvador, using this qemu command line, ACPI memory hotplug operations are raised at SYSTEM_SCHEDULING state: $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \ -m size=$MEM,slots=255,maxmem=4294967296k \ -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \ -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \ -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \ -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \ -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \ -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \ -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \ -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \ Fixes: 4fbce633 ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()") Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-