- 21 4月, 2017 2 次提交
-
-
由 Paul E. McKenney 提交于
Although Tree SRCU does reduce delays when there is at least one synchronize_srcu_expedited() invocation pending, srcu_schedule_cbs_snp() still waits for SRCU_INTERVAL before invoking callbacks. Since synchronize_srcu_expedited() now posts a callback and waits for that callback to do a wakeup, this destroys the expedited nature of synchronize_srcu_expedited(). This destruction became apparent to Marc Zyngier in the guise of a guest-OS bootup slowdown from five seconds to no fewer than forty seconds. This commit therefore invokes callbacks immediately at the end of the grace period when there is at least one synchronize_srcu_expedited() invocation pending. This brought Marc's guest-OS bootup times back into the realm of reason. Reported-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: NMarc Zyngier <marc.zyngier@arm.com>
-
由 Paul E. McKenney 提交于
Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1,2], however, there are workloads that could result in a high volume of concurrent invocations of call_srcu(), which with current SRCU would result in excessive lock contention on the srcu_struct structure's ->queue_lock, which protects SRCU's callback lists. This commit therefore moves SRCU to per-CPU callback lists, thus greatly reducing contention. Because a given SRCU instance no longer has a single centralized callback list, starting grace periods and invoking callbacks are both more complex than in the single-list Classic SRCU implementation. Starting grace periods and handling callbacks are now handled using an srcu_node tree that is in some ways similar to the rcu_node trees used by RCU-bh, RCU-preempt, and RCU-sched (for example, the srcu_node tree shape is controlled by exactly the same Kconfig options and boot parameters that control the shape of the rcu_node tree). In addition, the old per-CPU srcu_array structure is now named srcu_data and contains an rcu_segcblist structure named ->srcu_cblist for its callbacks (and a spinlock to protect this). The srcu_struct gets an srcu_gp_seq that is used to associate callback segments with the corresponding completion-time grace-period number. These completion-time grace-period numbers are propagated up the srcu_node tree so that the grace-period workqueue handler can determine whether additional grace periods are needed on the one hand and where to look for callbacks that are ready to be invoked. The srcu_barrier() function must now wait on all instances of the per-CPU ->srcu_cblist. Because each ->srcu_cblist is protected by ->lock, srcu_barrier() can remotely add the needed callbacks. In theory, it could also remotely start grace periods, but in practice doing so is complex and racy. And interestingly enough, it is never necessary for srcu_barrier() to start a grace period because srcu_barrier() only enqueues a callback when a callback is already present--and it turns out that a grace period has to have already been started for this pre-existing callback. Furthermore, it is only the callback that srcu_barrier() needs to wait on, not any particular grace period. Therefore, a new rcu_segcblist_entrain() function enqueues the srcu_barrier() function's callback into the same segment occupied by the last pre-existing callback in the list. The special case where all the pre-existing callbacks are on a different list (because they are in the process of being invoked) is handled by enqueuing srcu_barrier()'s callback into the RCU_DONE_TAIL segment, relying on the done-callbacks check that takes place after all callbacks are inovked. Note that the readers use the same algorithm as before. Note that there is a separate srcu_idx that tells the readers what counter to increment. This unfortunately cannot be combined with srcu_gp_seq because they need to be incremented at different times. This commit introduces some ugly #ifdefs in rcutorture. These will go away when I feel good enough about Tree SRCU to ditch Classic SRCU. Some crude performance comparisons, courtesy of a quickly hacked rcuperf asynchronous-grace-period capability: Callback Queuing Overhead ------------------------- # CPUS Classic SRCU Tree SRCU ------ ------------ --------- 2 0.349 us 0.342 us 16 31.66 us 0.4 us 41 --------- 0.417 us The times are the 90th percentiles, a statistic that was chosen to reject the overheads of the occasional srcu_barrier() call needed to avoid OOMing the test machine. The rcuperf test hangs when running Classic SRCU at 41 CPUs, hence the line of dashes. Despite the hacks to both the rcuperf code and that statistics, this is a convincing demonstration of Tree SRCU's performance and scalability advantages. [1] https://lwn.net/Articles/309030/ [2] https://patchwork.kernel.org/patch/5108281/Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Fix initialization if synchronize_srcu_expedited() called first. ]
-
- 19 4月, 2017 35 次提交
-
-
由 Paul E. McKenney 提交于
The TREE_SRCU rewrite is large and a bit on the non-simple side, so this commit helps reduce risk by allowing the old v4.11 SRCU algorithm to be selected using a new CLASSIC_SRCU Kconfig option that depends on RCU_EXPERT. The default is to use the new TREE_SRCU and TINY_SRCU algorithms, in order to help get these the testing that they need. However, if your users do not require the update-side scalability that is to be provided by TREE_SRCU, select RCU_EXPERT and then CLASSIC_SRCU to revert back to the old classic SRCU algorithm. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
The srcu_torture_stats() function is adapted to the specific srcu_struct layout traditionally used by SRCU. This commit therefore adds support for Tiny SRCU. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
In response to automated complaints about modifications to SRCU increasing its size, this commit creates a tiny SRCU that is used in SMP=n && PREEMPT=n builds. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
SRCU's implementation of expedited grace periods has always assumed that the SRCU instance is idle when the expedited request arrives. This commit improves this a bit by maintaining a count of the number of outstanding expedited requests, thus allowing prior non-expedited grace periods accommodate these requests by shifting to expedited mode. However, any non-expedited wait already in progress will still wait for the full duration. Improved control of expedited grace periods is planned, but one step at a time. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
Updating ->srcu_state and ->srcu_gp_seq will lead to extremely complex race conditions given multiple callback queues, so this commit takes advantage of the two-bit state now available in rcu_seq counters to store the state in the bottom two bits of ->srcu_gp_seq. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
This commit increases the number of reserved bits at the bottom of an rcu_seq grace-period counter from one to two, as will be needed to accommodate SRCU's three-state grace periods. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
The expedited grace-period code contains several open-coded shifts know the format of an rcu_seq grace-period counter, which is not particularly good style. This commit therefore creates a new rcu_seq_ctr() function that extracts the counter portion of the counter, and an rcu_seq_state() function that extracts the low-order state bit. This commit prepares for SRCU callback parallelization, which will require two state bits. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
This commit makes the num_rcu_lvl[] array external so that SRCU can make use of it for initializing its upcoming srcu_node tree. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
This commit moves rcu_for_each_node_breadth_first(), rcu_for_each_nonleaf_node_breadth_first(), and rcu_for_each_leaf_node() from kernel/rcu/tree.h to kernel/rcu/rcu.h so that SRCU can access them. This commit is code-movement only. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
The levelcnt[] array is identical to num_rcu_lvl[], so this commit removes levelcnt[]. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
This commit moves the rcu_init_levelspread() function from kernel/rcu/tree.c to kernel/rcu/rcu.h so that SRCU can access it. This is another step towards enabling SRCU to create its own combining tree. This commit is code-movement only, give or take knock-on adjustments. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
This commit moves the C preprocessor code that defines the default shape of the rcu_node combining tree to a new include/linux/rcu_node_tree.h file as a first step towards enabling SRCU to create its own combining tree, which in turn enables SRCU to implement per-CPU callback handling, thus avoiding contention on the lock currently guarding the single list of callbacks. Note that users of SRCU still need to know the size of the srcu_struct structure, hence include/linux rather than kernel/rcu. This commit is code-movement only. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
This commit switches SRCU from custom-built callback queues to the new rcu_segcblist structure. This change associates grace-period sequence numbers with groups of callbacks, which will be needed for efficient processing of per-CPU callbacks. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
This commit adds grace-period sequence numbers, which will be used to handle mid-boot grace periods and per-CPU callback lists. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
The current SRCU grace-period processing might never reach the last portion of srcu_advance_batches(). This is OK given the current implementation, as the first portion, up to the try_check_zero() following the srcu_flip() is sufficient to drive grace periods forward. However, it has the unfortunate side-effect of making it impossible to determine when a given grace period has ended, and it will be necessary to efficiently trace ends of grace periods in order to efficiently handle per-CPU SRCU callback lists. This commit therefore adds states to the SRCU grace-period processing, so that the end of a given SRCU grace period is marked by the transition to the SRCU_STATE_DONE state. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
This commit simplifies the SRCU state machine by pushing the srcu_advance_batches() idle-SRCU fastpath into the common case. This is done by giving srcu_reschedule() a delay parameter, which is zero in the call from srcu_advance_batches(). This commit is a step towards numbering callbacks in order to efficiently handle per-CPU callback lists. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Dmitry Vyukov 提交于
The rcu_seq_end() function increments seq signifying completion of a grace period, after that checks that the seq is even and wakes _synchronize_rcu_expedited(). The _synchronize_rcu_expedited() function uses wait_event() to wait for even seq. The problem is that wait_event() can return as soon as seq becomes even without waiting for the wakeup. In such case the warning in rcu_seq_end() can falsely fire if the next expedited grace period starts before the check. Check that seq has good value before incrementing it. Signed-off-by: NDmitry Vyukov <dvyukov@google.com> Cc: syzkaller@googlegroups.com Cc: linux-kernel@vger.kernel.org Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: josh@joshtriplett.org Cc: jiangshanlai@gmail.com Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> --- syzkaller-triggered warning: WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533 rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533 CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Workqueue: events wait_rcu_exp_gp Call Trace: __dump_stack lib/dump_stack.c:15 [inline] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51 panic+0x1fb/0x412 kernel/panic.c:179 __warn+0x1c4/0x1e0 kernel/panic.c:540 warn_slowpath_null+0x2c/0x40 kernel/panic.c:583 rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533 rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline] rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517 rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline] wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570 process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096 worker_thread+0x223/0x19c0 kernel/workqueue.c:2230 kthread+0x326/0x3f0 kernel/kthread.c:227 ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430 ---
-
由 Paul E. McKenney 提交于
Expedited grace periods use workqueue handlers that wake up the requesters, but there is no lock mediating this wakeup. Therefore, memory barriers are required to ensure that the handler's memory references are seen by all to occur before synchronize_*_expedited() returns to its caller. Possibly detected by syzkaller. Reported-by: NDmitry Vyukov <dvyukov@google.com> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
This commit moves rcu_seq_start(), rcu_seq_end(), rcu_seq_snap(), and rcu_seq_done() from kernel/rcu/tree.c to kernel/rcu/rcu.h. This will allow SRCU to use these functions, which in turn will allow SRCU to move from a single global callback queue to a per-CPU callback queue. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
This commit adds single-element dequeue functions to rcu_segcblist. These are less efficient than using the extract and insert functions, but allow more precise debugging code. These functions are thus expected to be used only in debug builds, for example, CONFIG_PROVE_RCU. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
This commit checks for pre-scheduler state, and if that early in the boot process, synchronize_srcu() and friends are no-ops. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
This is primarily a code-movement commit in preparation for allowing SRCU to handle early-boot SRCU grace periods. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
RCU has only one multi-tail callback list, which is implemented via the nxtlist, nxttail, nxtcompleted, qlen_lazy, and qlen fields in the rcu_data structure, and whose operations are open-code throughout the Tree RCU implementation. This has been more or less OK in the past, but upcoming callback-list optimizations in SRCU could really use a multi-tail callback list there as well. This commit therefore abstracts the multi-tail callback list handling into a new kernel/rcu/rcu_segcblist.h file, and uses this new API. The simple head-and-tail pointer callback list is also abstracted and applied everywhere except for the NOCB callback-offload lists. (Yes, the plan is to apply them there as well, but this commit is already bigger than would be good.) Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
If the RCU_EXPERT Kconfig option is not set (the default), then the RCU_FANOUT_LEAF Kconfig option will not be defined, which will cause the leaf-level rcu_node tree fanout to default to 32 on 32-bit systems and 64 on 64-bit systems. This can result in excessive lock contention. This commit therefore changes the computation of the leaf-level rcu_node tree fanout so that the result will be 16 unless an explicit Kconfig or kernel-boot setting says otherwise. Reported-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
The rcu_all_qs() and rcu_note_context_switch() do a series of checks, taking various actions to supply RCU with quiescent states, depending on the outcomes of the various checks. This is a bit much for scheduling fastpaths, so this commit creates a separate ->rcu_urgent_qs field in the rcu_dynticks structure that acts as a global guard for these checks. Thus, in the common case, rcu_all_qs() and rcu_note_context_switch() check the ->rcu_urgent_qs field, find it false, and simply return. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org>
-
由 Paul E. McKenney 提交于
The rcu_momentary_dyntick_idle() function scans the RCU flavors, checking that one of them still needs a quiescent state before doing an expensive atomic operation on the ->dynticks counter. However, this check reduces overhead only after a rare race condition, and increases complexity. This commit therefore removes the scan and the mechanism enabling the scan. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
The rcu_qs_ctr variable is yet another isolated per-CPU variable, so this commit pulls it into the pre-existing rcu_dynticks per-CPU structure. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
The rcu_sched_qs_mask variable is yet another isolated per-CPU variable, so this commit pulls it into the pre-existing rcu_dynticks per-CPU structure. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
The current use of "RCU_TRACE(statement);" can cause odd bugs, especially where "statement" is a local-variable declaration, as it can leave a misplaced ";" in the source code. This commit therefore converts these to "RCU_TRACE(statement;)", which avoids the misplaced ";". Reported-by: NJosh Triplett <josh@joshtriplett.org> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
The current use of "RCU_TRACE(statement);" can cause odd bugs, especially where "statement" is a local-variable declaration, as it can leave a misplaced ";" in the source code. This commit therefore converts these to "RCU_TRACE(statement;)", which avoids the misplaced ";". Reported-by: NJosh Triplett <josh@joshtriplett.org> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
The current use of "RCU_TRACE(statement);" can cause odd bugs, especially where "statement" is a local-variable declaration, as it can leave a misplaced ";" in the source code. This commit therefore converts these to "RCU_TRACE(statement;)", which avoids the misplaced ";". Reported-by: NJosh Triplett <josh@joshtriplett.org> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Paul E. McKenney 提交于
Users of SRCU are obliged to complete all grace-period activity before invoking cleanup_srcu_struct(). This means that all calls to either synchronize_srcu() or synchronize_srcu_expedited() must have returned, and all calls to call_srcu() must have returned, and the last call to call_srcu() must have been followed by a call to srcu_barrier(). Furthermore, the caller must have done something to prevent any further calls to synchronize_srcu(), synchronize_srcu_expedited(), and call_srcu(). Therefore, if there has ever been an invocation of call_srcu() on the srcu_struct in question, the sequence of events must be as follows: 1. Prevent any further calls to call_srcu(). 2. Wait for any pre-existing call_srcu() invocations to return. 3. Invoke srcu_barrier(). 4. It is now safe to invoke cleanup_srcu_struct(). On the other hand, if there has ever been a call to synchronize_srcu() or synchronize_srcu_expedited(), the sequence of events must be as follows: 1. Prevent any further calls to synchronize_srcu() or synchronize_srcu_expedited(). 2. Wait for any pre-existing synchronize_srcu() or synchronize_srcu_expedited() invocations to return. 3. It is now safe to invoke cleanup_srcu_struct(). If there have been calls to all both types of functions (call_srcu() and either of synchronize_srcu() and synchronize_srcu_expedited()), then the caller must do the first three steps of the call_srcu() procedure above and the first two steps of the synchronize_s*() procedure above, and only then invoke cleanup_srcu_struct(). Note that cleanup_srcu_struct() does some probabilistic checks for the caller failing to follow these procedures, in which case cleanup_srcu_struct() does WARN_ON() and avoids freeing the per-CPU structures associated with the specified srcu_struct structure. Reported-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
-
由 Paul E. McKenney 提交于
The srcu_reschedule() function invokes rcu_batch_empty() on each of the four rcu_batch structures in the srcu_struct in question twice. Given that this check will also be needed in cleanup_srcu_struct(), this commit consolidates these four checks into a new rcu_all_batches_empty() function. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
-
由 Paul E. McKenney 提交于
Currently, IPIs are used to force other CPUs to invalidate their TLBs in response to a kernel virtual-memory mapping change. This works, but degrades both battery lifetime (for idle CPUs) and real-time response (for nohz_full CPUs), and in addition results in unnecessary IPIs due to the fact that CPUs executing in usermode are unaffected by stale kernel mappings. It would be better to cause a CPU executing in usermode to wait until it is entering kernel mode to do the flush, first to avoid interrupting usemode tasks and second to handle multiple flush requests with a single flush in the case of a long-running user task. This commit therefore reserves a bit at the bottom of the ->dynticks counter, which is checked upon exit from extended quiescent states. If it is set, it is cleared and then a new rcu_eqs_special_exit() macro is invoked, which, if not supplied, is an empty single-pass do-while loop. If this bottom bit is set on -entry- to an extended quiescent state, then a WARN_ON_ONCE() triggers. This bottom bit may be set using a new rcu_eqs_special_set() function, which returns true if the bit was set, or false if the CPU turned out to not be in an extended quiescent state. Please note that this function refuses to set the bit for a non-nohz_full CPU when that CPU is executing in usermode because usermode execution is tracked by RCU as a dyntick-idle extended quiescent state only for nohz_full CPUs. Reported-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
-
- 11 3月, 2017 1 次提交
-
-
由 Thomas Gleixner 提交于
The purgatory code defines global variables which are referenced via a symbol lookup in the kexec code (core and arch). A recent commit addressing sparse warnings made these static and thereby broke kexec_file. Why did this happen? Simply because the whole machinery is undocumented and lacks any form of forward declarations. The variable names are unspecific and lack a prefix, so adding forward declarations creates shadow variables in the core code. Aside of that the code relies on magic constants and duplicate struct definitions with no way to ensure that these things stay in sync. The section placement of the purgatory variables happened by chance and not by design. Unbreak kexec and cleanup the mess: - Add proper forward declarations and document the usage - Use common struct definition - Use the proper common defines instead of magic constants - Add a purgatory_ prefix to have a proper name space - Use ARRAY_SIZE() instead of a homebrewn reimplementation - Add proper sections to the purgatory variables [ From Mike ] Fixes: 72042a8c ("x86/purgatory: Make functions and variables static") Reported-by: NMike Galbraith <<efault@gmx.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Nicholas Mc Guire <der.herr@hofr.at> Cc: Borislav Petkov <bp@alien8.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Tobin C. Harding" <me@tobin.cc> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1703101315140.3681@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 10 3月, 2017 2 次提交
-
-
由 Andrea Arcangeli 提交于
Patch series "userfaultfd non-cooperative further update for 4.11 merge window". Unfortunately I noticed one relevant bug in userfaultfd_exit while doing more testing. I've been doing testing before and this was also tested by kbuild bot and exercised by the selftest, but this bug never reproduced before. I dropped userfaultfd_exit as result. I dropped it because of implementation difficulty in receiving signals in __mmput and because I think -ENOSPC as result from the background UFFDIO_COPY should be enough already. Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit wasn't exercised by the selftest and when I tried to exercise it, after moving it to a more correct place in __mmput where it would make more sense and where the vma list is stable, it resulted in the event_wait_completion in D state. So then I added the second patch to be sure even if we call userfaultfd_event_wait_completion too late during task exit(), we won't risk to generate tasks in D state. The same check exists in handle_userfault() for the same reason, except it makes a difference there, while here is just a robustness check and it's run under WARN_ON_ONCE. While looking at the userfaultfd_event_wait_completion() function I looked back at its callers too while at it and I think it's not ok to stop executing dup_fctx on the fcs list because we relay on userfaultfd_event_wait_completion to execute userfaultfd_ctx_put(fctx->orig) which is paired against userfaultfd_ctx_get(fctx->orig) in dup_userfault just before list_add(fcs). This change only takes care of fctx->orig but this area also needs further review looking for similar problems in fctx->new. The only patch that is urgent is the first because it's an use after free during a SMP race condition that affects all processes if CONFIG_USERFAULTFD=y. Very hard to reproduce though and probably impossible without SLUB poisoning enabled. This patch (of 3): I once reproduced this oops with the userfaultfd selftest, it's not easily reproducible and it requires SLUB poisoning to reproduce. general protection fault: 0000 [#1] SMP Modules linked in: CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G ------------ T 3.10.0+ #15 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014 task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000 RIP: 0010:[<ffffffff81451299>] [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0 RSP: 0018:ffff8801f833fe80 EFLAGS: 00010202 RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600 RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000 R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700 FS: 0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: do_exit+0x297/0xd10 SyS_exit+0x17/0x20 tracesys+0xdd/0xe2 Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 <4c> 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80 RIP [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0 RSP <ffff8801f833fe80> ---[ end trace 9fecd6dcb442846a ]--- In the debugger I located the "mm" pointer in the stack and walking mm->mmap->vm_next through the end shows the vma->vm_next list is fully consistent and it is null terminated list as expected. So this has to be an SMP race condition where userfaultfd_exit was running while the vma list was being modified by another CPU. When userfaultfd_exit() run one of the ->vm_next pointers pointed to SLAB_POISON (RBX is the vma pointer and is 0x6b6b..). The reason is that it's not running in __mmput but while there are still other threads running and it's not holding the mmap_sem (it can't as it has to wait the even to be received by the manager). So this is an use after free that was happening for all processes. One more implementation problem aside from the race condition: userfaultfd_exit has really to check a flag in mm->flags before walking the vma or it's going to slowdown the exit() path for regular tasks. One more implementation problem: at that point signals can't be delivered so it would also create a task in D state if the manager doesn't read the event. The major design issue: it overall looks superfluous as the manager can check for -ENOSPC in the background transfer: if (mmget_not_zero(ctx->mm)) { [..] } else { return -ENOSPC; } It's safer to roll it back and re-introduce it later if at all. [rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT] Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Masahiro Yamada 提交于
Fix typos and add the following to the scripts/spelling.txt: overide||override While we are here, fix the doubled "address" in the touched line Documentation/devicetree/bindings/regulator/ti-abb-regulator.txt. Also, fix the comment block style in the touched hunks in drivers/media/dvb-frontends/drx39xyj/drx_driver.h. Link: http://lkml.kernel.org/r/1481573103-11329-21-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-