1. 10 7月, 2020 1 次提交
  2. 28 6月, 2020 2 次提交
  3. 11 6月, 2020 1 次提交
    • T
      x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned · 17fae129
      Tony Luck 提交于
      An interesting thing happened when a guest Linux instance took a machine
      check. The VMM unmapped the bad page from guest physical space and
      passed the machine check to the guest.
      
      Linux took all the normal actions to offline the page from the process
      that was using it. But then guest Linux crashed because it said there
      was a second machine check inside the kernel with this stack trace:
      
      do_memory_failure
          set_mce_nospec
               set_memory_uc
                    _set_memory_uc
                         change_page_attr_set_clr
                              cpa_flush
                                   clflush_cache_range_opt
      
      This was odd, because a CLFLUSH instruction shouldn't raise a machine
      check (it isn't consuming the data). Further investigation showed that
      the VMM had passed in another machine check because is appeared that the
      guest was accessing the bad page.
      
      Fix is to check the scope of the poison by checking the MCi_MISC register.
      If the entire page is affected, then unmap the page. If only part of the
      page is affected, then mark the page as uncacheable.
      
      This assumes that VMMs will do the logical thing and pass in the "whole
      page scope" via the MCi_MISC register (since they unmapped the entire
      page).
      
        [ bp: Adjust to x86/entry changes. ]
      
      Fixes: 284ce401 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
      Reported-by: NJue Wang <juew@google.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NJue Wang <juew@google.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel.com
      
      
      17fae129
  4. 05 6月, 2020 1 次提交
  5. 03 6月, 2020 1 次提交
    • N
      mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE · a37b0715
      NeilBrown 提交于
      PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
      loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
      daemon needs to write to one bdi (the final bdi) in order to free up
      writes queued to another bdi (the client bdi).
      
      The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
      pages, so that it can still dirty pages after other processses have been
      throttled.  The purpose of this is to avoid deadlock that happen when
      the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
      but it is being thottled and cannot write.
      
      This approach was designed when all threads were blocked equally,
      independently on which device they were writing to, or how fast it was.
      Since that time the writeback algorithm has changed substantially with
      different threads getting different allowances based on non-trivial
      heuristics.  This means the simple "add 25%" heuristic is no longer
      reliable.
      
      The important issue is not that the daemon needs a *larger* dirty page
      allowance, but that it needs a *private* dirty page allowance, so that
      dirty pages for the "client" bdi that it is helping to clear (the bdi
      for an NFS filesystem or loop block device etc) do not affect the
      throttling of the daemon writing to the "final" bdi.
      
      This patch changes the heuristic so that the task is not throttled when
      the bdi it is writing to has a dirty page count below below (or equal
      to) the free-run threshold for that bdi.  This ensures it will always be
      able to have some pages in flight, and so will not deadlock.
      
      In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
      still be throttled by global threshold, but that is acceptable as it is
      only the deadlock state that is interesting for this flag.
      
      This approach of "only throttle when target bdi is busy" is consistent
      with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
      it causes attention to be focussed only on the target bdi.
      
      So this patch
       - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
       - removes the 25% bonus that that flag gives, and
       - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
         global and the local free-run thresholds are exceeded.
      
      Note that previously realtime threads were treated the same as
      PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour
      for real-time threads, so it is now different from the behaviour of nfsd
      and loop tasks.  I don't know what is wanted for realtime.
      
      [akpm@linux-foundation.org: coding style fixes]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: Chuck Lever <chuck.lever@oracle.com>	[nfsd]
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a37b0715
  6. 28 5月, 2020 1 次提交
    • P
      sched: Replace rq::wake_list · a1488664
      Peter Zijlstra 提交于
      The recent commit: 90b5363a ("sched: Clean up scheduler_ipi()")
      got smp_call_function_single_async() subtly wrong. Even though it will
      return -EBUSY when trying to re-use a csd, that condition is not
      atomic and still requires external serialization.
      
      The change in ttwu_queue_remote() got this wrong.
      
      While on first reading ttwu_queue_remote() has an atomic test-and-set
      that appears to serialize the use, the matching 'release' is not in
      the right place to actually guarantee this serialization.
      
      The actual race is vs the sched_ttwu_pending() call in the idle loop;
      that can run the wakeup-list without consuming the CSD.
      
      Instead of trying to chain the lists, merge them.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20200526161908.129371594@infradead.org
      a1488664
  7. 19 5月, 2020 2 次提交
  8. 12 5月, 2020 1 次提交
  9. 28 4月, 2020 3 次提交
    • P
      rcu-tasks: Split ->trc_reader_need_end · 276c4104
      Paul E. McKenney 提交于
      This commit splits ->trc_reader_need_end by using the rcu_special union.
      This change permits readers to check to see if a memory barrier is
      required without any added overhead in the common case where no such
      barrier is required.  This commit also adds the read-side checking.
      Later commits will add the machinery to properly set the new
      ->trc_reader_special.b.need_mb field.
      
      This commit also makes rcu_read_unlock_trace_special() tolerate nested
      read-side critical sections within interrupt and NMI handlers.
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      276c4104
    • P
      rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks · d5f177d3
      Paul E. McKenney 提交于
      Because RCU does not watch exception early-entry/late-exit, idle-loop,
      or CPU-hotplug execution, protection of tracing and BPF operations is
      needlessly complicated.  This commit therefore adds a variant of
      Tasks RCU that:
      
      o	Has explicit read-side markers to allow finite grace periods in
      	the face of in-kernel loops for PREEMPT=n builds.  These markers
      	are rcu_read_lock_trace() and rcu_read_unlock_trace().
      
      o	Protects code in the idle loop, exception entry/exit, and
      	CPU-hotplug code paths.  In this respect, RCU-tasks trace is
      	similar to SRCU, but with lighter-weight readers.
      
      o	Avoids expensive read-side instruction, having overhead similar
      	to that of Preemptible RCU.
      
      There are of course downsides:
      
      o	The grace-period code can send IPIs to CPUs, even when those
      	CPUs are in the idle loop or in nohz_full userspace.  This is
      	mitigated by later commits.
      
      o	It is necessary to scan the full tasklist, much as for Tasks RCU.
      
      o	There is a single callback queue guarded by a single lock,
      	again, much as for Tasks RCU.  However, those early use cases
      	that request multiple grace periods in quick succession are
      	expected to do so from a single task, which makes the single
      	lock almost irrelevant.  If needed, multiple callback queues
      	can be provided using any number of schemes.
      
      Perhaps most important, this variant of RCU does not affect the vanilla
      flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
      readers can operate from idle, offline, and exception entry/exit in no
      way enables rcu_preempt and rcu_sched readers to do so.
      
      The memory ordering was outlined here:
      https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/
      
      This effort benefited greatly from off-list discussions of BPF
      requirements with Alexei Starovoitov and Andrii Nakryiko.  At least
      some of the on-list discussions are captured in the Link: tags below.
      In addition, KCSAN was quite helpful in finding some early bugs.
      
      Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
      Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
      Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Andrii Nakryiko <andriin@fb.com>
      [ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
      [ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
      [ paulmck: Fix locking issue reported by rcutorture. ]
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      d5f177d3
    • L
      rcu: Remove unused ->rcu_read_unlock_special.b.deferred_qs field · f0bdf6d4
      Lai Jiangshan 提交于
      The ->rcu_read_unlock_special.b.deferred_qs field is set to true in
      rcu_read_unlock_special() but never set to false.  This is not
      particularly useful, so this commit removes this field.
      
      The only possible justification for this field is to ease debugging
      of RCU deferred quiscent states, but the combination of the other
      ->rcu_read_unlock_special fields plus ->rcu_blocked_node and of course
      ->rcu_read_lock_nesting should cover debugging needs.  And if this last
      proves incorrect, this patch can always be reverted, along with the
      required setting of ->rcu_read_unlock_special.b.deferred_qs to false
      in rcu_preempt_deferred_qs_irqrestore().
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      f0bdf6d4
  10. 02 4月, 2020 1 次提交
    • E
      signal: Extend exec_id to 64bits · d1e7fd64
      Eric W. Biederman 提交于
      Replace the 32bit exec_id with a 64bit exec_id to make it impossible
      to wrap the exec_id counter.  With care an attacker can cause exec_id
      wrap and send arbitrary signals to a newly exec'd parent.  This
      bypasses the signal sending checks if the parent changes their
      credentials during exec.
      
      The severity of this problem can been seen that in my limited testing
      of a 32bit exec_id it can take as little as 19s to exec 65536 times.
      Which means that it can take as little as 14 days to wrap a 32bit
      exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
      days.  Even my slower timing is in the uptime of a typical server.
      Which means self_exec_id is simply a speed bump today, and if exec
      gets noticably faster self_exec_id won't even be a speed bump.
      
      Extending self_exec_id to 64bits introduces a problem on 32bit
      architectures where reading self_exec_id is no longer atomic and can
      take two read instructions.  Which means that is is possible to hit
      a window where the read value of exec_id does not match the written
      value.  So with very lucky timing after this change this still
      remains expoiltable.
      
      I have updated the update of exec_id on exec to use WRITE_ONCE
      and the read of exec_id in do_notify_parent to use READ_ONCE
      to make it clear that there is no locking between these two
      locations.
      
      Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
      Fixes: 2.3.23pre2
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      d1e7fd64
  11. 21 3月, 2020 2 次提交
    • S
      lockdep: Add hrtimer context tracing bits · 40db1739
      Sebastian Andrzej Siewior 提交于
      Set current->irq_config = 1 for hrtimers which are not marked to expire in
      hard interrupt context during hrtimer_init(). These timers will expire in
      softirq context on PREEMPT_RT.
      
      Setting this allows lockdep to differentiate these timers. If a timer is
      marked to expire in hard interrupt context then the timer callback is not
      supposed to acquire a regular spinlock instead of a raw_spinlock in the
      expiry callback.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200321113242.534508206@linutronix.de
      40db1739
    • P
      lockdep: Introduce wait-type checks · de8f5e4f
      Peter Zijlstra 提交于
      Extend lockdep to validate lock wait-type context.
      
      The current wait-types are:
      
      	LD_WAIT_FREE,		/* wait free, rcu etc.. */
      	LD_WAIT_SPIN,		/* spin loops, raw_spinlock_t etc.. */
      	LD_WAIT_CONFIG,		/* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */
      	LD_WAIT_SLEEP,		/* sleeping locks, mutex_t etc.. */
      
      Where lockdep validates that the current lock (the one being acquired)
      fits in the current wait-context (as generated by the held stack).
      
      This ensures that there is no attempt to acquire mutexes while holding
      spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In
      other words, its a more fancy might_sleep().
      
      Obviously RCU made the entire ordeal more complex than a simple single
      value test because RCU can be acquired in (pretty much) any context and
      while it presents a context to nested locks it is not the same as it
      got acquired in.
      
      Therefore its necessary to split the wait_type into two values, one
      representing the acquire (outer) and one representing the nested context
      (inner). For most 'normal' locks these two are the same.
      
      [ To make static initialization easier we have the rule that:
        .outer == INV means .outer == .inner; because INV == 0. ]
      
      It further means that its required to find the minimal .inner of the held
      stack to compare against the outer of the new lock; because while 'normal'
      RCU presents a CONFIG type to nested locks, if it is taken while already
      holding a SPIN type it obviously doesn't relax the rules.
      
      Below is an example output generated by the trivial test code:
      
        raw_spin_lock(&foo);
        spin_lock(&bar);
        spin_unlock(&bar);
        raw_spin_unlock(&foo);
      
       [ BUG: Invalid wait context ]
       -----------------------------
       swapper/0/1 is trying to lock:
       ffffc90000013f20 (&bar){....}-{3:3}, at: kernel_init+0xdb/0x187
       other info that might help us debug this:
       1 lock held by swapper/0/1:
        #0: ffffc90000013ee0 (&foo){+.+.}-{2:2}, at: kernel_init+0xd1/0x187
      
      The way to read it is to look at the new -{n,m} part in the lock
      description; -{3:3} for the attempted lock, and try and match that up to
      the held locks, which in this case is the one: -{2,2}.
      
      This tells that the acquiring lock requires a more relaxed environment than
      presented by the lock stack.
      
      Currently only the normal locks and RCU are converted, the rest of the
      lockdep users defaults to .inner = INV which is ignored. More conversions
      can be done when desired.
      
      The check for spinlock_t nesting is not enabled by default. It's a separate
      config option for now as there are known problems which are currently
      addressed. The config option allows to identify these problems and to
      verify that the solutions found are indeed solving them.
      
      The config switch will be removed and the checks will permanently enabled
      once the vast majority of issues has been addressed.
      
      [ bigeasy: Move LD_WAIT_FREE,… out of CONFIG_LOCKDEP to avoid compile
      	   failure with CONFIG_DEBUG_SPINLOCK + !CONFIG_LOCKDEP]
      [ tglx: Add the config option ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200321113242.427089655@linutronix.de
      de8f5e4f
  12. 20 3月, 2020 1 次提交
  13. 24 2月, 2020 2 次提交
  14. 26 1月, 2020 1 次提交
  15. 25 12月, 2019 1 次提交
    • M
      rseq: Unregister rseq for clone CLONE_VM · 463f550f
      Mathieu Desnoyers 提交于
      It has been reported by Google that rseq is not behaving properly
      with respect to clone when CLONE_VM is used without CLONE_THREAD.
      
      It keeps the prior thread's rseq TLS registered when the TLS of the
      thread has moved, so the kernel can corrupt the TLS of the parent.
      
      The approach of clearing the per task-struct rseq registration
      on clone with CLONE_THREAD flag is incomplete. It does not cover
      the use-case of clone with CLONE_VM set, but without CLONE_THREAD.
      
      Here is the rationale for unregistering rseq on clone with CLONE_VM
      flag set:
      
      1) CLONE_THREAD requires CLONE_SIGHAND, which requires CLONE_VM to be
         set. Therefore, just checking for CLONE_VM covers all CLONE_THREAD
         uses. There is no point in checking for both CLONE_THREAD and
         CLONE_VM,
      
      2) There is the possibility of an unlikely scenario where CLONE_SETTLS
         is used without CLONE_VM. In order to be an issue, it would require
         that the rseq TLS is in a shared memory area.
      
         I do not plan on adding CLONE_SETTLS to the set of clone flags which
         unregister RSEQ, because it would require that we also unregister RSEQ
         on set_thread_area(2) and arch_prctl(2) ARCH_SET_FS for completeness.
         So rather than doing a partial solution, it appears better to let
         user-space explicitly perform rseq unregistration across clone if
         needed in scenarios where CLONE_VM is not set.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191211161713.4490-3-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      463f550f
  16. 05 12月, 2019 1 次提交
    • A
      kcov: remote coverage support · eec028c9
      Andrey Konovalov 提交于
      Patch series " kcov: collect coverage from usb and vhost", v3.
      
      This patchset extends kcov to allow collecting coverage from backgound
      kernel threads.  This extension requires custom annotations for each of
      the places where coverage collection is desired.  This patchset
      implements this for hub events in the USB subsystem and for vhost
      workers.  See the first patch description for details about the kcov
      extension.  The other two patches apply this kcov extension to USB and
      vhost.
      
      Examples of other subsystems that might potentially benefit from this
      when custom annotations are added (the list is based on
      process_one_work() callers for bugs recently reported by syzbot):
      
      1. fs: writeback wb_workfn() worker,
      2. net: addrconf_dad_work()/addrconf_verify_work() workers,
      3. net: neigh_periodic_work() worker,
      4. net/p9: p9_write_work()/p9_read_work() workers,
      5. block: blk_mq_run_work_fn() worker.
      
      These patches have been used to enable coverage-guided USB fuzzing with
      syzkaller for the last few years, see the details here:
      
        https://github.com/google/syzkaller/blob/master/docs/linux/external_fuzzing_usb.md
      
      This patchset has been pushed to the public Linux kernel Gerrit
      instance:
      
        https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/1524
      
      This patch (of 3):
      
      Add background thread coverage collection ability to kcov.
      
      With KCOV_ENABLE coverage is collected only for syscalls that are issued
      from the current process.  With KCOV_REMOTE_ENABLE it's possible to
      collect coverage for arbitrary parts of the kernel code, provided that
      those parts are annotated with kcov_remote_start()/kcov_remote_stop().
      
      This allows to collect coverage from two types of kernel background
      threads: the global ones, that are spawned during kernel boot in a
      limited number of instances (e.g.  one USB hub_event() worker thread is
      spawned per USB HCD); and the local ones, that are spawned when a user
      interacts with some kernel interface (e.g.  vhost workers).
      
      To enable collecting coverage from a global background thread, a unique
      global handle must be assigned and passed to the corresponding
      kcov_remote_start() call.  Then a userspace process can pass a list of
      such handles to the KCOV_REMOTE_ENABLE ioctl in the handles array field
      of the kcov_remote_arg struct.  This will attach the used kcov device to
      the code sections, that are referenced by those handles.
      
      Since there might be many local background threads spawned from
      different userspace processes, we can't use a single global handle per
      annotation.  Instead, the userspace process passes a non-zero handle
      through the common_handle field of the kcov_remote_arg struct.  This
      common handle gets saved to the kcov_handle field in the current
      task_struct and needs to be passed to the newly spawned threads via
      custom annotations.  Those threads should in turn be annotated with
      kcov_remote_start()/kcov_remote_stop().
      
      Internally kcov stores handles as u64 integers.  The top byte of a
      handle is used to denote the id of a subsystem that this handle belongs
      to, and the lower 4 bytes are used to denote the id of a thread instance
      within that subsystem.  A reserved value 0 is used as a subsystem id for
      common handles as they don't belong to a particular subsystem.  The
      bytes 4-7 are currently reserved and must be zero.  In the future the
      number of bytes used for the subsystem or handle ids might be increased.
      
      When a particular userspace process collects coverage by via a common
      handle, kcov will collect coverage for each code section that is
      annotated to use the common handle obtained as kcov_handle from the
      current task_struct.  However non common handles allow to collect
      coverage selectively from different subsystems.
      
      Link: http://lkml.kernel.org/r/e90e315426a384207edbec1d6aa89e43008e4caf.1572366574.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: David Windsor <dwindsor@gmail.com>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: Anders Roxell <anders.roxell@linaro.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eec028c9
  17. 20 11月, 2019 2 次提交
  18. 16 11月, 2019 1 次提交
  19. 13 11月, 2019 1 次提交
  20. 30 10月, 2019 1 次提交
    • J
      io-wq: small threadpool implementation for io_uring · 771b53d0
      Jens Axboe 提交于
      This adds support for io-wq, a smaller and specialized thread pool
      implementation. This is meant to replace workqueues for io_uring. Among
      the reasons for this addition are:
      
      - We can assign memory context smarter and more persistently if we
        manage the life time of threads.
      
      - We can drop various work-arounds we have in io_uring, like the
        async_list.
      
      - We can implement hashed work insertion, to manage concurrency of
        buffered writes without needing a) an extra workqueue, or b)
        needlessly making the concurrency of said workqueue very low
        which hurts performance of multiple buffered file writers.
      
      - We can implement cancel through signals, for cancelling
        interruptible work like read/write (or send/recv) to/from sockets.
      
      - We need the above cancel for being able to assign and use file tables
        from a process.
      
      - We can implement a more thorough cancel operation in general.
      
      - We need it to move towards a syslet/threadlet model for even faster
        async execution. For that we need to take ownership of the used
        threads.
      
      This list is just off the top of my head. Performance should be the
      same, or better, at least that's what I've seen in my testing. io-wq
      supports basic NUMA functionality, setting up a pool per node.
      
      io-wq hooks up to the scheduler schedule in/out just like workqueue
      and uses that to drive the need for more/less workers.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      771b53d0
  21. 29 10月, 2019 3 次提交
  22. 17 10月, 2019 1 次提交
    • J
      arm64: entry.S: Do not preempt from IRQ before all cpufeatures are enabled · 19c95f26
      Julien Thierry 提交于
      Preempting from IRQ-return means that the task has its PSTATE saved
      on the stack, which will get restored when the task is resumed and does
      the actual IRQ return.
      
      However, enabling some CPU features requires modifying the PSTATE. This
      means that, if a task was scheduled out during an IRQ-return before all
      CPU features are enabled, the task might restore a PSTATE that does not
      include the feature enablement changes once scheduled back in.
      
      * Task 1:
      
      PAN == 0 ---|                          |---------------
                  |                          |<- return from IRQ, PSTATE.PAN = 0
                  | <- IRQ                   |
                  +--------+ <- preempt()  +--
                                           ^
                                           |
                                           reschedule Task 1, PSTATE.PAN == 1
      * Init:
              --------------------+------------------------
                                  ^
                                  |
                                  enable_cpu_features
                                  set PSTATE.PAN on all CPUs
      
      Worse than this, since PSTATE is untouched when task switching is done,
      a task missing the new bits in PSTATE might affect another task, if both
      do direct calls to schedule() (outside of IRQ/exception contexts).
      
      Fix this by preventing preemption on IRQ-return until features are
      enabled on all CPUs.
      
      This way the only PSTATE values that are saved on the stack are from
      synchronous exceptions. These are expected to be fatal this early, the
      exception is BRK for WARN_ON(), but as this uses do_debug_exception()
      which keeps IRQs masked, it shouldn't call schedule().
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      [james: Replaced a really cool hack, with an even simpler static key in C.
       expanded commit message with Julien's cover-letter ascii art]
      Signed-off-by: NJames Morse <james.morse@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      19c95f26
  23. 25 9月, 2019 1 次提交
  24. 18 9月, 2019 1 次提交
  25. 07 9月, 2019 1 次提交
    • D
      kernel.h: Add non_block_start/end() · 312364f3
      Daniel Vetter 提交于
      In some special cases we must not block, but there's not a spinlock,
      preempt-off, irqs-off or similar critical section already that arms the
      might_sleep() debug checks. Add a non_block_start/end() pair to annotate
      these.
      
      This will be used in the oom paths of mmu-notifiers, where blocking is not
      allowed to make sure there's forward progress. Quoting Michal:
      
      "The notifier is called from quite a restricted context - oom_reaper -
      which shouldn't depend on any locks or sleepable conditionals. The code
      should be swift as well but we mostly do care about it to make a forward
      progress. Checking for sleepable context is the best thing we could come
      up with that would describe these demands at least partially."
      
      Peter also asked whether we want to catch spinlocks on top, but Michal
      said those are less of a problem because spinlocks can't have an indirect
      dependency upon the page allocator and hence close the loop with the oom
      reaper.
      
      Suggested by Michal Hocko.
      
      Link: https://lore.kernel.org/r/20190826201425.17547-4-daniel.vetter@ffwll.ch
      Acked-by: Christian König <christian.koenig@amd.com> (v1)
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      312364f3
  26. 28 8月, 2019 3 次提交
  27. 01 8月, 2019 1 次提交
  28. 25 7月, 2019 2 次提交