1. 02 12月, 2020 1 次提交
    • G
      kernel: Implement selective syscall userspace redirection · 1446e1df
      Gabriel Krisman Bertazi 提交于
      Introduce a mechanism to quickly disable/enable syscall handling for a
      specific process and redirect to userspace via SIGSYS.  This is useful
      for processes with parts that require syscall redirection and parts that
      don't, but who need to perform this boundary crossing really fast,
      without paying the cost of a system call to reconfigure syscall handling
      on each boundary transition.  This is particularly important for Windows
      games running over Wine.
      
      The proposed interface looks like this:
      
        prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
      
      The range [<offset>,<offset>+<length>) is a part of the process memory
      map that is allowed to by-pass the redirection code and dispatch
      syscalls directly, such that in fast paths a process doesn't need to
      disable the trap nor the kernel has to check the selector.  This is
      essential to return from SIGSYS to a blocked area without triggering
      another SIGSYS from rt_sigreturn.
      
      selector is an optional pointer to a char-sized userspace memory region
      that has a key switch for the mechanism. This key switch is set to
      either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
      redirection without calling the kernel.
      
      The feature is meant to be set per-thread and it is disabled on
      fork/clone/execv.
      
      Internally, this doesn't add overhead to the syscall hot path, and it
      requires very little per-architecture support.  I avoided using seccomp,
      even though it duplicates some functionality, due to previous feedback
      that maybe it shouldn't mix with seccomp since it is not a security
      mechanism.  And obviously, this should never be considered a security
      mechanism, since any part of the program can by-pass it by using the
      syscall dispatcher.
      
      For the sysinfo benchmark, which measures the overhead added to
      executing a native syscall that doesn't require interception, the
      overhead using only the direct dispatcher region to issue syscalls is
      pretty much irrelevant.  The overhead of using the selector goes around
      40ns for a native (unredirected) syscall in my system, and it is (as
      expected) dominated by the supervisor-mode user-address access.  In
      fact, with SMAP off, the overhead is consistently less than 5ns on my
      test box.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
      1446e1df
  2. 17 10月, 2020 1 次提交
  3. 14 10月, 2020 1 次提交
  4. 07 10月, 2020 1 次提交
    • T
      x86/mce: Recover from poison found while copying from user space · c0ab7ffc
      Tony Luck 提交于
      Existing kernel code can only recover from a machine check on code that
      is tagged in the exception table with a fault handling recovery path.
      
      Add two new fields in the task structure to pass information from
      machine check handler to the "task_work" that is queued to run before
      the task returns to user mode:
      
      + mce_vaddr: will be initialized to the user virtual address of the fault
        in the case where the fault occurred in the kernel copying data from
        a user address.  This is so that kill_me_maybe() can provide that
        information to the user SIGBUS handler.
      
      + mce_kflags: copy of the struct mce.kflags needed by kill_me_maybe()
        to determine if mce_vaddr is applicable to this error.
      
      Add code to recover from a machine check while copying data from user
      space to the kernel. Action for this case is the same as if the user
      touched the poison directly; unmap the page and send a SIGBUS to the task.
      
      Use a new helper function to share common code between the "fault
      in user mode" case and the "fault while copying from user" case.
      
      New code paths will be activated by the next patch which sets
      MCE_IN_KERNEL_COPYIN.
      Suggested-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20201006210910.21062-6-tony.luck@intel.com
      c0ab7ffc
  5. 03 10月, 2020 1 次提交
  6. 01 10月, 2020 1 次提交
    • J
      io_uring: don't rely on weak ->files references · 0f212204
      Jens Axboe 提交于
      Grab actual references to the files_struct. To avoid circular references
      issues due to this, we add a per-task note that keeps track of what
      io_uring contexts a task has used. When the tasks execs or exits its
      assigned files, we cancel requests based on this tracking.
      
      With that, we can grab proper references to the files table, and no
      longer need to rely on stashing away ring_fd and ring_file to check
      if the ring_fd may have been closed.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f212204
  7. 26 8月, 2020 2 次提交
  8. 06 8月, 2020 2 次提交
    • T
      posix-cpu-timers: Provide mechanisms to defer timer handling to task_work · 1fb497dd
      Thomas Gleixner 提交于
      Running posix CPU timers in hard interrupt context has a few downsides:
      
       - For PREEMPT_RT it cannot work as the expiry code needs to take
         sighand lock, which is a 'sleeping spinlock' in RT. The original RT
         approach of offloading the posix CPU timer handling into a high
         priority thread was clumsy and provided no real benefit in general.
      
       - For fine grained accounting it's just wrong to run this in context of
         the timer interrupt because that way a process specific CPU time is
         accounted to the timer interrupt.
      
       - Long running timer interrupts caused by a large amount of expiring
         timers which can be created and armed by unpriviledged user space.
      
      There is no hard requirement to expire them in interrupt context.
      
      If the signal is targeted at the task itself then it won't be delivered
      before the task returns to user space anyway. If the signal is targeted at
      a supervisor process then it might be slightly delayed, but posix CPU
      timers are inaccurate anyway due to the fact that they are tied to the
      tick.
      
      Provide infrastructure to schedule task work which allows splitting the
      posix CPU timer code into a quick check in interrupt context and a thread
      context expiry and signal delivery function. This has to be enabled by
      architectures as it requires that the architecture specific KVM
      implementation handles pending task work before exiting to guest mode.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20200730102337.783470146@linutronix.de
      1fb497dd
    • P
      locking/seqlock, headers: Untangle the spaghetti monster · 0cd39f46
      Peter Zijlstra 提交于
      By using lockdep_assert_*() from seqlock.h, the spaghetti monster
      attacked.
      
      Attack back by reducing seqlock.h dependencies from two key high level headers:
      
       - <linux/seqlock.h>:               -Remove <linux/ww_mutex.h>
       - <linux/time.h>:                  -Remove <linux/seqlock.h>
       - <linux/sched.h>:                 +Add    <linux/seqlock.h>
      
      The price was to add it to sched.h ...
      
      Core header fallout, we add direct header dependencies instead of gaining them
      parasitically from higher level headers:
      
       - <linux/dynamic_queue_limits.h>:  +Add <asm/bug.h>
       - <linux/hrtimer.h>:               +Add <linux/seqlock.h>
       - <linux/ktime.h>:                 +Add <asm/bug.h>
       - <linux/lockdep.h>:               +Add <linux/smp.h>
       - <linux/sched.h>:                 +Add <linux/seqlock.h>
       - <linux/videodev2.h>:             +Add <linux/kernel.h>
      
      Arch headers fallout:
      
       - PARISC: <asm/timex.h>:           +Add <asm/special_insns.h>
       - SH:     <asm/io.h>:              +Add <asm/page.h>
       - SPARC:  <asm/timer_64.h>:        +Add <uapi/asm/asi.h>
       - SPARC:  <asm/vvar.h>:            +Add <asm/processor.h>, <asm/barrier.h>
                                          -Remove <linux/seqlock.h>
       - X86:    <asm/fixmap.h>:          +Add <asm/pgtable_types.h>
                                          -Remove <asm/acpi.h>
      
      There's also a bunch of parasitic header dependency fallout in .c files, not listed
      separately.
      
      [ mingo: Extended the changelog, split up & fixed the original patch. ]
      Co-developed-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20200804133438.GK2674@hirez.programming.kicks-ass.net
      0cd39f46
  9. 31 7月, 2020 2 次提交
  10. 29 7月, 2020 2 次提交
    • A
      sched: tasks: Use sequence counter with associated spinlock · b7505861
      Ahmed S. Darwish 提交于
      A sequence counter write side critical section must be protected by some
      form of locking to serialize writers. A plain seqcount_t does not
      contain the information of which lock must be held when entering a write
      side critical section.
      
      Use the new seqcount_spinlock_t data type, which allows to associate a
      spinlock with the sequence counter. This enables lockdep to verify that
      the spinlock used for writer serialization is held when the write side
      critical section is entered.
      
      If lockdep is disabled this lock association is compiled out and has
      neither storage size nor runtime overhead.
      Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200720155530.1173732-14-a.darwish@linutronix.de
      b7505861
    • Q
      sched/uclamp: Add a new sysctl to control RT default boost value · 13685c4a
      Qais Yousef 提交于
      RT tasks by default run at the highest capacity/performance level. When
      uclamp is selected this default behavior is retained by enforcing the
      requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
      uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
      value.
      
      This is also referred to as 'the default boost value of RT tasks'.
      
      See commit 1a00d999 ("sched/uclamp: Set default clamps for RT tasks").
      
      On battery powered devices, it is desired to control this default
      (currently hardcoded) behavior at runtime to reduce energy consumed by
      RT tasks.
      
      For example, a mobile device manufacturer where big.LITTLE architecture
      is dominant, the performance of the little cores varies across SoCs, and
      on high end ones the big cores could be too power hungry.
      
      Given the diversity of SoCs, the new knob allows manufactures to tune
      the best performance/power for RT tasks for the particular hardware they
      run on.
      
      They could opt to further tune the value when the user selects
      a different power saving mode or when the device is actively charging.
      
      The runtime aspect of it further helps in creating a single kernel image
      that can be run on multiple devices that require different tuning.
      
      Keep in mind that a lot of RT tasks in the system are created by the
      kernel. On Android for instance I can see over 50 RT tasks, only
      a handful of which created by the Android framework.
      
      To control the default behavior globally by system admins and device
      integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
      to change the default boost value of the RT tasks.
      
      I anticipate this to be mostly in the form of modifying the init script
      of a particular device.
      
      To avoid polluting the fast path with unnecessary code, the approach
      taken is to synchronously do the update by traversing all the existing
      tasks in the system. This could race with a concurrent fork(), which is
      dealt with by introducing sched_post_fork() function which will ensure
      the racy fork will get the right update applied.
      
      Tested on Juno-r2 in combination with the RT capacity awareness [1].
      By default an RT task will go to the highest capacity CPU and run at the
      maximum frequency, which is particularly energy inefficient on high end
      mobile devices because the biggest core[s] are 'huge' and power hungry.
      
      With this patch the RT task can be controlled to run anywhere by
      default, and doesn't cause the frequency to be maximum all the time.
      Yet any task that really needs to be boosted can easily escape this
      default behavior by modifying its requested uclamp.min value
      (p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.
      
      [1] 804d402f: ("sched/rt: Make RT capacity-aware")
      Signed-off-by: NQais Yousef <qais.yousef@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
      13685c4a
  11. 28 7月, 2020 1 次提交
  12. 22 7月, 2020 1 次提交
  13. 10 7月, 2020 1 次提交
  14. 08 7月, 2020 3 次提交
  15. 04 7月, 2020 1 次提交
  16. 28 6月, 2020 2 次提交
  17. 15 6月, 2020 2 次提交
    • P
      sched: Remove sched_set_*() return value · 8b700983
      Peter Zijlstra 提交于
      Ingo suggested that since the new sched_set_*() functions are
      implemented using the 'nocheck' variants, they really shouldn't ever
      fail, so remove the return value.
      
      Cc: axboe@kernel.dk
      Cc: daniel.lezcano@linaro.org
      Cc: sudeep.holla@arm.com
      Cc: airlied@redhat.com
      Cc: broonie@kernel.org
      Cc: paulmck@kernel.org
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      8b700983
    • P
      sched: Provide sched_set_fifo() · 7318d4cc
      Peter Zijlstra 提交于
      SCHED_FIFO (or any static priority scheduler) is a broken scheduler
      model; it is fundamentally incapable of resource management, the one
      thing an OS is actually supposed to do.
      
      It is impossible to compose static priority workloads. One cannot take
      two well designed and functional static priority workloads and mash
      them together and still expect them to work.
      
      Therefore it doesn't make sense to expose the priority field; the
      kernel is fundamentally incapable of setting a sensible value, it
      needs systems knowledge that it doesn't have.
      
      Take away sched_setschedule() / sched_setattr() from modules and
      replace them with:
      
        - sched_set_fifo(p); create a FIFO task (at prio 50)
        - sched_set_fifo_low(p); create a task higher than NORMAL,
      	which ends up being a FIFO task at prio 1.
        - sched_set_normal(p, nice); (re)set the task to normal
      
      This stops the proliferation of randomly chosen, and irrelevant, FIFO
      priorities that dont't really mean anything anyway.
      
      The system administrator/integrator, whoever has insight into the
      actual system design and requirements (userspace) can set-up
      appropriate priorities if and when needed.
      
      Cc: airlied@redhat.com
      Cc: alexander.deucher@amd.com
      Cc: awalls@md.metrocast.net
      Cc: axboe@kernel.dk
      Cc: broonie@kernel.org
      Cc: daniel.lezcano@linaro.org
      Cc: gregkh@linuxfoundation.org
      Cc: hannes@cmpxchg.org
      Cc: herbert@gondor.apana.org.au
      Cc: hverkuil@xs4all.nl
      Cc: john.stultz@linaro.org
      Cc: nico@fluxnic.net
      Cc: paulmck@kernel.org
      Cc: rafael.j.wysocki@intel.com
      Cc: rmk+kernel@arm.linux.org.uk
      Cc: sudeep.holla@arm.com
      Cc: tglx@linutronix.de
      Cc: ulf.hansson@linaro.org
      Cc: wim@linux-watchdog.org
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Tested-by: NPaul E. McKenney <paulmck@kernel.org>
      7318d4cc
  18. 11 6月, 2020 1 次提交
    • T
      x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned · 17fae129
      Tony Luck 提交于
      An interesting thing happened when a guest Linux instance took a machine
      check. The VMM unmapped the bad page from guest physical space and
      passed the machine check to the guest.
      
      Linux took all the normal actions to offline the page from the process
      that was using it. But then guest Linux crashed because it said there
      was a second machine check inside the kernel with this stack trace:
      
      do_memory_failure
          set_mce_nospec
               set_memory_uc
                    _set_memory_uc
                         change_page_attr_set_clr
                              cpa_flush
                                   clflush_cache_range_opt
      
      This was odd, because a CLFLUSH instruction shouldn't raise a machine
      check (it isn't consuming the data). Further investigation showed that
      the VMM had passed in another machine check because is appeared that the
      guest was accessing the bad page.
      
      Fix is to check the scope of the poison by checking the MCi_MISC register.
      If the entire page is affected, then unmap the page. If only part of the
      page is affected, then mark the page as uncacheable.
      
      This assumes that VMMs will do the logical thing and pass in the "whole
      page scope" via the MCi_MISC register (since they unmapped the entire
      page).
      
        [ bp: Adjust to x86/entry changes. ]
      
      Fixes: 284ce401 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
      Reported-by: NJue Wang <juew@google.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NJue Wang <juew@google.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel.com
      
      
      17fae129
  19. 05 6月, 2020 1 次提交
  20. 03 6月, 2020 1 次提交
    • N
      mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE · a37b0715
      NeilBrown 提交于
      PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
      loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
      daemon needs to write to one bdi (the final bdi) in order to free up
      writes queued to another bdi (the client bdi).
      
      The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
      pages, so that it can still dirty pages after other processses have been
      throttled.  The purpose of this is to avoid deadlock that happen when
      the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
      but it is being thottled and cannot write.
      
      This approach was designed when all threads were blocked equally,
      independently on which device they were writing to, or how fast it was.
      Since that time the writeback algorithm has changed substantially with
      different threads getting different allowances based on non-trivial
      heuristics.  This means the simple "add 25%" heuristic is no longer
      reliable.
      
      The important issue is not that the daemon needs a *larger* dirty page
      allowance, but that it needs a *private* dirty page allowance, so that
      dirty pages for the "client" bdi that it is helping to clear (the bdi
      for an NFS filesystem or loop block device etc) do not affect the
      throttling of the daemon writing to the "final" bdi.
      
      This patch changes the heuristic so that the task is not throttled when
      the bdi it is writing to has a dirty page count below below (or equal
      to) the free-run threshold for that bdi.  This ensures it will always be
      able to have some pages in flight, and so will not deadlock.
      
      In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
      still be throttled by global threshold, but that is acceptable as it is
      only the deadlock state that is interesting for this flag.
      
      This approach of "only throttle when target bdi is busy" is consistent
      with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
      it causes attention to be focussed only on the target bdi.
      
      So this patch
       - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
       - removes the 25% bonus that that flag gives, and
       - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
         global and the local free-run thresholds are exceeded.
      
      Note that previously realtime threads were treated the same as
      PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour
      for real-time threads, so it is now different from the behaviour of nfsd
      and loop tasks.  I don't know what is wanted for realtime.
      
      [akpm@linux-foundation.org: coding style fixes]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: Chuck Lever <chuck.lever@oracle.com>	[nfsd]
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a37b0715
  21. 28 5月, 2020 1 次提交
    • P
      sched: Replace rq::wake_list · a1488664
      Peter Zijlstra 提交于
      The recent commit: 90b5363a ("sched: Clean up scheduler_ipi()")
      got smp_call_function_single_async() subtly wrong. Even though it will
      return -EBUSY when trying to re-use a csd, that condition is not
      atomic and still requires external serialization.
      
      The change in ttwu_queue_remote() got this wrong.
      
      While on first reading ttwu_queue_remote() has an atomic test-and-set
      that appears to serialize the use, the matching 'release' is not in
      the right place to actually guarantee this serialization.
      
      The actual race is vs the sched_ttwu_pending() call in the idle loop;
      that can run the wakeup-list without consuming the CSD.
      
      Instead of trying to chain the lists, merge them.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20200526161908.129371594@infradead.org
      a1488664
  22. 19 5月, 2020 2 次提交
  23. 12 5月, 2020 1 次提交
  24. 28 4月, 2020 3 次提交
    • P
      rcu-tasks: Split ->trc_reader_need_end · 276c4104
      Paul E. McKenney 提交于
      This commit splits ->trc_reader_need_end by using the rcu_special union.
      This change permits readers to check to see if a memory barrier is
      required without any added overhead in the common case where no such
      barrier is required.  This commit also adds the read-side checking.
      Later commits will add the machinery to properly set the new
      ->trc_reader_special.b.need_mb field.
      
      This commit also makes rcu_read_unlock_trace_special() tolerate nested
      read-side critical sections within interrupt and NMI handlers.
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      276c4104
    • P
      rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks · d5f177d3
      Paul E. McKenney 提交于
      Because RCU does not watch exception early-entry/late-exit, idle-loop,
      or CPU-hotplug execution, protection of tracing and BPF operations is
      needlessly complicated.  This commit therefore adds a variant of
      Tasks RCU that:
      
      o	Has explicit read-side markers to allow finite grace periods in
      	the face of in-kernel loops for PREEMPT=n builds.  These markers
      	are rcu_read_lock_trace() and rcu_read_unlock_trace().
      
      o	Protects code in the idle loop, exception entry/exit, and
      	CPU-hotplug code paths.  In this respect, RCU-tasks trace is
      	similar to SRCU, but with lighter-weight readers.
      
      o	Avoids expensive read-side instruction, having overhead similar
      	to that of Preemptible RCU.
      
      There are of course downsides:
      
      o	The grace-period code can send IPIs to CPUs, even when those
      	CPUs are in the idle loop or in nohz_full userspace.  This is
      	mitigated by later commits.
      
      o	It is necessary to scan the full tasklist, much as for Tasks RCU.
      
      o	There is a single callback queue guarded by a single lock,
      	again, much as for Tasks RCU.  However, those early use cases
      	that request multiple grace periods in quick succession are
      	expected to do so from a single task, which makes the single
      	lock almost irrelevant.  If needed, multiple callback queues
      	can be provided using any number of schemes.
      
      Perhaps most important, this variant of RCU does not affect the vanilla
      flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
      readers can operate from idle, offline, and exception entry/exit in no
      way enables rcu_preempt and rcu_sched readers to do so.
      
      The memory ordering was outlined here:
      https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/
      
      This effort benefited greatly from off-list discussions of BPF
      requirements with Alexei Starovoitov and Andrii Nakryiko.  At least
      some of the on-list discussions are captured in the Link: tags below.
      In addition, KCSAN was quite helpful in finding some early bugs.
      
      Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
      Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
      Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Andrii Nakryiko <andriin@fb.com>
      [ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
      [ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
      [ paulmck: Fix locking issue reported by rcutorture. ]
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      d5f177d3
    • L
      rcu: Remove unused ->rcu_read_unlock_special.b.deferred_qs field · f0bdf6d4
      Lai Jiangshan 提交于
      The ->rcu_read_unlock_special.b.deferred_qs field is set to true in
      rcu_read_unlock_special() but never set to false.  This is not
      particularly useful, so this commit removes this field.
      
      The only possible justification for this field is to ease debugging
      of RCU deferred quiscent states, but the combination of the other
      ->rcu_read_unlock_special fields plus ->rcu_blocked_node and of course
      ->rcu_read_lock_nesting should cover debugging needs.  And if this last
      proves incorrect, this patch can always be reverted, along with the
      required setting of ->rcu_read_unlock_special.b.deferred_qs to false
      in rcu_preempt_deferred_qs_irqrestore().
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      f0bdf6d4
  25. 02 4月, 2020 1 次提交
    • E
      signal: Extend exec_id to 64bits · d1e7fd64
      Eric W. Biederman 提交于
      Replace the 32bit exec_id with a 64bit exec_id to make it impossible
      to wrap the exec_id counter.  With care an attacker can cause exec_id
      wrap and send arbitrary signals to a newly exec'd parent.  This
      bypasses the signal sending checks if the parent changes their
      credentials during exec.
      
      The severity of this problem can been seen that in my limited testing
      of a 32bit exec_id it can take as little as 19s to exec 65536 times.
      Which means that it can take as little as 14 days to wrap a 32bit
      exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
      days.  Even my slower timing is in the uptime of a typical server.
      Which means self_exec_id is simply a speed bump today, and if exec
      gets noticably faster self_exec_id won't even be a speed bump.
      
      Extending self_exec_id to 64bits introduces a problem on 32bit
      architectures where reading self_exec_id is no longer atomic and can
      take two read instructions.  Which means that is is possible to hit
      a window where the read value of exec_id does not match the written
      value.  So with very lucky timing after this change this still
      remains expoiltable.
      
      I have updated the update of exec_id on exec to use WRITE_ONCE
      and the read of exec_id in do_notify_parent to use READ_ONCE
      to make it clear that there is no locking between these two
      locations.
      
      Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
      Fixes: 2.3.23pre2
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      d1e7fd64
  26. 21 3月, 2020 2 次提交
    • S
      lockdep: Add hrtimer context tracing bits · 40db1739
      Sebastian Andrzej Siewior 提交于
      Set current->irq_config = 1 for hrtimers which are not marked to expire in
      hard interrupt context during hrtimer_init(). These timers will expire in
      softirq context on PREEMPT_RT.
      
      Setting this allows lockdep to differentiate these timers. If a timer is
      marked to expire in hard interrupt context then the timer callback is not
      supposed to acquire a regular spinlock instead of a raw_spinlock in the
      expiry callback.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200321113242.534508206@linutronix.de
      40db1739
    • P
      lockdep: Introduce wait-type checks · de8f5e4f
      Peter Zijlstra 提交于
      Extend lockdep to validate lock wait-type context.
      
      The current wait-types are:
      
      	LD_WAIT_FREE,		/* wait free, rcu etc.. */
      	LD_WAIT_SPIN,		/* spin loops, raw_spinlock_t etc.. */
      	LD_WAIT_CONFIG,		/* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */
      	LD_WAIT_SLEEP,		/* sleeping locks, mutex_t etc.. */
      
      Where lockdep validates that the current lock (the one being acquired)
      fits in the current wait-context (as generated by the held stack).
      
      This ensures that there is no attempt to acquire mutexes while holding
      spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In
      other words, its a more fancy might_sleep().
      
      Obviously RCU made the entire ordeal more complex than a simple single
      value test because RCU can be acquired in (pretty much) any context and
      while it presents a context to nested locks it is not the same as it
      got acquired in.
      
      Therefore its necessary to split the wait_type into two values, one
      representing the acquire (outer) and one representing the nested context
      (inner). For most 'normal' locks these two are the same.
      
      [ To make static initialization easier we have the rule that:
        .outer == INV means .outer == .inner; because INV == 0. ]
      
      It further means that its required to find the minimal .inner of the held
      stack to compare against the outer of the new lock; because while 'normal'
      RCU presents a CONFIG type to nested locks, if it is taken while already
      holding a SPIN type it obviously doesn't relax the rules.
      
      Below is an example output generated by the trivial test code:
      
        raw_spin_lock(&foo);
        spin_lock(&bar);
        spin_unlock(&bar);
        raw_spin_unlock(&foo);
      
       [ BUG: Invalid wait context ]
       -----------------------------
       swapper/0/1 is trying to lock:
       ffffc90000013f20 (&bar){....}-{3:3}, at: kernel_init+0xdb/0x187
       other info that might help us debug this:
       1 lock held by swapper/0/1:
        #0: ffffc90000013ee0 (&foo){+.+.}-{2:2}, at: kernel_init+0xd1/0x187
      
      The way to read it is to look at the new -{n,m} part in the lock
      description; -{3:3} for the attempted lock, and try and match that up to
      the held locks, which in this case is the one: -{2,2}.
      
      This tells that the acquiring lock requires a more relaxed environment than
      presented by the lock stack.
      
      Currently only the normal locks and RCU are converted, the rest of the
      lockdep users defaults to .inner = INV which is ignored. More conversions
      can be done when desired.
      
      The check for spinlock_t nesting is not enabled by default. It's a separate
      config option for now as there are known problems which are currently
      addressed. The config option allows to identify these problems and to
      verify that the solutions found are indeed solving them.
      
      The config switch will be removed and the checks will permanently enabled
      once the vast majority of issues has been addressed.
      
      [ bigeasy: Move LD_WAIT_FREE,… out of CONFIG_LOCKDEP to avoid compile
      	   failure with CONFIG_DEBUG_SPINLOCK + !CONFIG_LOCKDEP]
      [ tglx: Add the config option ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200321113242.427089655@linutronix.de
      de8f5e4f
  27. 20 3月, 2020 1 次提交
  28. 24 2月, 2020 1 次提交
    • V
      sched/pelt: Add a new runnable average signal · 9f683953
      Vincent Guittot 提交于
      Now that runnable_load_avg has been removed, we can replace it by a new
      signal that will highlight the runnable pressure on a cfs_rq. This signal
      track the waiting time of tasks on rq and can help to better define the
      state of rqs.
      
      At now, only util_avg is used to define the state of a rq:
        A rq with more that around 80% of utilization and more than 1 tasks is
        considered as overloaded.
      
      But the util_avg signal of a rq can become temporaly low after that a task
      migrated onto another rq which can bias the classification of the rq.
      
      When tasks compete for the same rq, their runnable average signal will be
      higher than util_avg as it will include the waiting time and we can use
      this signal to better classify cfs_rqs.
      
      The new runnable_avg will track the runnable time of a task which simply
      adds the waiting time to the running time. The runnable _avg of cfs_rq
      will be the /Sum of se's runnable_avg and the runnable_avg of group entity
      will follow the one of the rq similarly to util_avg.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Link: https://lore.kernel.org/r/20200224095223.13361-9-mgorman@techsingularity.net
      9f683953