1. 05 4月, 2022 1 次提交
  2. 31 3月, 2022 1 次提交
  3. 23 3月, 2022 1 次提交
  4. 18 3月, 2022 1 次提交
  5. 04 3月, 2022 1 次提交
    • O
      signal, x86: Delay calling signals in atomic on RT enabled kernels · bf9ad37d
      Oleg Nesterov 提交于
      On x86_64 we must disable preemption before we enable interrupts
      for stack faults, int3 and debugging, because the current task is using
      a per CPU debug stack defined by the IST. If we schedule out, another task
      can come in and use the same stack and cause the stack to be corrupted
      and crash the kernel on return.
      
      When CONFIG_PREEMPT_RT is enabled, spinlock_t locks become sleeping, and
      one of these is the spin lock used in signal handling.
      
      Some of the debug code (int3) causes do_trap() to send a signal.
      This function calls a spinlock_t lock that has been converted to a
      sleeping lock. If this happens, the above issues with the corrupted
      stack is possible.
      
      Instead of calling the signal right away, for PREEMPT_RT and x86,
      the signal information is stored on the stacks task_struct and
      TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume
      code will send the signal when preemption is enabled.
      
      [ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT to
        ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ]
      [bigeasy: Add on 32bit as per Yang Shi, minor rewording. ]
      [ tglx: Use a config option ]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/Ygq5aBB/qMQw6aP5@linutronix.de
      bf9ad37d
  6. 01 3月, 2022 2 次提交
  7. 19 2月, 2022 1 次提交
    • M
      sched/preempt: Add PREEMPT_DYNAMIC using static keys · 99cf983c
      Mark Rutland 提交于
      Where an architecture selects HAVE_STATIC_CALL but not
      HAVE_STATIC_CALL_INLINE, each static call has an out-of-line trampoline
      which will either branch to a callee or return to the caller.
      
      On such architectures, a number of constraints can conspire to make
      those trampolines more complicated and potentially less useful than we'd
      like. For example:
      
      * Hardware and software control flow integrity schemes can require the
        addition of "landing pad" instructions (e.g. `BTI` for arm64), which
        will also be present at the "real" callee.
      
      * Limited branch ranges can require that trampolines generate or load an
        address into a register and perform an indirect branch (or at least
        have a slow path that does so). This loses some of the benefits of
        having a direct branch.
      
      * Interaction with SW CFI schemes can be complicated and fragile, e.g.
        requiring that we can recognise idiomatic codegen and remove
        indirections understand, at least until clang proves more helpful
        mechanisms for dealing with this.
      
      For PREEMPT_DYNAMIC, we don't need the full power of static calls, as we
      really only need to enable/disable specific preemption functions. We can
      achieve the same effect without a number of the pain points above by
      using static keys to fold early returns into the preemption functions
      themselves rather than in an out-of-line trampoline, effectively
      inlining the trampoline into the start of the function.
      
      For arm64, this results in good code generation. For example, the
      dynamic_cond_resched() wrapper looks as follows when enabled. When
      disabled, the first `B` is replaced with a `NOP`, resulting in an early
      return.
      
      | <dynamic_cond_resched>:
      |        bti     c
      |        b       <dynamic_cond_resched+0x10>     // or `nop`
      |        mov     w0, #0x0
      |        ret
      |        mrs     x0, sp_el0
      |        ldr     x0, [x0, #8]
      |        cbnz    x0, <dynamic_cond_resched+0x8>
      |        paciasp
      |        stp     x29, x30, [sp, #-16]!
      |        mov     x29, sp
      |        bl      <preempt_schedule_common>
      |        mov     w0, #0x1
      |        ldp     x29, x30, [sp], #16
      |        autiasp
      |        ret
      
      ... compared to the regular form of the function:
      
      | <__cond_resched>:
      |        bti     c
      |        mrs     x0, sp_el0
      |        ldr     x1, [x0, #8]
      |        cbz     x1, <__cond_resched+0x18>
      |        mov     w0, #0x0
      |        ret
      |        paciasp
      |        stp     x29, x30, [sp, #-16]!
      |        mov     x29, sp
      |        bl      <preempt_schedule_common>
      |        mov     w0, #0x1
      |        ldp     x29, x30, [sp], #16
      |        autiasp
      |        ret
      
      Any architecture which implements static keys should be able to use this
      to implement PREEMPT_DYNAMIC with similar cost to non-inlined static
      calls. Since this is likely to have greater overhead than (inlined)
      static calls, PREEMPT_DYNAMIC is only defaulted to enabled when
      HAVE_PREEMPT_DYNAMIC_CALL is selected.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NArd Biesheuvel <ardb@kernel.org>
      Acked-by: NFrederic Weisbecker <frederic@kernel.org>
      Link: https://lore.kernel.org/r/20220214165216.2231574-6-mark.rutland@arm.com
      99cf983c
  8. 15 2月, 2022 1 次提交
  9. 04 2月, 2022 1 次提交
    • I
      Revert "module, async: async_synchronize_full() on module init iff async is used" · 67d6212a
      Igor Pylypiv 提交于
      This reverts commit 774a1221.
      
      We need to finish all async code before the module init sequence is
      done.  In the reverted commit the PF_USED_ASYNC flag was added to mark a
      thread that called async_schedule().  Then the PF_USED_ASYNC flag was
      used to determine whether or not async_synchronize_full() needs to be
      invoked.  This works when modprobe thread is calling async_schedule(),
      but it does not work if module dispatches init code to a worker thread
      which then calls async_schedule().
      
      For example, PCI driver probing is invoked from a worker thread based on
      a node where device is attached:
      
      	if (cpu < nr_cpu_ids)
      		error = work_on_cpu(cpu, local_pci_probe, &ddi);
      	else
      		error = local_pci_probe(&ddi);
      
      We end up in a situation where a worker thread gets the PF_USED_ASYNC
      flag set instead of the modprobe thread.  As a result,
      async_synchronize_full() is not invoked and modprobe completes without
      waiting for the async code to finish.
      
      The issue was discovered while loading the pm80xx driver:
      (scsi_mod.scan=async)
      
      modprobe pm80xx                      worker
      ...
        do_init_module()
        ...
          pci_call_probe()
            work_on_cpu(local_pci_probe)
                                           local_pci_probe()
                                             pm8001_pci_probe()
                                               scsi_scan_host()
                                                 async_schedule()
                                                 worker->flags |= PF_USED_ASYNC;
                                           ...
            < return from worker >
        ...
        if (current->flags & PF_USED_ASYNC) <--- false
        	async_synchronize_full();
      
      Commit 21c3c5d2 ("block: don't request module during elevator init")
      fixed the deadlock issue which the reverted commit 774a1221
      ("module, async: async_synchronize_full() on module init iff async is
      used") tried to fix.
      
      Since commit 0fdff3ec ("async, kmod: warn on synchronous
      request_module() from async workers") synchronous module loading from
      async is not allowed.
      
      Given that the original deadlock issue is fixed and it is no longer
      allowed to call synchronous request_module() from async we can remove
      PF_USED_ASYNC flag to make module init consistently invoke
      async_synchronize_full() unless async module probe is requested.
      Signed-off-by: NIgor Pylypiv <ipylypiv@google.com>
      Reviewed-by: NChangyuan Lyu <changyuanl@google.com>
      Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67d6212a
  10. 20 1月, 2022 1 次提交
  11. 18 1月, 2022 1 次提交
  12. 08 1月, 2022 1 次提交
  13. 10 12月, 2021 1 次提交
    • M
      kcsan: Add core support for a subset of weak memory modeling · 69562e49
      Marco Elver 提交于
      Add support for modeling a subset of weak memory, which will enable
      detection of a subset of data races due to missing memory barriers.
      
      KCSAN's approach to detecting missing memory barriers is based on
      modeling access reordering, and enabled if `CONFIG_KCSAN_WEAK_MEMORY=y`,
      which depends on `CONFIG_KCSAN_STRICT=y`. The feature can be enabled or
      disabled at boot and runtime via the `kcsan.weak_memory` boot parameter.
      
      Each memory access for which a watchpoint is set up, is also selected
      for simulated reordering within the scope of its function (at most 1
      in-flight access).
      
      We are limited to modeling the effects of "buffering" (delaying the
      access), since the runtime cannot "prefetch" accesses (therefore no
      acquire modeling). Once an access has been selected for reordering, it
      is checked along every other access until the end of the function scope.
      If an appropriate memory barrier is encountered, the access will no
      longer be considered for reordering.
      
      When the result of a memory operation should be ordered by a barrier,
      KCSAN can then detect data races where the conflict only occurs as a
      result of a missing barrier due to reordering accesses.
      Suggested-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NMarco Elver <elver@google.com>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      69562e49
  14. 04 12月, 2021 2 次提交
    • M
      locking: Mark racy reads of owner->on_cpu · 4cf75fd4
      Marco Elver 提交于
      One of the more frequent data races reported by KCSAN is the racy read
      in mutex_spin_on_owner(), which is usually reported as "race of unknown
      origin" without showing the writer. This is due to the racing write
      occurring in kernel/sched. Locally enabling KCSAN in kernel/sched shows:
      
       | write (marked) to 0xffff97f205079934 of 4 bytes by task 316 on cpu 6:
       |  finish_task                kernel/sched/core.c:4632 [inline]
       |  finish_task_switch         kernel/sched/core.c:4848
       |  context_switch             kernel/sched/core.c:4975 [inline]
       |  __schedule                 kernel/sched/core.c:6253
       |  schedule                   kernel/sched/core.c:6326
       |  schedule_preempt_disabled  kernel/sched/core.c:6385
       |  __mutex_lock_common        kernel/locking/mutex.c:680
       |  __mutex_lock               kernel/locking/mutex.c:740 [inline]
       |  __mutex_lock_slowpath      kernel/locking/mutex.c:1028
       |  mutex_lock                 kernel/locking/mutex.c:283
       |  tty_open_by_driver         drivers/tty/tty_io.c:2062 [inline]
       |  ...
       |
       | read to 0xffff97f205079934 of 4 bytes by task 322 on cpu 3:
       |  mutex_spin_on_owner        kernel/locking/mutex.c:370
       |  mutex_optimistic_spin      kernel/locking/mutex.c:480
       |  __mutex_lock_common        kernel/locking/mutex.c:610
       |  __mutex_lock               kernel/locking/mutex.c:740 [inline]
       |  __mutex_lock_slowpath      kernel/locking/mutex.c:1028
       |  mutex_lock                 kernel/locking/mutex.c:283
       |  tty_open_by_driver         drivers/tty/tty_io.c:2062 [inline]
       |  ...
       |
       | value changed: 0x00000001 -> 0x00000000
      
      This race is clearly intentional, and the potential for miscompilation
      is slim due to surrounding barrier() and cpu_relax(), and the value
      being used as a boolean.
      
      Nevertheless, marking this reader would more clearly denote intent and
      make it obvious that concurrency is expected. Use READ_ONCE() to avoid
      having to reason about compiler optimizations now and in future.
      
      With previous refactor, mark the read to owner->on_cpu in owner_on_cpu(),
      which immediately precedes the loop executing mutex_spin_on_owner().
      Signed-off-by: NMarco Elver <elver@google.com>
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20211203075935.136808-3-wangkefeng.wang@huawei.com
      4cf75fd4
    • K
      locking: Make owner_on_cpu() into <linux/sched.h> · c0bed69d
      Kefeng Wang 提交于
      Move the owner_on_cpu() from kernel/locking/rwsem.c into
      include/linux/sched.h with under CONFIG_SMP, then use it
      in the mutex/rwsem/rtmutex to simplify the code.
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20211203075935.136808-2-wangkefeng.wang@huawei.com
      c0bed69d
  15. 17 11月, 2021 1 次提交
    • J
      sched/core: Forced idle accounting · 4feee7d1
      Josh Don 提交于
      Adds accounting for "forced idle" time, which is time where a cookie'd
      task forces its SMT sibling to idle, despite the presence of runnable
      tasks.
      
      Forced idle time is one means to measure the cost of enabling core
      scheduling (ie. the capacity lost due to the need to force idle).
      
      Forced idle time is attributed to the thread responsible for causing
      the forced idle.
      
      A few details:
       - Forced idle time is displayed via /proc/PID/sched. It also requires
         that schedstats is enabled.
       - Forced idle is only accounted when a sibling hyperthread is held
         idle despite the presence of runnable tasks. No time is charged if
         a sibling is idle but has no runnable tasks.
       - Tasks with 0 cookie are never charged forced idle.
       - For SMT > 2, we scale the amount of forced idle charged based on the
         number of forced idle siblings. Additionally, we split the time up and
         evenly charge it to all running tasks, as each is equally responsible
         for the forced idle.
      Signed-off-by: NJosh Don <joshdon@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20211018203428.2025792-1-joshdon@google.com
      4feee7d1
  16. 23 10月, 2021 1 次提交
    • J
      sched: make task_struct->plug always defined · 599593a8
      Jens Axboe 提交于
      If CONFIG_BLOCK isn't set, then it's an empty struct anyway. Just make
      it generally available, so we don't break the compile:
      
      kernel/sched/core.c: In function ‘sched_submit_work’:
      kernel/sched/core.c:6346:35: error: ‘struct task_struct’ has no member named ‘plug’
       6346 |                 blk_flush_plug(tsk->plug, true);
            |                                   ^~
      kernel/sched/core.c: In function ‘io_schedule_prepare’:
      kernel/sched/core.c:8357:20: error: ‘struct task_struct’ has no member named ‘plug’
       8357 |         if (current->plug)
            |                    ^~
      kernel/sched/core.c:8358:39: error: ‘struct task_struct’ has no member named ‘plug’
       8358 |                 blk_flush_plug(current->plug, true);
            |                                       ^~
      Reported-by: NNathan Chancellor <nathan@kernel.org>
      Fixes: 008f75a2 ("block: cleanup the flush plug helpers")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      599593a8
  17. 15 10月, 2021 1 次提交
  18. 14 10月, 2021 1 次提交
    • K
      sched: Fill unconditional hole induced by sched_entity · 804bccba
      Kees Cook 提交于
      With struct sched_entity before the other sched entities, its alignment
      won't induce a struct hole. This saves 64 bytes in defconfig task_struct:
      
      Before:
      	...
              unsigned int               rt_priority;          /*   120     4 */
      
              /* XXX 4 bytes hole, try to pack */
      
              /* --- cacheline 2 boundary (128 bytes) --- */
              const struct sched_class  * sched_class;         /*   128     8 */
      
              /* XXX 56 bytes hole, try to pack */
      
              /* --- cacheline 3 boundary (192 bytes) --- */
              struct sched_entity        se __attribute__((__aligned__(64))); /*   192   448 */
              /* --- cacheline 10 boundary (640 bytes) --- */
              struct sched_rt_entity     rt;                   /*   640    48 */
              struct sched_dl_entity     dl __attribute__((__aligned__(8))); /*   688   224 */
              /* --- cacheline 14 boundary (896 bytes) was 16 bytes ago --- */
      
      After:
      	...
              unsigned int               rt_priority;          /*   120     4 */
      
              /* XXX 4 bytes hole, try to pack */
      
              /* --- cacheline 2 boundary (128 bytes) --- */
              struct sched_entity        se __attribute__((__aligned__(64))); /*   128   448 */
              /* --- cacheline 9 boundary (576 bytes) --- */
              struct sched_rt_entity     rt;                   /*   576    48 */
              struct sched_dl_entity     dl __attribute__((__aligned__(8))); /*   624   224 */
              /* --- cacheline 13 boundary (832 bytes) was 16 bytes ago --- */
      
      Summary diff:
      -	/* size: 7040, cachelines: 110, members: 188 */
      +	/* size: 6976, cachelines: 109, members: 188 */
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20210924025450.4138503-1-keescook@chromium.org
      804bccba
  19. 07 10月, 2021 1 次提交
    • E
      coredump: Don't perform any cleanups before dumping core · 92307383
      Eric W. Biederman 提交于
      Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
      before PTRACE_EVENT_EXIT, and before any cleanup work for a task
      happens.  This ensures that an accurate copy of the process can be
      captured in the coredump as no cleanup for the process happens before
      the coredump completes.  This also ensures that PTRACE_EVENT_EXIT
      will not be visited by any thread until the coredump is complete.
      
      Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
      coredump_task_exit can be recognized and ignored in zap_process.
      
      Now that all of the coredumping happens before exit_mm remove code to
      test for a coredump in progress from mm_release.
      
      Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
      The other tests in may_ptrace_stop all concern avoiding stopping
      during a coredump.  These tests are no longer necessary as it is now
      guaranteed that fatal_signal_pending will be set if the code enters
      ptrace_stop during a coredump.  The code in ptrace_stop is guaranteed
      not to stop if fatal_signal_pending returns true.
      
      Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
      ptrace_stop without fatal_signal_pending being true, as signals are
      dequeued in get_signal before calling do_exit.  This is no longer
      an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
      until after the coredump completes.
      
      Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      92307383
  20. 05 10月, 2021 2 次提交
    • Y
      sched: Introduce task block time in schedstats · 847fc0cd
      Yafang Shao 提交于
      Currently in schedstats we have sum_sleep_runtime and iowait_sum, but
      there's no metric to show how long the task is in D state.  Once a task in
      D state, it means the task is blocked in the kernel, for example the
      task may be waiting for a mutex. The D state is more frequent than
      iowait, and it is more critital than S state. So it is worth to add a
      metric to measure it.
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210905143547.4668-5-laoar.shao@gmail.com
      847fc0cd
    • Y
      sched: Make struct sched_statistics independent of fair sched class · ceeadb83
      Yafang Shao 提交于
      If we want to use the schedstats facility to trace other sched classes, we
      should make it independent of fair sched class. The struct sched_statistics
      is the schedular statistics of a task_struct or a task_group. So we can
      move it into struct task_struct and struct task_group to achieve the goal.
      
      After the patch, schestats are orgnized as follows,
      
          struct task_struct {
             ...
             struct sched_entity se;
             struct sched_rt_entity rt;
             struct sched_dl_entity dl;
             ...
             struct sched_statistics stats;
             ...
         };
      
      Regarding the task group, schedstats is only supported for fair group
      sched, and a new struct sched_entity_stats is introduced, suggested by
      Peter -
      
          struct sched_entity_stats {
              struct sched_entity     se;
              struct sched_statistics stats;
          } __no_randomize_layout;
      
      Then with the se in a task_group, we can easily get the stats.
      
      The sched_statistics members may be frequently modified when schedstats is
      enabled, in order to avoid impacting on random data which may in the same
      cacheline with them, the struct sched_statistics is defined as cacheline
      aligned.
      
      As this patch changes the core struct of scheduler, so I verified the
      performance it may impact on the scheduler with 'perf bench sched
      pipe', suggested by Mel. Below is the result, in which all the values
      are in usecs/op.
                                        Before               After
            kernel.sched_schedstats=0  5.2~5.4               5.2~5.4
            kernel.sched_schedstats=1  5.3~5.5               5.3~5.5
      [These data is a little difference with the earlier version, that is
       because my old test machine is destroyed so I have to use a new
       different test machine.]
      
      Almost no impact on the sched performance.
      
      No functional change.
      
      [lkp@intel.com: reported build failure in earlier version]
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com
      ceeadb83
  21. 01 10月, 2021 5 次提交
  22. 30 9月, 2021 1 次提交
    • A
      sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y · bcf9033e
      Ard Biesheuvel 提交于
      THREAD_INFO_IN_TASK moved the CPU field out of thread_info, but this
      causes some issues on architectures that define raw_smp_processor_id()
      in terms of this field, due to the fact that #include'ing linux/sched.h
      to get at struct task_struct is problematic in terms of circular
      dependencies.
      
      Given that thread_info and task_struct are the same data structure
      anyway when THREAD_INFO_IN_TASK=y, let's move it back so that having
      access to the type definition of struct thread_info is sufficient to
      reference the CPU number of the current task.
      
      Note that this requires THREAD_INFO_IN_TASK's definition of the
      task_thread_info() helper to be updated, as task_cpu() takes a
      pointer-to-const, whereas task_thread_info() (which is used to generate
      lvalues as well), needs a non-const pointer. So make it a macro instead.
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      bcf9033e
  23. 14 9月, 2021 1 次提交
    • T
      x86/mce: Avoid infinite loop for copy from user recovery · 81065b35
      Tony Luck 提交于
      There are two cases for machine check recovery:
      
      1) The machine check was triggered by ring3 (application) code.
         This is the simpler case. The machine check handler simply queues
         work to be executed on return to user. That code unmaps the page
         from all users and arranges to send a SIGBUS to the task that
         triggered the poison.
      
      2) The machine check was triggered in kernel code that is covered by
         an exception table entry. In this case the machine check handler
         still queues a work entry to unmap the page, etc. but this will
         not be called right away because the #MC handler returns to the
         fix up code address in the exception table entry.
      
      Problems occur if the kernel triggers another machine check before the
      return to user processes the first queued work item.
      
      Specifically, the work is queued using the ->mce_kill_me callback
      structure in the task struct for the current thread. Attempting to queue
      a second work item using this same callback results in a loop in the
      linked list of work functions to call. So when the kernel does return to
      user, it enters an infinite loop processing the same entry for ever.
      
      There are some legitimate scenarios where the kernel may take a second
      machine check before returning to the user.
      
      1) Some code (e.g. futex) first tries a get_user() with page faults
         disabled. If this fails, the code retries with page faults enabled
         expecting that this will resolve the page fault.
      
      2) Copy from user code retries a copy in byte-at-time mode to check
         whether any additional bytes can be copied.
      
      On the other side of the fence are some bad drivers that do not check
      the return value from individual get_user() calls and may access
      multiple user addresses without noticing that some/all calls have
      failed.
      
      Fix by adding a counter (current->mce_count) to keep track of repeated
      machine checks before task_work() is called. First machine check saves
      the address information and calls task_work_add(). Subsequent machine
      checks before that task_work call back is executed check that the address
      is in the same page as the first machine check (since the callback will
      offline exactly one page).
      
      Expected worst case is four machine checks before moving on (e.g. one
      user access with page faults disabled, then a repeat to the same address
      with page faults enabled ... repeat in copy tail bytes). Just in case
      there is some code that loops forever enforce a limit of 10.
      
       [ bp: Massage commit message, drop noinstr, fix typo, extend panic
         messages. ]
      
      Fixes: 5567d11c ("x86/mce: Send #MC singal from task work")
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.com
      81065b35
  24. 28 8月, 2021 1 次提交
    • T
      eventfd: Make signal recursion protection a task bit · b542e383
      Thomas Gleixner 提交于
      The recursion protection for eventfd_signal() is based on a per CPU
      variable and relies on the !RT semantics of spin_lock_irqsave() for
      protecting this per CPU variable. On RT kernels spin_lock_irqsave() neither
      disables preemption nor interrupts which allows the spin lock held section
      to be preempted. If the preempting task invokes eventfd_signal() as well,
      then the recursion warning triggers.
      
      Paolo suggested to protect the per CPU variable with a local lock, but
      that's heavyweight and actually not necessary. The goal of this protection
      is to prevent the task stack from overflowing, which can be achieved with a
      per task recursion protection as well.
      
      Replace the per CPU variable with a per task bit similar to other recursion
      protection bits like task_struct::in_page_owner. This works on both !RT and
      RT kernels and removes as a side effect the extra per CPU storage.
      
      No functional change for !RT kernels.
      Reported-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/r/87wnp9idso.ffs@tglx
      b542e383
  25. 20 8月, 2021 3 次提交
  26. 17 8月, 2021 4 次提交
    • T
      sched/core: Provide a scheduling point for RT locks · 6991436c
      Thomas Gleixner 提交于
      RT enabled kernels substitute spin/rwlocks with 'sleeping' variants based
      on rtmutexes. Blocking on such a lock is similar to preemption versus:
      
       - I/O scheduling and worker handling, because these functions might block
         on another substituted lock, or come from a lock contention within these
         functions.
      
       - RCU considers this like a preemption, because the task might be in a read
         side critical section.
      
      Add a separate scheduling point for this, and hand a new scheduling mode
      argument to __schedule() which allows, along with separate mode masks, to
      handle this gracefully from within the scheduler, without proliferating that
      to other subsystems like RCU.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20210815211302.372319055@linutronix.de
      6991436c
    • T
      sched/wakeup: Prepare for RT sleeping spin/rwlocks · 5f220be2
      Thomas Gleixner 提交于
      Waiting for spinlocks and rwlocks on non RT enabled kernels is task::state
      preserving. Any wakeup which matches the state is valid.
      
      RT enabled kernels substitutes them with 'sleeping' spinlocks. This creates
      an issue vs. task::__state.
      
      In order to block on the lock, the task has to overwrite task::__state and a
      consecutive wakeup issued by the unlocker sets the state back to
      TASK_RUNNING. As a consequence the task loses the state which was set
      before the lock acquire and also any regular wakeup targeted at the task
      while it is blocked on the lock.
      
      To handle this gracefully, add a 'saved_state' member to task_struct which
      is used in the following way:
      
       1) When a task blocks on a 'sleeping' spinlock, the current state is saved
          in task::saved_state before it is set to TASK_RTLOCK_WAIT.
      
       2) When the task unblocks and after acquiring the lock, it restores the saved
          state.
      
       3) When a regular wakeup happens for a task while it is blocked then the
          state change of that wakeup is redirected to operate on task::saved_state.
      
          This is also required when the task state is running because the task
          might have been woken up from the lock wait and has not yet restored
          the saved state.
      
      To make it complete, provide the necessary helpers to save and restore the
      saved state along with the necessary documentation how the RT lock blocking
      is supposed to work.
      
      For non-RT kernels there is no functional change.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20210815211302.258751046@linutronix.de
      5f220be2
    • T
      sched/wakeup: Reorganize the current::__state helpers · 85019c16
      Thomas Gleixner 提交于
      In order to avoid more duplicate implementations for the debug and
      non-debug variants of the state change macros, split the debug portion out
      and make that conditional on CONFIG_DEBUG_ATOMIC_SLEEP=y.
      Suggested-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20210815211302.200898048@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      85019c16
    • T
      sched/wakeup: Introduce the TASK_RTLOCK_WAIT state bit · cd781d0c
      Thomas Gleixner 提交于
      RT kernels have an extra quirk for try_to_wake_up() to handle task state
      preservation across periods of blocking on a 'sleeping' spin/rwlock.
      
      For this to function correctly and under all circumstances try_to_wake_up()
      must be able to identify whether the wakeup is lock related or not and
      whether the task is waiting for a lock or not.
      
      The original approach was to use a special wake_flag argument for
      try_to_wake_up() and just use TASK_UNINTERRUPTIBLE for the tasks wait state
      and the try_to_wake_up() state argument.
      
      This works in principle, but due to the fact that try_to_wake_up() cannot
      determine whether the task is waiting for an RT lock wakeup or for a regular
      wakeup it's suboptimal.
      
      RT kernels save the original task state when blocking on an RT lock and
      restore it when the lock has been acquired. Any non lock related wakeup is
      checked against the saved state and if it matches the saved state is set to
      running so that the wakeup is not lost when the state is restored.
      
      While the necessary logic for the wake_flag based solution is trivial, the
      downside is that any regular wakeup with TASK_UNINTERRUPTIBLE in the state
      argument set will wake the task despite the fact that it is still blocked
      on the lock. That's not a fatal problem as the lock wait has do deal with
      spurious wakeups anyway, but it introduces unnecessary latencies.
      
      Introduce the TASK_RTLOCK_WAIT state bit which will be set when a task
      blocks on an RT lock.
      
      The lock wakeup will use wake_up_state(TASK_RTLOCK_WAIT), so both the
      waiting state and the wakeup state are distinguishable, which avoids
      spurious wakeups and allows better analysis.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20210815211302.144989915@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd781d0c
  27. 28 7月, 2021 1 次提交
  28. 17 7月, 2021 1 次提交
    • A
      bpf: Add ambient BPF runtime context stored in current · c7603cfa
      Andrii Nakryiko 提交于
      b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
      helper") fixed the problem with cgroup-local storage use in BPF by
      pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate
      possible BPF program preemptions and nested executions.
      
      While this seems to work good in practice, it introduces new and unnecessary
      failure mode in which not all BPF programs might be executed if we fail to
      find an unused slot for cgroup storage, however unlikely it is. It might also
      not be so unlikely when/if we allow sleepable cgroup BPF programs in the
      future.
      
      Further, the way that cgroup storage is implemented as ambiently-available
      property during entire BPF program execution is a convenient way to pass extra
      information to BPF program and helpers without requiring user code to pass
      around extra arguments explicitly. So it would be good to have a generic
      solution that can allow implementing this without arbitrary restrictions.
      Ideally, such solution would work for both preemptable and sleepable BPF
      programs in exactly the same way.
      
      This patch introduces such solution, bpf_run_ctx. It adds one pointer field
      (bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of
      macros in such a way that it always stays valid throughout BPF program
      execution. BPF program preemption is handled by remembering previous
      current->bpf_ctx value locally while executing nested BPF program and
      restoring old value after nested BPF program finishes. This is handled by two
      helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are
      supposed to be used before and after BPF program runs, respectively.
      
      Restoring old value of the pointer handles preemption, while bpf_run_ctx
      pointer being a property of current task_struct naturally solves this problem
      for sleepable BPF programs by "following" BPF program execution as it is
      scheduled in and out of CPU. It would even allow CPU migration of BPF
      programs, even though it's not currently allowed by BPF infra.
      
      This patch cleans up cgroup local storage handling as a first application. The
      design itself is generic, though, with bpf_run_ctx being an empty struct that
      is supposed to be embedded into a specific struct for a given BPF program type
      (bpf_cg_run_ctx in this case). Follow up patches are planned that will expand
      this mechanism for other uses within tracing BPF programs.
      
      To verify that this change doesn't revert the fix to the original cgroup
      storage issue, I ran the same repro as in the original report ([0]) and didn't
      get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with
      bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).
      
        [0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/
      
      Fixes: b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org
      c7603cfa