1. 18 3月, 2022 1 次提交
  2. 04 2月, 2022 1 次提交
    • I
      Revert "module, async: async_synchronize_full() on module init iff async is used" · 67d6212a
      Igor Pylypiv 提交于
      This reverts commit 774a1221.
      
      We need to finish all async code before the module init sequence is
      done.  In the reverted commit the PF_USED_ASYNC flag was added to mark a
      thread that called async_schedule().  Then the PF_USED_ASYNC flag was
      used to determine whether or not async_synchronize_full() needs to be
      invoked.  This works when modprobe thread is calling async_schedule(),
      but it does not work if module dispatches init code to a worker thread
      which then calls async_schedule().
      
      For example, PCI driver probing is invoked from a worker thread based on
      a node where device is attached:
      
      	if (cpu < nr_cpu_ids)
      		error = work_on_cpu(cpu, local_pci_probe, &ddi);
      	else
      		error = local_pci_probe(&ddi);
      
      We end up in a situation where a worker thread gets the PF_USED_ASYNC
      flag set instead of the modprobe thread.  As a result,
      async_synchronize_full() is not invoked and modprobe completes without
      waiting for the async code to finish.
      
      The issue was discovered while loading the pm80xx driver:
      (scsi_mod.scan=async)
      
      modprobe pm80xx                      worker
      ...
        do_init_module()
        ...
          pci_call_probe()
            work_on_cpu(local_pci_probe)
                                           local_pci_probe()
                                             pm8001_pci_probe()
                                               scsi_scan_host()
                                                 async_schedule()
                                                 worker->flags |= PF_USED_ASYNC;
                                           ...
            < return from worker >
        ...
        if (current->flags & PF_USED_ASYNC) <--- false
        	async_synchronize_full();
      
      Commit 21c3c5d2 ("block: don't request module during elevator init")
      fixed the deadlock issue which the reverted commit 774a1221
      ("module, async: async_synchronize_full() on module init iff async is
      used") tried to fix.
      
      Since commit 0fdff3ec ("async, kmod: warn on synchronous
      request_module() from async workers") synchronous module loading from
      async is not allowed.
      
      Given that the original deadlock issue is fixed and it is no longer
      allowed to call synchronous request_module() from async we can remove
      PF_USED_ASYNC flag to make module init consistently invoke
      async_synchronize_full() unless async module probe is requested.
      Signed-off-by: NIgor Pylypiv <ipylypiv@google.com>
      Reviewed-by: NChangyuan Lyu <changyuanl@google.com>
      Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67d6212a
  3. 20 1月, 2022 1 次提交
  4. 18 1月, 2022 1 次提交
  5. 08 1月, 2022 1 次提交
  6. 10 12月, 2021 1 次提交
    • M
      kcsan: Add core support for a subset of weak memory modeling · 69562e49
      Marco Elver 提交于
      Add support for modeling a subset of weak memory, which will enable
      detection of a subset of data races due to missing memory barriers.
      
      KCSAN's approach to detecting missing memory barriers is based on
      modeling access reordering, and enabled if `CONFIG_KCSAN_WEAK_MEMORY=y`,
      which depends on `CONFIG_KCSAN_STRICT=y`. The feature can be enabled or
      disabled at boot and runtime via the `kcsan.weak_memory` boot parameter.
      
      Each memory access for which a watchpoint is set up, is also selected
      for simulated reordering within the scope of its function (at most 1
      in-flight access).
      
      We are limited to modeling the effects of "buffering" (delaying the
      access), since the runtime cannot "prefetch" accesses (therefore no
      acquire modeling). Once an access has been selected for reordering, it
      is checked along every other access until the end of the function scope.
      If an appropriate memory barrier is encountered, the access will no
      longer be considered for reordering.
      
      When the result of a memory operation should be ordered by a barrier,
      KCSAN can then detect data races where the conflict only occurs as a
      result of a missing barrier due to reordering accesses.
      Suggested-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NMarco Elver <elver@google.com>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      69562e49
  7. 04 12月, 2021 2 次提交
    • M
      locking: Mark racy reads of owner->on_cpu · 4cf75fd4
      Marco Elver 提交于
      One of the more frequent data races reported by KCSAN is the racy read
      in mutex_spin_on_owner(), which is usually reported as "race of unknown
      origin" without showing the writer. This is due to the racing write
      occurring in kernel/sched. Locally enabling KCSAN in kernel/sched shows:
      
       | write (marked) to 0xffff97f205079934 of 4 bytes by task 316 on cpu 6:
       |  finish_task                kernel/sched/core.c:4632 [inline]
       |  finish_task_switch         kernel/sched/core.c:4848
       |  context_switch             kernel/sched/core.c:4975 [inline]
       |  __schedule                 kernel/sched/core.c:6253
       |  schedule                   kernel/sched/core.c:6326
       |  schedule_preempt_disabled  kernel/sched/core.c:6385
       |  __mutex_lock_common        kernel/locking/mutex.c:680
       |  __mutex_lock               kernel/locking/mutex.c:740 [inline]
       |  __mutex_lock_slowpath      kernel/locking/mutex.c:1028
       |  mutex_lock                 kernel/locking/mutex.c:283
       |  tty_open_by_driver         drivers/tty/tty_io.c:2062 [inline]
       |  ...
       |
       | read to 0xffff97f205079934 of 4 bytes by task 322 on cpu 3:
       |  mutex_spin_on_owner        kernel/locking/mutex.c:370
       |  mutex_optimistic_spin      kernel/locking/mutex.c:480
       |  __mutex_lock_common        kernel/locking/mutex.c:610
       |  __mutex_lock               kernel/locking/mutex.c:740 [inline]
       |  __mutex_lock_slowpath      kernel/locking/mutex.c:1028
       |  mutex_lock                 kernel/locking/mutex.c:283
       |  tty_open_by_driver         drivers/tty/tty_io.c:2062 [inline]
       |  ...
       |
       | value changed: 0x00000001 -> 0x00000000
      
      This race is clearly intentional, and the potential for miscompilation
      is slim due to surrounding barrier() and cpu_relax(), and the value
      being used as a boolean.
      
      Nevertheless, marking this reader would more clearly denote intent and
      make it obvious that concurrency is expected. Use READ_ONCE() to avoid
      having to reason about compiler optimizations now and in future.
      
      With previous refactor, mark the read to owner->on_cpu in owner_on_cpu(),
      which immediately precedes the loop executing mutex_spin_on_owner().
      Signed-off-by: NMarco Elver <elver@google.com>
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20211203075935.136808-3-wangkefeng.wang@huawei.com
      4cf75fd4
    • K
      locking: Make owner_on_cpu() into <linux/sched.h> · c0bed69d
      Kefeng Wang 提交于
      Move the owner_on_cpu() from kernel/locking/rwsem.c into
      include/linux/sched.h with under CONFIG_SMP, then use it
      in the mutex/rwsem/rtmutex to simplify the code.
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20211203075935.136808-2-wangkefeng.wang@huawei.com
      c0bed69d
  8. 17 11月, 2021 1 次提交
    • J
      sched/core: Forced idle accounting · 4feee7d1
      Josh Don 提交于
      Adds accounting for "forced idle" time, which is time where a cookie'd
      task forces its SMT sibling to idle, despite the presence of runnable
      tasks.
      
      Forced idle time is one means to measure the cost of enabling core
      scheduling (ie. the capacity lost due to the need to force idle).
      
      Forced idle time is attributed to the thread responsible for causing
      the forced idle.
      
      A few details:
       - Forced idle time is displayed via /proc/PID/sched. It also requires
         that schedstats is enabled.
       - Forced idle is only accounted when a sibling hyperthread is held
         idle despite the presence of runnable tasks. No time is charged if
         a sibling is idle but has no runnable tasks.
       - Tasks with 0 cookie are never charged forced idle.
       - For SMT > 2, we scale the amount of forced idle charged based on the
         number of forced idle siblings. Additionally, we split the time up and
         evenly charge it to all running tasks, as each is equally responsible
         for the forced idle.
      Signed-off-by: NJosh Don <joshdon@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20211018203428.2025792-1-joshdon@google.com
      4feee7d1
  9. 23 10月, 2021 1 次提交
    • J
      sched: make task_struct->plug always defined · 599593a8
      Jens Axboe 提交于
      If CONFIG_BLOCK isn't set, then it's an empty struct anyway. Just make
      it generally available, so we don't break the compile:
      
      kernel/sched/core.c: In function ‘sched_submit_work’:
      kernel/sched/core.c:6346:35: error: ‘struct task_struct’ has no member named ‘plug’
       6346 |                 blk_flush_plug(tsk->plug, true);
            |                                   ^~
      kernel/sched/core.c: In function ‘io_schedule_prepare’:
      kernel/sched/core.c:8357:20: error: ‘struct task_struct’ has no member named ‘plug’
       8357 |         if (current->plug)
            |                    ^~
      kernel/sched/core.c:8358:39: error: ‘struct task_struct’ has no member named ‘plug’
       8358 |                 blk_flush_plug(current->plug, true);
            |                                       ^~
      Reported-by: NNathan Chancellor <nathan@kernel.org>
      Fixes: 008f75a2 ("block: cleanup the flush plug helpers")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      599593a8
  10. 15 10月, 2021 1 次提交
  11. 14 10月, 2021 1 次提交
    • K
      sched: Fill unconditional hole induced by sched_entity · 804bccba
      Kees Cook 提交于
      With struct sched_entity before the other sched entities, its alignment
      won't induce a struct hole. This saves 64 bytes in defconfig task_struct:
      
      Before:
      	...
              unsigned int               rt_priority;          /*   120     4 */
      
              /* XXX 4 bytes hole, try to pack */
      
              /* --- cacheline 2 boundary (128 bytes) --- */
              const struct sched_class  * sched_class;         /*   128     8 */
      
              /* XXX 56 bytes hole, try to pack */
      
              /* --- cacheline 3 boundary (192 bytes) --- */
              struct sched_entity        se __attribute__((__aligned__(64))); /*   192   448 */
              /* --- cacheline 10 boundary (640 bytes) --- */
              struct sched_rt_entity     rt;                   /*   640    48 */
              struct sched_dl_entity     dl __attribute__((__aligned__(8))); /*   688   224 */
              /* --- cacheline 14 boundary (896 bytes) was 16 bytes ago --- */
      
      After:
      	...
              unsigned int               rt_priority;          /*   120     4 */
      
              /* XXX 4 bytes hole, try to pack */
      
              /* --- cacheline 2 boundary (128 bytes) --- */
              struct sched_entity        se __attribute__((__aligned__(64))); /*   128   448 */
              /* --- cacheline 9 boundary (576 bytes) --- */
              struct sched_rt_entity     rt;                   /*   576    48 */
              struct sched_dl_entity     dl __attribute__((__aligned__(8))); /*   624   224 */
              /* --- cacheline 13 boundary (832 bytes) was 16 bytes ago --- */
      
      Summary diff:
      -	/* size: 7040, cachelines: 110, members: 188 */
      +	/* size: 6976, cachelines: 109, members: 188 */
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20210924025450.4138503-1-keescook@chromium.org
      804bccba
  12. 07 10月, 2021 1 次提交
    • E
      coredump: Don't perform any cleanups before dumping core · 92307383
      Eric W. Biederman 提交于
      Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
      before PTRACE_EVENT_EXIT, and before any cleanup work for a task
      happens.  This ensures that an accurate copy of the process can be
      captured in the coredump as no cleanup for the process happens before
      the coredump completes.  This also ensures that PTRACE_EVENT_EXIT
      will not be visited by any thread until the coredump is complete.
      
      Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
      coredump_task_exit can be recognized and ignored in zap_process.
      
      Now that all of the coredumping happens before exit_mm remove code to
      test for a coredump in progress from mm_release.
      
      Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
      The other tests in may_ptrace_stop all concern avoiding stopping
      during a coredump.  These tests are no longer necessary as it is now
      guaranteed that fatal_signal_pending will be set if the code enters
      ptrace_stop during a coredump.  The code in ptrace_stop is guaranteed
      not to stop if fatal_signal_pending returns true.
      
      Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
      ptrace_stop without fatal_signal_pending being true, as signals are
      dequeued in get_signal before calling do_exit.  This is no longer
      an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
      until after the coredump completes.
      
      Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      92307383
  13. 05 10月, 2021 2 次提交
    • Y
      sched: Introduce task block time in schedstats · 847fc0cd
      Yafang Shao 提交于
      Currently in schedstats we have sum_sleep_runtime and iowait_sum, but
      there's no metric to show how long the task is in D state.  Once a task in
      D state, it means the task is blocked in the kernel, for example the
      task may be waiting for a mutex. The D state is more frequent than
      iowait, and it is more critital than S state. So it is worth to add a
      metric to measure it.
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210905143547.4668-5-laoar.shao@gmail.com
      847fc0cd
    • Y
      sched: Make struct sched_statistics independent of fair sched class · ceeadb83
      Yafang Shao 提交于
      If we want to use the schedstats facility to trace other sched classes, we
      should make it independent of fair sched class. The struct sched_statistics
      is the schedular statistics of a task_struct or a task_group. So we can
      move it into struct task_struct and struct task_group to achieve the goal.
      
      After the patch, schestats are orgnized as follows,
      
          struct task_struct {
             ...
             struct sched_entity se;
             struct sched_rt_entity rt;
             struct sched_dl_entity dl;
             ...
             struct sched_statistics stats;
             ...
         };
      
      Regarding the task group, schedstats is only supported for fair group
      sched, and a new struct sched_entity_stats is introduced, suggested by
      Peter -
      
          struct sched_entity_stats {
              struct sched_entity     se;
              struct sched_statistics stats;
          } __no_randomize_layout;
      
      Then with the se in a task_group, we can easily get the stats.
      
      The sched_statistics members may be frequently modified when schedstats is
      enabled, in order to avoid impacting on random data which may in the same
      cacheline with them, the struct sched_statistics is defined as cacheline
      aligned.
      
      As this patch changes the core struct of scheduler, so I verified the
      performance it may impact on the scheduler with 'perf bench sched
      pipe', suggested by Mel. Below is the result, in which all the values
      are in usecs/op.
                                        Before               After
            kernel.sched_schedstats=0  5.2~5.4               5.2~5.4
            kernel.sched_schedstats=1  5.3~5.5               5.3~5.5
      [These data is a little difference with the earlier version, that is
       because my old test machine is destroyed so I have to use a new
       different test machine.]
      
      Almost no impact on the sched performance.
      
      No functional change.
      
      [lkp@intel.com: reported build failure in earlier version]
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com
      ceeadb83
  14. 01 10月, 2021 5 次提交
  15. 30 9月, 2021 1 次提交
    • A
      sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y · bcf9033e
      Ard Biesheuvel 提交于
      THREAD_INFO_IN_TASK moved the CPU field out of thread_info, but this
      causes some issues on architectures that define raw_smp_processor_id()
      in terms of this field, due to the fact that #include'ing linux/sched.h
      to get at struct task_struct is problematic in terms of circular
      dependencies.
      
      Given that thread_info and task_struct are the same data structure
      anyway when THREAD_INFO_IN_TASK=y, let's move it back so that having
      access to the type definition of struct thread_info is sufficient to
      reference the CPU number of the current task.
      
      Note that this requires THREAD_INFO_IN_TASK's definition of the
      task_thread_info() helper to be updated, as task_cpu() takes a
      pointer-to-const, whereas task_thread_info() (which is used to generate
      lvalues as well), needs a non-const pointer. So make it a macro instead.
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      bcf9033e
  16. 14 9月, 2021 1 次提交
    • T
      x86/mce: Avoid infinite loop for copy from user recovery · 81065b35
      Tony Luck 提交于
      There are two cases for machine check recovery:
      
      1) The machine check was triggered by ring3 (application) code.
         This is the simpler case. The machine check handler simply queues
         work to be executed on return to user. That code unmaps the page
         from all users and arranges to send a SIGBUS to the task that
         triggered the poison.
      
      2) The machine check was triggered in kernel code that is covered by
         an exception table entry. In this case the machine check handler
         still queues a work entry to unmap the page, etc. but this will
         not be called right away because the #MC handler returns to the
         fix up code address in the exception table entry.
      
      Problems occur if the kernel triggers another machine check before the
      return to user processes the first queued work item.
      
      Specifically, the work is queued using the ->mce_kill_me callback
      structure in the task struct for the current thread. Attempting to queue
      a second work item using this same callback results in a loop in the
      linked list of work functions to call. So when the kernel does return to
      user, it enters an infinite loop processing the same entry for ever.
      
      There are some legitimate scenarios where the kernel may take a second
      machine check before returning to the user.
      
      1) Some code (e.g. futex) first tries a get_user() with page faults
         disabled. If this fails, the code retries with page faults enabled
         expecting that this will resolve the page fault.
      
      2) Copy from user code retries a copy in byte-at-time mode to check
         whether any additional bytes can be copied.
      
      On the other side of the fence are some bad drivers that do not check
      the return value from individual get_user() calls and may access
      multiple user addresses without noticing that some/all calls have
      failed.
      
      Fix by adding a counter (current->mce_count) to keep track of repeated
      machine checks before task_work() is called. First machine check saves
      the address information and calls task_work_add(). Subsequent machine
      checks before that task_work call back is executed check that the address
      is in the same page as the first machine check (since the callback will
      offline exactly one page).
      
      Expected worst case is four machine checks before moving on (e.g. one
      user access with page faults disabled, then a repeat to the same address
      with page faults enabled ... repeat in copy tail bytes). Just in case
      there is some code that loops forever enforce a limit of 10.
      
       [ bp: Massage commit message, drop noinstr, fix typo, extend panic
         messages. ]
      
      Fixes: 5567d11c ("x86/mce: Send #MC singal from task work")
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.com
      81065b35
  17. 28 8月, 2021 1 次提交
    • T
      eventfd: Make signal recursion protection a task bit · b542e383
      Thomas Gleixner 提交于
      The recursion protection for eventfd_signal() is based on a per CPU
      variable and relies on the !RT semantics of spin_lock_irqsave() for
      protecting this per CPU variable. On RT kernels spin_lock_irqsave() neither
      disables preemption nor interrupts which allows the spin lock held section
      to be preempted. If the preempting task invokes eventfd_signal() as well,
      then the recursion warning triggers.
      
      Paolo suggested to protect the per CPU variable with a local lock, but
      that's heavyweight and actually not necessary. The goal of this protection
      is to prevent the task stack from overflowing, which can be achieved with a
      per task recursion protection as well.
      
      Replace the per CPU variable with a per task bit similar to other recursion
      protection bits like task_struct::in_page_owner. This works on both !RT and
      RT kernels and removes as a side effect the extra per CPU storage.
      
      No functional change for !RT kernels.
      Reported-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/r/87wnp9idso.ffs@tglx
      b542e383
  18. 20 8月, 2021 3 次提交
  19. 17 8月, 2021 4 次提交
    • T
      sched/core: Provide a scheduling point for RT locks · 6991436c
      Thomas Gleixner 提交于
      RT enabled kernels substitute spin/rwlocks with 'sleeping' variants based
      on rtmutexes. Blocking on such a lock is similar to preemption versus:
      
       - I/O scheduling and worker handling, because these functions might block
         on another substituted lock, or come from a lock contention within these
         functions.
      
       - RCU considers this like a preemption, because the task might be in a read
         side critical section.
      
      Add a separate scheduling point for this, and hand a new scheduling mode
      argument to __schedule() which allows, along with separate mode masks, to
      handle this gracefully from within the scheduler, without proliferating that
      to other subsystems like RCU.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20210815211302.372319055@linutronix.de
      6991436c
    • T
      sched/wakeup: Prepare for RT sleeping spin/rwlocks · 5f220be2
      Thomas Gleixner 提交于
      Waiting for spinlocks and rwlocks on non RT enabled kernels is task::state
      preserving. Any wakeup which matches the state is valid.
      
      RT enabled kernels substitutes them with 'sleeping' spinlocks. This creates
      an issue vs. task::__state.
      
      In order to block on the lock, the task has to overwrite task::__state and a
      consecutive wakeup issued by the unlocker sets the state back to
      TASK_RUNNING. As a consequence the task loses the state which was set
      before the lock acquire and also any regular wakeup targeted at the task
      while it is blocked on the lock.
      
      To handle this gracefully, add a 'saved_state' member to task_struct which
      is used in the following way:
      
       1) When a task blocks on a 'sleeping' spinlock, the current state is saved
          in task::saved_state before it is set to TASK_RTLOCK_WAIT.
      
       2) When the task unblocks and after acquiring the lock, it restores the saved
          state.
      
       3) When a regular wakeup happens for a task while it is blocked then the
          state change of that wakeup is redirected to operate on task::saved_state.
      
          This is also required when the task state is running because the task
          might have been woken up from the lock wait and has not yet restored
          the saved state.
      
      To make it complete, provide the necessary helpers to save and restore the
      saved state along with the necessary documentation how the RT lock blocking
      is supposed to work.
      
      For non-RT kernels there is no functional change.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20210815211302.258751046@linutronix.de
      5f220be2
    • T
      sched/wakeup: Reorganize the current::__state helpers · 85019c16
      Thomas Gleixner 提交于
      In order to avoid more duplicate implementations for the debug and
      non-debug variants of the state change macros, split the debug portion out
      and make that conditional on CONFIG_DEBUG_ATOMIC_SLEEP=y.
      Suggested-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20210815211302.200898048@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      85019c16
    • T
      sched/wakeup: Introduce the TASK_RTLOCK_WAIT state bit · cd781d0c
      Thomas Gleixner 提交于
      RT kernels have an extra quirk for try_to_wake_up() to handle task state
      preservation across periods of blocking on a 'sleeping' spin/rwlock.
      
      For this to function correctly and under all circumstances try_to_wake_up()
      must be able to identify whether the wakeup is lock related or not and
      whether the task is waiting for a lock or not.
      
      The original approach was to use a special wake_flag argument for
      try_to_wake_up() and just use TASK_UNINTERRUPTIBLE for the tasks wait state
      and the try_to_wake_up() state argument.
      
      This works in principle, but due to the fact that try_to_wake_up() cannot
      determine whether the task is waiting for an RT lock wakeup or for a regular
      wakeup it's suboptimal.
      
      RT kernels save the original task state when blocking on an RT lock and
      restore it when the lock has been acquired. Any non lock related wakeup is
      checked against the saved state and if it matches the saved state is set to
      running so that the wakeup is not lost when the state is restored.
      
      While the necessary logic for the wake_flag based solution is trivial, the
      downside is that any regular wakeup with TASK_UNINTERRUPTIBLE in the state
      argument set will wake the task despite the fact that it is still blocked
      on the lock. That's not a fatal problem as the lock wait has do deal with
      spurious wakeups anyway, but it introduces unnecessary latencies.
      
      Introduce the TASK_RTLOCK_WAIT state bit which will be set when a task
      blocks on an RT lock.
      
      The lock wakeup will use wake_up_state(TASK_RTLOCK_WAIT), so both the
      waiting state and the wakeup state are distinguishable, which avoids
      spurious wakeups and allows better analysis.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20210815211302.144989915@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd781d0c
  20. 28 7月, 2021 1 次提交
  21. 17 7月, 2021 1 次提交
    • A
      bpf: Add ambient BPF runtime context stored in current · c7603cfa
      Andrii Nakryiko 提交于
      b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
      helper") fixed the problem with cgroup-local storage use in BPF by
      pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate
      possible BPF program preemptions and nested executions.
      
      While this seems to work good in practice, it introduces new and unnecessary
      failure mode in which not all BPF programs might be executed if we fail to
      find an unused slot for cgroup storage, however unlikely it is. It might also
      not be so unlikely when/if we allow sleepable cgroup BPF programs in the
      future.
      
      Further, the way that cgroup storage is implemented as ambiently-available
      property during entire BPF program execution is a convenient way to pass extra
      information to BPF program and helpers without requiring user code to pass
      around extra arguments explicitly. So it would be good to have a generic
      solution that can allow implementing this without arbitrary restrictions.
      Ideally, such solution would work for both preemptable and sleepable BPF
      programs in exactly the same way.
      
      This patch introduces such solution, bpf_run_ctx. It adds one pointer field
      (bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of
      macros in such a way that it always stays valid throughout BPF program
      execution. BPF program preemption is handled by remembering previous
      current->bpf_ctx value locally while executing nested BPF program and
      restoring old value after nested BPF program finishes. This is handled by two
      helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are
      supposed to be used before and after BPF program runs, respectively.
      
      Restoring old value of the pointer handles preemption, while bpf_run_ctx
      pointer being a property of current task_struct naturally solves this problem
      for sleepable BPF programs by "following" BPF program execution as it is
      scheduled in and out of CPU. It would even allow CPU migration of BPF
      programs, even though it's not currently allowed by BPF infra.
      
      This patch cleans up cgroup local storage handling as a first application. The
      design itself is generic, though, with bpf_run_ctx being an empty struct that
      is supposed to be embedded into a specific struct for a given BPF program type
      (bpf_cg_run_ctx in this case). Follow up patches are planned that will expand
      this mechanism for other uses within tracing BPF programs.
      
      To verify that this change doesn't revert the fix to the original cgroup
      storage issue, I ran the same repro as in the original report ([0]) and didn't
      get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with
      bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).
      
        [0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/
      
      Fixes: b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org
      c7603cfa
  22. 28 6月, 2021 1 次提交
    • L
      Revert "signal: Allow tasks to cache one sigqueue struct" · b4b27b9e
      Linus Torvalds 提交于
      This reverts commits 4bad58eb (and
      399f8dd9, which tried to fix it).
      
      I do not believe these are correct, and I'm about to release 5.13, so am
      reverting them out of an abundance of caution.
      
      The locking is odd, and appears broken.
      
      On the allocation side (in __sigqueue_alloc()), the locking is somewhat
      straightforward: it depends on sighand->siglock.  Since one caller
      doesn't hold that lock, it further then tests 'sigqueue_flags' to avoid
      the case with no locks held.
      
      On the freeing side (in sigqueue_cache_or_free()), there is no locking
      at all, and the logic instead depends on 'current' being a single
      thread, and not able to race with itself.
      
      To make things more exciting, there's also the data race between freeing
      a signal and allocating one, which is handled by using WRITE_ONCE() and
      READ_ONCE(), and being mutually exclusive wrt the initial state (ie
      freeing will only free if the old state was NULL, while allocating will
      obviously only use the value if it was non-NULL, so only one or the
      other will actually act on the value).
      
      However, while the free->alloc paths do seem mutually exclusive thanks
      to just the data value dependency, it's not clear what the memory
      ordering constraints are on it.  Could writes from the previous
      allocation possibly be delayed and seen by the new allocation later,
      causing logical inconsistencies?
      
      So it's all very exciting and unusual.
      
      And in particular, it seems that the freeing side is incorrect in
      depending on "current" being single-threaded.  Yes, 'current' is a
      single thread, but in the presense of asynchronous events even a single
      thread can have data races.
      
      And such asynchronous events can and do happen, with interrupts causing
      signals to be flushed and thus free'd (for example - sending a
      SIGCONT/SIGSTOP can happen from interrupt context, and can flush
      previously queued process control signals).
      
      So regardless of all the other questions about the memory ordering and
      locking for this new cached allocation, the sigqueue_cache_or_free()
      assumptions seem to be fundamentally incorrect.
      
      It may be that people will show me the errors of my ways, and tell me
      why this is all safe after all.  We can reinstate it if so.  But my
      current belief is that the WRITE_ONCE() that sets the cached entry needs
      to be a smp_store_release(), and the READ_ONCE() that finds a cached
      entry needs to be a smp_load_acquire() to handle memory ordering
      correctly.
      
      And the sequence in sigqueue_cache_or_free() would need to either use a
      lock or at least be interrupt-safe some way (perhaps by using something
      like the percpu 'cmpxchg': it doesn't need to be SMP-safe, but like the
      percpu operations it needs to be interrupt-safe).
      
      Fixes: 399f8dd9 ("signal: Prevent sigqueue caching after task got released")
      Fixes: 4bad58eb ("signal: Allow tasks to cache one sigqueue struct")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4b27b9e
  23. 18 6月, 2021 3 次提交
  24. 03 6月, 2021 1 次提交
    • D
      sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling · 68d7a190
      Dietmar Eggemann 提交于
      The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent
      unnecessary util_est updates uses the LSB of util_est.enqueued. It is
      exposed via _task_util_est() (and task_util_est()).
      
      Commit 92a801e5 ("sched/fair: Mask UTIL_AVG_UNCHANGED usages")
      mentions that the LSB is lost for util_est resolution but
      find_energy_efficient_cpu() checks if task_util_est() returns 0 to
      return prev_cpu early.
      
      _task_util_est() returns the max value of util_est.ewma and
      util_est.enqueued or'ed w/ UTIL_AVG_UNCHANGED.
      So task_util_est() returning the max of task_util() and
      _task_util_est() will never return 0 under the default
      SCHED_FEAT(UTIL_EST, true).
      
      To fix this use the MSB of util_est.enqueued instead and keep the flag
      util_est internal, i.e. don't export it via _task_util_est().
      
      The maximal possible util_avg value for a task is 1024 so the MSB of
      'unsigned int util_est.enqueued' isn't used to store a util value.
      
      As a caveat the code behind the util_est_se trace point has to filter
      UTIL_AVG_UNCHANGED to see the real util_est.enqueued value which should
      be easy to do.
      
      This also fixes an issue report by Xuewen Yan that util_est_update()
      only used UTIL_AVG_UNCHANGED for the subtrahend of the equation:
      
        last_enqueued_diff = ue.enqueued - (task_util() | UTIL_AVG_UNCHANGED)
      
      Fixes: b89997aa sched/pelt: Fix task util_est update filtering
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NXuewen Yan <xuewen.yan@unisoc.com>
      Reviewed-by: NVincent Donnefort <vincent.donnefort@arm.com>
      Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/20210602145808.1562603-1-dietmar.eggemann@arm.com
      68d7a190
  25. 13 5月, 2021 1 次提交
    • M
      tick/nohz: Kick only _queued_ task whose tick dependency is updated · a1dfb631
      Marcelo Tosatti 提交于
      When the tick dependency of a task is updated, we want it to aknowledge
      the new state and restart the tick if needed. If the task is not
      running, we don't need to kick it because it will observe the new
      dependency upon scheduling in. But if the task is running, we may need
      to send an IPI to it so that it gets notified.
      
      Unfortunately we don't have the means to check if a task is running
      in a race free way. Checking p->on_cpu in a synchronized way against
      p->tick_dep_mask would imply adding a full barrier between
      prepare_task_switch() and tick_nohz_task_switch(), which we want to
      avoid in this fast-path.
      
      Therefore we blindly fire an IPI to the task's CPU.
      
      Meanwhile we can check if the task is queued on the CPU rq because
      p->on_rq is always set to TASK_ON_RQ_QUEUED _before_ schedule() and its
      full barrier that precedes tick_nohz_task_switch(). And if the task is
      queued on a nohz_full CPU, it also has fair chances to be running as the
      isolation constraints prescribe running single tasks on full dynticks
      CPUs.
      
      So use this as a trick to check if we can spare an IPI toward a
      non-running task.
      
      NOTE: For the ordering to be correct, it is assumed that we never
      deactivate a task while it is running, the only exception being the task
      deactivating itself while scheduling out.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210512232924.150322-9-frederic@kernel.org
      a1dfb631
  26. 12 5月, 2021 2 次提交