1. 03 8月, 2022 1 次提交
    • W
      sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed · b6e8d40d
      Waiman Long 提交于
      With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
      that the cpuset will just use the effective CPUs of its parent. So
      cpuset_can_attach() can call task_can_attach() with an empty mask.
      This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
      to dl_bw_of() to crash due to percpu value access of an out of bound
      CPU value. For example:
      
      	[80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
      	  :
      	[80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
      	  :
      	[80468.207946] Call Trace:
      	[80468.208947]  cpuset_can_attach+0xa0/0x140
      	[80468.209953]  cgroup_migrate_execute+0x8c/0x490
      	[80468.210931]  cgroup_update_dfl_csses+0x254/0x270
      	[80468.211898]  cgroup_subtree_control_write+0x322/0x400
      	[80468.212854]  kernfs_fop_write_iter+0x11c/0x1b0
      	[80468.213777]  new_sync_write+0x11f/0x1b0
      	[80468.214689]  vfs_write+0x1eb/0x280
      	[80468.215592]  ksys_write+0x5f/0xe0
      	[80468.216463]  do_syscall_64+0x5c/0x80
      	[80468.224287]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
      is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
      to be used by tasks within the cpuset anyway.
      
      Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
      reflect the change. In addition, a check is added to task_can_attach()
      to guard against the possibility that cpumask_any_and() may return a
      value >= nr_cpu_ids.
      
      Fixes: 7f51412a ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets")
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NJuri Lelli <juri.lelli@redhat.com>
      Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com
      b6e8d40d
  2. 28 6月, 2022 1 次提交
  3. 12 5月, 2022 2 次提交
    • P
      sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state · 31cae1ea
      Peter Zijlstra 提交于
      Currently ptrace_stop() / do_signal_stop() rely on the special states
      TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this
      state exists only in task->__state and nowhere else.
      
      There's two spots of bother with this:
      
       - PREEMPT_RT has task->saved_state which complicates matters,
         meaning task_is_{traced,stopped}() needs to check an additional
         variable.
      
       - An alternative freezer implementation that itself relies on a
         special TASK state would loose TASK_TRACED/TASK_STOPPED and will
         result in misbehaviour.
      
      As such, add additional state to task->jobctl to track this state
      outside of task->__state.
      
      NOTE: this doesn't actually fix anything yet, just adds extra state.
      
      --EWB
        * didn't add a unnecessary newline in signal.h
        * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up
          instead of in signal_wake_up_state.  This prevents the clearing
          of TASK_STOPPED and TASK_TRACED from getting lost.
        * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20220421150654.757693825@infradead.orgTested-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Link: https://lkml.kernel.org/r/20220505182645.497868-12-ebiederm@xmission.comSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      31cae1ea
    • E
      ptrace: Don't change __state · 2500ad1c
      Eric W. Biederman 提交于
      Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace
      command is executing.
      
      Instead remove TASK_WAKEKILL from the definition of TASK_TRACED, and
      implement a new jobctl flag TASK_PTRACE_FROZEN.  This new flag is set
      in jobctl_freeze_task and cleared when ptrace_stop is awoken or in
      jobctl_unfreeze_task (when ptrace_stop remains asleep).
      
      In signal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL
      when the wake up is for a fatal signal.  Skip adding __TASK_TRACED
      when TASK_PTRACE_FROZEN is not set.  This has the same effect as
      changing TASK_TRACED to __TASK_TRACED as all of the wake_ups that use
      TASK_KILLABLE go through signal_wake_up.
      
      Handle a ptrace_stop being called with a pending fatal signal.
      Previously it would have been handled by schedule simply failing to
      sleep.  As TASK_WAKEKILL is no longer part of TASK_TRACED schedule
      will sleep with a fatal_signal_pending.   The code in signal_wake_up
      guarantees that the code will be awaked by any fatal signal that
      codes after TASK_TRACED is set.
      
      Previously the __state value of __TASK_TRACED was changed to
      TASK_RUNNING when woken up or back to TASK_TRACED when the code was
      left in ptrace_stop.  Now when woken up ptrace_stop now clears
      JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced
      clears JOBCTL_PTRACE_FROZEN.
      Tested-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Link: https://lkml.kernel.org/r/20220505182645.497868-10-ebiederm@xmission.comSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      2500ad1c
  4. 01 5月, 2022 1 次提交
  5. 29 4月, 2022 1 次提交
  6. 27 4月, 2022 1 次提交
    • T
      x86/split_lock: Make life miserable for split lockers · b041b525
      Tony Luck 提交于
      In https://lore.kernel.org/all/87y22uujkm.ffs@tglx/ Thomas
      said:
      
        Its's simply wishful thinking that stuff gets fixed because of a
        WARN_ONCE(). This has never worked. The only thing which works is to
        make stuff fail hard or slow it down in a way which makes it annoying
        enough to users to complain.
      
      He was talking about WBINVD. But it made me think about how we use the
      split lock detection feature in Linux.
      
      Existing code has three options for applications:
      
       1) Don't enable split lock detection (allow arbitrary split locks)
       2) Warn once when a process uses split lock, but let the process
          keep running with split lock detection disabled
       3) Kill process that use split locks
      
      Option 2 falls into the "wishful thinking" territory that Thomas warns does
      nothing. But option 3 might not be viable in a situation with legacy
      applications that need to run.
      
      Hence make option 2 much stricter to "slow it down in a way which makes
      it annoying".
      
      Primary reason for this change is to provide better quality of service to
      the rest of the applications running on the system. Internal testing shows
      that even with many processes splitting locks, performance for the rest of
      the system is much more responsive.
      
      The new "warn" mode operates like this.  When an application tries to
      execute a bus lock the #AC handler.
      
       1) Delays (interruptibly) 10 ms before moving to next step.
      
       2) Blocks (interruptibly) until it can get the semaphore
      	If interrupted, just return. Assume the signal will either
      	kill the task, or direct execution away from the instruction
      	that is trying to get the bus lock.
       3) Disables split lock detection for the current core
       4) Schedules a work queue to re-enable split lock detect in 2 jiffies
       5) Returns
      
      The work queue that re-enables split lock detection also releases the
      semaphore.
      
      There is a corner case where a CPU may be taken offline while split lock
      detection is disabled. A CPU hotplug handler handles this case.
      
      Old behaviour was to only print the split lock warning on the first
      occurrence of a split lock from a task. Preserve that by adding a flag to
      the task structure that suppresses subsequent split lock messages from that
      task.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20220310204854.31752-2-tony.luck@intel.com
      b041b525
  7. 22 4月, 2022 1 次提交
  8. 05 4月, 2022 1 次提交
  9. 31 3月, 2022 1 次提交
  10. 23 3月, 2022 1 次提交
  11. 18 3月, 2022 1 次提交
  12. 04 3月, 2022 1 次提交
    • O
      signal, x86: Delay calling signals in atomic on RT enabled kernels · bf9ad37d
      Oleg Nesterov 提交于
      On x86_64 we must disable preemption before we enable interrupts
      for stack faults, int3 and debugging, because the current task is using
      a per CPU debug stack defined by the IST. If we schedule out, another task
      can come in and use the same stack and cause the stack to be corrupted
      and crash the kernel on return.
      
      When CONFIG_PREEMPT_RT is enabled, spinlock_t locks become sleeping, and
      one of these is the spin lock used in signal handling.
      
      Some of the debug code (int3) causes do_trap() to send a signal.
      This function calls a spinlock_t lock that has been converted to a
      sleeping lock. If this happens, the above issues with the corrupted
      stack is possible.
      
      Instead of calling the signal right away, for PREEMPT_RT and x86,
      the signal information is stored on the stacks task_struct and
      TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume
      code will send the signal when preemption is enabled.
      
      [ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT to
        ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ]
      [bigeasy: Add on 32bit as per Yang Shi, minor rewording. ]
      [ tglx: Use a config option ]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/Ygq5aBB/qMQw6aP5@linutronix.de
      bf9ad37d
  13. 01 3月, 2022 2 次提交
  14. 19 2月, 2022 1 次提交
    • M
      sched/preempt: Add PREEMPT_DYNAMIC using static keys · 99cf983c
      Mark Rutland 提交于
      Where an architecture selects HAVE_STATIC_CALL but not
      HAVE_STATIC_CALL_INLINE, each static call has an out-of-line trampoline
      which will either branch to a callee or return to the caller.
      
      On such architectures, a number of constraints can conspire to make
      those trampolines more complicated and potentially less useful than we'd
      like. For example:
      
      * Hardware and software control flow integrity schemes can require the
        addition of "landing pad" instructions (e.g. `BTI` for arm64), which
        will also be present at the "real" callee.
      
      * Limited branch ranges can require that trampolines generate or load an
        address into a register and perform an indirect branch (or at least
        have a slow path that does so). This loses some of the benefits of
        having a direct branch.
      
      * Interaction with SW CFI schemes can be complicated and fragile, e.g.
        requiring that we can recognise idiomatic codegen and remove
        indirections understand, at least until clang proves more helpful
        mechanisms for dealing with this.
      
      For PREEMPT_DYNAMIC, we don't need the full power of static calls, as we
      really only need to enable/disable specific preemption functions. We can
      achieve the same effect without a number of the pain points above by
      using static keys to fold early returns into the preemption functions
      themselves rather than in an out-of-line trampoline, effectively
      inlining the trampoline into the start of the function.
      
      For arm64, this results in good code generation. For example, the
      dynamic_cond_resched() wrapper looks as follows when enabled. When
      disabled, the first `B` is replaced with a `NOP`, resulting in an early
      return.
      
      | <dynamic_cond_resched>:
      |        bti     c
      |        b       <dynamic_cond_resched+0x10>     // or `nop`
      |        mov     w0, #0x0
      |        ret
      |        mrs     x0, sp_el0
      |        ldr     x0, [x0, #8]
      |        cbnz    x0, <dynamic_cond_resched+0x8>
      |        paciasp
      |        stp     x29, x30, [sp, #-16]!
      |        mov     x29, sp
      |        bl      <preempt_schedule_common>
      |        mov     w0, #0x1
      |        ldp     x29, x30, [sp], #16
      |        autiasp
      |        ret
      
      ... compared to the regular form of the function:
      
      | <__cond_resched>:
      |        bti     c
      |        mrs     x0, sp_el0
      |        ldr     x1, [x0, #8]
      |        cbz     x1, <__cond_resched+0x18>
      |        mov     w0, #0x0
      |        ret
      |        paciasp
      |        stp     x29, x30, [sp, #-16]!
      |        mov     x29, sp
      |        bl      <preempt_schedule_common>
      |        mov     w0, #0x1
      |        ldp     x29, x30, [sp], #16
      |        autiasp
      |        ret
      
      Any architecture which implements static keys should be able to use this
      to implement PREEMPT_DYNAMIC with similar cost to non-inlined static
      calls. Since this is likely to have greater overhead than (inlined)
      static calls, PREEMPT_DYNAMIC is only defaulted to enabled when
      HAVE_PREEMPT_DYNAMIC_CALL is selected.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NArd Biesheuvel <ardb@kernel.org>
      Acked-by: NFrederic Weisbecker <frederic@kernel.org>
      Link: https://lore.kernel.org/r/20220214165216.2231574-6-mark.rutland@arm.com
      99cf983c
  15. 15 2月, 2022 1 次提交
  16. 04 2月, 2022 1 次提交
    • I
      Revert "module, async: async_synchronize_full() on module init iff async is used" · 67d6212a
      Igor Pylypiv 提交于
      This reverts commit 774a1221.
      
      We need to finish all async code before the module init sequence is
      done.  In the reverted commit the PF_USED_ASYNC flag was added to mark a
      thread that called async_schedule().  Then the PF_USED_ASYNC flag was
      used to determine whether or not async_synchronize_full() needs to be
      invoked.  This works when modprobe thread is calling async_schedule(),
      but it does not work if module dispatches init code to a worker thread
      which then calls async_schedule().
      
      For example, PCI driver probing is invoked from a worker thread based on
      a node where device is attached:
      
      	if (cpu < nr_cpu_ids)
      		error = work_on_cpu(cpu, local_pci_probe, &ddi);
      	else
      		error = local_pci_probe(&ddi);
      
      We end up in a situation where a worker thread gets the PF_USED_ASYNC
      flag set instead of the modprobe thread.  As a result,
      async_synchronize_full() is not invoked and modprobe completes without
      waiting for the async code to finish.
      
      The issue was discovered while loading the pm80xx driver:
      (scsi_mod.scan=async)
      
      modprobe pm80xx                      worker
      ...
        do_init_module()
        ...
          pci_call_probe()
            work_on_cpu(local_pci_probe)
                                           local_pci_probe()
                                             pm8001_pci_probe()
                                               scsi_scan_host()
                                                 async_schedule()
                                                 worker->flags |= PF_USED_ASYNC;
                                           ...
            < return from worker >
        ...
        if (current->flags & PF_USED_ASYNC) <--- false
        	async_synchronize_full();
      
      Commit 21c3c5d2 ("block: don't request module during elevator init")
      fixed the deadlock issue which the reverted commit 774a1221
      ("module, async: async_synchronize_full() on module init iff async is
      used") tried to fix.
      
      Since commit 0fdff3ec ("async, kmod: warn on synchronous
      request_module() from async workers") synchronous module loading from
      async is not allowed.
      
      Given that the original deadlock issue is fixed and it is no longer
      allowed to call synchronous request_module() from async we can remove
      PF_USED_ASYNC flag to make module init consistently invoke
      async_synchronize_full() unless async module probe is requested.
      Signed-off-by: NIgor Pylypiv <ipylypiv@google.com>
      Reviewed-by: NChangyuan Lyu <changyuanl@google.com>
      Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67d6212a
  17. 20 1月, 2022 1 次提交
  18. 18 1月, 2022 1 次提交
  19. 08 1月, 2022 1 次提交
  20. 10 12月, 2021 1 次提交
    • M
      kcsan: Add core support for a subset of weak memory modeling · 69562e49
      Marco Elver 提交于
      Add support for modeling a subset of weak memory, which will enable
      detection of a subset of data races due to missing memory barriers.
      
      KCSAN's approach to detecting missing memory barriers is based on
      modeling access reordering, and enabled if `CONFIG_KCSAN_WEAK_MEMORY=y`,
      which depends on `CONFIG_KCSAN_STRICT=y`. The feature can be enabled or
      disabled at boot and runtime via the `kcsan.weak_memory` boot parameter.
      
      Each memory access for which a watchpoint is set up, is also selected
      for simulated reordering within the scope of its function (at most 1
      in-flight access).
      
      We are limited to modeling the effects of "buffering" (delaying the
      access), since the runtime cannot "prefetch" accesses (therefore no
      acquire modeling). Once an access has been selected for reordering, it
      is checked along every other access until the end of the function scope.
      If an appropriate memory barrier is encountered, the access will no
      longer be considered for reordering.
      
      When the result of a memory operation should be ordered by a barrier,
      KCSAN can then detect data races where the conflict only occurs as a
      result of a missing barrier due to reordering accesses.
      Suggested-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NMarco Elver <elver@google.com>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      69562e49
  21. 04 12月, 2021 2 次提交
    • M
      locking: Mark racy reads of owner->on_cpu · 4cf75fd4
      Marco Elver 提交于
      One of the more frequent data races reported by KCSAN is the racy read
      in mutex_spin_on_owner(), which is usually reported as "race of unknown
      origin" without showing the writer. This is due to the racing write
      occurring in kernel/sched. Locally enabling KCSAN in kernel/sched shows:
      
       | write (marked) to 0xffff97f205079934 of 4 bytes by task 316 on cpu 6:
       |  finish_task                kernel/sched/core.c:4632 [inline]
       |  finish_task_switch         kernel/sched/core.c:4848
       |  context_switch             kernel/sched/core.c:4975 [inline]
       |  __schedule                 kernel/sched/core.c:6253
       |  schedule                   kernel/sched/core.c:6326
       |  schedule_preempt_disabled  kernel/sched/core.c:6385
       |  __mutex_lock_common        kernel/locking/mutex.c:680
       |  __mutex_lock               kernel/locking/mutex.c:740 [inline]
       |  __mutex_lock_slowpath      kernel/locking/mutex.c:1028
       |  mutex_lock                 kernel/locking/mutex.c:283
       |  tty_open_by_driver         drivers/tty/tty_io.c:2062 [inline]
       |  ...
       |
       | read to 0xffff97f205079934 of 4 bytes by task 322 on cpu 3:
       |  mutex_spin_on_owner        kernel/locking/mutex.c:370
       |  mutex_optimistic_spin      kernel/locking/mutex.c:480
       |  __mutex_lock_common        kernel/locking/mutex.c:610
       |  __mutex_lock               kernel/locking/mutex.c:740 [inline]
       |  __mutex_lock_slowpath      kernel/locking/mutex.c:1028
       |  mutex_lock                 kernel/locking/mutex.c:283
       |  tty_open_by_driver         drivers/tty/tty_io.c:2062 [inline]
       |  ...
       |
       | value changed: 0x00000001 -> 0x00000000
      
      This race is clearly intentional, and the potential for miscompilation
      is slim due to surrounding barrier() and cpu_relax(), and the value
      being used as a boolean.
      
      Nevertheless, marking this reader would more clearly denote intent and
      make it obvious that concurrency is expected. Use READ_ONCE() to avoid
      having to reason about compiler optimizations now and in future.
      
      With previous refactor, mark the read to owner->on_cpu in owner_on_cpu(),
      which immediately precedes the loop executing mutex_spin_on_owner().
      Signed-off-by: NMarco Elver <elver@google.com>
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20211203075935.136808-3-wangkefeng.wang@huawei.com
      4cf75fd4
    • K
      locking: Make owner_on_cpu() into <linux/sched.h> · c0bed69d
      Kefeng Wang 提交于
      Move the owner_on_cpu() from kernel/locking/rwsem.c into
      include/linux/sched.h with under CONFIG_SMP, then use it
      in the mutex/rwsem/rtmutex to simplify the code.
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20211203075935.136808-2-wangkefeng.wang@huawei.com
      c0bed69d
  22. 17 11月, 2021 1 次提交
    • J
      sched/core: Forced idle accounting · 4feee7d1
      Josh Don 提交于
      Adds accounting for "forced idle" time, which is time where a cookie'd
      task forces its SMT sibling to idle, despite the presence of runnable
      tasks.
      
      Forced idle time is one means to measure the cost of enabling core
      scheduling (ie. the capacity lost due to the need to force idle).
      
      Forced idle time is attributed to the thread responsible for causing
      the forced idle.
      
      A few details:
       - Forced idle time is displayed via /proc/PID/sched. It also requires
         that schedstats is enabled.
       - Forced idle is only accounted when a sibling hyperthread is held
         idle despite the presence of runnable tasks. No time is charged if
         a sibling is idle but has no runnable tasks.
       - Tasks with 0 cookie are never charged forced idle.
       - For SMT > 2, we scale the amount of forced idle charged based on the
         number of forced idle siblings. Additionally, we split the time up and
         evenly charge it to all running tasks, as each is equally responsible
         for the forced idle.
      Signed-off-by: NJosh Don <joshdon@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20211018203428.2025792-1-joshdon@google.com
      4feee7d1
  23. 23 10月, 2021 1 次提交
    • J
      sched: make task_struct->plug always defined · 599593a8
      Jens Axboe 提交于
      If CONFIG_BLOCK isn't set, then it's an empty struct anyway. Just make
      it generally available, so we don't break the compile:
      
      kernel/sched/core.c: In function ‘sched_submit_work’:
      kernel/sched/core.c:6346:35: error: ‘struct task_struct’ has no member named ‘plug’
       6346 |                 blk_flush_plug(tsk->plug, true);
            |                                   ^~
      kernel/sched/core.c: In function ‘io_schedule_prepare’:
      kernel/sched/core.c:8357:20: error: ‘struct task_struct’ has no member named ‘plug’
       8357 |         if (current->plug)
            |                    ^~
      kernel/sched/core.c:8358:39: error: ‘struct task_struct’ has no member named ‘plug’
       8358 |                 blk_flush_plug(current->plug, true);
            |                                       ^~
      Reported-by: NNathan Chancellor <nathan@kernel.org>
      Fixes: 008f75a2 ("block: cleanup the flush plug helpers")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      599593a8
  24. 15 10月, 2021 1 次提交
  25. 14 10月, 2021 1 次提交
    • K
      sched: Fill unconditional hole induced by sched_entity · 804bccba
      Kees Cook 提交于
      With struct sched_entity before the other sched entities, its alignment
      won't induce a struct hole. This saves 64 bytes in defconfig task_struct:
      
      Before:
      	...
              unsigned int               rt_priority;          /*   120     4 */
      
              /* XXX 4 bytes hole, try to pack */
      
              /* --- cacheline 2 boundary (128 bytes) --- */
              const struct sched_class  * sched_class;         /*   128     8 */
      
              /* XXX 56 bytes hole, try to pack */
      
              /* --- cacheline 3 boundary (192 bytes) --- */
              struct sched_entity        se __attribute__((__aligned__(64))); /*   192   448 */
              /* --- cacheline 10 boundary (640 bytes) --- */
              struct sched_rt_entity     rt;                   /*   640    48 */
              struct sched_dl_entity     dl __attribute__((__aligned__(8))); /*   688   224 */
              /* --- cacheline 14 boundary (896 bytes) was 16 bytes ago --- */
      
      After:
      	...
              unsigned int               rt_priority;          /*   120     4 */
      
              /* XXX 4 bytes hole, try to pack */
      
              /* --- cacheline 2 boundary (128 bytes) --- */
              struct sched_entity        se __attribute__((__aligned__(64))); /*   128   448 */
              /* --- cacheline 9 boundary (576 bytes) --- */
              struct sched_rt_entity     rt;                   /*   576    48 */
              struct sched_dl_entity     dl __attribute__((__aligned__(8))); /*   624   224 */
              /* --- cacheline 13 boundary (832 bytes) was 16 bytes ago --- */
      
      Summary diff:
      -	/* size: 7040, cachelines: 110, members: 188 */
      +	/* size: 6976, cachelines: 109, members: 188 */
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20210924025450.4138503-1-keescook@chromium.org
      804bccba
  26. 07 10月, 2021 1 次提交
    • E
      coredump: Don't perform any cleanups before dumping core · 92307383
      Eric W. Biederman 提交于
      Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
      before PTRACE_EVENT_EXIT, and before any cleanup work for a task
      happens.  This ensures that an accurate copy of the process can be
      captured in the coredump as no cleanup for the process happens before
      the coredump completes.  This also ensures that PTRACE_EVENT_EXIT
      will not be visited by any thread until the coredump is complete.
      
      Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
      coredump_task_exit can be recognized and ignored in zap_process.
      
      Now that all of the coredumping happens before exit_mm remove code to
      test for a coredump in progress from mm_release.
      
      Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
      The other tests in may_ptrace_stop all concern avoiding stopping
      during a coredump.  These tests are no longer necessary as it is now
      guaranteed that fatal_signal_pending will be set if the code enters
      ptrace_stop during a coredump.  The code in ptrace_stop is guaranteed
      not to stop if fatal_signal_pending returns true.
      
      Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
      ptrace_stop without fatal_signal_pending being true, as signals are
      dequeued in get_signal before calling do_exit.  This is no longer
      an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
      until after the coredump completes.
      
      Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      92307383
  27. 05 10月, 2021 2 次提交
    • Y
      sched: Introduce task block time in schedstats · 847fc0cd
      Yafang Shao 提交于
      Currently in schedstats we have sum_sleep_runtime and iowait_sum, but
      there's no metric to show how long the task is in D state.  Once a task in
      D state, it means the task is blocked in the kernel, for example the
      task may be waiting for a mutex. The D state is more frequent than
      iowait, and it is more critital than S state. So it is worth to add a
      metric to measure it.
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210905143547.4668-5-laoar.shao@gmail.com
      847fc0cd
    • Y
      sched: Make struct sched_statistics independent of fair sched class · ceeadb83
      Yafang Shao 提交于
      If we want to use the schedstats facility to trace other sched classes, we
      should make it independent of fair sched class. The struct sched_statistics
      is the schedular statistics of a task_struct or a task_group. So we can
      move it into struct task_struct and struct task_group to achieve the goal.
      
      After the patch, schestats are orgnized as follows,
      
          struct task_struct {
             ...
             struct sched_entity se;
             struct sched_rt_entity rt;
             struct sched_dl_entity dl;
             ...
             struct sched_statistics stats;
             ...
         };
      
      Regarding the task group, schedstats is only supported for fair group
      sched, and a new struct sched_entity_stats is introduced, suggested by
      Peter -
      
          struct sched_entity_stats {
              struct sched_entity     se;
              struct sched_statistics stats;
          } __no_randomize_layout;
      
      Then with the se in a task_group, we can easily get the stats.
      
      The sched_statistics members may be frequently modified when schedstats is
      enabled, in order to avoid impacting on random data which may in the same
      cacheline with them, the struct sched_statistics is defined as cacheline
      aligned.
      
      As this patch changes the core struct of scheduler, so I verified the
      performance it may impact on the scheduler with 'perf bench sched
      pipe', suggested by Mel. Below is the result, in which all the values
      are in usecs/op.
                                        Before               After
            kernel.sched_schedstats=0  5.2~5.4               5.2~5.4
            kernel.sched_schedstats=1  5.3~5.5               5.3~5.5
      [These data is a little difference with the earlier version, that is
       because my old test machine is destroyed so I have to use a new
       different test machine.]
      
      Almost no impact on the sched performance.
      
      No functional change.
      
      [lkp@intel.com: reported build failure in earlier version]
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com
      ceeadb83
  28. 01 10月, 2021 5 次提交
  29. 30 9月, 2021 1 次提交
    • A
      sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y · bcf9033e
      Ard Biesheuvel 提交于
      THREAD_INFO_IN_TASK moved the CPU field out of thread_info, but this
      causes some issues on architectures that define raw_smp_processor_id()
      in terms of this field, due to the fact that #include'ing linux/sched.h
      to get at struct task_struct is problematic in terms of circular
      dependencies.
      
      Given that thread_info and task_struct are the same data structure
      anyway when THREAD_INFO_IN_TASK=y, let's move it back so that having
      access to the type definition of struct thread_info is sufficient to
      reference the CPU number of the current task.
      
      Note that this requires THREAD_INFO_IN_TASK's definition of the
      task_thread_info() helper to be updated, as task_cpu() takes a
      pointer-to-const, whereas task_thread_info() (which is used to generate
      lvalues as well), needs a non-const pointer. So make it a macro instead.
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      bcf9033e
  30. 14 9月, 2021 1 次提交
    • T
      x86/mce: Avoid infinite loop for copy from user recovery · 81065b35
      Tony Luck 提交于
      There are two cases for machine check recovery:
      
      1) The machine check was triggered by ring3 (application) code.
         This is the simpler case. The machine check handler simply queues
         work to be executed on return to user. That code unmaps the page
         from all users and arranges to send a SIGBUS to the task that
         triggered the poison.
      
      2) The machine check was triggered in kernel code that is covered by
         an exception table entry. In this case the machine check handler
         still queues a work entry to unmap the page, etc. but this will
         not be called right away because the #MC handler returns to the
         fix up code address in the exception table entry.
      
      Problems occur if the kernel triggers another machine check before the
      return to user processes the first queued work item.
      
      Specifically, the work is queued using the ->mce_kill_me callback
      structure in the task struct for the current thread. Attempting to queue
      a second work item using this same callback results in a loop in the
      linked list of work functions to call. So when the kernel does return to
      user, it enters an infinite loop processing the same entry for ever.
      
      There are some legitimate scenarios where the kernel may take a second
      machine check before returning to the user.
      
      1) Some code (e.g. futex) first tries a get_user() with page faults
         disabled. If this fails, the code retries with page faults enabled
         expecting that this will resolve the page fault.
      
      2) Copy from user code retries a copy in byte-at-time mode to check
         whether any additional bytes can be copied.
      
      On the other side of the fence are some bad drivers that do not check
      the return value from individual get_user() calls and may access
      multiple user addresses without noticing that some/all calls have
      failed.
      
      Fix by adding a counter (current->mce_count) to keep track of repeated
      machine checks before task_work() is called. First machine check saves
      the address information and calls task_work_add(). Subsequent machine
      checks before that task_work call back is executed check that the address
      is in the same page as the first machine check (since the callback will
      offline exactly one page).
      
      Expected worst case is four machine checks before moving on (e.g. one
      user access with page faults disabled, then a repeat to the same address
      with page faults enabled ... repeat in copy tail bytes). Just in case
      there is some code that loops forever enforce a limit of 10.
      
       [ bp: Massage commit message, drop noinstr, fix typo, extend panic
         messages. ]
      
      Fixes: 5567d11c ("x86/mce: Send #MC singal from task work")
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.com
      81065b35
  31. 28 8月, 2021 1 次提交
    • T
      eventfd: Make signal recursion protection a task bit · b542e383
      Thomas Gleixner 提交于
      The recursion protection for eventfd_signal() is based on a per CPU
      variable and relies on the !RT semantics of spin_lock_irqsave() for
      protecting this per CPU variable. On RT kernels spin_lock_irqsave() neither
      disables preemption nor interrupts which allows the spin lock held section
      to be preempted. If the preempting task invokes eventfd_signal() as well,
      then the recursion warning triggers.
      
      Paolo suggested to protect the per CPU variable with a local lock, but
      that's heavyweight and actually not necessary. The goal of this protection
      is to prevent the task stack from overflowing, which can be achieved with a
      per task recursion protection as well.
      
      Replace the per CPU variable with a per task bit similar to other recursion
      protection bits like task_struct::in_page_owner. This works on both !RT and
      RT kernels and removes as a side effect the extra per CPU storage.
      
      No functional change for !RT kernels.
      Reported-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/r/87wnp9idso.ffs@tglx
      b542e383
  32. 20 8月, 2021 1 次提交