1. 23 10月, 2021 1 次提交
    • J
      sched: make task_struct->plug always defined · 599593a8
      Jens Axboe 提交于
      If CONFIG_BLOCK isn't set, then it's an empty struct anyway. Just make
      it generally available, so we don't break the compile:
      
      kernel/sched/core.c: In function ‘sched_submit_work’:
      kernel/sched/core.c:6346:35: error: ‘struct task_struct’ has no member named ‘plug’
       6346 |                 blk_flush_plug(tsk->plug, true);
            |                                   ^~
      kernel/sched/core.c: In function ‘io_schedule_prepare’:
      kernel/sched/core.c:8357:20: error: ‘struct task_struct’ has no member named ‘plug’
       8357 |         if (current->plug)
            |                    ^~
      kernel/sched/core.c:8358:39: error: ‘struct task_struct’ has no member named ‘plug’
       8358 |                 blk_flush_plug(current->plug, true);
            |                                       ^~
      Reported-by: NNathan Chancellor <nathan@kernel.org>
      Fixes: 008f75a2 ("block: cleanup the flush plug helpers")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      599593a8
  2. 01 10月, 2021 1 次提交
  3. 14 9月, 2021 1 次提交
    • T
      x86/mce: Avoid infinite loop for copy from user recovery · 81065b35
      Tony Luck 提交于
      There are two cases for machine check recovery:
      
      1) The machine check was triggered by ring3 (application) code.
         This is the simpler case. The machine check handler simply queues
         work to be executed on return to user. That code unmaps the page
         from all users and arranges to send a SIGBUS to the task that
         triggered the poison.
      
      2) The machine check was triggered in kernel code that is covered by
         an exception table entry. In this case the machine check handler
         still queues a work entry to unmap the page, etc. but this will
         not be called right away because the #MC handler returns to the
         fix up code address in the exception table entry.
      
      Problems occur if the kernel triggers another machine check before the
      return to user processes the first queued work item.
      
      Specifically, the work is queued using the ->mce_kill_me callback
      structure in the task struct for the current thread. Attempting to queue
      a second work item using this same callback results in a loop in the
      linked list of work functions to call. So when the kernel does return to
      user, it enters an infinite loop processing the same entry for ever.
      
      There are some legitimate scenarios where the kernel may take a second
      machine check before returning to the user.
      
      1) Some code (e.g. futex) first tries a get_user() with page faults
         disabled. If this fails, the code retries with page faults enabled
         expecting that this will resolve the page fault.
      
      2) Copy from user code retries a copy in byte-at-time mode to check
         whether any additional bytes can be copied.
      
      On the other side of the fence are some bad drivers that do not check
      the return value from individual get_user() calls and may access
      multiple user addresses without noticing that some/all calls have
      failed.
      
      Fix by adding a counter (current->mce_count) to keep track of repeated
      machine checks before task_work() is called. First machine check saves
      the address information and calls task_work_add(). Subsequent machine
      checks before that task_work call back is executed check that the address
      is in the same page as the first machine check (since the callback will
      offline exactly one page).
      
      Expected worst case is four machine checks before moving on (e.g. one
      user access with page faults disabled, then a repeat to the same address
      with page faults enabled ... repeat in copy tail bytes). Just in case
      there is some code that loops forever enforce a limit of 10.
      
       [ bp: Massage commit message, drop noinstr, fix typo, extend panic
         messages. ]
      
      Fixes: 5567d11c ("x86/mce: Send #MC singal from task work")
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.com
      81065b35
  4. 28 8月, 2021 1 次提交
    • T
      eventfd: Make signal recursion protection a task bit · b542e383
      Thomas Gleixner 提交于
      The recursion protection for eventfd_signal() is based on a per CPU
      variable and relies on the !RT semantics of spin_lock_irqsave() for
      protecting this per CPU variable. On RT kernels spin_lock_irqsave() neither
      disables preemption nor interrupts which allows the spin lock held section
      to be preempted. If the preempting task invokes eventfd_signal() as well,
      then the recursion warning triggers.
      
      Paolo suggested to protect the per CPU variable with a local lock, but
      that's heavyweight and actually not necessary. The goal of this protection
      is to prevent the task stack from overflowing, which can be achieved with a
      per task recursion protection as well.
      
      Replace the per CPU variable with a per task bit similar to other recursion
      protection bits like task_struct::in_page_owner. This works on both !RT and
      RT kernels and removes as a side effect the extra per CPU storage.
      
      No functional change for !RT kernels.
      Reported-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/r/87wnp9idso.ffs@tglx
      b542e383
  5. 20 8月, 2021 3 次提交
  6. 17 8月, 2021 4 次提交
    • T
      sched/core: Provide a scheduling point for RT locks · 6991436c
      Thomas Gleixner 提交于
      RT enabled kernels substitute spin/rwlocks with 'sleeping' variants based
      on rtmutexes. Blocking on such a lock is similar to preemption versus:
      
       - I/O scheduling and worker handling, because these functions might block
         on another substituted lock, or come from a lock contention within these
         functions.
      
       - RCU considers this like a preemption, because the task might be in a read
         side critical section.
      
      Add a separate scheduling point for this, and hand a new scheduling mode
      argument to __schedule() which allows, along with separate mode masks, to
      handle this gracefully from within the scheduler, without proliferating that
      to other subsystems like RCU.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20210815211302.372319055@linutronix.de
      6991436c
    • T
      sched/wakeup: Prepare for RT sleeping spin/rwlocks · 5f220be2
      Thomas Gleixner 提交于
      Waiting for spinlocks and rwlocks on non RT enabled kernels is task::state
      preserving. Any wakeup which matches the state is valid.
      
      RT enabled kernels substitutes them with 'sleeping' spinlocks. This creates
      an issue vs. task::__state.
      
      In order to block on the lock, the task has to overwrite task::__state and a
      consecutive wakeup issued by the unlocker sets the state back to
      TASK_RUNNING. As a consequence the task loses the state which was set
      before the lock acquire and also any regular wakeup targeted at the task
      while it is blocked on the lock.
      
      To handle this gracefully, add a 'saved_state' member to task_struct which
      is used in the following way:
      
       1) When a task blocks on a 'sleeping' spinlock, the current state is saved
          in task::saved_state before it is set to TASK_RTLOCK_WAIT.
      
       2) When the task unblocks and after acquiring the lock, it restores the saved
          state.
      
       3) When a regular wakeup happens for a task while it is blocked then the
          state change of that wakeup is redirected to operate on task::saved_state.
      
          This is also required when the task state is running because the task
          might have been woken up from the lock wait and has not yet restored
          the saved state.
      
      To make it complete, provide the necessary helpers to save and restore the
      saved state along with the necessary documentation how the RT lock blocking
      is supposed to work.
      
      For non-RT kernels there is no functional change.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20210815211302.258751046@linutronix.de
      5f220be2
    • T
      sched/wakeup: Reorganize the current::__state helpers · 85019c16
      Thomas Gleixner 提交于
      In order to avoid more duplicate implementations for the debug and
      non-debug variants of the state change macros, split the debug portion out
      and make that conditional on CONFIG_DEBUG_ATOMIC_SLEEP=y.
      Suggested-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20210815211302.200898048@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      85019c16
    • T
      sched/wakeup: Introduce the TASK_RTLOCK_WAIT state bit · cd781d0c
      Thomas Gleixner 提交于
      RT kernels have an extra quirk for try_to_wake_up() to handle task state
      preservation across periods of blocking on a 'sleeping' spin/rwlock.
      
      For this to function correctly and under all circumstances try_to_wake_up()
      must be able to identify whether the wakeup is lock related or not and
      whether the task is waiting for a lock or not.
      
      The original approach was to use a special wake_flag argument for
      try_to_wake_up() and just use TASK_UNINTERRUPTIBLE for the tasks wait state
      and the try_to_wake_up() state argument.
      
      This works in principle, but due to the fact that try_to_wake_up() cannot
      determine whether the task is waiting for an RT lock wakeup or for a regular
      wakeup it's suboptimal.
      
      RT kernels save the original task state when blocking on an RT lock and
      restore it when the lock has been acquired. Any non lock related wakeup is
      checked against the saved state and if it matches the saved state is set to
      running so that the wakeup is not lost when the state is restored.
      
      While the necessary logic for the wake_flag based solution is trivial, the
      downside is that any regular wakeup with TASK_UNINTERRUPTIBLE in the state
      argument set will wake the task despite the fact that it is still blocked
      on the lock. That's not a fatal problem as the lock wait has do deal with
      spurious wakeups anyway, but it introduces unnecessary latencies.
      
      Introduce the TASK_RTLOCK_WAIT state bit which will be set when a task
      blocks on an RT lock.
      
      The lock wakeup will use wake_up_state(TASK_RTLOCK_WAIT), so both the
      waiting state and the wakeup state are distinguishable, which avoids
      spurious wakeups and allows better analysis.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20210815211302.144989915@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd781d0c
  7. 28 7月, 2021 1 次提交
  8. 17 7月, 2021 1 次提交
    • A
      bpf: Add ambient BPF runtime context stored in current · c7603cfa
      Andrii Nakryiko 提交于
      b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
      helper") fixed the problem with cgroup-local storage use in BPF by
      pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate
      possible BPF program preemptions and nested executions.
      
      While this seems to work good in practice, it introduces new and unnecessary
      failure mode in which not all BPF programs might be executed if we fail to
      find an unused slot for cgroup storage, however unlikely it is. It might also
      not be so unlikely when/if we allow sleepable cgroup BPF programs in the
      future.
      
      Further, the way that cgroup storage is implemented as ambiently-available
      property during entire BPF program execution is a convenient way to pass extra
      information to BPF program and helpers without requiring user code to pass
      around extra arguments explicitly. So it would be good to have a generic
      solution that can allow implementing this without arbitrary restrictions.
      Ideally, such solution would work for both preemptable and sleepable BPF
      programs in exactly the same way.
      
      This patch introduces such solution, bpf_run_ctx. It adds one pointer field
      (bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of
      macros in such a way that it always stays valid throughout BPF program
      execution. BPF program preemption is handled by remembering previous
      current->bpf_ctx value locally while executing nested BPF program and
      restoring old value after nested BPF program finishes. This is handled by two
      helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are
      supposed to be used before and after BPF program runs, respectively.
      
      Restoring old value of the pointer handles preemption, while bpf_run_ctx
      pointer being a property of current task_struct naturally solves this problem
      for sleepable BPF programs by "following" BPF program execution as it is
      scheduled in and out of CPU. It would even allow CPU migration of BPF
      programs, even though it's not currently allowed by BPF infra.
      
      This patch cleans up cgroup local storage handling as a first application. The
      design itself is generic, though, with bpf_run_ctx being an empty struct that
      is supposed to be embedded into a specific struct for a given BPF program type
      (bpf_cg_run_ctx in this case). Follow up patches are planned that will expand
      this mechanism for other uses within tracing BPF programs.
      
      To verify that this change doesn't revert the fix to the original cgroup
      storage issue, I ran the same repro as in the original report ([0]) and didn't
      get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with
      bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).
      
        [0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/
      
      Fixes: b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org
      c7603cfa
  9. 28 6月, 2021 1 次提交
    • L
      Revert "signal: Allow tasks to cache one sigqueue struct" · b4b27b9e
      Linus Torvalds 提交于
      This reverts commits 4bad58eb (and
      399f8dd9, which tried to fix it).
      
      I do not believe these are correct, and I'm about to release 5.13, so am
      reverting them out of an abundance of caution.
      
      The locking is odd, and appears broken.
      
      On the allocation side (in __sigqueue_alloc()), the locking is somewhat
      straightforward: it depends on sighand->siglock.  Since one caller
      doesn't hold that lock, it further then tests 'sigqueue_flags' to avoid
      the case with no locks held.
      
      On the freeing side (in sigqueue_cache_or_free()), there is no locking
      at all, and the logic instead depends on 'current' being a single
      thread, and not able to race with itself.
      
      To make things more exciting, there's also the data race between freeing
      a signal and allocating one, which is handled by using WRITE_ONCE() and
      READ_ONCE(), and being mutually exclusive wrt the initial state (ie
      freeing will only free if the old state was NULL, while allocating will
      obviously only use the value if it was non-NULL, so only one or the
      other will actually act on the value).
      
      However, while the free->alloc paths do seem mutually exclusive thanks
      to just the data value dependency, it's not clear what the memory
      ordering constraints are on it.  Could writes from the previous
      allocation possibly be delayed and seen by the new allocation later,
      causing logical inconsistencies?
      
      So it's all very exciting and unusual.
      
      And in particular, it seems that the freeing side is incorrect in
      depending on "current" being single-threaded.  Yes, 'current' is a
      single thread, but in the presense of asynchronous events even a single
      thread can have data races.
      
      And such asynchronous events can and do happen, with interrupts causing
      signals to be flushed and thus free'd (for example - sending a
      SIGCONT/SIGSTOP can happen from interrupt context, and can flush
      previously queued process control signals).
      
      So regardless of all the other questions about the memory ordering and
      locking for this new cached allocation, the sigqueue_cache_or_free()
      assumptions seem to be fundamentally incorrect.
      
      It may be that people will show me the errors of my ways, and tell me
      why this is all safe after all.  We can reinstate it if so.  But my
      current belief is that the WRITE_ONCE() that sets the cached entry needs
      to be a smp_store_release(), and the READ_ONCE() that finds a cached
      entry needs to be a smp_load_acquire() to handle memory ordering
      correctly.
      
      And the sequence in sigqueue_cache_or_free() would need to either use a
      lock or at least be interrupt-safe some way (perhaps by using something
      like the percpu 'cmpxchg': it doesn't need to be SMP-safe, but like the
      percpu operations it needs to be interrupt-safe).
      
      Fixes: 399f8dd9 ("signal: Prevent sigqueue caching after task got released")
      Fixes: 4bad58eb ("signal: Allow tasks to cache one sigqueue struct")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4b27b9e
  10. 18 6月, 2021 3 次提交
  11. 03 6月, 2021 1 次提交
    • D
      sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling · 68d7a190
      Dietmar Eggemann 提交于
      The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent
      unnecessary util_est updates uses the LSB of util_est.enqueued. It is
      exposed via _task_util_est() (and task_util_est()).
      
      Commit 92a801e5 ("sched/fair: Mask UTIL_AVG_UNCHANGED usages")
      mentions that the LSB is lost for util_est resolution but
      find_energy_efficient_cpu() checks if task_util_est() returns 0 to
      return prev_cpu early.
      
      _task_util_est() returns the max value of util_est.ewma and
      util_est.enqueued or'ed w/ UTIL_AVG_UNCHANGED.
      So task_util_est() returning the max of task_util() and
      _task_util_est() will never return 0 under the default
      SCHED_FEAT(UTIL_EST, true).
      
      To fix this use the MSB of util_est.enqueued instead and keep the flag
      util_est internal, i.e. don't export it via _task_util_est().
      
      The maximal possible util_avg value for a task is 1024 so the MSB of
      'unsigned int util_est.enqueued' isn't used to store a util value.
      
      As a caveat the code behind the util_est_se trace point has to filter
      UTIL_AVG_UNCHANGED to see the real util_est.enqueued value which should
      be easy to do.
      
      This also fixes an issue report by Xuewen Yan that util_est_update()
      only used UTIL_AVG_UNCHANGED for the subtrahend of the equation:
      
        last_enqueued_diff = ue.enqueued - (task_util() | UTIL_AVG_UNCHANGED)
      
      Fixes: b89997aa sched/pelt: Fix task util_est update filtering
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NXuewen Yan <xuewen.yan@unisoc.com>
      Reviewed-by: NVincent Donnefort <vincent.donnefort@arm.com>
      Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/20210602145808.1562603-1-dietmar.eggemann@arm.com
      68d7a190
  12. 13 5月, 2021 1 次提交
    • M
      tick/nohz: Kick only _queued_ task whose tick dependency is updated · a1dfb631
      Marcelo Tosatti 提交于
      When the tick dependency of a task is updated, we want it to aknowledge
      the new state and restart the tick if needed. If the task is not
      running, we don't need to kick it because it will observe the new
      dependency upon scheduling in. But if the task is running, we may need
      to send an IPI to it so that it gets notified.
      
      Unfortunately we don't have the means to check if a task is running
      in a race free way. Checking p->on_cpu in a synchronized way against
      p->tick_dep_mask would imply adding a full barrier between
      prepare_task_switch() and tick_nohz_task_switch(), which we want to
      avoid in this fast-path.
      
      Therefore we blindly fire an IPI to the task's CPU.
      
      Meanwhile we can check if the task is queued on the CPU rq because
      p->on_rq is always set to TASK_ON_RQ_QUEUED _before_ schedule() and its
      full barrier that precedes tick_nohz_task_switch(). And if the task is
      queued on a nohz_full CPU, it also has fair chances to be running as the
      isolation constraints prescribe running single tasks on full dynticks
      CPUs.
      
      So use this as a trick to check if we can spare an IPI toward a
      non-running task.
      
      NOTE: For the ordering to be correct, it is assumed that we never
      deactivate a task while it is running, the only exception being the task
      deactivating itself while scheduling out.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210512232924.150322-9-frederic@kernel.org
      a1dfb631
  13. 12 5月, 2021 5 次提交
  14. 06 5月, 2021 1 次提交
    • P
      mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN · 1a08ae36
      Pavel Tatashin 提交于
      PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not
      return pages that might belong to CMA region.  This is currently used
      for long term gup to make sure that such pins are not going to be done
      on any CMA pages.
      
      When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it
      is focusing on CMA pages too much and that there is larger class of
      pages that need the same treatment.  MOVABLE zone cannot contain any
      long term pins as well so it makes sense to reuse and redefine this flag
      for that usecase as well.  Rename the flag to PF_MEMALLOC_PIN which
      defines an allocation context which can only get pages suitable for
      long-term pins.
      
      Also rename: memalloc_nocma_save()/memalloc_nocma_restore to
      memalloc_pin_save()/memalloc_pin_restore() and make the new functions
      common.
      
      [rppt@linux.ibm.com: fix renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN]
        Link: https://lkml.kernel.org/r/20210331163816.11517-1-rppt@kernel.org
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-6-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a08ae36
  15. 01 5月, 2021 1 次提交
  16. 15 4月, 2021 1 次提交
    • T
      signal: Allow tasks to cache one sigqueue struct · 4bad58eb
      Thomas Gleixner 提交于
      The idea for this originates from the real time tree to make signal
      delivery for realtime applications more efficient. In quite some of these
      application scenarios a control tasks signals workers to start their
      computations. There is usually only one signal per worker on flight.  This
      works nicely as long as the kmem cache allocations do not hit the slow path
      and cause latencies.
      
      To cure this an optimistic caching was introduced (limited to RT tasks)
      which allows a task to cache a single sigqueue in a pointer in task_struct
      instead of handing it back to the kmem cache after consuming a signal. When
      the next signal is sent to the task then the cached sigqueue is used
      instead of allocating a new one. This solved the problem for this set of
      application scenarios nicely.
      
      The task cache is not preallocated so the first signal sent to a task goes
      always to the cache allocator. The cached sigqueue stays around until the
      task exits and is freed when task::sighand is dropped.
      
      After posting this solution for mainline the discussion came up whether
      this would be useful in general and should not be limited to realtime
      tasks: https://lore.kernel.org/r/m11rcu7nbr.fsf@fess.ebiederm.org
      
      One concern leading to the original limitation was to avoid a large amount
      of pointlessly cached sigqueues in alive tasks. The other concern was
      vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for.
      
      The accounting problem is real, but on the other hand slightly academic.
      After gathering some statistics it turned out that after boot of a regular
      distro install there are less than 10 sigqueues cached in ~1500 tasks.
      
      In case of a 'mass fork and fire signal to child' scenario the extra 80
      bytes of memory per task are well in the noise of the overall memory
      consumption of the fork bomb.
      
      If this should be limited then this would need an extra counter in struct
      user, more atomic instructions and a seperate rlimit. Yet another tunable
      which is mostly unused.
      
      The caching is actually used. After boot and a full kernel compile on a
      64CPU machine with make -j128 the number of 'allocations' looks like this:
      
        From slab:	   23996
        From task cache: 52223
      
      I.e. it reduces the number of slab cache operations by ~68%.
      
      A typical pattern there is:
      
      <...>-58490 __sigqueue_alloc:  for 58488 from slab ffff8881132df460
      <...>-58488 __sigqueue_free:   cache ffff8881132df460
      <...>-58488 __sigqueue_alloc:  for 1149 from cache ffff8881103dc550
        bash-1149 exit_task_sighand: free ffff8881132df460
        bash-1149 __sigqueue_free:   cache ffff8881103dc550
      
      The interesting sequence is that the exiting task 58488 grabs the sigqueue
      from bash's task cache to signal exit and bash sticks it back into it's own
      cache. Lather, rinse and repeat.
      
      The caching is probably not noticable for the general use case, but the
      benefit for latency sensitive applications is clear. While kmem caches are
      usually just serving from the fast path the slab merging (default) can
      depending on the usage pattern of the merged slabs cause occasional slow
      path allocations.
      
      The time spared per cached entry is a few micro seconds per signal which is
      not relevant for e.g. a kernel build, but for signal heavy workloads it's
      measurable.
      
      As there is no real downside of this caching mechanism making it
      unconditionally available is preferred over more conditional code or new
      magic tunables.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Link: https://lkml.kernel.org/r/87sg4lbmxo.fsf@nanos.tec.linutronix.de
      4bad58eb
  17. 22 3月, 2021 1 次提交
    • I
      sched: Fix various typos · 3b03706f
      Ingo Molnar 提交于
      Fix ~42 single-word typos in scheduler code comments.
      
      We have accumulated a few fun ones over the years. :-)
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: linux-kernel@vger.kernel.org
      3b03706f
  18. 17 3月, 2021 1 次提交
  19. 06 3月, 2021 1 次提交
  20. 27 2月, 2021 1 次提交
    • S
      bpf: Enable task local storage for tracing programs · a10787e6
      Song Liu 提交于
      To access per-task data, BPF programs usually creates a hash table with
      pid as the key. This is not ideal because:
       1. The user need to estimate the proper size of the hash table, which may
          be inaccurate;
       2. Big hash tables are slow;
       3. To clean up the data properly during task terminations, the user need
          to write extra logic.
      
      Task local storage overcomes these issues and offers a better option for
      these per-task data. Task local storage is only available to BPF_LSM. Now
      enable it for tracing programs.
      
      Unlike LSM programs, tracing programs can be called in IRQ contexts.
      Helpers that access task local storage are updated to use
      raw_spin_lock_irqsave() instead of raw_spin_lock_bh().
      
      Tracing programs can attach to functions on the task free path, e.g.
      exit_creds(). To avoid allocating task local storage after
      bpf_task_storage_free(). bpf_task_storage_get() is updated to not allocate
      new storage when the task is not refcounted (task->usage == 0).
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NKP Singh <kpsingh@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210225234319.336131-2-songliubraving@fb.com
      a10787e6
  21. 22 2月, 2021 1 次提交
    • J
      io-wq: fork worker threads from original task · 3bfe6106
      Jens Axboe 提交于
      Instead of using regular kthread kernel threads, create kernel threads
      that are like a real thread that the task would create. This ensures that
      we get all the context that we need, without having to carry that state
      around. This greatly reduces the code complexity, and the risk of missing
      state for a given request type.
      
      With the move away from kthread, we can also dump everything related to
      assigned state to the new threads.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3bfe6106
  22. 17 2月, 2021 2 次提交
  23. 04 2月, 2021 2 次提交
  24. 28 1月, 2021 1 次提交
  25. 14 1月, 2021 1 次提交
  26. 23 12月, 2020 1 次提交
  27. 02 12月, 2020 1 次提交
    • G
      kernel: Implement selective syscall userspace redirection · 1446e1df
      Gabriel Krisman Bertazi 提交于
      Introduce a mechanism to quickly disable/enable syscall handling for a
      specific process and redirect to userspace via SIGSYS.  This is useful
      for processes with parts that require syscall redirection and parts that
      don't, but who need to perform this boundary crossing really fast,
      without paying the cost of a system call to reconfigure syscall handling
      on each boundary transition.  This is particularly important for Windows
      games running over Wine.
      
      The proposed interface looks like this:
      
        prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
      
      The range [<offset>,<offset>+<length>) is a part of the process memory
      map that is allowed to by-pass the redirection code and dispatch
      syscalls directly, such that in fast paths a process doesn't need to
      disable the trap nor the kernel has to check the selector.  This is
      essential to return from SIGSYS to a blocked area without triggering
      another SIGSYS from rt_sigreturn.
      
      selector is an optional pointer to a char-sized userspace memory region
      that has a key switch for the mechanism. This key switch is set to
      either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
      redirection without calling the kernel.
      
      The feature is meant to be set per-thread and it is disabled on
      fork/clone/execv.
      
      Internally, this doesn't add overhead to the syscall hot path, and it
      requires very little per-architecture support.  I avoided using seccomp,
      even though it duplicates some functionality, due to previous feedback
      that maybe it shouldn't mix with seccomp since it is not a security
      mechanism.  And obviously, this should never be considered a security
      mechanism, since any part of the program can by-pass it by using the
      syscall dispatcher.
      
      For the sysinfo benchmark, which measures the overhead added to
      executing a native syscall that doesn't require interception, the
      overhead using only the direct dispatcher region to issue syscalls is
      pretty much irrelevant.  The overhead of using the selector goes around
      40ns for a native (unredirected) syscall in my system, and it is (as
      expected) dominated by the supervisor-mode user-address access.  In
      fact, with SMAP off, the overhead is consistently less than 5ns on my
      test box.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
      1446e1df