1. 17 11月, 2020 2 次提交
    • J
      sched/deadline: Fix priority inheritance with multiple scheduling classes · 2279f540
      Juri Lelli 提交于
      Glenn reported that "an application [he developed produces] a BUG in
      deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested
      PTHREAD_PRIO_INHERIT mutexes.  I believe the bug is triggered when a CFS
      task that was boosted by a SCHED_DEADLINE task boosts another CFS task
      (nested priority inheritance).
      
       ------------[ cut here ]------------
       kernel BUG at kernel/sched/deadline.c:1462!
       invalid opcode: 0000 [#1] PREEMPT SMP
       CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ...
       Hardware name: ...
       RIP: 0010:enqueue_task_dl+0x335/0x910
       Code: ...
       RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
       RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
       RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
       RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
       R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
       R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
       FS:  00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       PKRU: 55555554
       Call Trace:
        ? intel_pstate_update_util_hwp+0x13/0x170
        rt_mutex_setprio+0x1cc/0x4b0
        task_blocks_on_rt_mutex+0x225/0x260
        rt_spin_lock_slowlock_locked+0xab/0x2d0
        rt_spin_lock_slowlock+0x50/0x80
        hrtimer_grab_expiry_lock+0x20/0x30
        hrtimer_cancel+0x13/0x30
        do_nanosleep+0xa0/0x150
        hrtimer_nanosleep+0xe1/0x230
        ? __hrtimer_init_sleeper+0x60/0x60
        __x64_sys_nanosleep+0x8d/0xa0
        do_syscall_64+0x4a/0x100
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7fa58b52330d
       ...
       ---[ end trace 0000000000000002 ]—
      
      He also provided a simple reproducer creating the situation below:
      
       So the execution order of locking steps are the following
       (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
       are mutexes that are enabled * with priority inheritance.)
      
       Time moves forward as this timeline goes down:
      
       N1              N2               D1
       |               |                |
       |               |                |
       Lock(M1)        |                |
       |               |                |
       |             Lock(M2)           |
       |               |                |
       |               |              Lock(M2)
       |               |                |
       |             Lock(M1)           |
       |             (!!bug triggered!) |
      
      Daniel reported a similar situation as well, by just letting ksoftirqd
      run with DEADLINE (and eventually block on a mutex).
      
      Problem is that boosted entities (Priority Inheritance) use static
      DEADLINE parameters of the top priority waiter. However, there might be
      cases where top waiter could be a non-DEADLINE entity that is currently
      boosted by a DEADLINE entity from a different lock chain (i.e., nested
      priority chains involving entities of non-DEADLINE classes). In this
      case, top waiter static DEADLINE parameters could be null (initialized
      to 0 at fork()) and replenish_dl_entity() would hit a BUG().
      
      Fix this by keeping track of the original donor and using its parameters
      when a task is boosted.
      Reported-by: NGlenn Elliott <glenn@aurora.tech>
      Reported-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com
      2279f540
    • P
      sched: Fix data-race in wakeup · f97bb527
      Peter Zijlstra 提交于
      Mel reported that on some ARM64 platforms loadavg goes bananas and
      Will tracked it down to the following race:
      
        CPU0					CPU1
      
        schedule()
          prev->sched_contributes_to_load = X;
          deactivate_task(prev);
      
      					try_to_wake_up()
      					  if (p->on_rq &&) // false
      					  if (smp_load_acquire(&p->on_cpu) && // true
      					      ttwu_queue_wakelist())
      					        p->sched_remote_wakeup = Y;
      
          smp_store_release(prev->on_cpu, 0);
      
      where both p->sched_contributes_to_load and p->sched_remote_wakeup are
      in the same word, and thus the stores X and Y race (and can clobber
      one another's data).
      
      Whereas prior to commit c6e7bd7a ("sched/core: Optimize ttwu()
      spinning on p->on_cpu") the p->on_cpu handoff serialized access to
      p->sched_remote_wakeup (just as it still does with
      p->sched_contributes_to_load) that commit broke that by calling
      ttwu_queue_wakelist() with p->on_cpu != 0.
      
      However, due to
      
        p->XXX = X			ttwu()
        schedule()			  if (p->on_rq && ...) // false
          smp_mb__after_spinlock()	  if (smp_load_acquire(&p->on_cpu) &&
          deactivate_task()		      ttwu_queue_wakelist())
            p->on_rq = 0;		        p->sched_remote_wakeup = Y;
      
      We can be sure any 'current' store is complete and 'current' is
      guaranteed asleep. Therefore we can move p->sched_remote_wakeup into
      the current flags word.
      
      Note: while the observed failure was loadavg accounting gone wrong due
      to ttwu() cobbering p->sched_contributes_to_load, the reverse problem
      is also possible where schedule() clobbers p->sched_remote_wakeup,
      this could result in enqueue_entity() wrecking ->vruntime and causing
      scheduling artifacts.
      
      Fixes: c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu")
      Reported-by: NMel Gorman <mgorman@techsingularity.net>
      Debugged-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201117083016.GK3121392@hirez.programming.kicks-ass.net
      f97bb527
  2. 07 10月, 2020 1 次提交
    • T
      x86/mce: Recover from poison found while copying from user space · c0ab7ffc
      Tony Luck 提交于
      Existing kernel code can only recover from a machine check on code that
      is tagged in the exception table with a fault handling recovery path.
      
      Add two new fields in the task structure to pass information from
      machine check handler to the "task_work" that is queued to run before
      the task returns to user mode:
      
      + mce_vaddr: will be initialized to the user virtual address of the fault
        in the case where the fault occurred in the kernel copying data from
        a user address.  This is so that kill_me_maybe() can provide that
        information to the user SIGBUS handler.
      
      + mce_kflags: copy of the struct mce.kflags needed by kill_me_maybe()
        to determine if mce_vaddr is applicable to this error.
      
      Add code to recover from a machine check while copying data from user
      space to the kernel. Action for this case is the same as if the user
      touched the poison directly; unmap the page and send a SIGBUS to the task.
      
      Use a new helper function to share common code between the "fault
      in user mode" case and the "fault while copying from user" case.
      
      New code paths will be activated by the next patch which sets
      MCE_IN_KERNEL_COPYIN.
      Suggested-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20201006210910.21062-6-tony.luck@intel.com
      c0ab7ffc
  3. 03 10月, 2020 1 次提交
  4. 26 8月, 2020 2 次提交
  5. 06 8月, 2020 2 次提交
    • T
      posix-cpu-timers: Provide mechanisms to defer timer handling to task_work · 1fb497dd
      Thomas Gleixner 提交于
      Running posix CPU timers in hard interrupt context has a few downsides:
      
       - For PREEMPT_RT it cannot work as the expiry code needs to take
         sighand lock, which is a 'sleeping spinlock' in RT. The original RT
         approach of offloading the posix CPU timer handling into a high
         priority thread was clumsy and provided no real benefit in general.
      
       - For fine grained accounting it's just wrong to run this in context of
         the timer interrupt because that way a process specific CPU time is
         accounted to the timer interrupt.
      
       - Long running timer interrupts caused by a large amount of expiring
         timers which can be created and armed by unpriviledged user space.
      
      There is no hard requirement to expire them in interrupt context.
      
      If the signal is targeted at the task itself then it won't be delivered
      before the task returns to user space anyway. If the signal is targeted at
      a supervisor process then it might be slightly delayed, but posix CPU
      timers are inaccurate anyway due to the fact that they are tied to the
      tick.
      
      Provide infrastructure to schedule task work which allows splitting the
      posix CPU timer code into a quick check in interrupt context and a thread
      context expiry and signal delivery function. This has to be enabled by
      architectures as it requires that the architecture specific KVM
      implementation handles pending task work before exiting to guest mode.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20200730102337.783470146@linutronix.de
      1fb497dd
    • P
      locking/seqlock, headers: Untangle the spaghetti monster · 0cd39f46
      Peter Zijlstra 提交于
      By using lockdep_assert_*() from seqlock.h, the spaghetti monster
      attacked.
      
      Attack back by reducing seqlock.h dependencies from two key high level headers:
      
       - <linux/seqlock.h>:               -Remove <linux/ww_mutex.h>
       - <linux/time.h>:                  -Remove <linux/seqlock.h>
       - <linux/sched.h>:                 +Add    <linux/seqlock.h>
      
      The price was to add it to sched.h ...
      
      Core header fallout, we add direct header dependencies instead of gaining them
      parasitically from higher level headers:
      
       - <linux/dynamic_queue_limits.h>:  +Add <asm/bug.h>
       - <linux/hrtimer.h>:               +Add <linux/seqlock.h>
       - <linux/ktime.h>:                 +Add <asm/bug.h>
       - <linux/lockdep.h>:               +Add <linux/smp.h>
       - <linux/sched.h>:                 +Add <linux/seqlock.h>
       - <linux/videodev2.h>:             +Add <linux/kernel.h>
      
      Arch headers fallout:
      
       - PARISC: <asm/timex.h>:           +Add <asm/special_insns.h>
       - SH:     <asm/io.h>:              +Add <asm/page.h>
       - SPARC:  <asm/timer_64.h>:        +Add <uapi/asm/asi.h>
       - SPARC:  <asm/vvar.h>:            +Add <asm/processor.h>, <asm/barrier.h>
                                          -Remove <linux/seqlock.h>
       - X86:    <asm/fixmap.h>:          +Add <asm/pgtable_types.h>
                                          -Remove <asm/acpi.h>
      
      There's also a bunch of parasitic header dependency fallout in .c files, not listed
      separately.
      
      [ mingo: Extended the changelog, split up & fixed the original patch. ]
      Co-developed-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20200804133438.GK2674@hirez.programming.kicks-ass.net
      0cd39f46
  6. 31 7月, 2020 2 次提交
  7. 29 7月, 2020 2 次提交
    • A
      sched: tasks: Use sequence counter with associated spinlock · b7505861
      Ahmed S. Darwish 提交于
      A sequence counter write side critical section must be protected by some
      form of locking to serialize writers. A plain seqcount_t does not
      contain the information of which lock must be held when entering a write
      side critical section.
      
      Use the new seqcount_spinlock_t data type, which allows to associate a
      spinlock with the sequence counter. This enables lockdep to verify that
      the spinlock used for writer serialization is held when the write side
      critical section is entered.
      
      If lockdep is disabled this lock association is compiled out and has
      neither storage size nor runtime overhead.
      Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200720155530.1173732-14-a.darwish@linutronix.de
      b7505861
    • Q
      sched/uclamp: Add a new sysctl to control RT default boost value · 13685c4a
      Qais Yousef 提交于
      RT tasks by default run at the highest capacity/performance level. When
      uclamp is selected this default behavior is retained by enforcing the
      requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
      uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
      value.
      
      This is also referred to as 'the default boost value of RT tasks'.
      
      See commit 1a00d999 ("sched/uclamp: Set default clamps for RT tasks").
      
      On battery powered devices, it is desired to control this default
      (currently hardcoded) behavior at runtime to reduce energy consumed by
      RT tasks.
      
      For example, a mobile device manufacturer where big.LITTLE architecture
      is dominant, the performance of the little cores varies across SoCs, and
      on high end ones the big cores could be too power hungry.
      
      Given the diversity of SoCs, the new knob allows manufactures to tune
      the best performance/power for RT tasks for the particular hardware they
      run on.
      
      They could opt to further tune the value when the user selects
      a different power saving mode or when the device is actively charging.
      
      The runtime aspect of it further helps in creating a single kernel image
      that can be run on multiple devices that require different tuning.
      
      Keep in mind that a lot of RT tasks in the system are created by the
      kernel. On Android for instance I can see over 50 RT tasks, only
      a handful of which created by the Android framework.
      
      To control the default behavior globally by system admins and device
      integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
      to change the default boost value of the RT tasks.
      
      I anticipate this to be mostly in the form of modifying the init script
      of a particular device.
      
      To avoid polluting the fast path with unnecessary code, the approach
      taken is to synchronously do the update by traversing all the existing
      tasks in the system. This could race with a concurrent fork(), which is
      dealt with by introducing sched_post_fork() function which will ensure
      the racy fork will get the right update applied.
      
      Tested on Juno-r2 in combination with the RT capacity awareness [1].
      By default an RT task will go to the highest capacity CPU and run at the
      maximum frequency, which is particularly energy inefficient on high end
      mobile devices because the biggest core[s] are 'huge' and power hungry.
      
      With this patch the RT task can be controlled to run anywhere by
      default, and doesn't cause the frequency to be maximum all the time.
      Yet any task that really needs to be boosted can easily escape this
      default behavior by modifying its requested uclamp.min value
      (p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.
      
      [1] 804d402f: ("sched/rt: Make RT capacity-aware")
      Signed-off-by: NQais Yousef <qais.yousef@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
      13685c4a
  8. 28 7月, 2020 1 次提交
  9. 22 7月, 2020 1 次提交
  10. 10 7月, 2020 1 次提交
  11. 08 7月, 2020 3 次提交
  12. 04 7月, 2020 1 次提交
  13. 28 6月, 2020 2 次提交
  14. 15 6月, 2020 2 次提交
    • P
      sched: Remove sched_set_*() return value · 8b700983
      Peter Zijlstra 提交于
      Ingo suggested that since the new sched_set_*() functions are
      implemented using the 'nocheck' variants, they really shouldn't ever
      fail, so remove the return value.
      
      Cc: axboe@kernel.dk
      Cc: daniel.lezcano@linaro.org
      Cc: sudeep.holla@arm.com
      Cc: airlied@redhat.com
      Cc: broonie@kernel.org
      Cc: paulmck@kernel.org
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      8b700983
    • P
      sched: Provide sched_set_fifo() · 7318d4cc
      Peter Zijlstra 提交于
      SCHED_FIFO (or any static priority scheduler) is a broken scheduler
      model; it is fundamentally incapable of resource management, the one
      thing an OS is actually supposed to do.
      
      It is impossible to compose static priority workloads. One cannot take
      two well designed and functional static priority workloads and mash
      them together and still expect them to work.
      
      Therefore it doesn't make sense to expose the priority field; the
      kernel is fundamentally incapable of setting a sensible value, it
      needs systems knowledge that it doesn't have.
      
      Take away sched_setschedule() / sched_setattr() from modules and
      replace them with:
      
        - sched_set_fifo(p); create a FIFO task (at prio 50)
        - sched_set_fifo_low(p); create a task higher than NORMAL,
      	which ends up being a FIFO task at prio 1.
        - sched_set_normal(p, nice); (re)set the task to normal
      
      This stops the proliferation of randomly chosen, and irrelevant, FIFO
      priorities that dont't really mean anything anyway.
      
      The system administrator/integrator, whoever has insight into the
      actual system design and requirements (userspace) can set-up
      appropriate priorities if and when needed.
      
      Cc: airlied@redhat.com
      Cc: alexander.deucher@amd.com
      Cc: awalls@md.metrocast.net
      Cc: axboe@kernel.dk
      Cc: broonie@kernel.org
      Cc: daniel.lezcano@linaro.org
      Cc: gregkh@linuxfoundation.org
      Cc: hannes@cmpxchg.org
      Cc: herbert@gondor.apana.org.au
      Cc: hverkuil@xs4all.nl
      Cc: john.stultz@linaro.org
      Cc: nico@fluxnic.net
      Cc: paulmck@kernel.org
      Cc: rafael.j.wysocki@intel.com
      Cc: rmk+kernel@arm.linux.org.uk
      Cc: sudeep.holla@arm.com
      Cc: tglx@linutronix.de
      Cc: ulf.hansson@linaro.org
      Cc: wim@linux-watchdog.org
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Tested-by: NPaul E. McKenney <paulmck@kernel.org>
      7318d4cc
  15. 11 6月, 2020 1 次提交
    • T
      x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned · 17fae129
      Tony Luck 提交于
      An interesting thing happened when a guest Linux instance took a machine
      check. The VMM unmapped the bad page from guest physical space and
      passed the machine check to the guest.
      
      Linux took all the normal actions to offline the page from the process
      that was using it. But then guest Linux crashed because it said there
      was a second machine check inside the kernel with this stack trace:
      
      do_memory_failure
          set_mce_nospec
               set_memory_uc
                    _set_memory_uc
                         change_page_attr_set_clr
                              cpa_flush
                                   clflush_cache_range_opt
      
      This was odd, because a CLFLUSH instruction shouldn't raise a machine
      check (it isn't consuming the data). Further investigation showed that
      the VMM had passed in another machine check because is appeared that the
      guest was accessing the bad page.
      
      Fix is to check the scope of the poison by checking the MCi_MISC register.
      If the entire page is affected, then unmap the page. If only part of the
      page is affected, then mark the page as uncacheable.
      
      This assumes that VMMs will do the logical thing and pass in the "whole
      page scope" via the MCi_MISC register (since they unmapped the entire
      page).
      
        [ bp: Adjust to x86/entry changes. ]
      
      Fixes: 284ce401 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
      Reported-by: NJue Wang <juew@google.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NJue Wang <juew@google.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel.com
      
      
      17fae129
  16. 05 6月, 2020 1 次提交
  17. 03 6月, 2020 1 次提交
    • N
      mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE · a37b0715
      NeilBrown 提交于
      PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
      loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
      daemon needs to write to one bdi (the final bdi) in order to free up
      writes queued to another bdi (the client bdi).
      
      The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
      pages, so that it can still dirty pages after other processses have been
      throttled.  The purpose of this is to avoid deadlock that happen when
      the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
      but it is being thottled and cannot write.
      
      This approach was designed when all threads were blocked equally,
      independently on which device they were writing to, or how fast it was.
      Since that time the writeback algorithm has changed substantially with
      different threads getting different allowances based on non-trivial
      heuristics.  This means the simple "add 25%" heuristic is no longer
      reliable.
      
      The important issue is not that the daemon needs a *larger* dirty page
      allowance, but that it needs a *private* dirty page allowance, so that
      dirty pages for the "client" bdi that it is helping to clear (the bdi
      for an NFS filesystem or loop block device etc) do not affect the
      throttling of the daemon writing to the "final" bdi.
      
      This patch changes the heuristic so that the task is not throttled when
      the bdi it is writing to has a dirty page count below below (or equal
      to) the free-run threshold for that bdi.  This ensures it will always be
      able to have some pages in flight, and so will not deadlock.
      
      In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
      still be throttled by global threshold, but that is acceptable as it is
      only the deadlock state that is interesting for this flag.
      
      This approach of "only throttle when target bdi is busy" is consistent
      with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
      it causes attention to be focussed only on the target bdi.
      
      So this patch
       - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
       - removes the 25% bonus that that flag gives, and
       - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
         global and the local free-run thresholds are exceeded.
      
      Note that previously realtime threads were treated the same as
      PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour
      for real-time threads, so it is now different from the behaviour of nfsd
      and loop tasks.  I don't know what is wanted for realtime.
      
      [akpm@linux-foundation.org: coding style fixes]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: Chuck Lever <chuck.lever@oracle.com>	[nfsd]
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a37b0715
  18. 28 5月, 2020 1 次提交
    • P
      sched: Replace rq::wake_list · a1488664
      Peter Zijlstra 提交于
      The recent commit: 90b5363a ("sched: Clean up scheduler_ipi()")
      got smp_call_function_single_async() subtly wrong. Even though it will
      return -EBUSY when trying to re-use a csd, that condition is not
      atomic and still requires external serialization.
      
      The change in ttwu_queue_remote() got this wrong.
      
      While on first reading ttwu_queue_remote() has an atomic test-and-set
      that appears to serialize the use, the matching 'release' is not in
      the right place to actually guarantee this serialization.
      
      The actual race is vs the sched_ttwu_pending() call in the idle loop;
      that can run the wakeup-list without consuming the CSD.
      
      Instead of trying to chain the lists, merge them.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20200526161908.129371594@infradead.org
      a1488664
  19. 19 5月, 2020 2 次提交
  20. 12 5月, 2020 1 次提交
  21. 28 4月, 2020 3 次提交
    • P
      rcu-tasks: Split ->trc_reader_need_end · 276c4104
      Paul E. McKenney 提交于
      This commit splits ->trc_reader_need_end by using the rcu_special union.
      This change permits readers to check to see if a memory barrier is
      required without any added overhead in the common case where no such
      barrier is required.  This commit also adds the read-side checking.
      Later commits will add the machinery to properly set the new
      ->trc_reader_special.b.need_mb field.
      
      This commit also makes rcu_read_unlock_trace_special() tolerate nested
      read-side critical sections within interrupt and NMI handlers.
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      276c4104
    • P
      rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks · d5f177d3
      Paul E. McKenney 提交于
      Because RCU does not watch exception early-entry/late-exit, idle-loop,
      or CPU-hotplug execution, protection of tracing and BPF operations is
      needlessly complicated.  This commit therefore adds a variant of
      Tasks RCU that:
      
      o	Has explicit read-side markers to allow finite grace periods in
      	the face of in-kernel loops for PREEMPT=n builds.  These markers
      	are rcu_read_lock_trace() and rcu_read_unlock_trace().
      
      o	Protects code in the idle loop, exception entry/exit, and
      	CPU-hotplug code paths.  In this respect, RCU-tasks trace is
      	similar to SRCU, but with lighter-weight readers.
      
      o	Avoids expensive read-side instruction, having overhead similar
      	to that of Preemptible RCU.
      
      There are of course downsides:
      
      o	The grace-period code can send IPIs to CPUs, even when those
      	CPUs are in the idle loop or in nohz_full userspace.  This is
      	mitigated by later commits.
      
      o	It is necessary to scan the full tasklist, much as for Tasks RCU.
      
      o	There is a single callback queue guarded by a single lock,
      	again, much as for Tasks RCU.  However, those early use cases
      	that request multiple grace periods in quick succession are
      	expected to do so from a single task, which makes the single
      	lock almost irrelevant.  If needed, multiple callback queues
      	can be provided using any number of schemes.
      
      Perhaps most important, this variant of RCU does not affect the vanilla
      flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
      readers can operate from idle, offline, and exception entry/exit in no
      way enables rcu_preempt and rcu_sched readers to do so.
      
      The memory ordering was outlined here:
      https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/
      
      This effort benefited greatly from off-list discussions of BPF
      requirements with Alexei Starovoitov and Andrii Nakryiko.  At least
      some of the on-list discussions are captured in the Link: tags below.
      In addition, KCSAN was quite helpful in finding some early bugs.
      
      Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
      Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
      Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Andrii Nakryiko <andriin@fb.com>
      [ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
      [ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
      [ paulmck: Fix locking issue reported by rcutorture. ]
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      d5f177d3
    • L
      rcu: Remove unused ->rcu_read_unlock_special.b.deferred_qs field · f0bdf6d4
      Lai Jiangshan 提交于
      The ->rcu_read_unlock_special.b.deferred_qs field is set to true in
      rcu_read_unlock_special() but never set to false.  This is not
      particularly useful, so this commit removes this field.
      
      The only possible justification for this field is to ease debugging
      of RCU deferred quiscent states, but the combination of the other
      ->rcu_read_unlock_special fields plus ->rcu_blocked_node and of course
      ->rcu_read_lock_nesting should cover debugging needs.  And if this last
      proves incorrect, this patch can always be reverted, along with the
      required setting of ->rcu_read_unlock_special.b.deferred_qs to false
      in rcu_preempt_deferred_qs_irqrestore().
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      f0bdf6d4
  22. 02 4月, 2020 1 次提交
    • E
      signal: Extend exec_id to 64bits · d1e7fd64
      Eric W. Biederman 提交于
      Replace the 32bit exec_id with a 64bit exec_id to make it impossible
      to wrap the exec_id counter.  With care an attacker can cause exec_id
      wrap and send arbitrary signals to a newly exec'd parent.  This
      bypasses the signal sending checks if the parent changes their
      credentials during exec.
      
      The severity of this problem can been seen that in my limited testing
      of a 32bit exec_id it can take as little as 19s to exec 65536 times.
      Which means that it can take as little as 14 days to wrap a 32bit
      exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
      days.  Even my slower timing is in the uptime of a typical server.
      Which means self_exec_id is simply a speed bump today, and if exec
      gets noticably faster self_exec_id won't even be a speed bump.
      
      Extending self_exec_id to 64bits introduces a problem on 32bit
      architectures where reading self_exec_id is no longer atomic and can
      take two read instructions.  Which means that is is possible to hit
      a window where the read value of exec_id does not match the written
      value.  So with very lucky timing after this change this still
      remains expoiltable.
      
      I have updated the update of exec_id on exec to use WRITE_ONCE
      and the read of exec_id in do_notify_parent to use READ_ONCE
      to make it clear that there is no locking between these two
      locations.
      
      Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
      Fixes: 2.3.23pre2
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      d1e7fd64
  23. 21 3月, 2020 2 次提交
    • S
      lockdep: Add hrtimer context tracing bits · 40db1739
      Sebastian Andrzej Siewior 提交于
      Set current->irq_config = 1 for hrtimers which are not marked to expire in
      hard interrupt context during hrtimer_init(). These timers will expire in
      softirq context on PREEMPT_RT.
      
      Setting this allows lockdep to differentiate these timers. If a timer is
      marked to expire in hard interrupt context then the timer callback is not
      supposed to acquire a regular spinlock instead of a raw_spinlock in the
      expiry callback.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200321113242.534508206@linutronix.de
      40db1739
    • P
      lockdep: Introduce wait-type checks · de8f5e4f
      Peter Zijlstra 提交于
      Extend lockdep to validate lock wait-type context.
      
      The current wait-types are:
      
      	LD_WAIT_FREE,		/* wait free, rcu etc.. */
      	LD_WAIT_SPIN,		/* spin loops, raw_spinlock_t etc.. */
      	LD_WAIT_CONFIG,		/* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */
      	LD_WAIT_SLEEP,		/* sleeping locks, mutex_t etc.. */
      
      Where lockdep validates that the current lock (the one being acquired)
      fits in the current wait-context (as generated by the held stack).
      
      This ensures that there is no attempt to acquire mutexes while holding
      spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In
      other words, its a more fancy might_sleep().
      
      Obviously RCU made the entire ordeal more complex than a simple single
      value test because RCU can be acquired in (pretty much) any context and
      while it presents a context to nested locks it is not the same as it
      got acquired in.
      
      Therefore its necessary to split the wait_type into two values, one
      representing the acquire (outer) and one representing the nested context
      (inner). For most 'normal' locks these two are the same.
      
      [ To make static initialization easier we have the rule that:
        .outer == INV means .outer == .inner; because INV == 0. ]
      
      It further means that its required to find the minimal .inner of the held
      stack to compare against the outer of the new lock; because while 'normal'
      RCU presents a CONFIG type to nested locks, if it is taken while already
      holding a SPIN type it obviously doesn't relax the rules.
      
      Below is an example output generated by the trivial test code:
      
        raw_spin_lock(&foo);
        spin_lock(&bar);
        spin_unlock(&bar);
        raw_spin_unlock(&foo);
      
       [ BUG: Invalid wait context ]
       -----------------------------
       swapper/0/1 is trying to lock:
       ffffc90000013f20 (&bar){....}-{3:3}, at: kernel_init+0xdb/0x187
       other info that might help us debug this:
       1 lock held by swapper/0/1:
        #0: ffffc90000013ee0 (&foo){+.+.}-{2:2}, at: kernel_init+0xd1/0x187
      
      The way to read it is to look at the new -{n,m} part in the lock
      description; -{3:3} for the attempted lock, and try and match that up to
      the held locks, which in this case is the one: -{2,2}.
      
      This tells that the acquiring lock requires a more relaxed environment than
      presented by the lock stack.
      
      Currently only the normal locks and RCU are converted, the rest of the
      lockdep users defaults to .inner = INV which is ignored. More conversions
      can be done when desired.
      
      The check for spinlock_t nesting is not enabled by default. It's a separate
      config option for now as there are known problems which are currently
      addressed. The config option allows to identify these problems and to
      verify that the solutions found are indeed solving them.
      
      The config switch will be removed and the checks will permanently enabled
      once the vast majority of issues has been addressed.
      
      [ bigeasy: Move LD_WAIT_FREE,… out of CONFIG_LOCKDEP to avoid compile
      	   failure with CONFIG_DEBUG_SPINLOCK + !CONFIG_LOCKDEP]
      [ tglx: Add the config option ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200321113242.427089655@linutronix.de
      de8f5e4f
  24. 20 3月, 2020 1 次提交
  25. 24 2月, 2020 2 次提交
  26. 26 1月, 2020 1 次提交