1. 31 5月, 2018 4 次提交
    • D
      sched/headers: Fix typo · 595058b6
      Davidlohr Bueso 提交于
      I cannot spell 'throttling'.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180530224940.17839-1-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      595058b6
    • J
      sched/deadline: Fix missing clock update · ecda2b66
      Juri Lelli 提交于
      A missing clock update is causing the following warning:
      
       rq->clock_update_flags < RQCF_ACT_SKIP
       WARNING: CPU: 10 PID: 0 at kernel/sched/sched.h:963 inactive_task_timer+0x5d6/0x720
       Call Trace:
        <IRQ>
        __hrtimer_run_queues+0x10f/0x530
        hrtimer_interrupt+0xe5/0x240
        smp_apic_timer_interrupt+0x79/0x2b0
        apic_timer_interrupt+0xf/0x20
        </IRQ>
        do_idle+0x203/0x280
        cpu_startup_entry+0x6f/0x80
        start_secondary+0x1b0/0x200
        secondary_startup_64+0xa5/0xb0
       hardirqs last  enabled at (793919): [<ffffffffa27c5f6e>] cpuidle_enter_state+0x9e/0x360
       hardirqs last disabled at (793920): [<ffffffffa2a0096e>] interrupt_entry+0xce/0xe0
       softirqs last  enabled at (793922): [<ffffffffa20bef78>] irq_enter+0x68/0x70
       softirqs last disabled at (793921): [<ffffffffa20bef5d>] irq_enter+0x4d/0x70
      
      This happens because inactive_task_timer() calls sub_running_bw() (if
      TASK_DEAD and non_contending) that might trigger a schedutil update,
      which might access the clock. Clock is however currently updated only
      later in inactive_task_timer() function.
      
      Fix the problem by updating the clock right after task_rq_lock().
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180530160809.9074-1-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ecda2b66
    • P
      sched/core: Require cpu_active() in select_task_rq(), for user tasks · 7af443ee
      Paul Burton 提交于
      select_task_rq() is used in a few paths to select the CPU upon which a
      thread should be run - for example it is used by try_to_wake_up() & by
      fork or exec balancing. As-is it allows use of any online CPU that is
      present in the task's cpus_allowed mask.
      
      This presents a problem because there is a period whilst CPUs are
      brought online where a CPU is marked online, but is not yet fully
      initialized - ie. the period where CPUHP_AP_ONLINE_IDLE <= state <
      CPUHP_ONLINE. Usually we don't run any user tasks during this window,
      but there are corner cases where this can happen. An example observed
      is:
      
        - Some user task A, running on CPU X, forks to create task B.
      
        - sched_fork() calls __set_task_cpu() with cpu=X, setting task B's
          task_struct::cpu field to X.
      
        - CPU X is offlined.
      
        - Task A, currently somewhere between the __set_task_cpu() in
          copy_process() and the call to wake_up_new_task(), is migrated to
          CPU Y by migrate_tasks() when CPU X is offlined.
      
        - CPU X is onlined, but still in the CPUHP_AP_ONLINE_IDLE state. The
          scheduler is now active on CPU X, but there are no user tasks on
          the runqueue.
      
        - Task A runs on CPU Y & reaches wake_up_new_task(). This calls
          select_task_rq() with cpu=X, taken from task B's task_struct,
          and select_task_rq() allows CPU X to be returned.
      
        - Task A enqueues task B on CPU X's runqueue, via activate_task() &
          enqueue_task().
      
        - CPU X now has a user task on its runqueue before it has reached the
          CPUHP_ONLINE state.
      
      In most cases, the user tasks that schedule on the newly onlined CPU
      have no idea that anything went wrong, but one case observed to be
      problematic is if the task goes on to invoke the sched_setaffinity
      syscall. The newly onlined CPU reaches the CPUHP_AP_ONLINE_IDLE state
      before the CPU that brought it online calls stop_machine_unpark(). This
      means that for a portion of the window of time between
      CPUHP_AP_ONLINE_IDLE & CPUHP_ONLINE the newly onlined CPU's struct
      cpu_stopper has its enabled field set to false. If a user thread is
      executed on the CPU during this window and it invokes sched_setaffinity
      with a CPU mask that does not include the CPU it's running on, then when
      __set_cpus_allowed_ptr() calls stop_one_cpu() intending to invoke
      migration_cpu_stop() and perform the actual migration away from the CPU
      it will simply return -ENOENT rather than calling migration_cpu_stop().
      We then return from the sched_setaffinity syscall back to the user task
      that is now running on a CPU which it just asked not to run on, and
      which is not present in its cpus_allowed mask.
      
      This patch resolves the problem by having select_task_rq() enforce that
      user tasks run on CPUs that are active - the same requirement that
      select_fallback_rq() already enforces. This should ensure that newly
      onlined CPUs reach the CPUHP_AP_ACTIVE state before being able to
      schedule user tasks, and also implies that bringup_wait_for_ap() will
      have called stop_machine_unpark() which resolves the sched_setaffinity
      issue above.
      
      I haven't yet investigated them, but it may be of interest to review
      whether any of the actions performed by hotplug states between
      CPUHP_AP_ONLINE_IDLE & CPUHP_AP_ACTIVE could have similar unintended
      effects on user tasks that might schedule before they are reached, which
      might widen the scope of the problem from just affecting the behaviour
      of sched_setaffinity.
      Signed-off-by: NPaul Burton <paul.burton@mips.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180526154648.11635-2-paul.burton@mips.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7af443ee
    • P
      sched/core: Fix rules for running on online && !active CPUs · 175f0e25
      Peter Zijlstra 提交于
      As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules
      for running on an online && !active CPU are stricter than just being a
      kthread, you need to be a per-cpu kthread.
      
      If you're not strictly per-CPU, you have better CPUs to run on and
      don't need the partially booted one to get your work done.
      
      The exception is to allow smpboot threads to bootstrap the CPU itself
      and get kernel 'services' initialized before we allow userspace on it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 955dbdf4 ("sched: Allow migrating kthreads into online but inactive CPUs")
      Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      175f0e25
  2. 29 5月, 2018 1 次提交
    • S
      tracing: Make the snapshot trigger work with instances · 2824f503
      Steven Rostedt (VMware) 提交于
      The snapshot trigger currently only affects the main ring buffer, even when
      it is used by the instances. This can be confusing as the snapshot trigger
      is listed in the instance.
      
       > # cd /sys/kernel/tracing
       > # mkdir instances/foo
       > # echo snapshot > instances/foo/events/syscalls/sys_enter_fchownat/trigger
       > # echo top buffer > trace_marker
       > # echo foo buffer > instances/foo/trace_marker
       > # touch /tmp/bar
       > # chown rostedt /tmp/bar
       > # cat instances/foo/snapshot
       # tracer: nop
       #
       #
       # * Snapshot is freed *
       #
       # Snapshot commands:
       # echo 0 > snapshot : Clears and frees snapshot buffer
       # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
       #                      Takes a snapshot of the main buffer.
       # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)
       #                      (Doesn't have to be '2' works with any number that
       #                       is not a '0' or '1')
      
       > # cat snapshot
       # tracer: nop
       #
       #                              _-----=> irqs-off
       #                             / _----=> need-resched
       #                            | / _---=> hardirq/softirq
       #                            || / _--=> preempt-depth
       #                            ||| /     delay
       #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
       #              | |       |   ||||       |         |
                   bash-1189  [000] ....   111.488323: tracing_mark_write: top buffer
      
      Not only did the snapshot occur in the top level buffer, but the instance
      snapshot buffer should have been allocated, and it is still free.
      
      Cc: stable@vger.kernel.org
      Fixes: 85f2b082 ("tracing: Add basic event trigger framework")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      2824f503
  3. 28 5月, 2018 1 次提交
    • S
      tracing: Fix crash when freeing instances with event triggers · 86b389ff
      Steven Rostedt (VMware) 提交于
      If a instance has an event trigger enabled when it is freed, it could cause
      an access of free memory. Here's the case that crashes:
      
       # cd /sys/kernel/tracing
       # mkdir instances/foo
       # echo snapshot > instances/foo/events/initcall/initcall_start/trigger
       # rmdir instances/foo
      
      Would produce:
      
       general protection fault: 0000 [#1] PREEMPT SMP PTI
       Modules linked in: tun bridge ...
       CPU: 5 PID: 6203 Comm: rmdir Tainted: G        W         4.17.0-rc4-test+ #933
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
       RIP: 0010:clear_event_triggers+0x3b/0x70
       RSP: 0018:ffffc90003783de0 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800c7130ba0
       RBP: ffffc90003783e00 R08: ffff8801131993f8 R09: 0000000100230016
       R10: ffffc90003783d80 R11: 0000000000000000 R12: ffff8800c7130ba0
       R13: ffff8800c7130bd8 R14: ffff8800cc093768 R15: 00000000ffffff9c
       FS:  00007f6f4aa86700(0000) GS:ffff88011eb40000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f6f4a5aed60 CR3: 00000000cd552001 CR4: 00000000001606e0
       Call Trace:
        event_trace_del_tracer+0x2a/0xc5
        instance_rmdir+0x15c/0x200
        tracefs_syscall_rmdir+0x52/0x90
        vfs_rmdir+0xdb/0x160
        do_rmdir+0x16d/0x1c0
        __x64_sys_rmdir+0x17/0x20
        do_syscall_64+0x55/0x1a0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This was due to the call the clears out the triggers when an instance is
      being deleted not removing the trigger from the link list.
      
      Cc: stable@vger.kernel.org
      Fixes: 85f2b082 ("tracing: Add basic event trigger framework")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      86b389ff
  4. 26 5月, 2018 1 次提交
  5. 25 5月, 2018 2 次提交
    • P
      kthread: Allow kthread_park() on a parked kthread · b1f5b378
      Peter Zijlstra 提交于
      The following commit:
      
        85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      
      added a WARN() in the case where we call kthread_park() on an already
      parked thread, because the old code wasn't doing the right thing there
      and it wasn't at all clear that would happen.
      
      It turns out, this does in fact happen, so we have to deal with it.
      
      Instead of potentially returning early, also wait for the completion.
      This does however mean we have to use complete_all() and re-initialize
      the completion on re-use.
      Reported-by: NLKP <lkp@01.org>
      Tested-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kernel test robot <lkp@intel.com>
      Cc: wfg@linux.intel.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      Link: http://lkml.kernel.org/r/20180504091142.GI12235@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b1f5b378
    • J
      sched/topology: Clarify root domain(s) debug string · bf5015a5
      Juri Lelli 提交于
      When scheduler debug is enabled, building scheduling domains outputs
      information about how the domains are laid out and to which root domain
      each CPU (or sets of CPUs) belongs, e.g.:
      
       CPU0 attaching sched-domain(s):
        domain-0: span=0-5 level=MC
         groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }
       CPU1 attaching sched-domain(s):
        domain-0: span=0-5 level=MC
         groups: 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 }
      
       [...]
      
       span: 0-5 (max cpu_capacity = 1024)
      
      The fact that latest line refers to CPUs 0-5 root domain doesn't however look
      immediately obvious to me: one might wonder why span 0-5 is reported "again".
      
      Make it more clear by adding "root domain" to it, as to end with the
      following:
      
       CPU0 attaching sched-domain(s):
        domain-0: span=0-5 level=MC
         groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }
       CPU1 attaching sched-domain(s):
        domain-0: span=0-5 level=MC
         groups: 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 }
      
       [...]
      
       root domain span: 0-5 (max cpu_capacity = 1024)
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180524152936.17611-1-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bf5015a5
  6. 24 5月, 2018 1 次提交
    • D
      bpf: properly enforce index mask to prevent out-of-bounds speculation · c93552c4
      Daniel Borkmann 提交于
      While reviewing the verifier code, I recently noticed that the
      following two program variants in relation to tail calls can be
      loaded.
      
      Variant 1:
      
        # bpftool p d x i 15
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:5]
          3: (05) goto pc+2
          4: (18) r2 = map[id:6]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0xa0 goto pc+2
          8: (54) (u32) r3 &= (u32) 255
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 5
          5: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
        # bpftool m s i 6
          6: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
      
      Variant 2:
      
        # bpftool p d x i 20
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:8]
          3: (05) goto pc+2
          4: (18) r2 = map[id:7]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0x4 goto pc+2
          8: (54) (u32) r3 &= (u32) 3
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 8
          8: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
        # bpftool m s i 7
          7: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
      
      In both cases the index masking inserted by the verifier in order
      to control out of bounds speculation from a CPU via b2157399
      ("bpf: prevent out-of-bounds speculation") seems to be incorrect
      in what it is enforcing. In the 1st variant, the mask is applied
      from the map with the significantly larger number of entries where
      we would allow to a certain degree out of bounds speculation for
      the smaller map, and in the 2nd variant where the mask is applied
      from the map with the smaller number of entries, we get buggy
      behavior since we truncate the index of the larger map.
      
      The original intent from commit b2157399 is to reject such
      occasions where two or more different tail call maps are used
      in the same tail call helper invocation. However, the check on
      the BPF_MAP_PTR_POISON is never hit since we never poisoned the
      saved pointer in the first place! We do this explicitly for map
      lookups but in case of tail calls we basically used the tail
      call map in insn_aux_data that was processed in the most recent
      path which the verifier walked. Thus any prior path that stored
      a pointer in insn_aux_data at the helper location was always
      overridden.
      
      Fix it by moving the map pointer poison logic into a small helper
      that covers both BPF helpers with the same logic. After that in
      fixup_bpf_calls() the poison check is then hit for tail calls
      and the program rejected. Latter only happens in unprivileged
      case since this is the *only* occasion where a rewrite needs to
      happen, and where such rewrite is specific to the map (max_entries,
      index_mask). In the privileged case the rewrite is generic for
      the insn->imm / insn->code update so multiple maps from different
      paths can be handled just fine since all the remaining logic
      happens in the instruction processing itself. This is similar
      to the case of map lookups: in case there is a collision of
      maps in fixup_bpf_calls() we must skip the inlined rewrite since
      this will turn the generic instruction sequence into a non-
      generic one. Thus the patch_call_imm will simply update the
      insn->imm location where the bpf_map_lookup_elem() will later
      take care of the dispatch. Given we need this 'poison' state
      as a check, the information of whether a map is an unpriv_array
      gets lost, so enforcing it prior to that needs an additional
      state. In general this check is needed since there are some
      complex and tail call intensive BPF programs out there where
      LLVM tends to generate such code occasionally. We therefore
      convert the map_ptr rather into map_state to store all this
      w/o extra memory overhead, and the bit whether one of the maps
      involved in the collision was from an unpriv_array thus needs
      to be retained as well there.
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c93552c4
  7. 23 5月, 2018 1 次提交
  8. 20 5月, 2018 1 次提交
    • A
      bpf: Prevent memory disambiguation attack · af86ca4e
      Alexei Starovoitov 提交于
      Detect code patterns where malicious 'speculative store bypass' can be used
      and sanitize such patterns.
      
       39: (bf) r3 = r10
       40: (07) r3 += -216
       41: (79) r8 = *(u64 *)(r7 +0)   // slow read
       42: (7a) *(u64 *)(r10 -72) = 0  // verifier inserts this instruction
       43: (7b) *(u64 *)(r8 +0) = r3   // this store becomes slow due to r8
       44: (79) r1 = *(u64 *)(r6 +0)   // cpu speculatively executes this load
       45: (71) r2 = *(u8 *)(r1 +0)    // speculatively arbitrary 'load byte'
                                       // is now sanitized
      
      Above code after x86 JIT becomes:
       e5: mov    %rbp,%rdx
       e8: add    $0xffffffffffffff28,%rdx
       ef: mov    0x0(%r13),%r14
       f3: movq   $0x0,-0x48(%rbp)
       fb: mov    %rdx,0x0(%r14)
       ff: mov    0x0(%rbx),%rdi
      103: movzbq 0x0(%rdi),%rsi
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      af86ca4e
  9. 18 5月, 2018 5 次提交
    • M
      sched/deadline: Make the grub_reclaim() function static · 3febfc8a
      Mathieu Malaterre 提交于
      Since the grub_reclaim() function can be made static, make it so.
      
      Silences the following GCC warning (W=1):
      
        kernel/sched/deadline.c:1120:5: warning: no previous prototype for ‘grub_reclaim’ [-Wmissing-prototypes]
      Signed-off-by: NMathieu Malaterre <malat@debian.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180516200902.959-1-malat@debian.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3febfc8a
    • M
      sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to kernel/sched/sched.h · f6a34630
      Mathieu Malaterre 提交于
      In the following commit:
      
        6b55c965 ("sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h")
      
      the print_cfs_rq() prototype was added to <kernel/sched/sched.h>,
      right next to the prototypes for print_cfs_stats(), print_rt_stats()
      and print_dl_stats().
      
      Finish this previous commit and also move related prototypes for
      print_rt_rq() and print_dl_rq().
      
      Remove existing extern declarations now that they not needed anymore.
      
      Silences the following GCC warning, triggered by W=1:
      
        kernel/sched/debug.c:573:6: warning: no previous prototype for ‘print_rt_rq’ [-Wmissing-prototypes]
        kernel/sched/debug.c:603:6: warning: no previous prototype for ‘print_dl_rq’ [-Wmissing-prototypes]
      Signed-off-by: NMathieu Malaterre <malat@debian.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180516195348.30426-1-malat@debian.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f6a34630
    • D
      bpf: fix truncated jump targets on heavy expansions · 050fad7c
      Daniel Borkmann 提交于
      Recently during testing, I ran into the following panic:
      
        [  207.892422] Internal error: Accessing user space memory outside uaccess.h routines: 96000004 [#1] SMP
        [  207.901637] Modules linked in: binfmt_misc [...]
        [  207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: G        W         4.17.0-rc3+ #7
        [  207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
        [  207.982428] pstate: 60400005 (nZCv daif +PAN -UAO)
        [  207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0
        [  207.992603] lr : 0xffff000000bdb754
        [  207.996080] sp : ffff000013703ca0
        [  207.999384] x29: ffff000013703ca0 x28: 0000000000000001
        [  208.004688] x27: 0000000000000001 x26: 0000000000000000
        [  208.009992] x25: ffff000013703ce0 x24: ffff800fb4afcb00
        [  208.015295] x23: ffff00007d2f5038 x22: ffff00007d2f5000
        [  208.020599] x21: fffffffffeff2a6f x20: 000000000000000a
        [  208.025903] x19: ffff000009578000 x18: 0000000000000a03
        [  208.031206] x17: 0000000000000000 x16: 0000000000000000
        [  208.036510] x15: 0000ffff9de83000 x14: 0000000000000000
        [  208.041813] x13: 0000000000000000 x12: 0000000000000000
        [  208.047116] x11: 0000000000000001 x10: ffff0000089e7f18
        [  208.052419] x9 : fffffffffeff2a6f x8 : 0000000000000000
        [  208.057723] x7 : 000000000000000a x6 : 00280c6160000000
        [  208.063026] x5 : 0000000000000018 x4 : 0000000000007db6
        [  208.068329] x3 : 000000000008647a x2 : 19868179b1484500
        [  208.073632] x1 : 0000000000000000 x0 : ffff000009578c08
        [  208.078938] Process test_verifier (pid: 2256, stack limit = 0x0000000049ca7974)
        [  208.086235] Call trace:
        [  208.088672]  bpf_skb_load_helper_8_no_cache+0x34/0xc0
        [  208.093713]  0xffff000000bdb754
        [  208.096845]  bpf_test_run+0x78/0xf8
        [  208.100324]  bpf_prog_test_run_skb+0x148/0x230
        [  208.104758]  sys_bpf+0x314/0x1198
        [  208.108064]  el0_svc_naked+0x30/0x34
        [  208.111632] Code: 91302260 f9400001 f9001fa1 d2800001 (29500680)
        [  208.117717] ---[ end trace 263cb8a59b5bf29f ]---
      
      The program itself which caused this had a long jump over the whole
      instruction sequence where all of the inner instructions required
      heavy expansions into multiple BPF instructions. Additionally, I also
      had BPF hardening enabled which requires once more rewrites of all
      constant values in order to blind them. Each time we rewrite insns,
      bpf_adj_branches() would need to potentially adjust branch targets
      which cross the patchlet boundary to accommodate for the additional
      delta. Eventually that lead to the case where the target offset could
      not fit into insn->off's upper 0x7fff limit anymore where then offset
      wraps around becoming negative (in s16 universe), or vice versa
      depending on the jump direction.
      
      Therefore it becomes necessary to detect and reject any such occasions
      in a generic way for native eBPF and cBPF to eBPF migrations. For
      the latter we can simply check bounds in the bpf_convert_filter()'s
      BPF_EMIT_JMP helper macro and bail out once we surpass limits. The
      bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case
      of subsequent hardening) is a bit more complex in that we need to
      detect such truncations before hitting the bpf_prog_realloc(). Thus
      the latter is split into an extra pass to probe problematic offsets
      on the original program in order to fail early. With that in place
      and carefully tested I no longer hit the panic and the rewrites are
      rejected properly. The above example panic I've seen on bpf-next,
      though the issue itself is generic in that a guard against this issue
      in bpf seems more appropriate in this case.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      050fad7c
    • J
      bpf: parse and verdict prog attach may race with bpf map update · 96174560
      John Fastabend 提交于
      In the sockmap design BPF programs (SK_SKB_STREAM_PARSER,
      SK_SKB_STREAM_VERDICT and SK_MSG_VERDICT) are attached to the sockmap
      map type and when a sock is added to the map the programs are used by
      the socket. However, sockmap updates from both userspace and BPF
      programs can happen concurrently with the attach and detach of these
      programs.
      
      To resolve this we use the bpf_prog_inc_not_zero and a READ_ONCE()
      primitive to ensure the program pointer is not refeched and
      possibly NULL'd before the refcnt increment. This happens inside
      a RCU critical section so although the pointer reference in the map
      object may be NULL (by a concurrent detach operation) the reference
      from READ_ONCE will not be free'd until after grace period. This
      ensures the object returned by READ_ONCE() is valid through the
      RCU criticl section and safe to use as long as we "know" it may
      be free'd shortly.
      
      Daniel spotted a case in the sock update API where instead of using
      the READ_ONCE() program reference we used the pointer from the
      original map, stab->bpf_{verdict|parse|txmsg}. The problem with this
      is the logic checks the object returned from the READ_ONCE() is not
      NULL and then tries to reference the object again but using the
      above map pointer, which may have already been NULL'd by a parallel
      detach operation. If this happened bpf_porg_inc_not_zero could
      dereference a NULL pointer.
      
      Fix this by using variable returned by READ_ONCE() that is checked
      for NULL.
      
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      96174560
    • J
      bpf: sockmap update rollback on error can incorrectly dec prog refcnt · a593f708
      John Fastabend 提交于
      If the user were to only attach one of the parse or verdict programs
      then it is possible a subsequent sockmap update could incorrectly
      decrement the refcnt on the program. This happens because in the
      rollback logic, after an error, we have to decrement the program
      reference count when its been incremented. However, we only increment
      the program reference count if the user has both a verdict and a
      parse program. The reason for this is because, at least at the
      moment, both are required for any one to be meaningful. The problem
      fixed here is in the rollback path we decrement the program refcnt
      even if only one existing. But we never incremented the refcnt in
      the first place creating an imbalance.
      
      This patch fixes the error path to handle this case.
      
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a593f708
  10. 16 5月, 2018 23 次提交
    • W
      locking/percpu-rwsem: Annotate rwsem ownership transfer by setting RWSEM_OWNER_UNKNOWN · 5a817641
      Waiman Long 提交于
      The filesystem freezing code needs to transfer ownership of a rwsem
      embedded in a percpu-rwsem from the task that does the freezing to
      another one that does the thawing by calling percpu_rwsem_release()
      after freezing and percpu_rwsem_acquire() before thawing.
      
      However, the new rwsem debug code runs afoul with this scheme by warning
      that the task that releases the rwsem isn't the one that acquires it,
      as reported by Amir Goldstein:
      
        DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
        WARNING: CPU: 1 PID: 1401 at /home/amir/build/src/linux/kernel/locking/rwsem.c:133 up_write+0x59/0x79
      
        Call Trace:
         percpu_up_write+0x1f/0x28
         thaw_super_locked+0xdf/0x120
         do_vfs_ioctl+0x270/0x5f1
         ksys_ioctl+0x52/0x71
         __x64_sys_ioctl+0x16/0x19
         do_syscall_64+0x5d/0x167
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      To work properly with the rwsem debug code, we need to annotate that the
      rwsem ownership is unknown during the tranfer period until a brave soul
      comes forward to acquire the ownership. During that period, optimistic
      spinning will be disabled.
      Reported-by: NAmir Goldstein <amir73il@gmail.com>
      Tested-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Theodore Y. Ts'o <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-fsdevel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1526420991-21213-3-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5a817641
    • W
      locking/rwsem: Add a new RWSEM_ANONYMOUSLY_OWNED flag · d7d760ef
      Waiman Long 提交于
      There are use cases where a rwsem can be acquired by one task, but
      released by another task. In thess cases, optimistic spinning may need
      to be disabled.  One example will be the filesystem freeze/thaw code
      where the task that freezes the filesystem will acquire a write lock
      on a rwsem and then un-owns it before returning to userspace. Later on,
      another task will come along, acquire the ownership, thaw the filesystem
      and release the rwsem.
      
      Bit 0 of the owner field was used to designate that it is a reader
      owned rwsem. It is now repurposed to mean that the owner of the rwsem
      is not known. If only bit 0 is set, the rwsem is reader owned. If bit
      0 and other bits are set, it is writer owned with an unknown owner.
      One such value for the latter case is (-1L). So we can set owner to 1 for
      reader-owned, -1 for writer-owned. The owner is unknown in both cases.
      
      To handle transfer of rwsem ownership, the higher level code should
      set the owner field to -1 to indicate a write-locked rwsem with unknown
      owner.  Optimistic spinning will be disabled in this case.
      
      Once the higher level code figures who the new owner is, it can then
      set the owner field accordingly.
      Tested-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Theodore Y. Ts'o <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-fsdevel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1526420991-21213-2-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d7d760ef
    • C
      resource: switch to proc_create_seq_data · 4e292a96
      Christoph Hellwig 提交于
      And use the root resource directly from the proc private data.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      4e292a96
    • C
      proc: introduce proc_create_single{,_data} · 3f3942ac
      Christoph Hellwig 提交于
      Variants of proc_create{,_data} that directly take a seq_file show
      callback and drastically reduces the boilerplate code in the callers.
      
      All trivial callers converted over.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      3f3942ac
    • C
      proc: introduce proc_create_seq_private · 44414d82
      Christoph Hellwig 提交于
      Variant of proc_create_data that directly take a struct seq_operations
      argument + a private state size and drastically reduces the boilerplate
      code in the callers.
      
      All trivial callers converted over.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      44414d82
    • C
      proc: introduce proc_create_seq{,_data} · fddda2b7
      Christoph Hellwig 提交于
      Variants of proc_create{,_data} that directly take a struct seq_operations
      argument and drastically reduces the boilerplate code in the callers.
      
      All trivial callers converted over.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      fddda2b7
    • D
      tick/broadcast: Use for_each_cpu() specially on UP kernels · 5596fe34
      Dexuan Cui 提交于
      for_each_cpu() unintuitively reports CPU0 as set independent of the actual
      cpumask content on UP kernels. This causes an unexpected PIT interrupt
      storm on a UP kernel running in an SMP virtual machine on Hyper-V, and as
      a result, the virtual machine can suffer from a strange random delay of 1~20
      minutes during boot-up, and sometimes it can hang forever.
      
      Protect if by checking whether the cpumask is empty before entering the
      for_each_cpu() loop.
      
      [ tglx: Use !IS_ENABLED(CONFIG_SMP) instead of #ifdeffery ]
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: stable@vger.kernel.org
      Cc: Rakib Mullick <rakib.mullick@gmail.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Link: https://lkml.kernel.org/r/KL1P15301MB000678289FE55BA365B3279ABF990@KL1P15301MB0006.APCP153.PROD.OUTLOOK.COM
      Link: https://lkml.kernel.org/r/KL1P15301MB0006FA63BC22BEB64902EAA0BF930@KL1P15301MB0006.APCP153.PROD.OUTLOOK.COM
      5596fe34
    • P
      rcutorture: Print end-of-test state · 034777d7
      Paul E. McKenney 提交于
      This commit adds end-of-test state printout to help check whether RCU
      shut down nicely.  Note that this printout only helps for flavors of
      RCU that are not used much by the kernel.  In particular, for normal
      RCU having a grace period in progress is expected behavior.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      034777d7
    • P
      rcu: Drop early GP request check from rcu_gp_kthread() · a458360a
      Paul E. McKenney 提交于
      Now that grace-period requests use funnel locking and now that they
      set ->gp_flags to RCU_GP_FLAG_INIT even when the RCU grace-period
      kthread has not yet started, rcu_gp_kthread() no longer needs to check
      need_any_future_gp() at startup time.  This commit therefore removes
      this check.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      a458360a
    • P
      rcu: Simplify and inline cpu_needs_another_gp() · c1935209
      Paul E. McKenney 提交于
      Now that RCU no longer relies on failsafe checks, cpu_needs_another_gp()
      can be greatly simplified.  This simplification eliminates the last
      call to rcu_future_needs_gp() and to rcu_segcblist_future_gp_needed(),
      both of which which can then be eliminated.  And then, because
      cpu_needs_another_gp() is called only from __rcu_pending(), it can be
      inlined and eliminated.
      
      This commit carries out the simplification, inlining, and elimination
      called out above.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      c1935209
    • P
      rcu: The rcu_gp_cleanup() function does not need cpu_needs_another_gp() · 384f77f4
      Paul E. McKenney 提交于
      All of the cpu_needs_another_gp() function's checks (except for
      newly arrived callbacks) have been subsumed into the rcu_gp_cleanup()
      function's scan of the rcu_node tree.  This commit therefore drops the
      call to cpu_needs_another_gp().  The check for newly arrived callbacks
      is supplied by rcu_accelerate_cbs().  Any needed advancing (as in the
      earlier rcu_advance_cbs() call) will be supplied when the corresponding
      CPU becomes aware of the end of the now-completed grace period.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      384f77f4
    • P
      rcu: Make rcu_start_this_gp() check for out-of-range requests · 665f08f1
      Paul E. McKenney 提交于
      If rcu_start_this_gp() is invoked with a requested grace period more
      than three in the future, then either the ->need_future_gp[] array
      needs to be bigger or the caller needs to be repaired.  This commit
      therefore adds a WARN_ON_ONCE() checking for this condition.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      665f08f1
    • P
      rcu: Add funnel locking to rcu_start_this_gp() · 360e0da6
      Paul E. McKenney 提交于
      The rcu_start_this_gp() function had a simple form of funnel locking that
      used only the leaves and root of the rcu_node tree, which is fine for
      systems with only a few hundred CPUs, but sub-optimal for systems having
      thousands of CPUs.  This commit therefore adds full-tree funnel locking.
      
      This variant of funnel locking is unusual in the following ways:
      
      1.	The leaf-level rcu_node structure's ->lock is held throughout.
      	Other funnel-locking implementations drop the leaf-level lock
      	before progressing to the next level of the tree.
      
      2.	Funnel locking can be started at the root, which is convenient
      	for code that already holds the root rcu_node structure's ->lock.
      	Other funnel-locking implementations start at the leaves.
      
      3.	If an rcu_node structure other than the initial one believes
      	that a grace period is in progress, it is not necessary to
      	go further up the tree.  This is because grace-period cleanup
      	scans the full tree, so that marking the need for a subsequent
      	grace period anywhere in the tree suffices -- but only if
      	a grace period is currently in progress.
      
      4.	It is possible that the RCU grace-period kthread has not yet
      	started, and this case must be handled appropriately.
      
      However, the general approach of using a tree to control lock contention
      is still in place.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      360e0da6
    • P
      rcu: Make rcu_start_future_gp() caller select grace period · 41e80595
      Paul E. McKenney 提交于
      The rcu_accelerate_cbs() function selects a grace-period target, which
      it uses to have rcu_segcblist_accelerate() assign numbers to recently
      queued callbacks.  Then it invokes rcu_start_future_gp(), which selects
      a grace-period target again, which is a bit pointless.  This commit
      therefore changes rcu_start_future_gp() to take the grace-period target as
      a parameter, thus avoiding double selection.  This commit also changes
      the name of rcu_start_future_gp() to rcu_start_this_gp() to reflect
      this change in functionality, and also makes a similar change to the
      name of trace_rcu_future_gp().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      41e80595
    • P
      rcu: Inline rcu_start_gp_advanced() into rcu_start_future_gp() · d5cd9685
      Paul E. McKenney 提交于
      The rcu_start_gp_advanced() is invoked only from rcu_start_future_gp() and
      much of its code is redundant when invoked from that context.  This commit
      therefore inlines rcu_start_gp_advanced() into rcu_start_future_gp(),
      then removes rcu_start_gp_advanced().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      d5cd9685
    • P
      rcu: Clear request other than RCU_GP_FLAG_INIT at GP end · a824a287
      Paul E. McKenney 提交于
      Once the grace period has ended, any RCU_GP_FLAG_FQS requests are
      irrelevant:  The grace period has ended, so there is no longer any
      point in forcing quiescent states in order to try to make it end sooner.
      This commit therefore causes rcu_gp_cleanup() to clear any bits other
      than RCU_GP_FLAG_INIT from ->gp_flags at the end of the grace period.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      a824a287
    • P
      rcu: Cleanup, don't put ->completed into an int · a508aa59
      Paul E. McKenney 提交于
      It is true that currently only the low-order two bits are used, so
      there should be no problem given modern machines and compilers, but
      good hygiene and maintainability dictates use of an unsigned long
      instead of an int.  This commit therefore makes this change.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      a508aa59
    • P
      rcu: Switch __rcu_process_callbacks() to rcu_accelerate_cbs() · bd7af846
      Paul E. McKenney 提交于
      The __rcu_process_callbacks() function currently checks to see if
      the current CPU needs a grace period and also if there is any other
      reason to kick off a new grace period.  This is one of the fail-safe
      checks that has been rendered unnecessary by the changes that increase
      the accuracy of rcu_gp_cleanup()'s estimate as to whether another grace
      period is required.  Because this particular fail-safe involved acquiring
      the root rcu_node structure's ->lock, which has seen excessive contention
      in real life, this fail-safe needs to go.
      
      However, one check must remain, namely the check for newly arrived
      RCU callbacks that have not yet been associated with a grace period.
      One might hope that the checks in __note_gp_changes(), which is invoked
      indirectly from rcu_check_quiescent_state(), would suffice, but this
      function won't be invoked at all if RCU is idle.  It is therefore necessary
      to replace the fail-safe checks with a simpler check for newly arrived
      callbacks during an RCU idle period, which is exactly what this commit
      does.  This change removes the final call to rcu_start_gp(), so this
      function is removed as well.
      
      Note that lockless use of cpu_needs_another_gp() is racy, but that
      these races are harmless in this case.  If RCU really is idle, the
      values will not change, so the return value from cpu_needs_another_gp()
      will be correct.  If RCU is not idle, the resulting redundant call to
      rcu_accelerate_cbs() will be harmless, and might even have the benefit
      of reducing grace-period latency a bit.
      
      This commit also moves interrupt disabling into the "if" statement to
      improve real-time response a bit.
      Reported-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      bd7af846
    • P
      rcu: Avoid __call_rcu_core() root rcu_node ->lock acquisition · a6058d85
      Paul E. McKenney 提交于
      When __call_rcu_core() notices excessive numbers of callbacks pending
      on the current CPU, we know that at least one of them is not yet
      classified, namely the one that was just now queued.  Therefore, it
      is not necessary to invoke rcu_start_gp() and thus not necessary to
      acquire the root rcu_node structure's ->lock.  This commit therefore
      replaces the rcu_start_gp() with rcu_accelerate_cbs(), thus replacing
      an acquisition of the root rcu_node structure's ->lock with that of
      this CPU's leaf rcu_node structure.
      
      This decreases contention on the root rcu_node structure's ->lock.
      Reported-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      a6058d85
    • P
      rcu: Make rcu_migrate_callbacks wake GP kthread when needed · ec4eacce
      Paul E. McKenney 提交于
      The rcu_migrate_callbacks() function invokes rcu_advance_cbs()
      twice, ignoring the return value.  This is OK at pressent because of
      failsafe code that does the wakeup when needed.  However, this failsafe
      code acquires the root rcu_node structure's lock frequently, while
      rcu_migrate_callbacks() does so only once per CPU-offline operation.
      
      This commit therefore makes rcu_migrate_callbacks()
      wake up the RCU GP kthread when either call to rcu_advance_cbs()
      returns true, thus removing need for the failsafe code.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      ec4eacce
    • P
      rcu: Convert ->need_future_gp[] array to boolean · 6f576e28
      Paul E. McKenney 提交于
      There is no longer any need for ->need_future_gp[] to count the number of
      requests for future grace periods, so this commit converts the additions
      to assignments to "true" and reduces the size of each element to one byte.
      While we are in the area, fix an obsolete comment.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      6f576e28
    • P
      rcu: Make rcu_future_needs_gp() check all ->need_future_gps[] elements · 0ae94e00
      Paul E. McKenney 提交于
      Currently, the rcu_future_needs_gp() function checks only the current
      element of the ->need_future_gps[] array, which might miss elements that
      were offset from the expected element, for example, due to races with
      the start or the end of a grace period.  This commit therefore makes
      rcu_future_needs_gp() use the need_any_future_gp() macro to check all
      of the elements of this array.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      0ae94e00
    • P
      rcu: Avoid losing ->need_future_gp[] values due to GP start/end races · 51af970d
      Paul E. McKenney 提交于
      The rcu_cbs_completed() function provides the value of ->completed
      at which new callbacks can safely be invoked.  This is recorded in
      two-element ->need_future_gp[] arrays in the rcu_node structure, and
      the elements of these arrays corresponding to the just-completed grace
      period are zeroed at the end of that grace period.  However, the
      rcu_cbs_completed() function can return the current ->completed value
      plus either one or two, so it is possible for the corresponding
      ->need_future_gp[] entry to be cleared just after it was set, thus
      losing a request for a future grace period.
      
      This commit avoids this race by expanding ->need_future_gp[] to four
      elements.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NNicholas Piggin <npiggin@gmail.com>
      51af970d