1. 06 6月, 2018 1 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  2. 31 5月, 2018 4 次提交
    • D
      sched/headers: Fix typo · 595058b6
      Davidlohr Bueso 提交于
      I cannot spell 'throttling'.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180530224940.17839-1-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      595058b6
    • J
      sched/deadline: Fix missing clock update · ecda2b66
      Juri Lelli 提交于
      A missing clock update is causing the following warning:
      
       rq->clock_update_flags < RQCF_ACT_SKIP
       WARNING: CPU: 10 PID: 0 at kernel/sched/sched.h:963 inactive_task_timer+0x5d6/0x720
       Call Trace:
        <IRQ>
        __hrtimer_run_queues+0x10f/0x530
        hrtimer_interrupt+0xe5/0x240
        smp_apic_timer_interrupt+0x79/0x2b0
        apic_timer_interrupt+0xf/0x20
        </IRQ>
        do_idle+0x203/0x280
        cpu_startup_entry+0x6f/0x80
        start_secondary+0x1b0/0x200
        secondary_startup_64+0xa5/0xb0
       hardirqs last  enabled at (793919): [<ffffffffa27c5f6e>] cpuidle_enter_state+0x9e/0x360
       hardirqs last disabled at (793920): [<ffffffffa2a0096e>] interrupt_entry+0xce/0xe0
       softirqs last  enabled at (793922): [<ffffffffa20bef78>] irq_enter+0x68/0x70
       softirqs last disabled at (793921): [<ffffffffa20bef5d>] irq_enter+0x4d/0x70
      
      This happens because inactive_task_timer() calls sub_running_bw() (if
      TASK_DEAD and non_contending) that might trigger a schedutil update,
      which might access the clock. Clock is however currently updated only
      later in inactive_task_timer() function.
      
      Fix the problem by updating the clock right after task_rq_lock().
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180530160809.9074-1-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ecda2b66
    • P
      sched/core: Require cpu_active() in select_task_rq(), for user tasks · 7af443ee
      Paul Burton 提交于
      select_task_rq() is used in a few paths to select the CPU upon which a
      thread should be run - for example it is used by try_to_wake_up() & by
      fork or exec balancing. As-is it allows use of any online CPU that is
      present in the task's cpus_allowed mask.
      
      This presents a problem because there is a period whilst CPUs are
      brought online where a CPU is marked online, but is not yet fully
      initialized - ie. the period where CPUHP_AP_ONLINE_IDLE <= state <
      CPUHP_ONLINE. Usually we don't run any user tasks during this window,
      but there are corner cases where this can happen. An example observed
      is:
      
        - Some user task A, running on CPU X, forks to create task B.
      
        - sched_fork() calls __set_task_cpu() with cpu=X, setting task B's
          task_struct::cpu field to X.
      
        - CPU X is offlined.
      
        - Task A, currently somewhere between the __set_task_cpu() in
          copy_process() and the call to wake_up_new_task(), is migrated to
          CPU Y by migrate_tasks() when CPU X is offlined.
      
        - CPU X is onlined, but still in the CPUHP_AP_ONLINE_IDLE state. The
          scheduler is now active on CPU X, but there are no user tasks on
          the runqueue.
      
        - Task A runs on CPU Y & reaches wake_up_new_task(). This calls
          select_task_rq() with cpu=X, taken from task B's task_struct,
          and select_task_rq() allows CPU X to be returned.
      
        - Task A enqueues task B on CPU X's runqueue, via activate_task() &
          enqueue_task().
      
        - CPU X now has a user task on its runqueue before it has reached the
          CPUHP_ONLINE state.
      
      In most cases, the user tasks that schedule on the newly onlined CPU
      have no idea that anything went wrong, but one case observed to be
      problematic is if the task goes on to invoke the sched_setaffinity
      syscall. The newly onlined CPU reaches the CPUHP_AP_ONLINE_IDLE state
      before the CPU that brought it online calls stop_machine_unpark(). This
      means that for a portion of the window of time between
      CPUHP_AP_ONLINE_IDLE & CPUHP_ONLINE the newly onlined CPU's struct
      cpu_stopper has its enabled field set to false. If a user thread is
      executed on the CPU during this window and it invokes sched_setaffinity
      with a CPU mask that does not include the CPU it's running on, then when
      __set_cpus_allowed_ptr() calls stop_one_cpu() intending to invoke
      migration_cpu_stop() and perform the actual migration away from the CPU
      it will simply return -ENOENT rather than calling migration_cpu_stop().
      We then return from the sched_setaffinity syscall back to the user task
      that is now running on a CPU which it just asked not to run on, and
      which is not present in its cpus_allowed mask.
      
      This patch resolves the problem by having select_task_rq() enforce that
      user tasks run on CPUs that are active - the same requirement that
      select_fallback_rq() already enforces. This should ensure that newly
      onlined CPUs reach the CPUHP_AP_ACTIVE state before being able to
      schedule user tasks, and also implies that bringup_wait_for_ap() will
      have called stop_machine_unpark() which resolves the sched_setaffinity
      issue above.
      
      I haven't yet investigated them, but it may be of interest to review
      whether any of the actions performed by hotplug states between
      CPUHP_AP_ONLINE_IDLE & CPUHP_AP_ACTIVE could have similar unintended
      effects on user tasks that might schedule before they are reached, which
      might widen the scope of the problem from just affecting the behaviour
      of sched_setaffinity.
      Signed-off-by: NPaul Burton <paul.burton@mips.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180526154648.11635-2-paul.burton@mips.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7af443ee
    • P
      sched/core: Fix rules for running on online && !active CPUs · 175f0e25
      Peter Zijlstra 提交于
      As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules
      for running on an online && !active CPU are stricter than just being a
      kthread, you need to be a per-cpu kthread.
      
      If you're not strictly per-CPU, you have better CPUs to run on and
      don't need the partially booted one to get your work done.
      
      The exception is to allow smpboot threads to bootstrap the CPU itself
      and get kernel 'services' initialized before we allow userspace on it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 955dbdf4 ("sched: Allow migrating kthreads into online but inactive CPUs")
      Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      175f0e25
  3. 29 5月, 2018 1 次提交
    • S
      tracing: Make the snapshot trigger work with instances · 2824f503
      Steven Rostedt (VMware) 提交于
      The snapshot trigger currently only affects the main ring buffer, even when
      it is used by the instances. This can be confusing as the snapshot trigger
      is listed in the instance.
      
       > # cd /sys/kernel/tracing
       > # mkdir instances/foo
       > # echo snapshot > instances/foo/events/syscalls/sys_enter_fchownat/trigger
       > # echo top buffer > trace_marker
       > # echo foo buffer > instances/foo/trace_marker
       > # touch /tmp/bar
       > # chown rostedt /tmp/bar
       > # cat instances/foo/snapshot
       # tracer: nop
       #
       #
       # * Snapshot is freed *
       #
       # Snapshot commands:
       # echo 0 > snapshot : Clears and frees snapshot buffer
       # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
       #                      Takes a snapshot of the main buffer.
       # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)
       #                      (Doesn't have to be '2' works with any number that
       #                       is not a '0' or '1')
      
       > # cat snapshot
       # tracer: nop
       #
       #                              _-----=> irqs-off
       #                             / _----=> need-resched
       #                            | / _---=> hardirq/softirq
       #                            || / _--=> preempt-depth
       #                            ||| /     delay
       #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
       #              | |       |   ||||       |         |
                   bash-1189  [000] ....   111.488323: tracing_mark_write: top buffer
      
      Not only did the snapshot occur in the top level buffer, but the instance
      snapshot buffer should have been allocated, and it is still free.
      
      Cc: stable@vger.kernel.org
      Fixes: 85f2b082 ("tracing: Add basic event trigger framework")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      2824f503
  4. 28 5月, 2018 1 次提交
    • S
      tracing: Fix crash when freeing instances with event triggers · 86b389ff
      Steven Rostedt (VMware) 提交于
      If a instance has an event trigger enabled when it is freed, it could cause
      an access of free memory. Here's the case that crashes:
      
       # cd /sys/kernel/tracing
       # mkdir instances/foo
       # echo snapshot > instances/foo/events/initcall/initcall_start/trigger
       # rmdir instances/foo
      
      Would produce:
      
       general protection fault: 0000 [#1] PREEMPT SMP PTI
       Modules linked in: tun bridge ...
       CPU: 5 PID: 6203 Comm: rmdir Tainted: G        W         4.17.0-rc4-test+ #933
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
       RIP: 0010:clear_event_triggers+0x3b/0x70
       RSP: 0018:ffffc90003783de0 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800c7130ba0
       RBP: ffffc90003783e00 R08: ffff8801131993f8 R09: 0000000100230016
       R10: ffffc90003783d80 R11: 0000000000000000 R12: ffff8800c7130ba0
       R13: ffff8800c7130bd8 R14: ffff8800cc093768 R15: 00000000ffffff9c
       FS:  00007f6f4aa86700(0000) GS:ffff88011eb40000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f6f4a5aed60 CR3: 00000000cd552001 CR4: 00000000001606e0
       Call Trace:
        event_trace_del_tracer+0x2a/0xc5
        instance_rmdir+0x15c/0x200
        tracefs_syscall_rmdir+0x52/0x90
        vfs_rmdir+0xdb/0x160
        do_rmdir+0x16d/0x1c0
        __x64_sys_rmdir+0x17/0x20
        do_syscall_64+0x55/0x1a0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This was due to the call the clears out the triggers when an instance is
      being deleted not removing the trigger from the link list.
      
      Cc: stable@vger.kernel.org
      Fixes: 85f2b082 ("tracing: Add basic event trigger framework")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      86b389ff
  5. 26 5月, 2018 1 次提交
  6. 25 5月, 2018 8 次提交
  7. 24 5月, 2018 1 次提交
    • D
      bpf: properly enforce index mask to prevent out-of-bounds speculation · c93552c4
      Daniel Borkmann 提交于
      While reviewing the verifier code, I recently noticed that the
      following two program variants in relation to tail calls can be
      loaded.
      
      Variant 1:
      
        # bpftool p d x i 15
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:5]
          3: (05) goto pc+2
          4: (18) r2 = map[id:6]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0xa0 goto pc+2
          8: (54) (u32) r3 &= (u32) 255
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 5
          5: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
        # bpftool m s i 6
          6: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
      
      Variant 2:
      
        # bpftool p d x i 20
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:8]
          3: (05) goto pc+2
          4: (18) r2 = map[id:7]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0x4 goto pc+2
          8: (54) (u32) r3 &= (u32) 3
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 8
          8: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
        # bpftool m s i 7
          7: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
      
      In both cases the index masking inserted by the verifier in order
      to control out of bounds speculation from a CPU via b2157399
      ("bpf: prevent out-of-bounds speculation") seems to be incorrect
      in what it is enforcing. In the 1st variant, the mask is applied
      from the map with the significantly larger number of entries where
      we would allow to a certain degree out of bounds speculation for
      the smaller map, and in the 2nd variant where the mask is applied
      from the map with the smaller number of entries, we get buggy
      behavior since we truncate the index of the larger map.
      
      The original intent from commit b2157399 is to reject such
      occasions where two or more different tail call maps are used
      in the same tail call helper invocation. However, the check on
      the BPF_MAP_PTR_POISON is never hit since we never poisoned the
      saved pointer in the first place! We do this explicitly for map
      lookups but in case of tail calls we basically used the tail
      call map in insn_aux_data that was processed in the most recent
      path which the verifier walked. Thus any prior path that stored
      a pointer in insn_aux_data at the helper location was always
      overridden.
      
      Fix it by moving the map pointer poison logic into a small helper
      that covers both BPF helpers with the same logic. After that in
      fixup_bpf_calls() the poison check is then hit for tail calls
      and the program rejected. Latter only happens in unprivileged
      case since this is the *only* occasion where a rewrite needs to
      happen, and where such rewrite is specific to the map (max_entries,
      index_mask). In the privileged case the rewrite is generic for
      the insn->imm / insn->code update so multiple maps from different
      paths can be handled just fine since all the remaining logic
      happens in the instruction processing itself. This is similar
      to the case of map lookups: in case there is a collision of
      maps in fixup_bpf_calls() we must skip the inlined rewrite since
      this will turn the generic instruction sequence into a non-
      generic one. Thus the patch_call_imm will simply update the
      insn->imm location where the bpf_map_lookup_elem() will later
      take care of the dispatch. Given we need this 'poison' state
      as a check, the information of whether a map is an unpriv_array
      gets lost, so enforcing it prior to that needs an additional
      state. In general this check is needed since there are some
      complex and tail call intensive BPF programs out there where
      LLVM tends to generate such code occasionally. We therefore
      convert the map_ptr rather into map_state to store all this
      w/o extra memory overhead, and the bit whether one of the maps
      involved in the collision was from an unpriv_array thus needs
      to be retained as well there.
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c93552c4
  8. 23 5月, 2018 1 次提交
  9. 20 5月, 2018 1 次提交
    • A
      bpf: Prevent memory disambiguation attack · af86ca4e
      Alexei Starovoitov 提交于
      Detect code patterns where malicious 'speculative store bypass' can be used
      and sanitize such patterns.
      
       39: (bf) r3 = r10
       40: (07) r3 += -216
       41: (79) r8 = *(u64 *)(r7 +0)   // slow read
       42: (7a) *(u64 *)(r10 -72) = 0  // verifier inserts this instruction
       43: (7b) *(u64 *)(r8 +0) = r3   // this store becomes slow due to r8
       44: (79) r1 = *(u64 *)(r6 +0)   // cpu speculatively executes this load
       45: (71) r2 = *(u8 *)(r1 +0)    // speculatively arbitrary 'load byte'
                                       // is now sanitized
      
      Above code after x86 JIT becomes:
       e5: mov    %rbp,%rdx
       e8: add    $0xffffffffffffff28,%rdx
       ef: mov    0x0(%r13),%r14
       f3: movq   $0x0,-0x48(%rbp)
       fb: mov    %rdx,0x0(%r14)
       ff: mov    0x0(%rbx),%rdi
      103: movzbq 0x0(%rdi),%rsi
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      af86ca4e
  10. 19 5月, 2018 4 次提交
    • A
      timekeeping: Add ktime_get_coarse_with_offset · b9ff604c
      Arnd Bergmann 提交于
      I have run into a couple of drivers using current_kernel_time()
      suffering from the y2038 problem, and they could be converted
      to using ktime_t, but don't have interfaces that skip the nanosecond
      calculation at the moment.
      
      This introduces ktime_get_coarse_with_offset() as a simpler
      variant of ktime_get_with_offset(), and adds wrappers for the
      three time domains we support with the existing function.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Stephen Boyd <sboyd@kernel.org>
      Cc: y2038@lists.linaro.org
      Cc: John Stultz <john.stultz@linaro.org>
      Link: https://lkml.kernel.org/r/20180427134016.2525989-5-arnd@arndb.de
      b9ff604c
    • A
      timekeeping: Standardize on ktime_get_*() naming · fb7fcc96
      Arnd Bergmann 提交于
      The current_kernel_time64, get_monotonic_coarse64, getrawmonotonic64,
      get_monotonic_boottime64 and timekeeping_clocktai64 interfaces have
      rather inconsistent naming, and they differ in the calling conventions
      by passing the output either by reference or as a return value.
      
      Rename them to ktime_get_coarse_real_ts64, ktime_get_coarse_ts64,
      ktime_get_raw_ts64, ktime_get_boottime_ts64 and ktime_get_clocktai_ts64
      respectively, and provide the interfaces with macros or inline
      functions as needed.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Stephen Boyd <sboyd@kernel.org>
      Cc: y2038@lists.linaro.org
      Cc: John Stultz <john.stultz@linaro.org>
      Link: https://lkml.kernel.org/r/20180427134016.2525989-4-arnd@arndb.de
      fb7fcc96
    • A
      timekeeping: Clean up ktime_get_real_ts64 · edca71fe
      Arnd Bergmann 提交于
      In a move to make ktime_get_*() the preferred driver interface into the
      timekeeping code, sanitizes ktime_get_real_ts64() to be a proper exported
      symbol rather than an alias for getnstimeofday64().
      
      The internal __getnstimeofday64() is no longer used, so remove that
      and merge it into ktime_get_real_ts64().
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Stephen Boyd <sboyd@kernel.org>
      Cc: y2038@lists.linaro.org
      Cc: John Stultz <john.stultz@linaro.org>
      Link: https://lkml.kernel.org/r/20180427134016.2525989-3-arnd@arndb.de
      edca71fe
    • A
      timekeeping: Remove timespec64 hack · 4f0fad9a
      Arnd Bergmann 提交于
      At this point, we have converted most of the kernel to use timespec64
      consistently in place of timespec, so it seems it's time to make
      timespec64 the native structure and define timespec in terms of that
      one on 64-bit architectures.
      
      Starting with gcc-5, the compiler can completely optimize away the
      timespec_to_timespec64 and timespec64_to_timespec functions on 64-bit
      architectures. With older compilers, we introduce a couple of extra
      copies of local variables, but those are easily avoided by using
      the timespec64 based interfaces consistently, as we do in most of the
      important code paths already.
      
      The main upside of removing the hack is that printing the tv_sec
      field of a timespec64 structure can now use the %lld format
      string on all architectures without a cast to time64_t. Without
      this patch, the field is a 'long' type and would have to be printed
      using %ld on 64-bit architectures.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Stephen Boyd <sboyd@kernel.org>
      Cc: y2038@lists.linaro.org
      Cc: John Stultz <john.stultz@linaro.org>
      Link: https://lkml.kernel.org/r/20180427134016.2525989-2-arnd@arndb.de
      4f0fad9a
  11. 18 5月, 2018 5 次提交
    • M
      sched/deadline: Make the grub_reclaim() function static · 3febfc8a
      Mathieu Malaterre 提交于
      Since the grub_reclaim() function can be made static, make it so.
      
      Silences the following GCC warning (W=1):
      
        kernel/sched/deadline.c:1120:5: warning: no previous prototype for ‘grub_reclaim’ [-Wmissing-prototypes]
      Signed-off-by: NMathieu Malaterre <malat@debian.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180516200902.959-1-malat@debian.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3febfc8a
    • M
      sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to kernel/sched/sched.h · f6a34630
      Mathieu Malaterre 提交于
      In the following commit:
      
        6b55c965 ("sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h")
      
      the print_cfs_rq() prototype was added to <kernel/sched/sched.h>,
      right next to the prototypes for print_cfs_stats(), print_rt_stats()
      and print_dl_stats().
      
      Finish this previous commit and also move related prototypes for
      print_rt_rq() and print_dl_rq().
      
      Remove existing extern declarations now that they not needed anymore.
      
      Silences the following GCC warning, triggered by W=1:
      
        kernel/sched/debug.c:573:6: warning: no previous prototype for ‘print_rt_rq’ [-Wmissing-prototypes]
        kernel/sched/debug.c:603:6: warning: no previous prototype for ‘print_dl_rq’ [-Wmissing-prototypes]
      Signed-off-by: NMathieu Malaterre <malat@debian.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180516195348.30426-1-malat@debian.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f6a34630
    • D
      bpf: fix truncated jump targets on heavy expansions · 050fad7c
      Daniel Borkmann 提交于
      Recently during testing, I ran into the following panic:
      
        [  207.892422] Internal error: Accessing user space memory outside uaccess.h routines: 96000004 [#1] SMP
        [  207.901637] Modules linked in: binfmt_misc [...]
        [  207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: G        W         4.17.0-rc3+ #7
        [  207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
        [  207.982428] pstate: 60400005 (nZCv daif +PAN -UAO)
        [  207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0
        [  207.992603] lr : 0xffff000000bdb754
        [  207.996080] sp : ffff000013703ca0
        [  207.999384] x29: ffff000013703ca0 x28: 0000000000000001
        [  208.004688] x27: 0000000000000001 x26: 0000000000000000
        [  208.009992] x25: ffff000013703ce0 x24: ffff800fb4afcb00
        [  208.015295] x23: ffff00007d2f5038 x22: ffff00007d2f5000
        [  208.020599] x21: fffffffffeff2a6f x20: 000000000000000a
        [  208.025903] x19: ffff000009578000 x18: 0000000000000a03
        [  208.031206] x17: 0000000000000000 x16: 0000000000000000
        [  208.036510] x15: 0000ffff9de83000 x14: 0000000000000000
        [  208.041813] x13: 0000000000000000 x12: 0000000000000000
        [  208.047116] x11: 0000000000000001 x10: ffff0000089e7f18
        [  208.052419] x9 : fffffffffeff2a6f x8 : 0000000000000000
        [  208.057723] x7 : 000000000000000a x6 : 00280c6160000000
        [  208.063026] x5 : 0000000000000018 x4 : 0000000000007db6
        [  208.068329] x3 : 000000000008647a x2 : 19868179b1484500
        [  208.073632] x1 : 0000000000000000 x0 : ffff000009578c08
        [  208.078938] Process test_verifier (pid: 2256, stack limit = 0x0000000049ca7974)
        [  208.086235] Call trace:
        [  208.088672]  bpf_skb_load_helper_8_no_cache+0x34/0xc0
        [  208.093713]  0xffff000000bdb754
        [  208.096845]  bpf_test_run+0x78/0xf8
        [  208.100324]  bpf_prog_test_run_skb+0x148/0x230
        [  208.104758]  sys_bpf+0x314/0x1198
        [  208.108064]  el0_svc_naked+0x30/0x34
        [  208.111632] Code: 91302260 f9400001 f9001fa1 d2800001 (29500680)
        [  208.117717] ---[ end trace 263cb8a59b5bf29f ]---
      
      The program itself which caused this had a long jump over the whole
      instruction sequence where all of the inner instructions required
      heavy expansions into multiple BPF instructions. Additionally, I also
      had BPF hardening enabled which requires once more rewrites of all
      constant values in order to blind them. Each time we rewrite insns,
      bpf_adj_branches() would need to potentially adjust branch targets
      which cross the patchlet boundary to accommodate for the additional
      delta. Eventually that lead to the case where the target offset could
      not fit into insn->off's upper 0x7fff limit anymore where then offset
      wraps around becoming negative (in s16 universe), or vice versa
      depending on the jump direction.
      
      Therefore it becomes necessary to detect and reject any such occasions
      in a generic way for native eBPF and cBPF to eBPF migrations. For
      the latter we can simply check bounds in the bpf_convert_filter()'s
      BPF_EMIT_JMP helper macro and bail out once we surpass limits. The
      bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case
      of subsequent hardening) is a bit more complex in that we need to
      detect such truncations before hitting the bpf_prog_realloc(). Thus
      the latter is split into an extra pass to probe problematic offsets
      on the original program in order to fail early. With that in place
      and carefully tested I no longer hit the panic and the rewrites are
      rejected properly. The above example panic I've seen on bpf-next,
      though the issue itself is generic in that a guard against this issue
      in bpf seems more appropriate in this case.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      050fad7c
    • J
      bpf: parse and verdict prog attach may race with bpf map update · 96174560
      John Fastabend 提交于
      In the sockmap design BPF programs (SK_SKB_STREAM_PARSER,
      SK_SKB_STREAM_VERDICT and SK_MSG_VERDICT) are attached to the sockmap
      map type and when a sock is added to the map the programs are used by
      the socket. However, sockmap updates from both userspace and BPF
      programs can happen concurrently with the attach and detach of these
      programs.
      
      To resolve this we use the bpf_prog_inc_not_zero and a READ_ONCE()
      primitive to ensure the program pointer is not refeched and
      possibly NULL'd before the refcnt increment. This happens inside
      a RCU critical section so although the pointer reference in the map
      object may be NULL (by a concurrent detach operation) the reference
      from READ_ONCE will not be free'd until after grace period. This
      ensures the object returned by READ_ONCE() is valid through the
      RCU criticl section and safe to use as long as we "know" it may
      be free'd shortly.
      
      Daniel spotted a case in the sock update API where instead of using
      the READ_ONCE() program reference we used the pointer from the
      original map, stab->bpf_{verdict|parse|txmsg}. The problem with this
      is the logic checks the object returned from the READ_ONCE() is not
      NULL and then tries to reference the object again but using the
      above map pointer, which may have already been NULL'd by a parallel
      detach operation. If this happened bpf_porg_inc_not_zero could
      dereference a NULL pointer.
      
      Fix this by using variable returned by READ_ONCE() that is checked
      for NULL.
      
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      96174560
    • J
      bpf: sockmap update rollback on error can incorrectly dec prog refcnt · a593f708
      John Fastabend 提交于
      If the user were to only attach one of the parse or verdict programs
      then it is possible a subsequent sockmap update could incorrectly
      decrement the refcnt on the program. This happens because in the
      rollback logic, after an error, we have to decrement the program
      reference count when its been incremented. However, we only increment
      the program reference count if the user has both a verdict and a
      parse program. The reason for this is because, at least at the
      moment, both are required for any one to be meaningful. The problem
      fixed here is in the rollback path we decrement the program refcnt
      even if only one existing. But we never incremented the refcnt in
      the first place creating an imbalance.
      
      This patch fixes the error path to handle this case.
      
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a593f708
  12. 17 5月, 2018 1 次提交
  13. 16 5月, 2018 11 次提交