1. 03 8月, 2018 2 次提交
  2. 01 8月, 2018 1 次提交
    • A
      bpf: verifier: MOV64 don't mark dst reg unbounded · fbeb1603
      Arthur Fabre 提交于
      When check_alu_op() handles a BPF_MOV64 between two registers,
      it calls check_reg_arg(DST_OP) on the dst register, marking it
      as unbounded. If the src and dst register are the same, this
      marks the src as unbounded, which can lead to unexpected errors
      for further checks that rely on bounds info. For example:
      
      	BPF_MOV64_IMM(BPF_REG_2, 0),
      	BPF_MOV64_REG(BPF_REG_2, BPF_REG_2),
      	BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_2),
      	BPF_MOV64_IMM(BPF_REG_0, 0),
      	BPF_EXIT_INSN(),
      
      Results in:
      
      	"math between ctx pointer and register with unbounded
      	min value is not allowed"
      
      check_alu_op() now uses check_reg_arg(DST_OP_NO_MARK), and MOVs
      that need to mark the dst register (MOVIMM, MOV32) do so.
      
      Added a test case for MOV64 dst == src, and dst != src.
      Signed-off-by: NArthur Fabre <afabre@cloudflare.com>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      fbeb1603
  3. 21 7月, 2018 1 次提交
  4. 18 7月, 2018 8 次提交
    • J
      bpf: offload: allow program and map sharing per-ASIC · fd4f227d
      Jakub Kicinski 提交于
      Allow programs and maps to be re-used across different netdevs,
      as long as they belong to the same struct bpf_offload_dev.
      Update the bpf_offload_prog_map_match() helper for the verifier
      and export a new helper for the drivers to use when checking
      programs at attachment time.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      fd4f227d
    • J
      bpf: offload: keep the offload state per-ASIC · 602144c2
      Jakub Kicinski 提交于
      Create a higher-level entity to represent a device/ASIC to allow
      programs and maps to be shared between device ports.  The extra
      work is required to make sure we don't destroy BPF objects as
      soon as the netdev for which they were loaded gets destroyed,
      as other ports may still be using them.  When netdev goes away
      all of its BPF objects will be moved to other netdevs of the
      device, and only destroyed when last netdev is unregistered.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      602144c2
    • J
      bpf: offload: aggregate offloads per-device · 9fd7c555
      Jakub Kicinski 提交于
      Currently we have two lists of offloaded objects - programs and maps.
      Netdevice unregister notifier scans those lists to orphan objects
      associated with device being unregistered.  This puts unnecessary
      (even if negligible) burden on all netdev unregister calls in BPF-
      -enabled kernel.  The lists of objects may potentially get long
      making the linear scan even more problematic.  There haven't been
      complaints about this mechanisms so far, but it is suboptimal.
      
      Instead of relying on notifiers, make the few BPF-capable drivers
      register explicitly for BPF offloads.  The programs and maps will
      now be collected per-device not on a global list, and only scanned
      for removal when driver unregisters from BPF offloads.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9fd7c555
    • J
      bpf: offload: rename bpf_offload_dev_match() to bpf_offload_prog_map_match() · 09728266
      Jakub Kicinski 提交于
      A set of new API functions exported for the drivers will soon use
      'bpf_offload_dev_' as a prefix.  Rename the bpf_offload_dev_match()
      which is internal to the core (used by the verifier) to avoid any
      confusion.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      09728266
    • C
      bpf: sockmap: remove redundant pointer sg · c23e014a
      Colin Ian King 提交于
      Pointer sg is being assigned but is never used hence it is
      redundant and can be removed.
      
      Cleans up clang warning:
      warning: variable 'sg' set but not used [-Wunused-but-set-variable]
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c23e014a
    • R
      bpf: fix rcu annotations in compute_effective_progs() · 3960f4fd
      Roman Gushchin 提交于
      The progs local variable in compute_effective_progs() is marked
      as __rcu, which is not correct. This is a local pointer, which
      is initialized by bpf_prog_array_alloc(), which also now
      returns a generic non-rcu pointer.
      
      The real rcu-protected pointer is *array (array is a pointer
      to an RCU-protected pointer), so the assignment should be performed
      using rcu_assign_pointer().
      
      Fixes: 324bda9e ("bpf: multi program support for cgroup+bpf")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3960f4fd
    • R
      bpf: bpf_prog_array_alloc() should return a generic non-rcu pointer · d29ab6e1
      Roman Gushchin 提交于
      Currently the return type of the bpf_prog_array_alloc() is
      struct bpf_prog_array __rcu *, which is not quite correct.
      Obviously, the returned pointer is a generic pointer, which
      is valid for an indefinite amount of time and it's not shared
      with anyone else, so there is no sense in marking it as __rcu.
      
      This change eliminate the following sparse warnings:
      kernel/bpf/core.c:1544:31: warning: incorrect type in return expression (different address spaces)
      kernel/bpf/core.c:1544:31:    expected struct bpf_prog_array [noderef] <asn:4>*
      kernel/bpf/core.c:1544:31:    got void *
      kernel/bpf/core.c:1548:17: warning: incorrect type in return expression (different address spaces)
      kernel/bpf/core.c:1548:17:    expected struct bpf_prog_array [noderef] <asn:4>*
      kernel/bpf/core.c:1548:17:    got struct bpf_prog_array *<noident>
      kernel/bpf/core.c:1681:15: warning: incorrect type in assignment (different address spaces)
      kernel/bpf/core.c:1681:15:    expected struct bpf_prog_array *array
      kernel/bpf/core.c:1681:15:    got struct bpf_prog_array [noderef] <asn:4>*
      
      Fixes: 324bda9e ("bpf: multi program support for cgroup+bpf")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d29ab6e1
    • L
      Mark HI and TASKLET softirq synchronous · 3c53776e
      Linus Torvalds 提交于
      Way back in 4.9, we committed 4cd13c21 ("softirq: Let ksoftirqd do
      its job"), and ever since we've had small nagging issues with it.  For
      example, we've had:
      
        1ff68820 ("watchdog: core: make sure the watchdog_worker is not deferred")
        8d5755b3 ("watchdog: softdog: fire watchdog even if softirqs do not get to run")
        217f6974 ("net: busy-poll: allow preemption in sk_busy_loop()")
      
      all of which worked around some of the effects of that commit.
      
      The DVB people have also complained that the commit causes excessive USB
      URB latencies, which seems to be due to the USB code using tasklets to
      schedule USB traffic.  This seems to be an issue mainly when already
      living on the edge, but waiting for ksoftirqd to handle it really does
      seem to cause excessive latencies.
      
      Now Hanna Hawa reports that this issue isn't just limited to USB URB and
      DVB, but also causes timeout problems for the Marvell SoC team:
      
       "I'm facing kernel panic issue while running raid 5 on sata disks
        connected to Macchiatobin (Marvell community board with Armada-8040
        SoC with 4 ARMv8 cores of CA72) Raid 5 built with Marvell DMA engine
        and async_tx mechanism (ASYNC_TX_DMA [=y]); the DMA driver (mv_xor_v2)
        uses a tasklet to clean the done descriptors from the queue"
      
      The latency problem causes a panic:
      
        mv_xor_v2 f0400000.xor: dma_sync_wait: timeout!
        Kernel panic - not syncing: async_tx_quiesce: DMA error waiting for transaction
      
      We've discussed simply just reverting the original commit entirely, and
      also much more involved solutions (with per-softirq threads etc).  This
      patch is intentionally stupid and fairly limited, because the issue
      still remains, and the other solutions either got sidetracked or had
      other issues.
      
      We should probably also consider the timer softirqs to be synchronous
      and not be delayed to ksoftirqd (since they were the issue with the
      earlier watchdog problems), but that should be done as a separate patch.
      This does only the tasklet cases.
      Reported-and-tested-by: NHanna Hawa <hannah@marvell.com>
      Reported-and-tested-by: NJosef Griebichler <griebichler.josef@gmx.at>
      Reported-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c53776e
  5. 13 7月, 2018 2 次提交
    • J
      tracing: Reorder display of TGID to be after PID · f8494fa3
      Joel Fernandes (Google) 提交于
      Currently ftrace displays data in trace output like so:
      
                                             _-----=> irqs-off
                                            / _----=> need-resched
                                           | / _---=> hardirq/softirq
                                           || / _--=> preempt-depth
                                           ||| /     delay
                  TASK-PID   CPU    TGID   ||||    TIMESTAMP  FUNCTION
                     | |       |      |    ||||       |         |
                  bash-1091  [000] ( 1091) d..2    28.313544: sched_switch:
      
      However Android's trace visualization tools expect a slightly different
      format due to an out-of-tree patch patch that was been carried for a
      decade, notice that the TGID and CPU fields are reversed:
      
                                             _-----=> irqs-off
                                            / _----=> need-resched
                                           | / _---=> hardirq/softirq
                                           || / _--=> preempt-depth
                                           ||| /     delay
                  TASK-PID    TGID   CPU   ||||    TIMESTAMP  FUNCTION
                     | |        |      |   ||||       |         |
                  bash-1091  ( 1091) [002] d..2    64.965177: sched_switch:
      
      From kernel v4.13 onwards, during which TGID was introduced, tracing
      with systrace on all Android kernels will break (most Android kernels
      have been on 4.9 with Android patches, so this issues hasn't been seen
      yet). From v4.13 onwards things will break.
      
      The chrome browser's tracing tools also embed the systrace viewer which
      uses the legacy TGID format and updates to that are known to be
      difficult to make.
      
      Considering this, I suggest we make this change to the upstream kernel
      and backport it to all Android kernels. I believe this feature is merged
      recently enough into the upstream kernel that it shouldn't be a problem.
      Also logically, IMO it makes more sense to group the TGID with the
      TASK-PID and the CPU after these.
      
      Link: http://lkml.kernel.org/r/20180626000822.113931-1-joel@joelfernandes.org
      
      Cc: jreck@google.com
      Cc: tkjos@google.com
      Cc: stable@vger.kernel.org
      Fixes: 441dae8f ("tracing: Add support for display of tgid in trace output")
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      f8494fa3
    • D
      bpf: don't leave partial mangled prog in jit_subprogs error path · c7a89784
      Daniel Borkmann 提交于
      syzkaller managed to trigger the following bug through fault injection:
      
        [...]
        [  141.043668] verifier bug. No program starts at insn 3
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [  141.047355] CPU: 3 PID: 4072 Comm: a.out Not tainted 4.18.0-rc4+ #51
        [  141.048446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS 1.10.2-1 04/01/2014
        [  141.049877] Call Trace:
        [  141.050324]  __dump_stack lib/dump_stack.c:77 [inline]
        [  141.050324]  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
        [  141.050950]  ? dump_stack_print_info.cold.2+0x52/0x52 lib/dump_stack.c:60
        [  141.051837]  panic+0x238/0x4e7 kernel/panic.c:184
        [  141.052386]  ? add_taint.cold.5+0x16/0x16 kernel/panic.c:385
        [  141.053101]  ? __warn.cold.8+0x148/0x1ba kernel/panic.c:537
        [  141.053814]  ? __warn.cold.8+0x117/0x1ba kernel/panic.c:530
        [  141.054506]  ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.054506]  ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.054506]  ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [  141.055163]  __warn.cold.8+0x163/0x1ba kernel/panic.c:538
        [  141.055820]  ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.055820]  ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.055820]  ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [...]
      
      What happens in jit_subprogs() is that kcalloc() for the subprog func
      buffer is failing with NULL where we then bail out. Latter is a plain
      return -ENOMEM, and this is definitely not okay since earlier in the
      loop we are walking all subprogs and temporarily rewrite insn->off to
      remember the subprog id as well as insn->imm to temporarily point the
      call to __bpf_call_base + 1 for the initial JIT pass. Thus, bailing
      out in such state and handing this over to the interpreter is troublesome
      since later/subsequent e.g. find_subprog() lookups are based on wrong
      insn->imm.
      
      Therefore, once we hit this point, we need to jump to out_free path
      where we undo all changes from earlier loop, so that interpreter can
      work on unmodified insn->{off,imm}.
      
      Another point is that should find_subprog() fail in jit_subprogs() due
      to a verifier bug, then we also should not simply defer the program to
      the interpreter since also here we did partial modifications. Instead
      we should just bail out entirely and return an error to the user who is
      trying to load the program.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Reported-by: syzbot+7d427828b2ea6e592804@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c7a89784
  6. 12 7月, 2018 2 次提交
  7. 11 7月, 2018 5 次提交
    • M
      rseq: uapi: Declare rseq_cs field as union, update includes · ec9c82e0
      Mathieu Desnoyers 提交于
      Declaring the rseq_cs field as a union between __u64 and two __u32
      allows both 32-bit and 64-bit kernels to read the full __u64, and
      therefore validate that a 32-bit user-space cleared the upper 32
      bits, thus ensuring a consistent behavior between native 32-bit
      kernels and 32-bit compat tasks on 64-bit kernels.
      
      Check that the rseq_cs value read is < TASK_SIZE.
      
      The asm/byteorder.h header needs to be included by rseq.h, now
      that it is not using linux/types_32_64.h anymore.
      
      Considering that only __32 and __u64 types are declared in linux/rseq.h,
      the linux/types.h header should always be included for both kernel and
      user-space code: including stdint.h is just for u64 and u32, which are
      not used in this header at all.
      
      Use copy_from_user()/clear_user() to interact with a 64-bit field,
      because arm32 does not implement 64-bit __get_user, and ppc32 does not
      64-bit get_user. Considering that the rseq_cs pointer does not need to
      be loaded/stored with single-copy atomicity from the kernel anymore, we
      can simply use copy_from_user()/clear_user().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-5-mathieu.desnoyers@efficios.com
      ec9c82e0
    • M
      rseq: uapi: Update uapi comments · 0fb9a1ab
      Mathieu Desnoyers 提交于
      Update rseq uapi header comments to reflect that user-space need to do
      thread-local loads/stores from/to the struct rseq fields.
      
      As a consequence of this added requirement, the kernel does not need
      to perform loads/stores with single-copy atomicity.
      
      Update the comment associated to the "flags" fields to describe
      more accurately that it's only useful to facilitate single-stepping
      through rseq critical sections with debuggers.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-4-mathieu.desnoyers@efficios.com
      0fb9a1ab
    • M
      rseq: Use get_user/put_user rather than __get_user/__put_user · 8f281770
      Mathieu Desnoyers 提交于
      __get_user()/__put_user() is used to read values for address ranges that
      were already checked with access_ok() on rseq registration.
      
      It has been recognized that __get_user/__put_user are optimizing the
      wrong thing. Replace them by get_user/put_user across rseq instead.
      
      If those end up showing up in benchmarks, the proper approach would be to
      use user_access_begin() / unsafe_{get,put}_user() / user_access_end()
      anyway.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: https://lkml.kernel.org/r/20180709195155.7654-3-mathieu.desnoyers@efficios.com
      8f281770
    • M
      rseq: Use __u64 for rseq_cs fields, validate user inputs · e96d7135
      Mathieu Desnoyers 提交于
      Change the rseq ABI so rseq_cs start_ip, post_commit_offset and abort_ip
      fields are seen as 64-bit fields by both 32-bit and 64-bit kernels rather
      that ignoring the 32 upper bits on 32-bit kernels. This ensures we have a
      consistent behavior for a 32-bit binary executed on 32-bit kernels and in
      compat mode on 64-bit kernels.
      
      Validating the value of abort_ip field to be below TASK_SIZE ensures the
      kernel don't return to an invalid address when returning to userspace
      after an abort. I don't fully trust each architecture code to consistently
      deal with invalid return addresses.
      
      Validating the value of the start_ip and post_commit_offset fields
      prevents overflow on arithmetic performed on those values, used to
      check whether abort_ip is within the rseq critical section.
      
      If validation fails, the process is killed with a segmentation fault.
      
      When the signature encountered before abort_ip does not match the expected
      signature, return -EINVAL rather than -EPERM to be consistent with other
      input validation return codes from rseq_get_rseq_cs().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-2-mathieu.desnoyers@efficios.com
      e96d7135
    • S
      Revert "tick: Prefer a lower rating device only if it's CPU local device" · 5b5ccbc2
      Sudeep Holla 提交于
      This reverts commit 1332a905.
      
      The original issue was not because of incorrect checking of cpumask for
      both new and old tick device. It was incorrectly analysed was due to the
      misunderstanding of the comment and misinterpretation of the return value
      from tick_check_preferred. The main issue is with the clockevent driver
      that sets the cpumask to cpu_all_mask instead of cpu_possible_mask.
      Signed-off-by: NSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NKevin Hilman <khilman@baylibre.com>
      Tested-by: NMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/1531151136-18297-1-git-send-email-sudeep.holla@arm.com
      5b5ccbc2
  8. 08 7月, 2018 6 次提交
    • T
      xdp: XDP_REDIRECT should check IFF_UP and MTU · d8d7218a
      Toshiaki Makita 提交于
      Otherwise we end up with attempting to send packets from down devices
      or to send oversized packets, which may cause unexpected driver/device
      behaviour. Generic XDP has already done this check, so reuse the logic
      in native XDP.
      
      Fixes: 814abfab ("xdp: add bpf_redirect helper function")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8d7218a
    • J
      bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb · 0ea488ff
      John Fastabend 提交于
      In commit
      
        'bpf: bpf_compute_data uses incorrect cb structure' (8108a775)
      
      we added the routine bpf_compute_data_end_sk_skb() to compute the
      correct data_end values, but this has since been lost. In kernel
      v4.14 this was correct and the above patch was applied in it
      entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
      we lost the piece that renamed bpf_compute_data_pointers to the
      new function bpf_compute_data_end_sk_skb. This was done here,
      
      e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      
      When it conflicted with the following rename patch,
      
      6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      
      Finally, after a refactor I thought even the function
      bpf_compute_data_end_sk_skb() was no longer needed and it was
      erroneously removed.
      
      However, we never reverted the sk_skb_convert_ctx_access() usage of
      tcp_skb_cb which had been committed and survived the merge conflict.
      Here we fix this by adding back the helper and *_data_end_sk_skb()
      usage. Using the bpf_skc_data_end mapping is not correct because it
      expects a qdisc_skb_cb object but at the sock layer this is not the
      case. Even though it happens to work here because we don't overwrite
      any data in-use at the socket layer and the cb structure is cleared
      later this has potential to create some subtle issues. But, even
      more concretely the filter.c access check uses tcp_skb_cb.
      
      And by some act of chance though,
      
      struct bpf_skb_data_end {
              struct qdisc_skb_cb        qdisc_cb;             /*     0    28 */
      
              /* XXX 4 bytes hole, try to pack */
      
              void *                     data_meta;            /*    32     8 */
              void *                     data_end;             /*    40     8 */
      
              /* size: 48, cachelines: 1, members: 3 */
              /* sum members: 44, holes: 1, sum holes: 4 */
              /* last cacheline: 48 bytes */
      };
      
      and then tcp_skb_cb,
      
      struct tcp_skb_cb {
      	[...]
                      struct {
                              __u32      flags;                /*    24     4 */
                              struct sock * sk_redir;          /*    32     8 */
                              void *     data_end;             /*    40     8 */
                      } bpf;                                   /*          24 */
              };
      
      So when we use offset_of() to track down the byte offset we get 40 in
      either case and everything continues to work. Fix this mess and use
      correct structures its unclear how long this might actually work for
      until someone moves the structs around.
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Fixes: e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      Fixes: 6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0ea488ff
    • J
      bpf: sockmap, consume_skb in close path · 7ebc14d5
      John Fastabend 提交于
      Currently, when a sock is closed and the bpf_tcp_close() callback is
      used we remove memory but do not free the skb. Call consume_skb() if
      the skb is attached to the buffer.
      
      Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
      Fixes: 1aa12bdf ("bpf: sockmap, add sock close() hook to remove socks")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7ebc14d5
    • J
      bpf: sockhash, disallow bpf_tcp_close and update in parallel · 99ba2b5a
      John Fastabend 提交于
      After latest lock updates there is no longer anything preventing a
      close and recvmsg call running in parallel. Additionally, we can
      race update with close if we close a socket and simultaneously update
      if via the BPF userspace API (note the cgroup ops are already run
      with sock_lock held).
      
      To resolve this take sock_lock in close and update paths.
      
      Reported-by: syzbot+b680e42077a0d7c9a0c4@syzkaller.appspotmail.com
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      99ba2b5a
    • J
      bpf: sockmap, hash table is RCU so readers do not need locks · 1d1ef005
      John Fastabend 提交于
      This removes locking from readers of RCU hash table. Its not
      necessary.
      
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1d1ef005
    • J
      bpf: sockmap, error path can not release psock in multi-map case · 547b3aa4
      John Fastabend 提交于
      The current code, in the error path of sock_hash_ctx_update_elem,
      checks if the sock has a psock in the user data and if so decrements
      the reference count of the psock. However, if the error happens early
      in the error path we may have never incremented the psock reference
      count and if the psock exists because the sock is in another map then
      we may inadvertently decrement the reference count.
      
      Fix this by making the error path only call smap_release_sock if the
      error happens after the increment.
      
      Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      547b3aa4
  9. 04 7月, 2018 7 次提交
  10. 03 7月, 2018 6 次提交
    • P
      kthread, sched/core: Fix kthread_parkme() (again...) · 1cef1150
      Peter Zijlstra 提交于
      Gaurav reports that commit:
      
        85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      
      isn't working for him. Because of the following race:
      
      > controller Thread                               CPUHP Thread
      > takedown_cpu
      > kthread_park
      > kthread_parkme
      > Set KTHREAD_SHOULD_PARK
      >                                                 smpboot_thread_fn
      >                                                 set Task interruptible
      >
      >
      > wake_up_process
      >  if (!(p->state & state))
      >                 goto out;
      >
      >                                                 Kthread_parkme
      >                                                 SET TASK_PARKED
      >                                                 schedule
      >                                                 raw_spin_lock(&rq->lock)
      > ttwu_remote
      > waiting for __task_rq_lock
      >                                                 context_switch
      >
      >                                                 finish_lock_switch
      >
      >
      >
      >                                                 Case TASK_PARKED
      >                                                 kthread_park_complete
      >
      >
      > SET Running
      
      Furthermore, Oleg noticed that the whole scheduler TASK_PARKED
      handling is buggered because the TASK_DEAD thing is done with
      preemption disabled, the current code can still complete early on
      preemption :/
      
      So basically revert that earlier fix and go with a variant of the
      alternative mentioned in the commit. Promote TASK_PARKED to special
      state to avoid the store-store issue on task->state leading to the
      WARN in kthread_unpark() -> __kthread_bind().
      
      But in addition, add wait_task_inactive() to kthread_park() to ensure
      the task really is PARKED when we return from kthread_park(). This
      avoids the whole kthread still gets migrated nonsense -- although it
      would be really good to get this done differently.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1cef1150
    • V
      sched/util_est: Fix util_est_dequeue() for throttled cfs_rq · 3482d98b
      Vincent Guittot 提交于
      When a cfs_rq is throttled, parent cfs_rq->nr_running is decreased and
      everything happens at cfs_rq level. Currently util_est stays unchanged
      in such case and it keeps accounting the utilization of throttled tasks.
      This can somewhat make sense as we don't dequeue tasks but only throttled
      cfs_rq.
      
      If a task of another group is enqueued/dequeued and root cfs_rq becomes
      idle during the dequeue, util_est will be cleared whereas it was
      accounting util_est of throttled tasks before. So the behavior of util_est
      is not always the same regarding throttled tasks and depends of side
      activity. Furthermore, util_est will not be updated when the cfs_rq is
      unthrottled as everything happens at cfs_rq level. Main results is that
      util_est will stay null whereas we now have running tasks. We have to wait
      for the next dequeue/enqueue of the previously throttled tasks to get an
      up to date util_est.
      
      Remove the assumption that cfs_rq's estimated utilization of a CPU is 0
      if there is no running task so the util_est of a task remains until the
      latter is dequeued even if its cfs_rq has been throttled.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 7f65ea42 ("sched/fair: Add util_est on top of PELT")
      Link: http://lkml.kernel.org/r/1528972380-16268-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3482d98b
    • X
      sched/fair: Advance global expiration when period timer is restarted · f1d1be8a
      Xunlei Pang 提交于
      When period gets restarted after some idle time, start_cfs_bandwidth()
      doesn't update the expiration information, expire_cfs_rq_runtime() will
      see cfs_rq->runtime_expires smaller than rq clock and go to the clock
      drift logic, wasting needless CPU cycles on the scheduler hot path.
      
      Update the global expiration in start_cfs_bandwidth() to avoid frequent
      expire_cfs_rq_runtime() calls once a new period begins.
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180620101834.24455-2-xlpang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f1d1be8a
    • X
      sched/fair: Fix bandwidth timer clock drift condition · 512ac999
      Xunlei Pang 提交于
      I noticed that cgroup task groups constantly get throttled even
      if they have low CPU usage, this causes some jitters on the response
      time to some of our business containers when enabling CPU quotas.
      
      It's very simple to reproduce:
      
        mkdir /sys/fs/cgroup/cpu/test
        cd /sys/fs/cgroup/cpu/test
        echo 100000 > cpu.cfs_quota_us
        echo $$ > tasks
      
      then repeat:
      
        cat cpu.stat | grep nr_throttled  # nr_throttled will increase steadily
      
      After some analysis, we found that cfs_rq::runtime_remaining will
      be cleared by expire_cfs_rq_runtime() due to two equal but stale
      "cfs_{b|q}->runtime_expires" after period timer is re-armed.
      
      The current condition to judge clock drift in expire_cfs_rq_runtime()
      is wrong, the two runtime_expires are actually the same when clock
      drift happens, so this condtion can never hit. The orginal design was
      correctly done by this commit:
      
        a9cf55b2 ("sched: Expire invalid runtime")
      
      ... but was changed to be the current implementation due to its locking bug.
      
      This patch introduces another way, it adds a new field in both structures
      cfs_rq and cfs_bandwidth to record the expiration update sequence, and
      uses them to figure out if clock drift happens (true if they are equal).
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 51f2176d ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
      Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      512ac999
    • V
      sched/rt: Fix call to cpufreq_update_util() · 296b2ffe
      Vincent Guittot 提交于
      With commit:
      
        8f111bc3 ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
      
      the schedutil governor uses rq->rt.rt_nr_running to detect whether an
      RT task is currently running on the CPU and to set frequency to max
      if necessary.
      
      cpufreq_update_util() is called in enqueue/dequeue_top_rt_rq() but
      rq->rt.rt_nr_running has not been updated yet when dequeue_top_rt_rq() is
      called so schedutil still considers that an RT task is running when the
      last task is dequeued. The update of rq->rt.rt_nr_running happens later
      in dequeue_rt_stack().
      
      In fact, we can take advantage of the sequence that the dequeue then
      re-enqueue rt entities when a rt task is enqueued or dequeued;
      As a result enqueue_top_rt_rq() is always called when a task is
      enqueued or dequeued and also when groups are throttled or unthrottled.
      The only place that not use enqueue_top_rt_rq() is when root rt_rq is
      throttled.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: juri.lelli@redhat.com
      Cc: patrick.bellasi@arm.com
      Cc: viresh.kumar@linaro.org
      Fixes: 8f111bc3 ('cpufreq/schedutil: Rewrite CPUFREQ_RT support')
      Link: http://lkml.kernel.org/r/1530021202-21695-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      296b2ffe
    • F
      sched/nohz: Skip remote tick on idle task entirely · d9c0ffca
      Frederic Weisbecker 提交于
      Some people have reported that the warning in sched_tick_remote()
      occasionally triggers, especially in favour of some RCU-Torture
      pressure:
      
      	WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0
      	Modules linked in:
      	CPU: 11 PID: 906 Comm: kworker/u32:3 Not tainted 4.18.0-rc2+ #1
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      	Workqueue: events_unbound sched_tick_remote
      	RIP: 0010:sched_tick_remote+0xb6/0xc0
      	Code: e8 0f 06 b8 00 c6 03 00 fb eb 9d 8b 43 04 85 c0 75 8d 48 8b 83 e0 0a 00 00 48 85 c0 75 81 eb 88 48 89 df e8 bc fe ff ff eb aa <0f> 0b eb
      	+c5 66 0f 1f 44 00 00 bf 17 00 00 00 e8 b6 2e fe ff 0f b6
      	Call Trace:
      	 process_one_work+0x1df/0x3b0
      	 worker_thread+0x44/0x3d0
      	 kthread+0xf3/0x130
      	 ? set_worker_desc+0xb0/0xb0
      	 ? kthread_create_worker_on_cpu+0x70/0x70
      	 ret_from_fork+0x35/0x40
      
      This happens when the remote tick applies on an idle task. Usually the
      idle_cpu() check avoids that, but it is performed before we lock the
      runqueue and it is therefore racy. It was intended to be that way in
      order to prevent from useless runqueue locks since idle task tick
      callback is a no-op.
      
      Now if the racy check slips out of our hands and we end up remotely
      ticking an idle task, the empty task_tick_idle() is harmless. Still
      it won't pass the WARN_ON_ONCE() test that ensures rq_clock_task() is
      not too far from curr->se.exec_start because update_curr_idle() doesn't
      update the exec_start value like other scheduler policies. Hence the
      reported false positive.
      
      So let's have another check, while the rq is locked, to make sure we
      don't remote tick on an idle task. The lockless idle_cpu() still applies
      to avoid unecessary rq lock contention.
      Reported-by: NJacek Tomaka <jacekt@dug.com>
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reported-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1530203381-31234-1-git-send-email-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d9c0ffca