1. 16 12月, 2017 1 次提交
    • D
      bpf: guarantee r1 to be ctx in case of bpf_helper_changes_pkt_data · 04514d13
      Daniel Borkmann 提交于
      Some JITs don't cache skb context on stack in prologue, so when
      LD_ABS/IND is used and helper calls yield bpf_helper_changes_pkt_data()
      as true, then they temporarily save/restore skb pointer. However,
      the assumption that skb always has to be in r1 is a bit of a
      gamble. Right now it turned out to be true for all helpers listed
      in bpf_helper_changes_pkt_data(), but lets enforce that from verifier
      side, so that we make this a guarantee and bail out if the func
      proto is misconfigured in future helpers.
      
      In case of BPF helper calls from cBPF, bpf_helper_changes_pkt_data()
      is completely unrelevant here (since cBPF is context read-only) and
      therefore always false.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      04514d13
  2. 23 11月, 2017 2 次提交
    • A
      bpf: fix branch pruning logic · c131187d
      Alexei Starovoitov 提交于
      when the verifier detects that register contains a runtime constant
      and it's compared with another constant it will prune exploration
      of the branch that is guaranteed not to be taken at runtime.
      This is all correct, but malicious program may be constructed
      in such a way that it always has a constant comparison and
      the other branch is never taken under any conditions.
      In this case such path through the program will not be explored
      by the verifier. It won't be taken at run-time either, but since
      all instructions are JITed the malicious program may cause JITs
      to complain about using reserved fields, etc.
      To fix the issue we have to track the instructions explored by
      the verifier and sanitize instructions that are dead at run time
      with NOPs. We cannot reject such dead code, since llvm generates
      it for valid C code, since it doesn't do as much data flow
      analysis as the verifier does.
      
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c131187d
    • G
      bpf: introduce ARG_PTR_TO_MEM_OR_NULL · db1ac496
      Gianluca Borello 提交于
      With the current ARG_PTR_TO_MEM/ARG_PTR_TO_UNINIT_MEM semantics, an helper
      argument can be NULL when the next argument type is ARG_CONST_SIZE_OR_ZERO
      and the verifier can prove the value of this next argument is 0. However,
      most helpers are just interested in handling <!NULL, 0>, so forcing them to
      deal with <NULL, 0> makes the implementation of those helpers more
      complicated for no apparent benefits, requiring them to explicitly handle
      those corner cases with checks that bpf programs could start relying upon,
      preventing the possibility of removing them later.
      
      Solve this by making ARG_PTR_TO_MEM/ARG_PTR_TO_UNINIT_MEM never accept NULL
      even when ARG_CONST_SIZE_OR_ZERO is set, and introduce a new argument type
      ARG_PTR_TO_MEM_OR_NULL to explicitly deal with the NULL case.
      
      Currently, the only helper that needs this is bpf_csum_diff_proto(), so
      change arg1 and arg3 to this new type as well.
      
      Also add a new battery of tests that explicitly test the
      !ARG_PTR_TO_MEM_OR_NULL combination: all the current ones testing the
      various <NULL, 0> variations are focused on bpf_csum_diff, so cover also
      other helpers.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      db1ac496
  3. 14 11月, 2017 1 次提交
    • Y
      bpf: improve verifier ARG_CONST_SIZE_OR_ZERO semantics · 9fd29c08
      Yonghong Song 提交于
      For helpers, the argument type ARG_CONST_SIZE_OR_ZERO permits the
      access size to be 0 when accessing the previous argument (arg).
      Right now, it requires the arg needs to be NULL when size passed
      is 0 or could be 0. It also requires a non-NULL arg when the size
      is proved to be non-0.
      
      This patch changes verifier ARG_CONST_SIZE_OR_ZERO behavior
      such that for size-0 or possible size-0, it is not required
      the arg equal to NULL.
      
      There are a couple of reasons for this semantics change, and
      all of them intends to simplify user bpf programs which
      may improve user experience and/or increase chances of
      verifier acceptance. Together with the next patch which
      changes bpf_probe_read arg2 type from ARG_CONST_SIZE to
      ARG_CONST_SIZE_OR_ZERO, the following two examples, which
      fail the verifier currently, are able to get verifier acceptance.
      
      Example 1:
         unsigned long len = pend - pstart;
         len = len > MAX_PAYLOAD_LEN ? MAX_PAYLOAD_LEN : len;
         len &= MAX_PAYLOAD_LEN;
         bpf_probe_read(data->payload, len, pstart);
      
      It does not have test for "len > 0" and it failed the verifier.
      Users may not be aware that they have to add this test.
      Converting the bpf_probe_read helper to have
      ARG_CONST_SIZE_OR_ZERO helps the above code get
      verifier acceptance.
      
      Example 2:
        Here is one example where llvm "messed up" the code and
        the verifier fails.
      
      ......
         unsigned long len = pend - pstart;
         if (len > 0 && len <= MAX_PAYLOAD_LEN)
           bpf_probe_read(data->payload, len, pstart);
      ......
      
      The compiler generates the following code and verifier fails:
      ......
      39: (79) r2 = *(u64 *)(r10 -16)
      40: (1f) r2 -= r8
      41: (bf) r1 = r2
      42: (07) r1 += -1
      43: (25) if r1 > 0xffe goto pc+3
        R0=inv(id=0) R1=inv(id=0,umax_value=4094,var_off=(0x0; 0xfff))
        R2=inv(id=0) R6=map_value(id=0,off=0,ks=4,vs=4095,imm=0) R7=inv(id=0)
        R8=inv(id=0) R9=inv0 R10=fp0
      44: (bf) r1 = r6
      45: (bf) r3 = r8
      46: (85) call bpf_probe_read#45
      R2 min value is negative, either use unsigned or 'var &= const'
      ......
      
      The compiler optimization is correct. If r1 = 0,
      r1 - 1 = 0xffffffffffffffff > 0xffe.  If r1 != 0, r1 - 1 will not wrap.
      r1 > 0xffe at insn #43 can actually capture
      both "r1 > 0" and "len <= MAX_PAYLOAD_LEN".
      This however causes an issue in verifier as the value range of arg2
      "r2" does not properly get refined and lead to verification failure.
      
      Relaxing bpf_prog_read arg2 from ARG_CONST_SIZE to ARG_CONST_SIZE_OR_ZERO
      allows the following simplied code:
         unsigned long len = pend - pstart;
         if (len <= MAX_PAYLOAD_LEN)
           bpf_probe_read(data->payload, len, pstart);
      
      The llvm compiler will generate less complex code and the
      verifier is able to verify that the program is okay.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fd29c08
  4. 11 11月, 2017 2 次提交
  5. 05 11月, 2017 3 次提交
  6. 03 11月, 2017 3 次提交
    • C
      bpf: fix verifier NULL pointer dereference · 8c01c4f8
      Craig Gallek 提交于
      do_check() can fail early without allocating env->cur_state under
      memory pressure.  Syzkaller found the stack below on the linux-next
      tree because of this.
      
        kasan: CONFIG_KASAN_INLINE enabled
        kasan: GPF could be caused by NULL-ptr deref or user memory access
        general protection fault: 0000 [#1] SMP KASAN
        Dumping ftrace buffer:
           (ftrace buffer empty)
        Modules linked in:
        CPU: 1 PID: 27062 Comm: syz-executor5 Not tainted 4.14.0-rc7+ #106
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        task: ffff8801c2c74700 task.stack: ffff8801c3e28000
        RIP: 0010:free_verifier_state kernel/bpf/verifier.c:347 [inline]
        RIP: 0010:bpf_check+0xcf4/0x19c0 kernel/bpf/verifier.c:4533
        RSP: 0018:ffff8801c3e2f5c8 EFLAGS: 00010202
        RAX: dffffc0000000000 RBX: 00000000fffffff4 RCX: 0000000000000000
        RDX: 0000000000000070 RSI: ffffffff817d5aa9 RDI: 0000000000000380
        RBP: ffff8801c3e2f668 R08: 0000000000000000 R09: 1ffff100387c5d9f
        R10: 00000000218c4e80 R11: ffffffff85b34380 R12: ffff8801c4dc6a28
        R13: 0000000000000000 R14: ffff8801c4dc6a00 R15: ffff8801c4dc6a20
        FS:  00007f311079b700(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000004d4a24 CR3: 00000001cbcd0000 CR4: 00000000001406e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         bpf_prog_load+0xcbb/0x18e0 kernel/bpf/syscall.c:1166
         SYSC_bpf kernel/bpf/syscall.c:1690 [inline]
         SyS_bpf+0xae9/0x4620 kernel/bpf/syscall.c:1652
         entry_SYSCALL_64_fastpath+0x1f/0xbe
        RIP: 0033:0x452869
        RSP: 002b:00007f311079abe8 EFLAGS: 00000212 ORIG_RAX: 0000000000000141
        RAX: ffffffffffffffda RBX: 0000000000758020 RCX: 0000000000452869
        RDX: 0000000000000030 RSI: 0000000020168000 RDI: 0000000000000005
        RBP: 00007f311079aa20 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000212 R12: 00000000004b7550
        R13: 00007f311079ab58 R14: 00000000004b7560 R15: 0000000000000000
        Code: df 48 c1 ea 03 80 3c 02 00 0f 85 e6 0b 00 00 4d 8b 6e 20 48 b8 00 00 00 00 00 fc ff df 49 8d bd 80 03 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 b6 0b 00 00 49 8b bd 80 03 00 00 e8 d6 0c 26
        RIP: free_verifier_state kernel/bpf/verifier.c:347 [inline] RSP: ffff8801c3e2f5c8
        RIP: bpf_check+0xcf4/0x19c0 kernel/bpf/verifier.c:4533 RSP: ffff8801c3e2f5c8
        ---[ end trace c8d37f339dc64004 ]---
      
      Fixes: 638f5b90 ("bpf: reduce verifier memory consumption")
      Fixes: 1969db47 ("bpf: fix verifier memory leaks")
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c01c4f8
    • A
      bpf: fix out-of-bounds access warning in bpf_check · eba0c929
      Arnd Bergmann 提交于
      The bpf_verifer_ops array is generated dynamically and may be
      empty depending on configuration, which then causes an out
      of bounds access:
      
      kernel/bpf/verifier.c: In function 'bpf_check':
      kernel/bpf/verifier.c:4320:29: error: array subscript is above array bounds [-Werror=array-bounds]
      
      This adds a check to the start of the function as a workaround.
      I would assume that the function is never called in that configuration,
      so the warning is probably harmless.
      
      Fixes: 00176a34 ("bpf: remove the verifier ops from program structure")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eba0c929
    • A
      bpf: fix link error without CONFIG_NET · 7cce782e
      Arnd Bergmann 提交于
      I ran into this link error with the latest net-next plus linux-next
      trees when networking is disabled:
      
      kernel/bpf/verifier.o:(.rodata+0x2958): undefined reference to `tc_cls_act_analyzer_ops'
      kernel/bpf/verifier.o:(.rodata+0x2970): undefined reference to `xdp_analyzer_ops'
      
      It seems that the code was written to deal with varying contents of
      the arrray, but the actual #ifdef was missing. Both tc_cls_act_analyzer_ops
      and xdp_analyzer_ops are defined in the core networking code, so adding
      a check for CONFIG_NET seems appropriate here, and I've verified this with
      many randconfig builds
      
      Fixes: 4f9218aa ("bpf: move knowledge about post-translation offsets out of verifier")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7cce782e
  7. 02 11月, 2017 2 次提交
  8. 01 11月, 2017 2 次提交
  9. 22 10月, 2017 2 次提交
  10. 18 10月, 2017 6 次提交
    • J
      bpf: move knowledge about post-translation offsets out of verifier · 4f9218aa
      Jakub Kicinski 提交于
      Use the fact that verifier ops are now separate from program
      ops to define a separate set of callbacks for verification of
      already translated programs.
      
      Since we expect the analyzer ops to be defined only for
      a small subset of all program types initialize their array
      by hand (don't use linux/bpf_types.h).
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f9218aa
    • J
      bpf: remove the verifier ops from program structure · 00176a34
      Jakub Kicinski 提交于
      Since the verifier ops don't have to be associated with
      the program for its entire lifetime we can move it to
      verifier's struct bpf_verifier_env.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00176a34
    • J
      bpf: split verifier and program ops · 7de16e3a
      Jakub Kicinski 提交于
      struct bpf_verifier_ops contains both verifier ops and operations
      used later during program's lifetime (test_run).  Split the runtime
      ops into a different structure.
      
      BPF_PROG_TYPE() will now append ## _prog_ops or ## _verifier_ops
      to the names.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7de16e3a
    • J
      bpf: disallow arithmetic operations on context pointer · 28e33f9d
      Jakub Kicinski 提交于
      Commit f1174f77 ("bpf/verifier: rework value tracking")
      removed the crafty selection of which pointer types are
      allowed to be modified.  This is OK for most pointer types
      since adjust_ptr_min_max_vals() will catch operations on
      immutable pointers.  One exception is PTR_TO_CTX which is
      now allowed to be offseted freely.
      
      The intent of aforementioned commit was to allow context
      access via modified registers.  The offset passed to
      ->is_valid_access() verifier callback has been adjusted
      by the value of the variable offset.
      
      What is missing, however, is taking the variable offset
      into account when the context register is used.  Or in terms
      of the code adding the offset to the value passed to the
      ->convert_ctx_access() callback.  This leads to the following
      eBPF user code:
      
           r1 += 68
           r0 = *(u32 *)(r1 + 8)
           exit
      
      being translated to this in kernel space:
      
         0: (07) r1 += 68
         1: (61) r0 = *(u32 *)(r1 +180)
         2: (95) exit
      
      Offset 8 is corresponding to 180 in the kernel, but offset
      76 is valid too.  Verifier will "accept" access to offset
      68+8=76 but then "convert" access to offset 8 as 180.
      Effective access to offset 248 is beyond the kernel context.
      (This is a __sk_buff example on a debug-heavy kernel -
      packet mark is 8 -> 180, 76 would be data.)
      
      Dereferencing the modified context pointer is not as easy
      as dereferencing other types, because we have to translate
      the access to reading a field in kernel structures which is
      usually at a different offset and often of a different size.
      To allow modifying the pointer we would have to make sure
      that given eBPF instruction will always access the same
      field or the fields accessed are "compatible" in terms of
      offset and size...
      
      Disallow dereferencing modified context pointers and add
      to selftests the test case described here.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28e33f9d
    • J
      bpf: XDP_REDIRECT enable use of cpumap · 9c270af3
      Jesper Dangaard Brouer 提交于
      This patch connects cpumap to the xdp_do_redirect_map infrastructure.
      
      Still no SKB allocation are done yet.  The XDP frames are transferred
      to the other CPU, but they are simply refcnt decremented on the remote
      CPU.  This served as a good benchmark for measuring the overhead of
      remote refcnt decrement.  If driver page recycle cache is not
      efficient then this, exposes a bottleneck in the page allocator.
      
      A shout-out to MST's ptr_ring, which is the secret behind is being so
      efficient to transfer memory pointers between CPUs, without constantly
      bouncing cache-lines between CPUs.
      
      V3: Handle !CONFIG_BPF_SYSCALL pointed out by kbuild test robot.
      
      V4: Make Generic-XDP aware of cpumap type, but don't allow redirect yet,
       as implementation require a separate upstream discussion.
      
      V5:
       - Fix a maybe-uninitialized pointed out by kbuild test robot.
       - Restrict bpf-prog side access to cpumap, open when use-cases appear
       - Implement cpu_map_enqueue() as a more simple void pointer enqueue
      
      V6:
       - Allow cpumap type for usage in helper bpf_redirect_map,
         general bpf-prog side restriction moved to earlier patch.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c270af3
    • J
      bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP · 6710e112
      Jesper Dangaard Brouer 提交于
      The 'cpumap' is primarily used as a backend map for XDP BPF helper
      call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
      
      This patch implement the main part of the map.  It is not connected to
      the XDP redirect system yet, and no SKB allocation are done yet.
      
      The main concern in this patch is to ensure the datapath can run
      without any locking.  This adds complexity to the setup and tear-down
      procedure, which assumptions are extra carefully documented in the
      code comments.
      
      V2:
       - make sure array isn't larger than NR_CPUS
       - make sure CPUs added is a valid possible CPU
      
      V3: fix nitpicks from Jakub Kicinski <kubakici@wp.pl>
      
      V5:
       - Restrict map allocation to root / CAP_SYS_ADMIN
       - WARN_ON_ONCE if queue is not empty on tear-down
       - Return -EPERM on memlock limit instead of -ENOMEM
       - Error code in __cpu_map_entry_alloc() also handle ptr_ring_cleanup()
       - Moved cpu_map_enqueue() to next patch
      
      V6: all notice by Daniel Borkmann
       - Fix err return code in cpu_map_alloc() introduced in V5
       - Move cpu_possible() check after max_entries boundary check
       - Forbid usage initially in check_map_func_compatibility()
      
      V7:
       - Fix alloc error path spotted by Daniel Borkmann
       - Did stress test adding+removing CPUs from the map concurrently
       - Fixed refcnt issue on cpu_map_entry, kthread started too soon
       - Make sure packets are flushed during tear-down, involved use of
         rcu_barrier() and kthread_run only exit after queue is empty
       - Fix alloc error path in __cpu_map_entry_alloc() for ptr_ring
      
      V8:
       - Nitpicking comments and gramma by Edward Cree
       - Fix missing semi-colon introduced in V7 due to rebasing
       - Move struct bpf_cpu_map_entry members cpu+map_id to tracepoint patch
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6710e112
  11. 15 10月, 2017 1 次提交
  12. 11 10月, 2017 4 次提交
  13. 08 10月, 2017 2 次提交
    • A
      bpf: fix liveness marking · 8fe2d6cc
      Alexei Starovoitov 提交于
      while processing Rx = Ry instruction the verifier does
      regs[insn->dst_reg] = regs[insn->src_reg]
      which often clears write mark (when Ry doesn't have it)
      that was just set by check_reg_arg(Rx) prior to the assignment.
      That causes mark_reg_read() to keep marking Rx in this block as
      REG_LIVE_READ (since the logic incorrectly misses that it's
      screened by the write) and in many of its parents (until lucky
      write into the same Rx or beginning of the program).
      That causes is_state_visited() logic to miss many pruning opportunities.
      
      Furthermore mark_reg_read() logic propagates the read mark
      for BPF_REG_FP as well (though it's readonly) which causes
      harmless but unnecssary work during is_state_visited().
      Note that do_propagate_liveness() skips FP correctly,
      so do the same in mark_reg_read() as well.
      It saves 0.2 seconds for the test below
      
      program               before  after
      bpf_lb-DLB_L3.o       2604    2304
      bpf_lb-DLB_L4.o       11159   3723
      bpf_lb-DUNKNOWN.o     1116    1110
      bpf_lxc-DDROP_ALL.o   34566   28004
      bpf_lxc-DUNKNOWN.o    53267   39026
      bpf_netdev.o          17843   16943
      bpf_overlay.o         8672    7929
      time                  ~11 sec  ~4 sec
      
      Fixes: dc503a8a ("bpf/verifier: track liveness for pruning")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fe2d6cc
    • Y
      bpf: add helper bpf_perf_event_read_value for perf event array map · 908432ca
      Yonghong Song 提交于
      Hardware pmu counters are limited resources. When there are more
      pmu based perf events opened than available counters, kernel will
      multiplex these events so each event gets certain percentage
      (but not 100%) of the pmu time. In case that multiplexing happens,
      the number of samples or counter value will not reflect the
      case compared to no multiplexing. This makes comparison between
      different runs difficult.
      
      Typically, the number of samples or counter value should be
      normalized before comparing to other experiments. The typical
      normalization is done like:
        normalized_num_samples = num_samples * time_enabled / time_running
        normalized_counter_value = counter_value * time_enabled / time_running
      where time_enabled is the time enabled for event and time_running is
      the time running for event since last normalization.
      
      This patch adds helper bpf_perf_event_read_value for kprobed based perf
      event array map, to read perf counter and enabled/running time.
      The enabled/running time is accumulated since the perf event open.
      To achieve scaling factor between two bpf invocations, users
      can can use cpu_id as the key (which is typical for perf array usage model)
      to remember the previous value and do the calculation inside the
      bpf program.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      908432ca
  14. 05 10月, 2017 1 次提交
  15. 29 9月, 2017 2 次提交
  16. 27 9月, 2017 1 次提交
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
  17. 20 9月, 2017 1 次提交
    • D
      bpf: fix ri->map_owner pointer on bpf_prog_realloc · 7c300131
      Daniel Borkmann 提交于
      Commit 109980b8 ("bpf: don't select potentially stale
      ri->map from buggy xdp progs") passed the pointer to the prog
      itself to be loaded into r4 prior on bpf_redirect_map() helper
      call, so that we can store the owner into ri->map_owner out of
      the helper.
      
      Issue with that is that the actual address of the prog is still
      subject to change when subsequent rewrites occur that require
      slow path in bpf_prog_realloc() to alloc more memory, e.g. from
      patching inlining helper functions or constant blinding. Thus,
      we really need to take prog->aux as the address we're holding,
      which also works with prog clones as they share the same aux
      object.
      
      Instead of then fetching aux->prog during runtime, which could
      potentially incur cache misses due to false sharing, we are
      going to just use aux for comparison on the map owner. This
      will also keep the patchlet of the same size, and later check
      in xdp_map_invalid() only accesses read-only aux pointer from
      the prog, it's also in the same cacheline already from prior
      access when calling bpf_func.
      
      Fixes: 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c300131
  18. 16 9月, 2017 1 次提交
  19. 09 9月, 2017 1 次提交
    • D
      bpf: don't select potentially stale ri->map from buggy xdp progs · 109980b8
      Daniel Borkmann 提交于
      We can potentially run into a couple of issues with the XDP
      bpf_redirect_map() helper. The ri->map in the per CPU storage
      can become stale in several ways, mostly due to misuse, where
      we can then trigger a use after free on the map:
      
      i) prog A is calling bpf_redirect_map(), returning XDP_REDIRECT
      and running on a driver not supporting XDP_REDIRECT yet. The
      ri->map on that CPU becomes stale when the XDP program is unloaded
      on the driver, and a prog B loaded on a different driver which
      supports XDP_REDIRECT return code. prog B would have to omit
      calling to bpf_redirect_map() and just return XDP_REDIRECT, which
      would then access the freed map in xdp_do_redirect() since not
      cleared for that CPU.
      
      ii) prog A is calling bpf_redirect_map(), returning a code other
      than XDP_REDIRECT. prog A is then detached, which triggers release
      of the map. prog B is attached which, similarly as in i), would
      just return XDP_REDIRECT without having called bpf_redirect_map()
      and thus be accessing the freed map in xdp_do_redirect() since
      not cleared for that CPU.
      
      iii) prog A is attached to generic XDP, calling the bpf_redirect_map()
      helper and returning XDP_REDIRECT. xdp_do_generic_redirect() is
      currently not handling ri->map (will be fixed by Jesper), so it's
      not being reset. Later loading a e.g. native prog B which would,
      say, call bpf_xdp_redirect() and then returns XDP_REDIRECT would
      find in xdp_do_redirect() that a map was set and uses that causing
      use after free on map access.
      
      Fix thus needs to avoid accessing stale ri->map pointers, naive
      way would be to call a BPF function from drivers that just resets
      it to NULL for all XDP return codes but XDP_REDIRECT and including
      XDP_REDIRECT for drivers not supporting it yet (and let ri->map
      being handled in xdp_do_generic_redirect()). There is a less
      intrusive way w/o letting drivers call a reset for each BPF run.
      
      The verifier knows we're calling into bpf_xdp_redirect_map()
      helper, so it can do a small insn rewrite transparent to the prog
      itself in the sense that it fills R4 with a pointer to the own
      bpf_prog. We have that pointer at verification time anyway and
      R4 is allowed to be used as per calling convention we scratch
      R0 to R5 anyway, so they become inaccessible and program cannot
      read them prior to a write. Then, the helper would store the prog
      pointer in the current CPUs struct redirect_info. Later in
      xdp_do_*_redirect() we check whether the redirect_info's prog
      pointer is the same as passed xdp_prog pointer, and if that's
      the case then all good, since the prog holds a ref on the map
      anyway, so it is always valid at that point in time and must
      have a reference count of at least 1. If in the unlikely case
      they are not equal, it means we got a stale pointer, so we clear
      and bail out right there. Also do reset map and the owning prog
      in bpf_xdp_redirect(), so that bpf_xdp_redirect_map() and
      bpf_xdp_redirect() won't get mixed up, only the last call should
      take precedence. A tc bpf_redirect() doesn't use map anywhere
      yet, so no need to clear it there since never accessed in that
      layer.
      
      Note that in case the prog is released, and thus the map as
      well we're still under RCU read critical section at that time
      and have preemption disabled as well. Once we commit with the
      __dev_map_insert_ctx() from xdp_do_redirect_map() and set the
      map to ri->map_to_flush, we still wait for a xdp_do_flush_map()
      to finish in devmap dismantle time once flush_needed bit is set,
      so that is fine.
      
      Fixes: 97f91a7c ("bpf: add bpf_redirect_map helper routine")
      Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      109980b8
  20. 24 8月, 2017 2 次提交