1. 29 4月, 2020 12 次提交
  2. 28 4月, 2020 1 次提交
  3. 27 4月, 2020 12 次提交
    • M
      libbpf: Return err if bpf_object__load failed · e411eb25
      Mao Wenan 提交于
      bpf_object__load() has various return code, when it failed to load
      object, it must return err instead of -EINVAL.
      Signed-off-by: NMao Wenan <maowenan@huawei.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200426063635.130680-3-maowenan@huawei.com
      e411eb25
    • C
      sysctl: pass kernel pointers to ->proc_handler · 32927393
      Christoph Hellwig 提交于
      Instead of having all the sysctl handlers deal with user pointers, which
      is rather hairy in terms of the BPF interaction, copy the input to and
      from  userspace in common code.  This also means that the strings are
      always NUL-terminated by the common code, making the API a little bit
      safer.
      
      As most handler just pass through the data to one of the common handlers
      a lot of the changes are mechnical.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      32927393
    • C
      sysctl: avoid forward declarations · f461d2dc
      Christoph Hellwig 提交于
      Move the sysctl tables to the end of the file to avoid lots of pointless
      forward declarations.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f461d2dc
    • C
      sysctl: remove all extern declaration from sysctl.c · 2374c09b
      Christoph Hellwig 提交于
      Extern declarations in .c files are a bad style and can lead to
      mismatches.  Use existing definitions in headers where they exist,
      and otherwise move the external declarations to suitable header
      files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2374c09b
    • C
      mm: remove watermark_boost_factor_sysctl_handler · 26363af5
      Christoph Hellwig 提交于
      watermark_boost_factor_sysctl_handler is just a pointless wrapper for
      proc_dointvec_minmax, so remove it and use proc_dointvec_minmax
      directly.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      26363af5
    • A
      Merge branch 'cloudflare-prog' · f131bd3e
      Alexei Starovoitov 提交于
      Lorenz Bauer says:
      
      ====================
      We've been developing an in-house L4 load balancer based on XDP
      and TC for a while. Following Alexei's call for more up-to-date examples of
      production BPF in the kernel tree [1], Cloudflare is making this available
      under dual GPL-2.0 or BSD 3-clause terms.
      
      The code requires at least v5.3 to function correctly.
      
      1: https://lore.kernel.org/bpf/20200326210719.den5isqxntnoqhmv@ast-mbp/
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f131bd3e
    • L
      selftests/bpf: Add cls_redirect classifier · 23458901
      Lorenz Bauer 提交于
      cls_redirect is a TC clsact based replacement for the glb-redirect iptables
      module available at [1]. It enables what GitHub calls "second chance"
      flows [2], similarly proposed by the Beamer paper [3]. In contrast to
      glb-redirect, it also supports migrating UDP flows as long as connected
      sockets are used. cls_redirect is in production at Cloudflare, as part of
      our own L4 load balancer.
      
      We have modified the encapsulation format slightly from glb-redirect:
      glbgue_chained_routing.private_data_type has been repurposed to form a
      version field and several flags. Both have been arranged in a way that
      a private_data_type value of zero matches the current glb-redirect
      behaviour. This means that cls_redirect will understand packets in
      glb-redirect format, but not vice versa.
      
      The test suite only covers basic features. For example, cls_redirect will
      correctly forward path MTU discovery packets, but this is not exercised.
      It is also possible to switch the encapsulation format to GRE on the last
      hop, which is also not tested.
      
      There are two major distinctions from glb-redirect: first, cls_redirect
      relies on receiving encapsulated packets directly from a router. This is
      because we don't have access to the neighbour tables from BPF, yet. See
      forward_to_next_hop for details. Second, cls_redirect performs decapsulation
      instead of using separate ipip and sit tunnel devices. This
      avoids issues with the sit tunnel [4] and makes deploying the classifier
      easier: decapsulated packets appear on the same interface, so existing
      firewall rules continue to work as expected.
      
      The code base started it's life on v4.19, so there are most likely still
      hold overs from old workarounds. In no particular order:
      
      - The function buf_off is required to defeat a clang optimization
        that leads to the verifier rejecting the program due to pointer
        arithmetic in the wrong order.
      
      - The function pkt_parse_ipv6 is force inlined, because it would
        otherwise be rejected due to returning a pointer to stack memory.
      
      - The functions fill_tuple and classify_tcp contain kludges, because
        we've run out of function arguments.
      
      - The logic in general is rather nested, due to verifier restrictions.
        I think this is either because the verifier loses track of constants
        on the stack, or because it can't track enum like variables.
      
      1: https://github.com/github/glb-director/tree/master/src/glb-redirect
      2: https://github.com/github/glb-director/blob/master/docs/development/second-chance-design.md
      3: https://www.usenix.org/conference/nsdi18/presentation/olteanu
      4: https://github.com/github/glb-director/issues/64Signed-off-by: NLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200424185556.7358-2-lmb@cloudflare.com
      23458901
    • A
      bpf: Make verifier log more relevant by default · 6f8a57cc
      Andrii Nakryiko 提交于
      To make BPF verifier verbose log more releavant and easier to use to debug
      verification failures, "pop" parts of log that were successfully verified.
      This has effect of leaving only verifier logs that correspond to code branches
      that lead to verification failure, which in practice should result in much
      shorter and more relevant verifier log dumps. This behavior is made the
      default behavior and can be overriden to do exhaustive logging by specifying
      BPF_LOG_LEVEL2 log level.
      
      Using BPF_LOG_LEVEL2 to disable this behavior is not ideal, because in some
      cases it's good to have BPF_LOG_LEVEL2 per-instruction register dump
      verbosity, but still have only relevant verifier branches logged. But for this
      patch, I didn't want to add any new flags. It might be worth-while to just
      rethink how BPF verifier logging is performed and requested and streamline it
      a bit. But this trimming of successfully verified branches seems to be useful
      and a good default behavior.
      
      To test this, I modified runqslower slightly to introduce read of
      uninitialized stack variable. Log (**truncated in the middle** to save many
      lines out of this commit message) BEFORE this change:
      
      ; int handle__sched_switch(u64 *ctx)
      0: (bf) r6 = r1
      ; struct task_struct *prev = (struct task_struct *)ctx[1];
      1: (79) r1 = *(u64 *)(r6 +8)
      func 'sched_switch' arg1 has btf_id 151 type STRUCT 'task_struct'
      2: (b7) r2 = 0
      ; struct event event = {};
      3: (7b) *(u64 *)(r10 -24) = r2
      last_idx 3 first_idx 0
      regs=4 stack=0 before 2: (b7) r2 = 0
      4: (7b) *(u64 *)(r10 -32) = r2
      5: (7b) *(u64 *)(r10 -40) = r2
      6: (7b) *(u64 *)(r10 -48) = r2
      ; if (prev->state == TASK_RUNNING)
      
      [ ... instruction dump from insn #7 through #50 are cut out ... ]
      
      51: (b7) r2 = 16
      52: (85) call bpf_get_current_comm#16
      last_idx 52 first_idx 42
      regs=4 stack=0 before 51: (b7) r2 = 16
      ; bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
      53: (bf) r1 = r6
      54: (18) r2 = 0xffff8881f3868800
      56: (18) r3 = 0xffffffff
      58: (bf) r4 = r7
      59: (b7) r5 = 32
      60: (85) call bpf_perf_event_output#25
      last_idx 60 first_idx 53
      regs=20 stack=0 before 59: (b7) r5 = 32
      61: (bf) r2 = r10
      ; event.pid = pid;
      62: (07) r2 += -16
      ; bpf_map_delete_elem(&start, &pid);
      63: (18) r1 = 0xffff8881f3868000
      65: (85) call bpf_map_delete_elem#3
      ; }
      66: (b7) r0 = 0
      67: (95) exit
      
      from 44 to 66: safe
      
      from 34 to 66: safe
      
      from 11 to 28: R1_w=inv0 R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-8=mmmm???? fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000
      ; bpf_map_update_elem(&start, &pid, &ts, 0);
      28: (bf) r2 = r10
      ;
      29: (07) r2 += -16
      ; tsp = bpf_map_lookup_elem(&start, &pid);
      30: (18) r1 = 0xffff8881f3868000
      32: (85) call bpf_map_lookup_elem#1
      invalid indirect read from stack off -16+0 size 4
      processed 65 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 4
      
      Notice how there is a successful code path from instruction 0 through 67, few
      successfully verified jumps (44->66, 34->66), and only after that 11->28 jump
      plus error on instruction #32.
      
      AFTER this change (full verifier log, **no truncation**):
      
      ; int handle__sched_switch(u64 *ctx)
      0: (bf) r6 = r1
      ; struct task_struct *prev = (struct task_struct *)ctx[1];
      1: (79) r1 = *(u64 *)(r6 +8)
      func 'sched_switch' arg1 has btf_id 151 type STRUCT 'task_struct'
      2: (b7) r2 = 0
      ; struct event event = {};
      3: (7b) *(u64 *)(r10 -24) = r2
      last_idx 3 first_idx 0
      regs=4 stack=0 before 2: (b7) r2 = 0
      4: (7b) *(u64 *)(r10 -32) = r2
      5: (7b) *(u64 *)(r10 -40) = r2
      6: (7b) *(u64 *)(r10 -48) = r2
      ; if (prev->state == TASK_RUNNING)
      7: (79) r2 = *(u64 *)(r1 +16)
      ; if (prev->state == TASK_RUNNING)
      8: (55) if r2 != 0x0 goto pc+19
       R1_w=ptr_task_struct(id=0,off=0,imm=0) R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000
      ; trace_enqueue(prev->tgid, prev->pid);
      9: (61) r1 = *(u32 *)(r1 +1184)
      10: (63) *(u32 *)(r10 -4) = r1
      ; if (!pid || (targ_pid && targ_pid != pid))
      11: (15) if r1 == 0x0 goto pc+16
      
      from 11 to 28: R1_w=inv0 R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-8=mmmm???? fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000
      ; bpf_map_update_elem(&start, &pid, &ts, 0);
      28: (bf) r2 = r10
      ;
      29: (07) r2 += -16
      ; tsp = bpf_map_lookup_elem(&start, &pid);
      30: (18) r1 = 0xffff8881db3ce800
      32: (85) call bpf_map_lookup_elem#1
      invalid indirect read from stack off -16+0 size 4
      processed 65 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 4
      
      Notice how in this case, there are 0-11 instructions + jump from 11 to
      28 is recorded + 28-32 instructions with error on insn #32.
      
      test_verifier test runner was updated to specify BPF_LOG_LEVEL2 for
      VERBOSE_ACCEPT expected result due to potentially "incomplete" success verbose
      log at BPF_LOG_LEVEL1.
      
      On success, verbose log will only have a summary of number of processed
      instructions, etc, but no branch tracing log. Having just a last succesful
      branch tracing seemed weird and confusing. Having small and clean summary log
      in success case seems quite logical and nice, though.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200423195850.1259827-1-andriin@fb.com
      6f8a57cc
    • M
      bpf: add bpf_ktime_get_boot_ns() · 71d19214
      Maciej Żenczykowski 提交于
      On a device like a cellphone which is constantly suspending
      and resuming CLOCK_MONOTONIC is not particularly useful for
      keeping track of or reacting to external network events.
      Instead you want to use CLOCK_BOOTTIME.
      
      Hence add bpf_ktime_get_boot_ns() as a mirror of bpf_ktime_get_ns()
      based around CLOCK_BOOTTIME instead of CLOCK_MONOTONIC.
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      71d19214
    • T
    • M
      net: bpf: Make bpf_ktime_get_ns() available to non GPL programs · 082b57e3
      Maciej Żenczykowski 提交于
      The entire implementation is in kernel/bpf/helpers.c:
      
      BPF_CALL_0(bpf_ktime_get_ns) {
             /* NMI safe access to clock monotonic */
             return ktime_get_mono_fast_ns();
      }
      
      const struct bpf_func_proto bpf_ktime_get_ns_proto = {
             .func           = bpf_ktime_get_ns,
             .gpl_only       = false,
             .ret_type       = RET_INTEGER,
      };
      
      and this was presumably marked GPL due to kernel/time/timekeeping.c:
        EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);
      
      and while that may make sense for kernel modules (although even that
      is doubtful), there is currently AFAICT no other source of time
      available to ebpf.
      
      Furthermore this is really just equivalent to clock_gettime(CLOCK_MONOTONIC)
      which is exposed to userspace (via vdso even to make it performant)...
      
      As such, I see no reason to keep the GPL restriction.
      (In the future I'd like to have access to time from Apache licensed ebpf code)
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      082b57e3
    • L
      net: bpf: Allow TC programs to call BPF_FUNC_skb_change_head · 6f3f65d8
      Lorenzo Colitti 提交于
      This allows TC eBPF programs to modify and forward (redirect) packets
      from interfaces without ethernet headers (for example cellular)
      to interfaces with (for example ethernet/wifi).
      
      The lack of this appears to simply be an oversight.
      
      Tested:
        in active use in Android R on 4.14+ devices for ipv6
        cellular to wifi tethering offload.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6f3f65d8
  4. 26 4月, 2020 15 次提交