1. 29 3月, 2020 1 次提交
  2. 05 3月, 2020 2 次提交
  3. 04 3月, 2020 1 次提交
  4. 25 2月, 2020 1 次提交
  5. 19 12月, 2019 1 次提交
  6. 14 12月, 2019 2 次提交
  7. 11 12月, 2019 1 次提交
  8. 10 12月, 2019 1 次提交
  9. 19 11月, 2019 1 次提交
  10. 16 11月, 2019 1 次提交
  11. 16 10月, 2019 1 次提交
  12. 26 7月, 2019 2 次提交
  13. 31 5月, 2019 1 次提交
  14. 28 4月, 2019 1 次提交
    • M
      bpf: Introduce bpf sk local storage · 6ac99e8f
      Martin KaFai Lau 提交于
      After allowing a bpf prog to
      - directly read the skb->sk ptr
      - get the fullsock bpf_sock by "bpf_sk_fullsock()"
      - get the bpf_tcp_sock by "bpf_tcp_sock()"
      - get the listener sock by "bpf_get_listener_sock()"
      - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
        into different bpf running context.
      
      this patch is another effort to make bpf's network programming
      more intuitive to do (together with memory and performance benefit).
      
      When bpf prog needs to store data for a sk, the current practice is to
      define a map with the usual 4-tuples (src/dst ip/port) as the key.
      If multiple bpf progs require to store different sk data, multiple maps
      have to be defined.  Hence, wasting memory to store the duplicated
      keys (i.e. 4 tuples here) in each of the bpf map.
      [ The smallest key could be the sk pointer itself which requires
        some enhancement in the verifier and it is a separate topic. ]
      
      Also, the bpf prog needs to clean up the elem when sk is freed.
      Otherwise, the bpf map will become full and un-usable quickly.
      The sk-free tracking currently could be done during sk state
      transition (e.g. BPF_SOCK_OPS_STATE_CB).
      
      The size of the map needs to be predefined which then usually ended-up
      with an over-provisioned map in production.  Even the map was re-sizable,
      while the sk naturally come and go away already, this potential re-size
      operation is arguably redundant if the data can be directly connected
      to the sk itself instead of proxy-ing through a bpf map.
      
      This patch introduces sk->sk_bpf_storage to provide local storage space
      at sk for bpf prog to use.  The space will be allocated when the first bpf
      prog has created data for this particular sk.
      
      The design optimizes the bpf prog's lookup (and then optionally followed by
      an inline update).  bpf_spin_lock should be used if the inline update needs
      to be protected.
      
      BPF_MAP_TYPE_SK_STORAGE:
      -----------------------
      To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
      this patch) needs to be created.  Multiple BPF_MAP_TYPE_SK_STORAGE maps can
      be created to fit different bpf progs' needs.  The map enforces
      BTF to allow printing the sk-local-storage during a system-wise
      sk dump (e.g. "ss -ta") in the future.
      
      The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
      a "sk-local-storage" data from a particular sk.
      Think of the map as a meta-data (or "type") of a "sk-local-storage".  This
      particular "type" of "sk-local-storage" data can then be stored in any sk.
      
      The main purposes of this map are mostly:
      1. Define the size of a "sk-local-storage" type.
      2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
         map-id, map-btf...etc.)
      3. Keep track of all sk's storages of this "type" and clean them up
         when the map is freed.
      
      sk->sk_bpf_storage:
      ------------------
      The main lookup/update/delete is done on sk->sk_bpf_storage (which
      is a "struct bpf_sk_storage").  When doing a lookup,
      the "map" pointer is now used as the "key" to search on the
      sk_storage->list.  The "map" pointer is actually serving
      as the "type" of the "sk-local-storage" that is being
      requested.
      
      To allow very fast lookup, it should be as fast as looking up an
      array at a stable-offset.  At the same time, it is not ideal to
      set a hard limit on the number of sk-local-storage "type" that the
      system can have.  Hence, this patch takes a cache approach.
      The last search result from sk_storage->list is cached in
      sk_storage->cache[] which is a stable sized array.  Each
      "sk-local-storage" type has a stable offset to the cache[] array.
      In the future, a map's flag could be introduced to do cache
      opt-out/enforcement if it became necessary.
      
      The cache size is 16 (i.e. 16 types of "sk-local-storage").
      Programs can share map.  On the program side, having a few bpf_progs
      running in the networking hotpath is already a lot.  The bpf_prog
      should have already consolidated the existing sock-key-ed map usage
      to minimize the map lookup penalty.  16 has enough runway to grow.
      
      All sk-local-storage data will be removed from sk->sk_bpf_storage
      during sk destruction.
      
      bpf_sk_storage_get() and bpf_sk_storage_delete():
      ------------------------------------------------
      Instead of using bpf_map_(lookup|update|delete)_elem(),
      the bpf prog needs to use the new helper bpf_sk_storage_get() and
      bpf_sk_storage_delete().  The verifier can then enforce the
      ARG_PTR_TO_SOCKET argument.  The bpf_sk_storage_get() also allows to
      "create" new elem if one does not exist in the sk.  It is done by
      the new BPF_SK_STORAGE_GET_F_CREATE flag.  An optional value can also be
      provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
      The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock.  Together,
      it has eliminated the potential use cases for an equivalent
      bpf_map_update_elem() API (for bpf_prog) in this patch.
      
      Misc notes:
      ----------
      1. map_get_next_key is not supported.  From the userspace syscall
         perspective,  the map has the socket fd as the key while the map
         can be shared by pinned-file or map-id.
      
         Since btf is enforced, the existing "ss" could be enhanced to pretty
         print the local-storage.
      
         Supporting a kernel defined btf with 4 tuples as the return key could
         be explored later also.
      
      2. The sk->sk_lock cannot be acquired.  Atomic operations is used instead.
         e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
         Please refer to the source code comments for the details in
         synchronization cases and considerations.
      
      3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
      
      Benchmark:
      ---------
      Here is the benchmark data collected by turning on
      the "kernel.bpf_stats_enabled" sysctl.
      Two bpf progs are tested:
      
      One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
      sk ptr as the key. (verifier is modified to support sk ptr as the key
      That should have shortened the key lookup time.)
      
      Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
      
      Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
      each egress skb and then bump the cnt.  netperf is used to drive
      data with 4096 connected UDP sockets.
      
      BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
      27: cgroup_skb  name egress_sk_map  tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
          loaded_at 2019-04-15T13:46:39-0700  uid 0
          xlated 344B  jited 258B  memlock 4096B  map_ids 16
          btf_id 5
      
      BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
      30: cgroup_skb  name egress_sk_stora  tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
          loaded_at 2019-04-15T13:47:54-0700  uid 0
          xlated 168B  jited 156B  memlock 4096B  map_ids 17
          btf_id 6
      
      Here is a high-level picture on how are the objects organized:
      
             sk
          ┌──────┐
          │      │
          │      │
          │      │
          │*sk_bpf_storage───── bpf_sk_storage
          └──────┘                 ┌───────┐
                       ┌───────────┤ list  │
                       │           │       │
                       │           │       │
                       │           │       │
                       │           └───────┘
                       │
                       │     elem
                       │  ┌────────┐
                       ├─│ snode  │
                       │  ├────────┤
                       │  │  data  │          bpf_map
                       │  ├────────┤        ┌─────────┐
                       │  │map_node│─┬─────┤  list   │
                       │  └────────┘  │     │         │
                       │              │     │         │
                       │     elem     │     │         │
                       │  ┌────────┐  │     └─────────┘
                       └─│ snode  │  │
                          ├────────┤  │
         bpf_map          │  data  │  │
       ┌─────────┐        ├────────┤  │
       │  list   ├───────│map_node│  │
       │         │        └────────┘  │
       │         │                    │
       │         │           elem     │
       └─────────┘        ┌────────┐  │
                       ┌─│ snode  │  │
                       │  ├────────┤  │
                       │  │  data  │  │
                       │  ├────────┤  │
                       │  │map_node│─┘
                       │  └────────┘
                       │
                       │
                       │          ┌───────┐
           sk          └──────────│ list  │
        ┌──────┐                  │       │
        │      │                  │       │
        │      │                  │       │
        │      │                  └───────┘
        │*sk_bpf_storage───────bpf_sk_storage
        └──────┘
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6ac99e8f
  15. 27 4月, 2019 1 次提交
  16. 24 4月, 2019 3 次提交
  17. 12 4月, 2019 1 次提交
  18. 11 4月, 2019 1 次提交
    • S
      bpf: support input __sk_buff context in BPF_PROG_TEST_RUN · b0b9395d
      Stanislav Fomichev 提交于
      Add new set of arguments to bpf_attr for BPF_PROG_TEST_RUN:
      * ctx_in/ctx_size_in - input context
      * ctx_out/ctx_size_out - output context
      
      The intended use case is to pass some meta data to the test runs that
      operate on skb (this has being brought up on recent LPC).
      
      For programs that use bpf_prog_test_run_skb, support __sk_buff input and
      output. Initially, from input __sk_buff, copy _only_ cb and priority into
      skb, all other non-zero fields are prohibited (with EINVAL).
      If the user has set ctx_out/ctx_size_out, copy the potentially modified
      __sk_buff back to the userspace.
      
      We require all fields of input __sk_buff except the ones we explicitly
      support to be set to zero. The expectation is that in the future we might
      add support for more fields and we want to fail explicitly if the user
      runs the program on the kernel where we don't yet support them.
      
      The API is intentionally vague (i.e. we don't explicitly add __sk_buff
      to bpf_attr, but ctx_in) to potentially let other test_run types use
      this interface in the future (this can be xdp_md for xdp types for
      example).
      
      v4:
        * don't copy more than allowed in bpf_ctx_init [Martin]
      
      v3:
        * handle case where ctx_in is NULL, but ctx_out is not [Martin]
        * convert size==0 checks to ptr==NULL checks and add some extra ptr
          checks [Martin]
      
      v2:
        * Addressed comments from Martin Lau
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b0b9395d
  19. 09 3月, 2019 1 次提交
  20. 26 2月, 2019 1 次提交
  21. 19 2月, 2019 1 次提交
  22. 29 1月, 2019 1 次提交
  23. 05 12月, 2018 1 次提交
  24. 02 12月, 2018 1 次提交
  25. 20 10月, 2018 1 次提交
  26. 01 10月, 2018 1 次提交
  27. 03 8月, 2018 1 次提交
  28. 12 7月, 2018 1 次提交
    • D
      bpf: fix panic due to oob in bpf_prog_test_run_skb · 6e6fddc7
      Daniel Borkmann 提交于
      sykzaller triggered several panics similar to the below:
      
        [...]
        [  248.851531] BUG: KASAN: use-after-free in _copy_to_user+0x5c/0x90
        [  248.857656] Read of size 985 at addr ffff8808017ffff2 by task a.out/1425
        [...]
        [  248.865902] CPU: 1 PID: 1425 Comm: a.out Not tainted 4.18.0-rc4+ #13
        [  248.865903] Hardware name: Supermicro SYS-5039MS-H12TRF/X11SSE-F, BIOS 2.1a 03/08/2018
        [  248.865905] Call Trace:
        [  248.865910]  dump_stack+0xd6/0x185
        [  248.865911]  ? show_regs_print_info+0xb/0xb
        [  248.865913]  ? printk+0x9c/0xc3
        [  248.865915]  ? kmsg_dump_rewind_nolock+0xe4/0xe4
        [  248.865919]  print_address_description+0x6f/0x270
        [  248.865920]  kasan_report+0x25b/0x380
        [  248.865922]  ? _copy_to_user+0x5c/0x90
        [  248.865924]  check_memory_region+0x137/0x190
        [  248.865925]  kasan_check_read+0x11/0x20
        [  248.865927]  _copy_to_user+0x5c/0x90
        [  248.865930]  bpf_test_finish.isra.8+0x4f/0xc0
        [  248.865932]  bpf_prog_test_run_skb+0x6a0/0xba0
        [...]
      
      After scrubbing the BPF prog a bit from the noise, turns out it called
      bpf_skb_change_head() for the lwt_xmit prog with headroom of 2. Nothing
      wrong in that, however, this was run with repeat >> 0 in bpf_prog_test_run_skb()
      and the same skb thus keeps changing until the pskb_expand_head() called
      from skb_cow() keeps bailing out in atomic alloc context with -ENOMEM.
      So upon return we'll basically have 0 headroom left yet blindly do the
      __skb_push() of 14 bytes and keep copying data from there in bpf_test_finish()
      out of bounds. Fix to check if we have enough headroom and if pskb_expand_head()
      fails, bail out with error.
      
      Another bug independent of this fix (but related in triggering above) is
      that BPF_PROG_TEST_RUN should be reworked to reset the skb/xdp buffer to
      it's original state from input as otherwise repeating the same test in a
      loop won't work for benchmarking when underlying input buffer is getting
      changed by the prog each time and reused for the next run leading to
      unexpected results.
      
      Fixes: 1cf1cae9 ("bpf: introduce BPF_PROG_TEST_RUN command")
      Reported-by: syzbot+709412e651e55ed96498@syzkaller.appspotmail.com
      Reported-by: syzbot+54f39d6ab58f39720a55@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6e6fddc7
  29. 19 4月, 2018 1 次提交
  30. 01 2月, 2018 1 次提交
    • D
      bpf: fix null pointer deref in bpf_prog_test_run_xdp · 65073a67
      Daniel Borkmann 提交于
      syzkaller was able to generate the following XDP program ...
      
        (18) r0 = 0x0
        (61) r5 = *(u32 *)(r1 +12)
        (04) (u32) r0 += (u32) 0
        (95) exit
      
      ... and trigger a NULL pointer dereference in ___bpf_prog_run()
      via bpf_prog_test_run_xdp() where this was attempted to run.
      
      Reason is that recent xdp_rxq_info addition to XDP programs
      updated all drivers, but not bpf_prog_test_run_xdp(), where
      xdp_buff is set up. Thus when context rewriter does the deref
      on the netdev it's NULL at runtime. Fix it by using xdp_rxq
      from loopback dev. __netif_get_rx_queue() helper can also be
      reused in various other locations later on.
      
      Fixes: 02dd3291 ("bpf: finally expose xdp_rxq_info to XDP bpf-programs")
      Reported-by: syzbot+1eb094057b338eb1fc00@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      65073a67
  31. 27 9月, 2017 2 次提交
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
    • D
      bpf: rename bpf_compute_data_end into bpf_compute_data_pointers · 6aaae2b6
      Daniel Borkmann 提交于
      Just do the rename into bpf_compute_data_pointers() as we'll add
      one more pointer here to recompute.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6aaae2b6
  32. 02 5月, 2017 2 次提交
  33. 02 4月, 2017 1 次提交
    • A
      bpf: introduce BPF_PROG_TEST_RUN command · 1cf1cae9
      Alexei Starovoitov 提交于
      development and testing of networking bpf programs is quite cumbersome.
      Despite availability of user space bpf interpreters the kernel is
      the ultimate authority and execution environment.
      Current test frameworks for TC include creation of netns, veth,
      qdiscs and use of various packet generators just to test functionality
      of a bpf program. XDP testing is even more complicated, since
      qemu needs to be started with gro/gso disabled and precise queue
      configuration, transferring of xdp program from host into guest,
      attaching to virtio/eth0 and generating traffic from the host
      while capturing the results from the guest.
      
      Moreover analyzing performance bottlenecks in XDP program is
      impossible in virtio environment, since cost of running the program
      is tiny comparing to the overhead of virtio packet processing,
      so performance testing can only be done on physical nic
      with another server generating traffic.
      
      Furthermore ongoing changes to user space control plane of production
      applications cannot be run on the test servers leaving bpf programs
      stubbed out for testing.
      
      Last but not least, the upstream llvm changes are validated by the bpf
      backend testsuite which has no ability to test the code generated.
      
      To improve this situation introduce BPF_PROG_TEST_RUN command
      to test and performance benchmark bpf programs.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cf1cae9