1. 20 11月, 2019 3 次提交
    • A
      selftests/bpf: Enforce no-ALU32 for test_progs-no_alu32 · 24f65050
      Andrii Nakryiko 提交于
      With the most recent Clang, alu32 is enabled by default if -mcpu=probe or
      -mcpu=v3 is specified. Use a separate build rule with -mcpu=v2 to enforce no
      ALU32 mode.
      Suggested-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20191120002510.4130605-1-andriin@fb.com
      24f65050
    • A
      libbpf: Fix call relocation offset calculation bug · a0d7da26
      Andrii Nakryiko 提交于
      When relocating subprogram call, libbpf doesn't take into account
      relo->text_off, which comes from symbol's value. This generally works fine for
      subprograms implemented as static functions, but breaks for global functions.
      
      Taking a simplified test_pkt_access.c as an example:
      
      __attribute__ ((noinline))
      static int test_pkt_access_subprog1(volatile struct __sk_buff *skb)
      {
              return skb->len * 2;
      }
      
      __attribute__ ((noinline))
      static int test_pkt_access_subprog2(int val, volatile struct __sk_buff *skb)
      {
              return skb->len + val;
      }
      
      SEC("classifier/test_pkt_access")
      int test_pkt_access(struct __sk_buff *skb)
      {
              if (test_pkt_access_subprog1(skb) != skb->len * 2)
                      return TC_ACT_SHOT;
              if (test_pkt_access_subprog2(2, skb) != skb->len + 2)
                      return TC_ACT_SHOT;
              return TC_ACT_UNSPEC;
      }
      
      When compiled, we get two relocations, pointing to '.text' symbol. .text has
      st_value set to 0 (it points to the beginning of .text section):
      
      0000000000000008  000000050000000a R_BPF_64_32            0000000000000000 .text
      0000000000000040  000000050000000a R_BPF_64_32            0000000000000000 .text
      
      test_pkt_access_subprog1 and test_pkt_access_subprog2 offsets (targets of two
      calls) are encoded within call instruction's imm32 part as -1 and 2,
      respectively:
      
      0000000000000000 test_pkt_access_subprog1:
             0:       61 10 00 00 00 00 00 00 r0 = *(u32 *)(r1 + 0)
             1:       64 00 00 00 01 00 00 00 w0 <<= 1
             2:       95 00 00 00 00 00 00 00 exit
      
      0000000000000018 test_pkt_access_subprog2:
             3:       61 10 00 00 00 00 00 00 r0 = *(u32 *)(r1 + 0)
             4:       04 00 00 00 02 00 00 00 w0 += 2
             5:       95 00 00 00 00 00 00 00 exit
      
      0000000000000000 test_pkt_access:
             0:       bf 16 00 00 00 00 00 00 r6 = r1
      ===>   1:       85 10 00 00 ff ff ff ff call -1
             2:       bc 01 00 00 00 00 00 00 w1 = w0
             3:       b4 00 00 00 02 00 00 00 w0 = 2
             4:       61 62 00 00 00 00 00 00 r2 = *(u32 *)(r6 + 0)
             5:       64 02 00 00 01 00 00 00 w2 <<= 1
             6:       5e 21 08 00 00 00 00 00 if w1 != w2 goto +8 <LBB0_3>
             7:       bf 61 00 00 00 00 00 00 r1 = r6
      ===>   8:       85 10 00 00 02 00 00 00 call 2
             9:       bc 01 00 00 00 00 00 00 w1 = w0
            10:       61 62 00 00 00 00 00 00 r2 = *(u32 *)(r6 + 0)
            11:       04 02 00 00 02 00 00 00 w2 += 2
            12:       b4 00 00 00 ff ff ff ff w0 = -1
            13:       1e 21 01 00 00 00 00 00 if w1 == w2 goto +1 <LBB0_3>
            14:       b4 00 00 00 02 00 00 00 w0 = 2
      0000000000000078 LBB0_3:
            15:       95 00 00 00 00 00 00 00 exit
      
      Now, if we compile example with global functions, the setup changes.
      Relocations are now against specifically test_pkt_access_subprog1 and
      test_pkt_access_subprog2 symbols, with test_pkt_access_subprog2 pointing 24
      bytes into its respective section (.text), i.e., 3 instructions in:
      
      0000000000000008  000000070000000a R_BPF_64_32            0000000000000000 test_pkt_access_subprog1
      0000000000000048  000000080000000a R_BPF_64_32            0000000000000018 test_pkt_access_subprog2
      
      Calls instructions now encode offsets relative to function symbols and are both
      set ot -1:
      
      0000000000000000 test_pkt_access_subprog1:
             0:       61 10 00 00 00 00 00 00 r0 = *(u32 *)(r1 + 0)
             1:       64 00 00 00 01 00 00 00 w0 <<= 1
             2:       95 00 00 00 00 00 00 00 exit
      
      0000000000000018 test_pkt_access_subprog2:
             3:       61 20 00 00 00 00 00 00 r0 = *(u32 *)(r2 + 0)
             4:       0c 10 00 00 00 00 00 00 w0 += w1
             5:       95 00 00 00 00 00 00 00 exit
      
      0000000000000000 test_pkt_access:
             0:       bf 16 00 00 00 00 00 00 r6 = r1
      ===>   1:       85 10 00 00 ff ff ff ff call -1
             2:       bc 01 00 00 00 00 00 00 w1 = w0
             3:       b4 00 00 00 02 00 00 00 w0 = 2
             4:       61 62 00 00 00 00 00 00 r2 = *(u32 *)(r6 + 0)
             5:       64 02 00 00 01 00 00 00 w2 <<= 1
             6:       5e 21 09 00 00 00 00 00 if w1 != w2 goto +9 <LBB2_3>
             7:       b4 01 00 00 02 00 00 00 w1 = 2
             8:       bf 62 00 00 00 00 00 00 r2 = r6
      ===>   9:       85 10 00 00 ff ff ff ff call -1
            10:       bc 01 00 00 00 00 00 00 w1 = w0
            11:       61 62 00 00 00 00 00 00 r2 = *(u32 *)(r6 + 0)
            12:       04 02 00 00 02 00 00 00 w2 += 2
            13:       b4 00 00 00 ff ff ff ff w0 = -1
            14:       1e 21 01 00 00 00 00 00 if w1 == w2 goto +1 <LBB2_3>
            15:       b4 00 00 00 02 00 00 00 w0 = 2
      0000000000000080 LBB2_3:
            16:       95 00 00 00 00 00 00 00 exit
      
      Thus the right formula to calculate target call offset after relocation should
      take into account relocation's target symbol value (offset within section),
      call instruction's imm32 offset, and (subtracting, to get relative instruction
      offset) instruction index of call instruction itself. All that is shifted by
      number of instructions in main program, given all sub-programs are copied over
      after main program.
      
      Convert few selftests relying on bpf-to-bpf calls to use global functions
      instead of static ones.
      
      Fixes: 48cca7e4 ("libbpf: add support for bpf_call")
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191119224447.3781271-1-andriin@fb.com
      a0d7da26
    • L
      net-af_xdp: Use correct number of channels from ethtool · 3de88c91
      Luigi Rizzo 提交于
      Drivers use different fields to report the number of channels, so take
      the maximum of all data channels (rx, tx, combined) when determining the
      size of the xsk map. The current code used only 'combined' which was set
      to 0 in some drivers e.g. mlx4.
      
      Tested: compiled and run xdpsock -q 3 -r -S on mlx4
      Signed-off-by: NLuigi Rizzo <lrizzo@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20191119001951.92930-1-lrizzo@google.com
      3de88c91
  2. 19 11月, 2019 11 次提交
  3. 18 11月, 2019 6 次提交
    • D
      Merge branch 'bpf-array-mmap' · b97e12e5
      Daniel Borkmann 提交于
      Andrii Nakryiko says:
      
      ====================
      This patch set adds ability to memory-map BPF array maps (single- and
      multi-element). The primary use case is memory-mapping BPF array maps, created
      to back global data variables, created by libbpf implicitly. This allows for
      much better usability, along with avoiding syscalls to read or update data
      completely.
      
      Due to memory-mapping requirements, BPF array map that is supposed to be
      memory-mapped, has to be created with special BPF_F_MMAPABLE attribute, which
      triggers slightly different memory allocation strategy internally. See
      patch 1 for details.
      
      Libbpf is extended to detect kernel support for this flag, and if supported,
      will specify it for all global data maps automatically.
      
      Patch #1 refactors bpf_map_inc() and converts bpf_map's refcnt to atomic64_t
      to make refcounting never fail. Patch #2 does similar refactoring for
      bpf_prog_add()/bpf_prog_inc().
      
      v5->v6:
      - add back uref counting (Daniel);
      
      v4->v5:
      - change bpf_prog's refcnt to atomic64_t (Daniel);
      
      v3->v4:
      - add mmap's open() callback to fix refcounting (Johannes);
      - switch to remap_vmalloc_pages() instead of custom fault handler (Johannes);
      - converted bpf_map's refcnt/usercnt into atomic64_t;
      - provide default bpf_map_default_vmops handling open/close properly;
      
      v2->v3:
      - change allocation strategy to avoid extra pointer dereference (Jakub);
      
      v1->v2:
      - fix map lookup code generation for BPF_F_MMAPABLE case;
      - prevent BPF_F_MMAPABLE flag for all but plain array map type;
      - centralize ref-counting in generic bpf_map_mmap();
      - don't use uref counting (Alexei);
      - use vfree() directly;
      - print flags with %x (Song);
      - extend tests to verify bpf_map_{lookup,update}_elem() logic as well.
      ====================
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b97e12e5
    • A
      selftests/bpf: Add BPF_TYPE_MAP_ARRAY mmap() tests · 5051b384
      Andrii Nakryiko 提交于
      Add selftests validating mmap()-ing BPF array maps: both single-element and
      multi-element ones. Check that plain bpf_map_update_elem() and
      bpf_map_lookup_elem() work correctly with memory-mapped array. Also convert
      CO-RE relocation tests to use memory-mapped views of global data.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-6-andriin@fb.com
      5051b384
    • A
      libbpf: Make global data internal arrays mmap()-able, if possible · 7fe74b43
      Andrii Nakryiko 提交于
      Add detection of BPF_F_MMAPABLE flag support for arrays and add it as an extra
      flag to internal global data maps, if supported by kernel. This allows users
      to memory-map global data and use it without BPF map operations, greatly
      simplifying user experience.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-5-andriin@fb.com
      7fe74b43
    • A
      bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY · fc970227
      Andrii Nakryiko 提交于
      Add ability to memory-map contents of BPF array map. This is extremely useful
      for working with BPF global data from userspace programs. It allows to avoid
      typical bpf_map_{lookup,update}_elem operations, improving both performance
      and usability.
      
      There had to be special considerations for map freezing, to avoid having
      writable memory view into a frozen map. To solve this issue, map freezing and
      mmap-ing is happening under mutex now:
        - if map is already frozen, no writable mapping is allowed;
        - if map has writable memory mappings active (accounted in map->writecnt),
          map freezing will keep failing with -EBUSY;
        - once number of writable memory mappings drops to zero, map freezing can be
          performed again.
      
      Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
      can't be memory mapped either.
      
      For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
      to be mmap()'able. We also need to make sure that array data memory is
      page-sized and page-aligned, so we over-allocate memory in such a way that
      struct bpf_array is at the end of a single page of memory with array->value
      being aligned with the start of the second page. On deallocation we need to
      accomodate this memory arrangement to free vmalloc()'ed memory correctly.
      
      One important consideration regarding how memory-mapping subsystem functions.
      Memory-mapping subsystem provides few optional callbacks, among them open()
      and close().  close() is called for each memory region that is unmapped, so
      that users can decrease their reference counters and free up resources, if
      necessary. open() is *almost* symmetrical: it's called for each memory region
      that is being mapped, **except** the very first one. So bpf_map_mmap does
      initial refcnt bump, while open() will do any extra ones after that. Thus
      number of close() calls is equal to number of open() calls plus one more.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com
      fc970227
    • A
      bpf: Convert bpf_prog refcnt to atomic64_t · 85192dbf
      Andrii Nakryiko 提交于
      Similarly to bpf_map's refcnt/usercnt, convert bpf_prog's refcnt to atomic64
      and remove artificial 32k limit. This allows to make bpf_prog's refcounting
      non-failing, simplifying logic of users of bpf_prog_add/bpf_prog_inc.
      
      Validated compilation by running allyesconfig kernel build.
      Suggested-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-3-andriin@fb.com
      85192dbf
    • A
      bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails · 1e0bd5a0
      Andrii Nakryiko 提交于
      92117d84 ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
      potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
      (32k). Due to using 32-bit counter, it's possible in practice to overflow
      refcounter and make it wrap around to 0, causing erroneous map free, while
      there are still references to it, causing use-after-free problems.
      
      But having a failing refcounting operations are problematic in some cases. One
      example is mmap() interface. After establishing initial memory-mapping, user
      is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
      splitting it into multiple non-contiguous regions. All this happening without
      any control from the users of mmap subsystem. Rather mmap subsystem sends
      notifications to original creator of memory mapping through open/close
      callbacks, which are optionally specified during initial memory mapping
      creation. These callbacks are used to maintain accurate refcount for bpf_map
      (see next patch in this series). The problem is that open() callback is not
      supposed to fail, because memory-mapped resource is set up and properly
      referenced. This is posing a problem for using memory-mapping with BPF maps.
      
      One solution to this is to maintain separate refcount for just memory-mappings
      and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
      There are similar use cases in current work on tcp-bpf, necessitating extra
      counter as well. This seems like a rather unfortunate and ugly solution that
      doesn't scale well to various new use cases.
      
      Another approach to solve this is to use non-failing refcount_t type, which
      uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
      stays there. This utlimately causes memory leak, but prevents use after free.
      
      But given refcounting is not the most performance-critical operation with BPF
      maps (it's not used from running BPF program code), we can also just switch to
      64-bit counter that can't overflow in practice, potentially disadvantaging
      32-bit platforms a tiny bit. This simplifies semantics and allows above
      described scenarios to not worry about failing refcount increment operation.
      
      In terms of struct bpf_map size, we are still good and use the same amount of
      space:
      
      BEFORE (3 cache lines, 8 bytes of padding at the end):
      struct bpf_map {
      	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
      	struct bpf_map *           inner_map_meta;       /*     8     8 */
      	void *                     security;             /*    16     8 */
      	enum bpf_map_type  map_type;                     /*    24     4 */
      	u32                        key_size;             /*    28     4 */
      	u32                        value_size;           /*    32     4 */
      	u32                        max_entries;          /*    36     4 */
      	u32                        map_flags;            /*    40     4 */
      	int                        spin_lock_off;        /*    44     4 */
      	u32                        id;                   /*    48     4 */
      	int                        numa_node;            /*    52     4 */
      	u32                        btf_key_type_id;      /*    56     4 */
      	u32                        btf_value_type_id;    /*    60     4 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	struct btf *               btf;                  /*    64     8 */
      	struct bpf_map_memory memory;                    /*    72    16 */
      	bool                       unpriv_array;         /*    88     1 */
      	bool                       frozen;               /*    89     1 */
      
      	/* XXX 38 bytes hole, try to pack */
      
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	atomic_t                   refcnt __attribute__((__aligned__(64))); /*   128     4 */
      	atomic_t                   usercnt;              /*   132     4 */
      	struct work_struct work;                         /*   136    32 */
      	char                       name[16];             /*   168    16 */
      
      	/* size: 192, cachelines: 3, members: 21 */
      	/* sum members: 146, holes: 1, sum holes: 38 */
      	/* padding: 8 */
      	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
      } __attribute__((__aligned__(64)));
      
      AFTER (same 3 cache lines, no extra padding now):
      struct bpf_map {
      	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
      	struct bpf_map *           inner_map_meta;       /*     8     8 */
      	void *                     security;             /*    16     8 */
      	enum bpf_map_type  map_type;                     /*    24     4 */
      	u32                        key_size;             /*    28     4 */
      	u32                        value_size;           /*    32     4 */
      	u32                        max_entries;          /*    36     4 */
      	u32                        map_flags;            /*    40     4 */
      	int                        spin_lock_off;        /*    44     4 */
      	u32                        id;                   /*    48     4 */
      	int                        numa_node;            /*    52     4 */
      	u32                        btf_key_type_id;      /*    56     4 */
      	u32                        btf_value_type_id;    /*    60     4 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	struct btf *               btf;                  /*    64     8 */
      	struct bpf_map_memory memory;                    /*    72    16 */
      	bool                       unpriv_array;         /*    88     1 */
      	bool                       frozen;               /*    89     1 */
      
      	/* XXX 38 bytes hole, try to pack */
      
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	atomic64_t                 refcnt __attribute__((__aligned__(64))); /*   128     8 */
      	atomic64_t                 usercnt;              /*   136     8 */
      	struct work_struct work;                         /*   144    32 */
      	char                       name[16];             /*   176    16 */
      
      	/* size: 192, cachelines: 3, members: 21 */
      	/* sum members: 154, holes: 1, sum holes: 38 */
      	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
      } __attribute__((__aligned__(64)));
      
      This patch, while modifying all users of bpf_map_inc, also cleans up its
      interface to match bpf_map_put with separate operations for bpf_map_inc and
      bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
      respectively). Also, given there are no users of bpf_map_inc_not_zero
      specifying uref=true, remove uref flag and default to uref=false internally.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com
      1e0bd5a0
  4. 16 11月, 2019 20 次提交