1. 12 4月, 2019 6 次提交
    • D
      Merge branch 'bpf-l2-encap' · 94c59aab
      Daniel Borkmann 提交于
      Alan Maguire says:
      
      ====================
      Extend bpf_skb_adjust_room growth to mark inner MAC header so
      that L2 encapsulation can be used for tc tunnels.
      
      Patch #1 extends the existing test_tc_tunnel to support UDP
      encapsulation; later we want to be able to test MPLS over UDP
      and MPLS over GRE encapsulation.
      
      Patch #2 adds the BPF_F_ADJ_ROOM_ENCAP_L2(len) macro, which
      allows specification of inner mac length.  Other approaches were
      explored prior to taking this approach.  Specifically, I tried
      automatically computing the inner mac length on the basis of the
      specified flags (so inner maclen for GRE/IPv4 encap is the len_diff
      specified to bpf_skb_adjust_room minus GRE + IPv4 header length
      for example).  Problem with this is that we don't know for sure
      what form of GRE/UDP header we have; is it a full GRE header,
      or is it a FOU UDP header or generic UDP encap header? My fear
      here was we'd end up with an explosion of flags.  The other approach
      tried was to support inner L2 header marking as a separate room
      adjustment, i.e. adjust for L3/L4 encap, then call
      bpf_skb_adjust_room for L2 encap.  This can be made to work but
      because it imposed an order on operations, felt a bit clunky.
      
      Patch #3 syncs tools/ bpf.h.
      
      Patch #4 extends the tests again to support MPLSoverGRE,
      MPLSoverUDP, and transparent ethernet bridging (TEB) where
      the inner L2 header is an ethernet header.  Testing of BPF
      encap against tunnels is done for cases where configuration
      of such tunnels is possible (MPLSoverGRE[6], MPLSoverUDP,
      gre[6]tap), and skipped otherwise.  Testing of BPF encap/decap
      is always carried out.
      
      Changes since v2:
       - updated tools/testing/selftest/bpf/config with FOU/MPLS CONFIG
         variables (patches 1, 4)
       - reduced noise in patch 1 by avoiding unnecessary movement of code
       - eliminated inner_mac variable in bpf_skb_net_grow (patch 2)
      
      Changes since v1:
       - fixed formatting of commit references.
       - BPF_F_ADJ_ROOM_FIXED_GSO flag enabled on all variants (patch 1)
       - fixed fou6 options for UDP encap; checksum errors observed were
         due to the fact fou6 tunnel was not set up with correct ipproto
         options (41 -6).  0 checksums work fine (patch 1)
       - added definitions for mask and shift used in setting L2 length
         (patch 2)
       - allow udp encap with fixed GSO (patch 2)
       - changed "elen" to "l2_len" to be more descriptive (patch 4)
      ====================
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      94c59aab
    • A
      selftests_bpf: add L2 encap to test_tc_tunnel · 3ec61df8
      Alan Maguire 提交于
      Update test_tc_tunnel to verify adding inner L2 header
      encapsulation (an MPLS label or ethernet header) works.
      Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3ec61df8
    • A
      bpf: sync bpf.h to tools/ for BPF_F_ADJ_ROOM_ENCAP_L2 · 1db04c30
      Alan Maguire 提交于
      Sync include/uapi/linux/bpf.h with tools/ equivalent to add
      BPF_F_ADJ_ROOM_ENCAP_L2(len) macro.
      Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1db04c30
    • A
      bpf: add layer 2 encap support to bpf_skb_adjust_room · 58dfc900
      Alan Maguire 提交于
      commit 868d5235 ("bpf: add bpf_skb_adjust_room encap flags")
      introduced support to bpf_skb_adjust_room for GSO-friendly GRE
      and UDP encapsulation.
      
      For GSO to work for skbs, the inner headers (mac and network) need to
      be marked.  For L3 encapsulation using bpf_skb_adjust_room, the mac
      and network headers are identical.  Here we provide a way of specifying
      the inner mac header length for cases where L2 encap is desired.  Such
      an approach can support encapsulated ethernet headers, MPLS headers etc.
      For example to convert from a packet of form [eth][ip][tcp] to
      [eth][ip][udp][inner mac][ip][tcp], something like the following could
      be done:
      
      	headroom = sizeof(iph) + sizeof(struct udphdr) + inner_maclen;
      
      	ret = bpf_skb_adjust_room(skb, headroom, BPF_ADJ_ROOM_MAC,
      				  BPF_F_ADJ_ROOM_ENCAP_L4_UDP |
      				  BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 |
      				  BPF_F_ADJ_ROOM_ENCAP_L2(inner_maclen));
      Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      58dfc900
    • A
      selftests_bpf: extend test_tc_tunnel for UDP encap · 166b5a7f
      Alan Maguire 提交于
      commit 868d5235 ("bpf: add bpf_skb_adjust_room encap flags")
      introduced support to bpf_skb_adjust_room for GSO-friendly GRE
      and UDP encapsulation and later introduced associated test_tc_tunnel
      tests.  Here those tests are extended to cover UDP encapsulation also.
      Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      166b5a7f
    • S
      bpf: fix missing bpf_check_uarg_tail_zero in BPF_PROG_TEST_RUN · c695865c
      Stanislav Fomichev 提交于
      Commit b0b9395d ("bpf: support input __sk_buff context in
      BPF_PROG_TEST_RUN") started using bpf_check_uarg_tail_zero in
      BPF_PROG_TEST_RUN. However, bpf_check_uarg_tail_zero is not defined
      for !CONFIG_BPF_SYSCALL:
      
      net/bpf/test_run.c: In function ‘bpf_ctx_init’:
      net/bpf/test_run.c:142:9: error: implicit declaration of function ‘bpf_check_uarg_tail_zero’ [-Werror=implicit-function-declaration]
         err = bpf_check_uarg_tail_zero(data_in, max_size, size);
               ^~~~~~~~~~~~~~~~~~~~~~~~
      
      Let's not build net/bpf/test_run.c when CONFIG_BPF_SYSCALL is not set.
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Fixes: b0b9395d ("bpf: support input __sk_buff context in BPF_PROG_TEST_RUN")
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c695865c
  2. 11 4月, 2019 6 次提交
  3. 10 4月, 2019 19 次提交
    • M
      libbpf: fix crash in XDP socket part with new larger BPF_LOG_BUF_SIZE · 50bd645b
      Magnus Karlsson 提交于
      In commit da11b417 ("libbpf: teach libbpf about log_level bit 2"),
      the BPF_LOG_BUF_SIZE was increased to 16M. The XDP socket part of
      libbpf allocated the log_buf on the stack, but for the new 16M buffer
      size this is not going to work. Change the code so it uses a 16K buffer
      instead.
      
      Fixes: da11b417 ("libbpf: teach libbpf about log_level bit 2")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      50bd645b
    • Y
      bpf, bpftool: fix a few ubsan warnings · 69a0f9ec
      Yonghong Song 提交于
      The issue is reported at https://github.com/libbpf/libbpf/issues/28.
      
      Basically, per C standard, for
        void *memcpy(void *dest, const void *src, size_t n)
      if "dest" or "src" is NULL, regardless of whether "n" is 0 or not,
      the result of memcpy is undefined. clang ubsan reported three such
      instances in bpf.c with the following pattern:
        memcpy(dest, 0, 0).
      
      Although in practice, no known compiler will cause issues when
      copy size is 0. Let us still fix the issue to silence ubsan
      warnings.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      69a0f9ec
    • A
      Merge branch 'support-global-data' · 6316f783
      Alexei Starovoitov 提交于
      Daniel Borkmann says:
      
      ====================
      This series is a major rework of previously submitted libbpf
      patches [0] in order to add global data support for BPF. The
      kernel has been extended to add proper infrastructure that allows
      for full .bss/.data/.rodata sections on BPF loader side based
      upon feedback from LPC discussions [1]. Latter support is then
      also added into libbpf in this series which allows for more
      natural C-like programming of BPF programs. For more information
      on loader, please refer to 'bpf, libbpf: support global data/bss/
      rodata sections' patch in this series.
      
      Thanks a lot!
      
        v5 -> v6:
         - Removed synchronize_rcu() from map freeze (Jann)
         - Rest as-is
        v4 -> v5:
         - Removed index selection again for ldimm64 (Alexei)
         - Adapted related test cases and added new ones to test
           rejection of off != 0
        v3 -> v4:
         - Various fixes in BTF verification e.g. to disallow
           Var and DataSec to be an intermediate type during resolve (Martin)
         - More BTF test cases added
         - Few cleanups in key-less BTF commit (Martin)
         - Bump libbpf minor version from 2 to 3
         - Renamed and simplified read-only locking
         - Various minor improvements all over the place
        v2 -> v3:
         - Implement BTF support in kernel, libbpf, bpftool, add tests
         - Fix idx + off conversion (Andrii)
         - Document lower / higher bits for direct value access (Andrii)
         - Add tests with small value size (Andrii)
         - Add index selection into ldimm64 (Andrii)
         - Fix missing fdput() (Jann)
         - Reject invalid flags in BPF_F_*_PROG (Jakub)
         - Complete rework of libbpf support, includes:
          - Add objname to map name (Stanislav)
          - Make .rodata map full read-only after setup (Andrii)
          - Merge relocation handling into single one (Andrii)
          - Store global maps into obj->maps array (Andrii, Alexei)
          - Debug message when skipping section (Andrii)
          - Reject non-static global data till we have
            semantics for sharing them (Yonghong, Andrii, Alexei)
          - More test cases and completely reworked prog test (Alexei)
         - Fixes, cleanups, etc all over the set
         - Not yet addressed:
          - Make BTF mandatory for these maps (Alexei)
          -> Waiting till BTF support for these lands first
        v1 -> v2:
          - Instead of 32-bit static data, implement full global
            data support (Alexei)
      
        [0] https://patchwork.ozlabs.org/cover/1040290/
        [1] http://vger.kernel.org/lpc-bpf2018.html#session-3
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6316f783
    • D
      bpf, selftest: add test cases for BTF Var and DataSec · c861168b
      Daniel Borkmann 提交于
      Extend test_btf with various positive and negative tests around
      BTF verification of kind Var and DataSec. All passing as well:
      
        # ./test_btf
        [...]
        BTF raw test[4] (global data test #1): OK
        BTF raw test[5] (global data test #2): OK
        BTF raw test[6] (global data test #3): OK
        BTF raw test[7] (global data test #4, unsupported linkage): OK
        BTF raw test[8] (global data test #5, invalid var type): OK
        BTF raw test[9] (global data test #6, invalid var type (fwd type)): OK
        BTF raw test[10] (global data test #7, invalid var type (fwd type)): OK
        BTF raw test[11] (global data test #8, invalid var size): OK
        BTF raw test[12] (global data test #9, invalid var size): OK
        BTF raw test[13] (global data test #10, invalid var size): OK
        BTF raw test[14] (global data test #11, multiple section members): OK
        BTF raw test[15] (global data test #12, invalid offset): OK
        BTF raw test[16] (global data test #13, invalid offset): OK
        BTF raw test[17] (global data test #14, invalid offset): OK
        BTF raw test[18] (global data test #15, not var kind): OK
        BTF raw test[19] (global data test #16, invalid var referencing sec): OK
        BTF raw test[20] (global data test #17, invalid var referencing var): OK
        BTF raw test[21] (global data test #18, invalid var loop): OK
        BTF raw test[22] (global data test #19, invalid var referencing var): OK
        BTF raw test[23] (global data test #20, invalid ptr referencing var): OK
        BTF raw test[24] (global data test #21, var included in struct): OK
        BTF raw test[25] (global data test #22, array of var): OK
        [...]
        PASS:167 SKIP:0 FAIL:0
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c861168b
    • J
      bpf, selftest: test global data/bss/rodata sections · b915ebe6
      Joe Stringer 提交于
      Add tests for libbpf relocation of static variable references
      into the .data, .rodata and .bss sections of the ELF, also add
      read-only test for .rodata. All passing:
      
        # ./test_progs
        [...]
        test_global_data:PASS:load program 0 nsec
        test_global_data:PASS:pass global data run 925 nsec
        test_global_data_number:PASS:relocate .bss reference 925 nsec
        test_global_data_number:PASS:relocate .data reference 925 nsec
        test_global_data_number:PASS:relocate .rodata reference 925 nsec
        test_global_data_number:PASS:relocate .bss reference 925 nsec
        test_global_data_number:PASS:relocate .data reference 925 nsec
        test_global_data_number:PASS:relocate .rodata reference 925 nsec
        test_global_data_number:PASS:relocate .bss reference 925 nsec
        test_global_data_number:PASS:relocate .bss reference 925 nsec
        test_global_data_number:PASS:relocate .rodata reference 925 nsec
        test_global_data_number:PASS:relocate .rodata reference 925 nsec
        test_global_data_number:PASS:relocate .rodata reference 925 nsec
        test_global_data_string:PASS:relocate .rodata reference 925 nsec
        test_global_data_string:PASS:relocate .data reference 925 nsec
        test_global_data_string:PASS:relocate .bss reference 925 nsec
        test_global_data_string:PASS:relocate .data reference 925 nsec
        test_global_data_string:PASS:relocate .bss reference 925 nsec
        test_global_data_struct:PASS:relocate .rodata reference 925 nsec
        test_global_data_struct:PASS:relocate .bss reference 925 nsec
        test_global_data_struct:PASS:relocate .rodata reference 925 nsec
        test_global_data_struct:PASS:relocate .data reference 925 nsec
        test_global_data_rdonly:PASS:test .rodata read-only map 925 nsec
        [...]
        Summary: 229 PASSED, 0 FAILED
      
      Note map helper signatures have been changed to avoid warnings
      when passing in const data.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b915ebe6
    • D
      bpf, selftest: test {rd, wr}only flags and direct value access · fb2abb73
      Daniel Borkmann 提交于
      Extend test_verifier with various test cases around the two kernel
      extensions, that is, {rd,wr}only map support as well as direct map
      value access. All passing, one skipped due to xskmap not present
      on test machine:
      
        # ./test_verifier
        [...]
        #948/p XDP pkt read, pkt_meta' <= pkt_data, bad access 1 OK
        #949/p XDP pkt read, pkt_meta' <= pkt_data, bad access 2 OK
        #950/p XDP pkt read, pkt_data <= pkt_meta', good access OK
        #951/p XDP pkt read, pkt_data <= pkt_meta', bad access 1 OK
        #952/p XDP pkt read, pkt_data <= pkt_meta', bad access 2 OK
        Summary: 1410 PASSED, 1 SKIPPED, 0 FAILED
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fb2abb73
    • D
      bpf: bpftool support for dumping data/bss/rodata sections · 817998af
      Daniel Borkmann 提交于
      Add the ability to bpftool to handle BTF Var and DataSec kinds
      in order to dump them out of btf_dumper_type(). The value has a
      single object with the section name, which itself holds an array
      of variables it dumps. A single variable is an object by itself
      printed along with its name. From there further type information
      is dumped along with corresponding value information.
      
      Example output from .rodata:
      
        # ./bpftool m d i 150
        [{
                "value": {
                    ".rodata": [{
                            "load_static_data.bar": 18446744073709551615
                        },{
                            "num2": 24
                        },{
                            "num5": 43947
                        },{
                            "num6": 171
                        },{
                            "str0": [97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,0,0,0,0,0,0
                            ]
                        },{
                            "struct0": {
                                "a": 42,
                                "b": 4278120431,
                                "c": 1229782938247303441
                            }
                        },{
                            "struct2": {
                                "a": 0,
                                "b": 0,
                                "c": 0
                            }
                        }
                    ]
                }
            }
        ]
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      817998af
    • D
      bpf, libbpf: add support for BTF Var and DataSec · 1713d68b
      Daniel Borkmann 提交于
      This adds libbpf support for BTF Var and DataSec kinds. Main point
      here is that libbpf needs to do some preparatory work before the
      whole BTF object can be loaded into the kernel, that is, fixing up
      of DataSec size taken from the ELF section size and non-static
      variable offset which needs to be taken from the ELF's string section.
      
      Upstream LLVM doesn't fix these up since at time of BTF emission
      it is too early in the compilation process thus this information
      isn't available yet, hence loader needs to take care of it.
      
      Note, deduplication handling has not been in the scope of this work
      and needs to be addressed in a future commit.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://reviews.llvm.org/D59441Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1713d68b
    • D
      bpf, libbpf: support global data/bss/rodata sections · d859900c
      Daniel Borkmann 提交于
      This work adds BPF loader support for global data sections
      to libbpf. This allows to write BPF programs in more natural
      C-like way by being able to define global variables and const
      data.
      
      Back at LPC 2018 [0] we presented a first prototype which
      implemented support for global data sections by extending BPF
      syscall where union bpf_attr would get additional memory/size
      pair for each section passed during prog load in order to later
      add this base address into the ldimm64 instruction along with
      the user provided offset when accessing a variable. Consensus
      from LPC was that for proper upstream support, it would be
      more desirable to use maps instead of bpf_attr extension as
      this would allow for introspection of these sections as well
      as potential live updates of their content. This work follows
      this path by taking the following steps from loader side:
      
       1) In bpf_object__elf_collect() step we pick up ".data",
          ".rodata", and ".bss" section information.
      
       2) If present, in bpf_object__init_internal_map() we add
          maps to the obj's map array that corresponds to each
          of the present sections. Given section size and access
          properties can differ, a single entry array map is
          created with value size that is corresponding to the
          ELF section size of .data, .bss or .rodata. These
          internal maps are integrated into the normal map
          handling of libbpf such that when user traverses all
          obj maps, they can be differentiated from user-created
          ones via bpf_map__is_internal(). In later steps when
          we actually create these maps in the kernel via
          bpf_object__create_maps(), then for .data and .rodata
          sections their content is copied into the map through
          bpf_map_update_elem(). For .bss this is not necessary
          since array map is already zero-initialized by default.
          Additionally, for .rodata the map is frozen as read-only
          after setup, such that neither from program nor syscall
          side writes would be possible.
      
       3) In bpf_program__collect_reloc() step, we record the
          corresponding map, insn index, and relocation type for
          the global data.
      
       4) And last but not least in the actual relocation step in
          bpf_program__relocate(), we mark the ldimm64 instruction
          with src_reg = BPF_PSEUDO_MAP_VALUE where in the first
          imm field the map's file descriptor is stored as similarly
          done as in BPF_PSEUDO_MAP_FD, and in the second imm field
          (as ldimm64 is 2-insn wide) we store the access offset
          into the section. Given these maps have only single element
          ldimm64's off remains zero in both parts.
      
       5) On kernel side, this special marked BPF_PSEUDO_MAP_VALUE
          load will then store the actual target address in order
          to have a 'map-lookup'-free access. That is, the actual
          map value base address + offset. The destination register
          in the verifier will then be marked as PTR_TO_MAP_VALUE,
          containing the fixed offset as reg->off and backing BPF
          map as reg->map_ptr. Meaning, it's treated as any other
          normal map value from verification side, only with
          efficient, direct value access instead of actual call to
          map lookup helper as in the typical case.
      
      Currently, only support for static global variables has been
      added, and libbpf rejects non-static global variables from
      loading. This can be lifted until we have proper semantics
      for how BPF will treat multi-object BPF loads. From BTF side,
      libbpf will set the value type id of the types corresponding
      to the ".bss", ".data" and ".rodata" names which LLVM will
      emit without the object name prefix. The key type will be
      left as zero, thus making use of the key-less BTF option in
      array maps.
      
      Simple example dump of program using globals vars in each
      section:
      
        # bpftool prog
        [...]
        6784: sched_cls  name load_static_dat  tag a7e1291567277844  gpl
              loaded_at 2019-03-11T15:39:34+0000  uid 0
              xlated 1776B  jited 993B  memlock 4096B  map_ids 2238,2237,2235,2236,2239,2240
      
        # bpftool map show id 2237
        2237: array  name test_glo.bss  flags 0x0
              key 4B  value 64B  max_entries 1  memlock 4096B
        # bpftool map show id 2235
        2235: array  name test_glo.data  flags 0x0
              key 4B  value 64B  max_entries 1  memlock 4096B
        # bpftool map show id 2236
        2236: array  name test_glo.rodata  flags 0x80
              key 4B  value 96B  max_entries 1  memlock 4096B
      
        # bpftool prog dump xlated id 6784
        int load_static_data(struct __sk_buff * skb):
        ; int load_static_data(struct __sk_buff *skb)
           0: (b7) r6 = 0
        ; test_reloc(number, 0, &num0);
           1: (63) *(u32 *)(r10 -4) = r6
           2: (bf) r2 = r10
        ; int load_static_data(struct __sk_buff *skb)
           3: (07) r2 += -4
        ; test_reloc(number, 0, &num0);
           4: (18) r1 = map[id:2238]
           6: (18) r3 = map[id:2237][0]+0    <-- direct addr in .bss area
           8: (b7) r4 = 0
           9: (85) call array_map_update_elem#100464
          10: (b7) r1 = 1
        ; test_reloc(number, 1, &num1);
        [...]
        ; test_reloc(string, 2, str2);
         120: (18) r8 = map[id:2237][0]+16   <-- same here at offset +16
         122: (18) r1 = map[id:2239]
         124: (18) r3 = map[id:2237][0]+16
         126: (b7) r4 = 0
         127: (85) call array_map_update_elem#100464
         128: (b7) r1 = 120
        ; str1[5] = 'x';
         129: (73) *(u8 *)(r9 +5) = r1
        ; test_reloc(string, 3, str1);
         130: (b7) r1 = 3
         131: (63) *(u32 *)(r10 -4) = r1
         132: (b7) r9 = 3
         133: (bf) r2 = r10
        ; int load_static_data(struct __sk_buff *skb)
         134: (07) r2 += -4
        ; test_reloc(string, 3, str1);
         135: (18) r1 = map[id:2239]
         137: (18) r3 = map[id:2235][0]+16   <-- direct addr in .data area
         139: (b7) r4 = 0
         140: (85) call array_map_update_elem#100464
         141: (b7) r1 = 111
        ; __builtin_memcpy(&str2[2], "hello", sizeof("hello"));
         142: (73) *(u8 *)(r8 +6) = r1       <-- further access based on .bss data
         143: (b7) r1 = 108
         144: (73) *(u8 *)(r8 +5) = r1
        [...]
      
      For Cilium use-case in particular, this enables migrating configuration
      constants from Cilium daemon's generated header defines into global
      data sections such that expensive runtime recompilations with LLVM can
      be avoided altogether. Instead, the ELF file becomes effectively a
      "template", meaning, it is compiled only once (!) and the Cilium daemon
      will then rewrite relevant configuration data from the ELF's .data or
      .rodata sections directly instead of recompiling the program. The
      updated ELF is then loaded into the kernel and atomically replaces
      the existing program in the networking datapath. More info in [0].
      
      Based upon recent fix in LLVM, commit c0db6b6bd444 ("[BPF] Don't fail
      for static variables").
      
        [0] LPC 2018, BPF track, "ELF relocation for static data in BPF",
            http://vger.kernel.org/lpc-bpf2018.html#session-3Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d859900c
    • J
      bpf, libbpf: refactor relocation handling · f8c7a4d4
      Joe Stringer 提交于
      Adjust the code for relocations slightly with no functional changes,
      so that upcoming patches that will introduce support for relocations
      into the .data, .rodata and .bss sections can be added independent
      of these changes.
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f8c7a4d4
    • D
      bpf: sync {btf, bpf}.h uapi header from tools infrastructure · c83fef6b
      Daniel Borkmann 提交于
      Pull in latest changes from both headers, so we can make use of
      them in libbpf.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c83fef6b
    • D
      bpf: allow for key-less BTF in array map · 2824ecb7
      Daniel Borkmann 提交于
      Given we'll be reusing BPF array maps for global data/bss/rodata
      sections, we need a way to associate BTF DataSec type as its map
      value type. In usual cases we have this ugly BPF_ANNOTATE_KV_PAIR()
      macro hack e.g. via 38d5d3b3 ("bpf: Introduce BPF_ANNOTATE_KV_PAIR")
      to get initial map to type association going. While more use cases
      for it are discouraged, this also won't work for global data since
      the use of array map is a BPF loader detail and therefore unknown
      at compilation time. For array maps with just a single entry we make
      an exception in terms of BTF in that key type is declared optional
      if value type is of DataSec type. The latter LLVM is guaranteed to
      emit and it also aligns with how we regard global data maps as just
      a plain buffer area reusing existing map facilities for allowing
      things like introspection with existing tools.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2824ecb7
    • D
      bpf: kernel side support for BTF Var and DataSec · 1dc92851
      Daniel Borkmann 提交于
      This work adds kernel-side verification, logging and seq_show dumping
      of BTF Var and DataSec kinds which are emitted with latest LLVM. The
      following constraints apply:
      
      BTF Var must have:
      
      - Its kind_flag is 0
      - Its vlen is 0
      - Must point to a valid type
      - Type must not resolve to a forward type
      - Size of underlying type must be > 0
      - Must have a valid name
      - Can only be a source type, not sink or intermediate one
      - Name may include dots (e.g. in case of static variables
        inside functions)
      - Cannot be a member of a struct/union
      - Linkage so far can either only be static or global/allocated
      
      BTF DataSec must have:
      
      - Its kind_flag is 0
      - Its vlen cannot be 0
      - Its size cannot be 0
      - Must have a valid name
      - Can only be a source type, not sink or intermediate one
      - Name may include dots (e.g. to represent .bss, .data, .rodata etc)
      - Cannot be a member of a struct/union
      - Inner btf_var_secinfo array with {type,offset,size} triple
        must be sorted by offset in ascending order
      - Type must always point to BTF Var
      - BTF resolved size of Var must be <= size provided by triple
      - DataSec size must be >= sum of triple sizes (thus holes
        are allowed)
      
      btf_var_resolve(), btf_ptr_resolve() and btf_modifier_resolve()
      are on a high level quite similar but each come with slight,
      subtle differences. They could potentially be a bit refactored
      in future which hasn't been done here to ease review.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1dc92851
    • D
      bpf: add specification for BTF Var and DataSec kinds · f063c889
      Daniel Borkmann 提交于
      This adds the BTF specification and UAPI bits for supporting BTF Var
      and DataSec kinds. This is following LLVM upstream commit ac4082b77e07
      ("[BPF] Add BTF Var and DataSec Support") which has been merged recently.
      Var itself is for describing a global variable and DataSec to describe
      ELF sections e.g. data/bss/rodata sections that hold one or multiple
      global variables.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f063c889
    • D
      bpf: allow . char as part of the object name · 3e0ddc4f
      Daniel Borkmann 提交于
      Trivial addition to allow '.' aside from '_' as "special" characters
      in the object name. Used to allow for substrings in maps from loader
      side such as ".bss", ".data", ".rodata", but could also be useful for
      other purposes.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3e0ddc4f
    • D
      bpf: add syscall side map freeze support · 87df15de
      Daniel Borkmann 提交于
      This patch adds a new BPF_MAP_FREEZE command which allows to
      "freeze" the map globally as read-only / immutable from syscall
      side.
      
      Map permission handling has been refactored into map_get_sys_perms()
      and drops FMODE_CAN_WRITE in case of locked map. Main use case is
      to allow for setting up .rodata sections from the BPF ELF which
      are loaded into the kernel, meaning BPF loader first allocates
      map, sets up map value by copying .rodata section into it and once
      complete, it calls BPF_MAP_FREEZE on the map fd to prevent further
      modifications.
      
      Right now BPF_MAP_FREEZE only takes map fd as argument while remaining
      bpf_attr members are required to be zero. I didn't add write-only
      locking here as counterpart since I don't have a concrete use-case
      for it on my side, and I think it makes probably more sense to wait
      once there is actually one. In that case bpf_attr can be extended
      as usual with a flag field and/or others where flag 0 means that
      we lock the map read-only hence this doesn't prevent to add further
      extensions to BPF_MAP_FREEZE upon need.
      
      A map creation flag like BPF_F_WRONCE was not considered for couple
      of reasons: i) in case of a generic implementation, a map can consist
      of more than just one element, thus there could be multiple map
      updates needed to set the map into a state where it can then be
      made immutable, ii) WRONCE indicates exact one-time write before
      it is then set immutable. A generic implementation would set a bit
      atomically on map update entry (if unset), indicating that every
      subsequent update from then onwards will need to bail out there.
      However, map updates can fail, so upon failure that flag would need
      to be unset again and the update attempt would need to be repeated
      for it to be eventually made immutable. While this can be made
      race-free, this approach feels less clean and in combination with
      reason i), it's not generic enough. A dedicated BPF_MAP_FREEZE
      command directly sets the flag and caller has the guarantee that
      map is immutable from syscall side upon successful return for any
      future syscall invocations that would alter the map state, which
      is also more intuitive from an API point of view. A command name
      such as BPF_MAP_LOCK has been avoided as it's too close with BPF
      map spin locks (which already has BPF_F_LOCK flag). BPF_MAP_FREEZE
      is so far only enabled for privileged users.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      87df15de
    • D
      bpf: add program side {rd, wr}only support for maps · 591fe988
      Daniel Borkmann 提交于
      This work adds two new map creation flags BPF_F_RDONLY_PROG
      and BPF_F_WRONLY_PROG in order to allow for read-only or
      write-only BPF maps from a BPF program side.
      
      Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only
      applies to system call side, meaning the BPF program has full
      read/write access to the map as usual while bpf(2) calls with
      map fd can either only read or write into the map depending
      on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows
      for the exact opposite such that verifier is going to reject
      program loads if write into a read-only map or a read into a
      write-only map is detected. For read-only map case also some
      helpers are forbidden for programs that would alter the map
      state such as map deletion, update, etc. As opposed to the two
      BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well
      as BPF_F_WRONLY_PROG really do correspond to the map lifetime.
      
      We've enabled this generic map extension to various non-special
      maps holding normal user data: array, hash, lru, lpm, local
      storage, queue and stack. Further generic map types could be
      followed up in future depending on use-case. Main use case
      here is to forbid writes into .rodata map values from verifier
      side.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      591fe988
    • D
      bpf: do not retain flags that are not tied to map lifetime · be70bcd5
      Daniel Borkmann 提交于
      Both BPF_F_WRONLY / BPF_F_RDONLY flags are tied to the map file
      descriptor, but not to the map object itself! Meaning, at map
      creation time BPF_F_RDONLY can be set to make the map read-only
      from syscall side, but this holds only for the returned fd, so
      any other fd either retrieved via bpf file system or via map id
      for the very same underlying map object can have read-write access
      instead.
      
      Given that, keeping the two flags around in the map_flags attribute
      and exposing them to user space upon map dump is misleading and
      may lead to false conclusions. Since these two flags are not
      tied to the map object lets also not store them as map property.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      be70bcd5
    • D
      bpf: implement lookup-free direct value access for maps · d8eca5bb
      Daniel Borkmann 提交于
      This generic extension to BPF maps allows for directly loading
      an address residing inside a BPF map value as a single BPF
      ldimm64 instruction!
      
      The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
      is a special src_reg flag for ldimm64 instruction that indicates
      that inside the first part of the double insns's imm field is a
      file descriptor which the verifier then replaces as a full 64bit
      address of the map into both imm parts. For the newly added
      BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following:
      the first part of the double insns's imm field is again a file
      descriptor corresponding to the map, and the second part of the
      imm field is an offset into the value. The verifier will then
      replace both imm parts with an address that points into the BPF
      map value at the given value offset for maps that support this
      operation. Currently supported is array map with single entry.
      It is possible to support more than just single map element by
      reusing both 16bit off fields of the insns as a map index, so
      full array map lookup could be expressed that way. It hasn't
      been implemented here due to lack of concrete use case, but
      could easily be done so in future in a compatible way, since
      both off fields right now have to be 0 and would correctly
      denote a map index 0.
      
      The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with
      BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of
      map pointer versus load of map's value at offset 0, and changing
      BPF_PSEUDO_MAP_FD's encoding into off by one to differ between
      regular map pointer and map value pointer would add unnecessary
      complexity and increases barrier for debugability thus less
      suitable. Using the second part of the imm field as an offset
      into the value does /not/ come with limitations since maximum
      possible value size is in u32 universe anyway.
      
      This optimization allows for efficiently retrieving an address
      to a map value memory area without having to issue a helper call
      which needs to prepare registers according to calling convention,
      etc, without needing the extra NULL test, and without having to
      add the offset in an additional instruction to the value base
      pointer. The verifier then treats the destination register as
      PTR_TO_MAP_VALUE with constant reg->off from the user passed
      offset from the second imm field, and guarantees that this is
      within bounds of the map value. Any subsequent operations are
      normally treated as typical map value handling without anything
      extra needed from verification side.
      
      The two map operations for direct value access have been added to
      array map for now. In future other types could be supported as
      well depending on the use case. The main use case for this commit
      is to allow for BPF loader support for global variables that
      reside in .data/.rodata/.bss sections such that we can directly
      load the address of them with minimal additional infrastructure
      required. Loader support has been added in subsequent commits for
      libbpf library.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8eca5bb
  4. 07 4月, 2019 1 次提交
  5. 05 4月, 2019 8 次提交
    • D
      Merge branch 'bpf-varstack-fixes' · 347807d3
      Daniel Borkmann 提交于
      Andrey Ignatov says:
      
      ====================
      v2->v3:
      - sanity check max value for variable offset.
      
      v1->v2:
      - rely on meta = NULL to reject var_off stack access to uninit buffer.
      
      This patch set is a follow-up for discussion [1].
      
      It fixes variable offset stack access handling for raw and unprivileged
      mode, rejecting both of them, and sanity checks max variable offset value.
      
      Patch 1 handles raw (uninitialized) mode.
      Patch 2 adds test for raw mode.
      Patch 3 handles unprivileged mode.
      Patch 4 adds test for unprivileged mode.
      Patch 5 adds sanity check for max value of variable offset.
      Patch 6 adds test for variable offset max value checking.
      Patch 7 is a minor fix in verbose log.
      
      Unprivileged mode is an interesting case since one (and only?) way to come
      up with variable offset is to use pointer arithmetics. Though pointer
      arithmetics is already prohibited for unprivileged mode. I'm not sure if
      it's enough though and it seems like a good idea to still reject variable
      offset for unpriv in check_stack_boundary(). Please see patches 3 and 4
      for more details on this.
      
      [1] https://marc.info/?l=linux-netdev&m=155419526427742&w=2
      ====================
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      347807d3
    • A
      bpf: Add missed newline in verifier verbose log · 1fbd20f8
      Andrey Ignatov 提交于
      check_stack_access() that prints verbose log is used in
      adjust_ptr_min_max_vals() that prints its own verbose log and now they
      stick together, e.g.:
      
        variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16
        size=1R2 stack pointer arithmetic goes out of range, prohibited for
        !root
      
      Add missing newline so that log is more readable:
        variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1
        R2 stack pointer arithmetic goes out of range, prohibited for !root
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1fbd20f8
    • A
      selftests/bpf: Test unbounded var_off stack access · 07f91962
      Andrey Ignatov 提交于
      Test the case when reg->smax_value is too small/big and can overflow,
      and separately min and max values outside of stack bounds.
      
      Example of output:
        # ./test_verifier
        #856/p indirect variable-offset stack access, unbounded OK
        #857/p indirect variable-offset stack access, max out of bound OK
        #858/p indirect variable-offset stack access, min out of bound OK
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      07f91962
    • A
      bpf: Sanity check max value for var_off stack access · 107c26a7
      Andrey Ignatov 提交于
      As discussed in [1] max value of variable offset has to be checked for
      overflow on stack access otherwise verifier would accept code like this:
      
        0: (b7) r2 = 6
        1: (b7) r3 = 28
        2: (7a) *(u64 *)(r10 -16) = 0
        3: (7a) *(u64 *)(r10 -8) = 0
        4: (79) r4 = *(u64 *)(r1 +168)
        5: (c5) if r4 s< 0x0 goto pc+4
         R1=ctx(id=0,off=0,imm=0) R2=inv6 R3=inv28
         R4=inv(id=0,umax_value=9223372036854775807,var_off=(0x0;
         0x7fffffffffffffff)) R10=fp0,call_-1 fp-8=mmmmmmmm fp-16=mmmmmmmm
        6: (17) r4 -= 16
        7: (0f) r4 += r10
        8: (b7) r5 = 8
        9: (85) call bpf_getsockopt#57
        10: (b7) r0 = 0
        11: (95) exit
      
      , where R4 obviosly has unbounded max value.
      
      Fix it by checking that reg->smax_value is inside (-BPF_MAX_VAR_OFF;
      BPF_MAX_VAR_OFF) range.
      
      reg->smax_value is used instead of reg->umax_value because stack
      pointers are calculated using negative offset from fp. This is opposite
      to e.g. map access where offset must be non-negative and where
      umax_value is used.
      
      Also dedicated verbose logs are added for both min and max bound check
      failures to have diagnostics consistent with variable offset handling in
      check_map_access().
      
      [1] https://marc.info/?l=linux-netdev&m=155433357510597&w=2
      
      Fixes: 2011fccf ("bpf: Support variable offset stack access from helpers")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      107c26a7
    • A
      selftests/bpf: Test indirect var_off stack access in unpriv mode · 2c6927db
      Andrey Ignatov 提交于
      Test that verifier rejects indirect stack access with variable offset in
      unprivileged mode and accepts same code in privileged mode.
      
      Since pointer arithmetics is prohibited in unprivileged mode verifier
      should reject the program even before it gets to helper call that uses
      variable offset, at the time when that variable offset is trying to be
      constructed.
      
      Example of output:
        # ./test_verifier
        ...
        #859/u indirect variable-offset stack access, priv vs unpriv OK
        #859/p indirect variable-offset stack access, priv vs unpriv OK
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2c6927db
    • A
      bpf: Reject indirect var_off stack access in unpriv mode · 088ec26d
      Andrey Ignatov 提交于
      Proper support of indirect stack access with variable offset in
      unprivileged mode (!root) requires corresponding support in Spectre
      masking for stack ALU in retrieve_ptr_limit().
      
      There are no use-case for variable offset in unprivileged mode though so
      make verifier reject such accesses for simplicity.
      
      Pointer arithmetics is one (and only?) way to cause variable offset and
      it's already rejected in unpriv mode so that verifier won't even get to
      helper function whose argument contains variable offset, e.g.:
      
        0: (7a) *(u64 *)(r10 -16) = 0
        1: (7a) *(u64 *)(r10 -8) = 0
        2: (61) r2 = *(u32 *)(r1 +0)
        3: (57) r2 &= 4
        4: (17) r2 -= 16
        5: (0f) r2 += r10
        variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1R2
        stack pointer arithmetic goes out of range, prohibited for !root
      
      Still it looks like a good idea to reject variable offset indirect stack
      access for unprivileged mode in check_stack_boundary() explicitly.
      
      Fixes: 2011fccf ("bpf: Support variable offset stack access from helpers")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      088ec26d
    • A
      selftests/bpf: Test indirect var_off stack access in raw mode · f68a5b44
      Andrey Ignatov 提交于
      Test that verifier rejects indirect access to uninitialized stack with
      variable offset.
      
      Example of output:
        # ./test_verifier
        ...
        #859/p indirect variable-offset stack access, uninitialized OK
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f68a5b44
    • A
      bpf: Reject indirect var_off stack access in raw mode · f2bcd05e
      Andrey Ignatov 提交于
      It's hard to guarantee that whole memory is marked as initialized on
      helper return if uninitialized stack is accessed with variable offset
      since specific bounds are unknown to verifier. This may cause
      uninitialized stack leaking.
      
      Reject such an access in check_stack_boundary to prevent possible
      leaking.
      
      There are no known use-cases for indirect uninitialized stack access
      with variable offset so it shouldn't break anything.
      
      Fixes: 2011fccf ("bpf: Support variable offset stack access from helpers")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f2bcd05e