1. 02 4月, 2017 1 次提交
    • A
      bpf: introduce BPF_PROG_TEST_RUN command · 1cf1cae9
      Alexei Starovoitov 提交于
      development and testing of networking bpf programs is quite cumbersome.
      Despite availability of user space bpf interpreters the kernel is
      the ultimate authority and execution environment.
      Current test frameworks for TC include creation of netns, veth,
      qdiscs and use of various packet generators just to test functionality
      of a bpf program. XDP testing is even more complicated, since
      qemu needs to be started with gro/gso disabled and precise queue
      configuration, transferring of xdp program from host into guest,
      attaching to virtio/eth0 and generating traffic from the host
      while capturing the results from the guest.
      
      Moreover analyzing performance bottlenecks in XDP program is
      impossible in virtio environment, since cost of running the program
      is tiny comparing to the overhead of virtio packet processing,
      so performance testing can only be done on physical nic
      with another server generating traffic.
      
      Furthermore ongoing changes to user space control plane of production
      applications cannot be run on the test servers leaving bpf programs
      stubbed out for testing.
      
      Last but not least, the upstream llvm changes are validated by the bpf
      backend testsuite which has no ability to test the code generated.
      
      To improve this situation introduce BPF_PROG_TEST_RUN command
      to test and performance benchmark bpf programs.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cf1cae9
  2. 23 3月, 2017 4 次提交
    • M
      bpf: Add hash of maps support · bcc6b1b7
      Martin KaFai Lau 提交于
      This patch adds hash of maps support (hashmap->bpf_map).
      BPF_MAP_TYPE_HASH_OF_MAPS is added.
      
      A map-in-map contains a pointer to another map and lets call
      this pointer 'inner_map_ptr'.
      
      Notes on deleting inner_map_ptr from a hash map:
      
      1. For BPF_F_NO_PREALLOC map-in-map, when deleting
         an inner_map_ptr, the htab_elem itself will go through
         a rcu grace period and the inner_map_ptr resides
         in the htab_elem.
      
      2. For pre-allocated htab_elem (!BPF_F_NO_PREALLOC),
         when deleting an inner_map_ptr, the htab_elem may
         get reused immediately.  This situation is similar
         to the existing prealloc-ated use cases.
      
         However, the bpf_map_fd_put_ptr() calls bpf_map_put() which calls
         inner_map->ops->map_free(inner_map) which will go
         through a rcu grace period (i.e. all bpf_map's map_free
         currently goes through a rcu grace period).  Hence,
         the inner_map_ptr is still safe for the rcu reader side.
      
      This patch also includes BPF_MAP_TYPE_HASH_OF_MAPS to the
      check_map_prealloc() in the verifier.  preallocation is a
      must for BPF_PROG_TYPE_PERF_EVENT.  Hence, even we don't expect
      heavy updates to map-in-map, enforcing BPF_F_NO_PREALLOC for map-in-map
      is impossible without disallowing BPF_PROG_TYPE_PERF_EVENT from using
      map-in-map first.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcc6b1b7
    • M
      bpf: Add array of maps support · 56f668df
      Martin KaFai Lau 提交于
      This patch adds a few helper funcs to enable map-in-map
      support (i.e. outer_map->inner_map).  The first outer_map type
      BPF_MAP_TYPE_ARRAY_OF_MAPS is also added in this patch.
      The next patch will introduce a hash of maps type.
      
      Any bpf map type can be acted as an inner_map.  The exception
      is BPF_MAP_TYPE_PROG_ARRAY because the extra level of
      indirection makes it harder to verify the owner_prog_type
      and owner_jited.
      
      Multi-level map-in-map is not supported (i.e. map->map is ok
      but not map->map->map).
      
      When adding an inner_map to an outer_map, it currently checks the
      map_type, key_size, value_size, map_flags, max_entries and ops.
      The verifier also uses those map's properties to do static analysis.
      map_flags is needed because we need to ensure BPF_PROG_TYPE_PERF_EVENT
      is using a preallocated hashtab for the inner_hash also.  ops and
      max_entries are needed to generate inlined map-lookup instructions.
      For simplicity reason, a simple '==' test is used for both map_flags
      and max_entries.  The equality of ops is implied by the equality of
      map_type.
      
      During outer_map creation time, an inner_map_fd is needed to create an
      outer_map.  However, the inner_map_fd's life time does not depend on the
      outer_map.  The inner_map_fd is merely used to initialize
      the inner_map_meta of the outer_map.
      
      Also, for the outer_map:
      
      * It allows element update and delete from syscall
      * It allows element lookup from bpf_prog
      
      The above is similar to the current fd_array pattern.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56f668df
    • M
      bpf: Fix and simplifications on inline map lookup · fad73a1a
      Martin KaFai Lau 提交于
      Fix in verifier:
      For the same bpf_map_lookup_elem() instruction (i.e. "call 1"),
      a broken case is "a different type of map could be used for the
      same lookup instruction". For example, an array in one case and a
      hashmap in another.  We have to resort to the old dynamic call behavior
      in this case.  The fix is to check for collision on insn_aux->map_ptr.
      If there is collision, don't inline the map lookup.
      
      Please see the "do_reg_lookup()" in test_map_in_map_kern.c in the later
      patch for how-to trigger the above case.
      
      Simplifications on array_map_gen_lookup():
      1. Calculate elem_size from map->value_size.  It removes the
         need for 'struct bpf_array' which makes the later map-in-map
         implementation easier.
      2. Remove the 'elem_size == 1' test
      
      Fixes: 81ed18ab ("bpf: add helper inlining infra and optimize map_array lookup")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fad73a1a
    • A
      bpf: fix hashmap extra_elems logic · 8c290e60
      Alexei Starovoitov 提交于
      In both kmalloc and prealloc mode the bpf_map_update_elem() is using
      per-cpu extra_elems to do atomic update when the map is full.
      There are two issues with it. The logic can be misused, since it allows
      max_entries+num_cpus elements to be present in the map. And alloc_extra_elems()
      at map creation time can fail percpu alloc for large map values with a warn:
      WARNING: CPU: 3 PID: 2752 at ../mm/percpu.c:892 pcpu_alloc+0x119/0xa60
      illegal size (32824) or align (8) for percpu allocation
      
      The fixes for both of these issues are different for kmalloc and prealloc modes.
      For prealloc mode allocate extra num_possible_cpus elements and store
      their pointers into extra_elems array instead of actual elements.
      Hence we can use these hidden(spare) elements not only when the map is full
      but during bpf_map_update_elem() that replaces existing element too.
      That also improves performance, since pcpu_freelist_pop/push is avoided.
      Unfortunately this approach cannot be used for kmalloc mode which needs
      to kfree elements after rcu grace period. Therefore switch it back to normal
      kmalloc even when full and old element exists like it was prior to
      commit 6c905981 ("bpf: pre-allocate hash map elements").
      
      Add tests to check for over max_entries and large map values.
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Fixes: 6c905981 ("bpf: pre-allocate hash map elements")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c290e60
  3. 17 3月, 2017 5 次提交
  4. 10 3月, 2017 2 次提交
  5. 06 3月, 2017 1 次提交
  6. 02 3月, 2017 2 次提交
  7. 23 2月, 2017 1 次提交
  8. 18 2月, 2017 3 次提交
    • D
      bpf: make jited programs visible in traces · 74451e66
      Daniel Borkmann 提交于
      Long standing issue with JITed programs is that stack traces from
      function tracing check whether a given address is kernel code
      through {__,}kernel_text_address(), which checks for code in core
      kernel, modules and dynamically allocated ftrace trampolines. But
      what is still missing is BPF JITed programs (interpreted programs
      are not an issue as __bpf_prog_run() will be attributed to them),
      thus when a stack trace is triggered, the code walking the stack
      won't see any of the JITed ones. The same for address correlation
      done from user space via reading /proc/kallsyms. This is read by
      tools like perf, but the latter is also useful for permanent live
      tracing with eBPF itself in combination with stack maps when other
      eBPF types are part of the callchain. See offwaketime example on
      dumping stack from a map.
      
      This work tries to tackle that issue by making the addresses and
      symbols known to the kernel. The lookup from *kernel_text_address()
      is implemented through a latched RB tree that can be read under
      RCU in fast-path that is also shared for symbol/size/offset lookup
      for a specific given address in kallsyms. The slow-path iteration
      through all symbols in the seq file done via RCU list, which holds
      a tiny fraction of all exported ksyms, usually below 0.1 percent.
      Function symbols are exported as bpf_prog_<tag>, in order to aide
      debugging and attribution. This facility is currently enabled for
      root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
      is active in any mode. The rationale behind this is that still a lot
      of systems ship with world read permissions on kallsyms thus addresses
      should not get suddenly exposed for them. If that situation gets
      much better in future, we always have the option to change the
      default on this. Likewise, unprivileged programs are not allowed
      to add entries there either, but that is less of a concern as most
      such programs types relevant in this context are for root-only anyway.
      If enabled, call graphs and stack traces will then show a correct
      attribution; one example is illustrated below, where the trace is
      now visible in tooling such as perf script --kallsyms=/proc/kallsyms
      and friends.
      
      Before:
      
        7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
      
      After:
      
        7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
        [...]
        7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74451e66
    • D
      bpf: remove stubs for cBPF from arch code · 9383191d
      Daniel Borkmann 提交于
      Remove the dummy bpf_jit_compile() stubs for eBPF JITs and make
      that a single __weak function in the core that can be overridden
      similarly to the eBPF one. Also remove stale pr_err() mentions
      of bpf_jit_compile.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9383191d
    • D
      bpf: mark all registered map/prog types as __ro_after_init · c78f8bdf
      Daniel Borkmann 提交于
      All map types and prog types are registered to the BPF core through
      bpf_register_map_type() and bpf_register_prog_type() during init and
      remain unchanged thereafter. As by design we don't (and never will)
      have any pluggable code that can register to that at any later point
      in time, lets mark all the existing bpf_{map,prog}_type_list objects
      in the tree as __ro_after_init, so they can be moved to read-only
      section from then onwards.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c78f8bdf
  9. 15 2月, 2017 1 次提交
    • A
      bpf: reduce compiler warnings by adding fallthrough comments · 7e57fbb2
      Alexander Alemayhu 提交于
      Fixes the following warnings:
      
      kernel/bpf/verifier.c: In function ‘may_access_direct_pkt_data’:
      kernel/bpf/verifier.c:702:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
         if (t == BPF_WRITE)
            ^
      kernel/bpf/verifier.c:704:2: note: here
        case BPF_PROG_TYPE_SCHED_CLS:
        ^~~~
      kernel/bpf/verifier.c: In function ‘reg_set_min_max_inv’:
      kernel/bpf/verifier.c:2057:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
         true_reg->min_value = 0;
         ~~~~~~~~~~~~~~~~~~~~^~~
      kernel/bpf/verifier.c:2058:2: note: here
        case BPF_JSGT:
        ^~~~
      kernel/bpf/verifier.c:2068:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
         true_reg->min_value = 0;
         ~~~~~~~~~~~~~~~~~~~~^~~
      kernel/bpf/verifier.c:2069:2: note: here
        case BPF_JSGE:
        ^~~~
      kernel/bpf/verifier.c: In function ‘reg_set_min_max’:
      kernel/bpf/verifier.c:2009:24: warning: this statement may fall through [-Wimplicit-fallthrough=]
         false_reg->min_value = 0;
         ~~~~~~~~~~~~~~~~~~~~~^~~
      kernel/bpf/verifier.c:2010:2: note: here
        case BPF_JSGT:
        ^~~~
      kernel/bpf/verifier.c:2019:24: warning: this statement may fall through [-Wimplicit-fallthrough=]
         false_reg->min_value = 0;
         ~~~~~~~~~~~~~~~~~~~~~^~~
      kernel/bpf/verifier.c:2020:2: note: here
        case BPF_JSGE:
        ^~~~
      Reported-by: NDavid Binderman <dcb314@hotmail.com>
      Signed-off-by: NAlexander Alemayhu <alexander@alemayhu.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e57fbb2
  10. 13 2月, 2017 1 次提交
    • A
      bpf: introduce BPF_F_ALLOW_OVERRIDE flag · 7f677633
      Alexei Starovoitov 提交于
      If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
      to the given cgroup the descendent cgroup will be able to override
      effective bpf program that was inherited from this cgroup.
      By default it's not passed, therefore override is disallowed.
      
      Examples:
      1.
      prog X attached to /A with default
      prog Y fails to attach to /A/B and /A/B/C
      Everything under /A runs prog X
      
      2.
      prog X attached to /A with allow_override.
      prog Y fails to attach to /A/B with default (non-override)
      prog M attached to /A/B with allow_override.
      Everything under /A/B runs prog M only.
      
      3.
      prog X attached to /A with allow_override.
      prog Y fails to attach to /A with default.
      The user has to detach first to switch the mode.
      
      In the future this behavior may be extended with a chain of
      non-overridable programs.
      
      Also fix the bug where detach from cgroup where nothing is attached
      was not throwing error. Return ENOENT in such case.
      
      Add several testcases and adjust libbpf.
      
      Fixes: 30070984 ("cgroup: add support for eBPF programs")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDaniel Mack <daniel@zonque.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f677633
  11. 09 2月, 2017 1 次提交
    • D
      bpf, lpm: fix overflows in trie_alloc checks · c502faf9
      Daniel Borkmann 提交于
      Cap the maximum (total) value size and bail out if larger than KMALLOC_MAX_SIZE
      as otherwise it doesn't make any sense to proceed further, since we're
      guaranteed to fail to allocate elements anyway in lpm_trie_node_alloc();
      likleyhood of failure is still high for large values, though, similarly
      as with htab case in non-prealloc.
      
      Next, make sure that cost vars are really u64 instead of size_t, so that we
      don't overflow on 32 bit and charge only tiny map.pages against memlock while
      allowing huge max_entries; cap also the max cost like we do with other map
      types.
      
      Fixes: b95a5c4d ("bpf: add a longest prefix match trie map implementation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c502faf9
  12. 07 2月, 2017 1 次提交
  13. 26 1月, 2017 1 次提交
    • D
      bpf: add initial bpf tracepoints · a67edbf4
      Daniel Borkmann 提交于
      This work adds a number of tracepoints to paths that are either
      considered slow-path or exception-like states, where monitoring or
      inspecting them would be desirable.
      
      For bpf(2) syscall, tracepoints have been placed for main commands
      when they succeed. In XDP case, tracepoint is for exceptions, that
      is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
      return code, or when error occurs during XDP_TX action and the packet
      could not be forwarded.
      
      Both have been split into separate event headers, and can be further
      extended. Worst case, if they unexpectedly should get into our way in
      future, they can also removed [1]. Of course, these tracepoints (like
      any other) can be analyzed by eBPF itself, etc. Example output:
      
        # ./perf record -a -e bpf:* sleep 10
        # ./perf script
        sock_example  6197 [005]   283.980322:      bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
        sock_example  6197 [005]   283.980721:       bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
        sock_example  6197 [005]   283.988423:   bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
        sock_example  6197 [005]   283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
        [...]
        sock_example  6197 [005]   288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
             swapper     0 [005]   289.338243:    bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
      
        [1] https://lwn.net/Articles/705270/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a67edbf4
  14. 25 1月, 2017 1 次提交
    • D
      bpf: enable verifier to better track const alu ops · 3fadc801
      Daniel Borkmann 提交于
      William reported couple of issues in relation to direct packet
      access. Typical scheme is to check for data + [off] <= data_end,
      where [off] can be either immediate or coming from a tracked
      register that contains an immediate, depending on the branch, we
      can then access the data. However, in case of calculating [off]
      for either the mentioned test itself or for access after the test
      in a more "complex" way, then the verifier will stop tracking the
      CONST_IMM marked register and will mark it as UNKNOWN_VALUE one.
      
      Adding that UNKNOWN_VALUE typed register to a pkt() marked
      register, the verifier then bails out in check_packet_ptr_add()
      as it finds the registers imm value below 48. In the first below
      example, that is due to evaluate_reg_imm_alu() not handling right
      shifts and thus marking the register as UNKNOWN_VALUE via helper
      __mark_reg_unknown_value() that resets imm to 0.
      
      In the second case the same happens at the time when r4 is set
      to r4 &= r5, where it transitions to UNKNOWN_VALUE from
      evaluate_reg_imm_alu(). Later on r4 we shift right by 3 inside
      evaluate_reg_alu(), where the register's imm turns into 3. That
      is, for registers with type UNKNOWN_VALUE, imm of 0 means that
      we don't know what value the register has, and for imm > 0 it
      means that the value has [imm] upper zero bits. F.e. when shifting
      an UNKNOWN_VALUE register by 3 to the right, no matter what value
      it had, we know that the 3 upper most bits must be zero now.
      This is to make sure that ALU operations with unknown registers
      don't overflow. Meaning, once we know that we have more than 48
      upper zero bits, or, in other words cannot go beyond 0xffff offset
      with ALU ops, such an addition will track the target register
      as a new pkt() register with a new id, but 0 offset and 0 range,
      so for that a new data/data_end test will be required. Is the source
      register a CONST_IMM one that is to be added to the pkt() register,
      or the source instruction is an add instruction with immediate
      value, then it will get added if it stays within max 0xffff bounds.
      >From there, pkt() type, can be accessed should reg->off + imm be
      within the access range of pkt().
      
        [...]
        from 28 to 30: R0=imm1,min_value=1,max_value=1
          R1=pkt(id=0,off=0,r=22) R2=pkt_end
          R3=imm144,min_value=144,max_value=144
          R4=imm0,min_value=0,max_value=0
          R5=inv48,min_value=2054,max_value=2054 R10=fp
        30: (bf) r5 = r3
        31: (07) r5 += 23
        32: (77) r5 >>= 3
        33: (bf) r6 = r1
        34: (0f) r6 += r5
        cannot add integer value with 0 upper zero bits to ptr_to_packet
      
        [...]
        from 52 to 80: R0=imm1,min_value=1,max_value=1
          R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=inv
          R4=imm272 R5=inv56,min_value=17,max_value=17
          R6=pkt(id=0,off=26,r=34) R10=fp
        80: (07) r4 += 71
        81: (18) r5 = 0xfffffff8
        83: (5f) r4 &= r5
        84: (77) r4 >>= 3
        85: (0f) r1 += r4
        cannot add integer value with 3 upper zero bits to ptr_to_packet
      
      Thus to get above use-cases working, evaluate_reg_imm_alu() has
      been extended for further ALU ops. This is fine, because we only
      operate strictly within realm of CONST_IMM types, so here we don't
      care about overflows as they will happen in the simulated but also
      real execution and interaction with pkt() in check_packet_ptr_add()
      will check actual imm value once added to pkt(), but it's irrelevant
      before.
      
      With regards to 06c1c049 ("bpf: allow helpers access to variable
      memory") that works on UNKNOWN_VALUE registers, the verifier becomes
      now a bit smarter as it can better resolve ALU ops, so we need to
      adapt two test cases there, as min/max bound tracking only becomes
      necessary when registers were spilled to stack. So while mask was
      set before to track upper bound for UNKNOWN_VALUE case, it's now
      resolved directly as CONST_IMM, and such contructs are only necessary
      when f.e. registers are spilled.
      
      For commit 6b173873 ("bpf: recognize 64bit immediate loads as
      consts") that initially enabled dw load tracking only for nfp jit/
      analyzer, I did couple of tests on large, complex programs and we
      don't increase complexity badly (my tests were in ~3% range on avg).
      I've added a couple of tests similar to affected code above, and
      it works fine with verifier now.
      Reported-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Gianluca Borello <g.borello@gmail.com>
      Cc: William Tu <u9012063@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fadc801
  15. 24 1月, 2017 2 次提交
  16. 19 1月, 2017 1 次提交
    • D
      bpf: don't trigger OOM killer under pressure with map alloc · d407bd25
      Daniel Borkmann 提交于
      This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
      that are to be used for map allocations. Using kmalloc() for very large
      allocations can cause excessive work within the page allocator, so i) fall
      back earlier to vmalloc() when the attempt is considered costly anyway,
      and even more importantly ii) don't trigger OOM killer with any of the
      allocators.
      
      Since this is based on a user space request, for example, when creating
      maps with element pre-allocation, we really want such requests to fail
      instead of killing other user space processes.
      
      Also, don't spam the kernel log with warnings should any of the allocations
      fail under pressure. Given that, we can make backend selection in
      bpf_map_area_alloc() generic, and convert all maps over to use this API
      for spots with potentially large allocation requests.
      
      Note, replacing the one kmalloc_array() is fine as overflow checks happen
      earlier in htab_map_alloc(), since it must also protect the multiplication
      for vmalloc() should kmalloc_array() fail.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d407bd25
  17. 17 1月, 2017 1 次提交
    • D
      bpf: rework prog_digest into prog_tag · f1f7714e
      Daniel Borkmann 提交于
      Commit 7bd509e3 ("bpf: add prog_digest and expose it via
      fdinfo/netlink") was recently discussed, partially due to
      admittedly suboptimal name of "prog_digest" in combination
      with sha1 hash usage, thus inevitably and rightfully concerns
      about its security in terms of collision resistance were
      raised with regards to use-cases.
      
      The intended use cases are for debugging resp. introspection
      only for providing a stable "tag" over the instruction sequence
      that both kernel and user space can calculate independently.
      It's not usable at all for making a security relevant decision.
      So collisions where two different instruction sequences generate
      the same tag can happen, but ideally at a rather low rate. The
      "tag" will be dumped in hex and is short enough to introspect
      in tracepoints or kallsyms output along with other data such
      as stack trace, etc. Thus, this patch performs a rename into
      prog_tag and truncates the tag to a short output (64 bits) to
      make it obvious it's not collision-free.
      
      Should in future a hash or facility be needed with a security
      relevant focus, then we can think about requirements, constraints,
      etc that would fit to that situation. For now, rework the exposed
      parts for the current use cases as long as nothing has been
      released yet. Tested on x86_64 and s390x.
      
      Fixes: 7bd509e3 ("bpf: add prog_digest and expose it via fdinfo/netlink")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1f7714e
  18. 12 1月, 2017 2 次提交
    • D
      bpf: allow b/h/w/dw access for bpf's cb in ctx · 62c7989b
      Daniel Borkmann 提交于
      When structs are used to store temporary state in cb[] buffer that is
      used with programs and among tail calls, then the generated code will
      not always access the buffer in bpf_w chunks. We can ease programming
      of it and let this act more natural by allowing for aligned b/h/w/dw
      sized access for cb[] ctx member. Various test cases are attached as
      well for the selftest suite. Potentially, this can also be reused for
      other program types to pass data around.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62c7989b
    • D
      bpf: pass original insn directly to convert_ctx_access · 6b8cc1d1
      Daniel Borkmann 提交于
      Currently, when calling convert_ctx_access() callback for the various
      program types, we pass in insn->dst_reg, insn->src_reg, insn->off from
      the original instruction. This information is needed to rewrite the
      instruction that is based on the user ctx structure into a kernel
      representation for the ctx. As we'd like to allow access size beyond
      just BPF_W, we'd need also insn->code for that in order to decode the
      original access size. Given that, lets just pass insn directly to the
      convert_ctx_access() callback and work on that to not clutter the
      callback with even more arguments we need to pass when everything is
      already contained in insn. So lets go through that once, no functional
      change.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b8cc1d1
  19. 11 1月, 2017 3 次提交
  20. 10 1月, 2017 5 次提交
    • A
      bpf: rename ARG_PTR_TO_STACK · 39f19ebb
      Alexei Starovoitov 提交于
      since ARG_PTR_TO_STACK is no longer just pointer to stack
      rename it to ARG_PTR_TO_MEM and adjust comment.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39f19ebb
    • G
      bpf: allow helpers access to variable memory · 06c1c049
      Gianluca Borello 提交于
      Currently, helpers that read and write from/to the stack can do so using
      a pair of arguments of type ARG_PTR_TO_STACK and ARG_CONST_STACK_SIZE.
      ARG_CONST_STACK_SIZE accepts a constant register of type CONST_IMM, so
      that the verifier can safely check the memory access. However, requiring
      the argument to be a constant can be limiting in some circumstances.
      
      Since the current logic keeps track of the minimum and maximum value of
      a register throughout the simulated execution, ARG_CONST_STACK_SIZE can
      be changed to also accept an UNKNOWN_VALUE register in case its
      boundaries have been set and the range doesn't cause invalid memory
      accesses.
      
      One common situation when this is useful:
      
      int len;
      char buf[BUFSIZE]; /* BUFSIZE is 128 */
      
      if (some_condition)
      	len = 42;
      else
      	len = 84;
      
      some_helper(..., buf, len & (BUFSIZE - 1));
      
      The compiler can often decide to assign the constant values 42 or 48
      into a variable on the stack, instead of keeping it in a register. When
      the variable is then read back from stack into the register in order to
      be passed to the helper, the verifier will not be able to recognize the
      register as constant (the verifier is not currently tracking all
      constant writes into memory), and the program won't be valid.
      
      However, by allowing the helper to accept an UNKNOWN_VALUE register,
      this program will work because the bitwise AND operation will set the
      range of possible values for the UNKNOWN_VALUE register to [0, BUFSIZE),
      so the verifier can guarantee the helper call will be safe (assuming the
      argument is of type ARG_CONST_STACK_SIZE_OR_ZERO, otherwise one more
      check against 0 would be needed). Custom ranges can be set not only with
      ALU operations, but also by explicitly comparing the UNKNOWN_VALUE
      register with constants.
      
      Another very common example happens when intercepting system call
      arguments and accessing user-provided data of variable size using
      bpf_probe_read(). One can load at runtime the user-provided length in an
      UNKNOWN_VALUE register, and then read that exact amount of data up to a
      compile-time determined limit in order to fit into the proper local
      storage allocated on the stack, without having to guess a suboptimal
      access size at compile time.
      
      Also, in case the helpers accepting the UNKNOWN_VALUE register operate
      in raw mode, disable the raw mode so that the program is required to
      initialize all memory, since there is no guarantee the helper will fill
      it completely, leaving possibilities for data leak (just relevant when
      the memory used by the helper is the stack, not when using a pointer to
      map element value or packet). In other words, ARG_PTR_TO_RAW_STACK will
      be treated as ARG_PTR_TO_STACK.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06c1c049
    • G
      bpf: allow adjusted map element values to spill · f0318d01
      Gianluca Borello 提交于
      commit 48461135 ("bpf: allow access into map value arrays")
      introduces the ability to do pointer math inside a map element value via
      the PTR_TO_MAP_VALUE_ADJ register type.
      
      The current support doesn't handle the case where a PTR_TO_MAP_VALUE_ADJ
      is spilled into the stack, limiting several use cases, especially when
      generating bpf code from a compiler.
      
      Handle this case by explicitly enabling the register type
      PTR_TO_MAP_VALUE_ADJ to be spilled. Also, make sure that min_value and
      max_value are reset just for BPF_LDX operations that don't result in a
      restore of a spilled register from stack.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0318d01
    • G
      bpf: allow helpers access to map element values · 5722569b
      Gianluca Borello 提交于
      Enable helpers to directly access a map element value by passing a
      register type PTR_TO_MAP_VALUE (or PTR_TO_MAP_VALUE_ADJ) to helper
      arguments ARG_PTR_TO_STACK or ARG_PTR_TO_RAW_STACK.
      
      This enables several use cases. For example, a typical tracing program
      might want to capture pathnames passed to sys_open() with:
      
      struct trace_data {
      	char pathname[PATHLEN];
      };
      
      SEC("kprobe/sys_open")
      void bpf_sys_open(struct pt_regs *ctx)
      {
      	struct trace_data data;
      	bpf_probe_read(data.pathname, sizeof(data.pathname), ctx->di);
      
      	/* consume data.pathname, for example via
      	 * bpf_trace_printk() or bpf_perf_event_output()
      	 */
      }
      
      Such a program could easily hit the stack limit in case PATHLEN needs to
      be large or more local variables need to exist, both of which are quite
      common scenarios. Allowing direct helper access to map element values,
      one could do:
      
      struct bpf_map_def SEC("maps") scratch_map = {
      	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
      	.key_size = sizeof(u32),
      	.value_size = sizeof(struct trace_data),
      	.max_entries = 1,
      };
      
      SEC("kprobe/sys_open")
      int bpf_sys_open(struct pt_regs *ctx)
      {
      	int id = 0;
      	struct trace_data *p = bpf_map_lookup_elem(&scratch_map, &id);
      	if (!p)
      		return;
      	bpf_probe_read(p->pathname, sizeof(p->pathname), ctx->di);
      
      	/* consume p->pathname, for example via
      	 * bpf_trace_printk() or bpf_perf_event_output()
      	 */
      }
      
      And wouldn't risk exhausting the stack.
      
      Code changes are loosely modeled after commit 6841de8b ("bpf: allow
      helpers access the packet directly"). Unlike with PTR_TO_PACKET, these
      changes just work with ARG_PTR_TO_STACK and ARG_PTR_TO_RAW_STACK (not
      ARG_PTR_TO_MAP_KEY, ARG_PTR_TO_MAP_VALUE, ...): adding those would be
      trivial, but since there is not currently a use case for that, it's
      reasonable to limit the set of changes.
      
      Also, add new tests to make sure accesses to map element values from
      helpers never go out of boundary, even when adjusted.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5722569b
    • G
      bpf: split check_mem_access logic for map values · dbcfe5f7
      Gianluca Borello 提交于
      Move the logic to check memory accesses to a PTR_TO_MAP_VALUE_ADJ from
      check_mem_access() to a separate helper check_map_access_adj(). This
      enables to use those checks in other parts of the verifier as well,
      where boundaries on PTR_TO_MAP_VALUE_ADJ might need to be checked, for
      example when checking helper function arguments. The same thing is
      already happening for other types such as PTR_TO_PACKET and its
      check_packet_access() helper.
      
      The code has been copied verbatim, with the only difference of removing
      the "off += reg->max_value" statement and moving the sum into the call
      statement to check_map_access(), as that was only needed due to the
      earlier common check_map_access() call.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbcfe5f7
  21. 18 12月, 2016 1 次提交
    • D
      bpf: fix mark_reg_unknown_value for spilled regs on map value marking · 6760bf2d
      Daniel Borkmann 提交于
      Martin reported a verifier issue that hit the BUG_ON() for his
      test case in the mark_reg_unknown_value() function:
      
        [  202.861380] kernel BUG at kernel/bpf/verifier.c:467!
        [...]
        [  203.291109] Call Trace:
        [  203.296501]  [<ffffffff811364d5>] mark_map_reg+0x45/0x50
        [  203.308225]  [<ffffffff81136558>] mark_map_regs+0x78/0x90
        [  203.320140]  [<ffffffff8113938d>] do_check+0x226d/0x2c90
        [  203.331865]  [<ffffffff8113a6ab>] bpf_check+0x48b/0x780
        [  203.343403]  [<ffffffff81134c8e>] bpf_prog_load+0x27e/0x440
        [  203.355705]  [<ffffffff8118a38f>] ? handle_mm_fault+0x11af/0x1230
        [  203.369158]  [<ffffffff812d8188>] ? security_capable+0x48/0x60
        [  203.382035]  [<ffffffff811351a4>] SyS_bpf+0x124/0x960
        [  203.393185]  [<ffffffff810515f6>] ? __do_page_fault+0x276/0x490
        [  203.406258]  [<ffffffff816db320>] entry_SYSCALL_64_fastpath+0x13/0x94
      
      This issue got uncovered after the fix in a08dd0da ("bpf: fix
      regression on verifier pruning wrt map lookups"). The reason why it
      wasn't noticed before was, because as mentioned in a08dd0da,
      mark_map_regs() was doing the id matching incorrectly based on the
      uncached regs[regno].id. So, in the first loop, we walked all regs
      and as soon as we found regno == i, then this reg's id was cleared
      when calling mark_reg_unknown_value() thus that every subsequent
      register was probed against id of 0 (which, in combination with the
      PTR_TO_MAP_VALUE_OR_NULL type is an invalid condition that no other
      register state can hold), and therefore wasn't type transitioned such
      as in the spilled register case for the second loop.
      
      Now since that got fixed, it turned out that 57a09bf0 ("bpf:
      Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") used
      mark_reg_unknown_value() incorrectly for the spilled regs, and thus
      hitting the BUG_ON() in some cases due to regno >= MAX_BPF_REG.
      
      Although spilled regs have the same type as the non-spilled regs
      for the verifier state, that is, struct bpf_reg_state, they are
      semantically different from the non-spilled regs. In other words,
      there can be up to 64 (MAX_BPF_STACK / BPF_REG_SIZE) spilled regs
      in the stack, for example, register R<x> could have been spilled by
      the program to stack location X, Y, Z, and in mark_map_regs() we
      need to scan these stack slots of type STACK_SPILL for potential
      registers that we have to transition from PTR_TO_MAP_VALUE_OR_NULL.
      Therefore, depending on the location, the spilled_regs regno can
      be a lot higher than just MAX_BPF_REG's value since we operate on
      stack instead. The reset in mark_reg_unknown_value() itself is
      just fine, only that the BUG_ON() was inappropriate for this. Fix
      it by making a __mark_reg_unknown_value() version that can be
      called from mark_map_reg() generically; we know for the non-spilled
      case that the regno is always < MAX_BPF_REG anyway.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6760bf2d