1. 02 4月, 2017 2 次提交
    • D
      bpf, verifier: fix rejection of unaligned access checks for map_value_adj · 79adffcd
      Daniel Borkmann 提交于
      Currently, the verifier doesn't reject unaligned access for map_value_adj
      register types. Commit 48461135 ("bpf: allow access into map value
      arrays") added logic to check_ptr_alignment() extending it from PTR_TO_PACKET
      to also PTR_TO_MAP_VALUE_ADJ, but for PTR_TO_MAP_VALUE_ADJ no enforcement
      is in place, because reg->id for PTR_TO_MAP_VALUE_ADJ reg types is never
      non-zero, meaning, we can cause BPF_H/_W/_DW-based unaligned access for
      architectures not supporting efficient unaligned access, and thus worst
      case could raise exceptions on some archs that are unable to correct the
      unaligned access or perform a different memory access to the actual
      requested one and such.
      
      i) Unaligned load with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
         on r0 (map_value_adj):
      
         0: (bf) r2 = r10
         1: (07) r2 += -8
         2: (7a) *(u64 *)(r2 +0) = 0
         3: (18) r1 = 0x42533a00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+11
          R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
         7: (61) r1 = *(u32 *)(r0 +0)
         8: (35) if r1 >= 0xb goto pc+9
          R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=10 R10=fp
         9: (07) r0 += 3
        10: (79) r7 = *(u64 *)(r0 +0)
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R10=fp
        11: (79) r7 = *(u64 *)(r0 +2)
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R7=inv R10=fp
        [...]
      
      ii) Unaligned store with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
          on r0 (map_value_adj):
      
         0: (bf) r2 = r10
         1: (07) r2 += -8
         2: (7a) *(u64 *)(r2 +0) = 0
         3: (18) r1 = 0x4df16a00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+19
          R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
         7: (07) r0 += 3
         8: (7a) *(u64 *)(r0 +0) = 42
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
         9: (7a) *(u64 *)(r0 +2) = 43
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
        10: (7a) *(u64 *)(r0 -2) = 44
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
        [...]
      
      For the PTR_TO_PACKET type, reg->id is initially zero when skb->data
      was fetched, it later receives a reg->id from env->id_gen generator
      once another register with UNKNOWN_VALUE type was added to it via
      check_packet_ptr_add(). The purpose of this reg->id is twofold: i) it
      is used in find_good_pkt_pointers() for setting the allowed access
      range for regs with PTR_TO_PACKET of same id once verifier matched
      on data/data_end tests, and ii) for check_ptr_alignment() to determine
      that when not having efficient unaligned access and register with
      UNKNOWN_VALUE was added to PTR_TO_PACKET, that we're only allowed
      to access the content bytewise due to unknown unalignment. reg->id
      was never intended for PTR_TO_MAP_VALUE{,_ADJ} types and thus is
      always zero, the only marking is in PTR_TO_MAP_VALUE_OR_NULL that
      was added after 48461135 via 57a09bf0 ("bpf: Detect identical
      PTR_TO_MAP_VALUE_OR_NULL registers"). Above tests will fail for
      non-root environment due to prohibited pointer arithmetic.
      
      The fix splits register-type specific checks into their own helper
      instead of keeping them combined, so we don't run into a similar
      issue in future once we extend check_ptr_alignment() further and
      forget to add reg->type checks for some of the checks.
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79adffcd
    • D
      bpf, verifier: fix alu ops against map_value{, _adj} register types · fce366a9
      Daniel Borkmann 提交于
      While looking into map_value_adj, I noticed that alu operations
      directly on the map_value() resp. map_value_adj() register (any
      alu operation on a map_value() register will turn it into a
      map_value_adj() typed register) are not sufficiently protected
      against some of the operations. Two non-exhaustive examples are
      provided that the verifier needs to reject:
      
       i) BPF_AND on r0 (map_value_adj):
      
        0: (bf) r2 = r10
        1: (07) r2 += -8
        2: (7a) *(u64 *)(r2 +0) = 0
        3: (18) r1 = 0xbf842a00
        5: (85) call bpf_map_lookup_elem#1
        6: (15) if r0 == 0x0 goto pc+2
         R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
        7: (57) r0 &= 8
        8: (7a) *(u64 *)(r0 +0) = 22
         R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=8 R10=fp
        9: (95) exit
      
        from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
        9: (95) exit
        processed 10 insns
      
      ii) BPF_ADD in 32 bit mode on r0 (map_value_adj):
      
        0: (bf) r2 = r10
        1: (07) r2 += -8
        2: (7a) *(u64 *)(r2 +0) = 0
        3: (18) r1 = 0xc24eee00
        5: (85) call bpf_map_lookup_elem#1
        6: (15) if r0 == 0x0 goto pc+2
         R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
        7: (04) (u32) r0 += (u32) 0
        8: (7a) *(u64 *)(r0 +0) = 22
         R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
        9: (95) exit
      
        from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
        9: (95) exit
        processed 10 insns
      
      Issue is, while min_value / max_value boundaries for the access
      are adjusted appropriately, we change the pointer value in a way
      that cannot be sufficiently tracked anymore from its origin.
      Operations like BPF_{AND,OR,DIV,MUL,etc} on a destination register
      that is PTR_TO_MAP_VALUE{,_ADJ} was probably unintended, in fact,
      all the test cases coming with 48461135 ("bpf: allow access
      into map value arrays") perform BPF_ADD only on the destination
      register that is PTR_TO_MAP_VALUE_ADJ.
      
      Only for UNKNOWN_VALUE register types such operations make sense,
      f.e. with unknown memory content fetched initially from a constant
      offset from the map value memory into a register. That register is
      then later tested against lower / upper bounds, so that the verifier
      can then do the tracking of min_value / max_value, and properly
      check once that UNKNOWN_VALUE register is added to the destination
      register with type PTR_TO_MAP_VALUE{,_ADJ}. This is also what the
      original use-case is solving. Note, tracking on what is being
      added is done through adjust_reg_min_max_vals() and later access
      to the map value enforced with these boundaries and the given offset
      from the insn through check_map_access_adj().
      
      Tests will fail for non-root environment due to prohibited pointer
      arithmetic, in particular in check_alu_op(), we bail out on the
      is_pointer_value() check on the dst_reg (which is false in root
      case as we allow for pointer arithmetic via env->allow_ptr_leaks).
      
      Similarly to PTR_TO_PACKET, one way to fix it is to restrict the
      allowed operations on PTR_TO_MAP_VALUE{,_ADJ} registers to 64 bit
      mode BPF_ADD. The test_verifier suite runs fine after the patch
      and it also rejects mentioned test cases.
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fce366a9
  2. 25 3月, 2017 1 次提交
  3. 23 3月, 2017 1 次提交
    • A
      bpf: fix hashmap extra_elems logic · 8c290e60
      Alexei Starovoitov 提交于
      In both kmalloc and prealloc mode the bpf_map_update_elem() is using
      per-cpu extra_elems to do atomic update when the map is full.
      There are two issues with it. The logic can be misused, since it allows
      max_entries+num_cpus elements to be present in the map. And alloc_extra_elems()
      at map creation time can fail percpu alloc for large map values with a warn:
      WARNING: CPU: 3 PID: 2752 at ../mm/percpu.c:892 pcpu_alloc+0x119/0xa60
      illegal size (32824) or align (8) for percpu allocation
      
      The fixes for both of these issues are different for kmalloc and prealloc modes.
      For prealloc mode allocate extra num_possible_cpus elements and store
      their pointers into extra_elems array instead of actual elements.
      Hence we can use these hidden(spare) elements not only when the map is full
      but during bpf_map_update_elem() that replaces existing element too.
      That also improves performance, since pcpu_freelist_pop/push is avoided.
      Unfortunately this approach cannot be used for kmalloc mode which needs
      to kfree elements after rcu grace period. Therefore switch it back to normal
      kmalloc even when full and old element exists like it was prior to
      commit 6c905981 ("bpf: pre-allocate hash map elements").
      
      Add tests to check for over max_entries and large map values.
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Fixes: 6c905981 ("bpf: pre-allocate hash map elements")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c290e60
  4. 10 3月, 2017 2 次提交
  5. 06 3月, 2017 1 次提交
  6. 02 3月, 2017 2 次提交
  7. 23 2月, 2017 1 次提交
  8. 18 2月, 2017 3 次提交
    • D
      bpf: make jited programs visible in traces · 74451e66
      Daniel Borkmann 提交于
      Long standing issue with JITed programs is that stack traces from
      function tracing check whether a given address is kernel code
      through {__,}kernel_text_address(), which checks for code in core
      kernel, modules and dynamically allocated ftrace trampolines. But
      what is still missing is BPF JITed programs (interpreted programs
      are not an issue as __bpf_prog_run() will be attributed to them),
      thus when a stack trace is triggered, the code walking the stack
      won't see any of the JITed ones. The same for address correlation
      done from user space via reading /proc/kallsyms. This is read by
      tools like perf, but the latter is also useful for permanent live
      tracing with eBPF itself in combination with stack maps when other
      eBPF types are part of the callchain. See offwaketime example on
      dumping stack from a map.
      
      This work tries to tackle that issue by making the addresses and
      symbols known to the kernel. The lookup from *kernel_text_address()
      is implemented through a latched RB tree that can be read under
      RCU in fast-path that is also shared for symbol/size/offset lookup
      for a specific given address in kallsyms. The slow-path iteration
      through all symbols in the seq file done via RCU list, which holds
      a tiny fraction of all exported ksyms, usually below 0.1 percent.
      Function symbols are exported as bpf_prog_<tag>, in order to aide
      debugging and attribution. This facility is currently enabled for
      root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
      is active in any mode. The rationale behind this is that still a lot
      of systems ship with world read permissions on kallsyms thus addresses
      should not get suddenly exposed for them. If that situation gets
      much better in future, we always have the option to change the
      default on this. Likewise, unprivileged programs are not allowed
      to add entries there either, but that is less of a concern as most
      such programs types relevant in this context are for root-only anyway.
      If enabled, call graphs and stack traces will then show a correct
      attribution; one example is illustrated below, where the trace is
      now visible in tooling such as perf script --kallsyms=/proc/kallsyms
      and friends.
      
      Before:
      
        7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
      
      After:
      
        7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
        [...]
        7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74451e66
    • D
      bpf: remove stubs for cBPF from arch code · 9383191d
      Daniel Borkmann 提交于
      Remove the dummy bpf_jit_compile() stubs for eBPF JITs and make
      that a single __weak function in the core that can be overridden
      similarly to the eBPF one. Also remove stale pr_err() mentions
      of bpf_jit_compile.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9383191d
    • D
      bpf: mark all registered map/prog types as __ro_after_init · c78f8bdf
      Daniel Borkmann 提交于
      All map types and prog types are registered to the BPF core through
      bpf_register_map_type() and bpf_register_prog_type() during init and
      remain unchanged thereafter. As by design we don't (and never will)
      have any pluggable code that can register to that at any later point
      in time, lets mark all the existing bpf_{map,prog}_type_list objects
      in the tree as __ro_after_init, so they can be moved to read-only
      section from then onwards.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c78f8bdf
  9. 15 2月, 2017 1 次提交
    • A
      bpf: reduce compiler warnings by adding fallthrough comments · 7e57fbb2
      Alexander Alemayhu 提交于
      Fixes the following warnings:
      
      kernel/bpf/verifier.c: In function ‘may_access_direct_pkt_data’:
      kernel/bpf/verifier.c:702:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
         if (t == BPF_WRITE)
            ^
      kernel/bpf/verifier.c:704:2: note: here
        case BPF_PROG_TYPE_SCHED_CLS:
        ^~~~
      kernel/bpf/verifier.c: In function ‘reg_set_min_max_inv’:
      kernel/bpf/verifier.c:2057:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
         true_reg->min_value = 0;
         ~~~~~~~~~~~~~~~~~~~~^~~
      kernel/bpf/verifier.c:2058:2: note: here
        case BPF_JSGT:
        ^~~~
      kernel/bpf/verifier.c:2068:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
         true_reg->min_value = 0;
         ~~~~~~~~~~~~~~~~~~~~^~~
      kernel/bpf/verifier.c:2069:2: note: here
        case BPF_JSGE:
        ^~~~
      kernel/bpf/verifier.c: In function ‘reg_set_min_max’:
      kernel/bpf/verifier.c:2009:24: warning: this statement may fall through [-Wimplicit-fallthrough=]
         false_reg->min_value = 0;
         ~~~~~~~~~~~~~~~~~~~~~^~~
      kernel/bpf/verifier.c:2010:2: note: here
        case BPF_JSGT:
        ^~~~
      kernel/bpf/verifier.c:2019:24: warning: this statement may fall through [-Wimplicit-fallthrough=]
         false_reg->min_value = 0;
         ~~~~~~~~~~~~~~~~~~~~~^~~
      kernel/bpf/verifier.c:2020:2: note: here
        case BPF_JSGE:
        ^~~~
      Reported-by: NDavid Binderman <dcb314@hotmail.com>
      Signed-off-by: NAlexander Alemayhu <alexander@alemayhu.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e57fbb2
  10. 13 2月, 2017 1 次提交
    • A
      bpf: introduce BPF_F_ALLOW_OVERRIDE flag · 7f677633
      Alexei Starovoitov 提交于
      If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
      to the given cgroup the descendent cgroup will be able to override
      effective bpf program that was inherited from this cgroup.
      By default it's not passed, therefore override is disallowed.
      
      Examples:
      1.
      prog X attached to /A with default
      prog Y fails to attach to /A/B and /A/B/C
      Everything under /A runs prog X
      
      2.
      prog X attached to /A with allow_override.
      prog Y fails to attach to /A/B with default (non-override)
      prog M attached to /A/B with allow_override.
      Everything under /A/B runs prog M only.
      
      3.
      prog X attached to /A with allow_override.
      prog Y fails to attach to /A with default.
      The user has to detach first to switch the mode.
      
      In the future this behavior may be extended with a chain of
      non-overridable programs.
      
      Also fix the bug where detach from cgroup where nothing is attached
      was not throwing error. Return ENOENT in such case.
      
      Add several testcases and adjust libbpf.
      
      Fixes: 30070984 ("cgroup: add support for eBPF programs")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDaniel Mack <daniel@zonque.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f677633
  11. 09 2月, 2017 1 次提交
    • D
      bpf, lpm: fix overflows in trie_alloc checks · c502faf9
      Daniel Borkmann 提交于
      Cap the maximum (total) value size and bail out if larger than KMALLOC_MAX_SIZE
      as otherwise it doesn't make any sense to proceed further, since we're
      guaranteed to fail to allocate elements anyway in lpm_trie_node_alloc();
      likleyhood of failure is still high for large values, though, similarly
      as with htab case in non-prealloc.
      
      Next, make sure that cost vars are really u64 instead of size_t, so that we
      don't overflow on 32 bit and charge only tiny map.pages against memlock while
      allowing huge max_entries; cap also the max cost like we do with other map
      types.
      
      Fixes: b95a5c4d ("bpf: add a longest prefix match trie map implementation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c502faf9
  12. 07 2月, 2017 1 次提交
  13. 26 1月, 2017 1 次提交
    • D
      bpf: add initial bpf tracepoints · a67edbf4
      Daniel Borkmann 提交于
      This work adds a number of tracepoints to paths that are either
      considered slow-path or exception-like states, where monitoring or
      inspecting them would be desirable.
      
      For bpf(2) syscall, tracepoints have been placed for main commands
      when they succeed. In XDP case, tracepoint is for exceptions, that
      is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
      return code, or when error occurs during XDP_TX action and the packet
      could not be forwarded.
      
      Both have been split into separate event headers, and can be further
      extended. Worst case, if they unexpectedly should get into our way in
      future, they can also removed [1]. Of course, these tracepoints (like
      any other) can be analyzed by eBPF itself, etc. Example output:
      
        # ./perf record -a -e bpf:* sleep 10
        # ./perf script
        sock_example  6197 [005]   283.980322:      bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
        sock_example  6197 [005]   283.980721:       bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
        sock_example  6197 [005]   283.988423:   bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
        sock_example  6197 [005]   283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
        [...]
        sock_example  6197 [005]   288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
             swapper     0 [005]   289.338243:    bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
      
        [1] https://lwn.net/Articles/705270/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a67edbf4
  14. 25 1月, 2017 1 次提交
    • D
      bpf: enable verifier to better track const alu ops · 3fadc801
      Daniel Borkmann 提交于
      William reported couple of issues in relation to direct packet
      access. Typical scheme is to check for data + [off] <= data_end,
      where [off] can be either immediate or coming from a tracked
      register that contains an immediate, depending on the branch, we
      can then access the data. However, in case of calculating [off]
      for either the mentioned test itself or for access after the test
      in a more "complex" way, then the verifier will stop tracking the
      CONST_IMM marked register and will mark it as UNKNOWN_VALUE one.
      
      Adding that UNKNOWN_VALUE typed register to a pkt() marked
      register, the verifier then bails out in check_packet_ptr_add()
      as it finds the registers imm value below 48. In the first below
      example, that is due to evaluate_reg_imm_alu() not handling right
      shifts and thus marking the register as UNKNOWN_VALUE via helper
      __mark_reg_unknown_value() that resets imm to 0.
      
      In the second case the same happens at the time when r4 is set
      to r4 &= r5, where it transitions to UNKNOWN_VALUE from
      evaluate_reg_imm_alu(). Later on r4 we shift right by 3 inside
      evaluate_reg_alu(), where the register's imm turns into 3. That
      is, for registers with type UNKNOWN_VALUE, imm of 0 means that
      we don't know what value the register has, and for imm > 0 it
      means that the value has [imm] upper zero bits. F.e. when shifting
      an UNKNOWN_VALUE register by 3 to the right, no matter what value
      it had, we know that the 3 upper most bits must be zero now.
      This is to make sure that ALU operations with unknown registers
      don't overflow. Meaning, once we know that we have more than 48
      upper zero bits, or, in other words cannot go beyond 0xffff offset
      with ALU ops, such an addition will track the target register
      as a new pkt() register with a new id, but 0 offset and 0 range,
      so for that a new data/data_end test will be required. Is the source
      register a CONST_IMM one that is to be added to the pkt() register,
      or the source instruction is an add instruction with immediate
      value, then it will get added if it stays within max 0xffff bounds.
      >From there, pkt() type, can be accessed should reg->off + imm be
      within the access range of pkt().
      
        [...]
        from 28 to 30: R0=imm1,min_value=1,max_value=1
          R1=pkt(id=0,off=0,r=22) R2=pkt_end
          R3=imm144,min_value=144,max_value=144
          R4=imm0,min_value=0,max_value=0
          R5=inv48,min_value=2054,max_value=2054 R10=fp
        30: (bf) r5 = r3
        31: (07) r5 += 23
        32: (77) r5 >>= 3
        33: (bf) r6 = r1
        34: (0f) r6 += r5
        cannot add integer value with 0 upper zero bits to ptr_to_packet
      
        [...]
        from 52 to 80: R0=imm1,min_value=1,max_value=1
          R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=inv
          R4=imm272 R5=inv56,min_value=17,max_value=17
          R6=pkt(id=0,off=26,r=34) R10=fp
        80: (07) r4 += 71
        81: (18) r5 = 0xfffffff8
        83: (5f) r4 &= r5
        84: (77) r4 >>= 3
        85: (0f) r1 += r4
        cannot add integer value with 3 upper zero bits to ptr_to_packet
      
      Thus to get above use-cases working, evaluate_reg_imm_alu() has
      been extended for further ALU ops. This is fine, because we only
      operate strictly within realm of CONST_IMM types, so here we don't
      care about overflows as they will happen in the simulated but also
      real execution and interaction with pkt() in check_packet_ptr_add()
      will check actual imm value once added to pkt(), but it's irrelevant
      before.
      
      With regards to 06c1c049 ("bpf: allow helpers access to variable
      memory") that works on UNKNOWN_VALUE registers, the verifier becomes
      now a bit smarter as it can better resolve ALU ops, so we need to
      adapt two test cases there, as min/max bound tracking only becomes
      necessary when registers were spilled to stack. So while mask was
      set before to track upper bound for UNKNOWN_VALUE case, it's now
      resolved directly as CONST_IMM, and such contructs are only necessary
      when f.e. registers are spilled.
      
      For commit 6b173873 ("bpf: recognize 64bit immediate loads as
      consts") that initially enabled dw load tracking only for nfp jit/
      analyzer, I did couple of tests on large, complex programs and we
      don't increase complexity badly (my tests were in ~3% range on avg).
      I've added a couple of tests similar to affected code above, and
      it works fine with verifier now.
      Reported-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Gianluca Borello <g.borello@gmail.com>
      Cc: William Tu <u9012063@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fadc801
  15. 24 1月, 2017 2 次提交
  16. 19 1月, 2017 1 次提交
    • D
      bpf: don't trigger OOM killer under pressure with map alloc · d407bd25
      Daniel Borkmann 提交于
      This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
      that are to be used for map allocations. Using kmalloc() for very large
      allocations can cause excessive work within the page allocator, so i) fall
      back earlier to vmalloc() when the attempt is considered costly anyway,
      and even more importantly ii) don't trigger OOM killer with any of the
      allocators.
      
      Since this is based on a user space request, for example, when creating
      maps with element pre-allocation, we really want such requests to fail
      instead of killing other user space processes.
      
      Also, don't spam the kernel log with warnings should any of the allocations
      fail under pressure. Given that, we can make backend selection in
      bpf_map_area_alloc() generic, and convert all maps over to use this API
      for spots with potentially large allocation requests.
      
      Note, replacing the one kmalloc_array() is fine as overflow checks happen
      earlier in htab_map_alloc(), since it must also protect the multiplication
      for vmalloc() should kmalloc_array() fail.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d407bd25
  17. 17 1月, 2017 1 次提交
    • D
      bpf: rework prog_digest into prog_tag · f1f7714e
      Daniel Borkmann 提交于
      Commit 7bd509e3 ("bpf: add prog_digest and expose it via
      fdinfo/netlink") was recently discussed, partially due to
      admittedly suboptimal name of "prog_digest" in combination
      with sha1 hash usage, thus inevitably and rightfully concerns
      about its security in terms of collision resistance were
      raised with regards to use-cases.
      
      The intended use cases are for debugging resp. introspection
      only for providing a stable "tag" over the instruction sequence
      that both kernel and user space can calculate independently.
      It's not usable at all for making a security relevant decision.
      So collisions where two different instruction sequences generate
      the same tag can happen, but ideally at a rather low rate. The
      "tag" will be dumped in hex and is short enough to introspect
      in tracepoints or kallsyms output along with other data such
      as stack trace, etc. Thus, this patch performs a rename into
      prog_tag and truncates the tag to a short output (64 bits) to
      make it obvious it's not collision-free.
      
      Should in future a hash or facility be needed with a security
      relevant focus, then we can think about requirements, constraints,
      etc that would fit to that situation. For now, rework the exposed
      parts for the current use cases as long as nothing has been
      released yet. Tested on x86_64 and s390x.
      
      Fixes: 7bd509e3 ("bpf: add prog_digest and expose it via fdinfo/netlink")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1f7714e
  18. 12 1月, 2017 2 次提交
    • D
      bpf: allow b/h/w/dw access for bpf's cb in ctx · 62c7989b
      Daniel Borkmann 提交于
      When structs are used to store temporary state in cb[] buffer that is
      used with programs and among tail calls, then the generated code will
      not always access the buffer in bpf_w chunks. We can ease programming
      of it and let this act more natural by allowing for aligned b/h/w/dw
      sized access for cb[] ctx member. Various test cases are attached as
      well for the selftest suite. Potentially, this can also be reused for
      other program types to pass data around.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62c7989b
    • D
      bpf: pass original insn directly to convert_ctx_access · 6b8cc1d1
      Daniel Borkmann 提交于
      Currently, when calling convert_ctx_access() callback for the various
      program types, we pass in insn->dst_reg, insn->src_reg, insn->off from
      the original instruction. This information is needed to rewrite the
      instruction that is based on the user ctx structure into a kernel
      representation for the ctx. As we'd like to allow access size beyond
      just BPF_W, we'd need also insn->code for that in order to decode the
      original access size. Given that, lets just pass insn directly to the
      convert_ctx_access() callback and work on that to not clutter the
      callback with even more arguments we need to pass when everything is
      already contained in insn. So lets go through that once, no functional
      change.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b8cc1d1
  19. 11 1月, 2017 3 次提交
  20. 10 1月, 2017 5 次提交
    • A
      bpf: rename ARG_PTR_TO_STACK · 39f19ebb
      Alexei Starovoitov 提交于
      since ARG_PTR_TO_STACK is no longer just pointer to stack
      rename it to ARG_PTR_TO_MEM and adjust comment.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39f19ebb
    • G
      bpf: allow helpers access to variable memory · 06c1c049
      Gianluca Borello 提交于
      Currently, helpers that read and write from/to the stack can do so using
      a pair of arguments of type ARG_PTR_TO_STACK and ARG_CONST_STACK_SIZE.
      ARG_CONST_STACK_SIZE accepts a constant register of type CONST_IMM, so
      that the verifier can safely check the memory access. However, requiring
      the argument to be a constant can be limiting in some circumstances.
      
      Since the current logic keeps track of the minimum and maximum value of
      a register throughout the simulated execution, ARG_CONST_STACK_SIZE can
      be changed to also accept an UNKNOWN_VALUE register in case its
      boundaries have been set and the range doesn't cause invalid memory
      accesses.
      
      One common situation when this is useful:
      
      int len;
      char buf[BUFSIZE]; /* BUFSIZE is 128 */
      
      if (some_condition)
      	len = 42;
      else
      	len = 84;
      
      some_helper(..., buf, len & (BUFSIZE - 1));
      
      The compiler can often decide to assign the constant values 42 or 48
      into a variable on the stack, instead of keeping it in a register. When
      the variable is then read back from stack into the register in order to
      be passed to the helper, the verifier will not be able to recognize the
      register as constant (the verifier is not currently tracking all
      constant writes into memory), and the program won't be valid.
      
      However, by allowing the helper to accept an UNKNOWN_VALUE register,
      this program will work because the bitwise AND operation will set the
      range of possible values for the UNKNOWN_VALUE register to [0, BUFSIZE),
      so the verifier can guarantee the helper call will be safe (assuming the
      argument is of type ARG_CONST_STACK_SIZE_OR_ZERO, otherwise one more
      check against 0 would be needed). Custom ranges can be set not only with
      ALU operations, but also by explicitly comparing the UNKNOWN_VALUE
      register with constants.
      
      Another very common example happens when intercepting system call
      arguments and accessing user-provided data of variable size using
      bpf_probe_read(). One can load at runtime the user-provided length in an
      UNKNOWN_VALUE register, and then read that exact amount of data up to a
      compile-time determined limit in order to fit into the proper local
      storage allocated on the stack, without having to guess a suboptimal
      access size at compile time.
      
      Also, in case the helpers accepting the UNKNOWN_VALUE register operate
      in raw mode, disable the raw mode so that the program is required to
      initialize all memory, since there is no guarantee the helper will fill
      it completely, leaving possibilities for data leak (just relevant when
      the memory used by the helper is the stack, not when using a pointer to
      map element value or packet). In other words, ARG_PTR_TO_RAW_STACK will
      be treated as ARG_PTR_TO_STACK.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06c1c049
    • G
      bpf: allow adjusted map element values to spill · f0318d01
      Gianluca Borello 提交于
      commit 48461135 ("bpf: allow access into map value arrays")
      introduces the ability to do pointer math inside a map element value via
      the PTR_TO_MAP_VALUE_ADJ register type.
      
      The current support doesn't handle the case where a PTR_TO_MAP_VALUE_ADJ
      is spilled into the stack, limiting several use cases, especially when
      generating bpf code from a compiler.
      
      Handle this case by explicitly enabling the register type
      PTR_TO_MAP_VALUE_ADJ to be spilled. Also, make sure that min_value and
      max_value are reset just for BPF_LDX operations that don't result in a
      restore of a spilled register from stack.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0318d01
    • G
      bpf: allow helpers access to map element values · 5722569b
      Gianluca Borello 提交于
      Enable helpers to directly access a map element value by passing a
      register type PTR_TO_MAP_VALUE (or PTR_TO_MAP_VALUE_ADJ) to helper
      arguments ARG_PTR_TO_STACK or ARG_PTR_TO_RAW_STACK.
      
      This enables several use cases. For example, a typical tracing program
      might want to capture pathnames passed to sys_open() with:
      
      struct trace_data {
      	char pathname[PATHLEN];
      };
      
      SEC("kprobe/sys_open")
      void bpf_sys_open(struct pt_regs *ctx)
      {
      	struct trace_data data;
      	bpf_probe_read(data.pathname, sizeof(data.pathname), ctx->di);
      
      	/* consume data.pathname, for example via
      	 * bpf_trace_printk() or bpf_perf_event_output()
      	 */
      }
      
      Such a program could easily hit the stack limit in case PATHLEN needs to
      be large or more local variables need to exist, both of which are quite
      common scenarios. Allowing direct helper access to map element values,
      one could do:
      
      struct bpf_map_def SEC("maps") scratch_map = {
      	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
      	.key_size = sizeof(u32),
      	.value_size = sizeof(struct trace_data),
      	.max_entries = 1,
      };
      
      SEC("kprobe/sys_open")
      int bpf_sys_open(struct pt_regs *ctx)
      {
      	int id = 0;
      	struct trace_data *p = bpf_map_lookup_elem(&scratch_map, &id);
      	if (!p)
      		return;
      	bpf_probe_read(p->pathname, sizeof(p->pathname), ctx->di);
      
      	/* consume p->pathname, for example via
      	 * bpf_trace_printk() or bpf_perf_event_output()
      	 */
      }
      
      And wouldn't risk exhausting the stack.
      
      Code changes are loosely modeled after commit 6841de8b ("bpf: allow
      helpers access the packet directly"). Unlike with PTR_TO_PACKET, these
      changes just work with ARG_PTR_TO_STACK and ARG_PTR_TO_RAW_STACK (not
      ARG_PTR_TO_MAP_KEY, ARG_PTR_TO_MAP_VALUE, ...): adding those would be
      trivial, but since there is not currently a use case for that, it's
      reasonable to limit the set of changes.
      
      Also, add new tests to make sure accesses to map element values from
      helpers never go out of boundary, even when adjusted.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5722569b
    • G
      bpf: split check_mem_access logic for map values · dbcfe5f7
      Gianluca Borello 提交于
      Move the logic to check memory accesses to a PTR_TO_MAP_VALUE_ADJ from
      check_mem_access() to a separate helper check_map_access_adj(). This
      enables to use those checks in other parts of the verifier as well,
      where boundaries on PTR_TO_MAP_VALUE_ADJ might need to be checked, for
      example when checking helper function arguments. The same thing is
      already happening for other types such as PTR_TO_PACKET and its
      check_packet_access() helper.
      
      The code has been copied verbatim, with the only difference of removing
      the "off += reg->max_value" statement and moving the sum into the call
      statement to check_map_access(), as that was only needed due to the
      earlier common check_map_access() call.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbcfe5f7
  21. 18 12月, 2016 3 次提交
    • D
      bpf: fix mark_reg_unknown_value for spilled regs on map value marking · 6760bf2d
      Daniel Borkmann 提交于
      Martin reported a verifier issue that hit the BUG_ON() for his
      test case in the mark_reg_unknown_value() function:
      
        [  202.861380] kernel BUG at kernel/bpf/verifier.c:467!
        [...]
        [  203.291109] Call Trace:
        [  203.296501]  [<ffffffff811364d5>] mark_map_reg+0x45/0x50
        [  203.308225]  [<ffffffff81136558>] mark_map_regs+0x78/0x90
        [  203.320140]  [<ffffffff8113938d>] do_check+0x226d/0x2c90
        [  203.331865]  [<ffffffff8113a6ab>] bpf_check+0x48b/0x780
        [  203.343403]  [<ffffffff81134c8e>] bpf_prog_load+0x27e/0x440
        [  203.355705]  [<ffffffff8118a38f>] ? handle_mm_fault+0x11af/0x1230
        [  203.369158]  [<ffffffff812d8188>] ? security_capable+0x48/0x60
        [  203.382035]  [<ffffffff811351a4>] SyS_bpf+0x124/0x960
        [  203.393185]  [<ffffffff810515f6>] ? __do_page_fault+0x276/0x490
        [  203.406258]  [<ffffffff816db320>] entry_SYSCALL_64_fastpath+0x13/0x94
      
      This issue got uncovered after the fix in a08dd0da ("bpf: fix
      regression on verifier pruning wrt map lookups"). The reason why it
      wasn't noticed before was, because as mentioned in a08dd0da,
      mark_map_regs() was doing the id matching incorrectly based on the
      uncached regs[regno].id. So, in the first loop, we walked all regs
      and as soon as we found regno == i, then this reg's id was cleared
      when calling mark_reg_unknown_value() thus that every subsequent
      register was probed against id of 0 (which, in combination with the
      PTR_TO_MAP_VALUE_OR_NULL type is an invalid condition that no other
      register state can hold), and therefore wasn't type transitioned such
      as in the spilled register case for the second loop.
      
      Now since that got fixed, it turned out that 57a09bf0 ("bpf:
      Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") used
      mark_reg_unknown_value() incorrectly for the spilled regs, and thus
      hitting the BUG_ON() in some cases due to regno >= MAX_BPF_REG.
      
      Although spilled regs have the same type as the non-spilled regs
      for the verifier state, that is, struct bpf_reg_state, they are
      semantically different from the non-spilled regs. In other words,
      there can be up to 64 (MAX_BPF_STACK / BPF_REG_SIZE) spilled regs
      in the stack, for example, register R<x> could have been spilled by
      the program to stack location X, Y, Z, and in mark_map_regs() we
      need to scan these stack slots of type STACK_SPILL for potential
      registers that we have to transition from PTR_TO_MAP_VALUE_OR_NULL.
      Therefore, depending on the location, the spilled_regs regno can
      be a lot higher than just MAX_BPF_REG's value since we operate on
      stack instead. The reset in mark_reg_unknown_value() itself is
      just fine, only that the BUG_ON() was inappropriate for this. Fix
      it by making a __mark_reg_unknown_value() version that can be
      called from mark_map_reg() generically; we know for the non-spilled
      case that the regno is always < MAX_BPF_REG anyway.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6760bf2d
    • D
      bpf: fix overflow in prog accounting · 5ccb071e
      Daniel Borkmann 提交于
      Commit aaac3ba9 ("bpf: charge user for creation of BPF maps and
      programs") made a wrong assumption of charging against prog->pages.
      Unlike map->pages, prog->pages are still subject to change when we
      need to expand the program through bpf_prog_realloc().
      
      This can for example happen during verification stage when we need to
      expand and rewrite parts of the program. Should the required space
      cross a page boundary, then prog->pages is not the same anymore as
      its original value that we used to bpf_prog_charge_memlock() on. Thus,
      we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
      is freed eventually. I noticed this that despite having unlimited
      memlock, programs suddenly refused to load with EPERM error due to
      insufficient memlock.
      
      There are two ways to fix this issue. One would be to add a cached
      variable to struct bpf_prog that takes a snapshot of prog->pages at the
      time of charging. The other approach is to also account for resizes. I
      chose to go with the latter for a couple of reasons: i) We want accounting
      rather to be more accurate instead of further fooling limits, ii) adding
      yet another page counter on struct bpf_prog would also be a waste just
      for this purpose. We also do want to charge as early as possible to
      avoid going into the verifier just to find out later on that we crossed
      limits. The only place that needs to be fixed is bpf_prog_realloc(),
      since only here we expand the program, so we try to account for the
      needed delta and should we fail, call-sites check for outcome anyway.
      On cBPF to eBPF migrations, we don't grab a reference to the user as
      they are charged differently. With that in place, my test case worked
      fine.
      
      Fixes: aaac3ba9 ("bpf: charge user for creation of BPF maps and programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ccb071e
    • D
      bpf: dynamically allocate digest scratch buffer · aafe6ae9
      Daniel Borkmann 提交于
      Geert rightfully complained that 7bd509e3 ("bpf: add prog_digest
      and expose it via fdinfo/netlink") added a too large allocation of
      variable 'raw' from bss section, and should instead be done dynamically:
      
        # ./scripts/bloat-o-meter kernel/bpf/core.o.1 kernel/bpf/core.o.2
        add/remove: 3/0 grow/shrink: 0/0 up/down: 33291/0 (33291)
        function                                     old     new   delta
        raw                                            -   32832  +32832
        [...]
      
      Since this is only relevant during program creation path, which can be
      considered slow-path anyway, lets allocate that dynamically and be not
      implicitly dependent on verifier mutex. Move bpf_prog_calc_digest() at
      the beginning of replace_map_fd_with_map_ptr() and also error handling
      stays straight forward.
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aafe6ae9
  22. 17 12月, 2016 1 次提交
    • D
      bpf: fix regression on verifier pruning wrt map lookups · a08dd0da
      Daniel Borkmann 提交于
      Commit 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL
      registers") introduced a regression where existing programs stopped
      loading due to reaching the verifier's maximum complexity limit,
      whereas prior to this commit they were loading just fine; the affected
      program has roughly 2k instructions.
      
      What was found is that state pruning couldn't be performed effectively
      anymore due to mismatches of the verifier's register state, in particular
      in the id tracking. It doesn't mean that 57a09bf0 is incorrect per
      se, but rather that verifier needs to perform a lot more work for the
      same program with regards to involved map lookups.
      
      Since commit 57a09bf0 is only about tracking registers with type
      PTR_TO_MAP_VALUE_OR_NULL, the id is only needed to follow registers
      until they are promoted through pattern matching with a NULL check to
      either PTR_TO_MAP_VALUE or UNKNOWN_VALUE type. After that point, the
      id becomes irrelevant for the transitioned types.
      
      For UNKNOWN_VALUE, id is already reset to 0 via mark_reg_unknown_value(),
      but not so for PTR_TO_MAP_VALUE where id is becoming stale. It's even
      transferred further into other types that don't make use of it. Among
      others, one example is where UNKNOWN_VALUE is set on function call
      return with RET_INTEGER return type.
      
      states_equal() will then fall through the memcmp() on register state;
      note that the second memcmp() uses offsetofend(), so the id is part of
      that since d2a4dd37 ("bpf: fix state equivalence"). But the bisect
      pointed already to 57a09bf0, where we really reach beyond complexity
      limit. What I found was that states_equal() often failed in this
      case due to id mismatches in spilled regs with registers in type
      PTR_TO_MAP_VALUE. Unlike non-spilled regs, spilled regs just perform
      a memcmp() on their reg state and don't have any other optimizations
      in place, therefore also id was relevant in this case for making a
      pruning decision.
      
      We can safely reset id to 0 as well when converting to PTR_TO_MAP_VALUE.
      For the affected program, it resulted in a ~17 fold reduction of
      complexity and let the program load fine again. Selftest suite also
      runs fine. The only other place where env->id_gen is used currently is
      through direct packet access, but for these cases id is long living, thus
      a different scenario.
      
      Also, the current logic in mark_map_regs() is not fully correct when
      marking NULL branch with UNKNOWN_VALUE. We need to cache the destination
      reg's id in any case. Otherwise, once we marked that reg as UNKNOWN_VALUE,
      it's id is reset and any subsequent registers that hold the original id
      and are of type PTR_TO_MAP_VALUE_OR_NULL won't be marked UNKNOWN_VALUE
      anymore, since mark_map_reg() reuses the uncached regs[regno].id that
      was just overridden. Note, we don't need to cache it outside of
      mark_map_regs(), since it's called once on this_branch and the other
      time on other_branch, which are both two independent verifier states.
      A test case for this is added here, too.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a08dd0da
  23. 09 12月, 2016 2 次提交
  24. 08 12月, 2016 1 次提交
    • D
      bpf: fix loading of BPF_MAXINSNS sized programs · ef0915ca
      Daniel Borkmann 提交于
      General assumption is that single program can hold up to BPF_MAXINSNS,
      that is, 4096 number of instructions. It is the case with cBPF and
      that limit was carried over to eBPF. When recently testing digest, I
      noticed that it's actually not possible to feed 4096 instructions
      via bpf(2).
      
      The check for > BPF_MAXINSNS was added back then to bpf_check() in
      cbd35700 ("bpf: verifier (add ability to receive verification log)").
      However, 09756af4 ("bpf: expand BPF syscall with program load/unload")
      added yet another check that comes before that into bpf_prog_load(),
      but this time bails out already in case of >= BPF_MAXINSNS.
      
      Fix it up and perform the check early in bpf_prog_load(), so we can drop
      the second one in bpf_check(). It makes sense, because also a 0 insn
      program is useless and we don't want to waste any resources doing work
      up to bpf_check() point. The existing bpf(2) man page documents E2BIG
      as the official error for such cases, so just stick with it as well.
      
      Fixes: 09756af4 ("bpf: expand BPF syscall with program load/unload")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef0915ca