1. 16 11月, 2016 4 次提交
    • M
      bpf: Add BPF_MAP_TYPE_LRU_HASH · 29ba732a
      Martin KaFai Lau 提交于
      Provide a LRU version of the existing BPF_MAP_TYPE_HASH.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29ba732a
    • M
      bpf: Refactor codes handling percpu map · fd91de7b
      Martin KaFai Lau 提交于
      Refactor the codes that populate the value
      of a htab_elem in a BPF_MAP_TYPE_PERCPU_HASH
      typed bpf_map.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd91de7b
    • M
      bpf: Add percpu LRU list · 961578b6
      Martin KaFai Lau 提交于
      Instead of having a common LRU list, this patch allows a
      percpu LRU list which can be selected by specifying a map
      attribute.  The map attribute will be added in the later
      patch.
      
      While the common use case for LRU is #reads >> #updates,
      percpu LRU list allows bpf prog to absorb unusual #updates
      under pathological case (e.g. external traffic facing machine which
      could be under attack).
      
      Each percpu LRU is isolated from each other.  The LRU nodes (including
      free nodes) cannot be moved across different LRU Lists.
      
      Here are the update performance comparison between
      common LRU list and percpu LRU list (the test code is
      at the last patch):
      
      [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
      ./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
       1 cpus: 2934082 updates
       4 cpus: 7391434 updates
       8 cpus: 6500576 updates
      
      [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
      ./map_perf_test 32 $i | awk '{r += $3}END{printr " updates"}'; done
        1 cpus: 2896553 updates
        4 cpus: 9766395 updates
        8 cpus: 17460553 updates
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      961578b6
    • M
      bpf: LRU List · 3a08c2fd
      Martin KaFai Lau 提交于
      Introduce bpf_lru_list which will provide LRU capability to
      the bpf_htab in the later patch.
      
      * General Thoughts:
      1. Target use case.  Read is more often than update.
         (i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
         If bpf_prog does a bpf_lookup_elem() first and then an in-place
         update, it still counts as a read operation to the LRU list concern.
      2. It may be useful to think of it as a LRU cache
      3. Optimize the read case
         3.1 No lock in read case
         3.2 The LRU maintenance is only done during bpf_update_elem()
      4. If there is a percpu LRU list, it will lose the system-wise LRU
         property.  A completely isolated percpu LRU list has the best
         performance but the memory utilization is not ideal considering
         the work load may be imbalance.
      5. Hence, this patch starts the LRU implementation with a global LRU
         list with batched operations before accessing the global LRU list.
         As a LRU cache, #read >> #update/#insert operations, it will work well.
      6. There is a local list (for each cpu) which is named
         'struct bpf_lru_locallist'.  This local list is not used to sort
         the LRU property.  Instead, the local list is to batch enough
         operations before acquiring the lock of the global LRU list.  More
         details on this later.
      7. In the later patch, it allows a percpu LRU list by specifying a
         map-attribute for scalability reason and for use cases that need to
         prepare for the worst (and pathological) case like DoS attack.
         The percpu LRU list is completely isolated from each other and the
         LRU nodes (including free nodes) cannot be moved across the list.  The
         following description is for the global LRU list but mostly applicable
         to the percpu LRU list also.
      
      * Global LRU List:
      1. It has three sub-lists: active-list, inactive-list and free-list.
      2. The two list idea, active and inactive, is borrowed from the
         page cache.
      3. All nodes are pre-allocated and all sit at the free-list (of the
         global LRU list) at the beginning.  The pre-allocation reasoning
         is similar to the existing BPF_MAP_TYPE_HASH.  However,
         opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
         the LRU map.
      
      * Active/Inactive List (of the global LRU list):
      1. The active list, as its name says it, maintains the active set of
         the nodes.  We can think of it as the working set or more frequently
         accessed nodes.  The access frequency is approximated by a ref-bit.
         The ref-bit is set during the bpf_lookup_elem().
      2. The inactive list, as its name also says it, maintains a less
         active set of nodes.  They are the candidates to be removed
         from the bpf_htab when we are running out of free nodes.
      3. The ordering of these two lists is acting as a rough clock.
         The tail of the inactive list is the older nodes and
         should be released first if the bpf_htab needs free element.
      
      * Rotating the Active/Inactive List (of the global LRU list):
      1. It is the basic operation to maintain the LRU property of
         the global list.
      2. The active list is only rotated when the inactive list is running
         low.  This idea is similar to the current page cache.
         Inactive running low is currently defined as
         "# of inactive < # of active".
      3. The active list rotation always starts from the tail.  It moves
         node without ref-bit set to the head of the inactive list.
         It moves node with ref-bit set back to the head of the active
         list and then clears its ref-bit.
      4. The inactive rotation is pretty simply.
         It walks the inactive list and moves the nodes back to the head of
         active list if its ref-bit is set. The ref-bit is cleared after moving
         to the active list.
         If the node does not have ref-bit set, it just leave it as it is
         because it is already in the inactive list.
      
      * Shrinking the Inactive List (of the global LRU list):
      1. Shrinking is the operation to get free nodes when the bpf_htab is
         full.
      2. It usually only shrinks the inactive list to get free nodes.
      3. During shrinking, it will walk the inactive list from the tail,
         delete the nodes without ref-bit set from bpf_htab.
      4. If no free node found after step (3), it will forcefully get
         one node from the tail of inactive or active list.  Forcefully is
         in the sense that it ignores the ref-bit.
      
      * Local List:
      1. Each CPU has a 'struct bpf_lru_locallist'.  The purpose is to
         batch enough operations before acquiring the lock of the
         global LRU.
      2. A local list has two sub-lists, free-list and pending-list.
      3. During bpf_update_elem(), it will try to get from the free-list
         of (the current CPU local list).
      4. If the local free-list is empty, it will acquire from the
         global LRU list.  The global LRU list can either satisfy it
         by its global free-list or by shrinking the global inactive
         list.  Since we have acquired the global LRU list lock,
         it will try to get at most LOCAL_FREE_TARGET elements
         to the local free list.
      5. When a new element is added to the bpf_htab, it will
         first sit at the pending-list (of the local list) first.
         The pending-list will be flushed to the global LRU list
         when it needs to acquire free nodes from the global list
         next time.
      
      * Lock Consideration:
      The LRU list has a lock (lru_lock).  Each bucket of htab has a
      lock (buck_lock).  If both locks need to be acquired together,
      the lock order is always lru_lock -> buck_lock and this only
      happens in the bpf_lru_list.c logic.
      
      In hashtab.c, both locks are not acquired together (i.e. one
      lock is always released first before acquiring another lock).
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a08c2fd
  2. 15 11月, 2016 1 次提交
  3. 13 11月, 2016 1 次提交
  4. 10 11月, 2016 1 次提交
  5. 08 11月, 2016 2 次提交
  6. 01 11月, 2016 1 次提交
    • D
      bpf, inode: add support for symlinks and fix mtime/ctime · 0f98621b
      Daniel Borkmann 提交于
      While commit bb35a6ef ("bpf, inode: allow for rename and link ops")
      added support for hard links that can be used for prog and map nodes,
      this work adds simple symlink support, which can be used f.e. for
      directories also when unpriviledged and works with cmdline tooling that
      understands S_IFLNK anyway. Since the switch in e27f4a94 ("bpf: Use
      mount_nodev not mount_ns to mount the bpf filesystem"), there can be
      various mount instances with mount_nodev() and thus hierarchy can be
      flattened to facilitate object sharing. Thus, we can keep bpf tooling
      also working by repointing paths.
      
      Most of the functionality can be used from vfs library operations. The
      symlink is stored in the inode itself, that is in i_link, which is
      sufficient in our case as opposed to storing it in the page cache.
      While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
      the directories mtime and ctime, so add a common helper for it called
      bpf_dentry_finalize() that takes care of it for all cases now.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f98621b
  7. 30 10月, 2016 1 次提交
  8. 23 10月, 2016 1 次提交
  9. 19 10月, 2016 1 次提交
    • T
      bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers · 57a09bf0
      Thomas Graf 提交于
      A BPF program is required to check the return register of a
      map_elem_lookup() call before accessing memory. The verifier keeps
      track of this by converting the type of the result register from
      PTR_TO_MAP_VALUE_OR_NULL to PTR_TO_MAP_VALUE after a conditional
      jump ensures safety. This check is currently exclusively performed
      for the result register 0.
      
      In the event the compiler reorders instructions, BPF_MOV64_REG
      instructions may be moved before the conditional jump which causes
      them to keep their type PTR_TO_MAP_VALUE_OR_NULL to which the
      verifier objects when the register is accessed:
      
      0: (b7) r1 = 10
      1: (7b) *(u64 *)(r10 -8) = r1
      2: (bf) r2 = r10
      3: (07) r2 += -8
      4: (18) r1 = 0x59c00000
      6: (85) call 1
      7: (bf) r4 = r0
      8: (15) if r0 == 0x0 goto pc+1
       R0=map_value(ks=8,vs=8) R4=map_value_or_null(ks=8,vs=8) R10=fp
      9: (7a) *(u64 *)(r4 +0) = 0
      R4 invalid mem access 'map_value_or_null'
      
      This commit extends the verifier to keep track of all identical
      PTR_TO_MAP_VALUE_OR_NULL registers after a map_elem_lookup() by
      assigning them an ID and then marking them all when the conditional
      jump is observed.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57a09bf0
  10. 29 9月, 2016 1 次提交
    • J
      bpf: allow access into map value arrays · 48461135
      Josef Bacik 提交于
      Suppose you have a map array value that is something like this
      
      struct foo {
      	unsigned iter;
      	int array[SOME_CONSTANT];
      };
      
      You can easily insert this into an array, but you cannot modify the contents of
      foo->array[] after the fact.  This is because we have no way to verify we won't
      go off the end of the array at verification time.  This patch provides a start
      for this work.  We accomplish this by keeping track of a minimum and maximum
      value a register could be while we're checking the code.  Then at the time we
      try to do an access into a MAP_VALUE we verify that the maximum offset into that
      region is a valid access into that memory region.  So in practice, code such as
      this
      
      unsigned index = 0;
      
      if (foo->iter >= SOME_CONSTANT)
      	foo->iter = index;
      else
      	index = foo->iter++;
      foo->array[index] = bar;
      
      would be allowed, as we can verify that index will always be between 0 and
      SOME_CONSTANT-1.  If you wish to use signed values you'll have to have an extra
      check to make sure the index isn't less than 0, or do something like index %=
      SOME_CONSTANT.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48461135
  11. 28 9月, 2016 2 次提交
  12. 27 9月, 2016 3 次提交
    • M
      fs: rename "rename2" i_op to "rename" · 2773bf00
      Miklos Szeredi 提交于
      Generated patch:
      
      sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
      sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      2773bf00
    • M
      libfs: support RENAME_NOREPLACE in simple_rename() · e0e0be8a
      Miklos Szeredi 提交于
      This is trivial to do:
      
       - add flags argument to simple_rename()
       - check if flags doesn't have any other than RENAME_NOREPLACE
       - assign simple_rename() to .rename2 instead of .rename
      
      Filesystems converted:
      
      hugetlbfs, ramfs, bpf.
      
      Debugfs uses simple_rename() to implement debugfs_rename(), which is for
      debugfs instances to rename files internally, not for userspace filesystem
      access.  For this case pass zero flags to simple_rename().
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      e0e0be8a
    • M
      bpf: Set register type according to is_valid_access() · 1955351d
      Mickaël Salaün 提交于
      This prevent future potential pointer leaks when an unprivileged eBPF
      program will read a pointer value from its context. Even if
      is_valid_access() returns a pointer type, the eBPF verifier replace it
      with UNKNOWN_VALUE. The register value that contains a kernel address is
      then allowed to leak. Moreover, this fix allows unprivileged eBPF
      programs to use functions with (legitimate) pointer arguments.
      
      Not an issue currently since reg_type is only set for PTR_TO_PACKET or
      PTR_TO_PACKET_END in XDP and TC programs that can only be loaded as
      privileged. For now, the only unprivileged eBPF program allowed is for
      socket filtering and all the types from its context are UNKNOWN_VALUE.
      However, this fix is important for future unprivileged eBPF programs
      which could use pointers in their context.
      Signed-off-by: NMickaël Salaün <mic@digikod.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1955351d
  13. 22 9月, 2016 4 次提交
  14. 21 9月, 2016 2 次提交
    • D
      bpf: direct packet write and access for helpers for clsact progs · 36bbef52
      Daniel Borkmann 提交于
      This work implements direct packet access for helpers and direct packet
      write in a similar fashion as already available for XDP types via commits
      4acf6c0b ("bpf: enable direct packet data write for xdp progs") and
      6841de8b ("bpf: allow helpers access the packet directly"), and as a
      complementary feature to the already available direct packet read for tc
      (cls/act) programs.
      
      For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
      and bpf_csum_update(). The first is generally needed for both, read and
      write, because they would otherwise only be limited to the current linear
      skb head. Usually, when the data_end test fails, programs just bail out,
      or, in the direct read case, use bpf_skb_load_bytes() as an alternative
      to overcome this limitation. If such data sits in non-linear parts, we
      can just pull them in once with the new helper, retest and eventually
      access them.
      
      At the same time, this also makes sure the skb is uncloned, which is, of
      course, a necessary condition for direct write. As this needs to be an
      invariant for the write part only, the verifier detects writes and adds
      a prologue that is calling bpf_skb_pull_data() to effectively unclone the
      skb from the very beginning in case it is indeed cloned. The heuristic
      makes use of a similar trick that was done in 233577a2 ("net: filter:
      constify detection of pkt_type_offset"). This comes at zero cost for other
      programs that do not use the direct write feature. Should a program use
      this feature only sparsely and has read access for the most parts with,
      for example, drop return codes, then such write action can be delegated
      to a tail called program for mitigating this cost of potential uncloning
      to a late point in time where it would have been paid similarly with the
      bpf_skb_store_bytes() as well. Advantage of direct write is that the
      writes are inlined whereas the helper cannot make any length assumptions
      and thus needs to generate a call to memcpy() also for small sizes, as well
      as cost of helper call itself with sanity checks are avoided. Plus, when
      direct read is already used, we don't need to cache or perform rechecks
      on the data boundaries (due to verifier invalidating previous checks for
      helpers that change skb->data), so more complex programs using rewrites
      can benefit from switching to direct read plus write.
      
      For direct packet access to helpers, we save the otherwise needed copy into
      a temp struct sitting on stack memory when use-case allows. Both facilities
      are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
      this to map helpers and csum_diff, and can successively enable other helpers
      where we find it makes sense. Helpers that definitely cannot be allowed for
      this are those part of bpf_helper_changes_skb_data() since they can change
      underlying data, and those that write into memory as this could happen for
      packet typed args when still cloned. bpf_csum_update() helper accommodates
      for the fact that we need to fixup checksum_complete when using direct write
      instead of bpf_skb_store_bytes(), meaning the programs can use available
      helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
      csum_block_add(), csum_block_sub() equivalents in eBPF together with the
      new helper. A usage example will be provided for iproute2's examples/bpf/
      directory.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36bbef52
    • D
      bpf, verifier: enforce larger zero range for pkt on overloading stack buffs · b399cf64
      Daniel Borkmann 提交于
      Current contract for the following two helper argument types is:
      
        * ARG_CONST_STACK_SIZE: passed argument pair must be (ptr, >0).
        * ARG_CONST_STACK_SIZE_OR_ZERO: passed argument pair can be either
          (NULL, 0) or (ptr, >0).
      
      With 6841de8b ("bpf: allow helpers access the packet directly"), we can
      pass also raw packet data to helpers, so depending on the argument type
      being PTR_TO_PACKET, we now either assert memory via check_packet_access()
      or check_stack_boundary(). As a result, the tests in check_packet_access()
      currently allow more than intended with regards to reg->imm.
      
      Back in 969bf05e ("bpf: direct packet access"), check_packet_access()
      was fine to ignore size argument since in check_mem_access() size was
      bpf_size_to_bytes() derived and prior to the call to check_packet_access()
      guaranteed to be larger than zero.
      
      However, for the above two argument types, it currently means, we can have
      a <= 0 size and thus breaking current guarantees for helpers. Enforce a
      check for size <= 0 and bail out if so.
      
      check_stack_boundary() doesn't have such an issue since it already tests
      for access_size <= 0 and bails out, resp. access_size == 0 in case of NULL
      pointer passed when allowed.
      
      Fixes: 6841de8b ("bpf: allow helpers access the packet directly")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b399cf64
  15. 10 9月, 2016 2 次提交
    • D
      bpf: add BPF_CALL_x macros for declaring helpers · f3694e00
      Daniel Borkmann 提交于
      This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
      to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
      that are used today. Motivation for this is to hide all the register handling
      and all necessary casts from the user, so that it is done automatically in the
      background when adding a BPF_CALL_<n>() call.
      
      This makes current helpers easier to review, eases to write future helpers,
      avoids getting the casting mess wrong, and allows for extending all helpers at
      once (f.e. build time checks, etc). It also helps detecting more easily in
      code reviews that unused registers are not instrumented in the code by accident,
      breaking compatibility with existing programs.
      
      BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
      fundamental differences, for example, for generating the actual helper function
      that carries all u64 regs, we need to fill unused regs, so that we always end up
      with 5 u64 regs as an argument.
      
      I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
      they look all as expected. No sparse issue spotted. We let this also sit for a
      few days with Fengguang's kbuild test robot, and there were no issues seen. On
      s390, it barked on the "uses dynamic stack allocation" notice, which is an old
      one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
      to the call wrapper, just telling that the perf raw record/frag sits on stack
      (gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
      and they were fine as well. All eBPF helpers are now converted to use these
      macros, getting rid of a good chunk of all the raw castings.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3694e00
    • D
      bpf: minor cleanups in helpers · 6088b582
      Daniel Borkmann 提交于
      Some minor misc cleanups, f.e. use sizeof(__u32) instead of hardcoding
      and in __bpf_skb_max_len(), I missed that we always have skb->dev valid
      anyway, so we can drop the unneeded test for dev; also few more other
      misc bits addressed here.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6088b582
  16. 09 9月, 2016 1 次提交
    • D
      bpf: fix range propagation on direct packet access · 2d2be8ca
      Daniel Borkmann 提交于
      LLVM can generate code that tests for direct packet access via
      skb->data/data_end in a way that currently gets rejected by the
      verifier, example:
      
        [...]
         7: (61) r3 = *(u32 *)(r6 +80)
         8: (61) r9 = *(u32 *)(r6 +76)
         9: (bf) r2 = r9
        10: (07) r2 += 54
        11: (3d) if r3 >= r2 goto pc+12
         R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
         R9=pkt(id=0,off=0,r=0) R10=fp
        12: (18) r4 = 0xffffff7a
        14: (05) goto pc+430
        [...]
      
        from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv
                       R6=ctx R9=pkt(id=0,off=0,r=0) R10=fp
        24: (7b) *(u64 *)(r10 -40) = r1
        25: (b7) r1 = 0
        26: (63) *(u32 *)(r6 +56) = r1
        27: (b7) r2 = 40
        28: (71) r8 = *(u8 *)(r9 +20)
        invalid access to packet, off=20 size=1, R9(id=0,off=0,r=0)
      
      The reason why this gets rejected despite a proper test is that we
      currently call find_good_pkt_pointers() only in case where we detect
      tests like rX > pkt_end, where rX is of type pkt(id=Y,off=Z,r=0) and
      derived, for example, from a register of type pkt(id=Y,off=0,r=0)
      pointing to skb->data. find_good_pkt_pointers() then fills the range
      in the current branch to pkt(id=Y,off=0,r=Z) on success.
      
      For above case, we need to extend that to recognize pkt_end >= rX
      pattern and mark the other branch that is taken on success with the
      appropriate pkt(id=Y,off=0,r=Z) type via find_good_pkt_pointers().
      Since eBPF operates on BPF_JGT (>) and BPF_JGE (>=), these are the
      only two practical options to test for from what LLVM could have
      generated, since there's no such thing as BPF_JLT (<) or BPF_JLE (<=)
      that we would need to take into account as well.
      
      After the fix:
      
        [...]
         7: (61) r3 = *(u32 *)(r6 +80)
         8: (61) r9 = *(u32 *)(r6 +76)
         9: (bf) r2 = r9
        10: (07) r2 += 54
        11: (3d) if r3 >= r2 goto pc+12
         R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
         R9=pkt(id=0,off=0,r=0) R10=fp
        12: (18) r4 = 0xffffff7a
        14: (05) goto pc+430
        [...]
      
        from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=54) R3=pkt_end R4=inv
                       R6=ctx R9=pkt(id=0,off=0,r=54) R10=fp
        24: (7b) *(u64 *)(r10 -40) = r1
        25: (b7) r1 = 0
        26: (63) *(u32 *)(r6 +56) = r1
        27: (b7) r2 = 40
        28: (71) r8 = *(u8 *)(r9 +20)
        29: (bf) r1 = r8
        30: (25) if r8 > 0x3c goto pc+47
         R1=inv56 R2=imm40 R3=pkt_end R4=inv R6=ctx R8=inv56
         R9=pkt(id=0,off=0,r=54) R10=fp
        31: (b7) r1 = 1
        [...]
      
      Verifier test cases are also added in this work, one that demonstrates
      the mentioned example here and one that tries a bad packet access for
      the current/fall-through branch (the one with types pkt(id=X,off=Y,r=0),
      pkt(id=X,off=0,r=0)), then a case with good and bad accesses, and two
      with both test variants (>, >=).
      
      Fixes: 969bf05e ("bpf: direct packet access")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d2be8ca
  17. 03 9月, 2016 2 次提交
  18. 13 8月, 2016 3 次提交
  19. 07 8月, 2016 1 次提交
    • A
      bpf: restore behavior of bpf_map_update_elem · a6ed3ea6
      Alexei Starovoitov 提交于
      The introduction of pre-allocated hash elements inadvertently broke
      the behavior of bpf hash maps where users expected to call
      bpf_map_update_elem() without considering that the map can be full.
      Some programs do:
      old_value = bpf_map_lookup_elem(map, key);
      if (old_value) {
        ... prepare new_value on stack ...
        bpf_map_update_elem(map, key, new_value);
      }
      Before pre-alloc the update() for existing element would work even
      in 'map full' condition. Restore this behavior.
      
      The above program could have updated old_value in place instead of
      update() which would be faster and most programs use that approach,
      but sometimes the values are large and the programs use update()
      helper to do atomic replacement of the element.
      Note we cannot simply update element's value in-place like percpu
      hash map does and have to allocate extra num_possible_cpu elements
      and use this extra reserve when the map is full.
      
      Fixes: 6c905981 ("bpf: pre-allocate hash map elements")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6ed3ea6
  20. 04 8月, 2016 1 次提交
    • J
      bpf: fix method of PTR_TO_PACKET reg id generation · 1f415a74
      Jakub Kicinski 提交于
      Using per-register incrementing ID can lead to
      find_good_pkt_pointers() confusing registers which
      have completely different values.  Consider example:
      
      0: (bf) r6 = r1
      1: (61) r8 = *(u32 *)(r6 +76)
      2: (61) r0 = *(u32 *)(r6 +80)
      3: (bf) r7 = r8
      4: (07) r8 += 32
      5: (2d) if r8 > r0 goto pc+9
       R0=pkt_end R1=ctx R6=ctx R7=pkt(id=0,off=0,r=32) R8=pkt(id=0,off=32,r=32) R10=fp
      6: (bf) r8 = r7
      7: (bf) r9 = r7
      8: (71) r1 = *(u8 *)(r7 +0)
      9: (0f) r8 += r1
      10: (71) r1 = *(u8 *)(r7 +1)
      11: (0f) r9 += r1
      12: (07) r8 += 32
      13: (2d) if r8 > r0 goto pc+1
       R0=pkt_end R1=inv56 R6=ctx R7=pkt(id=0,off=0,r=32) R8=pkt(id=1,off=32,r=32) R9=pkt(id=1,off=0,r=32) R10=fp
      14: (71) r1 = *(u8 *)(r9 +16)
      15: (b7) r7 = 0
      16: (bf) r0 = r7
      17: (95) exit
      
      We need to get a UNKNOWN_VALUE with imm to force id
      generation so lines 0-5 make r7 a valid packet pointer.
      We then read two different bytes from the packet and
      add them to copies of the constructed packet pointer.
      r8 (line 9) and r9 (line 11) will get the same id of 1,
      independently.  When either of them is validated (line
      13) - find_good_pkt_pointers() will also mark the other
      as safe.  This leads to access on line 14 being mistakenly
      considered safe.
      
      Fixes: 969bf05e ("bpf: direct packet access")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f415a74
  21. 20 7月, 2016 3 次提交
  22. 17 7月, 2016 1 次提交
  23. 16 7月, 2016 1 次提交
    • D
      bpf: avoid stack copy and use skb ctx for event output · 555c8a86
      Daniel Borkmann 提交于
      This work addresses a couple of issues bpf_skb_event_output()
      helper currently has: i) We need two copies instead of just a
      single one for the skb data when it should be part of a sample.
      The data can be non-linear and thus needs to be extracted via
      bpf_skb_load_bytes() helper first, and then copied once again
      into the ring buffer slot. ii) Since bpf_skb_load_bytes()
      currently needs to be used first, the helper needs to see a
      constant size on the passed stack buffer to make sure BPF
      verifier can do sanity checks on it during verification time.
      Thus, just passing skb->len (or any other non-constant value)
      wouldn't work, but changing bpf_skb_load_bytes() is also not
      the proper solution, since the two copies are generally still
      needed. iii) bpf_skb_load_bytes() is just for rather small
      buffers like headers, since they need to sit on the limited
      BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
      this work improves the bpf_skb_event_output() helper to address
      all 3 at once.
      
      We can make use of the passed in skb context that we have in
      the helper anyway, and use some of the reserved flag bits as
      a length argument. The helper will use the new __output_custom()
      facility from perf side with bpf_skb_copy() as callback helper
      to walk and extract the data. It will pass the data for setup
      to bpf_event_output(), which generates and pushes the raw record
      with an additional frag part. The linear data used in the first
      frag of the record serves as programmatically defined meta data
      passed along with the appended sample.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      555c8a86