1. 23 5月, 2018 2 次提交
  2. 19 5月, 2018 1 次提交
  3. 17 5月, 2018 1 次提交
  4. 16 5月, 2018 1 次提交
  5. 15 5月, 2018 2 次提交
    • J
      bpf: sockmap, refactor sockmap routines to work with hashmap · e5cd3abc
      John Fastabend 提交于
      This patch only refactors the existing sockmap code. This will allow
      much of the psock initialization code path and bpf helper codes to
      work for both sockmap bpf map types that are backed by an array, the
      currently supported type, and the new hash backed bpf map type
      sockhash.
      
      Most the fallout comes from three changes,
      
        - Pushing bpf programs into an independent structure so we
          can use it from the htab struct in the next patch.
        - Generalizing helpers to use void *key instead of the hardcoded
          u32.
        - Instead of passing map/key through the metadata we now do
          the lookup inline. This avoids storing the key in the metadata
          which will be useful when keys can be longer than 4 bytes. We
          rename the sk pointers to sk_redir at this point as well to
          avoid any confusion between the current sk pointer and the
          redirect pointer sk_redir.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e5cd3abc
    • R
      vmcore: add API to collect hardware dump in second kernel · 2724273e
      Rahul Lakkireddy 提交于
      The sequence of actions done by device drivers to append their device
      specific hardware/firmware logs to /proc/vmcore are as follows:
      
      1. During probe (before hardware is initialized), device drivers
      register to the vmcore module (via vmcore_add_device_dump()), with
      callback function, along with buffer size and log name needed for
      firmware/hardware log collection.
      
      2. vmcore module allocates the buffer with requested size. It adds
      an Elf note and invokes the device driver's registered callback
      function.
      
      3. Device driver collects all hardware/firmware logs into the buffer
      and returns control back to vmcore module.
      
      Ensure that the device dump buffer size is always aligned to page size
      so that it can be mmaped.
      
      Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more
      generic and reserve NT_VMCOREDD note type to indicate vmcore device
      dump.
      
      Suggested-by: Eric Biederman <ebiederm@xmission.com>.
      Signed-off-by: NRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: NGanesh Goudar <ganeshgr@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2724273e
  6. 14 5月, 2018 1 次提交
  7. 11 5月, 2018 1 次提交
  8. 10 5月, 2018 2 次提交
  9. 09 5月, 2018 1 次提交
    • M
      bpf: btf: Introduce BTF ID · 78958fca
      Martin KaFai Lau 提交于
      This patch gives an ID to each loaded BTF.  The ID is allocated by
      the idr like the existing prog-id and map-id.
      
      The bpf_put(map->btf) is moved to __bpf_map_put() so that the
      userspace can stop seeing the BTF ID ASAP when the last BTF
      refcnt is gone.
      
      It also makes BTF accessible from userspace through the
      1. new BPF_BTF_GET_FD_BY_ID command.  It is limited to CAP_SYS_ADMIN
         which is inline with the BPF_BTF_LOAD cmd and the existing
         BPF_[MAP|PROG]_GET_FD_BY_ID cmd.
      2. new btf_id (and btf_key_id + btf_value_id) in "struct bpf_map_info"
      
      Once the BTF ID handler is accessible from userspace, freeing a BTF
      object has to go through a rcu period.  The BPF_BTF_GET_FD_BY_ID cmd
      can then be done under a rcu_read_lock() instead of taking
      spin_lock.
      [Note: A similar rcu usage can be done to the existing
             bpf_prog_get_fd_by_id() in a follow up patch]
      
      When processing the BPF_BTF_GET_FD_BY_ID cmd,
      refcount_inc_not_zero() is needed because the BTF object
      could be already in the rcu dead row .  btf_get() is
      removed since its usage is currently limited to btf.c
      alone.  refcount_inc() is used directly instead.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      78958fca
  10. 08 5月, 2018 5 次提交
  11. 07 5月, 2018 1 次提交
  12. 05 5月, 2018 2 次提交
  13. 04 5月, 2018 8 次提交
    • J
      bpf: centre subprog information fields · 9c8105bd
      Jiong Wang 提交于
      It is better to centre all subprog information fields into one structure.
      This structure could later serve as function node in call graph.
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9c8105bd
    • J
      bpf: unify main prog and subprog · f910cefa
      Jiong Wang 提交于
      Currently, verifier treat main prog and subprog differently. All subprogs
      detected are kept in env->subprog_starts while main prog is not kept there.
      Instead, main prog is implicitly defined as the prog start at 0.
      
      There is actually no difference between main prog and subprog, it is better
      to unify them, and register all progs detected into env->subprog_starts.
      
      This could also help simplifying some code logic.
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f910cefa
    • D
      bpf: implement ld_abs/ld_ind in native bpf · e0cea7ce
      Daniel Borkmann 提交于
      The main part of this work is to finally allow removal of LD_ABS
      and LD_IND from the BPF core by reimplementing them through native
      eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
      keeping them around in native eBPF caused way more trouble than
      actually worth it. To just list some of the security issues in
      the past:
      
        * fdfaf64e ("x86: bpf_jit: support negative offsets")
        * 35607b02 ("sparc: bpf_jit: fix loads from negative offsets")
        * e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
        * 07aee943 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
        * 6d59b7db ("bpf, s390x: do not reload skb pointers in non-skb context")
        * 87338c8e ("bpf, ppc64: do not reload skb pointers in non-skb context")
      
      For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
      these days due to their limitations and more efficient/flexible
      alternatives that have been developed over time such as direct
      packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
      register, the load happens in host endianness and its exception
      handling can yield unexpected behavior. The latter is explained
      in depth in f6b1b3bf ("bpf: fix subprog verifier bypass by
      div/mod by 0 exception") with similar cases of exceptions we had.
      In native eBPF more recent program types will disable LD_ABS/LD_IND
      altogether through may_access_skb() in verifier, and given the
      limitations in terms of exception handling, it's also disabled
      in programs that use BPF to BPF calls.
      
      In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
      to access packet data. It is not used in seccomp-BPF but programs
      that use it for socket filtering or reuseport for demuxing with
      cBPF. This is mostly relevant for applications that have not yet
      migrated to native eBPF.
      
      The main complexity and source of bugs in LD_ABS/LD_IND is coming
      from their implementation in the various JITs. Most of them keep
      the model around from cBPF times by implementing a fastpath written
      in asm. They use typically two from the BPF program hidden CPU
      registers for caching the skb's headlen (skb->len - skb->data_len)
      and skb->data. Throughout the JIT phase this requires to keep track
      whether LD_ABS/LD_IND are used and if so, the two registers need
      to be recached each time a BPF helper would change the underlying
      packet data in native eBPF case. At least in eBPF case, available
      CPU registers are rare and the additional exit path out of the
      asm written JIT helper makes it also inflexible since not all
      parts of the JITer are in control from plain C. A LD_ABS/LD_IND
      implementation in eBPF therefore allows to significantly reduce
      the complexity in JITs with comparable performance results for
      them, e.g.:
      
      test_bpf             tcpdump port 22             tcpdump complex
      x64      - before    15 21 10                    14 19  18
               - after      7 10 10                     7 10  15
      arm64    - before    40 91 92                    40 91 151
               - after     51 64 73                    51 62 113
      
      For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
      and cache the skb's headlen and data in the cBPF prologue. The
      BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
      used as a local temporary variable. This allows to shrink the
      image on x86_64 also for seccomp programs slightly since mapping
      to %rsi is not an ereg. In callee-saved R8 and R9 we now track
      skb data and headlen, respectively. For normal prologue emission
      in the JITs this does not add any extra instructions since R8, R9
      are pushed to stack in any case from eBPF side. cBPF uses the
      convert_bpf_ld_abs() emitter which probes the fast path inline
      already and falls back to bpf_skb_load_helper_{8,16,32}() helper
      relying on the cached skb data and headlen as well. R8 and R9
      never need to be reloaded due to bpf_helper_changes_pkt_data()
      since all skb access in cBPF is read-only. Then, for the case
      of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
      the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
      does neither cache skb data and headlen nor has an inlined fast
      path. The reason for the latter is that native eBPF does not have
      any extra registers available anyway, but even if there were, it
      avoids any reload of skb data and headlen in the first place.
      Additionally, for the negative offsets, we provide an alternative
      bpf_skb_load_bytes_relative() helper in eBPF which operates
      similarly as bpf_skb_load_bytes() and allows for more flexibility.
      Tested myself on x64, arm64, s390x, from Sandipan on ppc64.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e0cea7ce
    • D
      bpf: migrate ebpf ld_abs/ld_ind tests to test_verifier · 93731ef0
      Daniel Borkmann 提交于
      Remove all eBPF tests involving LD_ABS/LD_IND from test_bpf.ko. Reason
      is that the eBPF tests from test_bpf module do not go via BPF verifier
      and therefore any instruction rewrites from verifier cannot take place.
      
      Therefore, move them into test_verifier which runs out of user space,
      so that verfier can rewrite LD_ABS/LD_IND internally in upcoming patches.
      It will have the same effect since runtime tests are also performed from
      there. This also allows to finally unexport bpf_skb_vlan_{push,pop}_proto
      and keep it internal to core kernel.
      
      Additionally, also add further cBPF LD_ABS/LD_IND test coverage into
      test_bpf.ko suite.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      93731ef0
    • M
      dev: packet: make packet_direct_xmit a common function · 865b03f2
      Magnus Karlsson 提交于
      The new dev_direct_xmit will be used by AF_XDP in later commits.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      865b03f2
    • B
      xsk: wire up XDP_SKB side of AF_XDP · 02671e23
      Björn Töpel 提交于
      This commit wires up the xskmap to XDP_SKB layer.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      02671e23
    • B
      bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP · fbfc504a
      Björn Töpel 提交于
      The xskmap is yet another BPF map, very much inspired by
      dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
      adds AF_XDP sockets into the map, and by using the bpf_redirect_map
      helper, an XDP program can redirect XDP frames to an AF_XDP socket.
      
      Note that a socket that is bound to certain ifindex/queue index will
      *only* accept XDP frames from that netdev/queue index. If an XDP
      program tries to redirect from a netdev/queue index other than what
      the socket is bound to, the frame will not be received on the socket.
      
      A socket can reside in multiple maps.
      
      v3: Fixed race and simplified code.
      v2: Removed one indirection in map lookup.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fbfc504a
    • B
      net: initial AF_XDP skeleton · 68e8b849
      Björn Töpel 提交于
      Buildable skeleton of AF_XDP without any functionality. Just what it
      takes to register a new address family.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      68e8b849
  14. 03 5月, 2018 1 次提交
  15. 02 5月, 2018 2 次提交
    • S
      tcp: send in-queue bytes in cmsg upon read · b75eba76
      Soheil Hassas Yeganeh 提交于
      Applications with many concurrent connections, high variance
      in receive queue length and tight memory bounds cannot
      allocate worst-case buffer size to drain sockets. Knowing
      the size of receive queue length, applications can optimize
      how they allocate buffers to read from the socket.
      
      The number of bytes pending on the socket is directly
      available through ioctl(FIONREAD/SIOCINQ) and can be
      approximated using getsockopt(MEMINFO) (rmem_alloc includes
      skb overheads in addition to application data). But, both of
      these options add an extra syscall per recvmsg. Moreover,
      ioctl(FIONREAD/SIOCINQ) takes the socket lock.
      
      Add the TCP_INQ socket option to TCP. When this socket
      option is set, recvmsg() relays the number of bytes available
      on the socket for reading to the application via the
      TCP_CM_INQ control message.
      
      Calculate the number of bytes after releasing the socket lock
      to include the processed backlog, if any. To avoid an extra
      branch in the hot path of recvmsg() for this new control
      message, move all cmsg processing inside an existing branch for
      processing receive timestamps. Since the socket lock is not held
      when calculating the size of receive queue, TCP_INQ is a hint.
      For example, it can overestimate the queue size by one byte,
      if FIN is received.
      
      With this method, applications can start reading from the socket
      using a small buffer, and then use larger buffers based on the
      remaining data when needed.
      
      V3 change-log:
      	As suggested by David Miller, added loads with barrier
      	to check whether we have multiple threads calling recvmsg
      	in parallel. When that happens we lock the socket to
      	calculate inq.
      V4 change-log:
      	Removed inline from a static function.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b75eba76
    • F
      net: core: Inline netdev_features_size_check() · e283de3a
      Florian Fainelli 提交于
      We do not require this inline function to be used in multiple different
      locations, just inline it where it gets used in register_netdevice().
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Suggested-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e283de3a
  16. 01 5月, 2018 5 次提交
  17. 30 4月, 2018 3 次提交
  18. 29 4月, 2018 1 次提交
    • Y
      bpf/verifier: improve register value range tracking with ARSH · 9cbe1f5a
      Yonghong Song 提交于
      When helpers like bpf_get_stack returns an int value
      and later on used for arithmetic computation, the LSH and ARSH
      operations are often required to get proper sign extension into
      64-bit. For example, without this patch:
          54: R0=inv(id=0,umax_value=800)
          54: (bf) r8 = r0
          55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
          55: (67) r8 <<= 32
          56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
          56: (c7) r8 s>>= 32
          57: R8=inv(id=0)
      With this patch:
          54: R0=inv(id=0,umax_value=800)
          54: (bf) r8 = r0
          55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
          55: (67) r8 <<= 32
          56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
          56: (c7) r8 s>>= 32
          57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
      With better range of "R8", later on when "R8" is added to other register,
      e.g., a map pointer or scalar-value register, the better register
      range can be derived and verifier failure may be avoided.
      
      In our later example,
          ......
          usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
          if (usize < 0)
              return 0;
          ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
          ......
      Without improving ARSH value range tracking, the register representing
      "max_len - usize" will have smin_value equal to S64_MIN and will be
      rejected by verifier.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9cbe1f5a