1. 04 5月, 2018 19 次提交
    • D
      bpf: migrate ebpf ld_abs/ld_ind tests to test_verifier · 93731ef0
      Daniel Borkmann 提交于
      Remove all eBPF tests involving LD_ABS/LD_IND from test_bpf.ko. Reason
      is that the eBPF tests from test_bpf module do not go via BPF verifier
      and therefore any instruction rewrites from verifier cannot take place.
      
      Therefore, move them into test_verifier which runs out of user space,
      so that verfier can rewrite LD_ABS/LD_IND internally in upcoming patches.
      It will have the same effect since runtime tests are also performed from
      there. This also allows to finally unexport bpf_skb_vlan_{push,pop}_proto
      and keep it internal to core kernel.
      
      Additionally, also add further cBPF LD_ABS/LD_IND test coverage into
      test_bpf.ko suite.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      93731ef0
    • D
      bpf: prefix cbpf internal helpers with bpf_ · b390134c
      Daniel Borkmann 提交于
      No change in functionality, just remove the '__' prefix and replace it
      with a 'bpf_' prefix instead. We later on add a couple of more helpers
      for cBPF and keeping the scheme with '__' is suboptimal there.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b390134c
    • A
      Merge branch 'AF_XDP-initial-support' · 08dbc7a6
      Alexei Starovoitov 提交于
      Björn Töpel says:
      
      ====================
      This patch set introduces a new address family called AF_XDP that is
      optimized for high performance packet processing and, in upcoming
      patch sets, zero-copy semantics. In this patch set, we have removed
      all zero-copy related code in order to make it smaller, simpler and
      hopefully more review friendly. This patch set only supports copy-mode
      for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
      for RX using the XDP_DRV path. Zero-copy support requires XDP and
      driver changes that Jesper Dangaard Brouer is working on. Some of his
      work has already been accepted. We will publish our zero-copy support
      for RX and TX on top of his patch sets at a later point in time.
      
      An AF_XDP socket (XSK) is created with the normal socket()
      syscall. Associated with each XSK are two queues: the RX queue and the
      TX queue. A socket can receive packets on the RX queue and it can send
      packets on the TX queue. These queues are registered and sized with
      the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
      mandatory to have at least one of these queues for each socket. In
      contrast to AF_PACKET V2/V3 these descriptor queues are separated from
      packet buffers. An RX or TX descriptor points to a data buffer in a
      memory area called a UMEM. RX and TX can share the same UMEM so that a
      packet does not have to be copied between RX and TX. Moreover, if a
      packet needs to be kept for a while due to a possible retransmit, the
      descriptor that points to that packet can be changed to point to
      another and reused right away. This again avoids copying data.
      
      This new dedicated packet buffer area is call a UMEM. It consists of a
      number of equally size frames and each frame has a unique frame id. A
      descriptor in one of the queues references a frame by referencing its
      frame id. The user space allocates memory for this UMEM using whatever
      means it feels is most appropriate (malloc, mmap, huge pages,
      etc). This memory area is then registered with the kernel using the new
      setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
      and the COMPLETION queue. The fill queue is used by the application to
      send down frame ids for the kernel to fill in with RX packet
      data. References to these frames will then appear in the RX queue of
      the XSK once they have been received. The completion queue, on the
      other hand, contains frame ids that the kernel has transmitted
      completely and can now be used again by user space, for either TX or
      RX. Thus, the frame ids appearing in the completion queue are ids that
      were previously transmitted using the TX queue. In summary, the RX and
      FILL queues are used for the RX path and the TX and COMPLETION queues
      are used for the TX path.
      
      The socket is then finally bound with a bind() call to a device and a
      specific queue id on that device, and it is not until bind is
      completed that traffic starts to flow. Note that in this patch set,
      all packet data is copied out to user-space.
      
      A new feature in this patch set is that the UMEM can be shared between
      processes, if desired. If a process wants to do this, it simply skips
      the registration of the UMEM and its corresponding two queues, sets a
      flag in the bind call and submits the XSK of the process it would like
      to share UMEM with as well as its own newly created XSK socket. The
      new process will then receive frame id references in its own RX queue
      that point to this shared UMEM. Note that since the queue structures
      are single-consumer / single-producer (for performance reasons), the
      new process has to create its own socket with associated RX and TX
      queues, since it cannot share this with the other process. This is
      also the reason that there is only one set of FILL and COMPLETION
      queues per UMEM. It is the responsibility of a single process to
      handle the UMEM. If multiple-producer / multiple-consumer queues are
      implemented in the future, this requirement could be relaxed.
      
      How is then packets distributed between these two XSK? We have
      introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
      full). The user-space application can place an XSK at an arbitrary
      place in this map. The XDP program can then redirect a packet to a
      specific index in this map and at this point XDP validates that the
      XSK in that map was indeed bound to that device and queue number. If
      not, the packet is dropped. If the map is empty at that index, the
      packet is also dropped. This also means that it is currently mandatory
      to have an XDP program loaded (and one XSK in the XSKMAP) to be able
      to get any traffic to user space through the XSK.
      
      AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
      driver does not have support for XDP, or XDP_SKB is explicitly chosen
      when loading the XDP program, XDP_SKB mode is employed that uses SKBs
      together with the generic XDP support and copies out the data to user
      space. A fallback mode that works for any network device. On the other
      hand, if the driver has support for XDP, it will be used by the AF_XDP
      code to provide better performance, but there is still a copy of the
      data into user space.
      
      There is a xdpsock benchmarking/test application included that
      demonstrates how to use AF_XDP sockets with both private and shared
      UMEMs. Say that you would like your UDP traffic from port 4242 to end
      up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
      for this:
      
            ethtool -N p3p2 rx-flow-hash udp4 fn
            ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
                action 16
      
      Running the rxdrop benchmark in XDP_DRV mode can then be done
      using:
      
            samples/bpf/xdpsock -i p3p2 -q 16 -r -N
      
      For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
      can be displayed with "-h", as usual.
      
      We have run some benchmarks on a dual socket system with two Broadwell
      E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
      cores which gives a total of 28, but only two cores are used in these
      experiments. One for TR/RX and one for the user space application. The
      memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
      8192MB and with 8 of those DIMMs in the system we have 64 GB of total
      memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
      NIC is Intel I40E 40Gbit/s using the i40e driver.
      
      Below are the results in Mpps of the I40E NIC benchmark runs for 64
      and 1500 byte packets, generated by a commercial packet generator HW
      outputing packets at full 40 Gbit/s line rate. The results are without
      retpoline so that we can compare against previous numbers. With
      retpoline, the AF_XDP numbers drop with between 10 - 15 percent.
      
      AF_XDP performance 64 byte packets. Results from V2 in parenthesis.
      Benchmark   XDP_SKB   XDP_DRV
      rxdrop       2.9(3.0)   9.6(9.5)
      txpush       2.6(2.5)   NA*
      l2fwd        1.9(1.9)   2.5(2.5) (TX using XDP_SKB in both cases)
      
      AF_XDP performance 1500 byte packets:
      Benchmark   XDP_SKB   XDP_DRV
      rxdrop       2.1(2.2)   3.3(3.3)
      l2fwd        1.4(1.4)   1.8(1.8) (TX using XDP_SKB in both cases)
      
      * NA since we have no support for TX using the XDP_DRV infrastructure
        in this patch set. This is for a future patch set since it involves
        changes to the XDP NDOs. Some of this has been upstreamed by Jesper
        Dangaard Brouer.
      
      XDP performance on our system as a base line:
      
      64 byte packets:
      XDP stats       CPU     pps         issue-pps
      XDP-RX CPU      16      32.3(32.9)M  0
      
      1500 byte packets:
      XDP stats       CPU     pps         issue-pps
      XDP-RX CPU      16      3.3(3.3)M    0
      
      Changes from V2:
      
      * Fixed a race in XSKMAP map found by Will. The code has been
        completely rearchitected and is now simpler, faster, and hopefully
        also not racy. Please review and check if it holds.
      
      If you would like to diff V2 against V3, you can find them here:
      https://github.com/bjoto/linux/tree/af-xdp-v2-on-bpf-next
      https://github.com/bjoto/linux/tree/af-xdp-v3-on-bpf-next
      
      The structure of the patch set is as follows:
      
      Patches 1-3: Basic socket and umem plumbing
      Patches 4-9: RX support together with the new XSKMAP
      Patches 10-13: TX support
      Patch 14: Statistics support with getsockopt()
      Patch 15: Sample application
      
      We based this patch set on bpf-next commit a3fe1f6f ("tools:
      bpftool: change time format for program 'loaded at:' information")
      
      To do for this patch set:
      
      * Syzkaller torture session being worked on
      
      Post-series plan:
      
      * Optimize performance
      
      * Kernel selftest
      
      * Kernel load module support of AF_XDP would be nice. Unclear how to
        achieve this though since our XDP code depends on net/core.
      
      * Support for AF_XDP sockets without an XPD program loaded. In this
        case all the traffic on a queue should go up to the user space socket.
      
      * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
        XDP_PASS" for a tcpdump-like functionality.
      
      * And of course getting to zero-copy support in small increments,
        starting with TX then adding RX.
      
      Thanks: Björn and Magnus
      ====================
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      08dbc7a6
    • M
      samples/bpf: sample application and documentation for AF_XDP sockets · b4b8faa1
      Magnus Karlsson 提交于
      This is a sample application for AF_XDP sockets. The application
      supports three different modes of operation: rxdrop, txonly and l2fwd.
      
      To show-case a simple round-robin load-balancing between a set of
      sockets in an xskmap, set the RR_LB compile time define option to 1 in
      "xdpsock.h".
      
      v2: The entries variable was calculated twice in {umem,xq}_nb_avail.
      Co-authored-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b4b8faa1
    • M
      xsk: statistics support · af75d9e0
      Magnus Karlsson 提交于
      In this commit, a new getsockopt is added: XDP_STATISTICS. This is
      used to obtain stats from the sockets.
      
      v2: getsockopt now returns size of stats structure.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      af75d9e0
    • M
      xsk: support for Tx · 35fcde7f
      Magnus Karlsson 提交于
      Here, Tx support is added. The user fills the Tx queue with frames to
      be sent by the kernel, and let's the kernel know using the sendmsg
      syscall.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      35fcde7f
    • M
      dev: packet: make packet_direct_xmit a common function · 865b03f2
      Magnus Karlsson 提交于
      The new dev_direct_xmit will be used by AF_XDP in later commits.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      865b03f2
    • M
      xsk: add Tx queue setup and mmap support · f6145903
      Magnus Karlsson 提交于
      Another setsockopt (XDP_TX_QUEUE) is added to let the process allocate
      a queue, where the user process can pass frames to be transmitted by
      the kernel.
      
      The mmapping of the queue is done using the XDP_PGOFF_TX_QUEUE offset.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6145903
    • M
      xsk: add umem completion queue support and mmap · fe230832
      Magnus Karlsson 提交于
      Here, we add another setsockopt for registered user memory (umem)
      called XDP_UMEM_COMPLETION_QUEUE. Using this socket option, the
      process can ask the kernel to allocate a queue (ring buffer) and also
      mmap it (XDP_UMEM_PGOFF_COMPLETION_QUEUE) into the process.
      
      The queue is used to explicitly pass ownership of umem frames from the
      kernel to user process. This will be used by the TX path to tell user
      space that a certain frame has been transmitted and user space can use
      it for something else, if it wishes.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fe230832
    • B
      xsk: wire up XDP_SKB side of AF_XDP · 02671e23
      Björn Töpel 提交于
      This commit wires up the xskmap to XDP_SKB layer.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      02671e23
    • B
      xsk: wire up XDP_DRV side of AF_XDP · 1b1a251c
      Björn Töpel 提交于
      This commit wires up the xskmap to XDP_DRV layer.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1b1a251c
    • B
      bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP · fbfc504a
      Björn Töpel 提交于
      The xskmap is yet another BPF map, very much inspired by
      dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
      adds AF_XDP sockets into the map, and by using the bpf_redirect_map
      helper, an XDP program can redirect XDP frames to an AF_XDP socket.
      
      Note that a socket that is bound to certain ifindex/queue index will
      *only* accept XDP frames from that netdev/queue index. If an XDP
      program tries to redirect from a netdev/queue index other than what
      the socket is bound to, the frame will not be received on the socket.
      
      A socket can reside in multiple maps.
      
      v3: Fixed race and simplified code.
      v2: Removed one indirection in map lookup.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fbfc504a
    • B
      xsk: add Rx receive functions and poll support · c497176c
      Björn Töpel 提交于
      Here the actual receive functions of AF_XDP are implemented, that in a
      later commit, will be called from the XDP layers.
      
      There's one set of functions for the XDP_DRV side and another for
      XDP_SKB (generic).
      
      A new XDP API, xdp_return_buff, is also introduced.
      
      Adding xdp_return_buff, which is analogous to xdp_return_frame, but
      acts upon an struct xdp_buff. The API will be used by AF_XDP in future
      commits.
      
      Support for the poll syscall is also implemented.
      
      v2: xskq_validate_id did not update cons_tail.
          The entries variable was calculated twice in xskq_nb_avail.
          Squashed xdp_return_buff commit.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c497176c
    • M
      xsk: add support for bind for Rx · 965a9909
      Magnus Karlsson 提交于
      Here, the bind syscall is added. Binding an AF_XDP socket, means
      associating the socket to an umem, a netdev and a queue index. This
      can be done in two ways.
      
      The first way, creating a "socket from scratch". Create the umem using
      the XDP_UMEM_REG setsockopt and an associated fill queue with
      XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
      setsockopt. Call bind passing ifindex and queue index ("channel" in
      ethtool speak).
      
      The second way to bind a socket, is simply skipping the
      umem/netdev/queue index, and passing another already setup AF_XDP
      socket. The new socket will then have the same umem/netdev/queue index
      as the parent so it will share the same umem. You must also set the
      flags field in the socket address to XDP_SHARED_UMEM.
      
      v2: Use PTR_ERR instead of passing error variable explicitly.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      965a9909
    • B
      xsk: add Rx queue setup and mmap support · b9b6b68e
      Björn Töpel 提交于
      Another setsockopt (XDP_RX_QUEUE) is added to let the process allocate
      a queue, where the kernel can pass completed Rx frames from the kernel
      to user process.
      
      The mmapping of the queue is done using the XDP_PGOFF_RX_QUEUE offset.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b9b6b68e
    • M
      xsk: add umem fill queue support and mmap · 423f3832
      Magnus Karlsson 提交于
      Here, we add another setsockopt for registered user memory (umem)
      called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
      ask the kernel to allocate a queue (ring buffer) and also mmap it
      (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
      
      The queue is used to explicitly pass ownership of umem frames from the
      user process to the kernel. These frames will in a later patch be
      filled in with Rx packet data by the kernel.
      
      v2: Fixed potential crash in xsk_mmap.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      423f3832
    • B
      xsk: add user memory registration support sockopt · c0c77d8f
      Björn Töpel 提交于
      In this commit the base structure of the AF_XDP address family is set
      up. Further, we introduce the abilty register a window of user memory
      to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
      window is viewed by an AF_XDP socket as a set of equally large
      frames. After a user memory registration all frames are "owned" by the
      user application, and not the kernel.
      
      v2: More robust checks on umem creation and unaccount on error.
          Call set_page_dirty_lock on cleanup.
          Simplified xdp_umem_reg.
      Co-authored-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c0c77d8f
    • B
      net: initial AF_XDP skeleton · 68e8b849
      Björn Töpel 提交于
      Buildable skeleton of AF_XDP without any functionality. Just what it
      takes to register a new address family.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      68e8b849
    • W
      bpf, x86_32: add eBPF JIT compiler for ia32 · 03f5781b
      Wang YanQing 提交于
      The JIT compiler emits ia32 bit instructions. Currently, It supports eBPF
      only. Classic BPF is supported because of the conversion by BPF core.
      
      Almost all instructions from eBPF ISA supported except the following:
      BPF_ALU64 | BPF_DIV | BPF_K
      BPF_ALU64 | BPF_DIV | BPF_X
      BPF_ALU64 | BPF_MOD | BPF_K
      BPF_ALU64 | BPF_MOD | BPF_X
      BPF_STX | BPF_XADD | BPF_W
      BPF_STX | BPF_XADD | BPF_DW
      
      It doesn't support BPF_JMP|BPF_CALL with BPF_PSEUDO_CALL at the moment.
      
      IA32 has few general purpose registers, EAX|EDX|ECX|EBX|ESI|EDI. I use
      EAX|EDX|ECX|EBX as temporary registers to simulate instructions in eBPF
      ISA, and allocate ESI|EDI to BPF_REG_AX for constant blinding, all others
      eBPF registers, R0-R10, are simulated through scratch space on stack.
      
      The reasons behind the hardware registers allocation policy are:
      1:MUL need EAX:EDX, shift operation need ECX, so they aren't fit
        for general eBPF 64bit register simulation.
      2:We need at least 4 registers to simulate most eBPF ISA operations
        on registers operands instead of on register&memory operands.
      3:We need to put BPF_REG_AX on hardware registers, or constant blinding
        will degrade jit performance heavily.
      
      Tested on PC (Intel(R) Core(TM) i5-5200U CPU).
      Testing results on i5-5200U:
      1) test_bpf: Summary: 349 PASSED, 0 FAILED, [319/341 JIT'ed]
      2) test_progs: Summary: 83 PASSED, 0 FAILED.
      3) test_lpm: OK
      4) test_lru_map: OK
      5) test_verifier: Summary: 828 PASSED, 0 FAILED.
      
      Above tests are all done in following two conditions separately:
      1:bpf_jit_enable=1 and bpf_jit_harden=0
      2:bpf_jit_enable=1 and bpf_jit_harden=2
      
      Below are some numbers for this jit implementation:
      Note:
        I run test_progs in kselftest 100 times continuously for every condition,
        the numbers are in format: total/times=avg.
        The numbers that test_bpf reports show almost the same relation.
      
      a:jit_enable=0 and jit_harden=0            b:jit_enable=1 and jit_harden=0
        test_pkt_access:PASS:ipv4:15622/100=156    test_pkt_access:PASS:ipv4:10674/100=106
        test_pkt_access:PASS:ipv6:9130/100=91      test_pkt_access:PASS:ipv6:4855/100=48
        test_xdp:PASS:ipv4:240198/100=2401         test_xdp:PASS:ipv4:138912/100=1389
        test_xdp:PASS:ipv6:137326/100=1373         test_xdp:PASS:ipv6:68542/100=685
        test_l4lb:PASS:ipv4:61100/100=611          test_l4lb:PASS:ipv4:37302/100=373
        test_l4lb:PASS:ipv6:101000/100=1010        test_l4lb:PASS:ipv6:55030/100=550
      
      c:jit_enable=1 and jit_harden=2
        test_pkt_access:PASS:ipv4:10558/100=105
        test_pkt_access:PASS:ipv6:5092/100=50
        test_xdp:PASS:ipv4:131902/100=1319
        test_xdp:PASS:ipv6:77932/100=779
        test_l4lb:PASS:ipv4:38924/100=389
        test_l4lb:PASS:ipv6:57520/100=575
      
      The numbers show we get 30%~50% improvement.
      
      See Documentation/networking/filter.txt for more information.
      
      Changelog:
      
       Changes v5-v6:
       1:Add do {} while (0) to RETPOLINE_RAX_BPF_JIT for
         consistence reason.
       2:Clean up non-standard comments, reported by Daniel Borkmann.
       3:Fix a memory leak issue, repoted by Daniel Borkmann.
      
       Changes v4-v5:
       1:Delete is_on_stack, BPF_REG_AX is the only one
         on real hardware registers, so just check with
         it.
       2:Apply commit 1612a981 ("bpf, x64: fix JIT emission
         for dead code"), suggested by Daniel Borkmann.
      
       Changes v3-v4:
       1:Fix changelog in commit.
         I install llvm-6.0, then test_progs willn't report errors.
         I submit another patch:
         "bpf: fix misaligned access for BPF_PROG_TYPE_PERF_EVENT program type on x86_32 platform"
         to fix another problem, after that patch, test_verifier willn't report errors too.
       2:Fix clear r0[1] twice unnecessarily in *BPF_IND|BPF_ABS* simulation.
      
       Changes v2-v3:
       1:Move BPF_REG_AX to real hardware registers for performance reason.
       3:Using bpf_load_pointer instead of bpf_jit32.S, suggested by Daniel Borkmann.
       4:Delete partial codes in 1c2a088a, suggested by Daniel Borkmann.
       5:Some bug fixes and comments improvement.
      
       Changes v1-v2:
       1:Fix bug in emit_ia32_neg64.
       2:Fix bug in emit_ia32_arsh_r64.
       3:Delete filename in top level comment, suggested by Thomas Gleixner.
       4:Delete unnecessary boiler plate text, suggested by Thomas Gleixner.
       5:Rewrite some words in changelog.
       6:CodingSytle improvement and a little more comments.
      Signed-off-by: NWang YanQing <udknight@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      03f5781b
  2. 02 5月, 2018 3 次提交
    • Q
      bpf: relax constraints on formatting for eBPF helper documentation · 6f96674d
      Quentin Monnet 提交于
      The Python script used to parse and extract eBPF helpers documentation
      from include/uapi/linux/bpf.h expects a very specific formatting for the
      descriptions (single dot represents a space, '>' stands for a tab):
      
          /*
           ...
           *.int bpf_helper(list of arguments)
           *.>    Description
           *.>    >       Start of description
           *.>    >       Another line of description
           *.>    >       And yet another line of description
           *.>    Return
           *.>    >       0 on success, or a negative error in case of failure
           ...
           */
      
      This is too strict, and painful for developers who wants to add
      documentation for new helpers. Worse, it is extremely difficult to check
      that the formatting is correct during reviews. Change the format
      expected by the script and make it more flexible. The script now works
      whether or not the initial space (right after the star) is present, and
      accepts both tabs and white spaces (or a combination of both) for
      indenting description sections and contents.
      
      Concretely, something like the following would now be supported:
      
          /*
           ...
           *int bpf_helper(list of arguments)
           *......Description
           *.>    >       Start of description...
           *>     >       Another line of description
           *..............And yet another line of description
           *>     Return
           *.>    ........0 on success, or a negative error in case of failure
           ...
           */
      
      While at it, remove unnecessary carets from each regex used with match()
      in the script. They are redundant, as match() tries to match from the
      beginning of the string by default.
      
      v2: Remove unnecessary caret when a regex is used with match().
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6f96674d
    • I
      x86/bpf: Clean up non-standard comments, to make the code more readable · a2c7a983
      Ingo Molnar 提交于
      So by chance I looked into x86 assembly in arch/x86/net/bpf_jit_comp.c and
      noticed the weird and inconsistent comment style it mistakenly learned from
      the networking code:
      
       /* Multi-line comment ...
        * ... looks like this.
        */
      
      Fix this to use the standard comment style specified in Documentation/CodingStyle
      and used in arch/x86/ as well:
      
       /*
        * Multi-line comment ...
        * ... looks like this.
        */
      
      Also, to quote Linus's ... more explicit views about this:
      
        http://article.gmane.org/gmane.linux.kernel.cryptoapi/21066
      
        > But no, the networking code picked *none* of the above sane formats.
        > Instead, it picked these two models that are just half-arsed
        > shit-for-brains:
        >
        >  (no)
        >      /* This is disgusting drug-induced
        >        * crap, and should die
        >        */
        >
        >   (no-no-no)
        >       /* This is also very nasty
        >        * and visually unbalanced */
        >
        > Please. The networking code actually has the *worst* possible comment
        > style. You can literally find that (no-no-no) style, which is just
        > really horribly disgusting and worse than the otherwise fairly similar
        > (d) in pretty much every way.
      
      Also improve the comments and some other details while at it:
      
       - Don't mix same-line and previous-line comment style on otherwise
         identical code patterns within the same function,
      
       - capitalize 'BPF' and x86 register names consistently,
      
       - capitalize sentences consistently,
      
       - instead of 'x64' use 'x86-64': x64 is a Microsoft specific term,
      
       - use more consistent punctuation,
      
       - use standard coding style in macros as well,
      
       - fix typos and a few other minor details.
      
      Consistent coding style is not optional, at least in arch/x86/.
      
      No change in functionality.
      
      ( In case this commit causes conflicts with pending development code
        I'll be glad to help resolve any conflicts! )
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a2c7a983
    • Q
      tools: bpftool: change time format for program 'loaded at:' information · a3fe1f6f
      Quentin Monnet 提交于
      To make eBPF program load time easier to parse from "bpftool prog"
      output for machines, change the time format used by the program. The
      format now differs for plain and JSON version:
      
      - Plain version uses a string formatted according to ISO 8601.
      - JSON uses the number of seconds since the Epoch, wich is less friendly
        for humans but even easier to process.
      
      Example output:
      
          # ./bpftool prog
          41298: xdp  tag a04f5eef06a7f555 dev foo
                  loaded_at 2018-04-18T17:19:47+0100  uid 0
                  xlated 16B  not jited  memlock 4096B
      
          # ./bpftool prog -p
          [{
                  "id": 41298,
                  "type": "xdp",
                  "tag": "a04f5eef06a7f555",
                  "gpl_compatible": false,
                  "dev": {
                      "ifindex": 14,
                      "ns_dev": 3,
                      "ns_inode": 4026531993,
                      "ifname": "foo"
                  },
                  "loaded_at": 1524068387,
                  "uid": 0,
                  "bytes_xlated": 16,
                  "jited": false,
                  "bytes_memlock": 4096
              }
          ]
      
      Previously, "Apr 18/17:19" would be used at both places.
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a3fe1f6f
  3. 30 4月, 2018 8 次提交
  4. 29 4月, 2018 10 次提交
    • A
      Merge branch 'fix-bpf-helpers-doc' · fcf85729
      Alexei Starovoitov 提交于
      Andrey Ignatov says:
      
      ====================
      BPF helpers documentation in UAPI refers to kernel ctx structures when it
      has to refer to user visible ones. Fix it.
      ====================
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fcf85729
    • A
      bpf: Sync bpf.h to tools/ · 96871b9f
      Andrey Ignatov 提交于
      The patch syncs bpf.h to tools/.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      96871b9f
    • A
      bpf: Fix helpers ctx struct types in uapi doc · a3ef8e9a
      Andrey Ignatov 提交于
      Helpers may operate on two types of ctx structures: user visible ones
      (e.g. `struct bpf_sock_ops`) when used in user programs, and kernel ones
      (e.g. `struct bpf_sock_ops_kern`) in kernel implementation.
      
      UAPI documentation must refer to only user visible structures.
      
      The patch replaces references to `_kern` structures in BPF helpers
      description by corresponding user visible structures.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a3ef8e9a
    • A
      Merge branch 'bpf_get_stack' · f60ad0a0
      Alexei Starovoitov 提交于
      Yonghong Song says:
      
      ====================
      Currently, stackmap and bpf_get_stackid helper are provided
      for bpf program to get the stack trace. This approach has
      a limitation though. If two stack traces have the same hash,
      only one will get stored in the stackmap table regardless of
      whether BPF_F_REUSE_STACKID is specified or not,
      so some stack traces may be missing from user perspective.
      
      This patch implements a new helper, bpf_get_stack, will
      send stack traces directly to bpf program. The bpf program
      is able to see all stack traces, and then can do in-kernel
      processing or send stack traces to user space through
      shared map or bpf_perf_event_output.
      
      Patches #1 and #2 implemented the core kernel support.
      Patch #3 removes two never-hit branches in verifier.
      Patches #4 and #5 are two verifier improves to make
      bpf programming easier. Patch #6 synced the new helper
      to tools headers. Patch #7 moved perf_event polling code
      and ksym lookup code from samples/bpf to
      tools/testing/selftests/bpf. Patch #8 added a verifier
      test in tools/bpf for new verifier change.
      Patches #9 and #10 added tests for raw tracepoint prog
      and tracepoint prog respectively.
      
      Changelogs:
        v8 -> v9:
          . make function perf_event_mmap (in trace_helpers.c) extern
            to decouple perf_event_mmap and perf_event_poller.
          . add jit enabled handling for kernel stack verification
            in Patch #9. Since we did not have a good way to
            verify jit enabled kernel stack, just return true if
            the kernel stack is not empty.
          . In path #9, using raw_syscalls/sys_enter instead of
            sched/sched_switch, removed calling cmd
            "task 1 dd if=/dev/zero of=/dev/null" which is left
            with dangling process after the program exited.
        v7 -> v8:
          . rebase on top of latest bpf-next
          . simplify BPF_ARSH dst_reg->smin_val/smax_value tracking
          . rewrite the description of bpf_get_stack() in uapi bpf.h
            based on new format.
        v6 -> v7:
          . do perf callchain buffer allocation inside the
            verifier. so if the prog->has_callchain_buf is set,
            it is guaranteed that the buffer has been allocated.
          . change condition "trace_nr <= skip" to "trace_nr < skip"
            so that for zero size buffer, return 0 instead of -EFAULT
        v5 -> v6:
          . after refining return register smax_value and umax_value
            for helpers bpf_get_stack and bpf_probe_read_str,
            bounds and var_off of the return register are further refined.
          . added missing commit message for tools header sync commit.
          . removed one unnecessary empty line.
        v4 -> v5:
          . relied on dst_reg->var_off to refine umin_val/umax_val
            in verifier handling BPF_ARSH value range tracking,
            suggested by Edward.
        v3 -> v4:
          . fixed a bug when meta ptr is set to NULL in check_func_arg.
          . introduced tnum_arshift and added detailed comments for
            the underlying implementation
          . avoided using VLA in tools/bpf test_progs.
        v2 -> v3:
          . used meta to track helper memory size argument
          . implemented range checking for ARSH in verifier
          . moved perf event polling and ksym related functions
            from samples/bpf to tools/bpf
          . added test to compare build id's between bpf_get_stackid
            and bpf_get_stack
        v1 -> v2:
          . fixed compilation error when CONFIG_PERF_EVENTS is not enabled
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f60ad0a0
    • Y
      tools/bpf: add a test for bpf_get_stack with tracepoint prog · 79b45350
      Yonghong Song 提交于
      The test_stacktrace_map and test_stacktrace_build_id are
      enhanced to call bpf_get_stack in the helper to get the
      stack trace as well.  The stack traces from bpf_get_stack
      and bpf_get_stackid are compared to ensure that for the
      same stack as represented as the same hash, their ip addresses
      or build id's must be the same.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      79b45350
    • Y
      tools/bpf: add a test for bpf_get_stack with raw tracepoint prog · 173965fb
      Yonghong Song 提交于
      The test attached a raw_tracepoint program to raw_syscalls/sys_enter.
      It tested to get stack for user space, kernel space and user
      space with build_id request. It also tested to get user
      and kernel stack into the same buffer with back-to-back
      bpf_get_stack helper calls.
      
      If jit is not enabled, the user space application will check
      to ensure that the kernel function for raw_tracepoint
      ___bpf_prog_run is part of the stack.
      
      If jit is enabled, we did not have a reliable way to
      verify the kernel stack, so just assume the kernel stack
      is good when the kernel stack size is greater than 0.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      173965fb
    • Y
      tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH · 2abe611c
      Yonghong Song 提交于
      The test_verifier already has a few ARSH test cases.
      This patch adds a new test case which takes advantage of newly
      improved verifier behavior for bpf_get_stack and ARSH.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2abe611c
    • Y
      samples/bpf: move common-purpose trace functions to selftests · 28dbf861
      Yonghong Song 提交于
      There is no functionality change in this patch. The common-purpose
      trace functions, including perf_event polling and ksym lookup,
      are moved from trace_output_user.c and bpf_load.c to
      selftests/bpf/trace_helpers.c so that these function can
      be reused later in selftests.
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      28dbf861
    • Y
      tools/bpf: add bpf_get_stack helper to tools headers · de2ff05f
      Yonghong Song 提交于
      The tools header file bpf.h is synced with kernel uapi bpf.h.
      The new helper is also added to bpf_helpers.h.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      de2ff05f
    • Y
      bpf/verifier: improve register value range tracking with ARSH · 9cbe1f5a
      Yonghong Song 提交于
      When helpers like bpf_get_stack returns an int value
      and later on used for arithmetic computation, the LSH and ARSH
      operations are often required to get proper sign extension into
      64-bit. For example, without this patch:
          54: R0=inv(id=0,umax_value=800)
          54: (bf) r8 = r0
          55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
          55: (67) r8 <<= 32
          56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
          56: (c7) r8 s>>= 32
          57: R8=inv(id=0)
      With this patch:
          54: R0=inv(id=0,umax_value=800)
          54: (bf) r8 = r0
          55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
          55: (67) r8 <<= 32
          56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
          56: (c7) r8 s>>= 32
          57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
      With better range of "R8", later on when "R8" is added to other register,
      e.g., a map pointer or scalar-value register, the better register
      range can be derived and verifier failure may be avoided.
      
      In our later example,
          ......
          usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
          if (usize < 0)
              return 0;
          ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
          ......
      Without improving ARSH value range tracking, the register representing
      "max_len - usize" will have smin_value equal to S64_MIN and will be
      rejected by verifier.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9cbe1f5a