1. 11 11月, 2018 19 次提交
    • A
      bpf: Allow narrow loads with offset > 0 · 46f53a65
      Andrey Ignatov 提交于
      Currently BPF verifier allows narrow loads for a context field only with
      offset zero. E.g. if there is a __u32 field then only the following
      loads are permitted:
        * off=0, size=1 (narrow);
        * off=0, size=2 (narrow);
        * off=0, size=4 (full).
      
      On the other hand LLVM can generate a load with offset different than
      zero that make sense from program logic point of view, but verifier
      doesn't accept it.
      
      E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code:
      
        #define DST_IP4			0xC0A801FEU /* 192.168.1.254 */
        ...
        	if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) &&
      
      where ctx is struct bpf_sock_addr.
      
      Some versions of LLVM can produce the following byte code for it:
      
             8:       71 12 07 00 00 00 00 00         r2 = *(u8 *)(r1 + 7)
             9:       67 02 00 00 18 00 00 00         r2 <<= 24
            10:       18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00         r3 = 4261412864 ll
            12:       5d 32 07 00 00 00 00 00         if r2 != r3 goto +7 <LBB0_6>
      
      where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1
      and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently
      rejected by verifier.
      
      Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok()
      what means any is_valid_access implementation, that uses the function,
      works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or
      sock_addr_is_valid_access() for bpf_sock_addr.
      
      The patch makes such loads supported. Offset can be in [0; size_default)
      but has to be multiple of load size. E.g. for __u32 field the following
      loads are supported now:
        * off=0, size=1 (narrow);
        * off=1, size=1 (narrow);
        * off=2, size=1 (narrow);
        * off=3, size=1 (narrow);
        * off=0, size=2 (narrow);
        * off=2, size=2 (narrow);
        * off=0, size=4 (full).
      Reported-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      46f53a65
    • A
      Merge branch 'bpftool-flow-dissector' · f2cbf958
      Alexei Starovoitov 提交于
      Stanislav Fomichev says:
      
      ====================
      v5 changes:
      * FILE -> PATH for load/loadall (can be either file or directory now)
      * simpler implementation for __bpf_program__pin_name
      * removed p_err for REQ_ARGS checks
      * parse_atach_detach_args -> parse_attach_detach_args
      * for -> while in bpf_object__pin_{programs,maps} recovery
      
      v4 changes:
      * addressed another round of comments/style issues from Jakub Kicinski &
        Quentin Monnet (thanks!)
      * implemented bpf_object__pin_maps and bpf_object__pin_programs helpers and
        used them in bpf_program__pin
      * added new pin_name to bpf_program so bpf_program__pin
        works with sections that contain '/'
      * moved *loadall* command implementation into a separate patch
      * added patch that implements *pinmaps* to pin maps when doing
        load/loadall
      
      v3 changes:
      * (maybe) better cleanup for partial failure in bpf_object__pin
      * added special case in bpf_program__pin for programs with single
        instances
      
      v2 changes:
      * addressed comments/style issues from Jakub Kicinski & Quentin Monnet
      * removed logic that populates jump table
      * added cleanup for partial failure in bpf_object__pin
      
      This patch series adds support for loading and attaching flow dissector
      programs from the bpftool:
      
      * first patch fixes flow dissector section name in the selftests (so
        libbpf auto-detection works)
      * second patch adds proper cleanup to bpf_object__pin, parts of which are now
        being used to attach all flow dissector progs/maps
      * third patch adds special case in bpf_program__pin for programs with
        single instances (we don't create <prog>/0 pin anymore, just <prog>)
      * forth patch adds pin_name to the bpf_program struct
        which is now used as a pin name in bpf_program__pin et al
      * fifth patch adds *loadall* command that pins all programs, not just
        the first one
      * sixth patch adds *pinmaps* argument to load/loadall to let users pin
        all maps of the obj file
      * seventh patch adds actual flow_dissector support to the bpftool and
        an example
      ====================
      Acked-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f2cbf958
    • S
      bpftool: support loading flow dissector · 092f0892
      Stanislav Fomichev 提交于
      This commit adds support for loading/attaching/detaching flow
      dissector program.
      
      When `bpftool loadall` is called with a flow_dissector prog (i.e. when the
      'type flow_dissector' argument is passed), we load and pin all programs.
      User is responsible to construct the jump table for the tail calls.
      
      The last argument of `bpftool attach` is made optional for this use
      case.
      
      Example:
      bpftool prog load tools/testing/selftests/bpf/bpf_flow.o \
              /sys/fs/bpf/flow type flow_dissector \
      	pinmaps /sys/fs/bpf/flow
      
      bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
              key 0 0 0 0 \
              value pinned /sys/fs/bpf/flow/IP
      
      bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
              key 1 0 0 0 \
              value pinned /sys/fs/bpf/flow/IPV6
      
      bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
              key 2 0 0 0 \
              value pinned /sys/fs/bpf/flow/IPV6OP
      
      bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
              key 3 0 0 0 \
              value pinned /sys/fs/bpf/flow/IPV6FR
      
      bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
              key 4 0 0 0 \
              value pinned /sys/fs/bpf/flow/MPLS
      
      bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
              key 5 0 0 0 \
              value pinned /sys/fs/bpf/flow/VLAN
      
      bpftool prog attach pinned /sys/fs/bpf/flow/flow_dissector flow_dissector
      
      Tested by using the above lines to load the prog in
      the test_flow_dissector.sh selftest.
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      092f0892
    • S
      bpftool: add pinmaps argument to the load/loadall · 3767a94b
      Stanislav Fomichev 提交于
      This new additional argument lets users pin all maps from the object at
      specified path.
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3767a94b
    • S
      bpftool: add loadall command · 77380998
      Stanislav Fomichev 提交于
      This patch adds new *loadall* command which slightly differs from the
      existing *load*. *load* command loads all programs from the obj file,
      but pins only the first programs. *loadall* pins all programs from the
      obj file under specified directory.
      
      The intended usecase is flow_dissector, where we want to load a bunch
      of progs, pin them all and after that construct a jump table.
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      77380998
    • S
      libbpf: add internal pin_name · 33a2c75c
      Stanislav Fomichev 提交于
      pin_name is the same as section_name where '/' is replaced
      by '_'. bpf_object__pin_programs is converted to use pin_name
      to avoid the situation where section_name would require creating another
      subdirectory for a pin (as, for example, when calling bpf_object__pin_programs
      for programs in sections like "cgroup/connect6").
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      33a2c75c
    • S
      libbpf: bpf_program__pin: add special case for instances.nr == 1 · fd734c5c
      Stanislav Fomichev 提交于
      When bpf_program has only one instance, don't create a subdirectory with
      per-instance pin files (<prog>/0). Instead, just create a single pin file
      for that single instance. This simplifies object pinning by not creating
      unnecessary subdirectories.
      
      This can potentially break existing users that depend on the case
      where '/0' is always created. However, I couldn't find any serious
      usage of bpf_program__pin inside the kernel tree and I suppose there
      should be none outside.
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fd734c5c
    • S
      libbpf: cleanup after partial failure in bpf_object__pin · 0c19a9fb
      Stanislav Fomichev 提交于
      bpftool will use bpf_object__pin in the next commits to pin all programs
      and maps from the file; in case of a partial failure, we need to get
      back to the clean state (undo previous program/map pins).
      
      As part of a cleanup, I've added and exported separate routines to
      pin all maps (bpf_object__pin_maps) and progs (bpf_object__pin_programs)
      of an object.
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0c19a9fb
    • S
      selftests/bpf: rename flow dissector section to flow_dissector · 108d50a9
      Stanislav Fomichev 提交于
      Makes it compatible with the logic that derives program type
      from section name in libbpf_prog_type_by_name.
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      108d50a9
    • A
      Merge branch 'device-ops-as-cb' · 0157edc8
      Alexei Starovoitov 提交于
      Quentin Monnet says:
      
      ====================
      For passing device functions for offloaded eBPF programs, there used to
      be no place where to store the pointer without making the non-offloaded
      programs pay a memory price.
      
      As a consequence, three functions were called with ndo_bpf() through
      specific commands. Now that we have struct bpf_offload_dev, and since none
      of those operations rely on RTNL, we can turn these three commands into
      hooks inside the struct bpf_prog_offload_ops, and pass them as part of
      bpf_offload_dev_create().
      
      This patch set changes the offload architecture to do so, and brings the
      relevant changes to the nfp and netdevsim drivers.
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0157edc8
    • Q
      bpf: do not pass netdev to translate() and prepare() offload callbacks · 16a8cb5c
      Quentin Monnet 提交于
      The kernel functions to prepare verifier and translate for offloaded
      program retrieve "offload" from "prog", and "netdev" from "offload".
      Then both "prog" and "netdev" are passed to the callbacks.
      
      Simplify this by letting the drivers retrieve the net device themselves
      from the offload object attached to prog - if they need it at all. There
      is currently no need to pass the netdev as an argument to those
      functions.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      16a8cb5c
    • Q
      bpf: pass prog instead of env to bpf_prog_offload_verifier_prep() · a40a2632
      Quentin Monnet 提交于
      Function bpf_prog_offload_verifier_prep(), called from the kernel BPF
      verifier to run a driver-specific callback for preparing for the
      verification step for offloaded programs, takes a pointer to a struct
      bpf_verifier_env object. However, no driver callback needs the whole
      structure at this time: the two drivers supporting this, nfp and
      netdevsim, only need a pointer to the struct bpf_prog instance held by
      env.
      
      Update the callback accordingly, on kernel side and in these two
      drivers.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a40a2632
    • Q
      bpf: pass destroy() as a callback and remove its ndo_bpf subcommand · eb911947
      Quentin Monnet 提交于
      As part of the transition from ndo_bpf() to callbacks attached to struct
      bpf_offload_dev for some of the eBPF offload operations, move the
      functions related to program destruction to the struct and remove the
      subcommand that was used to call them through the NDO.
      
      Remove function __bpf_offload_ndo(), which is no longer used.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      eb911947
    • Q
      bpf: pass translate() as a callback and remove its ndo_bpf subcommand · b07ade27
      Quentin Monnet 提交于
      As part of the transition from ndo_bpf() to callbacks attached to struct
      bpf_offload_dev for some of the eBPF offload operations, move the
      functions related to code translation to the struct and remove the
      subcommand that was used to call them through the NDO.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b07ade27
    • Q
      bpf: call verifier_prep from its callback in struct bpf_offload_dev · 00db12c3
      Quentin Monnet 提交于
      In a way similar to the change previously brought to the verify_insn
      hook and to the finalize callback, switch to the newly added ops in
      struct bpf_prog_offload for calling the functions used to prepare driver
      verifiers.
      
      Since the dev_ops pointer in struct bpf_prog_offload is no longer used
      by any callback, we can now remove it from struct bpf_prog_offload.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      00db12c3
    • Q
      bpf: call finalize() from its callback in struct bpf_offload_dev · 6dc18fa6
      Quentin Monnet 提交于
      In a way similar to the change previously brought to the verify_insn
      hook, switch to the newly added ops in struct bpf_prog_offload for
      calling the functions used to perform final verification steps for
      offloaded programs.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6dc18fa6
    • Q
      bpf: call verify_insn from its callback in struct bpf_offload_dev · 341b3e7b
      Quentin Monnet 提交于
      We intend to remove the dev_ops in struct bpf_prog_offload, and to only
      keep the ops in struct bpf_offload_dev instead, which is accessible from
      more locations for passing function pointers.
      
      But dev_ops is used for calling the verify_insn hook. Switch to the
      newly added ops in struct bpf_prog_offload instead.
      
      To avoid table lookups for each eBPF instruction to verify, we remember
      the offdev attached to a netdev and modify bpf_offload_find_netdev() to
      avoid performing more than once a lookup for a given offload object.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      341b3e7b
    • Q
      bpf: pass a struct with offload callbacks to bpf_offload_dev_create() · 1385d755
      Quentin Monnet 提交于
      For passing device functions for offloaded eBPF programs, there used to
      be no place where to store the pointer without making the non-offloaded
      programs pay a memory price.
      
      As a consequence, three functions were called with ndo_bpf() through
      specific commands. Now that we have struct bpf_offload_dev, and since
      none of those operations rely on RTNL, we can turn these three commands
      into hooks inside the struct bpf_prog_offload_ops, and pass them as part
      of bpf_offload_dev_create().
      
      This commit effectively passes a pointer to the struct to
      bpf_offload_dev_create(). We temporarily have two struct
      bpf_prog_offload_ops instances, one under offdev->ops and one under
      offload->dev_ops. The next patches will make the transition towards the
      former, so that offload->dev_ops can be removed, and callbacks relying
      on ndo_bpf() added to offdev->ops as well.
      
      While at it, rename "nfp_bpf_analyzer_ops" as "nfp_bpf_dev_ops" (and
      similarly for netdevsim).
      Suggested-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1385d755
    • Q
      nfp: bpf: move nfp_bpf_analyzer_ops from verifier.c to offload.c · 1da6f573
      Quentin Monnet 提交于
      We are about to add several new callbacks to the struct, all of them
      defined in offload.c. Move the struct bpf_prog_offload_ops object in
      that file. As a consequence, nfp_verify_insn() and nfp_finalize() can no
      longer be static.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1da6f573
  2. 09 11月, 2018 7 次提交
    • N
      bpf: Extend the sk_lookup() helper to XDP hookpoint. · c8123ead
      Nitin Hande 提交于
      This patch proposes to extend the sk_lookup() BPF API to the
      XDP hookpoint. The sk_lookup() helper supports a lookup
      on incoming packet to find the corresponding socket that will
      receive this packet. Current support for this BPF API is
      at the tc hookpoint. This patch will extend this API at XDP
      hookpoint. A XDP program can map the incoming packet to the
      5-tuple parameter and invoke the API to find the corresponding
      socket structure.
      Signed-off-by: NNitin Hande <Nitin.Hande@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c8123ead
    • D
      bpftool: Improve handling of ENOENT on map dumps · bf598a8f
      David Ahern 提交于
      bpftool output is not user friendly when dumping a map with only a few
      populated entries:
      
          $ bpftool map
          1: devmap  name tx_devmap  flags 0x0
                  key 4B  value 4B  max_entries 64  memlock 4096B
          2: array  name tx_idxmap  flags 0x0
                  key 4B  value 4B  max_entries 64  memlock 4096B
      
          $ bpftool map dump id 1
          key:
          00 00 00 00
          value:
          No such file or directory
          key:
          01 00 00 00
          value:
          No such file or directory
          key:
          02 00 00 00
          value:
          No such file or directory
          key: 03 00 00 00  value: 03 00 00 00
      
      Handle ENOENT by keeping the line format sane and dumping
      "<no entry>" for the value
      
          $ bpftool map dump id 1
          key: 00 00 00 00  value: <no entry>
          key: 01 00 00 00  value: <no entry>
          key: 02 00 00 00  value: <no entry>
          key: 03 00 00 00  value: 03 00 00 00
          ...
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      bf598a8f
    • S
      selftests/bpf: add a test case for sock_ops perf-event notification · 435f90a3
      Sowmini Varadhan 提交于
      This patch provides a tcp_bpf based eBPF sample. The test
      
      - ncat(1) as the TCP client program to connect() to a port
        with the intention of triggerring SYN retransmissions: we
        first install an iptables DROP rule to make sure ncat SYNs are
        resent (instead of aborting instantly after a TCP RST)
      
      - has a bpf kernel module that sends a perf-event notification for
        each TCP retransmit, and also tracks the number of such notifications
        sent in the global_map
      
      The test passes when the number of event notifications intercepted
      in user-space matches the value in the global_map.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      435f90a3
    • S
      bpf: add perf event notificaton support for sock_ops · a5a3a828
      Sowmini Varadhan 提交于
      This patch allows eBPF programs that use sock_ops to send perf
      based event notifications using bpf_perf_event_output(). Our main
      use case for this is the following:
      
        We would like to monitor some subset of TCP sockets in user-space,
        (the monitoring application would define 4-tuples it wants to monitor)
        using TCP_INFO stats to analyze reported problems. The idea is to
        use those stats to see where the bottlenecks are likely to be ("is
        it application-limited?" or "is there evidence of BufferBloat in
        the path?" etc).
      
        Today we can do this by periodically polling for tcp_info, but this
        could be made more efficient if the kernel would asynchronously
        notify the application via tcp_info when some "interesting"
        thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc)
        are reached. And to make this effective, it is better if
        we could apply the threshold check *before* constructing the
        tcp_info netlink notification, so that we don't waste resources
        constructing notifications that will be discarded by the filter.
      
      This work solves the problem by adding perf event based notification
      support for sock_ops. The eBPF program can thus be designed to apply
      any desired filters to the bpf_sock_ops and trigger a perf event
      notification based on the evaluation from the filter. The user space
      component can use these perf event notifications to either read any
      state managed by the eBPF program, or issue a TCP_INFO netlink call
      if desired.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Co-developed-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a5a3a828
    • D
      Merge branch 'bpf-max-pkt-offset' · 185067a8
      Daniel Borkmann 提交于
      Jiong Wang says:
      
      ====================
      The maximum packet offset accessed by one BPF program is useful
      information.
      
      Because sometimes there could be packet split and it is possible for some
      reasons (for example performance) we want to reject the BPF program if the
      maximum packet size would trigger such split. Normally, MTU value is
      treated as the maximum packet size, but one BPF program does not always
      access the whole packet, it could only access the head portion of the data.
      
      We could let verifier calculate the maximum packet offset ever used and
      record it inside prog auxiliar information structure as a new field
      "max_pkt_offset".
      ====================
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      185067a8
    • J
      nfp: bpf: relax prog rejection through max_pkt_offset · cf599f50
      Jiong Wang 提交于
      NFP is refusing to offload programs whenever the MTU is set to a value
      larger than the max packet bytes that fits in NFP Cluster Target Memory
      (CTM). However, a eBPF program doesn't always need to access the whole
      packet data.
      
      Verifier has always calculated maximum direct packet access (DPA) offset,
      and kept it in max_pkt_offset inside prog auxiliar information. This patch
      relax prog rejection based on max_pkt_offset.
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cf599f50
    • J
      bpf: let verifier to calculate and record max_pkt_offset · e647815a
      Jiong Wang 提交于
      In check_packet_access, update max_pkt_offset after the offset has passed
      __check_packet_access.
      
      It should be safe to use u32 for max_pkt_offset as explained in code
      comment.
      
      Also, when there is tail call, the max_pkt_offset of the called program is
      unknown, so conservatively set max_pkt_offset to MAX_PACKET_OFF for such
      case.
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e647815a
  3. 08 11月, 2018 14 次提交