1. 22 3月, 2018 11 次提交
    • H
      r8169: change type of argument in rtl_disable/enable_clock_request · 73c86ee3
      Heiner Kallweit 提交于
      Changing the argument type to struct rtl8169_private * is more in line
      with the other functions in the driver and it allows to reduce the code size.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73c86ee3
    • H
      r8169: change type of first argument in rtl_tx_performance_tweak · cb73200c
      Heiner Kallweit 提交于
      Changing the type of the first argument to struct rtl8169_private * is more
      in line with the other functions in the driver and it allows to reduce the
      code size.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb73200c
    • H
      r8169: simplify rtl_set_mac_address · 1f7aa2bc
      Heiner Kallweit 提交于
      Replace open-coded functionality with eth_mac_addr().
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f7aa2bc
    • D
      Merge tag 'batadv-next-for-davem-20180319' of git://git.open-mesh.org/linux-merge · 755f6633
      David S. Miller 提交于
      Simon Wunderlich says:
      
      ====================
      This feature/cleanup patchset includes the following patches:
      
       - avoid redundant multicast TT entries, by Linus Luessing
      
       - add netlink support for distributed arp table cache and multicast flags,
         by Linus Luessing (2 patches)
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      755f6633
    • S
      rds: tcp: remove register_netdevice_notifier infrastructure. · bdf5bd7f
      Sowmini Varadhan 提交于
      The netns deletion path does not need to wait for all net_devices
      to be unregistered before dismantling rds_tcp state for the netns
      (we are able to dismantle this state on module unload even when
      all net_devices are active so there is no dependency here).
      
      This patch removes code related to netdevice notifiers and
      refactors all the code needed to dismantle rds_tcp state
      into a ->exit callback for the pernet_operations used with
      register_pernet_device().
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdf5bd7f
    • C
      netns: send uevent messages · 692ec06d
      Christian Brauner 提交于
      This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
      to allow sending uevent messages into the network namespace the socket
      belongs to.
      
      Currently non-initial network namespaces are already isolated and don't
      receive uevents. There are a number of cases where it is beneficial for a
      sufficiently privileged userspace process to send a uevent into a network
      namespace.
      
      One such use case would be debugging and fuzzing of a piece of software
      which listens and reacts to uevents. By running a copy of that software
      inside a network namespace, specific uevents could then be presented to it.
      More concretely, this would allow for easy testing of udevd/ueventd.
      
      This will also allow some piece of software to run components inside a
      separate network namespace and then effectively filter what that software
      can receive. Some examples of software that do directly listen to uevents
      and that we have in the past attempted to run inside a network namespace
      are rbd (CEPH client) or the X server.
      
      Implementation:
      The implementation has been kept as simple as possible from the kernel's
      perspective. Specifically, a simple input method uevent_net_rcv() is added
      to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing
      af_netlink infrastructure and does neither add an additional netlink family
      nor requires any user-visible changes.
      
      For example, by using netlink_rcv_skb() we can make use of existing netlink
      infrastructure to report back informative error messages to userspace.
      
      Furthermore, this implementation does not introduce any overhead for
      existing uevent generating codepaths. The struct netns got a new uevent
      socket member that records the uevent socket associated with that network
      namespace including its position in the uevent socket list. Since we record
      the uevent socket for each network namespace in struct net we don't have to
      walk the whole uevent socket list. Instead we can directly retrieve the
      relevant uevent socket and send the message. At exit time we can now also
      trivially remove the uevent socket from the uevent socket list. This keeps
      the codepath very performant without introducing needless overhead and even
      makes older codepaths faster.
      
      Uevent sequence numbers are kept global. When a uevent message is sent to
      another network namespace the implementation will simply increment the
      global uevent sequence number and append it to the received uevent. This
      has the advantage that the kernel will never need to parse the received
      uevent message to replace any existing uevent sequence numbers. Instead it
      is up to the userspace process to remove any existing uevent sequence
      numbers in case the uevent message to be sent contains any.
      
      Security:
      In order for a caller to send uevent messages to a target network namespace
      the caller must have CAP_SYS_ADMIN in the owning user namespace of the
      target network namespace. Additionally, any received uevent message is
      verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space
      needed to append the uevent sequence number.
      
      Testing:
      This patch has been tested and verified to work with the following udev
      implementations:
      1. CentOS 6 with udevd version 147
      2. Debian Sid with systemd-udevd version 237
      3. Android 7.1.1 with ueventd
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      692ec06d
    • C
      net: add uevent socket member · 94e5e308
      Christian Brauner 提交于
      This commit adds struct uevent_sock to struct net. Since struct uevent_sock
      records the position of the uevent socket in the uevent socket list we can
      trivially remove it from the uevent socket list during cleanup. This speeds
      up the old removal codepath.
      Note, list_del() will hit __list_del_entry_valid() in its call chain which
      will validate that the element is a member of the list. If it isn't it will
      take care that the list is not modified.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94e5e308
    • K
      net: Convert nf_ct_net_ops · aa65f636
      Kirill Tkhai 提交于
      These pernet_operations register and unregister sysctl.
      Also, there is inet_frags_exit_net() called in exit method,
      which has to be safe after a5600024 "net: Fix hlist
      corruptions in inet_evict_bucket()".
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa65f636
    • K
      net: Convert lowpan_frags_ops · 08012631
      Kirill Tkhai 提交于
      These pernet_operations register and unregister sysctl.
      Also, there is inet_frags_exit_net() called in exit method,
      which has to be safe after a5600024 "net: Fix hlist
      corruptions in inet_evict_bucket()".
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08012631
    • K
      net: Convert can_pernet_ops · 1ae77627
      Kirill Tkhai 提交于
      These pernet_operations create and destroy /proc entries
      and cancel per-net timer.
      
      Also, there are unneed iterations over empty list of net
      devices, since all net devices must be already moved
      to init_net or unregistered by default_device_ops. This
      already was mentioned here:
      
      https://marc.info/?l=linux-can&m=150169589119335&w=2
      
      So, it looks safe to make them async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ae77627
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 454bfe97
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2018-03-21
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) Add a BPF hook for sendmsg and sendfile by reusing the ULP infrastructure
         and sockmap. Three helpers are added along with this, bpf_msg_apply_bytes(),
         bpf_msg_cork_bytes(), and bpf_msg_pull_data(). The first is used to tell
         for how many bytes the verdict should be applied to, the second to tell
         that x bytes need to be queued first to retrigger the BPF program for a
         verdict, and the third helper is mainly for the sendfile case to pull in
         data for making it private for reading and/or writing, from John.
      
      2) Improve address to symbol resolution of user stack traces in BPF stackmap.
         Currently, the latter stores the address for each entry in the call trace,
         however to map these addresses to user space files, it is necessary to
         maintain the mapping from these virtual addresses to symbols in the binary
         which is not practical for system-wide profiling. Instead, this option for
         the stackmap rather stores the ELF build id and offset for the call trace
         entries, from Song.
      
      3) Add support that allows BPF programs attached to perf events to read the
         address values recorded with the perf events. They are requested through
         PERF_SAMPLE_ADDR via perf_event_open(). Main motivation behind it is to
         support building memory or lock access profiling and tracing tools with
         the help of BPF, from Teng.
      
      4) Several improvements to the tools/bpf/ Makefiles. The 'make bpf' in the
         tools directory does not provide the standard quiet output except for
         bpftool and it also does not respect specifying a build output directory.
         'make bpf_install' command neither respects specified destination nor
         prefix, all from Jiri. In addition, Jakub fixes several other minor issues
         in the Makefiles on top of that, e.g. fixing dependency paths, phony
         targets and more.
      
      5) Various doc updates e.g. add a comment for BPF fs about reserved names
         to make the dentry lookup from there a bit more obvious, and a comment
         to the bpf_devel_QA file in order to explain the diff between native
         and bpf target clang usage with regards to pointer size, from Quentin
         and Daniel.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      454bfe97
  2. 21 3月, 2018 10 次提交
  3. 20 3月, 2018 19 次提交
    • M
      mlx5: Remove call to ida_pre_get · c846d8da
      Matthew Wilcox 提交于
      The mlx5 driver calls ida_pre_get() in a loop for no readily apparent
      reason.  The driver uses ida_simple_get() which will call ida_pre_get()
      by itself and there's no need to use ida_pre_get() unless using
      ida_get_new().
      Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c846d8da
    • D
      Merge branch 'bpf-sockmap-ulp' · d48ce3e5
      Daniel Borkmann 提交于
      John Fastabend says:
      
      ====================
      This series adds a BPF hook for sendmsg and senfile by using
      the ULP infrastructure and sockmap. A simple pseudocode example
      would be,
      
        // load the programs
        bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
                      &obj, &msg_prog);
      
        // lookup the sockmap
        bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
      
        // get fd for sockmap
        map_fd_msg = bpf_map__fd(bpf_map_msg);
      
        // attach program to sockmap
        bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
      
        // Add a socket 'fd' to sockmap at location 'i'
        bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
      
      After the above snippet any socket attached to the map would run
      msg_prog on sendmsg and sendfile system calls.
      
      Three additional helpers are added bpf_msg_apply_bytes(),
      bpf_msg_cork_bytes(), and bpf_msg_pull_data(). With
      bpf_msg_apply_bytes BPF programs can tell the infrastructure how
      many bytes the given verdict should apply to. This has two cases.
      First, a BPF program applies verdict to fewer bytes than in the
      current sendmsg/sendfile msg this will apply the verdict to the
      first N bytes of the message then run the BPF program again with
      data pointers recalculated to the N+1 byte. The second case is the
      BPF program applies a verdict to more bytes than the current sendmsg
      or sendfile system call. In this case the infrastructure will cache
      the verdict and apply it to future sendmsg/sendfile calls until the
      byte limit is reached. This avoids the overhead of running BPF
      programs on large payloads.
      
      The helper bpf_msg_cork_bytes() handles a different case where
      a BPF program can not reach a verdict on a msg until it receives
      more bytes AND the program doesn't want to forward the packet
      until it is known to be "good". The example case being a user
      (albeit a dumb one probably) sends a N byte header in 1B system
      calls. The BPF program can call bpf_msg_cork_bytes with the
      required byte limit to reach a verdict and then the program will
      only be called again once N bytes are received.
      
      The last helper added in this series is bpf_msg_pull_data(). It
      is used to pull data in for modification or reading. Similar to
      how sk_pull_data() works msg_pull_data can be used to access data
      not in the initial (data_start, data_end) range. For sendpage()
      calls this is needed if any data is accessed because the BPF
      sendpage hook initializes the data_start and data_end pointers to
      zero. We do this because sendpage data is shared with the user
      and can be modified during or after the BPF verdict possibly
      invalidating any verdict the BPF program decides. For sendmsg
      the data is already copied by the sendmsg bpf infrastructure so
      we only copy the data if the user request a data range that is
      not already linearized. This happens if the user requests larger
      blocks of data that are not in a single scatterlist element. The
      common case seems to be accessing headers which normally are
      in the first scatterlist element and already linearized.
      
      For more examples please review the sample program. There are
      examples for all the actions and helpers there.
      
      Patches 1-8 implement the above sockmap/BPF infrastructure. The
      remaining patches flush out some minimal selftests and the sample
      sockmap program. The sockmap sample program is the main vehicle
      for testing this infrastructure and will be moved into selftests
      shortly. The final patch in this series is a simple shell script
      to run a set of tests. These are the tests I run after any changes
      to sockmap. The next task on the list after this series is to
      push those into selftests so we can avoid manually testing.
      
      Couple notes on future items in the pipeline,
      
        0. move sample sockmap programs into selftests (noted above)
        1. add additional support for tcp flags, most are ignored now.
        2. add a Documentation/bpf/sockmap file with these details
        3. support stacked ULP types to allow this and ktls to cooperate
        4. Ingress flag support, redirect only supports egress here. The
           other redirect helpers support ingress and egress flags.
        5. add optimizations, I cut a few optimizations here in the
           first iteration of the code for later study/implementation
      
      -v3 updates
        : u32 data pointers in msg_md changed to void *
        : page_address NULL check and flag verification in msg_pull_data
        : remove old note in commit msg that is no longer relevant
        : remove enum sk_msg_action its not used anywhere
        : fixup test_verifier W -> DW insn to account for data pointers
        : unintentionally dropped a smap_stop_tx() call in sockmap.c
      
      I propagated the ACKs forward because above changes were small
      one/two line changes.
      
      -v2 updates (discussion):
      
      Dave noticed that sendpage call was previously (in v1) running
      on the data directly. This allowed users to potentially modify
      the data after or during the BPF program. However doing a copy
      automatically even if the data is not accessed has measurable
      performance impact. So we added another helper modeled after
      the existing skb_pull_data() helper to allow users to selectively
      pull data from the msg. This is also useful in the sendmsg case
      when users need to access data outside the first scatterlist
      element or across scatterlist boundaries.
      
      While doing this I also unified the sendmsg and sendfile handlers
      a bit. Originally the sendfile call was optimized for never
      touching the data. I've decided for a first submission to drop
      this optimization and we can add it back later. It introduced
      unnecessary complexity, at least for a first posting, for a
      use case I have not entirely flushed out yet. When the use
      case is deployed we can add it back if needed. Then we can
      review concrete performance deltas as well on real-world
      use-cases/applications.
      
      Lastly, I reorganized the patches a bit. Now all sockmap
      changes are in a single patch and each helper gets its own
      patch. This, at least IMO, makes it easier to review because
      sockmap changes are not spread across the patch series. On
      the other hand now apply_bytes, cork_bytes logic is only
      activated later in the series. But that should be OK.
      ====================
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d48ce3e5
    • J
      bpf: sockmap test script · ae30727f
      John Fastabend 提交于
      This adds the test script I am currently using to validate
      the latest sockmap changes. Shortly sockmap will be ported
      to selftests and these will be run from the infrastructure
      there. Until then add the script here so we have a coverage
      checklist when porting into selftests.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ae30727f
    • J
      bpf: sockmap sample test for bpf_msg_pull_data · 0dcbbf67
      John Fastabend 提交于
      This adds an option to test the msg_pull_data helper. This
      uses two options txmsg_start and txmsg_end to let the user
      specify start and end bytes to pull.
      
      The options can be used with txmsg_apply, txmsg_cork options
      as well as with any of the basic tests, txmsg, txmsg_redir and
      txmsg_drop (plus noisy variants) to run pull_data inline with
      those tests. By giving user direct control over the variables
      we can easily do negative testing as well as positive tests.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0dcbbf67
    • J
      bpf: sockmap add SK_DROP tests · e6373ce7
      John Fastabend 提交于
      Add tests for SK_DROP.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e6373ce7
    • J
      bpf: sockmap sample support for bpf_msg_cork_bytes() · 468b3fde
      John Fastabend 提交于
      Add sample application support for the bpf_msg_cork_bytes helper. This
      lets the user specify how many bytes each verdict should apply to.
      
      Similar to apply_bytes() tests these can be run as a stand-alone test
      when used without other options or inline with other tests by using
      the txmsg_cork option along with any of the basic tests txmsg,
      txmsg_redir, txmsg_drop.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      468b3fde
    • J
      bpf: sockmap, add sample option to test apply_bytes helper · 1c16c312
      John Fastabend 提交于
      This adds an option to test the apply_bytes helper. This option lets
      the user specify an int on the command line specifying how much data
      each verdict should apply to.
      
      When this is set a map entry is set with the bytes input by the user
      and then the specified program --txmsg or --txmsg_redir will use the
      value and set the applied data. If no other option is set then a
      default --txmsg_apply program is run. This program will drop pkts
      if an error is detected on the bytes map lookup. Useful to verify
      the map lookup and apply helper are working and causing a hard
      error if it is not.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1c16c312
    • J
      bpf: sockmap sample, add data verification option · 6bce9d2c
      John Fastabend 提交于
      To verify data is not being dropped or corrupted this adds an option
      to verify test-patterns on recv.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6bce9d2c
    • J
      bpf: sockmap sample, add sendfile test · e67463cb
      John Fastabend 提交于
      To exercise TX ULP sendpage implementation we need a test that does
      a sendfile. Add sendfile test option here.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e67463cb
    • J
      bpf: sockmap sample, add option to attach SK_MSG program · 4c4c3c27
      John Fastabend 提交于
      Add sockmap option to use SK_MSG program types.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4c4c3c27
    • J
      bpf: add verifier tests for BPF_PROG_TYPE_SK_MSG · 1acc60b6
      John Fastabend 提交于
      Test read and writes for BPF_PROG_TYPE_SK_MSG.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1acc60b6
    • J
      bpf: add map tests for BPF_PROG_TYPE_SK_MSG · 82a86168
      John Fastabend 提交于
      Add map tests to attach BPF_PROG_TYPE_SK_MSG types to a sockmap.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      82a86168
    • J
      bpf: sk_msg program helper bpf_sk_msg_pull_data · 015632bb
      John Fastabend 提交于
      Currently, if a bpf sk msg program is run the program
      can only parse data that the (start,end) pointers already
      consumed. For sendmsg hooks this is likely the first
      scatterlist element. For sendpage this will be the range
      (0,0) because the data is shared with userspace and by
      default we want to avoid allowing userspace to modify
      data while (or after) BPF verdict is being decided.
      
      To support pulling in additional bytes for parsing use
      a new helper bpf_sk_msg_pull(start, end, flags) which
      works similar to cls tc logic. This helper will attempt
      to point the data start pointer at 'start' bytes offest
      into msg and data end pointer at 'end' bytes offset into
      message.
      
      After basic sanity checks to ensure 'start' <= 'end' and
      'end' <= msg_length there are a few cases we need to
      handle.
      
      First the sendmsg hook has already copied the data from
      userspace and has exclusive access to it. Therefor, it
      is not necessesary to copy the data. However, it may
      be required. After finding the scatterlist element with
      'start' offset byte in it there are two cases. One the
      range (start,end) is entirely contained in the sg element
      and is already linear. All that is needed is to update the
      data pointers, no allocate/copy is needed. The other case
      is (start, end) crosses sg element boundaries. In this
      case we allocate a block of size 'end - start' and copy
      the data to linearize it.
      
      Next sendpage hook has not copied any data in initial
      state so that data pointers are (0,0). In this case we
      handle it similar to the above sendmsg case except the
      allocation/copy must always happen. Then when sending
      the data we have possibly three memory regions that
      need to be sent, (0, start - 1), (start, end), and
      (end + 1, msg_length). This is required to ensure any
      writes by the BPF program are correctly transmitted.
      
      Lastly this operation will invalidate any previous
      data checks so BPF programs will have to revalidate
      pointers after making this BPF call.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      015632bb
    • J
      bpf: sockmap, add msg_cork_bytes() helper · 91843d54
      John Fastabend 提交于
      In the case where we need a specific number of bytes before a
      verdict can be assigned, even if the data spans multiple sendmsg
      or sendfile calls. The BPF program may use msg_cork_bytes().
      
      The extreme case is a user can call sendmsg repeatedly with
      1-byte msg segments. Obviously, this is bad for performance but
      is still valid. If the BPF program needs N bytes to validate
      a header it can use msg_cork_bytes to specify N bytes and the
      BPF program will not be called again until N bytes have been
      accumulated. The infrastructure will attempt to coalesce data
      if possible so in many cases (most my use cases at least) the
      data will be in a single scatterlist element with data pointers
      pointing to start/end of the element. However, this is dependent
      on available memory so is not guaranteed. So BPF programs must
      validate data pointer ranges, but this is the case anyways to
      convince the verifier the accesses are valid.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      91843d54
    • J
      bpf: sockmap, add bpf_msg_apply_bytes() helper · 2a100317
      John Fastabend 提交于
      A single sendmsg or sendfile system call can contain multiple logical
      messages that a BPF program may want to read and apply a verdict. But,
      without an apply_bytes helper any verdict on the data applies to all
      bytes in the sendmsg/sendfile. Alternatively, a BPF program may only
      care to read the first N bytes of a msg. If the payload is large say
      MB or even GB setting up and calling the BPF program repeatedly for
      all bytes, even though the verdict is already known, creates
      unnecessary overhead.
      
      To allow BPF programs to control how many bytes a given verdict
      applies to we implement a bpf_msg_apply_bytes() helper. When called
      from within a BPF program this sets a counter, internal to the
      BPF infrastructure, that applies the last verdict to the next N
      bytes. If the N is smaller than the current data being processed
      from a sendmsg/sendfile call, the first N bytes will be sent and
      the BPF program will be re-run with start_data pointing to the N+1
      byte. If N is larger than the current data being processed the
      BPF verdict will be applied to multiple sendmsg/sendfile calls
      until N bytes are consumed.
      
      Note1 if a socket closes with apply_bytes counter non-zero this
      is not a problem because data is not being buffered for N bytes
      and is sent as its received.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2a100317
    • J
      bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data · 4f738adb
      John Fastabend 提交于
      This implements a BPF ULP layer to allow policy enforcement and
      monitoring at the socket layer. In order to support this a new
      program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
      the sendmsg/sendpage hook. To attach the policy to sockets a
      sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
      
      Similar to previous sockmap usages when a sock is added to a
      sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
      program type attached then the BPF ULP layer is created on the
      socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
      every msg in sendmsg case and page/offset in sendpage case.
      
      BPF_PROG_TYPE_SK_MSG Semantics/API:
      
      BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
      SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
      case and in the sendpage case leaves the data untouched. Both cases
      return -EACESS to the user. Returning SK_PASS will allow the msg to
      be sent.
      
      In the sendmsg case data is copied into kernel space buffers before
      running the BPF program. The kernel space buffers are stored in a
      scatterlist object where each element is a kernel memory buffer.
      Some effort is made to coalesce data from the sendmsg call here.
      For example a sendmsg call with many one byte iov entries will
      likely be pushed into a single entry. The BPF program is run with
      data pointers (start/end) pointing to the first sg element.
      
      In the sendpage case data is not copied. We opt not to copy the
      data by default here, because the BPF infrastructure does not
      know what bytes will be needed nor when they will be needed. So
      copying all bytes may be wasteful. Because of this the initial
      start/end data pointers are (0,0). Meaning no data can be read or
      written. This avoids reading data that may be modified by the
      user. A new helper is added later in this series if reading and
      writing the data is needed. The helper call will do a copy by
      default so that the page is exclusively owned by the BPF call.
      
      The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
      in the sendmsg() case and the entire page/offset in the sendpage case.
      This avoids ambiguity on how to handle mixed return codes in the
      sendmsg case. Again a helper is added later in the series if
      a verdict needs to apply to multiple system calls and/or only
      a subpart of the currently being processed message.
      
      The helper msg_redirect_map() can be used to select the socket to
      send the data on. This is used similar to existing redirect use
      cases. This allows policy to redirect msgs.
      
      Pseudo code simple example:
      
      The basic logic to attach a program to a socket is as follows,
      
        // load the programs
        bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
      		&obj, &msg_prog);
      
        // lookup the sockmap
        bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
      
        // get fd for sockmap
        map_fd_msg = bpf_map__fd(bpf_map_msg);
      
        // attach program to sockmap
        bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
      
      Adding sockets to the map is done in the normal way,
      
        // Add a socket 'fd' to sockmap at location 'i'
        bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
      
      After the above any socket attached to "my_sock_map", in this case
      'fd', will run the BPF msg verdict program (msg_prog) on every
      sendmsg and sendpage system call.
      
      For a complete example see BPF selftests or sockmap samples.
      
      Implementation notes:
      
      It seemed the simplest, to me at least, to use a refcnt to ensure
      psock is not lost across the sendmsg copy into the sg, the bpf program
      running on the data in sg_data, and the final pass to the TCP stack.
      Some performance testing may show a better method to do this and avoid
      the refcnt cost, but for now use the simpler method.
      
      Another item that will come after basic support is in place is
      supporting MSG_MORE flag. At the moment we call sendpages even if
      the MSG_MORE flag is set. An enhancement would be to collect the
      pages into a larger scatterlist and pass down the stack. Notice that
      bpf_tcp_sendmsg() could support this with some additional state saved
      across sendmsg calls. I built the code to support this without having
      to do refactoring work. Other features TBD include ZEROCOPY and the
      TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
      shortly.
      
      Future work could improve size limits on the scatterlist rings used
      here. Currently, we use MAX_SKB_FRAGS simply because this was being
      used already in the TLS case. Future work could extend the kernel sk
      APIs to tune this depending on workload. This is a trade-off
      between memory usage and throughput performance.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4f738adb
    • J
      net: generalize sk_alloc_sg to work with scatterlist rings · 8c05dbf0
      John Fastabend 提交于
      The current implementation of sk_alloc_sg expects scatterlist to always
      start at entry 0 and complete at entry MAX_SKB_FRAGS.
      
      Future patches will want to support starting at arbitrary offset into
      scatterlist so add an additional sg_start parameters and then default
      to the current values in TLS code paths.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8c05dbf0
    • J
      net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG · 312fc2b4
      John Fastabend 提交于
      When calling do_tcp_sendpages() from in kernel and we know the data
      has no references from user side we can omit SKBTX_SHARED_FRAG flag.
      This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
      to omit setting SKBTX_SHARED_FRAG.
      
      The flag is not exposed to userspace because the sendpage call from
      the splice logic masks out all bits except MSG_MORE.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      312fc2b4
    • J
      sockmap: convert refcnt to an atomic refcnt · ffa35660
      John Fastabend 提交于
      The sockmap refcnt up until now has been wrapped in the
      sk_callback_lock(). So its not actually needed any locking of its
      own. The counter itself tracks the lifetime of the psock object.
      Sockets in a sockmap have a lifetime that is independent of the
      map they are part of. This is possible because a single socket may
      be in multiple maps. When this happens we can only release the
      psock data associated with the socket when the refcnt reaches
      zero. There are three possible delete sock reference decrement
      paths first through the normal sockmap process, the user deletes
      the socket from the map. Second the map is removed and all sockets
      in the map are removed, delete path is similar to case 1. The third
      case is an asyncronous socket event such as a closing the socket. The
      last case handles removing sockets that are no longer available.
      For completeness, although inc does not pose any problems in this
      patch series, the inc case only happens when a psock is added to a
      map.
      
      Next we plan to add another socket prog type to handle policy and
      monitoring on the TX path. When we do this however we will need to
      keep a reference count open across the sendmsg/sendpage call and
      holding the sk_callback_lock() here (on every send) seems less than
      ideal, also it may sleep in cases where we hit memory pressure.
      Instead of dealing with these issues in some clever way simply make
      the reference counting a refcnt_t type and do proper atomic ops.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ffa35660