1. 19 9月, 2020 2 次提交
    • S
      devlink: add timeout information to status_notify · f92970c6
      Shannon Nelson 提交于
      Add a timeout element to the DEVLINK_CMD_FLASH_UPDATE_STATUS
      netlink message for use by a userland utility to show that
      a particular firmware flash activity may take a long but
      bounded time to finish.  Also add a handy helper for drivers
      to make use of the new timeout value.
      
      UI usage hints:
       - if non-zero, add timeout display to the end of the status line
       	[component] status_msg  ( Xm Ys : Am Bs )
           using the timeout value for Am Bs and updating the Xm Ys
           every second
       - if the timeout expires while awaiting the next update,
         display something like
       	[component] status_msg  ( timeout reached : Am Bs )
       - if new status notify messages are received, remove
         the timeout and start over
      Signed-off-by: NShannon Nelson <snelson@pensando.io>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Reviewed-by: NJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f92970c6
    • F
      net: use exponential backoff in netdev_wait_allrefs · 0e4be9e5
      Francesco Ruggeri 提交于
      The combination of aca_free_rcu, introduced in commit 2384d025
      ("net/ipv6: Add anycast addresses to a global hashtable"), and
      fib6_info_destroy_rcu, introduced in commit 9b0a8da8 ("net/ipv6:
      respect rcu grace period before freeing fib6_info"), can result in
      an extra rcu grace period being needed when deleting an interface,
      with the result that netdev_wait_allrefs ends up hitting the msleep(250),
      which is considerably longer than the required grace period.
      This can result in long delays when deleting a large number of interfaces,
      and it can be observed with this script:
      
      ns=dummy-ns
      NIFS=100
      
      ip netns add $ns
      ip netns exec $ns ip link set lo up
      ip netns exec $ns sysctl net.ipv6.conf.default.disable_ipv6=0
      ip netns exec $ns sysctl net.ipv6.conf.default.forwarding=1
      
      for ((i=0; i<$NIFS; i++))
      do
              if=eth$i
              ip netns exec $ns ip link add $if type dummy
              ip netns exec $ns ip link set $if up
              ip netns exec $ns ip -6 addr add 2021:$i::1/120 dev $if
      done
      
      for ((i=0; i<$NIFS; i++))
      do
              if=eth$i
              ip netns exec $ns ip link del $if
      done
      
      ip netns del $ns
      
      Instead of using a fixed msleep(250), this patch tries an extra
      rcu_barrier() followed by an exponential backoff.
      
      Time with this patch on a 5.4 kernel:
      
      real	0m7.704s
      user	0m0.385s
      sys	0m1.230s
      
      Time without this patch:
      
      real    0m31.522s
      user    0m0.438s
      sys     0m1.156s
      
      v2: use exponential backoff instead of trying to wake up
          netdev_wait_allrefs.
      v3: preserve reverse christmas tree ordering of local variables
      v4: try an extra rcu_barrier before the backoff, plus some
          cosmetic changes.
      Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e4be9e5
  2. 16 9月, 2020 1 次提交
  3. 15 9月, 2020 2 次提交
    • V
      __netif_receive_skb_core: don't untag vlan from skb on DSA master · b14a9fc4
      Vladimir Oltean 提交于
      A DSA master interface has upper network devices, each representing an
      Ethernet switch port attached to it. Demultiplexing the source ports and
      setting skb->dev accordingly is done through the catch-all ETH_P_XDSA
      packet_type handler. Catch-all because DSA vendors have various header
      implementations, which can be placed anywhere in the frame: before the
      DMAC, before the EtherType, before the FCS, etc. So, the ETH_P_XDSA
      handler acts like an rx_handler more than anything.
      
      It is unlikely for the DSA master interface to have any other upper than
      the DSA switch interfaces themselves. Only maybe a bridge upper*, but it
      is very likely that the DSA master will have no 8021q upper. So
      __netif_receive_skb_core() will try to untag the VLAN, despite the fact
      that the DSA switch interface might have an 8021q upper. So the skb will
      never reach that.
      
      So far, this hasn't been a problem because most of the possible
      placements of the DSA switch header mentioned in the first paragraph
      will displace the VLAN header when the DSA master receives the frame, so
      __netif_receive_skb_core() will not actually execute any VLAN-specific
      code for it. This only becomes a problem when the DSA switch header does
      not displace the VLAN header (for example with a tail tag).
      
      What the patch does is it bypasses the untagging of the skb when there
      is a DSA switch attached to this net device. So, DSA is the only
      packet_type handler which requires seeing the VLAN header. Once skb->dev
      will be changed, __netif_receive_skb_core() will be invoked again and
      untagging, or delivery to an 8021q upper, will happen in the RX of the
      DSA switch interface itself.
      
      *see commit 9eb8eff0 ("net: bridge: allow enslaving some DSA master
      network devices". This is actually the reason why I prefer keeping DSA
      as a packet_type handler of ETH_P_XDSA rather than converting to an
      rx_handler. Currently the rx_handler code doesn't support chaining, and
      this is a problem because a DSA master might be bridged.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b14a9fc4
    • P
      net: try to avoid unneeded backlog flush · 2de79ee2
      Paolo Abeni 提交于
      flush_all_backlogs() may cause deadlock on systems
      running processes with FIFO scheduling policy.
      
      The above is critical in -RT scenarios, where user-space
      specifically ensure no network activity is scheduled on
      the CPU running the mentioned FIFO process, but still get
      stuck.
      
      This commit tries to address the problem checking the
      backlog status on the remote CPUs before scheduling the
      flush operation. If the backlog is empty, we can skip it.
      
      v1 -> v2:
       - explicitly clear flushed cpu mask - Eric
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2de79ee2
  4. 11 9月, 2020 4 次提交
    • J
      net: make sure napi_list is safe for RCU traversal · 5251ef82
      Jakub Kicinski 提交于
      netpoll needs to traverse dev->napi_list under RCU, make
      sure it uses the right iterator and that removal from this
      list is handled safely.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5251ef82
    • J
      net: manage napi add/del idempotence explicitly · 4d092dd2
      Jakub Kicinski 提交于
      To RCUify napi->dev_list we need to replace list_del_init()
      with list_del_rcu(). There is no _init() version for RCU for
      obvious reasons. Up until now netif_napi_del() was idempotent
      so to make sure it remains such add a bit which is set when
      NAPI is listed, and cleared when it removed. Since we don't
      expect multiple calls to netif_napi_add() to be correct,
      add a warning on that side.
      
      Now that napi_hash_add / napi_hash_del are only called by
      napi_add / del we can actually steal its bit. We just need
      to make sure hash node is initialized correctly.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d092dd2
    • J
      net: remove napi_hash_del() from driver-facing API · 5198d545
      Jakub Kicinski 提交于
      We allow drivers to call napi_hash_del() before calling
      netif_napi_del() to batch RCU grace periods. This makes
      the API asymmetric and leaks internal implementation details.
      Soon we will want the grace period to protect more than just
      the NAPI hash table.
      
      Restructure the API and have drivers call a new function -
      __netif_napi_del() if they want to take care of RCU waits.
      
      Note that only core was checking the return status from
      napi_hash_del() so the new helper does not report if the
      NAPI was actually deleted.
      
      Some notes on driver oddness:
       - veth observed the grace period before calling netif_napi_del()
         but that should not matter
       - myri10ge observed normal RCU flavor
       - bnx2x and enic did not actually observe the grace period
         (unless they did so implicitly)
       - virtio_net and enic only unhashed Rx NAPIs
      
      The last two points seem to indicate that the calls to
      napi_hash_del() were a left over rather than an optimization.
      Regardless, it's easy enough to correct them.
      
      This patch may introduce extra synchronize_net() calls for
      interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on
      free_netdev() to call netif_napi_del(). This seems inevitable
      since we want to use RCU for netpoll dev->napi_list traversal,
      and almost no drivers set IFF_DISABLE_NETPOLL.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5198d545
    • J
      devlink: don't crash if netdev is NULL · 3ea87ca7
      Jakub Kicinski 提交于
      Following change will add support for a corner case where
      we may not have a netdev to pass to devlink_port_type_eth_set()
      but we still want to set port type.
      
      This is definitely a corner case, and drivers should not normally
      pass NULL netdev - print a warning message when this happens.
      
      Sadly for other port types (ib) switches don't have a device
      reference, the way we always do for Ethernet, so we can't put
      the warning in __devlink_port_type_set().
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ea87ca7
  5. 10 9月, 2020 3 次提交
    • P
      devlink: Use controller while building phys_port_name · 66b17082
      Parav Pandit 提交于
      Now that controller number attribute is available, use it when
      building phsy_port_name for external controller ports.
      
      An example devlink port and representor netdev name consist of controller
      annotation for external controller with controller number = 1,
      for a VF 1 of PF 0:
      
      $ devlink port show pci/0000:06:00.0/2
      pci/0000:06:00.0/2: type eth netdev ens2f0c1pf0vf1 flavour pcivf controller 1 pfnum 0 vfnum 1 external true splittable false
        function:
          hw_addr 00:00:00:00:00:00
      
      $ devlink port show pci/0000:06:00.0/2 -jp
      {
          "port": {
              "pci/0000:06:00.0/2": {
                  "type": "eth",
                  "netdev": "ens2f0c1pf0vf1",
                  "flavour": "pcivf",
                  "controller": 1,
                  "pfnum": 0,
                  "vfnum": 1,
                  "external": true,
                  "splittable": false,
                  "function": {
                      "hw_addr": "00:00:00:00:00:00"
                  }
              }
          }
      }
      
      Controller number annotation is skipped for non external controllers to
      maintain backward compatibility.
      Signed-off-by: NParav Pandit <parav@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66b17082
    • P
      devlink: Introduce controller number · 3a2d9588
      Parav Pandit 提交于
      A devlink port may be for a controller consist of PCI device.
      A devlink instance holds ports of two types of controllers.
      (1) controller discovered on same system where eswitch resides
      This is the case where PCI PF/VF of a controller and devlink eswitch
      instance both are located on a single system.
      (2) controller located on external host system.
      This is the case where a controller is located in one system and its
      devlink eswitch ports are located in a different system.
      
      When a devlink eswitch instance serves the devlink ports of both
      controllers together, PCI PF/VF numbers may overlap.
      Due to this a unique phys_port_name cannot be constructed.
      
      For example in below such system controller-0 and controller-1, each has
      PCI PF pf0 whose eswitch ports can be present in controller-0.
      These results in phys_port_name as "pf0" for both.
      Similar problem exists for VFs and upcoming Sub functions.
      
      An example view of two controller systems:
      
                   ---------------------------------------------------------
                   |                                                       |
                   |           --------- ---------         ------- ------- |
      -----------  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
      | server  |  | -------   ----/---- ---/----- ------- ---/--- ---/--- |
      | pci rc  |=== | pf0 |______/________/       | pf1 |___/_______/     |
      | connect |  | -------                       -------                 |
      -----------  |     | controller_num=1 (no eswitch)                   |
                   ------|--------------------------------------------------
                   (internal wire)
                         |
                   ---------------------------------------------------------
                   | devlink eswitch ports and reps                        |
                   | ----------------------------------------------------- |
                   | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
                   | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
                   | ----------------------------------------------------- |
                   | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
                   | |pf1    | pf1vfN | pf1sfN | pf1    | pf1vfN |pf0sfN | |
                   | ----------------------------------------------------- |
                   |                                                       |
                   |                                                       |
                   |           --------- ---------         ------- ------- |
                   |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
                   | -------   ----/---- ---/----- ------- ---/--- ---/--- |
                   | | pf0 |______/________/       | pf1 |___/_______/     |
                   | -------                       -------                 |
                   |                                                       |
                   |  local controller_num=0 (eswitch)                     |
                   ---------------------------------------------------------
      
      An example devlink port for external controller with controller
      number = 1 for a VF 1 of PF 0:
      
      $ devlink port show pci/0000:06:00.0/2
      pci/0000:06:00.0/2: type eth netdev ens2f0pf0vf1 flavour pcivf controller 1 pfnum 0 vfnum 1 external true splittable false
        function:
          hw_addr 00:00:00:00:00:00
      
      $ devlink port show pci/0000:06:00.0/2 -jp
      {
          "port": {
              "pci/0000:06:00.0/2": {
                  "type": "eth",
                  "netdev": "ens2f0pf0vf1",
                  "flavour": "pcivf",
                  "controller": 1,
                  "pfnum": 0,
                  "vfnum": 1,
                  "external": true,
                  "splittable": false,
                  "function": {
                      "hw_addr": "00:00:00:00:00:00"
                  }
              }
          }
      }
      Signed-off-by: NParav Pandit <parav@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a2d9588
    • P
      devlink: Introduce external controller flag · 05b595e9
      Parav Pandit 提交于
      A devlink eswitch port may represent PCI PF/VF ports of a controller.
      
      A controller either located on same system or it can be an external
      controller located in host where such NIC is plugged in.
      
      Add the ability for driver to specify if a port is for external
      controller.
      
      Use such flag in the mlx5_core driver.
      
      An example of an external controller having VF1 of PF0 belong to
      controller 1.
      
      $ devlink port show pci/0000:06:00.0/2
      pci/0000:06:00.0/2: type eth netdev ens2f0pf0vf1 flavour pcivf pfnum 0 vfnum 1 external true splittable false
        function:
          hw_addr 00:00:00:00:00:00
      $ devlink port show pci/0000:06:00.0/2 -jp
      {
          "port": {
              "pci/0000:06:00.0/2": {
                  "type": "eth",
                  "netdev": "ens2f0pf0vf1",
                  "flavour": "pcivf",
                  "pfnum": 0,
                  "vfnum": 1,
                  "external": true,
                  "splittable": false,
                  "function": {
                      "hw_addr": "00:00:00:00:00:00"
                  }
              }
          }
      }
      Signed-off-by: NParav Pandit <parav@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05b595e9
  6. 07 9月, 2020 1 次提交
  7. 02 9月, 2020 1 次提交
  8. 28 8月, 2020 2 次提交
    • M
      net: add option to not create fall-back tunnels in root-ns as well · 316cdaa1
      Mahesh Bandewar 提交于
      The sysctl that was added  earlier by commit 79134e6c ("net: do
      not create fallback tunnels for non-default namespaces") to create
      fall-back only in root-ns. This patch enhances that behavior to provide
      option not to create fallback tunnels in root-ns as well. Since modules
      that create fallback tunnels could be built-in and setting the sysctl
      value after booting is pointless, so added a kernel cmdline options to
      change this default. The default setting is preseved for backward
      compatibility. The kernel command line option of fb_tunnels=initns will
      set the sysctl value to 1 and will create fallback tunnels only in initns
      while kernel cmdline fb_tunnels=none will set the sysctl value to 2 and
      fallback tunnels are skipped in every netns.
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Maciej Zenczykowski <maze@google.com>
      Cc: Jian Yang <jianyang@google.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      316cdaa1
    • M
      bpf: Add map_meta_equal map ops · f4d05259
      Martin KaFai Lau 提交于
      Some properties of the inner map is used in the verification time.
      When an inner map is inserted to an outer map at runtime,
      bpf_map_meta_equal() is currently used to ensure those properties
      of the inserting inner map stays the same as the verification
      time.
      
      In particular, the current bpf_map_meta_equal() checks max_entries which
      turns out to be too restrictive for most of the maps which do not use
      max_entries during the verification time.  It limits the use case that
      wants to replace a smaller inner map with a larger inner map.  There are
      some maps do use max_entries during verification though.  For example,
      the map_gen_lookup in array_map_ops uses the max_entries to generate
      the inline lookup code.
      
      To accommodate differences between maps, the map_meta_equal is added
      to bpf_map_ops.  Each map-type can decide what to check when its
      map is used as an inner map during runtime.
      
      Also, some map types cannot be used as an inner map and they are
      currently black listed in bpf_map_meta_alloc() in map_in_map.c.
      It is not unusual that the new map types may not aware that such
      blacklist exists.  This patch enforces an explicit opt-in
      and only allows a map to be used as an inner map if it has
      implemented the map_meta_equal ops.  It is based on the
      discussion in [1].
      
      All maps that support inner map has its map_meta_equal points
      to bpf_map_meta_equal in this patch.  A later patch will
      relax the max_entries check for most maps.  bpf_types.h
      counts 28 map types.  This patch adds 23 ".map_meta_equal"
      by using coccinelle.  -5 for
      	BPF_MAP_TYPE_PROG_ARRAY
      	BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
      	BPF_MAP_TYPE_STRUCT_OPS
      	BPF_MAP_TYPE_ARRAY_OF_MAPS
      	BPF_MAP_TYPE_HASH_OF_MAPS
      
      The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
      is moved such that the same error is returned.
      
      [1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com
      f4d05259
  9. 27 8月, 2020 2 次提交
  10. 26 8月, 2020 5 次提交
  11. 25 8月, 2020 9 次提交
    • H
      net: Get rid of consume_skb when tracing is off · be769db2
      Herbert Xu 提交于
      The function consume_skb is only meaningful when tracing is enabled.
      This patch makes it conditional on CONFIG_TRACEPOINTS.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be769db2
    • P
      devlink: Protect devlink port list traversal · 5d080b50
      Parav Pandit 提交于
      Cited patch in fixes tag misses to protect port list traversal
      while traversing per port reporter list.
      
      Protect it using devlink instance lock.
      
      Fixes: f4f54166 ("devlink: Implement devlink health reporters on per-port basis")
      Signed-off-by: NParav Pandit <parav@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d080b50
    • P
      devlink: Fix per port reporter fields initialization · 79604c5d
      Parav Pandit 提交于
      Cited patch in fixes tag initializes reporters_list and reporters_lock
      of a devlink port after devlink port is added to the list. Once port
      is added to the list, devlink_nl_cmd_health_reporter_get_dumpit()
      can access the uninitialized mutex and reporters list head.
      Fix it by initializing port reporters field before adding port to the
      list.
      
      Fixes: f4f54166 ("devlink: Implement devlink health reporters on per-port basis")
      Signed-off-by: NParav Pandit <parav@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79604c5d
    • M
      tcp: bpf: Optionally store mac header in TCP_SAVE_SYN · 267cf9fa
      Martin KaFai Lau 提交于
      This patch is adapted from Eric's patch in an earlier discussion [1].
      
      The TCP_SAVE_SYN currently only stores the network header and
      tcp header.  This patch allows it to optionally store
      the mac header also if the setsockopt's optval is 2.
      
      It requires one more bit for the "save_syn" bit field in tcp_sock.
      This patch achieves this by moving the syn_smc bit next to the is_mptcp.
      The syn_smc is currently used with the TCP experimental option.  Since
      syn_smc is only used when CONFIG_SMC is enabled, this patch also puts
      the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did
      with "IS_ENABLED(CONFIG_MPTCP)".
      
      The mac_hdrlen is also stored in the "struct saved_syn"
      to allow a quick offset from the bpf prog if it chooses to start
      getting from the network header or the tcp header.
      
      [1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com
      267cf9fa
    • M
      bpf: tcp: Allow bpf prog to write and parse TCP header option · 0813a841
      Martin KaFai Lau 提交于
      [ Note: The TCP changes here is mainly to implement the bpf
        pieces into the bpf_skops_*() functions introduced
        in the earlier patches. ]
      
      The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
      algorithm to be written in BPF.  It opens up opportunities to allow
      a faster turnaround time in testing/releasing new congestion control
      ideas to production environment.
      
      The same flexibility can be extended to writing TCP header option.
      It is not uncommon that people want to test new TCP header option
      to improve the TCP performance.  Another use case is for data-center
      that has a more controlled environment and has more flexibility in
      putting header options for internal only use.
      
      For example, we want to test the idea in putting maximum delay
      ACK in TCP header option which is similar to a draft RFC proposal [1].
      
      This patch introduces the necessary BPF API and use them in the
      TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
      and write TCP header options.  It currently supports most of
      the TCP packet except RST.
      
      Supported TCP header option:
      ───────────────────────────
      This patch allows the bpf-prog to write any option kind.
      Different bpf-progs can write its own option by calling the new helper
      bpf_store_hdr_opt().  The helper will ensure there is no duplicated
      option in the header.
      
      By allowing bpf-prog to write any option kind, this gives a lot of
      flexibility to the bpf-prog.  Different bpf-prog can write its
      own option kind.  It could also allow the bpf-prog to support a
      recently standardized option on an older kernel.
      
      Sockops Callback Flags:
      ──────────────────────
      The bpf program will only be called to parse/write tcp header option
      if the following newly added callback flags are enabled
      in tp->bpf_sock_ops_cb_flags:
      BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
      BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
      BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
      
      A few words on the PARSE CB flags.  When the above PARSE CB flags are
      turned on, the bpf-prog will be called on packets received
      at a sk that has at least reached the ESTABLISHED state.
      The parsing of the SYN-SYNACK-ACK will be discussed in the
      "3 Way HandShake" section.
      
      The default is off for all of the above new CB flags, i.e. the bpf prog
      will not be called to parse or write bpf hdr option.  There are
      details comment on these new cb flags in the UAPI bpf.h.
      
      sock_ops->skb_data and bpf_load_hdr_opt()
      ─────────────────────────────────────────
      sock_ops->skb_data and sock_ops->skb_data_end covers the whole
      TCP header and its options.  They are read only.
      
      The new bpf_load_hdr_opt() helps to read a particular option "kind"
      from the skb_data.
      
      Please refer to the comment in UAPI bpf.h.  It has details
      on what skb_data contains under different sock_ops->op.
      
      3 Way HandShake
      ───────────────
      The bpf-prog can learn if it is sending SYN or SYNACK by reading the
      sock_ops->skb_tcp_flags.
      
      * Passive side
      
      When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
      the received SYN skb will be available to the bpf prog.  The bpf prog can
      use the SYN skb (which may carry the header option sent from the remote bpf
      prog) to decide what bpf header option should be written to the outgoing
      SYNACK skb.  The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
      More on this later.  Also, the bpf prog can learn if it is in syncookie
      mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).
      
      The bpf prog can store the received SYN pkt by using the existing
      bpf_setsockopt(TCP_SAVE_SYN).  The example in a later patch does it.
      [ Note that the fullsock here is a listen sk, bpf_sk_storage
        is not very useful here since the listen sk will be shared
        by many concurrent connection requests.
      
        Extending bpf_sk_storage support to request_sock will add weight
        to the minisock and it is not necessary better than storing the
        whole ~100 bytes SYN pkt. ]
      
      When the connection is established, the bpf prog will be called
      in the existing PASSIVE_ESTABLISHED_CB callback.  At that time,
      the bpf prog can get the header option from the saved syn and
      then apply the needed operation to the newly established socket.
      The later patch will use the max delay ack specified in the SYN
      header and set the RTO of this newly established connection
      as an example.
      
      The received ACK (that concludes the 3WHS) will also be available to
      the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
      It could be useful in syncookie scenario.  More on this later.
      
      There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
      saved syn pkt which includes the IP[46] header and the TCP header.
      A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
      start getting from, e.g. starting from TCP header, or from IP[46] header.
      
      The new getsockopt(TCP_BPF_SYN*) will also know where it can get
      the SYN's packet from:
        - (a) the just received syn (available when the bpf prog is writing SYNACK)
              and it is the only way to get SYN during syncookie mode.
        or
        - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
              existing CB).
      
      The bpf prog does not need to know where the SYN pkt is coming from.
      The getsockopt(TCP_BPF_SYN*) will hide this details.
      
      Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
      bpf_load_hdr_opt() to read a particular header option from the SYN packet.
      
      * Fastopen
      
      Fastopen should work the same as the regular non fastopen case.
      This is a test in a later patch.
      
      * Syncookie
      
      For syncookie, the later example patch asks the active
      side's bpf prog to resend the header options in ACK.  The server
      can use bpf_load_hdr_opt() to look at the options in this
      received ACK during PASSIVE_ESTABLISHED_CB.
      
      * Active side
      
      The bpf prog will get a chance to write the bpf header option
      in the SYN packet during WRITE_HDR_OPT_CB.  The received SYNACK
      pkt will also be available to the bpf prog during the existing
      ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
      and bpf_load_hdr_opt().
      
      * Turn off header CB flags after 3WHS
      
      If the bpf prog does not need to write/parse header options
      beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
      to avoid being called for header options.
      Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
      so that the kernel will only call it when there is option that
      the kernel cannot handle.
      
      [1]: draft-wang-tcpm-low-latency-opt-00
           https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
      0813a841
    • M
      bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8 · c9985d09
      Martin KaFai Lau 提交于
      A later patch needs to add a few pointers and a few u8 to
      sock_ops_kern.  Hence, this patch saves some spaces by moving
      some of the existing members from u32 to u8 so that the later
      patch can still fit everything in a cacheline.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200820190058.2885640-1-kafai@fb.com
      c9985d09
    • M
      tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt · ca584ba0
      Martin KaFai Lau 提交于
      This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog
      to set the min rto of a connection.  It could be used together
      with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX).
      
      A later selftest patch will communicate the max delay ack in a
      bpf tcp header option and then the receiving side can use
      bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200820190027.2884170-1-kafai@fb.com
      ca584ba0
    • M
      tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt · 2b8ee4f0
      Martin KaFai Lau 提交于
      This change is mostly from an internal patch and adapts it from sysctl
      config to the bpf_setsockopt setup.
      
      The bpf_prog can set the max delay ack by using
      bpf_setsockopt(TCP_BPF_DELACK_MAX).  This max delay ack can be communicated
      to its peer through bpf header option.  The receiving peer can then use
      this max delay ack and set a potentially lower rto by using
      bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced
      in the next patch.
      
      Another later selftest patch will also use it like the above to show
      how to write and parse bpf tcp header option.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com
      2b8ee4f0
    • M
      tcp: Use a struct to represent a saved_syn · 70a217f1
      Martin KaFai Lau 提交于
      The TCP_SAVE_SYN has both the network header and tcp header.
      The total length of the saved syn packet is currently stored in
      the first 4 bytes (u32) of an array and the actual packet data is
      stored after that.
      
      A later patch will add a bpf helper that allows to get the tcp header
      alone from the saved syn without the network header.  It will be more
      convenient to have a direct offset to a specific header instead of
      re-parsing it.  This requires to separately store the network hdrlen.
      The total header length (i.e. network + tcp) is still needed for the
      current usage in getsockopt.  Although this total length can be obtained
      by looking into the tcphdr and then get the (th->doff << 2), this patch
      chooses to directly store the tcp hdrlen in the second four bytes of
      this newly created "struct saved_syn".  By using a new struct, it can
      give a readable name to each individual header length.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200820190014.2883694-1-kafai@fb.com
      70a217f1
  12. 24 8月, 2020 1 次提交
  13. 22 8月, 2020 5 次提交
  14. 21 8月, 2020 2 次提交