1. 11 8月, 2018 3 次提交
    • M
      bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY · 5dc4c4b7
      Martin KaFai Lau 提交于
      This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
      
      To unleash the full potential of a bpf prog, it is essential for the
      userspace to be capable of directly setting up a bpf map which can then
      be consumed by the bpf prog to make decision.  In this case, decide which
      SO_REUSEPORT sk to serve the incoming request.
      
      By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
      and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
      The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
      the bpf prog can directly select a sk from the bpf map.  That will
      raise the programmability of the bpf prog attached to a reuseport
      group (a group of sk serving the same IP:PORT).
      
      For example, in UDP, the bpf prog can peek into the payload (e.g.
      through the "data" pointer introduced in the later patch) to learn
      the application level's connection information and then decide which sk
      to pick from a bpf map.  The userspace can tightly couple the sk's location
      in a bpf map with the application logic in generating the UDP payload's
      connection information.  This connection info contact/API stays within the
      userspace.
      
      Also, when used with map-in-map, the userspace can switch the
      old-server-process's inner map to a new-server-process's inner map
      in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
      The bpf prog will then direct incoming requests to the new process instead
      of the old process.  The old process can finish draining the pending
      requests (e.g. by "accept()") before closing the old-fds.  [Note that
      deleting a fd from a bpf map does not necessary mean the fd is closed]
      
      During map_update_elem(),
      Only SO_REUSEPORT sk (i.e. which has already been added
      to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
      "bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
      ensured in "reuseport_array_update_check()".
      
      A SO_REUSEPORT sk can only be added once to a map (i.e. the
      same sk cannot be added twice even to the same map).  SO_REUSEPORT
      already allows another sk to be created for the same IP:PORT.
      There is no need to re-create a similar usage in the BPF side.
      
      When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
      it will notify the bpf map to remove it from the map also.  It is
      done through "bpf_sk_reuseport_detach()" and it will only be called
      if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
      
      The map_update()/map_delete() has to be in-sync with the
      "reuse->socks[]".  Hence, the same "reuseport_lock" used
      by "reuse->socks[]" has to be used here also. Care has
      been taken to ensure the lock is only acquired when the
      adding sk passes some strict tests. and
      freeing the map does not require the reuseport_lock.
      
      The reuseport_array will also support lookup from the syscall
      side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
      is on-demand (i.e. a sk's cookie is not generated until the very
      first map_lookup_elem()).
      
      The lookup cookie is 64bits but it goes against the logical userspace
      expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
      It may catch user in surprise if we enforce value_size=8 while
      userspace still pass a 32bits fd during update.  Supporting different
      value_size between lookup and update seems unintuitive also.
      
      We also need to consider what if other existing fd based maps want
      to return 64bits value from syscall's lookup in the future.
      Hence, reuseport_array supports both value_size 4 and 8, and
      assuming user will usually use value_size=4.  The syscall's lookup
      will return ENOSPC on value_size=4.  It will will only
      return 64bits value from sock_gen_cookie() when user consciously
      choose value_size=8 (as a signal that lookup is desired) which then
      requires a 64bits value in both lookup and update.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5dc4c4b7
    • M
      net: Add ID (if needed) to sock_reuseport and expose reuseport_lock · 736b4602
      Martin KaFai Lau 提交于
      A later patch will introduce a BPF_MAP_TYPE_REUSEPORT_ARRAY which
      allows a SO_REUSEPORT sk to be added to a bpf map.  When a sk
      is removed from reuse->socks[], it also needs to be removed from
      the bpf map.  Also, when adding a sk to a bpf map, the bpf
      map needs to ensure it is indeed in a reuse->socks[].
      Hence, reuseport_lock is needed by the bpf map to ensure its
      map_update_elem() and map_delete_elem() operations are in-sync with
      the reuse->socks[].  The BPF_MAP_TYPE_REUSEPORT_ARRAY map will only
      acquire the reuseport_lock after ensuring the adding sk is already
      in a reuseport group (i.e. reuse->socks[]).  The map_lookup_elem()
      will be lockless.
      
      This patch also adds an ID to sock_reuseport.  A later patch
      will introduce BPF_PROG_TYPE_SK_REUSEPORT which allows
      a bpf prog to select a sk from a bpf map.  It is inflexible to
      statically enforce a bpf map can only contain the sk belonging to
      a particular reuse->socks[] (i.e. same IP:PORT) during the bpf
      verification time. For example, think about the the map-in-map situation
      where the inner map can be dynamically changed in runtime and the outer
      map may have inner maps belonging to different reuseport groups.
      Hence, when the bpf prog (in the new BPF_PROG_TYPE_SK_REUSEPORT
      type) selects a sk,  this selected sk has to be checked to ensure it
      belongs to the requesting reuseport group (i.e. the group serving
      that IP:PORT).
      
      The "sk->sk_reuseport_cb" pointer cannot be used for this checking
      purpose because the pointer value will change after reuseport_grow().
      Instead of saving all checking conditions like the ones
      preced calling "reuseport_add_sock()" and compare them everytime a
      bpf_prog is run, a 32bits ID is introduced to survive the
      reuseport_grow().  The ID is only acquired if any of the
      reuse->socks[] is added to the newly introduced
      "BPF_MAP_TYPE_REUSEPORT_ARRAY" map.
      
      If "BPF_MAP_TYPE_REUSEPORT_ARRAY" is not used,  the changes in this
      patch is a no-op.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      736b4602
    • M
      tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket · 40a1227e
      Martin KaFai Lau 提交于
      Although the actual cookie check "__cookie_v[46]_check()" does
      not involve sk specific info, it checks whether the sk has recent
      synq overflow event in "tcp_synq_no_recent_overflow()".  The
      tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
      when it has sent out a syncookie (through "tcp_synq_overflow()").
      
      The above per sk "recent synq overflow event timestamp" works well
      for non SO_REUSEPORT use case.  However, it may cause random
      connection request reject/discard when SO_REUSEPORT is used with
      syncookie because it fails the "tcp_synq_no_recent_overflow()"
      test.
      
      When SO_REUSEPORT is used, it usually has multiple listening
      socks serving TCP connection requests destinated to the same local IP:PORT.
      There are cases that the TCP-ACK-COOKIE may not be received
      by the same sk that sent out the syncookie.  For example,
      if reuse->socks[] began with {sk0, sk1},
      1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
         was updated.
      2) the reuse->socks[] became {sk1, sk2} later.  e.g. sk0 was first closed
         and then sk2 was added.  Here, sk2 does not have ts_recent_stamp set.
         There are other ordering that will trigger the similar situation
         below but the idea is the same.
      3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
         "tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
         all syncookies sent by sk1 will be handled (and rejected)
         by sk2 while sk1 is still alive.
      
      The userspace may create and remove listening SO_REUSEPORT sockets
      as it sees fit.  E.g. Adding new thread (and SO_REUSEPORT sock) to handle
      incoming requests, old process stopping and new process starting...etc.
      With or without SO_ATTACH_REUSEPORT_[CB]BPF,
      the sockets leaving and joining a reuseport group makes picking
      the same sk to check the syncookie very difficult (if not impossible).
      
      The later patches will allow bpf prog more flexibility in deciding
      where a sk should be located in a bpf map and selecting a particular
      SO_REUSEPORT sock as it sees fit.  e.g. Without closing any sock,
      replace the whole bpf reuseport_array in one map_update() by using
      map-in-map.  Getting the syncookie check working smoothly across
      socks in the same "reuse->socks[]" is important.
      
      A partial solution is to set the newly added sk's ts_recent_stamp
      to the max ts_recent_stamp of a reuseport group but that will require
      to iterate through reuse->socks[]  OR
      pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
      joining a reuseport group.  However, neither of them will solve the
      existing sk getting moved around the reuse->socks[] and that
      sk may not have ts_recent_stamp updated, unlikely under continuous
      synflood but not impossible.
      
      This patch opts to treat the reuseport group as a whole when
      considering the last synq overflow timestamp since
      they are serving the same IP:PORT from the userspace
      (and BPF program) perspective.
      
      "synq_overflow_ts" is added to "struct sock_reuseport".
      The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
      will update/check reuse->synq_overflow_ts if the sk is
      in a reuseport group.  Similar to the reuseport decision in
      __inet_lookup_listener(), both sk->sk_reuseport and
      sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
      Update on "synq_overflow_ts" happens at roughly once
      every second.
      
      A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
      No meaningful performance change is observed.  Before and
      after the change is ~9Mpps in IPv4.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      40a1227e
  2. 10 8月, 2018 6 次提交
  3. 09 8月, 2018 3 次提交
  4. 08 8月, 2018 6 次提交
    • C
      llc: use refcount_inc_not_zero() for llc_sap_find() · 0dcb8225
      Cong Wang 提交于
      llc_sap_put() decreases the refcnt before deleting sap
      from the global list. Therefore, there is a chance
      llc_sap_find() could find a sap with zero refcnt
      in this global list.
      
      Close this race condition by checking if refcnt is zero
      or not in llc_sap_find(), if it is zero then it is being
      removed so we can just treat it as gone.
      
      Reported-by: <syzbot+278893f3f7803871f7ce@syzkaller.appspotmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0dcb8225
    • A
      net: phy: Add support for Broadcom Omega internal Combo GPHY · 6fdecfe3
      Arun Parameswaran 提交于
      Add support for the Broadcom Omega SoC internal Combo Ethernet
      GPHY to the bcm7xxx phy driver.
      Signed-off-by: NArun Parameswaran <arun.parameswaran@broadcom.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fdecfe3
    • C
      vsock: split dwork to avoid reinitializations · 455f05ec
      Cong Wang 提交于
      syzbot reported that we reinitialize an active delayed
      work in vsock_stream_connect():
      
      	ODEBUG: init active (active state 0) object type: timer_list hint:
      	delayed_work_timer_fn+0x0/0x90 kernel/workqueue.c:1414
      	WARNING: CPU: 1 PID: 11518 at lib/debugobjects.c:329
      	debug_print_object+0x16a/0x210 lib/debugobjects.c:326
      
      The pattern is apparently wrong, we should only initialize
      the dealyed work once and could repeatly schedule it. So we
      have to move out the initializations to allocation side.
      And to avoid confusion, we can split the shared dwork
      into two, instead of re-using the same one.
      
      Fixes: d021c344 ("VSOCK: Introduce VM Sockets")
      Reported-by: <syzbot+8a9b1bd330476a4f3db6@syzkaller.appspotmail.com>
      Cc: Andy king <acking@vmware.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Jorgen Hansen <jhansen@vmware.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      455f05ec
    • P
      net/sched: allow flower to match tunnel options · 0a6e7778
      Pieter Jansen van Vuuren 提交于
      Allow matching on options in Geneve tunnel headers.
      This makes use of existing tunnel metadata support.
      
      The options can be described in the form
      CLASS:TYPE:DATA/CLASS_MASK:TYPE_MASK:DATA_MASK, where CLASS is
      represented as a 16bit hexadecimal value, TYPE as an 8bit
      hexadecimal value and DATA as a variable length hexadecimal value.
      
      e.g.
       # ip link add name geneve0 type geneve dstport 0 external
       # tc qdisc add dev geneve0 ingress
       # tc filter add dev geneve0 protocol ip parent ffff: \
           flower \
             enc_src_ip 10.0.99.192 \
             enc_dst_ip 10.0.99.193 \
             enc_key_id 11 \
             geneve_opts 0102:80:1122334421314151/ffff:ff:ffffffffffffffff \
             ip_proto udp \
             action mirred egress redirect dev eth1
      
      This patch adds support for matching Geneve options in the order
      supplied by the user. This leads to an efficient implementation in
      the software datapath (and in our opinion hardware datapaths that
      offload this feature). It is also compatible with Geneve options
      matching provided by the Open vSwitch kernel datapath which is
      relevant here as the Flower classifier may be used as a mechanism
      to program flows into hardware as a form of Open vSwitch datapath
      offload (sometimes referred to as OVS-TC). The netlink
      Kernel/Userspace API may be extended, for example by adding a flag,
      if other matching options are desired, for example matching given
      options in any order. This would require an implementation in the
      TC software datapath. And be done in a way that drivers that
      facilitate offload of the Flower classifier can reject or accept
      such flows based on hardware datapath capabilities.
      
      This approach was discussed and agreed on at Netconf 2017 in Seoul.
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NPieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a6e7778
    • S
      flow_dissector: allow dissection of tunnel options from metadata · 92e2c405
      Simon Horman 提交于
      Allow the existing 'dissection' of tunnel metadata to 'dissect'
      options already present in tunnel metadata. This dissection is
      controlled by a new dissector key, FLOW_DISSECTOR_KEY_ENC_OPTS.
      
      This dissection only occurs when skb_flow_dissect_tunnel_info()
      is called, currently only the Flower classifier makes that call.
      So there should be no impact on other users of the flow dissector.
      
      This is in preparation for allowing the flower classifier to
      match on Geneve options.
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NPieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92e2c405
    • F
      ethtool: Add WAKE_FILTER and RX_CLS_FLOW_WAKE · 6cfef793
      Florian Fainelli 提交于
      Add the ability to specify through ethtool::rxnfc that a rule location is
      special and will be used to participate in Wake-on-LAN, by e.g: having a
      specific pattern be matched. When this is the case, fs->ring_cookie must
      be set to the special value RX_CLS_FLOW_WAKE.
      
      We also define an additional ethtool::wolinfo flag: WAKE_FILTER which
      can be used to configure an Ethernet adapter to allow Wake-on-LAN using
      previously programmed filters.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6cfef793
  5. 07 8月, 2018 1 次提交
    • J
      vhost: switch to use new message format · 429711ae
      Jason Wang 提交于
      We use to have message like:
      
      struct vhost_msg {
      	int type;
      	union {
      		struct vhost_iotlb_msg iotlb;
      		__u8 padding[64];
      	};
      };
      
      Unfortunately, there will be a hole of 32bit in 64bit machine because
      of the alignment. This leads a different formats between 32bit API and
      64bit API. What's more it will break 32bit program running on 64bit
      machine.
      
      So fixing this by introducing a new message type with an explicit
      32bit reserved field after type like:
      
      struct vhost_msg_v2 {
      	__u32 type;
      	__u32 reserved;
      	union {
      		struct vhost_iotlb_msg iotlb;
      		__u8 padding[64];
      	};
      };
      
      We will have a consistent ABI after switching to use this. To enable
      this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for
      userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2).
      
      Fixes: 6b1e6cc7 ("vhost: new device IOTLB API")
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      429711ae
  6. 06 8月, 2018 4 次提交
  7. 05 8月, 2018 2 次提交
  8. 04 8月, 2018 8 次提交
  9. 03 8月, 2018 7 次提交