1. 11 8月, 2018 1 次提交
    • M
      bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY · 5dc4c4b7
      Martin KaFai Lau 提交于
      This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
      
      To unleash the full potential of a bpf prog, it is essential for the
      userspace to be capable of directly setting up a bpf map which can then
      be consumed by the bpf prog to make decision.  In this case, decide which
      SO_REUSEPORT sk to serve the incoming request.
      
      By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
      and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
      The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
      the bpf prog can directly select a sk from the bpf map.  That will
      raise the programmability of the bpf prog attached to a reuseport
      group (a group of sk serving the same IP:PORT).
      
      For example, in UDP, the bpf prog can peek into the payload (e.g.
      through the "data" pointer introduced in the later patch) to learn
      the application level's connection information and then decide which sk
      to pick from a bpf map.  The userspace can tightly couple the sk's location
      in a bpf map with the application logic in generating the UDP payload's
      connection information.  This connection info contact/API stays within the
      userspace.
      
      Also, when used with map-in-map, the userspace can switch the
      old-server-process's inner map to a new-server-process's inner map
      in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
      The bpf prog will then direct incoming requests to the new process instead
      of the old process.  The old process can finish draining the pending
      requests (e.g. by "accept()") before closing the old-fds.  [Note that
      deleting a fd from a bpf map does not necessary mean the fd is closed]
      
      During map_update_elem(),
      Only SO_REUSEPORT sk (i.e. which has already been added
      to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
      "bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
      ensured in "reuseport_array_update_check()".
      
      A SO_REUSEPORT sk can only be added once to a map (i.e. the
      same sk cannot be added twice even to the same map).  SO_REUSEPORT
      already allows another sk to be created for the same IP:PORT.
      There is no need to re-create a similar usage in the BPF side.
      
      When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
      it will notify the bpf map to remove it from the map also.  It is
      done through "bpf_sk_reuseport_detach()" and it will only be called
      if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
      
      The map_update()/map_delete() has to be in-sync with the
      "reuse->socks[]".  Hence, the same "reuseport_lock" used
      by "reuse->socks[]" has to be used here also. Care has
      been taken to ensure the lock is only acquired when the
      adding sk passes some strict tests. and
      freeing the map does not require the reuseport_lock.
      
      The reuseport_array will also support lookup from the syscall
      side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
      is on-demand (i.e. a sk's cookie is not generated until the very
      first map_lookup_elem()).
      
      The lookup cookie is 64bits but it goes against the logical userspace
      expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
      It may catch user in surprise if we enforce value_size=8 while
      userspace still pass a 32bits fd during update.  Supporting different
      value_size between lookup and update seems unintuitive also.
      
      We also need to consider what if other existing fd based maps want
      to return 64bits value from syscall's lookup in the future.
      Hence, reuseport_array supports both value_size 4 and 8, and
      assuming user will usually use value_size=4.  The syscall's lookup
      will return ENOSPC on value_size=4.  It will will only
      return 64bits value from sock_gen_cookie() when user consciously
      choose value_size=8 (as a signal that lookup is desired) which then
      requires a 64bits value in both lookup and update.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5dc4c4b7
  2. 08 8月, 2018 2 次提交
    • P
      net/sched: allow flower to match tunnel options · 0a6e7778
      Pieter Jansen van Vuuren 提交于
      Allow matching on options in Geneve tunnel headers.
      This makes use of existing tunnel metadata support.
      
      The options can be described in the form
      CLASS:TYPE:DATA/CLASS_MASK:TYPE_MASK:DATA_MASK, where CLASS is
      represented as a 16bit hexadecimal value, TYPE as an 8bit
      hexadecimal value and DATA as a variable length hexadecimal value.
      
      e.g.
       # ip link add name geneve0 type geneve dstport 0 external
       # tc qdisc add dev geneve0 ingress
       # tc filter add dev geneve0 protocol ip parent ffff: \
           flower \
             enc_src_ip 10.0.99.192 \
             enc_dst_ip 10.0.99.193 \
             enc_key_id 11 \
             geneve_opts 0102:80:1122334421314151/ffff:ff:ffffffffffffffff \
             ip_proto udp \
             action mirred egress redirect dev eth1
      
      This patch adds support for matching Geneve options in the order
      supplied by the user. This leads to an efficient implementation in
      the software datapath (and in our opinion hardware datapaths that
      offload this feature). It is also compatible with Geneve options
      matching provided by the Open vSwitch kernel datapath which is
      relevant here as the Flower classifier may be used as a mechanism
      to program flows into hardware as a form of Open vSwitch datapath
      offload (sometimes referred to as OVS-TC). The netlink
      Kernel/Userspace API may be extended, for example by adding a flag,
      if other matching options are desired, for example matching given
      options in any order. This would require an implementation in the
      TC software datapath. And be done in a way that drivers that
      facilitate offload of the Flower classifier can reject or accept
      such flows based on hardware datapath capabilities.
      
      This approach was discussed and agreed on at Netconf 2017 in Seoul.
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NPieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a6e7778
    • F
      ethtool: Add WAKE_FILTER and RX_CLS_FLOW_WAKE · 6cfef793
      Florian Fainelli 提交于
      Add the ability to specify through ethtool::rxnfc that a rule location is
      special and will be used to participate in Wake-on-LAN, by e.g: having a
      specific pattern be matched. When this is the case, fs->ring_cookie must
      be set to the special value RX_CLS_FLOW_WAKE.
      
      We also define an additional ethtool::wolinfo flag: WAKE_FILTER which
      can be used to configure an Ethernet adapter to allow Wake-on-LAN using
      previously programmed filters.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6cfef793
  3. 07 8月, 2018 1 次提交
    • J
      vhost: switch to use new message format · 429711ae
      Jason Wang 提交于
      We use to have message like:
      
      struct vhost_msg {
      	int type;
      	union {
      		struct vhost_iotlb_msg iotlb;
      		__u8 padding[64];
      	};
      };
      
      Unfortunately, there will be a hole of 32bit in 64bit machine because
      of the alignment. This leads a different formats between 32bit API and
      64bit API. What's more it will break 32bit program running on 64bit
      machine.
      
      So fixing this by introducing a new message type with an explicit
      32bit reserved field after type like:
      
      struct vhost_msg_v2 {
      	__u32 type;
      	__u32 reserved;
      	union {
      		struct vhost_iotlb_msg iotlb;
      		__u8 padding[64];
      	};
      };
      
      We will have a consistent ABI after switching to use this. To enable
      this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for
      userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2).
      
      Fixes: 6b1e6cc7 ("vhost: new device IOTLB API")
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      429711ae
  4. 06 8月, 2018 1 次提交
  5. 05 8月, 2018 1 次提交
  6. 04 8月, 2018 6 次提交
  7. 03 8月, 2018 2 次提交
  8. 02 8月, 2018 4 次提交
  9. 31 7月, 2018 2 次提交
    • A
      bpf: Support bpf_get_socket_cookie in more prog types · d692f113
      Andrey Ignatov 提交于
      bpf_get_socket_cookie() helper can be used to identify skb that
      correspond to the same socket.
      
      Though socket cookie can be useful in many other use-cases where socket is
      available in program context. Specifically BPF_PROG_TYPE_CGROUP_SOCK_ADDR
      and BPF_PROG_TYPE_SOCK_OPS programs can benefit from it so that one of
      them can augment a value in a map prepared earlier by other program for
      the same socket.
      
      The patch adds support to call bpf_get_socket_cookie() from
      BPF_PROG_TYPE_CGROUP_SOCK_ADDR and BPF_PROG_TYPE_SOCK_OPS.
      
      It doesn't introduce new helpers. Instead it reuses same helper name
      bpf_get_socket_cookie() but adds support to this helper to accept
      `struct bpf_sock_addr` and `struct bpf_sock_ops`.
      
      Documentation in bpf.h is changed in a way that should not break
      automatic generation of markdown.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d692f113
    • P
      net/sched: user-space can't set unknown tcfa_action values · 802bfb19
      Paolo Abeni 提交于
      Currently, when initializing an action, the user-space can specify
      and use arbitrary values for the tcfa_action field. If the value
      is unknown by the kernel, is implicitly threaded as TC_ACT_UNSPEC.
      
      This change explicitly checks for unknown values at action creation
      time, and explicitly convert them to TC_ACT_UNSPEC. No functional
      changes are introduced, but this will allow introducing tcfa_action
      values not exposed to user-space in a later patch.
      
      Note: we can't use the above to hide TC_ACT_REDIRECT from user-space,
      as the latter is already part of uAPI.
      
      v3 -> v4:
       - use an helper to check for action validity (JiriP)
       - emit an extack for invalid actions (JiriP)
      v4 -> v5:
       - keep messages on a single line, drop net_warn (Marcelo)
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      802bfb19
  10. 30 7月, 2018 6 次提交
  11. 28 7月, 2018 3 次提交
  12. 27 7月, 2018 1 次提交
  13. 26 7月, 2018 1 次提交
  14. 25 7月, 2018 3 次提交
    • P
      perf/x86/intel: Fix unwind errors from PEBS entries (mk-II) · 6cbc304f
      Peter Zijlstra 提交于
      Vince reported the perf_fuzzer giving various unwinder warnings and
      Josh reported:
      
      > Deja vu.  Most of these are related to perf PEBS, similar to the
      > following issue:
      >
      >   b8000586 ("perf/x86/intel: Cure bogus unwind from PEBS entries")
      >
      > This is basically the ORC version of that.  setup_pebs_sample_data() is
      > assembling a franken-pt_regs which ORC isn't happy about.  RIP is
      > inconsistent with some of the other registers (like RSP and RBP).
      
      And where the previous unwinder only needed BP,SP ORC also requires
      IP. But we cannot spoof IP because then the sample will get displaced,
      entirely negating the point of PEBS.
      
      So cure the whole thing differently by doing the unwind early; this
      does however require a means to communicate we did the unwind early.
      We (ab)use an unused sample_type bit for this, which we set on events
      that fill out the data->callchain before the normal
      perf_prepare_sample().
      Debugged-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Tested-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6cbc304f
    • N
      net/sched: add skbprio scheduler · aea5f654
      Nishanth Devarajan 提交于
      Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
      according to their skb->priority field. Under congestion, already-enqueued lower
      priority packets will be dropped to make space available for higher priority
      packets. Skbprio was conceived as a solution for denial-of-service defenses that
      need to route packets with different priorities as a means to overcome DoS
      attacks.
      
      v5
      *Do not reference qdisc_dev(sch)->tx_queue_len for setting limit. Instead set
      default sch->limit to 64.
      
      v4
      *Drop Documentation/networking/sch_skbprio.txt doc file to move it to tc man
      page for Skbprio, in iproute2.
      
      v3
      *Drop max_limit parameter in struct skbprio_sched_data and instead use
      sch->limit.
      
      *Reference qdisc_dev(sch)->tx_queue_len only once, during initialisation for
      qdisc (previously being referenced every time qdisc changes).
      
      *Move qdisc's detailed description from in-code to Documentation/networking.
      
      *When qdisc is saturated, enqueue incoming packet first before dequeueing
      lowest priority packet in queue - improves usage of call stack registers.
      
      *Introduce and use overlimit stat to keep track of number of dropped packets.
      
      v2
      *Use skb->priority field rather than DS field. Rename queueing discipline as
      SKB Priority Queue (previously Gatekeeper Priority Queue).
      
      *Queueing discipline is made classful to expose Skbprio's internal priority
      queues.
      Signed-off-by: NNishanth Devarajan <ndev2021@gmail.com>
      Reviewed-by: NSachin Paryani <sachin.paryani@gmail.com>
      Reviewed-by: NCody Doucette <doucette@bu.edu>
      Reviewed-by: NMichel Machado <michel@digirati.com.br>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aea5f654
    • H
      net: phy: add GBit master / slave error detection · b8f8c8eb
      Heiner Kallweit 提交于
      Certain PHY's have issues when operating in GBit slave mode and can
      be forced to master mode. Examples are RTL8211C, also the Micrel PHY
      driver has a DT setting to force master mode.
      If two such chips are link partners the autonegotiation will fail.
      Standard defines a self-clearing on read, latched-high bit to
      indicate this error. Check this bit to inform the user.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8f8c8eb
  15. 24 7月, 2018 4 次提交
    • K
      rds: Extend RDS API for IPv6 support · b7ff8b10
      Ka-Cheong Poon 提交于
      There are many data structures (RDS socket options) used by RDS apps
      which use a 32 bit integer to store IP address. To support IPv6,
      struct in6_addr needs to be used. To ensure backward compatibility, a
      new data structure is introduced for each of those data structures
      which use a 32 bit integer to represent an IP address. And new socket
      options are introduced to use those new structures. This means that
      existing apps should work without a problem with the new RDS module.
      For apps which want to use IPv6, those new data structures and socket
      options can be used. IPv4 mapped address is used to represent IPv4
      address in the new data structures.
      
      v4: Revert changes to SO_RDS_TRANSPORT
      Signed-off-by: NKa-Cheong Poon <ka-cheong.poon@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7ff8b10
    • J
      net: sched: introduce chain object to uapi · 32a4f5ec
      Jiri Pirko 提交于
      Allow user to create, destroy, get and dump chain objects. Do that by
      extending rtnl commands by the chain-specific ones. User will now be
      able to explicitly create or destroy chains (so far this was done only
      automatically according the filter/act needs and refcounting). Also, the
      user will receive notification about any chain creation or destuction.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32a4f5ec
    • K
      net/smc: provide smc mode in smc_diag.c · c601171d
      Karsten Graul 提交于
      Rename field diag_fallback into diag_mode and set the smc mode of a
      connection explicitly.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c601171d
    • N
      net: bridge: add support for backup port · 2756f68c
      Nikolay Aleksandrov 提交于
      This patch adds a new port attribute - IFLA_BRPORT_BACKUP_PORT, which
      allows to set a backup port to be used for known unicast traffic if the
      port has gone carrier down. The backup pointer is rcu protected and set
      only under RTNL, a counter is maintained so when deleting a port we know
      how many other ports reference it as a backup and we remove it from all.
      Also the pointer is in the first cache line which is hot at the time of
      the check and thus in the common case we only add one more test.
      The backup port will be used only for the non-flooding case since
      it's a part of the bridge and the flooded packets will be forwarded to it
      anyway. To remove the forwarding just send a 0/non-existing backup port.
      This is used to avoid numerous scalability problems when using MLAG most
      notably if we have thousands of fdbs one would need to change all of them
      on port carrier going down which takes too long and causes a storm of fdb
      notifications (and again when the port comes back up). In a Multi-chassis
      Link Aggregation setup usually hosts are connected to two different
      switches which act as a single logical switch. Those switches usually have
      a control and backup link between them called peerlink which might be used
      for communication in case a host loses connectivity to one of them.
      We need a fast way to failover in case a host port goes down and currently
      none of the solutions (like bond) cannot fulfill the requirements because
      the participating ports are actually the "master" devices and must have the
      same peerlink as their backup interface and at the same time all of them
      must participate in the bridge device. As Roopa noted it's normal practice
      in routing called fast re-route where a precalculated backup path is used
      when the main one is down.
      Another use case of this is with EVPN, having a single vxlan device which
      is backup of every port. Due to the nature of master devices it's not
      currently possible to use one device as a backup for many and still have
      all of them participate in the bridge (which is master itself).
      More detailed information about MLAG is available at the link below.
      https://docs.cumulusnetworks.com/display/DOCS/Multi-Chassis+Link+Aggregation+-+MLAG
      
      Further explanation and a diagram by Roopa:
      Two switches acting in a MLAG pair are connected by the peerlink
      interface which is a bridge port.
      
      the config on one of the switches looks like the below. The other
      switch also has a similar config.
      eth0 is connected to one port on the server. And the server is
      connected to both switches.
      
      br0 -- team0---eth0
            |
            -- switch-peerlink
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2756f68c
  16. 20 7月, 2018 2 次提交