1. 04 12月, 2020 1 次提交
    • T
      macvlan: Support for high multicast packet rate · d4bff72c
      Thomas Karlsson 提交于
      Background:
      Broadcast and multicast packages are enqueued for later processing.
      This queue was previously hardcoded to 1000.
      
      This proved insufficient for handling very high packet rates.
      This resulted in packet drops for multicast.
      While at the same time unicast worked fine.
      
      The change:
      This patch make the queue length adjustable to accommodate
      for environments with very high multicast packet rate.
      But still keeps the default value of 1000 unless specified.
      
      The queue length is specified as a request per macvlan
      using the IFLA_MACVLAN_BC_QUEUE_LEN parameter.
      
      The actual used queue length will then be the maximum of
      any macvlan connected to the same port. The actual used
      queue length for the port can be retrieved (read only)
      by the IFLA_MACVLAN_BC_QUEUE_LEN_USED parameter for verification.
      
      This will be followed up by a patch to iproute2
      in order to adjust the parameter from userspace.
      Signed-off-by: NThomas Karlsson <thomas.karlsson@paneda.se>
      Link: https://lore.kernel.org/r/dd4673b2-7eab-edda-6815-85c67ce87f63@paneda.seSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      d4bff72c
  2. 18 9月, 2020 1 次提交
  3. 08 9月, 2020 1 次提交
    • J
      net: tighten the definition of interface statistics · 0db0c34c
      Jakub Kicinski 提交于
      This patch is born out of an investigation into which IEEE statistics
      correspond to which struct rtnl_link_stats64 members. Turns out that
      there seems to be reasonable consensus on the matter, among many drivers.
      To save others the time (and it took more time than I'm comfortable
      admitting) I'm adding comments referring to IEEE attributes to
      struct rtnl_link_stats64.
      
      Up until now we had two forms of documentation for stats - in
      Documentation/ABI/testing/sysfs-class-net-statistics and the comments
      on struct rtnl_link_stats64 itself. While the former is very cautious
      in defining the expected behavior, the latter feel quite dated and
      may not be easy to understand for modern day driver author
      (e.g. rx_over_errors). At the same time modern systems are far more
      complex and once obvious definitions lost their clarity. For example
      - does rx_packet count at the MAC layer (aFramesReceivedOK)?
      packets processed correctly by hardware? received by the driver?
      or maybe received by the stack?
      
      I tried to clarify the expectations, further clarifications from
      others are very welcome.
      
      The part hardest to untangle is rx_over_errors vs rx_fifo_errors
      vs rx_missed_errors. After much deliberation I concluded that for
      modern HW only two of the counters will make sense. The distinction
      between internal FIFO overflow and packets dropped due to back-pressure
      from the host is likely too implementation (driver and device) specific
      to expose in the standard stats.
      
      Now - which two of those counters we select to use is anyone's pick:
      
      sysfs documentation suggests rx_over_errors counts packets which
      did not fit into buffers due to MTU being too small, which I reused.
      There don't seem to be many modern drivers using it (well, CAN drivers
      seem to love this statistic).
      
      Of the remaining two I picked rx_missed_errors to report device drops.
      bnxt reports it and it's folded into "drop"s in procfs (while
      rx_fifo_errors is an error, and modern devices usually receive the frame
      OK, they just can't admit it into the pipeline).
      
      Of the drivers I looked at only AMD Lance-like and NS8390-like use all
      three of these counters. rx_missed_errors counts missed frames,
      rx_over_errors counts overflow events, and rx_fifo_errors counts frames
      which were truncated because they didn't fit into buffers. This suggests
      that rx_fifo_errors may be the correct stat for truncated packets, but
      I'd think a FIFO stat counting truncated packets would be very confusing
      to a modern reader.
      
      v2:
       - add driver developer notes about ethtool stat count and reset
       - replace Ethernet with IEEE 802.3 to better indicate source of attrs
       - mention byte counters don't count FCS
       - clarify RX counter is from device to host
       - drop "sightly" from sysfs paragraph
       - add examples of ethtool stats
       - s/incoming/received/ s/incoming/transmitted/
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      0db0c34c
  4. 01 8月, 2020 1 次提交
    • R
      rtnetlink: add support for protodown reason · 829eb208
      Roopa Prabhu 提交于
      netdev protodown is a mechanism that allows protocols to
      hold an interface down. It was initially introduced in
      the kernel to hold links down by a multihoming protocol.
      There was also an attempt to introduce protodown
      reason at the time but was rejected. protodown and protodown reason
      is supported by almost every switching and routing platform.
      It was ok for a while to live without a protodown reason.
      But, its become more critical now given more than
      one protocol may need to keep a link down on a system
      at the same time. eg: vrrp peer node, port security,
      multihoming protocol. Its common for Network operators and
      protocol developers to look for such a reason on a networking
      box (Its also known as errDisable by most networking operators)
      
      This patch adds support for link protodown reason
      attribute. There are two ways to maintain protodown
      reasons.
      (a) enumerate every possible reason code in kernel
          - A protocol developer has to make a request and
            have that appear in a certain kernel version
      (b) provide the bits in the kernel, and allow user-space
      (sysadmin or NOS distributions) to manage the bit-to-reasonname
      map.
      	- This makes extending reason codes easier (kind of like
            the iproute2 table to vrf-name map /etc/iproute2/rt_tables.d/)
      
      This patch takes approach (b).
      
      a few things about the patch:
      - It treats the protodown reason bits as counter to indicate
      active protodown users
      - Since protodown attribute is already an exposed UAPI,
      the reason is not enforced on a protodown set. Its a no-op
      if not used.
      the patch follows the below algorithm:
        - presence of reason bits set indicates protodown
          is in use
        - user can set protodown and protodown reason in a
          single or multiple setlink operations
        - setlink operation to clear protodown, will return -EBUSY
          if there are active protodown reason bits
        - reason is not included in link dumps if not used
      
      example with patched iproute2:
      $cat /etc/iproute2/protodown_reasons.d/r.conf
      0 mlag
      1 evpn
      2 vrrp
      3 psecurity
      
      $ip link set dev vxlan0 protodown on protodown_reason vrrp on
      $ip link set dev vxlan0 protodown_reason mlag on
      $ip link show
      14: vxlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
      DEFAULT group default qlen 1000
          link/ether f6:06:be:17:91:e7 brd ff:ff:ff:ff:ff:ff protodown on <mlag,vrrp>
      
      $ip link set dev vxlan0 protodown_reason mlag off
      $ip link set dev vxlan0 protodown off protodown_reason vrrp off
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      829eb208
  5. 28 7月, 2020 1 次提交
    • M
      hsr: enhance netlink socket interface to support PRP · 8f4c0e01
      Murali Karicheri 提交于
      Parallel Redundancy Protocol (PRP) is another redundancy protocol
      introduced by IEC 63439 standard. It is similar to HSR in many
      aspects:-
      
       - Use a pair of Ethernet interfaces to created the PRP device
       - Use a 6 byte redundancy protocol part (RCT, Redundancy Check
         Trailer) similar to HSR Tag.
       - Has Link Redundancy Entity (LRE) that works with RCT to implement
         redundancy.
      
      Key difference is that the protocol unit is a trailer instead of a
      prefix as in HSR. That makes it inter-operable with tradition network
      components such as bridges/switches which treat it as pad bytes,
      whereas HSR nodes requires some kind of translators (Called redbox) to
      talk to regular network devices. This features allows regular linux box
      to be converted to a DAN-P box. DAN-P stands for Dual Attached Node - PRP
      similar to DAN-H (Dual Attached Node - HSR).
      
      Add a comment at the header/source code to explicitly state that the
      driver files also handles PRP protocol as well.
      Signed-off-by: NMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f4c0e01
  6. 22 7月, 2020 1 次提交
  7. 15 7月, 2020 1 次提交
  8. 29 6月, 2020 1 次提交
  9. 28 4月, 2020 1 次提交
  10. 30 3月, 2020 1 次提交
  11. 29 3月, 2020 1 次提交
  12. 27 3月, 2020 1 次提交
  13. 25 2月, 2020 2 次提交
  14. 15 1月, 2020 1 次提交
  15. 13 12月, 2019 1 次提交
  16. 02 10月, 2019 1 次提交
  17. 05 7月, 2019 1 次提交
    • V
      bonding: add an option to specify a delay between peer notifications · 07a4ddec
      Vincent Bernat 提交于
      Currently, gratuitous ARP/ND packets are sent every `miimon'
      milliseconds. This commit allows a user to specify a custom delay
      through a new option, `peer_notif_delay'.
      
      Like for `updelay' and `downdelay', this delay should be a multiple of
      `miimon' to avoid managing an additional work queue. The configuration
      logic is copied from `updelay' and `downdelay'. However, the default
      value cannot be set using a module parameter: Netlink or sysfs should
      be used to configure this feature.
      
      When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
      observe the 500 ms delay is respected:
      
          20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
          20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
          20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
          20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
      
      In bond_mii_monitor(), I have tried to keep the lock logic readable.
      The change is due to the fact we cannot rely on a notification to
      lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
      only triggered once every N times, while we need to decrement the
      counter each time.
      
      iproute2 also needs to be updated to be able to specify this new
      attribute through `ip link'.
      Signed-off-by: NVincent Bernat <vincent@bernat.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07a4ddec
  18. 19 6月, 2019 1 次提交
    • D
      ipoib: show VF broadcast address · 75345f88
      Denis Kirjanov 提交于
      in IPoIB case we can't see a VF broadcast address for but
      can see for PF
      
      Before:
      11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
      state UP mode DEFAULT group default qlen 256
          link/infiniband
      80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
      00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
          vf 0 MAC 14:80:00:00:66:fe, spoof checking off, link-state disable,
      trust off, query_rss off
      ...
      
      After:
      11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
      state UP mode DEFAULT group default qlen 256
          link/infiniband
      80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
      00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
          vf 0     link/infiniband
      80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
      00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof
      checking off, link-state disable, trust off, query_rss off
      
      v1->v2: add the IFLA_VF_BROADCAST constant
      v2->v3: put IFLA_VF_BROADCAST at the end
      to avoid KABI breakage and set NLA_REJECT
      dev_setlink
      Signed-off-by: NDenis Kirjanov <kda@linux-powerpc.org>
      Acked-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75345f88
  19. 23 1月, 2019 1 次提交
    • N
      bonding: add support for xstats and export 3ad stats · a258aeac
      Nikolay Aleksandrov 提交于
      This patch adds support for extended statistics (xstats) call to the
      bonding. The first user would be the 3ad code which counts the following
      events:
       - LACPDU Rx/Tx
       - LACPDU unknown type Rx
       - LACPDU illegal Rx
       - Marker Rx/Tx
       - Marker response Rx/Tx
       - Marker unknown type Rx
      
      All of these are exported via netlink as separate attributes to be
      easily extensible as we plan to add more in the future.
      Similar to how the bridge and other xstats exports, the structure
      inside is:
       [ IFLA_STATS_LINK_XSTATS ]
         -> [ LINK_XSTATS_TYPE_BOND ]
              -> [ BOND_XSTATS_3AD ]
                   -> [ 3ad stats attributes ]
      
      With this structure it's easy to add more stat types later.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a258aeac
  20. 28 11月, 2018 1 次提交
    • N
      net: bridge: add support for user-controlled bool options · a428afe8
      Nikolay Aleksandrov 提交于
      We have been adding many new bridge options, a big number of which are
      boolean but still take up netlink attribute ids and waste space in the skb.
      Recently we discussed learning from link-local packets[1] and decided
      yet another new boolean option will be needed, thus introducing this API
      to save some bridge nl space.
      The API supports changing the value of multiple boolean options at once
      via the br_boolopt_multi struct which has an optmask (which options to
      set, bit per opt) and optval (options' new values). Future boolean
      options will only be added to the br_boolopt_id enum and then will have
      to be handled in br_boolopt_toggle/get. The API will automatically
      add the ability to change and export them via netlink, sysfs can use the
      single boolopt function versions to do the same. The behaviour with
      failing/succeeding is the same as with normal netlink option changing.
      
      If an option requires mapping to internal kernel flag or needs special
      configuration to be enabled then it should be handled in
      br_boolopt_toggle. It should also be able to retrieve an option's current
      state via br_boolopt_get.
      
      v2: WARN_ON() on unsupported option as that shouldn't be possible and
          also will help catch people who add new options without handling
          them for both set and get. Pass down extack so if an option desires
          it could set it on error and be more user-friendly.
      
      [1] https://www.spinics.net/lists/netdev/msg532698.htmlSigned-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a428afe8
  21. 09 11月, 2018 2 次提交
    • S
      geneve: Allow configuration of DF behaviour · a025fb5f
      Stefano Brivio 提交于
      draft-ietf-nvo3-geneve-08 says:
      
         It is strongly RECOMMENDED that Path MTU Discovery ([RFC1191],
         [RFC1981]) be used by setting the DF bit in the IP header when Geneve
         packets are transmitted over IPv4 (this is the default with IPv6).
      
      Now that ICMP error handling is working for GENEVE, we can comply with
      this recommendation.
      
      Make this configurable, though, to avoid breaking existing setups. By
      default, DF won't be set. It can be set or inherited from inner IPv4
      packets. If it's configured to be inherited and we are encapsulating IPv6,
      it will be set.
      
      This only applies to non-lwt tunnels: if an external control plane is
      used, tunnel key will still control the DF flag.
      
      v2:
      - DF behaviour configuration only applies for non-lwt tunnels, apply DF
        setting only if (!geneve->collect_md) in geneve_xmit_skb()
        (Stephen Hemminger)
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a025fb5f
    • S
      vxlan: Allow configuration of DF behaviour · b4d30697
      Stefano Brivio 提交于
      Allow users to set the IPv4 DF bit in outgoing packets, or to inherit its
      value from the IPv4 inner header. If the encapsulated protocol is IPv6 and
      DF is configured to be inherited, always set it.
      
      For IPv4, inheriting DF from the inner header was probably intended from
      the very beginning judging by the comment to vxlan_xmit(), but it wasn't
      actually implemented -- also because it would have done more harm than
      good, without handling for ICMP Fragmentation Needed messages.
      
      According to RFC 7348, "Path MTU discovery MAY be used". An expired RFC
      draft, draft-saum-nvo3-pmtud-over-vxlan-05, whose purpose was to describe
      PMTUD implementation, says that "is a MUST that Vxlan gateways [...]
      SHOULD set the DF-bit [...]", whatever that means.
      
      Given this background, the only sane option is probably to let the user
      decide, and keep the current behaviour as default.
      
      This only applies to non-lwt tunnels: if an external control plane is
      used, tunnel key will still control the DF flag.
      
      v2:
      - DF behaviour configuration only applies for non-lwt tunnels, move DF
        setting to if (!info) block in vxlan_xmit_one() (Stephen Hemminger)
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4d30697
  22. 13 10月, 2018 1 次提交
    • N
      net: bridge: add support for per-port vlan stats · 9163a0fc
      Nikolay Aleksandrov 提交于
      This patch adds an option to have per-port vlan stats instead of the
      default global stats. The option can be set only when there are no port
      vlans in the bridge since we need to allocate the stats if it is set
      when vlans are being added to ports (and respectively free them
      when being deleted). Also bump RTNL_MAX_TYPE as the bridge is the
      largest user of options. The current stats design allows us to add
      these without any changes to the fast-path, it all comes down to
      the per-vlan stats pointer which, if this option is enabled, will
      be allocated for each port vlan instead of using the global bridge-wide
      one.
      
      CC: bridge@lists.linux-foundation.org
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9163a0fc
  23. 13 9月, 2018 1 次提交
  24. 06 9月, 2018 1 次提交
  25. 30 7月, 2018 1 次提交
  26. 24 7月, 2018 1 次提交
    • N
      net: bridge: add support for backup port · 2756f68c
      Nikolay Aleksandrov 提交于
      This patch adds a new port attribute - IFLA_BRPORT_BACKUP_PORT, which
      allows to set a backup port to be used for known unicast traffic if the
      port has gone carrier down. The backup pointer is rcu protected and set
      only under RTNL, a counter is maintained so when deleting a port we know
      how many other ports reference it as a backup and we remove it from all.
      Also the pointer is in the first cache line which is hot at the time of
      the check and thus in the common case we only add one more test.
      The backup port will be used only for the non-flooding case since
      it's a part of the bridge and the flooded packets will be forwarded to it
      anyway. To remove the forwarding just send a 0/non-existing backup port.
      This is used to avoid numerous scalability problems when using MLAG most
      notably if we have thousands of fdbs one would need to change all of them
      on port carrier going down which takes too long and causes a storm of fdb
      notifications (and again when the port comes back up). In a Multi-chassis
      Link Aggregation setup usually hosts are connected to two different
      switches which act as a single logical switch. Those switches usually have
      a control and backup link between them called peerlink which might be used
      for communication in case a host loses connectivity to one of them.
      We need a fast way to failover in case a host port goes down and currently
      none of the solutions (like bond) cannot fulfill the requirements because
      the participating ports are actually the "master" devices and must have the
      same peerlink as their backup interface and at the same time all of them
      must participate in the bridge device. As Roopa noted it's normal practice
      in routing called fast re-route where a precalculated backup path is used
      when the main one is down.
      Another use case of this is with EVPN, having a single vxlan device which
      is backup of every port. Due to the nature of master devices it's not
      currently possible to use one device as a backup for many and still have
      all of them participate in the bridge (which is master itself).
      More detailed information about MLAG is available at the link below.
      https://docs.cumulusnetworks.com/display/DOCS/Multi-Chassis+Link+Aggregation+-+MLAG
      
      Further explanation and a diagram by Roopa:
      Two switches acting in a MLAG pair are connected by the peerlink
      interface which is a bridge port.
      
      the config on one of the switches looks like the below. The other
      switch also has a similar config.
      eth0 is connected to one port on the server. And the server is
      connected to both switches.
      
      br0 -- team0---eth0
            |
            -- switch-peerlink
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2756f68c
  27. 14 7月, 2018 2 次提交
  28. 23 6月, 2018 1 次提交
  29. 26 5月, 2018 1 次提交
  30. 18 4月, 2018 1 次提交
  31. 23 3月, 2018 1 次提交
  32. 17 2月, 2018 1 次提交
  33. 30 1月, 2018 1 次提交
  34. 23 1月, 2018 1 次提交
  35. 09 1月, 2018 1 次提交
  36. 05 11月, 2017 1 次提交
    • J
      rtnetlink: use netnsid to query interface · 79e1ad14
      Jiri Benc 提交于
      Currently, when an application gets netnsid from the kernel (for example as
      the result of RTM_GETLINK call on one end of the veth pair), it's not much
      useful. There's no reliable way to get to the netns fd from the netnsid, nor
      does any kernel API accept netnsid.
      
      Extend the RTM_GETLINK call to also accept netnsid. It will operate on the
      netns with the given netnsid in such case. Of course, the calling process
      needs to have enough capabilities in the target name space; for now, require
      CAP_NET_ADMIN. This can be relaxed in the future.
      
      To signal to the calling process that the kernel understood the new
      IFLA_IF_NETNSID attribute in the query, it will include it in the response.
      This is needed to detect older kernels, as they will just ignore
      IFLA_IF_NETNSID and query in the current name space.
      
      This patch implemetns IFLA_IF_NETNSID only for get and dump. For set
      operations, this can be extended later.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79e1ad14
  37. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX license identifier to uapi header files with no license · 6f52b16c
      Greg Kroah-Hartman 提交于
      Many user space API headers are missing licensing information, which
      makes it hard for compliance tools to determine the correct license.
      
      By default are files without license information under the default
      license of the kernel, which is GPLV2.  Marking them GPLV2 would exclude
      them from being included in non GPLV2 code, which is obviously not
      intended. The user space API headers fall under the syscall exception
      which is in the kernels COPYING file:
      
         NOTE! This copyright does *not* cover user programs that use kernel
         services by normal system calls - this is merely considered normal use
         of the kernel, and does *not* fall under the heading of "derived work".
      
      otherwise syscall usage would not be possible.
      
      Update the files which contain no license information with an SPDX
      license identifier.  The chosen identifier is 'GPL-2.0 WITH
      Linux-syscall-note' which is the officially assigned identifier for the
      Linux syscall exception.  SPDX license identifiers are a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.  See the previous patch in this series for the
      methodology of how this patch was researched.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6f52b16c