1. 21 1月, 2021 4 次提交
  2. 20 1月, 2021 5 次提交
    • Y
      tcp: fix TCP socket rehash stats mis-accounting · 9c30ae83
      Yuchung Cheng 提交于
      The previous commit 32efcc06 ("tcp: export count for rehash attempts")
      would mis-account rehashing SNMP and socket stats:
      
        a. During handshake of an active open, only counts the first
           SYN timeout
      
        b. After handshake of passive and active open, stop updating
           after (roughly) TCP_RETRIES1 recurring RTOs
      
        c. After the socket aborts, over count timeout_rehash by 1
      
      This patch fixes this by checking the rehash result from sk_rethink_txhash.
      
      Fixes: 32efcc06 ("tcp: export count for rehash attempts")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20210119192619.1848270-1-ycheng@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      9c30ae83
    • J
      bonding: add a vlan+srcmac tx hashing option · 7b8fc010
      Jarod Wilson 提交于
      This comes from an end-user request, where they're running multiple VMs on
      hosts with bonded interfaces connected to some interest switch topologies,
      where 802.3ad isn't an option. They're currently running a proprietary
      solution that effectively achieves load-balancing of VMs and bandwidth
      utilization improvements with a similar form of transmission algorithm.
      
      Basically, each VM has it's own vlan, so it always sends its traffic out
      the same interface, unless that interface fails. Traffic gets split
      between the interfaces, maintaining a consistent path, with failover still
      available if an interface goes down.
      
      Unlike bond_eth_hash(), this hash function is using the full source MAC
      address instead of just the last byte, as there are so few components to
      the hash, and in the no-vlan case, we would be returning just the last
      byte of the source MAC as the hash value. It's entirely possible to have
      two NICs in a bond with the same last byte of their MAC, but not the same
      MAC, so this adjustment should guarantee distinct hashes in all cases.
      
      This has been rudimetarily tested to provide similar results to the
      proprietary solution it is aiming to replace. A patch for iproute2 is also
      posted, to properly support the new mode there as well.
      
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Thomas Davis <tadavis@lbl.gov>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Link: https://lore.kernel.org/r/20210119010927.1191922-1-jarod@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      7b8fc010
    • X
      net: add inline function skb_csum_is_sctp · fa821170
      Xin Long 提交于
      This patch is to define a inline function skb_csum_is_sctp(), and
      also replace all places where it checks if it's a SCTP CSUM skb.
      This function would be used later in many networking drivers in
      the following patches.
      Suggested-by: NAlexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      fa821170
    • G
      mdio-bitbang: Export mdiobb_{read,write}() · 8eed01b5
      Geert Uytterhoeven 提交于
      Export mdiobb_read() and mdiobb_write(), so Ethernet controller drivers
      can call them from their MDIO read/write wrappers.
      Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Tested-by: NWolfram Sang <wsa+renesas@sang-engineering.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      8eed01b5
    • A
      mdio, phy: fix -Wshadow warnings triggered by nested container_of() · 7eab14de
      Alexander Lobakin 提交于
      container_of() macro hides a local variable '__mptr' inside. This
      becomes a problem when several container_of() are nested in each
      other within single line or plain macros.
      As C preprocessor doesn't support generating random variable names,
      the sole solution is to avoid defining macros that consist only of
      container_of() calls, or they will self-shadow '__mptr' each time:
      
      In file included from ./include/linux/bitmap.h:10,
                       from drivers/net/phy/phy_device.c:12:
      drivers/net/phy/phy_device.c: In function ‘phy_device_release’:
      ./include/linux/kernel.h:693:8: warning: declaration of ‘__mptr’ shadows a previous local [-Wshadow]
        693 |  void *__mptr = (void *)(ptr);     \
            |        ^~~~~~
      ./include/linux/phy.h:647:26: note: in expansion of macro ‘container_of’
        647 | #define to_phy_device(d) container_of(to_mdio_device(d), \
            |                          ^~~~~~~~~~~~
      ./include/linux/mdio.h:52:27: note: in expansion of macro ‘container_of’
         52 | #define to_mdio_device(d) container_of(d, struct mdio_device, dev)
            |                           ^~~~~~~~~~~~
      ./include/linux/phy.h:647:39: note: in expansion of macro ‘to_mdio_device’
        647 | #define to_phy_device(d) container_of(to_mdio_device(d), \
            |                                       ^~~~~~~~~~~~~~
      drivers/net/phy/phy_device.c:217:8: note: in expansion of macro ‘to_phy_device’
        217 |  kfree(to_phy_device(dev));
            |        ^~~~~~~~~~~~~
      ./include/linux/kernel.h:693:8: note: shadowed declaration is here
        693 |  void *__mptr = (void *)(ptr);     \
            |        ^~~~~~
      ./include/linux/phy.h:647:26: note: in expansion of macro ‘container_of’
        647 | #define to_phy_device(d) container_of(to_mdio_device(d), \
            |                          ^~~~~~~~~~~~
      drivers/net/phy/phy_device.c:217:8: note: in expansion of macro ‘to_phy_device’
        217 |  kfree(to_phy_device(dev));
            |        ^~~~~~~~~~~~~
      
      As they are declared in header files, these warnings are highly
      repetitive and very annoying (along with the one from linux/pci.h).
      
      Convert the related macros from linux/{mdio,phy}.h to static inlines
      to avoid self-shadowing and potentially improve bug-catching.
      No functional changes implied.
      Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20210116161246.67075-1-alobakin@pm.meSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      7eab14de
  3. 19 1月, 2021 6 次提交
    • T
      net/bonding: Declare TLS RX device offload support · dc5809f9
      Tariq Toukan 提交于
      Following the description in previous patch (for TX):
      As the bond interface is being bypassed by the TLS module, interacting
      directly against the lower devs, there is no way for the bond interface
      to disable its device offload capabilities, as long as the mode/policy
      config allows it.
      Hence, the feature flag is not directly controllable, but just reflects
      the offload status based on the logic under bond_sk_check().
      
      Here we just declare RX device offload support, and expose it via the
      NETIF_F_HW_TLS_RX flag.
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      dc5809f9
    • T
      net/bonding: Implement TLS TX device offload · 89df6a81
      Tariq Toukan 提交于
      Implement TLS TX device offload for bonding interfaces.
      This allows kTLS sockets running on a bond to benefit from the
      device offload on capable lower devices.
      
      To allow a simple and fast maintenance of the TLS context in SW and
      lower devices, we bind the TLS socket to a specific lower dev.
      To achieve a behavior similar to SW kTLS, we support only balance-xor
      and 802.3ad modes, with xmit_hash_policy=layer3+4. This is enforced
      in bond_sk_check(), done in a previous patch.
      
      For the above configuration, the SW implementation keeps picking the
      same exact lower dev for all the socket's SKBs. The device offload
      behaves similarly, making the decision once at the connection creation.
      
      Per socket, the TLS module should work directly with the lowest netdev
      in chain, to call the tls_dev_ops operations.
      
      As the bond interface is being bypassed by the TLS module, interacting
      directly against the lower devs, there is no way for the bond interface
      to disable its device offload capabilities, as long as the mode/policy
      config allows it.
      Hence, the feature flag is not directly controllable, but just reflects
      the current offload status based on the logic under bond_sk_check().
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      89df6a81
    • T
      net/bonding: Implement ndo_sk_get_lower_dev · 007feb87
      Tariq Toukan 提交于
      Add ndo_sk_get_lower_dev() implementation for bond interfaces.
      
      Support only for the cases where the socket's and SKBs' hash
      yields identical value for the whole connection lifetime.
      
      Here we restrict it to L3+4 sockets only, with
      xmit_hash_policy==LAYER34 and bond modes xor/802.3ad.
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      007feb87
    • T
      net: netdevice: Add operation ndo_sk_get_lower_dev · 719a402c
      Tariq Toukan 提交于
      ndo_sk_get_lower_dev returns the lower netdev that corresponds to
      a given socket.
      Additionally, we implement a helper netdev_sk_get_lowest_dev() to get
      the lowest one in chain.
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NBoris Pismenny <borisp@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      719a402c
    • C
      net_sched: fix RTNL deadlock again caused by request_module() · d349f997
      Cong Wang 提交于
      tcf_action_init_1() loads tc action modules automatically with
      request_module() after parsing the tc action names, and it drops RTNL
      lock and re-holds it before and after request_module(). This causes a
      lot of troubles, as discovered by syzbot, because we can be in the
      middle of batch initializations when we create an array of tc actions.
      
      One of the problem is deadlock:
      
      CPU 0					CPU 1
      rtnl_lock();
      for (...) {
        tcf_action_init_1();
          -> rtnl_unlock();
          -> request_module();
      				rtnl_lock();
      				for (...) {
      				  tcf_action_init_1();
      				    -> tcf_idr_check_alloc();
      				   // Insert one action into idr,
      				   // but it is not committed until
      				   // tcf_idr_insert_many(), then drop
      				   // the RTNL lock in the _next_
      				   // iteration
      				   -> rtnl_unlock();
          -> rtnl_lock();
          -> a_o->init();
            -> tcf_idr_check_alloc();
            // Now waiting for the same index
            // to be committed
      				    -> request_module();
      				    -> rtnl_lock()
      				    // Now waiting for RTNL lock
      				}
      				rtnl_unlock();
      }
      rtnl_unlock();
      
      This is not easy to solve, we can move the request_module() before
      this loop and pre-load all the modules we need for this netlink
      message and then do the rest initializations. So the loop breaks down
      to two now:
      
              for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
                      struct tc_action_ops *a_o;
      
                      a_o = tc_action_load_ops(name, tb[i]...);
                      ops[i - 1] = a_o;
              }
      
              for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
                      act = tcf_action_init_1(ops[i - 1]...);
              }
      
      Although this looks serious, it only has been reported by syzbot, so it
      seems hard to trigger this by humans. And given the size of this patch,
      I'd suggest to make it to net-next and not to backport to stable.
      
      This patch has been tested by syzbot and tested with tdc.py by me.
      
      Fixes: 0fedc63f ("net_sched: commit action insertions together")
      Reported-and-tested-by: syzbot+82752bc5331601cf4899@syzkaller.appspotmail.com
      Reported-and-tested-by: syzbot+b3b63b6bff456bd95294@syzkaller.appspotmail.com
      Reported-by: syzbot+ba67b12b1ca729912834@syzkaller.appspotmail.com
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Tested-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Link: https://lore.kernel.org/r/20210117005657.14810-1-xiyou.wangcong@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      d349f997
    • E
      tcp: fix TCP_USER_TIMEOUT with zero window · 9d9b1ee0
      Enke Chen 提交于
      The TCP session does not terminate with TCP_USER_TIMEOUT when data
      remain untransmitted due to zero window.
      
      The number of unanswered zero-window probes (tcp_probes_out) is
      reset to zero with incoming acks irrespective of the window size,
      as described in tcp_probe_timer():
      
          RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
          as long as the receiver continues to respond probes. We support
          this by default and reset icsk_probes_out with incoming ACKs.
      
      This counter, however, is the wrong one to be used in calculating the
      duration that the window remains closed and data remain untransmitted.
      Thanks to Jonathan Maxwell <jmaxwell37@gmail.com> for diagnosing the
      actual issue.
      
      In this patch a new timestamp is introduced for the socket in order to
      track the elapsed time for the zero-window probes that have not been
      answered with any non-zero window ack.
      
      Fixes: 9721e709 ("tcp: simplify window probe aborting on USER_TIMEOUT")
      Reported-by: NWilliam McCall <william.mccall@gmail.com>
      Co-developed-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEnke Chen <enchen@paloaltonetworks.com>
      Reviewed-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20210115223058.GA39267@localhost.localdomainSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      9d9b1ee0
  4. 16 1月, 2021 8 次提交
    • P
      GTP: add support for flow based tunneling API · 9ab7e76a
      Pravin B Shelar 提交于
      Following patch add support for flow based tunneling API
      to send and recv GTP tunnel packet over tunnel metadata API.
      This would allow this device integration with OVS or eBPF using
      flow based tunneling APIs.
      Signed-off-by: NPravin B Shelar <pbshelar@fb.com>
      Link: https://lore.kernel.org/r/20210110070021.26822-1-pbshelar@fb.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      9ab7e76a
    • V
      net: mscc: ocelot: configure watermarks using devlink-sb · f59fd9ca
      Vladimir Oltean 提交于
      Using devlink-sb, we can configure 12/16 (the important 75%) of the
      switch's controlling watermarks for congestion drops, and we can monitor
      50% of the watermark occupancies (we can monitor the reservation
      watermarks, but not the sharing watermarks, which are exposed as pool
      sizes).
      
      The following definitions can be made:
      
      SB_BUF=0 # The devlink-sb for frame buffers
      SB_REF=1 # The devlink-sb for frame references
      POOL_ING=0 # The pool for ingress traffic. Both devlink-sb instances
                 # have one of these.
      POOL_EGR=1 # The pool for egress traffic. Both devlink-sb instances
                 # have one of these.
      
      Editing the hardware watermarks is done in the following way:
      BUF_xxxx_I is accessed when sb=$SB_BUF and pool=$POOL_ING
      REF_xxxx_I is accessed when sb=$SB_REF and pool=$POOL_ING
      BUF_xxxx_E is accessed when sb=$SB_BUF and pool=$POOL_EGR
      REF_xxxx_E is accessed when sb=$SB_REF and pool=$POOL_EGR
      
      Configuring the sharing watermarks for COL_SHR(dp=0) is done implicitly
      by modifying the corresponding pool size. By default, the pool size has
      maximum size, so this can be skipped.
      
      devlink sb pool set pci/0000:00:00.5 sb $SB_BUF pool $POOL_ING \
      	size 129840 thtype static
      
      Since by default there is no buffer reservation, the above command has
      maxed out BUF_COL_SHR_I(dp=0).
      
      Configuring the per-port reservation watermark (P_RSRV) is done in the
      following way:
      
      devlink sb port pool set pci/0000:00:00.5/0 sb $SB_BUF \
      	pool $POOL_ING th 1000
      
      The above command sets BUF_P_RSRV_I(port 0) to 1000 bytes. After this
      command, the sharing watermarks are internally reconfigured with 1000
      bytes less, i.e. from 129840 bytes to 128840 bytes.
      
      Configuring the per-port-tc reservation watermarks (Q_RSRV) is done in
      the following way:
      
      for tc in {0..7}; do
      	devlink sb tc bind set pci/0000:00:00.5/0 sb 0 tc $tc \
      		type ingress pool $POOL_ING \
      		th 3000
      done
      
      The above command sets BUF_Q_RSRV_I(port 0, tc 0..7) to 3000 bytes.
      The sharing watermarks are again reconfigured with 24000 bytes less.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f59fd9ca
    • V
      net: mscc: ocelot: register devlink ports · 6c30384e
      Vladimir Oltean 提交于
      Add devlink integration into the mscc_ocelot switchdev driver. All
      physical ports (i.e. the unused ones as well) except the CPU port module
      at ocelot->num_phys_ports are registered with devlink, and that requires
      keeping the devlink_port structure outside struct ocelot_port_private,
      since the latter has a 1:1 mapping with a struct net_device (which does
      not exist for unused ports).
      
      Since we use devlink_port_type_eth_set to link the devlink port to the
      net_device, we can as well remove the .ndo_get_phys_port_name and
      .ndo_get_port_parent_id implementations, since devlink takes care of
      retrieving the port name and number automatically, once
      .ndo_get_devlink_port is implemented.
      
      Note that the felix DSA driver is already integrated with devlink by
      default, since that is a thing that the DSA core takes care of. This is
      the reason why these devlink stubs were put in ocelot_net.c and not in
      the common library. It is also the reason why ocelot::devlink is a
      pointer and not a full structure embedded inside struct ocelot: because
      the mscc_ocelot driver allocates that by itself (as the container of
      struct ocelot, in fact), but in the case of felix, it is DSA who
      allocates the devlink, and felix just propagates the pointer towards
      struct ocelot.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      6c30384e
    • V
      net: mscc: ocelot: export NUM_TC constant from felix to common switch lib · 70d39a6e
      Vladimir Oltean 提交于
      We should be moving anything that isn't DSA-specific or SoC-specific out
      of the felix DSA driver, and into the common mscc_ocelot switch library.
      
      The number of traffic classes is one of the aspects that is common
      between all ocelot switches, so it belongs in the library.
      
      This patch also makes seville use 8 TX queues, and therefore enables
      prioritization via the QOS_CLASS field in the NPI injection header.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      70d39a6e
    • V
      net: dsa: add ops for devlink-sb · 2a6ef763
      Vladimir Oltean 提交于
      Switches that care about QoS might have hardware support for reserving
      buffer pools for individual ports or traffic classes, and configuring
      their sizes and thresholds. Through devlink-sb (shared buffers), this is
      all configurable, as well as their occupancy being viewable.
      
      Add the plumbing in DSA for these operations.
      
      Individual drivers still need to call devlink_sb_register() with the
      shared buffers they want to expose. A helper was not created in DSA for
      this purpose (unlike, say, dsa_devlink_params_register), since in my
      opinion it does not bring any benefit over plainly calling
      devlink_sb_register() directly.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      2a6ef763
    • V
      net: mscc: ocelot: add ops for decoding watermark threshold and occupancy · 703b7621
      Vladimir Oltean 提交于
      We'll need to read back the watermark thresholds and occupancy from
      hardware (for devlink-sb integration), not only to write them as we did
      so far in ocelot_port_set_maxlen. So introduce 2 new functions in struct
      ocelot_ops, similar to wm_enc, and implement them for the 3 supported
      mscc_ocelot switches.
      
      Remove the INUSE and MAXUSE unpacking helpers for the QSYS_RES_STAT
      register, because that doesn't scale with the number of switches that
      mscc_ocelot supports now. They have different bit widths for the
      watermarks, and we need function pointers to abstract that difference
      away.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      703b7621
    • V
      net: mscc: ocelot: auto-detect packet buffer size and number of frame references · f6fe01d6
      Vladimir Oltean 提交于
      Instead of reading these values from the reference manual and writing
      them down into the driver, it appears that the hardware gives us the
      option of detecting them dynamically.
      
      The number of frame references corresponds to what the reference manual
      notes, however it seems that the frame buffers are reported as slightly
      less than the books would indicate. On VSC9959 (Felix), the books say it
      should have 128KB of packet buffer, but the registers indicate only
      129840 bytes (126.79 KB). Also, the unit of measurement for FREECNT from
      the documentation of all these devices is incorrect (taken from an older
      generation). This was confirmed by Younes Leroul from Microchip support.
      
      Not having anything better to do with these values at the moment* (this
      will change soon), let's just print them.
      
      *The frame buffer size is, in fact, used to calculate the tail dropping
      watermarks.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f6fe01d6
    • G
      dsa: add support for Arrow XRS700x tag trailer · 54a52823
      George McCollister 提交于
      Add support for Arrow SpeedChips XRS700x single byte tag trailer. This
      is modeled on tag_trailer.c which works in a similar way.
      Signed-off-by: NGeorge McCollister <george.mccollister@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      54a52823
  5. 15 1月, 2021 11 次提交
  6. 14 1月, 2021 6 次提交