1. 07 5月, 2020 2 次提交
  2. 05 5月, 2020 2 次提交
  3. 25 4月, 2020 1 次提交
  4. 24 4月, 2020 1 次提交
    • A
      net: dsa: add GRO support via gro_cells · e131a563
      Alexander Lobakin 提交于
      gro_cells lib is used by different encapsulating netdevices, such as
      geneve, macsec, vxlan etc. to speed up decapsulated traffic processing.
      CPU tag is a sort of "encapsulation", and we can use the same mechs to
      greatly improve overall DSA performance.
      skbs are passed to the GRO layer after removing CPU tags, so we don't
      need any new packet offload types as it was firstly proposed by me in
      the first GRO-over-DSA variant [1].
      
      The size of struct gro_cells is sizeof(void *), so hot struct
      dsa_slave_priv becomes only 4/8 bytes bigger, and all critical fields
      remain in one 32-byte cacheline.
      The other positive side effect is that drivers for network devices
      that can be shipped as CPU ports of DSA-driven switches can now use
      napi_gro_frags() to pass skbs to kernel. Packets built that way are
      completely non-linear and are likely being dropped without GRO.
      
      This was tested on to-be-mainlined-soon Ethernet driver that uses
      napi_gro_frags(), and the overall performance was on par with the
      variant from [1], sometimes even better due to minimal overhead.
      net.core.gro_normal_batch tuning may help to push it to the limit
      on particular setups and platforms.
      
      iperf3 IPoE VLAN NAT TCP forwarding (port1.218 -> port0) setup
      on 1.2 GHz MIPS board:
      
      5.7-rc2 baseline:
      
      [ID]  Interval         Transfer     Bitrate        Retr
      [ 5]  0.00-120.01 sec  9.00 GBytes  644 Mbits/sec  413  sender
      [ 5]  0.00-120.00 sec  8.99 GBytes  644 Mbits/sec       receiver
      
      Iface      RX packets  TX packets
      eth0       7097731     7097702
      port0      426050      6671829
      port1      6671681     425862
      port1.218  6671677     425851
      
      With this patch:
      
      [ID]  Interval         Transfer     Bitrate        Retr
      [ 5]  0.00-120.01 sec  12.2 GBytes  870 Mbits/sec  122  sender
      [ 5]  0.00-120.00 sec  12.2 GBytes  870 Mbits/sec       receiver
      
      Iface      RX packets  TX packets
      eth0       9474792     9474777
      port0      455200      353288
      port1      9019592     455035
      port1.218  353144      455024
      
      v2:
       - Add some performance examples in the commit message;
       - No functional changes.
      
      [1] https://lore.kernel.org/netdev/20191230143028.27313-1-alobakin@dlink.ru/Signed-off-by: NAlexander Lobakin <bloodyreaper@yandex.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e131a563
  5. 23 4月, 2020 1 次提交
  6. 15 4月, 2020 1 次提交
    • A
      net: dsa: Down cpu/dsa ports phylink will control · 3be98b2d
      Andrew Lunn 提交于
      DSA and CPU ports can be configured in two ways. By default, the
      driver should configure such ports to there maximum bandwidth. For
      most use cases, this is sufficient. When this default is insufficient,
      a phylink instance can be bound to such ports, and phylink will
      configure the port, e.g. based on fixed-link properties. phylink
      assumes the port is initially down. Given that the driver should have
      already configured it to its maximum speed, ask the driver to down
      the port before instantiating the phylink instance.
      
      Fixes: 30c4a5b0 ("net: mv88e6xxx: use resolved link config in mac_link_up()")
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3be98b2d
  7. 02 4月, 2020 1 次提交
  8. 01 4月, 2020 1 次提交
    • R
      net: dsa: fix oops while probing Marvell DSA switches · 765bda93
      Russell King 提交于
      Fix an oops in dsa_port_phylink_mac_change() caused by a combination
      of a20f9970 ("net: dsa: Don't instantiate phylink for CPU/DSA
      ports unless needed") and the net-dsa-improve-serdes-integration
      series of patches 65b7a2c8 ("Merge branch
      'net-dsa-improve-serdes-integration'").
      
      Unable to handle kernel NULL pointer dereference at virtual address 00000124
      pgd = c0004000
      [00000124] *pgd=00000000
      Internal error: Oops: 805 [#1] SMP ARM
      Modules linked in: tag_edsa spi_nor mtd xhci_plat_hcd mv88e6xxx(+) xhci_hcd armada_thermal marvell_cesa dsa_core ehci_orion libdes phy_armada38x_comphy at24 mcp3021 sfp evbug spi_orion sff mdio_i2c
      CPU: 1 PID: 214 Comm: irq/55-mv88e6xx Not tainted 5.6.0+ #470
      Hardware name: Marvell Armada 380/385 (Device Tree)
      PC is at phylink_mac_change+0x10/0x88
      LR is at mv88e6352_serdes_irq_status+0x74/0x94 [mv88e6xxx]
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: NVivien Didelot <vivien.didelot@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      765bda93
  9. 31 3月, 2020 3 次提交
  10. 28 3月, 2020 2 次提交
    • V
      net: dsa: implement auto-normalization of MTU for bridge hardware datapath · bff33f7e
      Vladimir Oltean 提交于
      Many switches don't have an explicit knob for configuring the MTU
      (maximum transmission unit per interface).  Instead, they do the
      length-based packet admission checks on the ingress interface, for
      reasons that are easy to understand (why would you accept a packet in
      the queuing subsystem if you know you're going to drop it anyway).
      
      So it is actually the MRU that these switches permit configuring.
      
      In Linux there only exists the IFLA_MTU netlink attribute and the
      associated dev_set_mtu function. The comments like to play blind and say
      that it's changing the "maximum transfer unit", which is to say that
      there isn't any directionality in the meaning of the MTU word. So that
      is the interpretation that this patch is giving to things: MTU == MRU.
      
      When 2 interfaces having different MTUs are bridged, the bridge driver
      MTU auto-adjustment logic kicks in: what br_mtu_auto_adjust() does is it
      adjusts the MTU of the bridge net device itself (and not that of the
      slave net devices) to the minimum value of all slave interfaces, in
      order for forwarded packets to not exceed the MTU regardless of the
      interface they are received and send on.
      
      The idea behind this behavior, and why the slave MTUs are not adjusted,
      is that normal termination from Linux over the L2 forwarding domain
      should happen over the bridge net device, which _is_ properly limited by
      the minimum MTU. And termination over individual slave devices is
      possible even if those are bridged. But that is not "forwarding", so
      there's no reason to do normalization there, since only a single
      interface sees that packet.
      
      The problem with those switches that can only control the MRU is with
      the offloaded data path, where a packet received on an interface with
      MRU 9000 would still be forwarded to an interface with MRU 1500. And the
      br_mtu_auto_adjust() function does not really help, since the MTU
      configured on the bridge net device is ignored.
      
      In order to enforce the de-facto MTU == MRU rule for these switches, we
      need to do MTU normalization, which means: in order for no packet larger
      than the MTU configured on this port to be sent, then we need to limit
      the MRU on all ports that this packet could possibly come from. AKA
      since we are configuring the MRU via MTU, it means that all ports within
      a bridge forwarding domain should have the same MTU.
      
      And that is exactly what this patch is trying to do.
      
      >From an implementation perspective, we try to follow the intent of the
      user, otherwise there is a risk that we might livelock them (they try to
      change the MTU on an already-bridged interface, but we just keep
      changing it back in an attempt to keep the MTU normalized). So the MTU
      that the bridge is normalized to is either:
      
       - The most recently changed one:
      
         ip link set dev swp0 master br0
         ip link set dev swp1 master br0
         ip link set dev swp0 mtu 1400
      
         This sequence will make swp1 inherit MTU 1400 from swp0.
      
       - The one of the most recently added interface to the bridge:
      
         ip link set dev swp0 master br0
         ip link set dev swp1 mtu 1400
         ip link set dev swp1 master br0
      
         The above sequence will make swp0 inherit MTU 1400 as well.
      Suggested-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bff33f7e
    • V
      net: dsa: configure the MTU for switch ports · bfcb8132
      Vladimir Oltean 提交于
      It is useful be able to configure port policers on a switch to accept
      frames of various sizes:
      
      - Increase the MTU for better throughput from the default of 1500 if it
        is known that there is no 10/100 Mbps device in the network.
      - Decrease the MTU to limit the latency of high-priority frames under
        congestion, or work around various network segments that add extra
        headers to packets which can't be fragmented.
      
      For DSA slave ports, this is mostly a pass-through callback, called
      through the regular ndo ops and at probe time (to ensure consistency
      across all supported switches).
      
      The CPU port is called with an MTU equal to the largest configured MTU
      of the slave ports. The assumption is that the user might want to
      sustain a bidirectional conversation with a partner over any switch
      port.
      
      The DSA master is configured the same as the CPU port, plus the tagger
      overhead. Since the MTU is by definition L2 payload (sans Ethernet
      header), it is up to each individual driver to figure out if it needs to
      do anything special for its frame tags on the CPU port (it shouldn't
      except in special cases). So the MTU does not contain the tagger
      overhead on the CPU port.
      However the MTU of the DSA master, minus the tagger overhead, is used as
      a proxy for the MTU of the CPU port, which does not have a net device.
      This is to avoid uselessly calling the .change_mtu function on the CPU
      port when nothing should change.
      
      So it is safe to assume that the DSA master and the CPU port MTUs are
      apart by exactly the tagger's overhead in bytes.
      
      Some changes were made around dsa_master_set_mtu(), function which was
      now removed, for 2 reasons:
        - dev_set_mtu() already calls dev_validate_mtu(), so it's redundant to
          do the same thing in DSA
        - __dev_set_mtu() returns 0 if ops->ndo_change_mtu is an absent method
      That is to say, there's no need for this function in DSA, we can safely
      call dev_set_mtu() directly, take the rtnl lock when necessary, and just
      propagate whatever errors get reported (since the user probably wants to
      be informed).
      
      Some inspiration (mainly in the MTU DSA notifier) was taken from a
      vaguely similar patch from Murali and Florian, who are credited as
      co-developers down below.
      Co-developed-by: NMurali Krishna Policharla <murali.policharla@broadcom.com>
      Signed-off-by: NMurali Krishna Policharla <murali.policharla@broadcom.com>
      Co-developed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bfcb8132
  11. 25 3月, 2020 1 次提交
    • V
      net: dsa: tag_8021q: replace dsa_8021q_remove_header with __skb_vlan_pop · e80f40cb
      Vladimir Oltean 提交于
      Not only did this wheel did not need reinventing, but there is also
      an issue with it: It doesn't remove the VLAN header in a way that
      preserves the L2 payload checksum when that is being provided by the DSA
      master hw.  It should recalculate checksum both for the push, before
      removing the header, and for the pull afterwards. But the current
      implementation is quite dizzying, with pulls followed immediately
      afterwards by pushes, the memmove is done before the push, etc.  This
      makes a DSA master with RX checksumming offload to print stack traces
      with the infamous 'hw csum failure' message.
      
      So remove the dsa_8021q_remove_header function and replace it with
      something that actually works with inet checksumming.
      
      Fixes: d4619336 ("net: dsa: tag_8021q: Create helper function for removing VLAN header")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e80f40cb
  12. 24 3月, 2020 2 次提交
  13. 18 3月, 2020 1 次提交
  14. 16 3月, 2020 1 次提交
  15. 12 3月, 2020 1 次提交
    • A
      net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed · a20f9970
      Andrew Lunn 提交于
      By default, DSA drivers should configure CPU and DSA ports to their
      maximum speed. In many configurations this is sufficient to make the
      link work.
      
      In some cases it is necessary to configure the link to run slower,
      e.g. because of limitations of the SoC it is connected to. Or back to
      back PHYs are used and the PHY needs to be driven in order to
      establish link. In this case, phylink is used.
      
      Only instantiate phylink if it is required. If there is no PHY, or no
      fixed link properties, phylink can upset a link which works in the
      default configuration.
      
      Fixes: 0e279218 ("net: dsa: Use PHYLINK for the CPU/DSA ports")
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a20f9970
  16. 09 3月, 2020 1 次提交
  17. 05 3月, 2020 1 次提交
    • V
      net: mscc: ocelot: eliminate confusion between CPU and NPI port · 69df578c
      Vladimir Oltean 提交于
      Ocelot has the concept of a CPU port. The CPU port is represented in the
      forwarding and the queueing system, but it is not a physical device. The
      CPU port can either be accessed via register-based injection/extraction
      (which is the case of Ocelot), via Frame-DMA (similar to the first one),
      or "connected" to a physical Ethernet port (called NPI in the datasheet)
      which is the case of the Felix DSA switch.
      
      In Ocelot the CPU port is at index 11.
      In Felix the CPU port is at index 6.
      
      The CPU bit is treated special in the forwarding, as it is never cleared
      from the forwarding port mask (once added to it). Other than that, it is
      treated the same as a normal front port.
      
      Both Felix and Ocelot should use the CPU port in the same way. This
      means that Felix should not use the NPI port directly when forwarding to
      the CPU, but instead use the CPU port.
      
      This patch is fixing this such that Felix will use port 6 as its CPU
      port, and just use the NPI port to carry the traffic.
      
      Therefore, eliminate the "ocelot->cpu" variable which was holding the
      index of the NPI port for Felix, and the index of the CPU port module
      for Ocelot, so the variable was actually configuring different things
      for different drivers and causing at least part of the confusion.
      
      Also remove the "ocelot->num_cpu_ports" variable, which is the result of
      another confusion. The 2 CPU ports mentioned in the datasheet are
      because there are two frame extraction channels (register based or DMA
      based). This is of no relevance to the driver at the moment, and
      invisible to the analyzer module.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Suggested-by: NAllan W. Nielsen <allan.nielsen@microchip.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69df578c
  18. 04 3月, 2020 2 次提交
  19. 28 2月, 2020 2 次提交
  20. 14 2月, 2020 2 次提交
  21. 27 1月, 2020 1 次提交
    • V
      net: dsa: Fix use-after-free in probing of DSA switch tree · 6dc43cd3
      Vladimir Oltean 提交于
      DSA sets up a switch tree little by little. Every switch of the N
      members of the tree calls dsa_register_switch, and (N - 1) will just
      touch the dst->ports list with their ports and quickly exit. Only the
      last switch that calls dsa_register_switch will find all DSA links
      complete in dsa_tree_setup_routing_table, and not return zero as a
      result but instead go ahead and set up the entire DSA switch tree
      (practically on behalf of the other switches too).
      
      The trouble is that the (N - 1) switches don't clean up after themselves
      after they get an error such as EPROBE_DEFER. Their footprint left in
      dst->ports by dsa_switch_touch_ports is still there. And switch N, the
      one responsible with actually setting up the tree, is going to work with
      those stale dp, dp->ds and dp->ds->dev pointers. In particular ds and
      ds->dev might get freed by the device driver.
      
      Be there a 2-switch tree and the following calling order:
      - Switch 1 calls dsa_register_switch
        - Calls dsa_switch_touch_ports, populates dst->ports
        - Calls dsa_port_parse_cpu, gets -EPROBE_DEFER, exits.
      - Switch 2 calls dsa_register_switch
        - Calls dsa_switch_touch_ports, populates dst->ports
        - Probe doesn't get deferred, so it goes ahead.
        - Calls dsa_tree_setup_routing_table, which returns "complete == true"
          due to Switch 1 having called dsa_switch_touch_ports before.
        - Because the DSA links are complete, it calls dsa_tree_setup_switches
          now.
        - dsa_tree_setup_switches iterates through dst->ports, initializing
          the Switch 1 ds structure (invalid) and the Switch 2 ds structure
          (valid).
        - Undefined behavior (use after free, sometimes NULL pointers, etc).
      
      Real example below (debugging prints added by me, as well as guards
      against NULL pointers):
      
      [    5.477947] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803df0b980 (dev ffffff803f775c00)
      [    6.313002] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803df0b980 (dev ffffff803f775c00)
      [    6.319932] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803df0b980 (dev ffffff803f775c00)
      [    6.329693] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803df0b980 (dev ffffff803f775c00)
      [    6.339458] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803df0b980 (dev ffffff803f775c00)
      [    6.349226] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803df0b980 (dev ffffff803f775c00)
      [    6.358991] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803df0b980 (dev ffffff803f775c00)
      [    6.368758] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803df0b980 (dev ffffff803f775c00)
      [    6.378524] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803df0b980 (dev ffffff803f775c00)
      [    6.388291] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803df0b980 (dev ffffff803f775c00)
      [    6.398057] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803df0b980 (dev ffffff803f775c00)
      [    6.407912] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803da02f80 (dev 0000000000000000)
      [    6.417682] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803da02f80 (dev 0000000000000000)
      [    6.427446] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803da02f80 (dev 0000000000000000)
      [    6.437212] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803da02f80 (dev 0000000000000000)
      [    6.446979] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803da02f80 (dev 0000000000000000)
      [    6.456744] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803da02f80 (dev 0000000000000000)
      [    6.466512] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803da02f80 (dev 0000000000000000)
      [    6.476277] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803da02f80 (dev 0000000000000000)
      [    6.486043] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803da02f80 (dev 0000000000000000)
      [    6.495810] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803da02f80 (dev 0000000000000000)
      [    6.505577] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803da02f80 (dev 0000000000000000)
      [    6.515433] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803db15b80 (dev ffffff803d8e4800)
      [    7.354120] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803db15b80 (dev ffffff803d8e4800)
      [    7.361045] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803db15b80 (dev ffffff803d8e4800)
      [    7.370805] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803db15b80 (dev ffffff803d8e4800)
      [    7.380571] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803db15b80 (dev ffffff803d8e4800)
      [    7.390337] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803db15b80 (dev ffffff803d8e4800)
      [    7.400104] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803db15b80 (dev ffffff803d8e4800)
      [    7.409872] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803db15b80 (dev ffffff803d8e4800)
      [    7.419637] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803db15b80 (dev ffffff803d8e4800)
      [    7.429403] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803db15b80 (dev ffffff803d8e4800)
      [    7.439169] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803db15b80 (dev ffffff803d8e4800)
      
      The solution is to recognize that the functions that call
      dsa_switch_touch_ports (dsa_switch_parse_of, dsa_switch_parse) have side
      effects, and therefore one should clean up their side effects on error
      path. The cleanup of dst->ports was taken from dsa_switch_remove and
      moved into a dedicated dsa_switch_release_ports function, which should
      really be per-switch (free only the members of dst->ports that are also
      members of ds, instead of all switch ports).
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dc43cd3
  22. 16 1月, 2020 2 次提交
  23. 09 1月, 2020 1 次提交
    • F
      net: dsa: Get information about stacked DSA protocol · 4d776482
      Florian Fainelli 提交于
      It is possible to stack multiple DSA switches in a way that they are not
      part of the tree (disjoint) but the DSA master of a switch is a DSA
      slave of another. When that happens switch drivers may have to know this
      is the case so as to determine whether their tagging protocol has a
      remove chance of working.
      
      This is useful for specific switch drivers such as b53 where devices
      have been known to be stacked in the wild without the Broadcom tag
      protocol supporting that feature. This allows b53 to continue supporting
      those devices by forcing the disabling of Broadcom tags on the outermost
      switches if necessary.
      
      The get_tag_protocol() function is therefore updated to gain an
      additional enum dsa_tag_protocol argument which denotes the current
      tagging protocol used by the DSA master we are attached to, else
      DSA_TAG_PROTO_NONE for the top of the dsa_switch_tree.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d776482
  24. 06 1月, 2020 3 次提交
    • V
      net: dsa: Pass pcs_poll flag from driver to PHYLINK · 787cac3f
      Vladimir Oltean 提交于
      The DSA drivers that implement .phylink_mac_link_state should normally
      register an interrupt for the PCS, from which they should call
      phylink_mac_change(). However not all switches implement this, and those
      who don't should set this flag in dsa_switch in the .setup callback, so
      that PHYLINK will poll for a few ms until the in-band AN link timer
      expires and the PCS state settles.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      787cac3f
    • V
      net: dsa: tag_sja1105: Slightly improve the Xmas tree in sja1105_xmit · 2821d50f
      Vladimir Oltean 提交于
      This is a cosmetic patch that makes the dp, tx_vid, queue_mapping and
      pcp local variable definitions a bit closer in length, so they don't
      look like an eyesore as much.
      
      The 'ds' variable is not used otherwise, except for ds->dp.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2821d50f
    • V
      net: dsa: Make deferred_xmit private to sja1105 · a68578c2
      Vladimir Oltean 提交于
      There are 3 things that are wrong with the DSA deferred xmit mechanism:
      
      1. Its introduction has made the DSA hotpath ever so slightly more
         inefficient for everybody, since DSA_SKB_CB(skb)->deferred_xmit needs
         to be initialized to false for every transmitted frame, in order to
         figure out whether the driver requested deferral or not (a very rare
         occasion, rare even for the only driver that does use this mechanism:
         sja1105). That was necessary to avoid kfree_skb from freeing the skb.
      
      2. Because L2 PTP is a link-local protocol like STP, it requires
         management routes and deferred xmit with this switch. But as opposed
         to STP, the deferred work mechanism needs to schedule the packet
         rather quickly for the TX timstamp to be collected in time and sent
         to user space. But there is no provision for controlling the
         scheduling priority of this deferred xmit workqueue. Too bad this is
         a rather specific requirement for a feature that nobody else uses
         (more below).
      
      3. Perhaps most importantly, it makes the DSA core adhere a bit too
         much to the NXP company-wide policy "Innovate Where It Doesn't
         Matter". The sja1105 is probably the only DSA switch that requires
         some frames sent from the CPU to be routed to the slave port via an
         out-of-band configuration (register write) rather than in-band (DSA
         tag). And there are indeed very good reasons to not want to do that:
         if that out-of-band register is at the other end of a slow bus such
         as SPI, then you limit that Ethernet flow's throughput to effectively
         the throughput of the SPI bus. So hardware vendors should definitely
         not be encouraged to design this way. We do _not_ want more
         widespread use of this mechanism.
      
      Luckily we have a solution for each of the 3 issues:
      
      For 1, we can just remove that variable in the skb->cb and counteract
      the effect of kfree_skb with skb_get, much to the same effect. The
      advantage, of course, being that anybody who doesn't use deferred xmit
      doesn't need to do any extra operation in the hotpath.
      
      For 2, we can create a kernel thread for each port's deferred xmit work.
      If the user switch ports are named swp0, swp1, swp2, the kernel threads
      will be named swp0_xmit, swp1_xmit, swp2_xmit (there appears to be a 15
      character length limit on kernel thread names). With this, the user can
      change the scheduling priority with chrt $(pidof swp2_xmit).
      
      For 3, we can actually move the entire implementation to the sja1105
      driver.
      
      So this patch deletes the generic implementation from the DSA core and
      adds a new one, more adequate to the requirements of PTP TX
      timestamping, in sja1105_main.c.
      Suggested-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a68578c2
  25. 29 12月, 2019 1 次提交
    • V
      net: dsa: Deny PTP on master if switch supports it · f685e609
      Vladimir Oltean 提交于
      It is possible to kill PTP on a DSA switch completely and absolutely,
      until a reboot, with a simple command:
      
      tcpdump -i eth2 -j adapter_unsynced
      
      where eth2 is the switch's DSA master.
      
      Why? Well, in short, the PTP API in place today is a bit rudimentary and
      relies on applications to retrieve the TX timestamps by polling the
      error queue and looking at the cmsg structure. But there is no timestamp
      identification of any sorts (except whether it's HW or SW), you don't
      know how many more timestamps are there to come, which one is this one,
      from whom it is, etc. In other words, the SO_TIMESTAMPING API is
      fundamentally limited in that you can get a single HW timestamp from the
      stack.
      
      And the "-j adapter_unsynced" flag of tcpdump enables hardware
      timestamping.
      
      So let's imagine what happens when the DSA master decides it wants to
      deliver TX timestamps to the skb's socket too:
      - The timestamp that the user space sees is taken by the DSA master.
        Whereas the RX timestamp will eventually be overwritten by the DSA
        switch. So the RX and TX timestamps will be in different time bases
        (aka garbage).
      - The user space applications have no way to deal with the second (real)
        TX timestamp finally delivered by the DSA switch, or even to know to
        wait for it.
      
      Take ptp4l from the linuxptp project, for example. This is its behavior
      after running tcpdump, before the patch:
      
      ptp4l[172]: [6469.594] Unexpected data on socket err queue:
      ptp4l[172]: [6469.693] rms    8 max   16 freq -21257 +/-  11 delay   748 +/-   0
      ptp4l[172]: [6469.711] Unexpected data on socket err queue:
      ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 05 00 fd
      ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: [6469.721] Unexpected data on socket err queue:
      ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
      ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 01 c6 b1 00 fd
      ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: [6469.838] Unexpected data on socket err queue:
      ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
      ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 06 00 fd
      ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: [6469.848] Unexpected data on socket err queue:
      ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 13 02
      ptp4l[172]: 0010 00 36 00 00 02 00 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 04 1a 45 05 7f
      ptp4l[172]: 0030 00 00 5e 05 41 32 27 c2 1a 68 00 04 9f ff fe 05
      ptp4l[172]: 0040 de 06 00 01
      ptp4l[172]: [6469.855] Unexpected data on socket err queue:
      ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
      ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 01 c6 b2 00 fd
      ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: [6469.974] Unexpected data on socket err queue:
      ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
      ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 07 00 fd
      ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
      
      The ptp4l program itself is heavily patched to show this (more details
      here [0]). Otherwise, by default it just hangs.
      
      On the other hand, with the DSA patch to disallow HW timestamping
      applied:
      
      tcpdump -i eth2 -j adapter_unsynced
      tcpdump: SIOCSHWTSTAMP failed: Device or resource busy
      
      So it is a fact of life that PTP timestamping on the DSA master is
      incompatible with timestamping on the switch MAC, at least with the
      current API. And if the switch supports PTP, taking the timestamps from
      the switch MAC is highly preferable anyway, due to the fact that those
      don't contain the queuing latencies of the switch. So just disallow PTP
      on the DSA master if there is any PTP-capable switch attached.
      
      [0]: https://sourceforge.net/p/linuxptp/mailman/message/36880648/
      
      Fixes: 0336369d ("net: dsa: forward hardware timestamping ioctls to switch driver")
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f685e609
  26. 21 12月, 2019 2 次提交
  27. 18 12月, 2019 1 次提交