1. 05 2月, 2021 15 次提交
  2. 04 2月, 2021 25 次提交
    • W
      tcp: use a smaller percpu_counter batch size for sk_alloc · f5a5589c
      Wei Wang 提交于
      Currently, a percpu_counter with the default batch size (2*nr_cpus) is
      used to record the total # of active sockets per protocol. This means
      sk_sockets_allocated_read_positive() could be off by +/-2*(nr_cpus^2).
      This under/over-estimation could lead to wrong memory suppression
      conditions in __sk_raise_mem_allocated().
      Fix this by using a more reasonable fixed batch size of 16.
      
      See related commit cf86a086 ("net/dst: use a smaller percpu_counter
      batch for dst entries accounting") that addresses a similar issue.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Link: https://lore.kernel.org/r/20210202193408.1171634-1-weiwan@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      f5a5589c
    • J
      Merge branch 'support-setting-lanes-via-ethtool' · 6fd5eeee
      Jakub Kicinski 提交于
      Danielle Ratson says:
      
      ====================
      Support setting lanes via ethtool
      
      Some speeds can be achieved with different number of lanes. For example,
      100Gbps can be achieved using two lanes of 50Gbps or four lanes of
      25Gbps. This patchset adds a new selector that allows ethtool to
      advertise link modes according to their number of lanes and also force a
      specific number of lanes when autonegotiation is off.
      
      Advertising all link modes with a speed of 100Gbps that use two lanes:
      
      $ ethtool -s swp1 speed 100000 lanes 2 autoneg on
      
      Forcing a speed of 100Gbps using four lanes:
      
      $ ethtool -s swp1 speed 100000 lanes 4 autoneg off
      ====================
      
      Link: https://lore.kernel.org/r/20210202180612.325099-1-danieller@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      6fd5eeee
    • D
      net: selftests: Add lanes setting test · f72e2f48
      Danielle Ratson 提交于
      Test that setting lanes parameter is working.
      
      Set max speed and max lanes in the list of advertised link modes,
      and then try to set max speed with the lanes below max lanes if exists
      in the list.
      
      And then, test that setting number of lanes larger than max lanes fails.
      
      Do the above for both autoneg on and off.
      
      $ ./ethtool_lanes.sh
      
      TEST: 4 lanes is autonegotiated                                     [ OK ]
      TEST: Lanes number larger than max width is not set                 [ OK ]
      TEST: Autoneg off, 4 lanes detected during force mode               [ OK ]
      TEST: Lanes number larger than max width is not set                 [ OK ]
      Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f72e2f48
    • D
      mlxsw: ethtool: Pass link mode in use to ethtool · 25a96f05
      Danielle Ratson 提交于
      Currently, when user space queries the link's parameters, as speed and
      duplex, each parameter is passed from the driver to ethtool.
      
      Instead, pass the link mode bit in use.
      In Spectrum-1, simply pass the bit that is set to '1' from PTYS register.
      In Spectrum-2, pass the first link mode bit in the mask of the used
      link mode.
      Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      25a96f05
    • D
      mlxsw: ethtool: Add support for setting lanes when autoneg is off · 763ece86
      Danielle Ratson 提交于
      Currently, when auto negotiation is set to off, the user can force a
      specific speed or both speed and duplex. The user cannot influence the
      number of lanes that will be forced.
      
      Add support for setting speed along with lanes so one would be able
      to choose how many lanes will be forced.
      
      When lanes parameter is passed from user space, choose the link mode
      that its actual width equals to it.
      Otherwise, the default link mode will be the one that supports the width
      of the port.
      Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      763ece86
    • D
      mlxsw: ethtool: Remove max lanes filtering · 5fc4053d
      Danielle Ratson 提交于
      Currently, when a speed can be supported by different number of lanes,
      the supported link modes bitmask contains only link modes with a single
      number of lanes.
      
      This was done in order to prevent auto negotiation on number of
      lanes after 50G-1-lane and 100G-2-lanes link modes were introduced.
      
      For example, if a port's max width is 4, only link modes with 4 lanes
      will be presented as supported by that port, so 100G is always achieved by
      4 lanes of 25G.
      
      After the previous patches that allow selection of the number of lanes,
      auto negotiation on number of lanes becomes practical.
      
      Remove that filtering of the maximum number of lanes supported link modes,
      so indeed all the supported and advertised link modes will be shown.
      Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      5fc4053d
    • D
      ethtool: Expose the number of lanes in use · 7dc33f09
      Danielle Ratson 提交于
      Currently, ethtool does not expose how many lanes are used when the
      link is up.
      
      After adding a possibility to advertise or force a specific number of
      lanes, the lanes in use value can be either the maximum width of the port
      or below.
      
      Extend ethtool to expose the number of lanes currently in use for
      drivers that support it.
      
      For example:
      
      $ ethtool -s swp1 speed 100000 lanes 4
      $ ethtool -s swp2 speed 100000 lanes 4
      $ ip link set swp1 up
      $ ip link set swp2 up
      $ ethtool swp1
      Settings for swp1:
              Supported ports: [ FIBRE         Backplane ]
              Supported link modes:   1000baseT/Full
                                      10000baseT/Full
                                      1000baseKX/Full
                                      10000baseKR/Full
                                      10000baseR_FEC
                                      40000baseKR4/Full
                                      40000baseCR4/Full
                                      40000baseSR4/Full
                                      40000baseLR4/Full
                                      25000baseCR/Full
                                      25000baseKR/Full
                                      25000baseSR/Full
                                      50000baseCR2/Full
                                      50000baseKR2/Full
                                      100000baseKR4/Full
                                      100000baseSR4/Full
                                      100000baseCR4/Full
                                      100000baseLR4_ER4/Full
                                      50000baseSR2/Full
                                      10000baseCR/Full
                                      10000baseSR/Full
                                      10000baseLR/Full
                                      10000baseER/Full
                                      50000baseKR/Full
                                      50000baseSR/Full
                                      50000baseCR/Full
                                      50000baseLR_ER_FR/Full
                                      50000baseDR/Full
                                      100000baseKR2/Full
                                      100000baseSR2/Full
                                      100000baseCR2/Full
                                      100000baseLR2_ER2_FR2/Full
                                      100000baseDR2/Full
                                      200000baseKR4/Full
                                      200000baseSR4/Full
                                      200000baseLR4_ER4_FR4/Full
                                      200000baseDR4/Full
                                      200000baseCR4/Full
              Supported pause frame use: Symmetric Receive-only
              Supports auto-negotiation: Yes
              Supported FEC modes: Not reported
              Advertised link modes:  1000baseT/Full
                                      10000baseT/Full
                                      1000baseKX/Full
                                      1000baseKX/Full
                                      10000baseKR/Full
                                      10000baseR_FEC
                                      40000baseKR4/Full
                                      40000baseCR4/Full
                                      40000baseSR4/Full
                                      40000baseLR4/Full
                                      25000baseCR/Full
                                      25000baseKR/Full
                                      25000baseSR/Full
                                      50000baseCR2/Full
                                      50000baseKR2/Full
                                      100000baseKR4/Full
                                      100000baseSR4/Full
                                      100000baseCR4/Full
                                      100000baseLR4_ER4/Full
                                      50000baseSR2/Full
                                      10000baseCR/Full
                                      10000baseSR/Full
                                      10000baseLR/Full
                                      10000baseER/Full
                                      200000baseKR4/Full
                                      200000baseSR4/Full
                                      200000baseLR4_ER4_FR4/Full
                                      200000baseDR4/Full
                                      200000baseCR4/Full
              Advertised pause frame use: No
              Advertised auto-negotiation: Yes
              Advertised FEC modes: Not reported
              Advertised link modes:  100000baseKR4/Full
                                      100000baseSR4/Full
                                      100000baseCR4/Full
                                      100000baseLR4_ER4/Full
      	Advertised pause frame use: No
      	Advertised auto-negotiation: Yes
      	Advertised FEC modes: Not reported
      	Speed: 100000Mb/s
      	Lanes: 4
      	Duplex: Full
      	Auto-negotiation: on
      	Port: Direct Attach Copper
      	PHYAD: 0
      	Transceiver: internal
      	Link detected: yes
      Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      7dc33f09
    • D
      ethtool: Get link mode in use instead of speed and duplex parameters · c8907043
      Danielle Ratson 提交于
      Currently, when user space queries the link's parameters, as speed and
      duplex, each parameter is passed from the driver to ethtool.
      
      Instead, get the link mode bit in use, and derive each of the parameters
      from it in ethtool.
      Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      c8907043
    • D
      ethtool: Extend link modes settings uAPI with lanes · 012ce4dd
      Danielle Ratson 提交于
      Currently, when auto negotiation is on, the user can advertise all the
      linkmodes which correspond to a specific speed, but does not have a
      similar selector for the number of lanes. This is significant when a
      specific speed can be achieved using different number of lanes.  For
      example, 2x50 or 4x25.
      
      Add 'ETHTOOL_A_LINKMODES_LANES' attribute and expand 'struct
      ethtool_link_settings' with lanes field in order to implement a new
      lanes-selector that will enable the user to advertise a specific number
      of lanes as well.
      
      When auto negotiation is off, lanes parameter can be forced only if the
      driver supports it. Add a capability bit in 'struct ethtool_ops' that
      allows ethtool know if the driver can handle the lanes parameter when
      auto negotiation is off, so if it does not, an error message will be
      returned when trying to set lanes.
      
      Example:
      
      $ ethtool -s swp1 lanes 4
      $ ethtool swp1
        Settings for swp1:
      	Supported ports: [ FIBRE ]
              Supported link modes:   1000baseKX/Full
                                      10000baseKR/Full
                                      40000baseCR4/Full
      				40000baseSR4/Full
      				40000baseLR4/Full
                                      25000baseCR/Full
                                      25000baseSR/Full
      				50000baseCR2/Full
                                      100000baseSR4/Full
      				100000baseCR4/Full
              Supported pause frame use: Symmetric Receive-only
              Supports auto-negotiation: Yes
              Supported FEC modes: Not reported
              Advertised link modes:  40000baseCR4/Full
      				40000baseSR4/Full
      				40000baseLR4/Full
                                      100000baseSR4/Full
      				100000baseCR4/Full
              Advertised pause frame use: No
              Advertised auto-negotiation: Yes
              Advertised FEC modes: Not reported
              Speed: Unknown!
              Duplex: Unknown! (255)
              Auto-negotiation: on
              Port: Direct Attach Copper
              PHYAD: 0
              Transceiver: internal
              Link detected: no
      Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      012ce4dd
    • D
      ethtool: Validate master slave configuration before rtnl_lock() · 189e7a8d
      Danielle Ratson 提交于
      Create a new function for input validations to be called before
      rtnl_lock() and move the master slave validation to that function.
      
      This would be a cleanup for next patch that would add another validation
      to the new function.
      Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      189e7a8d
    • V
      net: dsa: fix SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING getting ignored · 99b8202b
      Vladimir Oltean 提交于
      The bridge emits VLAN filtering events and quite a few others via
      switchdev with orig_dev = br->dev. After the blamed commit, these events
      started getting ignored.
      
      The point of the patch was to not offload switchdev objects for ports
      that didn't go through dsa_port_bridge_join, because the configuration
      is unsupported:
      - ports that offload a bonding/team interface go through
        dsa_port_bridge_join when that bonding/team interface is later bridged
        with another switch port or LAG
      - ports that don't offload LAG don't get notified of the bridge that is
        on top of that LAG.
      
      Sadly, a check is missing, which is that the orig_dev is equal to the
      bridge device. This check is compatible with the original intention,
      because ports that don't offload bridging because they use a software
      LAG don't have dp->bridge_dev set.
      
      On a semi-related note, we should not offload switchdev objects or
      populate dp->bridge_dev if the driver doesn't implement .port_bridge_join
      either. However there is no regression associated with that, so it can
      be done separately.
      
      Fixes: 5696c8ae ("net: dsa: Don't offload port attributes on standalone ports")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NTobias Waldekranz <tobias@waldekranz.com>
      Tested-by: NTobias Waldekranz <tobias@waldekranz.com>
      Link: https://lore.kernel.org/r/20210202233109.1591466-1-olteanv@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      99b8202b
    • J
      Merge branch 'chelsio-cxgb-use-threaded-interrupts-for-deferred-work' · 75b8f78f
      Jakub Kicinski 提交于
      Sebastian Andrzej Siewior says:
      
      ====================
      chelsio: cxgb: Use threaded interrupts for deferred work
      
      Patch #2 fixes an issue in which del_timer_sync() and tasklet_kill() is
      invoked from the interrupt handler. This is probably a rare error case
      since it disables interrupts / the card in that case.
      Patch #1 converts a worker to use a threaded interrupt which is then
      also used in patch #2 instead adding another worker for this task (and
      flush_work() to synchronise vs rmmod).
      ====================
      
      Link: https://lore.kernel.org/r/20210202170104.1909200-1-bigeasy@linutronix.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      75b8f78f
    • S
      chelsio: cxgb: Disable the card on error in threaded interrupt · 82154580
      Sebastian Andrzej Siewior 提交于
      t1_fatal_err() is invoked from the interrupt handler. The bad part is
      that it invokes (via t1_sge_stop()) del_timer_sync() and tasklet_kill().
      Both functions must not be called from an interrupt because it is
      possible that it will wait for the completion of the timer/tasklet it
      just interrupted.
      
      In case of a fatal error, use t1_interrupts_disable() to disable all
      interrupt sources and then wake the interrupt thread with
      F_PL_INTR_SGE_ERR as pending flag. The threaded-interrupt will stop the
      card via t1_sge_stop() and not re-enable the interrupts again.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      82154580
    • S
      chelsio: cxgb: Replace the workqueue with threaded interrupt · fec7fa0a
      Sebastian Andrzej Siewior 提交于
      The external interrupt (F_PL_INTR_EXT) needs to be handled in a process
      context and this is accomplished by utilizing a workqueue.
      
      The process context can also be provided by a threaded interrupt instead
      of a workqueue. The threaded interrupt can be used later for other
      interrupt related processing which require non-atomic context without
      using yet another workqueue. free_irq() also ensures that the thread is
      done which is currently missing (the worker could continue after the
      module has been removed).
      
      Save pending flags in pending_thread_intr. Use the same mechanism
      to disable F_PL_INTR_EXT as interrupt source like it is used before the
      worker is scheduled. Enable the interrupt again once
      t1_elmer0_ext_intr_handler() is done.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      fec7fa0a
    • J
      Merge branch 'support-for-octeontx2-98xx-cpt-block' · 462e99a1
      Jakub Kicinski 提交于
      Srujana Challa says:
      
      ====================
      Support for OcteonTX2 98xx CPT block.
      
      OcteonTX2 series of silicons have multiple variants, the
      98xx variant has two crypto (CPT) blocks to double the crypto
      performance. This patchset adds support for new CPT block(CPT1).
      ====================
      
      Link: https://lore.kernel.org/r/20210202152709.20450-1-schalla@marvell.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      462e99a1
    • S
      octeontx2-af: Handle CPT function level reset · c57c58fd
      Srujana Challa 提交于
      When FLR is initiated for a VF (PCI function level reset),
      the parent PF gets a interrupt. PF then sends a message to
      admin function (AF), which then cleans up all resources
      attached to that VF. This patch adds support to handle
      CPT FLR.
      Signed-off-by: NNarayana Prasad Raju Atherya <pathreya@marvell.com>
      Signed-off-by: NSuheil Chandran <schandran@marvell.com>
      Signed-off-by: NSunil Kovvuri Goutham <sgoutham@marvell.com>
      Signed-off-by: NSrujana Challa <schalla@marvell.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      c57c58fd
    • S
      octeontx2-af: Add support for CPT1 in debugfs · b0f60fab
      Srujana Challa 提交于
      Adds support to display block CPT1 stats at
      "/sys/kernel/debug/octeontx2/cpt1".
      Signed-off-by: NMahipal Challa <mchalla@marvell.com>
      Signed-off-by: NSrujana Challa <schalla@marvell.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      b0f60fab
    • S
      octeontx2-af: Mailbox changes for 98xx CPT block · de2854c8
      Srujana Challa 提交于
      This patch changes CPT mailbox message format to
      support new block CPT1 in 98xx silicon.
      
      cpt_rd_wr_reg ->
          Modify cpt_rd_wr_reg mailbox and its handler to
          accommodate new block CPT1.
      cpt_lf_alloc ->
          Modify cpt_lf_alloc mailbox and its handler to
          configure LFs from a block address out of multiple
          blocks of same type. If a PF/VF needs to configure
          LFs from both the blocks then this mbox should be
          called twice.
      Signed-off-by: NMahipal Challa <mchalla@marvell.com>
      Signed-off-by: NSrujana Challa <schalla@marvell.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      de2854c8
    • M
      net: mdiobus: Prevent spike on MDIO bus reset signal · e0183b97
      Mike Looijmans 提交于
      The mdio_bus reset code first de-asserted the reset by allocating with
      GPIOD_OUT_LOW, then asserted and de-asserted again. In other words, if
      the reset signal defaulted to asserted, there'd be a short "spike"
      before the reset.
      
      Here is what happens depending on the pre-existing state of the reset
      signal:
      Reset (previously asserted):   ~~~|_|~~~~|_______
      Reset (previously deasserted): _____|~~~~|_______
                                        ^ ^    ^
                                        A B    C
      
      At point A, the low going transition is because the reset line is
      requested using GPIOD_OUT_LOW. If the line is successfully requested,
      the first thing we do is set it high _without_ any delay. This is
      point B. So, a glitch occurs between A and B.
      
      We then fsleep() and finally set the GPIO low at point C.
      
      Requesting the line using GPIOD_OUT_HIGH eliminates the A and B
      transitions. Instead we get:
      
      Reset (previously asserted)  : ~~~~~~~~~~|______
      Reset (previously deasserted): ____|~~~~~|______
                                         ^     ^
                                         A     C
      
      Where A and C are the points described above in the code. Point B
      has been eliminated.
      
      The issue was found when we pulled down the reset signal for the
      Marvell 88E1512P PHY (because it requires at least 50ms after POR with
      an active clock). Looking at the reset signal with a scope revealed a
      short spike, point B in the artwork above.
      Signed-off-by: NMike Looijmans <mike.looijmans@topic.nl>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20210202143239.10714-1-mike.looijmans@topic.nlSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      e0183b97
    • D
      net: mscc: ocelot: fix error code in mscc_ocelot_probe() · 4160d9ec
      Dan Carpenter 提交于
      Probe should return an error code if platform_get_irq_byname() fails
      but it returns success instead.
      
      Fixes: 6c30384e ("net: mscc: ocelot: register devlink ports")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/YBkXyFIl4V9hgxYM@mwandaSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      4160d9ec
    • D
      net: mscc: ocelot: fix error handling bugs in mscc_ocelot_init_ports() · e0c16233
      Dan Carpenter 提交于
      There are several error handling bugs in mscc_ocelot_init_ports().  I
      went through the code, and carefully audited it and made fixes and
      cleanups.
      
      1) The ocelot_probe_port() function didn't have a mirror release function
         so it was hard to follow.  I created the ocelot_release_port()
         function.
      2) In the ocelot_probe_port() function, if the register_netdev() call
         failed, then it lead to a double free_netdev(dev) bug.  Fix this by
         setting "ocelot->ports[port] = NULL" on the error path.
      3) I was concerned that the "port" which comes from of_property_read_u32()
         might be out of bounds so I added a check for that.
      4) In the original code if ocelot_regmap_init() failed then the driver
         tried to continue but I think that should be a fatal error.
      5) If ocelot_probe_port() failed then the most recent devlink was leaked.
         The fix for mostly came Vladimir Oltean.  Get rid of "registered_ports"
         and just set a bit in "devlink_ports_registered" to say when the
         devlink port has been registered (and needs to be unregistered on
         error).  There are fewer than 32 ports so a u32 is large enough for
         this purpose.
      6) The error handling if the final ocelot_port_devlink_init() failed had
         two problems.  The "while (port-- >= 0)" loop should have been
         "--port" pre-op instead of a post-op to avoid a buffer underflow.
         The "if (!registered_ports[port])" condition was reversed leading to
         resource leaks and double frees.
      
      Fixes: 6c30384e ("net: mscc: ocelot: register devlink ports")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/YBkXhqRxHtRGzSnJ@mwandaSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      e0c16233
    • J
      Merge branch 'net-use-indirect_call-in-some-dst_ops' · 2d912da0
      Jakub Kicinski 提交于
      Brian Vazquez says:
      
      ====================
      net: use INDIRECT_CALL in some dst_ops
      
      This patch series uses the INDIRECT_CALL wrappers in some dst_ops
      functions to mitigate retpoline costs. Benefits depend on the
      platform as described below.
      
      Background: The kernel rewrites the retpoline code at
      __x86_indirect_thunk_r11 depending on the CPU's requirements.
      The INDIRECT_CALL wrappers provide hints on possible targets and
      save the retpoline overhead using a direct call in case the
      target matches one of the hints.
      
      The retpoline overhead for the following three cases has been
      measured by Luigi Rizzo in microbenchmarks, using CPU performance
      counters, and cover reasonably well the range of possible retpoline
      overheads compared to a plain indirect call (in equal conditions,
      specifically with predicted branch, hot cache):
      
      - just "jmp *(%r11)" on modern platforms like Intel Cascadelake.
        In this case the overhead is just 2 clock cycles:
      
      - "lfence; jmp *(%r11)" on e.g. some recent AMD CPUs.
        In this case the lfence is blocked until pending reads complete,
        so the actual overhead depends on previous instructions.
        The best case we have measured 15 clock cycles of overhead.
      
      - worst case, e.g. skylake, the full retpoline is used
      
          __x86_indirect_thunk_r11:     call set_u_target
          capture_speculation:          pause
                                        lfence
                                        jmp capture_speculation
          .align 16
          set_up_target:                mov %r11, (%rsp)
                                        ret
      
         In this case the overhead has been measured in 35-40 clock cycles.
      
      The actual time saved hence depends on the platform and current
      clock speed (which varies heavily, especially when C-states are active).
      Also note that actual benefit might be lower than expected if the
      longer retpoline overlaps with some pending memory read.
      
      MEASUREMENTS:
      The INDIRECT_CALL wrappers in this patchset involve the processing
      of incoming SYN and generation of syncookies. Hence, the test has been
      run by configuring a receiving host with a single NIC rx queue, disabling
      RPS and RFS so that all processing occurs on the same core.
      An external source generates SYN fast enough to saturate the receiving CPU.
      We ran two sets of experiments, with and without the dst_output patch,
      comparing the number of syncookies generated over a 20s period
      in multiple runs.
      
      Assuming the CPU is saturated, the time per packet is
         t = number_of_packets/total_time
      and if the two datasets have statistically meaningful difference,
      the difference in times between the two cases gives an estimate
      of the benefits from one INDIRECT_CALL.
      
      Here are the experimental results:
      
      Skylake     Syncookies over 20s (5 tests)
      ---------------------------------------------------
      indirect    9166325 9182023 9170093 9134014 9171082
      retpoline   9099308 9126350 9154841 9056377 9122376
      
      Computing the stats on the ns_pkt = 20e6/total_packets gives the following:
      
      $ ministat -c 95 -w 70 /tmp/sk-indirect /tmp/sk-retp
      x /tmp/sk-indirect
      + /tmp/sk-retp
      +----------------------------------------------------------------------+
      |x     xx x     +          x    + +           +                       +|
      ||______M__A_______|_|____________M_____A___________________|          |
      +----------------------------------------------------------------------+
          N           Min           Max        Median           Avg        Stddev
      x   5   2.17817e-06   2.18962e-06     2.181e-06  2.182292e-06 4.3252133e-09
      +   5   2.18464e-06   2.20839e-06   2.19241e-06  2.194974e-06 8.8695958e-09
      Difference at 95.0% confidence
              1.2682e-08 +/- 1.01766e-08
              0.581132% +/- 0.466326%
              (Student's t, pooled s = 6.97772e-09)
      
      This suggests a difference of 13ns +/- 10ns
      Our expectation from microbenchmarks was 35-40 cycles per call,
      but part of the gains may be eaten by stalls from pending memory reads.
      
      For Cascadelake:
      Cascadelake     Syncookies over 20s (5 tests)
      ---------------------------------------------------------
      indirect     10339797 10297547 10366826 10378891 10384854
      retpoline    10332674 10366805 10320374 10334272 10374087
      
      Computing the stats on the ns_pkt = 20e6/total_packets gives no
      meaningful difference even at just 80% (this was expected):
      
      $ ministat -c 80 -w 70 /tmp/cl-indirect /tmp/cl-retp
      x /tmp/cl-indirect
      + /tmp/cl-retp
      +----------------------------------------------------------------------+
      |   x    x  +     *                   x   + +        +                x|
      ||______________|_M_________A_____A_______M________|___|               |
      +----------------------------------------------------------------------+
          N           Min           Max        Median           Avg        Stddev
      x   5   1.92588e-06   1.94221e-06   1.92923e-06  1.931716e-06 6.6936746e-09
      +   5   1.92788e-06   1.93791e-06   1.93531e-06  1.933188e-06 4.3734106e-09
      No difference proven at 80.0% confidence
      ====================
      
      Link: https://lore.kernel.org/r/20210201174132.3534118-1-brianvv@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      2d912da0
    • B
      net: indirect call helpers for ipv4/ipv6 dst_check functions · bbd807df
      Brian Vazquez 提交于
      This patch avoids the indirect call for the common case:
      ip6_dst_check and ipv4_dst_check
      Signed-off-by: NBrian Vazquez <brianvv@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      bbd807df
    • B
      net: use indirect call helpers for dst_mtu · f67fbeae
      Brian Vazquez 提交于
      This patch avoids the indirect call for the common case:
      ip6_mtu and ipv4_mtu
      Signed-off-by: NBrian Vazquez <brianvv@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f67fbeae
    • B
      net: use indirect call helpers for dst_output · 6585d7dc
      Brian Vazquez 提交于
      This patch avoids the indirect call for the common case:
      ip6_output and ip_output
      Signed-off-by: NBrian Vazquez <brianvv@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      6585d7dc