提交 · 8109e1232b3e5322415a9b5e09951617c5fae277 · gsplhtlxg / clone-Linux

03 7月, 2014 17 次提交

i40e/i40evf: Add new HW link info variable an_enabled and function update_link_info · 8109e123

由 Catherine Sullivan 提交于 6月 04, 2014

Add a new variable, hw.phy.link_info.an_enabled, to track whether autoneg is
enabled. Also add a new function update_link_info that will update that
variable as well as calling get_link_info to update the rest of the link info.
Also add get_phy_capabilities to support this.

Change-ID: I5157ef03492b6dd8ec5e608ba0cf9b0db9c01710
Signed-off-by: NCatherine Sullivan <catherine.sullivan@intel.com>
Tested-by: NJim Young <jamesx.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

8109e123

i40e: Finish implementation of ethtool get settings · 4e91bcd5

由 Jesse Brandeburg 提交于 6月 04, 2014

Finish the i40e implementation of get_settings for ethtool.

Change-ID: Iec81835aa9380723ae9288bcb79b30a6a1ecd498
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: NCatherine Sullivan <catherine.sullivan@intel.com>
Tested-by: NJim Young <jamesx.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

4e91bcd5

i40e: disable TPH · f846c1a0

由 Jesse Brandeburg 提交于 6月 04, 2014

TPH is not currently enabled in this product, make sure it
isn't enabled by default.

Change-ID: Ibb1a10799c33c4c76dec06fcd53b1d6efa13c1f5
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: NJim Young <jamesx.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

f846c1a0

i40e: Fix a boundary condition and turning off of ntuple · 8a4f34fb

由 Anjali Singhai Jain 提交于 6月 04, 2014

When turning off ntuple with a FD table full situation,
the driver would have auto disabled FD filter additions.
Clear the auto disable flag for FD_SB so that when the
feature is turned on again using "ethtool -K ethx ntuple on"
we can start adding filters once again.

Change-ID: I036a32e7331bcae765b657c8abb4fa070940b163
Signed-off-by: NAnjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: NJim Young <jamesx.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

8a4f34fb

i40evf: invite vector 0 to the interrupt party · 164ec1bf

由 Mitch Williams 提交于 6月 04, 2014

The i40evf_irq_enable and i40evf_fire_sw_interrupt functions were
unfairly discriminating against MSI-X vector 0, just because it doesn't
handle traffic. That doesn't mean it's not essential to the operation of
the driver. This change allows the watchdog to fire vector 0 via
software, which makes the driver tolerant of dropped interrupts on that
vector.

Buck up, vector 0! You can be part of our gang!

Change-ID: I37131d955018a6b3e711e1732d21428acd0d767e
Signed-off-by: NMitch Williams <mitch.a.williams@intel.com>
Tested-by: NJim Young <jamesx.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

164ec1bf

i40e: tolerate lost interrupts · 56497978

由 Mitch Williams 提交于 6月 04, 2014

If the AQ interrupt gets lost for some reason, VF communications will
stall as the VFs have no way of reaching the PF, which is essentially
deaf. The VFs end up waiting forever for a reply that will never come.

To alleviate this condition, go ahead and check the ARQ every time we
run the service task. Remove the check for a pending event, and get rid
of a chatty error message that is now meaningless.

Change-ID: I0fc9d18169cd45c98f60188aef872cd6cee9a027
Signed-off-by: NMitch Williams <mitch.a.williams@intel.com>
Tested-by: NJim Young <jamesx.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

56497978

i40e/i40evf: Force a shifted '1' to be unsigned · 30fe8ad3

由 Paul M Stillwell Jr 提交于 6月 04, 2014

Force a shifted '1' to be unsiged to avoid shifting a signed int

Change-ID: I688cbd082af0f2e1df548fda25847a5ca04babcf
Signed-off-by: NPaul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
Tested-by: NJim Young <jamesx.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

30fe8ad3

i40evf: don't violate scope · 4334edf5

由 Mitch Williams 提交于 6月 04, 2014

Move a declaration up one level so we don't dereference it out of scope.
This didn't cause any panics, but the details->async field would
mysteriously disappear, causing unnecessary delays when sending AQ
commands. Also, the code is just plain wrong.

Change-ID: I753f64f13c55e5d75ea4351e29b14fb53b2f0104
Signed-off-by: NMitch Williams <mitch.a.williams@intel.com>
Tested-by: NJim Young <jamesx.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

4334edf5

i40e/i40evf: Do not free the dummy packet buffer synchronously · 49d7d933

由 Anjali Singhai Jain 提交于 6月 04, 2014

The HW still needs to consume it and freeing it in the function
that created it would mean we will be racing with the HW. The
i40e_clean_tx_ring() routine will free up the buffer attached once
the HW has consumed it.  The clean_fdir_tx_irq function had to be fixed
to handle the freeing correctly.

Cases where we program more than one filter per flow (Ipv4), the
code had to be changed to allocate dummy buffer multiple times
since it will be freed by the clean routine.  This also fixes an issue
where the filter program routine was not checking if there were
descriptors available for programming a filter.

Change-ID: Idf72028fd873221934e319d021ef65a1e51acaf7
Signed-off-by: NAnjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: NJim Young <jamesx.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

49d7d933

drivers/net/hyperv/netvsc.c: remove unnecessary null test before kfree · bd4578bc

由 Fabian Frederick 提交于 6月 28, 2014

Fix checkpatch warning:
WARNING: kfree(NULL) is safe this check is probably not required

Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: netdev@vger.kernel.org
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bd4578bc

sh_eth: remove checks around dev_kfree_skb() calls · 179d80af

由 Sergei Shtylyov 提交于 6月 28, 2014

Since consume_skb() (and hence dev_kfree_skb() macro) checks the passed pointer
for NULL, there's no need to check for NULL before invoking dev_kfree_skb().
Signed-off-by: NSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

179d80af

MAINTAINERS: Update tg3 maintainer · 23629477

由 Prashant Sreedharan 提交于 6月 27, 2014

Signed-off-by: NPrashant Sreedharan <prashant@broadcom.com>
Signed-off-by: NMichael Chan <mchan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

23629477

Merge branch 'qlcnic-next' · af7efaff

由 David S. Miller 提交于 7月 02, 2014

Harish Patil says:

====================
qlcnic: Enhance Tx timeout debug data collection.

The following set of patches are for enhancing Tx timeout debug collection

- Collect a firmware dump on first Tx timeout if netif_msg_tx_err() is set
- Log Receive and Status ring info on Tx timeout, in addition to Tx ring info
- Log additional Tx ring info if netif_msg_tx_err() is set
- Update driver version to 5.3.61

Please apply this series to net-next.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

af7efaff

qlcnic: Update version to 5.3.61 · 28470572

由 Harish Patil 提交于 6月 27, 2014

Signed-off-by: NHarish Patil <harish.patil@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

28470572

qlcnic: Enhance Tx timeout debug data collection. · 665d1eca

由 Harish Patil 提交于 6月 27, 2014

- Collect a firmware dump on first Tx timeout if netif_msg_tx_err() is set
- Log Receive and Status ring info on Tx timeout, in addition to Tx ring info
- Log additional Tx ring info if netif_msg_tx_err() is set
Signed-off-by: NHarish Patil <harish.patil@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

665d1eca

net/caif/caif_socket.c: remove unnecessary null test before debugfs_remove_recursive · fb0d164c

由 Fabian Frederick 提交于 6月 27, 2014

based on checkpatch:
"debugfs_remove_recursive(NULL) is safe this check is probably not required"

Cc: Dmitry Tarnyagin <dmitry.tarnyagin@lockless.no>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fb0d164c

drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c: remove unnecessary null test... · 9f16dc2e

由 Fabian Frederick 提交于 6月 27, 2014

drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c: remove unnecessary null test before debugfs_remove_recursive

Fix checkpatch warning:
"WARNING: debugfs_remove_recursive(NULL) is safe this check is probably not required"

Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: netdev@vger.kernel.org
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9f16dc2e

02 7月, 2014 23 次提交

inet: move ipv6only in sock_common · 9fe516ba

由 Eric Dumazet 提交于 6月 27, 2014

When an UDP application switches from AF_INET to AF_INET6 sockets, we
have a small performance degradation for IPv4 communications because of
extra cache line misses to access ipv6only information.

This can also be noticed for TCP listeners, as ipv6_only_sock() is also
used from __inet_lookup_listener()->compute_score()

This is magnified when SO_REUSEPORT is used.

Move ipv6only into struct sock_common so that it is available at
no extra cost in lookups.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9fe516ba

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next · 090cce42

由 David S. Miller 提交于 7月 01, 2014

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2014-07-01

This series contains updates to i40e, i40evf, igb and ixgbe.

Shannon adds the Base Address High and Low to the admin queue structure
to simplify the logic in the configuration routines.  Also adds code to
clear all queues and interrupts to help clean up after a PXE or other
early boot activity.

Kevin fixes mask assignment value since -1 cannot be used for unsigned
integer types.

Mitch fixes an issue where in some circumstances the reply from the PF
would come back before we were able to properly modify the admin queue
pending and required flags.  This would mess up the flags and put the
driver in an indeterminate state, so fix this by simply setting the flags
before sending the request to the admin queue.  Also changes the branding
string for i40evf to reduce confusion and to match up with our other
marketing materials.

Kamil adds a new variable defining admin send queue (ASQ) command write
back timeout to allow for dynamic modification of this timeout.

Anjali fix a bug in the flow director filter replay logic, so that we
call a replay after a sideband reset correctly.

Jesse adds code to initialize all members of the context descriptor to
prevent possible stale data.

Christopher fixes i40e to prevent writing to reserved bits, since the
queue index is only 0-127.

Jacob removes the unneeded header export.h from the i40e PTP code.
Fixes ixgbe PTP code where the PPS signal was not correct, as it
generates a one half HZ clock signal, it only generates one level
change per second.  To generate a full clock, we need two level changes
per second.

Todd provides a fix for igb to bring up link when the PHY has powered
up, which was reported by Jeff Westfahl.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

090cce42

bonding: allow to add vlans on top of empty bond · 763e0ecd

由 Jiri Pirko 提交于 6月 27, 2014

This limitation maybe had some reason in the past, but now there is not
one -> removing this.
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Acked-by: NVeaceslav Falico <vfalico@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

763e0ecd

Merge branch 'cxgb4-next' · 813f8e29

由 David S. Miller 提交于 7月 01, 2014

Hariprasad Shenai says:

====================
cxgb4: Fix for PCI passthrough and some Misc. fixes

This patch series fixes probe failure in VM when PF is exposed through PCI
Passthrough. Adds support to use firmware interface to get BAR0 value.
Replace the backdoor mechanism to access the HW memory with PCIe Window method
which fixes memory I/O. Also adds device ID of few more adapters for cxgb4 and
cxgb4vf driver.

The patches series is created against 'net-next' tree.
And includes patches on cxgb4, cxgb4vf and iw_cxgb4 driver.

Since this patch-series contains mainly cxgb4 related changes, we would like to
request this patch series to get merged via David Miller's 'net-next' tree.

We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

813f8e29

H
cxgb4vf: Adds device ID for few more Chelsio T4 Adapters · dde3aadf
由 Hariprasad Shenai 提交于 6月 27, 2014
```
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
dde3aadf

cxgb4: Adds device ID for few more Chelsio T4 Adapters · fb1e933d

由 Hariprasad Shenai 提交于 6月 27, 2014

Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fb1e933d

cxgb4: Replaced the backdoor mechanism to access the HW memory with PCIe Window method · fc5ab020

由 Hariprasad Shenai 提交于 6月 27, 2014

Rip out a bunch of redundant PCI-E Memory Window Read/Write routines,
collapse the more general purpose routines into a single routine
thereby eliminating the need for a large stack frame (and extra data
copying) in the outer routine, change everything to use the improved
routine t4_memory_rw.

Based on origninal work by Casey Leedom <leedom@chelsio.com> and
Steve Wise <swise@opengridcomputing.com>
Signed-off-by: NCasey Leedom <leedom@chelsio.com>
Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fc5ab020

cxgb4: Use FW interface to get BAR0 value · 0abfd152

由 Hariprasad Shenai 提交于 6月 27, 2014

Use the firmware interface to get the BAR0 value since we really don't want
to use the PCI-E Configuration Space Backdoor access which is owned by the
firmware.

Set up PCI-E Memory Window registers using the true values programmed into
BAR registers.  When the PF4 "Master Function" is exported to a Virtual
Machine, the values returned by pci_resource_start() will be for the
synthetic PCI-E Configuration Space and not the real addresses. But we need
to program the PCI-E Memory Window address decoders with the real addresses
that we're going to be using in order to have accesses through the Memory
Windows work.

Based on origninal work by Casey Leedom <leedom@chelsio.com>
Signed-off-by: NCasey Leedom <leedom@chelsio.com>
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0abfd152

rdma/cxgb4: Fixes cxgb4 probe failure in VM when PF is exposed through PCI Passthrough · 35b1de55

由 Hariprasad Shenai 提交于 6月 27, 2014

Change logic which determines our Physical Function at PCI Probe time.
Now we read the PL_WHOAMI register and get the Physical Function.

Pass Physical Function to Upper Layer Drivers in lld_info structure in the
new field "pf" added to lld_info. This is useful for the cases where the
PF, say PF4, is attached to a Virtual Machine via some form of "PCI
Pass Through" technology and the PCI Function shows up as PF0 in the VM.

Based on original work by Casey Leedom <leedom@chelsio.com>
Signed-off-by: NCasey Leedom <leedom@chelsio.com>
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35b1de55

Merge branch 'dp83640-next' · 2eb27a16

由 David S. Miller 提交于 7月 01, 2014

Stefan Sørensen says:

====================
dp83640: Increase support perout pins

This patch series increases the number of periodic output pins supported
on the dp83640 to 7, and allows for reprogramming the calibration pin.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2eb27a16

ptp: Allow reassigning calibration pin function · 72df7a72

由 Stefan Sørensen 提交于 6月 27, 2014

The ptp pin function programming does not allow calibration pin to change
function. This is problematic on hardware that uses the default calibration
pin for other purposes.

Removing this limitation does not impact calibration if userspace does not
reprogram the calibration pin.
Signed-off-by: NStefan Sørensen <stefan.sorensen@spectralink.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

72df7a72

dp83640: Get calibration pin with ptp_find_pin · e0155950

由 Stefan Sørensen 提交于 6月 27, 2014

For consistency, use the ptp_find_pin function to get the calibration pin,
not gpio_tab.
Signed-off-by: NStefan Sørensen <stefan.sorensen@spectralink.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e0155950

dp83640: Verify calibration pin assignment · 6f39eb87

由 Stefan Sørensen 提交于 6月 27, 2014

This constraints the pin assignment to not allow the calibration function to
be reassigned and only allow reassigning the calibratin pin if only one phy is
connected.
Signed-off-by: NStefan Sørensen <stefan.sorensen@spectralink.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6f39eb87

dp83640: Increase supported perout pins to 7 · ad01577a

由 Stefan Sørensen 提交于 6月 27, 2014

This patch increases the number of supported periodic output pins from
1 to 7. The last pin is reserved for sync.
Signed-off-by: NStefan Sørensen <stefan.sorensen@spectralink.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad01577a

dp83640: Program pulsewidth2 values of perout triggers 0 and 1 · 35e872ae

由 Stefan Sørensen 提交于 6月 27, 2014

Periodic output triggers 0 and 1 of the dp83640 has a programmable
duty-cycle which is controlled by the Pulsewidth2 field of the trigger
data register.  This field is not documented in the datasheet, but it
is described in the "PHYTER Software Development Guide" section
3.1.4.1. Failing to set the field can result in unstable/no trigger
output.

Add programming of the Pulsewidth2 field, setting it to the same value
as the Pulsewidth field for a 50% duty cycle.
Signed-off-by: NStefan Sørensen <stefan.sorensen@spectralink.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35e872ae

Merge branch 'bnx2x-next' · b6fd8b7f

由 David S. Miller 提交于 7月 01, 2014

Yuval Mintz says:

====================
bnx2x: Enhancement patch series

This patch series introduces the ability to propagate link parameters
to VFs as well as control the VF link via hypervisor.

In addition, it contains 2 small improvements [one IOV-related and the
other improves performance on machines with short cache lines].

Please consider applying these patches to `net-next'.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b6fd8b7f

bnx2x: Fail probe of VFs using an old incompatible driver · ebf457f9

由 Yuval Mintz 提交于 6月 26, 2014

There are linux distributions where the inbox bnx2x driver contains SRIOV
support but doesn't contain the changes introduced in b9871bcf
"bnx2x: VF RSS support - PF side".

A VF in a VM running that distribution over a new hypervisor will access
incorrect addresses when trying to transmit packets, causing an attention
in the hypervisor and making that VF inactive until FLRed.

The driver in the VM has to ne upgraded [no real way to overcome this], but
due to the HW attention currently arising upgrading the driver in the VM
would not suffice [since the VF needs also be FLRed if the previous driver
was already loaded].

This patch causes the PF to fail the acquire message from a VF running an
old problematic driver; The VF will then gracefully fail it's probe preventing
the HW attention [and allow clean upgrade of driver in VM].
Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NAriel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ebf457f9

bnx2x: enlarge minimal alignemnt of data offset · 9927b514

由 Dmitry Kravkov 提交于 6月 26, 2014

This improves the performance of driver on machine with L1_CACHE_SHIFT of at
most 32 bytes [HW was planned for 64-byte aligned fastpath data].
Signed-off-by: NDmitry Kravkov <Dmitry.Kravkov@qlogic.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NAriel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9927b514

bnx2x: VF can report link speed · 6495d15a

由 Dmitry Kravkov 提交于 6月 26, 2014

Until now VFs were oblvious to the actual configured link parameters.
This patch does 2 things:

  1. It enables a PF to inform its VF using the bulletin board of the link
     configured, and allows the VF to present that information.

  2. It adds support of `ndo_set_vf_link_state', allowing the hypervisor
     to set the VF link state.
Signed-off-by: NDmitry Kravkov <Dmitry.Kravkov@qlogic.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NAriel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6495d15a

Merge branch 'pktgen' · edd79ca8

由 David S. Miller 提交于 7月 01, 2014

Jesper Dangaard Brouer says:

====================
Optimizing pktgen for single CPU performance

This series focus on optimizing "pktgen" for single CPU performance.

V2-series:
 - Removed some patches
 - Doc real reason for TX ring buffer filling up

NIC tuning for pktgen:
 http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html

General overload setup according to:
 http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html

Hardware:
 System: CPU E5-2630
 NIC: Intel ixgbe/82599 chip

Testing done with net-next git tree on top of
 commit 6623b419 ("Merge branch 'master' of...jkirsher/net-next")

Pktgen script exercising race condition:
 https://github.com/netoptimizer/network-testing/blob/master/pktgen/unit_test01_race_add_rem_device_loop.sh

Tool for measuring LOCK overhead:
 https://github.com/netoptimizer/network-testing/blob/master/src/overhead_cmpxchg.c
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

edd79ca8

pktgen: RCU-ify "if_list" to remove lock in next_to_run() · 8788370a

由 Jesper Dangaard Brouer 提交于 6月 26, 2014

The if_lock()/if_unlock() in next_to_run() adds a significant
overhead, because its called for every packet in busy loop of
pktgen_thread_worker().  (Thomas Graf originally pointed me
at this lock problem).

Removing these two "LOCK" operations should in theory save us approx
16ns (8ns x 2), as illustrated below we do save 16ns when removing
the locks and introducing RCU protection.

Performance data with CLONE_SKB==100000, TX-size=512, rx-usecs=30:
 (single CPU performance, ixgbe 10Gbit/s, E5-2630)
 * Prev   : 5684009 pps --> 175.93ns (1/5684009*10^9)
 * RCU-fix: 6272204 pps --> 159.43ns (1/6272204*10^9)
 * Diff   : +588195 pps --> -16.50ns

To understand this RCU patch, I describe the pktgen thread model
below.

In pktgen there is several kernel threads, but there is only one CPU
running each kernel thread.  Communication with the kernel threads are
done through some thread control flags.  This allow the thread to
change data structures at a know synchronization point, see main
thread func pktgen_thread_worker().

Userspace changes are communicated through proc-file writes.  There
are three types of changes, general control changes "pgctrl"
(func:pgctrl_write), thread changes "kpktgend_X"
(func:pktgen_thread_write), and interface config changes "etcX@N"
(func:pktgen_if_write).

Userspace "pgctrl" and "thread" changes are synchronized via the mutex
pktgen_thread_lock, thus only a single userspace instance can run.
The mutex is taken while the packet generator is running, by pgctrl
"start".  Thus e.g. "add_device" cannot be invoked when pktgen is
running/started.

All "pgctrl" and all "thread" changes, except thread "add_device",
communicate via the thread control flags.  The main problem is the
exception "add_device", that modifies threads "if_list" directly.

Fortunately "add_device" cannot be invoked while pktgen is running.
But there exists a race between "rem_device_all" and "add_device"
(which normally don't occur, because "rem_device_all" waits 125ms
before returning). Background'ing "rem_device_all" and running
"add_device" immediately allow the race to occur.

The race affects the threads (list of devices) "if_list".  The if_lock
is used for protecting this "if_list".  Other readers are given
lock-free access to the list under RCU read sections.

Note, interface config changes (via proc) can occur while pktgen is
running, which worries me a bit.  I'm assuming proc_remove() takes
appropriate locks, to assure no writers exists after proc_remove()
finish.

I've been running a script exercising the race condition (leading me
to fix the proc_remove order), without any issues.  The script also
exercises concurrent proc writes, while the interface config is
getting removed.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8788370a

pktgen: avoid expensive set_current_state() call in loop · baac167b

由 Jesper Dangaard Brouer 提交于 6月 26, 2014

Avoid calling set_current_state() inside the busy-loop in
pktgen_thread_worker().  In case of pkt_dev->delay, then it is still
used/enabled in pktgen_xmit() via the spin() call.

The set_current_state(TASK_INTERRUPTIBLE) uses a xchg, which implicit
is LOCK prefixed.  I've measured the asm LOCK operation to take approx
8ns on this E5-2630 CPU.  Performance increase corrolate with this
measurement.

Performance data with CLONE_SKB==100000, rx-usecs=30:
 (single CPU performance, ixgbe 10Gbit/s, E5-2630)
 * Prev:  5454050 pps --> 183.35ns (1/5454050*10^9)
 * Now:   5684009 pps --> 175.93ns (1/5684009*10^9)
 * Diff:  +229959 pps -->  -7.42ns
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

baac167b

pktgen: document tuning for max NIC performance · 9ceb87fc

由 Jesper Dangaard Brouer 提交于 6月 26, 2014

Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
running full.  Thus, the TX ring is artificially limiting pktgen.
(Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
counters.)

Using ixgbe, the real reason behind the TX ring running full, is due
to TX ring not being cleaned up fast enough. The ixgbe driver combines
TX+RX ring cleanups, and the cleanup interval is affected by the
ethtool --coalesce setting of parameter "rx-usecs".

Do not increase the default NIC TX ring buffer or default cleanup
interval.  Instead simply document that pktgen needs special NIC
tuning for maximum packet per sec performance.

Performance results with pktgen with clone_skb=100000.
TX ring size 512 (default), adjusting "rx-usecs":
 (Single CPU performance, E5-2630, ixgbe)
 - 3935002 pps - rx-usecs:  1 (irqs:  9346)
 - 5132350 pps - rx-usecs: 10 (irqs: 99157)
 - 5375111 pps - rx-usecs: 20 (irqs: 50154)
 - 5454050 pps - rx-usecs: 30 (irqs: 33872)
 - 5496320 pps - rx-usecs: 40 (irqs: 26197)
 - 5502510 pps - rx-usecs: 50 (irqs: 21527)

TX ring size adjusting (ethtool -G), "rx-usecs==1" (default):
 - 3935002 pps - tx-size:  512
 - 5354401 pps - tx-size:  768
 - 5356847 pps - tx-size: 1024
 - 5327595 pps - tx-size: 1536
 - 5356779 pps - tx-size: 2048
 - 5353438 pps - tx-size: 4096

Notice after commit 6f25cd47 (pktgen: fix xmit test for BQL enabled
devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores
the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag.  This allow us to put
more pressure on the TX ring buffers.

It is the ixgbe_maybe_stop_tx() call that stops the transmits, and
pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq).
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9ceb87fc