提交 · 3179698d48eb2ac19387a9c5e971314e22239b12 · openanolis / cloud-kernel

01 11月, 2017 24 次提交

liquidio: Configure switchdev with devlink · d4be8ebe

由 Vijaya Mohan Guvva 提交于 10月 31, 2017

Enable and disable switchdev on SRIOV capable LiquidIO NIC with devlink.
Create representor netdev for each SRIOV VF function on SRIOV enable and
and do the cleanup on SRIOV disable.
Signed-off-by: NVijaya Mohan Guvva <vijaya.guvva@cavium.com>
Signed-off-by: NSatanand Burla <satananda.burla@cavium.com>
Signed-off-by: NRaghu Vatsavayi <raghu.vatsavayi@cavium.com>
Signed-off-by: NFelix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d4be8ebe

liquidio: switchdev support for LiquidIO NIC · 1f233f32

由 Vijaya Mohan Guvva 提交于 10月 31, 2017

Enable switchdev for SRIOV capable LiquidIO NIC. It registers
a representor netdev (with switchdev_ops) for each SRIOV VF created.
It also has changes to send representor interface configurations like
admin state and MTU to LiquidIO firmware and to retrieve HW counted
VF stats for VF representor.
Signed-off-by: NVijaya Mohan Guvva <vijaya.guvva@cavium.com>
Signed-off-by: NSatanand Burla <satananda.burla@cavium.com>
Signed-off-by: NRaghu Vatsavayi <raghu.vatsavayi@cavium.com>
Signed-off-by: NFelix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f233f32

net/mlx5e: Switch channels counters to use stats group API · 1fe85006

由 Kamal Heib 提交于 8月 23, 2017

Switch the channels counters to use the new stats group API.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NGal Pressman <galp@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

1fe85006

net/mlx5e: Switch ipsec counters to use stats group API · e185d43f

由 Kamal Heib 提交于 8月 23, 2017

Switch the ipsec counters to use the new stats group API.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NGal Pressman <galp@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

e185d43f

net/mlx5e: Switch pme counters to use stats group API · 0e6f01a4

由 Kamal Heib 提交于 8月 23, 2017

Switch the pme counters to use the new stats group API.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NGal Pressman <galp@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

0e6f01a4

net/mlx5e: Switch per prio pfc counters to use stats group API · 4377bea2

由 Kamal Heib 提交于 8月 23, 2017

Switch the per prio pfc counters to use the new stats group API.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NGal Pressman <galp@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

4377bea2

net/mlx5e: Switch per prio traffic counters to use stats group API · e6000651

由 Kamal Heib 提交于 8月 23, 2017

Switch the per prio traffic counters to use the new stats group API.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NGal Pressman <galp@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

e6000651

net/mlx5e: Switch pcie counters to use stats group API · 9fd2b5f1

由 Kamal Heib 提交于 8月 23, 2017

Switch the pcie counters to use the new stats group API.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NGal Pressman <galp@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

9fd2b5f1

net/mlx5e: Switch ethernet extended counters to use stats group API · 3488bd4c

由 Kamal Heib 提交于 8月 23, 2017

Switch the ethernet extended counters to use the new stats group API.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NGal Pressman <galp@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

3488bd4c

net/mlx5e: Switch physical statistical counters to use stats group API · 2e4df0b2

由 Kamal Heib 提交于 8月 23, 2017

Switch the physical statistical counters to use the new stats group API.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NGal Pressman <galp@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

2e4df0b2

net/mlx5e: Switch RFC 2819 counters to use stats group API · e0e0def9

由 Kamal Heib 提交于 8月 23, 2017

Switch the RFC 2819 counters to use the new stats group API.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NGal Pressman <galp@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

e0e0def9

net/mlx5e: Switch RFC 2863 counters to use stats group API · fc8e64a3

由 Kamal Heib 提交于 8月 23, 2017

Switch the RFC 2863 counters to use the new stats group API.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NGal Pressman <galp@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

fc8e64a3

net/mlx5e: Switch IEEE 802.3 counters to use stats group API · 6e6ef814

由 Kamal Heib 提交于 8月 23, 2017

Switch the IEEE 802.3 counters to use the new stats group API.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NGal Pressman <galp@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

6e6ef814

net/mlx5e: Switch vport counters to use the stats group API · 40cab9f1

由 Kamal Heib 提交于 8月 23, 2017

Switch the vport counters to use the new stats group API.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NGal Pressman <galp@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

40cab9f1

net/mlx5e: Switch Q counters to use the stats group API · fd8dcdb8

由 Kamal Heib 提交于 8月 23, 2017

Switch the Q counters to use the new stats group API.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NGal Pressman <galp@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

fd8dcdb8

net/mlx5e: Introduce stats group API · c0752f2b

由 Kamal Heib 提交于 8月 23, 2017

Currently the mlx5e driver has multiple groups of stats, each group is
used for different purposes and it may depend on hardware capabilities
or not. The problem with the current implementation is that there is no
clear API to create a new group of stats.

This change define a new API to create a group of stats and simplifies
the way of handling them by defining a new struct "mlx5e_stats_grp" which
have the following three function pointers:
- get_num_stats() - return the number of counters in the group.
- fill_strings() - fill counters strings within the group.
- fill_stats() - fill counters values within the group.

The above function pointers are used within the ethtool callbaks while
calling "ethtool -S" from userspace. This change also switch the SW
group to use the new API.
Signed-off-by: NKamal Heib <kamalh@mellanox.com>
Reviewed-by: NGal Pressman <galp@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

c0752f2b

i40e: Enable cloud filters via tc-flower · 2f4b411a

由 Amritha Nambiar 提交于 10月 27, 2017

This patch enables tc-flower based hardware offloads. tc flower
filter provided by the kernel is configured as driver specific
cloud filter. The patch implements functions and admin queue
commands needed to support cloud filters in the driver and
adds cloud filters to configure these tc-flower filters.

The classification function of the filter is to direct matched
packets to a traffic class. The hardware traffic class is set
based on the the classid reserved in the range :ffe0 - :ffef.

Match Dst MAC and route to TC0:
  prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw\
  hw_tc 1

Match Dst IPv4,Dst Port and route to TC1:
  prio 2 flower dst_ip 192.168.3.5/32\
  ip_proto udp dst_port 25 skip_sw\
  hw_tc 2

Match Dst IPv6,Dst Port and route to TC1:
  prio 3 flower dst_ip fe8::200:1\
  ip_proto udp dst_port 66 skip_sw\
  hw_tc 2

Delete tc flower filter:
Example:

Flow Director Sideband is disabled while configuring cloud filters
via tc-flower and until any cloud filter exists.

Unsupported matches when cloud filters are added using enhanced
big buffer cloud filter mode of underlying switch include:
1. source port and source IP
2. Combined MAC address and IP fields.
3. Not specifying L4 port

These filter matches can however be used to redirect traffic to
the main VSI (tc 0) which does not require the enhanced big buffer
cloud filter support.
Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: NKiran Patil <kiran.patil@intel.com>
Signed-off-by: NAnjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: NJingjing Wu <jingjing.wu@intel.com>
Acked-by: NShannon Nelson <shannon.nelson@oracle.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

2f4b411a

i40e: Clean up of cloud filters · aaf66502

由 Amritha Nambiar 提交于 10月 27, 2017

Introduce the cloud filter data structure and cleanup of cloud
filters associated with the device.
Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
Acked-by: NShannon Nelson <shannon.nelson@oracle.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

aaf66502

i40e: Admin queue definitions for cloud filters · 2c001523

由 Amritha Nambiar 提交于 10月 27, 2017

Add new admin queue definitions and extended fields for cloud
filter support. Define big buffer for extended general fields
in Add/Remove Cloud filters command.
Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: NKiran Patil <kiran.patil@intel.com>
Signed-off-by: NJingjing Wu <jingjing.wu@intel.com>
Acked-by: NShannon Nelson <shannon.nelson@oracle.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

2c001523

i40e: Cloud filter mode for set_switch_config command · 5efe0c6c

由 Amritha Nambiar 提交于 10月 27, 2017

Add definitions for L4 filters and switch modes based on cloud filters
modes and extend the set switch config command to include the
additional cloud filter mode.
Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: NKiran Patil <kiran.patil@intel.com>
Acked-by: NShannon Nelson <shannon.nelson@oracle.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

5efe0c6c

i40e: Map TCs with the VSI seids · aa5cb02a

由 Amritha Nambiar 提交于 10月 27, 2017

Add mapping of TCs with the seids of the channel VSIs. TC0
will be mapped to the main VSI seid and all other TCs are
mapped to the seid of the corresponding channel VSI.
Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
Acked-by: NShannon Nelson <shannon.nelson@oracle.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

aa5cb02a

i40e/i40evf: Revert "i40e/i40evf: bump tail only in multiples of 8" · aa250f11

由 Alexander Duyck 提交于 10月 21, 2017

This reverts commit 11f29003.

I am reverting this as I am fairly certain this can result in a memory leak
when combined with the current page recycling scheme. Specifically we end
up attempting to allocate fewer buffers than we recycled and this results
in us rewinding the next to alloc pointer which leads to leaks when we
overwrite the rx_buffer_info when processing the next frame.

Fixes: 11f29003 ("i40e/i40evf: bump tail only in multiples of 8")
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

aa250f11

i40e: only redistribute MSI-X vectors when needed · 3e6b1cf7

由 Shannon Nelson 提交于 10月 10, 2017

Whether or not there are vectors_left, we only need to redistribute
our vectors if we didn't get as many as we requested. With the current
check, the code will try to redistribute even if we did in fact get all
the vectors we requested - this can happen when we have more CPUs than
we do vectors. This restores an earlier check to be sure we only
redistribute if we didn't get the full count we requested.

Fixes: 4ce20abc (i40e: fix MSI-X vector redistribution if hw limit is reached)
Signed-off-by: NShannon Nelson <shannon.nelson@oracle.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

3e6b1cf7

i40e: mark PM functions as __maybe_unused · 254d152a

由 Arnd Bergmann 提交于 10月 10, 2017

A cleanup of the PM code left an incorrect #ifdef in place, leading
to a harmless build warning:

drivers/net/ethernet/intel/i40e/i40e_main.c:12223:12: error: 'i40e_resume' defined but not used [-Werror=unused-function]
drivers/net/ethernet/intel/i40e/i40e_main.c:12185:12: error: 'i40e_suspend' defined but not used [-Werror=unused-function]

It's easier to use __maybe_unused attributes here, since you
can't pick the wrong one.

Fixes: 0e5d3da4 ("i40e: use newer generic PM support instead of legacy PM callbacks")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Acked-by: NJacob Keller <jacob.e.keller@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

254d152a

29 10月, 2017 9 次提交

ipvlan: implement VEPA mode · fe89aa6b

由 Mahesh Bandewar 提交于 10月 26, 2017

This is very similar to the Macvlan VEPA mode, however, there is some
difference. IPvlan uses the mac-address of the lower device, so the VEPA
mode has implications of ICMP-redirects for packets destined for its
immediate neighbors sharing same master since the packets will have same
source and dest mac. The external switch/router will send redirect msg.

Having said that, this will be useful tool in terms of debugging
since IPvlan will not switch packets within its slaves and rely completely
on the external entity as intended in 802.1Qbg.
Signed-off-by: NMahesh Bandewar <maheshb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fe89aa6b

ipvlan: introduce 'private' attribute for all existing modes. · a190d04d

由 Mahesh Bandewar 提交于 10月 26, 2017

IPvlan has always operated in bridge mode. However there are scenarios
where each slave should be able to talk through the master device but
not necessarily across each other. Think of an environment where each
of a namespace is a private and independant customer. In this scenario
the machine which is hosting these namespaces neither want to tell who
their neighbor is nor the individual namespaces care to talk to neighbor
on short-circuited network path.

This patch implements the mode that is very similar to the 'private' mode
in macvlan where individual slaves can send and receive traffic through
the master device, just that they can not talk among slave devices.
Signed-off-by: NMahesh Bandewar <maheshb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a190d04d

net: aquantia: Make local functions static · 2660d226

由 Wei Yongjun 提交于 10月 28, 2017

Fixes the following sparse warnings:

drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c:224:5: warning:
 symbol 'aq_ethtool_get_coalesce' was not declared. Should it be static?
drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c:245:5: warning:
 symbol 'aq_ethtool_set_coalesce' was not declared. Should it be static?
Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2660d226

net: dsa: b53: Export b53_configure_vlan() · 5c1a6eaf

由 Florian Fainelli 提交于 10月 27, 2017

bcm_sf2 and b53 replicate the same operations: clear all VLANs and set
their ports to the default VLAN tag (1 for these devices) so export the
b53 function doing just that.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5c1a6eaf

liquidio: get rid of false alarm "Unknown cmd 27" in dmesg · 641da8ed

由 Felix Manlunas 提交于 10月 27, 2017

Creating a macvtap interface with the liquidio VF driver as lower device
causes this alarming message to show up in dmesg:

    liquidio_link_ctrl_cmd_completion Unknown cmd 27

That's actually a false alarm because cmd 27 is the value of the macro
OCTNET_CMD_SET_UC_LIST which is known.  It's a control command sent from
host to NIC firmware to set the unicast MAC address list of the macvtap
lower device.

Make the false alarm go away by adding a case for OCTNET_CMD_SET_UC_LIST
in liquidio_link_ctrl_cmd_completion().
Signed-off-by: NFelix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: NRaghu Vatsavayi <raghu.vatsavayi@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

641da8ed

hv_netvsc: Set tx_table to equal weight after subchannels open · a6fb6aa3

由 Haiyang Zhang 提交于 10月 27, 2017

In some cases, like internal vSwitch, the host doesn't provide
send indirection table updates. This patch sets the table to be
equal weight after subchannels are all open. Otherwise, all workload
will be on one TX channel.

As tested, this patch has largely increased the throughput over
internal vSwitch.
Signed-off-by: NHaiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a6fb6aa3

ppp: allow usage in namespaces · 90e229ef

由 Matteo Croce 提交于 10月 27, 2017

Check for CAP_NET_ADMIN with ns_capable() instead of capable()
to allow usage of ppp in user namespace other than the init one.
Signed-off-by: NMatteo Croce <mcroce@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

90e229ef

cxgb3: Check and handle the dma mapping errors · c69fe407

由 Arjun Vynipadath 提交于 10月 27, 2017

This patch adds checks at approprate places whether *dma_map*() call has
succeeded or not.

Original Work by: Santosh Rastapur <santosh@chelsio.com>
Signed-off-by: NArjun Vynipadath <arjun@chelsio.com>
Signed-off-by: NGanesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c69fe407

r8169: Add support for interrupt coalesce tuning (ethtool -C) · 50970831

由 Francois Romieu 提交于 10月 27, 2017

Kirr: In particular with

	ethtool -C <ifname> rx-usecs 0 rx-frames 0

now it is possible to disable RX delays when NIC usage requires low-latency.

See this thread for context:

	https://www.spinics.net/lists/netdev/msg217665.html

My specific case is that:

We have many computers with gigabit Realtek NICs. For 2 such computers
connected to a gigabit store-and-forward switch the minimum round-trip
time for small pings (`ping -i 0 -w 3 -s 56 -q peer`) is ~ 30μs.

However it turned out that when Ethernet frame length transitions 127 ->
128 bytes (`ping -i 0 -w 3 -s {81 -> 82} -q peer`) the lowest RTT
transitions step-wise to ~ 270μs.

As David Light said this is RX interrupt mitigation done by NIC which creates
the latency. For workloads when low-latency is required with e.g. Intel,
BCM etc NIC drivers one just uses `ethtool -C rx-usecs ...` to reduce
the time NIC delays before interrupting CPU, but it turned out
`ethtool -C` is not supported by r8169 driver.

Like Stéphane ANCELOT I've traced the problem down to IntrMitigate being
hardcoded to != 0 for our chips (we have 8168 based NICs):

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/realtek/r8169.c#n5460
static void rtl_hw_start_8169(struct net_device *dev) {
        ...
        /*
         * Undocumented corner. Supposedly:
         * (TxTimer << 12) | (TxPackets << 8) | (RxTimer << 4) | RxPackets
         */
        RTL_W16(IntrMitigate, 0x0000);

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/realtek/r8169.c#n6346
static void rtl_hw_start_8168(struct net_device *dev) {
        ...
        RTL_W16(IntrMitigate, 0x5151);

and then I've also found

	https://www.spinics.net/lists/netdev/msg217665.html

and original Francois' patch:

	https://www.spinics.net/lists/netdev/msg217984.html
	https://www.spinics.net/lists/netdev/msg218207.html

So could we please finally get support for tuning r8169 interrupt
coalescing in tree? (so that next poor soul who hits the problem does
not need to go all the way to dig into driver sources and internet
wildly and finally patch locally

        -RTL_W16(IntrMitigate, 0x5151);
        +RTL_W16(IntrMitigate, 0x5100);

guessing whether it is right or not and also having to care to deploy
the patch everywhere it needs to be used, etc...).

To do so I've took original Francois's patch from 2012 and reworked it a bit:

- updated to latest net-next.git;
- adjusted scaling setup based on feedback from Hayes to pick up scaling
  vector depending not only on link speed but also on CPlusCmd[0:1] and to
  adjust CPlusCmd[0:1] correspondingly when setting timings;
- improved a bit (I think so) error handling.

I've tested the patch on "RTL8168d/8111d" (XID 083000c0) and with it and
`ethtool -C rx-usecs 0 rx-frames 0` on both ends it improves:

- minimum RTT latency:

        ~270μs ->  ~30μs (small packet),
        ~330μs -> ~110μs (full 1.5K ethernet frame)

- average RTT latency:

        ~480μs ->  ~50μs (small packet),
        ~560μs -> ~125μs (full 1.5K ethernet frame)

( before:

        root@neo1:# ping -i 0 -w 3 -s 82 -q neo2
        PING neo2.kirr.nexedi.com (192.168.102.21) 82(110) bytes of data.

        --- neo2.kirr.nexedi.com ping statistics ---
        5906 packets transmitted, 5905 received, 0% packet loss, time 2999ms
        rtt min/avg/max/mdev = 0.274/0.485/0.607/0.026 ms, ipg/ewma 0.508/0.489 ms

        root@neo1:# ping -i 0 -w 3 -s 1472 -q neo2
        PING neo2.kirr.nexedi.com (192.168.102.21) 1472(1500) bytes of data.

        --- neo2.kirr.nexedi.com ping statistics ---
        5073 packets transmitted, 5073 received, 0% packet loss, time 2999ms
        rtt min/avg/max/mdev = 0.330/0.566/0.710/0.028 ms, ipg/ewma 0.591/0.544 ms

  after:

        root@neo1# ping -i 0 -w 3 -s 82 -q neo2
        PING neo2.kirr.nexedi.com (192.168.102.21) 82(110) bytes of data.

        --- neo2.kirr.nexedi.com ping statistics ---
        45815 packets transmitted, 45815 received, 0% packet loss, time 3000ms
        rtt min/avg/max/mdev = 0.036/0.051/0.368/0.010 ms, ipg/ewma 0.065/0.053 ms

        root@neo1:# ping -i 0 -w 3 -s 1472 -q neo2
        PING neo2.kirr.nexedi.com (192.168.102.21) 1472(1500) bytes of data.

        --- neo2.kirr.nexedi.com ping statistics ---
        21250 packets transmitted, 21250 received, 0% packet loss, time 3000ms
        rtt min/avg/max/mdev = 0.112/0.125/0.390/0.007 ms, ipg/ewma 0.141/0.125 ms

  the small -> 1.5K latency growth is understandable as it takes ~15μs
  to transmit 1.5K on 1Gbps on the wire and with 2 hosts and 1 switch
  and ICMP ECHO + ECHO reply the packet has to travel 4 ethernet
  segments which is already 60μs;

  probably something a bit else is also there as e.g. on Linux, even
  with `cpupower frequency-set -g performance`, on some computers I've
  noticed the kernel can be spending more time in software-only mode
  when incoming packets go in less frequently. E.g. this program can
  demonstrate the effect for ICMP ECHO processing:

  https://lab.nexedi.com/kirr/bcc/blob/43cfc13b/tools/pinglat.py

  (later this was found to be partly due to C-states exit latencies) )

We have this patch running in our testing setup for 1 months already
without any issues observed.

It remains to be clarified whether RX and TX timers use the same base.
For now I've set them equally, but Francois's original patch version
suggests it could be not the same.

I've got no feedback at all to my original posting of this patch and questions

	https://www.spinics.net/lists/netdev/msg457173.html

neither from Francois, nor from any people from Realtek during one month.

So I suggest we simply apply it to net-next.git now.

Cc: Francois Romieu <romieu@fr.zoreil.com>
Cc: Hayes Wang <hayeswang@realtek.com>
Cc: Realtek linux nic maintainers <nic_swsd@realtek.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Stéphane ANCELOT <sancelot@free.fr>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

50970831

28 10月, 2017 7 次提交

tap: reference to KVA of an unloaded module causes kernel panic · dea6e19f

由 Girish Moodalbail 提交于 10月 27, 2017

The commit 9a393b5d ("tap: tap as an independent module") created a
separate tap module that implements tap functionality and exports
interfaces that will be used by macvtap and ipvtap modules to create
create respective tap devices.

However, that patch introduced a regression wherein the modules macvtap
and ipvtap can be removed (through modprobe -r) while there are
applications using the respective /dev/tapX devices. These applications
cause kernel to hold reference to /dev/tapX through 'struct cdev
macvtap_cdev' and 'struct cdev ipvtap_dev' defined in macvtap and ipvtap
modules respectively. So,  when the application is later closed the
kernel panics because we are referencing KVA that is present in the
unloaded modules.

----------8<------- Example ----------8<----------
$ sudo ip li add name mv0 link enp7s0 type macvtap
$ sudo ip li show mv0 |grep mv0| awk -e '{print $1 $2}'
  14:mv0@enp7s0:
$ cat /dev/tap14 &
$ lsmod |egrep -i 'tap|vlan'
macvtap                16384  0
macvlan                24576  1 macvtap
tap                    24576  3 macvtap
$ sudo modprobe -r macvtap
$ fg
cat /dev/tap14
^C

<...system panics...>
BUG: unable to handle kernel paging request at ffffffffa038c500
IP: cdev_put+0xf/0x30
----------8<-----------------8<----------

The fix is to set cdev.owner to the module that creates the tap device
(either macvtap or ipvtap). With this set, the operations (in
fs/char_dev.c) on char device holds and releases the module through
cdev_get() and cdev_put() and will not allow the module to unload
prematurely.

Fixes: 9a393b5d (tap: tap as an independent module)
Signed-off-by: NGirish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dea6e19f

drivers/net: smsc: Convert timers to use timer_setup() · 267146d4

由 Kees Cook 提交于 10月 26, 2017

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: "yuval.shaia@oracle.com" <yuval.shaia@oracle.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Philippe Reynes <tremyfr@gmail.com>
Cc: Allen Pais <allen.lkml@gmail.com>
Cc: Tobias Klauser <tklauser@distanz.ch>
Cc: netdev@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

267146d4

drivers/net: packetengines: Convert timers to use timer_setup() · 8089c6f4

由 Kees Cook 提交于 10月 26, 2017

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Allen Pais <allen.lkml@gmail.com>
Cc: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
Cc: Philippe Reynes <tremyfr@gmail.com>
Cc: netdev@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8089c6f4

drivers/net: natsemi: Convert timers to use timer_setup() · 15735c9d

由 Kees Cook 提交于 10月 26, 2017

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Allen Pais <allen.lkml@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Philippe Reynes <tremyfr@gmail.com>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: netdev@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

15735c9d

drivers/net: mellanox: Convert timers to use timer_setup() · 0365b047

由 Kees Cook 提交于 10月 26, 2017

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Matan Barak <matanb@mellanox.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: netdev@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0365b047

drivers/net: korina: Convert timers to use timer_setup() · 34309b36

由 Kees Cook 提交于 10月 26, 2017

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Roman Yeryomin <leroi.lists@gmail.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: netdev@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

34309b36

drivers/net: fealnx: Convert timers to use timer_setup() · 8b3718dc

由 Kees Cook 提交于 10月 26, 2017

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: "yuval.shaia@oracle.com" <yuval.shaia@oracle.com>
Cc: Allen Pais <allen.lkml@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Philippe Reynes <tremyfr@gmail.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: netdev@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8b3718dc

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功