提交 · e00431bc93bb48c650273be4a00007b2a392d32a · openeuler / Kernel

08 6月, 2016 34 次提交

tcp: accept RST if SEQ matches right edge of right-most SACK block · e00431bc

由 Pau Espin Pedrol 提交于 6月 07, 2016

RFC 5961 advises to only accept RST packets containing a seq number
matching the next expected seq number instead of the whole receive
window in order to avoid spoofing attacks.

However, this situation is not optimal in the case SACK is in use at the
time the RST is sent. I recently run into a scenario in which packet
losses were high while uploading data to a server, and userspace was
willing to frequently terminate connections by sending a RST. In
this case, the ACK sent on the receiver side (rcv_nxt) is frozen waiting
for a lost packet retransmission and SACK blocks are used to let the
client continue uploading data. At some point later on, the client sends
the RST (snd_nxt), which matches the next expected seq number of the
right-most SACK block on the receiver side which is going forward
receiving data.

In this scenario, as RFC 5961 defines, the RST SEQ doesn't match the
frozen main ACK at receiver side and thus gets dropped and a challenge
ACK is sent, which gets usually lost due to network conditions. The main
consequence is that the connection stays alive for a while even if it
made sense to accept the RST. This can get really bad if lots of
connections like this one are created in few seconds, allocating all the
resources of the server easily.

For security reasons, not all SACK blocks are checked (there could be a
big amount of SACK blocks => acceptable SEQ numbers). Furthermore, it
wouldn't make sense to check for RST in blocks other than the right-most
received one because the sender is not expected to be sending new data
after the RST. For simplicity, only up to the 4 most recently updated
SACK blocks (selective_acks[4] field) are compared to find the
right-most block, as usually those are the ones with bigger probability
to contain it.

This patch was tested in a 3.18 kernel and probed to improve the
situation in the scenario described above.
Signed-off-by: NPau Espin Pedrol <pau.espin@tessares.net>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Tested-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e00431bc

qed: potential overflow in qed_cxt_src_t2_alloc() · 01e517f1

由 Dan Carpenter 提交于 6月 07, 2016

In the current code "ent_per_page" could be more than "conn_num" making
"conn_num" negative after the subtraction.  In the next iteration
through the loop then the negative is treated as a very high positive
meaning we don't put a limit on "ent_num".  It could lead to memory
corruption.

Fixes: dbb799c3 ('qed: Initialize hardware for new protocols')
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Acked-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

01e517f1

Merge branch 'vrf-local' · f02ea215

由 David S. Miller 提交于 6月 08, 2016

David Ahern says:

====================
net: vrf: Add support for local traffic to local addresses

Add support for locally originated traffic to VRF-local addresses,
be it addresses on enslaved devices or addresses on the VRF device:

$ ip addr show dev red
33: red: <NOARP,MASTER,UP,LOWER_UP> mtu 65536 qdisc pfifo_fast state UP group default qlen 1000
    link/ether be:00:53:b5:e4:25 brd ff:ff:ff:ff:ff:ff
    inet 1.1.1.1/32 scope global red
       valid_lft forever preferred_lft forever
    inet6 1111:1::1/128 scope global
       valid_lft forever preferred_lft forever

$ ip addr show dev eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
    link/ether 02:e0:f9:79:34:bd brd ff:ff:ff:ff:ff:ff
    inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 2100:1::1/120 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
       valid_lft forever preferred_lft forever

$ ping -c1 -I red 10.100.1.1
    ping: Warning: source address might be selected on device other than red.
    PING 10.100.1.1 (10.100.1.1) from 10.100.1.1 red: 56(84) bytes of data.
    64 bytes from 10.100.1.1: icmp_seq=1 ttl=64 time=0.057 ms

$ ping -c1 -I red 1.1.1.1
PING 1.1.1.1 (1.1.1.1) from 1.1.1.1 red: 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=64 time=0.136 ms

--- 1.1.1.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.136/0.136/0.136/0.000 ms

$ ping6 -c1 -I red  2100:1::1
ping6: Warning: source address might be selected on device other than red.
PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.167 ms

--- 2100:1::1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.167/0.167/0.167/0.000 ms

$ ping6 -c1 -I red 1111::1
PING 1111::1(1111::1) from 1111:1::1 red: 56 data bytes
64 bytes from 1111::1: icmp_seq=1 ttl=64 time=0.187 ms

--- 1111::1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.187/0.187/0.187/0.000 ms

This change also enables use of loopback address on the VRF device:
$ ip addr add dev red 127.0.0.1/8

$ ping -c1 -I red 127.0.0.1
PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.058 ms
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f02ea215

net: vrf: ipv6 support for local traffic to local addresses · b4869aa2

由 David Ahern 提交于 6月 06, 2016

Add support for locally originated traffic to VRF-local IPv6 addresses.
Similar to IPv4 a local dst is set on the skb and the packet is
reinserted with a call to netif_rx. With this patch, ping, tcp and udp
packets to a local IPv6 address are successfully routed:

    $ ip addr show dev eth1
    4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
        link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
        inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
           valid_lft forever preferred_lft forever
        inet6 2100:1::1/120 scope global
           valid_lft forever preferred_lft forever
        inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
           valid_lft forever preferred_lft forever

    $ ping6 -c1 -I red 2100:1::1
    ping6: Warning: source address might be selected on device other than red.
    PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
    64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms

ip6_input is exported so the VRF driver can use it for the dst input
function. The dst_alloc function for IPv4 defaults to setting the input and
output functions; IPv6's does not. VRF does not need to duplicate the Rx path
so just export the ipv6 input function.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4869aa2

net: vrf: ipv4 support for local traffic to local addresses · afe80a49

由 David Ahern 提交于 6月 06, 2016

Add support for locally originated traffic to VRF-local addresses. If
destination device for an skb is the loopback or VRF device then set
its dst to a local version of the VRF cached dst_entry and call netif_rx
to insert the packet onto the rx queue - similar to what is done for
loopback. This patch handles IPv4 support; follow on patch handles IPv6.

With this patch, ping, tcp and udp packets to a local IPv4 address are
successfully routed:

    $ ip addr show dev eth1
    4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
        link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
        inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
           valid_lft forever preferred_lft forever
        inet6 2100:1::1/120 scope global
           valid_lft forever preferred_lft forever
        inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
           valid_lft forever preferred_lft forever

    $ ping -c1 -I red 10.100.1.1
    ping: Warning: source address might be selected on device other than red.
    PING 10.100.1.1 (10.100.1.1) from 10.100.1.1 red: 56(84) bytes of data.
    64 bytes from 10.100.1.1: icmp_seq=1 ttl=64 time=0.057 ms

This patch also enables use of IPv4 loopback address on the VRF device:
    $ ip addr add dev red 127.0.0.1/8

    $ ping -c1 -I red 127.0.0.1
    PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data.
    64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.058 ms
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

afe80a49

net: vrf: Minor refactoring for local address patches · 911a66fb

由 David Ahern 提交于 6月 06, 2016

Move the stripping of the ethernet header from is_ip_tx_frame into the
ipv4 and ipv6 outbound functions and collapse vrf_send_v4_prep into
vrf_process_v4_outbound.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

911a66fb

gue: Implement direction IP encapsulation · c1e48af7

由 Tom Herbert 提交于 6月 06, 2016

This patch implements direct encapsulation of IPv4 and IPv6 packets
in UDP. This is done a version "1" of GUE and as explained in I-D
draft-ietf-nvo3-gue-03.

Changes here are only in the receive path, fou with IPxIPx already
supports the transmit side. Both the normal receive path and
GRO path are modified to check for GUE version and check for
IP version in the case that GUE version is "1".

Tested:

IPIP with direct GUE encap
  1 TCP_STREAM
    4530 Mbps
  200 TCP_RR
    1297625 tps
    135/232/444 90/95/99% latencies

IP4IP6 with direct GUE encap
  1 TCP_STREAM
    4903 Mbps
  200 TCP_RR
    1184481 tps
    149/253/473 90/95/99% latencies

IP6IP6 direct GUE encap
  1 TCP_STREAM
   5146 Mbps
  200 TCP_RR
    1202879 tps
    146/251/472 90/95/99% latencies

SIT with direct GUE encap
  1 TCP_STREAM
    6111 Mbps
  200 TCP_RR
    1250337 tps
    139/241/467 90/95/99% latencies
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c1e48af7

Merge branch 'net-sched-fast-stats' · 34fe76ab

由 David S. Miller 提交于 6月 07, 2016

Eric Dumazet says:

====================
net: sched: faster stats gathering

A while back, I sent one RFC patch using lockless stats gathering
on 64bit arches.

This patch series does it more cleanly, using a seqcount.

Since qdisc/class stats are written at dequeue() time,
we can ask the dequeue to change the seqcount, so that
stats readers can avoid taking the root qdisc lock,
and instead the typical read_seqcount_{begin|retry} guarded
loop.

This does not change fast path costs, as the seqcount
increments are not more expensive than the bit manipulation,
and allows readers to not freeze the fast path anymore.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

34fe76ab

net: sched: do not acquire qdisc spinlock in qdisc/class stats dump · edb09eb1

由 Eric Dumazet 提交于 6月 06, 2016

Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
agent [1] are problematic at scale :

For each qdisc/class found in the dump, we currently lock the root qdisc
spinlock in order to get stats. Sampling stats every 5 seconds from
thousands of HTB classes is a challenge when the root qdisc spinlock is
under high pressure. Not only the dumps take time, they also slow
down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.

An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
that might need the qdisc lock in fq_codel_dump_stats() and
fq_codel_dump_class_stats()

In v2 of this patch, I now use the Qdisc running seqcount to provide
consistent reads of packets/bytes counters, regardless of 32/64 bit arches.

I also changed rate estimators to use the same infrastructure
so that they no longer need to lock root qdisc lock.

[1]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdfSigned-off-by: NEric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kevin Athey <kda@google.com>
Cc: Xiaotian Pei <xiaotian@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

edb09eb1

net_sched: transform qdisc running bit into a seqcount · f9eb8aea

由 Eric Dumazet 提交于 6月 06, 2016

Instead of using a single bit (__QDISC___STATE_RUNNING)
in sch->__state, use a seqcount.

This adds lockdep support, but more importantly it will allow us
to sample qdisc/class statistics without having to grab qdisc root lock.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9eb8aea

Merge branch 'be2net-noncrit-fixes' · 64151ae3

由 David S. Miller 提交于 6月 07, 2016

Sathya Perla says:

====================
be2net: patch set

Hi David, the following patch set contains three non-critical fixes that
can go into the net-next tree.

Patch 1 fixes the logic for provisioning queue pairs on VFs to take into
account the limit on number of TXQs too as in some profiles the number
of TXQs is less than that of RXQs.

Patch 2 enables WoL support from shutdown on Skyhawk.

Patch 3 enhances the logic for provisioning queue pairs on VFs on
SR-IOV over multi-partition configs. Each PF (partition) on a port has to
compute the number of RSS tables it's VFs can use.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

64151ae3

be2net: Fix provisioning of RSS for VFs in multi-partition configurations · de2b1e03

由 Somnath Kotur 提交于 6月 06, 2016

Currently, we do not distribute queue resources to enable RSS for VFs
in multi-channel/partition configurations.
Fix this by having each PF(SRIOV capable) calculate it's share of the
15 RSS Policy Tables available per port before provisioning resources for
all the VFs.
This  proportional share calculation is done based on division of the
PF's MAX VFs with the Total MAX VFs on that port. It also needs to
learn about the no: of NIC PFs on the port and subtract that from
the 15 RSS Policy Tables on the port.
Signed-off-by: NSomnath Kotur <somnath.kotur@emulex.com>
Signed-off-by: NSathya Perla <sathya.perla@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de2b1e03

be2net: Enable Wake-On-LAN from shutdown for Skyhawk · 45f13df7

由 Sriharsha Basavapatna 提交于 6月 06, 2016

Skyhawk does support wake-up from ACPI shutdown state - S5, provided the
platform supports it (like Auxiliary power source etc). The changes listed
below are done to fix this.

1) There's no need to defer the HW configuration of WOL to be_suspend().
Remove this in be_suspend() and move it to be_set_wol() ethtool function
so it is configured directly in the context of ethtool. This automatically
takes care of the shutdown case.

2) The driver incorrectly uses WOL_CAP field in the FW response to
get_acpi_wol_cap() command, to determine if WOL is enabled. Instead the
driver must rely on the macaddr field in the response to infer WOL state.

3) In be_get_config() during init, if we find that WOL is enabled in FW,
call pci_enable_wake() to enable pmcsr.pme_en bit. This is needed to
support persistent WOL configuration provided by the FW in some platforms.

4) Remove code in be_set_wol() that writes to PCICFG_PM_CONTROL_OFFSET
to set pme_en bit; pci_enable_wake() sets that.

Fixes: 028991e4 ("Enabling Wake-on-LAN is not supported in S5 state")
Signed-off-by: NSriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: NSathya Perla <sathya.perla@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

45f13df7

be2net: use max-TXQs limit too while provisioning VF queue pairs · b9263cbf

由 Suresh Reddy 提交于 6月 06, 2016

When the PF driver provisions resources for VFs, it currently only looks
at max RSS queues available to calculate the number of VF queue pairs.
This logic breaks when there are less number of TX-queues than RSS-queues.
This patch fixes this problem by using the max-TXQs available in the
PF-pool in the calculations. As a part of this change the
be_calculate_vf_qs() routine is renamed as be_calculate_vf_res() and the
code that calculates limits on other related resources is moved here to
contain all resource calculation code inside one routine.
Signed-off-by: NSuresh Reddy <suresh.reddy@broadcom.com>
Signed-off-by: NSathya Perla <sathya.perla@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b9263cbf

drivers/net: support hdlc function for QE-UCC · c19b6d24

由 Zhao Qiang 提交于 6月 06, 2016

The driver add hdlc support for Freescale QUICC Engine.
It support NMSI and TSA mode.
Signed-off-by: NZhao Qiang <qiang.zhao@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c19b6d24

fsl/qe: Add QE TDM lib · 35ef1c20

由 Zhao Qiang 提交于 6月 06, 2016

QE has module to support TDM, some other protocols
supported by QE are based on TDM.
add a qe-tdm lib, this lib provides functions to the protocols
using TDM to configurate QE-TDM.
Signed-off-by: NZhao Qiang <qiang.zhao@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35ef1c20

fsl/qe: Make regs resouce_size_t · 19163ac3

由 Zhao Qiang 提交于 6月 06, 2016

Signed-off-by: NZhao Qiang <qiang.zhao@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19163ac3

fsl/qe: setup clock source for TDM mode · bb8b2062

由 Zhao Qiang 提交于 6月 06, 2016

Add tdm clock configuration in both qe clock system and ucc
fast controller.
Signed-off-by: NZhao Qiang <qiang.zhao@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bb8b2062

fsl/qe: add rx_sync and tx_sync for TDM mode · 68f047e3

由 Zhao Qiang 提交于 6月 06, 2016

Rx_sync and tx_sync are used by QE-TDM mode,
add them to struct ucc_fast_info.
Signed-off-by: NZhao Qiang <qiang.zhao@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

68f047e3

J
net sched: indentation and other OCD stylistic fixes · 0b0f43fe
由 Jamal Hadi Salim 提交于 6月 05, 2016
```
Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
```
0b0f43fe

Merge branch 'sch-action-tstamp' · be119913

由 David S. Miller 提交于 6月 07, 2016

Jamal Hadi Salim says:

====================
net sched action timestamp improvements

Various aggregations of duplicated code, fixes and introduction of firstused
timestamp

v2: add const for source time info per suggestion from Cong
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

be119913

J
net sched actions: aggregate dumping of actions timeinfo · 48d8ee16
由 Jamal Hadi Salim 提交于 6月 06, 2016
```
Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
48d8ee16

net sched actions: introduce timestamp for firsttime use · 53eb440f

由 Jamal Hadi Salim 提交于 6月 06, 2016

Useful to know when the action was first used for accounting
(and debugging)
Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

53eb440f

J
net sched: actions use tcf_lastuse_update for consistency · 9c4a4e48
由 Jamal Hadi Salim 提交于 6月 06, 2016
```
Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
9c4a4e48

net/sched: cls_flower: Introduce support in SKIP SW flag · e69985c6

由 Amir Vadai 提交于 6月 05, 2016

In order to make a filter processed only by hardware, skip_sw flag
should be supplied. This is an addition to the already existing skip_hw
flag (filter will be processed by software only). If no flag is
specified, filter will be processed by both software and hardware.

If only hardware offloaded filters exist, fl_classify() will return
without doing anything.

A following userspace patch will be sent once kernel patch is accepted.

Example:

tc filter add dev enp0s9 protocol ip prio 20 parent ffff: \
	flower \
		ip_proto 6 \
		indev enp0s9 \
		skip_sw \
	action skbedit mark 0x1234
Signed-off-by: NAmir Vadai <amirva@mellanox.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e69985c6

Merge branch 'qed-iov-fw-reqs' · 919f274f

由 David S. Miller 提交于 6月 07, 2016

Yuval Mintz says:

====================
qed: IOV series - relax firmware requirements

In order for VFs to work, current implementation demands that the VF's
requried storm firmware would be exactly the version that was loaded by
the PF, which is a very harsh requirement.
This patch series is intended to relax this -
the recently submitted firmware is intended to be forward/backward
compatible in its fastpath [slowpath is configured by PF on behalf of VF],
and so VFs would only be required of having the same major faspath HSI in
order to work.

Most of the other patches in this series extend current forward
compatibilty of driver to reduce chance of breaking PF/VF compatibility
in the future. A few are unrelated IOV changes.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

919f274f

qed: PF to reply to unknown messages · 54fdd80f

由 Yuval Mintz 提交于 6月 05, 2016

If a future VF would send the PF an unknown message, the PF today would
not send a reply. This would have 2 bad effects:
  a. VF would have to timeout on the request.
  b. If VF were to send an additional message to PF, firmware would mark
     it as malicious.

Instead, if there's some valid reply-address on the message - let the PF
answer and tell the VF it doesn't know the message.
Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

54fdd80f

qed: PF enforce MAC limitation of VFs · 8246d0b4

由 Yuval Mintz 提交于 6月 05, 2016

The only limitation relating to MACs the PF enforce today on its VFs
is in case it has a forced-unicast MAC address for them, in which case
they can't configure other unicast addresses.
Specifically, the PF isn't enforcing the number of MAC addresse a VF can
configure regardless of the nubmer of such filters agreed upon by PF and
VF during the acquisition process.

PF's shadow-config is now extended to also contain information about its
VFs' unicast addresses configuration, allowing such enforcement.
Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8246d0b4

qed: Move doorbell calculation from VF to PF · 5040acf5

由 Yuval Mintz 提交于 6月 05, 2016

Today, the VF is aware of its queues context-ids, and calculates the
doorbell address when opening its queues on its own.
The configuration of doorbells in HW can sometime in the future be changed
by the PF [hw has several configurable features that might affect doorbell
addresses, e.g., dpm support], this would break compatibility with older
VFs as their calculated doorbell addresses would be incorrect for such a
configuration.

In order to avoid such a backward compatibility failure, let the PF make
the calculation of the doorbell offset based on the context-id, and pass
that to the VF.
Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5040acf5

qed: Make PF more robust against malicious VF · 41086467

由 Yuval Mintz 提交于 6月 05, 2016

There are several requests the VF can make toward the PF which the driver
would pass to firmware without checking the validity first - specifically,
opening queues and updating vports. Such configurations might cause the
firmware to assert.

This adds validation of the legality of said configurations on the PF side
before passing it onward via ramrod to firmware.
Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

41086467

qed: PF-VF resource negotiation · 1cf2b1a9

由 Yuval Mintz 提交于 6月 05, 2016

One of the goals of the vf's first message to the PF [acquire]
is to learn about the number of resources available to it [macs, vlans,
etc.]. This is done via negotiation - the VF requires a set of resources,
which the PF either approves or disaproves and sends a smaller set of
resources as alternative. In this later case, the VF is then expected to
either abort the probe or re-send the acquire message with less
required resources.

While this infrastructure exists since the initial submision of qed
SRIOV support, it's in fact completely inoperational - PF isn't really
looking into the resources the VF has asked for and is never going to
reply to the VF that it lacks resources.

This patch addresses this flow, fixing it and allowing the PF and VF
to actually agree on a set of resources.
Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1cf2b1a9

qed: Relax VF firmware requirements · 1fe614d1

由 Yuval Mintz 提交于 6月 05, 2016

Current driver require an exact match between VF and PF storm firmware;
Any difference would fail the VF acquire message, causing the VF probe
to be aborted.

While there's still dependencies between the two, the recent FW submission
has relaxed the match requirement - instead of an exact match, there's now
a 'fastpath' HSI major/minor scheme, where VFs and PFs that match in their
major number can co-exist even if their minor is different.

In order to accomadate this change some changes in the vf-start init flow
had to be made, as the VF start ramrod now has to be sent only after PF
learns which fastpath HSI its VF is requiring.
Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1fe614d1

net: get rid of spin_trylock() in net_tx_action() · 3bcb846c

由 Eric Dumazet 提交于 6月 04, 2016

Note: Tom Herbert posted almost same patch 3 months back, but for
different reasons.

The reasons we want to get rid of this spin_trylock() are :

1) Under high qdisc pressure, the spin_trylock() has almost no
chance to succeed.

2) We loop multiple times in softirq handler, eventually reaching
the max retry count (10), and we schedule ksoftirqd.

Since we want to adhere more strictly to ksoftirqd being waked up in
the future (https://lwn.net/Articles/687617/), better avoid spurious
wakeups.

3) calls to __netif_reschedule() dirty the cache line containing
q->next_sched, slowing down the owner of qdisc.

4) RT kernels can not use the spin_trylock() here.

With help of busylock, we get the qdisc spinlock fast enough, and
the trylock trick brings only performance penalty.

Depending on qdisc setup, I observed a gain of up to 19 % in qdisc
performance (1016600 pps instead of 853400 pps, using prio+tbf+fq_codel)

("mpstat -I SCPU 1" is much happier now)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Acked-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3bcb846c

vhost_net: stop polling socket during rx processing · 8241a1e4

由 Jason Wang 提交于 6月 01, 2016

We don't stop rx polling socket during rx processing, this will lead
unnecessary wakeups from under layer net devices (E.g
sock_def_readable() form tun). Rx will be slowed down in this
way. This patch avoids this by stop polling socket during rx
processing. A small drawback is that this introduces some overheads in
light load case because of the extra start/stop polling, but single
netperf TCP_RR does not notice any change. In a super heavy load case,
e.g using pktgen to inject packet to guest, we get about ~8.8%
improvement on pps:

before: ~1240000 pkt/s
after:  ~1350000 pkt/s
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8241a1e4

07 6月, 2016 6 次提交

net: ethernet: cavium: liquidio: request_manager: Remove create_workqueue · aaa76724

由 Bhaktipriya Shridhar 提交于 6月 04, 2016

alloc_workqueue replaces deprecated create_workqueue().

A dedicated workqueue has been used since the workitem viz
(&db_wq->wk.work which maps to check_db_timeout) is involved
in normal device operation. WQ_MEM_RECLAIM has been set to guarantee
forward progress under memory pressure, which is a requirement here.
Since there are only a fixed number of work items, explicit concurrency
limit is unnecessary.

flush_workqueue is unnecessary since destroy_workqueue() itself calls
drain_workqueue() which flushes repeatedly till the workqueue
becomes empty.
Signed-off-by: NBhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aaa76724

net: ethernet: cavium: liquidio: response_manager: Remove create_workqueue · 523a61b4

由 Bhaktipriya Shridhar 提交于 6月 04, 2016

alloc_workqueue replaces deprecated create_workqueue().

A dedicated workqueue has been used since the workitem viz
(&cwq->wk.work which maps to oct_poll_req_completion) is involved
in normal device operation. WQ_MEM_RECLAIM has been set to guarantee
forward progress under memory pressure, which is a requirement here.
Since there are only a fixed number of work items, explicit concurrency
limit is unnecessary.

flush_workqueue is unnecessary since destroy_workqueue() itself calls
drain_workqueue() which flushes repeatedly till the workqueue
becomes empty. Hence the call to flush_workqueue() has been dropped.
Signed-off-by: NBhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

523a61b4

virtio-net: Add initial MTU advice feature · 14de9d11

由 Aaron Conole 提交于 6月 03, 2016

This commit adds the feature bit and associated mtu device entry for the
virtio network device.  When a virtio device comes up, it checks the
feature bit for the VIRTIO_NET_F_MTU feature.  If such feature bit is
enabled, the driver will read the advised MTU and use it as the initial
value.
Signed-off-by: NAaron Conole <aconole@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

14de9d11

net: Revert vrf-local changes. · 3d9dc408

由 David S. Miller 提交于 6月 06, 2016

This reverts commit 2fb7ea45.

It results in build errors because ip6_input is not a
symbol exported to modules.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3d9dc408

Merge branch 'vrf-local' · 2fb7ea45

由 David S. Miller 提交于 6月 06, 2016

David Ahern says:

====================
net: vrf: Add support for local traffic to local addresses

Add support for locally originated traffic to VRF-local addresses,
be it addresses on enslaved devices or addresses on the VRF device:

$ ip addr show dev red
33: red: <NOARP,MASTER,UP,LOWER_UP> mtu 65536 qdisc pfifo_fast state UP group default qlen 1000
    link/ether be:00:53:b5:e4:25 brd ff:ff:ff:ff:ff:ff
    inet 1.1.1.1/32 scope global red
       valid_lft forever preferred_lft forever
    inet6 1111:1::1/128 scope global
       valid_lft forever preferred_lft forever

$ ip addr show dev eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
    link/ether 02:e0:f9:79:34:bd brd ff:ff:ff:ff:ff:ff
    inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 2100:1::1/120 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
       valid_lft forever preferred_lft forever

$ ping -c1 -I red 10.100.1.1
    ping: Warning: source address might be selected on device other than red.
    PING 10.100.1.1 (10.100.1.1) from 10.100.1.1 red: 56(84) bytes of data.
    64 bytes from 10.100.1.1: icmp_seq=1 ttl=64 time=0.057 ms

$ ping -c1 -I red 1.1.1.1
PING 1.1.1.1 (1.1.1.1) from 1.1.1.1 red: 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=64 time=0.136 ms

--- 1.1.1.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.136/0.136/0.136/0.000 ms

$ ping6 -c1 -I red  2100:1::1
ping6: Warning: source address might be selected on device other than red.
PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.167 ms

--- 2100:1::1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.167/0.167/0.167/0.000 ms

$ ping6 -c1 -I red 1111::1
PING 1111::1(1111::1) from 1111:1::1 red: 56 data bytes
64 bytes from 1111::1: icmp_seq=1 ttl=64 time=0.187 ms

--- 1111::1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.187/0.187/0.187/0.000 ms

This change also enables use of loopback address on the VRF device:
$ ip addr add dev red 127.0.0.1/8

$ ping -c1 -I red 127.0.0.1
PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.058 ms
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2fb7ea45

net: vrf: ipv6 support for local traffic to local addresses · 625b47b5

由 David Ahern 提交于 6月 02, 2016

Add support for locally originated traffic to VRF-local IPv6 addresses.
Similar to IPv4 a local dst is set on the skb and the packet is
reinserted with a call to netif_rx. With this patch, ping, tcp and udp
packets to a local IPv6 address are successfully routed:

    $ ip addr show dev eth1
    4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
        link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
        inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
           valid_lft forever preferred_lft forever
        inet6 2100:1::1/120 scope global
           valid_lft forever preferred_lft forever
        inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
           valid_lft forever preferred_lft forever

    $ ping6 -c1 -I red 2100:1::1
    ping6: Warning: source address might be selected on device other than red.
    PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
    64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms

ip6_input is exported so the VRF driver can use it for the dst input
function. The dst_alloc function for IPv4 defaults to setting the input and
output functions; IPv6's does not. VRF does not need to duplicate the Rx path
so just export the ipv6 input function.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

625b47b5

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功