提交 · 890b7916d0965829ad1c457aa61f049a210c19f8 · openanolis / cloud-kernel

05 11月, 2014 14 次提交

udp: remove blank line between set and test · 6cf1093e

由 Fabian Frederick 提交于 11月 04, 2014

Suggested-by: NJoe Perches <joe@perches.com>
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6cf1093e

esp4: remove assignment in if condition · 05006e8c

由 Fabian Frederick 提交于 11月 04, 2014

Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

05006e8c

net: allow setting ecn via routing table · f7b3bec6

由 Florian Westphal 提交于 11月 03, 2014

This patch allows to set ECN on a per-route basis in case the sysctl
tcp_ecn is not set to 1. In other words, when ECN is set for specific
routes, it provides a tcp_ecn=1 behaviour for that route while the rest
of the stack acts according to the global settings.

One can use 'ip route change dev $dev $net features ecn' to toggle this.

Having a more fine-grained per-route setting can be beneficial for various
reasons, for example, 1) within data centers, or 2) local ISPs may deploy
ECN support for their own video/streaming services [1], etc.

There was a recent measurement study/paper [2] which scanned the Alexa's
publicly available top million websites list from a vantage point in US,
Europe and Asia:

Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
blamed to commit 255cac91 ("tcp: extend ECN sysctl to allow server-side
only ECN") ;)); the break in connectivity on-path was found is about
1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
more common in the negotiation phase (and mostly seen in the Alexa
middle band, ranks around 50k-150k): from 12-thousand hosts on which
there _may_ be ECN-linked connection failures, only 79 failed with RST
when _not_ failing with RST when ECN is not requested.

It's unclear though, how much equipment in the wild actually marks CE
when buffers start to fill up.

We thought about a fallback to non-ECN for retransmitted SYNs as another
global option (which could perhaps one day be made default), but as Eric
points out, there's much more work needed to detect broken middleboxes.

Two examples Eric mentioned are buggy firewalls that accept only a single
SYN per flow, and middleboxes that successfully let an ECN flow establish,
but later mark CE for all packets (so cwnd converges to 1).

 [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
 [2] http://ecn.ethz.ch/

Joint work with Daniel Borkmann.

Reference: http://thread.gmane.org/gmane.linux.network/335797Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f7b3bec6

syncookies: split cookie_check_timestamp() into two functions · f1673381

由 Florian Westphal 提交于 11月 03, 2014

The function cookie_check_timestamp(), both called from IPv4/6 context,
is being used to decode the echoed timestamp from the SYN/ACK into TCP
options used for follow-up communication with the peer.

We can remove ECN handling from that function, split it into a separate
one, and simply rename the original function into cookie_decode_options().
cookie_decode_options() just fills in tcp_option struct based on the
echoed timestamp received from the peer. Anything that fails in this
function will actually discard the request socket.

While this is the natural place for decoding options such as ECN which
commit 172d69e6 ("syncookies: add support for ECN") added, we argue
that in particular for ECN handling, it can be checked at a later point
in time as the request sock would actually not need to be dropped from
this, but just ECN support turned off.

Therefore, we split this functionality into cookie_ecn_ok(), which tells
us if the timestamp indicates ECN support AND the tcp_ecn sysctl is enabled.

This prepares for per-route ECN support: just looking at the tcp_ecn sysctl
won't be enough anymore at that point; if the timestamp indicates ECN
and sysctl tcp_ecn == 0, we will also need to check the ECN dst metric.

This would mean adding a route lookup to cookie_check_timestamp(), which
we definitely want to avoid. As we already do a route lookup at a later
point in cookie_{v4,v6}_check(), we can simply make use of that as well
for the new cookie_ecn_ok() function w/o any additional cost.

Joint work with Daniel Borkmann.
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f1673381

syncookies: avoid magic values and document which-bit-is-what-option · 274e2da0

由 Florian Westphal 提交于 11月 03, 2014

Was a bit more difficult to read than needed due to magic shifts;
add defines and document the used encoding scheme.

Joint work with Daniel Borkmann.
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

274e2da0

igmp: remove camel case definitions · 436f7c20

由 Fabian Frederick 提交于 11月 04, 2014

use standard uppercase for definitions
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

436f7c20

udp: remove else after return · c18450a5

由 Fabian Frederick 提交于 11月 04, 2014

else is unnecessary after return 0 in __udp4_lib_rcv()
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c18450a5

inet: frags: remove inline on static in c file · aa1f731e

由 Fabian Frederick 提交于 11月 04, 2014

remove __inline__ / inline and let compiler decide what to do
with static functions
Inspired-by: N"David S. Miller" <davem@davemloft.net>
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aa1f731e

ipv4: remove 0/NULL assignment on static · 0d3979b9

由 Fabian Frederick 提交于 11月 04, 2014

static values are automatically initialized to 0
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0d3979b9

F
ipv4: use seq_puts instead of seq_printf where possible · c9f503b0
由 Fabian Frederick 提交于 11月 04, 2014
```
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
c9f503b0

tcp: spelling s/plugable/pluggable · b92022f3

由 Fabian Frederick 提交于 11月 04, 2014

Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b92022f3

cipso: remove NULL assignment on static · 988b1343

由 Fabian Frederick 提交于 11月 04, 2014

Also add blank line after structure declarations
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

988b1343

ipv4: include linux/bug.h instead of asm/bug.h · 4c787b16

由 Fabian Frederick 提交于 11月 04, 2014

Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c787b16

cipso: kerneldoc warning fix · 4973404f

由 Fabian Frederick 提交于 11月 04, 2014

Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4973404f

31 10月, 2014 7 次提交

netfilter: nf_reject_ipv4: split nf_send_reset() in smaller functions · 052b9498

由 Pablo Neira Ayuso 提交于 10月 25, 2014

That can be reused by the reject bridge expression to build the reject
packet. The new functions are:

* nf_reject_ip_tcphdr_get(): to sanitize and to obtain the TCP header.
* nf_reject_iphdr_put(): to build the IPv4 header.
* nf_reject_ip_tcphdr_put(): to build the TCP header.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

052b9498

net: skb_fclone_busy() needs to detect orphaned skb · 39bb5e62

由 Eric Dumazet 提交于 10月 30, 2014

Some drivers are unable to perform TX completions in a bound time.
They instead call skb_orphan()

Problem is skb_fclone_busy() has to detect this case, otherwise
we block TCP retransmits and can freeze unlucky tcp sessions on
mostly idle hosts.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: 1f3279ae ("tcp: avoid retransmits of TCP packets hanging in host queues")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

39bb5e62

tcp: Correction to RFC number in comment · cd214535

由 Sowmini Varadhan 提交于 10月 30, 2014

Challenge ACK is described in RFC 5961, fix typo.
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cd214535

gre: Use inner mac length when computing tunnel length · 14051f04

由 Tom Herbert 提交于 10月 30, 2014

Currently, skb_inner_network_header is used but this does not account
for Ethernet header for ETH_P_TEB. Use skb_inner_mac_header which
handles TEB and also should work with IP encapsulation in which case
inner mac and inner network headers are the same.

Tested: Ran TCP_STREAM over GRE, worked as expected.
Signed-off-by: NTom Herbert <therbert@google.com>
Acked-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

14051f04

ipv4: Do not cache routing failures due to disabled forwarding. · fa19c2b0

由 Nicolas Cavallari 提交于 10月 30, 2014

If we cache them, the kernel will reuse them, independently of
whether forwarding is enabled or not.  Which means that if forwarding is
disabled on the input interface where the first routing request comes
from, then that unreachable result will be cached and reused for
other interfaces, even if forwarding is enabled on them.  The opposite
is also true.

This can be verified with two interfaces A and B and an output interface
C, where B has forwarding enabled, but not A and trying
ip route get $dst iif A from $src && ip route get $dst iif B from $src
Signed-off-by: NNicolas Cavallari <nicolas.cavallari@green-communications.fr>
Reviewed-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa19c2b0

syncookies: only increment SYNCOOKIESFAILED on validation error · 646697b9

由 Florian Westphal 提交于 10月 30, 2014

Only count packets that failed cookie-authentication.
We can get SYNCOOKIESFAILED > 0 while we never even sent a single cookie.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

646697b9

ipv4: minor spelling fixes · f4e715c3

由 stephen hemminger 提交于 10月 29, 2014

Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f4e715c3

30 10月, 2014 3 次提交

inet: frags: remove the WARN_ON from inet_evict_bucket · d70127e8

由 Nikolay Aleksandrov 提交于 10月 28, 2014

The WARN_ON in inet_evict_bucket can be triggered by a valid case:
inet_frag_kill and inet_evict_bucket can be running in parallel on the
same queue which means that there has been at least one more ref added
by a previous inet_frag_find call, but inet_frag_kill can delete the
timer before inet_evict_bucket which will cause the WARN_ON() there to
trigger since we'll have refcnt!=1. Now, this case is valid because the
queue is being "killed" for some reason (removed from the chain list and
its timer deleted) so it will get destroyed in the end by one of the
inet_frag_put() calls which reaches 0 i.e. refcnt is still valid.

CC: Florian Westphal <fw@strlen.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McLean <chutzpah@gentoo.org>

Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue")
Reported-by: NPatrick McLean <chutzpah@gentoo.org>
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d70127e8

inet: frags: fix a race between inet_evict_bucket and inet_frag_kill · 65ba1f1e

由 Nikolay Aleksandrov 提交于 10月 28, 2014

When the evictor is running it adds some chosen frags to a local list to
be evicted once the chain lock has been released but at the same time
the *frag_queue can be running for some of the same queues and it
may call inet_frag_kill which will wait on the chain lock and
will then delete the queue from the wrong list since it was added in the
eviction one. The fix is simple - check if the queue has the evict flag
set under the chain lock before deleting it, this is safe because the
evict flag is set only under that lock and having the flag set also means
that the queue has been detached from the chain list, so no need to delete
it again.
An important note to make is that we're safe w.r.t refcnt because
inet_frag_kill and inet_evict_bucket will sync on the del_timer operation
where only one of the two can succeed (or if the timer is executing -
none of them), the cases are:
1. inet_frag_kill succeeds in del_timer
 - then the timer ref is removed, but inet_evict_bucket will not add
   this queue to its expire list but will restart eviction in that chain
2. inet_evict_bucket succeeds in del_timer
 - then the timer ref is kept until the evictor "expires" the queue, but
   inet_frag_kill will remove the initial ref and will set
   INET_FRAG_COMPLETE which will make the frag_expire fn just to remove
   its ref.
In the end all of the queue users will do an inet_frag_put and the one
that reaches 0 will free it. The refcount balance should be okay.

CC: Florian Westphal <fw@strlen.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McLean <chutzpah@gentoo.org>

Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue")
Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
Reported-by: NPatrick McLean <chutzpah@gentoo.org>
Tested-by: NPatrick McLean <chutzpah@gentoo.org>
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Reviewed-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

65ba1f1e

tcp: allow for bigger reordering level · dca145ff

由 Eric Dumazet 提交于 10月 27, 2014

While testing upcoming Yaogong patch (converting out of order queue
into an RB tree), I hit the max reordering level of linux TCP stack.

Reordering level was limited to 127 for no good reason, and some
network setups [1] can easily reach this limit and get limited
throughput.

Allow a new max limit of 300, and add a sysctl to allow admins to even
allow bigger (or lower) values if needed.

[1] Aggregation of links, per packet load balancing, fabrics not doing
 deep packet inspections, alternative TCP congestion modules...
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dca145ff

28 10月, 2014 1 次提交

ipv4: remove set but unused variable sha · 6b436d33

由 Fabian Frederick 提交于 10月 27, 2014

unsigned char *sha (source) was already in original git version
 but was never used.
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b436d33

26 10月, 2014 1 次提交

tcp: md5: do not use alloc_percpu() · 349ce993

由 Eric Dumazet 提交于 10月 23, 2014

percpu tcp_md5sig_pool contains memory blobs that ultimately
go through sg_set_buf().

-> sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));

This requires that whole area is in a physically contiguous portion
of memory. And that @buf is not backed by vmalloc().

Given that alloc_percpu() can use vmalloc() areas, this does not
fit the requirements.

Replace alloc_percpu() by a static DEFINE_PER_CPU() as tcp_md5sig_pool
is small anyway, there is no gain to dynamically allocate it.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: 765cf997 ("tcp: md5: remove one indirection level in tcp_md5sig_pool")
Reported-by: NCrestez Dan Leonard <cdleonard@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

349ce993

23 10月, 2014 1 次提交

net: fix saving TX flow hash in sock for outgoing connections · 9e7ceb06

由 Sathya Perla 提交于 10月 22, 2014

The commit "net: Save TX flow hash in sock and set in skbuf on xmit"
introduced the inet_set_txhash() and ip6_set_txhash() routines to calculate
and record flow hash(sk_txhash) in the socket structure. sk_txhash is used
to set skb->hash which is used to spread flows across multiple TXQs.

But, the above routines are invoked before the source port of the connection
is created. Because of this all outgoing connections that just differ in the
source port get hashed into the same TXQ.

This patch fixes this problem for IPv4/6 by invoking the the above routines
after the source port is available for the socket.

Fixes: b73c3d0e("net: Save TX flow hash in sock and set in skbuf on xmit")
Signed-off-by: NSathya Perla <sathya.perla@emulex.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9e7ceb06

21 10月, 2014 2 次提交

net: make skb_gso_segment error handling more robust · 330966e5

由 Florian Westphal 提交于 10月 20, 2014

skb_gso_segment has three possible return values:
1. a pointer to the first segmented skb
2. an errno value (IS_ERR())
3. NULL.  This can happen when GSO is used for header verification.

However, several callers currently test IS_ERR instead of IS_ERR_OR_NULL
and would oops when NULL is returned.

Note that these call sites should never actually see such a NULL return
value; all callers mask out the GSO bits in the feature argument.

However, there have been issues with some protocol handlers erronously not
respecting the specified feature mask in some cases.

It is preferable to get 'have to turn off hw offloading, else slow' reports
rather than 'kernel crashes'.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

330966e5

net: gso: use feature flag argument in all protocol gso handlers · 1e16aa3d

由 Florian Westphal 提交于 10月 20, 2014

skb_gso_segment() has a 'features' argument representing offload features
available to the output path.

A few handlers, e.g. GRE, instead re-fetch the features of skb->dev and use
those instead of the provided ones when handing encapsulation/tunnels.

Depending on dev->hw_enc_features of the output device skb_gso_segment() can
then return NULL even when the caller has disabled all GSO feature bits,
as segmentation of inner header thinks device will take care of segmentation.

This e.g. affects the tbf scheduler, which will silently drop GRE-encap GSO skbs
that did not fit the remaining token quota as the segmentation does not work
when device supports corresponding hw offload capabilities.

Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1e16aa3d

19 10月, 2014 1 次提交

ipv4: fix a potential use after free in gre_offload.c · b4e3cef7

由 Li RongQing 提交于 10月 18, 2014

pskb_may_pull() may change skb->data and make greh pointer oboslete;
so need to reassign greh;
but since first calling pskb_may_pull already ensured that skb->data
has enough space for greh, so move the reference of greh before second
calling pskb_may_pull(), to avoid reassign greh.

Fixes: 7a7ffbab("ipv4: fix tunneled VM traffic over hw VXLAN/GRE GSO NIC")
Cc: Wei-Chun Chao <weichunc@plumgrid.com>
Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4e3cef7

18 10月, 2014 6 次提交

ipv4: fix a potential use after free in fou.c · d8f00d27

由 Li RongQing 提交于 10月 17, 2014

pskb_may_pull() maybe change skb->data and make uh pointer oboslete,
so reload uh and guehdr

Fixes: 37dd0247 ("gue: Receive side for Generic UDP Encapsulation")
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d8f00d27

ipv4: fix a potential use after free in ip_tunnel_core.c · 1245dfc8

由 Li RongQing 提交于 10月 17, 2014

pskb_may_pull() maybe change skb->data and make eth pointer oboslete,
so set eth after pskb_may_pull()

Fixes:3d7b46cd("ip_tunnel: push generic protocol handling to ip_tunnel module")
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
Acked-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1245dfc8

ipv4: dst_entry leak in ip_send_unicast_reply() · 4062090e

由 Vasily Averin 提交于 10月 15, 2014

ip_setup_cork() called inside ip_append_data() steals dst entry from rt to cork
and in case errors in __ip_append_data() nobody frees stolen dst entry

Fixes: 2e77d89b ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
Signed-off-by: NVasily Averin <vvs@parallels.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4062090e

ipv4: clean up cookie_v4_check() · 461b74c3

由 Cong Wang 提交于 10月 15, 2014

We can retrieve opt from skb, no need to pass it as a parameter.
And opt should always be non-NULL, no need to check.

Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
Cc: Eric Dumazet <edumazet@google.com>
Tested-by: NKrzysztof Kolasa <kkolasa@winsoft.pl>
Signed-off-by: NCong Wang <cwang@twopensource.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

461b74c3

ipv4: share tcp_v4_save_options() with cookie_v4_check() · e25f866f

由 Cong Wang 提交于 10月 15, 2014

cookie_v4_check() allocates ip_options_rcu in the same way
with tcp_v4_save_options(), we can just make it a helper function.

Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NCong Wang <cwang@twopensource.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e25f866f

ipv4: call __ip_options_echo() in cookie_v4_check() · 2077eebf

由 Cong Wang 提交于 10月 15, 2014

commit 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
missed that cookie_v4_check() still calls ip_options_echo() which uses
IPCB(). It should use TCPCB() at TCP layer, so call __ip_options_echo()
instead.

Fixes: commit 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
Cc: Eric Dumazet <edumazet@google.com>
Reported-by: NKrzysztof Kolasa <kkolasa@winsoft.pl>
Tested-by: NKrzysztof Kolasa <kkolasa@winsoft.pl>
Signed-off-by: NCong Wang <cwang@twopensource.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2077eebf

15 10月, 2014 4 次提交

tcp: TCP Small Queues and strange attractors · 9b462d02

由 Eric Dumazet 提交于 10月 13, 2014

TCP Small queues tries to keep number of packets in qdisc
as small as possible, and depends on a tasklet to feed following
packets at TX completion time.
Choice of tasklet was driven by latencies requirements.

Then, TCP stack tries to avoid reorders, by locking flows with
outstanding packets in qdisc in a given TX queue.

What can happen is that many flows get attracted by a low performing
TX queue, and cpu servicing TX completion has to feed packets for all of
them, making this cpu 100% busy in softirq mode.

This became particularly visible with latest skb->xmit_more support

Strategy adopted in this patch is to detect when tcp_wfree() is called
from ksoftirqd and let the outstanding queue for this flow being drained
before feeding additional packets, so that skb->ooo_okay can be set
to allow select_queue() to select the optimal queue :

Incoming ACKS are normally handled by different cpus, so this patch
gives more chance for these cpus to take over the burden of feeding
qdisc with future packets.

Tested:

lpaa23:~# ./super_netperf 1400 --google-pacing-rate 3028000 -H lpaa24 -l 3600 &

lpaa23:~# sar -n DEV 1 10 | grep eth1
06:16:18 AM eth1 595448.00 1190564.00 38381.09 1760253.12 0.00 0.00 1.00
06:16:19 AM eth1 594858.00 1189686e.00 38340.76 1758952.72 0.00 0.00 0.00
06:16:20 AM eth1 597017.00 1194019.00 38480.79 1765370.29 0.00 0.00 1.00
06:16:21 AM eth1 595450.00 1190936.00 38380.19 1760805.05 0.00 0.00 0.00
06:16:22 AM eth1 596385.00 1193096.00 38442.56 1763976.29 0.00 0.00 1.00
06:16:23 AM eth1 598155.00 1195978.00 38552.97 1768264.60 0.00 0.00 0.00
06:16:24 AM eth1 594405.00 1188643.00 38312.57 1757414.89 0.00 0.00 1.00
06:16:25 AM eth1 593366.00 1187154.00 38252.16 1755195.83 0.00 0.00 0.00
06:16:26 AM eth1 593188.00 1186118.00 38232.88 1753682.57 0.00 0.00 1.00
06:16:27 AM eth1 596301.00 1192241.00 38440.94 1762733.09 0.00 0.00 0.00
Average: eth1 595457.30 1190843.50 38381.69 1760664.84 0.00 0.00 0.50
lpaa23:~# ./tc -s -d qd sh dev eth1 | grep backlog
backlog 7606336b 2513p requeues 167982
backlog 224072b 74p requeues 566
backlog 581376b 192p requeues 5598
backlog 181680b 60p requeues 1070
backlog 5305056b 1753p requeues 110166 // Here, this TX queue is attracting flows
backlog 157456b 52p requeues 1758
backlog 672216b 222p requeues 3025
backlog 60560b 20p requeues 24541
backlog 448144b 148p requeues 21258

lpaa23:~# echo 1 >/proc/sys/net/ipv4/tcp_tsq_enable_tcp_wfree_ksoftirqd_detect

Immediate jump to full bandwidth, and traffic is properly
shard on all tx queues.

lpaa23:~# sar -n DEV 1 10 | grep eth1
06:16:46 AM eth1 1397632.00 2795397.00 90081.87 4133031.26 0.00 0.00 1.00
06:16:47 AM eth1 1396874.00 2793614.00 90032.99 4130385.46 0.00 0.00 0.00
06:16:48 AM eth1 1395842.00 2791600.00 89966.46 4127409.67 0.00 0.00 1.00
06:16:49 AM eth1 1395528.00 2791017.00 89946.17 4126551.24 0.00 0.00 0.00
06:16:50 AM eth1 1397891.00 2795716.00 90098.74 4133497.39 0.00 0.00 1.00
06:16:51 AM eth1 1394951.00 2789984.00 89908.96 4125022.51 0.00 0.00 0.00
06:16:52 AM eth1 1394608.00 2789190.00 89886.90 4123851.36 0.00 0.00 1.00
06:16:53 AM eth1 1395314.00 2790653.00 89934.33 4125983.09 0.00 0.00 0.00
06:16:54 AM eth1 1396115.00 2792276.00 89984.25 4128411.21 0.00 0.00 1.00
06:16:55 AM eth1 1396829.00 2793523.00 90030.19 4130250.28 0.00 0.00 0.00
Average: eth1 1396158.40 2792297.00 89987.09 4128439.35 0.00 0.00 0.50

lpaa23:~# tc -s -d qd sh dev eth1 | grep backlog
backlog 7900052b 2609p requeues 173287
backlog 878120b 290p requeues 589
backlog 1068884b 354p requeues 5621
backlog 996212b 329p requeues 1088
backlog 984100b 325p requeues 115316
backlog 956848b 316p requeues 1781
backlog 1080996b 357p requeues 3047
backlog 975016b 322p requeues 24571
backlog 990156b 327p requeues 21274

(All 8 TX queues get a fair share of the traffic)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9b462d02

ipv4: fix nexthop attlen check in fib_nh_match · f76936d0

由 Jiri Pirko 提交于 10月 13, 2014

fib_nh_match does not match nexthops correctly. Example:

ip route add 172.16.10/24 nexthop via 192.168.122.12 dev eth0 \
                          nexthop via 192.168.122.13 dev eth0
ip route del 172.16.10/24 nexthop via 192.168.122.14 dev eth0 \
                          nexthop via 192.168.122.15 dev eth0

Del command is successful and route is removed. After this patch
applied, the route is correctly matched and result is:
RTNETLINK answers: No such process

Please consider this for stable trees as well.

Fixes: 4e902c57 ("[IPv4]: FIB configuration using struct fib_config")
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f76936d0

tcp: fix tcp_ack() performance problem · ad971f61

由 Eric Dumazet 提交于 10月 11, 2014

We worked hard to improve tcp_ack() performance, by not accessing
skb_shinfo() in fast path (cd7d8498 tcp: change tcp_skb_pcount()
location)

We still have one spurious access because of ACK timestamping,
added in commit e1c8a607 ("net-timestamp: ACK timestamp for
bytestreams")

By checking if sk_tsflags has SOF_TIMESTAMPING_TX_ACK set,
we can avoid two cache line misses for the common case.

While we are at it, add two prefetchw() :

One in tcp_ack() to bring skb at the head of write queue.

One in tcp_clean_rtx_queue() loop to bring following skb,
as we will delete skb from the write queue and dirty skb->next->prev.

Add a couple of [un]likely() clauses.

After this patch, tcp_ack() is no longer the most consuming
function in tcp stack.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad971f61

tcp: fix ooo_okay setting vs Small Queues · b2532eb9

由 Eric Dumazet 提交于 10月 10, 2014

TCP Small Queues (tcp_tsq_handler()) can hold one reference on
sk->sk_wmem_alloc, preventing skb->ooo_okay being set.

We should relax test done to set skb->ooo_okay to take care
of this extra reference.

Minimal truesize of skb containing one byte of payload is
SKB_TRUESIZE(1)

Without this fix, we have more chance locking flows into the wrong
transmit queue.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b2532eb9

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功