提交 · 14e1d0fa97f821b42e8683500cf4ec817bb5d940 · openeuler / Kernel

28 5月, 2015 5 次提交

vxlan: release lock after each bucket in vxlan_cleanup · 14e1d0fa

由 Sorin Dumitru 提交于 5月 26, 2015

We're seeing some softlockups from this function when there
are a lot fdb entries on a vxlan device. Taking the lock for
each bucket instead of the whole table is enough to fix that.
Signed-off-by: NSorin Dumitru <sdumitru@ixiacom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

14e1d0fa

tcp/dccp: try to not exhaust ip_local_port_range in connect() · 07f4c900

由 Eric Dumazet 提交于 5月 24, 2015

A long standing problem on busy servers is the tiny available TCP port
range (/proc/sys/net/ipv4/ip_local_port_range) and the default
sequential allocation of source ports in connect() system call.

If a host is having a lot of active TCP sessions, chances are
very high that all ports are in use by at least one flow,
and subsequent bind(0) attempts fail, or have to scan a big portion of
space to find a slot.

In this patch, I changed the starting point in __inet_hash_connect()
so that we try to favor even [1] ports, leaving odd ports for bind()
users.

We still perform a sequential search, so there is no guarantee, but
if connect() targets are very different, end result is we leave
more ports available to bind(), and we spread them all over the range,
lowering time for both connect() and bind() to find a slot.

This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range
is even, ie if start/end values have different parity.

Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to
32768 - 60999 (instead of 32768 - 61000)

There is no change on security aspects here, only some poor hashing
schemes could be eventually impacted by this change.

[1] : The odd/even property depends on ip_local_port_range values parity
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

07f4c900

Merge branch 'ip_frag_next' · 837b9955

由 David S. Miller 提交于 5月 27, 2015

Florian Westphal says:

====================
net: force refragmentation for DF reassembed skbs

output path tests:

    if (skb->len > mtu) ip_fragment()

This breaks connectivity in one corner case:
 If the skb was reassembled, but has the DF bit set and ..
 .. its reassembled size is <= outdev mtu ..
 .. we will forward a DF packet larger than what the sender
    transmitted on wire.

If a router later in the path can't forward this packet, it will send an
icmp error in response to an mtu that the original sender never exceeded.

This changes ipv4 defrag/output path to

a) force refragmentation for DF reassembled skbs and
b) set DF bit on all fragments when refragmenting if it was set on original
frags.

tested via:
from scapy.all import *
dip="10.23.42.2"
payload="A"*1400
packet=IP(dst=dip,id=12345,flags='DF')/UDP(sport=42,dport=42)/payload
frags=fragment(packet,fragsize=1200)
for fragment in frags:
    send(fragment)

Without this patch, we generate fragments without df bit set based
on the outgoing device mtu when fragmenting after forwarding, ie.

IP (ttl 64, id 12345, offset 0, flags [+, DF], proto UDP (17), length 1204)
    192.168.7.1.42 > 10.23.42.2.42: UDP, length 1400
IP (ttl 64, id 12345, offset 1184, flags [DF], proto UDP (17), length 244)
    192.168.7.1 > 10.23.42.2: ip-proto-17

on ingress will either turn into

IP (ttl 63, id 12345, offset 0, flags [+], proto UDP (17), length 1396)
    192.168.7.1.42 > 10.23.42.2.42: UDP, length 1400
IP (ttl 63, id 12345, offset 1376, flags [none], proto UDP (17), length 52)

(mtu 1400: We strip df and send larger fragment), or

IP (ttl 63, id 12345, offset 0, flags [DF], proto UDP (17), length 1428)
    192.168.7.1.42 > 10.23.42.2.42: [udp sum ok] UDP, length 1400

if mtu is 1500.  And in this case things break; router with a smaller mtu
will send icmp error, but original sender only sent packets <= 1204 byte.

With patch, we keep intent of such fragments and will emit DF-fragments
that won't exceed 1204 byte in size.

Joint work with Hannes Frederic Sowa.

Changes since v2:
 - split unrelated patches from series
 - rework changelog of patch #2 to better illustrate breakage
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

837b9955

ip_fragment: don't forward defragmented DF packet · d6b915e2

由 Florian Westphal 提交于 5月 22, 2015

We currently always send fragments without DF bit set.

Thus, given following setup:

mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280
   A           R1              R2         B

Where R1 and R2 run linux with netfilter defragmentation/conntrack
enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1
will respond with icmp too big error if one of these fragments exceeded
1400 bytes.

However, if R1 receives fragment sizes 1200 and 100, it would
forward the reassembled packet without refragmenting, i.e.
R2 will send an icmp error in response to a packet that was never sent,
citing mtu that the original sender never exceeded.

The other minor issue is that a refragmentation on R1 will conceal the
MTU of R2-B since refragmentation does not set DF bit on the fragments.

This modifies ip_fragment so that we track largest fragment size seen
both for DF and non-DF packets, and set frag_max_size to the largest
value.

If the DF fragment size is larger or equal to the non-df one, we will
consider the packet a path mtu probe:
We set DF bit on the reassembled skb and also tag it with a new IPCB flag
to force refragmentation even if skb fits outdev mtu.

We will also set DF bit on each fragment in this case.

Joint work with Hannes Frederic Sowa.
Reported-by: NJesse Gross <jesse@nicira.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d6b915e2

net: ipv4: avoid repeated calls to ip_skb_dst_mtu helper · c5501eb3

由 Florian Westphal 提交于 5月 22, 2015

ip_skb_dst_mtu is small inline helper, but its called in several places.

before: 17061      44       0   17105    42d1 net/ipv4/ip_output.o
after:  16805      44       0   16849    41d1 net/ipv4/ip_output.o
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c5501eb3

27 5月, 2015 7 次提交

Merge branch 'phy_rgmii' · 8c0ce770

由 David S. Miller 提交于 5月 27, 2015

Florian Fainelli says:

====================
net: phy: phy_interface_is_rgmii helper

As you suggested, here is the helper function to avoid missing some RGMII
interface checks. Had to wait for net to be merged in net-next to avoid
submitting the same patch/commit.

Dan, you might want to rebase your dp83867 submission to use that helper
when you this patchset gets merged into net-next, thanks!
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8c0ce770

net: phy: Utilize phy_interface_is_rgmii · 32a64161

由 Florian Fainelli 提交于 5月 26, 2015

Update all open-coded tests for all 4 PHY_INTERFACE_MODE_RGMII* values
to use the newly introduced helper: phy_interface_is_rgmii.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

32a64161

net: phy: Add phy_interface_is_rgmii helper · e463d88c

由 Florian Fainelli 提交于 5月 26, 2015

RGMII interfaces come in 4 different flavors that the PHY library needs
to care about: regular RGMII (no delays), RGMII with either RX or TX
delay, and both. In order to avoid errors of checking only for one type
of RGMII interface and miss the 3 others, introduce a convenience
function which tests for all values.
Suggested-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e463d88c

ipv4: Fix fib_trie.c build, missing linux/vmalloc.h include. · ffa915d0

由 David S. Miller 提交于 5月 27, 2015

We used to get this indirectly I supposed, but no longer do.

Either way, an explicit include should have been done in the
first place.

   net/ipv4/fib_trie.c: In function '__node_free_rcu':
>> net/ipv4/fib_trie.c:293:3: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
      vfree(n);
      ^
   net/ipv4/fib_trie.c: In function 'tnode_alloc':
>> net/ipv4/fib_trie.c:312:3: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration]
      return vzalloc(size);
      ^
>> net/ipv4/fib_trie.c:312:3: warning: return makes pointer from integer without a cast
   cc1: some warnings being treated as errors
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ffa915d0

tcp: tcp_tso_autosize() minimum is one packet · d6a4e26a

由 Eric Dumazet 提交于 5月 26, 2015

By making sure sk->sk_gso_max_segs minimal value is one,
and sysctl_tcp_min_tso_segs minimal value is one as well,
tcp_tso_autosize() will return a non zero value.

We can then revert 843925f3
("tcp: Do not apply TSO segment limit to non-TSO packets")
and save few cpu cycles in fast path.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d6a4e26a

tcp: fix/cleanup inet_ehash_locks_alloc() · 095dc8e0

由 Eric Dumazet 提交于 5月 26, 2015

If tcp ehash table is constrained to a very small number of buckets
(eg boot parameter thash_entries=128), then we can crash if spinlock
array has more entries.

While we are at it, un-inline inet_ehash_locks_alloc() and make
following changes :

- Budget 2 cache lines per cpu worth of 'spinlocks'
- Try to kmalloc() the array to avoid extra TLB pressure.
  (Most servers at Google allocate 8192 bytes for this hash table)
- Get rid of various #ifdef
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

095dc8e0

tipc: fix bug in link protocol message create function · f3903bcc

由 Jon Paul Maloy 提交于 5月 26, 2015

In commit dd3f9e70
("tipc: add packet sequence number at instant of transmission") we
made a change with the consequence that packets in the link backlog
queue don't contain valid sequence numbers.

However, when we create a link protocol message, we still use the
sequence number of the first packet in the backlog, if there is any,
as "next_sent" indicator in the message. This may entail unnecessary
retransissions or stale packet transmission when there is very low
traffic on the link.

This commit fixes this issue by only using the current value of
tipc_link::snd_nxt as indicator.
Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f3903bcc

26 5月, 2015 28 次提交

net: fix inet_proto_csum_replace4() sparse errors · 05c98543

由 Eric Dumazet 提交于 5月 25, 2015

make C=2 CF=-D__CHECK_ENDIAN__ net/core/utils.o
...
net/core/utils.c:307:72: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:307:72: expected restricted __wsum [usertype] addend
net/core/utils.c:307:72: got restricted __be32 [usertype] from
net/core/utils.c:308:34: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:308:34: expected restricted __wsum [usertype] addend
net/core/utils.c:308:34: got restricted __be32 [usertype] to
net/core/utils.c:310:70: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:310:70: expected restricted __wsum [usertype] addend
net/core/utils.c:310:70: got restricted __be32 [usertype] from
net/core/utils.c:310:77: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:310:77: expected restricted __wsum [usertype] addend
net/core/utils.c:310:77: got restricted __be32 [usertype] to
net/core/utils.c:312:72: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:312:72: expected restricted __wsum [usertype] addend
net/core/utils.c:312:72: got restricted __be32 [usertype] from
net/core/utils.c:313:35: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:313:35: expected restricted __wsum [usertype] addend
net/core/utils.c:313:35: got restricted __be32 [usertype] to

Note we can use csum_replace4() helper

Fixes: 58e3cac5 ("net: optimise inet_proto_csum_replace4()")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

05c98543

net: remove a sparse error in secure_dccpv6_sequence_number() · 68319052

由 Eric Dumazet 提交于 5月 25, 2015

make C=2 CF=-D__CHECK_ENDIAN__ net/core/secure_seq.o
net/core/secure_seq.c:157:50: warning: restricted __be32 degrades to
integer
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

68319052

bridge: skip fdb add if the port shouldn't learn · eb8d7baa

由 Wilson Kok 提交于 5月 25, 2015

Check in fdb_add_entry() if the source port should learn, similar
check is used in br_fdb_update.
Note that new fdb entries which are added manually or
as local ones are still permitted.
This patch has been tested by running traffic via a bridge port and
switching the port's state, also by manually adding/removing entries
from the bridge's fdb.
Signed-off-by: NWilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb8d7baa

pktgen: remove one sparse error · d4969581

由 Eric Dumazet 提交于 5月 25, 2015

net/core/pktgen.c:2672:43: warning: incorrect type in assignment (different base types)
net/core/pktgen.c:2672:43: expected unsigned short [unsigned] [short] [usertype] <noident>
net/core/pktgen.c:2672:43: got restricted __be16 [usertype] protocol

Let's use proper struct ethhdr instead of hard coding everything.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d4969581

ipv6: ipv6_select_ident() returns a __be32 · 7f159867

由 Eric Dumazet 提交于 5月 25, 2015

ipv6_select_ident() returns a 32bit value in network order.

Fixes: 286c2349 ("ipv6: Clean up ipv6_select_ident() and ip6_fragment()")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7f159867

Merge branch 'cpsw-cleanups' · eedf4c66

由 David S. Miller 提交于 5月 25, 2015

Richard Cochran says:

====================
cpsw cleanups

While working on an out-of-tree customization, I noticed a few minor
problems in the cpsw code.  This series cleans up the issues I found.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eedf4c66

net: cpsw: remove redundant calls disabling dma interrupts. · 61d22596

由 Richard Cochran 提交于 5月 25, 2015

The function, cpsw_intr_disable, already calls cpdma_ctlr_int_ctrl.  There
is no need to disable the dma interrupts twice.  This patch removes the
extra calls.
Signed-off-by: NRichard Cochran <richardcochran@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

61d22596

net: cpsw: remove redundant calls enabling dma interrupts. · 071f1a96

由 Richard Cochran 提交于 5月 25, 2015

The function, cpsw_intr_enable, already calls cpdma_ctlr_int_ctrl.  There
is no need to enable the dma interrupts twice.  This patch removes the
extra call.
Signed-off-by: NRichard Cochran <richardcochran@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

071f1a96

net: cpsw: remove two unused global functions · 202c5919

由 Richard Cochran 提交于 5月 25, 2015

The funtions, cpsw_ale_flush and cpsw_ale_set_ageout, have never been used
since they were first introduced. This patch removes the dead code.
Signed-off-by: NRichard Cochran <richardcochran@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

202c5919

net: cpsw: fix misplaced break statements. · 26fe7eb8

由 Richard Cochran 提交于 5月 25, 2015

Having the breaks too far to the left makes parsing the dense switch/case
block unnecessarily harder.
Signed-off-by: NRichard Cochran <richardcochran@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

26fe7eb8

Merge branch 'rocker-cleanups' · d1a0ed79

由 David S. Miller 提交于 5月 25, 2015

Simon Horman says:

====================
rocker: unused parameter and const cleanups

This series provides some minor though verbose cleanup of rocker.

The second patch depends on the first though it could be rebased.

I had previously asked for v2 to be put on hold while some bugs I had found
in the rocker driver were shaken out. That has now happened and the bugs
turned out to be unrelated.  Accordingly I am reposting the series.

* Changes v2 -> v3
  - Rebase and update for new variables and parameters that may be const

* Changes v1 -> v2
  - Found quite a few more variables and parameters to make const
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d1a0ed79

rocker: mark parameters and local variables as const · e5054643

由 Simon Horman 提交于 5月 25, 2015

Mark parameters and local variables as const where possible.
Signed-off-by: NSimon Horman <simon.horman@netronome.com>
Acked-by: NScott Feldman <sfeldma@gmail.com>
Acked-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e5054643

rocker: remove unused rocker_port parameter from rocker_port_kfree · 0985df73

由 Simon Horman 提交于 5月 25, 2015

Remove unused rocker_port parameter from rocker_port_kfree.
Also remove the rocker_port parameter from callers of rocker_port_kfree
where the parameter it is now unused.
Signed-off-by: NSimon Horman <simon.horman@netronome.com>
Acked-by: NScott Feldman <sfeldma@gmail.com>
Acked-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0985df73

irda: use msecs_to_jiffies for conversion to jiffies · 005e8709

由 Nicholas Mc Guire 提交于 5月 25, 2015

API compliance scanning with coccinelle flagged:
./net/irda/timer.c:63:35-37: use of msecs_to_jiffies probably perferable

Converting milliseconds to jiffies by "val * HZ / 1000" technically
is not a clean solution as it does not handle all corner cases correctly.
By changing the conversion to use msecs_to_jiffies(val) conversion is
correct in all cases. Further the () around the arithmetic expression
was dropped.

Patch was compile tested for x86_64_defconfig + CONFIG_IRDA=m

Patch is against 4.1-rc4 (localversion-next is -next-20150522)
Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

005e8709

neterion: s2io: Fix kernel doc formatting · d07ce242

由 Joe Perches 提交于 5月 23, 2015

These two uses seem to have had carriage returns removed.
Make these entries like all the others in this file.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d07ce242

irda: irda-usb: use msecs_to_jiffies for conversions · bbfe0f37

由 Nicholas Mc Guire 提交于 5月 23, 2015

API compliance scanning with coccinelle flagged:

Converting milliseconds to jiffies by "val * HZ / 1000" is technically
is not a clean solution as it does not handle all corner cases correctly.
By changing the conversion to use msecs_to_jiffies(val) conversion is
correct in all cases.

in the current code:
  mod_timer(&self->rx_defer_timer, jiffies + (10 * HZ / 1000));
for HZ < 100 (e.g. CONFIG_HZ == 64|32 in alpha) this effectively results
in no delay at all.

Patch was compile tested for x86_64_defconfig (implies CONFIG_USB_IRDA=m)

Patch is against 4.1-rc4 (localversion-next is -next-20150522)
Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bbfe0f37

bridge: allow setting hash_max + multicast_router if interface is down · 6ae4ae8e

由 Linus Lüssing 提交于 5月 23, 2015

Network managers like netifd (used in OpenWRT for instance) try to
configure interface options after creation but before setting the
interface up.

Unfortunately the sysfs / bridge currently only allows to configure the
hash_max and multicast_router options when the bridge interface is up.
But since br_multicast_init() doesn't start any timers and only sets
default values and initializes timers it should be save to reconfigure
the default values after that, before things actually get active after
the bridge is set up.
Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ae4ae8e

ipv6: don't increase size when refragmenting forwarded ipv6 skbs · 485fca66

由 Florian Westphal 提交于 5月 22, 2015

since commit 6aafeef0 ("netfilter: push reasm skb through instead of
original frag skbs") we will end up sometimes re-fragmenting skbs
that we've reassembled.

ipv6 defrag preserves the original skbs using the skb frag list, i.e. as long
as the skb frag list is preserved there is no problem since we keep
original geometry of fragments intact.

However, in the rare case where the frag list is munged or skb
is linearized, we might send larger fragments than what we originally
received.

A router in the path might then send packet-too-big errors even if
sender never sent fragments exceeding the reported mtu:

mtu 1500 - 1500:1400 - 1400:1280 - 1280
     A         R1         R2        B

1 - A sends to B, fragment size 1400
2 - R2 sends pkttoobig error for 1280
3 - A sends to B, fragment size 1280
4 - R2 sends pkttoobig error for 1280 again because it sees fragments of size 1400.

make sure ip6_fragment always caps MTU at largest packet size seen
when defragmented skb is forwarded.
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

485fca66

atm:he - Change 1 to true for bool type variable. · 376cd36d

由 Shailendra Verma 提交于 5月 26, 2015

The variable irq_coalesce is bool type.
So assign the value true instead of 1.
Signed-off-by: NShailendra Verma <shailendra.capricorn@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

376cd36d

net:xen-netback - Change 1 to true for bool type variable. · c489dbb1

由 Shailendra Verma 提交于 5月 25, 2015

The variable separate_tx_rx_irq is bool type so assigning true
instead of 1.
Signed-off-by: NShailendra Verma <shailendra.capricorn@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c489dbb1

Merge branch 'ipv6_route_sharing' · c1a34035

由 David S. Miller 提交于 5月 25, 2015

Martin KaFai Lau says:

====================
ipv6: Only create RTF_CACHE route after encountering pmtu exception

v4 -> v5:
- Patch 1 is new. Clean up the ipv6_select_ident() and ip6_fragment().

- Further simplify the newly added rt6_get_pcpu_route().  If there is a
  'prev' after cmpxchg, return prev instead of the newly created percpu
  clone.

v3 -> v4:
- Patch 8 is new. It keeps track of the DST_NOCACHE routes in a list to handle
  the iface down/unregister event.

- Remove rcu from the newly added rt6i_pcpu variable.  It is not needed
  because it has already been protected by the existing reader/writer lock.

- Thanks to 'Julian Anastasov <ja@ssi.bg>' for testing the FLOWI_FLAG_KNOWN_NH
  patches.

v2 -> v3:
- Patch 5 to 7 are new.  They take care of cases where the daddr in
  skb is not the one used to do the route look-up.  There is also
  related changes to rt6_nexthop() since v2 which is in patch 2/9.
  Thanks to 'Julian Anastasov <ja@ssi.bg>' for pointing it out.

- Fix a few problems in __ip6_rt_update_pmtu(), like setting the expire
  and mtu before inserting to the tree and don't do dst_destroy() after
  tree insertion failure.  Also update the rt6i_pmtu in fib6_add_rt2node().
  Thanks to 'Steffen Klassert <steffen.klassert@secunet.com>' for pointing
  it out.

- Merge ip6_pmtu_rt_cache_alloc() into ip6_rt_cache_alloc().

v1 -> v2:
- Move the /128 route bug fixes to another series (accepted).
- Create a function for checking (rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)).
- Avoid shuffling the skb network_header.  Instead, change the function
  signature to take iph instead of skb.

- Many Thanks to 'Hannes Frederic Sowa <hannes@stressinduktion.org>' on
  reviewing v1 and v2 and giving advice.

--Martin

~~~ start: v1 compose message (with the out-dated parts removed) ~~~

This series is to avoid creating a RTF_CACHE route whenever we are consulting
the fib6 tree with a new destination.  Instead, only create RTF_CACHE route
when we see a pmtu exception.

Out of all ipv6 RTF_CACHE routes that are created, the percentage that has a
different mtu is very small. In one of our end-user facing proxy server,
only 1k out of 80k RTF_CACHE routes have a smaller MTU.  For our DC
traffic, there is no mtu exception.

A large fib6 tree has problems like, 'ip -6 r show' takes a long time.
gc may kick in too often.  Also, when a service has restarted and a lot
of new TCP conn requests come in, it creates pressure on the tree by inserting
a lot of RTF_CACHE in a short time and it currently requires a write lock
to do that.

The first few patches are prep works to remove assumption that the
returned rt is always RTF_CACHE.

The patch 'ipv6: Only create RTF_CACHE routes after encountering pmtu exception'
do the lazy RTF_CACHE route creation.

The following patches added percpu rt to compensate the performance loss after
doing the RTF_CACHE lazy creation.

Here is some numbers of the udpflood test.  The udpflood has been
slightly modified to have a time limit instead of count limit.

A /64 via gateway route is used for the test. Each udpflood uses 10000 dst
addresses.  The dst addresses of different udpflood processes do not overlap
with each other.

1                    16M                          15M
10                   61M                          61M
20                   65M                          62M
40                   88M                          83M

~~~ end: v1 compose message ~~~
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c1a34035

ipv6: Create percpu rt6_info · d52d3997

由 Martin KaFai Lau 提交于 5月 22, 2015

After the patch
'ipv6: Only create RTF_CACHE routes after encountering pmtu exception',
we need to compensate the performance hit (bouncing dst->__refcnt).
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d52d3997

ipv6: Break up ip6_rt_copy() · 83a09abd

由 Martin KaFai Lau 提交于 5月 22, 2015

This patch breaks up ip6_rt_copy() into ip6_rt_copy_init() and
ip6_rt_cache_alloc().

In the later patch, we need to create a percpu rt6_info copy. Hence,
refactor the common rt6_info init codes to ip6_rt_copy_init().
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

83a09abd

ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister · 8d0b94af

由 Martin KaFai Lau 提交于 5月 22, 2015

This patch keeps track of the DST_NOCACHE routes in a list and replaces its
dev with loopback during the iface down/unregister event.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d0b94af

ipv6: Create RTF_CACHE clone when FLOWI_FLAG_KNOWN_NH is set · 3da59bd9

由 Martin KaFai Lau 提交于 5月 22, 2015

This patch always creates RTF_CACHE clone with DST_NOCACHE
when FLOWI_FLAG_KNOWN_NH is set so that the rt6i_dst is set to
the fl6->daddr.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Acked-by: NJulian Anastasov <ja@ssi.bg>
Tested-by: NJulian Anastasov <ja@ssi.bg>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3da59bd9

ipv6: Set FLOWI_FLAG_KNOWN_NH at flowi6_flags · 48e8aa6e

由 Martin KaFai Lau 提交于 5月 22, 2015

The neighbor look-up used to depend on the rt6i_gateway (if
there is a gateway) or the rt6i_dst (if it is a RTF_CACHE clone)
as the nexthop address.  Note that rt6i_dst is set to fl6->daddr
for the RTF_CACHE clone where fl6->daddr is the one used to do
the route look-up.

Now, we only create RTF_CACHE clone after encountering exception.
When doing the neighbor look-up with a route that is neither a gateway
nor a RTF_CACHE clone, the daddr in skb will be used as the nexthop.

In some cases, the daddr in skb is not the one used to do
the route look-up.  One example is in ip_vs_dr_xmit_v6() where the
real nexthop server address is different from the one in the skb.

This patch is going to follow the IPv4 approach and ask the
ip6_pol_route() callers to set the FLOWI_FLAG_KNOWN_NH properly.

In the next patch, ip6_pol_route() will honor the FLOWI_FLAG_KNOWN_NH
and create a RTF_CACHE clone.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Acked-by: NJulian Anastasov <ja@ssi.bg>
Tested-by: NJulian Anastasov <ja@ssi.bg>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

48e8aa6e

ipv6: Add rt6_get_cookie() function · b197df4f

由 Martin KaFai Lau 提交于 5月 22, 2015

Instead of doing the rt6->rt6i_node check whenever we need
to get the route's cookie.  Refactor it into rt6_get_cookie().
It is a prep work to handle FLOWI_FLAG_KNOWN_NH and also
percpu rt6_info later.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b197df4f

ipv6: Only create RTF_CACHE routes after encountering pmtu exception · 45e4fd26

由 Martin KaFai Lau 提交于 5月 22, 2015

This patch creates a RTF_CACHE routes only after encountering a pmtu
exception.

After ip6_rt_update_pmtu() has inserted the RTF_CACHE route to the fib6
tree, the rt->rt6i_node->fn_sernum is bumped which will fail the
ip6_dst_check() and trigger a relookup.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

45e4fd26

openeuler / Kernel 11 个月 前同步成功

openeuler / Kernel
11 个月前同步成功