提交 · e0d56fdd734224666e7bd5fafbc620286d2a7ee8 · openanolis / cloud-kernel

11 9月, 2016 8 次提交

net: l3mdev: remove redundant calls · e0d56fdd

由 David Ahern 提交于 9月 10, 2016

A previous patch added l3mdev flow update making these hooks
redundant. Remove them.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e0d56fdd

net: vrf: Flip IPv4 output path from FIB lookup hook to out hook · ebfc102c

由 David Ahern 提交于 9月 10, 2016

Flip the IPv4 output path to use the l3mdev tx out hook. The VRF dst
is not returned on the first FIB lookup. Instead, the dst on the
skb is switched at the beginning of the IPv4 output processing to
send the packet to the VRF driver on xmit.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ebfc102c

net: l3mdev: Allow the l3mdev to be a loopback · 5f02ce24

由 David Ahern 提交于 9月 10, 2016

Allow an L3 master device to act as the loopback for that L3 domain.
For IPv4 the device can also have the address 127.0.0.1.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5f02ce24

net: l3mdev: Add hook to output path · a8e3e1a9

由 David Ahern 提交于 9月 10, 2016

This patch adds the infrastructure to the output path to pass an skb
to an l3mdev device if it has a hook registered. This is the Tx parallel
to l3mdev_ip{6}_rcv in the receive path and is the basis for removing
the existing hook that returns the vrf dst on the fib lookup.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a8e3e1a9

net: flow: Add l3mdev flow update · 9ee0034b

由 David Ahern 提交于 9月 10, 2016

Add l3mdev hook to set FLOWI_FLAG_SKIP_NH_OIF flag and update oif/iif
in flow struct if its oif or iif points to a device enslaved to an L3
Master device. Only 1 needs to be converted to match the l3mdev FIB
rule. This moves the flow adjustment for l3mdev to a single point
catching all lookups. It is redundant for existing hooks (those are
removed in later patches) but is needed for missed lookups such as
PMTU updates.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9ee0034b

tcp: better use ooo_last_skb in tcp_data_queue_ofo() · 2594a2a9

由 Eric Dumazet 提交于 9月 09, 2016

Willem noticed that we could avoid an rbtree lookup if the
the attempt to coalesce incoming skb to the last skb failed
for some reason.

Since most ooo additions are at the tail, this is definitely
worth adding a test and fast path.
Suggested-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2594a2a9

ipv4: use IS_ENABLED() instead of checking for built-in or module · 6ca40d4e

由 Javier Martinez Canillas 提交于 9月 09, 2016

The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.

Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.
Signed-off-by: NJavier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ca40d4e

net/ip_tunnels: Introduce tunnel_id_to_key32() and key32_to_tunnel_id() · d817f432

由 Amir Vadai 提交于 9月 08, 2016

Add utility functions to convert a 32 bits key into a 64 bits tunnel and
vice versa.
These functions will be used instead of cloning code in GRE and VXLAN,
and in tc act_iptunnel which will be introduced in a following patch in
this patchset.
Signed-off-by: NAmir Vadai <amir@vadai.me>
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: NJiri Benc <jbenc@redhat.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d817f432

10 9月, 2016 2 次提交

ip_tunnel: do not clear l4 hashes · bf8d85d4

由 Eric Dumazet 提交于 9月 08, 2016

If skb has a valid l4 hash, there is no point clearing hash and force
a further flow dissection when a tunnel encapsulation is added.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bf8d85d4

ipv4: fix value of ->nlmsg_flags reported in RTM_NEWROUTE events · b93e1fa7

由 Guillaume Nault 提交于 9月 07, 2016

fib_table_insert() inconsistently fills the nlmsg_flags field in its
notification messages.

Since commit b8f55831 ("[RTNETLINK]: Fix sending netlink message
when replace route."), the netlink message has its nlmsg_flags set to
NLM_F_REPLACE if the route replaced a preexisting one.

Then commit a2bb6d7d ("ipv4: include NLM_F_APPEND flag in append
route notifications") started setting nlmsg_flags to NLM_F_APPEND if
the route matched a preexisting one but was appended.

In other cases (exclusive creation or prepend), nlmsg_flags is 0.

This patch sets ->nlmsg_flags in all situations, preserving the
semantic of the NLM_F_* bits:

  * NLM_F_CREATE: a new fib entry has been created for this route.
  * NLM_F_EXCL: no other fib entry existed for this route.
  * NLM_F_REPLACE: this route has overwritten a preexisting fib entry.
  * NLM_F_APPEND: the new fib entry was added after other entries for
    the same route.

As a result, the possible flag combination can now be reported
(iproute2's terminology into parentheses):

  * NLM_F_CREATE | NLM_F_EXCL: route didn't exist, exclusive creation
    ("add").
  * NLM_F_CREATE | NLM_F_APPEND: route did already exist, new route
    added after preexisting ones ("append").
  * NLM_F_CREATE: route did already exist, new route added before
    preexisting ones ("prepend").
  * NLM_F_REPLACE: route did already exist, new route replaced the
    first preexisting one ("change").
Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b93e1fa7

09 9月, 2016 3 次提交

ipv4: accept u8 in IP_TOS ancillary data · e895cdce

由 Eric Dumazet 提交于 9月 07, 2016

In commit f02db315 ("ipv4: IP_TOS and IP_TTL can be specified as
ancillary data") Francesco added IP_TOS values specified as integer.

However, kernel sends to userspace (at recvmsg() time) an IP_TOS value
in a single byte, when IP_RECVTOS is set on the socket.

It can be very useful to reflect all ancillary options as given by the
kernel in a subsequent sendmsg(), instead of aborting the sendmsg() with
EINVAL after Francesco patch.

So this patch extends IP_TOS ancillary to accept an u8, so that an UDP
server can simply reuse same ancillary block without having to mangle
it.

Jesper can then augment
https://github.com/netoptimizer/network-testing/blob/master/src/udp_example02.c
to add TOS reflection ;)

Fixes: f02db315 ("ipv4: IP_TOS and IP_TTL can be specified as ancillary data")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Francesco Fusco <ffusco@redhat.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e895cdce

tcp: use an RB tree for ooo receive queue · 9f5afeae

由 Yaogong Wang 提交于 9月 07, 2016

Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.

Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.

In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.

Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.

However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.

This patch converts it to a RB tree, to get bounded latencies.

Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.

Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)

Next step would be to also use an RB tree for the write queue at sender
side ;)
Signed-off-by: NYaogong Wang <wygivan@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9f5afeae

net: inet: diag: expose the socket mark to privileged processes. · d545caca

由 Lorenzo Colitti 提交于 9月 08, 2016

This adds the capability for a process that has CAP_NET_ADMIN on
a socket to see the socket mark in socket dumps.

Commit a52e95ab ("net: diag: allow socket bytecode filters to
match socket marks") recently gave privileged processes the
ability to filter socket dumps based on mark. This patch is
complementary: it ensures that the mark is also passed to
userspace in the socket's netlink attributes. It is useful for
tools like ss which display information about sockets.

Tested: https://android-review.googlesource.com/270210Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d545caca

08 9月, 2016 1 次提交

net: diag: make udp_diag_destroy work for mapped addresses. · f95bf346

由 Lorenzo Colitti 提交于 9月 07, 2016

udp_diag_destroy does look up the IPv4 UDP hashtable for mapped
addresses, but it gets the IPv4 address to look up from the
beginning of the IPv6 address instead of the end.

Tested: https://android-review.googlesource.com/269874
Fixes: 5d77dca8 ("net: diag: support SOCK_DESTROY for UDP sockets")
Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f95bf346

02 9月, 2016 2 次提交

tcp: make nla_policy const · 4f70c96f

由 stephen hemminger 提交于 8月 31, 2016

Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f70c96f

fou: make nla_policy const · 3f18ff2b

由 stephen hemminger 提交于 8月 31, 2016

Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f18ff2b

31 8月, 2016 1 次提交

net: lwtunnel: Handle fragmentation · 14972cbd

由 Roopa Prabhu 提交于 8月 24, 2016

Today mpls iptunnel lwtunnel_output redirect expects the tunnel
output function to handle fragmentation. This is ok but can be
avoided if we did not do the mpls output redirect too early.
ie we could wait until ip fragmentation is done and then call
mpls output for each ip fragment.

To make this work we will need,
1) the lwtunnel state to carry encap headroom
2) and do the redirect to the encap output handler on the ip fragment
(essentially do the output redirect after fragmentation)

This patch adds tunnel headroom in lwtstate to make sure we
account for tunnel data in mtu calculations during fragmentation
and adds new xmit redirect handler to redirect to lwtunnel xmit func
after ip fragmentation.

This includes IPV6 and some mtu fixes and testing from David Ahern.
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

14972cbd

30 8月, 2016 2 次提交

netfilter: log: Check param to avoid overflow in nf_log_set · 779994fa

由 Gao Feng 提交于 8月 29, 2016

The nf_log_set is an interface function, so it should do the strict sanity
check of parameters. Convert the return value of nf_log_set as int instead
of void. When the pf is invalid, return -EOPNOTSUPP.
Signed-off-by: NGao Feng <fgao@ikuai8.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

779994fa

netfilter: log_arp: Use ARPHRD_ETHER instead of literal '1' · 3cb27991

由 Gao Feng 提交于 8月 28, 2016

There is one macro ARPHRD_ETHER which defines the ethernet proto for ARP,
so we could use it instead of the literal number '1'.
Signed-off-by: NGao Feng <fgao@ikuai8.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

3cb27991

29 8月, 2016 2 次提交

tcp: add tcp_add_backlog() · c9c33212

由 Eric Dumazet 提交于 8月 27, 2016

When TCP operates in lossy environments (between 1 and 10 % packet
losses), many SACK blocks can be exchanged, and I noticed we could
drop them on busy senders, if these SACK blocks have to be queued
into the socket backlog.

While the main cause is the poor performance of RACK/SACK processing,
we can try to avoid these drops of valuable information that can lead to
spurious timeouts and retransmits.

Cause of the drops is the skb->truesize overestimation caused by :

- drivers allocating ~2048 (or more) bytes as a fragment to hold an
  Ethernet frame.

- various pskb_may_pull() calls bringing the headers into skb->head
  might have pulled all the frame content, but skb->truesize could
  not be lowered, as the stack has no idea of each fragment truesize.

The backlog drops are also more visible on bidirectional flows, since
their sk_rmem_alloc can be quite big.

Let's add some room for the backlog, as only the socket owner
can selectively take action to lower memory needs, like collapsing
receive queues or partial ofo pruning.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c9c33212

tcp: Set read_sock and peek_len proto_ops · 32035585

由 Tom Herbert 提交于 8月 28, 2016

In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
tcp_peek_len (which is just a stub function that calls tcp_inq).
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

32035585

26 8月, 2016 2 次提交

tcp: md5: add LINUX_MIB_TCPMD5FAILURE counter · 72145a68

由 Eric Dumazet 提交于 8月 24, 2016

Adds SNMP counter for drops caused by MD5 mismatches.

The current syslog might help, but a counter is more precise and helps
monitoring.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

72145a68

tcp: md5: increment sk_drops on syn_recv state · e65c332d

由 Eric Dumazet 提交于 8月 24, 2016

TCP MD5 mismatches do increment sk_drops counter in all states but
SYN_RECV.

This is very unlikely to happen in the real world, but worth adding
to help diagnostics.

We increase the parent (listener) sk_drops.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e65c332d

25 8月, 2016 2 次提交

net: diag: allow socket bytecode filters to match socket marks · a52e95ab

由 Lorenzo Colitti 提交于 8月 24, 2016

This allows a privileged process to filter by socket mark when
dumping sockets via INET_DIAG_BY_FAMILY. This is useful on
systems that use mark-based routing such as Android.

The ability to filter socket marks requires CAP_NET_ADMIN, which
is consistent with other privileged operations allowed by the
SOCK_DIAG interface such as the ability to destroy sockets and
the ability to inspect BPF filters attached to packet sockets.

Tested: https://android-review.googlesource.com/261350Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a52e95ab

net: diag: slightly refactor the inet_diag_bc_audit error checks. · 627cc4ad

由 Lorenzo Colitti 提交于 8月 24, 2016

This simplifies the code a bit and also allows inet_diag_bc_audit
to send to userspace an error that isn't EINVAL.
Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

627cc4ad

24 8月, 2016 7 次提交

udp: get rid of sk_prot_clear_portaddr_nulls() · 4cac8204

由 Eric Dumazet 提交于 8月 23, 2016

Since we no longer use SLAB_DESTROY_BY_RCU for UDP,
we do not need sk_prot_clear_portaddr_nulls() helper.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4cac8204

net: diag: support SOCK_DESTROY for UDP sockets · 5d77dca8

由 David Ahern 提交于 8月 23, 2016

This implements SOCK_DESTROY for UDP sockets similar to what was done
for TCP with commit c1e64e29 ("net: diag: Support destroying TCP
sockets.") A process with a UDP socket targeted for destroy is awakened
and recvmsg fails with ECONNABORTED.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5d77dca8

net: diag: Fix refcnt leak in error path destroying socket · d7226c7a

由 David Ahern 提交于 8月 23, 2016

inet_diag_find_one_icsk takes a reference to a socket that is not
released if sock_diag_destroy returns an error. Fix by changing
tcp_diag_destroy to manage the refcnt for all cases and remove
the sock_put calls from tcp_abort.

Fixes: c1e64e29 ("net: diag: Support destroying TCP sockets")
Reported-by: NLorenzo Colitti <lorenzo@google.com>
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d7226c7a

udp: get rid of SLAB_DESTROY_BY_RCU allocations · 75d855a5

由 Eric Dumazet 提交于 8月 23, 2016

After commit ca065d0c ("udp: no longer use SLAB_DESTROY_BY_RCU")
we do not need this special allocation mode anymore, even if it is
harmless.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

75d855a5

net-tcp: retire TFO_SERVER_WO_SOCKOPT2 config · cebc5cba

由 Yuchung Cheng 提交于 8月 22, 2016

TFO_SERVER_WO_SOCKOPT2 was intended for debugging purposes during
Fast Open development. Remove this config option and also
update/clean-up the documentation of the Fast Open sysctl.
Reported-by: NPiotr Jurkiewicz <piotr.jerzy.jurkiewicz@gmail.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cebc5cba

tcp: properly scale window in tcp_v[46]_reqsk_send_ack() · 20a2b49f

由 Eric Dumazet 提交于 8月 22, 2016

When sending an ack in SYN_RECV state, we must scale the offered
window if wscale option was negotiated and accepted.

Tested:
 Following packetdrill test demonstrates the issue :

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0

+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

// Establish a connection.
+0 < S 0:0(0) win 20000 <mss 1000,sackOK,wscale 7, nop, TS val 100 ecr 0>
+0 > S. 0:0(0) ack 1 win 28960 <mss 1460,sackOK, TS val 100 ecr 100, nop, wscale 7>

+0 < . 1:11(10) ack 1 win 156 <nop,nop,TS val 99 ecr 100>
// check that window is properly scaled !
+0 > . 1:1(0) ack 1 win 226 <nop,nop,TS val 200 ecr 100>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

20a2b49f

udp: fix poll() issue with zero sized packets · e83c6744

由 Eric Dumazet 提交于 8月 23, 2016

Laura tracked poll() [and friends] regression caused by commit
e6afc8ac ("udp: remove headers from UDP packets before queueing")

udp_poll() needs to know if there is a valid packet in receive queue,
even if its payload length is 0.

Change first_packet_length() to return an signed int, and use -1
as the indication of an empty queue.

Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
Reported-by: NLaura Abbott <labbott@redhat.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Tested-by: NLaura Abbott <labbott@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e83c6744

23 8月, 2016 2 次提交

net: ipconfig: Fix NULL pointer dereference on RARP/BOOTP/DHCP timeout · 1ae292a2

由 Geert Uytterhoeven 提交于 8月 22, 2016

If no RARP, BOOTP, or DHCP response is received, ic_dev is never set,
causing a NULL pointer dereference in ic_close_devs():

    Sending DHCP requests ...... timed out!
    Unable to handle kernel NULL pointer dereference at virtual address 00000004

To fix this, add a check to avoid dereferencing ic_dev if it is still
NULL.
Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Fixes: 2647cffb ("net: ipconfig: Support using "delayed" DHCP replies")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1ae292a2

net: ip_finish_output_gso: Allow fragmenting segments of tunneled skbs if their DF is unset · c0451fe1

由 Shmulik Ladkani 提交于 8月 21, 2016

In b8247f09,

   "net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs"

gso skbs arriving from an ingress interface that go through UDP
tunneling, are allowed to be fragmented if the resulting encapulated
segments exceed the dst mtu of the egress interface.

This aligned the behavior of gso skbs to non-gso skbs going through udp
encapsulation path.

However the non-gso vs gso anomaly is present also in the following
cases of a GRE tunnel:
 - ip_gre in collect_md mode, where TUNNEL_DONT_FRAGMENT is not set
   (e.g. OvS vport-gre with df_default=false)
 - ip_gre in nopmtudisc mode, where IFLA_GRE_IGNORE_DF is set

In both of the above cases, the non-gso skbs get fragmented, whereas the
gso skbs (having skb_gso_network_seglen that exceeds dst mtu) get dropped,
as they don't go through the segment+fragment code path.

Fix: Setting IPSKB_FRAG_SEGS if the tunnel specified IP_DF bit is NOT set.

Tunnels that do set IP_DF, will not go to fragmentation of segments.
This preserves behavior of ip_gre in (the default) pmtudisc mode.

Fixes: b8247f09 ("net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs")
Reported-by: Nwenxu <wenxu@ucloud.cn>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
Tested-by: Nwenxu <wenxu@ucloud.cn>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c0451fe1

20 8月, 2016 3 次提交

net: ipv4: fix sparse error in fib_good_nh() · d985d151

由 Eric Dumazet 提交于 8月 18, 2016

Fixes following sparse errors :

net/ipv4/fib_semantics.c:1579:61: warning: incorrect type in argument 2
(different base types)
net/ipv4/fib_semantics.c:1579:61:    expected unsigned int [unsigned]
[usertype] key
net/ipv4/fib_semantics.c:1579:61:    got restricted __be32 const
[usertype] nh_gw

Fixes: a6db4494 ("net: ipv4: Consider failed nexthops in multipath routes")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d985d151

udp: include addrconf.h · 217375a0

由 Eric Dumazet 提交于 8月 18, 2016

Include ipv4_rcv_saddr_equal() definition to avoid this sparse error :

net/ipv4/udp.c:362:5: warning: symbol 'ipv4_rcv_saddr_equal' was not
declared. Should it be static?
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

217375a0

tcp: md5: remove tcp_md5_hash_header() · b6c6b645

由 Eric Dumazet 提交于 8月 18, 2016

After commit 19689e38 ("tcp: md5: use kmalloc() backed scratch
areas") this function is no longer used.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b6c6b645

19 8月, 2016 3 次提交

fib_trie: Fix the description of pos and bits · 98a384ec

由 Xunlei Pang 提交于 8月 18, 2016

1) Fix one typo: s/tn/tp/
2) Fix the description about the "u" bits.
Signed-off-by: NXunlei Pang <xlpang@redhat.com>
Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

98a384ec

tcp: refine tcp_prune_ofo_queue() to not drop all packets · 36a6503f

由 Eric Dumazet 提交于 8月 17, 2016

Over the years, TCP BDP has increased a lot, and is typically
in the order of ~10 Mbytes with help of clever Congestion Control
modules.

In presence of packet losses, TCP stores incoming packets into an out of
order queue, and number of skbs sitting there waiting for the missing
packets to be received can match the BDP (~10 Mbytes)

In some cases, TCP needs to make room for incoming skbs, and current
strategy can simply remove all skbs in the out of order queue as a last
resort, incurring a huge penalty, both for receiver and sender.

Unfortunately these 'last resort events' are quite frequent, forcing
sender to send all packets again, stalling the flow and wasting a lot of
resources.

This patch cleans only a part of the out of order queue in order
to meet the memory constraints.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: C. Stephen Gun <csg@google.com>
Cc: Van Jacobson <vanj@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

36a6503f

tcp: defer sacked assignment · dca0aaf8

由 Eric Dumazet 提交于 8月 17, 2016

While chasing tcp_xmit_retransmit_queue() kasan issue, I found
that we could avoid reading sacked field of skb that we wont send,
possibly removing one cache line miss.

Very minor change in slow path, but why not ? ;)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dca0aaf8

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功