提交 · c380d37e97e783e36a924279fbd2f6837508546a · openanolis / cloud-kernel

16 7月, 2016 1 次提交

tcp_timer.c: Add kernel-doc function descriptions · c380d37e

由 Richard Sailer 提交于 7月 16, 2016

This adds kernel-doc style descriptions for 6 functions and
fixes 1 typo.
Signed-off-by: NRichard Sailer <richard@weltraumpflege.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c380d37e

12 7月, 2016 2 次提交

ipv4: af_inet: make it explicitly non-modular · d3fc0353

由 Paul Gortmaker 提交于 7月 11, 2016

The Makefile controlling compilation of this file is obj-y,
meaning that it currently is never being built as a module.

Since MODULE_ALIAS is a no-op for non-modular code, we can simply
remove the MODULE_ALIAS_NETPROTO variant used here.

We replace module.h with kmod.h since the file does make use of
request_module() in order to load other modules from here.

We don't have to worry about init.h coming in via the removed
module.h since the file explicitly includes init.h already.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d3fc0353

tunnels: correct conditional build of MPLS and IPv6 · aa9667e7

由 Simon Horman 提交于 7月 10, 2016

Using a combination if #if conditionals and goto labels to unwind
tunnel4_init seems unwieldy. This patch takes a simpler approach of
directly unregistering previously registered protocols when an error
occurs.

This fixes a number of problems with the current implementation
including the potential presence of labels when they are unused
and the potential absence of unregister code when it is needed.

Fixes: 8afe97e5 ("tunnels: support MPLS over IPv4 tunnels")
Signed-off-by: NSimon Horman <simon.horman@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aa9667e7

10 7月, 2016 3 次提交

ipv4: do not abuse GFP_ATOMIC in inet_netconf_notify_devconf() · fa17806c

由 Eric Dumazet 提交于 7月 08, 2016

inet_forward_change() runs with RTNL held.
We are allowed to sleep if required.

If we use __in_dev_get_rtnl() instead of __in_dev_get_rcu(),
we no longer have to use GFP_ATOMIC allocations in
inet_netconf_notify_devconf(), meaning we are less likely to miss
notifications under memory pressure, and wont touch precious memory
reserves either and risk dropping incoming packets.

inet_netconf_get_devconf() can also use GFP_KERNEL allocation.

Fixes: edc9e748 ("rtnl/ipv4: use netconf msg to advertise forwarding status")
Fixes: 9e551110 ("rtnl/ipv4: add support of RTM_GETNETCONF")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa17806c

ipip: support MPLS over IPv4 · 1b69e7e6

由 Simon Horman 提交于 7月 07, 2016

Extend the IPIP driver to support MPLS over IPv4. The implementation is an
extension of existing support for IPv4 over IPv4 and is based of multiple
inner-protocol support for the SIT driver.
Signed-off-by: NSimon Horman <simon.horman@netronome.com>
Reviewed-by: NDinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1b69e7e6

tunnels: support MPLS over IPv4 tunnels · 8afe97e5

由 Simon Horman 提交于 7月 07, 2016

Extend tunnel support to MPLS over IPv4.  The implementation extends the
existing differentiation between IPIP and IPv6 over IPv4 to also cover MPLS
over IPv4.
Signed-off-by: NSimon Horman <simon.horman@netronome.com>
Reviewed-by: NDinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8afe97e5

03 7月, 2016 1 次提交

netfilter: Convert FWINV<[foo]> macros and uses to NF_INVF · c37a2dfa

由 Joe Perches 提交于 6月 24, 2016

netfilter uses multiple FWINV #defines with identical form that hide a
specific structure variable and dereference it with a invflags member.

$ git grep "#define FWINV"
include/linux/netfilter_bridge/ebtables.h:#define FWINV(bool,invflg) ((bool) ^ !!(info->invflags & invflg))
net/bridge/netfilter/ebtables.c:#define FWINV2(bool, invflg) ((bool) ^ !!(e->invflags & invflg))
net/ipv4/netfilter/arp_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(arpinfo->invflags & (invflg)))
net/ipv4/netfilter/ip_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
net/ipv6/netfilter/ip6_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ip6info->invflags & (invflg)))
net/netfilter/xt_tcpudp.c:#define FWINVTCP(bool, invflg) ((bool) ^ !!(tcpinfo->invflags & (invflg)))

Consolidate these macros into a single NF_INVF macro.

Miscellanea:

o Neaten the alignment around these uses
o A few lines are > 80 columns for intelligibility
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

c37a2dfa

01 7月, 2016 2 次提交

netfilter: x_tables: simplify ip{6}table_mangle_hook() · 468b021b

由 Pablo Neira Ayuso 提交于 6月 24, 2016

No need for a special case to handle NF_INET_POST_ROUTING, this is
basically the same handling as for prerouting, input, forward.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

468b021b

tcp: md5: use kmalloc() backed scratch areas · 19689e38

由 Eric Dumazet 提交于 6月 27, 2016

Some arches have virtually mapped kernel stacks, or will soon have.

tcp_md5_hash_header() uses an automatic variable to copy tcp header
before mangling th->check and calling crypto function, which might
be problematic on such arches.

David says that using percpu storage is also problematic on non SMP
builds.

Just use kmalloc() to allocate scratch areas.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NAndy Lutomirski <luto@amacapital.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19689e38

30 6月, 2016 2 次提交

ipv4: Fix ip_skb_dst_mtu to use the sk passed by ip_finish_output · fedbb6b4

由 Shmulik Ladkani 提交于 6月 29, 2016

ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
calls ip_sk_use_pmtu which casts sk as an inet_sk).

However, in the case of UDP tunneling, the skb->sk is not necessarily an
inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
tun/tap).

OTOH, the sk passed as an argument throughout IP stack's output path is
the one which is of PMTU interest:
 - In case of local sockets, sk is same as skb->sk;
 - In case of a udp tunnel, sk is the tunneling socket.

Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
This augments 7026b1dd 'netfilter: Pass socket pointer down through okfn().'
Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fedbb6b4

tcp: add an ability to dump and restore window parameters · b1ed4c4f

由 Andrey Vagin 提交于 6月 27, 2016

We found that sometimes a restored tcp socket doesn't work.

A reason of this bug is incorrect window parameters and in this case
tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The
other side drops packets with this seq, because seq is less than
tp->rcv_nxt ( tcp_sequence() ).

Data from a send queue is sent only if there is enough space in a
window, so when we restore unacked data, we need to expand a window to
fit this data.

This was in a first version of this patch:
"tcp: extend window to fit all restored unacked data in a send queue"

Then Alexey recommended me to restore window parameters instead of
adjusted them according with data in a sent queue. This sounds resonable.

rcv_wnd has to be restored, because it was reported to another side
and the offered window is never shrunk.
One of reasons why we need to restore snd_wnd was described above.

Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: NAndrey Vagin <avagin@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1ed4c4f

29 6月, 2016 1 次提交

tcp: do not send too big packets at retransmit time · a3d2e9f8

由 Eric Dumazet 提交于 6月 27, 2016

Arjun reported a bug in TCP stack and bisected it to a recent commit.

In case where we process SACK, we can coalesce multiple skbs
into fat ones (tcp_shift_skb_data()), to lower write queue
overhead, because we do not expect to retransmit these packets.

However, SACK reneging can happen, forcing the sender to retransmit
all these packets. If skb->len is above 64KB, we then send buggy
IP packets that could hang TSO engine on cxgb4.

Neal suggested to use tcp_tso_autosize() instead of tp->gso_segs
so that we cook packets of optimal size vs TCP/pacing.

Thanks to Arjun for reporting the bug and running the tests !

Fixes: 10d3be56 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NArjun V <arjun@chelsio.com>
Tested-by: NArjun V <arjun@chelsio.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a3d2e9f8

28 6月, 2016 2 次提交

net: diag: Add support to filter on device index · 637c841d

由 David Ahern 提交于 6月 23, 2016

Add support to inet_diag facility to filter sockets based on device
index. If an interface index is in the filter only sockets bound
to that index (sk_bound_dev_if) are returned.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

637c841d

ipmr/ip6mr: Initialize the last assert time of mfc entries. · 70a0dec4

由 Tom Goff 提交于 6月 23, 2016

This fixes wrong-interface signaling on 32-bit platforms for entries
created when jiffies > 2^31 + MFC_ASSERT_THRESH.
Signed-off-by: NTom Goff <thomas.goff@ll.mit.edu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

70a0dec4

24 6月, 2016 1 次提交

netfilter: nf_reject_ipv4: don't send tcp RST if the packet is non-TCP · e1dbbc59

由 Liping Zhang 提交于 6月 20, 2016

In iptables, if the user add a rule to send tcp RST and specify the
non-TCP protocol, such as UDP, kernel will reject this request. But
in nftables, this validity check only occurs in nft tool, i.e. only
in userspace.

This means that user can add such a rule like follows via nfnetlink:
  "nft add rule filter forward ip protocol udp reject with tcp reset"

This will generate some confusing tcp RST packets. So we should send
tcp RST only when it is TCP packet.
Signed-off-by: NLiping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

e1dbbc59

23 6月, 2016 1 次提交

esp: Fix ESN generation under UDP encapsulation · 962fcef3

由 Herbert Xu 提交于 6月 18, 2016

Blair Steven noticed that ESN in conjunction with UDP encapsulation
is broken because we set the temporary ESP header to the wrong spot.

This patch fixes this by first of all using the right spot, i.e.,
4 bytes off the real ESP header, and then saving this information
so that after encryption we can restore it properly.

Fixes: 7021b2e1 ("esp4: Switch to new AEAD interface")
Reported-by: NBlair Steven <Blair.Steven@alliedtelesis.co.nz>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

962fcef3

19 6月, 2016 2 次提交

ipv6: RFC 4884 partial support for SIT/GRE tunnels · 20e1954f

由 Eric Dumazet 提交于 6月 18, 2016

When receiving an ICMPv4 message containing extensions as
defined in RFC 4884, and translating it to ICMPv6 at SIT
or GRE tunnel, we need some extra manipulation in order
to properly forward the extensions.

This patch only takes care of Time Exceeded messages as they
are the ones that typically carry information from various
routers in a fabric during a traceroute session.

It also avoids complex skb logic if the data_len is not
a multiple of 8.

RFC states :

   The "original datagram" field MUST contain at least 128 octets.
   If the original datagram did not contain 128 octets, the
   "original datagram" field MUST be zero padded to 128 octets.

In practice routers use 128 bytes of original datagram, not more.

Initial translation was added in commit ca15a078
("sit: generate icmpv6 error when receiving icmpv4 error")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Oussama Ghorbel <ghorbel@pivasoftware.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

20e1954f

gre: better support for ICMP messages for gre+ipv6 · 9b8c6d7b

由 Eric Dumazet 提交于 6月 18, 2016

ipgre_err() can call ip6_err_gen_icmpv6_unreach() for proper
support of ipv4+gre+icmp+ipv6+... frames, used for example
by traceroute/mtr.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9b8c6d7b

18 6月, 2016 3 次提交

net: Remove deprecated tunnel specific UDP offload functions · 1938ee1f

由 Alexander Duyck 提交于 6月 16, 2016

Now that we have all the drivers using udp_tunnel_get_rx_ports,
ndo_add_udp_enc_rx_port, and ndo_del_udp_enc_rx_port we can drop the
function calls that were specific to VXLAN and GENEVE.
Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1938ee1f

net: Merge VXLAN and GENEVE push notifiers into a single notifier · 7c46a640

由 Alexander Duyck 提交于 6月 16, 2016

This patch merges the notifiers for VXLAN and GENEVE into a single UDP
tunnel notifier. The idea is that we will want to only have to make one
notifier call to receive the list of ports for VXLAN and GENEVE tunnels
that need to be offloaded.

In addition we add a new set of ndo functions named ndo_udp_tunnel_add and
ndo_udp_tunnel_del that are meant to allow us to track the tunnel meta-data
such as port and address family as tunnels are added and removed. The
tunnel meta-data is now transported in a structure named udp_tunnel_info
which for now carries the type, address family, and port number. In the
future this could be updated so that we can include a tuple of values
including things such as the destination IP address and other fields.

I also ended up going with a naming scheme that consisted of using the
prefix udp_tunnel on function names. I applied this to the notifier and
ndo ops as well so that it hopefully points to the fact that these are
primarily used in the udp_tunnel functions.
Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7c46a640

net: Combine GENEVE and VXLAN port notifiers into single functions · e7b3db5e

由 Alexander Duyck 提交于 6月 16, 2016

This patch merges the GENEVE and VXLAN code so that both functions pass
through a shared code path.  This way we can start the effort of using a
single function on the network device drivers to handle both of these
tunnel types.
Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7b3db5e

17 6月, 2016 1 次提交

net: xfrm: fix old-style declaration · 318d3cc0

由 Arnd Bergmann 提交于 6月 16, 2016

Modern C standards expect the '__inline__' keyword to come before the return
type in a declaration, and we get a couple of warnings for this with "make W=1"
in the xfrm{4,6}_policy.c files:

net/ipv6/xfrm6_policy.c:369:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static int inline xfrm6_net_sysctl_init(struct net *net)
net/ipv6/xfrm6_policy.c:374:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static void inline xfrm6_net_sysctl_exit(struct net *net)
net/ipv4/xfrm4_policy.c:339:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static int inline xfrm4_net_sysctl_init(struct net *net)
net/ipv4/xfrm4_policy.c:344:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static void inline xfrm4_net_sysctl_exit(struct net *net)
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

318d3cc0

16 6月, 2016 2 次提交

gre: fix error handler · e582615a

由 Eric Dumazet 提交于 6月 15, 2016

1) gre_parse_header() can be called from gre_err()

   At this point transport header points to ICMP header, not the inner
header.

2) We can not really change transport header as ipgre_err() will later
assume transport header still points to ICMP header (using icmp_hdr())

3) pskb_may_pull() logic in gre_parse_header() really works
  if we are interested at zone pointed by skb->data

4) As Jiri explained in commit b7f8fe25 ("gre: do not pull header in
ICMP error processing") we should not pull headers in error handler.

So this fix :

A) changes gre_parse_header() to use skb->data instead of
skb_transport_header()

B) Adds a nhs parameter to gre_parse_header() so that we can skip the
not pulled IP header from error path.
  This offset is 0 for normal receive path.

C) remove obsolete IPV6 includes
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Jiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e582615a

net: ipv4: Add ability to have GRE ignore DF bit in IPv4 payloads · 22a59be8

由 Philip Prindeville 提交于 6月 14, 2016

    In the presence of firewalls which improperly block ICMP Unreachable
    (including Fragmentation Required) messages, Path MTU Discovery is
    prevented from working.

    A workaround is to handle IPv4 payloads opaquely, ignoring the DF bit--as
    is done for other payloads like AppleTalk--and doing transparent
    fragmentation and reassembly.

    Redux includes the enforcement of mutual exclusion between this feature
    and Path MTU Discovery as suggested by Alexander Duyck.

    Cc: Alexander Duyck <alexander.duyck@gmail.com>
Reviewed-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NPhilip Prindeville <philipp@redfish-solutions.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

22a59be8

15 6月, 2016 5 次提交

tcp: return sizeof tcp_dctcp_info in dctcp_get_info() · dcf1158b

由 Neal Cardwell 提交于 6月 13, 2016

Make sure that dctcp_get_info() returns only the size of the
info->dctcp struct that it zeroes out and fills in. Previously it had
been returning the size of the enclosing tcp_cc_info union,
sizeof(*info).  There is no problem yet, but that union that may one
day be larger than struct tcp_dctcp_info, in which case the
TCP_CC_INFO code might accidentally copy uninitialized bytes from the
stack.
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dcf1158b

ovs/gre: fix rtnl notifications on iface deletion · da6f1da8

由 Nicolas Dichtel 提交于 6月 13, 2016

The function gretap_fb_dev_create() (only used by ovs) never calls
rtnl_configure_link(). The consequence is that dev->rtnl_link_state is
never set to RTNL_LINK_INITIALIZED.
During the deletion phase, the function rollback_registered_many() sends
a RTM_DELLINK only if dev->rtnl_link_state is set to RTNL_LINK_INITIALIZED.

Fixes: b2acd1dc ("openvswitch: Use regular GRE net_device instead of vport")
CC: Thomas Graf <tgraf@suug.ch>
CC: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

da6f1da8

ovs/gre,geneve: fix error path when creating an iface · 106da663

由 Nicolas Dichtel 提交于 6月 13, 2016

After ipgre_newlink()/geneve_configure() call, the netdev is registered.

Fixes: 7e059158 ("vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices")
CC: David Wragg <david@weave.works>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

106da663

udp reuseport: fix packet of same flow hashed to different socket · d1e37288

由 Su, Xuemin 提交于 6月 13, 2016

There is a corner case in which udp packets belonging to a same
flow are hashed to different socket when hslot->count changes from 10
to 11:

1) When hslot->count <= 10, __udp_lib_lookup() searches udp_table->hash,
and always passes 'daddr' to udp_ehashfn().

2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2,
but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to
INADDR_ANY instead of some specific addr.

That means when hslot->count changes from 10 to 11, the hash calculated by
udp_ehashfn() is also changed, and the udp packets belonging to a same
flow will be hashed to different socket.

This is easily reproduced:
1) Create 10 udp sockets and bind all of them to 0.0.0.0:40000.
2) From the same host send udp packets to 127.0.0.1:40000, record the
socket index which receives the packets.
3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096
is 40000 + UDP_HASH_SIZE(4096), this makes the new socket put into the
same hslot as the aformentioned 10 sockets, and makes the hslot->count
change from 10 to 11.
4) From the same host send udp packets to 127.0.0.1:40000, and the socket
index which receives the packets will be different from the one received
in step 2.
This should not happen as the socket bound to 0.0.0.0:44096 should not
change the behavior of the sockets bound to 0.0.0.0:40000.

It's the same case for IPv6, and this patch also fixes that.
Signed-off-by: NSu, Xuemin <suxm@chinanetcenter.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d1e37288

ipv4: fix checksum annotation in udp4_csum_init · b46d9f62

由 Hannes Frederic Sowa 提交于 6月 12, 2016

Reported-by: NCong Wang <xiyou.wangcong@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Tom Herbert <tom@herbertland.com>
Fixes: 4068579e ("net: Implmement RFC 6936 (zero RX csums for UDP/IPv6")
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b46d9f62

12 6月, 2016 1 次提交

ipconfig: Protect ic_addrservaddr with IPCONFIG_DYNAMIC. · 86ef7f9c

由 David S. Miller 提交于 6月 11, 2016

>> net/ipv4/ipconfig.c:130:15: warning: 'ic_addrservaddr' defined but not used [-Wunused-variable]
    static __be32 ic_addrservaddr = NONE; /* IP Address of the IP addresses'server */
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

86ef7f9c

11 6月, 2016 3 次提交

net: ipconfig: avoid warning by making ic_addrservaddr static · 0b392be9

由 Ben Dooks 提交于 6月 10, 2016

The symbol ic_addrservaddr is not static, but has no declaration
to match so make it static to fix the following warning:

net/ipv4/ipconfig.c:130:8: warning: symbol 'ic_addrservaddr' was not declared. Should it be static?
Signed-off-by: NBen Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0b392be9

tcp: add NV congestion control · 699fafaf

由 Lawrence Brakmo 提交于 6月 08, 2016

TCP-NV (New Vegas) is a major update to TCP-Vegas.
An earlier version of NV was presented at 2010's LPC.
It is a delayed based congestion avoidance for the
data center. This version has been tested within a
10G rack where the HW RTTs are 20-50us and with
1 to 400 flows.

A description of TCP-NV, including implementation
details as well as experimental results, can be found at:
http://www.brakmo.org/networking/tcp-nv/TCPNV.htmlSigned-off-by: NLawrence Brakmo <brakmo@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

699fafaf

tcp: add in_flight to tcp_skb_cb · 6f094b9e

由 Lawrence Brakmo 提交于 6月 08, 2016

Add in_flight (bytes in flight when packet was sent) field
to tx component of tcp_skb_cb and make it available to
congestion modules' pkts_acked() function through the
ack_sample function argument.
Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6f094b9e

09 6月, 2016 1 次提交

net: Add l3mdev rule · 96c63fa7

由 David Ahern 提交于 6月 08, 2016

Currently, VRFs require 1 oif and 1 iif rule per address family per
VRF. As the number of VRF devices increases it brings scalability
issues with the increasing rule list. All of the VRF rules have the
same format with the exception of the specific table id to direct the
lookup. Since the table id is available from the oif or iif in the
loopup, the VRF rules can be consolidated to a single rule that pulls
the table from the VRF device.

This patch introduces a new rule attribute l3mdev. The l3mdev rule
means the table id used for the lookup is pulled from the L3 master
device (e.g., VRF) rather than being statically defined. With the
l3mdev rule all of the basic VRF FIB rules are reduced to 1 l3mdev
rule per address family (IPv4 and IPv6).

If an admin wishes to insert higher priority rules for specific VRFs
those rules will co-exist with the l3mdev rule. This capability means
current VRF scripts will co-exist with this new simpler implementation.

Currently, the rules list for both ipv4 and ipv6 look like this:
    $ ip  ru ls
    1000:       from all oif vrf1 lookup 1001
    1000:       from all iif vrf1 lookup 1001
    1000:       from all oif vrf2 lookup 1002
    1000:       from all iif vrf2 lookup 1002
    1000:       from all oif vrf3 lookup 1003
    1000:       from all iif vrf3 lookup 1003
    1000:       from all oif vrf4 lookup 1004
    1000:       from all iif vrf4 lookup 1004
    1000:       from all oif vrf5 lookup 1005
    1000:       from all iif vrf5 lookup 1005
    1000:       from all oif vrf6 lookup 1006
    1000:       from all iif vrf6 lookup 1006
    1000:       from all oif vrf7 lookup 1007
    1000:       from all iif vrf7 lookup 1007
    1000:       from all oif vrf8 lookup 1008
    1000:       from all iif vrf8 lookup 1008
    ...
    32765:      from all lookup local
    32766:      from all lookup main
    32767:      from all lookup default

With the l3mdev rule the list is just the following regardless of the
number of VRFs:
    $ ip ru ls
    1000:       from all lookup [l3mdev table]
    32765:      from all lookup local
    32766:      from all lookup main
    32767:      from all lookup default

(Note: the above pretty print of the rule is based on an iproute2
       prototype. Actual verbage may change)
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

96c63fa7

08 6月, 2016 2 次提交

tcp: accept RST if SEQ matches right edge of right-most SACK block · e00431bc

由 Pau Espin Pedrol 提交于 6月 07, 2016

RFC 5961 advises to only accept RST packets containing a seq number
matching the next expected seq number instead of the whole receive
window in order to avoid spoofing attacks.

However, this situation is not optimal in the case SACK is in use at the
time the RST is sent. I recently run into a scenario in which packet
losses were high while uploading data to a server, and userspace was
willing to frequently terminate connections by sending a RST. In
this case, the ACK sent on the receiver side (rcv_nxt) is frozen waiting
for a lost packet retransmission and SACK blocks are used to let the
client continue uploading data. At some point later on, the client sends
the RST (snd_nxt), which matches the next expected seq number of the
right-most SACK block on the receiver side which is going forward
receiving data.

In this scenario, as RFC 5961 defines, the RST SEQ doesn't match the
frozen main ACK at receiver side and thus gets dropped and a challenge
ACK is sent, which gets usually lost due to network conditions. The main
consequence is that the connection stays alive for a while even if it
made sense to accept the RST. This can get really bad if lots of
connections like this one are created in few seconds, allocating all the
resources of the server easily.

For security reasons, not all SACK blocks are checked (there could be a
big amount of SACK blocks => acceptable SEQ numbers). Furthermore, it
wouldn't make sense to check for RST in blocks other than the right-most
received one because the sender is not expected to be sending new data
after the RST. For simplicity, only up to the 4 most recently updated
SACK blocks (selective_acks[4] field) are compared to find the
right-most block, as usually those are the ones with bigger probability
to contain it.

This patch was tested in a 3.18 kernel and probed to improve the
situation in the scenario described above.
Signed-off-by: NPau Espin Pedrol <pau.espin@tessares.net>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Tested-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e00431bc

gue: Implement direction IP encapsulation · c1e48af7

由 Tom Herbert 提交于 6月 06, 2016

This patch implements direct encapsulation of IPv4 and IPv6 packets
in UDP. This is done a version "1" of GUE and as explained in I-D
draft-ietf-nvo3-gue-03.

Changes here are only in the receive path, fou with IPxIPx already
supports the transmit side. Both the normal receive path and
GRO path are modified to check for GUE version and check for
IP version in the case that GUE version is "1".

Tested:

IPIP with direct GUE encap
  1 TCP_STREAM
    4530 Mbps
  200 TCP_RR
    1297625 tps
    135/232/444 90/95/99% latencies

IP4IP6 with direct GUE encap
  1 TCP_STREAM
    4903 Mbps
  200 TCP_RR
    1184481 tps
    149/253/473 90/95/99% latencies

IP6IP6 direct GUE encap
  1 TCP_STREAM
   5146 Mbps
  200 TCP_RR
    1202879 tps
    146/251/472 90/95/99% latencies

SIT with direct GUE encap
  1 TCP_STREAM
    6111 Mbps
  200 TCP_RR
    1250337 tps
    139/241/467 90/95/99% latencies
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c1e48af7

06 6月, 2016 1 次提交

net: disable fragment reassembly if high_thresh is zero · 30759219

由 Michal Kubeček 提交于 5月 27, 2016

Before commit 6d7b857d ("net: use lib/percpu_counter API for
fragmentation mem accounting"), setting the reassembly high threshold
to 0 prevented fragment reassembly as first fragment would be always
evicted before second could be added to the queue. While inefficient,
some users apparently relied on this method.

Since the commit mentioned above, a percpu counter is used for
reassembly memory accounting and high batch size avoids taking slow path
in most common scenarios. As a result, a whole full sized packet can be
reassembled without the percpu counter's main counter changing its value
so that even with high_thresh set to 0, fragmented packets can be still
reassembled and processed.

Add explicit check preventing reassembly if high threshold is zero.
Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30759219

04 6月, 2016 1 次提交

skbuff: introduce skb_gso_validate_mtu · ae7ef81e

由 Marcelo Ricardo Leitner 提交于 6月 02, 2016

skb_gso_network_seglen is not enough for checking fragment sizes if
skb is using GSO_BY_FRAGS as we have to check frag per frag.

This patch introduces skb_gso_validate_mtu, based on the former, which
will wrap the use case inside it as all calls to skb_gso_network_seglen
were to validate if it fits on a given TMU, and improve the check.
Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ae7ef81e

03 6月, 2016 1 次提交

Possible problem with ("udp: remove headers from UDP packets before queueing") · ce25d66a

由 Eric Dumazet 提交于 6月 02, 2016

Paul Moore tracked a regression caused by a recent commit, which
mistakenly assumed that sk_filter() could be avoided if socket
had no current BPF filter.

The intent was to avoid udp_lib_checksum_complete() overhead.

But sk_filter() also checks skb_pfmemalloc() and
security_sock_rcv_skb(), so better call it.

Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NPaul Moore <paul@paul-moore.com>
Tested-by: NPaul Moore <paul@paul-moore.com>
Tested-by: NStephen Smalley <sds@tycho.nsa.gov>
Cc: samanthakumar <samanthakumar@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce25d66a

24 5月, 2016 1 次提交

ipv4: Fix non-initialized TTL when CONFIG_SYSCTL=n · 049bbf58

由 Ezequiel Garcia 提交于 5月 20, 2016

Commit fa50d974 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
moves the default TTL assignment, and as side-effect IPv4 TTL now
has a default value only if sysctl support is enabled (CONFIG_SYSCTL=y).

The sysctl_ip_default_ttl is fundamental for IP to work properly,
as it provides the TTL to be used as default. The defautl TTL may be
used in ip_selected_ttl, through the following flow:

  ip_select_ttl
    ip4_dst_hoplimit
      net->ipv4.sysctl_ip_default_ttl

This commit fixes the issue by assigning net->ipv4.sysctl_ip_default_ttl
in net_init_net, called during ipv4's initialization.

Without this commit, a kernel built without sysctl support will send
all IP packets with zero TTL (unless a TTL is explicitly set, e.g.
with setsockopt).

Given a similar issue might appear on the other knobs that were
namespaceify, this commit also moves them.

Fixes: fa50d974 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
Signed-off-by: NEzequiel Garcia <ezequiel@vanguardiasur.com.ar>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

049bbf58

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功