提交 · d7226c7a4dd19929d6df4ae04698da2fcf6f875a · openeuler / Kernel

24 8月, 2016 4 次提交

net: diag: Fix refcnt leak in error path destroying socket · d7226c7a

由 David Ahern 提交于 8月 23, 2016

inet_diag_find_one_icsk takes a reference to a socket that is not
released if sock_diag_destroy returns an error. Fix by changing
tcp_diag_destroy to manage the refcnt for all cases and remove
the sock_put calls from tcp_abort.

Fixes: c1e64e29 ("net: diag: Support destroying TCP sockets")
Reported-by: NLorenzo Colitti <lorenzo@google.com>
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d7226c7a

udp: get rid of SLAB_DESTROY_BY_RCU allocations · 75d855a5

由 Eric Dumazet 提交于 8月 23, 2016

After commit ca065d0c ("udp: no longer use SLAB_DESTROY_BY_RCU")
we do not need this special allocation mode anymore, even if it is
harmless.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

75d855a5

tcp: properly scale window in tcp_v[46]_reqsk_send_ack() · 20a2b49f

由 Eric Dumazet 提交于 8月 22, 2016

When sending an ack in SYN_RECV state, we must scale the offered
window if wscale option was negotiated and accepted.

Tested:
 Following packetdrill test demonstrates the issue :

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0

+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

// Establish a connection.
+0 < S 0:0(0) win 20000 <mss 1000,sackOK,wscale 7, nop, TS val 100 ecr 0>
+0 > S. 0:0(0) ack 1 win 28960 <mss 1460,sackOK, TS val 100 ecr 100, nop, wscale 7>

+0 < . 1:11(10) ack 1 win 156 <nop,nop,TS val 99 ecr 100>
// check that window is properly scaled !
+0 > . 1:1(0) ack 1 win 226 <nop,nop,TS val 200 ecr 100>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

20a2b49f

udp: fix poll() issue with zero sized packets · e83c6744

由 Eric Dumazet 提交于 8月 23, 2016

Laura tracked poll() [and friends] regression caused by commit
e6afc8ac ("udp: remove headers from UDP packets before queueing")

udp_poll() needs to know if there is a valid packet in receive queue,
even if its payload length is 0.

Change first_packet_length() to return an signed int, and use -1
as the indication of an empty queue.

Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
Reported-by: NLaura Abbott <labbott@redhat.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Tested-by: NLaura Abbott <labbott@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e83c6744

23 8月, 2016 1 次提交

net: ip_finish_output_gso: Allow fragmenting segments of tunneled skbs if their DF is unset · c0451fe1

由 Shmulik Ladkani 提交于 8月 21, 2016

In b8247f09,

   "net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs"

gso skbs arriving from an ingress interface that go through UDP
tunneling, are allowed to be fragmented if the resulting encapulated
segments exceed the dst mtu of the egress interface.

This aligned the behavior of gso skbs to non-gso skbs going through udp
encapsulation path.

However the non-gso vs gso anomaly is present also in the following
cases of a GRE tunnel:
 - ip_gre in collect_md mode, where TUNNEL_DONT_FRAGMENT is not set
   (e.g. OvS vport-gre with df_default=false)
 - ip_gre in nopmtudisc mode, where IFLA_GRE_IGNORE_DF is set

In both of the above cases, the non-gso skbs get fragmented, whereas the
gso skbs (having skb_gso_network_seglen that exceeds dst mtu) get dropped,
as they don't go through the segment+fragment code path.

Fix: Setting IPSKB_FRAG_SEGS if the tunnel specified IP_DF bit is NOT set.

Tunnels that do set IP_DF, will not go to fragmentation of segments.
This preserves behavior of ip_gre in (the default) pmtudisc mode.

Fixes: b8247f09 ("net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs")
Reported-by: Nwenxu <wenxu@ucloud.cn>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
Tested-by: Nwenxu <wenxu@ucloud.cn>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c0451fe1

19 8月, 2016 1 次提交

fib_trie: Fix the description of pos and bits · 98a384ec

由 Xunlei Pang 提交于 8月 18, 2016

1) Fix one typo: s/tn/tp/
2) Fix the description about the "u" bits.
Signed-off-by: NXunlei Pang <xlpang@redhat.com>
Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

98a384ec

16 8月, 2016 1 次提交

gre: set inner_protocol on xmit · 3d7b3320

由 Simon Horman 提交于 8月 15, 2016

Ensure that the inner_protocol is set on transmit so that GSO segmentation,
which relies on that field, works correctly.

This is achieved by setting the inner_protocol in gre_build_header rather
than each caller of that function. It ensures that the inner_protocol is
set when gre_fb_xmit() is used to transmit GRE which was not previously the
case.

I have observed this is not the case when OvS transmits GRE using
lwtunnel metadata (which it always does).

Fixes: 38720352 ("gre: Use inner_proto to obtain inner header protocol")
Cc: Pravin Shelar <pshelar@ovn.org>
Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NSimon Horman <simon.horman@netronome.com>
Acked-by: NPravin B Shelar <pshelar@ovn.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3d7b3320

10 8月, 2016 1 次提交

vti: flush x-netns xfrm cache when vti interface is removed · a5d0dc81

由 Lance Richardson 提交于 8月 09, 2016

When executing the script included below, the netns delete operation
hangs with the following message (repeated at 10 second intervals):

  kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

This occurs because a reference to the lo interface in the "secure" netns
is still held by a dst entry in the xfrm bundle cache in the init netns.

Address this problem by garbage collecting the tunnel netns flow cache
when a cross-namespace vti interface receives a NETDEV_DOWN notification.

A more detailed description of the problem scenario (referencing commands
in the script below):

(1) ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1

  The vti_test interface is created in the init namespace. vti_tunnel_init()
  attaches a struct ip_tunnel to the vti interface's netdev_priv(dev),
  setting the tunnel net to &init_net.

(2) ip link set vti_test netns secure

  The vti_test interface is moved to the "secure" netns. Note that
  the associated struct ip_tunnel still has tunnel->net set to &init_net.

(3) ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1

  The first packet sent using the vti device causes xfrm_lookup() to be
  called as follows:

      dst = xfrm_lookup(tunnel->net, skb_dst(skb), fl, NULL, 0);

  Note that tunnel->net is the init namespace, while skb_dst(skb) references
  the vti_test interface in the "secure" namespace. The returned dst
  references an interface in the init namespace.

  Also note that the first parameter to xfrm_lookup() determines which flow
  cache is used to store the computed xfrm bundle, so after xfrm_lookup()
  returns there will be a cached bundle in the init namespace flow cache
  with a dst referencing a device in the "secure" namespace.

(4) ip netns del secure

  Kernel begins to delete the "secure" namespace.  At some point the
  vti_test interface is deleted, at which point dst_ifdown() changes
  the dst->dev in the cached xfrm bundle flow from vti_test to lo (still
  in the "secure" namespace however).
  Since nothing has happened to cause the init namespace's flow cache
  to be garbage collected, this dst remains attached to the flow cache,
  so the kernel loops waiting for the last reference to lo to go away.

<Begin script>
ip link add br1 type bridge
ip link set dev br1 up
ip addr add dev br1 1.1.1.1/8

ip netns add secure
ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1
ip link set vti_test netns secure
ip netns exec secure ip link set vti_test up
ip netns exec secure ip link s lo up
ip netns exec secure ip addr add dev lo 192.168.100.1/24
ip netns exec secure ip route add 192.168.200.0/24 dev vti_test
ip xfrm policy flush
ip xfrm state flush
ip xfrm policy add dir out tmpl src 1.1.1.1 dst 1.1.1.2 \
   proto esp mode tunnel mark 1
ip xfrm policy add dir in tmpl src 1.1.1.2 dst 1.1.1.1 \
   proto esp mode tunnel mark 1
ip xfrm state add src 1.1.1.1 dst 1.1.1.2 proto esp spi 1 \
   mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788
ip xfrm state add src 1.1.1.2 dst 1.1.1.1 proto esp spi 1 \
   mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788

ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1

ip netns del secure
<End script>
Reported-by: NHangbin Liu <haliu@redhat.com>
Reported-by: NJan Tluka <jtluka@redhat.com>
Signed-off-by: NLance Richardson <lrichard@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a5d0dc81

06 8月, 2016 1 次提交

ipv4: panic in leaf_walk_rcu due to stale node pointer · 94d9f1c5

由 David Forster 提交于 8月 03, 2016

Panic occurs when issuing "cat /proc/net/route" whilst
populating FIB with > 1M routes.

Use of cached node pointer in fib_route_get_idx is unsafe.

 BUG: unable to handle kernel paging request at ffffc90001630024
 IP: [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
 PGD 11b08d067 PUD 11b08e067 PMD dac4b067 PTE 0
 Oops: 0000 [#1] SMP
 Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscac
 snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep virti
 acpi_cpufreq button parport_pc ppdev lp parport autofs4 ext4 crc16 mbcache jbd
tio_ring virtio floppy uhci_hcd ehci_hcd usbcore usb_common libata scsi_mod
 CPU: 1 PID: 785 Comm: cat Not tainted 4.2.0-rc8+ #4
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
 task: ffff8800da1c0bc0 ti: ffff88011a05c000 task.ti: ffff88011a05c000
 RIP: 0010:[<ffffffff814cf6a0>]  [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
 RSP: 0018:ffff88011a05fda0  EFLAGS: 00010202
 RAX: ffff8800d8a40c00 RBX: ffff8800da4af940 RCX: ffff88011a05ff20
 RDX: ffffc90001630020 RSI: 0000000001013531 RDI: ffff8800da4af950
 RBP: 0000000000000000 R08: ffff8800da1f9a00 R09: 0000000000000000
 R10: ffff8800db45b7e4 R11: 0000000000000246 R12: ffff8800da4af950
 R13: ffff8800d97a74c0 R14: 0000000000000000 R15: ffff8800d97a7480
 FS:  00007fd3970e0700(0000) GS:ffff88011fd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: ffffc90001630024 CR3: 000000011a7e4000 CR4: 00000000000006e0
 Stack:
  ffffffff814d00d3 0000000000000000 ffff88011a05ff20 ffff8800da1f9a00
  ffffffff811dd8b9 0000000000000800 0000000000020000 00007fd396f35000
  ffffffff811f8714 0000000000003431 ffffffff8138dce0 0000000000000f80
 Call Trace:
  [<ffffffff814d00d3>] ? fib_route_seq_start+0x93/0xc0
  [<ffffffff811dd8b9>] ? seq_read+0x149/0x380
  [<ffffffff811f8714>] ? fsnotify+0x3b4/0x500
  [<ffffffff8138dce0>] ? process_echoes+0x70/0x70
  [<ffffffff8121cfa7>] ? proc_reg_read+0x47/0x70
  [<ffffffff811bb823>] ? __vfs_read+0x23/0xd0
  [<ffffffff811bbd42>] ? rw_verify_area+0x52/0xf0
  [<ffffffff811bbe61>] ? vfs_read+0x81/0x120
  [<ffffffff811bcbc2>] ? SyS_read+0x42/0xa0
  [<ffffffff81549ab2>] ? entry_SYSCALL_64_fastpath+0x16/0x75
 Code: 48 85 c0 75 d8 f3 c3 31 c0 c3 f3 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00
a 04 89 f0 33 02 44 89 c9 48 d3 e8 0f b6 4a 05 49 89
 RIP  [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
  RSP <ffff88011a05fda0>
 CR2: ffffc90001630024
Signed-off-by: NDave Forster <dforster@brocade.com>
Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

94d9f1c5

31 7月, 2016 1 次提交

tcp: consider recv buf for the initial window scale · f626300a

由 Soheil Hassas Yeganeh 提交于 7月 29, 2016

tcp_select_initial_window() intends to advertise a window
scaling for the maximum possible window size. To do so,
it considers the maximum of net.ipv4.tcp_rmem[2] and
net.core.rmem_max as the only possible upper-bounds.
However, users with CAP_NET_ADMIN can use SO_RCVBUFFORCE
to set the socket's receive buffer size to values
larger than net.ipv4.tcp_rmem[2] and net.core.rmem_max.
Thus, SO_RCVBUFFORCE is effectively ignored by
tcp_select_initial_window().

To fix this, consider the maximum of net.ipv4.tcp_rmem[2],
net.core.rmem_max and socket's initial buffer space.

Fixes: b0573dea ("[NET]: Introduce SO_{SND,RCV}BUFFORCE socket options")
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Suggested-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f626300a

27 7月, 2016 1 次提交

net: ipmr/ip6mr: update lastuse on entry change · 90b5ca17

由 Nikolay Aleksandrov 提交于 7月 26, 2016

Currently lastuse is updated on entry creation and cache hit, but it should
also be updated on entry change. Since both on add and update the ttl array
is updated we can simply update the lastuse in ipmr_update_thresholds.
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Donald Sharp <sharpd@cumulusnetworks.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

90b5ca17

26 7月, 2016 1 次提交

udp: use sk_filter_trim_cap for udp{,6}_queue_rcv_skb · ba66bbe5

由 Daniel Borkmann 提交于 7月 25, 2016

After a6127697 ("udp: prevent bugcheck if filter truncates packet
too much"), there followed various other fixes for similar cases such
as f4979fce ("rose: limit sk_filter trim to payload").

Latter introduced a new helper sk_filter_trim_cap(), where we can pass
the trim limit directly to the socket filter handling. Make use of it
here as well with sizeof(struct udphdr) as lower cap limit and drop the
extra skb->len test in UDP's input path.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ba66bbe5

20 7月, 2016 2 次提交

net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow... · b8247f09

由 Shmulik Ladkani 提交于 7月 18, 2016

net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs

Given:
 - tap0 and vxlan0 are bridged
 - vxlan0 stacked on eth0, eth0 having small mtu (e.g. 1400)

Assume GSO skbs arriving from tap0 having a gso_size as determined by
user-provided virtio_net_hdr (e.g. 1460 corresponding to VM mtu of 1500).

After encapsulation these skbs have skb_gso_network_seglen that exceed
eth0's ip_skb_dst_mtu.

These skbs are accidentally passed to ip_finish_output2 AS IS.
Alas, each final segment (segmented either by validate_xmit_skb or by
hardware UFO) would be larger than eth0 mtu.
As a result, those above-mtu segments get dropped on certain networks.

This behavior is not aligned with the NON-GSO case:
Assume a non-gso 1500-sized IP packet arrives from tap0. After
encapsulation, the vxlan datagram is fragmented normally at the
ip_finish_output-->ip_fragment code path.

The expected behavior for the GSO case would be segmenting the
"gso-oversized" skb first, then fragmenting each segment according to
dst mtu, and finally passing the resulting fragments to ip_finish_output2.

'ip_finish_output_gso' already supports this "Slowpath" behavior,
according to the IPSKB_FRAG_SEGS flag, which is only set during ipv4
forwarding (not set in the bridged case).

In order to support the bridged case, we'll mark skbs arriving from an
ingress interface that get udp-encaspulated as "allowed to be fragmented",
causing their network_seglen to be validated by 'ip_finish_output_gso'
(and fragment if needed).

Note the TUNNEL_DONT_FRAGMENT tun_flag is still honoured (both in the
gso and non-gso cases), which serves users wishing to forbid
fragmentation at the udp tunnel endpoint.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b8247f09

net/ipv4: Introduce IPSKB_FRAG_SEGS bit to inet_skb_parm.flags · 359ebda2

由 Shmulik Ladkani 提交于 7月 18, 2016

This flag indicates whether fragmentation of segments is allowed.

Formerly this policy was hardcoded according to IPSKB_FORWARDED (set by
either ip_forward or ipmr_forward).

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

359ebda2

19 7月, 2016 1 次提交

netfilter: x_tables: speed up jump target validation · f4dc7771

由 Florian Westphal 提交于 7月 14, 2016

The dummy ruleset I used to test the original validation change was broken,
most rules were unreachable and were not tested by mark_source_chains().

In some cases rulesets that used to load in a few seconds now require
several minutes.

sample ruleset that shows the behaviour:

echo "*filter"
for i in $(seq 0 100000);do
        printf ":chain_%06x - [0:0]\n" $i
done
for i in $(seq 0 100000);do
   printf -- "-A INPUT -j chain_%06x\n" $i
   printf -- "-A INPUT -j chain_%06x\n" $i
   printf -- "-A INPUT -j chain_%06x\n" $i
done
echo COMMIT

[ pipe result into iptables-restore ]

This ruleset will be about 74mbyte in size, with ~500k searches
though all 500k[1] rule entries. iptables-restore will take forever
(gave up after 10 minutes)

Instead of always searching the entire blob for a match, fill an
array with the start offsets of every single ipt_entry struct,
then do a binary search to check if the jump target is present or not.

After this change ruleset restore times get again close to what one
gets when reverting 36472341 (~3 seconds on my workstation).

[1] every user-defined rule gets an implicit RETURN, so we get
300k jumps + 100k userchains + 100k returns -> 500k rule entries

Fixes: 36472341 ("netfilter: x_tables: validate targets of jumps")
Reported-by: NJeff Wu <wujiafu@gmail.com>
Tested-by: NJeff Wu <wujiafu@gmail.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

f4dc7771

17 7月, 2016 1 次提交

net: ipmr/ip6mr: add support for keeping an entry age · 43b9e127

由 Nikolay Aleksandrov 提交于 7月 14, 2016

In preparation for hardware offloading of ipmr/ip6mr we need an
interface that allows to check (and later update) the age of entries.
Relying on stats alone can show activity but not actual age of the entry,
furthermore when there're tens of thousands of entries a lot of the
hardware implementations only support "hit" bits which are cleared on
read to denote that the entry was active and shouldn't be aged out,
these can then be naturally translated into age timestamp and will be
compatible with the software forwarding age. Using a lastuse entry doesn't
affect performance because the members in that cache line are written to
along with the age.
Since all new users are encouraged to use ipmr via netlink, this is
exported via the RTA_EXPIRES attribute.
Also do a minor local variable declaration style adjustment - arrange them
longest to shortest.
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Shrijeet Mukherjee <shm@cumulusnetworks.com>
CC: Satish Ashok <sashok@cumulusnetworks.com>
CC: Donald Sharp <sharpd@cumulusnetworks.com>
CC: David S. Miller <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

43b9e127

16 7月, 2016 2 次提交

tcp_timer.c: Add kernel-doc function descriptions · c380d37e

由 Richard Sailer 提交于 7月 16, 2016

This adds kernel-doc style descriptions for 6 functions and
fixes 1 typo.
Signed-off-by: NRichard Sailer <richard@weltraumpflege.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c380d37e

tcp: enable per-socket rate limiting of all 'challenge acks' · 083ae308

由 Jason Baron 提交于 7月 14, 2016

The per-socket rate limit for 'challenge acks' was introduced in the
context of limiting ack loops:

commit f2b2c582 ("tcp: mitigate ACK loops for connections as tcp_sock")

And I think it can be extended to rate limit all 'challenge acks' on a
per-socket basis.

Since we have the global tcp_challenge_ack_limit, this patch allows for
tcp_challenge_ack_limit to be set to a large value and effectively rely on
the per-socket limit, or set tcp_challenge_ack_limit to a lower value and
still prevents a single connections from consuming the entire challenge ack
quota.

It further moves in the direction of eliminating the global limit at some
point, as Eric Dumazet has suggested. This a follow-up to:
Subject: tcp: make challenge acks less predictable

Cc: Eric Dumazet <edumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Yue Cao <ycao009@ucr.edu>
Signed-off-by: NJason Baron <jbaron@akamai.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

083ae308

12 7月, 2016 5 次提交

ipv4: af_inet: make it explicitly non-modular · d3fc0353

由 Paul Gortmaker 提交于 7月 11, 2016

The Makefile controlling compilation of this file is obj-y,
meaning that it currently is never being built as a module.

Since MODULE_ALIAS is a no-op for non-modular code, we can simply
remove the MODULE_ALIAS_NETPROTO variant used here.

We replace module.h with kmod.h since the file does make use of
request_module() in order to load other modules from here.

We don't have to worry about init.h coming in via the removed
module.h since the file explicitly includes init.h already.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d3fc0353

ipv4: reject RTNH_F_DEAD and RTNH_F_LINKDOWN from user space · 80610229

由 Julian Anastasov 提交于 7月 10, 2016

Vegard Nossum is reporting for a crash in fib_dump_info
when nh_dev = NULL and fib_nhs == 1:

Pid: 50, comm: netlink.exe Not tainted 4.7.0-rc5+
RIP: 0033:[<00000000602b3d18>]
RSP: 0000000062623890  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 000000006261b800 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000024 RDI: 000000006245ba00
RBP: 00000000626238f0 R08: 000000000000029c R09: 0000000000000000
R10: 0000000062468038 R11: 000000006245ba00 R12: 000000006245ba00
R13: 00000000625f96c0 R14: 00000000601e16f0 R15: 0000000000000000
Kernel panic - not syncing: Kernel mode fault at addr 0x2e0, ip 0x602b3d18
CPU: 0 PID: 50 Comm: netlink.exe Not tainted 4.7.0-rc5+ #581
Stack:
 626238f0 960226a02 00000400 000000fe
 62623910 600afca7 62623970 62623a48
 62468038 00000018 00000000 00000000
Call Trace:
 [<602b3e93>] rtmsg_fib+0xd3/0x190
 [<602b6680>] fib_table_insert+0x260/0x500
 [<602b0e5d>] inet_rtm_newroute+0x4d/0x60
 [<60250def>] rtnetlink_rcv_msg+0x8f/0x270
 [<60267079>] netlink_rcv_skb+0xc9/0xe0
 [<60250d4b>] rtnetlink_rcv+0x3b/0x50
 [<60265400>] netlink_unicast+0x1a0/0x2c0
 [<60265e47>] netlink_sendmsg+0x3f7/0x470
 [<6021dc9a>] sock_sendmsg+0x3a/0x90
 [<6021e0d0>] ___sys_sendmsg+0x300/0x360
 [<6021fa64>] __sys_sendmsg+0x54/0xa0
 [<6021fac0>] SyS_sendmsg+0x10/0x20
 [<6001ea68>] handle_syscall+0x88/0x90
 [<600295fd>] userspace+0x3fd/0x500
 [<6001ac55>] fork_handler+0x85/0x90

$ addr2line -e vmlinux -i 0x602b3d18
include/linux/inetdevice.h:222
net/ipv4/fib_semantics.c:1264

Problem happens when RTNH_F_LINKDOWN is provided from user space
when creating routes that do not use the flag, catched with
netlink fuzzer.

Currently, the kernel allows user space to set both flags
to nh_flags and fib_flags but this is not intentional, the
assumption was that they are not set. Fix this by rejecting
both flags with EINVAL.
Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
Fixes: 0eeb075f ("net: ipv4 sysctl option to ignore routes when nexthop link is down")
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Cc: Dinesh Dutt <ddutt@cumulusnetworks.com>
Cc: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

80610229

tcp: make challenge acks less predictable · 75ff39cc

由 Eric Dumazet 提交于 7月 10, 2016

Yue Cao claims that current host rate limiting of challenge ACKS
(RFC 5961) could leak enough information to allow a patient attacker
to hijack TCP sessions. He will soon provide details in an academic
paper.

This patch increases the default limit from 100 to 1000, and adds
some randomization so that the attacker can no longer hijack
sessions without spending a considerable amount of probes.

Based on initial analysis and patch from Linus.

Note that we also have per socket rate limiting, so it is tempting
to remove the host limit in the future.

v2: randomize the count of challenge acks per second, not the period.

Fixes: 282f23c6 ("tcp: implement RFC 5961 3.2")
Reported-by: NYue Cao <ycao009@ucr.edu>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

75ff39cc

tunnels: correct conditional build of MPLS and IPv6 · aa9667e7

由 Simon Horman 提交于 7月 10, 2016

Using a combination if #if conditionals and goto labels to unwind
tunnel4_init seems unwieldy. This patch takes a simpler approach of
directly unregistering previously registered protocols when an error
occurs.

This fixes a number of problems with the current implementation
including the potential presence of labels when they are unused
and the potential absence of unregister code when it is needed.

Fixes: 8afe97e5 ("tunnels: support MPLS over IPv4 tunnels")
Signed-off-by: NSimon Horman <simon.horman@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aa9667e7

udp: prevent bugcheck if filter truncates packet too much · a6127697

由 Michal Kubeček 提交于 7月 08, 2016

If socket filter truncates an udp packet below the length of UDP header
in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a
BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if
kernel is configured that way) can be easily enforced by an unprivileged
user which was reported as CVE-2016-6162. For a reproducer, see
http://seclists.org/oss-sec/2016/q3/8

Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
Reported-by: NMarco Grassi <marco.gra@gmail.com>
Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a6127697

11 7月, 2016 1 次提交

netfilter: conntrack: fix race between nf_conntrack proc read and hash resize · 64b87639

由 Liping Zhang 提交于 7月 03, 2016

When we do "cat /proc/net/nf_conntrack", and meanwhile resize the conntrack
hash table via /sys/module/nf_conntrack/parameters/hashsize, race will
happen, because reader can observe a newly allocated hash but the old size
(or vice versa). So oops will happen like follows：

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000017
  IP: [<ffffffffa0418e21>] seq_print_acct+0x11/0x50 [nf_conntrack]
  Call Trace:
  [<ffffffffa0412f4e>] ? ct_seq_show+0x14e/0x340 [nf_conntrack]
  [<ffffffff81261a1c>] seq_read+0x2cc/0x390
  [<ffffffff812a8d62>] proc_reg_read+0x42/0x70
  [<ffffffff8123bee7>] __vfs_read+0x37/0x130
  [<ffffffff81347980>] ? security_file_permission+0xa0/0xc0
  [<ffffffff8123cf75>] vfs_read+0x95/0x140
  [<ffffffff8123e475>] SyS_read+0x55/0xc0
  [<ffffffff817c2572>] entry_SYSCALL_64_fastpath+0x1a/0xa4

It is very easy to reproduce this kernel crash.
1. open one shell and input the following cmds:
  while : ; do
    echo $RANDOM > /sys/module/nf_conntrack/parameters/hashsize
  done
2. open more shells and input the following cmds:
  while : ; do
    cat /proc/net/nf_conntrack
  done
3. just wait a monent, oops will happen soon.

The solution in this patch is based on Florian's Commit 5e3c61f9
("netfilter: conntrack: fix lookup race during hash resize"). And
add a wrapper function nf_conntrack_get_ht to get hash and hsize
suggested by Florian Westphal.
Signed-off-by: NLiping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

64b87639

10 7月, 2016 3 次提交

ipv4: do not abuse GFP_ATOMIC in inet_netconf_notify_devconf() · fa17806c

由 Eric Dumazet 提交于 7月 08, 2016

inet_forward_change() runs with RTNL held.
We are allowed to sleep if required.

If we use __in_dev_get_rtnl() instead of __in_dev_get_rcu(),
we no longer have to use GFP_ATOMIC allocations in
inet_netconf_notify_devconf(), meaning we are less likely to miss
notifications under memory pressure, and wont touch precious memory
reserves either and risk dropping incoming packets.

inet_netconf_get_devconf() can also use GFP_KERNEL allocation.

Fixes: edc9e748 ("rtnl/ipv4: use netconf msg to advertise forwarding status")
Fixes: 9e551110 ("rtnl/ipv4: add support of RTM_GETNETCONF")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa17806c

ipip: support MPLS over IPv4 · 1b69e7e6

由 Simon Horman 提交于 7月 07, 2016

Extend the IPIP driver to support MPLS over IPv4. The implementation is an
extension of existing support for IPv4 over IPv4 and is based of multiple
inner-protocol support for the SIT driver.
Signed-off-by: NSimon Horman <simon.horman@netronome.com>
Reviewed-by: NDinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1b69e7e6

tunnels: support MPLS over IPv4 tunnels · 8afe97e5

由 Simon Horman 提交于 7月 07, 2016

Extend tunnel support to MPLS over IPv4.  The implementation extends the
existing differentiation between IPIP and IPv6 over IPv4 to also cover MPLS
over IPv4.
Signed-off-by: NSimon Horman <simon.horman@netronome.com>
Reviewed-by: NDinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8afe97e5

07 7月, 2016 1 次提交

timers, net/ipv4/inet: Initialize connection request timers as pinned · f3438bc7

由 Thomas Gleixner 提交于 7月 04, 2016

Pinned timers must carry the pinned attribute in the timer structure
itself, so convert the code to the new API.

No functional change.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.617891430@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

f3438bc7

03 7月, 2016 1 次提交

netfilter: Convert FWINV<[foo]> macros and uses to NF_INVF · c37a2dfa

由 Joe Perches 提交于 6月 24, 2016

netfilter uses multiple FWINV #defines with identical form that hide a
specific structure variable and dereference it with a invflags member.

$ git grep "#define FWINV"
include/linux/netfilter_bridge/ebtables.h:#define FWINV(bool,invflg) ((bool) ^ !!(info->invflags & invflg))
net/bridge/netfilter/ebtables.c:#define FWINV2(bool, invflg) ((bool) ^ !!(e->invflags & invflg))
net/ipv4/netfilter/arp_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(arpinfo->invflags & (invflg)))
net/ipv4/netfilter/ip_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
net/ipv6/netfilter/ip6_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ip6info->invflags & (invflg)))
net/netfilter/xt_tcpudp.c:#define FWINVTCP(bool, invflg) ((bool) ^ !!(tcpinfo->invflags & (invflg)))

Consolidate these macros into a single NF_INVF macro.

Miscellanea:

o Neaten the alignment around these uses
o A few lines are > 80 columns for intelligibility
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

c37a2dfa

01 7月, 2016 2 次提交

netfilter: x_tables: simplify ip{6}table_mangle_hook() · 468b021b

由 Pablo Neira Ayuso 提交于 6月 24, 2016

No need for a special case to handle NF_INET_POST_ROUTING, this is
basically the same handling as for prerouting, input, forward.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

468b021b

tcp: md5: use kmalloc() backed scratch areas · 19689e38

由 Eric Dumazet 提交于 6月 27, 2016

Some arches have virtually mapped kernel stacks, or will soon have.

tcp_md5_hash_header() uses an automatic variable to copy tcp header
before mangling th->check and calling crypto function, which might
be problematic on such arches.

David says that using percpu storage is also problematic on non SMP
builds.

Just use kmalloc() to allocate scratch areas.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NAndy Lutomirski <luto@amacapital.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19689e38

30 6月, 2016 2 次提交

ipv4: Fix ip_skb_dst_mtu to use the sk passed by ip_finish_output · fedbb6b4

由 Shmulik Ladkani 提交于 6月 29, 2016

ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
calls ip_sk_use_pmtu which casts sk as an inet_sk).

However, in the case of UDP tunneling, the skb->sk is not necessarily an
inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
tun/tap).

OTOH, the sk passed as an argument throughout IP stack's output path is
the one which is of PMTU interest:
 - In case of local sockets, sk is same as skb->sk;
 - In case of a udp tunnel, sk is the tunneling socket.

Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
This augments 7026b1dd 'netfilter: Pass socket pointer down through okfn().'
Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fedbb6b4

tcp: add an ability to dump and restore window parameters · b1ed4c4f

由 Andrey Vagin 提交于 6月 27, 2016

We found that sometimes a restored tcp socket doesn't work.

A reason of this bug is incorrect window parameters and in this case
tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The
other side drops packets with this seq, because seq is less than
tp->rcv_nxt ( tcp_sequence() ).

Data from a send queue is sent only if there is enough space in a
window, so when we restore unacked data, we need to expand a window to
fit this data.

This was in a first version of this patch:
"tcp: extend window to fit all restored unacked data in a send queue"

Then Alexey recommended me to restore window parameters instead of
adjusted them according with data in a sent queue. This sounds resonable.

rcv_wnd has to be restored, because it was reported to another side
and the offered window is never shrunk.
One of reasons why we need to restore snd_wnd was described above.

Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: NAndrey Vagin <avagin@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1ed4c4f

29 6月, 2016 1 次提交

tcp: do not send too big packets at retransmit time · a3d2e9f8

由 Eric Dumazet 提交于 6月 27, 2016

Arjun reported a bug in TCP stack and bisected it to a recent commit.

In case where we process SACK, we can coalesce multiple skbs
into fat ones (tcp_shift_skb_data()), to lower write queue
overhead, because we do not expect to retransmit these packets.

However, SACK reneging can happen, forcing the sender to retransmit
all these packets. If skb->len is above 64KB, we then send buggy
IP packets that could hang TSO engine on cxgb4.

Neal suggested to use tcp_tso_autosize() instead of tp->gso_segs
so that we cook packets of optimal size vs TCP/pacing.

Thanks to Arjun for reporting the bug and running the tests !

Fixes: 10d3be56 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NArjun V <arjun@chelsio.com>
Tested-by: NArjun V <arjun@chelsio.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a3d2e9f8

28 6月, 2016 4 次提交

net: diag: Add support to filter on device index · 637c841d

由 David Ahern 提交于 6月 23, 2016

Add support to inet_diag facility to filter sockets based on device
index. If an interface index is in the filter only sockets bound
to that index (sk_bound_dev_if) are returned.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

637c841d

ipmr/ip6mr: Initialize the last assert time of mfc entries. · 70a0dec4

由 Tom Goff 提交于 6月 23, 2016

This fixes wrong-interface signaling on 32-bit platforms for entries
created when jiffies > 2^31 + MFC_ASSERT_THRESH.
Signed-off-by: NTom Goff <thomas.goff@ll.mit.edu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

70a0dec4

ipv6: Allow request socks to contain IPv6 options. · 56ac42bc

由 Huw Davies 提交于 6月 27, 2016

If set, these will take precedence over the parent's options during
both sending and child creation.  If they're not set, the parent's
options (if any) will be used.

This is to allow the security_inet_conn_request() hook to modify the
IPv6 options in just the same way that it already may do for IPv4.
Signed-off-by: NHuw Davies <huw@codeweavers.com>
Signed-off-by: NPaul Moore <paul@paul-moore.com>

56ac42bc

netlabel: Move bitmap manipulation functions to the NetLabel core. · 3faa8f98

由 Huw Davies 提交于 6月 27, 2016

This is to allow the CALIPSO labelling engine to use these.
Signed-off-by: NHuw Davies <huw@codeweavers.com>
Signed-off-by: NPaul Moore <paul@paul-moore.com>

3faa8f98

24 6月, 2016 1 次提交

netfilter: nf_reject_ipv4: don't send tcp RST if the packet is non-TCP · e1dbbc59

由 Liping Zhang 提交于 6月 20, 2016

In iptables, if the user add a rule to send tcp RST and specify the
non-TCP protocol, such as UDP, kernel will reject this request. But
in nftables, this validity check only occurs in nft tool, i.e. only
in userspace.

This means that user can add such a rule like follows via nfnetlink:
  "nft add rule filter forward ip protocol udp reject with tcp reset"

This will generate some confusing tcp RST packets. So we should send
tcp RST only when it is TCP packet.
Signed-off-by: NLiping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

e1dbbc59

23 6月, 2016 1 次提交

esp: Fix ESN generation under UDP encapsulation · 962fcef3

由 Herbert Xu 提交于 6月 18, 2016

Blair Steven noticed that ESN in conjunction with UDP encapsulation
is broken because we set the temporary ESP header to the wrong spot.

This patch fixes this by first of all using the right spot, i.e.,
4 bytes off the real ESP header, and then saving this information
so that after encryption we can restore it properly.

Fixes: 7021b2e1 ("esp4: Switch to new AEAD interface")
Reported-by: NBlair Steven <Blair.Steven@alliedtelesis.co.nz>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

962fcef3

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功