提交 · 65a2022e89a4760f9702837e2d9d15a39a9c68a3 · openanolis / cloud-kernel

11 5月, 2018 7 次提交

net/ipv6: Add fib lookup stubs for use in bpf helper · 65a2022e

由 David Ahern 提交于 5月 09, 2018

Add stubs to retrieve a handle to an IPv6 FIB table, fib6_get_table,
a stub to do a lookup in a specific table, fib6_table_lookup, and
a stub for a full route lookup.

The stubs are needed for core bpf code to handle the case when the
IPv6 module is not builtin.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

65a2022e

net/ipv6: Update fib6 tracepoint to take fib6_info · d4bea421

由 David Ahern 提交于 5月 09, 2018

Similar to IPv4, IPv6 should use the FIB lookup result in the
tracepoint.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

d4bea421

net/ipv6: Add fib6_lookup · 138118ec

由 David Ahern 提交于 5月 09, 2018

Add IPv6 equivalent to fib_lookup. Does a fib lookup, including rules,
but returns a FIB entry, fib6_info, rather than a dst based rt6_info.
fib6_lookup is any where from 140% (MULTIPLE_TABLES config disabled)
to 60% faster than any of the dst based lookup methods (without custom
rules) and 25% faster with custom rules (e.g., l3mdev rule).

Since the lookup function has a completely different signature,
fib6_rule_action is split into 2 paths: the existing one is
renamed __fib6_rule_action and a new one for the fib6_info path
is added. fib6_rule_action decides which to call based on the
lookup_ptr. If it is fib6_table_lookup then the new path is taken.

Caller must hold rcu lock as no reference is taken on the returned
fib entry.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

138118ec

net/ipv6: Refactor fib6_rule_action · cc065a9e

由 David Ahern 提交于 5月 09, 2018

Move source address lookup from fib6_rule_action to a helper. It will be
used in a later patch by a second variant for fib6_rule_action.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

cc065a9e

net/ipv6: Extract table lookup from ip6_pol_route · 1d053da9

由 David Ahern 提交于 5月 09, 2018

ip6_pol_route is used for ingress and egress FIB lookups. Refactor it
moving the table lookup into a separate fib6_table_lookup that can be
invoked separately and export the new function.

ip6_pol_route now calls fib6_table_lookup and uses the result to generate
a dst based rt6_info.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

1d053da9

net/ipv6: Rename rt6_multipath_select · 3b290a31

由 David Ahern 提交于 5月 09, 2018

Rename rt6_multipath_select to fib6_multipath_select and export it.
A later patch wants access to it similar to IPv4's fib_select_path.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

3b290a31

net/ipv6: Rename fib6_lookup to fib6_node_lookup · 6454743b

由 David Ahern 提交于 5月 09, 2018

Rename fib6_lookup to fib6_node_lookup to better reflect what it
returns. The fib6_lookup name will be used in a later patch for
an IPv6 equivalent to IPv4's fib_lookup.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

6454743b

08 5月, 2018 2 次提交

net: ipv6/gre: Add GRO support · 0c1dd2a1

由 Eran Ben Elisha 提交于 5月 07, 2018

Add GRO capability for IPv6 GRE tunnel and ip6erspan tap, via gro_cells
infrastructure.

Performance testing: 55% higher badwidth.
Measuring bandwidth of 1 thread IPv4 TCP traffic over IPv6 GRE tunnel
while GRO on the physical interface is disabled.
CPU: Intel Xeon E312xx (Sandy Bridge)
NIC: Mellanox Technologies MT27700 Family [ConnectX-4]
Before (GRO not working in tunnel) : 2.47 Gbits/sec
After  (GRO working in tunnel)     : 3.85 Gbits/sec
Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
CC: Eric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c1dd2a1

net: ipv6: Fix typo in ipv6_find_hdr() documentation · 6f2f8212

由 Tariq Toukan 提交于 5月 07, 2018

Fix 'an' into 'and', and use a comma instead of a period.
Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6f2f8212

07 5月, 2018 2 次提交

F
netfilter: nf_nat: remove unused ct arg from lookup functions · 3a2e86f6
由 Florian Westphal 提交于 4月 26, 2018
```
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
```
3a2e86f6

netfilter: ip6t_srh: extend SRH matching for previous, next and last SID · c1c7e44b

由 Ahmed Abdelsalam 提交于 4月 25, 2018

IPv6 Segment Routing Header (SRH) contains a list of SIDs to be crossed
by SR encapsulated packet. Each SID is encoded as an IPv6 prefix.

When a Firewall receives an SR encapsulated packet, it should be able
to identify which node previously processed the packet (previous SID),
which node is going to process the packet next (next SID), and which
node is the last to process the packet (last SID) which represent the
final destination of the packet in case of inline SR mode.

An example use-case of using these features could be SID list that
includes two firewalls. When the second firewall receives a packet,
it can check whether the packet has been processed by the first firewall
or not. Based on that check, it decides to apply all rules, apply just
subset of the rules, or totally skip all rules and forward the packet to
the next SID.

This patch extends SRH match to support matching previous SID, next SID,
and last SID.
Signed-off-by: NAhmed Abdelsalam <amsalam20@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

c1c7e44b

05 5月, 2018 1 次提交

net/ipv6: rename rt6_next to fib6_next · 8fb11a9a

由 David Ahern 提交于 5月 04, 2018

This slipped through the cracks in the followup set to the fib6_info flip.
Rename rt6_next to fib6_next.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8fb11a9a

03 5月, 2018 2 次提交

ip6_gre: correct the function name in ip6gre_tnl_addr_conflict() comment · 7ccbdff1

由 Sun Lianwen 提交于 5月 03, 2018

The function name is wrong in ip6gre_tnl_addr_conflict() comment, which
use ip6_tnl_addr_conflict instead of ip6gre_tnl_addr_conflict.
Signed-off-by: NSun Lianwen <sunlw.fnst@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7ccbdff1

ipv6: Revert "ipv6: Allow non-gateway ECMP for IPv6" · 30ca22e4

由 Ido Schimmel 提交于 5月 02, 2018

This reverts commit edd7ceb7 ("ipv6: Allow non-gateway ECMP for
IPv6").

Eric reported a division by zero in rt6_multipath_rebalance() which is
caused by above commit that considers identical local routes to be
siblings. The division by zero happens because a nexthop weight is not
set for local routes.

Revert the commit as it does not fix a bug and has side effects.

To reproduce:

# ip -6 address add 2001:db8::1/64 dev dummy0
# ip -6 address add 2001:db8::1/64 dev dummy1

Fixes: edd7ceb7 ("ipv6: Allow non-gateway ECMP for IPv6")
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
Tested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30ca22e4

02 5月, 2018 3 次提交

ipv6: Allow non-gateway ECMP for IPv6 · edd7ceb7

由 Thomas Winter 提交于 5月 01, 2018

It is valid to have static routes where the nexthop
is an interface not an address such as tunnels.
For IPv4 it was possible to use ECMP on these routes
but not for IPv6.
Signed-off-by: NThomas Winter <Thomas.Winter@alliedtelesis.co.nz>
Cc: David Ahern <dsahern@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

edd7ceb7

udp: disable gso with no_check_tx · a8c744a8

由 Willem de Bruijn 提交于 4月 30, 2018

Syzbot managed to send a udp gso packet without checksum offload into
the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This
triggered the skb_warn_bad_offload.

  RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658
   skb_gso_segment include/linux/netdevice.h:4038 [inline]
   validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120
   __dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577
   dev_queue_xmit+0x17/0x20 net/core/dev.c:3618

UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the
udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to
catch and fail in this case.

After the integrity checks jump directly to the CHECKSUM_PARTIAL case
to avoid reading the no_check_tx flags again (a TOCTTOU race).

Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT")
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a8c744a8

ipv6: fix uninit-value in ip6_multipath_l3_keys() · cea67a2d

由 Eric Dumazet 提交于 4月 29, 2018

syzbot/KMSAN reported an uninit-value in ip6_multipath_l3_keys(),
root caused to a bad assumption of ICMP header being already
pulled in skb->head

ip_multipath_l3_keys() does the correct thing, so it is an IPv6 only bug.

BUG: KMSAN: uninit-value in ip6_multipath_l3_keys net/ipv6/route.c:1830 [inline]
BUG: KMSAN: uninit-value in rt6_multipath_hash+0x5c4/0x640 net/ipv6/route.c:1858
CPU: 0 PID: 4507 Comm: syz-executor661 Not tainted 4.16.0+ #87
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
 ip6_multipath_l3_keys net/ipv6/route.c:1830 [inline]
 rt6_multipath_hash+0x5c4/0x640 net/ipv6/route.c:1858
 ip6_route_input+0x65a/0x920 net/ipv6/route.c:1884
 ip6_rcv_finish+0x413/0x6e0 net/ipv6/ip6_input.c:69
 NF_HOOK include/linux/netfilter.h:288 [inline]
 ipv6_rcv+0x1e16/0x2340 net/ipv6/ip6_input.c:208
 __netif_receive_skb_core+0x47df/0x4a90 net/core/dev.c:4562
 __netif_receive_skb net/core/dev.c:4627 [inline]
 netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
 netif_receive_skb+0x230/0x240 net/core/dev.c:4725
 tun_rx_batched drivers/net/tun.c:1555 [inline]
 tun_get_user+0x740f/0x7c60 drivers/net/tun.c:1962
 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
 call_write_iter include/linux/fs.h:1782 [inline]
 new_sync_write fs/read_write.c:469 [inline]
 __vfs_write+0x7fb/0x9f0 fs/read_write.c:482
 vfs_write+0x463/0x8d0 fs/read_write.c:544
 SYSC_write+0x172/0x360 fs/read_write.c:589
 SyS_write+0x55/0x80 fs/read_write.c:581
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 23aebdac ("ipv6: Compute multipath hash for ICMP errors from offending packet")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Cc: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: NJakub Sitnicki <jkbs@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cea67a2d

01 5月, 2018 2 次提交

change the comment of vti6_ioctl · 154a8c46

由 Sun Lianwen 提交于 4月 29, 2018

The comment of vti6_ioctl() is wrong. which use vti6_tnl_ioctl
instead of vti6_ioctl.
Signed-off-by: NSun Lianwen <sunlw.fnst@cn.fujitsu.com>
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

154a8c46

ipv6: sr: extract the right key values for "seg6_make_flowlabel" · 6df93462

由 Ahmed Abdelsalam 提交于 4月 28, 2018

The seg6_make_flowlabel() is used by seg6_do_srh_encap() to compute the
flowlabel from a given skb. It relies on skb_get_hash() which eventually
calls __skb_flow_dissect() to extract the flow_keys struct values from
the skb.

In case of IPv4 traffic, calling seg6_make_flowlabel() after skb_push(),
skb_reset_network_header(), and skb_mac_header_rebuild() will results in
flow_keys struct of all key values set to zero.

This patch calls seg6_make_flowlabel() before resetting the headers of skb
to get the right key values.

Extracted Key values are based on the type inner packet as follows:
1) IPv6 traffic: src_IP, dst_IP, L4 proto, and flowlabel of inner packet.
2) IPv4 traffic: src_IP, dst_IP, L4 proto, src_port, and dst_port
3) L2 traffic: depends on what kind of traffic carried into the L2
frame. IPv6 and IPv4 traffic works as discussed 1) and 2)

Here a hex_dump of struct flow_keys for IPv4 and IPv6 traffic
10.100.1.100: 47302 > 30.0.0.2: 5001
00000000: 14 00 02 00 00 00 00 00 08 00 11 00 00 00 00 00
00000010: 00 00 00 00 00 00 00 00 13 89 b8 c6 1e 00 00 02
00000020: 0a 64 01 64

fc00:a1:a > b2::2
00000000: 28 00 03 00 00 00 00 00 86 dd 11 00 99 f9 02 00
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 b2 00 00
00000020: 00 00 00 00 00 00 00 00 00 00 00 02 fc 00 00 a1
00000030: 00 00 00 00 00 00 00 00 00 00 00 0a
Signed-off-by: NAhmed Abdelsalam <amsalam20@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6df93462

30 4月, 2018 2 次提交

erspan: auto detect truncated packets. · 1baf5ebf

由 William Tu 提交于 4月 27, 2018

Currently the truncated bit is set only when the mirrored packet
is larger than mtu.  For certain cases, the packet might already
been truncated before sending to the erspan tunnel.  In this case,
the patch detect whether the IP header's total length is larger
than the actual skb->len.  If true, this indicated that the
mirrored packet is truncated and set the erspan truncate bit.

I tested the patch using bpf_skb_change_tail helper function to
shrink the packet size and send to erspan tunnel.
Reported-by: NXiaoyan Jin <xiaoyanj@vmware.com>
Signed-off-by: NWilliam Tu <u9012063@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1baf5ebf

tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive · 05255b82

由 Eric Dumazet 提交于 4月 27, 2018

When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.

Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.

1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
  This operation does not involve any TCP locking.

2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
 the transfert of pages from skbs to one VMA.
  This operation only uses down_read(&current->mm->mmap_sem) after
  holding TCP lock, thus solving the lockdep issue.

This new implementation was suggested by Andy Lutomirski with great details.

Benefits are :

- Better scalability, in case multiple threads reuse VMAS
   (without mmap()/munmap() calls) since mmap_sem wont be write locked.

- Better error recovery.
   The previous mmap() model had to provide the expected size of the
   mapping. If for some reason one part could not be mapped (partial MSS),
   the whole operation had to be aborted.
   With the tcp_zerocopy_receive struct, kernel can report how
   many bytes were successfuly mapped, and how many bytes should
   be read to skip the problematic sequence.

- No more memory allocation to hold an array of page pointers.
  16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/

- skbs are freed while mmap_sem has been released

Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)

Note that memcg might require additional changes.

Fixes: 93ab6cc6 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Suggested-by: NAndy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

05255b82

27 4月, 2018 5 次提交

udp: add gso segment cmsg · 2e8de857

由 Willem de Bruijn 提交于 4月 26, 2018

Allow specifying segment size in the send call.

The new control message performs the same function as socket option
UDP_SEGMENT while avoiding the extra system call.

[ Export udp_cmsg_send for ipv6. -DaveM ]
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2e8de857

udp: paged allocation with gso · 15e36f5b

由 Willem de Bruijn 提交于 4月 26, 2018

When sending large datagrams that are later segmented, store data in
page frags to avoid copying from linear in skb_segment.
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

15e36f5b

udp: generate gso with UDP_SEGMENT · bec1f6f6

由 Willem de Bruijn 提交于 4月 26, 2018

Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.

To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.

A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.

Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.

The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.

Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.

    tcp tso
     3197 MB/s 54232 msg/s 54232 calls/s
         6,457,754,262      cycles

    tcp gso
     1765 MB/s 29939 msg/s 29939 calls/s
        11,203,021,806      cycles

    tcp without tso/gso *
      739 MB/s 12548 msg/s 12548 calls/s
        11,205,483,630      cycles

    udp
      876 MB/s 14873 msg/s 624666 calls/s
        11,205,777,429      cycles

    udp gso
     2139 MB/s 36282 msg/s 36282 calls/s
        11,204,374,561      cycles

   [*] after reverting commit 0a6b2a1d
       ("tcp: switch to GSO being always on")

Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:

  perf stat -a -C 12 -e cycles \
    ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4

Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bec1f6f6

udp: add udp gso · ee80d1eb

由 Willem de Bruijn 提交于 4月 26, 2018

Implement generic segmentation offload support for udp datagrams. A
follow-up patch adds support to the protocol stack to generate such
packets.

UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
a large payload into a number of discrete UDP datagrams.

The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
from UFO (SKB_UDP_GSO).

IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
registered.

[ Export __udp_gso_segment for ipv6. -DaveM ]
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ee80d1eb

udp: expose inet cork to udp · 1cd7884d

由 Willem de Bruijn 提交于 4月 26, 2018

UDP segmentation offload needs access to inet_cork in the udp layer.
Pass the struct to ip(6)_make_skb instead of allocating it on the
stack in that function itself.

This patch is a noop otherwise.
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1cd7884d

26 4月, 2018 3 次提交

xfrm: remove VLA usage in __xfrm6_sort() · c926ca16

由 Kees Cook 提交于 4月 25, 2018

In the quest to remove all stack VLA usage removed from the kernel[1],
just use XFRM_MAX_DEPTH as already done for the "class" array. In one
case, it'll do this loop up to 5, the other caller up to 6.

[1] https://lkml.org/lkml/2018/3/7/621Co-developed-by: NAndreas Christoforou <andreaschristofo@gmail.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Acked-by: NStefano Brivio <sbrivio@redhat.com>
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

c926ca16

ipv6: addrconf: don't evaluate keep_addr_on_down twice · 0aef78aa

由 Ivan Vecera 提交于 4月 24, 2018

The addrconf_ifdown() evaluates keep_addr_on_down state twice. There
is no need to do it.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: NIvan Vecera <cera@cera.cz>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0aef78aa

ipv6: sr: Compute flowlabel for outer IPv6 header of seg6 encap mode · b5facfdb

由 Ahmed Abdelsalam 提交于 4月 24, 2018

ECMP (equal-cost multipath) hashes are typically computed on the packets'
5-tuple(src IP, dst IP, src port, dst port, L4 proto).

For encapsulated packets, the L4 data is not readily available and ECMP
hashing will often revert to (src IP, dst IP). This will lead to traffic
polarization on a single ECMP path, causing congestion and waste of network
capacity.

In IPv6, the 20-bit flow label field is also used as part of the ECMP hash.
In the lack of L4 data, the hashing will be on (src IP, dst IP, flow
label). Having a non-zero flow label is thus important for proper traffic
load balancing when L4 data is unavailable (i.e., when packets are
encapsulated).

Currently, the seg6_do_srh_encap() function extracts the original packet's
flow label and set it as the outer IPv6 flow label. There are two issues
with this behaviour:

a) There is no guarantee that the inner flow label is set by the source.
b) If the original packet is not IPv6, the flow label will be set to
zero (e.g., IPv4 or L2 encap).

This patch adds a function, named seg6_make_flowlabel(), that computes a
flow label from a given skb. It supports IPv6, IPv4 and L2 payloads, and
leverages the per namespace 'seg6_flowlabel" sysctl value.

The currently support behaviours are as follows:
-1 set flowlabel to zero.
0 copy flowlabel from Inner paceket in case of Inner IPv6
(Set flowlabel to 0 in case IPv4/L2)
1 Compute the flowlabel using seg6_make_flowlabel()

This patch has been tested for IPv6, IPv4, and L2 traffic.
Signed-off-by: NAhmed Abdelsalam <amsalam20@gmail.com>
Acked-by: NDavid Lebrun <dlebrun@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b5facfdb

25 4月, 2018 1 次提交

net/ipv6: fix LOCKDEP issue in rt6_remove_exception_rt() · 091311de

由 Eric Dumazet 提交于 4月 24, 2018

rt6_remove_exception_rt() is called under rcu_read_lock() only.

We lock rt6_exception_lock a bit later, so we do not hold
rt6_exception_lock yet.

Fixes: 8a14e46f ("net/ipv6: Fix missing rcu dereferences on from")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Cc: David Ahern <dsahern@gmail.com>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

091311de

24 4月, 2018 7 次提交

netfilter: x_tables: remove duplicate ip6t_get_target function call · 4351bef0

由 Taehee Yoo 提交于 4月 09, 2018

In the check_target, ip6t_get_target is called twice.
Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

4351bef0

netfilter: add NAT support for shifted portmap ranges · 2eb0f624

由 Thierry Du Tre 提交于 4月 04, 2018

This is a patch proposal to support shifted ranges in portmaps. (i.e. tcp/udp
incoming port 5000-5100 on WAN redirected to LAN 192.168.1.5:2000-2100)

Currently DNAT only works for single port or identical port ranges. (i.e.
ports 5000-5100 on WAN interface redirected to a LAN host while original
destination port is not altered) When different port ranges are configured,
either 'random' mode should be used, or else all incoming connections are
mapped onto the first port in the redirect range. (in described example
WAN:5000-5100 will all be mapped to 192.168.1.5:2000)

This patch introduces a new mode indicated by flag NF_NAT_RANGE_PROTO_OFFSET
which uses a base port value to calculate an offset with the destination port
present in the incoming stream. That offset is then applied as index within the
redirect port range (index modulo rangewidth to handle range overflow).

In described example the base port would be 5000. An incoming stream with
destination port 5004 would result in an offset value 4 which means that the
NAT'ed stream will be using destination port 2004.

Other possibilities include deterministic mapping of larger or multiple ranges
to a smaller range : WAN:5000-5999 -> LAN:5000-5099 (maps WAN port 5*xx to port
51xx)

This patch does not change any current behavior. It just adds new NAT proto
range functionality which must be selected via the specific flag when intended
to use.

A patch for iptables (libipt_DNAT.c + libip6t_DNAT.c) will also be proposed
which makes this functionality immediately available.
Signed-off-by: NThierry Du Tre <thierry@dtsystems.be>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

2eb0f624

netfilter: nf_flow_table: move init code to nf_flow_table_core.c · a268de77

由 Felix Fietkau 提交于 2月 26, 2018

Reduces duplication of .gc and .params in flowtable type definitions and
makes the API clearer
Signed-off-by: NFelix Fietkau <nbd@nbd.name>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

a268de77

netfilter: nf_flow_table: move ipv6 offload hook code to nf_flow_table · a908fdec

由 Felix Fietkau 提交于 2月 26, 2018

Useful as preparation for adding iptables support for offload.
Signed-off-by: NFelix Fietkau <nbd@nbd.name>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

a908fdec

net/ipv6: Fix missing rcu dereferences on from · 8a14e46f

由 David Ahern 提交于 4月 23, 2018

kbuild test robot reported 2 uses of rt->from not properly accessed
using rcu_dereference:
1. add rcu_dereference_protected to rt6_remove_exception_rt and make
   sure it is always called with rcu lock held.

2. change rt6_do_redirect to take a reference on 'from' when accessed
   the first time so it can be used the sceond time outside of the lock

Fixes: a68886a6 ("net/ipv6: Make from in rt6_info rcu protected")
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a14e46f

net/ipv6: add rcu locking to ip6_negative_advice · c3c14da0

由 David Ahern 提交于 4月 23, 2018

syzbot reported a suspicious rcu_dereference_check:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592
  rt6_check_expired+0x38b/0x3e0 net/ipv6/route.c:410
  ip6_negative_advice+0x67/0xc0 net/ipv6/route.c:2204
  dst_negative_advice include/net/sock.h:1786 [inline]
  sock_setsockopt+0x138f/0x1fe0 net/core/sock.c:1051
  __sys_setsockopt+0x2df/0x390 net/socket.c:1899
  SYSC_setsockopt net/socket.c:1914 [inline]
  SyS_setsockopt+0x34/0x50 net/socket.c:1911

Add rcu locking around call to rt6_check_expired in
ip6_negative_advice.

Fixes: a68886a6 ("net/ipv6: Make from in rt6_info rcu protected")
Reported-by: syzbot+2422c9e35796659d2273@syzkaller.appspotmail.com
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c3c14da0

ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy · aa8f8778

由 Eric Dumazet 提交于 4月 22, 2018

KMSAN reported use of uninit-value that I tracked to lack
of proper size check on RTA_TABLE attribute.

I also believe RTA_PREFSRC lacks a similar check.

Fixes: 86872cb5 ("[IPv6] route: FIB6 configuration using struct fib6_config")
Fixes: c3968a85 ("ipv6: RTA_PREFSRC support for ipv6 route source address selection")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aa8f8778

23 4月, 2018 2 次提交

net: fib_rules: add extack support · b16fb418

由 Roopa Prabhu 提交于 4月 21, 2018

Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b16fb418

ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts · a957fa19

由 Ahmed Abdelsalam 提交于 4月 20, 2018

In case of seg6 in encap mode, seg6_do_srh_encap() calls set_tun_src()
in order to set the src addr of outer IPv6 header.

The net_device is required for set_tun_src(). However calling ip6_dst_idev()
on dst_entry in case of IPv4 traffic results on the following bug.

Using just dst->dev should fix this BUG.

[  196.242461] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  196.242975] PGD 800000010f076067 P4D 800000010f076067 PUD 10f060067 PMD 0
[  196.243329] Oops: 0000 [#1] SMP PTI
[  196.243468] Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd input_leds glue_helper led_class pcspkr serio_raw mac_hid video autofs4 hid_generic usbhid hid e1000 i2c_piix4 ahci pata_acpi libahci
[  196.244362] CPU: 2 PID: 1089 Comm: ping Not tainted 4.16.0+ #1
[  196.244606] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  196.244968] RIP: 0010:seg6_do_srh_encap+0x1ac/0x300
[  196.245236] RSP: 0018:ffffb2ce00b23a60 EFLAGS: 00010202
[  196.245464] RAX: 0000000000000000 RBX: ffff8c7f53eea300 RCX: 0000000000000000
[  196.245742] RDX: 0000f10000000000 RSI: ffff8c7f52085a6c RDI: ffff8c7f41166850
[  196.246018] RBP: ffffb2ce00b23aa8 R08: 00000000000261e0 R09: ffff8c7f41166800
[  196.246294] R10: ffffdce5040ac780 R11: ffff8c7f41166828 R12: ffff8c7f41166808
[  196.246570] R13: ffff8c7f52085a44 R14: ffffffffb73211c0 R15: ffff8c7e69e44200
[  196.246846] FS:  00007fc448789700(0000) GS:ffff8c7f59d00000(0000) knlGS:0000000000000000
[  196.247286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  196.247526] CR2: 0000000000000000 CR3: 000000010f05a000 CR4: 00000000000406e0
[  196.247804] Call Trace:
[  196.247972]  seg6_do_srh+0x15b/0x1c0
[  196.248156]  seg6_output+0x3c/0x220
[  196.248341]  ? prandom_u32+0x14/0x20
[  196.248526]  ? ip_idents_reserve+0x6c/0x80
[  196.248723]  ? __ip_select_ident+0x90/0x100
[  196.248923]  ? ip_append_data.part.50+0x6c/0xd0
[  196.249133]  lwtunnel_output+0x44/0x70
[  196.249328]  ip_send_skb+0x15/0x40
[  196.249515]  raw_sendmsg+0x8c3/0xac0
[  196.249701]  ? _copy_from_user+0x2e/0x60
[  196.249897]  ? rw_copy_check_uvector+0x53/0x110
[  196.250106]  ? _copy_from_user+0x2e/0x60
[  196.250299]  ? copy_msghdr_from_user+0xce/0x140
[  196.250508]  sock_sendmsg+0x36/0x40
[  196.250690]  ___sys_sendmsg+0x292/0x2a0
[  196.250881]  ? _cond_resched+0x15/0x30
[  196.251074]  ? copy_termios+0x1e/0x70
[  196.251261]  ? _copy_to_user+0x22/0x30
[  196.251575]  ? tty_mode_ioctl+0x1c3/0x4e0
[  196.251782]  ? _cond_resched+0x15/0x30
[  196.251972]  ? mutex_lock+0xe/0x30
[  196.252152]  ? vvar_fault+0xd2/0x110
[  196.252337]  ? __do_fault+0x1f/0xc0
[  196.252521]  ? __handle_mm_fault+0xc1f/0x12d0
[  196.252727]  ? __sys_sendmsg+0x63/0xa0
[  196.252919]  __sys_sendmsg+0x63/0xa0
[  196.253107]  do_syscall_64+0x72/0x200
[  196.253305]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  196.253530] RIP: 0033:0x7fc4480b0690
[  196.253715] RSP: 002b:00007ffde9f252f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  196.254053] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007fc4480b0690
[  196.254331] RDX: 0000000000000000 RSI: 000000000060a360 RDI: 0000000000000003
[  196.254608] RBP: 00007ffde9f253f0 R08: 00000000002d1e81 R09: 0000000000000002
[  196.254884] R10: 00007ffde9f250c0 R11: 0000000000000246 R12: 0000000000b22070
[  196.255205] R13: 20c49ba5e353f7cf R14: 431bde82d7b634db R15: 00007ffde9f278fe
[  196.255484] Code: a5 0f b6 45 c0 41 88 41 28 41 0f b6 41 2c 48 c1 e0 04 49 8b 54 01 38 49 8b 44 01 30 49 89 51 20 49 89 41 18 48 8b 83 b0 00 00 00 <48> 8b 30 49 8b 86 08 0b 00 00 48 8b 40 20 48 8b 50 08 48 0b 10
[  196.256190] RIP: seg6_do_srh_encap+0x1ac/0x300 RSP: ffffb2ce00b23a60
[  196.256445] CR2: 0000000000000000
[  196.256676] ---[ end trace 71af7d093603885c ]---

Fixes: 8936ef76 ("ipv6: sr: fix NULL pointer dereference when setting encap source address")
Signed-off-by: NAhmed Abdelsalam <amsalam20@gmail.com>
Acked-by: NDavid Lebrun <dlebrun@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a957fa19

22 4月, 2018 1 次提交

net/ipv6: Remove unncessary check on f6i in fib6_check · 8ae86971

由 David Ahern 提交于 4月 20, 2018

Dan reported an imbalance in fib6_check on use of f6i and checking
whether it is null. Since fib6_check is only called if f6i is non-null,
remove the unnecessary check.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8ae86971

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功