提交 · 559bf69e3c8f8892c3205417c01cdf8d7823d03c · openeuler / Kernel

25 10月, 2018 4 次提交

net: rtnl_dump_all needs to propagate error from dumpit function · c63586dc

由 David Ahern 提交于 10月 24, 2018

If an address, route or netconf dump request is sent for AF_UNSPEC, then
rtnl_dump_all is used to do the dump across all address families. If one
of the dumpit functions fails (e.g., invalid attributes in the dump
request) then rtnl_dump_all needs to propagate that error so the user
gets an appropriate response instead of just getting no data.

Fixes: effe6792 ("net: Enable kernel side filtering of route dumps")
Fixes: 5fcd266a ("net/ipv4: Add support for dumping addresses for a specific device")
Fixes: 6371a71f ("net/ipv6: Add support for dumping addresses for a specific device")
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c63586dc

net: Don't return invalid table id error when dumping all families · ae677bbb

由 David Ahern 提交于 10月 24, 2018

When doing a route dump across all address families, do not error out
if the table does not exist. This allows a route dump for AF_UNSPEC
with a table id that may only exist for some of the families.

Do return the table does not exist error if dumping routes for a
specific family and the table does not exist.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ae677bbb

net/ipv6: Put target net when address dump fails due to bad attributes · 242afaa6

由 David Ahern 提交于 10月 24, 2018

If tgt_net is set based on IFA_TARGET_NETNSID attribute in the dump
request, make sure all error paths call put_net.

Fixes: 6371a71f ("net/ipv6: Add support for dumping addresses for a specific device")
Fixes: ed6eff11 ("net/ipv6: Update inet6_dump_addr for strict data checking")
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

242afaa6

net/ipv4: Put target net when address dump fails due to bad attributes · d7e38611

由 David Ahern 提交于 10月 24, 2018

If tgt_net is set based on IFA_TARGET_NETNSID attribute in the dump
request, make sure all error paths call put_net.

Fixes: 5fcd266a ("net/ipv4: Add support for dumping addresses for a specific device")
Fixes: c33078e3 ("net/ipv4: Update inet_dump_ifaddr for strict data checking")
Reported-by: NLi RongQing <lirongqing@baidu.com>
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d7e38611

24 10月, 2018 2 次提交

tcp: add tcp_reset_xmit_timer() helper · 3f80e08f

由 Eric Dumazet 提交于 10月 23, 2018

With EDT model, SRTT no longer is inflated by pacing delays.

This means that RTO and some other xmit timers might be setup
incorrectly. This is particularly visible with either :

- Very small enforced pacing rates (SO_MAX_PACING_RATE)
- Reduced rto (from the default 200 ms)

This can lead to TCP flows aborts in the worst case,
or spurious retransmits in other cases.

For example, this session gets far more throughput
than the requested 80kbit :

$ netperf -H 127.0.0.2 -l 100 -- -q 10000
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

540000 262144 262144    104.00      2.66

With the fix :

$ netperf -H 127.0.0.2 -l 100 -- -q 10000
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

540000 262144 262144    104.00      0.12

EDT allows for better control of rtx timers, since TCP has
a better idea of the earliest departure time of each skb
in the rtx queue. We only have to eventually add to the
timer the difference of the EDT time with current time.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f80e08f

Revert "net: simplify sock_poll_wait" · 89ab066d

由 Karsten Graul 提交于 10月 23, 2018

This reverts commit dd979b4d.

This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
internal TCP socket for the initial handshake with the remote peer.
Whenever the SMC connection can not be established this TCP socket is
used as a fallback. All socket operations on the SMC socket are then
forwarded to the TCP socket. In case of poll, the file->private_data
pointer references the SMC socket because the TCP socket has no file
assigned. This causes tcp_poll to wait on the wrong socket.
Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

89ab066d

23 10月, 2018 11 次提交

llc: do not use sk_eat_skb() · 604d415e

由 Eric Dumazet 提交于 10月 22, 2018

syzkaller triggered a use-after-free [1], caused by a combination of
skb_get() in llc_conn_state_process() and usage of sk_eat_skb()

sk_eat_skb() is assuming the skb about to be freed is only used by
the current thread. TCP/DCCP stacks enforce this because current
thread holds the socket lock.

llc_conn_state_process() wants to make sure skb does not disappear,
and holds a reference on the skb it manipulates. But as soon as this
skb is added to socket receive queue, another thread can consume it.

This means that llc must use regular skb_unlink() and kfree_skb()
so that both producer and consumer can safely work on the same skb.

[1]
BUG: KASAN: use-after-free in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: use-after-free in refcount_read include/linux/refcount.h:43 [inline]
BUG: KASAN: use-after-free in skb_unref include/linux/skbuff.h:967 [inline]
BUG: KASAN: use-after-free in kfree_skb+0xb7/0x580 net/core/skbuff.c:655
Read of size 4 at addr ffff8801d1f6fba4 by task ksoftirqd/1/18

CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.19.0-rc8+ #295
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c4/0x2b6 lib/dump_stack.c:113
 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 refcount_read include/linux/refcount.h:43 [inline]
 skb_unref include/linux/skbuff.h:967 [inline]
 kfree_skb+0xb7/0x580 net/core/skbuff.c:655
 llc_sap_state_process+0x9b/0x550 net/llc/llc_sap.c:224
 llc_sap_rcv+0x156/0x1f0 net/llc/llc_sap.c:297
 llc_sap_handler+0x65e/0xf80 net/llc/llc_sap.c:438
 llc_rcv+0x79e/0xe20 net/llc/llc_input.c:208
 __netif_receive_skb_one_core+0x14d/0x200 net/core/dev.c:4913
 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:5023
 process_backlog+0x218/0x6f0 net/core/dev.c:5829
 napi_poll net/core/dev.c:6249 [inline]
 net_rx_action+0x7c5/0x1950 net/core/dev.c:6315
 __do_softirq+0x30c/0xb03 kernel/softirq.c:292
 run_ksoftirqd+0x94/0x100 kernel/softirq.c:653
 smpboot_thread_fn+0x68b/0xa00 kernel/smpboot.c:164
 kthread+0x35a/0x420 kernel/kthread.c:246
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413

Allocated by task 18:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
 kmem_cache_alloc_node+0x144/0x730 mm/slab.c:3644
 __alloc_skb+0x119/0x770 net/core/skbuff.c:193
 alloc_skb include/linux/skbuff.h:995 [inline]
 llc_alloc_frame+0xbc/0x370 net/llc/llc_sap.c:54
 llc_station_ac_send_xid_r net/llc/llc_station.c:52 [inline]
 llc_station_rcv+0x1dc/0x1420 net/llc/llc_station.c:111
 llc_rcv+0xc32/0xe20 net/llc/llc_input.c:220
 __netif_receive_skb_one_core+0x14d/0x200 net/core/dev.c:4913
 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:5023
 process_backlog+0x218/0x6f0 net/core/dev.c:5829
 napi_poll net/core/dev.c:6249 [inline]
 net_rx_action+0x7c5/0x1950 net/core/dev.c:6315
 __do_softirq+0x30c/0xb03 kernel/softirq.c:292

Freed by task 16383:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kmem_cache_free+0x83/0x290 mm/slab.c:3756
 kfree_skbmem+0x154/0x230 net/core/skbuff.c:582
 __kfree_skb+0x1d/0x20 net/core/skbuff.c:642
 sk_eat_skb include/net/sock.h:2366 [inline]
 llc_ui_recvmsg+0xec2/0x1610 net/llc/af_llc.c:882
 sock_recvmsg_nosec net/socket.c:794 [inline]
 sock_recvmsg+0xd0/0x110 net/socket.c:801
 ___sys_recvmsg+0x2b6/0x680 net/socket.c:2278
 __sys_recvmmsg+0x303/0xb90 net/socket.c:2390
 do_sys_recvmmsg+0x181/0x1a0 net/socket.c:2466
 __do_sys_recvmmsg net/socket.c:2484 [inline]
 __se_sys_recvmmsg net/socket.c:2480 [inline]
 __x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2480
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8801d1f6fac0
 which belongs to the cache skbuff_head_cache of size 232
The buggy address is located 228 bytes inside of
 232-byte region [ffff8801d1f6fac0, ffff8801d1f6fba8)
The buggy address belongs to the page:
page:ffffea000747dbc0 count:1 mapcount:0 mapping:ffff8801d9be7680 index:0xffff8801d1f6fe80
flags: 0x2fffc0000000100(slab)
raw: 02fffc0000000100 ffffea0007346e88 ffffea000705b108 ffff8801d9be7680
raw: ffff8801d1f6fe80 ffff8801d1f6f0c0 000000010000000b 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8801d1f6fa80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
 ffff8801d1f6fb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8801d1f6fb80: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc
                               ^
 ffff8801d1f6fc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8801d1f6fc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

604d415e

net: dsa: legacy: simplify getting .driver_data · 2af1ccd5

由 Wolfram Sang 提交于 10月 21, 2018

We should get 'driver_data' from 'struct device' directly. Going via
platform_device is an unneeded step back and forth.
Signed-off-by: NWolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2af1ccd5

net/sched: act_police: disallow 'goto chain' on fallback control action · c08f5ed5

由 Davide Caratti 提交于 10月 20, 2018

in the following command:

 # tc action add action police rate <r> burst <b> conform-exceed <c1>/<c2>

'goto chain x' is allowed only for c1: setting it for c2 makes the kernel
crash with NULL pointer dereference, since TC core doesn't initialize the
chain handle.
Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c08f5ed5

net/sched: act_gact: disallow 'goto chain' on fallback control action · 9469f375

由 Davide Caratti 提交于 10月 20, 2018

in the following command:

 # tc action add action <c1> random <rand_type> <c2> <rand_param>

'goto chain x' is allowed only for c1: setting it for c2 makes the kernel
crash with NULL pointer dereference, since TC core doesn't initialize the
chain handle.
Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9469f375

O
net: bpfilter: Set user mode helper's command line · 4b78030b
由 Olivier Brunel 提交于 10月 20, 2018
```
Signed-off-by: NOlivier Brunel <jjk@jjacky.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
4b78030b

net/ipv6: Add support for dumping addresses for a specific device · 6371a71f

由 David Ahern 提交于 10月 19, 2018

If an RTM_GETADDR dump request has ifa_index set in the ifaddrmsg
header, then return only the addresses for that device.

Since inet6_dump_addr is reused for multicast and anycast addresses,
this adds support for device specfic dumps of RTM_GETMULTICAST and
RTM_GETANYCAST as well.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6371a71f

net/ipv4: Add support for dumping addresses for a specific device · 5fcd266a

由 David Ahern 提交于 10月 19, 2018

If an RTM_GETADDR dump request has ifa_index set in the ifaddrmsg
header, then return only the addresses for that device.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5fcd266a

net/ipv6: Remove ip_idx arg to in6_dump_addrs · fe884c2b

由 David Ahern 提交于 10月 19, 2018

ip_idx is always 0 going into in6_dump_addrs; it is passed as a pointer
to save the last good index into cb. Since cb is already argument to
in6_dump_addrs, just save the value there.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fe884c2b

net/ipv4: Move loop over addresses on a device into in_dev_dump_addr · 1c98eca4

由 David Ahern 提交于 10月 19, 2018

Similar to IPv6 move the logic that walks over the ipv4 address list
for a device into a helper.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c98eca4

tipc: eliminate message disordering during binding table update · 988f3f16

由 Jon Maloy 提交于 10月 19, 2018

We have seen the following race scenario:
1) named_distribute() builds a "bulk" message, containing a PUBLISH
   item for a certain publication. This is based on the contents of
   the binding tables's 'cluster_scope' list.
2) tipc_named_withdraw() removes the same publication from the list,
   bulds a WITHDRAW message and distributes it to all cluster nodes.
3) tipc_named_node_up(), which was calling named_distribute(), sends
   out the bulk message built under 1)
4) The WITHDRAW message arrives at the just detected node, finds
   no corresponding publication, and is dropped.
5) The PUBLISH item arrives at the same node, is added to its binding
   table, and remains there forever.

This arrival disordering was earlier taken care of by the backlog queue,
originally added for a different purpose, which was removed in the
commit referred to below, but we now need a different solution.
In this commit, we replace the rcu lock protecting the 'cluster_scope'
list with a regular RW lock which comprises even the sending of the
bulk message. This both guarantees both the list integrity and the
message sending order. We will later add a commit which cleans up
this code further.

Note that this commit needs recently added commit d3092b2e ("tipc:
fix unsafe rcu locking when accessing publication list") to apply
cleanly.

Fixes: 37922ea4 ("tipc: permit overlapping service ranges in name table")
Reported-by: NTuong Lien Tong <tuong.t.lien@dektech.com.au>
Acked-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

988f3f16

tipc: use destination length for copy string · 29e270fc

由 Guoqing Jiang 提交于 10月 19, 2018

Got below warning with gcc 8.2 compiler.

net/tipc/topsrv.c: In function ‘tipc_topsrv_start’:
net/tipc/topsrv.c:660:2: warning: ‘strncpy’ specified bound depends on the length of the source argument [-Wstringop-overflow=]
  strncpy(srv->name, name, strlen(name) + 1);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
net/tipc/topsrv.c:660:27: note: length computed here
  strncpy(srv->name, name, strlen(name) + 1);
                           ^~~~~~~~~~~~
So change it to correct length and use strscpy.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Acked-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

29e270fc

21 10月, 2018 4 次提交

ulp: remove uid and user_visible members · c16ee04c

由 Daniel Borkmann 提交于 10月 21, 2018

They are not used anymore and therefore should be removed.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

c16ee04c

Revert "neighbour: force neigh_invalidate when NUD_FAILED update is from admin" · d2fb4fb8

由 Roopa Prabhu 提交于 10月 20, 2018

This reverts commit 8e326289.

This patch results in unnecessary netlink notification when one
tries to delete a neigh entry already in NUD_FAILED state. Found
this with a buggy app that tries to delete a NUD_FAILED entry
repeatedly. While the notification issue can be fixed with more
checks, adding more complexity here seems unnecessary. Also,
recent tests with other changes in the neighbour code have
shown that the INCOMPLETE and PROBE checks are good enough for
the original issue.
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d2fb4fb8

net/ipv6: Fix index counter for unicast addresses in in6_dump_addrs · 4ba4c566

由 David Ahern 提交于 10月 19, 2018

The loop wants to skip previously dumped addresses, so loops until
current index >= saved index. If the message fills it wants to save
the index for the next address to dump - ie., the one that did not
fit in the current message.

Currently, it is incrementing the index counter before comparing to the
saved index, and then the saved index is off by 1 - it assumes the
current address is going to fit in the message.

Change the index handling to increment only after a succesful dump.

Fixes: 502a2ffd ("ipv6: convert idev_list to list macros")
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4ba4c566

bpf: sk_msg program helper bpf_msg_push_data · 6fff607e

由 John Fastabend 提交于 10月 19, 2018

This allows user to push data into a msg using sk_msg program types.
The format is as follows,

	bpf_msg_push_data(msg, offset, len, flags)

this will insert 'len' bytes at offset 'offset'. For example to
prepend 10 bytes at the front of the message the user can,

	bpf_msg_push_data(msg, 0, 10, 0);

This will invalidate data bounds so BPF user will have to then recheck
data bounds after calling this. After this the msg size will have been
updated and the user is free to write into the added bytes. We allow
any offset/len as long as it is within the (data, data_end) range.
However, a copy will be required if the ring is full and its possible
for the helper to fail with ENOMEM or EINVAL errors which need to be
handled by the BPF program.

This can be used similar to XDP metadata to pass data between sk_msg
layer and lower layers.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

6fff607e

20 10月, 2018 7 次提交

net: fix pskb_trim_rcsum_slow() with odd trim offset · d55bef50

由 Dimitris Michailidis 提交于 10月 19, 2018

We've been getting checksum errors involving small UDP packets, usually
59B packets with 1 extra non-zero padding byte. netdev_rx_csum_fault()
has been complaining that HW is providing bad checksums. Turns out the
problem is in pskb_trim_rcsum_slow(), introduced in commit 88078d98
("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends").

The source of the problem is that when the bytes we are trimming start
at an odd address, as in the case of the 1 padding byte above,
skb_checksum() returns a byte-swapped value. We cannot just combine this
with skb->csum using csum_sub(). We need to use csum_block_sub() here
that takes into account the parity of the start address and handles the
swapping.

Matches existing code in __skb_postpull_rcsum() and esp_remove_trailer().

Fixes: 88078d98 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Signed-off-by: NDimitris Michailidis <dmichail@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d55bef50

netpoll: allow cleanup to be synchronous · c9fbd71f

由 Debabrata Banerjee 提交于 10月 18, 2018

This fixes a problem introduced by:
commit 2cde6acd ("netpoll: Fix __netpoll_rcu_free so that it can hold the rtnl lock")

When using netconsole on a bond, __netpoll_cleanup can asynchronously
recurse multiple times, each __netpoll_free_async call can result in
more __netpoll_free_async's. This means there is now a race between
cleanup_work queues on multiple netpoll_info's on multiple devices and
the configuration of a new netpoll. For example if a netconsole is set
to enable 0, reconfigured, and enable 1 immediately, this netconsole
will likely not work.

Given the reason for __netpoll_free_async is it can be called when rtnl
is not locked, if it is locked, we should be able to execute
synchronously. It appears to be locked everywhere it's called from.

Generalize the design pattern from the teaming driver for current
callers of __netpoll_free_async.

CC: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NDebabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c9fbd71f

bpf: skmsg, fix psock create on existing kcm/tls port · 5032d079

由 John Fastabend 提交于 10月 18, 2018

Before using the psock returned by sk_psock_get() when adding it to a
sockmap we need to ensure it is actually a sockmap based psock.
Previously we were only checking this after incrementing the reference
counter which was an error. This resulted in a slab-out-of-bounds
error when the psock was not actually a sockmap type.

This moves the check up so the reference counter is only used
if it is a sockmap psock.

Eric reported the following KASAN BUG,

BUG: KASAN: slab-out-of-bounds in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
Read of size 4 at addr ffff88019548be58 by task syz-executor4/22387

CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
 sk_psock_get include/linux/skmsg.h:379 [inline]
 sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
 sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
 sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
 map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

5032d079

bpf: add tests for direct packet access from CGROUP_SKB · 2cb494a3

由 Song Liu 提交于 10月 19, 2018

Tests are added to make sure CGROUP_SKB cannot access:
  tc_classid, data_meta, flow_keys

and can read and write:
  mark, prority, and cb[0-4]

and can read other fields.

To make selftest with skb->sk work, a dummy sk is added in
bpf_prog_test_run_skb().
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

2cb494a3

bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB · b39b5f41

由 Song Liu 提交于 10月 19, 2018

BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
skb. This patch enables direct access of skb for these programs.

Two helper functions bpf_compute_and_save_data_end() and
bpf_restore_data_end() are introduced. There are used in
__cgroup_bpf_run_filter_skb(), to compute proper data_end for the
BPF program, and restore original data afterwards.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

b39b5f41

bpf: add queue and stack maps · f1a2e44a

由 Mauricio Vasquez B 提交于 10月 18, 2018

Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
These maps support peek, pop and push operations that are exposed to eBPF
programs through the new bpf_map[peek/pop/push] helpers.  Those operations
are exposed to userspace applications through the already existing
syscalls in the following way:

BPF_MAP_LOOKUP_ELEM            -> peek
BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
BPF_MAP_UPDATE_ELEM            -> push

Queue/stack maps are implemented using a buffer, tail and head indexes,
hence BPF_F_NO_PREALLOC is not supported.

As opposite to other maps, queue and stack do not use RCU for protecting
maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
argument that is a pointer to a memory zone where to save the value of a
map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
be passed as an extra argument.

Our main motivation for implementing queue/stack maps was to keep track
of a pool of elements, like network ports in a SNAT, however we forsee
other use cases, like for exampling saving last N kernel events in a map
and then analysing from userspace.
Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

f1a2e44a

Revert "bond: take rcu lock in netpoll_send_skb_on_dev" · 48995423

由 David S. Miller 提交于 10月 19, 2018

This reverts commit 6fe94878.

It is causing more serious regressions than the RCU warning
it is fixing.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

48995423

19 10月, 2018 12 次提交

Revert "netfilter: xt_quota: fix the behavior of xt_quota module" · af510ebd

由 Pablo Neira Ayuso 提交于 10月 19, 2018

This reverts commit e9837e55.

When talking to Maze and Chenbo, we agreed to keep this back by now
due to problems in the ruleset listing path with 32-bit arches.
Signed-off-by: NMaciej Żenczykowski <maze@google.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

af510ebd

netfilter: remove two unused variables. · da8a705c

由 Weongyo Jeong 提交于 10月 17, 2018

nft_dup_netdev_ingress_ops and nft_fwd_netdev_ingress_ops variables are
no longer used at the code.
Signed-off-by: NWeongyo Jeong <weongyo.linux@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

da8a705c

netfilter: nf_flow_table: do not remove offload when other netns's interface is down · a3fb3698

由 Taehee Yoo 提交于 10月 12, 2018

When interface is down, offload cleanup function(nf_flow_table_do_cleanup)
is called and that checks whether interface index of offload and
index of link down interface is same. but only interface index checking
is not enough because flowtable is not pernet list.
So that, if other netns's interface that has index is same with offload
is down, that offload will be removed.
This patch adds netns checking code to the offload cleanup routine.

Fixes: 59c466dd ("netfilter: nf_flow_table: add a new flow state for tearing down offloading")
Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

a3fb3698

netfilter: nf_flow_table: remove unnecessary parameter of nf_flow_table_cleanup() · 5f1be84a

由 Taehee Yoo 提交于 10月 12, 2018

parameter net of nf_flow_table_cleanup() is not used.
So that it can be removed.
Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

5f1be84a

netfilter: nf_flow_table: remove flowtable hook flush routine in netns exit routine · b7f1a16d

由 Taehee Yoo 提交于 10月 02, 2018

When device is unregistered, flowtable flush routine is called
by notifier_call(nf_tables_flowtable_event). and exit callback of
nftables pernet_operation(nf_tables_exit_net) also has flowtable flush
routine. but when network namespace is destroyed, both notifier_call
and pernet_operation are called. hence flowtable flush routine in
pernet_operation is unnecessary.

test commands:
   %ip netns add vm1
   %ip netns exec vm1 nft add table ip filter
   %ip netns exec vm1 nft add flowtable ip filter w \
	{ hook ingress priority 0\; devices = { lo }\; }
   %ip netns del vm1

splat looks like:
[  265.187019] WARNING: CPU: 0 PID: 87 at net/netfilter/core.c:309 nf_hook_entry_head+0xc7/0xf0
[  265.187112] Modules linked in: nf_flow_table_ipv4 nf_flow_table nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ip_tables x_tables
[  265.187390] CPU: 0 PID: 87 Comm: kworker/u4:2 Not tainted 4.19.0-rc3+ #5
[  265.187453] Workqueue: netns cleanup_net
[  265.187514] RIP: 0010:nf_hook_entry_head+0xc7/0xf0
[  265.187546] Code: 8d 81 68 03 00 00 5b c3 89 d0 83 fa 04 48 8d 84 c7 e8 11 00 00 76 81 0f 0b 31 c0 e9 78 ff ff ff 0f 0b 48 83 c4 08 31 c0 5b c3 <0f> 0b 31 c0 e9 65 ff ff ff 0f 0b 31 c0 e9 5c ff ff ff 48 89 0c 24
[  265.187573] RSP: 0018:ffff88011546f098 EFLAGS: 00010246
[  265.187624] RAX: ffffffff8d90e135 RBX: 1ffff10022a8de1c RCX: 0000000000000000
[  265.187645] RDX: 0000000000000000 RSI: 0000000000000005 RDI: ffff880116298040
[  265.187645] RBP: ffff88010ea4c1a8 R08: 0000000000000000 R09: 0000000000000000
[  265.187645] R10: ffff88011546f1d8 R11: ffffed0022c532c1 R12: ffff88010ea4c1d0
[  265.187645] R13: 0000000000000005 R14: dffffc0000000000 R15: ffff88010ea4c1c4
[  265.187645] FS:  0000000000000000(0000) GS:ffff88011b200000(0000) knlGS:0000000000000000
[  265.187645] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  265.187645] CR2: 00007fdfb8d00000 CR3: 0000000057a16000 CR4: 00000000001006f0
[  265.187645] Call Trace:
[  265.187645]  __nf_unregister_net_hook+0xca/0x5d0
[  265.187645]  ? nf_hook_entries_free.part.3+0x80/0x80
[  265.187645]  ? save_trace+0x300/0x300
[  265.187645]  nf_unregister_net_hooks+0x2e/0x40
[  265.187645]  nf_tables_exit_net+0x479/0x1340 [nf_tables]
[  265.187645]  ? find_held_lock+0x39/0x1c0
[  265.187645]  ? nf_tables_abort+0x30/0x30 [nf_tables]
[  265.187645]  ? inet_frag_destroy_rcu+0xd0/0xd0
[  265.187645]  ? trace_hardirqs_on+0x93/0x210
[  265.187645]  ? __bpf_trace_preemptirq_template+0x10/0x10
[  265.187645]  ? inet_frag_destroy_rcu+0xd0/0xd0
[  265.187645]  ? inet_frag_destroy_rcu+0xd0/0xd0
[  265.187645]  ? __mutex_unlock_slowpath+0x17f/0x740
[  265.187645]  ? wait_for_completion+0x710/0x710
[  265.187645]  ? bucket_table_free+0xb2/0x1f0
[  265.187645]  ? nested_table_free+0x130/0x130
[  265.187645]  ? __lock_is_held+0xb4/0x140
[  265.187645]  ops_exit_list.isra.10+0x94/0x140
[  265.187645]  cleanup_net+0x45b/0x900
[ ... ]

This WARNING means that hook unregisteration is failed because
all flowtables hooks are already unregistered by notifier_call.

Network namespace exit routine guarantees that all devices will be
unregistered first. then, other exit callbacks of pernet_operations
are called. so that removing flowtable flush routine in exit callback of
pernet_operation(nf_tables_exit_net) doesn't make flowtable leak.
Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

b7f1a16d

ip6_tunnel: Fix encapsulation layout · d4d576f5

由 Stefano Brivio 提交于 10月 18, 2018

Commit 058214a4 ("ip6_tun: Add infrastructure for doing
encapsulation") added the ip6_tnl_encap() call in ip6_tnl_xmit(), before
the call to ipv6_push_frag_opts() to append the IPv6 Tunnel Encapsulation
Limit option (option 4, RFC 2473, par. 5.1) to the outer IPv6 header.

As long as the option didn't actually end up in generated packets, this
wasn't an issue. Then commit 89a23c8b ("ip6_tunnel: Fix missing tunnel
encapsulation limit option") fixed sending of this option, and the
resulting layout, e.g. for FoU, is:

.-------------------.------------.----------.-------------------.----- - -
| Outer IPv6 Header | UDP header | Option 4 | Inner IPv6 Header | Payload
'-------------------'------------'----------'-------------------'----- - -

Needless to say, FoU and GUE (at least) won't work over IPv6. The option
is appended by default, and I couldn't find a way to disable it with the
current iproute2.

Turn this into a more reasonable:

.-------------------.----------.------------.-------------------.----- - -
| Outer IPv6 Header | Option 4 | UDP header | Inner IPv6 Header | Payload
'-------------------'----------'------------'-------------------'----- - -

With this, and with 84dad559 ("udp6: fix encap return code for
resubmitting"), FoU and GUE work again over IPv6.

Fixes: 058214a4 ("ip6_tun: Add infrastructure for doing encapsulation")
Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d4d576f5

tcp: fix TCP_REPAIR xmit queue setup · 79861919

由 Eric Dumazet 提交于 10月 18, 2018

Andrey reported the following warning triggered while running CRIU tests:

tcp_clean_rtx_queue()
...
	last_ackt = tcp_skb_timestamp_us(skb);
	WARN_ON_ONCE(last_ackt == 0);

This is caused by 5f6188a8 ("tcp: do not change tcp_wstamp_ns
in tcp_mstamp_refresh"), as we end up having skbs in retransmit queue
with a zero skb->skb_mstamp_ns field.

We could fix this bug in different ways, like making sure
tp->tcp_wstamp_ns is not zero at socket creation, but as Neal pointed
out, we also do not want that pacing status of a repaired socket
could push tp->tcp_wstamp_ns far ahead in the future.

So we prefer changing tcp_write_xmit() to not call tcp_update_skb_after_send()
and instead do what is requested by TCP_REPAIR logic.

Fixes: 5f6188a8 ("tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NAndrey Vagin <avagin@openvz.org>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

79861919

tipc: fix info leak from kernel tipc_event · b06f9d9f

由 Jon Maloy 提交于 10月 18, 2018

We initialize a struct tipc_event allocated on the kernel stack to
zero to avert info leak to user space.

Reported-by: syzbot+057458894bc8cada4dee@syzkaller.appspotmail.com
Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b06f9d9f

net: socket: fix a missing-check bug · b6168562

由 Wenwen Wang 提交于 10月 18, 2018

In ethtool_ioctl(), the ioctl command 'ethcmd' is checked through a switch
statement to see whether it is necessary to pre-process the ethtool
structure, because, as mentioned in the comment, the structure
ethtool_rxnfc is defined with padding. If yes, a user-space buffer 'rxnfc'
is allocated through compat_alloc_user_space(). One thing to note here is
that, if 'ethcmd' is ETHTOOL_GRXCLSRLALL, the size of the buffer 'rxnfc' is
partially determined by 'rule_cnt', which is actually acquired from the
user-space buffer 'compat_rxnfc', i.e., 'compat_rxnfc->rule_cnt', through
get_user(). After 'rxnfc' is allocated, the data in the original user-space
buffer 'compat_rxnfc' is then copied to 'rxnfc' through copy_in_user(),
including the 'rule_cnt' field. However, after this copy, no check is
re-enforced on 'rxnfc->rule_cnt'. So it is possible that a malicious user
race to change the value in the 'compat_rxnfc->rule_cnt' between these two
copies. Through this way, the attacker can bypass the previous check on
'rule_cnt' and inject malicious data. This can cause undefined behavior of
the kernel and introduce potential security risk.

This patch avoids the above issue via copying the value acquired by
get_user() to 'rxnfc->rule_cn', if 'ethcmd' is ETHTOOL_GRXCLSRLALL.
Signed-off-by: NWenwen Wang <wang6495@umn.edu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b6168562

net: sched: Fix for duplicate class dump · 3c53ed8f

由 Phil Sutter 提交于 10月 18, 2018

When dumping classes by parent, kernel would return classes twice:

| # tc qdisc add dev lo root prio
| # tc class show dev lo
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:
| # tc class show dev lo parent 8001:
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:

This comes from qdisc_match_from_root() potentially returning the root
qdisc itself if its handle matched. Though in that case, root's classes
were already dumped a few lines above.

Fixes: cb395b20 ("net: sched: optimize class dumps")
Signed-off-by: NPhil Sutter <phil@nwl.cc>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c53ed8f

sctp: use sk_wmem_queued to check for writable space · cd305c74

由 Xin Long 提交于 10月 17, 2018

sk->sk_wmem_queued is used to count the size of chunks in out queue
while sk->sk_wmem_alloc is for counting the size of chunks has been
sent. sctp is increasing both of them before enqueuing the chunks,
and using sk->sk_wmem_alloc to check for writable space.

However, sk_wmem_alloc is also increased by 1 for the skb allocked
for sending in sctp_packet_transmit() but it will not wake up the
waiters when sk_wmem_alloc is decreased in this skb's destructor.

If msg size is equal to sk_sndbuf and sendmsg is waiting for sndbuf,
the check 'msg_len <= sctp_wspace(asoc)' in sctp_wait_for_sndbuf()
will keep waiting if there's a skb allocked in sctp_packet_transmit,
and later even if this skb got freed, the waiting thread will never
get waked up.

This issue has been there since very beginning, so we change to use
sk->sk_wmem_queued to check for writable space as sk_wmem_queued is
not increased for the skb allocked for sending, also as TCP does.

SOCK_SNDBUF_LOCK check is also removed here as it's for tx buf auto
tuning which I will add in another patch.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cd305c74

sctp: count both sk and asoc sndbuf with skb truesize and sctp_chunk size · 605c0ac1

由 Xin Long 提交于 10月 17, 2018

Now it's confusing that asoc sndbuf_used is doing memory accounting with
SCTP_DATA_SNDSIZE(chunk) + sizeof(sk_buff) + sizeof(sctp_chunk) while sk
sk_wmem_alloc is doing that with skb->truesize + sizeof(sctp_chunk).

It also causes sctp_prsctp_prune to count with a wrong freed memory when
sndbuf_policy is not set.

To make this right and also keep consistent between asoc sndbuf_used, sk
sk_wmem_alloc and sk_wmem_queued, use skb->truesize + sizeof(sctp_chunk)
for them.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

605c0ac1

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功