提交 · af545bb5ee53f5261db631db2ac4cde54038bdaf · openeuler / Kernel

27 10月, 2020 1 次提交

vsock: use ns_capable_noaudit() on socket create · af545bb5

由 Jeff Vander Stoep 提交于 10月 23, 2020

During __vsock_create() CAP_NET_ADMIN is used to determine if the
vsock_sock->trusted should be set to true. This value is used later
for determing if a remote connection should be allowed to connect
to a restricted VM. Unfortunately, if the caller doesn't have
CAP_NET_ADMIN, an audit message such as an selinux denial is
generated even if the caller does not want a trusted socket.

Logging errors on success is confusing. To avoid this, switch the
capable(CAP_NET_ADMIN) check to the noaudit version.
Reported-by: NRoman Kiryanov <rkir@google.com>
https://android-review.googlesource.com/c/device/generic/goldfish/+/1468545/Signed-off-by: NJeff Vander Stoep <jeffv@google.com>
Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/r/20201023143757.377574-1-jeffv@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

af545bb5

24 10月, 2020 1 次提交

tcp: Prevent low rmem stalls with SO_RCVLOWAT. · 435ccfa8

由 Arjun Roy 提交于 10月 23, 2020

With SO_RCVLOWAT, under memory pressure,
it is possible to enter a state where:

1. We have not received enough bytes to satisfy SO_RCVLOWAT.
2. We have not entered buffer pressure (see tcp_rmem_pressure()).
3. But, we do not have enough buffer space to accept more packets.

In this case, we advertise 0 rwnd (due to #3) but the application does
not drain the receive queue (no wakeup because of #1 and #2) so the
flow stalls.

Modify the heuristic for SO_RCVLOWAT so that, if we are advertising
rwnd<=rcv_mss, force a wakeup to prevent a stall.

Without this patch, setting tcp_rmem to 6143 and disabling TCP
autotune causes a stalled flow. With this patch, no stall occurs. This
is with RPC-style traffic with large messages.

Fixes: 03f45c88 ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
Signed-off-by: NArjun Roy <arjunroy@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201023184709.217614-1-arjunroy.kdev@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

435ccfa8

23 10月, 2020 2 次提交

tcp: fix to update snd_wl1 in bulk receiver fast path · 18ded910

由 Neal Cardwell 提交于 10月 22, 2020

In the header prediction fast path for a bulk data receiver, if no
data is newly acknowledged then we do not call tcp_ack() and do not
call tcp_ack_update_window(). This means that a bulk receiver that
receives large amounts of data can have the incoming sequence numbers
wrap, so that the check in tcp_may_update_window fails:
after(ack_seq, tp->snd_wl1)

If the incoming receive windows are zero in this state, and then the
connection that was a bulk data receiver later wants to send data,
that connection can find itself persistently rejecting the window
updates in incoming ACKs. This means the connection can persistently
fail to discover that the receive window has opened, which in turn
means that the connection is unable to send anything, and the
connection's sending process can get permanently "stuck".

The fix is to update snd_wl1 in the header prediction fast path for a
bulk data receiver, so that it keeps up and does not see wrapping
problems.

This fix is based on a very nice and thorough analysis and diagnosis
by Apollon Oikonomopoulos (see link below).

This is a stable candidate but there is no Fixes tag here since the
bug predates current git history. Just for fun: looks like the bug
dates back to when header prediction was added in Linux v2.1.8 in Nov
1996. In that version tcp_rcv_established() was added, and the code
only updates snd_wl1 in tcp_ack(), and in the new "Bulk data transfer:
receiver" code path it does not call tcp_ack(). This fix seems to
apply cleanly at least as far back as v3.2.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Reported-by: NApollon Oikonomopoulos <apoikos@dmesg.gr>
Tested-by: NApollon Oikonomopoulos <apoikos@dmesg.gr>
Link: https://www.spinics.net/lists/netdev/msg692430.htmlAcked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201022143331.1887495-1-ncardwell.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

18ded910

net: Properly typecast int values to set sk_max_pacing_rate · 700465fd

由 Ke Li 提交于 10月 22, 2020

In setsockopt(SO_MAX_PACING_RATE) on 64bit systems, sk_max_pacing_rate,
after extended from 'u32' to 'unsigned long', takes unintentionally
hiked value whenever assigned from an 'int' value with MSB=1, due to
binary sign extension in promoting s32 to u64, e.g. 0x80000000 becomes
0xFFFFFFFF80000000.

Thus inflated sk_max_pacing_rate causes subsequent getsockopt to return
~0U unexpectedly. It may also result in increased pacing rate.

Fix by explicitly casting the 'int' value to 'unsigned int' before
assigning it to sk_max_pacing_rate, for zero extension to happen.

Fixes: 76a9ebe8 ("net: extend sk_pacing_rate to unsigned long")
Signed-off-by: NJi Li <jli@akamai.com>
Signed-off-by: NKe Li <keli@akamai.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201022064146.79873-1-keli@akamai.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

700465fd

22 10月, 2020 3 次提交

netfilter: nf_fwd_netdev: clear timestamp in forwarding path · c77761c8

由 Pablo Neira Ayuso 提交于 10月 21, 2020

Similar to 7980d2ea ("ipvs: clear skb->tstamp in forwarding path").
fq qdisc requires tstamp to be cleared in forwarding path.

Fixes: 8203e2d8 ("net: clear skb->tstamp in forwarding paths")
Fixes: fb420d5d ("tcp/fq: move back to CLOCK_MONOTONIC")
Fixes: 80b14dee ("net: Add a new socket option for a future transmit time.")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

c77761c8

rtnetlink: fix data overflow in rtnl_calcit() · ebfe3c51

由 Di Zhu 提交于 10月 21, 2020

"ip addr show" command execute error when we have a physical
network card with a large number of VFs

The return value of if_nlmsg_size() in rtnl_calcit() will exceed
range of u16 data type when any network cards has a larger number of
VFs. rtnl_vfinfo_size() will significant increase needed dump size when
the value of num_vfs is larger.

Eventually we get a wrong value of min_ifinfo_dump_size because of overflow
which decides the memory size needed by netlink dump and netlink_dump()
will return -EMSGSIZE because of not enough memory was allocated.

So fix it by promoting min_dump_alloc data type to u32 to
avoid whole netlink message size overflow and it's also align
with the data type of struct netlink_callback{}.min_dump_alloc
which is assigned by return value of rtnl_calcit()
Signed-off-by: NDi Zhu <zhudi21@huawei.com>
Link: https://lore.kernel.org/r/20201021020053.1401-1-zhudi21@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

ebfe3c51

bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop · ba452c9e

由 Toke Høiland-Jørgensen 提交于 10月 20, 2020

Based on the discussion in [0], update the bpf_redirect_neigh() helper to
accept an optional parameter specifying the nexthop information. This makes
it possible to combine bpf_fib_lookup() and bpf_redirect_neigh() without
incurring a duplicate FIB lookup - since the FIB lookup helper will return
the nexthop information even if no neighbour is present, this can simply
be passed on to bpf_redirect_neigh() if bpf_fib_lookup() returns
BPF_FIB_LKUP_RET_NO_NEIGH. Thus fix & extend it before helper API is frozen.

  [0] https://lore.kernel.org/bpf/393e17fc-d187-3a8d-2f0d-a627c7c63fca@iogearbox.net/Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/bpf/160322915615.32199.1187570224032024535.stgit@toke.dk

ba452c9e

21 10月, 2020 12 次提交

mptcp: depends on IPV6 but not as a module · 0ed37ac5

由 Matthieu Baerts 提交于 10月 21, 2020

Like TCP, MPTCP cannot be compiled as a module. Obviously, MPTCP IPv6'
support also depends on CONFIG_IPV6. But not all functions from IPv6
code are exported.

To simplify the code and reduce modifications outside MPTCP, it was
decided from the beginning to support MPTCP with IPv6 only if
CONFIG_IPV6 was built inlined. That's also why CONFIG_MPTCP_IPV6 was
created. More modifications are needed to support CONFIG_IPV6=m.

Even if it was not explicit, until recently, we were forcing CONFIG_IPV6
to be built-in because we had "select IPV6" in Kconfig. Now that we have
"depends on IPV6", we have to explicitly set "IPV6=y" to force
CONFIG_IPV6 not to be built as a module.

In other words, we can now only have CONFIG_MPTCP_IPV6=y if
CONFIG_IPV6=y.

Note that the new dependency might hide the fact IPv6 is not supported
in MPTCP even if we have CONFIG_IPV6=m. But selecting IPV6 like we did
before was forcing it to be built-in while it was maybe not what the
user wants.
Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Fixes: 010b430d ("mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it")
Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20201021105154.628257-1-matthieu.baerts@tessares.netSigned-off-by: NJakub Kicinski <kuba@kernel.org>

0ed37ac5

mpls: load mpls_gso after mpls_iptunnel · b7c24497

由 Alexander Ovechkin 提交于 10月 20, 2020

mpls_iptunnel is used only for mpls encapsuation, and if encaplusated
packet is larger than MTU we need mpls_gso for segmentation.
Signed-off-by: NAlexander Ovechkin <ovov@yandex-team.ru>
Acked-by: NDmitry Yakunin <zeil@yandex-team.ru>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Link: https://lore.kernel.org/r/20201020114333.26866-1-ovov@yandex-team.ruSigned-off-by: NJakub Kicinski <kuba@kernel.org>

b7c24497

net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels · a7a12b5a

由 Davide Caratti 提交于 10月 21, 2020

the following command

 # tc action add action tunnel_key \
 > set src_ip 2001:db8::1 dst_ip 2001:db8::2 id 10 erspan_opts 1:6789:0:0

generates the following splat:

 BUG: KASAN: slab-out-of-bounds in tunnel_key_copy_opts+0xcc9/0x1010 [act_tunnel_key]
 Write of size 4 at addr ffff88813f5f1cc8 by task tc/873

 CPU: 2 PID: 873 Comm: tc Not tainted 5.9.0+ #282
 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
 Call Trace:
  dump_stack+0x99/0xcb
  print_address_description.constprop.7+0x1e/0x230
  kasan_report.cold.13+0x37/0x7c
  tunnel_key_copy_opts+0xcc9/0x1010 [act_tunnel_key]
  tunnel_key_init+0x160c/0x1f40 [act_tunnel_key]
  tcf_action_init_1+0x5b5/0x850
  tcf_action_init+0x15d/0x370
  tcf_action_add+0xd9/0x2f0
  tc_ctl_action+0x29b/0x3a0
  rtnetlink_rcv_msg+0x341/0x8d0
  netlink_rcv_skb+0x120/0x380
  netlink_unicast+0x439/0x630
  netlink_sendmsg+0x719/0xbf0
  sock_sendmsg+0xe2/0x110
  ____sys_sendmsg+0x5ba/0x890
  ___sys_sendmsg+0xe9/0x160
  __sys_sendmsg+0xd3/0x170
  do_syscall_64+0x33/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f872a96b338
 Code: 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 43 2c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 41 89 d4 55
 RSP: 002b:00007ffffe367518 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 000000005f8f5aed RCX: 00007f872a96b338
 RDX: 0000000000000000 RSI: 00007ffffe367580 RDI: 0000000000000003
 RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000000001c
 R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000001
 R13: 0000000000686760 R14: 0000000000000601 R15: 0000000000000000

 Allocated by task 873:
  kasan_save_stack+0x19/0x40
  __kasan_kmalloc.constprop.7+0xc1/0xd0
  __kmalloc+0x151/0x310
  metadata_dst_alloc+0x20/0x40
  tunnel_key_init+0xfff/0x1f40 [act_tunnel_key]
  tcf_action_init_1+0x5b5/0x850
  tcf_action_init+0x15d/0x370
  tcf_action_add+0xd9/0x2f0
  tc_ctl_action+0x29b/0x3a0
  rtnetlink_rcv_msg+0x341/0x8d0
  netlink_rcv_skb+0x120/0x380
  netlink_unicast+0x439/0x630
  netlink_sendmsg+0x719/0xbf0
  sock_sendmsg+0xe2/0x110
  ____sys_sendmsg+0x5ba/0x890
  ___sys_sendmsg+0xe9/0x160
  __sys_sendmsg+0xd3/0x170
  do_syscall_64+0x33/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

 The buggy address belongs to the object at ffff88813f5f1c00
  which belongs to the cache kmalloc-256 of size 256
 The buggy address is located 200 bytes inside of
  256-byte region [ffff88813f5f1c00, ffff88813f5f1d00)
 The buggy address belongs to the page:
 page:0000000011b48a19 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13f5f0
 head:0000000011b48a19 order:1 compound_mapcount:0
 flags: 0x17ffffc0010200(slab|head)
 raw: 0017ffffc0010200 0000000000000000 0000000d00000001 ffff888107c43400
 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff88813f5f1b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff88813f5f1c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 >ffff88813f5f1c80: 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc
                                               ^
  ffff88813f5f1d00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff88813f5f1d80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

using IPv6 tunnels, act_tunnel_key allocates a fixed amount of memory for
the tunnel metadata, but then it expects additional bytes to store tunnel
specific metadata with tunnel_key_copy_opts().

Fix the arguments of __ipv6_tun_set_dst(), so that 'md_size' contains the
size previously computed by tunnel_key_get_opts_len(), like it's done for
IPv4 tunnels.

Fixes: 0ed5269f ("net/sched: add tunnel option support to act_tunnel_key")
Reported-by: NShuang Li <shuali@redhat.com>
Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
Link: https://lore.kernel.org/r/36ebe969f6d13ff59912d6464a4356fe6f103766.1603231100.git.dcaratti@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

a7a12b5a

net/sched: act_gate: Unlock ->tcfa_lock in tc_setup_flow_action() · b1307621

由 Guillaume Nault 提交于 10月 20, 2020

We need to jump to the "err_out_locked" label when
tcf_gate_get_entries() fails. Otherwise, tc_setup_flow_action() exits
with ->tcfa_lock still held.

Fixes: d29bdd69 ("net: schedule: add action gate offloading")
Signed-off-by: NGuillaume Nault <gnault@redhat.com>
Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
Link: https://lore.kernel.org/r/12f60e385584c52c22863701c0185e40ab08a7a7.1603207948.git.gnault@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

b1307621

mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it · 010b430d

由 Geert Uytterhoeven 提交于 10月 20, 2020

MPTCP_IPV6 selects IPV6, thus enabling an optional feature the user may
not want to enable. Fix this by making MPTCP_IPV6 depend on IPV6, like
is done for all other IPv6 features.

Fixes: f870fa0b ("mptcp: Add MPTCP socket stubs")
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20201020073839.29226-1-geert@linux-m68k.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

010b430d

nfc: Ensure presence of NFC_ATTR_FIRMWARE_NAME attribute in nfc_genl_fw_download() · 280e3ebd

由 Defang Bo 提交于 10月 19, 2020

Check that the NFC_ATTR_FIRMWARE_NAME attributes are provided by
the netlink client prior to accessing them.This prevents potential
unhandled NULL pointer dereference exceptions which can be triggered
by malicious user-mode programs, if they omit one or both of these
attributes.

Similar to commit a0323b97 ("nfc: Ensure presence of required attributes in the activate_target handler").

Fixes: 9674da87 ("NFC: Add firmware upload netlink command")
Signed-off-by: NDefang Bo <bodefang@126.com>
Link: https://lore.kernel.org/r/1603107538-4744-1-git-send-email-bodefang@126.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

280e3ebd

mptcp: MPTCP_KUNIT_TESTS should depend on MPTCP instead of selecting it · b142083b

由 Geert Uytterhoeven 提交于 10月 19, 2020

MPTCP_KUNIT_TESTS selects MPTCP, thus enabling an optional feature the
user may not want to enable. Fix this by making the test depend on
MPTCP instead.

Fixes: a00a5822 ("mptcp: move crypto test to KUNIT")
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20201019113240.11516-1-geert@linux-m68k.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

b142083b

mptcp: move mptcp_options_received's port initialization · 65b8c8a6

由 Geliang Tang 提交于 10月 19, 2020

Move mptcp_options_received's port initialization from
mptcp_parse_option to mptcp_get_options, put it together with
the other fields initializations of mptcp_options_received.
Signed-off-by: NGeliang Tang <geliangtang@gmail.com>
Reviewed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

65b8c8a6

mptcp: initialize mptcp_options_received's ahmac · fe2d9b1a

由 Geliang Tang 提交于 10月 19, 2020

Initialize mptcp_options_received's ahmac to zero, otherwise it
will be a random number when receiving ADD_ADDR suboption with echo-flag=1.

Fixes: 3df523ab ("mptcp: Add ADD_ADDR handling")
Signed-off-by: NGeliang Tang <geliangtang@gmail.com>
Reviewed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

fe2d9b1a

net/sched: act_ct: Fix adding udp port mangle operation · 47b5d2a1

由 Roi Dayan 提交于 10月 19, 2020

Need to use the udp header type and not tcp.

Fixes: 9c26ba9b ("net/sched: act_ct: Instantiate flow table entry actions")
Signed-off-by: NRoi Dayan <roid@nvidia.com>
Reviewed-by: NPaul Blakey <paulb@nvidia.com>
Link: https://lore.kernel.org/r/20201019090244.3015186-1-roid@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

47b5d2a1

SUNRPC: fix copying of multiple pages in gss_read_proxy_verf() · d48c8124

由 Martijn de Gouw 提交于 10月 19, 2020

When the passed token is longer than 4032 bytes, the remaining part
of the token must be copied from the rqstp->rq_arg.pages. But the
copy must make sure it happens in a consecutive way.

With the existing code, the first memcpy copies 'length' bytes from
argv->iobase, but since the header is in front, this never fills the
whole first page of in_token->pages.

The mecpy in the loop copies the following bytes, but starts writing at
the next page of in_token->pages.  This leaves the last bytes of page 0
unwritten.

Symptoms were that users with many groups were not able to access NFS
exports, when using Active Directory as the KDC.
Signed-off-by: NMartijn de Gouw <martijn.de.gouw@prodrive-technologies.com>
Fixes: 5866efa8 "SUNRPC: Fix svcauth_gss_proxy_init()"
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

d48c8124

sunrpc: raise kernel RPC channel buffer size · 27a1e8a0

由 Roberto Bergantinos Corpas 提交于 10月 19, 2020

Its possible that using AUTH_SYS and mountd manage-gids option a
user may hit the 8k RPC channel buffer limit. This have been observed
on field, causing unanswered RPCs on clients after mountd fails to
write on channel :

rpc.mountd[11231]: auth_unix_gid: error writing reply

Userland nfs-utils uses a buffer size of 32k (RPC_CHAN_BUF_SIZE), so
lets match those two.
Signed-off-by: NRoberto Bergantinos Corpas <rbergant@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

27a1e8a0

20 10月, 2020 7 次提交

netfilter: nftables_offload: KASAN slab-out-of-bounds Read in nft_flow_rule_create · 31cc578a

由 Saeed Mirzamohammadi 提交于 10月 20, 2020

This patch fixes the issue due to:

BUG: KASAN: slab-out-of-bounds in nft_flow_rule_create+0x622/0x6a2
net/netfilter/nf_tables_offload.c:40
Read of size 8 at addr ffff888103910b58 by task syz-executor227/16244

The error happens when expr->ops is accessed early on before performing the boundary check and after nft_expr_next() moves the expr to go out-of-bounds.

This patch checks the boundary condition before expr->ops that fixes the slab-out-of-bounds Read issue.

Add nft_expr_more() and use it to fix this problem.
Signed-off-by: NSaeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

31cc578a

netfilter: ebtables: Fixes dropping of small packets in bridge nat · 63137bc5

由 Timothée COCAULT 提交于 10月 14, 2020

Fixes an error causing small packets to get dropped. skb_ensure_writable
expects the second parameter to be a length in the ethernet payload.=20
If we want to write the ethernet header (src, dst), we should pass 0.
Otherwise, packets with small payloads (< ETH_ALEN) will get dropped.

Fixes: c1a83116 ("netfilter: bridge: convert skb_make_writable to skb_ensure_writable")
Signed-off-by: NTimothée COCAULT <timothee.cocault@orange.com>
Reviewed-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

63137bc5

netfilter: Drop fragmented ndisc packets assembled in netfilter · 68f9f9c2

由 Georg Kohmann 提交于 10月 13, 2020

Fragmented ndisc packets assembled in netfilter not dropped as specified
in RFC 6980, section 5. This behaviour breaks TAHI IPv6 Core Conformance
Tests v6LC.2.1.22/23, V6LC.2.2.26/27 and V6LC.2.3.18.

Setting IP6SKB_FRAGMENTED flag during reassembly.

References: commit b800c3b9 ("ipv6: drop fragmented ndisc packets by default (RFC 6980)")
Signed-off-by: NGeorg Kohmann <geokohma@cisco.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

68f9f9c2

netfilter: conntrack: connection timeout after re-register · 4f25434b

由 Francesco Ruggeri 提交于 10月 07, 2020

If the first packet conntrack sees after a re-register is an outgoing
keepalive packet with no data (SEG.SEQ = SND.NXT-1), td_end is set to
SND.NXT-1.
When the peer correctly acknowledges SND.NXT, tcp_in_window fails
check III (Upper bound for valid (s)ack: sack <= receiver.td_end) and
returns false, which cascades into nf_conntrack_in setting
skb->_nfct = 0 and in later conntrack iptables rules not matching.
In cases where iptables are dropping packets that do not match
conntrack rules this can result in idle tcp connections to time out.

v2: adjust td_end when getting the reply rather than when sending out
    the keepalive packet.

Fixes: f94e6380 ("netfilter: conntrack: reset tcp maxwin on re-register")
Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

4f25434b

ipvs: adjust the debug info in function set_tcp_state · 79dce09a

由 longguang.yue 提交于 9月 28, 2020

Outputting client,virtual,dst addresses info when tcp state changes,
which makes the connection debug more clear
Signed-off-by: Nlongguang.yue <bigclouds@163.com>
Acked-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

79dce09a

nexthop: Fix performance regression in nexthop deletion · df6afe2f

由 Ido Schimmel 提交于 10月 16, 2020

While insertion of 16k nexthops all using the same netdev ('dummy10')
takes less than a second, deletion takes about 130 seconds:

# time -p ip -b nexthop.batch
real 0.29
user 0.01
sys 0.15

# time -p ip link set dev dummy10 down
real 131.03
user 0.06
sys 0.52

This is because of repeated calls to synchronize_rcu() whenever a
nexthop is removed from a nexthop group:

# /usr/share/bcc/tools/offcputime -p `pgrep -nx ip` -K
...
    b'finish_task_switch'
    b'schedule'
    b'schedule_timeout'
    b'wait_for_completion'
    b'__wait_rcu_gp'
    b'synchronize_rcu.part.0'
    b'synchronize_rcu'
    b'__remove_nexthop'
    b'remove_nexthop'
    b'nexthop_flush_dev'
    b'nh_netdev_event'
    b'raw_notifier_call_chain'
    b'call_netdevice_notifiers_info'
    b'__dev_notify_flags'
    b'dev_change_flags'
    b'do_setlink'
    b'__rtnl_newlink'
    b'rtnl_newlink'
    b'rtnetlink_rcv_msg'
    b'netlink_rcv_skb'
    b'rtnetlink_rcv'
    b'netlink_unicast'
    b'netlink_sendmsg'
    b'____sys_sendmsg'
    b'___sys_sendmsg'
    b'__sys_sendmsg'
    b'__x64_sys_sendmsg'
    b'do_syscall_64'
    b'entry_SYSCALL_64_after_hwframe'
    -                ip (277)
        126554955

Since nexthops are always deleted under RTNL, synchronize_net() can be
used instead. It will call synchronize_rcu_expedited() which only blocks
for several microseconds as opposed to multiple milliseconds like
synchronize_rcu().

With this patch deletion of 16k nexthops takes less than a second:

# time -p ip link set dev dummy10 down
real 0.12
user 0.00
sys 0.04

Tested with fib_nexthops.sh which includes torture tests that prompted
the initial change:

# ./fib_nexthops.sh
...
Tests passed: 134
Tests failed:   0

Fixes: 90f33bff ("nexthops: don't modify published nexthop groups")
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20201016172914.643282-1-idosch@idosch.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

df6afe2f

net: dsa: tag_ksz: KSZ8795 and KSZ9477 also use tail tags · bc7e343d

由 Christian Eggers 提交于 10月 16, 2020

The Marvell 88E6060 uses tag_trailer.c and the KSZ8795, KSZ9477 and
KSZ9893 switches also use tail tags.

Fixes: 7a6ffe76 ("net: dsa: point out the tail taggers")
Signed-off-by: NChristian Eggers <ceggers@arri.de>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20201016171603.10587-1-ceggers@arri.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>

bc7e343d

19 10月, 2020 2 次提交

net: core: use list_del_init() instead of list_del() in netdev_run_todo() · 0e8b8d6a

由 Taehee Yoo 提交于 10月 15, 2020

dev->unlink_list is reused unless dev is deleted.
So, list_del() should not be used.
Due to using list_del(), dev->unlink_list can't be reused so that
dev->nested_level update logic doesn't work.
In order to fix this bug, list_del_init() should be used instead
of list_del().

Test commands:
    ip link add bond0 type bond
    ip link add bond1 type bond
    ip link set bond0 master bond1
    ip link set bond0 nomaster
    ip link set bond1 master bond0
    ip link set bond1 nomaster

Splat looks like:
[  255.750458][ T1030] ============================================
[  255.751967][ T1030] WARNING: possible recursive locking detected
[  255.753435][ T1030] 5.9.0-rc8+ #772 Not tainted
[  255.754553][ T1030] --------------------------------------------
[  255.756047][ T1030] ip/1030 is trying to acquire lock:
[  255.757304][ T1030] ffff88811782a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: dev_mc_sync_multiple+0xc2/0x150
[  255.760056][ T1030]
[  255.760056][ T1030] but task is already holding lock:
[  255.761862][ T1030] ffff88811130a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: bond_enslave+0x3d4d/0x43e0 [bonding]
[  255.764581][ T1030]
[  255.764581][ T1030] other info that might help us debug this:
[  255.766645][ T1030]  Possible unsafe locking scenario:
[  255.766645][ T1030]
[  255.768566][ T1030]        CPU0
[  255.769415][ T1030]        ----
[  255.770259][ T1030]   lock(&dev_addr_list_lock_key/1);
[  255.771629][ T1030]   lock(&dev_addr_list_lock_key/1);
[  255.772994][ T1030]
[  255.772994][ T1030]  *** DEADLOCK ***
[  255.772994][ T1030]
[  255.775091][ T1030]  May be due to missing lock nesting notation
[  255.775091][ T1030]
[  255.777182][ T1030] 2 locks held by ip/1030:
[  255.778299][ T1030]  #0: ffffffffb1f63250 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x2e4/0x8b0
[  255.780600][ T1030]  #1: ffff88811130a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: bond_enslave+0x3d4d/0x43e0 [bonding]
[  255.783411][ T1030]
[  255.783411][ T1030] stack backtrace:
[  255.784874][ T1030] CPU: 7 PID: 1030 Comm: ip Not tainted 5.9.0-rc8+ #772
[  255.786595][ T1030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  255.789030][ T1030] Call Trace:
[  255.789850][ T1030]  dump_stack+0x99/0xd0
[  255.790882][ T1030]  __lock_acquire.cold.71+0x166/0x3cc
[  255.792285][ T1030]  ? register_lock_class+0x1a30/0x1a30
[  255.793619][ T1030]  ? rcu_read_lock_sched_held+0x91/0xc0
[  255.794963][ T1030]  ? rcu_read_lock_bh_held+0xa0/0xa0
[  255.796246][ T1030]  lock_acquire+0x1b8/0x850
[  255.797332][ T1030]  ? dev_mc_sync_multiple+0xc2/0x150
[  255.798624][ T1030]  ? bond_enslave+0x3d4d/0x43e0 [bonding]
[  255.800039][ T1030]  ? check_flags+0x50/0x50
[  255.801143][ T1030]  ? lock_contended+0xd80/0xd80
[  255.802341][ T1030]  _raw_spin_lock_nested+0x2e/0x70
[  255.803592][ T1030]  ? dev_mc_sync_multiple+0xc2/0x150
[  255.804897][ T1030]  dev_mc_sync_multiple+0xc2/0x150
[  255.806168][ T1030]  bond_enslave+0x3d58/0x43e0 [bonding]
[  255.807542][ T1030]  ? __lock_acquire+0xe53/0x51b0
[  255.808824][ T1030]  ? bond_update_slave_arr+0xdc0/0xdc0 [bonding]
[  255.810451][ T1030]  ? check_chain_key+0x236/0x5e0
[  255.811742][ T1030]  ? mutex_is_locked+0x13/0x50
[  255.812910][ T1030]  ? rtnl_is_locked+0x11/0x20
[  255.814061][ T1030]  ? netdev_master_upper_dev_get+0xf/0x120
[  255.815553][ T1030]  do_setlink+0x94c/0x3040
[ ... ]

Reported-by: syzbot+4a0f7bc34e3997a6c7df@syzkaller.appspotmail.com
Fixes: 1fc70edb ("net: core: add nested_level variable in net_device")
Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
Link: https://lore.kernel.org/r/20201015162606.9377-1-ap420073@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

0e8b8d6a

net: openvswitch: fix to make sure flow_lookup() is not preempted · f981fc3d

由 Eelco Chaudron 提交于 10月 17, 2020

The flow_lookup() function uses per CPU variables, which must be called
with BH disabled. However, this is fine in the general NAPI use case
where the local BH is disabled. But, it's also called from the netlink
context. The below patch makes sure that even in the netlink path, the
BH is disabled.

In addition, u64_stats_update_begin() requires a lock to ensure one writer
which is not ensured here. Making it per-CPU and disabling NAPI (softirq)
ensures that there is always only one writer.

Fixes: eac87c41 ("net: openvswitch: reorder masks array based on usage")
Reported-by: NJuri Lelli <jlelli@redhat.com>
Signed-off-by: NEelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/160295903253.7789.826736662555102345.stgit@ebuildSigned-off-by: NJakub Kicinski <kuba@kernel.org>

f981fc3d

17 10月, 2020 5 次提交

icmp: randomize the global rate limiter · b38e7819

由 Eric Dumazet 提交于 10月 15, 2020

Keyu Man reported that the ICMP rate limiter could be used
by attackers to get useful signal. Details will be provided
in an upcoming academic publication.

Our solution is to add some noise, so that the attackers
no longer can get help from the predictable token bucket limiter.

Fixes: 4cdf507d ("icmp: add a global rate limitation")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NKeyu Man <kman001@ucr.edu>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

b38e7819

tipc: fix incorrect setting window for bcast link · ec78e318

由 Hoang Huu Le 提交于 10月 16, 2020

In commit 16ad3f40
("tipc: introduce variable window congestion control"), we applied
the algorithm to select window size from minimum window to the
configured maximum window for unicast link, and, besides we chose
to keep the window size for broadcast link unchanged and equal (i.e
fix window 50)

However, when setting maximum window variable via command, the window
variable was re-initialized to unexpect value (i.e 32).

We fix this by updating the fix window for broadcast as we stated.

Fixes: 16ad3f40 ("tipc: introduce variable window congestion control")
Acked-by: NJon Maloy <jmaloy@redhat.com>
Signed-off-by: NHoang Huu Le <hoang.h.le@dektech.com.au>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ec78e318

tipc: re-configure queue limit for broadcast link · 75cee397

由 Hoang Huu Le 提交于 10月 16, 2020

The queue limit of the broadcast link is being calculated base on initial
MTU. However, when MTU value changed (e.g manual changing MTU on NIC
device, MTU negotiation etc.,) we do not re-calculate queue limit.
This gives throughput does not reflect with the change.

So fix it by calling the function to re-calculate queue limit of the
broadcast link.
Acked-by: NJon Maloy <jmaloy@redhat.com>
Signed-off-by: NHoang Huu Le <hoang.h.le@dektech.com.au>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

75cee397

svcrdma: fix bounce buffers for unaligned offsets and multiple pages · c327a310

由 Dan Aloni 提交于 10月 02, 2020

This was discovered using O_DIRECT at the client side, with small
unaligned file offsets or IOs that span multiple file pages.

Fixes: e248aa7b ("svcrdma: Remove max_sge check at connect time")
Signed-off-by: NDan Aloni <dan@kernelim.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

c327a310

net/sunrpc: Fix return value for sysctl sunrpc.transports · c09f56b8

由 Artur Molchanov 提交于 10月 12, 2020

Fix returning value for sysctl sunrpc.transports.
Return error code from sysctl proc_handler function proc_do_xprt instead of number of the written bytes.
Otherwise sysctl returns random garbage for this key.

Since v1:
- Handle negative returned value from memory_read_from_buffer as an error
Signed-off-by: NArtur Molchanov <arturmolchanov@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

c09f56b8

16 10月, 2020 7 次提交

Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH" · 2ecbc1f6

由 Jakub Kicinski 提交于 10月 15, 2020

This reverts commit 1d273fcc.

Alexei points out there's nothing implying headers will be built
and therefore exist under usr/include, so this fix doesn't make
much sense.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

2ecbc1f6

net, sockmap: Don't call bpf_prog_put() on NULL pointer · 83c11c17

由 Alex Dewar 提交于 10月 12, 2020

If bpf_prog_inc_not_zero() fails for skb_parser, then bpf_prog_put() is
called unconditionally on skb_verdict, even though it may be NULL. Fix
and tidy up error path.

Fixes: 743df8b7 ("bpf, sockmap: Check skb_verdict and skb_parser programs explicitly")
Addresses-Coverity-ID: 1497799: Null pointer dereferences (FORWARD_NULL)
Signed-off-by: NAlex Dewar <alex.dewar90@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20201012170952.60750-1-alex.dewar90@gmail.com

83c11c17

bpf, sockmap: Add locking annotations to iterator · f58423ae

由 Lorenz Bauer 提交于 10月 12, 2020

The sparse checker currently outputs the following warnings:

include/linux/rcupdate.h:632:9: sparse: sparse: context imbalance in 'sock_hash_seq_start' - wrong count at exit
include/linux/rcupdate.h:632:9: sparse: sparse: context imbalance in 'sock_map_seq_start' - wrong count at exit

Add the necessary __acquires and __release annotations to make the
iterator locking schema palatable to sparse. Also add __must_hold
for good measure.

The kernel codebase uses both __acquires(rcu) and __acquires(RCU).
I couldn't find any guidance which one is preferred, so I used
what is easier to type out.

Fixes: 03653515 ("net: Allow iterating sockmap and sockhash")
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NLorenz Bauer <lmb@cloudflare.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20201012091850.67452-1-lmb@cloudflare.com

f58423ae

netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements · 346e320c

由 Davide Caratti 提交于 10月 15, 2020

nftables payload statements are used to mangle SCTP headers, but they can
only replace the Internet Checksum. As a consequence, nftables rules that
mangle sport/dport/vtag in SCTP headers potentially generate packets that
are discarded by the receiver, unless the CRC-32C is "offloaded" (e.g the
rule mangles a skb having 'ip_summed' equal to 'CHECKSUM_PARTIAL'.

Fix this extending uAPI definitions and L4 checksum update function, in a
way that userspace programs (e.g. nft) can instruct the kernel to compute
CRC-32C in SCTP headers. Also ensure that LIBCRC32C is built if NF_TABLES
is 'y' or 'm' in the kernel build configuration.
Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

346e320c

net: fix pos incrementment in ipv6_route_seq_next · 6617dfd4

由 Yonghong Song 提交于 10月 14, 2020

Commit 4fc427e0 ("ipv6_route_seq_next should increase position index")
tried to fix the issue where seq_file pos is not increased
if a NULL element is returned with seq_ops->next(). See bug
https://bugzilla.kernel.org/show_bug.cgi?id=206283
The commit effectively does:
- increase pos for all seq_ops->start()
- increase pos for all seq_ops->next()

For ipv6_route, increasing pos for all seq_ops->next() is correct.
But increasing pos for seq_ops->start() is not correct
since pos is used to determine how many items to skip during
seq_ops->start():
iter->skip = *pos;
seq_ops->start() just fetches the *current* pos item.
The item can be skipped only after seq_ops->show() which essentially
is the beginning of seq_ops->next().

For example, I have 7 ipv6 route entries,
root@arch-fb-vm1:~/net-next dd if=/proc/net/ipv6_route bs=4096
00000000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000400 00000001 00000000 00000001 eth0
fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001 eth0
00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo
00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001 lo
fe800000000000002050e3fffebd3be8 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001 eth0
ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000004 00000000 00000001 eth0
00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo
0+1 records in
0+1 records out
1050 bytes (1.0 kB, 1.0 KiB) copied, 0.00707908 s, 148 kB/s
root@arch-fb-vm1:~/net-next

In the above, I specify buffer size 4096, so all records can be returned
to user space with a single trip to the kernel.

If I use buffer size 128, since each record size is 149, internally
kernel seq_read() will read 149 into its internal buffer and return the data
to user space in two read() syscalls. Then user read() syscall will trigger
next seq_ops->start(). Since the current implementation increased pos even
for seq_ops->start(), it will skip record #2, #4 and #6, assuming the first
record is #1.

root@arch-fb-vm1:~/net-next dd if=/proc/net/ipv6_route bs=128
00000000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000400 00000001 00000000 00000001 eth0
00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo
fe800000000000002050e3fffebd3be8 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001 eth0
00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo
4+1 records in
4+1 records out
600 bytes copied, 0.00127758 s, 470 kB/s

To fix the problem, create a fake pos pointer so seq_ops->start()
won't actually increase seq_file pos. With this fix, the
above `dd` command with `bs=128` will show correct result.

Fixes: 4fc427e0 ("ipv6_route_seq_next should increase position index")
Cc: Alexei Starovoitov <ast@kernel.org>
Suggested-by: NVasily Averin <vvs@virtuozzo.com>
Reviewed-by: NVasily Averin <vvs@virtuozzo.com>
Signed-off-by: NYonghong Song <yhs@fb.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Acked-by: NAndrii Nakryiko <andrii@kernel.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

6617dfd4

net/smc: fix invalid return code in smcd_new_buf_create() · 6b1bbf94

由 Karsten Graul 提交于 10月 14, 2020

smc_ism_register_dmb() returns error codes set by the ISM driver which
are not guaranteed to be negative or in the errno range. Such values
would not be handled by ERR_PTR() and finally the return code will be
used as a memory address.
Fix that by using a valid negative errno value with ERR_PTR().

Fixes: 72b7f6c4 ("net/smc: unique reason code for exceeded max dmb count")
Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

6b1bbf94

net/smc: fix valid DMBE buffer sizes · ef12ad45

由 Karsten Graul 提交于 10月 14, 2020

The SMCD_DMBE_SIZES should include all valid DMBE buffer sizes, so the
correct value is 6 which means 1MB. With 7 the registration of an ISM
buffer would always fail because of the invalid size requested.
Fix that and set the value to 6.

Fixes: c6ba7c9b ("net/smc: add base infrastructure for SMC-D and ISM")
Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ef12ad45

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功