1. 11 3月, 2021 2 次提交
    • D
      net, bpf: Fix ip6ip6 crash with collect_md populated skbs · a188bb56
      Daniel Borkmann 提交于
      I ran into a crash where setting up a ip6ip6 tunnel device which was /not/
      set to collect_md mode was receiving collect_md populated skbs for xmit.
      
      The BPF prog was populating the skb via bpf_skb_set_tunnel_key() which is
      assigning special metadata dst entry and then redirecting the skb to the
      device, taking ip6_tnl_start_xmit() -> ipxip6_tnl_xmit() -> ip6_tnl_xmit()
      and in the latter it performs a neigh lookup based on skb_dst(skb) where
      we trigger a NULL pointer dereference on dst->ops->neigh_lookup() since
      the md_dst_ops do not populate neigh_lookup callback with a fake handler.
      
      Transform the md_dst_ops into generic dst_blackhole_ops that can also be
      reused elsewhere when needed, and use them for the metadata dst entries as
      callback ops.
      
      Also, remove the dst_md_discard{,_out}() ops and rely on dst_discard{,_out}()
      from dst_init() which free the skb the same way modulo the splat. Given we
      will be able to recover just fine from there, avoid any potential splats
      iff this gets ever triggered in future (or worse, panic on warns when set).
      
      Fixes: f38a9eb1 ("dst: Metadata destinations")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a188bb56
    • D
      net: Consolidate common blackhole dst ops · c4c877b2
      Daniel Borkmann 提交于
      Move generic blackhole dst ops to the core and use them from both
      ipv4_dst_blackhole_ops and ip6_dst_blackhole_ops where possible. No
      functional change otherwise. We need these also in other locations
      and having to define them over and over again is not great.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4c877b2
  2. 10 3月, 2021 1 次提交
  3. 09 3月, 2021 3 次提交
    • J
      net: qrtr: fix error return code of qrtr_sendmsg() · 179d0ba0
      Jia-Ju Bai 提交于
      When sock_alloc_send_skb() returns NULL to skb, no error return code of
      qrtr_sendmsg() is assigned.
      To fix this bug, rc is assigned with -ENOMEM in this case.
      
      Fixes: 194ccc88 ("net: qrtr: Support decoding incoming v2 packets")
      Reported-by: NTOTE Robot <oslab@tsinghua.edu.cn>
      Signed-off-by: NJia-Ju Bai <baijiaju1990@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      179d0ba0
    • D
      mptcp: fix length of ADD_ADDR with port sub-option · 27ab92d9
      Davide Caratti 提交于
      in current Linux, MPTCP peers advertising endpoints with port numbers use
      a sub-option length that wrongly accounts for the trailing TCP NOP. Also,
      receivers will only process incoming ADD_ADDR with port having such wrong
      sub-option length. Fix this, making ADD_ADDR compliant to RFC8684 §3.4.1.
      
      this can be verified running tcpdump on the kselftests artifacts:
      
       unpatched kernel:
       [root@bottarga mptcp]# tcpdump -tnnr unpatched.pcap | grep add-addr
       reading from file unpatched.pcap, link-type LINUX_SLL (Linux cooked v1), snapshot length 65535
       IP 10.0.1.1.10000 > 10.0.1.2.53078: Flags [.], ack 101, win 509, options [nop,nop,TS val 214459678 ecr 521312851,mptcp add-addr v1 id 1 a00:201:2774:2d88:7436:85c3:17fd:101], length 0
       IP 10.0.1.2.53078 > 10.0.1.1.10000: Flags [.], ack 101, win 502, options [nop,nop,TS val 521312852 ecr 214459678,mptcp add-addr[bad opt]]
      
       patched kernel:
       [root@bottarga mptcp]# tcpdump -tnnr patched.pcap | grep add-addr
       reading from file patched.pcap, link-type LINUX_SLL (Linux cooked v1), snapshot length 65535
       IP 10.0.1.1.10000 > 10.0.1.2.38178: Flags [.], ack 101, win 509, options [nop,nop,TS val 3728873902 ecr 2732713192,mptcp add-addr v1 id 1 10.0.2.1:10100 hmac 0xbccdfcbe59292a1f,nop,nop], length 0
       IP 10.0.1.2.38178 > 10.0.1.1.10000: Flags [.], ack 101, win 502, options [nop,nop,TS val 2732713195 ecr 3728873902,mptcp add-addr v1-echo id 1 10.0.2.1:10100,nop,nop], length 0
      
      Fixes: 22fb85ff ("mptcp: add port support for ADD_ADDR suboption writing")
      CC: stable@vger.kernel.org # 5.11+
      Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Acked-and-tested-by: NGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27ab92d9
    • V
      net: dsa: fix switchdev objects on bridge master mistakenly being applied on ports · 03cbb870
      Vladimir Oltean 提交于
      Tobias reports that after the blamed patch, VLAN objects being added to
      a bridge device are being added to all slave ports instead (swp2, swp3).
      
      ip link add br0 type bridge vlan_filtering 1
      ip link set swp2 master br0
      ip link set swp3 master br0
      bridge vlan add dev br0 vid 100 self
      
      This is because the fix was too broad: we made dsa_port_offloads_netdev
      say "yes, I offload the br0 bridge" for all slave ports, but we didn't
      add the checks whether the switchdev object was in fact meant for the
      physical port or for the bridge itself. So we are reacting on events in
      a way in which we shouldn't.
      
      The reason why the fix was too broad is because the question itself,
      "does this DSA port offload this netdev", was too broad in the first
      place. The solution is to disambiguate the question and separate it into
      two different functions, one to be called for each switchdev attribute /
      object that has an orig_dev == net_bridge (dsa_port_offloads_bridge),
      and the other for orig_dev == net_bridge_port (*_offloads_bridge_port).
      
      In the case of VLAN objects on the bridge interface, this solves the
      problem because we know that VLAN objects are per bridge port and not
      per bridge. And when orig_dev is equal to the net_bridge, we offload it
      as a bridge, but not as a bridge port; that's how we are able to skip
      reacting on those events. Note that this is compatible with future plans
      to have explicit offloading of VLAN objects on the bridge interface as a
      bridge port (in DSA, this signifies that we should add that VLAN towards
      the CPU port).
      
      Fixes: 99b8202b ("net: dsa: fix SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING getting ignored")
      Reported-by: NTobias Waldekranz <tobias@waldekranz.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NTobias Waldekranz <tobias@waldekranz.com>
      Tested-by: NTobias Waldekranz <tobias@waldekranz.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03cbb870
  4. 06 3月, 2021 1 次提交
  5. 05 3月, 2021 13 次提交
  6. 04 3月, 2021 5 次提交
    • P
      netfilter: nftables: bogus check for netlink portID with table owner · bd1777b3
      Pablo Neira Ayuso 提交于
      The existing branch checks for 0 != table->nlpid which always evaluates
      true for tables that have an owner.
      
      Fixes: 6001a930 ("netfilter: nftables: introduce table ownership")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      bd1777b3
    • P
      netfilter: nftables: fix possible double hook unregistration with table owner · 2888b080
      Pablo Neira Ayuso 提交于
      Skip hook unregistration of owner tables from the netns exit path,
      nft_rcv_nl_event() unregisters the table hooks before tearing down
      the table content.
      
      Fixes: 6001a930 ("netfilter: nftables: introduce table ownership")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2888b080
    • Z
      rtnetlink: using dev_base_seq from target net · a9ecb0cb
      zhang kai 提交于
      Signed-off-by: Nzhang kai <zhangkaiheb@126.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9ecb0cb
    • J
      net: 9p: advance iov on empty read · d65614a0
      Jisheng Zhang 提交于
      I met below warning when cating a small size(about 80bytes) txt file
      on 9pfs(msize=2097152 is passed to 9p mount option), the reason is we
      miss iov_iter_advance() if the read count is 0 for zerocopy case, so
      we didn't truncate the pipe, then iov_iter_pipe() thinks the pipe is
      full. Fix it by removing the exception for 0 to ensure to call
      iov_iter_advance() even on empty read for zerocopy case.
      
      [    8.279568] WARNING: CPU: 0 PID: 39 at lib/iov_iter.c:1203 iov_iter_pipe+0x31/0x40
      [    8.280028] Modules linked in:
      [    8.280561] CPU: 0 PID: 39 Comm: cat Not tainted 5.11.0+ #6
      [    8.281260] RIP: 0010:iov_iter_pipe+0x31/0x40
      [    8.281974] Code: 2b 42 54 39 42 5c 76 22 c7 07 20 00 00 00 48 89 57 18 8b 42 50 48 c7 47 08 b
      [    8.283169] RSP: 0018:ffff888000cbbd80 EFLAGS: 00000246
      [    8.283512] RAX: 0000000000000010 RBX: ffff888000117d00 RCX: 0000000000000000
      [    8.283876] RDX: ffff88800031d600 RSI: 0000000000000000 RDI: ffff888000cbbd90
      [    8.284244] RBP: ffff888000cbbe38 R08: 0000000000000000 R09: ffff8880008d2058
      [    8.284605] R10: 0000000000000002 R11: ffff888000375510 R12: 0000000000000050
      [    8.284964] R13: ffff888000cbbe80 R14: 0000000000000050 R15: ffff88800031d600
      [    8.285439] FS:  00007f24fd8af600(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
      [    8.285844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    8.286150] CR2: 00007f24fd7d7b90 CR3: 0000000000c97000 CR4: 00000000000406b0
      [    8.286710] Call Trace:
      [    8.288279]  generic_file_splice_read+0x31/0x1a0
      [    8.289273]  ? do_splice_to+0x2f/0x90
      [    8.289511]  splice_direct_to_actor+0xcc/0x220
      [    8.289788]  ? pipe_to_sendpage+0xa0/0xa0
      [    8.290052]  do_splice_direct+0x8b/0xd0
      [    8.290314]  do_sendfile+0x1ad/0x470
      [    8.290576]  do_syscall_64+0x2d/0x40
      [    8.290818]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [    8.291409] RIP: 0033:0x7f24fd7dca0a
      [    8.292511] Code: c3 0f 1f 80 00 00 00 00 4c 89 d2 4c 89 c6 e9 bd fd ff ff 0f 1f 44 00 00 31 8
      [    8.293360] RSP: 002b:00007ffc20932818 EFLAGS: 00000206 ORIG_RAX: 0000000000000028
      [    8.293800] RAX: ffffffffffffffda RBX: 0000000001000000 RCX: 00007f24fd7dca0a
      [    8.294153] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001
      [    8.294504] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
      [    8.294867] R10: 0000000001000000 R11: 0000000000000206 R12: 0000000000000003
      [    8.295217] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
      [    8.295782] ---[ end trace 63317af81b3ca24b ]---
      Signed-off-by: NJisheng Zhang <Jisheng.Zhang@synaptics.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d65614a0
    • M
      net: l2tp: reduce log level of messages in receive path, add counter instead · 3e59e885
      Matthias Schiffer 提交于
      Commit 5ee759cd ("l2tp: use standard API for warning log messages")
      changed a number of warnings about invalid packets in the receive path
      so that they are always shown, instead of only when a special L2TP debug
      flag is set. Even with rate limiting these warnings can easily cause
      significant log spam - potentially triggered by a malicious party
      sending invalid packets on purpose.
      
      In addition these warnings were noticed by projects like Tunneldigger [1],
      which uses L2TP for its data path, but implements its own control
      protocol (which is sufficiently different from L2TP data packets that it
      would always be passed up to userspace even with future extensions of
      L2TP).
      
      Some of the warnings were already redundant, as l2tp_stats has a counter
      for these packets. This commit adds one additional counter for invalid
      packets that are passed up to userspace. Packets with unknown session are
      not counted as invalid, as there is nothing wrong with the format of
      these packets.
      
      With the additional counter, all of these messages are either redundant
      or benign, so we reduce them to pr_debug_ratelimited().
      
      [1] https://github.com/wlanslovenija/tunneldigger/issues/160
      
      Fixes: 5ee759cd ("l2tp: use standard API for warning log messages")
      Signed-off-by: NMatthias Schiffer <mschiffer@universe-factory.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e59e885
  7. 02 3月, 2021 8 次提交
    • P
      netfilter: nftables: disallow updates on table ownership · 9cc0001a
      Pablo Neira Ayuso 提交于
      Disallow updating the ownership bit on an existing table: Do not allow
      to grab ownership on an existing table. Do not allow to drop ownership
      on an existing table.
      
      Fixes: 6001a930 ("netfilter: nftables: introduce table ownership")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      9cc0001a
    • E
      tcp: add sanity tests to TCP_QUEUE_SEQ · 8811f4a9
      Eric Dumazet 提交于
      Qingyu Li reported a syzkaller bug where the repro
      changes RCV SEQ _after_ restoring data in the receive queue.
      
      mprotect(0x4aa000, 12288, PROT_READ)    = 0
      mmap(0x1ffff000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1ffff000
      mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
      mmap(0x21000000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x21000000
      socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
      setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
      connect(3, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0
      setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
      sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="0x0000000000000003\0\0", iov_len=20}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
      setsockopt(3, SOL_TCP, TCP_REPAIR, [0], 4) = 0
      setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [128], 4) = 0
      recvfrom(3, NULL, 20, 0, NULL, NULL)    = -1 ECONNRESET (Connection reset by peer)
      
      syslog shows:
      [  111.205099] TCP recvmsg seq # bug 2: copied 80, seq 0, rcvnxt 80, fl 0
      [  111.207894] WARNING: CPU: 1 PID: 356 at net/ipv4/tcp.c:2343 tcp_recvmsg_locked+0x90e/0x29a0
      
      This should not be allowed. TCP_QUEUE_SEQ should only be used
      when queues are empty.
      
      This patch fixes this case, and the tx path as well.
      
      Fixes: ee995283 ("tcp: Initial repair mode")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=212005Reported-by: NQingyu Li <ieatmuttonchuan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8811f4a9
    • D
      net: dsa: tag_mtk: fix 802.1ad VLAN egress · 9200f515
      DENG Qingfang 提交于
      A different TPID bit is used for 802.1ad VLAN frames.
      Reported-by: NIlario Gelmetti <iochesonome@gmail.com>
      Fixes: f0af3431 ("net: dsa: mediatek: combine MediaTek tag with VLAN tag")
      Signed-off-by: NDENG Qingfang <dqfext@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9200f515
    • W
      net: expand textsearch ts_state to fit skb_seq_state · b228c9b0
      Willem de Bruijn 提交于
      The referenced commit expands the skb_seq_state used by
      skb_find_text with a 4B frag_off field, growing it to 48B.
      
      This exceeds container ts_state->cb, causing a stack corruption:
      
      [   73.238353] Kernel panic - not syncing: stack-protector: Kernel stack
      is corrupted in: skb_find_text+0xc5/0xd0
      [   73.247384] CPU: 1 PID: 376 Comm: nping Not tainted 5.11.0+ #4
      [   73.252613] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS 1.14.0-2 04/01/2014
      [   73.260078] Call Trace:
      [   73.264677]  dump_stack+0x57/0x6a
      [   73.267866]  panic+0xf6/0x2b7
      [   73.270578]  ? skb_find_text+0xc5/0xd0
      [   73.273964]  __stack_chk_fail+0x10/0x10
      [   73.277491]  skb_find_text+0xc5/0xd0
      [   73.280727]  string_mt+0x1f/0x30
      [   73.283639]  ipt_do_table+0x214/0x410
      
      The struct is passed between skb_find_text and its callbacks
      skb_prepare_seq_read, skb_seq_read and skb_abort_seq read through
      the textsearch interface using TS_SKB_CB.
      
      I assumed that this mapped to skb->cb like other .._SKB_CB wrappers.
      skb->cb is 48B. But it maps to ts_state->cb, which is only 40B.
      
      skb->cb was increased from 40B to 48B after ts_state was introduced,
      in commit 3e3850e9 ("[NETFILTER]: Fix xfrm lookup in
      ip_route_me_harder/ip6_route_me_harder").
      
      Increase ts_state.cb[] to 48 to fit the struct.
      
      Also add a BUILD_BUG_ON to avoid a repeat.
      
      The alternative is to directly add a dependency from textsearch onto
      linux/skbuff.h, but I think the intent is textsearch to have no such
      dependencies on its callers.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=211911
      Fixes: 97550f6f ("net: compound page support in skb_seq_read")
      Reported-by: NKris Karas <bugs-a17@moonlit-rail.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b228c9b0
    • Y
      inetpeer: use div64_ul() and clamp_val() calculate inet_peer_threshold · 8bd2a055
      Yejune Deng 提交于
      In inet_initpeers(), struct inet_peer on IA32 uses 128 bytes in nowdays.
      Get rid of the cascade and use div64_ul() and clamp_val() calculate that
      will not need to be adjusted in the future as suggested by Eric Dumazet.
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NYejune Deng <yejune.deng@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8bd2a055
    • P
      net/qrtr: fix __netdev_alloc_skb call · 093b036a
      Pavel Skripkin 提交于
      syzbot found WARNING in __alloc_pages_nodemask()[1] when order >= MAX_ORDER.
      It was caused by a huge length value passed from userspace to qrtr_tun_write_iter(),
      which tries to allocate skb. Since the value comes from the untrusted source
      there is no need to raise a warning in __alloc_pages_nodemask().
      
      [1] WARNING in __alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:5014
      Call Trace:
       __alloc_pages include/linux/gfp.h:511 [inline]
       __alloc_pages_node include/linux/gfp.h:524 [inline]
       alloc_pages_node include/linux/gfp.h:538 [inline]
       kmalloc_large_node+0x60/0x110 mm/slub.c:3999
       __kmalloc_node_track_caller+0x319/0x3f0 mm/slub.c:4496
       __kmalloc_reserve net/core/skbuff.c:150 [inline]
       __alloc_skb+0x4e4/0x5a0 net/core/skbuff.c:210
       __netdev_alloc_skb+0x70/0x400 net/core/skbuff.c:446
       netdev_alloc_skb include/linux/skbuff.h:2832 [inline]
       qrtr_endpoint_post+0x84/0x11b0 net/qrtr/qrtr.c:442
       qrtr_tun_write_iter+0x11f/0x1a0 net/qrtr/tun.c:98
       call_write_iter include/linux/fs.h:1901 [inline]
       new_sync_write+0x426/0x650 fs/read_write.c:518
       vfs_write+0x791/0xa30 fs/read_write.c:605
       ksys_write+0x12d/0x250 fs/read_write.c:658
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported-by: syzbot+80dccaee7c6630fa9dcf@syzkaller.appspotmail.com
      Signed-off-by: NPavel Skripkin <paskripkin@gmail.com>
      Acked-by: NAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      093b036a
    • J
      net: always use icmp{,v6}_ndo_send from ndo_start_xmit · 4372339e
      Jason A. Donenfeld 提交于
      There were a few remaining tunnel drivers that didn't receive the prior
      conversion to icmp{,v6}_ndo_send. Knowing now that this could lead to
      memory corrution (see ee576c47 ("net: icmp: pass zeroed opts from
      icmp{,v6}_ndo_send before sending") for details), there's even more
      imperative to have these all converted. So this commit goes through the
      remaining cases that I could find and does a boring translation to the
      ndo variety.
      
      The Fixes: line below is the merge that originally added icmp{,v6}_
      ndo_send and converted the first batch of icmp{,v6}_send users. The
      rationale then for the change applies equally to this patch. It's just
      that these drivers were left out of the initial conversion because these
      network devices are hiding in net/ rather than in drivers/net/.
      
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Fixes: 803381f9 ("Merge branch 'icmp-account-for-NAT-when-sending-icmps-from-ndo-layer'")
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4372339e
    • D
      net: dsa: tag_rtl4_a: fix egress tags · 9eb8bc59
      DENG Qingfang 提交于
      Commit 86dd9868 has several issues, but was accepted too soon
      before anyone could take a look.
      
      - Double free. dsa_slave_xmit() will free the skb if the xmit function
        returns NULL, but the skb is already freed by eth_skb_pad(). Use
        __skb_put_padto() to avoid that.
      - Unnecessary allocation. It has been done by DSA core since commit
        a3b0b647.
      - A u16 pointer points to skb data. It should be __be16 for network
        byte order.
      - Typo in comments. "numer" -> "number".
      
      Fixes: 86dd9868 ("net: dsa: tag_rtl4_a: Support also egress tags")
      Signed-off-by: NDENG Qingfang <dqfext@gmail.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9eb8bc59
  8. 01 3月, 2021 2 次提交
    • D
      net: Fix gro aggregation for udp encaps with zero csum · 89e5c58f
      Daniel Borkmann 提交于
      We noticed a GRO issue for UDP-based encaps such as vxlan/geneve when the
      csum for the UDP header itself is 0. In that case, GRO aggregation does
      not take place on the phys dev, but instead is deferred to the vxlan/geneve
      driver (see trace below).
      
      The reason is essentially that GRO aggregation bails out in udp_gro_receive()
      for such case when drivers marked the skb with CHECKSUM_UNNECESSARY (ice, i40e,
      others) where for non-zero csums 2abb7cdc ("udp: Add support for doing
      checksum unnecessary conversion") promotes those skbs to CHECKSUM_COMPLETE
      and napi context has csum_valid set. This is however not the case for zero
      UDP csum (here: csum_cnt is still 0 and csum_valid continues to be false).
      
      At the same time 57c67ff4 ("udp: additional GRO support") added matches
      on !uh->check ^ !uh2->check as part to determine candidates for aggregation,
      so it certainly is expected to handle zero csums in udp_gro_receive(). The
      purpose of the check added via 662880f4 ("net: Allow GRO to use and set
      levels of checksum unnecessary") seems to catch bad csum and stop aggregation
      right away.
      
      One way to fix aggregation in the zero case is to only perform the !csum_valid
      check in udp_gro_receive() if uh->check is infact non-zero.
      
      Before:
      
        [...]
        swapper     0 [008]   731.946506: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100400 len=1500   (1)
        swapper     0 [008]   731.946507: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100200 len=1500
        swapper     0 [008]   731.946507: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101100 len=1500
        swapper     0 [008]   731.946508: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101700 len=1500
        swapper     0 [008]   731.946508: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101b00 len=1500
        swapper     0 [008]   731.946508: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100600 len=1500
        swapper     0 [008]   731.946508: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100f00 len=1500
        swapper     0 [008]   731.946509: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100a00 len=1500
        swapper     0 [008]   731.946516: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100500 len=1500
        swapper     0 [008]   731.946516: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100700 len=1500
        swapper     0 [008]   731.946516: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101d00 len=1500   (2)
        swapper     0 [008]   731.946517: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101000 len=1500
        swapper     0 [008]   731.946517: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101c00 len=1500
        swapper     0 [008]   731.946517: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101400 len=1500
        swapper     0 [008]   731.946518: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100e00 len=1500
        swapper     0 [008]   731.946518: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101600 len=1500
        swapper     0 [008]   731.946521: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100800 len=774
        swapper     0 [008]   731.946530: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff966497100400 len=14032 (1)
        swapper     0 [008]   731.946530: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff966497101d00 len=9112  (2)
        [...]
      
        # netperf -H 10.55.10.4 -t TCP_STREAM -l 20
        MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.55.10.4 () port 0 AF_INET : demo
        Recv   Send    Send
        Socket Socket  Message  Elapsed
        Size   Size    Size     Time     Throughput
        bytes  bytes   bytes    secs.    10^6bits/sec
      
         87380  16384  16384    20.01    13129.24
      
      After:
      
        [...]
        swapper     0 [026]   521.862641: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff93ab0d479000 len=11286 (1)
        swapper     0 [026]   521.862643: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d479000 len=11236 (1)
        swapper     0 [026]   521.862650: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff93ab0d478500 len=2898  (2)
        swapper     0 [026]   521.862650: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff93ab0d479f00 len=8490  (3)
        swapper     0 [026]   521.862653: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d478500 len=2848  (2)
        swapper     0 [026]   521.862653: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d479f00 len=8440  (3)
        [...]
      
        # netperf -H 10.55.10.4 -t TCP_STREAM -l 20
        MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.55.10.4 () port 0 AF_INET : demo
        Recv   Send    Send
        Socket Socket  Message  Elapsed
        Size   Size    Size     Time     Throughput
        bytes  bytes   bytes    secs.    10^6bits/sec
      
         87380  16384  16384    20.01    24576.53
      
      Fixes: 57c67ff4 ("udp: additional GRO support")
      Fixes: 662880f4 ("net: Allow GRO to use and set levels of checksum unnecessary")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/r/20210226212248.8300-1-daniel@iogearbox.netSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      89e5c58f
    • Y
      ethtool: fix the check logic of at least one channel for RX/TX · a4fc088a
      Yinjun Zhang 提交于
      The command "ethtool -L <intf> combined 0" may clean the RX/TX channel
      count and skip the error path, since the attrs
      tb[ETHTOOL_A_CHANNELS_RX_COUNT] and tb[ETHTOOL_A_CHANNELS_TX_COUNT]
      are NULL in this case when recent ethtool is used.
      
      Tested using ethtool v5.10.
      
      Fixes: 7be92514 ("ethtool: check if there is at least one channel for TX/RX in the core")
      Signed-off-by: NYinjun Zhang <yinjun.zhang@corigine.com>
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NLouis Peens <louis.peens@netronome.com>
      Link: https://lore.kernel.org/r/20210225125102.23989-1-simon.horman@netronome.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      a4fc088a
  9. 28 2月, 2021 4 次提交
    • V
      netfilter: x_tables: gpf inside xt_find_revision() · 8e24eddd
      Vasily Averin 提交于
      nested target/match_revfn() calls work with xt[NFPROTO_UNSPEC] lists
      without taking xt[NFPROTO_UNSPEC].mutex. This can race with module unload
      and cause host to crash:
      
      general protection fault: 0000 [#1]
      Modules linked in: ... [last unloaded: xt_cluster]
      CPU: 0 PID: 542455 Comm: iptables
      RIP: 0010:[<ffffffff8ffbd518>]  [<ffffffff8ffbd518>] strcmp+0x18/0x40
      RDX: 0000000000000003 RSI: ffff9a5a5d9abe10 RDI: dead000000000111
      R13: ffff9a5a5d9abe10 R14: ffff9a5a5d9abd8c R15: dead000000000100
      (VvS: %R15 -- &xt_match,  %RDI -- &xt_match.name,
      xt_cluster unregister match in xt[NFPROTO_UNSPEC].match list)
      Call Trace:
       [<ffffffff902ccf44>] match_revfn+0x54/0xc0
       [<ffffffff902ccf9f>] match_revfn+0xaf/0xc0
       [<ffffffff902cd01e>] xt_find_revision+0x6e/0xf0
       [<ffffffffc05a5be0>] do_ipt_get_ctl+0x100/0x420 [ip_tables]
       [<ffffffff902cc6bf>] nf_getsockopt+0x4f/0x70
       [<ffffffff902dd99e>] ip_getsockopt+0xde/0x100
       [<ffffffff903039b5>] raw_getsockopt+0x25/0x50
       [<ffffffff9026c5da>] sock_common_getsockopt+0x1a/0x20
       [<ffffffff9026b89d>] SyS_getsockopt+0x7d/0xf0
       [<ffffffff903cbf92>] system_call_fastpath+0x25/0x2a
      
      Fixes: 656caff2 ("netfilter 04/09: x_tables: fix match/target revision lookup")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8e24eddd
    • F
      netfilter: conntrack: avoid misleading 'invalid' in log message · 07b5a76e
      Florian Westphal 提交于
      The packet is not flagged as invalid: conntrack will accept it and
      its associated with the conntrack entry.
      
      This happens e.g. when receiving a retransmitted SYN in SYN_RECV state.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      07b5a76e
    • F
      netfilter: nf_nat: undo erroneous tcp edemux lookup · 03a3ca37
      Florian Westphal 提交于
      Under extremely rare conditions TCP early demux will retrieve the wrong
      socket.
      
      1. local machine establishes a connection to a remote server, S, on port
         p.
      
         This gives:
         laddr:lport -> S:p
         ... both in tcp and conntrack.
      
      2. local machine establishes a connection to host H, on port p2.
         2a. TCP stack choses same laddr:lport, so we have
         laddr:lport -> H:p2 from TCP point of view.
         2b). There is a destination NAT rewrite in place, translating
              H:p2 to S:p.  This results in following conntrack entries:
      
         I)  laddr:lport -> S:p  (origin)  S:p -> laddr:lport (reply)
         II) laddr:lport -> H:p2 (origin)  S:p -> laddr:lport2 (reply)
      
         NAT engine has rewritten laddr:lport to laddr:lport2 to map
         the reply packet to the correct origin.
      
         When server sends SYN/ACK to laddr:lport2, the PREROUTING hook
         will undo-the SNAT transformation, rewriting IP header to
         S:p -> laddr:lport
      
         This causes TCP early demux to associate the skb with the TCP socket
         of the first connection.
      
         The INPUT hook will then reverse the DNAT transformation, rewriting
         the IP header to H:p2 -> laddr:lport.
      
      Because packet ends up with the wrong socket, the new connection
      never completes: originator stays in SYN_SENT and conntrack entry
      remains in SYN_RECV until timeout, and responder retransmits SYN/ACK
      until it gives up.
      
      To resolve this, orphan the skb after the input rewrite:
      Because the source IP address changed, the socket must be incorrect.
      We can't move the DNAT undo to prerouting due to backwards
      compatibility, doing so will make iptables/nftables rules to no longer
      match the way they did.
      
      After orphan, the packet will be handed to the next protocol layer
      (tcp, udp, ...) and that will repeat the socket lookup just like as if
      early demux was disabled.
      
      Fixes: 41063e9d ("ipv4: Early TCP socket demux.")
      Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1427Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      03a3ca37
    • K
      netfilter: conntrack: Remove a double space in a log message · c57ea2d7
      Klemen Košir 提交于
      Removed an extra space in a log message and an extra blank line in code.
      Signed-off-by: NKlemen Košir <klemen.kosir@kream.io>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c57ea2d7
  10. 27 2月, 2021 1 次提交