1. 14 10月, 2020 1 次提交
    • C
      ip_gre: set dev->hard_header_len and dev->needed_headroom properly · fdafed45
      Cong Wang 提交于
      GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
      conditionally. When it is set, it assumes the outer IP header is
      already created before ipgre_xmit().
      
      This is not true when we send packets through a raw packet socket,
      where L2 headers are supposed to be constructed by user. Packet
      socket calls dev_validate_header() to validate the header. But
      GRE tunnel does not set dev->hard_header_len, so that check can
      be simply bypassed, therefore uninit memory could be passed down
      to ipgre_xmit(). Similar for dev->needed_headroom.
      
      dev->hard_header_len is supposed to be the length of the header
      created by dev->header_ops->create(), so it should be used whenever
      header_ops is set, and dev->needed_headroom should be used when it
      is not set.
      
      Reported-and-tested-by: syzbot+4a2c52677a8a1aa283cb@syzkaller.appspotmail.com
      Cc: William Tu <u9012063@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NXie He <xie.he.0141@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      fdafed45
  2. 11 10月, 2020 1 次提交
  3. 06 10月, 2020 1 次提交
    • E
      tcp: fix receive window update in tcp_add_backlog() · 86bccd03
      Eric Dumazet 提交于
      We got reports from GKE customers flows being reset by netfilter
      conntrack unless nf_conntrack_tcp_be_liberal is set to 1.
      
      Traces seemed to suggest ACK packet being dropped by the
      packet capture, or more likely that ACK were received in the
      wrong order.
      
       wscale=7, SYN and SYNACK not shown here.
      
       This ACK allows the sender to send 1871*128 bytes from seq 51359321 :
       New right edge of the window -> 51359321+1871*128=51598809
      
       09:17:23.389210 IP A > B: Flags [.], ack 51359321, win 1871, options [nop,nop,TS val 10 ecr 999], length 0
      
       09:17:23.389212 IP B > A: Flags [.], seq 51422681:51424089, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 1408
       09:17:23.389214 IP A > B: Flags [.], ack 51422681, win 1376, options [nop,nop,TS val 10 ecr 999], length 0
       09:17:23.389253 IP B > A: Flags [.], seq 51424089:51488857, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 64768
       09:17:23.389272 IP A > B: Flags [.], ack 51488857, win 859, options [nop,nop,TS val 10 ecr 999], length 0
       09:17:23.389275 IP B > A: Flags [.], seq 51488857:51521241, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
      
       Receiver now allows to send 606*128=77568 from seq 51521241 :
       New right edge of the window -> 51521241+606*128=51598809
      
       09:17:23.389296 IP A > B: Flags [.], ack 51521241, win 606, options [nop,nop,TS val 10 ecr 999], length 0
      
       09:17:23.389308 IP B > A: Flags [.], seq 51521241:51553625, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
      
       It seems the sender exceeds RWIN allowance, since 51611353 > 51598809
      
       09:17:23.389346 IP B > A: Flags [.], seq 51553625:51611353, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 57728
       09:17:23.389356 IP B > A: Flags [.], seq 51611353:51618393, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 7040
      
       09:17:23.389367 IP A > B: Flags [.], ack 51611353, win 0, options [nop,nop,TS val 10 ecr 999], length 0
      
       netfilter conntrack is not happy and sends RST
      
       09:17:23.389389 IP A > B: Flags [R], seq 92176528, win 0, length 0
       09:17:23.389488 IP B > A: Flags [R], seq 174478967, win 0, length 0
      
       Now imagine ACK were delivered out of order and tcp_add_backlog() sets window based on wrong packet.
       New right edge of the window -> 51521241+859*128=51631193
      
      Normally TCP stack handles OOO packets just fine, but it
      turns out tcp_add_backlog() does not. It can update the window
      field of the aggregated packet even if the ACK sequence
      of the last received packet is too old.
      
      Many thanks to Alexandre Ferrieux for independently reporting the issue
      and suggesting a fix.
      
      Fixes: 4f693b55 ("tcp: implement coalescing on backlog queue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAlexandre Ferrieux <alexandre.ferrieux@orange.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86bccd03
  4. 03 10月, 2020 2 次提交
  5. 25 9月, 2020 1 次提交
  6. 22 9月, 2020 1 次提交
    • E
      inet_diag: validate INET_DIAG_REQ_PROTOCOL attribute · d5e4d0a5
      Eric Dumazet 提交于
      User space could send an invalid INET_DIAG_REQ_PROTOCOL attribute
      as caught by syzbot.
      
      BUG: KMSAN: uninit-value in inet_diag_lock_handler net/ipv4/inet_diag.c:55 [inline]
      BUG: KMSAN: uninit-value in __inet_diag_dump+0x58c/0x720 net/ipv4/inet_diag.c:1147
      CPU: 0 PID: 8505 Comm: syz-executor174 Not tainted 5.9.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x21c/0x280 lib/dump_stack.c:118
       kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:122
       __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:219
       inet_diag_lock_handler net/ipv4/inet_diag.c:55 [inline]
       __inet_diag_dump+0x58c/0x720 net/ipv4/inet_diag.c:1147
       inet_diag_dump_compat+0x2a5/0x380 net/ipv4/inet_diag.c:1254
       netlink_dump+0xb73/0x1cb0 net/netlink/af_netlink.c:2246
       __netlink_dump_start+0xcf2/0xea0 net/netlink/af_netlink.c:2354
       netlink_dump_start include/linux/netlink.h:246 [inline]
       inet_diag_rcv_msg_compat+0x5da/0x6c0 net/ipv4/inet_diag.c:1288
       sock_diag_rcv_msg+0x24f/0x620 net/core/sock_diag.c:256
       netlink_rcv_skb+0x6d7/0x7e0 net/netlink/af_netlink.c:2470
       sock_diag_rcv+0x63/0x80 net/core/sock_diag.c:275
       netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
       netlink_unicast+0x11c8/0x1490 net/netlink/af_netlink.c:1330
       netlink_sendmsg+0x173a/0x1840 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:651 [inline]
       sock_sendmsg net/socket.c:671 [inline]
       ____sys_sendmsg+0xc82/0x1240 net/socket.c:2353
       ___sys_sendmsg net/socket.c:2407 [inline]
       __sys_sendmsg+0x6d1/0x820 net/socket.c:2440
       __do_sys_sendmsg net/socket.c:2449 [inline]
       __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
       __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
       do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x441389
      Code: e8 fc ab 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 09 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fff3b02ce98 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441389
      RDX: 0000000000000000 RSI: 0000000020001500 RDI: 0000000000000003
      RBP: 00000000006cb018 R08: 00000000004002c8 R09: 00000000004002c8
      R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000402130
      R13: 00000000004021c0 R14: 0000000000000000 R15: 0000000000000000
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:143 [inline]
       kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:126
       kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:80
       slab_alloc_node mm/slub.c:2907 [inline]
       __kmalloc_node_track_caller+0x9aa/0x12f0 mm/slub.c:4511
       __kmalloc_reserve net/core/skbuff.c:142 [inline]
       __alloc_skb+0x35f/0xb30 net/core/skbuff.c:210
       alloc_skb include/linux/skbuff.h:1094 [inline]
       netlink_alloc_large_skb net/netlink/af_netlink.c:1176 [inline]
       netlink_sendmsg+0xdb9/0x1840 net/netlink/af_netlink.c:1894
       sock_sendmsg_nosec net/socket.c:651 [inline]
       sock_sendmsg net/socket.c:671 [inline]
       ____sys_sendmsg+0xc82/0x1240 net/socket.c:2353
       ___sys_sendmsg net/socket.c:2407 [inline]
       __sys_sendmsg+0x6d1/0x820 net/socket.c:2440
       __do_sys_sendmsg net/socket.c:2449 [inline]
       __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
       __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
       do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 3f935c75 ("inet_diag: support for wider protocol numbers")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Christoph Paasch <cpaasch@apple.com>
      Cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5e4d0a5
  7. 16 9月, 2020 1 次提交
    • D
      ipv4: Update exception handling for multipath routes via same device · 2fbc6e89
      David Ahern 提交于
      Kfir reported that pmtu exceptions are not created properly for
      deployments where multipath routes use the same device.
      
      After some digging I see 2 compounding problems:
      1. ip_route_output_key_hash_rcu is updating the flowi4_oif *after*
         the route lookup. This is the second use case where this has
         been a problem (the first is related to use of vti devices with
         VRF). I can not find any reason for the oif to be changed after the
         lookup; the code goes back to the start of git. It does not seem
         logical so remove it.
      
      2. fib_lookups for exceptions do not call fib_select_path to handle
         multipath route selection based on the hash.
      
      The end result is that the fib_lookup used to add the exception
      always creates it based using the first leg of the route.
      
      An example topology showing the problem:
      
                       |  host1
                   +------+
                   | eth0 |  .209
                   +------+
                       |
                   +------+
           switch  | br0  |
                   +------+
                       |
             +---------+---------+
             | host2             |  host3
         +------+             +------+
         | eth0 | .250        | eth0 | 192.168.252.252
         +------+             +------+
      
         +-----+             +-----+
         | vti | .2          | vti | 192.168.247.3
         +-----+             +-----+
             \                  /
       =================================
       tunnels
               192.168.247.1/24
      
      for h in host1 host2 host3; do
              ip netns add ${h}
              ip -netns ${h} link set lo up
              ip netns exec ${h} sysctl -wq net.ipv4.ip_forward=1
      done
      
      ip netns add switch
      ip -netns switch li set lo up
      ip -netns switch link add br0 type bridge stp 0
      ip -netns switch link set br0 up
      
      for n in 1 2 3; do
              ip -netns switch link add eth-sw type veth peer name eth-h${n}
              ip -netns switch li set eth-h${n} master br0 up
              ip -netns switch li set eth-sw netns host${n} name eth0
      done
      
      ip -netns host1 addr add 192.168.252.209/24 dev eth0
      ip -netns host1 link set dev eth0 up
      ip -netns host1 route add 192.168.247.0/24 \
              nexthop via 192.168.252.250 dev eth0 nexthop via 192.168.252.252 dev eth0
      
      ip -netns host2 addr add 192.168.252.250/24 dev eth0
      ip -netns host2 link set dev eth0 up
      
      ip -netns host2 addr add 192.168.252.252/24 dev eth0
      ip -netns host3 link set dev eth0 up
      
      ip netns add tunnel
      ip -netns tunnel li set lo up
      ip -netns tunnel li add br0 type bridge
      ip -netns tunnel li set br0 up
      for n in $(seq 11 20); do
              ip -netns tunnel addr add dev br0 192.168.247.${n}/24
      done
      
      for n in 2 3
      do
              ip -netns tunnel link add vti${n} type veth peer name eth${n}
              ip -netns tunnel link set eth${n} mtu 1360 master br0 up
              ip -netns tunnel link set vti${n} netns host${n} mtu 1360 up
              ip -netns host${n} addr add dev vti${n} 192.168.247.${n}/24
      done
      ip -netns tunnel ro add default nexthop via 192.168.247.2 nexthop via 192.168.247.3
      
      ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.11
      ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.15
      ip -netns host1 ro ls cache
      
      Before this patch the cache always shows exceptions against the first
      leg in the multipath route; 192.168.252.250 per this example. Since the
      hash has an initial random seed, you may need to vary the final octet
      more than what is listed. In my tests, using addresses between 11 and 19
      usually found 1 that used both legs.
      
      With this patch, the cache will have exceptions for both legs.
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions")
      Reported-by: NKfir Itzhak <mastertheknife@gmail.com>
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2fbc6e89
  8. 15 9月, 2020 2 次提交
  9. 09 9月, 2020 1 次提交
  10. 29 8月, 2020 1 次提交
  11. 27 8月, 2020 2 次提交
    • M
      net: Fix some comments · 645f0897
      Miaohe Lin 提交于
      Fix some comments, including wrong function name, duplicated word and so
      on.
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      645f0897
    • I
      ipv4: Silence suspicious RCU usage warning · 7f6f32bb
      Ido Schimmel 提交于
      fib_info_notify_update() is always called with RTNL held, but not from
      an RCU read-side critical section. This leads to the following warning
      [1] when the FIB table list is traversed with
      hlist_for_each_entry_rcu(), but without a proper lockdep expression.
      
      Since modification of the list is protected by RTNL, silence the warning
      by adding a lockdep expression which verifies RTNL is held.
      
      [1]
       =============================
       WARNING: suspicious RCU usage
       5.9.0-rc1-custom-14233-g2f26e122d62f #129 Not tainted
       -----------------------------
       net/ipv4/fib_trie.c:2124 RCU-list traversed in non-reader section!!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 2, debug_locks = 1
       1 lock held by ip/834:
        #0: ffffffff85a3b6b0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x49a/0xbd0
      
       stack backtrace:
       CPU: 0 PID: 834 Comm: ip Not tainted 5.9.0-rc1-custom-14233-g2f26e122d62f #129
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
       Call Trace:
        dump_stack+0x100/0x184
        lockdep_rcu_suspicious+0x143/0x14d
        fib_info_notify_update+0x8d1/0xa60
        __nexthop_replace_notify+0xd2/0x290
        rtm_new_nexthop+0x35e2/0x5946
        rtnetlink_rcv_msg+0x4f7/0xbd0
        netlink_rcv_skb+0x17a/0x480
        rtnetlink_rcv+0x22/0x30
        netlink_unicast+0x5ae/0x890
        netlink_sendmsg+0x98a/0xf40
        ____sys_sendmsg+0x879/0xa00
        ___sys_sendmsg+0x122/0x190
        __sys_sendmsg+0x103/0x1d0
        __x64_sys_sendmsg+0x7d/0xb0
        do_syscall_64+0x32/0x50
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7fde28c3be57
       Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51
      c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
      RSP: 002b:00007ffc09330028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fde28c3be57
      RDX: 0000000000000000 RSI: 00007ffc09330090 RDI: 0000000000000003
      RBP: 000000005f45f911 R08: 0000000000000001 R09: 00007ffc0933012c
      R10: 0000000000000076 R11: 0000000000000246 R12: 0000000000000001
      R13: 00007ffc09330290 R14: 00007ffc09330eee R15: 00005610e48ed020
      
      Fixes: 1bff1a0c ("ipv4: Add function to send route updates")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f6f32bb
  12. 23 8月, 2020 1 次提交
    • N
      net: nexthop: don't allow empty NHA_GROUP · eeaac363
      Nikolay Aleksandrov 提交于
      Currently the nexthop code will use an empty NHA_GROUP attribute, but it
      requires at least 1 entry in order to function properly. Otherwise we
      end up derefencing null or random pointers all over the place due to not
      having any nh_grp_entry members allocated, nexthop code relies on having at
      least the first member present. Empty NHA_GROUP doesn't make any sense so
      just disallow it.
      Also add a WARN_ON for any future users of nexthop_create_group().
      
       BUG: kernel NULL pointer dereference, address: 0000000000000080
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] SMP
       CPU: 0 PID: 558 Comm: ip Not tainted 5.9.0-rc1+ #93
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
       RIP: 0010:fib_check_nexthop+0x4a/0xaa
       Code: 0f 84 83 00 00 00 48 c7 02 80 03 f7 81 c3 40 80 fe fe 75 12 b8 ea ff ff ff 48 85 d2 74 6b 48 c7 02 40 03 f7 81 c3 48 8b 40 10 <48> 8b 80 80 00 00 00 eb 36 80 78 1a 00 74 12 b8 ea ff ff ff 48 85
       RSP: 0018:ffff88807983ba00 EFLAGS: 00010213
       RAX: 0000000000000000 RBX: ffff88807983bc00 RCX: 0000000000000000
       RDX: ffff88807983bc00 RSI: 0000000000000000 RDI: ffff88807bdd0a80
       RBP: ffff88807983baf8 R08: 0000000000000dc0 R09: 000000000000040a
       R10: 0000000000000000 R11: ffff88807bdd0ae8 R12: 0000000000000000
       R13: 0000000000000000 R14: ffff88807bea3100 R15: 0000000000000001
       FS:  00007f10db393700(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000080 CR3: 000000007bd0f004 CR4: 00000000003706f0
       Call Trace:
        fib_create_info+0x64d/0xaf7
        fib_table_insert+0xf6/0x581
        ? __vma_adjust+0x3b6/0x4d4
        inet_rtm_newroute+0x56/0x70
        rtnetlink_rcv_msg+0x1e3/0x20d
        ? rtnl_calcit.isra.0+0xb8/0xb8
        netlink_rcv_skb+0x5b/0xac
        netlink_unicast+0xfa/0x17b
        netlink_sendmsg+0x334/0x353
        sock_sendmsg_nosec+0xf/0x3f
        ____sys_sendmsg+0x1a0/0x1fc
        ? copy_msghdr_from_user+0x4c/0x61
        ___sys_sendmsg+0x63/0x84
        ? handle_mm_fault+0xa39/0x11b5
        ? sockfd_lookup_light+0x72/0x9a
        __sys_sendmsg+0x50/0x6e
        do_syscall_64+0x54/0xbe
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f10dacc0bb7
       Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb cd 66 0f 1f 44 00 00 8b 05 9a 4b 2b 00 85 c0 75 2e 48 63 ff 48 63 d2 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 b1 f2 2a 00 f7 d8 64 89 02 48
       RSP: 002b:00007ffcbe628bf8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
       RAX: ffffffffffffffda RBX: 00007ffcbe628f80 RCX: 00007f10dacc0bb7
       RDX: 0000000000000000 RSI: 00007ffcbe628c60 RDI: 0000000000000003
       RBP: 000000005f41099c R08: 0000000000000001 R09: 0000000000000008
       R10: 00000000000005e9 R11: 0000000000000246 R12: 0000000000000000
       R13: 0000000000000000 R14: 00007ffcbe628d70 R15: 0000563a86c6e440
       Modules linked in:
       CR2: 0000000000000080
      
      CC: David Ahern <dsahern@gmail.com>
      Fixes: 430a0491 ("nexthop: Add support for nexthop groups")
      Reported-by: syzbot+a61aa19b0c14c8770bd9@syzkaller.appspotmail.com
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eeaac363
  13. 19 8月, 2020 1 次提交
  14. 13 8月, 2020 2 次提交
  15. 12 8月, 2020 2 次提交
    • T
      net: initialize fastreuse on inet_inherit_port · d76f3351
      Tim Froidcoeur 提交于
      In the case of TPROXY, bind_conflict optimizations for SO_REUSEADDR or
      SO_REUSEPORT are broken, possibly resulting in O(n) instead of O(1) bind
      behaviour or in the incorrect reuse of a bind.
      
      the kernel keeps track for each bind_bucket if all sockets in the
      bind_bucket support SO_REUSEADDR or SO_REUSEPORT in two fastreuse flags.
      These flags allow skipping the costly bind_conflict check when possible
      (meaning when all sockets have the proper SO_REUSE option).
      
      For every socket added to a bind_bucket, these flags need to be updated.
      As soon as a socket that does not support reuse is added, the flag is
      set to false and will never go back to true, unless the bind_bucket is
      deleted.
      
      Note that there is no mechanism to re-evaluate these flags when a socket
      is removed (this might make sense when removing a socket that would not
      allow reuse; this leaves room for a future patch).
      
      For this optimization to work, it is mandatory that these flags are
      properly initialized and updated.
      
      When a child socket is created from a listen socket in
      __inet_inherit_port, the TPROXY case could create a new bind bucket
      without properly initializing these flags, thus preventing the
      optimization to work. Alternatively, a socket not allowing reuse could
      be added to an existing bind bucket without updating the flags, causing
      bind_conflict to never be called as it should.
      
      Call inet_csk_update_fastreuse when __inet_inherit_port decides to create
      a new bind_bucket or use a different bind_bucket than the one of the
      listen socket.
      
      Fixes: 093d2823 ("tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()")
      Acked-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NTim Froidcoeur <tim.froidcoeur@tessares.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d76f3351
    • T
      net: refactor bind_bucket fastreuse into helper · 62ffc589
      Tim Froidcoeur 提交于
      Refactor the fastreuse update code in inet_csk_get_port into a small
      helper function that can be called from other places.
      Acked-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NTim Froidcoeur <tim.froidcoeur@tessares.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62ffc589
  16. 11 8月, 2020 2 次提交
  17. 08 8月, 2020 2 次提交
    • Y
      ip_vti: Fix unused variable warning · 61ee4137
      YueHaibing 提交于
      If CONFIG_INET_XFRM_TUNNEL is set but CONFIG_IPV6 is n,
      
      net/ipv4/ip_vti.c:493:27: warning: 'vti_ipip6_handler' defined but not used [-Wunused-variable]
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      61ee4137
    • W
      mm, treewide: rename kzfree() to kfree_sensitive() · 453431a5
      Waiman Long 提交于
      As said by Linus:
      
        A symmetric naming is only helpful if it implies symmetries in use.
        Otherwise it's actively misleading.
      
        In "kzalloc()", the z is meaningful and an important part of what the
        caller wants.
      
        In "kzfree()", the z is actively detrimental, because maybe in the
        future we really _might_ want to use that "memfill(0xdeadbeef)" or
        something. The "zero" part of the interface isn't even _relevant_.
      
      The main reason that kzfree() exists is to clear sensitive information
      that should not be leaked to other future users of the same memory
      objects.
      
      Rename kzfree() to kfree_sensitive() to follow the example of the recently
      added kvfree_sensitive() and make the intention of the API more explicit.
      In addition, memzero_explicit() is used to clear the memory to make sure
      that it won't get optimized away by the compiler.
      
      The renaming is done by using the command sequence:
      
        git grep -w --name-only kzfree |\
        xargs sed -i 's/kzfree/kfree_sensitive/'
      
      followed by some editing of the kfree_sensitive() kerneldoc and adding
      a kzfree backward compatibility macro in slab.h.
      
      [akpm@linux-foundation.org: fs/crypto/inline_crypt.c needs linux/slab.h]
      [akpm@linux-foundation.org: fix fs/crypto/inline_crypt.c some more]
      Suggested-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
      Link: http://lkml.kernel.org/r/20200616154311.12314-3-longman@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      453431a5
  18. 06 8月, 2020 1 次提交
  19. 05 8月, 2020 2 次提交
    • S
      tunnels: PMTU discovery support for directly bridged IP packets · 4cb47a86
      Stefano Brivio 提交于
      It's currently possible to bridge Ethernet tunnels carrying IP
      packets directly to external interfaces without assigning them
      addresses and routes on the bridged network itself: this is the case
      for UDP tunnels bridged with a standard bridge or by Open vSwitch.
      
      PMTU discovery is currently broken with those configurations, because
      the encapsulation effectively decreases the MTU of the link, and
      while we are able to account for this using PMTU discovery on the
      lower layer, we don't have a way to relay ICMP or ICMPv6 messages
      needed by the sender, because we don't have valid routes to it.
      
      On the other hand, as a tunnel endpoint, we can't fragment packets
      as a general approach: this is for instance clearly forbidden for
      VXLAN by RFC 7348, section 4.3:
      
         VTEPs MUST NOT fragment VXLAN packets.  Intermediate routers may
         fragment encapsulated VXLAN packets due to the larger frame size.
         The destination VTEP MAY silently discard such VXLAN fragments.
      
      The same paragraph recommends that the MTU over the physical network
      accomodates for encapsulations, but this isn't a practical option for
      complex topologies, especially for typical Open vSwitch use cases.
      
      Further, it states that:
      
         Other techniques like Path MTU discovery (see [RFC1191] and
         [RFC1981]) MAY be used to address this requirement as well.
      
      Now, PMTU discovery already works for routed interfaces, we get
      route exceptions created by the encapsulation device as they receive
      ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and
      we already rebuild those messages with the appropriate MTU and route
      them back to the sender.
      
      Add the missing bits for bridged cases:
      
      - checks in skb_tunnel_check_pmtu() to understand if it's appropriate
        to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and
        RFC 4443 section 2.4 for ICMPv6. This function is already called by
        UDP tunnels
      
      - a new function generating those ICMP or ICMPv6 replies. We can't
        reuse icmp_send() and icmp6_send() as we don't see the sender as a
        valid destination. This doesn't need to be generic, as we don't
        cover any other type of ICMP errors given that we only provide an
        encapsulation function to the sender
      
      While at it, make the MTU check in skb_tunnel_check_pmtu() accurate:
      we might receive GSO buffers here, and the passed headroom already
      includes the inner MAC length, so we don't have to account for it
      a second time (that would imply three MAC headers on the wire, but
      there are just two).
      
      This issue became visible while bridging IPv6 packets with 4500 bytes
      of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50
      bytes of encapsulation headroom, we would advertise MTU as 3950, and
      we would reject fragmented IPv6 datagrams of 3958 bytes size on the
      wire. We're exclusively dealing with network MTU here, though, so we
      could get Ethernet frames up to 3964 octets in that case.
      
      v2:
      - moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern)
      - split IPv4/IPv6 functions (David Ahern)
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cb47a86
    • S
      ipv4: route: Ignore output interface in FIB lookup for PMTU route · df23bb18
      Stefano Brivio 提交于
      Currently, processes sending traffic to a local bridge with an
      encapsulation device as a port don't get ICMP errors if they exceed
      the PMTU of the encapsulated link.
      
      David Ahern suggested this as a hack, but it actually looks like
      the correct solution: when we update the PMTU for a given destination
      by means of updating or creating a route exception, the encapsulation
      might trigger this because of PMTU discovery happening either on the
      encapsulation device itself, or its lower layer. This happens on
      bridged encapsulations only.
      
      The output interface shouldn't matter, because we already have a
      valid destination. Drop the output interface restriction from the
      associated route lookup.
      
      For UDP tunnels, we will now have a route exception created for the
      encapsulation itself, with a MTU value reflecting its headroom, which
      allows a bridge forwarding IP packets originated locally to deliver
      errors back to the sending socket.
      
      The behaviour is now consistent with IPv6 and verified with selftests
      pmtu_ipv{4,6}_br_{geneve,vxlan}{4,6}_exception introduced later in
      this series.
      
      v2:
      - reset output interface only for bridge ports (David Ahern)
      - add and use netif_is_any_bridge_port() helper (David Ahern)
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df23bb18
  20. 04 8月, 2020 3 次提交
    • J
      tcp: apply a floor of 1 for RTT samples from TCP timestamps · 730e700e
      Jianfeng Wang 提交于
      For retransmitted packets, TCP needs to resort to using TCP timestamps
      for computing RTT samples. In the common case where the data and ACK
      fall in the same 1-millisecond interval, TCP senders with millisecond-
      granularity TCP timestamps compute a ca_rtt_us of 0. This ca_rtt_us
      of 0 propagates to rs->rtt_us.
      
      This value of 0 can cause performance problems for congestion control
      modules. For example, in BBR, the zero min_rtt sample can bring the
      min_rtt and BDP estimate down to 0, reduce snd_cwnd and result in a
      low throughput. It would be hard to mitigate this with filtering in
      the congestion control module, because the proper floor to apply would
      depend on the method of RTT sampling (using timestamp options or
      internally-saved transmission timestamps).
      
      This fix applies a floor of 1 for the RTT sample delta from TCP
      timestamps, so that seq_rtt_us, ca_rtt_us, and rs->rtt_us will be at
      least 1 * (USEC_PER_SEC / TCP_TS_HZ).
      
      Note that the receiver RTT computation in tcp_rcv_rtt_measure() and
      min_rtt computation in tcp_update_rtt_min() both already apply a floor
      of 1 timestamp tick, so this commit makes the code more consistent in
      avoiding this edge case of a value of 0.
      Signed-off-by: NJianfeng Wang <jfwang@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NKevin Yang <yyd@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      730e700e
    • L
      net: gre: recompute gre csum for sctp over gre tunnels · 622e32b7
      Lorenzo Bianconi 提交于
      The GRE tunnel can be used to transport traffic that does not rely on a
      Internet checksum (e.g. SCTP). The issue can be triggered creating a GRE
      or GRETAP tunnel and transmitting SCTP traffic ontop of it where CRC
      offload has been disabled. In order to fix the issue we need to
      recompute the GRE csum in gre_gso_segment() not relying on the inner
      checksum.
      The issue is still present when we have the CRC offload enabled.
      In this case we need to disable the CRC offload if we require GRE
      checksum since otherwise skb_checksum() will report a wrong value.
      
      Fixes: 90017acc ("sctp: Add GSO support")
      Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org>
      Reviewed-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      622e32b7
    • J
      udp_tunnel: add the ability to hard-code IANA VXLAN · 966e5059
      Jakub Kicinski 提交于
      mlx5 has the IANA VXLAN port (4789) hard coded by the device,
      instead of being added dynamically when tunnels are created.
      
      To support this add a workaround flag to struct udp_tunnel_nic_info.
      Skipping updates for the port is fairly trivial, dumping the hard
      coded port via ethtool requires some code duplication. The port
      is not a part of any real table, we dump it in a special table
      which has no tunnel types supported and only one entry.
      
      This is the last known workaround / hack needed to convert
      all drivers to the new infra.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      966e5059
  21. 02 8月, 2020 1 次提交
    • E
      tcp: fix build fong CONFIG_MPTCP=n · 0e8642cf
      Eric Dumazet 提交于
      Fixes these errors:
      
      net/ipv4/syncookies.c: In function 'tcp_get_cookie_sock':
      net/ipv4/syncookies.c:216:19: error: 'struct tcp_request_sock' has no
      member named 'drop_req'
        216 |   if (tcp_rsk(req)->drop_req) {
            |                   ^~
      net/ipv4/syncookies.c: In function 'cookie_tcp_reqsk_alloc':
      net/ipv4/syncookies.c:289:27: warning: unused variable 'treq'
      [-Wunused-variable]
        289 |  struct tcp_request_sock *treq;
            |                           ^~~~
      make[3]: *** [scripts/Makefile.build:280: net/ipv4/syncookies.o] Error 1
      make[3]: *** Waiting for unfinished jobs....
      
      Fixes: 9466a1cc ("mptcp: enable JOIN requests even if cookies are in use")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e8642cf
  22. 01 8月, 2020 4 次提交
  23. 31 7月, 2020 1 次提交
  24. 30 7月, 2020 1 次提交
    • I
      ipv4: Silence suspicious RCU usage warning · 83f35228
      Ido Schimmel 提交于
      fib_trie_unmerge() is called with RTNL held, but not from an RCU
      read-side critical section. This leads to the following warning [1] when
      the FIB alias list in a leaf is traversed with
      hlist_for_each_entry_rcu().
      
      Since the function is always called with RTNL held and since
      modification of the list is protected by RTNL, simply use
      hlist_for_each_entry() and silence the warning.
      
      [1]
      WARNING: suspicious RCU usage
      5.8.0-rc4-custom-01520-gc1f937f3f83b #30 Not tainted
      -----------------------------
      net/ipv4/fib_trie.c:1867 RCU-list traversed in non-reader section!!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by ip/164:
       #0: ffffffff85a27850 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x49a/0xbd0
      
      stack backtrace:
      CPU: 0 PID: 164 Comm: ip Not tainted 5.8.0-rc4-custom-01520-gc1f937f3f83b #30
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
      Call Trace:
       dump_stack+0x100/0x184
       lockdep_rcu_suspicious+0x153/0x15d
       fib_trie_unmerge+0x608/0xdb0
       fib_unmerge+0x44/0x360
       fib4_rule_configure+0xc8/0xad0
       fib_nl_newrule+0x37a/0x1dd0
       rtnetlink_rcv_msg+0x4f7/0xbd0
       netlink_rcv_skb+0x17a/0x480
       rtnetlink_rcv+0x22/0x30
       netlink_unicast+0x5ae/0x890
       netlink_sendmsg+0x98a/0xf40
       ____sys_sendmsg+0x879/0xa00
       ___sys_sendmsg+0x122/0x190
       __sys_sendmsg+0x103/0x1d0
       __x64_sys_sendmsg+0x7d/0xb0
       do_syscall_64+0x54/0xa0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x7fc80a234e97
      Code: Bad RIP value.
      RSP: 002b:00007ffef8b66798 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc80a234e97
      RDX: 0000000000000000 RSI: 00007ffef8b66800 RDI: 0000000000000003
      RBP: 000000005f141b1c R08: 0000000000000001 R09: 0000000000000000
      R10: 00007fc80a2a8ac0 R11: 0000000000000246 R12: 0000000000000001
      R13: 0000000000000000 R14: 00007ffef8b67008 R15: 0000556fccb10020
      
      Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83f35228
  25. 29 7月, 2020 3 次提交