1. 13 10月, 2022 2 次提交
    • K
      udp: Call inet6_destroy_sock() in setsockopt(IPV6_ADDRFORM). · 21985f43
      Kuniyuki Iwashima 提交于
      Commit 4b340ae2 ("IPv6: Complete IPV6_DONTFRAG support") forgot
      to add a change to free inet6_sk(sk)->rxpmtu while converting an IPv6
      socket into IPv4 with IPV6_ADDRFORM.  After conversion, sk_prot is
      changed to udp_prot and ->destroy() never cleans it up, resulting in
      a memory leak.
      
      This is due to the discrepancy between inet6_destroy_sock() and
      IPV6_ADDRFORM, so let's call inet6_destroy_sock() from IPV6_ADDRFORM
      to remove the difference.
      
      However, this is not enough for now because rxpmtu can be changed
      without lock_sock() after commit 03485f2a ("udpv6: Add lockless
      sendmsg() support").  We will fix this case in the following patch.
      
      Note we will rename inet6_destroy_sock() to inet6_cleanup_sock() and
      remove unnecessary inet6_destroy_sock() calls in sk_prot->destroy()
      in the future.
      
      Fixes: 4b340ae2 ("IPv6: Complete IPV6_DONTFRAG support")
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      21985f43
    • K
      tcp/udp: Fix memory leak in ipv6_renew_options(). · 3c52c6bb
      Kuniyuki Iwashima 提交于
      syzbot reported a memory leak [0] related to IPV6_ADDRFORM.
      
      The scenario is that while one thread is converting an IPv6 socket into
      IPv4 with IPV6_ADDRFORM, another thread calls do_ipv6_setsockopt() and
      allocates memory to inet6_sk(sk)->XXX after conversion.
      
      Then, the converted sk with (tcp|udp)_prot never frees the IPv6 resources,
      which inet6_destroy_sock() should have cleaned up.
      
      setsockopt(IPV6_ADDRFORM)                 setsockopt(IPV6_DSTOPTS)
      +-----------------------+                 +----------------------+
      - do_ipv6_setsockopt(sk, ...)
        - sockopt_lock_sock(sk)                 - do_ipv6_setsockopt(sk, ...)
          - lock_sock(sk)                         ^._ called via tcpv6_prot
        - WRITE_ONCE(sk->sk_prot, &tcp_prot)          before WRITE_ONCE()
        - xchg(&np->opt, NULL)
        - txopt_put(opt)
        - sockopt_release_sock(sk)
          - release_sock(sk)                      - sockopt_lock_sock(sk)
                                                    - lock_sock(sk)
                                                  - ipv6_set_opt_hdr(sk, ...)
                                                    - ipv6_update_options(sk, opt)
                                                      - xchg(&inet6_sk(sk)->opt, opt)
                                                        ^._ opt is never freed.
      
                                                  - sockopt_release_sock(sk)
                                                    - release_sock(sk)
      
      Since IPV6_DSTOPTS allocates options under lock_sock(), we can avoid this
      memory leak by testing whether sk_family is changed by IPV6_ADDRFORM after
      acquiring the lock.
      
      This issue exists from the initial commit between IPV6_ADDRFORM and
      IPV6_PKTOPTIONS.
      
      [0]:
      BUG: memory leak
      unreferenced object 0xffff888009ab9f80 (size 96):
        comm "syz-executor583", pid 328, jiffies 4294916198 (age 13.034s)
        hex dump (first 32 bytes):
          01 00 00 00 48 00 00 00 08 00 00 00 00 00 00 00  ....H...........
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<000000002ee98ae1>] kmalloc include/linux/slab.h:605 [inline]
          [<000000002ee98ae1>] sock_kmalloc+0xb3/0x100 net/core/sock.c:2566
          [<0000000065d7b698>] ipv6_renew_options+0x21e/0x10b0 net/ipv6/exthdrs.c:1318
          [<00000000a8c756d7>] ipv6_set_opt_hdr net/ipv6/ipv6_sockglue.c:354 [inline]
          [<00000000a8c756d7>] do_ipv6_setsockopt.constprop.0+0x28b7/0x4350 net/ipv6/ipv6_sockglue.c:668
          [<000000002854d204>] ipv6_setsockopt+0xdf/0x190 net/ipv6/ipv6_sockglue.c:1021
          [<00000000e69fdcf8>] tcp_setsockopt+0x13b/0x2620 net/ipv4/tcp.c:3789
          [<0000000090da4b9b>] __sys_setsockopt+0x239/0x620 net/socket.c:2252
          [<00000000b10d192f>] __do_sys_setsockopt net/socket.c:2263 [inline]
          [<00000000b10d192f>] __se_sys_setsockopt net/socket.c:2260 [inline]
          [<00000000b10d192f>] __x64_sys_setsockopt+0xbe/0x160 net/socket.c:2260
          [<000000000a80d7aa>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<000000000a80d7aa>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
          [<000000004562b5c6>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      3c52c6bb
  2. 12 10月, 2022 2 次提交
    • P
      netfilter: rpfilter/fib: Populate flowic_l3mdev field · acc641ab
      Phil Sutter 提交于
      Use the introduced field for correct operation with VRF devices instead
      of conditionally overwriting flowic_oif. This is a partial revert of
      commit b575b24b ("netfilter: Fix rpfilter dropping vrf packets by
      mistake"), implementing a simpler solution.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Reviewed-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      acc641ab
    • E
      inet: ping: fix recent breakage · 0d24148b
      Eric Dumazet 提交于
      Blamed commit broke the assumption used by ping sendmsg() that
      allocated skb would have MAX_HEADER bytes in skb->head.
      
      This patch changes the way ping works, by making sure
      the skb head contains space for the icmp header,
      and adjusting ping_getfrag() which was desperate
      about going past the icmp header :/
      
      This is adopting what UDP does, mostly.
      
      syzbot is able to crash a host using both kfence and following repro in a loop.
      
      fd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_ICMPV6)
      connect(fd, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0),
      		inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28
      sendmsg(fd, {msg_name=NULL, msg_namelen=0, msg_iov=[
      		{iov_base="\200\0\0\0\23\0\0\0\0\0\0\0\0\0"..., iov_len=65496}],
      		msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0
      
      When kfence triggers, skb->head only has 64 bytes, immediately followed
      by struct skb_shared_info (no extra headroom based on ksize(ptr))
      
      Then icmpv6_push_pending_frames() is overwriting first bytes
      of skb_shinfo(skb), making nr_frags bigger than MAX_SKB_FRAGS,
      and/or setting shinfo->gso_size to a non zero value.
      
      If nr_frags is mangled, a crash happens in skb_release_data()
      
      If gso_size is mangled, we have the following report:
      
      lo: caps=(0x00000516401d7c69, 0x00000516401d7c69)
      WARNING: CPU: 0 PID: 7548 at net/core/dev.c:3239 skb_warn_bad_offload+0x119/0x230 net/core/dev.c:3239
      Modules linked in:
      CPU: 0 PID: 7548 Comm: syz-executor268 Not tainted 6.0.0-syzkaller-02754-g557f0501 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
      RIP: 0010:skb_warn_bad_offload+0x119/0x230 net/core/dev.c:3239
      Code: 70 03 00 00 e8 58 c3 24 fa 4c 8d a5 e8 00 00 00 e8 4c c3 24 fa 4c 89 e9 4c 89 e2 4c 89 f6 48 c7 c7 00 53 f5 8a e8 13 ac e7 01 <0f> 0b 5b 5d 41 5c 41 5d 41 5e e9 28 c3 24 fa e8 23 c3 24 fa 48 89
      RSP: 0018:ffffc9000366f3e8 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffff88807a9d9d00 RCX: 0000000000000000
      RDX: ffff8880780c0000 RSI: ffffffff8160f6f8 RDI: fffff520006cde6f
      RBP: ffff888079952000 R08: 0000000000000005 R09: 0000000000000000
      R10: 0000000000000400 R11: 0000000000000000 R12: ffff8880799520e8
      R13: ffff88807a9da070 R14: ffff888079952000 R15: 0000000000000000
      FS: 0000555556be6300(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020010000 CR3: 000000006eb7b000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
      <TASK>
      gso_features_check net/core/dev.c:3521 [inline]
      netif_skb_features+0x83e/0xb90 net/core/dev.c:3554
      validate_xmit_skb+0x2b/0xf10 net/core/dev.c:3659
      __dev_queue_xmit+0x998/0x3ad0 net/core/dev.c:4248
      dev_queue_xmit include/linux/netdevice.h:3008 [inline]
      neigh_hh_output include/net/neighbour.h:530 [inline]
      neigh_output include/net/neighbour.h:544 [inline]
      ip6_finish_output2+0xf97/0x1520 net/ipv6/ip6_output.c:134
      __ip6_finish_output net/ipv6/ip6_output.c:195 [inline]
      ip6_finish_output+0x690/0x1160 net/ipv6/ip6_output.c:206
      NF_HOOK_COND include/linux/netfilter.h:291 [inline]
      ip6_output+0x1ed/0x540 net/ipv6/ip6_output.c:227
      dst_output include/net/dst.h:445 [inline]
      ip6_local_out+0xaf/0x1a0 net/ipv6/output_core.c:161
      ip6_send_skb+0xb7/0x340 net/ipv6/ip6_output.c:1966
      ip6_push_pending_frames+0xdd/0x100 net/ipv6/ip6_output.c:1986
      icmpv6_push_pending_frames+0x2af/0x490 net/ipv6/icmp.c:303
      ping_v6_sendmsg+0xc44/0x1190 net/ipv6/ping.c:190
      inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
      sock_sendmsg_nosec net/socket.c:714 [inline]
      sock_sendmsg+0xcf/0x120 net/socket.c:734
      ____sys_sendmsg+0x712/0x8c0 net/socket.c:2482
      ___sys_sendmsg+0x110/0x1b0 net/socket.c:2536
      __sys_sendmsg+0xf3/0x1c0 net/socket.c:2565
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f21aab42b89
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 41 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007fff1729d038 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f21aab42b89
      RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003
      RBP: 0000000000000000 R08: 000000000000000d R09: 000000000000000d
      R10: 000000000000000d R11: 0000000000000246 R12: 00007fff1729d050
      R13: 00000000000f4240 R14: 0000000000021dd1 R15: 00007fff1729d044
      </TASK>
      
      Fixes: 47cf8899 ("net: unify alloclen calculation for paged requests")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Lorenzo Colitti <lorenzo@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d24148b
  3. 03 10月, 2022 2 次提交
  4. 30 9月, 2022 2 次提交
  5. 29 9月, 2022 7 次提交
  6. 28 9月, 2022 2 次提交
  7. 24 9月, 2022 1 次提交
  8. 21 9月, 2022 6 次提交
    • I
      ipv6: Fix crash when IPv6 is administratively disabled · 76dd0728
      Ido Schimmel 提交于
      The global 'raw_v6_hashinfo' variable can be accessed even when IPv6 is
      administratively disabled via the 'ipv6.disable=1' kernel command line
      option, leading to a crash [1].
      
      Fix by restoring the original behavior and always initializing the
      variable, regardless of IPv6 support being administratively disabled or
      not.
      
      [1]
       BUG: unable to handle page fault for address: ffffffffffffffc8
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 173e18067 P4D 173e18067 PUD 173e1a067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP KASAN
       CPU: 3 PID: 271 Comm: ss Not tainted 6.0.0-rc4-custom-00136-g0727a9a5 #1396
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
       RIP: 0010:raw_diag_dump+0x310/0x7f0
       [...]
       Call Trace:
        <TASK>
        __inet_diag_dump+0x10f/0x2e0
        netlink_dump+0x575/0xfd0
        __netlink_dump_start+0x67b/0x940
        inet_diag_handler_cmd+0x273/0x2d0
        sock_diag_rcv_msg+0x317/0x440
        netlink_rcv_skb+0x15e/0x430
        sock_diag_rcv+0x2b/0x40
        netlink_unicast+0x53b/0x800
        netlink_sendmsg+0x945/0xe60
        ____sys_sendmsg+0x747/0x960
        ___sys_sendmsg+0x13a/0x1e0
        __sys_sendmsg+0x118/0x1e0
        do_syscall_64+0x34/0x80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Fixes: 0daf07e5 ("raw: convert raw sockets to RCU")
      Reported-by: NRoberto Ricci <rroberto2r@gmail.com>
      Tested-by: NRoberto Ricci <rroberto2r@gmail.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220916084821.229287-1-idosch@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      76dd0728
    • K
      tcp: Save unnecessary inet_twsk_purge() calls. · edc12f03
      Kuniyuki Iwashima 提交于
      While destroying netns, we call inet_twsk_purge() in tcp_sk_exit_batch()
      and tcpv6_net_exit_batch() for AF_INET and AF_INET6.  These commands
      trigger the kernel to walk through the potentially big ehash twice even
      though the netns has no TIME_WAIT sockets.
      
        # ip netns add test
        # ip netns del test
      
        or
      
        # unshare -n /bin/true >/dev/null
      
      When tw_refcount is 1, we need not call inet_twsk_purge() at least
      for the net.  We can save such unneeded iterations if all netns in
      net_exit_list have no TIME_WAIT sockets.  This change eliminates
      the tax by the additional unshare() described in the next patch to
      guarantee the per-netns ehash size.
      
      Tested:
      
        # mount -t debugfs none /sys/kernel/debug/
        # echo cleanup_net > /sys/kernel/debug/tracing/set_ftrace_filter
        # echo inet_twsk_purge >> /sys/kernel/debug/tracing/set_ftrace_filter
        # echo function > /sys/kernel/debug/tracing/current_tracer
        # cat ./add_del_unshare.sh
        for i in `seq 1 40`
        do
            (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
        done
        wait;
        # ./add_del_unshare.sh
      
      Before the patch:
      
        # cat /sys/kernel/debug/tracing/trace_pipe
          kworker/u128:0-8       [031] ...1.   174.162765: cleanup_net <-process_one_work
          kworker/u128:0-8       [031] ...1.   174.240796: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [032] ...1.   174.244759: inet_twsk_purge <-tcp_sk_exit_batch
          kworker/u128:0-8       [034] ...1.   174.290861: cleanup_net <-process_one_work
          kworker/u128:0-8       [039] ...1.   175.245027: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [046] ...1.   175.290541: inet_twsk_purge <-tcp_sk_exit_batch
          kworker/u128:0-8       [037] ...1.   175.321046: cleanup_net <-process_one_work
          kworker/u128:0-8       [024] ...1.   175.941633: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [025] ...1.   176.242539: inet_twsk_purge <-tcp_sk_exit_batch
      
      After:
      
        # cat /sys/kernel/debug/tracing/trace_pipe
          kworker/u128:0-8       [038] ...1.   428.116174: cleanup_net <-process_one_work
          kworker/u128:0-8       [038] ...1.   428.262532: cleanup_net <-process_one_work
          kworker/u128:0-8       [030] ...1.   429.292645: cleanup_net <-process_one_work
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      edc12f03
    • K
      tcp: Access &tcp_hashinfo via net. · 4461568a
      Kuniyuki Iwashima 提交于
      We will soon introduce an optional per-netns ehash.
      
      This means we cannot use tcp_hashinfo directly in most places.
      
      Instead, access it via net->ipv4.tcp_death_row.hashinfo.
      
      The access will be valid only while initialising tcp_hashinfo
      itself and creating/destroying each netns.
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      4461568a
    • K
      tcp: Set NULL to sk->sk_prot->h.hashinfo. · 429e42c1
      Kuniyuki Iwashima 提交于
      We will soon introduce an optional per-netns ehash.
      
      This means we cannot use the global sk->sk_prot->h.hashinfo
      to fetch a TCP hashinfo.
      
      Instead, set NULL to sk->sk_prot->h.hashinfo for TCP and get
      a proper hashinfo from net->ipv4.tcp_death_row.hashinfo.
      
      Note that we need not use sk->sk_prot->h.hashinfo if DCCP is
      disabled.
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      429e42c1
    • K
      tcp: Don't allocate tcp_death_row outside of struct netns_ipv4. · e9bd0cca
      Kuniyuki Iwashima 提交于
      We will soon introduce an optional per-netns ehash and access hash
      tables via net->ipv4.tcp_death_row->hashinfo instead of &tcp_hashinfo
      in most places.
      
      It could harm the fast path because dereferences of two fields in net
      and tcp_death_row might incur two extra cache line misses.  To save one
      dereference, let's place tcp_death_row back in netns_ipv4 and fetch
      hashinfo via net->ipv4.tcp_death_row"."hashinfo.
      
      Note tcp_death_row was initially placed in netns_ipv4, and commit
      fbb82952 ("tcp: allocate tcp_death_row outside of struct netns_ipv4")
      changed it to a pointer so that we can fire TIME_WAIT timers after freeing
      net.  However, we don't do so after commit 04c494e6 ("Revert "tcp/dccp:
      get rid of inet_twsk_purge()""), so we need not define tcp_death_row as a
      pointer.
      
      Also, we move refcount_dec_and_test(&tw_refcount) from tcp_sk_exit() to
      tcp_sk_exit_batch() as a debug check.
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e9bd0cca
    • K
      tcp: Clean up some functions. · 08eaef90
      Kuniyuki Iwashima 提交于
      This patch adds no functional change and cleans up some functions
      that the following patches touch around so that we make them tidy
      and easy to review/revert.  The changes are
      
        - Keep reverse christmas tree order
        - Remove unnecessary init of port in inet_csk_find_open_port()
        - Use req_to_sk() once in reqsk_queue_unlink()
        - Use sock_net(sk) once in tcp_time_wait() and tcp_v[46]_connect()
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      08eaef90
  9. 20 9月, 2022 4 次提交
    • I
      ipmr: Always call ip{,6}_mr_forward() from RCU read-side critical section · b07a9b26
      Ido Schimmel 提交于
      These functions expect to be called from RCU read-side critical section,
      but this only happens when invoked from the data path via
      ip{,6}_mr_input(). They can also be invoked from process context in
      response to user space adding a multicast route which resolves a cache
      entry with queued packets [1][2].
      
      Fix by adding missing rcu_read_lock() / rcu_read_unlock() in these call
      paths.
      
      [1]
      WARNING: suspicious RCU usage
      6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Not tainted
      -----------------------------
      net/ipv4/ipmr.c:84 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by smcrouted/246:
       #0: ffffffff862389b0 (rtnl_mutex){+.+.}-{3:3}, at: ip_mroute_setsockopt+0x11c/0x1420
      
      stack backtrace:
      CPU: 0 PID: 246 Comm: smcrouted Not tainted 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x91/0xb9
       vif_dev_read+0xbf/0xd0
       ipmr_queue_xmit+0x135/0x1ab0
       ip_mr_forward+0xe7b/0x13d0
       ipmr_mfc_add+0x1a06/0x2ad0
       ip_mroute_setsockopt+0x5c1/0x1420
       do_ip_setsockopt+0x23d/0x37f0
       ip_setsockopt+0x56/0x80
       raw_setsockopt+0x219/0x290
       __sys_setsockopt+0x236/0x4d0
       __x64_sys_setsockopt+0xbe/0x160
       do_syscall_64+0x34/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      [2]
      WARNING: suspicious RCU usage
      6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Not tainted
      -----------------------------
      net/ipv6/ip6mr.c:69 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by smcrouted/246:
       #0: ffffffff862389b0 (rtnl_mutex){+.+.}-{3:3}, at: ip6_mroute_setsockopt+0x6b9/0x2630
      
      stack backtrace:
      CPU: 1 PID: 246 Comm: smcrouted Not tainted 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x91/0xb9
       vif_dev_read+0xbf/0xd0
       ip6mr_forward2.isra.0+0xc9/0x1160
       ip6_mr_forward+0xef0/0x13f0
       ip6mr_mfc_add+0x1ff2/0x31f0
       ip6_mroute_setsockopt+0x1825/0x2630
       do_ipv6_setsockopt+0x462/0x4440
       ipv6_setsockopt+0x105/0x140
       rawv6_setsockopt+0xd8/0x690
       __sys_setsockopt+0x236/0x4d0
       __x64_sys_setsockopt+0xbe/0x160
       do_syscall_64+0x34/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: ebc31979 ("ipmr: add rcu protection over (struct vif_device)->dev")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      b07a9b26
    • A
      seg6: add NEXT-C-SID support for SRv6 End behavior · 848f3c0d
      Andrea Mayer 提交于
      The NEXT-C-SID mechanism described in [1] offers the possibility of
      encoding several SRv6 segments within a single 128 bit SID address. Such
      a SID address is called a Compressed SID (C-SID) container. In this way,
      the length of the SID List can be drastically reduced.
      
      A SID instantiated with the NEXT-C-SID flavor considers an IPv6 address
      logically structured in three main blocks: i) Locator-Block; ii)
      Locator-Node Function; iii) Argument.
      
                              C-SID container
      +------------------------------------------------------------------+
      |     Locator-Block      |Loc-Node|            Argument            |
      |                        |Function|                                |
      +------------------------------------------------------------------+
      <--------- B -----------> <- NF -> <------------- A --------------->
      
         (i) The Locator-Block can be any IPv6 prefix available to the provider;
      
        (ii) The Locator-Node Function represents the node and the function to
             be triggered when a packet is received on the node;
      
       (iii) The Argument carries the remaining C-SIDs in the current C-SID
             container.
      
      The NEXT-C-SID mechanism relies on the "flavors" framework defined in
      [2]. The flavors represent additional operations that can modify or
      extend a subset of the existing behaviors.
      
      This patch introduces the support for flavors in SRv6 End behavior
      implementing the NEXT-C-SID one. An SRv6 End behavior with NEXT-C-SID
      flavor works as an End behavior but it is capable of processing the
      compressed SID List encoded in C-SID containers.
      
      An SRv6 End behavior with NEXT-C-SID flavor can be configured to support
      user-provided Locator-Block and Locator-Node Function lengths. In this
      implementation, such lengths must be evenly divisible by 8 (i.e. must be
      byte-aligned), otherwise the kernel informs the user about invalid
      values with a meaningful error code and message through netlink_ext_ack.
      
      If Locator-Block and/or Locator-Node Function lengths are not provided
      by the user during configuration of an SRv6 End behavior instance with
      NEXT-C-SID flavor, the kernel will choose their default values i.e.,
      32-bit Locator-Block and 16-bit Locator-Node Function.
      
      [1] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression
      [2] - https://datatracker.ietf.org/doc/html/rfc8986Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      848f3c0d
    • A
      seg6: add netlink_ext_ack support in parsing SRv6 behavior attributes · e2a8ecc4
      Andrea Mayer 提交于
      An SRv6 behavior instance can be set up using mandatory and/or optional
      attributes.
      In the setup phase, each supplied attribute is parsed and processed. If
      the parsing operation fails, the creation of the behavior instance stops
      and an error number/code is reported to the user.  In many cases, it is
      challenging for the user to figure out exactly what happened by relying
      only on the error code.
      
      For this reason, we add the support for netlink_ext_ack in parsing SRv6
      behavior attributes. In this way, when an SRv6 behavior attribute is
      parsed and an error occurs, the kernel can send a message to the
      userspace describing the error through a meaningful text message in
      addition to the classic error code.
      Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      e2a8ecc4
    • R
      net-next: gro: Fix use of skb_gro_header_slow · cb628a9a
      Richard Gobert 提交于
      In the cited commit, the function ipv6_gro_receive was accidentally
      changed to use skb_gro_header_slow, without attempting the fast path.
      Fix it.
      
      Fixes: 35ffb665 ("net: gro: skb_gro_header helper function")
      Signed-off-by: NRichard Gobert <richardbgobert@gmail.com>
      Link: https://lore.kernel.org/r/20220911184835.GA105063@debianSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      cb628a9a
  10. 10 9月, 2022 1 次提交
  11. 05 9月, 2022 2 次提交
    • D
      ipv6: sr: fix out-of-bounds read when setting HMAC data. · 84a53580
      David Lebrun 提交于
      The SRv6 layer allows defining HMAC data that can later be used to sign IPv6
      Segment Routing Headers. This configuration is realised via netlink through
      four attributes: SEG6_ATTR_HMACKEYID, SEG6_ATTR_SECRET, SEG6_ATTR_SECRETLEN and
      SEG6_ATTR_ALGID. Because the SECRETLEN attribute is decoupled from the actual
      length of the SECRET attribute, it is possible to provide invalid combinations
      (e.g., secret = "", secretlen = 64). This case is not checked in the code and
      with an appropriately crafted netlink message, an out-of-bounds read of up
      to 64 bytes (max secret length) can occur past the skb end pointer and into
      skb_shared_info:
      
      Breakpoint 1, seg6_genl_sethmac (skb=<optimized out>, info=<optimized out>) at net/ipv6/seg6.c:208
      208		memcpy(hinfo->secret, secret, slen);
      (gdb) bt
       #0  seg6_genl_sethmac (skb=<optimized out>, info=<optimized out>) at net/ipv6/seg6.c:208
       #1  0xffffffff81e012e9 in genl_family_rcv_msg_doit (skb=skb@entry=0xffff88800b1f9f00, nlh=nlh@entry=0xffff88800b1b7600,
          extack=extack@entry=0xffffc90000ba7af0, ops=ops@entry=0xffffc90000ba7a80, hdrlen=4, net=0xffffffff84237580 <init_net>, family=<optimized out>,
          family=<optimized out>) at net/netlink/genetlink.c:731
       #2  0xffffffff81e01435 in genl_family_rcv_msg (extack=0xffffc90000ba7af0, nlh=0xffff88800b1b7600, skb=0xffff88800b1f9f00,
          family=0xffffffff82fef6c0 <seg6_genl_family>) at net/netlink/genetlink.c:775
       #3  genl_rcv_msg (skb=0xffff88800b1f9f00, nlh=0xffff88800b1b7600, extack=0xffffc90000ba7af0) at net/netlink/genetlink.c:792
       #4  0xffffffff81dfffc3 in netlink_rcv_skb (skb=skb@entry=0xffff88800b1f9f00, cb=cb@entry=0xffffffff81e01350 <genl_rcv_msg>)
          at net/netlink/af_netlink.c:2501
       #5  0xffffffff81e00919 in genl_rcv (skb=0xffff88800b1f9f00) at net/netlink/genetlink.c:803
       #6  0xffffffff81dff6ae in netlink_unicast_kernel (ssk=0xffff888010eec800, skb=0xffff88800b1f9f00, sk=0xffff888004aed000)
          at net/netlink/af_netlink.c:1319
       #7  netlink_unicast (ssk=ssk@entry=0xffff888010eec800, skb=skb@entry=0xffff88800b1f9f00, portid=portid@entry=0, nonblock=<optimized out>)
          at net/netlink/af_netlink.c:1345
       #8  0xffffffff81dff9a4 in netlink_sendmsg (sock=<optimized out>, msg=0xffffc90000ba7e48, len=<optimized out>) at net/netlink/af_netlink.c:1921
      ...
      (gdb) p/x ((struct sk_buff *)0xffff88800b1f9f00)->head + ((struct sk_buff *)0xffff88800b1f9f00)->end
      $1 = 0xffff88800b1b76c0
      (gdb) p/x secret
      $2 = 0xffff88800b1b76c0
      (gdb) p slen
      $3 = 64 '@'
      
      The OOB data can then be read back from userspace by dumping HMAC state. This
      commit fixes this by ensuring SECRETLEN cannot exceed the actual length of
      SECRET.
      Reported-by: NLucas Leong <wmliang.tw@gmail.com>
      Tested: verified that EINVAL is correctly returned when secretlen > len(secret)
      Fixes: 4f4853dc ("ipv6: sr: implement API to control SR HMAC structure")
      Signed-off-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84a53580
    • H
      bonding: add all node mcast address when slave up · fd16eb94
      Hangbin Liu 提交于
      When a link is enslave to bond, it need to set the interface down first.
      This makes the slave remove mac multicast address 33:33:00:00:00:01(The
      IPv6 multicast address ff02::1 is kept even when the interface down). When
      bond set the slave up, ipv6_mc_up() was not called due to commit c2edacf8
      ("bonding / ipv6: no addrconf for slaves separately from master").
      
      This is not an issue before we adding the lladdr target feature for bonding,
      as the mac multicast address will be added back when bond interface up and
      join group ff02::1.
      
      But after adding lladdr target feature for bonding. When user set a lladdr
      target, the unsolicited NA message with all-nodes multicast dest will be
      dropped as the slave interface never add 33:33:00:00:00:01 back.
      
      Fix this by calling ipv6_mc_up() to add 33:33:00:00:00:01 back when
      the slave interface up.
      Reported-by: NLiLiang <liali@redhat.com>
      Fixes: 5e1eeef6 ("bonding: NS target should accept link local address")
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd16eb94
  12. 03 9月, 2022 5 次提交
  13. 02 9月, 2022 1 次提交
    • E
      ipv6: tcp: send consistent autoflowlabel in SYN_RECV state · aa51b80e
      Eric Dumazet 提交于
      This is a followup of commit c67b8555 ("ipv6: tcp: send consistent
      autoflowlabel in TIME_WAIT state"), but for SYN_RECV state.
      
      In some cases, TCP sends a challenge ACK on behalf of a SYN_RECV request.
      WHen this happens, we want to use the flow label that was used when
      the prior SYNACK packet was sent, instead of another one.
      
      After his patch, following packetdrill passes:
      
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
        +.2 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > (flowlabel 0x11) S. 0:0(0) ack 1 <...>
      // Test if a challenge ack is properly sent (same flowlabel than prior SYNACK)
         +.01 < . 4000000000:4000000000(0) ack 1 win 320
         +0  > (flowlabel 0x11) . 1:1(0) ack 1
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220831203729.458000-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      aa51b80e
  14. 01 9月, 2022 1 次提交
    • D
      rxrpc: Fix ICMP/ICMP6 error handling · ac56a0b4
      David Howells 提交于
      Because rxrpc pretends to be a tunnel on top of a UDP/UDP6 socket, allowing
      it to siphon off UDP packets early in the handling of received UDP packets
      thereby avoiding the packet going through the UDP receive queue, it doesn't
      get ICMP packets through the UDP ->sk_error_report() callback.  In fact, it
      doesn't appear that there's any usable option for getting hold of ICMP
      packets.
      
      Fix this by adding a new UDP encap hook to distribute error messages for
      UDP tunnels.  If the hook is set, then the tunnel driver will be able to
      see ICMP packets.  The hook provides the offset into the packet of the UDP
      header of the original packet that caused the notification.
      
      An alternative would be to call the ->error_handler() hook - but that
      requires that the skbuff be cloned (as ip_icmp_error() or ipv6_cmp_error()
      do, though isn't really necessary or desirable in rxrpc's case is we want
      to parse them there and then, not queue them).
      
      Changes
      =======
      ver #3)
       - Fixed an uninitialised variable.
      
      ver #2)
       - Fixed some missing CONFIG_AF_RXRPC_IPV6 conditionals.
      
      Fixes: 5271953c ("rxrpc: Use the UDP encap_rcv hook")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      ac56a0b4
  15. 30 8月, 2022 1 次提交
  16. 29 8月, 2022 1 次提交
    • J
      genetlink: start to validate reserved header bytes · 9c5d03d3
      Jakub Kicinski 提交于
      We had historically not checked that genlmsghdr.reserved
      is 0 on input which prevents us from using those precious
      bytes in the future.
      
      One use case would be to extend the cmd field, which is
      currently just 8 bits wide and 256 is not a lot of commands
      for some core families.
      
      To make sure that new families do the right thing by default
      put the onus of opting out of validation on existing families.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: Paul Moore <paul@paul-moore.com> (NetLabel)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c5d03d3