1. 07 12月, 2021 2 次提交
  2. 02 12月, 2021 2 次提交
  3. 30 11月, 2021 1 次提交
  4. 29 11月, 2021 2 次提交
  5. 25 11月, 2021 3 次提交
  6. 24 11月, 2021 1 次提交
  7. 23 11月, 2021 1 次提交
    • N
      net: nexthop: fix null pointer dereference when IPv6 is not enabled · 1c743127
      Nikolay Aleksandrov 提交于
      When we try to add an IPv6 nexthop and IPv6 is not enabled
      (!CONFIG_IPV6) we'll hit a NULL pointer dereference[1] in the error path
      of nh_create_ipv6() due to calling ipv6_stub->fib6_nh_release. The bug
      has been present since the beginning of IPv6 nexthop gateway support.
      Commit 1aefd3de ("ipv6: Add fib6_nh_init and release to stubs") tells
      us that only fib6_nh_init has a dummy stub because fib6_nh_release should
      not be called if fib6_nh_init returns an error, but the commit below added
      a call to ipv6_stub->fib6_nh_release in its error path. To fix it return
      the dummy stub's -EAFNOSUPPORT error directly without calling
      ipv6_stub->fib6_nh_release in nh_create_ipv6()'s error path.
      
      [1]
       Output is a bit truncated, but it clearly shows the error.
       BUG: kernel NULL pointer dereference, address: 000000000000000000
       #PF: supervisor instruction fetch in kernel modede
       #PF: error_code(0x0010) - not-present pagege
       PGD 0 P4D 0
       Oops: 0010 [#1] PREEMPT SMP NOPTI
       CPU: 4 PID: 638 Comm: ip Kdump: loaded Not tainted 5.16.0-rc1+ #446
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014
       RIP: 0010:0x0
       Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
       RSP: 0018:ffff888109f5b8f0 EFLAGS: 00010286^Ac
       RAX: 0000000000000000 RBX: ffff888109f5ba28 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8881008a2860
       RBP: ffff888109f5b9d8 R08: 0000000000000000 R09: 0000000000000000
       R10: ffff888109f5b978 R11: ffff888109f5b948 R12: 00000000ffffff9f
       R13: ffff8881008a2a80 R14: ffff8881008a2860 R15: ffff8881008a2840
       FS:  00007f98de70f100(0000) GS:ffff88822bf00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffffffffffffffd6 CR3: 0000000100efc000 CR4: 00000000000006e0
       Call Trace:
        <TASK>
        nh_create_ipv6+0xed/0x10c
        rtm_new_nexthop+0x6d7/0x13f3
        ? check_preemption_disabled+0x3d/0xf2
        ? lock_is_held_type+0xbe/0xfd
        rtnetlink_rcv_msg+0x23f/0x26a
        ? check_preemption_disabled+0x3d/0xf2
        ? rtnl_calcit.isra.0+0x147/0x147
        netlink_rcv_skb+0x61/0xb2
        netlink_unicast+0x100/0x187
        netlink_sendmsg+0x37f/0x3a0
        ? netlink_unicast+0x187/0x187
        sock_sendmsg_nosec+0x67/0x9b
        ____sys_sendmsg+0x19d/0x1f9
        ? copy_msghdr_from_user+0x4c/0x5e
        ? rcu_read_lock_any_held+0x2a/0x78
        ___sys_sendmsg+0x6c/0x8c
        ? asm_sysvec_apic_timer_interrupt+0x12/0x20
        ? lockdep_hardirqs_on+0xd9/0x102
        ? sockfd_lookup_light+0x69/0x99
        __sys_sendmsg+0x50/0x6e
        do_syscall_64+0xcb/0xf2
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f98dea28914
       Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 48 8d 05 e9 5d 0c 00 8b 00 85 c0 75 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 41 89 d4 55 48 89 f5 53
       RSP: 002b:00007fff859f5e68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e2e
       RAX: ffffffffffffffda RBX: 00000000619cb810 RCX: 00007f98dea28914
       RDX: 0000000000000000 RSI: 00007fff859f5ed0 RDI: 0000000000000003
       RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000008
       R10: fffffffffffffce6 R11: 0000000000000246 R12: 0000000000000001
       R13: 000055c0097ae520 R14: 000055c0097957fd R15: 00007fff859f63a0
       </TASK>
       Modules linked in: bridge stp llc bonding virtio_net
      
      Cc: stable@vger.kernel.org
      Fixes: 53010f99 ("nexthop: Add support for IPv6 gateways")
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c743127
  8. 22 11月, 2021 2 次提交
    • N
      net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group · 1005f19b
      Nikolay Aleksandrov 提交于
      When replacing a nexthop group, we must release the IPv6 per-cpu dsts of
      the removed nexthop entries after an RCU grace period because they
      contain references to the nexthop's net device and to the fib6 info.
      With specific series of events[1] we can reach net device refcount
      imbalance which is unrecoverable. IPv4 is not affected because dsts
      don't take a refcount on the route.
      
      [1]
       $ ip nexthop list
        id 200 via 2002:db8::2 dev bridge.10 scope link onlink
        id 201 via 2002:db8::3 dev bridge scope link onlink
        id 203 group 201/200
       $ ip -6 route
        2001:db8::10 nhid 203 metric 1024 pref medium
           nexthop via 2002:db8::3 dev bridge weight 1 onlink
           nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink
      
      Create rt6_info through one of the multipath legs, e.g.:
       $ taskset -a -c 1  ./pkt_inj 24 bridge.10 2001:db8::10
       (pkt_inj is just a custom packet generator, nothing special)
      
      Then remove that leg from the group by replace (let's assume it is id
      200 in this case):
       $ ip nexthop replace id 203 group 201
      
      Now remove the IPv6 route:
       $ ip -6 route del 2001:db8::10/128
      
      The route won't be really deleted due to the stale rt6_info holding 1
      refcnt in nexthop id 200.
      At this point we have the following reference count dependency:
       (deleted) IPv6 route holds 1 reference over nhid 203
       nh 203 holds 1 ref over id 201
       nh 200 holds 1 ref over the net device and the route due to the stale
       rt6_info
      
      Now to create circular dependency between nh 200 and the IPv6 route, and
      also to get a reference over nh 200, restore nhid 200 in the group:
       $ ip nexthop replace id 203 group 201/200
      
      And now we have a permanent circular dependncy because nhid 203 holds a
      reference over nh 200 and 201, but the route holds a ref over nh 203 and
      is deleted.
      
      To trigger the bug just delete the group (nhid 203):
       $ ip nexthop del id 203
      
      It won't really be deleted due to the IPv6 route dependency, and now we
      have 2 unlinked and deleted objects that reference each other: the group
      and the IPv6 route. Since the group drops the reference it holds over its
      entries at free time (i.e. its own refcount needs to drop to 0) that will
      never happen and we get a permanent ref on them, since one of the entries
      holds a reference over the IPv6 route it will also never be released.
      
      At this point the dependencies are:
       (deleted, only unlinked) IPv6 route holds reference over group nh 203
       (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
       nh 200 holds 1 ref over the net device and the route due to the stale
       rt6_info
      
      This is the last point where it can be fixed by running traffic through
      nh 200, and specifically through the same CPU so the rt6_info (dst) will
      get released due to the IPv6 genid, that in turn will free the IPv6
      route, which in turn will free the ref count over the group nh 203.
      
      If nh 200 is deleted at this point, it will never be released due to the
      ref from the unlinked group 203, it will only be unlinked:
       $ ip nexthop del id 200
       $ ip nexthop
       $
      
      Now we can never release that stale rt6_info, we have IPv6 route with ref
      over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
      rt6_info (dst) with ref over the net device and the IPv6 route. All of
      these objects are only unlinked, and cannot be released, thus they can't
      release their ref counts.
      
       Message from syslogd@dev at Nov 19 14:04:10 ...
        kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
       Message from syslogd@dev at Nov 19 14:04:20 ...
        kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
      
      Fixes: 7bf4796d ("nexthops: add support for replace")
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1005f19b
    • Y
      arp: Remove #ifdef CONFIG_PROC_FS · e968b1b3
      Yajun Deng 提交于
      proc_create_net() and remove_proc_entry() already contain the case
      whether to define CONFIG_PROC_FS, so remove #ifdef CONFIG_PROC_FS.
      Signed-off-by: NYajun Deng <yajun.deng@linux.dev>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e968b1b3
  9. 20 11月, 2021 1 次提交
  10. 18 11月, 2021 1 次提交
  11. 17 11月, 2021 1 次提交
  12. 16 11月, 2021 17 次提交
    • E
      net: drop nopreempt requirement on sock_prot_inuse_add() · b3cb764a
      Eric Dumazet 提交于
      This is distracting really, let's make this simpler,
      because many callers had to take care of this
      by themselves, even if on x86 this adds more
      code than really needed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3cb764a
    • E
      net: gro: move skb_gro_receive_list to udp_offload.c · 0b935d7f
      Eric Dumazet 提交于
      This helper is used once, no need to keep it in fat net/core/skbuff.c
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b935d7f
    • E
      net: move gro definitions to include/net/gro.h · 4721031c
      Eric Dumazet 提交于
      include/linux/netdevice.h became too big, move gro stuff
      into include/net/gro.h
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4721031c
    • E
      tcp: do not call tcp_cleanup_rbuf() if we have a backlog · 29fbc26e
      Eric Dumazet 提交于
      Under pressure, tcp recvmsg() has logic to process the socket backlog,
      but calls tcp_cleanup_rbuf() right before.
      
      Avoiding sending ACK right before processing new segments makes
      a lot of sense, as this decrease the number of ACK packets,
      with no impact on effective ACK clocking.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29fbc26e
    • E
      tcp: check local var (timeo) before socket fields in one test · 8bd172b7
      Eric Dumazet 提交于
      Testing timeo before sk_err/sk_state/sk_shutdown makes more sense.
      
      Modern applications use non-blocking IO, while a socket is terminated
      only once during its life time.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8bd172b7
    • E
      tcp: defer skb freeing after socket lock is released · f35f8219
      Eric Dumazet 提交于
      tcp recvmsg() (or rx zerocopy) spends a fair amount of time
      freeing skbs after their payload has been consumed.
      
      A typical ~64KB GRO packet has to release ~45 page
      references, eventually going to page allocator
      for each of them.
      
      Currently, this freeing is performed while socket lock
      is held, meaning that there is a high chance that
      BH handler has to queue incoming packets to tcp socket backlog.
      
      This can cause additional latencies, because the user
      thread has to process the backlog at release_sock() time,
      and while doing so, additional frames can be added
      by BH handler.
      
      This patch adds logic to defer these frees after socket
      lock is released, or directly from BH handler if possible.
      
      Being able to free these skbs from BH handler helps a lot,
      because this avoids the usual alloc/free assymetry,
      when BH handler and user thread do not run on same cpu or
      NUMA node.
      
      One cpu can now be fully utilized for the kernel->user copy,
      and another cpu is handling BH processing and skb/page
      allocs/frees (assuming RFS is not forcing use of a single CPU)
      
      Tested:
       100Gbit NIC
       Max throughput for one TCP_STREAM flow, over 10 runs
      
      MTU : 1500
      Before: 55 Gbit
      After:  66 Gbit
      
      MTU : 4096+(headers)
      Before: 82 Gbit
      After:  95 Gbit
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f35f8219
    • E
      tcp: avoid indirect calls to sock_rfree · 3df684c1
      Eric Dumazet 提交于
      TCP uses sk_eat_skb() when skbs can be removed from receive queue.
      However, the call to skb_orphan() from __kfree_skb() incurs
      an indirect call so sock_rfee(), which is more expensive than
      a direct call, especially for CONFIG_RETPOLINE=y.
      
      Add tcp_eat_recv_skb() function to make the call before
      __kfree_skb().
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3df684c1
    • E
      tcp: tp->urg_data is unlikely to be set · b96c51bd
      Eric Dumazet 提交于
      Use some unlikely() hints in the fast path.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b96c51bd
    • E
      tcp: annotate races around tp->urg_data · 7b6a893a
      Eric Dumazet 提交于
      tcp_poll() and tcp_ioctl() are reading tp->urg_data without socket lock
      owned.
      
      Also, it is faster to first check tp->urg_data in tcp_poll(),
      then tp->urg_seq == tp->copied_seq, because tp->urg_seq is
      located in a different/cold cache line.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b6a893a
    • E
      tcp: annotate data-races on tp->segs_in and tp->data_segs_in · 0307a0b7
      Eric Dumazet 提交于
      tcp_segs_in() can be called from BH, while socket spinlock
      is held but socket owned by user, eventually reading these
      fields from tcp_get_info()
      
      Found by code inspection, no need to backport this patch
      to older kernels.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0307a0b7
    • E
      tcp: small optimization in tcp recvmsg() · 93afcfd1
      Eric Dumazet 提交于
      When reading large chunks of data, incoming packets might
      be added to the backlog from BH.
      
      tcp recvmsg() detects the backlog queue is not empty, and uses
      a release_sock()/lock_sock() pair to process this backlog.
      
      We now have __sk_flush_backlog() to perform this
      a bit faster.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93afcfd1
    • E
      net: cache align tcp_memory_allocated, tcp_sockets_allocated · 91b6d325
      Eric Dumazet 提交于
      tcp_memory_allocated and tcp_sockets_allocated often share
      a common cache line, source of false sharing.
      
      Also take care of udp_memory_allocated and mptcp_sockets_allocated.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91b6d325
    • E
      net: remove sk_route_nocaps · aba54656
      Eric Dumazet 提交于
      Instead of using a full netdev_features_t, we can use a single bit,
      as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from
      sk->sk_route_cap.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aba54656
    • E
      net: remove sk_route_forced_caps · d0d598ca
      Eric Dumazet 提交于
      We were only using one bit, and we can replace it by sk_is_tcp()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0d598ca
    • E
      tcp: minor optimization in tcp_add_backlog() · d519f350
      Eric Dumazet 提交于
      If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values
      are not used. Defer their access to the point we need them.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d519f350
    • C
      udp: Validate checksum in udp_read_sock() · 099f896f
      Cong Wang 提交于
      It turns out the skb's in sock receive queue could have bad checksums, as
      both ->poll() and ->recvmsg() validate checksums. We have to do the same
      for ->read_sock() path too before they are redirected in sockmap.
      
      Fixes: d7f57118 ("udp: Implement ->read_sock() for sockmap")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20211115044006.26068-1-xiyou.wangcong@gmail.com
      099f896f
    • D
      bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs · 5e0bc308
      Dmitrii Banshchikov 提交于
      Use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in tracing
      progs may result in locking issues.
      
      bpf_ktime_get_coarse_ns() uses ktime_get_coarse_ns() time accessor that
      isn't safe for any context:
      ======================================================
      WARNING: possible circular locking dependency detected
      5.15.0-syzkaller #0 Not tainted
      ------------------------------------------------------
      syz-executor.4/14877 is trying to acquire lock:
      ffffffff8cb30008 (tk_core.seq.seqcount){----}-{0:0}, at: ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255
      
      but task is already holding lock:
      ffffffff90dbf200 (&obj_hash[i].lock){-.-.}-{2:2}, at: debug_object_deactivate+0x61/0x400 lib/debugobjects.c:735
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (&obj_hash[i].lock){-.-.}-{2:2}:
             lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
             __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
             _raw_spin_lock_irqsave+0xd1/0x120 kernel/locking/spinlock.c:162
             __debug_object_init+0xd9/0x1860 lib/debugobjects.c:569
             debug_hrtimer_init kernel/time/hrtimer.c:414 [inline]
             debug_init kernel/time/hrtimer.c:468 [inline]
             hrtimer_init+0x20/0x40 kernel/time/hrtimer.c:1592
             ntp_init_cmos_sync kernel/time/ntp.c:676 [inline]
             ntp_init+0xa1/0xad kernel/time/ntp.c:1095
             timekeeping_init+0x512/0x6bf kernel/time/timekeeping.c:1639
             start_kernel+0x267/0x56e init/main.c:1030
             secondary_startup_64_no_verify+0xb1/0xbb
      
      -> #0 (tk_core.seq.seqcount){----}-{0:0}:
             check_prev_add kernel/locking/lockdep.c:3051 [inline]
             check_prevs_add kernel/locking/lockdep.c:3174 [inline]
             validate_chain+0x1dfb/0x8240 kernel/locking/lockdep.c:3789
             __lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5015
             lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
             seqcount_lockdep_reader_access+0xfe/0x230 include/linux/seqlock.h:103
             ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255
             ktime_get_coarse include/linux/timekeeping.h:120 [inline]
             ktime_get_coarse_ns include/linux/timekeeping.h:126 [inline]
             ____bpf_ktime_get_coarse_ns kernel/bpf/helpers.c:173 [inline]
             bpf_ktime_get_coarse_ns+0x7e/0x130 kernel/bpf/helpers.c:171
             bpf_prog_a99735ebafdda2f1+0x10/0xb50
             bpf_dispatcher_nop_func include/linux/bpf.h:721 [inline]
             __bpf_prog_run include/linux/filter.h:626 [inline]
             bpf_prog_run include/linux/filter.h:633 [inline]
             BPF_PROG_RUN_ARRAY include/linux/bpf.h:1294 [inline]
             trace_call_bpf+0x2cf/0x5d0 kernel/trace/bpf_trace.c:127
             perf_trace_run_bpf_submit+0x7b/0x1d0 kernel/events/core.c:9708
             perf_trace_lock+0x37c/0x440 include/trace/events/lock.h:39
             trace_lock_release+0x128/0x150 include/trace/events/lock.h:58
             lock_release+0x82/0x810 kernel/locking/lockdep.c:5636
             __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:149 [inline]
             _raw_spin_unlock_irqrestore+0x75/0x130 kernel/locking/spinlock.c:194
             debug_hrtimer_deactivate kernel/time/hrtimer.c:425 [inline]
             debug_deactivate kernel/time/hrtimer.c:481 [inline]
             __run_hrtimer kernel/time/hrtimer.c:1653 [inline]
             __hrtimer_run_queues+0x2f9/0xa60 kernel/time/hrtimer.c:1749
             hrtimer_interrupt+0x3b3/0x1040 kernel/time/hrtimer.c:1811
             local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline]
             __sysvec_apic_timer_interrupt+0xf9/0x270 arch/x86/kernel/apic/apic.c:1103
             sysvec_apic_timer_interrupt+0x8c/0xb0 arch/x86/kernel/apic/apic.c:1097
             asm_sysvec_apic_timer_interrupt+0x12/0x20
             __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
             _raw_spin_unlock_irqrestore+0xd4/0x130 kernel/locking/spinlock.c:194
             try_to_wake_up+0x702/0xd20 kernel/sched/core.c:4118
             wake_up_process kernel/sched/core.c:4200 [inline]
             wake_up_q+0x9a/0xf0 kernel/sched/core.c:953
             futex_wake+0x50f/0x5b0 kernel/futex/waitwake.c:184
             do_futex+0x367/0x560 kernel/futex/syscalls.c:127
             __do_sys_futex kernel/futex/syscalls.c:199 [inline]
             __se_sys_futex+0x401/0x4b0 kernel/futex/syscalls.c:180
             do_syscall_x64 arch/x86/entry/common.c:50 [inline]
             do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      There is a possible deadlock with bpf_timer_* set of helpers:
      hrtimer_start()
        lock_base();
        trace_hrtimer...()
          perf_event()
            bpf_run()
              bpf_timer_start()
                hrtimer_start()
                  lock_base()         <- DEADLOCK
      
      Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in
      BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_PERF_EVENT
      and BPF_PROG_TYPE_RAW_TRACEPOINT prog types.
      
      Fixes: d0551261 ("bpf: Add bpf_ktime_get_coarse_ns helper")
      Fixes: b00628b1 ("bpf: Introduce bpf timers.")
      Reported-by: syzbot+43fd005b5a1b4d10781e@syzkaller.appspotmail.com
      Signed-off-by: NDmitrii Banshchikov <me@ubique.spb.ru>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211113142227.566439-2-me@ubique.spb.ru
      5e0bc308
  13. 15 11月, 2021 1 次提交
  14. 14 11月, 2021 2 次提交
  15. 13 11月, 2021 1 次提交
    • A
      tcp: Fix uninitialized access in skb frags array for Rx 0cp. · 70701b83
      Arjun Roy 提交于
      TCP Receive zerocopy iterates through the SKB queue via
      tcp_recv_skb(), acquiring a pointer to an SKB and an offset within
      that SKB to read from. From there, it iterates the SKB frags array to
      determine which offset to start remapping pages from.
      
      However, this is built on the assumption that the offset read so far
      within the SKB is smaller than the SKB length. If this assumption is
      violated, we can attempt to read an invalid frags array element, which
      would cause a fault.
      
      tcp_recv_skb() can cause such an SKB to be returned when the TCP FIN
      flag is set. Therefore, we must guard against this occurrence inside
      skb_advance_frag().
      
      One way that we can reproduce this error follows:
      1) In a receiver program, call getsockopt(TCP_ZEROCOPY_RECEIVE) with:
      char some_array[32 * 1024];
      struct tcp_zerocopy_receive zc = {
        .copybuf_address  = (__u64) &some_array[0],
        .copybuf_len = 32 * 1024,
      };
      
      2) In a sender program, after a TCP handshake, send the following
      sequence of packets:
        i) Seq = [X, X+4000]
        ii) Seq = [X+4000, X+5000]
        iii) Seq = [X+4000, X+5000], Flags = FIN | URG, urgptr=1000
      
      (This can happen without URG, if we have a signal pending, but URG is
      a convenient way to reproduce the behaviour).
      
      In this case, the following event sequence will occur on the receiver:
      
      tcp_zerocopy_receive():
      -> receive_fallback_to_copy() // copybuf_len >= inq
      -> tcp_recvmsg_locked() // reads 5000 bytes, then breaks due to URG
      -> tcp_recv_skb() // yields skb with skb->len == offset
      -> tcp_zerocopy_set_hint_for_skb()
      -> skb_advance_to_frag() // will returns a frags ptr. >= nr_frags
      -> find_next_mappable_frag() // will dereference this bad frags ptr.
      
      With this patch, skb_advance_to_frag() will no longer return an
      invalid frags pointer, and will return NULL instead, fixing the issue.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: 05255b82 ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
      Link: https://lore.kernel.org/r/20211111235215.2605384-1-arjunroy.kdev@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      70701b83
  16. 11 11月, 2021 1 次提交
  17. 09 11月, 2021 1 次提交
    • J
      bpf, sockmap: Fix race in ingress receive verdict with redirect to self · c5d2177a
      John Fastabend 提交于
      A socket in a sockmap may have different combinations of programs attached
      depending on configuration. There can be no programs in which case the socket
      acts as a sink only. There can be a TX program in this case a BPF program is
      attached to sending side, but no RX program is attached. There can be an RX
      program only where sends have no BPF program attached, but receives are hooked
      with BPF. And finally, both TX and RX programs may be attached. Giving us the
      permutations:
      
       None, Tx, Rx, and TxRx
      
      To date most of our use cases have been TX case being used as a fast datapath
      to directly copy between local application and a userspace proxy. Or Rx cases
      and TxRX applications that are operating an in kernel based proxy. The traffic
      in the first case where we hook applications into a userspace application looks
      like this:
      
        AppA  redirect   AppB
         Tx <-----------> Rx
         |                |
         +                +
         TCP <--> lo <--> TCP
      
      In this case all traffic from AppA (after 3whs) is copied into the AppB
      ingress queue and no traffic is ever on the TCP recieive_queue.
      
      In the second case the application never receives, except in some rare error
      cases, traffic on the actual user space socket. Instead the send happens in
      the kernel.
      
                 AppProxy       socket pool
             sk0 ------------->{sk1,sk2, skn}
              ^                      |
              |                      |
              |                      v
             ingress              lb egress
             TCP                  TCP
      
      Here because traffic is never read off the socket with userspace recv() APIs
      there is only ever one reader on the sk receive_queue. Namely the BPF programs.
      
      However, we've started to introduce a third configuration where the BPF program
      on receive should process the data, but then the normal case is to push the
      data into the receive queue of AppB.
      
             AppB
             recv()                (userspace)
           -----------------------
             tcp_bpf_recvmsg()     (kernel)
               |             |
               |             |
               |             |
             ingress_msgQ    |
               |             |
             RX_BPF          |
               |             |
               v             v
             sk->receive_queue
      
      This is different from the App{A,B} redirect because traffic is first received
      on the sk->receive_queue.
      
      Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
      queue for any data handled by the BPF rx program and returned with PASS code
      so that it was enqueued on the ingress msg queue. Then if no data exists on
      that queue it checks the socket receive queue. Unfortunately, this is the same
      receive_queue the BPF program is reading data off of. So we get a race. Its
      possible for the recvmsg() hook to pull data off the receive_queue before the
      BPF hook has a chance to read it. It typically happens when an application is
      banging on recv() and getting EAGAINs. Until they manage to race with the RX
      BPF program.
      
      To fix this we note that before this patch at attach time when the socket is
      loaded into the map we check if it needs a TX program or just the base set of
      proto bpf hooks. Then it uses the above general RX hook regardless of if we
      have a BPF program attached at rx or not. This patch now extends this check to
      handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
      race when an RX program is attached we use a new hook that is nearly identical
      to the old one except now we do not let the recv() call skip the RX BPF program.
      Now only the BPF program pulls data from sk->receive_queue and recv() only
      pulls data from the ingress msgQ post BPF program handling.
      
      With this resolved our AppB from above has been up and running for many hours
      without detecting any errors. We do this by correlating counters in RX BPF
      events and the AppB to ensure data is never skipping the BPF program. Selftests,
      was not able to detect this because we only run them for a short period of time
      on well ordered send/recvs so we don't get any of the noise we see in real
      application environments.
      
      Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NJussi Maki <joamaki@gmail.com>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
      c5d2177a