1. 17 12月, 2021 11 次提交
    • J
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 0c3e2474
      Jakub Kicinski 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2021-12-16
      
      We've added 15 non-merge commits during the last 7 day(s) which contain
      a total of 12 files changed, 434 insertions(+), 30 deletions(-).
      
      The main changes are:
      
      1) Fix incorrect verifier state pruning behavior for <8B register spill/fill,
         from Paul Chaignon.
      
      2) Fix x86-64 JIT's extable handling for fentry/fexit when return pointer
         is an ERR_PTR(), from Alexei Starovoitov.
      
      3) Fix 3 different possibilities that BPF verifier missed where unprivileged
         could leak kernel addresses, from Daniel Borkmann.
      
      4) Fix xsk's poll behavior under need_wakeup flag, from Magnus Karlsson.
      
      5) Fix an oob-write in test_verifier due to a missed MAX_NR_MAPS bump,
         from Kumar Kartikeya Dwivedi.
      
      6) Fix a race in test_btf_skc_cls_ingress selftest, from Martin KaFai Lau.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        bpf, selftests: Fix racing issue in btf_skc_cls_ingress test
        selftest/bpf: Add a test that reads various addresses.
        bpf: Fix extable address check.
        bpf: Fix extable fixup offset.
        bpf, selftests: Add test case trying to taint map value pointer
        bpf: Make 32->64 bounds propagation slightly more robust
        bpf: Fix signed bounds propagation after mov32
        bpf, selftests: Update test case for atomic cmpxchg on r0 with pointer
        bpf: Fix kernel address leakage in atomic cmpxchg's r0 aux reg
        bpf, selftests: Add test case for atomic fetch on spilled pointer
        bpf: Fix kernel address leakage in atomic fetch
        selftests/bpf: Fix OOB write in test_verifier
        xsk: Do not sleep in poll() when need_wakeup set
        selftests/bpf: Tests for state pruning with u32 spill/fill
        bpf: Fix incorrect state pruning for <8B spill/fill
      ====================
      
      Link: https://lore.kernel.org/r/20211216210005.13815-1-daniel@iogearbox.netSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      0c3e2474
    • M
      bpf, selftests: Fix racing issue in btf_skc_cls_ingress test · c2fcbf81
      Martin KaFai Lau 提交于
      The libbpf CI reported occasional failure in btf_skc_cls_ingress:
      
        test_syncookie:FAIL:Unexpected syncookie states gen_cookie:80326634 recv_cookie:0
        bpf prog error at line 97
      
      "error at line 97" means the bpf prog cannot find the listening socket
      when the final ack is received.  It then skipped processing
      the syncookie in the final ack which then led to "recv_cookie:0".
      
      The problem is the userspace program did not do accept() and went
      ahead to close(listen_fd) before the kernel (and the bpf prog) had
      a chance to process the final ack.
      
      The fix is to add accept() call so that the userspace will wait for
      the kernel to finish processing the final ack first before close()-ing
      everything.
      
      Fixes: 9a856cae ("bpf: selftest: Add test_btf_skc_cls_ingress")
      Reported-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20211216191630.466151-1-kafai@fb.com
      c2fcbf81
    • A
      selftest/bpf: Add a test that reads various addresses. · 7edc3fcb
      Alexei Starovoitov 提交于
      Add a function to bpf_testmod that returns invalid kernel and user addresses.
      Then attach an fexit program to that function that tries to read
      memory through these addresses.
      
      This logic checks that bpf_probe_read_kernel and BPF_PROBE_MEM logic is sane.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      7edc3fcb
    • A
      bpf: Fix extable address check. · 588a25e9
      Alexei Starovoitov 提交于
      The verifier checks that PTR_TO_BTF_ID pointer is either valid or NULL,
      but it cannot distinguish IS_ERR pointer from valid one.
      
      When offset is added to IS_ERR pointer it may become small positive
      value which is a user address that is not handled by extable logic
      and has to be checked for at the runtime.
      
      Tighten BPF_PROBE_MEM pointer check code to prevent this case.
      
      Fixes: 4c5de127 ("bpf: Emit explicit NULL pointer checks for PROBE_LDX instructions.")
      Reported-by: NLorenzo Fontana <lorenzo.fontana@elastic.co>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      588a25e9
    • A
      bpf: Fix extable fixup offset. · 433956e9
      Alexei Starovoitov 提交于
      The prog - start_of_ldx is the offset before the faulting ldx to the location
      after it, so this will be used to adjust pt_regs->ip for jumping over it and
      continuing, and with old temp it would have been fixed up to the wrong offset,
      causing crash.
      
      Fixes: 4c5de127 ("bpf: Emit explicit NULL pointer checks for PROBE_LDX instructions.")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      433956e9
    • D
      bpf, selftests: Add test case trying to taint map value pointer · b1a7288d
      Daniel Borkmann 提交于
      Add a test case which tries to taint map value pointer arithmetic into a
      unknown scalar with subsequent export through the map.
      
      Before fix:
      
        # ./test_verifier 1186
        #1186/u map access: trying to leak tained dst reg FAIL
        Unexpected success to load!
        verification time 24 usec
        stack depth 8
        processed 15 insns (limit 1000000) max_states_per_insn 0 total_states 1 peak_states 1 mark_read 1
        #1186/p map access: trying to leak tained dst reg FAIL
        Unexpected success to load!
        verification time 8 usec
        stack depth 8
        processed 15 insns (limit 1000000) max_states_per_insn 0 total_states 1 peak_states 1 mark_read 1
        Summary: 0 PASSED, 0 SKIPPED, 2 FAILED
      
      After fix:
      
        # ./test_verifier 1186
        #1186/u map access: trying to leak tained dst reg OK
        #1186/p map access: trying to leak tained dst reg OK
        Summary: 2 PASSED, 0 SKIPPED, 0 FAILED
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      b1a7288d
    • D
      bpf: Make 32->64 bounds propagation slightly more robust · e572ff80
      Daniel Borkmann 提交于
      Make the bounds propagation in __reg_assign_32_into_64() slightly more
      robust and readable by aligning it similarly as we did back in the
      __reg_combine_64_into_32() counterpart. Meaning, only propagate or
      pessimize them as a smin/smax pair.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      e572ff80
    • D
      bpf: Fix signed bounds propagation after mov32 · 3cf2b61e
      Daniel Borkmann 提交于
      For the case where both s32_{min,max}_value bounds are positive, the
      __reg_assign_32_into_64() directly propagates them to their 64 bit
      counterparts, otherwise it pessimises them into [0,u32_max] universe and
      tries to refine them later on by learning through the tnum as per comment
      in mentioned function. However, that does not always happen, for example,
      in mov32 operation we call zext_32_to_64(dst_reg) which invokes the
      __reg_assign_32_into_64() as is without subsequent bounds update as
      elsewhere thus no refinement based on tnum takes place.
      
      Thus, not calling into the __update_reg_bounds() / __reg_deduce_bounds() /
      __reg_bound_offset() triplet as we do, for example, in case of ALU ops via
      adjust_scalar_min_max_vals(), will lead to more pessimistic bounds when
      dumping the full register state:
      
      Before fix:
      
        0: (b4) w0 = -1
        1: R0_w=invP4294967295
           (id=0,imm=ffffffff,
            smin_value=4294967295,smax_value=4294967295,
            umin_value=4294967295,umax_value=4294967295,
            var_off=(0xffffffff; 0x0),
            s32_min_value=-1,s32_max_value=-1,
            u32_min_value=-1,u32_max_value=-1)
      
        1: (bc) w0 = w0
        2: R0_w=invP4294967295
           (id=0,imm=ffffffff,
            smin_value=0,smax_value=4294967295,
            umin_value=4294967295,umax_value=4294967295,
            var_off=(0xffffffff; 0x0),
            s32_min_value=-1,s32_max_value=-1,
            u32_min_value=-1,u32_max_value=-1)
      
      Technically, the smin_value=0 and smax_value=4294967295 bounds are not
      incorrect, but given the register is still a constant, they break assumptions
      about const scalars that smin_value == smax_value and umin_value == umax_value.
      
      After fix:
      
        0: (b4) w0 = -1
        1: R0_w=invP4294967295
           (id=0,imm=ffffffff,
            smin_value=4294967295,smax_value=4294967295,
            umin_value=4294967295,umax_value=4294967295,
            var_off=(0xffffffff; 0x0),
            s32_min_value=-1,s32_max_value=-1,
            u32_min_value=-1,u32_max_value=-1)
      
        1: (bc) w0 = w0
        2: R0_w=invP4294967295
           (id=0,imm=ffffffff,
            smin_value=4294967295,smax_value=4294967295,
            umin_value=4294967295,umax_value=4294967295,
            var_off=(0xffffffff; 0x0),
            s32_min_value=-1,s32_max_value=-1,
            u32_min_value=-1,u32_max_value=-1)
      
      Without the smin_value == smax_value and umin_value == umax_value invariant
      being intact for const scalars, it is possible to leak out kernel pointers
      from unprivileged user space if the latter is enabled. For example, when such
      registers are involved in pointer arithmtics, then adjust_ptr_min_max_vals()
      will taint the destination register into an unknown scalar, and the latter
      can be exported and stored e.g. into a BPF map value.
      
      Fixes: 3f50f132 ("bpf: Verifier, do explicit ALU32 bounds tracking")
      Reported-by: NKuee K1r0a <liulin063@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      3cf2b61e
    • E
      sit: do not call ipip6_dev_free() from sit_init_net() · e28587cc
      Eric Dumazet 提交于
      ipip6_dev_free is sit dev->priv_destructor, already called
      by register_netdevice() if something goes wrong.
      
      Alternative would be to make ipip6_dev_free() robust against
      multiple invocations, but other drivers do not implement this
      strategy.
      
      syzbot reported:
      
      dst_release underflow
      WARNING: CPU: 0 PID: 5059 at net/core/dst.c:173 dst_release+0xd8/0xe0 net/core/dst.c:173
      Modules linked in:
      CPU: 1 PID: 5059 Comm: syz-executor.4 Not tainted 5.16.0-rc5-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:dst_release+0xd8/0xe0 net/core/dst.c:173
      Code: 4c 89 f2 89 d9 31 c0 5b 41 5e 5d e9 da d5 44 f9 e8 1d 90 5f f9 c6 05 87 48 c6 05 01 48 c7 c7 80 44 99 8b 31 c0 e8 e8 67 29 f9 <0f> 0b eb 85 0f 1f 40 00 53 48 89 fb e8 f7 8f 5f f9 48 83 c3 a8 48
      RSP: 0018:ffffc9000aa5faa0 EFLAGS: 00010246
      RAX: d6894a925dd15a00 RBX: 00000000ffffffff RCX: 0000000000040000
      RDX: ffffc90005e19000 RSI: 000000000003ffff RDI: 0000000000040000
      RBP: 0000000000000000 R08: ffffffff816a1f42 R09: ffffed1017344f2c
      R10: ffffed1017344f2c R11: 0000000000000000 R12: 0000607f462b1358
      R13: 1ffffffff1bfd305 R14: ffffe8ffffcb1358 R15: dffffc0000000000
      FS:  00007f66c71a2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f88aaed5058 CR3: 0000000023e0f000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       dst_cache_destroy+0x107/0x1e0 net/core/dst_cache.c:160
       ipip6_dev_free net/ipv6/sit.c:1414 [inline]
       sit_init_net+0x229/0x550 net/ipv6/sit.c:1936
       ops_init+0x313/0x430 net/core/net_namespace.c:140
       setup_net+0x35b/0x9d0 net/core/net_namespace.c:326
       copy_net_ns+0x359/0x5c0 net/core/net_namespace.c:470
       create_new_namespaces+0x4ce/0xa00 kernel/nsproxy.c:110
       unshare_nsproxy_namespaces+0x11e/0x180 kernel/nsproxy.c:226
       ksys_unshare+0x57d/0xb50 kernel/fork.c:3075
       __do_sys_unshare kernel/fork.c:3146 [inline]
       __se_sys_unshare kernel/fork.c:3144 [inline]
       __x64_sys_unshare+0x34/0x40 kernel/fork.c:3144
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f66c882ce99
      Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f66c71a2168 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
      RAX: ffffffffffffffda RBX: 00007f66c893ff60 RCX: 00007f66c882ce99
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000048040200
      RBP: 00007f66c8886ff1 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007fff6634832f R14: 00007f66c71a2300 R15: 0000000000022000
       </TASK>
      
      Fixes: cf124db5 ("net: Fix inconsistent teardown and release of private netdev state.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Link: https://lore.kernel.org/r/20211216111741.1387540-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      e28587cc
    • F
      net: systemport: Add global locking for descriptor lifecycle · 8b8e6e78
      Florian Fainelli 提交于
      The descriptor list is a shared resource across all of the transmit queues, and
      the locking mechanism used today only protects concurrency across a given
      transmit queue between the transmit and reclaiming. This creates an opportunity
      for the SYSTEMPORT hardware to work on corrupted descriptors if we have
      multiple producers at once which is the case when using multiple transmit
      queues.
      
      This was particularly noticeable when using multiple flows/transmit queues and
      it showed up in interesting ways in that UDP packets would get a correct UDP
      header checksum being calculated over an incorrect packet length. Similarly TCP
      packets would get an equally correct checksum computed by the hardware over an
      incorrect packet length.
      
      The SYSTEMPORT hardware maintains an internal descriptor list that it re-arranges
      when the driver produces a new descriptor anytime it writes to the
      WRITE_PORT_{HI,LO} registers, there is however some delay in the hardware to
      re-organize its descriptors and it is possible that concurrent TX queues
      eventually break this internal allocation scheme to the point where the
      length/status part of the descriptor gets used for an incorrect data buffer.
      
      The fix is to impose a global serialization for all TX queues in the short
      section where we are writing to the WRITE_PORT_{HI,LO} registers which solves
      the corruption even with multiple concurrent TX queues being used.
      
      Fixes: 80105bef ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver")
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20211215202450.4086240-1-f.fainelli@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      8b8e6e78
    • D
      net/smc: Prevent smc_release() from long blocking · 5c15b312
      D. Wythe 提交于
      In nginx/wrk benchmark, there's a hung problem with high probability
      on case likes that: (client will last several minutes to exit)
      
      server: smc_run nginx
      
      client: smc_run wrk -c 10000 -t 1 http://server
      
      Client hangs with the following backtrace:
      
      0 [ffffa7ce8Of3bbf8] __schedule at ffffffff9f9eOd5f
      1 [ffffa7ce8Of3bc88] schedule at ffffffff9f9eløe6
      2 [ffffa7ce8Of3bcaO] schedule_timeout at ffffffff9f9e3f3c
      3 [ffffa7ce8Of3bd2O] wait_for_common at ffffffff9f9el9de
      4 [ffffa7ce8Of3bd8O] __flush_work at ffffffff9fOfeOl3
      5 [ffffa7ce8øf3bdfO] smc_release at ffffffffcO697d24 [smc]
      6 [ffffa7ce8Of3be2O] __sock_release at ffffffff9f8O2e2d
      7 [ffffa7ce8Of3be4ø] sock_close at ffffffff9f8ø2ebl
      8 [ffffa7ce8øf3be48] __fput at ffffffff9f334f93
      9 [ffffa7ce8Of3be78] task_work_run at ffffffff9flOlff5
      10 [ffffa7ce8Of3beaO] do_exit at ffffffff9fOe5Ol2
      11 [ffffa7ce8Of3bflO] do_group_exit at ffffffff9fOe592a
      12 [ffffa7ce8Of3bf38] __x64_sys_exit_group at ffffffff9fOe5994
      13 [ffffa7ce8Of3bf4O] do_syscall_64 at ffffffff9f9d4373
      14 [ffffa7ce8Of3bfsO] entry_SYSCALL_64_after_hwframe at ffffffff9fa0007c
      
      This issue dues to flush_work(), which is used to wait for
      smc_connect_work() to finish in smc_release(). Once lots of
      smc_connect_work() was pending or all executing work dangling,
      smc_release() has to block until one worker comes to free, which
      is equivalent to wait another smc_connnect_work() to finish.
      
      In order to fix this, There are two changes:
      
      1. For those idle smc_connect_work(), cancel it from the workqueue; for
         executing smc_connect_work(), waiting for it to finish. For that
         purpose, replace flush_work() with cancel_work_sync().
      
      2. Since smc_connect() hold a reference for passive closing, if
         smc_connect_work() has been cancelled, release the reference.
      
      Fixes: 24ac3a08 ("net/smc: rebuild nonblocking connect")
      Reported-by: NTony Lu <tonylu@linux.alibaba.com>
      Tested-by: NDust Li <dust.li@linux.alibaba.com>
      Reviewed-by: NDust Li <dust.li@linux.alibaba.com>
      Reviewed-by: NTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
      Acked-by: NKarsten Graul <kgraul@linux.ibm.com>
      Link: https://lore.kernel.org/r/1639571361-101128-1-git-send-email-alibuda@linux.alibaba.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      5c15b312
  2. 16 12月, 2021 15 次提交
  3. 15 12月, 2021 13 次提交
    • D
      Merge tag 'wireless-drivers-2021-12-15' of... · 1d1c950f
      David S. Miller 提交于
      Merge tag 'wireless-drivers-2021-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers
      
      Kalle Valo says:
      
      ====================
      wireless-drivers fixes for v5.16
      
      Second set of fixes for v5.16, hopefully also the last one. I changed
      my email in MAINTAINERS, one crash fix in iwlwifi and some build
      problems fixed.
      
      iwlwifi
      
      * fix crash caused by a warning
      
      * fix LED linking problem
      
      brcmsmac
      
      * rework LED dependencies for being consistent with other drivers
      
      mt76
      
      * mt7921: fix build regression
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d1c950f
    • D
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 7c8089f9
      David S. Miller 提交于
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2021-12-14
      
      This series contains updates to ice driver only.
      
      Karol corrects division that was causing incorrect calculations and
      adds a check to ensure stale timestamps are not being used.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c8089f9
    • D
      bpf, selftests: Update test case for atomic cmpxchg on r0 with pointer · e523102c
      Daniel Borkmann 提交于
      Fix up unprivileged test case results for 'Dest pointer in r0' verifier tests
      given they now need to reject R0 containing a pointer value, and add a couple
      of new related ones with 32bit cmpxchg as well.
      
        root@foo:~/bpf/tools/testing/selftests/bpf# ./test_verifier
        #0/u invalid and of negative number OK
        #0/p invalid and of negative number OK
        [...]
        #1268/p XDP pkt read, pkt_meta' <= pkt_data, bad access 1 OK
        #1269/p XDP pkt read, pkt_meta' <= pkt_data, bad access 2 OK
        #1270/p XDP pkt read, pkt_data <= pkt_meta', good access OK
        #1271/p XDP pkt read, pkt_data <= pkt_meta', bad access 1 OK
        #1272/p XDP pkt read, pkt_data <= pkt_meta', bad access 2 OK
        Summary: 1900 PASSED, 0 SKIPPED, 0 FAILED
      Acked-by: NBrendan Jackman <jackmanb@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e523102c
    • D
      bpf: Fix kernel address leakage in atomic cmpxchg's r0 aux reg · a82fe085
      Daniel Borkmann 提交于
      The implementation of BPF_CMPXCHG on a high level has the following parameters:
      
        .-[old-val]                                          .-[new-val]
        BPF_R0 = cmpxchg{32,64}(DST_REG + insn->off, BPF_R0, SRC_REG)
                                `-[mem-loc]          `-[old-val]
      
      Given a BPF insn can only have two registers (dst, src), the R0 is fixed and
      used as an auxilliary register for input (old value) as well as output (returning
      old value from memory location). While the verifier performs a number of safety
      checks, it misses to reject unprivileged programs where R0 contains a pointer as
      old value.
      
      Through brute-forcing it takes about ~16sec on my machine to leak a kernel pointer
      with BPF_CMPXCHG. The PoC is basically probing for kernel addresses by storing the
      guessed address into the map slot as a scalar, and using the map value pointer as
      R0 while SRC_REG has a canary value to detect a matching address.
      
      Fix it by checking R0 for pointers, and reject if that's the case for unprivileged
      programs.
      
      Fixes: 5ffa2550 ("bpf: Add instructions for atomic_[cmp]xchg")
      Reported-by: Ryota Shiga (Flatt Security)
      Acked-by: NBrendan Jackman <jackmanb@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a82fe085
    • D
      bpf, selftests: Add test case for atomic fetch on spilled pointer · 180486b4
      Daniel Borkmann 提交于
      Test whether unprivileged would be able to leak the spilled pointer either
      by exporting the returned value from the atomic{32,64} operation or by reading
      and exporting the value from the stack after the atomic operation took place.
      
      Note that for unprivileged, the below atomic cmpxchg test case named "Dest
      pointer in r0 - succeed" is failing. The reason is that in the dst memory
      location (r10 -8) there is the spilled register r10:
      
        0: R1=ctx(id=0,off=0,imm=0) R10=fp0
        0: (bf) r0 = r10
        1: R0_w=fp0 R1=ctx(id=0,off=0,imm=0) R10=fp0
        1: (7b) *(u64 *)(r10 -8) = r0
        2: R0_w=fp0 R1=ctx(id=0,off=0,imm=0) R10=fp0 fp-8_w=fp
        2: (b7) r1 = 0
        3: R0_w=fp0 R1_w=invP0 R10=fp0 fp-8_w=fp
        3: (db) r0 = atomic64_cmpxchg((u64 *)(r10 -8), r0, r1)
        4: R0_w=fp0 R1_w=invP0 R10=fp0 fp-8_w=mmmmmmmm
        4: (79) r1 = *(u64 *)(r0 -8)
        5: R0_w=fp0 R1_w=invP(id=0) R10=fp0 fp-8_w=mmmmmmmm
        5: (b7) r0 = 0
        6: R0_w=invP0 R1_w=invP(id=0) R10=fp0 fp-8_w=mmmmmmmm
        6: (95) exit
      
      However, allowing this case for unprivileged is a bit useless given an
      update with a new pointer will fail anyway:
      
        0: R1=ctx(id=0,off=0,imm=0) R10=fp0
        0: (bf) r0 = r10
        1: R0_w=fp0 R1=ctx(id=0,off=0,imm=0) R10=fp0
        1: (7b) *(u64 *)(r10 -8) = r0
        2: R0_w=fp0 R1=ctx(id=0,off=0,imm=0) R10=fp0 fp-8_w=fp
        2: (db) r0 = atomic64_cmpxchg((u64 *)(r10 -8), r0, r10)
        R10 leaks addr into mem
      Acked-by: NBrendan Jackman <jackmanb@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      180486b4
    • D
      bpf: Fix kernel address leakage in atomic fetch · 7d3baf0a
      Daniel Borkmann 提交于
      The change in commit 37086bfd ("bpf: Propagate stack bounds to registers
      in atomics w/ BPF_FETCH") around check_mem_access() handling is buggy since
      this would allow for unprivileged users to leak kernel pointers. For example,
      an atomic fetch/and with -1 on a stack destination which holds a spilled
      pointer will migrate the spilled register type into a scalar, which can then
      be exported out of the program (since scalar != pointer) by dumping it into
      a map value.
      
      The original implementation of XADD was preventing this situation by using
      a double call to check_mem_access() one with BPF_READ and a subsequent one
      with BPF_WRITE, in both cases passing -1 as a placeholder value instead of
      register as per XADD semantics since it didn't contain a value fetch. The
      BPF_READ also included a check in check_stack_read_fixed_off() which rejects
      the program if the stack slot is of __is_pointer_value() if dst_regno < 0.
      The latter is to distinguish whether we're dealing with a regular stack spill/
      fill or some arithmetical operation which is disallowed on non-scalars, see
      also 6e7e63cb ("bpf: Forbid XADD on spilled pointers for unprivileged
      users") for more context on check_mem_access() and its handling of placeholder
      value -1.
      
      One minimally intrusive option to fix the leak is for the BPF_FETCH case to
      initially check the BPF_READ case via check_mem_access() with -1 as register,
      followed by the actual load case with non-negative load_reg to propagate
      stack bounds to registers.
      
      Fixes: 37086bfd ("bpf: Propagate stack bounds to registers in atomics w/ BPF_FETCH")
      Reported-by: <n4ke4mry@gmail.com>
      Acked-by: NBrendan Jackman <jackmanb@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7d3baf0a
    • J
      Merge branch 'mptcp-fixes-for-ulp-a-deadlock-and-netlink-docs' · 500f3720
      Jakub Kicinski 提交于
      Mat Martineau says:
      
      ====================
      mptcp: Fixes for ULP, a deadlock, and netlink docs
      
      Two of the MPTCP fixes in this set are related to the TCP_ULP socket
      option with MPTCP sockets operating in "fallback" mode (the connection
      has reverted to regular TCP). The other issues are an observed deadlock
      and missing parameter documentation in the MPTCP netlink API.
      
      Patch 1 marks TCP_ULP as unsupported earlier in MPTCP setsockopt code,
      so the fallback code path in the MPTCP layer does not pass the TCP_ULP
      option down to the subflow TCP socket.
      
      Patch 2 makes sure a TCP fallback socket returned to userspace by
      accept()ing on a MPTCP listening socket does not allow use of the
      "mptcp" TCP_ULP type. That ULP is intended only for use by in-kernel
      MPTCP subflows.
      
      Patch 3 fixes the possible deadlock when sending data and there are
      socket option changes to sync to the subflows.
      
      Patch 4 makes sure all MPTCP netlink event parameters are documented
      in the MPTCP uapi header.
      ====================
      
      Link: https://lore.kernel.org/r/20211214231604.211016-1-mathew.j.martineau@linux.intel.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      500f3720
    • M
      mptcp: add missing documented NL params · 6813b192
      Matthieu Baerts 提交于
      'loc_id' and 'rem_id' are set in all events linked to subflows but those
      were missing in the events description in the comments.
      
      Fixes: b911c97c ("mptcp: add netlink event support")
      Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      6813b192
    • M
      mptcp: fix deadlock in __mptcp_push_pending() · 3d79e375
      Maxim Galaganov 提交于
      __mptcp_push_pending() may call mptcp_flush_join_list() with subflow
      socket lock held. If such call hits mptcp_sockopt_sync_all() then
      subsequently __mptcp_sockopt_sync() could try to lock the subflow
      socket for itself, causing a deadlock.
      
      sysrq: Show Blocked State
      task:ss-server       state:D stack:    0 pid:  938 ppid:     1 flags:0x00000000
      Call Trace:
       <TASK>
       __schedule+0x2d6/0x10c0
       ? __mod_memcg_state+0x4d/0x70
       ? csum_partial+0xd/0x20
       ? _raw_spin_lock_irqsave+0x26/0x50
       schedule+0x4e/0xc0
       __lock_sock+0x69/0x90
       ? do_wait_intr_irq+0xa0/0xa0
       __lock_sock_fast+0x35/0x50
       mptcp_sockopt_sync_all+0x38/0xc0
       __mptcp_push_pending+0x105/0x200
       mptcp_sendmsg+0x466/0x490
       sock_sendmsg+0x57/0x60
       __sys_sendto+0xf0/0x160
       ? do_wait_intr_irq+0xa0/0xa0
       ? fpregs_restore_userregs+0x12/0xd0
       __x64_sys_sendto+0x20/0x30
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f9ba546c2d0
      RSP: 002b:00007ffdc3b762d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00007f9ba56c8060 RCX: 00007f9ba546c2d0
      RDX: 000000000000077a RSI: 0000000000e5e180 RDI: 0000000000000234
      RBP: 0000000000cc57f0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f9ba56c8060
      R13: 0000000000b6ba60 R14: 0000000000cc7840 R15: 41d8685b1d7901b8
       </TASK>
      
      Fix the issue by using __mptcp_flush_join_list() instead of plain
      mptcp_flush_join_list() inside __mptcp_push_pending(), as suggested by
      Florian. The sockopt sync will be deferred to the workqueue.
      
      Fixes: 1b3e7ede ("mptcp: setsockopt: handle SO_KEEPALIVE and SO_PRIORITY")
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/244Suggested-by: NFlorian Westphal <fw@strlen.de>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMaxim Galaganov <max@internet.ru>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      3d79e375
    • F
      mptcp: clear 'kern' flag from fallback sockets · d6692b3b
      Florian Westphal 提交于
      The mptcp ULP extension relies on sk->sk_sock_kern being set correctly:
      It prevents setsockopt(fd, IPPROTO_TCP, TCP_ULP, "mptcp", 6); from
      working for plain tcp sockets (any userspace-exposed socket).
      
      But in case of fallback, accept() can return a plain tcp sk.
      In such case, sk is still tagged as 'kernel' and setsockopt will work.
      
      This will crash the kernel, The subflow extension has a NULL ctx->conn
      mptcp socket:
      
      BUG: KASAN: null-ptr-deref in subflow_data_ready+0x181/0x2b0
      Call Trace:
       tcp_data_ready+0xf8/0x370
       [..]
      
      Fixes: cf7da0d6 ("mptcp: Create SUBFLOW socket for incoming connections")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d6692b3b
    • F
      mptcp: remove tcp ulp setsockopt support · 404cd9a2
      Florian Westphal 提交于
      TCP_ULP setsockopt cannot be used for mptcp because its already
      used internally to plumb subflow (tcp) sockets to the mptcp layer.
      
      syzbot managed to trigger a crash for mptcp connections that are
      in fallback mode:
      
      KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027]
      CPU: 1 PID: 1083 Comm: syz-executor.3 Not tainted 5.16.0-rc2-syzkaller #0
      RIP: 0010:tls_build_proto net/tls/tls_main.c:776 [inline]
      [..]
       __tcp_set_ulp net/ipv4/tcp_ulp.c:139 [inline]
       tcp_set_ulp+0x428/0x4c0 net/ipv4/tcp_ulp.c:160
       do_tcp_setsockopt+0x455/0x37c0 net/ipv4/tcp.c:3391
       mptcp_setsockopt+0x1b47/0x2400 net/mptcp/sockopt.c:638
      
      Remove support for TCP_ULP setsockopt.
      
      Fixes: d9e4c129 ("mptcp: only admit explicitly supported sockopt")
      Reported-by: syzbot+1fd9b69cde42967d1add@syzkaller.appspotmail.com
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      404cd9a2
    • K
      ice: Don't put stale timestamps in the skb · 37e738b6
      Karol Kolacinski 提交于
      The driver has to check if it does not accidentally put the timestamp in
      the SKB before previous timestamp gets overwritten.
      Timestamp values in the PHY are read only and do not get cleared except
      at hardware reset or when a new timestamp value is captured.
      The cached_tstamp field is used to detect the case where a new timestamp
      has not yet been captured, ensuring that we avoid sending stale
      timestamp data to the stack.
      
      Fixes: ea9b847c ("ice: enable transmit timestamps for E810 devices")
      Signed-off-by: NKarol Kolacinski <karol.kolacinski@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      37e738b6
    • K
      ice: Use div64_u64 instead of div_u64 in adjfine · 0013881c
      Karol Kolacinski 提交于
      Change the division in ice_ptp_adjfine from div_u64 to div64_u64.
      div_u64 is used when the divisor is 32 bit but in this case incval is
      64 bit and it caused incorrect calculations and incval adjustments.
      
      Fixes: 06c16d89 ("ice: register 1588 PTP clock device object for E810 devices")
      Signed-off-by: NKarol Kolacinski <karol.kolacinski@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      0013881c
  4. 14 12月, 2021 1 次提交