1. 12 7月, 2021 4 次提交
    • X
      bpf, test: fix NULL pointer dereference on invalid expected_attach_type · 5e21bb4e
      Xuan Zhuo 提交于
      These two types of XDP progs (BPF_XDP_DEVMAP, BPF_XDP_CPUMAP) will not be
      executed directly in the driver, therefore we should also not directly
      run them from here. To run in these two situations, there must be further
      preparations done, otherwise these may cause a kernel panic.
      
      For more details, see also dev_xdp_attach().
      
        [   46.982479] BUG: kernel NULL pointer dereference, address: 0000000000000000
        [   46.984295] #PF: supervisor read access in kernel mode
        [   46.985777] #PF: error_code(0x0000) - not-present page
        [   46.987227] PGD 800000010dca4067 P4D 800000010dca4067 PUD 10dca6067 PMD 0
        [   46.989201] Oops: 0000 [#1] SMP PTI
        [   46.990304] CPU: 7 PID: 562 Comm: a.out Not tainted 5.13.0+ #44
        [   46.992001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/24
        [   46.995113] RIP: 0010:___bpf_prog_run+0x17b/0x1710
        [   46.996586] Code: 49 03 14 cc e8 76 f6 fe ff e9 ad fe ff ff 0f b6 43 01 48 0f bf 4b 02 48 83 c3 08 89 c2 83 e0 0f c0 ea 04 02
        [   47.001562] RSP: 0018:ffffc900005afc58 EFLAGS: 00010246
        [   47.003115] RAX: 0000000000000000 RBX: ffffc9000023f068 RCX: 0000000000000000
        [   47.005163] RDX: 0000000000000000 RSI: 0000000000000079 RDI: ffffc900005afc98
        [   47.007135] RBP: 0000000000000000 R08: ffffc9000023f048 R09: c0000000ffffdfff
        [   47.009171] R10: 0000000000000001 R11: ffffc900005afb40 R12: ffffc900005afc98
        [   47.011172] R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff825258a8
        [   47.013244] FS:  00007f04a5207580(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000
        [   47.015705] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [   47.017475] CR2: 0000000000000000 CR3: 0000000100182005 CR4: 0000000000770ee0
        [   47.019558] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [   47.021595] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [   47.023574] PKRU: 55555554
        [   47.024571] Call Trace:
        [   47.025424]  __bpf_prog_run32+0x32/0x50
        [   47.026296]  ? printk+0x53/0x6a
        [   47.027066]  ? ktime_get+0x39/0x90
        [   47.027895]  bpf_test_run.cold.28+0x23/0x123
        [   47.028866]  ? printk+0x53/0x6a
        [   47.029630]  bpf_prog_test_run_xdp+0x149/0x1d0
        [   47.030649]  __sys_bpf+0x1305/0x23d0
        [   47.031482]  __x64_sys_bpf+0x17/0x20
        [   47.032316]  do_syscall_64+0x3a/0x80
        [   47.033165]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [   47.034254] RIP: 0033:0x7f04a51364dd
        [   47.035133] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 48
        [   47.038768] RSP: 002b:00007fff8f9fc518 EFLAGS: 00000213 ORIG_RAX: 0000000000000141
        [   47.040344] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f04a51364dd
        [   47.041749] RDX: 0000000000000048 RSI: 0000000020002a80 RDI: 000000000000000a
        [   47.043171] RBP: 00007fff8f9fc530 R08: 0000000002049300 R09: 0000000020000100
        [   47.044626] R10: 0000000000000004 R11: 0000000000000213 R12: 0000000000401070
        [   47.046088] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
        [   47.047579] Modules linked in:
        [   47.048318] CR2: 0000000000000000
        [   47.049120] ---[ end trace 7ad34443d5be719a ]---
        [   47.050273] RIP: 0010:___bpf_prog_run+0x17b/0x1710
        [   47.051343] Code: 49 03 14 cc e8 76 f6 fe ff e9 ad fe ff ff 0f b6 43 01 48 0f bf 4b 02 48 83 c3 08 89 c2 83 e0 0f c0 ea 04 02
        [   47.054943] RSP: 0018:ffffc900005afc58 EFLAGS: 00010246
        [   47.056068] RAX: 0000000000000000 RBX: ffffc9000023f068 RCX: 0000000000000000
        [   47.057522] RDX: 0000000000000000 RSI: 0000000000000079 RDI: ffffc900005afc98
        [   47.058961] RBP: 0000000000000000 R08: ffffc9000023f048 R09: c0000000ffffdfff
        [   47.060390] R10: 0000000000000001 R11: ffffc900005afb40 R12: ffffc900005afc98
        [   47.061803] R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff825258a8
        [   47.063249] FS:  00007f04a5207580(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000
        [   47.065070] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [   47.066307] CR2: 0000000000000000 CR3: 0000000100182005 CR4: 0000000000770ee0
        [   47.067747] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [   47.069217] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [   47.070652] PKRU: 55555554
        [   47.071318] Kernel panic - not syncing: Fatal exception
        [   47.072854] Kernel Offset: disabled
        [   47.073683] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Fixes: 92164774 ("bpf: cpumap: Add the possibility to attach an eBPF program to cpumap")
      Fixes: fbee97fe ("bpf: Add support to attach bpf program to a devmap entry")
      Reported-by: NAbaci <abaci@linux.alibaba.com>
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NDust Li <dust.li@linux.alibaba.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NDavid Ahern <dsahern@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20210708080409.73525-1-xuanzhuo@linux.alibaba.com
      5e21bb4e
    • B
      doc, af_xdp: Fix bind flags option typo · f35e0cc2
      Baruch Siach 提交于
      Fix XDP_ZERO_COPY flag typo since it is actually named XDP_ZEROCOPY
      instead as per if_xdp.h uapi header.
      Signed-off-by: NBaruch Siach <baruch@tkos.co.il>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/1656fdf94704e9e735df0f8b97667d8f26dd098b.1625550240.git.baruch@tkos.co.il
      f35e0cc2
    • M
      net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340 · a5de4be0
      Marek Behún 提交于
      It seems that we cannot differentiate 88X3310 from 88X3340 by simply
      looking at bit 3 of revision ID. This only works on revisions A0 and A1.
      On revision B0, this bit is always 1.
      
      Instead use the 3.d00d register for differentiation, since this register
      contains information about number of ports on the device.
      
      Fixes: 9885d016 ("net: phy: marvell10g: add separate structure for 88X3340")
      Signed-off-by: NMarek Behún <kabel@kernel.org>
      Reported-by: NMatteo Croce <mcroce@linux.microsoft.com>
      Tested-by: NMatteo Croce <mcroce@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5de4be0
    • K
      dsa: fix for_each_child.cocci warnings · 84f7e0bb
      kernel test robot 提交于
      For_each_available_child_of_node should have of_node_put() before
      return around line 423.
      
      Generated by: scripts/coccinelle/iterators/for_each_child.cocci
      
      CC: Alexander Lobakin <alobakin@pm.me>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NJulia Lawall <julia.lawall@inria.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84f7e0bb
  2. 11 7月, 2021 1 次提交
  3. 10 7月, 2021 16 次提交
    • D
      Merge branch 'mptcp-Connection-and-accounting-fixes' · 849fd444
      David S. Miller 提交于
      Mat Martineau says:
      
      ====================
      mptcp: Connection and accounting fixes
      
      Here are some miscellaneous fixes for MPTCP:
      
      Patch 1 modifies an MPTCP hash so it doesn't depend on one of skb->dev
      and skb->sk being non-NULL.
      
      Patch 2 removes an extra destructor call when rejecting a join due to
      port mismatch.
      
      Patches 3 and 5 more cleanly handle error conditions with MP_JOIN and
      syncookies, and update a related self test.
      
      Patch 4 makes sure packets that trigger a subflow TCP reset during MPTCP
      option header processing are correctly dropped.
      
      Patch 6 addresses a rmem accounting issue that could keep packets in
      subflow receive buffers longer than necessary, delaying MPTCP-level
      ACKs.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      849fd444
    • P
      mptcp: properly account bulk freed memory · ce599c51
      Paolo Abeni 提交于
      After commit 87952603 ("mptcp: protect the rx path with
      the msk socket spinlock") the rmem currently used by a given
      msk is really sk_rmem_alloc - rmem_released.
      
      The safety check in mptcp_data_ready() does not take the above
      in due account, as a result legit incoming data is kept in
      subflow receive queue with no reason, delaying or blocking
      MPTCP-level ack generation.
      
      This change addresses the issue introducing a new helper to fetch
      the rmem memory and using it as needed. Additionally add a MIB
      counter for the exceptional event described above - the peer is
      misbehaving.
      
      Finally, introduce the required annotation when rmem_released is
      updated.
      
      Fixes: 87952603 ("mptcp: protect the rx path with the msk socket spinlock")
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/211Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce599c51
    • J
      selftests: mptcp: fix case multiple subflows limited by server · a7da4416
      Jianguo Wu 提交于
      After patch "mptcp: fix syncookie process if mptcp can not_accept new
      subflow", if subflow is limited, MP_JOIN SYN is dropped, and no SYN/ACK
      will be replied.
      
      So in case "multiple subflows limited by server", the expected SYN/ACK
      number should be 1.
      
      Fixes: 00587187 ("selftests: mptcp: add test cases for mptcp join tests with syn cookies")
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Signed-off-by: NJianguo Wu <wujianguo@chinatelecom.cn>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7da4416
    • J
      mptcp: avoid processing packet if a subflow reset · 6787b7e3
      Jianguo Wu 提交于
      If check_fully_established() causes a subflow reset, it should not
      continue to process the packet in tcp_data_queue().
      Add a return value to mptcp_incoming_options(), and return false if a
      subflow has been reset, else return true. Then drop the packet in
      tcp_data_queue()/tcp_rcv_state_process() if mptcp_incoming_options()
      return false.
      
      Fixes: d5824847 ("mptcp: fix fallback for MP_JOIN subflows")
      Signed-off-by: NJianguo Wu <wujianguo@chinatelecom.cn>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6787b7e3
    • J
      mptcp: fix syncookie process if mptcp can not_accept new subflow · 8547ea5f
      Jianguo Wu 提交于
      Lots of "TCP: tcp_fin: Impossible, sk->sk_state=7" in client side
      when doing stress testing using wrk and webfsd.
      
      There are at least two cases may trigger this warning:
      1.mptcp is in syncookie, and server recv MP_JOIN SYN request,
        in subflow_check_req(), the mptcp_can_accept_new_subflow()
        return false, so subflow_init_req_cookie_join_save() isn't
        called, i.e. not store the data present in the MP_JOIN syn
        request and the random nonce in hash table - join_entries[],
        but still send synack. When recv 3rd-ack,
        mptcp_token_join_cookie_init_state() will return false, and
        3rd-ack is dropped, then if mptcp conn is closed by client,
        client will send a DATA_FIN and a MPTCP FIN, the DATA_FIN
        doesn't have MP_CAPABLE or MP_JOIN,
        so mptcp_subflow_init_cookie_req() will return 0, and pass
        the cookie check, MP_JOIN request is fallback to normal TCP.
        Server will send a TCP FIN if closed, in client side,
        when process TCP FIN, it will do reset, the code path is:
          tcp_data_queue()->mptcp_incoming_options()
            ->check_fully_established()->mptcp_subflow_reset().
        mptcp_subflow_reset() will set sock state to TCP_CLOSE,
        so tcp_fin will hit TCP_CLOSE, and print the warning.
      
      2.mptcp is in syncookie, and server recv 3rd-ack, in
        mptcp_subflow_init_cookie_req(), mptcp_can_accept_new_subflow()
        return false, and subflow_req->mp_join is not set to 1,
        so in subflow_syn_recv_sock() will not reset the MP_JOIN
        subflow, but fallback to normal TCP, and then the same thing
        happens when server will send a TCP FIN if closed.
      
      For case1, subflow_check_req() return -EPERM,
      then tcp_conn_request() will drop MP_JOIN SYN.
      
      For case2, let subflow_syn_recv_sock() call
      mptcp_can_accept_new_subflow(), and do fatal fallback, send reset.
      
      Fixes: 9466a1cc ("mptcp: enable JOIN requests even if cookies are in use")
      Signed-off-by: NJianguo Wu <wujianguo@chinatelecom.cn>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8547ea5f
    • J
      mptcp: remove redundant req destruct in subflow_check_req() · 030d37bd
      Jianguo Wu 提交于
      In subflow_check_req(), if subflow sport is mismatch, will put msk,
      destroy token, and destruct req, then return -EPERM, which can be
      done by subflow_req_destructor() via:
      
        tcp_conn_request()
          |--__reqsk_free()
            |--subflow_req_destructor()
      
      So we should remove these redundant code, otherwise will call
      tcp_v4_reqsk_destructor() twice, and may double free
      inet_rsk(req)->ireq_opt.
      
      Fixes: 5bc56388 ("mptcp: add port number check for MP_JOIN")
      Signed-off-by: NJianguo Wu <wujianguo@chinatelecom.cn>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      030d37bd
    • J
      mptcp: fix warning in __skb_flow_dissect() when do syn cookie for subflow join · 0c71929b
      Jianguo Wu 提交于
      I did stress test with wrk[1] and webfsd[2] with the assistance of
      mptcp-tools[3]:
      
        Server side:
            ./use_mptcp.sh webfsd -4 -R /tmp/ -p 8099
        Client side:
            ./use_mptcp.sh wrk -c 200 -d 30 -t 4 http://192.168.174.129:8099/
      
      and got the following warning message:
      
      [   55.552626] TCP: request_sock_subflow: Possible SYN flooding on port 8099. Sending cookies.  Check SNMP counters.
      [   55.553024] ------------[ cut here ]------------
      [   55.553027] WARNING: CPU: 0 PID: 10 at net/core/flow_dissector.c:984 __skb_flow_dissect+0x280/0x1650
      ...
      [   55.553117] CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 5.12.0+ #18
      [   55.553121] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
      [   55.553124] RIP: 0010:__skb_flow_dissect+0x280/0x1650
      ...
      [   55.553133] RSP: 0018:ffffb79580087770 EFLAGS: 00010246
      [   55.553137] RAX: 0000000000000000 RBX: ffffffff8ddb58e0 RCX: ffffb79580087888
      [   55.553139] RDX: ffffffff8ddb58e0 RSI: ffff8f7e4652b600 RDI: 0000000000000000
      [   55.553141] RBP: ffffb79580087858 R08: 0000000000000000 R09: 0000000000000008
      [   55.553143] R10: 000000008c622965 R11: 00000000d3313a5b R12: ffff8f7e4652b600
      [   55.553146] R13: ffff8f7e465c9062 R14: 0000000000000000 R15: ffffb79580087888
      [   55.553149] FS:  0000000000000000(0000) GS:ffff8f7f75e00000(0000) knlGS:0000000000000000
      [   55.553152] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   55.553154] CR2: 00007f73d1d19000 CR3: 0000000135e10004 CR4: 00000000003706f0
      [   55.553160] Call Trace:
      [   55.553166]  ? __sha256_final+0x67/0xd0
      [   55.553173]  ? sha256+0x7e/0xa0
      [   55.553177]  __skb_get_hash+0x57/0x210
      [   55.553182]  subflow_init_req_cookie_join_save+0xac/0xc0
      [   55.553189]  subflow_check_req+0x474/0x550
      [   55.553195]  ? ip_route_output_key_hash+0x67/0x90
      [   55.553200]  ? xfrm_lookup_route+0x1d/0xa0
      [   55.553207]  subflow_v4_route_req+0x8e/0xd0
      [   55.553212]  tcp_conn_request+0x31e/0xab0
      [   55.553218]  ? selinux_socket_sock_rcv_skb+0x116/0x210
      [   55.553224]  ? tcp_rcv_state_process+0x179/0x6d0
      [   55.553229]  tcp_rcv_state_process+0x179/0x6d0
      [   55.553235]  tcp_v4_do_rcv+0xaf/0x220
      [   55.553239]  tcp_v4_rcv+0xce4/0xd80
      [   55.553243]  ? ip_route_input_rcu+0x246/0x260
      [   55.553248]  ip_protocol_deliver_rcu+0x35/0x1b0
      [   55.553253]  ip_local_deliver_finish+0x44/0x50
      [   55.553258]  ip_local_deliver+0x6c/0x110
      [   55.553262]  ? ip_rcv_finish_core.isra.19+0x5a/0x400
      [   55.553267]  ip_rcv+0xd1/0xe0
      ...
      
      After debugging, I found in __skb_flow_dissect(), skb->dev and skb->sk
      are both NULL, then net is NULL, and trigger WARN_ON_ONCE(!net),
      actually net is always NULL in this code path, as skb->dev is set to
      NULL in tcp_v4_rcv(), and skb->sk is never set.
      
      Code snippet in __skb_flow_dissect() that trigger warning:
        975         if (skb) {
        976                 if (!net) {
        977                         if (skb->dev)
        978                                 net = dev_net(skb->dev);
        979                         else if (skb->sk)
        980                                 net = sock_net(skb->sk);
        981                 }
        982         }
        983
        984         WARN_ON_ONCE(!net);
      
      So, using seq and transport header derived hash.
      
      [1] https://github.com/wg/wrk
      [2] https://github.com/ourway/webfsd
      [3] https://github.com/pabeni/mptcp-tools
      
      Fixes: 9466a1cc ("mptcp: enable JOIN requests even if cookies are in use")
      Suggested-by: NPaolo Abeni <pabeni@redhat.com>
      Suggested-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NJianguo Wu <wujianguo@chinatelecom.cn>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c71929b
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 5d52c906
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2021-07-09
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 9 non-merge commits during the last 9 day(s) which contain
      a total of 13 files changed, 118 insertions(+), 62 deletions(-).
      
      The main changes are:
      
      1) Fix runqslower task->state access from BPF, from SanjayKumar Jeyakumar.
      
      2) Fix subprog poke descriptor tracking use-after-free, from John Fastabend.
      
      3) Fix sparse complaint from prior devmap RCU conversion, from Toke Høiland-Jørgensen.
      
      4) Fix missing va_end in bpftool JIT json dump's error path, from Gu Shengxian.
      
      5) Fix tools/bpf install target from missing runqslower install, from Wei Li.
      
      6) Fix xdpsock BPF sample to unload program on shared umem option, from Wang Hai.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d52c906
    • T
      net: validate lwtstate->data before returning from skb_tunnel_info() · 67a9c943
      Taehee Yoo 提交于
      skb_tunnel_info() returns pointer of lwtstate->data as ip_tunnel_info
      type without validation. lwtstate->data can have various types such as
      mpls_iptunnel_encap, etc and these are not compatible.
      So skb_tunnel_info() should validate before returning that pointer.
      
      Splat looks like:
      BUG: KASAN: slab-out-of-bounds in vxlan_get_route+0x418/0x4b0 [vxlan]
      Read of size 2 at addr ffff888106ec2698 by task ping/811
      
      CPU: 1 PID: 811 Comm: ping Not tainted 5.13.0+ #1195
      Call Trace:
       dump_stack_lvl+0x56/0x7b
       print_address_description.constprop.8.cold.13+0x13/0x2ee
       ? vxlan_get_route+0x418/0x4b0 [vxlan]
       ? vxlan_get_route+0x418/0x4b0 [vxlan]
       kasan_report.cold.14+0x83/0xdf
       ? vxlan_get_route+0x418/0x4b0 [vxlan]
       vxlan_get_route+0x418/0x4b0 [vxlan]
       [ ... ]
       vxlan_xmit_one+0x148b/0x32b0 [vxlan]
       [ ... ]
       vxlan_xmit+0x25c5/0x4780 [vxlan]
       [ ... ]
       dev_hard_start_xmit+0x1ae/0x6e0
       __dev_queue_xmit+0x1f39/0x31a0
       [ ... ]
       neigh_xmit+0x2f9/0x940
       mpls_xmit+0x911/0x1600 [mpls_iptunnel]
       lwtunnel_xmit+0x18f/0x450
       ip_finish_output2+0x867/0x2040
       [ ... ]
      
      Fixes: 61adedf3 ("route: move lwtunnel state to dst_entry")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67a9c943
    • H
      net: ip_tunnel: fix mtu calculation for ETHER tunnel devices · 9992a078
      Hangbin Liu 提交于
      Commit 28e104d0 ("net: ip_tunnel: fix mtu calculation") removed
      dev->hard_header_len subtraction when calculate MTU for tunnel devices
      as there is an overhead for device that has header_ops.
      
      But there are ETHER tunnel devices, like gre_tap or erspan, which don't
      have header_ops but set dev->hard_header_len during setup. This makes
      pkts greater than (MTU - ETH_HLEN) could not be xmited. Fix it by
      subtracting the ETHER tunnel devices' dev->hard_header_len for MTU
      calculation.
      
      Fixes: 28e104d0 ("net: ip_tunnel: fix mtu calculation")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9992a078
    • A
      net: do not reuse skbuff allocated from skbuff_fclone_cache in the skb cache · 28b34f01
      Antoine Tenart 提交于
      Some socket buffers allocated in the fclone cache (in __alloc_skb) can
      end-up in the following path[1]:
      
      napi_skb_finish
        __kfree_skb_defer
          napi_skb_cache_put
      
      The issue is napi_skb_cache_put is not fclone friendly and will put
      those skbuff in the skb cache to be reused later, although this cache
      only expects skbuff allocated from skbuff_head_cache. When this happens
      the skbuff is eventually freed using the wrong origin cache, and we can
      see traces similar to:
      
      [ 1223.947534] cache_from_obj: Wrong slab cache. skbuff_head_cache but object is from skbuff_fclone_cache
      [ 1223.948895] WARNING: CPU: 3 PID: 0 at mm/slab.h:442 kmem_cache_free+0x251/0x3e0
      [ 1223.950211] Modules linked in:
      [ 1223.950680] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.13.0+ #474
      [ 1223.951587] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-3.fc34 04/01/2014
      [ 1223.953060] RIP: 0010:kmem_cache_free+0x251/0x3e0
      
      Leading sometimes to other memory related issues.
      
      Fix this by using __kfree_skb for fclone skbuff, similar to what is done
      the other place __kfree_skb_defer is called.
      
      [1] At least in setups using veth pairs and tunnels. Building a kernel
          with KASAN we can for example see packets allocated in
          sk_stream_alloc_skb hit the above path and later the issue arises
          when the skbuff is reused.
      
      Fixes: 9243adfc ("skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing")
      Cc: Alexander Lobakin <alobakin@pm.me>
      Signed-off-by: NAntoine Tenart <atenart@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28b34f01
    • T
      tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path · 358ed624
      Talal Ahmad 提交于
      sk_wmem_schedule makes sure that sk_forward_alloc has enough
      bytes for charging that is going to be done by sk_mem_charge.
      
      In the transmit zerocopy path, there is sk_mem_charge but there was
      no call to sk_wmem_schedule. This change adds that call.
      
      Without this call to sk_wmem_schedule, sk_forward_alloc can go
      negetive which is a bug because sk_forward_alloc is a per-socket
      space that has been forward charged so this can't be negative.
      
      Fixes: f214f915 ("tcp: enable MSG_ZEROCOPY")
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NWei Wang <weiwan@google.com>
      Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      358ed624
    • A
      net: send SYNACK packet with accepted fwmark · 43b90bfa
      Alexander Ovechkin 提交于
      commit e05a90ec ("net: reflect mark on tcp syn ack packets")
      fixed IPv4 only.
      
      This part is for the IPv6 side.
      
      Fixes: e05a90ec ("net: reflect mark on tcp syn ack packets")
      Signed-off-by: NAlexander Ovechkin <ovov@yandex-team.ru>
      Acked-by: NDmitry Yakunin <zeil@yandex-team.ru>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43b90bfa
    • P
      net: ti: fix UAF in tlan_remove_one · 0336f8ff
      Pavel Skripkin 提交于
      priv is netdev private data and it cannot be
      used after free_netdev() call. Using priv after free_netdev()
      can cause UAF bug. Fix it by moving free_netdev() at the end of the
      function.
      
      Fixes: 1e0a8b13 ("tlan: cancel work at remove path")
      Signed-off-by: NPavel Skripkin <paskripkin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0336f8ff
    • P
      net: qcom/emac: fix UAF in emac_remove · ad297cd2
      Pavel Skripkin 提交于
      adpt is netdev private data and it cannot be
      used after free_netdev() call. Using adpt after free_netdev()
      can cause UAF bug. Fix it by moving free_netdev() at the end of the
      function.
      
      Fixes: 54e19bc7 ("net: qcom/emac: do not use devm on internal phy pdev")
      Signed-off-by: NPavel Skripkin <paskripkin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad297cd2
    • P
      net: moxa: fix UAF in moxart_mac_probe · c78eaeeb
      Pavel Skripkin 提交于
      In case of netdev registration failure the code path will
      jump to init_fail label:
      
      init_fail:
      	netdev_err(ndev, "init failed\n");
      	moxart_mac_free_memory(ndev);
      irq_map_fail:
      	free_netdev(ndev);
      	return ret;
      
      So, there is no need to call free_netdev() before jumping
      to error handling path, since it can cause UAF or double-free
      bug.
      
      Fixes: 6c821bd9 ("net: Add MOXA ART SoCs ethernet driver")
      Signed-off-by: NPavel Skripkin <paskripkin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c78eaeeb
  4. 09 7月, 2021 13 次提交
    • J
      bpf: Selftest to verify mixing bpf2bpf calls and tailcalls with insn patch · 1fb5ba29
      John Fastabend 提交于
      This adds some extra noise to the tailcall_bpf2bpf4 tests that will cause
      verify to patch insns. This then moves around subprog start/end insn
      index and poke descriptor insn index to ensure that verify and JIT will
      continue to track these correctly.
      
      If done correctly verifier should pass this program same as before and
      JIT should emit tail call logic.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210707223848.14580-3-john.fastabend@gmail.com
      1fb5ba29
    • J
      bpf: Track subprog poke descriptors correctly and fix use-after-free · f263a814
      John Fastabend 提交于
      Subprograms are calling map_poke_track(), but on program release there is no
      hook to call map_poke_untrack(). However, on program release, the aux memory
      (and poke descriptor table) is freed even though we still have a reference to
      it in the element list of the map aux data. When we run map_poke_run(), we then
      end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run():
      
        [...]
        [  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
        [  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337
        [  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
        [  402.824715] Call Trace:
        [  402.824719]  dump_stack+0x93/0xc2
        [  402.824727]  print_address_description.constprop.0+0x1a/0x140
        [  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824744]  kasan_report.cold+0x7c/0xd8
        [  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824757]  prog_array_map_poke_run+0xc2/0x34e
        [  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0
        [...]
      
      The elements concerned are walked as follows:
      
          for (i = 0; i < elem->aux->size_poke_tab; i++) {
                 poke = &elem->aux->poke_tab[i];
          [...]
      
      The access to size_poke_tab is a 4 byte read, verified by checking offsets
      in the KASAN dump:
      
        [  402.825004] The buggy address belongs to the object at ffff8881905a7800
                       which belongs to the cache kmalloc-1k of size 1024
        [  402.825008] The buggy address is located 320 bytes inside of
                       1024-byte region [ffff8881905a7800, ffff8881905a7c00)
      
      The pahole output of bpf_prog_aux:
      
        struct bpf_prog_aux {
          [...]
          /* --- cacheline 5 boundary (320 bytes) --- */
          u32                        size_poke_tab;        /*   320     4 */
          [...]
      
      In general, subprograms do not necessarily manage their own data structures.
      For example, BTF func_info and linfo are just pointers to the main program
      structure. This allows reference counting and cleanup to be done on the latter
      which simplifies their management a bit. The aux->poke_tab struct, however,
      did not follow this logic. The initial proposed fix for this use-after-free
      bug further embedded poke data tracking into the subprogram with proper
      reference counting. However, Daniel and Alexei questioned why we were treating
      these objects special; I agree, its unnecessary. The fix here removes the per
      subprogram poke table allocation and map tracking and instead simply points
      the aux->poke_tab pointer at the main programs poke table. This way, map
      tracking is simplified to the main program and we do not need to manage them
      per subprogram.
      
      This also means, bpf_prog_free_deferred(), which unwinds the program reference
      counting and kfrees objects, needs to ensure that we don't try to double free
      the poke_tab when free'ing the subprog structures. This is easily solved by
      NULL'ing the poke_tab pointer. The second detail is to ensure that per
      subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do
      this, we add a pointer in the poke structure to point at the subprogram value
      so JITs can easily check while walking the poke_tab structure if the current
      entry belongs to the current program. The aux pointer is stable and therefore
      suitable for such comparison. On the jit_subprogs() error path, we omit
      cleaning up the poke->aux field because these are only ever referenced from
      the JIT side, but on error we will never make it to the JIT, so its fine to
      leave them dangling. Removing these pointers would complicate the error path
      for no reason. However, we do need to untrack all poke descriptors from the
      main program as otherwise they could race with the freeing of JIT memory from
      the subprograms. Lastly, a748c697 ("bpf: propagate poke descriptors to
      subprograms") had an off-by-one on the subprogram instruction index range
      check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'.
      However, subprog_end is the next subprogram's start instruction.
      
      Fixes: a748c697 ("bpf: propagate poke descriptors to subprograms")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Co-developed-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210707223848.14580-2-john.fastabend@gmail.com
      f263a814
    • F
      net: bcmgenet: Ensure all TX/RX queues DMAs are disabled · 2b452550
      Florian Fainelli 提交于
      Make sure that we disable each of the TX and RX queues in the TDMA and
      RDMA control registers. This is a correctness change to be symmetrical
      with the code that enables the TX and RX queues.
      Tested-by: NMaxime Ripard <maxime@cerno.tech>
      Fixes: 1c1008c7 ("net: bcmgenet: add main driver file")
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b452550
    • D
      Merge branch 'ncsi-phy-link-up' · 5702b81e
      David S. Miller 提交于
      Ivan Mikhaylov says:
      
      ====================
      net/ncsi: Add NCSI Intel OEM command to keep PHY link up
      
      Add NCSI Intel OEM command to keep PHY link up and prevents any channel
      resets during the host load on i210. Also includes dummy response handler for
      Intel manufacturer id.
      
      Changes from v1:
       1. sparse fixes about casts
       2. put it after ncsi_dev_state_probe_cis instead of
          ncsi_dev_state_probe_channel because sometimes channel is not ready
          after it
       3. inl -> intel
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5702b81e
    • I
      net/ncsi: add dummy response handler for Intel boards · 163f5de5
      Ivan Mikhaylov 提交于
      Add the dummy response handler for Intel boards to prevent incorrect
      handling of OEM commands.
      Signed-off-by: NIvan Mikhaylov <i.mikhaylov@yadro.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      163f5de5
    • I
      net/ncsi: add NCSI Intel OEM command to keep PHY up · abd2fddc
      Ivan Mikhaylov 提交于
      This allows to keep PHY link up and prevents any channel resets during
      the host load.
      
      It is KEEP_PHY_LINK_UP option(Veto bit) in i210 datasheet which
      block PHY reset and power state changes.
      Signed-off-by: NIvan Mikhaylov <i.mikhaylov@yadro.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abd2fddc
    • I
      net/ncsi: fix restricted cast warning of sparse · 27fa107d
      Ivan Mikhaylov 提交于
      Sparse reports:
      net/ncsi/ncsi-rsp.c:406:24: warning: cast to restricted __be32
      net/ncsi/ncsi-manage.c:732:33: warning: cast to restricted __be32
      net/ncsi/ncsi-manage.c:756:25: warning: cast to restricted __be32
      net/ncsi/ncsi-manage.c:779:25: warning: cast to restricted __be32
      Signed-off-by: NIvan Mikhaylov <i.mikhaylov@yadro.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27fa107d
    • R
      net: microchip: sparx5: fix kconfig warning · 96248d6d
      Randy Dunlap 提交于
      PHY_SPARX5_SERDES depends on OF so SPARX5_SWITCH should also depend
      on OF since 'select' does not follow any dependencies.
      
      WARNING: unmet direct dependencies detected for PHY_SPARX5_SERDES
        Depends on [n]: (ARCH_SPARX5 || COMPILE_TEST [=n]) && OF [=n] && HAS_IOMEM [=y]
        Selected by [y]:
        - SPARX5_SWITCH [=y] && NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MICROCHIP [=y] && NET_SWITCHDEV [=y] && HAS_IOMEM [=y]
      
      Fixes: 3cfa11ba ("net: sparx5: add the basic sparx5 driver")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Lars Povlsen <lars.povlsen@microchip.com>
      Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
      Cc: UNGLinuxDriver@microchip.com
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96248d6d
    • S
      cxgb4: fix IRQ free race during driver unload · 015fe6fd
      Shahjada Abul Husain 提交于
      IRQs are requested during driver's ndo_open() and then later
      freed up in disable_interrupts() during driver unload.
      A race exists where driver can set the CXGB4_FULL_INIT_DONE
      flag in ndo_open() after the disable_interrupts() in driver
      unload path checks it, and hence misses calling free_irq().
      
      Fix by unregistering netdevice first and sync with driver's
      ndo_open(). This ensures disable_interrupts() checks the flag
      correctly and frees up the IRQs properly.
      
      Fixes: b37987e8 ("cxgb4: Disable interrupts and napi before unregistering netdev")
      Signed-off-by: NShahjada Abul Husain <shahjada@chelsio.com>
      Signed-off-by: NRaju Rangoju <rajur@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      015fe6fd
    • A
      mt76: mt7921: continue to probe driver when fw already downloaded · c3426904
      Aaron Ma 提交于
      When reboot system, no power cycles, firmware is already downloaded,
      return -EIO will break driver as error:
      mt7921e: probe of 0000:03:00.0 failed with error -5
      
      Skip firmware download and continue to probe.
      Signed-off-by: NAaron Ma <aaron.ma@canonical.com>
      Fixes: 1c099ab4 ("mt76: mt7921: add MCU support")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3426904
    • G
      atl1c: fix Mikrotik 10/25G NIC detection · b9d233ea
      Gatis Peisenieks 提交于
      Since Mikrotik 10/25G NIC MDIO op emulation is not 100% reliable,
      on rare occasions it can happen that some physical functions of
      the NIC do not get initialized due to timeouted early MDIO op.
      
      This changes the atl1c probe on Mikrotik 10/25G NIC not to
      depend on MDIO op emulation.
      Signed-off-by: NGatis Peisenieks <gatis@mikrotik.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9d233ea
    • J
      ptp: Relocate lookup cookie to correct block. · debdd8e3
      Jonathan Lemon 提交于
      An earlier commit set the pps_lookup cookie, but the line
      was somehow added to the wrong code block.  Correct this.
      
      Fixes: 8602e40f ("ptp: Set lookup cookie when creating a PTP PPS source.")
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDario Binacchi <dariobin@libero.it>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      debdd8e3
    • E
      ipv6: tcp: drop silly ICMPv6 packet too big messages · c7bb4b89
      Eric Dumazet 提交于
      While TCP stack scales reasonably well, there is still one part that
      can be used to DDOS it.
      
      IPv6 Packet too big messages have to lookup/insert a new route,
      and if abused by attackers, can easily put hosts under high stress,
      with many cpus contending on a spinlock while one is stuck in fib6_run_gc()
      
      ip6_protocol_deliver_rcu()
       icmpv6_rcv()
        icmpv6_notify()
         tcp_v6_err()
          tcp_v6_mtu_reduced()
           inet6_csk_update_pmtu()
            ip6_rt_update_pmtu()
             __ip6_rt_update_pmtu()
              ip6_rt_cache_alloc()
               ip6_dst_alloc()
                dst_alloc()
                 ip6_dst_gc()
                  fib6_run_gc()
                   spin_lock_bh() ...
      
      Some of our servers have been hit by malicious ICMPv6 packets
      trying to _increase_ the MTU/MSS of TCP flows.
      
      We believe these ICMPv6 packets are a result of a bug in one ISP stack,
      since they were blindly sent back for _every_ (small) packet sent to them.
      
      These packets are for one TCP flow:
      09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      
      TCP stack can filter some silly requests :
      
      1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err()
      2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS.
      
      This tests happen before the IPv6 routing stack is entered, thus
      removing the potential contention and route exhaustion.
      
      Note that IPv6 stack was performing these checks, but too late
      (ie : after the route has been added, and after the potential
      garbage collect war)
      
      v2: fix typo caught by Martin, thanks !
      v3: exports tcp_mtu_to_mss(), caught by David, thanks !
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NMaciej Żenczykowski <maze@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7bb4b89
  5. 08 7月, 2021 6 次提交