1. 10 7月, 2021 8 次提交
    • J
      mptcp: avoid processing packet if a subflow reset · 6787b7e3
      Jianguo Wu 提交于
      If check_fully_established() causes a subflow reset, it should not
      continue to process the packet in tcp_data_queue().
      Add a return value to mptcp_incoming_options(), and return false if a
      subflow has been reset, else return true. Then drop the packet in
      tcp_data_queue()/tcp_rcv_state_process() if mptcp_incoming_options()
      return false.
      
      Fixes: d5824847 ("mptcp: fix fallback for MP_JOIN subflows")
      Signed-off-by: NJianguo Wu <wujianguo@chinatelecom.cn>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6787b7e3
    • J
      mptcp: fix syncookie process if mptcp can not_accept new subflow · 8547ea5f
      Jianguo Wu 提交于
      Lots of "TCP: tcp_fin: Impossible, sk->sk_state=7" in client side
      when doing stress testing using wrk and webfsd.
      
      There are at least two cases may trigger this warning:
      1.mptcp is in syncookie, and server recv MP_JOIN SYN request,
        in subflow_check_req(), the mptcp_can_accept_new_subflow()
        return false, so subflow_init_req_cookie_join_save() isn't
        called, i.e. not store the data present in the MP_JOIN syn
        request and the random nonce in hash table - join_entries[],
        but still send synack. When recv 3rd-ack,
        mptcp_token_join_cookie_init_state() will return false, and
        3rd-ack is dropped, then if mptcp conn is closed by client,
        client will send a DATA_FIN and a MPTCP FIN, the DATA_FIN
        doesn't have MP_CAPABLE or MP_JOIN,
        so mptcp_subflow_init_cookie_req() will return 0, and pass
        the cookie check, MP_JOIN request is fallback to normal TCP.
        Server will send a TCP FIN if closed, in client side,
        when process TCP FIN, it will do reset, the code path is:
          tcp_data_queue()->mptcp_incoming_options()
            ->check_fully_established()->mptcp_subflow_reset().
        mptcp_subflow_reset() will set sock state to TCP_CLOSE,
        so tcp_fin will hit TCP_CLOSE, and print the warning.
      
      2.mptcp is in syncookie, and server recv 3rd-ack, in
        mptcp_subflow_init_cookie_req(), mptcp_can_accept_new_subflow()
        return false, and subflow_req->mp_join is not set to 1,
        so in subflow_syn_recv_sock() will not reset the MP_JOIN
        subflow, but fallback to normal TCP, and then the same thing
        happens when server will send a TCP FIN if closed.
      
      For case1, subflow_check_req() return -EPERM,
      then tcp_conn_request() will drop MP_JOIN SYN.
      
      For case2, let subflow_syn_recv_sock() call
      mptcp_can_accept_new_subflow(), and do fatal fallback, send reset.
      
      Fixes: 9466a1cc ("mptcp: enable JOIN requests even if cookies are in use")
      Signed-off-by: NJianguo Wu <wujianguo@chinatelecom.cn>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8547ea5f
    • J
      mptcp: remove redundant req destruct in subflow_check_req() · 030d37bd
      Jianguo Wu 提交于
      In subflow_check_req(), if subflow sport is mismatch, will put msk,
      destroy token, and destruct req, then return -EPERM, which can be
      done by subflow_req_destructor() via:
      
        tcp_conn_request()
          |--__reqsk_free()
            |--subflow_req_destructor()
      
      So we should remove these redundant code, otherwise will call
      tcp_v4_reqsk_destructor() twice, and may double free
      inet_rsk(req)->ireq_opt.
      
      Fixes: 5bc56388 ("mptcp: add port number check for MP_JOIN")
      Signed-off-by: NJianguo Wu <wujianguo@chinatelecom.cn>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      030d37bd
    • J
      mptcp: fix warning in __skb_flow_dissect() when do syn cookie for subflow join · 0c71929b
      Jianguo Wu 提交于
      I did stress test with wrk[1] and webfsd[2] with the assistance of
      mptcp-tools[3]:
      
        Server side:
            ./use_mptcp.sh webfsd -4 -R /tmp/ -p 8099
        Client side:
            ./use_mptcp.sh wrk -c 200 -d 30 -t 4 http://192.168.174.129:8099/
      
      and got the following warning message:
      
      [   55.552626] TCP: request_sock_subflow: Possible SYN flooding on port 8099. Sending cookies.  Check SNMP counters.
      [   55.553024] ------------[ cut here ]------------
      [   55.553027] WARNING: CPU: 0 PID: 10 at net/core/flow_dissector.c:984 __skb_flow_dissect+0x280/0x1650
      ...
      [   55.553117] CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 5.12.0+ #18
      [   55.553121] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
      [   55.553124] RIP: 0010:__skb_flow_dissect+0x280/0x1650
      ...
      [   55.553133] RSP: 0018:ffffb79580087770 EFLAGS: 00010246
      [   55.553137] RAX: 0000000000000000 RBX: ffffffff8ddb58e0 RCX: ffffb79580087888
      [   55.553139] RDX: ffffffff8ddb58e0 RSI: ffff8f7e4652b600 RDI: 0000000000000000
      [   55.553141] RBP: ffffb79580087858 R08: 0000000000000000 R09: 0000000000000008
      [   55.553143] R10: 000000008c622965 R11: 00000000d3313a5b R12: ffff8f7e4652b600
      [   55.553146] R13: ffff8f7e465c9062 R14: 0000000000000000 R15: ffffb79580087888
      [   55.553149] FS:  0000000000000000(0000) GS:ffff8f7f75e00000(0000) knlGS:0000000000000000
      [   55.553152] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   55.553154] CR2: 00007f73d1d19000 CR3: 0000000135e10004 CR4: 00000000003706f0
      [   55.553160] Call Trace:
      [   55.553166]  ? __sha256_final+0x67/0xd0
      [   55.553173]  ? sha256+0x7e/0xa0
      [   55.553177]  __skb_get_hash+0x57/0x210
      [   55.553182]  subflow_init_req_cookie_join_save+0xac/0xc0
      [   55.553189]  subflow_check_req+0x474/0x550
      [   55.553195]  ? ip_route_output_key_hash+0x67/0x90
      [   55.553200]  ? xfrm_lookup_route+0x1d/0xa0
      [   55.553207]  subflow_v4_route_req+0x8e/0xd0
      [   55.553212]  tcp_conn_request+0x31e/0xab0
      [   55.553218]  ? selinux_socket_sock_rcv_skb+0x116/0x210
      [   55.553224]  ? tcp_rcv_state_process+0x179/0x6d0
      [   55.553229]  tcp_rcv_state_process+0x179/0x6d0
      [   55.553235]  tcp_v4_do_rcv+0xaf/0x220
      [   55.553239]  tcp_v4_rcv+0xce4/0xd80
      [   55.553243]  ? ip_route_input_rcu+0x246/0x260
      [   55.553248]  ip_protocol_deliver_rcu+0x35/0x1b0
      [   55.553253]  ip_local_deliver_finish+0x44/0x50
      [   55.553258]  ip_local_deliver+0x6c/0x110
      [   55.553262]  ? ip_rcv_finish_core.isra.19+0x5a/0x400
      [   55.553267]  ip_rcv+0xd1/0xe0
      ...
      
      After debugging, I found in __skb_flow_dissect(), skb->dev and skb->sk
      are both NULL, then net is NULL, and trigger WARN_ON_ONCE(!net),
      actually net is always NULL in this code path, as skb->dev is set to
      NULL in tcp_v4_rcv(), and skb->sk is never set.
      
      Code snippet in __skb_flow_dissect() that trigger warning:
        975         if (skb) {
        976                 if (!net) {
        977                         if (skb->dev)
        978                                 net = dev_net(skb->dev);
        979                         else if (skb->sk)
        980                                 net = sock_net(skb->sk);
        981                 }
        982         }
        983
        984         WARN_ON_ONCE(!net);
      
      So, using seq and transport header derived hash.
      
      [1] https://github.com/wg/wrk
      [2] https://github.com/ourway/webfsd
      [3] https://github.com/pabeni/mptcp-tools
      
      Fixes: 9466a1cc ("mptcp: enable JOIN requests even if cookies are in use")
      Suggested-by: NPaolo Abeni <pabeni@redhat.com>
      Suggested-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NJianguo Wu <wujianguo@chinatelecom.cn>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c71929b
    • H
      net: ip_tunnel: fix mtu calculation for ETHER tunnel devices · 9992a078
      Hangbin Liu 提交于
      Commit 28e104d0 ("net: ip_tunnel: fix mtu calculation") removed
      dev->hard_header_len subtraction when calculate MTU for tunnel devices
      as there is an overhead for device that has header_ops.
      
      But there are ETHER tunnel devices, like gre_tap or erspan, which don't
      have header_ops but set dev->hard_header_len during setup. This makes
      pkts greater than (MTU - ETH_HLEN) could not be xmited. Fix it by
      subtracting the ETHER tunnel devices' dev->hard_header_len for MTU
      calculation.
      
      Fixes: 28e104d0 ("net: ip_tunnel: fix mtu calculation")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9992a078
    • A
      net: do not reuse skbuff allocated from skbuff_fclone_cache in the skb cache · 28b34f01
      Antoine Tenart 提交于
      Some socket buffers allocated in the fclone cache (in __alloc_skb) can
      end-up in the following path[1]:
      
      napi_skb_finish
        __kfree_skb_defer
          napi_skb_cache_put
      
      The issue is napi_skb_cache_put is not fclone friendly and will put
      those skbuff in the skb cache to be reused later, although this cache
      only expects skbuff allocated from skbuff_head_cache. When this happens
      the skbuff is eventually freed using the wrong origin cache, and we can
      see traces similar to:
      
      [ 1223.947534] cache_from_obj: Wrong slab cache. skbuff_head_cache but object is from skbuff_fclone_cache
      [ 1223.948895] WARNING: CPU: 3 PID: 0 at mm/slab.h:442 kmem_cache_free+0x251/0x3e0
      [ 1223.950211] Modules linked in:
      [ 1223.950680] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.13.0+ #474
      [ 1223.951587] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-3.fc34 04/01/2014
      [ 1223.953060] RIP: 0010:kmem_cache_free+0x251/0x3e0
      
      Leading sometimes to other memory related issues.
      
      Fix this by using __kfree_skb for fclone skbuff, similar to what is done
      the other place __kfree_skb_defer is called.
      
      [1] At least in setups using veth pairs and tunnels. Building a kernel
          with KASAN we can for example see packets allocated in
          sk_stream_alloc_skb hit the above path and later the issue arises
          when the skbuff is reused.
      
      Fixes: 9243adfc ("skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing")
      Cc: Alexander Lobakin <alobakin@pm.me>
      Signed-off-by: NAntoine Tenart <atenart@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28b34f01
    • T
      tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path · 358ed624
      Talal Ahmad 提交于
      sk_wmem_schedule makes sure that sk_forward_alloc has enough
      bytes for charging that is going to be done by sk_mem_charge.
      
      In the transmit zerocopy path, there is sk_mem_charge but there was
      no call to sk_wmem_schedule. This change adds that call.
      
      Without this call to sk_wmem_schedule, sk_forward_alloc can go
      negetive which is a bug because sk_forward_alloc is a per-socket
      space that has been forward charged so this can't be negative.
      
      Fixes: f214f915 ("tcp: enable MSG_ZEROCOPY")
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NWei Wang <weiwan@google.com>
      Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      358ed624
    • A
      net: send SYNACK packet with accepted fwmark · 43b90bfa
      Alexander Ovechkin 提交于
      commit e05a90ec ("net: reflect mark on tcp syn ack packets")
      fixed IPv4 only.
      
      This part is for the IPv6 side.
      
      Fixes: e05a90ec ("net: reflect mark on tcp syn ack packets")
      Signed-off-by: NAlexander Ovechkin <ovov@yandex-team.ru>
      Acked-by: NDmitry Yakunin <zeil@yandex-team.ru>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43b90bfa
  2. 09 7月, 2021 4 次提交
    • I
      net/ncsi: add dummy response handler for Intel boards · 163f5de5
      Ivan Mikhaylov 提交于
      Add the dummy response handler for Intel boards to prevent incorrect
      handling of OEM commands.
      Signed-off-by: NIvan Mikhaylov <i.mikhaylov@yadro.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      163f5de5
    • I
      net/ncsi: add NCSI Intel OEM command to keep PHY up · abd2fddc
      Ivan Mikhaylov 提交于
      This allows to keep PHY link up and prevents any channel resets during
      the host load.
      
      It is KEEP_PHY_LINK_UP option(Veto bit) in i210 datasheet which
      block PHY reset and power state changes.
      Signed-off-by: NIvan Mikhaylov <i.mikhaylov@yadro.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abd2fddc
    • I
      net/ncsi: fix restricted cast warning of sparse · 27fa107d
      Ivan Mikhaylov 提交于
      Sparse reports:
      net/ncsi/ncsi-rsp.c:406:24: warning: cast to restricted __be32
      net/ncsi/ncsi-manage.c:732:33: warning: cast to restricted __be32
      net/ncsi/ncsi-manage.c:756:25: warning: cast to restricted __be32
      net/ncsi/ncsi-manage.c:779:25: warning: cast to restricted __be32
      Signed-off-by: NIvan Mikhaylov <i.mikhaylov@yadro.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27fa107d
    • E
      ipv6: tcp: drop silly ICMPv6 packet too big messages · c7bb4b89
      Eric Dumazet 提交于
      While TCP stack scales reasonably well, there is still one part that
      can be used to DDOS it.
      
      IPv6 Packet too big messages have to lookup/insert a new route,
      and if abused by attackers, can easily put hosts under high stress,
      with many cpus contending on a spinlock while one is stuck in fib6_run_gc()
      
      ip6_protocol_deliver_rcu()
       icmpv6_rcv()
        icmpv6_notify()
         tcp_v6_err()
          tcp_v6_mtu_reduced()
           inet6_csk_update_pmtu()
            ip6_rt_update_pmtu()
             __ip6_rt_update_pmtu()
              ip6_rt_cache_alloc()
               ip6_dst_alloc()
                dst_alloc()
                 ip6_dst_gc()
                  fib6_run_gc()
                   spin_lock_bh() ...
      
      Some of our servers have been hit by malicious ICMPv6 packets
      trying to _increase_ the MTU/MSS of TCP flows.
      
      We believe these ICMPv6 packets are a result of a bug in one ISP stack,
      since they were blindly sent back for _every_ (small) packet sent to them.
      
      These packets are for one TCP flow:
      09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      
      TCP stack can filter some silly requests :
      
      1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err()
      2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS.
      
      This tests happen before the IPv6 routing stack is entered, thus
      removing the potential contention and route exhaustion.
      
      Note that IPv6 stack was performing these checks, but too late
      (ie : after the route has been added, and after the potential
      garbage collect war)
      
      v2: fix typo caught by Martin, thanks !
      v3: exports tcp_mtu_to_mss(), caught by David, thanks !
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NMaciej Żenczykowski <maze@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7bb4b89
  3. 08 7月, 2021 3 次提交
  4. 07 7月, 2021 3 次提交
    • N
      ipv6: fix 'disable_policy' for fwd packets · ccd27f05
      Nicolas Dichtel 提交于
      The goal of commit df789fe7 ("ipv6: Provide ipv6 version of
      "disable_policy" sysctl") was to have the disable_policy from ipv4
      available on ipv6.
      However, it's not exactly the same mechanism. On IPv4, all packets coming
      from an interface, which has disable_policy set, bypass the policy check.
      For ipv6, this is done only for local packets, ie for packets destinated to
      an address configured on the incoming interface.
      
      Let's align ipv6 with ipv4 so that the 'disable_policy' sysctl has the same
      effect for both protocols.
      
      My first approach was to create a new kind of route cache entries, to be
      able to set DST_NOPOLICY without modifying routes. This would have added a
      lot of code. Because the local delivery path is already handled, I choose
      to focus on the forwarding path to minimize code churn.
      
      Fixes: df789fe7 ("ipv6: Provide ipv6 version of "disable_policy" sysctl")
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ccd27f05
    • N
      tcp: fix tcp_init_transfer() to not reset icsk_ca_initialized · be5d1b61
      Nguyen Dinh Phi 提交于
      This commit fixes a bug (found by syzkaller) that could cause spurious
      double-initializations for congestion control modules, which could cause
      memory leaks or other problems for congestion control modules (like CDG)
      that allocate memory in their init functions.
      
      The buggy scenario constructed by syzkaller was something like:
      
      (1) create a TCP socket
      (2) initiate a TFO connect via sendto()
      (3) while socket is in TCP_SYN_SENT, call setsockopt(TCP_CONGESTION),
          which calls:
             tcp_set_congestion_control() ->
               tcp_reinit_congestion_control() ->
                 tcp_init_congestion_control()
      (4) receive ACK, connection is established, call tcp_init_transfer(),
          set icsk_ca_initialized=0 (without first calling cc->release()),
          call tcp_init_congestion_control() again.
      
      Note that in this sequence tcp_init_congestion_control() is called
      twice without a cc->release() call in between. Thus, for CC modules
      that allocate memory in their init() function, e.g, CDG, a memory leak
      may occur. The syzkaller tool managed to find a reproducer that
      triggered such a leak in CDG.
      
      The bug was introduced when that commit 8919a9b3 ("tcp: Only init
      congestion control if not initialized already")
      introduced icsk_ca_initialized and set icsk_ca_initialized to 0 in
      tcp_init_transfer(), missing the possibility for a sequence like the
      one above, where a process could call setsockopt(TCP_CONGESTION) in
      state TCP_SYN_SENT (i.e. after the connect() or TFO open sendmsg()),
      which would call tcp_init_congestion_control(). It did not intend to
      reset any initialization that the user had already explicitly made;
      it just missed the possibility of that particular sequence (which
      syzkaller managed to find).
      
      Fixes: 8919a9b3 ("tcp: Only init congestion control if not initialized already")
      Reported-by: syzbot+f1e24a0594d4e3a895d3@syzkaller.appspotmail.com
      Signed-off-by: NNguyen Dinh Phi <phind.uet@gmail.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be5d1b61
    • P
      skbuff: Release nfct refcount on napi stolen or re-used skbs · 8550ff8d
      Paul Blakey 提交于
      When multiple SKBs are merged to a new skb under napi GRO,
      or SKB is re-used by napi, if nfct was set for them in the
      driver, it will not be released while freeing their stolen
      head state or on re-use.
      
      Release nfct on napi's stolen or re-used SKBs, and
      in gro_list_prepare, check conntrack metadata diff.
      
      Fixes: 5c6b9460 ("net/mlx5e: CT: Handle misses after executing CT action")
      Reviewed-by: NRoi Dayan <roid@nvidia.com>
      Signed-off-by: NPaul Blakey <paulb@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8550ff8d
  5. 06 7月, 2021 5 次提交
  6. 03 7月, 2021 8 次提交
  7. 02 7月, 2021 9 次提交
    • V
      netfilter: ctnetlink: suspicious RCU usage in ctnetlink_dump_helpinfo · c23a9fd2
      Vasily Averin 提交于
      Two patches listed below removed ctnetlink_dump_helpinfo call from under
      rcu_read_lock. Now its rcu_dereference generates following warning:
      =============================
      WARNING: suspicious RCU usage
      5.13.0+ #5 Not tainted
      -----------------------------
      net/netfilter/nf_conntrack_netlink.c:221 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      rcu_scheduler_active = 2, debug_locks = 1
      stack backtrace:
      CPU: 1 PID: 2251 Comm: conntrack Not tainted 5.13.0+ #5
      Call Trace:
       dump_stack+0x7f/0xa1
       ctnetlink_dump_helpinfo+0x134/0x150 [nf_conntrack_netlink]
       ctnetlink_fill_info+0x2c2/0x390 [nf_conntrack_netlink]
       ctnetlink_dump_table+0x13f/0x370 [nf_conntrack_netlink]
       netlink_dump+0x10c/0x370
       __netlink_dump_start+0x1a7/0x260
       ctnetlink_get_conntrack+0x1e5/0x250 [nf_conntrack_netlink]
       nfnetlink_rcv_msg+0x613/0x993 [nfnetlink]
       netlink_rcv_skb+0x50/0x100
       nfnetlink_rcv+0x55/0x120 [nfnetlink]
       netlink_unicast+0x181/0x260
       netlink_sendmsg+0x23f/0x460
       sock_sendmsg+0x5b/0x60
       __sys_sendto+0xf1/0x160
       __x64_sys_sendto+0x24/0x30
       do_syscall_64+0x36/0x70
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: 49ca022b ("netfilter: ctnetlink: don't dump ct extensions of unconfirmed conntracks")
      Fixes: 0b35f603 ("netfilter: Remove duplicated rcu_read_lock.")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c23a9fd2
    • V
      netfilter: conntrack: nf_ct_gre_keymap_flush() removal · a23f89a9
      Vasily Averin 提交于
      nf_ct_gre_keymap_flush() is useless.
      It is called from nf_conntrack_cleanup_net_list() only and tries to remove
      nf_ct_gre_keymap entries from pernet gre keymap list. Though:
      a) at this point the list should already be empty, all its entries were
      deleted during the conntracks cleanup, because
      nf_conntrack_cleanup_net_list() executes nf_ct_iterate_cleanup(kill_all)
      before nf_conntrack_proto_pernet_fini():
       nf_conntrack_cleanup_net_list
        +- nf_ct_iterate_cleanup
        |   nf_ct_put
        |    nf_conntrack_put
        |     nf_conntrack_destroy
        |      destroy_conntrack
        |       destroy_gre_conntrack
        |        nf_ct_gre_keymap_destroy
        `- nf_conntrack_proto_pernet_fini
            nf_ct_gre_keymap_flush
      
      b) Let's say we find that the keymap list is not empty. This means netns
      still has a conntrack associated with gre, in which case we should not free
      its memory, because this will lead to a double free and related crashes.
      However I doubt it could have gone unnoticed for years, obviously
      this does not happen in real life. So I think we can remove
      both nf_ct_gre_keymap_flush() and nf_conntrack_proto_pernet_fini().
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      a23f89a9
    • C
      netfilter: nf_tables: Fix dereference of null pointer flow · 4ca041f9
      Colin Ian King 提交于
      In the case where chain->flags & NFT_CHAIN_HW_OFFLOAD is false then
      nft_flow_rule_create is not called and flow is NULL. The subsequent
      error handling execution via label err_destroy_flow_rule will lead
      to a null pointer dereference on flow when calling nft_flow_rule_destroy.
      Since the error path to err_destroy_flow_rule has to cater for null
      and non-null flows, only call nft_flow_rule_destroy if flow is non-null
      to fix this issue.
      
      Addresses-Coverity: ("Explicity null dereference")
      Fixes: 3c5e4462 ("netfilter: nf_tables: memleak in hw offload abort path")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      4ca041f9
    • F
      netfilter: conntrack: do not renew entry stuck in tcp SYN_SENT state · e15d4cdf
      Florian Westphal 提交于
      Consider:
        client -----> conntrack ---> Host
      
      client sends a SYN, but $Host is unreachable/silent.
      Client eventually gives up and the conntrack entry will time out.
      
      However, if the client is restarted with same addr/port pair, it
      may prevent the conntrack entry from timing out.
      
      This is noticeable when the existing conntrack entry has no NAT
      transformation or an outdated one and port reuse happens either
      on client or due to a NAT middlebox.
      
      This change prevents refresh of the timeout for SYN retransmits,
      so entry is going away after nf_conntrack_tcp_timeout_syn_sent
      seconds (default: 60).
      
      Entry will be re-created on next connection attempt, but then
      nat rules will be evaluated again.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e15d4cdf
    • K
      s390: iucv: Avoid field over-reading memcpy() · 5140aaa4
      Kees Cook 提交于
      In preparation for FORTIFY_SOURCE performing compile-time and run-time
      field bounds checking for memcpy(), memmove(), and memset(), avoid
      intentionally reading across neighboring array fields.
      
      Add a wrapping struct to serve as the memcpy() source so the compiler
      can perform appropriate bounds checking, avoiding this future warning:
      
      In function '__fortify_memcpy',
          inlined from 'iucv_message_pending' at net/iucv/iucv.c:1663:4:
      ./include/linux/fortify-string.h:246:4: error: call to '__read_overflow2_field' declared with attribute error: detected read beyond size of field (2nd parameter)
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5140aaa4
    • E
      udp: annotate data races around unix_sk(sk)->gso_size · 18a419ba
      Eric Dumazet 提交于
      Accesses to unix_sk(sk)->gso_size are lockless.
      Add READ_ONCE()/WRITE_ONCE() around them.
      
      BUG: KCSAN: data-race in udp_lib_setsockopt / udpv6_sendmsg
      
      write to 0xffff88812d78f47c of 2 bytes by task 10849 on cpu 1:
       udp_lib_setsockopt+0x3b3/0x710 net/ipv4/udp.c:2696
       udpv6_setsockopt+0x63/0x90 net/ipv6/udp.c:1630
       sock_common_setsockopt+0x5d/0x70 net/core/sock.c:3265
       __sys_setsockopt+0x18f/0x200 net/socket.c:2104
       __do_sys_setsockopt net/socket.c:2115 [inline]
       __se_sys_setsockopt net/socket.c:2112 [inline]
       __x64_sys_setsockopt+0x62/0x70 net/socket.c:2112
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff88812d78f47c of 2 bytes by task 10852 on cpu 0:
       udpv6_sendmsg+0x161/0x16b0 net/ipv6/udp.c:1299
       inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:642
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2337
       ___sys_sendmsg net/socket.c:2391 [inline]
       __sys_sendmmsg+0x315/0x4b0 net/socket.c:2477
       __do_sys_sendmmsg net/socket.c:2506 [inline]
       __se_sys_sendmmsg net/socket.c:2503 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2503
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000 -> 0x0005
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 10852 Comm: syz-executor.0 Not tainted 5.13.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18a419ba
    • Y
      net: socket: support hardware timestamp conversion to PHC bound · d7c08826
      Yangbo Lu 提交于
      This patch is to support hardware timestamp conversion to
      PHC bound. This applies to both RX and TX since their skb
      handling (for TX, it's skb clone in error queue) all goes
      through __sock_recv_timestamp.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7c08826
    • Y
      net: sock: extend SO_TIMESTAMPING for PHC binding · d463126e
      Yangbo Lu 提交于
      Since PTP virtual clock support is added, there can be
      several PTP virtual clocks based on one PTP physical
      clock for timestamping.
      
      This patch is to extend SO_TIMESTAMPING API to support
      PHC (PTP Hardware Clock) binding by adding a new flag
      SOF_TIMESTAMPING_BIND_PHC. When PTP virtual clocks are
      in use, user space can configure to bind one for
      timestamping, but PTP physical clock is not supported
      and not needed to bind.
      
      This patch is preparation for timestamp conversion from
      raw timestamp to a specific PTP virtual clock time in
      core net.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d463126e
    • Y
      mptcp: setsockopt: convert to mptcp_setsockopt_sol_socket_timestamping() · 6c9a0a0f
      Yangbo Lu 提交于
      Split timestamping handling into a new function
      mptcp_setsockopt_sol_socket_timestamping().
      This is preparation for extending SO_TIMESTAMPING
      for PHC binding, since optval will no longer be
      integer.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c9a0a0f