1. 08 1月, 2021 4 次提交
  2. 29 12月, 2020 2 次提交
    • C
      erspan: fix version 1 check in gre_parse_header() · 085c7c4e
      Cong Wang 提交于
      Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not
      have an erspan header. So the check in gre_parse_header() is wrong,
      we have to distinguish version 1 from version 0.
      
      We can just check the gre header length like is_erspan_type1().
      
      Fixes: cb73ee40 ("net: ip_gre: use erspan key field for tunnel lookup")
      Reported-by: syzbot+f583ce3d4ddf9836b27a@syzkaller.appspotmail.com
      Cc: William Tu <u9012063@gmail.com>
      Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      085c7c4e
    • G
      ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() · 21fdca22
      Guillaume Nault 提交于
      RT_TOS() only clears one of the ECN bits. Therefore, when
      fib_compute_spec_dst() resorts to a fib lookup, it can return
      different results depending on the value of the second ECN bit.
      
      For example, ECT(0) and ECT(1) packets could be treated differently.
      
        $ ip netns add ns0
        $ ip netns add ns1
        $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
        $ ip -netns ns0 link set dev lo up
        $ ip -netns ns1 link set dev lo up
        $ ip -netns ns0 link set dev veth01 up
        $ ip -netns ns1 link set dev veth10 up
      
        $ ip -netns ns0 address add 192.0.2.10/24 dev veth01
        $ ip -netns ns1 address add 192.0.2.11/24 dev veth10
      
        $ ip -netns ns1 address add 192.0.2.21/32 dev lo
        $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21
        $ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0
      
      With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21
      (ping uses -Q to set all TOS and ECN bits):
      
        $ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255
        [...]
        64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms
      
      But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11
      because the "tos 4" route isn't matched:
      
        $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
        [...]
        64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms
      
      After this patch the ECN bits don't affect the result anymore:
      
        $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
        [...]
        64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms
      
      Fixes: 35ebf65e ("ipv4: Create and use fib_compute_spec_dst() helper.")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21fdca22
  3. 18 12月, 2020 1 次提交
  4. 15 12月, 2020 2 次提交
  5. 13 12月, 2020 1 次提交
    • S
      inet: frags: batch fqdir destroy works · 0b9b2414
      SeongJae Park 提交于
      On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
      make the number of active slab objects including 'sock_inode_cache' type
      rapidly and continuously increase.  As a result, memory pressure occurs.
      
      In more detail, I made an artificial reproducer that resembles the
      workload that we found the problem and reproduce the problem faster.  It
      merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop.  It takes
      about 2 minutes.  On 40 CPU cores / 70GB DRAM machine, the available
      memory continuously reduced in a fast speed (about 120MB per second,
      15GB in total within the 2 minutes).  Note that the issue don't
      reproduce on every machine.  On my 6 CPU cores machine, the problem
      didn't reproduce.
      
      'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
      relevant memory objects.  They are asynchronously invoked by the work
      queues and internally use 'rcu_barrier()' to ensure safe destructions.
      'cleanup_net()' works in a batched maneer in a single thread worker,
      while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
      'system_wq'.  Therefore, 'fqdir_work_fn()' called frequently under the
      workload and made the contention for 'rcu_barrier()' high.  In more
      detail, the global mutex, 'rcu_state.barrier_mutex' became the
      bottleneck.
      
      This commit avoids such contention by doing the 'rcu_barrier()' and
      subsequent lightweight works in a batched manner, as similar to that of
      'cleanup_net()'.  The fqdir hashtable destruction, which is done before
      the 'rcu_barrier()', is still allowed to run in parallel for fast
      processing, but this commit makes it to use a dedicated work queue
      instead of the 'system_wq', to make sure that the number of threads is
      bounded.
      Signed-off-by: NSeongJae Park <sjpark@amazon.de>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20201211112405.31158-1-sjpark@amazon.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      0b9b2414
  6. 11 12月, 2020 1 次提交
  7. 10 12月, 2020 2 次提交
  8. 09 12月, 2020 1 次提交
    • E
      tcp: select sane initial rcvq_space.space for big MSS · 72d05c00
      Eric Dumazet 提交于
      Before commit a337531b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
      small tcp_rmem[1] values were overridden by tcp_fixup_rcvbuf() to accommodate various MSS.
      
      This is no longer the case, and Hazem Mohamed Abuelfotoh reported
      that DRS would not work for MTU 9000 endpoints receiving regular (1500 bytes) frames.
      
      Root cause is that tcp_init_buffer_space() uses tp->rcv_wnd for upper limit
      of rcvq_space.space computation, while it can select later a smaller
      value for tp->rcv_ssthresh and tp->window_clamp.
      
      ss -temoi on receiver would show :
      
      skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) rcv_space:62496 rcv_ssthresh:56596
      
      This means that TCP can not increase its window in tcp_grow_window(),
      and that DRS can never kick.
      
      Fix this by making sure that rcvq_space.space is not bigger than number of bytes
      that can be held in TCP receive queue.
      
      People unable/unwilling to change their kernel can work around this issue by
      selecting a bigger tcp_rmem[1] value as in :
      
      echo "4096 196608 6291456" >/proc/sys/net/ipv4/tcp_rmem
      
      Based on an initial report and patch from Hazem Mohamed Abuelfotoh
       https://lore.kernel.org/netdev/20201204180622.14285-1-abuehaze@amazon.com/
      
      Fixes: a337531b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
      Fixes: 041a14d2 ("tcp: start receiver buffer autotuning sooner")
      Reported-by: NHazem Mohamed Abuelfotoh <abuehaze@amazon.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72d05c00
  9. 08 12月, 2020 1 次提交
    • S
      netfilter: x_tables: Switch synchronization to RCU · cc00bcaa
      Subash Abhinov Kasiviswanathan 提交于
      When running concurrent iptables rules replacement with data, the per CPU
      sequence count is checked after the assignment of the new information.
      The sequence count is used to synchronize with the packet path without the
      use of any explicit locking. If there are any packets in the packet path using
      the table information, the sequence count is incremented to an odd value and
      is incremented to an even after the packet process completion.
      
      The new table value assignment is followed by a write memory barrier so every
      CPU should see the latest value. If the packet path has started with the old
      table information, the sequence counter will be odd and the iptables
      replacement will wait till the sequence count is even prior to freeing the
      old table info.
      
      However, this assumes that the new table information assignment and the memory
      barrier is actually executed prior to the counter check in the replacement
      thread. If CPU decides to execute the assignment later as there is no user of
      the table information prior to the sequence check, the packet path in another
      CPU may use the old table information. The replacement thread would then free
      the table information under it leading to a use after free in the packet
      processing context-
      
      Unable to handle kernel NULL pointer dereference at virtual
      address 000000000000008e
      pc : ip6t_do_table+0x5d0/0x89c
      lr : ip6t_do_table+0x5b8/0x89c
      ip6t_do_table+0x5d0/0x89c
      ip6table_filter_hook+0x24/0x30
      nf_hook_slow+0x84/0x120
      ip6_input+0x74/0xe0
      ip6_rcv_finish+0x7c/0x128
      ipv6_rcv+0xac/0xe4
      __netif_receive_skb+0x84/0x17c
      process_backlog+0x15c/0x1b8
      napi_poll+0x88/0x284
      net_rx_action+0xbc/0x23c
      __do_softirq+0x20c/0x48c
      
      This could be fixed by forcing instruction order after the new table
      information assignment or by switching to RCU for the synchronization.
      
      Fixes: 80055dab ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore")
      Reported-by: NSean Tranchetti <stranche@codeaurora.org>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Suggested-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      cc00bcaa
  10. 07 12月, 2020 1 次提交
  11. 05 12月, 2020 9 次提交
  12. 04 12月, 2020 3 次提交
    • A
      bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier · 22dc4a0f
      Andrii Nakryiko 提交于
      Remove a permeating assumption thoughout BPF verifier of vmlinux BTF. Instead,
      wherever BTF type IDs are involved, also track the instance of struct btf that
      goes along with the type ID. This allows to gradually add support for kernel
      module BTFs and using/tracking module types across BPF helper calls and
      registers.
      
      This patch also renames btf_id() function to btf_obj_id() to minimize naming
      clash with using btf_id to denote BTF *type* ID, rather than BTF *object*'s ID.
      
      Also, altough btf_vmlinux can't get destructed and thus doesn't need
      refcounting, module BTFs need that, so apply BTF refcounting universally when
      BPF program is using BTF-powered attachment (tp_btf, fentry/fexit, etc). This
      makes for simpler clean up code.
      
      Now that BTF type ID is not enough to uniquely identify a BTF type, extend BPF
      trampoline key to include BTF object ID. To differentiate that from target
      program BPF ID, set 31st bit of type ID. BTF type IDs (at least currently) are
      not allowed to take full 32 bits, so there is no danger of confusing that bit
      with a valid BTF type ID.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201203204634.1325171-10-andrii@kernel.org
      22dc4a0f
    • P
      bpf: Adds support for setting window clamp · cb811109
      Prankur gupta 提交于
      Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_WINDOW_CLAMP,
      which sets the maximum receiver window size. It will be useful for
      limiting receiver window based on RTT.
      Signed-off-by: NPrankur gupta <prankgup@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201202213152.435886-2-prankgup@fb.com
      cb811109
    • F
      tcp: merge 'init_req' and 'route_req' functions · 7ea851d1
      Florian Westphal 提交于
      The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
      a TCP reset if the token in a MP_JOIN request is unknown.
      
      At this time we don't do this, the 3whs completes and the 'new subflow'
      is reset afterwards.  There are two ways to allow MPTCP to send the
      reset.
      
      1. override 'send_synack' callback and emit the rst from there.
         The drawback is that the request socket gets inserted into the
         listeners queue just to get removed again right away.
      
      2. Send the reset from the 'route_req' function instead.
         This avoids the 'add&remove request socket', but route_req lacks the
         skb that is required to send the TCP reset.
      
      Instead of just adding the skb to that function for MPTCP sake alone,
      Paolo suggested to merge init_req and route_req functions.
      
      This saves one indirection from syn processing path and provides the skb
      to the merged function at the same time.
      
      'send reset on unknown mptcp join token' is added in next patch.
      Suggested-by: NPaolo Abeni <pabeni@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      7ea851d1
  13. 03 12月, 2020 1 次提交
  14. 01 12月, 2020 1 次提交
  15. 29 11月, 2020 1 次提交
    • G
      ipv4: Fix tos mask in inet_rtm_getroute() · 1ebf1790
      Guillaume Nault 提交于
      When inet_rtm_getroute() was converted to use the RCU variants of
      ip_route_input() and ip_route_output_key(), the TOS parameters
      stopped being masked with IPTOS_RT_MASK before doing the route lookup.
      
      As a result, "ip route get" can return a different route than what
      would be used when sending real packets.
      
      For example:
      
          $ ip route add 192.0.2.11/32 dev eth0
          $ ip route add unreachable 192.0.2.11/32 tos 2
          $ ip route get 192.0.2.11 tos 2
          RTNETLINK answers: No route to host
      
      But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would
      actually be routed using the first route:
      
          $ ping -c 1 -Q 2 192.0.2.11
          PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data.
          64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms
      
          --- 192.0.2.11 ping statistics ---
          1 packets transmitted, 1 received, 0% packet loss, time 0ms
          rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms
      
      This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to
      return results consistent with real route lookups.
      
      Fixes: 3765d35e ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.1606412894.git.gnault@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      1ebf1790
  16. 25 11月, 2020 1 次提交
    • A
      tcp: Set ECT0 bit in tos/tclass for synack when BPF needs ECN · 407c85c7
      Alexander Duyck 提交于
      When a BPF program is used to select between a type of TCP congestion
      control algorithm that uses either ECN or not there is a case where the
      synack for the frame was coming up without the ECT0 bit set. A bit of
      research found that this was due to the final socket being configured to
      dctcp while the listener socket was staying in cubic.
      
      To reproduce it all that is needed is to monitor TCP traffic while running
      the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
      assuming tcp_dctcp module is loaded or compiled in and the traffic matches
      the rules in the sample file, is that for all frames with the exception of
      the synack the ECT0 bit is set.
      
      To address that it is necessary to make one additional call to
      tcp_bpf_ca_needs_ecn using the request socket and then use the output of
      that to set the ECT0 bit for the tos/tclass of the packet.
      
      Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomainSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      407c85c7
  17. 24 11月, 2020 2 次提交
    • R
      tcp: fix race condition when creating child sockets from syncookies · 01770a16
      Ricardo Dias 提交于
      When the TCP stack is in SYN flood mode, the server child socket is
      created from the SYN cookie received in a TCP packet with the ACK flag
      set.
      
      The child socket is created when the server receives the first TCP
      packet with a valid SYN cookie from the client. Usually, this packet
      corresponds to the final step of the TCP 3-way handshake, the ACK
      packet. But is also possible to receive a valid SYN cookie from the
      first TCP data packet sent by the client, and thus create a child socket
      from that SYN cookie.
      
      Since a client socket is ready to send data as soon as it receives the
      SYN+ACK packet from the server, the client can send the ACK packet (sent
      by the TCP stack code), and the first data packet (sent by the userspace
      program) almost at the same time, and thus the server will equally
      receive the two TCP packets with valid SYN cookies almost at the same
      instant.
      
      When such event happens, the TCP stack code has a race condition that
      occurs between the momement a lookup is done to the established
      connections hashtable to check for the existence of a connection for the
      same client, and the moment that the child socket is added to the
      established connections hashtable. As a consequence, this race condition
      can lead to a situation where we add two child sockets to the
      established connections hashtable and deliver two sockets to the
      userspace program to the same client.
      
      This patch fixes the race condition by checking if an existing child
      socket exists for the same client when we are adding the second child
      socket to the established connections socket. If an existing child
      socket exists, we drop the packet and discard the second child socket
      to the same client.
      Signed-off-by: NRicardo Dias <rdias@singlestore.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lanSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      01770a16
    • P
      lsm,selinux: pass flowi_common instead of flowi to the LSM hooks · 3df98d79
      Paul Moore 提交于
      As pointed out by Herbert in a recent related patch, the LSM hooks do
      not have the necessary address family information to use the flowi
      struct safely.  As none of the LSMs currently use any of the protocol
      specific flowi information, replace the flowi pointers with pointers
      to the address family independent flowi_common struct.
      Reported-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Acked-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      3df98d79
  18. 21 11月, 2020 3 次提交
    • A
      tcp: Set INET_ECN_xmit configuration in tcp_reinit_congestion_control · 55472017
      Alexander Duyck 提交于
      When setting congestion control via a BPF program it is seen that the
      SYN/ACK for packets within a given flow will not include the ECT0 flag. A
      bit of simple printk debugging shows that when this is configured without
      BPF we will see the value INET_ECN_xmit value initialized in
      tcp_assign_congestion_control however when we configure this via BPF the
      socket is in the closed state and as such it isn't configured, and I do not
      see it being initialized when we transition the socket into the listen
      state. The result of this is that the ECT0 bit is configured based on
      whatever the default state is for the socket.
      
      Any easy way to reproduce this is to monitor the following with tcpdump:
      tools/testing/selftests/bpf/test_progs -t bpf_tcp_ca
      
      Without this patch the SYN/ACK will follow whatever the default is. If dctcp
      all SYN/ACK packets will have the ECT0 bit set, and if it is not then ECT0
      will be cleared on all SYN/ACK packets. With this patch applied the SYN/ACK
      bit matches the value seen on the other packets in the given stream.
      
      Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      55472017
    • A
      tcp: Allow full IP tos/IPv6 tclass to be reflected in L3 header · 861602b5
      Alexander Duyck 提交于
      An issue was recently found where DCTCP SYN/ACK packets did not have the
      ECT bit set in the L3 header. A bit of code review found that the recent
      change referenced below had gone though and added a mask that prevented the
      ECN bits from being populated in the L3 header.
      
      This patch addresses that by rolling back the mask so that it is only
      applied to the flags coming from the incoming TCP request instead of
      applying it to the socket tos/tclass field. Doing this the ECT bits were
      restored in the SYN/ACK packets in my testing.
      
      One thing that is not addressed by this patch set is the fact that
      tcp_reflect_tos appears to be incompatible with ECN based congestion
      avoidance algorithms. At a minimum the feature should likely be documented
      which it currently isn't.
      
      Fixes: ac8f1710 ("tcp: reflect tos value received in SYN to the socket")
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Acked-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      861602b5
    • F
      mptcp: track window announced to peer · fa3fe2b1
      Florian Westphal 提交于
      OoO handling attempts to detect when packet is out-of-window by testing
      current ack sequence and remaining space vs. sequence number.
      
      This doesn't work reliably. Store the highest allowed sequence number
      that we've announced and use it to detect oow packets.
      
      Do this when mptcp options get written to the packet (wire format).
      For this to work we need to move the write_options call until after
      stack selected a new tcp window.
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      fa3fe2b1
  19. 18 11月, 2020 3 次提交