1. 02 11月, 2021 3 次提交
    • J
      net: arp: introduce arp_evict_nocarrier sysctl parameter · fcdb44d0
      James Prestwood 提交于
      This change introduces a new sysctl parameter, arp_evict_nocarrier.
      When set (default) the ARP cache will be cleared on a NOCARRIER event.
      This new option has been defaulted to '1' which maintains existing
      behavior.
      
      Clearing the ARP cache on NOCARRIER is relatively new, introduced by:
      
      commit 859bd2ef
      Author: David Ahern <dsahern@gmail.com>
      Date:   Thu Oct 11 20:33:49 2018 -0700
      
          net: Evict neighbor entries on carrier down
      
      The reason for this changes is to prevent the ARP cache from being
      cleared when a wireless device roams. Specifically for wireless roams
      the ARP cache should not be cleared because the underlying network has not
      changed. Clearing the ARP cache in this case can introduce significant
      delays sending out packets after a roam.
      
      A user reported such a situation here:
      
      https://lore.kernel.org/linux-wireless/CACsRnHWa47zpx3D1oDq9JYnZWniS8yBwW1h0WAVZ6vrbwL_S0w@mail.gmail.com/
      
      After some investigation it was found that the kernel was holding onto
      packets until ARP finished which resulted in this 1 second delay. It
      was also found that the first ARP who-has was never responded to,
      which is actually what caues the delay. This change is more or less
      working around this behavior, but again, there is no reason to clear
      the cache on a roam anyways.
      
      As for the unanswered who-has, we know the packet made it OTA since
      it was seen while monitoring. Why it never received a response is
      unknown. In any case, since this is a problem on the AP side of things
      all that can be done is to work around it until it is solved.
      
      Some background on testing/reproducing the packet delay:
      
      Hardware:
       - 2 access points configured for Fast BSS Transition (Though I don't
         see why regular reassociation wouldn't have the same behavior)
       - Wireless station running IWD as supplicant
       - A device on network able to respond to pings (I used one of the APs)
      
      Procedure:
       - Connect to first AP
       - Ping once to establish an ARP entry
       - Start a tcpdump
       - Roam to second AP
       - Wait for operstate UP event, and note the timestamp
       - Start pinging
      
      Results:
      
      Below is the tcpdump after UP. It was recorded the interface went UP at
      10:42:01.432875.
      
      10:42:01.461871 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
      10:42:02.497976 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
      10:42:02.507162 ARP, Reply 192.168.254.1 is-at ac:86:74:55:b0:20, length 46
      10:42:02.507185 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 1, length 64
      10:42:02.507205 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 2, length 64
      10:42:02.507212 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 3, length 64
      10:42:02.507219 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 4, length 64
      10:42:02.507225 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 5, length 64
      10:42:02.507232 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 6, length 64
      10:42:02.515373 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 1, length 64
      10:42:02.521399 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 2, length 64
      10:42:02.521612 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 3, length 64
      10:42:02.521941 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 4, length 64
      10:42:02.522419 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 5, length 64
      10:42:02.523085 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 6, length 64
      
      You can see the first ARP who-has went out very quickly after UP, but
      was never responded to. Nearly a second later the kernel retries and
      gets a response. Only then do the ping packets go out. If an ARP entry
      is manually added prior to UP (after the cache is cleared) it is seen
      that the first ping is never responded to, so its not only an issue with
      ARP but with data packets in general.
      
      As mentioned prior, the wireless interface was also monitored to verify
      the ping/ARP packet made it OTA which was observed to be true.
      Signed-off-by: NJames Prestwood <prestwoj@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      fcdb44d0
    • T
      net: avoid double accounting for pure zerocopy skbs · f1a456f8
      Talal Ahmad 提交于
      Track skbs with only zerocopy data and avoid charging them to kernel
      memory to correctly account the memory utilization for msg_zerocopy.
      All of the data in such skbs is held in user pages which are already
      accounted to user. Before this change, they are charged again in
      kernel in __zerocopy_sg_from_iter. The charging in kernel is
      excessive because data is not being copied into skb frags. This
      excessive charging can lead to kernel going into memory pressure
      state which impacts all sockets in the system adversely. Mark pure
      zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
      charge/uncharge for data in such skbs.
      
      Initially, an skb is marked pure zerocopy when it is empty and in
      zerocopy path. skb can then change from a pure zerocopy skb to mixed
      data skb (zerocopy and copy data) if it is at tail of write queue and
      there is room available in it and non-zerocopy data is being sent in
      the next sendmsg call. At this time sk_mem_charge is done for the pure
      zerocopied data and the pure zerocopy flag is unmarked. We found that
      this happens very rarely on workloads that pass MSG_ZEROCOPY.
      
      A pure zerocopy skb can later be coalesced into normal skb if they are
      next to each other in queue but this patch prevents coalescing from
      happening. This avoids complexity of charging when skb downgrades from
      pure zerocopy to mixed. This is also rare.
      
      In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
      for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in
      tcp_skb_entail for an skb without data.
      
      Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
      with zerocopy showed that before this patch the 'sock' variable in
      memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
      sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
      change it is 0. This is due to no charge to sk_forward_alloc for
      zerocopy data and shows memory utilization for kernel is lowered.
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f1a456f8
    • T
      tcp: rename sk_wmem_free_skb · 03271f3a
      Talal Ahmad 提交于
      sk_wmem_free_skb() is only used by TCP.
      
      Rename it to make this clear, and move its declaration to
      include/net/tcp.h
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      03271f3a
  2. 28 10月, 2021 10 次提交
  3. 27 10月, 2021 4 次提交
  4. 26 10月, 2021 7 次提交
  5. 15 10月, 2021 5 次提交
    • L
      tcp: md5: Allow MD5SIG_FLAG_IFINDEX with ifindex=0 · a76c2315
      Leonard Crestez 提交于
      Multiple VRFs are generally meant to be "separate" but right now md5
      keys for the default VRF also affect connections inside VRFs if the IP
      addresses happen to overlap.
      
      So far the combination of TCP_MD5SIG_FLAG_IFINDEX with tcpm_ifindex == 0
      was an error, accept this to mean "key only applies to default VRF".
      This is what applications using VRFs for traffic separation want.
      Signed-off-by: NLeonard Crestez <cdleonard@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a76c2315
    • L
      tcp: md5: Fix overlap between vrf and non-vrf keys · 86f1e3a8
      Leonard Crestez 提交于
      With net.ipv4.tcp_l3mdev_accept=1 it is possible for a listen socket to
      accept connection from the same client address in different VRFs. It is
      also possible to set different MD5 keys for these clients which differ
      only in the tcpm_l3index field.
      
      This appears to work when distinguishing between different VRFs but not
      between non-VRF and VRF connections. In particular:
      
       * tcp_md5_do_lookup_exact will match a non-vrf key against a vrf key.
      This means that adding a key with l3index != 0 after a key with l3index
      == 0 will cause the earlier key to be deleted. Both keys can be present
      if the non-vrf key is added later.
       * _tcp_md5_do_lookup can match a non-vrf key before a vrf key. This
      casues failures if the passwords differ.
      
      Fix this by making tcp_md5_do_lookup_exact perform an actual exact
      comparison on l3index and by making  __tcp_md5_do_lookup perfer
      vrf-bound keys above other considerations like prefixlen.
      
      Fixes: dea53bb8 ("tcp: Add l3index to tcp_md5sig_key and md5 functions")
      Signed-off-by: NLeonard Crestez <cdleonard@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86f1e3a8
    • E
      tcp: switch orphan_count to bare per-cpu counters · 19757ceb
      Eric Dumazet 提交于
      Use of percpu_counter structure to track count of orphaned
      sockets is causing problems on modern hosts with 256 cpus
      or more.
      
      Stefan Bach reported a serious spinlock contention in real workloads,
      that I was able to reproduce with a netfilter rule dropping
      incoming FIN packets.
      
          53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
                  |
                  ---queued_spin_lock_slowpath
                     |
                      --53.51%--_raw_spin_lock_irqsave
                                |
                                 --53.51%--__percpu_counter_sum
                                           tcp_check_oom
                                           |
                                           |--39.03%--__tcp_close
                                           |          tcp_close
                                           |          inet_release
                                           |          inet6_release
                                           |          sock_close
                                           |          __fput
                                           |          ____fput
                                           |          task_work_run
                                           |          exit_to_usermode_loop
                                           |          do_syscall_64
                                           |          entry_SYSCALL_64_after_hwframe
                                           |          __GI___libc_close
                                           |
                                            --14.48%--tcp_out_of_resources
                                                      tcp_write_timeout
                                                      tcp_retransmit_timer
                                                      tcp_write_timer_handler
                                                      tcp_write_timer
                                                      call_timer_fn
                                                      expire_timers
                                                      __run_timers
                                                      run_timer_softirq
                                                      __softirqentry_text_start
      
      As explained in commit cf86a086 ("net/dst: use a smaller percpu_counter
      batch for dst entries accounting"), default batch size is too big
      for the default value of tcp_max_orphans (262144).
      
      But even if we reduce batch sizes, there would still be cases
      where the estimated count of orphans is beyond the limit,
      and where tcp_too_many_orphans() has to call the expensive
      percpu_counter_sum_positive().
      
      One solution is to use plain per-cpu counters, and have
      a timer to periodically refresh this cache.
      
      Updating this cache every 100ms seems about right, tcp pressure
      state is not radically changing over shorter periods.
      
      percpu_counter was nice 15 years ago while hosts had less
      than 16 cpus, not anymore by current standards.
      
      v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
          reported by kernel test robot <lkp@intel.com>
          Remove unused socket argument from tcp_too_many_orphans()
      
      Fixes: dd24c001 ("net: Use a percpu_counter for orphan_count")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NStefan Bach <sfb@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19757ceb
    • F
      netfilter: arp_tables: allow use of arpt_do_table as hookfn · e8d225b6
      Florian Westphal 提交于
      This is possible now that the xt_table structure is passed in via *priv.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e8d225b6
    • F
      netfilter: iptables: allow use of ipt_do_table as hookfn · 8844e010
      Florian Westphal 提交于
      This is possible now that the xt_table structure is passed in via *priv.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8844e010
  6. 14 10月, 2021 2 次提交
    • X
      icmp: fix icmp_ext_echo_iio parsing in icmp_build_probe · 1fcd7945
      Xin Long 提交于
      In icmp_build_probe(), the icmp_ext_echo_iio parsing should be done
      step by step and skb_header_pointer() return value should always be
      checked, this patch fixes 3 places in there:
      
        - On case ICMP_EXT_ECHO_CTYPE_NAME, it should only copy ident.name
          from skb by skb_header_pointer(), its len is ident_len. Besides,
          the return value of skb_header_pointer() should always be checked.
      
        - On case ICMP_EXT_ECHO_CTYPE_INDEX, move ident_len check ahead of
          skb_header_pointer(), and also do the return value check for
          skb_header_pointer().
      
        - On case ICMP_EXT_ECHO_CTYPE_ADDR, before accessing iio->ident.addr.
          ctype3_hdr.addrlen, skb_header_pointer() should be called first,
          then check its return value and ident_len.
          On subcases ICMP_AFI_IP and ICMP_AFI_IP6, also do check for ident.
          addr.ctype3_hdr.addrlen and skb_header_pointer()'s return value.
          On subcase ICMP_AFI_IP, the len for skb_header_pointer() should be
          "sizeof(iio->extobj_hdr) + sizeof(iio->ident.addr.ctype3_hdr) +
          sizeof(struct in_addr)" or "ident_len".
      
      v1->v2:
        - To make it more clear, call skb_header_pointer() once only for
          iio->indent's parsing as Jakub Suggested.
      v2->v3:
        - The extobj_hdr.length check against sizeof(_iio) should be done
          before calling skb_header_pointer(), as Eric noticed.
      
      Fixes: d329ea5b ("icmp: add response to RFC 8335 PROBE messages")
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/31628dd76657ea62f5cf78bb55da6b35240831f1.1634205050.git.lucien.xin@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      1fcd7945
    • J
      ip: use dev_addr_set() in tunnels · 5a1b7e1a
      Jakub Kicinski 提交于
      Use dev_addr_set() instead of writing to netdev->dev_addr
      directly in ip tunnels drivers.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      5a1b7e1a
  7. 07 10月, 2021 1 次提交
    • M
      net: prefer socket bound to interface when not in VRF · 8d6c414c
      Mike Manning 提交于
      The commit 6da5b0f0 ("net: ensure unbound datagram socket to be
      chosen when not in a VRF") modified compute_score() so that a device
      match is always made, not just in the case of an l3mdev skb, then
      increments the score also for unbound sockets. This ensures that
      sockets bound to an l3mdev are never selected when not in a VRF.
      But as unbound and bound sockets are now scored equally, this results
      in the last opened socket being selected if there are matches in the
      default VRF for an unbound socket and a socket bound to a dev that is
      not an l3mdev. However, handling prior to this commit was to always
      select the bound socket in this case. Reinstate this handling by
      incrementing the score only for bound sockets. The required isolation
      due to choosing between an unbound socket and a socket bound to an
      l3mdev remains in place due to the device match always being made.
      The same approach is taken for compute_score() for stream sockets.
      
      Fixes: 6da5b0f0 ("net: ensure unbound datagram socket to be chosen when not in a VRF")
      Fixes: e7819058 ("net: ensure unbound stream socket to be chosen when not in a VRF")
      Signed-off-by: NMike Manning <mmanning@vyatta.att-mail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/cf0a8523-b362-1edf-ee78-eef63cbbb428@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      8d6c414c
  8. 30 9月, 2021 4 次提交
  9. 29 9月, 2021 2 次提交
  10. 28 9月, 2021 2 次提交