1. 16 5月, 2022 6 次提交
    • E
      net: call skb_defer_free_flush() before each napi_poll() · 90987650
      Eric Dumazet 提交于
      skb_defer_free_flush() can consume cpu cycles,
      it seems better to call it in the inner loop:
      
      - Potentially frees page/skb that will be reallocated while hot.
      
      - Account for the cpu cycles in the @Time_Limit determination.
      
      - Keep softnet_data.defer_count small to reduce chances for
        skb_attempt_defer_free() to send an IPI.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90987650
    • E
      net: add skb_defer_max sysctl · 39564c3f
      Eric Dumazet 提交于
      commit 68822bdf ("net: generalize skb freeing
      deferral to per-cpu lists") added another per-cpu
      cache of skbs. It was expected to be small,
      and an IPI was forced whenever the list reached 128
      skbs.
      
      We might need to be able to control more precisely
      queue capacity and added latency.
      
      An IPI is generated whenever queue reaches half capacity.
      
      Default value of the new limit is 64.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39564c3f
    • E
      net: use napi_consume_skb() in skb_defer_free_flush() · 2db60eed
      Eric Dumazet 提交于
      skb_defer_free_flush() runs from softirq context,
      we have the opportunity to refill the napi_alloc_cache,
      and/or use kmem_cache_free_bulk() when this cache is full.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2db60eed
    • E
      net: fix possible race in skb_attempt_defer_free() · 97e719a8
      Eric Dumazet 提交于
      A cpu can observe sd->defer_count reaching 128,
      and call smp_call_function_single_async()
      
      Problem is that the remote CPU can clear sd->defer_count
      before the IPI is run/acknowledged.
      
      Other cpus can queue more packets and also decide
      to call smp_call_function_single_async() while the pending
      IPI was not yet delivered.
      
      This is a common issue with smp_call_function_single_async().
      Callers must ensure correct synchronization and serialization.
      
      I triggered this issue while experimenting smaller threshold.
      Performing the call to smp_call_function_single_async()
      under sd->defer_lock protection did not solve the problem.
      
      Commit 5a18ceca ("smp: Allow smp_call_function_single_async()
      to insert locked csd") replaced an informative WARN_ON_ONCE()
      with a return of -EBUSY, which is often ignored.
      Test of CSD_FLAG_LOCK presence is racy anyway.
      
      Fixes: 68822bdf ("net: generalize skb freeing deferral to per-cpu lists")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97e719a8
    • A
      net: allow gro_max_size to exceed 65536 · 0fe79f28
      Alexander Duyck 提交于
      Allow the gro_max_size to exceed a value larger than 65536.
      
      There weren't really any external limitations that prevented this other
      than the fact that IPv4 only supports a 16 bit length field. Since we have
      the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to
      exceed this value and for IPv4 and non-TCP flows we can cap things at 65536
      via a constant rather than relying on gro_max_size.
      
      [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows.
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fe79f28
    • A
      net: allow gso_max_size to exceed 65536 · 7c4e983c
      Alexander Duyck 提交于
      The code for gso_max_size was added originally to allow for debugging and
      workaround of buggy devices that couldn't support TSO with blocks 64K in
      size. The original reason for limiting it to 64K was because that was the
      existing limits of IPv4 and non-jumbogram IPv6 length fields.
      
      With the addition of Big TCP we can remove this limit and allow the value
      to potentially go up to UINT_MAX and instead be limited by the tso_max_size
      value.
      
      So in order to support this we need to go through and clean up the
      remaining users of the gso_max_size value so that the values will cap at
      64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
      so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
      limit for GSO_MAX_SIZE.
      
      v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
                     in a new sk_trim_gso_size() helper.
                     netif_set_tso_max_size() caps the requested TSO size
                     with GSO_MAX_SIZE.
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4e983c
  2. 13 5月, 2022 1 次提交
  3. 11 5月, 2022 3 次提交
  4. 06 5月, 2022 3 次提交
  5. 04 5月, 2022 1 次提交
  6. 30 4月, 2022 1 次提交
  7. 28 4月, 2022 1 次提交
  8. 27 4月, 2022 2 次提交
    • S
      net: Use this_cpu_inc() to increment net->core_stats · 6510ea97
      Sebastian Andrzej Siewior 提交于
      The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes
      netdev_core_stats_alloc() to return a per-CPU pointer.
      netdev_core_stats_alloc() will allocate memory on its first invocation
      which breaks on PREEMPT_RT because it requires non-atomic context for
      memory allocation.
      
      This can be avoided by enabling preemption in netdev_core_stats_alloc()
      assuming the caller always disables preemption.
      
      It might be better to replace local_inc() with this_cpu_inc() now that
      dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does
      not rely on already disabled preemption. This results in less
      instructions on x86-64:
      local_inc:
      |          incl %gs:__preempt_count(%rip)  # __preempt_count
      |          movq    488(%rdi), %rax # _1->core_stats, _22
      |          testq   %rax, %rax      # _22
      |          je      .L585   #,
      |          add %gs:this_cpu_off(%rip), %rax        # this_cpu_off, tcp_ptr__
      |  .L586:
      |          testq   %rax, %rax      # _27
      |          je      .L587   #,
      |          incq (%rax)            # _6->a.counter
      |  .L587:
      |          decl %gs:__preempt_count(%rip)  # __preempt_count
      
      this_cpu_inc(), this patch:
      |         movq    488(%rdi), %rax # _1->core_stats, _5
      |         testq   %rax, %rax      # _5
      |         je      .L591   #,
      | .L585:
      |         incq %gs:(%rax) # _18->rx_dropped
      
      Use unsigned long as type for the counter. Use this_cpu_inc() to
      increment the counter. Use a plain read of the counter.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/YmbO0pxgtKpCw4SY@linutronix.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      6510ea97
    • E
      net: generalize skb freeing deferral to per-cpu lists · 68822bdf
      Eric Dumazet 提交于
      Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
      lock is released") helped bulk TCP flows to move the cost of skbs
      frees outside of critical section where socket lock was held.
      
      But for RPC traffic, or hosts with RFS enabled, the solution is far from
      being ideal.
      
      For RPC traffic, recvmsg() has to return to user space right after
      skb payload has been consumed, meaning that BH handler has no chance
      to pick the skb before recvmsg() thread. This issue is more visible
      with BIG TCP, as more RPC fit one skb.
      
      For RFS, even if BH handler picks the skbs, they are still picked
      from the cpu on which user thread is running.
      
      Ideally, it is better to free the skbs (and associated page frags)
      on the cpu that originally allocated them.
      
      This patch removes the per socket anchor (sk->defer_list) and
      instead uses a per-cpu list, which will hold more skbs per round.
      
      This new per-cpu list is drained at the end of net_action_rx(),
      after incoming packets have been processed, to lower latencies.
      
      In normal conditions, skbs are added to the per-cpu list with
      no further action. In the (unlikely) cases where the cpu does not
      run net_action_rx() handler fast enough, we use an IPI to raise
      NET_RX_SOFTIRQ on the remote cpu.
      
      Also, we do not bother draining the per-cpu list from dev_cpu_dead()
      This is because skbs in this list have no requirement on how fast
      they should be freed.
      
      Note that we can add in the future a small per-cpu cache
      if we see any contention on sd->defer_lock.
      
      Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
      and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
      page recycling strategy used by NIC driver (its page pool capacity
      being too small compared to number of skbs/pages held in sockets
      receive queues)
      
      Note that this tuning was only done to demonstrate worse
      conditions for skb freeing for this particular test.
      These conditions can happen in more general production workload.
      
      10 runs of one TCP_STREAM flow
      
      Before:
      Average throughput: 49685 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() show high cost for
      skb freeing related functions (*)
      
          57.81%  [kernel]       [k] copy_user_enhanced_fast_string
      (*) 12.87%  [kernel]       [k] skb_release_data
      (*)  4.25%  [kernel]       [k] __free_one_page
      (*)  3.57%  [kernel]       [k] __list_del_entry_valid
           1.85%  [kernel]       [k] __netif_receive_skb_core
           1.60%  [kernel]       [k] __skb_datagram_iter
      (*)  1.59%  [kernel]       [k] free_unref_page_commit
      (*)  1.16%  [kernel]       [k] __slab_free
           1.16%  [kernel]       [k] _copy_to_iter
      (*)  1.01%  [kernel]       [k] kfree
      (*)  0.88%  [kernel]       [k] free_unref_page
           0.57%  [kernel]       [k] ip6_rcv_core
           0.55%  [kernel]       [k] ip6t_do_table
           0.54%  [kernel]       [k] flush_smp_call_function_queue
      (*)  0.54%  [kernel]       [k] free_pcppages_bulk
           0.51%  [kernel]       [k] llist_reverse_order
           0.38%  [kernel]       [k] process_backlog
      (*)  0.38%  [kernel]       [k] free_pcp_prepare
           0.37%  [kernel]       [k] tcp_recvmsg_locked
      (*)  0.37%  [kernel]       [k] __list_add_valid
           0.34%  [kernel]       [k] sock_rfree
           0.34%  [kernel]       [k] _raw_spin_lock_irq
      (*)  0.33%  [kernel]       [k] __page_cache_release
           0.33%  [kernel]       [k] tcp_v6_rcv
      (*)  0.33%  [kernel]       [k] __put_page
      (*)  0.29%  [kernel]       [k] __mod_zone_page_state
           0.27%  [kernel]       [k] _raw_spin_lock
      
      After patch:
      Average throughput: 73076 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() looks better:
      
          81.35%  [kernel]       [k] copy_user_enhanced_fast_string
           1.95%  [kernel]       [k] _copy_to_iter
           1.95%  [kernel]       [k] __skb_datagram_iter
           1.27%  [kernel]       [k] __netif_receive_skb_core
           1.03%  [kernel]       [k] ip6t_do_table
           0.60%  [kernel]       [k] sock_rfree
           0.50%  [kernel]       [k] tcp_v6_rcv
           0.47%  [kernel]       [k] ip6_rcv_core
           0.45%  [kernel]       [k] read_tsc
           0.44%  [kernel]       [k] _raw_spin_lock_irqsave
           0.37%  [kernel]       [k] _raw_spin_lock
           0.37%  [kernel]       [k] native_irq_return_iret
           0.33%  [kernel]       [k] __inet6_lookup_established
           0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
           0.29%  [kernel]       [k] tcp_rcv_established
           0.29%  [kernel]       [k] llist_reverse_order
      
      v2: kdoc issue (kernel bots)
          do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
          replace the sk_buff_head with a single-linked list (Jakub)
          add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      68822bdf
  9. 19 4月, 2022 1 次提交
    • T
      net: sched: use queue_mapping to pick tx queue · 2f1e85b1
      Tonghao Zhang 提交于
      This patch fixes issue:
      * If we install tc filters with act_skbedit in clsact hook.
        It doesn't work, because netdev_core_pick_tx() overwrites
        queue_mapping.
      
        $ tc filter ... action skbedit queue_mapping 1
      
      And this patch is useful:
      * We can use FQ + EDT to implement efficient policies. Tx queues
        are picked by xps, ndo_select_queue of netdev driver, or skb hash
        in netdev_core_pick_tx(). In fact, the netdev driver, and skb
        hash are _not_ under control. xps uses the CPUs map to select Tx
        queues, but we can't figure out which task_struct of pod/containter
        running on this cpu in most case. We can use clsact filters to classify
        one pod/container traffic to one Tx queue. Why ?
      
        In containter networking environment, there are two kinds of pod/
        containter/net-namespace. One kind (e.g. P1, P2), the high throughput
        is key in these applications. But avoid running out of network resource,
        the outbound traffic of these pods is limited, using or sharing one
        dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
        (e.g. Pn), the low latency of data access is key. And the traffic is not
        limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
        This choice provides two benefits. First, contention on the HTB/FQ Qdisc
        lock is significantly reduced since fewer CPUs contend for the same queue.
        More importantly, Qdisc contention can be eliminated completely if each
        CPU has its own FIFO Qdisc for the second kind of pods.
      
        There must be a mechanism in place to support classifying traffic based on
        pods/container to different Tx queues. Note that clsact is outside of Qdisc
        while Qdisc can run a classifier to select a sub-queue under the lock.
      
        In general recording the decision in the skb seems a little heavy handed.
        This patch introduces a per-CPU variable, suggested by Eric.
      
        The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
        - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
          is set in qdisc->enqueue() though tx queue has been selected in
          netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
          firstly in __dev_queue_xmit(), is useful:
        - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
          in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
          For example, eth0, macvlan in pod, which root Qdisc install skbedit
          queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
          eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
          because there is no filters in clsact or tx Qdisc of this netdev.
          Same action taked in eth0, ixgbe in Host.
        - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
          in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
          in __dev_queue_xmit when processing next packets.
      
        For performance reasons, use the static key. If user does not config the NET_EGRESS,
        the patch will not be compiled.
      
        +----+      +----+      +----+
        | P1 |      | P2 |      | Pn |
        +----+      +----+      +----+
          |           |           |
          +-----------+-----------+
                      |
                      | clsact/skbedit
                      |      MQ
                      v
          +-----------+-----------+
          | q0        | q1        | qn
          v           v           v
        HTB/FQ      HTB/FQ  ...  FIFO
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Lobakin <alobakin@pm.me>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Talal Ahmad <talalahmad@google.com>
      Cc: Kevin Hao <haokexin@gmail.com>
      Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
      Cc: Antoine Tenart <atenart@kernel.org>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      2f1e85b1
  10. 11 4月, 2022 1 次提交
  11. 08 4月, 2022 3 次提交
  12. 06 4月, 2022 2 次提交
  13. 29 3月, 2022 1 次提交
  14. 22 3月, 2022 1 次提交
  15. 19 3月, 2022 1 次提交
    • Í
      net: set default rss queues num to physical cores / 2 · 046e1537
      Íñigo Huguet 提交于
      Network drivers can call to netif_get_num_default_rss_queues to get the
      default number of receive queues to use. Right now, this default number
      is min(8, num_online_cpus()).
      
      Instead, as suggested by Jakub, use the number of physical cores divided
      by 2 as a way to avoid wasting CPU resources and to avoid using both CPU
      threads, but still allowing to scale for high-end processors with many
      cores.
      
      As an exception, select 2 queues for processors with 2 cores, because
      otherwise it won't take any advantage of RSS despite being SMP capable.
      
      Tested: Processor Intel Xeon E5-2620 (2 sockets, 6 cores/socket, 2
      threads/core). NIC Broadcom NetXtreme II BCM57810 (10GBps). Ran some
      tests with `perf stat iperf3 -R`, with parallelisms of 1, 8 and 24,
      getting the following results:
      - Number of queues: 6 (instead of 8)
      - Network throughput: not affected
      - CPU usage: utilized 0.05-0.12 CPUs more than before (having 24 CPUs
        this is only 0.2-0.5% higher)
      - Reduced the number of context switches by 7-50%, being more noticeable
        when using a higher number of parallel threads.
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NÍñigo Huguet <ihuguet@redhat.com>
      Link: https://lore.kernel.org/r/20220315091832.13873-1-ihuguet@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      046e1537
  16. 14 3月, 2022 1 次提交
  17. 12 3月, 2022 2 次提交
  18. 04 3月, 2022 7 次提交
  19. 03 3月, 2022 2 次提交
    • M
      net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally · cd14e9b7
      Martin KaFai Lau 提交于
      The previous patches handled the delivery_time in the ingress path
      before the routing decision is made.  This patch can postpone clearing
      delivery_time in a skb until knowing it is delivered locally and also
      set the (rcv) timestamp if needed.  This patch moves the
      skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
      and ip6_input_finish().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd14e9b7
    • M
      net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() · d98d58a0
      Martin KaFai Lau 提交于
      The previous patches handled the delivery_time before sch_handle_ingress().
      
      This patch can now set the skb->mono_delivery_time to flag the skb->tstamp
      is used as the mono delivery_time (EDT) instead of the (rcv) timestamp
      and also clear it with skb_clear_delivery_time() after
      sch_handle_ingress().  This will make the bpf_redirect_*()
      to keep the mono delivery_time and used by a qdisc (fq) of
      the egress-ing interface.
      
      A latter patch will postpone the skb_clear_delivery_time() until the
      stack learns that the skb is being delivered locally and that will
      make other kernel forwarding paths (ip[6]_forward) able to keep
      the delivery_time also.  Thus, like the previous patches on using
      the skb->mono_delivery_time bit, calling skb_clear_delivery_time()
      is not limited within the CONFIG_NET_INGRESS to avoid too many code
      churns among this set.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d98d58a0