1. 04 11月, 2022 2 次提交
  2. 01 11月, 2022 1 次提交
  3. 31 10月, 2022 1 次提交
  4. 15 10月, 2022 1 次提交
  5. 02 10月, 2022 1 次提交
  6. 30 9月, 2022 2 次提交
  7. 29 9月, 2022 1 次提交
  8. 20 9月, 2022 1 次提交
    • V
      net: introduce iterators over synced hw addresses · db01868b
      Vladimir Oltean 提交于
      Some network drivers use __dev_mc_sync()/__dev_uc_sync() and therefore
      program the hardware only with addresses with a non-zero sync_cnt.
      
      Some of the above drivers also need to save/restore the address
      filtering lists when certain events happen, and they need to walk
      through the struct net_device :: uc and struct net_device :: mc lists.
      But these lists contain unsynced addresses too.
      
      To keep the appearance of an elementary form of data encapsulation,
      provide iterators through these lists that only look at entries with a
      non-zero sync_cnt, instead of filtering entries out from device drivers.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      db01868b
  9. 02 9月, 2022 2 次提交
  10. 26 8月, 2022 1 次提交
  11. 24 8月, 2022 2 次提交
  12. 22 8月, 2022 1 次提交
  13. 25 6月, 2022 1 次提交
  14. 10 6月, 2022 2 次提交
  15. 23 5月, 2022 1 次提交
  16. 16 5月, 2022 5 次提交
    • F
      net: fix dev_fill_forward_path with pppoe + bridge · cf2df74e
      Felix Fietkau 提交于
      When calling dev_fill_forward_path on a pppoe device, the provided destination
      address is invalid. In order for the bridge fdb lookup to succeed, the pppoe
      code needs to update ctx->daddr to the correct value.
      Fix this by storing the address inside struct net_device_path_ctx
      
      Fixes: f6efc675 ("net: ppp: resolve forwarding path for bridge pppoe devices")
      Signed-off-by: NFelix Fietkau <nbd@nbd.name>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      cf2df74e
    • E
      net: fix possible race in skb_attempt_defer_free() · 97e719a8
      Eric Dumazet 提交于
      A cpu can observe sd->defer_count reaching 128,
      and call smp_call_function_single_async()
      
      Problem is that the remote CPU can clear sd->defer_count
      before the IPI is run/acknowledged.
      
      Other cpus can queue more packets and also decide
      to call smp_call_function_single_async() while the pending
      IPI was not yet delivered.
      
      This is a common issue with smp_call_function_single_async().
      Callers must ensure correct synchronization and serialization.
      
      I triggered this issue while experimenting smaller threshold.
      Performing the call to smp_call_function_single_async()
      under sd->defer_lock protection did not solve the problem.
      
      Commit 5a18ceca ("smp: Allow smp_call_function_single_async()
      to insert locked csd") replaced an informative WARN_ON_ONCE()
      with a return of -EBUSY, which is often ignored.
      Test of CSD_FLAG_LOCK presence is racy anyway.
      
      Fixes: 68822bdf ("net: generalize skb freeing deferral to per-cpu lists")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97e719a8
    • A
      net: allow gro_max_size to exceed 65536 · 0fe79f28
      Alexander Duyck 提交于
      Allow the gro_max_size to exceed a value larger than 65536.
      
      There weren't really any external limitations that prevented this other
      than the fact that IPv4 only supports a 16 bit length field. Since we have
      the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to
      exceed this value and for IPv4 and non-TCP flows we can cap things at 65536
      via a constant rather than relying on gro_max_size.
      
      [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows.
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fe79f28
    • E
      net: limit GSO_MAX_SIZE to 524280 bytes · 34b92e8d
      Eric Dumazet 提交于
      Make sure we will not overflow shinfo->gso_segs
      
      Minimal TCP MSS size is 8 bytes, and shinfo->gso_segs
      is a 16bit field.
      
      TCP_MIN_GSO_SIZE is currently defined in include/net/tcp.h,
      it seems cleaner to not bring tcp details into include/linux/netdevice.h
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34b92e8d
    • A
      net: allow gso_max_size to exceed 65536 · 7c4e983c
      Alexander Duyck 提交于
      The code for gso_max_size was added originally to allow for debugging and
      workaround of buggy devices that couldn't support TSO with blocks 64K in
      size. The original reason for limiting it to 64K was because that was the
      existing limits of IPv4 and non-jumbogram IPv6 length fields.
      
      With the addition of Big TCP we can remove this limit and allow the value
      to potentially go up to UINT_MAX and instead be limited by the tso_max_size
      value.
      
      So in order to support this we need to go through and clean up the
      remaining users of the gso_max_size value so that the values will cap at
      64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
      so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
      limit for GSO_MAX_SIZE.
      
      v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
                     in a new sk_trim_gso_size() helper.
                     netif_set_tso_max_size() caps the requested TSO size
                     with GSO_MAX_SIZE.
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4e983c
  17. 11 5月, 2022 1 次提交
  18. 10 5月, 2022 1 次提交
    • G
      ptp: Support late timestamp determination · 97dc7cd9
      Gerhard Engleder 提交于
      If a physical clock supports a free running cycle counter, then
      timestamps shall be based on this time too. For TX it is known in
      advance before the transmission if a timestamp based on the free running
      cycle counter is needed. For RX it is impossible to know which timestamp
      is needed before the packet is received and assigned to a socket.
      
      Support late timestamp determination by a network device. Therefore, an
      address/cookie is stored within the new netdev_data field of struct
      skb_shared_hwtstamps. This address/cookie is provided to a new network
      device function called ndo_get_tstamp(), which returns a timestamp based
      on the normal/adjustable time or based on the free running cycle
      counter. If function is not supported, then timestamp handling is not
      changed.
      
      This mechanism is intended for RX, but TX use is also possible.
      Signed-off-by: NGerhard Engleder <gerhard@engleder-embedded.com>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      97dc7cd9
  19. 09 5月, 2022 1 次提交
  20. 06 5月, 2022 3 次提交
  21. 04 5月, 2022 1 次提交
  22. 30 4月, 2022 1 次提交
  23. 27 4月, 2022 2 次提交
    • S
      net: Use this_cpu_inc() to increment net->core_stats · 6510ea97
      Sebastian Andrzej Siewior 提交于
      The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes
      netdev_core_stats_alloc() to return a per-CPU pointer.
      netdev_core_stats_alloc() will allocate memory on its first invocation
      which breaks on PREEMPT_RT because it requires non-atomic context for
      memory allocation.
      
      This can be avoided by enabling preemption in netdev_core_stats_alloc()
      assuming the caller always disables preemption.
      
      It might be better to replace local_inc() with this_cpu_inc() now that
      dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does
      not rely on already disabled preemption. This results in less
      instructions on x86-64:
      local_inc:
      |          incl %gs:__preempt_count(%rip)  # __preempt_count
      |          movq    488(%rdi), %rax # _1->core_stats, _22
      |          testq   %rax, %rax      # _22
      |          je      .L585   #,
      |          add %gs:this_cpu_off(%rip), %rax        # this_cpu_off, tcp_ptr__
      |  .L586:
      |          testq   %rax, %rax      # _27
      |          je      .L587   #,
      |          incq (%rax)            # _6->a.counter
      |  .L587:
      |          decl %gs:__preempt_count(%rip)  # __preempt_count
      
      this_cpu_inc(), this patch:
      |         movq    488(%rdi), %rax # _1->core_stats, _5
      |         testq   %rax, %rax      # _5
      |         je      .L591   #,
      | .L585:
      |         incq %gs:(%rax) # _18->rx_dropped
      
      Use unsigned long as type for the counter. Use this_cpu_inc() to
      increment the counter. Use a plain read of the counter.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/YmbO0pxgtKpCw4SY@linutronix.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      6510ea97
    • E
      net: generalize skb freeing deferral to per-cpu lists · 68822bdf
      Eric Dumazet 提交于
      Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
      lock is released") helped bulk TCP flows to move the cost of skbs
      frees outside of critical section where socket lock was held.
      
      But for RPC traffic, or hosts with RFS enabled, the solution is far from
      being ideal.
      
      For RPC traffic, recvmsg() has to return to user space right after
      skb payload has been consumed, meaning that BH handler has no chance
      to pick the skb before recvmsg() thread. This issue is more visible
      with BIG TCP, as more RPC fit one skb.
      
      For RFS, even if BH handler picks the skbs, they are still picked
      from the cpu on which user thread is running.
      
      Ideally, it is better to free the skbs (and associated page frags)
      on the cpu that originally allocated them.
      
      This patch removes the per socket anchor (sk->defer_list) and
      instead uses a per-cpu list, which will hold more skbs per round.
      
      This new per-cpu list is drained at the end of net_action_rx(),
      after incoming packets have been processed, to lower latencies.
      
      In normal conditions, skbs are added to the per-cpu list with
      no further action. In the (unlikely) cases where the cpu does not
      run net_action_rx() handler fast enough, we use an IPI to raise
      NET_RX_SOFTIRQ on the remote cpu.
      
      Also, we do not bother draining the per-cpu list from dev_cpu_dead()
      This is because skbs in this list have no requirement on how fast
      they should be freed.
      
      Note that we can add in the future a small per-cpu cache
      if we see any contention on sd->defer_lock.
      
      Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
      and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
      page recycling strategy used by NIC driver (its page pool capacity
      being too small compared to number of skbs/pages held in sockets
      receive queues)
      
      Note that this tuning was only done to demonstrate worse
      conditions for skb freeing for this particular test.
      These conditions can happen in more general production workload.
      
      10 runs of one TCP_STREAM flow
      
      Before:
      Average throughput: 49685 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() show high cost for
      skb freeing related functions (*)
      
          57.81%  [kernel]       [k] copy_user_enhanced_fast_string
      (*) 12.87%  [kernel]       [k] skb_release_data
      (*)  4.25%  [kernel]       [k] __free_one_page
      (*)  3.57%  [kernel]       [k] __list_del_entry_valid
           1.85%  [kernel]       [k] __netif_receive_skb_core
           1.60%  [kernel]       [k] __skb_datagram_iter
      (*)  1.59%  [kernel]       [k] free_unref_page_commit
      (*)  1.16%  [kernel]       [k] __slab_free
           1.16%  [kernel]       [k] _copy_to_iter
      (*)  1.01%  [kernel]       [k] kfree
      (*)  0.88%  [kernel]       [k] free_unref_page
           0.57%  [kernel]       [k] ip6_rcv_core
           0.55%  [kernel]       [k] ip6t_do_table
           0.54%  [kernel]       [k] flush_smp_call_function_queue
      (*)  0.54%  [kernel]       [k] free_pcppages_bulk
           0.51%  [kernel]       [k] llist_reverse_order
           0.38%  [kernel]       [k] process_backlog
      (*)  0.38%  [kernel]       [k] free_pcp_prepare
           0.37%  [kernel]       [k] tcp_recvmsg_locked
      (*)  0.37%  [kernel]       [k] __list_add_valid
           0.34%  [kernel]       [k] sock_rfree
           0.34%  [kernel]       [k] _raw_spin_lock_irq
      (*)  0.33%  [kernel]       [k] __page_cache_release
           0.33%  [kernel]       [k] tcp_v6_rcv
      (*)  0.33%  [kernel]       [k] __put_page
      (*)  0.29%  [kernel]       [k] __mod_zone_page_state
           0.27%  [kernel]       [k] _raw_spin_lock
      
      After patch:
      Average throughput: 73076 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() looks better:
      
          81.35%  [kernel]       [k] copy_user_enhanced_fast_string
           1.95%  [kernel]       [k] _copy_to_iter
           1.95%  [kernel]       [k] __skb_datagram_iter
           1.27%  [kernel]       [k] __netif_receive_skb_core
           1.03%  [kernel]       [k] ip6t_do_table
           0.60%  [kernel]       [k] sock_rfree
           0.50%  [kernel]       [k] tcp_v6_rcv
           0.47%  [kernel]       [k] ip6_rcv_core
           0.45%  [kernel]       [k] read_tsc
           0.44%  [kernel]       [k] _raw_spin_lock_irqsave
           0.37%  [kernel]       [k] _raw_spin_lock
           0.37%  [kernel]       [k] native_irq_return_iret
           0.33%  [kernel]       [k] __inet6_lookup_established
           0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
           0.29%  [kernel]       [k] tcp_rcv_established
           0.29%  [kernel]       [k] llist_reverse_order
      
      v2: kdoc issue (kernel bots)
          do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
          replace the sk_buff_head with a single-linked list (Jakub)
          add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      68822bdf
  24. 19 4月, 2022 1 次提交
    • T
      net: sched: use queue_mapping to pick tx queue · 2f1e85b1
      Tonghao Zhang 提交于
      This patch fixes issue:
      * If we install tc filters with act_skbedit in clsact hook.
        It doesn't work, because netdev_core_pick_tx() overwrites
        queue_mapping.
      
        $ tc filter ... action skbedit queue_mapping 1
      
      And this patch is useful:
      * We can use FQ + EDT to implement efficient policies. Tx queues
        are picked by xps, ndo_select_queue of netdev driver, or skb hash
        in netdev_core_pick_tx(). In fact, the netdev driver, and skb
        hash are _not_ under control. xps uses the CPUs map to select Tx
        queues, but we can't figure out which task_struct of pod/containter
        running on this cpu in most case. We can use clsact filters to classify
        one pod/container traffic to one Tx queue. Why ?
      
        In containter networking environment, there are two kinds of pod/
        containter/net-namespace. One kind (e.g. P1, P2), the high throughput
        is key in these applications. But avoid running out of network resource,
        the outbound traffic of these pods is limited, using or sharing one
        dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
        (e.g. Pn), the low latency of data access is key. And the traffic is not
        limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
        This choice provides two benefits. First, contention on the HTB/FQ Qdisc
        lock is significantly reduced since fewer CPUs contend for the same queue.
        More importantly, Qdisc contention can be eliminated completely if each
        CPU has its own FIFO Qdisc for the second kind of pods.
      
        There must be a mechanism in place to support classifying traffic based on
        pods/container to different Tx queues. Note that clsact is outside of Qdisc
        while Qdisc can run a classifier to select a sub-queue under the lock.
      
        In general recording the decision in the skb seems a little heavy handed.
        This patch introduces a per-CPU variable, suggested by Eric.
      
        The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
        - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
          is set in qdisc->enqueue() though tx queue has been selected in
          netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
          firstly in __dev_queue_xmit(), is useful:
        - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
          in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
          For example, eth0, macvlan in pod, which root Qdisc install skbedit
          queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
          eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
          because there is no filters in clsact or tx Qdisc of this netdev.
          Same action taked in eth0, ixgbe in Host.
        - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
          in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
          in __dev_queue_xmit when processing next packets.
      
        For performance reasons, use the static key. If user does not config the NET_EGRESS,
        the patch will not be compiled.
      
        +----+      +----+      +----+
        | P1 |      | P2 |      | Pn |
        +----+      +----+      +----+
          |           |           |
          +-----------+-----------+
                      |
                      | clsact/skbedit
                      |      MQ
                      v
          +-----------+-----------+
          | q0        | q1        | qn
          v           v           v
        HTB/FQ      HTB/FQ  ...  FIFO
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Lobakin <alobakin@pm.me>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Talal Ahmad <talalahmad@google.com>
      Cc: Kevin Hao <haokexin@gmail.com>
      Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
      Cc: Antoine Tenart <atenart@kernel.org>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      2f1e85b1
  25. 13 4月, 2022 1 次提交
  26. 08 4月, 2022 2 次提交
    • J
      net-core: rx_otherhost_dropped to core_stats · 794c24e9
      Jeffrey Ji 提交于
      Increment rx_otherhost_dropped counter when packet dropped due to
      mismatched dest MAC addr.
      
      An example when this drop can occur is when manually crafting raw
      packets that will be consumed by a user space application via a tap
      device. For testing purposes local traffic was generated using trafgen
      for the client and netcat to start a server
      
      Tested: Created 2 netns, sent 1 packet using trafgen from 1 to the other
      with "{eth(daddr=$INCORRECT_MAC...}", verified that iproute2 showed the
      counter was incremented. (Also had to modify iproute2 to show the stat,
      additional patch for that coming next.)
      Signed-off-by: NJeffrey Ji <jeffreyji@google.com>
      Reviewed-by: NBrian Vazquez <brianvv@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220406172600.1141083-1-jeffreyjilinux@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      794c24e9
    • J
      net: extract a few internals from netdevice.h · 6264f58c
      Jakub Kicinski 提交于
      There's a number of functions and static variables used
      under net/core/ but not from the outside. We currently
      dump most of them into netdevice.h. That bad for many
      reasons:
       - netdevice.h is very cluttered, hard to figure out
         what the APIs are;
       - netdevice.h is very long;
       - we have to touch netdevice.h more which causes expensive
         incremental builds.
      
      Create a header under net/core/ and move some declarations.
      
      The new header is also a bit of a catch-all but that's
      fine, if we create more specific headers people will
      likely over-think where their declaration fit best.
      And end up putting them in netdevice.h, again.
      
      More work should be done on splitting netdevice.h into more
      targeted headers, but that'd be more time consuming so small
      steps.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      6264f58c
  27. 06 4月, 2022 1 次提交