1. 09 5月, 2022 1 次提交
  2. 06 5月, 2022 3 次提交
  3. 04 5月, 2022 1 次提交
  4. 30 4月, 2022 1 次提交
  5. 27 4月, 2022 2 次提交
    • S
      net: Use this_cpu_inc() to increment net->core_stats · 6510ea97
      Sebastian Andrzej Siewior 提交于
      The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes
      netdev_core_stats_alloc() to return a per-CPU pointer.
      netdev_core_stats_alloc() will allocate memory on its first invocation
      which breaks on PREEMPT_RT because it requires non-atomic context for
      memory allocation.
      
      This can be avoided by enabling preemption in netdev_core_stats_alloc()
      assuming the caller always disables preemption.
      
      It might be better to replace local_inc() with this_cpu_inc() now that
      dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does
      not rely on already disabled preemption. This results in less
      instructions on x86-64:
      local_inc:
      |          incl %gs:__preempt_count(%rip)  # __preempt_count
      |          movq    488(%rdi), %rax # _1->core_stats, _22
      |          testq   %rax, %rax      # _22
      |          je      .L585   #,
      |          add %gs:this_cpu_off(%rip), %rax        # this_cpu_off, tcp_ptr__
      |  .L586:
      |          testq   %rax, %rax      # _27
      |          je      .L587   #,
      |          incq (%rax)            # _6->a.counter
      |  .L587:
      |          decl %gs:__preempt_count(%rip)  # __preempt_count
      
      this_cpu_inc(), this patch:
      |         movq    488(%rdi), %rax # _1->core_stats, _5
      |         testq   %rax, %rax      # _5
      |         je      .L591   #,
      | .L585:
      |         incq %gs:(%rax) # _18->rx_dropped
      
      Use unsigned long as type for the counter. Use this_cpu_inc() to
      increment the counter. Use a plain read of the counter.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/YmbO0pxgtKpCw4SY@linutronix.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      6510ea97
    • E
      net: generalize skb freeing deferral to per-cpu lists · 68822bdf
      Eric Dumazet 提交于
      Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
      lock is released") helped bulk TCP flows to move the cost of skbs
      frees outside of critical section where socket lock was held.
      
      But for RPC traffic, or hosts with RFS enabled, the solution is far from
      being ideal.
      
      For RPC traffic, recvmsg() has to return to user space right after
      skb payload has been consumed, meaning that BH handler has no chance
      to pick the skb before recvmsg() thread. This issue is more visible
      with BIG TCP, as more RPC fit one skb.
      
      For RFS, even if BH handler picks the skbs, they are still picked
      from the cpu on which user thread is running.
      
      Ideally, it is better to free the skbs (and associated page frags)
      on the cpu that originally allocated them.
      
      This patch removes the per socket anchor (sk->defer_list) and
      instead uses a per-cpu list, which will hold more skbs per round.
      
      This new per-cpu list is drained at the end of net_action_rx(),
      after incoming packets have been processed, to lower latencies.
      
      In normal conditions, skbs are added to the per-cpu list with
      no further action. In the (unlikely) cases where the cpu does not
      run net_action_rx() handler fast enough, we use an IPI to raise
      NET_RX_SOFTIRQ on the remote cpu.
      
      Also, we do not bother draining the per-cpu list from dev_cpu_dead()
      This is because skbs in this list have no requirement on how fast
      they should be freed.
      
      Note that we can add in the future a small per-cpu cache
      if we see any contention on sd->defer_lock.
      
      Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
      and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
      page recycling strategy used by NIC driver (its page pool capacity
      being too small compared to number of skbs/pages held in sockets
      receive queues)
      
      Note that this tuning was only done to demonstrate worse
      conditions for skb freeing for this particular test.
      These conditions can happen in more general production workload.
      
      10 runs of one TCP_STREAM flow
      
      Before:
      Average throughput: 49685 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() show high cost for
      skb freeing related functions (*)
      
          57.81%  [kernel]       [k] copy_user_enhanced_fast_string
      (*) 12.87%  [kernel]       [k] skb_release_data
      (*)  4.25%  [kernel]       [k] __free_one_page
      (*)  3.57%  [kernel]       [k] __list_del_entry_valid
           1.85%  [kernel]       [k] __netif_receive_skb_core
           1.60%  [kernel]       [k] __skb_datagram_iter
      (*)  1.59%  [kernel]       [k] free_unref_page_commit
      (*)  1.16%  [kernel]       [k] __slab_free
           1.16%  [kernel]       [k] _copy_to_iter
      (*)  1.01%  [kernel]       [k] kfree
      (*)  0.88%  [kernel]       [k] free_unref_page
           0.57%  [kernel]       [k] ip6_rcv_core
           0.55%  [kernel]       [k] ip6t_do_table
           0.54%  [kernel]       [k] flush_smp_call_function_queue
      (*)  0.54%  [kernel]       [k] free_pcppages_bulk
           0.51%  [kernel]       [k] llist_reverse_order
           0.38%  [kernel]       [k] process_backlog
      (*)  0.38%  [kernel]       [k] free_pcp_prepare
           0.37%  [kernel]       [k] tcp_recvmsg_locked
      (*)  0.37%  [kernel]       [k] __list_add_valid
           0.34%  [kernel]       [k] sock_rfree
           0.34%  [kernel]       [k] _raw_spin_lock_irq
      (*)  0.33%  [kernel]       [k] __page_cache_release
           0.33%  [kernel]       [k] tcp_v6_rcv
      (*)  0.33%  [kernel]       [k] __put_page
      (*)  0.29%  [kernel]       [k] __mod_zone_page_state
           0.27%  [kernel]       [k] _raw_spin_lock
      
      After patch:
      Average throughput: 73076 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() looks better:
      
          81.35%  [kernel]       [k] copy_user_enhanced_fast_string
           1.95%  [kernel]       [k] _copy_to_iter
           1.95%  [kernel]       [k] __skb_datagram_iter
           1.27%  [kernel]       [k] __netif_receive_skb_core
           1.03%  [kernel]       [k] ip6t_do_table
           0.60%  [kernel]       [k] sock_rfree
           0.50%  [kernel]       [k] tcp_v6_rcv
           0.47%  [kernel]       [k] ip6_rcv_core
           0.45%  [kernel]       [k] read_tsc
           0.44%  [kernel]       [k] _raw_spin_lock_irqsave
           0.37%  [kernel]       [k] _raw_spin_lock
           0.37%  [kernel]       [k] native_irq_return_iret
           0.33%  [kernel]       [k] __inet6_lookup_established
           0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
           0.29%  [kernel]       [k] tcp_rcv_established
           0.29%  [kernel]       [k] llist_reverse_order
      
      v2: kdoc issue (kernel bots)
          do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
          replace the sk_buff_head with a single-linked list (Jakub)
          add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      68822bdf
  6. 19 4月, 2022 1 次提交
    • T
      net: sched: use queue_mapping to pick tx queue · 2f1e85b1
      Tonghao Zhang 提交于
      This patch fixes issue:
      * If we install tc filters with act_skbedit in clsact hook.
        It doesn't work, because netdev_core_pick_tx() overwrites
        queue_mapping.
      
        $ tc filter ... action skbedit queue_mapping 1
      
      And this patch is useful:
      * We can use FQ + EDT to implement efficient policies. Tx queues
        are picked by xps, ndo_select_queue of netdev driver, or skb hash
        in netdev_core_pick_tx(). In fact, the netdev driver, and skb
        hash are _not_ under control. xps uses the CPUs map to select Tx
        queues, but we can't figure out which task_struct of pod/containter
        running on this cpu in most case. We can use clsact filters to classify
        one pod/container traffic to one Tx queue. Why ?
      
        In containter networking environment, there are two kinds of pod/
        containter/net-namespace. One kind (e.g. P1, P2), the high throughput
        is key in these applications. But avoid running out of network resource,
        the outbound traffic of these pods is limited, using or sharing one
        dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
        (e.g. Pn), the low latency of data access is key. And the traffic is not
        limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
        This choice provides two benefits. First, contention on the HTB/FQ Qdisc
        lock is significantly reduced since fewer CPUs contend for the same queue.
        More importantly, Qdisc contention can be eliminated completely if each
        CPU has its own FIFO Qdisc for the second kind of pods.
      
        There must be a mechanism in place to support classifying traffic based on
        pods/container to different Tx queues. Note that clsact is outside of Qdisc
        while Qdisc can run a classifier to select a sub-queue under the lock.
      
        In general recording the decision in the skb seems a little heavy handed.
        This patch introduces a per-CPU variable, suggested by Eric.
      
        The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
        - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
          is set in qdisc->enqueue() though tx queue has been selected in
          netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
          firstly in __dev_queue_xmit(), is useful:
        - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
          in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
          For example, eth0, macvlan in pod, which root Qdisc install skbedit
          queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
          eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
          because there is no filters in clsact or tx Qdisc of this netdev.
          Same action taked in eth0, ixgbe in Host.
        - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
          in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
          in __dev_queue_xmit when processing next packets.
      
        For performance reasons, use the static key. If user does not config the NET_EGRESS,
        the patch will not be compiled.
      
        +----+      +----+      +----+
        | P1 |      | P2 |      | Pn |
        +----+      +----+      +----+
          |           |           |
          +-----------+-----------+
                      |
                      | clsact/skbedit
                      |      MQ
                      v
          +-----------+-----------+
          | q0        | q1        | qn
          v           v           v
        HTB/FQ      HTB/FQ  ...  FIFO
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Lobakin <alobakin@pm.me>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Talal Ahmad <talalahmad@google.com>
      Cc: Kevin Hao <haokexin@gmail.com>
      Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
      Cc: Antoine Tenart <atenart@kernel.org>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      2f1e85b1
  7. 13 4月, 2022 1 次提交
  8. 08 4月, 2022 2 次提交
    • J
      net-core: rx_otherhost_dropped to core_stats · 794c24e9
      Jeffrey Ji 提交于
      Increment rx_otherhost_dropped counter when packet dropped due to
      mismatched dest MAC addr.
      
      An example when this drop can occur is when manually crafting raw
      packets that will be consumed by a user space application via a tap
      device. For testing purposes local traffic was generated using trafgen
      for the client and netcat to start a server
      
      Tested: Created 2 netns, sent 1 packet using trafgen from 1 to the other
      with "{eth(daddr=$INCORRECT_MAC...}", verified that iproute2 showed the
      counter was incremented. (Also had to modify iproute2 to show the stat,
      additional patch for that coming next.)
      Signed-off-by: NJeffrey Ji <jeffreyji@google.com>
      Reviewed-by: NBrian Vazquez <brianvv@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220406172600.1141083-1-jeffreyjilinux@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      794c24e9
    • J
      net: extract a few internals from netdevice.h · 6264f58c
      Jakub Kicinski 提交于
      There's a number of functions and static variables used
      under net/core/ but not from the outside. We currently
      dump most of them into netdevice.h. That bad for many
      reasons:
       - netdevice.h is very cluttered, hard to figure out
         what the APIs are;
       - netdevice.h is very long;
       - we have to touch netdevice.h more which causes expensive
         incremental builds.
      
      Create a header under net/core/ and move some declarations.
      
      The new header is also a bit of a catch-all but that's
      fine, if we create more specific headers people will
      likely over-think where their declaration fit best.
      And end up putting them in netdevice.h, again.
      
      More work should be done on splitting netdevice.h into more
      targeted headers, but that'd be more time consuming so small
      steps.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      6264f58c
  9. 06 4月, 2022 2 次提交
  10. 29 3月, 2022 1 次提交
  11. 22 3月, 2022 1 次提交
  12. 19 3月, 2022 1 次提交
    • Í
      net: set default rss queues num to physical cores / 2 · 046e1537
      Íñigo Huguet 提交于
      Network drivers can call to netif_get_num_default_rss_queues to get the
      default number of receive queues to use. Right now, this default number
      is min(8, num_online_cpus()).
      
      Instead, as suggested by Jakub, use the number of physical cores divided
      by 2 as a way to avoid wasting CPU resources and to avoid using both CPU
      threads, but still allowing to scale for high-end processors with many
      cores.
      
      As an exception, select 2 queues for processors with 2 cores, because
      otherwise it won't take any advantage of RSS despite being SMP capable.
      
      Tested: Processor Intel Xeon E5-2620 (2 sockets, 6 cores/socket, 2
      threads/core). NIC Broadcom NetXtreme II BCM57810 (10GBps). Ran some
      tests with `perf stat iperf3 -R`, with parallelisms of 1, 8 and 24,
      getting the following results:
      - Number of queues: 6 (instead of 8)
      - Network throughput: not affected
      - CPU usage: utilized 0.05-0.12 CPUs more than before (having 24 CPUs
        this is only 0.2-0.5% higher)
      - Reduced the number of context switches by 7-50%, being more noticeable
        when using a higher number of parallel threads.
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NÍñigo Huguet <ihuguet@redhat.com>
      Link: https://lore.kernel.org/r/20220315091832.13873-1-ihuguet@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      046e1537
  13. 15 3月, 2022 1 次提交
    • E
      net: disable preemption in dev_core_stats_XXX_inc() helpers · fc93db15
      Eric Dumazet 提交于
      syzbot was kind enough to remind us that dev->{tx_dropped|rx_dropped}
      could be increased in process context.
      
      BUG: using smp_processor_id() in preemptible [00000000] code: syz-executor413/3593
      caller is netdev_core_stats_alloc+0x98/0x110 net/core/dev.c:10298
      CPU: 1 PID: 3593 Comm: syz-executor413 Not tainted 5.17.0-rc7-syzkaller-02426-g97aeb877 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       check_preemption_disabled+0x16b/0x170 lib/smp_processor_id.c:49
       netdev_core_stats_alloc+0x98/0x110 net/core/dev.c:10298
       dev_core_stats include/linux/netdevice.h:3855 [inline]
       dev_core_stats_rx_dropped_inc include/linux/netdevice.h:3866 [inline]
       tun_get_user+0x3455/0x3ab0 drivers/net/tun.c:1800
       tun_chr_write_iter+0xe1/0x200 drivers/net/tun.c:2015
       call_write_iter include/linux/fs.h:2074 [inline]
       new_sync_write+0x431/0x660 fs/read_write.c:503
       vfs_write+0x7cd/0xae0 fs/read_write.c:590
       ksys_write+0x12d/0x250 fs/read_write.c:643
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f2cf4f887e3
      Code: 5d 41 5c 41 5d 41 5e e9 9b fd ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
      RSP: 002b:00007ffd50dd5fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 00007ffd50dd6000 RCX: 00007f2cf4f887e3
      RDX: 000000000000002a RSI: 0000000000000000 RDI: 00000000000000c8
      RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007ffd50dd5ff0 R14: 00007ffd50dd5fe8 R15: 00007ffd50dd5fe4
       </TASK>
      
      Fixes: 625788b5 ("net: add per-cpu storage and net->core_stats")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: jeffreyji <jeffreyji@google.com>
      Cc: Brian Vazquez <brianvv@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220312214505.3294762-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      fc93db15
  14. 12 3月, 2022 1 次提交
  15. 07 3月, 2022 2 次提交
  16. 03 3月, 2022 1 次提交
    • P
      net: dev: Add hardware stats support · 9309f97a
      Petr Machata 提交于
      Offloading switch device drivers may be able to collect statistics of the
      traffic taking place in the HW datapath that pertains to a certain soft
      netdevice, such as VLAN. Add the necessary infrastructure to allow exposing
      these statistics to the offloaded netdevice in question. The API was shaped
      by the following considerations:
      
      - Collection of HW statistics is not free: there may be a finite number of
        counters, and the act of counting may have a performance impact. It is
        therefore necessary to allow toggling whether HW counting should be done
        for any particular SW netdevice.
      
      - As the drivers are loaded and removed, a particular device may get
        offloaded and unoffloaded again. At the same time, the statistics values
        need to stay monotonic (modulo the eventual 64-bit wraparound),
        increasing only to reflect traffic measured in the device.
      
        To that end, the netdevice keeps around a lazily-allocated copy of struct
        rtnl_link_stats64. Device drivers then contribute to the values kept
        therein at various points. Even as the driver goes away, the struct stays
        around to maintain the statistics values.
      
      - Different HW devices may be able to count different things. The
        motivation behind this patch in particular is exposure of HW counters on
        Nvidia Spectrum switches, where the only practical approach to counting
        traffic on offloaded soft netdevices currently is to use router interface
        counters, and count L3 traffic. Correspondingly that is the statistics
        suite added in this patch.
      
        Other devices may be able to measure different kinds of traffic, and for
        that reason, the APIs are built to allow uniform access to different
        statistics suites.
      
      - Because soft netdevices and offloading drivers are only loosely bound, a
        netdevice uses a notifier chain to communicate with the drivers. Several
        new notifiers, NETDEV_OFFLOAD_XSTATS_*, have been added to carry messages
        to the offloading drivers.
      
      - Devices can have various conditions for when a particular counter is
        available. As the device is configured and reconfigured, the device
        offload may become or cease being suitable for counter binding. A
        netdevice can use a notifier type NETDEV_OFFLOAD_XSTATS_REPORT_USED to
        ping offloading drivers and determine whether anyone currently implements
        a given statistics suite. This information can then be propagated to user
        space.
      
        When the driver decides to unoffload a netdevice, it can use a
        newly-added function, netdev_offload_xstats_report_delta(), to record
        outstanding collected statistics, before destroying the HW counter.
      
      This patch adds a helper, call_netdevice_notifiers_info_robust(), for
      dispatching a notifier with the possibility of unwind when one of the
      consumers bails. Given the wish to eventually get rid of the global
      notifier block altogether, this helper only invokes the per-netns notifier
      block.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9309f97a
  17. 23 2月, 2022 1 次提交
    • E
      drop_monitor: remove quadratic behavior · b26ef81c
      Eric Dumazet 提交于
      drop_monitor is using an unique list on which all netdevices in
      the host have an element, regardless of their netns.
      
      This scales poorly, not only at device unregister time (what I
      caught during my netns dismantle stress tests), but also at packet
      processing time whenever trace_napi_poll_hit() is called.
      
      If the intent was to avoid adding one pointer in 'struct net_device'
      then surely we prefer O(1) behavior.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b26ef81c
  18. 14 2月, 2022 2 次提交
    • S
      net: dev: Makes sure netif_rx() can be invoked in any context. · baebdf48
      Sebastian Andrzej Siewior 提交于
      Dave suggested a while ago (eleven years by now) "Let's make netif_rx()
      work in all contexts and get rid of netif_rx_ni()". Eric agreed and
      pointed out that modern devices should use netif_receive_skb() to avoid
      the overhead.
      In the meantime someone added another variant, netif_rx_any_context(),
      which behaves as suggested.
      
      netif_rx() must be invoked with disabled bottom halves to ensure that
      pending softirqs, which were raised within the function, are handled.
      netif_rx_ni() can be invoked only from process context (bottom halves
      must be enabled) because the function handles pending softirqs without
      checking if bottom halves were disabled or not.
      netif_rx_any_context() invokes on the former functions by checking
      in_interrupts().
      
      netif_rx() could be taught to handle both cases (disabled and enabled
      bottom halves) by simply disabling bottom halves while invoking
      netif_rx_internal(). The local_bh_enable() invocation will then invoke
      pending softirqs only if the BH-disable counter drops to zero.
      
      Eric is concerned about the overhead of BH-disable+enable especially in
      regard to the loopback driver. As critical as this driver is, it will
      receive a shortcut to avoid the additional overhead which is not needed.
      
      Add a local_bh_disable() section in netif_rx() to ensure softirqs are
      handled if needed.
      Provide __netif_rx() which does not disable BH and has a lockdep assert
      to ensure that interrupts are disabled. Use this shortcut in the
      loopback driver and in drivers/net/*.c.
      Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they
      can be removed once they are no more users left.
      
      Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.netSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      baebdf48
    • E
      net_sched: add __rcu annotation to netdev->qdisc · 5891cd5e
      Eric Dumazet 提交于
      syzbot found a data-race [1] which lead me to add __rcu
      annotations to netdev->qdisc, and proper accessors
      to get LOCKDEP support.
      
      [1]
      BUG: KCSAN: data-race in dev_activate / qdisc_lookup_rcu
      
      write to 0xffff888168ad6410 of 8 bytes by task 13559 on cpu 1:
       attach_default_qdiscs net/sched/sch_generic.c:1167 [inline]
       dev_activate+0x2ed/0x8f0 net/sched/sch_generic.c:1221
       __dev_open+0x2e9/0x3a0 net/core/dev.c:1416
       __dev_change_flags+0x167/0x3f0 net/core/dev.c:8139
       rtnl_configure_link+0xc2/0x150 net/core/rtnetlink.c:3150
       __rtnl_newlink net/core/rtnetlink.c:3489 [inline]
       rtnl_newlink+0xf4d/0x13e0 net/core/rtnetlink.c:3529
       rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5594
       netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494
       rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmsg+0x195/0x230 net/socket.c:2496
       __do_sys_sendmsg net/socket.c:2505 [inline]
       __se_sys_sendmsg net/socket.c:2503 [inline]
       __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff888168ad6410 of 8 bytes by task 13560 on cpu 0:
       qdisc_lookup_rcu+0x30/0x2e0 net/sched/sch_api.c:323
       __tcf_qdisc_find+0x74/0x3a0 net/sched/cls_api.c:1050
       tc_del_tfilter+0x1c7/0x1350 net/sched/cls_api.c:2211
       rtnetlink_rcv_msg+0x5ba/0x7e0 net/core/rtnetlink.c:5585
       netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494
       rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmsg+0x195/0x230 net/socket.c:2496
       __do_sys_sendmsg net/socket.c:2505 [inline]
       __se_sys_sendmsg net/socket.c:2503 [inline]
       __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0xffffffff85dee080 -> 0xffff88815d96ec00
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 13560 Comm: syz-executor.2 Not tainted 5.17.0-rc3-syzkaller-00116-gf1baf68e-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 470502de ("net: sched: unlock rules update API")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Vlad Buslov <vladbu@mellanox.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5891cd5e
  19. 09 2月, 2022 1 次提交
  20. 05 2月, 2022 1 次提交
    • E
      net: refine dev_put()/dev_hold() debugging · 4c6c11ea
      Eric Dumazet 提交于
      We are still chasing some syzbot reports where we think a rogue dev_put()
      is called with no corresponding prior dev_hold().
      Unfortunately it eats a reference on dev->dev_refcnt taken by innocent
      dev_hold_track(), meaning that the refcount saturation splat comes
      too late to be useful.
      
      Make sure that 'not tracked' dev_put() and dev_hold() better use
      CONFIG_NET_DEV_REFCNT_TRACKER=y debug infrastructure:
      
      Prior patch in the series allowed ref_tracker_alloc() and ref_tracker_free()
      to be called with a NULL @trackerp parameter, and to use a separate refcount
      only to detect too many put() even in the following case:
      
      dev_hold_track(dev, tracker_1, GFP_ATOMIC);
       dev_hold(dev);
       dev_put(dev);
       dev_put(dev); // Should complain loudly here.
      dev_put_track(dev, tracker_1); // instead of here
      
      Add clarification about netdev_tracker_alloc() role.
      
      v2: I replaced the dev_put() in linkwatch_do_dev()
          with __dev_put() because callers called netdev_tracker_free().
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c6c11ea
  21. 20 1月, 2022 1 次提交
    • C
      net: fix information leakage in /proc/net/ptype · 47934e06
      Congyu Liu 提交于
      In one net namespace, after creating a packet socket without binding
      it to a device, users in other net namespaces can observe the new
      `packet_type` added by this packet socket by reading `/proc/net/ptype`
      file. This is minor information leakage as packet socket is
      namespace aware.
      
      Add a net pointer in `packet_type` to keep the net namespace of
      of corresponding packet socket. In `ptype_seq_show`, this net pointer
      must be checked when it is not NULL.
      
      Fixes: 2feb27db ("[NETNS]: Minor information leak via /proc/net/ptype file.")
      Signed-off-by: NCongyu Liu <liu3101@purdue.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47934e06
  22. 06 1月, 2022 1 次提交
  23. 31 12月, 2021 1 次提交
  24. 19 12月, 2021 1 次提交
  25. 17 12月, 2021 1 次提交
  26. 15 12月, 2021 2 次提交
  27. 10 12月, 2021 1 次提交
    • E
      net: add networking namespace refcount tracker · 9ba74e6c
      Eric Dumazet 提交于
      We have 100+ syzbot reports about netns being dismantled too soon,
      still unresolved as of today.
      
      We think a missing get_net() or an extra put_net() is the root cause.
      
      In order to find the bug(s), and be able to spot future ones,
      this patch adds CONFIG_NET_NS_REFCNT_TRACKER and new helpers
      to precisely pair all put_net() with corresponding get_net().
      
      To use these helpers, each data structure owning a refcount
      should also use a "netns_tracker" to pair the get and put.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      9ba74e6c
  28. 08 12月, 2021 1 次提交
  29. 07 12月, 2021 4 次提交
新手
引导
客服 返回
顶部