1. 25 2月, 2021 1 次提交
    • O
      net: introduce CAN specific pointer in the struct net_device · 4e096a18
      Oleksij Rempel 提交于
      Since 20dd3850 ("can: Speed up CAN frame receiption by using
      ml_priv") the CAN framework uses per device specific data in the AF_CAN
      protocol. For this purpose the struct net_device->ml_priv is used. Later
      the ml_priv usage in CAN was extended for other users, one of them being
      CAN_J1939.
      
      Later in the kernel ml_priv was converted to an union, used by other
      drivers. E.g. the tun driver started storing it's stats pointer.
      
      Since tun devices can claim to be a CAN device, CAN specific protocols
      will wrongly interpret this pointer, which will cause system crashes.
      Mostly this issue is visible in the CAN_J1939 stack.
      
      To fix this issue, we request a dedicated CAN pointer within the
      net_device struct.
      
      Reported-by: syzbot+5138c4dd15a0401bec7b@syzkaller.appspotmail.com
      Fixes: 20dd3850 ("can: Speed up CAN frame receiption by using ml_priv")
      Fixes: ffd956ee ("can: introduce CAN midlayer private and allocate it automatically")
      Fixes: 9d71dd0c ("can: add support of SAE J1939 protocol")
      Fixes: 497a5757 ("tun: switch to net core provided statistics counters")
      Signed-off-by: NOleksij Rempel <o.rempel@pengutronix.de>
      Link: https://lore.kernel.org/r/20210223070127.4538-1-o.rempel@pengutronix.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      4e096a18
  2. 13 2月, 2021 1 次提交
  3. 12 2月, 2021 1 次提交
    • C
      net: fix dev_ifsioc_locked() race condition · 3b23a32a
      Cong Wang 提交于
      dev_ifsioc_locked() is called with only RCU read lock, so when
      there is a parallel writer changing the mac address, it could
      get a partially updated mac address, as shown below:
      
      Thread 1			Thread 2
      // eth_commit_mac_addr_change()
      memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
      				// dev_ifsioc_locked()
      				memcpy(ifr->ifr_hwaddr.sa_data,
      					dev->dev_addr,...);
      
      Close this race condition by guarding them with a RW semaphore,
      like netdev_get_name(). We can not use seqlock here as it does not
      allow blocking. The writers already take RTNL anyway, so this does
      not affect the slow path. To avoid bothering existing
      dev_set_mac_address() callers in drivers, introduce a new wrapper
      just for user-facing callers on ioctl and rtnetlink paths.
      
      Note, bonding also changes slave mac addresses but that requires
      a separate patch due to the complexity of bonding code.
      
      Fixes: 3710becf ("net: RCU locking for simple ioctl()")
      Reported-by: N"Gong, Sishuai" <sishuai@purdue.edu>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b23a32a
  4. 10 2月, 2021 2 次提交
  5. 09 2月, 2021 1 次提交
  6. 29 1月, 2021 1 次提交
    • J
      net: adjust net_device layout for cacheline usage · 28af22c6
      Jesper Dangaard Brouer 提交于
      The current layout of net_device is not optimal for cacheline usage.
      
      The member adj_list.lower linked list is split between cacheline 2 and 3.
      The ifindex is placed together with stats (struct net_device_stats),
      although most modern drivers don't update this stats member.
      
      The members netdev_ops, mtu and hard_header_len are placed on three
      different cachelines. These members are accessed for XDP redirect into
      devmap, which were noticeably with perf tool. When not using the map
      redirect variant (like TC-BPF does), then ifindex is also used, which is
      placed on a separate fourth cacheline. These members are also accessed
      during forwarding with regular network stack. The members priv_flags and
      flags are on fast-path for network stack transmit path in __dev_queue_xmit
      (currently located together with mtu cacheline).
      
      This patch creates a read mostly cacheline, with the purpose of keeping the
      above mentioned members on the same cacheline.
      
      Some netdev_features_t members also becomes part of this cacheline, which is
      on purpose, as function netif_skb_features() is on fast-path via
      validate_xmit_skb().
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/r/161168277983.410784.12401225493601624417.stgit@firesoulSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      28af22c6
  7. 23 1月, 2021 1 次提交
    • M
      sch_htb: Hierarchical QoS hardware offload · d03b195b
      Maxim Mikityanskiy 提交于
      HTB doesn't scale well because of contention on a single lock, and it
      also consumes CPU. This patch adds support for offloading HTB to
      hardware that supports hierarchical rate limiting.
      
      In the offload mode, HTB passes control commands to the driver using
      ndo_setup_tc. The driver has to replicate the whole hierarchy of classes
      and their settings (rate, ceil) in the NIC. Every modification of the
      HTB tree caused by the admin results in ndo_setup_tc being called.
      
      After this setup, the HTB algorithm is done completely in the NIC. An SQ
      (send queue) is created for every leaf class and attached to the
      hierarchy, so that the NIC can calculate and obey aggregated rate
      limits, too. In the future, it can be changed, so that multiple SQs will
      back a single leaf class.
      
      ndo_select_queue is responsible for selecting the right queue that
      serves the traffic class of each packet.
      
      The data path works as follows: a packet is classified by clsact, the
      driver selects a hardware queue according to its class, and the packet
      is enqueued into this queue's qdisc.
      
      This solution addresses two main problems of scaling HTB:
      
      1. Contention by flow classification. Currently the filters are attached
      to the HTB instance as follows:
      
          # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
          classid 1:10
      
      It's possible to move classification to clsact egress hook, which is
      thread-safe and lock-free:
      
          # tc filter add dev eth0 egress protocol ip flower dst_port 80
          action skbedit priority 1:10
      
      This way classification still happens in software, but the lock
      contention is eliminated, and it happens before selecting the TX queue,
      allowing the driver to translate the class to the corresponding hardware
      queue in ndo_select_queue.
      
      Note that this is already compatible with non-offloaded HTB and doesn't
      require changes to the kernel nor iproute2.
      
      2. Contention by handling packets. HTB is not multi-queue, it attaches
      to a whole net device, and handling of all packets takes the same lock.
      When HTB is offloaded, it registers itself as a multi-queue qdisc,
      similarly to mq: HTB is attached to the netdev, and each queue has its
      own qdisc.
      
      Some features of HTB may be not supported by some particular hardware,
      for example, the maximum number of classes may be limited, the
      granularity of rate and ceil parameters may be different, etc. - so, the
      offload is not enabled by default, a new parameter is used to enable it:
      
          # tc qdisc replace dev eth0 root handle 1: htb offload
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d03b195b
  8. 20 1月, 2021 1 次提交
    • J
      bonding: add a vlan+srcmac tx hashing option · 7b8fc010
      Jarod Wilson 提交于
      This comes from an end-user request, where they're running multiple VMs on
      hosts with bonded interfaces connected to some interest switch topologies,
      where 802.3ad isn't an option. They're currently running a proprietary
      solution that effectively achieves load-balancing of VMs and bandwidth
      utilization improvements with a similar form of transmission algorithm.
      
      Basically, each VM has it's own vlan, so it always sends its traffic out
      the same interface, unless that interface fails. Traffic gets split
      between the interfaces, maintaining a consistent path, with failover still
      available if an interface goes down.
      
      Unlike bond_eth_hash(), this hash function is using the full source MAC
      address instead of just the last byte, as there are so few components to
      the hash, and in the no-vlan case, we would be returning just the last
      byte of the source MAC as the hash value. It's entirely possible to have
      two NICs in a bond with the same last byte of their MAC, but not the same
      MAC, so this adjustment should guarantee distinct hashes in all cases.
      
      This has been rudimetarily tested to provide similar results to the
      proprietary solution it is aiming to replace. A patch for iproute2 is also
      posted, to properly support the new mode there as well.
      
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Thomas Davis <tadavis@lbl.gov>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Link: https://lore.kernel.org/r/20210119010927.1191922-1-jarod@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      7b8fc010
  9. 19 1月, 2021 1 次提交
  10. 10 1月, 2021 1 次提交
  11. 08 1月, 2021 1 次提交
  12. 17 12月, 2020 1 次提交
  13. 02 12月, 2020 1 次提交
  14. 01 12月, 2020 1 次提交
    • B
      net: Introduce preferred busy-polling · 7fd3253a
      Björn Töpel 提交于
      The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
      option or system-wide using the /proc/sys/net/core/busy_read knob, is
      an opportunistic. That means that if the NAPI context is not
      scheduled, it will poll it. If, after busy-polling, the budget is
      exceeded the busy-polling logic will schedule the NAPI onto the
      regular softirq handling.
      
      One implication of the behavior above is that a busy/heavy loaded NAPI
      context will never enter/allow for busy-polling. Some applications
      prefer that most NAPI processing would be done by busy-polling.
      
      This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
      in concert with the napi_defer_hard_irqs and gro_flush_timeout
      knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
      introduced in commit 6f8b12d6 ("net: napi: add hard irqs deferral
      feature"), and allows for a user to defer interrupts to be enabled and
      instead schedule the NAPI context from a watchdog timer. When a user
      enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
      and the NAPI context is being processed by a softirq, the softirq NAPI
      processing will exit early to allow the busy-polling to be performed.
      
      If the application stops performing busy-polling via a system call,
      the watchdog timer defined by gro_flush_timeout will timeout, and
      regular softirq handling will resume.
      
      In summary; Heavy traffic applications that prefer busy-polling over
      softirq processing should use this option.
      
      Example usage:
      
        $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
        $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
      
      Note that the timeout should be larger than the userspace processing
      window, otherwise the watchdog will timeout and fall back to regular
      softirq processing.
      
      Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
      7fd3253a
  15. 25 11月, 2020 1 次提交
  16. 24 11月, 2020 2 次提交
  17. 18 11月, 2020 1 次提交
  18. 10 11月, 2020 1 次提交
  19. 01 11月, 2020 2 次提交
  20. 14 10月, 2020 1 次提交
  21. 12 10月, 2020 1 次提交
    • D
      bpf: Add redirect_peer helper · 9aa1206e
      Daniel Borkmann 提交于
      Add an efficient ingress to ingress netns switch that can be used out of tc BPF
      programs in order to redirect traffic from host ns ingress into a container
      veth device ingress without having to go via CPU backlog queue [0]. For local
      containers this can also be utilized and path via CPU backlog queue only needs
      to be taken once, not twice. On a high level this borrows from ipvlan which does
      similar switch in __netif_receive_skb_core() and then iterates via another_round.
      This helps to reduce latency for mentioned use cases.
      
      Pod to remote pod with redirect(), TCP_RR [1]:
      
        # percpu_netperf 10.217.1.33
                RT_LATENCY:         122.450         (per CPU:         122.666         122.401         122.333         122.401 )
              MEAN_LATENCY:         121.210         (per CPU:         121.100         121.260         121.320         121.160 )
            STDDEV_LATENCY:         120.040         (per CPU:         119.420         119.910         125.460         115.370 )
               MIN_LATENCY:          46.500         (per CPU:          47.000          47.000          47.000          45.000 )
               P50_LATENCY:         118.500         (per CPU:         118.000         119.000         118.000         119.000 )
               P90_LATENCY:         127.500         (per CPU:         127.000         128.000         127.000         128.000 )
               P99_LATENCY:         130.750         (per CPU:         131.000         131.000         129.000         132.000 )
      
          TRANSACTION_RATE:       32666.400         (per CPU:        8152.200        8169.842        8174.439        8169.897 )
      
      Pod to remote pod with redirect_peer(), TCP_RR:
      
        # percpu_netperf 10.217.1.33
                RT_LATENCY:          44.449         (per CPU:          43.767          43.127          45.279          45.622 )
              MEAN_LATENCY:          45.065         (per CPU:          44.030          45.530          45.190          45.510 )
            STDDEV_LATENCY:          84.823         (per CPU:          66.770          97.290          84.380          90.850 )
               MIN_LATENCY:          33.500         (per CPU:          33.000          33.000          34.000          34.000 )
               P50_LATENCY:          43.250         (per CPU:          43.000          43.000          43.000          44.000 )
               P90_LATENCY:          46.750         (per CPU:          46.000          47.000          47.000          47.000 )
               P99_LATENCY:          52.750         (per CPU:          51.000          54.000          53.000          53.000 )
      
          TRANSACTION_RATE:       90039.500         (per CPU:       22848.186       23187.089       22085.077       21919.130 )
      
        [0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf
        [1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net
      9aa1206e
  22. 06 10月, 2020 1 次提交
  23. 04 10月, 2020 1 次提交
  24. 03 10月, 2020 1 次提交
  25. 30 9月, 2020 1 次提交
    • S
      net: Add netif_rx_any_context() · c11171a4
      Sebastian Andrzej Siewior 提交于
      Quite some drivers make conditional decisions based on in_interrupt() to
      invoke either netif_rx() or netif_rx_ni().
      
      Conditionals based on in_interrupt() or other variants of preempt count
      checks in drivers should not exist for various reasons and Linus clearly
      requested to either split the code pathes or pass an argument to the
      common functions which provides the context.
      
      This is obviously the correct solution, but for some of the affected
      drivers this needs a major rewrite due to their convoluted structure.
      
      As in_interrupt() usage in drivers needs to be phased out, provide
      netif_rx_any_context() as a stop gap for these drivers.
      
      This confines the in_interrupt() conditional to core code which in turn
      allows to remove the access to this check for driver code and provides one
      central place to do further modifications once the driver maze is cleaned
      up.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c11171a4
  26. 29 9月, 2020 2 次提交
    • T
      net: core: add nested_level variable in net_device · 1fc70edb
      Taehee Yoo 提交于
      This patch is to add a new variable 'nested_level' into the net_device
      structure.
      This variable will be used as a parameter of spin_lock_nested() of
      dev->addr_list_lock.
      
      netif_addr_lock() can be called recursively so spin_lock_nested() is
      used instead of spin_lock() and dev->lower_level is used as a parameter
      of spin_lock_nested().
      But, dev->lower_level value can be updated while it is being used.
      So, lockdep would warn a possible deadlock scenario.
      
      When a stacked interface is deleted, netif_{uc | mc}_sync() is
      called recursively.
      So, spin_lock_nested() is called recursively too.
      At this moment, the dev->lower_level variable is used as a parameter of it.
      dev->lower_level value is updated when interfaces are being unlinked/linked
      immediately.
      Thus, After unlinking, dev->lower_level shouldn't be a parameter of
      spin_lock_nested().
      
          A (macvlan)
          |
          B (vlan)
          |
          C (bridge)
          |
          D (macvlan)
          |
          E (vlan)
          |
          F (bridge)
      
          A->lower_level : 6
          B->lower_level : 5
          C->lower_level : 4
          D->lower_level : 3
          E->lower_level : 2
          F->lower_level : 1
      
      When an interface 'A' is removed, it releases resources.
      At this moment, netif_addr_lock() would be called.
      Then, netdev_upper_dev_unlink() is called recursively.
      Then dev->lower_level is updated.
      There is no problem.
      
      But, when the bridge module is removed, 'C' and 'F' interfaces
      are removed at once.
      If 'F' is removed first, a lower_level value is like below.
          A->lower_level : 5
          B->lower_level : 4
          C->lower_level : 3
          D->lower_level : 2
          E->lower_level : 1
          F->lower_level : 1
      
      Then, 'C' is removed. at this moment, netif_addr_lock() is called
      recursively.
      The ordering is like this.
      C(3)->D(2)->E(1)->F(1)
      At this moment, the lower_level value of 'E' and 'F' are the same.
      So, lockdep warns a possible deadlock scenario.
      
      In order to avoid this problem, a new variable 'nested_level' is added.
      This value is the same as dev->lower_level - 1.
      But this value is updated in rtnl_unlock().
      So, this variable can be used as a parameter of spin_lock_nested() safely
      in the rtnl context.
      
      Test commands:
         ip link add br0 type bridge vlan_filtering 1
         ip link add vlan1 link br0 type vlan id 10
         ip link add macvlan2 link vlan1 type macvlan
         ip link add br3 type bridge vlan_filtering 1
         ip link set macvlan2 master br3
         ip link add vlan4 link br3 type vlan id 10
         ip link add macvlan5 link vlan4 type macvlan
         ip link add br6 type bridge vlan_filtering 1
         ip link set macvlan5 master br6
         ip link add vlan7 link br6 type vlan id 10
         ip link add macvlan8 link vlan7 type macvlan
      
         ip link set br0 up
         ip link set vlan1 up
         ip link set macvlan2 up
         ip link set br3 up
         ip link set vlan4 up
         ip link set macvlan5 up
         ip link set br6 up
         ip link set vlan7 up
         ip link set macvlan8 up
         modprobe -rv bridge
      
      Splat looks like:
      [   36.057436][  T744] WARNING: possible recursive locking detected
      [   36.058848][  T744] 5.9.0-rc6+ #728 Not tainted
      [   36.059959][  T744] --------------------------------------------
      [   36.061391][  T744] ip/744 is trying to acquire lock:
      [   36.062590][  T744] ffff8c4767509280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_set_rx_mode+0x19/0x30
      [   36.064922][  T744]
      [   36.064922][  T744] but task is already holding lock:
      [   36.066626][  T744] ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60
      [   36.068851][  T744]
      [   36.068851][  T744] other info that might help us debug this:
      [   36.070731][  T744]  Possible unsafe locking scenario:
      [   36.070731][  T744]
      [   36.072497][  T744]        CPU0
      [   36.073238][  T744]        ----
      [   36.074007][  T744]   lock(&vlan_netdev_addr_lock_key);
      [   36.075290][  T744]   lock(&vlan_netdev_addr_lock_key);
      [   36.076590][  T744]
      [   36.076590][  T744]  *** DEADLOCK ***
      [   36.076590][  T744]
      [   36.078515][  T744]  May be due to missing lock nesting notation
      [   36.078515][  T744]
      [   36.080491][  T744] 3 locks held by ip/744:
      [   36.081471][  T744]  #0: ffffffff98571df0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x236/0x490
      [   36.083614][  T744]  #1: ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60
      [   36.085942][  T744]  #2: ffff8c476c8da280 (&bridge_netdev_addr_lock_key/4){+...}-{2:2}, at: dev_uc_sync+0x39/0x80
      [   36.088400][  T744]
      [   36.088400][  T744] stack backtrace:
      [   36.089772][  T744] CPU: 6 PID: 744 Comm: ip Not tainted 5.9.0-rc6+ #728
      [   36.091364][  T744] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [   36.093630][  T744] Call Trace:
      [   36.094416][  T744]  dump_stack+0x77/0x9b
      [   36.095385][  T744]  __lock_acquire+0xbc3/0x1f40
      [   36.096522][  T744]  lock_acquire+0xb4/0x3b0
      [   36.097540][  T744]  ? dev_set_rx_mode+0x19/0x30
      [   36.098657][  T744]  ? rtmsg_ifinfo+0x1f/0x30
      [   36.099711][  T744]  ? __dev_notify_flags+0xa5/0xf0
      [   36.100874][  T744]  ? rtnl_is_locked+0x11/0x20
      [   36.101967][  T744]  ? __dev_set_promiscuity+0x7b/0x1a0
      [   36.103230][  T744]  _raw_spin_lock_bh+0x38/0x70
      [   36.104348][  T744]  ? dev_set_rx_mode+0x19/0x30
      [   36.105461][  T744]  dev_set_rx_mode+0x19/0x30
      [   36.106532][  T744]  dev_set_promiscuity+0x36/0x50
      [   36.107692][  T744]  __dev_set_promiscuity+0x123/0x1a0
      [   36.108929][  T744]  dev_set_promiscuity+0x1e/0x50
      [   36.110093][  T744]  br_port_set_promisc+0x1f/0x40 [bridge]
      [   36.111415][  T744]  br_manage_promisc+0x8b/0xe0 [bridge]
      [   36.112728][  T744]  __dev_set_promiscuity+0x123/0x1a0
      [   36.113967][  T744]  ? __hw_addr_sync_one+0x23/0x50
      [   36.115135][  T744]  __dev_set_rx_mode+0x68/0x90
      [   36.116249][  T744]  dev_uc_sync+0x70/0x80
      [   36.117244][  T744]  dev_uc_add+0x50/0x60
      [   36.118223][  T744]  macvlan_open+0x18e/0x1f0 [macvlan]
      [   36.119470][  T744]  __dev_open+0xd6/0x170
      [   36.120470][  T744]  __dev_change_flags+0x181/0x1d0
      [   36.121644][  T744]  dev_change_flags+0x23/0x60
      [   36.122741][  T744]  do_setlink+0x30a/0x11e0
      [   36.123778][  T744]  ? __lock_acquire+0x92c/0x1f40
      [   36.124929][  T744]  ? __nla_validate_parse.part.6+0x45/0x8e0
      [   36.126309][  T744]  ? __lock_acquire+0x92c/0x1f40
      [   36.127457][  T744]  __rtnl_newlink+0x546/0x8e0
      [   36.128560][  T744]  ? lock_acquire+0xb4/0x3b0
      [   36.129623][  T744]  ? deactivate_slab.isra.85+0x6a1/0x850
      [   36.130946][  T744]  ? __lock_acquire+0x92c/0x1f40
      [   36.132102][  T744]  ? lock_acquire+0xb4/0x3b0
      [   36.133176][  T744]  ? is_bpf_text_address+0x5/0xe0
      [   36.134364][  T744]  ? rtnl_newlink+0x2e/0x70
      [   36.135445][  T744]  ? rcu_read_lock_sched_held+0x32/0x60
      [   36.136771][  T744]  ? kmem_cache_alloc_trace+0x2d8/0x380
      [   36.138070][  T744]  ? rtnl_newlink+0x2e/0x70
      [   36.139164][  T744]  rtnl_newlink+0x47/0x70
      [ ... ]
      
      Fixes: 845e0ebb ("net: change addr_list_lock back to static key")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1fc70edb
    • T
      net: core: introduce struct netdev_nested_priv for nested interface infrastructure · eff74233
      Taehee Yoo 提交于
      Functions related to nested interface infrastructure such as
      netdev_walk_all_{ upper | lower }_dev() pass both private functions
      and "data" pointer to handle their own things.
      At this point, the data pointer type is void *.
      In order to make it easier to expand common variables and functions,
      this new netdev_nested_priv structure is added.
      
      In the following patch, a new member variable will be added into this
      struct to fix the lockdep issue.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eff74233
  27. 19 9月, 2020 1 次提交
  28. 18 9月, 2020 1 次提交
  29. 11 9月, 2020 2 次提交
    • J
      net: manage napi add/del idempotence explicitly · 4d092dd2
      Jakub Kicinski 提交于
      To RCUify napi->dev_list we need to replace list_del_init()
      with list_del_rcu(). There is no _init() version for RCU for
      obvious reasons. Up until now netif_napi_del() was idempotent
      so to make sure it remains such add a bit which is set when
      NAPI is listed, and cleared when it removed. Since we don't
      expect multiple calls to netif_napi_add() to be correct,
      add a warning on that side.
      
      Now that napi_hash_add / napi_hash_del are only called by
      napi_add / del we can actually steal its bit. We just need
      to make sure hash node is initialized correctly.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d092dd2
    • J
      net: remove napi_hash_del() from driver-facing API · 5198d545
      Jakub Kicinski 提交于
      We allow drivers to call napi_hash_del() before calling
      netif_napi_del() to batch RCU grace periods. This makes
      the API asymmetric and leaks internal implementation details.
      Soon we will want the grace period to protect more than just
      the NAPI hash table.
      
      Restructure the API and have drivers call a new function -
      __netif_napi_del() if they want to take care of RCU waits.
      
      Note that only core was checking the return status from
      napi_hash_del() so the new helper does not report if the
      NAPI was actually deleted.
      
      Some notes on driver oddness:
       - veth observed the grace period before calling netif_napi_del()
         but that should not matter
       - myri10ge observed normal RCU flavor
       - bnx2x and enic did not actually observe the grace period
         (unless they did so implicitly)
       - virtio_net and enic only unhashed Rx NAPIs
      
      The last two points seem to indicate that the calls to
      napi_hash_del() were a left over rather than an optimization.
      Regardless, it's easy enough to correct them.
      
      This patch may introduce extra synchronize_net() calls for
      interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on
      free_netdev() to call netif_napi_del(). This seems inevitable
      since we want to use RCU for netpoll dev->napi_list traversal,
      and almost no drivers set IFF_DISABLE_NETPOLL.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5198d545
  30. 08 9月, 2020 2 次提交
  31. 01 9月, 2020 1 次提交
  32. 28 8月, 2020 1 次提交
    • M
      net: add option to not create fall-back tunnels in root-ns as well · 316cdaa1
      Mahesh Bandewar 提交于
      The sysctl that was added  earlier by commit 79134e6c ("net: do
      not create fallback tunnels for non-default namespaces") to create
      fall-back only in root-ns. This patch enhances that behavior to provide
      option not to create fallback tunnels in root-ns as well. Since modules
      that create fallback tunnels could be built-in and setting the sysctl
      value after booting is pointless, so added a kernel cmdline options to
      change this default. The default setting is preseved for backward
      compatibility. The kernel command line option of fb_tunnels=initns will
      set the sysctl value to 1 and will create fallback tunnels only in initns
      while kernel cmdline fb_tunnels=none will set the sysctl value to 2 and
      fallback tunnels are skipped in every netns.
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Maciej Zenczykowski <maze@google.com>
      Cc: Jian Yang <jianyang@google.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      316cdaa1
  33. 27 8月, 2020 1 次提交
  34. 05 8月, 2020 1 次提交
    • S
      ipv4: route: Ignore output interface in FIB lookup for PMTU route · df23bb18
      Stefano Brivio 提交于
      Currently, processes sending traffic to a local bridge with an
      encapsulation device as a port don't get ICMP errors if they exceed
      the PMTU of the encapsulated link.
      
      David Ahern suggested this as a hack, but it actually looks like
      the correct solution: when we update the PMTU for a given destination
      by means of updating or creating a route exception, the encapsulation
      might trigger this because of PMTU discovery happening either on the
      encapsulation device itself, or its lower layer. This happens on
      bridged encapsulations only.
      
      The output interface shouldn't matter, because we already have a
      valid destination. Drop the output interface restriction from the
      associated route lookup.
      
      For UDP tunnels, we will now have a route exception created for the
      encapsulation itself, with a MTU value reflecting its headroom, which
      allows a bridge forwarding IP packets originated locally to deliver
      errors back to the sending socket.
      
      The behaviour is now consistent with IPv6 and verified with selftests
      pmtu_ipv{4,6}_br_{geneve,vxlan}{4,6}_exception introduced later in
      this series.
      
      v2:
      - reset output interface only for bridge ports (David Ahern)
      - add and use netif_is_any_bridge_port() helper (David Ahern)
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df23bb18