1. 27 3月, 2020 1 次提交
  2. 20 3月, 2020 5 次提交
  3. 06 3月, 2020 1 次提交
    • J
      veth: ignore peer tx_dropped when counting local rx_dropped · e25d5dbc
      Jiang Lidong 提交于
      When local NET_RX backlog is full due to traffic overrun,
      peer veth tx_dropped counter increases. At that time, list
      local veth stats, rx_dropped has double value of peer
      tx_dropped, even bigger than transmit packets by peer.
      
      In NET_RX softirq process, if any packet drop case happens,
      it increases dev's rx_dropped counter and returns NET_RX_DROP.
      
      At veth tx side, it records any error returned from peer netif_rx
      into local dev tx_dropped counter.
      
      In veth get stats process, it puts local dev rx_dropped and
      peer dev tx_dropped into together as local rx_drpped value.
      So that it shows double value of real dropped packets number in
      this case.
      
      This patch ignores peer tx_dropped when counting local rx_dropped,
      since peer tx_dropped is duplicated to local rx_dropped at most cases.
      Signed-off-by: NJiang Lidong <jianglidong3@jd.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e25d5dbc
  4. 27 1月, 2020 1 次提交
    • J
      bpf, xdp: Remove no longer required rcu_read_{un}lock() · b23bfa56
      John Fastabend 提交于
      Now that we depend on rcu_call() and synchronize_rcu() to also wait
      for preempt_disabled region to complete the rcu read critical section
      in __dev_map_flush() is no longer required. Except in a few special
      cases in drivers that need it for other reasons.
      
      These originally ensured the map reference was safe while a map was
      also being free'd. And additionally that bpf program updates via
      ndo_bpf did not happen while flush updates were in flight. But flush
      by new rules can only be called from preempt-disabled NAPI context.
      The synchronize_rcu from the map free path and the rcu_call from the
      delete path will ensure the reference there is safe. So lets remove
      the rcu_read_lock and rcu_read_unlock pair to avoid any confusion
      around how this is being protected.
      
      If the rcu_read_lock was required it would mean errors in the above
      logic and the original patch would also be wrong.
      
      Now that we have done above we put the rcu_read_lock in the driver
      code where it is needed in a driver dependent way. I think this
      helps readability of the code so we know where and why we are
      taking read locks. Most drivers will not need rcu_read_locks here
      and further XDP drivers already have rcu_read_locks in their code
      paths for reading xdp programs on RX side so this makes it symmetric
      where we don't have half of rcu critical sections define in driver
      and the other half in devmap.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/1580084042-11598-4-git-send-email-john.fastabend@gmail.com
      b23bfa56
  5. 17 1月, 2020 1 次提交
    • T
      xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths · 1d233886
      Toke Høiland-Jørgensen 提交于
      Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
      we can re-use the bulking for the non-map version of the bpf_redirect()
      helper. This is a simple matter of having xdp_do_redirect_slow() queue the
      frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
      
      Unfortunately we can't make the bpf_redirect() helper return an error if
      the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
      have a reference to the network namespace of the ingress device at the time
      the helper is called. So we have to leave it as-is and keep the device
      lookup in xdp_do_redirect_slow().
      
      Since this leaves less reason to have the non-map redirect code in a
      separate function, so we get rid of the xdp_do_redirect_slow() function
      entirely. This does lose us the tracepoint disambiguation, but fortunately
      the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
      entry structures. This means both can contain a map index, so we can just
      amend the tracepoint definitions so we always emit the xdp_redirect(_err)
      tracepoints, but with the map ID only populated if a map is present. This
      means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
      the definitions around in case someone is still listening for them.
      
      With this change, the performance of the xdp_redirect sample program goes
      from 5Mpps to 8.4Mpps (a 68% increase).
      
      Since the flush functions are no longer map-specific, rename the flush()
      functions to drop _map from their names. One of the renamed functions is
      the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
      keep from having to update all drivers, use a #define to keep the old name
      working, and only update the virtual drivers in this patch.
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
      1d233886
  6. 08 11月, 2019 1 次提交
  7. 25 6月, 2019 1 次提交
    • T
      veth: Support bulk XDP_TX · 9cda7807
      Toshiaki Makita 提交于
      XDP_TX is similar to XDP_REDIRECT as it essentially redirects packets to
      the device itself. XDP_REDIRECT has bulk transmit mechanism to avoid the
      heavy cost of indirect call but it also reduces lock acquisition on the
      destination device that needs locks like veth and tun.
      
      XDP_TX does not use indirect calls but drivers which require locks can
      benefit from the bulk transmit for XDP_TX as well.
      
      This patch introduces bulk transmit mechanism in veth using bulk queue
      on stack, and improves XDP_TX performance by about 9%.
      
      Here are single-core/single-flow XDP_TX test results. CPU consumptions
      are taken from "perf report --no-child".
      
      - Before:
      
        7.26 Mpps
      
        _raw_spin_lock  7.83%
        veth_xdp_xmit  12.23%
      
      - After:
      
        7.94 Mpps
      
        _raw_spin_lock  1.08%
        veth_xdp_xmit   6.10%
      
      v2:
      - Use stack for bulk queue instead of a global variable.
      Signed-off-by: NToshiaki Makita <toshiaki.makita1@gmail.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9cda7807
  8. 19 6月, 2019 1 次提交
  9. 21 5月, 2019 1 次提交
  10. 13 4月, 2019 1 次提交
  11. 24 2月, 2019 1 次提交
    • F
      veth: Fix -Wformat-truncation · abdf47aa
      Florian Fainelli 提交于
      Provide a precision hint to snprintf() in order to eliminate a
      -Wformat-truncation warning provided below. A maximum of 11 characters
      is allowed to reach a maximum of 32 - 1 characters given a possible
      maximum value of queues using up to UINT_MAX which occupies 10
      characters. Incidentally 11 is the number of characters for
      "xdp_packets" which is the largest string we append.
      
      drivers/net/veth.c: In function 'veth_get_strings':
      drivers/net/veth.c:118:47: warning: '%s' directive output may be
      truncated writing up to 31 bytes into a region of size between 12 and 21
      [-Wformat-truncation=]
           snprintf(p, ETH_GSTRING_LEN, "rx_queue_%u_%s",
                                                     ^~
      drivers/net/veth.c:118:5: note: 'snprintf' output between 12 and 52
      bytes into a destination of size 32
           snprintf(p, ETH_GSTRING_LEN, "rx_queue_%u_%s",
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
             i, veth_rq_stats_desc[j].desc);
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abdf47aa
  12. 09 2月, 2019 1 次提交
  13. 07 11月, 2018 1 次提交
  14. 16 10月, 2018 3 次提交
  15. 19 9月, 2018 1 次提交
  16. 17 9月, 2018 1 次提交
    • T
      veth: Orphan skb before GRO · 4bf9ffa0
      Toshiaki Makita 提交于
      GRO expects skbs not to be owned by sockets, but when XDP is enabled veth
      passed skbs owned by sockets. It caused corrupted sk_wmem_alloc.
      
      Paolo Abeni reported the following splat:
      
      [  362.098904] refcount_t overflow at skb_set_owner_w+0x5e/0xa0 in iperf3[1644], uid/euid: 0/0
      [  362.108239] WARNING: CPU: 0 PID: 1644 at kernel/panic.c:648 refcount_error_report+0xa0/0xa4
      [  362.117547] Modules linked in: tcp_diag inet_diag veth intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf ipmi_ssif iTCO_wdt sg ipmi_si iTCO_vendor_support ipmi_devintf mxm_wmi ipmi_msghandler pcspkr dcdbas mei_me wmi mei lpc_ich acpi_power_meter pcc_cpufreq xfs libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ixgbe igb ttm ahci mdio libahci ptp crc32c_intel drm pps_core libata i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod
      [  362.176622] CPU: 0 PID: 1644 Comm: iperf3 Not tainted 4.19.0-rc2.vanilla+ #2025
      [  362.184777] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
      [  362.193124] RIP: 0010:refcount_error_report+0xa0/0xa4
      [  362.198758] Code: 08 00 00 48 8b 95 80 00 00 00 49 8d 8c 24 80 0a 00 00 41 89 c1 44 89 2c 24 48 89 de 48 c7 c7 18 4d e7 9d 31 c0 e8 30 fa ff ff <0f> 0b eb 88 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 49 89 fc
      [  362.219711] RSP: 0018:ffff9ee6ff603c20 EFLAGS: 00010282
      [  362.225538] RAX: 0000000000000000 RBX: ffffffff9de83e10 RCX: 0000000000000000
      [  362.233497] RDX: 0000000000000001 RSI: ffff9ee6ff6167d8 RDI: ffff9ee6ff6167d8
      [  362.241457] RBP: ffff9ee6ff603d78 R08: 0000000000000490 R09: 0000000000000004
      [  362.249416] R10: 0000000000000000 R11: ffff9ee6ff603990 R12: ffff9ee664b94500
      [  362.257377] R13: 0000000000000000 R14: 0000000000000004 R15: ffffffff9de615f9
      [  362.265337] FS:  00007f1d22d28740(0000) GS:ffff9ee6ff600000(0000) knlGS:0000000000000000
      [  362.274363] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  362.280773] CR2: 00007f1d222f35d0 CR3: 0000001fddfec003 CR4: 00000000001606f0
      [  362.288733] Call Trace:
      [  362.291459]  <IRQ>
      [  362.293702]  ex_handler_refcount+0x4e/0x80
      [  362.298269]  fixup_exception+0x35/0x40
      [  362.302451]  do_trap+0x109/0x150
      [  362.306048]  do_error_trap+0xd5/0x130
      [  362.315766]  invalid_op+0x14/0x20
      [  362.319460] RIP: 0010:skb_set_owner_w+0x5e/0xa0
      [  362.324512] Code: ef ff ff 74 49 48 c7 43 60 20 7b 4a 9d 8b 85 f4 01 00 00 85 c0 75 16 8b 83 e0 00 00 00 f0 01 85 44 01 00 00 0f 88 d8 23 16 00 <5b> 5d c3 80 8b 91 00 00 00 01 8b 85 f4 01 00 00 89 83 a4 00 00 00
      [  362.345465] RSP: 0018:ffff9ee6ff603e20 EFLAGS: 00010a86
      [  362.351291] RAX: 0000000000001100 RBX: ffff9ee65deec700 RCX: ffff9ee65e829244
      [  362.359250] RDX: 0000000000000100 RSI: ffff9ee65e829100 RDI: ffff9ee65deec700
      [  362.367210] RBP: ffff9ee65e829100 R08: 000000000002a380 R09: 0000000000000000
      [  362.375169] R10: 0000000000000002 R11: fffff1a4bf77bb00 R12: ffffc0754661d000
      [  362.383130] R13: ffff9ee65deec200 R14: ffff9ee65f597000 R15: 00000000000000aa
      [  362.391092]  veth_xdp_rcv+0x4e4/0x890 [veth]
      [  362.399357]  veth_poll+0x4d/0x17a [veth]
      [  362.403731]  net_rx_action+0x2af/0x3f0
      [  362.407912]  __do_softirq+0xdd/0x29e
      [  362.411897]  do_softirq_own_stack+0x2a/0x40
      [  362.416561]  </IRQ>
      [  362.418899]  do_softirq+0x4b/0x70
      [  362.422594]  __local_bh_enable_ip+0x50/0x60
      [  362.427258]  ip_finish_output2+0x16a/0x390
      [  362.431824]  ip_output+0x71/0xe0
      [  362.440670]  __tcp_transmit_skb+0x583/0xab0
      [  362.445333]  tcp_write_xmit+0x247/0xfb0
      [  362.449609]  __tcp_push_pending_frames+0x2d/0xd0
      [  362.454760]  tcp_sendmsg_locked+0x857/0xd30
      [  362.459424]  tcp_sendmsg+0x27/0x40
      [  362.463216]  sock_sendmsg+0x36/0x50
      [  362.467104]  sock_write_iter+0x87/0x100
      [  362.471382]  __vfs_write+0x112/0x1a0
      [  362.475369]  vfs_write+0xad/0x1a0
      [  362.479062]  ksys_write+0x52/0xc0
      [  362.482759]  do_syscall_64+0x5b/0x180
      [  362.486841]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [  362.492473] RIP: 0033:0x7f1d22293238
      [  362.496458] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 c5 54 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
      [  362.517409] RSP: 002b:00007ffebaef8008 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  362.525855] RAX: ffffffffffffffda RBX: 0000000000002800 RCX: 00007f1d22293238
      [  362.533816] RDX: 0000000000002800 RSI: 00007f1d22d36000 RDI: 0000000000000005
      [  362.541775] RBP: 00007f1d22d36000 R08: 00000002db777a30 R09: 0000562b70712b20
      [  362.549734] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
      [  362.557693] R13: 0000000000002800 R14: 00007ffebaef8060 R15: 0000562b70712260
      
      In order to avoid this, orphan the skb before entering GRO.
      
      Fixes: 948d4f21 ("veth: Add driver XDP")
      Reported-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Tested-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bf9ffa0
  17. 01 9月, 2018 1 次提交
  18. 17 8月, 2018 1 次提交
    • T
      veth: Free queues on link delete · 7797b93b
      Toshiaki Makita 提交于
      David Ahern reported memory leak in veth.
      
      =======================================================================
      $ cat /sys/kernel/debug/kmemleak
      unreferenced object 0xffff8800354d5c00 (size 1024):
        comm "ip", pid 836, jiffies 4294722952 (age 25.904s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<(____ptrval____)>] kmemleak_alloc+0x70/0x94
          [<(____ptrval____)>] slab_post_alloc_hook+0x42/0x52
          [<(____ptrval____)>] __kmalloc+0x101/0x142
          [<(____ptrval____)>] kmalloc_array.constprop.20+0x1e/0x26 [veth]
          [<(____ptrval____)>] veth_newlink+0x147/0x3ac [veth]
          ...
      unreferenced object 0xffff88002e009c00 (size 1024):
        comm "ip", pid 836, jiffies 4294722958 (age 25.898s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<(____ptrval____)>] kmemleak_alloc+0x70/0x94
          [<(____ptrval____)>] slab_post_alloc_hook+0x42/0x52
          [<(____ptrval____)>] __kmalloc+0x101/0x142
          [<(____ptrval____)>] kmalloc_array.constprop.20+0x1e/0x26 [veth]
          [<(____ptrval____)>] veth_newlink+0x219/0x3ac [veth]
      =======================================================================
      
      veth_rq allocated in veth_newlink() was not freed on dellink.
      
      We need to free up them after veth_close() so that any packets will not
      reference the queues afterwards. Thus free them in veth_dev_free() in
      the same way as freeing stats structure (vstats).
      
      Also move queues allocation to veth_dev_init() to be in line with stats
      allocation.
      
      Fixes: 638264dc ("veth: Support per queue XDP ring")
      Reported-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Tested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7797b93b
  19. 10 8月, 2018 6 次提交
    • T
      veth: Support per queue XDP ring · 638264dc
      Toshiaki Makita 提交于
      Move XDP and napi related fields from veth_priv to newly created veth_rq
      structure.
      
      When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
      selected by current cpu.
      
      When skbs are enqueued from the peer device, rxq is one to one mapping
      of its peer txq. This way we have a restriction that the number of rxqs
      must not less than the number of peer txqs, but leave the possibility to
      achieve bulk skb xmit in the future because txq lock would make it
      possible to remove rxq ptr_ring lock.
      
      v3:
      - Add extack messages.
      - Fix array overrun in veth_xmit.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      638264dc
    • T
      veth: Add XDP TX and REDIRECT · d1396004
      Toshiaki Makita 提交于
      This allows further redirection of xdp_frames like
      
       NIC   -> veth--veth -> veth--veth
       (XDP)          (XDP)         (XDP)
      
      The intermediate XDP, redirecting packets from NIC to the other veth,
      reuses xdp_mem_info from NIC so that page recycling of the NIC works on
      the destination veth's XDP.
      In this way return_frame is not fully guarded by NAPI, since another
      NAPI handler on another cpu may use the same xdp_mem_info concurrently.
      Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
      NAPI context.
      
      v8:
      - Don't use xdp_frame pointer address for data_hard_start of xdp_buff.
      
      v4:
      - Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
        xdp_mem_info.
      
      v3:
      - Fix double free when veth_xdp_tx() returns a positive value.
      - Convert xdp_xmit and xdp_redir variables into flags.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d1396004
    • T
      veth: Add ndo_xdp_xmit · af87a3aa
      Toshiaki Makita 提交于
      This allows NIC's XDP to redirect packets to veth. The destination veth
      device enqueues redirected packets to the napi ring of its peer, then
      they are processed by XDP on its peer veth device.
      This can be thought as calling another XDP program by XDP program using
      REDIRECT, when the peer enables driver XDP.
      
      Note that when the peer veth device does not set driver xdp, redirected
      packets will be dropped because the peer is not ready for NAPI.
      
      v4:
      - Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
        Add comments about it and check only MTU.
      
      v2:
      - Drop the part converting xdp_frame into skb when XDP is not enabled.
      - Implement bulk interface of ndo_xdp_xmit.
      - Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      af87a3aa
    • T
      veth: Handle xdp_frames in xdp napi ring · 9fc8d518
      Toshiaki Makita 提交于
      This is preparation for XDP TX and ndo_xdp_xmit.
      This allows napi handler to handle xdp_frames through xdp ring as well
      as sk_buff.
      
      v8:
      - Don't use xdp_frame pointer address to calculate skb->head and
        headroom.
      
      v7:
      - Use xdp_scrub_frame() instead of memset().
      
      v3:
      - Revert v2 change around rings and use a flag to differentiate skb and
        xdp_frame, since bulk skb xmit makes little performance difference
        for now.
      
      v2:
      - Use another ring instead of using flag to differentiate skb and
        xdp_frame. This approach makes bulk skb transmit possible in
        veth_xmit later.
      - Clear xdp_frame feilds in skb->head.
      - Implement adjust_tail.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9fc8d518
    • T
      veth: Avoid drops by oversized packets when XDP is enabled · dc224822
      Toshiaki Makita 提交于
      Oversized packets including GSO packets can be dropped if XDP is
      enabled on receiver side, so don't send such packets from peer.
      
      Drop TSO and SCTP fragmentation features so that veth devices themselves
      segment packets with XDP enabled. Also cap MTU accordingly.
      
      v4:
      - Don't auto-adjust MTU but cap max MTU.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      dc224822
    • T
      veth: Add driver XDP · 948d4f21
      Toshiaki Makita 提交于
      This is the basic implementation of veth driver XDP.
      
      Incoming packets are sent from the peer veth device in the form of skb,
      so this is generally doing the same thing as generic XDP.
      
      This itself is not so useful, but a starting point to implement other
      useful veth XDP features like TX and REDIRECT.
      
      This introduces NAPI when XDP is enabled, because XDP is now heavily
      relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
      enqueues packets to the ring and peer NAPI handler drains the ring.
      
      Currently only one ring is allocated for each veth device, so it does
      not scale on multiqueue env. This can be resolved by allocating rings
      on the per-queue basis later.
      
      Note that NAPI is not used but netif_rx is used when XDP is not loaded,
      so this does not change the default behaviour.
      
      v6:
      - Check skb->len only when allocation is needed.
      - Add __GFP_NOWARN to alloc_page() as it can be triggered by external
        events.
      
      v3:
      - Fix race on closing the device.
      - Add extack messages in ndo_bpf.
      
      v2:
      - Squashed with the patch adding NAPI.
      - Implement adjust_tail.
      - Don't acquire consumer lock because it is guarded by NAPI.
      - Make poll_controller noop since it is unnecessary.
      - Register rxq_info on enabling XDP rather than on opening the device.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      948d4f21
  20. 09 12月, 2017 1 次提交
  21. 27 6月, 2017 2 次提交
  22. 22 6月, 2017 1 次提交
    • S
      veth: Be more robust on network device creation when no attributes · 191cdb38
      Serhey Popovych 提交于
      There are number of problems with configuration peer
      network device in absence of IFLA_VETH_PEER attributes
      where attributes for main network device shared with
      peer.
      
      First it is not feasible to configure both network
      devices with same MAC address since this makes
      communication in such configuration problematic.
      
      This case can be reproduced with following sequence:
      
        # ip link add address 02:11:22:33:44:55 type veth
        # ip li sh
        ...
        26: veth0@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc \
        noop state DOWN mode DEFAULT qlen 1000
            link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff
        27: veth1@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc \
        noop state DOWN mode DEFAULT qlen 1000
            link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff
      
      Second it is not possible to register both main and
      peer network devices with same name, that happens
      when name for main interface is given with IFLA_IFNAME
      and same attribute reused for peer.
      
      This case can be reproduced with following sequence:
      
        # ip link add dev veth1a type veth
        RTNETLINK answers: File exists
      
      To fix both of the cases check if corresponding netlink
      attributes are taken from peer_tb when valid or
      name based on rtnl ops kind and random address is used.
      Signed-off-by: NSerhey Popovych <serhe.popovych@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      191cdb38
  23. 08 6月, 2017 1 次提交
    • D
      net: Fix inconsistent teardown and release of private netdev state. · cf124db5
      David S. Miller 提交于
      Network devices can allocate reasources and private memory using
      netdev_ops->ndo_init().  However, the release of these resources
      can occur in one of two different places.
      
      Either netdev_ops->ndo_uninit() or netdev->destructor().
      
      The decision of which operation frees the resources depends upon
      whether it is necessary for all netdev refs to be released before it
      is safe to perform the freeing.
      
      netdev_ops->ndo_uninit() presumably can occur right after the
      NETDEV_UNREGISTER notifier completes and the unicast and multicast
      address lists are flushed.
      
      netdev->destructor(), on the other hand, does not run until the
      netdev references all go away.
      
      Further complicating the situation is that netdev->destructor()
      almost universally does also a free_netdev().
      
      This creates a problem for the logic in register_netdevice().
      Because all callers of register_netdevice() manage the freeing
      of the netdev, and invoke free_netdev(dev) if register_netdevice()
      fails.
      
      If netdev_ops->ndo_init() succeeds, but something else fails inside
      of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
      it is not able to invoke netdev->destructor().
      
      This is because netdev->destructor() will do a free_netdev() and
      then the caller of register_netdevice() will do the same.
      
      However, this means that the resources that would normally be released
      by netdev->destructor() will not be.
      
      Over the years drivers have added local hacks to deal with this, by
      invoking their destructor parts by hand when register_netdevice()
      fails.
      
      Many drivers do not try to deal with this, and instead we have leaks.
      
      Let's close this hole by formalizing the distinction between what
      private things need to be freed up by netdev->destructor() and whether
      the driver needs unregister_netdevice() to perform the free_netdev().
      
      netdev->priv_destructor() performs all actions to free up the private
      resources that used to be freed by netdev->destructor(), except for
      free_netdev().
      
      netdev->needs_free_netdev is a boolean that indicates whether
      free_netdev() should be done at the end of unregister_netdevice().
      
      Now, register_netdevice() can sanely release all resources after
      ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
      and netdev->priv_destructor().
      
      And at the end of unregister_netdevice(), we invoke
      netdev->priv_destructor() and optionally call free_netdev().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf124db5
  24. 14 4月, 2017 1 次提交
  25. 30 3月, 2017 1 次提交
  26. 09 1月, 2017 1 次提交
  27. 21 10月, 2016 1 次提交
    • J
      net: use core MTU range checking in core net infra · 91572088
      Jarod Wilson 提交于
      geneve:
      - Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
      - This one isn't quite as straight-forward as others, could use some
        closer inspection and testing
      
      macvlan:
      - set min/max_mtu
      
      tun:
      - set min/max_mtu, remove tun_net_change_mtu
      
      vxlan:
      - Merge __vxlan_change_mtu back into vxlan_change_mtu
      - Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
        change_mtu function
      - This one is also not as straight-forward and could use closer inspection
        and testing from vxlan folks
      
      bridge:
      - set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
        change_mtu function
      
      openvswitch:
      - set min/max_mtu, remove internal_dev_change_mtu
      - note: max_mtu wasn't checked previously, it's been set to 65535, which
        is the largest possible size supported
      
      sch_teql:
      - set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)
      
      macsec:
      - min_mtu = 0, max_mtu = 65535
      
      macvlan:
      - min_mtu = 0, max_mtu = 65535
      
      ntb_netdev:
      - min_mtu = 0, max_mtu = 65535
      
      veth:
      - min_mtu = 68, max_mtu = 65535
      
      8021q:
      - min_mtu = 0, max_mtu = 65535
      
      CC: netdev@vger.kernel.org
      CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
      CC: Tom Herbert <tom@herbertland.com>
      CC: Daniel Borkmann <daniel@iogearbox.net>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Paolo Abeni <pabeni@redhat.com>
      CC: Jiri Benc <jbenc@redhat.com>
      CC: WANG Cong <xiyou.wangcong@gmail.com>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      CC: Pravin B Shelar <pshelar@ovn.org>
      CC: Sabrina Dubroca <sd@queasysnail.net>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: Pravin Shelar <pshelar@nicira.com>
      CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91572088
  28. 31 8月, 2016 1 次提交