1. 01 11月, 2017 1 次提交
    • C
      tun/tap: sanitize TUNSETSNDBUF input · 93161922
      Craig Gallek 提交于
      Syzkaller found several variants of the lockup below by setting negative
      values with the TUNSETSNDBUF ioctl.  This patch adds a sanity check
      to both the tun and tap versions of this ioctl.
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [repro:2389]
        Modules linked in:
        irq event stamp: 329692056
        hardirqs last  enabled at (329692055): [<ffffffff824b8381>] _raw_spin_unlock_irqrestore+0x31/0x75
        hardirqs last disabled at (329692056): [<ffffffff824b9e58>] apic_timer_interrupt+0x98/0xb0
        softirqs last  enabled at (35659740): [<ffffffff824bc958>] __do_softirq+0x328/0x48c
        softirqs last disabled at (35659731): [<ffffffff811c796c>] irq_exit+0xbc/0xd0
        CPU: 0 PID: 2389 Comm: repro Not tainted 4.14.0-rc7 #23
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        task: ffff880009452140 task.stack: ffff880006a20000
        RIP: 0010:_raw_spin_lock_irqsave+0x11/0x80
        RSP: 0018:ffff880006a27c50 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff10
        RAX: ffff880009ac68d0 RBX: ffff880006a27ce0 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffff880006a27ce0 RDI: ffff880009ac6900
        RBP: ffff880006a27c60 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000001 R11: 000000000063ff00 R12: ffff880009ac6900
        R13: ffff880006a27cf8 R14: 0000000000000001 R15: ffff880006a27cf8
        FS:  00007f4be4838700(0000) GS:ffff88000cc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000020101000 CR3: 0000000009616000 CR4: 00000000000006f0
        Call Trace:
         prepare_to_wait+0x26/0xc0
         sock_alloc_send_pskb+0x14e/0x270
         ? remove_wait_queue+0x60/0x60
         tun_get_user+0x2cc/0x19d0
         ? __tun_get+0x60/0x1b0
         tun_chr_write_iter+0x57/0x86
         __vfs_write+0x156/0x1e0
         vfs_write+0xf7/0x230
         SyS_write+0x57/0xd0
         entry_SYSCALL_64_fastpath+0x1f/0xbe
        RIP: 0033:0x7f4be4356df9
        RSP: 002b:00007ffc18101c08 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
        RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4be4356df9
        RDX: 0000000000000046 RSI: 0000000020101000 RDI: 0000000000000005
        RBP: 00007ffc18101c40 R08: 0000000000000001 R09: 0000000000000001
        R10: 0000000000000001 R11: 0000000000000293 R12: 0000559c75f64780
        R13: 00007ffc18101d30 R14: 0000000000000000 R15: 0000000000000000
      
      Fixes: 33dccbb0 ("tun: Limit amount of queued packets per device")
      Fixes: 20d29d7a ("net: macvtap driver")
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93161922
  2. 28 10月, 2017 1 次提交
    • J
      tuntap: properly align skb->head before building skb · 63b9ab65
      Jason Wang 提交于
      An unaligned alloc_frag->offset caused by previous allocation will
      result an unaligned skb->head. This will lead unaligned
      skb_shared_info and then unaligned dataref which requires to be
      aligned for accessing on some architecture. Fix this by aligning
      alloc_frag->offset before the frag refilling.
      
      Fixes: 0bbd7dad ("tun: make tun_build_skb() thread safe")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
      Cc: Wei Wei <dotweiba@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Reported-by: NWei Wei <dotweiba@gmail.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63b9ab65
  3. 26 10月, 2017 1 次提交
  4. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  5. 17 10月, 2017 1 次提交
    • C
      tun: call dev_get_valid_name() before register_netdevice() · 0ad646c8
      Cong Wang 提交于
      register_netdevice() could fail early when we have an invalid
      dev name, in which case ->ndo_uninit() is not called. For tun
      device, this is a problem because a timer etc. are already
      initialized and it expects ->ndo_uninit() to clean them up.
      
      We could move these initializations into a ->ndo_init() so
      that register_netdevice() knows better, however this is still
      complicated due to the logic in tun_detach().
      
      Therefore, I choose to just call dev_get_valid_name() before
      register_netdevice(), which is quicker and much easier to audit.
      And for this specific case, it is already enough.
      
      Fixes: 96442e42 ("tuntap: choose the txq based on rxq")
      Reported-by: NDmitry Alexeev <avekceeb@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ad646c8
  6. 28 9月, 2017 1 次提交
    • A
      tun: bail out from tun_get_user() if the skb is empty · 2580c4c1
      Alexander Potapenko 提交于
      KMSAN (https://github.com/google/kmsan) reported accessing uninitialized
      skb->data[0] in the case the skb is empty (i.e. skb->len is 0):
      
      ================================================
      BUG: KMSAN: use of uninitialized memory in tun_get_user+0x19ba/0x3770
      CPU: 0 PID: 3051 Comm: probe Not tainted 4.13.0+ #3140
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Call Trace:
      ...
       __msan_warning_32+0x66/0xb0 mm/kmsan/kmsan_instr.c:477
       tun_get_user+0x19ba/0x3770 drivers/net/tun.c:1301
       tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
       call_write_iter ./include/linux/fs.h:1743
       new_sync_write fs/read_write.c:457
       __vfs_write+0x6c3/0x7f0 fs/read_write.c:470
       vfs_write+0x3e4/0x770 fs/read_write.c:518
       SYSC_write+0x12f/0x2b0 fs/read_write.c:565
       SyS_write+0x55/0x80 fs/read_write.c:557
       do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
       entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:245
      ...
      origin:
      ...
       kmsan_poison_shadow+0x6e/0xc0 mm/kmsan/kmsan.c:211
       slab_alloc_node mm/slub.c:2732
       __kmalloc_node_track_caller+0x351/0x370 mm/slub.c:4351
       __kmalloc_reserve net/core/skbuff.c:138
       __alloc_skb+0x26a/0x810 net/core/skbuff.c:231
       alloc_skb ./include/linux/skbuff.h:903
       alloc_skb_with_frags+0x1d7/0xc80 net/core/skbuff.c:4756
       sock_alloc_send_pskb+0xabf/0xfe0 net/core/sock.c:2037
       tun_alloc_skb drivers/net/tun.c:1144
       tun_get_user+0x9a8/0x3770 drivers/net/tun.c:1274
       tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
       call_write_iter ./include/linux/fs.h:1743
       new_sync_write fs/read_write.c:457
       __vfs_write+0x6c3/0x7f0 fs/read_write.c:470
       vfs_write+0x3e4/0x770 fs/read_write.c:518
       SYSC_write+0x12f/0x2b0 fs/read_write.c:565
       SyS_write+0x55/0x80 fs/read_write.c:557
       do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
       return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:245
      ================================================
      
      Make sure tun_get_user() doesn't touch skb->data[0] unless there is
      actual data.
      
      C reproducer below:
      ==========================
          // autogenerated by syzkaller (http://github.com/google/syzkaller)
      
          #define _GNU_SOURCE
      
          #include <fcntl.h>
          #include <linux/if_tun.h>
          #include <netinet/ip.h>
          #include <net/if.h>
          #include <string.h>
          #include <sys/ioctl.h>
      
          int main()
          {
            int sock = socket(PF_INET, SOCK_STREAM, IPPROTO_IP);
            int tun_fd = open("/dev/net/tun", O_RDWR);
            struct ifreq req;
            memset(&req, 0, sizeof(struct ifreq));
            strcpy((char*)&req.ifr_name, "gre0");
            req.ifr_flags = IFF_UP | IFF_MULTICAST;
            ioctl(tun_fd, TUNSETIFF, &req);
            ioctl(sock, SIOCSIFFLAGS, "gre0");
            write(tun_fd, "hi", 0);
            return 0;
          }
      ==========================
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2580c4c1
  7. 06 9月, 2017 2 次提交
  8. 19 8月, 2017 1 次提交
    • E
      tun: handle register_netdevice() failures properly · ff244c6b
      Eric Dumazet 提交于
      syzkaller reported a double free [1], caused by the fact
      that tun driver was not updated properly when priv_destructor
      was added.
      
      When/if register_netdevice() fails, priv_destructor() must have been
      called already.
      
      [1]
      BUG: KASAN: double-free or invalid-free in selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023
      
      CPU: 0 PID: 2919 Comm: syzkaller227220 Not tainted 4.13.0-rc4+ #23
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       print_address_description+0x7f/0x260 mm/kasan/report.c:252
       kasan_report_double_free+0x55/0x80 mm/kasan/report.c:333
       kasan_slab_free+0xa0/0xc0 mm/kasan/kasan.c:514
       __cache_free mm/slab.c:3503 [inline]
       kfree+0xd3/0x260 mm/slab.c:3820
       selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023
       security_tun_dev_free_security+0x48/0x80 security/security.c:1512
       tun_set_iff drivers/net/tun.c:1884 [inline]
       __tun_chr_ioctl+0x2ce6/0x3d50 drivers/net/tun.c:2064
       tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309
       vfs_ioctl fs/ioctl.c:45 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
       SYSC_ioctl fs/ioctl.c:700 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x443ff9
      RSP: 002b:00007ffc34271f68 EFLAGS: 00000217 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 00000000004002e0 RCX: 0000000000443ff9
      RDX: 0000000020533000 RSI: 00000000400454ca RDI: 0000000000000003
      RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000401ce0
      R13: 0000000000401d70 R14: 0000000000000000 R15: 0000000000000000
      
      Allocated by task 2919:
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
       save_stack+0x43/0xd0 mm/kasan/kasan.c:447
       set_track mm/kasan/kasan.c:459 [inline]
       kasan_kmalloc+0xaa/0xd0 mm/kasan/kasan.c:551
       kmem_cache_alloc_trace+0x101/0x6f0 mm/slab.c:3627
       kmalloc include/linux/slab.h:493 [inline]
       kzalloc include/linux/slab.h:666 [inline]
       selinux_tun_dev_alloc_security+0x49/0x170 security/selinux/hooks.c:5012
       security_tun_dev_alloc_security+0x6d/0xa0 security/security.c:1506
       tun_set_iff drivers/net/tun.c:1839 [inline]
       __tun_chr_ioctl+0x1730/0x3d50 drivers/net/tun.c:2064
       tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309
       vfs_ioctl fs/ioctl.c:45 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
       SYSC_ioctl fs/ioctl.c:700 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Freed by task 2919:
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
       save_stack+0x43/0xd0 mm/kasan/kasan.c:447
       set_track mm/kasan/kasan.c:459 [inline]
       kasan_slab_free+0x6e/0xc0 mm/kasan/kasan.c:524
       __cache_free mm/slab.c:3503 [inline]
       kfree+0xd3/0x260 mm/slab.c:3820
       selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023
       security_tun_dev_free_security+0x48/0x80 security/security.c:1512
       tun_free_netdev+0x13b/0x1b0 drivers/net/tun.c:1563
       register_netdevice+0x8d0/0xee0 net/core/dev.c:7605
       tun_set_iff drivers/net/tun.c:1859 [inline]
       __tun_chr_ioctl+0x1caf/0x3d50 drivers/net/tun.c:2064
       tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309
       vfs_ioctl fs/ioctl.c:45 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
       SYSC_ioctl fs/ioctl.c:700 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      The buggy address belongs to the object at ffff8801d2843b40
       which belongs to the cache kmalloc-32 of size 32
      The buggy address is located 0 bytes inside of
       32-byte region [ffff8801d2843b40, ffff8801d2843b60)
      The buggy address belongs to the page:
      page:ffffea000660cea8 count:1 mapcount:0 mapping:ffff8801d2843000 index:0xffff8801d2843fc1
      flags: 0x200000000000100(slab)
      raw: 0200000000000100 ffff8801d2843000 ffff8801d2843fc1 000000010000003f
      raw: ffffea0006626a40 ffffea00066141a0 ffff8801dbc00100
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8801d2843a00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
       ffff8801d2843a80: 00 00 00 fc fc fc fc fc fb fb fb fb fc fc fc fc
      >ffff8801d2843b00: 00 00 00 00 fc fc fc fc fb fb fb fb fc fc fc fc
                                                 ^
       ffff8801d2843b80: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
       ffff8801d2843c00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
      
      ==================================================================
      
      Fixes: cf124db5 ("net: Fix inconsistent teardown and release of private netdev state.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff244c6b
  9. 17 8月, 2017 2 次提交
  10. 14 8月, 2017 2 次提交
    • J
      tap: XDP support · 761876c8
      Jason Wang 提交于
      This patch tries to implement XDP for tun. The implementation was
      split into two parts:
      
      - fast path: small and no gso packet. We try to do XDP at page level
        before build_skb(). For XDP_TX, since creating/destroying queues
        were completely under control of userspace, it was implemented
        through generic XDP helper after skb has been built. This could be
        optimized in the future.
      - slow path: big or gso packet. We try to do it after skb was created
        through generic XDP helpers.
      
      Test were done through pktgen with small packets.
      
      xdp1 test shows ~41.1% improvement:
      
      Before: ~1.7Mpps
      After:  ~2.3Mpps
      
      xdp_redirect to ixgbe shows ~60% improvement:
      
      Before: ~0.8Mpps
      After:  ~1.38Mpps
      Suggested-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      761876c8
    • J
      tap: use build_skb() for small packet · 66ccbc9c
      Jason Wang 提交于
      We use tun_alloc_skb() which calls sock_alloc_send_pskb() to allocate
      skb in the past. This socket based method is not suitable for high
      speed userspace like virtualization which usually:
      
      - ignore sk_sndbuf (INT_MAX) and expect to receive the packet as fast as
        possible
      - don't want to be block at sendmsg()
      
      To eliminate the above overheads, this patch tries to use build_skb()
      for small packet. We will do this only when the following conditions
      are all met:
      
      - TAP instead of TUN
      - sk_sndbuf is INT_MAX
      - caller don't want to be blocked
      - zerocopy is not used
      - packet size is smaller enough to use build_skb()
      
      Pktgen from guest to host shows ~11% improvement for rx pps of tap:
      
      Before: ~1.70Mpps
      After : ~1.88Mpps
      
      What's more important, this makes it possible to implement XDP for tap
      before creating skbs.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66ccbc9c
  11. 04 8月, 2017 1 次提交
    • W
      sock: enable MSG_ZEROCOPY · 1f8b977a
      Willem de Bruijn 提交于
      Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
      skb_zerocopy_clone() wherever needed due to skb split, merge, resize
      or clone.
      
      Split skb_orphan_frags into two variants. The split, merge, .. paths
      support reference counted zerocopy buffers, so do not do a deep copy.
      Add skb_orphan_frags_rx for paths that may loop packets to receive
      sockets. That is not allowed, as it may cause unbounded latency.
      Deep copy all zerocopy copy buffers, ref-counted or not, in this path.
      
      The exact locations to modify were chosen by exhaustively searching
      through all code that might modify skb_frag references and/or the
      the SKBTX_DEV_ZEROCOPY tx_flags bit.
      
      The changes err on the safe side, in two ways.
      
      (1) legacy ubuf_info paths virtio and tap are not modified. They keep
          a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
          still call skb_copy_ubufs and thus copy frags in this case.
      
      (2) not all copies deep in the stack are addressed yet. skb_shift,
          skb_split and skb_try_coalesce can be refined to avoid copying.
          These are not in the hot path and this patch is hairy enough as
          is, so that is left for future refinement.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f8b977a
  12. 25 7月, 2017 1 次提交
  13. 18 7月, 2017 1 次提交
  14. 27 6月, 2017 1 次提交
  15. 08 6月, 2017 1 次提交
    • D
      net: Fix inconsistent teardown and release of private netdev state. · cf124db5
      David S. Miller 提交于
      Network devices can allocate reasources and private memory using
      netdev_ops->ndo_init().  However, the release of these resources
      can occur in one of two different places.
      
      Either netdev_ops->ndo_uninit() or netdev->destructor().
      
      The decision of which operation frees the resources depends upon
      whether it is necessary for all netdev refs to be released before it
      is safe to perform the freeing.
      
      netdev_ops->ndo_uninit() presumably can occur right after the
      NETDEV_UNREGISTER notifier completes and the unicast and multicast
      address lists are flushed.
      
      netdev->destructor(), on the other hand, does not run until the
      netdev references all go away.
      
      Further complicating the situation is that netdev->destructor()
      almost universally does also a free_netdev().
      
      This creates a problem for the logic in register_netdevice().
      Because all callers of register_netdevice() manage the freeing
      of the netdev, and invoke free_netdev(dev) if register_netdevice()
      fails.
      
      If netdev_ops->ndo_init() succeeds, but something else fails inside
      of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
      it is not able to invoke netdev->destructor().
      
      This is because netdev->destructor() will do a free_netdev() and
      then the caller of register_netdevice() will do the same.
      
      However, this means that the resources that would normally be released
      by netdev->destructor() will not be.
      
      Over the years drivers have added local hacks to deal with this, by
      invoking their destructor parts by hand when register_netdevice()
      fails.
      
      Many drivers do not try to deal with this, and instead we have leaks.
      
      Let's close this hole by formalizing the distinction between what
      private things need to be freed up by netdev->destructor() and whether
      the driver needs unregister_netdevice() to perform the free_netdev().
      
      netdev->priv_destructor() performs all actions to free up the private
      resources that used to be freed by netdev->destructor(), except for
      free_netdev().
      
      netdev->needs_free_netdev is a boolean that indicates whether
      free_netdev() should be done at the end of unregister_netdevice().
      
      Now, register_netdevice() can sanely release all resources after
      ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
      and netdev->priv_destructor().
      
      And at the end of unregister_netdevice(), we invoke
      netdev->priv_destructor() and optionally call free_netdev().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf124db5
  16. 07 6月, 2017 1 次提交
  17. 18 5月, 2017 2 次提交
  18. 22 3月, 2017 1 次提交
  19. 14 3月, 2017 2 次提交
  20. 10 3月, 2017 1 次提交
  21. 02 3月, 2017 1 次提交
  22. 07 2月, 2017 1 次提交
  23. 21 1月, 2017 1 次提交
  24. 19 1月, 2017 1 次提交
    • J
      tun: rx batching · 5503fcec
      Jason Wang 提交于
      We can only process 1 packet at one time during sendmsg(). This often
      lead bad cache utilization under heavy load. So this patch tries to do
      some batching during rx before submitting them to host network
      stack. This is done through accepting MSG_MORE as a hint from
      sendmsg() caller, if it was set, batch the packet temporarily in a
      linked list and submit them all once MSG_MORE were cleared.
      
      Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host:
      
                                       Mpps  -+%
          rx-frames = 0                0.91  +0%
          rx-frames = 4                1.00  +9.8%
          rx-frames = 8                1.00  +9.8%
          rx-frames = 16               1.01  +10.9%
          rx-frames = 32               1.07  +17.5%
          rx-frames = 48               1.07  +17.5%
          rx-frames = 64               1.08  +18.6%
          rx-frames = 64 (no MSG_MORE) 0.91  +0%
      
      User were allowed to change per device batched packets through
      ethtool -C rx-frames. NAPI_POLL_WEIGHT were used as upper limitation
      to prevent bh from being disabled too long.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5503fcec
  25. 09 1月, 2017 1 次提交
  26. 25 12月, 2016 1 次提交
  27. 07 12月, 2016 1 次提交
  28. 06 12月, 2016 1 次提交
    • A
      [iov_iter] new primitives - copy_from_iter_full() and friends · cbbd26b8
      Al Viro 提交于
      copy_from_iter_full(), copy_from_iter_full_nocache() and
      csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
      et.al., advancing iterator only in case of successful full copy
      and returning whether it had been successful or not.
      
      Convert some obvious users.  *NOTE* - do not blindly assume that
      something is a good candidate for those unless you are sure that
      not advancing iov_iter in failure case is the right thing in
      this case.  Anything that does short read/short write kind of
      stuff (or is in a loop, etc.) is unlikely to be a good one.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cbbd26b8
  29. 01 12月, 2016 1 次提交
  30. 25 11月, 2016 1 次提交
  31. 19 11月, 2016 2 次提交
  32. 31 10月, 2016 1 次提交
  33. 30 10月, 2016 1 次提交
  34. 21 10月, 2016 1 次提交
    • J
      net: use core MTU range checking in core net infra · 91572088
      Jarod Wilson 提交于
      geneve:
      - Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
      - This one isn't quite as straight-forward as others, could use some
        closer inspection and testing
      
      macvlan:
      - set min/max_mtu
      
      tun:
      - set min/max_mtu, remove tun_net_change_mtu
      
      vxlan:
      - Merge __vxlan_change_mtu back into vxlan_change_mtu
      - Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
        change_mtu function
      - This one is also not as straight-forward and could use closer inspection
        and testing from vxlan folks
      
      bridge:
      - set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
        change_mtu function
      
      openvswitch:
      - set min/max_mtu, remove internal_dev_change_mtu
      - note: max_mtu wasn't checked previously, it's been set to 65535, which
        is the largest possible size supported
      
      sch_teql:
      - set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)
      
      macsec:
      - min_mtu = 0, max_mtu = 65535
      
      macvlan:
      - min_mtu = 0, max_mtu = 65535
      
      ntb_netdev:
      - min_mtu = 0, max_mtu = 65535
      
      veth:
      - min_mtu = 68, max_mtu = 65535
      
      8021q:
      - min_mtu = 0, max_mtu = 65535
      
      CC: netdev@vger.kernel.org
      CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
      CC: Tom Herbert <tom@herbertland.com>
      CC: Daniel Borkmann <daniel@iogearbox.net>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Paolo Abeni <pabeni@redhat.com>
      CC: Jiri Benc <jbenc@redhat.com>
      CC: WANG Cong <xiyou.wangcong@gmail.com>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      CC: Pravin B Shelar <pshelar@ovn.org>
      CC: Sabrina Dubroca <sd@queasysnail.net>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: Pravin Shelar <pshelar@nicira.com>
      CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91572088