1. 19 11月, 2017 1 次提交
  2. 05 11月, 2017 1 次提交
  3. 01 11月, 2017 1 次提交
    • C
      tun/tap: sanitize TUNSETSNDBUF input · 93161922
      Craig Gallek 提交于
      Syzkaller found several variants of the lockup below by setting negative
      values with the TUNSETSNDBUF ioctl.  This patch adds a sanity check
      to both the tun and tap versions of this ioctl.
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [repro:2389]
        Modules linked in:
        irq event stamp: 329692056
        hardirqs last  enabled at (329692055): [<ffffffff824b8381>] _raw_spin_unlock_irqrestore+0x31/0x75
        hardirqs last disabled at (329692056): [<ffffffff824b9e58>] apic_timer_interrupt+0x98/0xb0
        softirqs last  enabled at (35659740): [<ffffffff824bc958>] __do_softirq+0x328/0x48c
        softirqs last disabled at (35659731): [<ffffffff811c796c>] irq_exit+0xbc/0xd0
        CPU: 0 PID: 2389 Comm: repro Not tainted 4.14.0-rc7 #23
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        task: ffff880009452140 task.stack: ffff880006a20000
        RIP: 0010:_raw_spin_lock_irqsave+0x11/0x80
        RSP: 0018:ffff880006a27c50 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff10
        RAX: ffff880009ac68d0 RBX: ffff880006a27ce0 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffff880006a27ce0 RDI: ffff880009ac6900
        RBP: ffff880006a27c60 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000001 R11: 000000000063ff00 R12: ffff880009ac6900
        R13: ffff880006a27cf8 R14: 0000000000000001 R15: ffff880006a27cf8
        FS:  00007f4be4838700(0000) GS:ffff88000cc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000020101000 CR3: 0000000009616000 CR4: 00000000000006f0
        Call Trace:
         prepare_to_wait+0x26/0xc0
         sock_alloc_send_pskb+0x14e/0x270
         ? remove_wait_queue+0x60/0x60
         tun_get_user+0x2cc/0x19d0
         ? __tun_get+0x60/0x1b0
         tun_chr_write_iter+0x57/0x86
         __vfs_write+0x156/0x1e0
         vfs_write+0xf7/0x230
         SyS_write+0x57/0xd0
         entry_SYSCALL_64_fastpath+0x1f/0xbe
        RIP: 0033:0x7f4be4356df9
        RSP: 002b:00007ffc18101c08 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
        RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4be4356df9
        RDX: 0000000000000046 RSI: 0000000020101000 RDI: 0000000000000005
        RBP: 00007ffc18101c40 R08: 0000000000000001 R09: 0000000000000001
        R10: 0000000000000001 R11: 0000000000000293 R12: 0000559c75f64780
        R13: 00007ffc18101d30 R14: 0000000000000000 R15: 0000000000000000
      
      Fixes: 33dccbb0 ("tun: Limit amount of queued packets per device")
      Fixes: 20d29d7a ("net: macvtap driver")
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93161922
  4. 28 10月, 2017 1 次提交
    • J
      tuntap: properly align skb->head before building skb · 63b9ab65
      Jason Wang 提交于
      An unaligned alloc_frag->offset caused by previous allocation will
      result an unaligned skb->head. This will lead unaligned
      skb_shared_info and then unaligned dataref which requires to be
      aligned for accessing on some architecture. Fix this by aligning
      alloc_frag->offset before the frag refilling.
      
      Fixes: 0bbd7dad ("tun: make tun_build_skb() thread safe")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
      Cc: Wei Wei <dotweiba@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Reported-by: NWei Wei <dotweiba@gmail.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63b9ab65
  5. 26 10月, 2017 1 次提交
  6. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  7. 22 10月, 2017 3 次提交
  8. 20 10月, 2017 1 次提交
    • E
      net-tun: fix panics at dismantle time · aec72f33
      Eric Dumazet 提交于
      syzkaller got crashes at dismantle time [1]
      
      It is not correct to test (tun->flags & IFF_NAPI) in tun_napi_disable()
      and tun_napi_del() : Each tun_file can have different mode, depending
      on how they were created.
      
      Similarly I have changed tun_get_user() and tun_poll_controller()
      to use the new tfile->napi_enabled boolean.
      
      [  154.331360] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [  154.339220] IP: [<ffffffff9634cad6>] hrtimer_active+0x26/0x60
      [  154.344983] PGD 0
      [  154.347009] Oops: 0000 [#1] SMP
      [  154.350680] gsmi: Log Shutdown Reason 0x03
      [  154.379572] task: ffff994719150dc0 ti: ffff99475c0ae000 task.ti: ffff99475c0ae000
      [  154.387043] RIP: 0010:[<ffffffff9634cad6>]  [<ffffffff9634cad6>] hrtimer_active+0x26/0x60
      [  154.395232] RSP: 0018:ffff99475c0afce8  EFLAGS: 00010246
      [  154.400542] RAX: ffff994754850ac0 RBX: ffff994753e65408 RCX: ffff994753e65388
      [  154.407666] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff994753e65408
      [  154.414790] RBP: ffff99475c0afce8 R08: 0000000000000000 R09: 0000000000000000
      [  154.421921] R10: ffff99475f6f5910 R11: 0000000000000001 R12: 0000000000000000
      [  154.429044] R13: ffff99417deab668 R14: ffff99417deaa780 R15: ffff99475f45dde0
      [  154.436174] FS:  0000000000000000(0000) GS:ffff994767a00000(0000) knlGS:0000000000000000
      [  154.444249] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  154.449986] CR2: 0000000000000000 CR3: 00000005a8a0e000 CR4: 0000000000022670
      [  154.457110] Stack:
      [  154.459120]  ffff99475c0afd28 ffffffff9634d614 1000000000000000 0000000000000000
      [  154.466598]  ffffe54240000000 ffff994753e65408 ffff994753e653a8 ffff99417deab668
      [  154.474067]  ffff99475c0afd48 ffffffff9634d6fd ffff99474c2be678 ffff994753e65398
      [  154.481537] Call Trace:
      [  154.483985]  [<ffffffff9634d614>] hrtimer_try_to_cancel+0x24/0xf0
      [  154.490074]  [<ffffffff9634d6fd>] hrtimer_cancel+0x1d/0x30
      [  154.495563]  [<ffffffff96860b3c>] napi_disable+0x3c/0x70
      [  154.500875]  [<ffffffff9678ae62>] __tun_detach+0xd2/0x360
      [  154.506272]  [<ffffffff9678b117>] tun_chr_close+0x27/0x40
      [  154.511669]  [<ffffffff9646ebe6>] __fput+0xd6/0x1e0
      [  154.516548]  [<ffffffff9646ed3e>] ____fput+0xe/0x10
      [  154.521429]  [<ffffffff963035a2>] task_work_run+0x72/0x90
      [  154.526827]  [<ffffffff962e9407>] do_exit+0x317/0xb60
      [  154.531879]  [<ffffffff962e9c8f>] do_group_exit+0x3f/0xa0
      [  154.537275]  [<ffffffff962e9d07>] SyS_exit_group+0x17/0x20
      [  154.542769]  [<ffffffff969784be>] entry_SYSCALL_64_fastpath+0x12/0x17
      
      Fixes: 94317099 ("net-tun: enable NAPI for TUN/TAP driver")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aec72f33
  9. 19 10月, 2017 1 次提交
  10. 17 10月, 2017 1 次提交
    • C
      tun: call dev_get_valid_name() before register_netdevice() · 0ad646c8
      Cong Wang 提交于
      register_netdevice() could fail early when we have an invalid
      dev name, in which case ->ndo_uninit() is not called. For tun
      device, this is a problem because a timer etc. are already
      initialized and it expects ->ndo_uninit() to clean them up.
      
      We could move these initializations into a ->ndo_init() so
      that register_netdevice() knows better, however this is still
      complicated due to the logic in tun_detach().
      
      Therefore, I choose to just call dev_get_valid_name() before
      register_netdevice(), which is quicker and much easier to audit.
      And for this specific case, it is already enough.
      
      Fixes: 96442e42 ("tuntap: choose the txq based on rxq")
      Reported-by: NDmitry Alexeev <avekceeb@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ad646c8
  11. 28 9月, 2017 1 次提交
    • A
      tun: bail out from tun_get_user() if the skb is empty · 2580c4c1
      Alexander Potapenko 提交于
      KMSAN (https://github.com/google/kmsan) reported accessing uninitialized
      skb->data[0] in the case the skb is empty (i.e. skb->len is 0):
      
      ================================================
      BUG: KMSAN: use of uninitialized memory in tun_get_user+0x19ba/0x3770
      CPU: 0 PID: 3051 Comm: probe Not tainted 4.13.0+ #3140
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Call Trace:
      ...
       __msan_warning_32+0x66/0xb0 mm/kmsan/kmsan_instr.c:477
       tun_get_user+0x19ba/0x3770 drivers/net/tun.c:1301
       tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
       call_write_iter ./include/linux/fs.h:1743
       new_sync_write fs/read_write.c:457
       __vfs_write+0x6c3/0x7f0 fs/read_write.c:470
       vfs_write+0x3e4/0x770 fs/read_write.c:518
       SYSC_write+0x12f/0x2b0 fs/read_write.c:565
       SyS_write+0x55/0x80 fs/read_write.c:557
       do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
       entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:245
      ...
      origin:
      ...
       kmsan_poison_shadow+0x6e/0xc0 mm/kmsan/kmsan.c:211
       slab_alloc_node mm/slub.c:2732
       __kmalloc_node_track_caller+0x351/0x370 mm/slub.c:4351
       __kmalloc_reserve net/core/skbuff.c:138
       __alloc_skb+0x26a/0x810 net/core/skbuff.c:231
       alloc_skb ./include/linux/skbuff.h:903
       alloc_skb_with_frags+0x1d7/0xc80 net/core/skbuff.c:4756
       sock_alloc_send_pskb+0xabf/0xfe0 net/core/sock.c:2037
       tun_alloc_skb drivers/net/tun.c:1144
       tun_get_user+0x9a8/0x3770 drivers/net/tun.c:1274
       tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
       call_write_iter ./include/linux/fs.h:1743
       new_sync_write fs/read_write.c:457
       __vfs_write+0x6c3/0x7f0 fs/read_write.c:470
       vfs_write+0x3e4/0x770 fs/read_write.c:518
       SYSC_write+0x12f/0x2b0 fs/read_write.c:565
       SyS_write+0x55/0x80 fs/read_write.c:557
       do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
       return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:245
      ================================================
      
      Make sure tun_get_user() doesn't touch skb->data[0] unless there is
      actual data.
      
      C reproducer below:
      ==========================
          // autogenerated by syzkaller (http://github.com/google/syzkaller)
      
          #define _GNU_SOURCE
      
          #include <fcntl.h>
          #include <linux/if_tun.h>
          #include <netinet/ip.h>
          #include <net/if.h>
          #include <string.h>
          #include <sys/ioctl.h>
      
          int main()
          {
            int sock = socket(PF_INET, SOCK_STREAM, IPPROTO_IP);
            int tun_fd = open("/dev/net/tun", O_RDWR);
            struct ifreq req;
            memset(&req, 0, sizeof(struct ifreq));
            strcpy((char*)&req.ifr_name, "gre0");
            req.ifr_flags = IFF_UP | IFF_MULTICAST;
            ioctl(tun_fd, TUNSETIFF, &req);
            ioctl(sock, SIOCSIFFLAGS, "gre0");
            write(tun_fd, "hi", 0);
            return 0;
          }
      ==========================
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2580c4c1
  12. 27 9月, 2017 1 次提交
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
  13. 26 9月, 2017 3 次提交
    • Y
      tun: delete original tun_get() and rename __tun_get() to tun_get() · 9484dc74
      yuan linyu 提交于
      it seems no need to keep tun_get() and __tun_get() at same time.
      Signed-off-by: Nyuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9484dc74
    • P
      tun: enable napi_gro_frags() for TUN/TAP driver · 90e33d45
      Petar Penkov 提交于
      Add a TUN/TAP receive mode that exercises the napi_gro_frags()
      interface. This mode is available only in TAP mode, as the interface
      expects packets with Ethernet headers.
      
      Furthermore, packets follow the layout of the iovec_iter that was
      received. The first iovec is the linear data, and every one after the
      first is a fragment. If there are more fragments than the max number,
      drop the packet. Additionally, invoke eth_get_headlen() to exercise flow
      dissector code and to verify that the header resides in the linear data.
      
      The napi_gro_frags() mode requires setting the IFF_NAPI_FRAGS option.
      This is imposed because this mode is intended for testing via tools like
      syzkaller and packetdrill, and the increased flexibility it provides can
      introduce security vulnerabilities. This flag is accepted only if the
      device is in TAP mode and has the IFF_NAPI flag set as well. This is
      done because both of these are explicit requirements for correct
      operation in this mode.
      Signed-off-by: NPetar Penkov <peterpenkov96@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: davem@davemloft.net
      Cc: ppenkov@stanford.edu
      Acked-by: NMahesh Bandewar <maheshb@google,com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90e33d45
    • P
      tun: enable NAPI for TUN/TAP driver · 94317099
      Petar Penkov 提交于
      Changes TUN driver to use napi_gro_receive() upon receiving packets
      rather than netif_rx_ni(). Adds flag IFF_NAPI that enables these
      changes and operation is not affected if the flag is disabled.  SKBs
      are constructed upon packet arrival and are queued to be processed
      later.
      
      The new path was evaluated with a benchmark with the following setup:
      Open two tap devices and a receiver thread that reads in a loop for
      each device. Start one sender thread and pin all threads to different
      CPUs. Send 1M minimum UDP packets to each device and measure sending
      time for each of the sending methods:
      	napi_gro_receive():	4.90s
      	netif_rx_ni():		4.90s
      	netif_receive_skb():	7.20s
      Signed-off-by: NPetar Penkov <peterpenkov96@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: davem@davemloft.net
      Cc: ppenkov@stanford.edu
      Acked-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94317099
  14. 06 9月, 2017 2 次提交
  15. 19 8月, 2017 1 次提交
    • E
      tun: handle register_netdevice() failures properly · ff244c6b
      Eric Dumazet 提交于
      syzkaller reported a double free [1], caused by the fact
      that tun driver was not updated properly when priv_destructor
      was added.
      
      When/if register_netdevice() fails, priv_destructor() must have been
      called already.
      
      [1]
      BUG: KASAN: double-free or invalid-free in selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023
      
      CPU: 0 PID: 2919 Comm: syzkaller227220 Not tainted 4.13.0-rc4+ #23
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       print_address_description+0x7f/0x260 mm/kasan/report.c:252
       kasan_report_double_free+0x55/0x80 mm/kasan/report.c:333
       kasan_slab_free+0xa0/0xc0 mm/kasan/kasan.c:514
       __cache_free mm/slab.c:3503 [inline]
       kfree+0xd3/0x260 mm/slab.c:3820
       selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023
       security_tun_dev_free_security+0x48/0x80 security/security.c:1512
       tun_set_iff drivers/net/tun.c:1884 [inline]
       __tun_chr_ioctl+0x2ce6/0x3d50 drivers/net/tun.c:2064
       tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309
       vfs_ioctl fs/ioctl.c:45 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
       SYSC_ioctl fs/ioctl.c:700 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x443ff9
      RSP: 002b:00007ffc34271f68 EFLAGS: 00000217 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 00000000004002e0 RCX: 0000000000443ff9
      RDX: 0000000020533000 RSI: 00000000400454ca RDI: 0000000000000003
      RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000401ce0
      R13: 0000000000401d70 R14: 0000000000000000 R15: 0000000000000000
      
      Allocated by task 2919:
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
       save_stack+0x43/0xd0 mm/kasan/kasan.c:447
       set_track mm/kasan/kasan.c:459 [inline]
       kasan_kmalloc+0xaa/0xd0 mm/kasan/kasan.c:551
       kmem_cache_alloc_trace+0x101/0x6f0 mm/slab.c:3627
       kmalloc include/linux/slab.h:493 [inline]
       kzalloc include/linux/slab.h:666 [inline]
       selinux_tun_dev_alloc_security+0x49/0x170 security/selinux/hooks.c:5012
       security_tun_dev_alloc_security+0x6d/0xa0 security/security.c:1506
       tun_set_iff drivers/net/tun.c:1839 [inline]
       __tun_chr_ioctl+0x1730/0x3d50 drivers/net/tun.c:2064
       tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309
       vfs_ioctl fs/ioctl.c:45 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
       SYSC_ioctl fs/ioctl.c:700 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Freed by task 2919:
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
       save_stack+0x43/0xd0 mm/kasan/kasan.c:447
       set_track mm/kasan/kasan.c:459 [inline]
       kasan_slab_free+0x6e/0xc0 mm/kasan/kasan.c:524
       __cache_free mm/slab.c:3503 [inline]
       kfree+0xd3/0x260 mm/slab.c:3820
       selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023
       security_tun_dev_free_security+0x48/0x80 security/security.c:1512
       tun_free_netdev+0x13b/0x1b0 drivers/net/tun.c:1563
       register_netdevice+0x8d0/0xee0 net/core/dev.c:7605
       tun_set_iff drivers/net/tun.c:1859 [inline]
       __tun_chr_ioctl+0x1caf/0x3d50 drivers/net/tun.c:2064
       tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309
       vfs_ioctl fs/ioctl.c:45 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
       SYSC_ioctl fs/ioctl.c:700 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      The buggy address belongs to the object at ffff8801d2843b40
       which belongs to the cache kmalloc-32 of size 32
      The buggy address is located 0 bytes inside of
       32-byte region [ffff8801d2843b40, ffff8801d2843b60)
      The buggy address belongs to the page:
      page:ffffea000660cea8 count:1 mapcount:0 mapping:ffff8801d2843000 index:0xffff8801d2843fc1
      flags: 0x200000000000100(slab)
      raw: 0200000000000100 ffff8801d2843000 ffff8801d2843fc1 000000010000003f
      raw: ffffea0006626a40 ffffea00066141a0 ffff8801dbc00100
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8801d2843a00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
       ffff8801d2843a80: 00 00 00 fc fc fc fc fc fb fb fb fb fc fc fc fc
      >ffff8801d2843b00: 00 00 00 00 fc fc fc fc fb fb fb fb fc fc fc fc
                                                 ^
       ffff8801d2843b80: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
       ffff8801d2843c00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
      
      ==================================================================
      
      Fixes: cf124db5 ("net: Fix inconsistent teardown and release of private netdev state.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff244c6b
  16. 17 8月, 2017 2 次提交
  17. 14 8月, 2017 2 次提交
    • J
      tap: XDP support · 761876c8
      Jason Wang 提交于
      This patch tries to implement XDP for tun. The implementation was
      split into two parts:
      
      - fast path: small and no gso packet. We try to do XDP at page level
        before build_skb(). For XDP_TX, since creating/destroying queues
        were completely under control of userspace, it was implemented
        through generic XDP helper after skb has been built. This could be
        optimized in the future.
      - slow path: big or gso packet. We try to do it after skb was created
        through generic XDP helpers.
      
      Test were done through pktgen with small packets.
      
      xdp1 test shows ~41.1% improvement:
      
      Before: ~1.7Mpps
      After:  ~2.3Mpps
      
      xdp_redirect to ixgbe shows ~60% improvement:
      
      Before: ~0.8Mpps
      After:  ~1.38Mpps
      Suggested-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      761876c8
    • J
      tap: use build_skb() for small packet · 66ccbc9c
      Jason Wang 提交于
      We use tun_alloc_skb() which calls sock_alloc_send_pskb() to allocate
      skb in the past. This socket based method is not suitable for high
      speed userspace like virtualization which usually:
      
      - ignore sk_sndbuf (INT_MAX) and expect to receive the packet as fast as
        possible
      - don't want to be block at sendmsg()
      
      To eliminate the above overheads, this patch tries to use build_skb()
      for small packet. We will do this only when the following conditions
      are all met:
      
      - TAP instead of TUN
      - sk_sndbuf is INT_MAX
      - caller don't want to be blocked
      - zerocopy is not used
      - packet size is smaller enough to use build_skb()
      
      Pktgen from guest to host shows ~11% improvement for rx pps of tap:
      
      Before: ~1.70Mpps
      After : ~1.88Mpps
      
      What's more important, this makes it possible to implement XDP for tap
      before creating skbs.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66ccbc9c
  18. 04 8月, 2017 1 次提交
    • W
      sock: enable MSG_ZEROCOPY · 1f8b977a
      Willem de Bruijn 提交于
      Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
      skb_zerocopy_clone() wherever needed due to skb split, merge, resize
      or clone.
      
      Split skb_orphan_frags into two variants. The split, merge, .. paths
      support reference counted zerocopy buffers, so do not do a deep copy.
      Add skb_orphan_frags_rx for paths that may loop packets to receive
      sockets. That is not allowed, as it may cause unbounded latency.
      Deep copy all zerocopy copy buffers, ref-counted or not, in this path.
      
      The exact locations to modify were chosen by exhaustively searching
      through all code that might modify skb_frag references and/or the
      the SKBTX_DEV_ZEROCOPY tx_flags bit.
      
      The changes err on the safe side, in two ways.
      
      (1) legacy ubuf_info paths virtio and tap are not modified. They keep
          a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
          still call skb_copy_ubufs and thus copy frags in this case.
      
      (2) not all copies deep in the stack are addressed yet. skb_shift,
          skb_split and skb_try_coalesce can be refined to avoid copying.
          These are not in the hot path and this patch is hairy enough as
          is, so that is left for future refinement.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f8b977a
  19. 25 7月, 2017 1 次提交
  20. 18 7月, 2017 1 次提交
  21. 27 6月, 2017 1 次提交
  22. 08 6月, 2017 1 次提交
    • D
      net: Fix inconsistent teardown and release of private netdev state. · cf124db5
      David S. Miller 提交于
      Network devices can allocate reasources and private memory using
      netdev_ops->ndo_init().  However, the release of these resources
      can occur in one of two different places.
      
      Either netdev_ops->ndo_uninit() or netdev->destructor().
      
      The decision of which operation frees the resources depends upon
      whether it is necessary for all netdev refs to be released before it
      is safe to perform the freeing.
      
      netdev_ops->ndo_uninit() presumably can occur right after the
      NETDEV_UNREGISTER notifier completes and the unicast and multicast
      address lists are flushed.
      
      netdev->destructor(), on the other hand, does not run until the
      netdev references all go away.
      
      Further complicating the situation is that netdev->destructor()
      almost universally does also a free_netdev().
      
      This creates a problem for the logic in register_netdevice().
      Because all callers of register_netdevice() manage the freeing
      of the netdev, and invoke free_netdev(dev) if register_netdevice()
      fails.
      
      If netdev_ops->ndo_init() succeeds, but something else fails inside
      of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
      it is not able to invoke netdev->destructor().
      
      This is because netdev->destructor() will do a free_netdev() and
      then the caller of register_netdevice() will do the same.
      
      However, this means that the resources that would normally be released
      by netdev->destructor() will not be.
      
      Over the years drivers have added local hacks to deal with this, by
      invoking their destructor parts by hand when register_netdevice()
      fails.
      
      Many drivers do not try to deal with this, and instead we have leaks.
      
      Let's close this hole by formalizing the distinction between what
      private things need to be freed up by netdev->destructor() and whether
      the driver needs unregister_netdevice() to perform the free_netdev().
      
      netdev->priv_destructor() performs all actions to free up the private
      resources that used to be freed by netdev->destructor(), except for
      free_netdev().
      
      netdev->needs_free_netdev is a boolean that indicates whether
      free_netdev() should be done at the end of unregister_netdevice().
      
      Now, register_netdevice() can sanely release all resources after
      ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
      and netdev->priv_destructor().
      
      And at the end of unregister_netdevice(), we invoke
      netdev->priv_destructor() and optionally call free_netdev().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf124db5
  23. 07 6月, 2017 1 次提交
  24. 18 5月, 2017 2 次提交
  25. 22 3月, 2017 1 次提交
  26. 14 3月, 2017 2 次提交
  27. 10 3月, 2017 1 次提交
  28. 02 3月, 2017 1 次提交
  29. 07 2月, 2017 1 次提交
  30. 21 1月, 2017 1 次提交
  31. 19 1月, 2017 1 次提交
    • J
      tun: rx batching · 5503fcec
      Jason Wang 提交于
      We can only process 1 packet at one time during sendmsg(). This often
      lead bad cache utilization under heavy load. So this patch tries to do
      some batching during rx before submitting them to host network
      stack. This is done through accepting MSG_MORE as a hint from
      sendmsg() caller, if it was set, batch the packet temporarily in a
      linked list and submit them all once MSG_MORE were cleared.
      
      Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host:
      
                                       Mpps  -+%
          rx-frames = 0                0.91  +0%
          rx-frames = 4                1.00  +9.8%
          rx-frames = 8                1.00  +9.8%
          rx-frames = 16               1.01  +10.9%
          rx-frames = 32               1.07  +17.5%
          rx-frames = 48               1.07  +17.5%
          rx-frames = 64               1.08  +18.6%
          rx-frames = 64 (no MSG_MORE) 0.91  +0%
      
      User were allowed to change per device batched packets through
      ethtool -C rx-frames. NAPI_POLL_WEIGHT were used as upper limitation
      to prevent bh from being disabled too long.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5503fcec