1. 10 3月, 2018 1 次提交
  2. 27 2月, 2018 3 次提交
  3. 17 2月, 2018 1 次提交
    • E
      tun: fix tun_napi_alloc_frags() frag allocator · 43a08e0f
      Eric Dumazet 提交于
      <Mark Rutland reported>
          While fuzzing arm64 v4.16-rc1 with Syzkaller, I've been hitting a
          misaligned atomic in __skb_clone:
      
              atomic_inc(&(skb_shinfo(skb)->dataref));
      
         where dataref doesn't have the required natural alignment, and the
         atomic operation faults. e.g. i often see it aligned to a single
         byte boundary rather than a four byte boundary.
      
         AFAICT, the skb_shared_info is misaligned at the instant it's
         allocated in __napi_alloc_skb()  __napi_alloc_skb()
      </end of report>
      
      Problem is caused by tun_napi_alloc_frags() using
      napi_alloc_frag() with user provided seg sizes,
      leading to other users of this API getting unaligned
      page fragments.
      
      Since we would like to not necessarily add paddings or alignments to
      the frags that tun_napi_alloc_frags() attaches to the skb, switch to
      another page frag allocator.
      
      As a bonus skb_page_frag_refill() can use GFP_KERNEL allocations,
      meaning that we can not deplete memory reserves as easily.
      
      Fixes: 90e33d45 ("tun: enable napi_gro_frags() for TUN/TAP driver")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43a08e0f
  4. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  5. 09 2月, 2018 1 次提交
    • J
      tuntap: add missing xdp flush · 762c330d
      Jason Wang 提交于
      When using devmap to redirect packets between interfaces,
      xdp_do_flush() is usually a must to flush any batched
      packets. Unfortunately this is missed in current tuntap
      implementation.
      
      Unlike most hardware driver which did XDP inside NAPI loop and call
      xdp_do_flush() at then end of each round of poll. TAP did it in the
      context of process e.g tun_get_user(). So fix this by count the
      pending redirected packets and flush when it exceeds NAPI_POLL_WEIGHT
      or MSG_MORE was cleared by sendmsg() caller.
      
      With this fix, xdp_redirect_map works again between two TAPs.
      
      Fixes: 761876c8 ("tap: XDP support")
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      762c330d
  6. 23 1月, 2018 1 次提交
  7. 22 1月, 2018 1 次提交
  8. 18 1月, 2018 3 次提交
  9. 09 1月, 2018 2 次提交
    • J
      tuntap: XDP transmission · fc72d1d5
      Jason Wang 提交于
      This patch implements XDP transmission for TAP. Since we can't create
      new queues for TAP during XDP set, exist ptr_ring was reused for
      queuing XDP buffers. To differ xdp_buff from sk_buff, TUN_XDP_FLAG
      (0x1UL) was encoded into lowest bit of xpd_buff pointer during
      ptr_ring_produce, and was decoded during consuming. XDP metadata was
      stored in the headroom of the packet which should work in most of
      cases since driver usually reserve enough headroom. Very minor changes
      were done for vhost_net: it just need to peek the length depends on
      the type of pointer.
      
      Tests were done on two Intel E5-2630 2.40GHz machines connected back
      to back through two 82599ES. Traffic were generated/received through
      MoonGen/testpmd(rxonly). It reports ~20% improvements when
      xdp_redirect_map is doing redirection from ixgbe to TAP (from 2.50Mpps
      to 3.05Mpps)
      
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc72d1d5
    • J
      tun/tap: use ptr_ring instead of skb_array · 5990a305
      Jason Wang 提交于
      This patch switches to use ptr_ring instead of skb_array. This will be
      used to enqueue different types of pointers by encoding type into
      lower bits.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5990a305
  10. 06 1月, 2018 1 次提交
  11. 08 12月, 2017 1 次提交
  12. 07 12月, 2017 1 次提交
  13. 06 12月, 2017 1 次提交
  14. 03 12月, 2017 2 次提交
  15. 29 11月, 2017 1 次提交
  16. 24 11月, 2017 1 次提交
    • W
      net: accept UFO datagrams from tuntap and packet · 0c19f846
      Willem de Bruijn 提交于
      Tuntap and similar devices can inject GSO packets. Accept type
      VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.
      
      Processes are expected to use feature negotiation such as TUNSETOFFLOAD
      to detect supported offload types and refrain from injecting other
      packets. This process breaks down with live migration: guest kernels
      do not renegotiate flags, so destination hosts need to expose all
      features that the source host does.
      
      Partially revert the UFO removal from 182e0b6b~1..d9d30adf.
      This patch introduces nearly(*) no new code to simplify verification.
      It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
      insertion and software UFO segmentation.
      
      It does not reinstate protocol stack support, hardware offload
      (NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
      of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.
      
      To support SKB_GSO_UDP reappearing in the stack, also reinstate
      logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
      by squashing in commit 93991221 ("net: skb_needs_check() removes
      CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee6
      ("net: avoid skb_warn_bad_offload false positives on UFO").
      
      (*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
      ipv6_proxy_select_ident is changed to return a __be32 and this is
      assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
      at the end of the enum to minimize code churn.
      
      Tested
        Booted a v4.13 guest kernel with QEMU. On a host kernel before this
        patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
        enabled, same as on a v4.13 host kernel.
      
        A UFO packet sent from the guest appears on the tap device:
          host:
            nc -l -p -u 8000 &
            tcpdump -n -i tap0
      
          guest:
            dd if=/dev/zero of=payload.txt bs=1 count=2000
            nc -u 192.16.1.1 8000 < payload.txt
      
        Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
        packets arriving fragmented:
      
          ./with_tap_pair.sh ./tap_send_ufo tap0 tap1
          (from https://github.com/wdebruij/kerneltools/tree/master/tests)
      
      Changes
        v1 -> v2
          - simplified set_offload change (review comment)
          - documented test procedure
      
      Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
      Fixes: fb652fdf ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
      Reported-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c19f846
  17. 22 11月, 2017 1 次提交
    • K
      treewide: setup_timer() -> timer_setup() · e99e88a9
      Kees Cook 提交于
      This converts all remaining cases of the old setup_timer() API into using
      timer_setup(), where the callback argument is the structure already
      holding the struct timer_list. These should have no behavioral changes,
      since they just change which pointer is passed into the callback with
      the same available pointers after conversion. It handles the following
      examples, in addition to some other variations.
      
      Casting from unsigned long:
      
          void my_callback(unsigned long data)
          {
              struct something *ptr = (struct something *)data;
          ...
          }
          ...
          setup_timer(&ptr->my_timer, my_callback, ptr);
      
      and forced object casts:
      
          void my_callback(struct something *ptr)
          {
          ...
          }
          ...
          setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);
      
      become:
      
          void my_callback(struct timer_list *t)
          {
              struct something *ptr = from_timer(ptr, t, my_timer);
          ...
          }
          ...
          timer_setup(&ptr->my_timer, my_callback, 0);
      
      Direct function assignments:
      
          void my_callback(unsigned long data)
          {
              struct something *ptr = (struct something *)data;
          ...
          }
          ...
          ptr->my_timer.function = my_callback;
      
      have a temporary cast added, along with converting the args:
      
          void my_callback(struct timer_list *t)
          {
              struct something *ptr = from_timer(ptr, t, my_timer);
          ...
          }
          ...
          ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;
      
      And finally, callbacks without a data assignment:
      
          void my_callback(unsigned long data)
          {
          ...
          }
          ...
          setup_timer(&ptr->my_timer, my_callback, 0);
      
      have their argument renamed to verify they're unused during conversion:
      
          void my_callback(struct timer_list *unused)
          {
          ...
          }
          ...
          timer_setup(&ptr->my_timer, my_callback, 0);
      
      The conversion is done with the following Coccinelle script:
      
      spatch --very-quiet --all-includes --include-headers \
      	-I ./arch/x86/include -I ./arch/x86/include/generated \
      	-I ./include -I ./arch/x86/include/uapi \
      	-I ./arch/x86/include/generated/uapi -I ./include/uapi \
      	-I ./include/generated/uapi --include ./include/linux/kconfig.h \
      	--dir . \
      	--cocci-file ~/src/data/timer_setup.cocci
      
      @fix_address_of@
      expression e;
      @@
      
       setup_timer(
      -&(e)
      +&e
       , ...)
      
      // Update any raw setup_timer() usages that have a NULL callback, but
      // would otherwise match change_timer_function_usage, since the latter
      // will update all function assignments done in the face of a NULL
      // function initialization in setup_timer().
      @change_timer_function_usage_NULL@
      expression _E;
      identifier _timer;
      type _cast_data;
      @@
      
      (
      -setup_timer(&_E->_timer, NULL, _E);
      +timer_setup(&_E->_timer, NULL, 0);
      |
      -setup_timer(&_E->_timer, NULL, (_cast_data)_E);
      +timer_setup(&_E->_timer, NULL, 0);
      |
      -setup_timer(&_E._timer, NULL, &_E);
      +timer_setup(&_E._timer, NULL, 0);
      |
      -setup_timer(&_E._timer, NULL, (_cast_data)&_E);
      +timer_setup(&_E._timer, NULL, 0);
      )
      
      @change_timer_function_usage@
      expression _E;
      identifier _timer;
      struct timer_list _stl;
      identifier _callback;
      type _cast_func, _cast_data;
      @@
      
      (
      -setup_timer(&_E->_timer, _callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, &_callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, _callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)_callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, &_callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
       _E->_timer@_stl.function = _callback;
      |
       _E->_timer@_stl.function = &_callback;
      |
       _E->_timer@_stl.function = (_cast_func)_callback;
      |
       _E->_timer@_stl.function = (_cast_func)&_callback;
      |
       _E._timer@_stl.function = _callback;
      |
       _E._timer@_stl.function = &_callback;
      |
       _E._timer@_stl.function = (_cast_func)_callback;
      |
       _E._timer@_stl.function = (_cast_func)&_callback;
      )
      
      // callback(unsigned long arg)
      @change_callback_handle_cast
       depends on change_timer_function_usage@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _origtype;
      identifier _origarg;
      type _handletype;
      identifier _handle;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *t
       )
       {
      (
      	... when != _origarg
      	_handletype *_handle =
      -(_handletype *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle =
      -(void *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle;
      	... when != _handle
      	_handle =
      -(_handletype *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle;
      	... when != _handle
      	_handle =
      -(void *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      )
       }
      
      // callback(unsigned long arg) without existing variable
      @change_callback_handle_cast_no_arg
       depends on change_timer_function_usage &&
                           !change_callback_handle_cast@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _origtype;
      identifier _origarg;
      type _handletype;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *t
       )
       {
      +	_handletype *_origarg = from_timer(_origarg, t, _timer);
      +
      	... when != _origarg
      -	(_handletype *)_origarg
      +	_origarg
      	... when != _origarg
       }
      
      // Avoid already converted callbacks.
      @match_callback_converted
       depends on change_timer_function_usage &&
                  !change_callback_handle_cast &&
      	    !change_callback_handle_cast_no_arg@
      identifier change_timer_function_usage._callback;
      identifier t;
      @@
      
       void _callback(struct timer_list *t)
       { ... }
      
      // callback(struct something *handle)
      @change_callback_handle_arg
       depends on change_timer_function_usage &&
      	    !match_callback_converted &&
                  !change_callback_handle_cast &&
                  !change_callback_handle_cast_no_arg@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _handletype;
      identifier _handle;
      @@
      
       void _callback(
      -_handletype *_handle
      +struct timer_list *t
       )
       {
      +	_handletype *_handle = from_timer(_handle, t, _timer);
      	...
       }
      
      // If change_callback_handle_arg ran on an empty function, remove
      // the added handler.
      @unchange_callback_handle_arg
       depends on change_timer_function_usage &&
      	    change_callback_handle_arg@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _handletype;
      identifier _handle;
      identifier t;
      @@
      
       void _callback(struct timer_list *t)
       {
      -	_handletype *_handle = from_timer(_handle, t, _timer);
       }
      
      // We only want to refactor the setup_timer() data argument if we've found
      // the matching callback. This undoes changes in change_timer_function_usage.
      @unchange_timer_function_usage
       depends on change_timer_function_usage &&
                  !change_callback_handle_cast &&
                  !change_callback_handle_cast_no_arg &&
      	    !change_callback_handle_arg@
      expression change_timer_function_usage._E;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type change_timer_function_usage._cast_data;
      @@
      
      (
      -timer_setup(&_E->_timer, _callback, 0);
      +setup_timer(&_E->_timer, _callback, (_cast_data)_E);
      |
      -timer_setup(&_E._timer, _callback, 0);
      +setup_timer(&_E._timer, _callback, (_cast_data)&_E);
      )
      
      // If we fixed a callback from a .function assignment, fix the
      // assignment cast now.
      @change_timer_function_assignment
       depends on change_timer_function_usage &&
                  (change_callback_handle_cast ||
                   change_callback_handle_cast_no_arg ||
                   change_callback_handle_arg)@
      expression change_timer_function_usage._E;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type _cast_func;
      typedef TIMER_FUNC_TYPE;
      @@
      
      (
       _E->_timer.function =
      -_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_timer.function =
      -&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_timer.function =
      -(_cast_func)_callback;
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_timer.function =
      -(_cast_func)&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -&_callback;
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -(_cast_func)_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -(_cast_func)&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      )
      
      // Sometimes timer functions are called directly. Replace matched args.
      @change_timer_function_calls
       depends on change_timer_function_usage &&
                  (change_callback_handle_cast ||
                   change_callback_handle_cast_no_arg ||
                   change_callback_handle_arg)@
      expression _E;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type _cast_data;
      @@
      
       _callback(
      (
      -(_cast_data)_E
      +&_E->_timer
      |
      -(_cast_data)&_E
      +&_E._timer
      |
      -_E
      +&_E->_timer
      )
       )
      
      // If a timer has been configured without a data argument, it can be
      // converted without regard to the callback argument, since it is unused.
      @match_timer_function_unused_data@
      expression _E;
      identifier _timer;
      identifier _callback;
      @@
      
      (
      -setup_timer(&_E->_timer, _callback, 0);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, _callback, 0L);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, _callback, 0UL);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, 0);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, 0L);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, 0UL);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_timer, _callback, 0);
      +timer_setup(&_timer, _callback, 0);
      |
      -setup_timer(&_timer, _callback, 0L);
      +timer_setup(&_timer, _callback, 0);
      |
      -setup_timer(&_timer, _callback, 0UL);
      +timer_setup(&_timer, _callback, 0);
      |
      -setup_timer(_timer, _callback, 0);
      +timer_setup(_timer, _callback, 0);
      |
      -setup_timer(_timer, _callback, 0L);
      +timer_setup(_timer, _callback, 0);
      |
      -setup_timer(_timer, _callback, 0UL);
      +timer_setup(_timer, _callback, 0);
      )
      
      @change_callback_unused_data
       depends on match_timer_function_unused_data@
      identifier match_timer_function_unused_data._callback;
      type _origtype;
      identifier _origarg;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *unused
       )
       {
      	... when != _origarg
       }
      Signed-off-by: NKees Cook <keescook@chromium.org>
      e99e88a9
  18. 19 11月, 2017 1 次提交
  19. 05 11月, 2017 1 次提交
  20. 01 11月, 2017 1 次提交
    • C
      tun/tap: sanitize TUNSETSNDBUF input · 93161922
      Craig Gallek 提交于
      Syzkaller found several variants of the lockup below by setting negative
      values with the TUNSETSNDBUF ioctl.  This patch adds a sanity check
      to both the tun and tap versions of this ioctl.
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [repro:2389]
        Modules linked in:
        irq event stamp: 329692056
        hardirqs last  enabled at (329692055): [<ffffffff824b8381>] _raw_spin_unlock_irqrestore+0x31/0x75
        hardirqs last disabled at (329692056): [<ffffffff824b9e58>] apic_timer_interrupt+0x98/0xb0
        softirqs last  enabled at (35659740): [<ffffffff824bc958>] __do_softirq+0x328/0x48c
        softirqs last disabled at (35659731): [<ffffffff811c796c>] irq_exit+0xbc/0xd0
        CPU: 0 PID: 2389 Comm: repro Not tainted 4.14.0-rc7 #23
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        task: ffff880009452140 task.stack: ffff880006a20000
        RIP: 0010:_raw_spin_lock_irqsave+0x11/0x80
        RSP: 0018:ffff880006a27c50 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff10
        RAX: ffff880009ac68d0 RBX: ffff880006a27ce0 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffff880006a27ce0 RDI: ffff880009ac6900
        RBP: ffff880006a27c60 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000001 R11: 000000000063ff00 R12: ffff880009ac6900
        R13: ffff880006a27cf8 R14: 0000000000000001 R15: ffff880006a27cf8
        FS:  00007f4be4838700(0000) GS:ffff88000cc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000020101000 CR3: 0000000009616000 CR4: 00000000000006f0
        Call Trace:
         prepare_to_wait+0x26/0xc0
         sock_alloc_send_pskb+0x14e/0x270
         ? remove_wait_queue+0x60/0x60
         tun_get_user+0x2cc/0x19d0
         ? __tun_get+0x60/0x1b0
         tun_chr_write_iter+0x57/0x86
         __vfs_write+0x156/0x1e0
         vfs_write+0xf7/0x230
         SyS_write+0x57/0xd0
         entry_SYSCALL_64_fastpath+0x1f/0xbe
        RIP: 0033:0x7f4be4356df9
        RSP: 002b:00007ffc18101c08 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
        RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4be4356df9
        RDX: 0000000000000046 RSI: 0000000020101000 RDI: 0000000000000005
        RBP: 00007ffc18101c40 R08: 0000000000000001 R09: 0000000000000001
        R10: 0000000000000001 R11: 0000000000000293 R12: 0000559c75f64780
        R13: 00007ffc18101d30 R14: 0000000000000000 R15: 0000000000000000
      
      Fixes: 33dccbb0 ("tun: Limit amount of queued packets per device")
      Fixes: 20d29d7a ("net: macvtap driver")
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93161922
  21. 28 10月, 2017 1 次提交
    • J
      tuntap: properly align skb->head before building skb · 63b9ab65
      Jason Wang 提交于
      An unaligned alloc_frag->offset caused by previous allocation will
      result an unaligned skb->head. This will lead unaligned
      skb_shared_info and then unaligned dataref which requires to be
      aligned for accessing on some architecture. Fix this by aligning
      alloc_frag->offset before the frag refilling.
      
      Fixes: 0bbd7dad ("tun: make tun_build_skb() thread safe")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
      Cc: Wei Wei <dotweiba@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Reported-by: NWei Wei <dotweiba@gmail.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63b9ab65
  22. 26 10月, 2017 1 次提交
  23. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  24. 22 10月, 2017 3 次提交
  25. 20 10月, 2017 1 次提交
    • E
      net-tun: fix panics at dismantle time · aec72f33
      Eric Dumazet 提交于
      syzkaller got crashes at dismantle time [1]
      
      It is not correct to test (tun->flags & IFF_NAPI) in tun_napi_disable()
      and tun_napi_del() : Each tun_file can have different mode, depending
      on how they were created.
      
      Similarly I have changed tun_get_user() and tun_poll_controller()
      to use the new tfile->napi_enabled boolean.
      
      [  154.331360] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [  154.339220] IP: [<ffffffff9634cad6>] hrtimer_active+0x26/0x60
      [  154.344983] PGD 0
      [  154.347009] Oops: 0000 [#1] SMP
      [  154.350680] gsmi: Log Shutdown Reason 0x03
      [  154.379572] task: ffff994719150dc0 ti: ffff99475c0ae000 task.ti: ffff99475c0ae000
      [  154.387043] RIP: 0010:[<ffffffff9634cad6>]  [<ffffffff9634cad6>] hrtimer_active+0x26/0x60
      [  154.395232] RSP: 0018:ffff99475c0afce8  EFLAGS: 00010246
      [  154.400542] RAX: ffff994754850ac0 RBX: ffff994753e65408 RCX: ffff994753e65388
      [  154.407666] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff994753e65408
      [  154.414790] RBP: ffff99475c0afce8 R08: 0000000000000000 R09: 0000000000000000
      [  154.421921] R10: ffff99475f6f5910 R11: 0000000000000001 R12: 0000000000000000
      [  154.429044] R13: ffff99417deab668 R14: ffff99417deaa780 R15: ffff99475f45dde0
      [  154.436174] FS:  0000000000000000(0000) GS:ffff994767a00000(0000) knlGS:0000000000000000
      [  154.444249] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  154.449986] CR2: 0000000000000000 CR3: 00000005a8a0e000 CR4: 0000000000022670
      [  154.457110] Stack:
      [  154.459120]  ffff99475c0afd28 ffffffff9634d614 1000000000000000 0000000000000000
      [  154.466598]  ffffe54240000000 ffff994753e65408 ffff994753e653a8 ffff99417deab668
      [  154.474067]  ffff99475c0afd48 ffffffff9634d6fd ffff99474c2be678 ffff994753e65398
      [  154.481537] Call Trace:
      [  154.483985]  [<ffffffff9634d614>] hrtimer_try_to_cancel+0x24/0xf0
      [  154.490074]  [<ffffffff9634d6fd>] hrtimer_cancel+0x1d/0x30
      [  154.495563]  [<ffffffff96860b3c>] napi_disable+0x3c/0x70
      [  154.500875]  [<ffffffff9678ae62>] __tun_detach+0xd2/0x360
      [  154.506272]  [<ffffffff9678b117>] tun_chr_close+0x27/0x40
      [  154.511669]  [<ffffffff9646ebe6>] __fput+0xd6/0x1e0
      [  154.516548]  [<ffffffff9646ed3e>] ____fput+0xe/0x10
      [  154.521429]  [<ffffffff963035a2>] task_work_run+0x72/0x90
      [  154.526827]  [<ffffffff962e9407>] do_exit+0x317/0xb60
      [  154.531879]  [<ffffffff962e9c8f>] do_group_exit+0x3f/0xa0
      [  154.537275]  [<ffffffff962e9d07>] SyS_exit_group+0x17/0x20
      [  154.542769]  [<ffffffff969784be>] entry_SYSCALL_64_fastpath+0x12/0x17
      
      Fixes: 94317099 ("net-tun: enable NAPI for TUN/TAP driver")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aec72f33
  26. 19 10月, 2017 1 次提交
  27. 17 10月, 2017 1 次提交
    • C
      tun: call dev_get_valid_name() before register_netdevice() · 0ad646c8
      Cong Wang 提交于
      register_netdevice() could fail early when we have an invalid
      dev name, in which case ->ndo_uninit() is not called. For tun
      device, this is a problem because a timer etc. are already
      initialized and it expects ->ndo_uninit() to clean them up.
      
      We could move these initializations into a ->ndo_init() so
      that register_netdevice() knows better, however this is still
      complicated due to the logic in tun_detach().
      
      Therefore, I choose to just call dev_get_valid_name() before
      register_netdevice(), which is quicker and much easier to audit.
      And for this specific case, it is already enough.
      
      Fixes: 96442e42 ("tuntap: choose the txq based on rxq")
      Reported-by: NDmitry Alexeev <avekceeb@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ad646c8
  28. 28 9月, 2017 1 次提交
    • A
      tun: bail out from tun_get_user() if the skb is empty · 2580c4c1
      Alexander Potapenko 提交于
      KMSAN (https://github.com/google/kmsan) reported accessing uninitialized
      skb->data[0] in the case the skb is empty (i.e. skb->len is 0):
      
      ================================================
      BUG: KMSAN: use of uninitialized memory in tun_get_user+0x19ba/0x3770
      CPU: 0 PID: 3051 Comm: probe Not tainted 4.13.0+ #3140
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Call Trace:
      ...
       __msan_warning_32+0x66/0xb0 mm/kmsan/kmsan_instr.c:477
       tun_get_user+0x19ba/0x3770 drivers/net/tun.c:1301
       tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
       call_write_iter ./include/linux/fs.h:1743
       new_sync_write fs/read_write.c:457
       __vfs_write+0x6c3/0x7f0 fs/read_write.c:470
       vfs_write+0x3e4/0x770 fs/read_write.c:518
       SYSC_write+0x12f/0x2b0 fs/read_write.c:565
       SyS_write+0x55/0x80 fs/read_write.c:557
       do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
       entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:245
      ...
      origin:
      ...
       kmsan_poison_shadow+0x6e/0xc0 mm/kmsan/kmsan.c:211
       slab_alloc_node mm/slub.c:2732
       __kmalloc_node_track_caller+0x351/0x370 mm/slub.c:4351
       __kmalloc_reserve net/core/skbuff.c:138
       __alloc_skb+0x26a/0x810 net/core/skbuff.c:231
       alloc_skb ./include/linux/skbuff.h:903
       alloc_skb_with_frags+0x1d7/0xc80 net/core/skbuff.c:4756
       sock_alloc_send_pskb+0xabf/0xfe0 net/core/sock.c:2037
       tun_alloc_skb drivers/net/tun.c:1144
       tun_get_user+0x9a8/0x3770 drivers/net/tun.c:1274
       tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
       call_write_iter ./include/linux/fs.h:1743
       new_sync_write fs/read_write.c:457
       __vfs_write+0x6c3/0x7f0 fs/read_write.c:470
       vfs_write+0x3e4/0x770 fs/read_write.c:518
       SYSC_write+0x12f/0x2b0 fs/read_write.c:565
       SyS_write+0x55/0x80 fs/read_write.c:557
       do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
       return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:245
      ================================================
      
      Make sure tun_get_user() doesn't touch skb->data[0] unless there is
      actual data.
      
      C reproducer below:
      ==========================
          // autogenerated by syzkaller (http://github.com/google/syzkaller)
      
          #define _GNU_SOURCE
      
          #include <fcntl.h>
          #include <linux/if_tun.h>
          #include <netinet/ip.h>
          #include <net/if.h>
          #include <string.h>
          #include <sys/ioctl.h>
      
          int main()
          {
            int sock = socket(PF_INET, SOCK_STREAM, IPPROTO_IP);
            int tun_fd = open("/dev/net/tun", O_RDWR);
            struct ifreq req;
            memset(&req, 0, sizeof(struct ifreq));
            strcpy((char*)&req.ifr_name, "gre0");
            req.ifr_flags = IFF_UP | IFF_MULTICAST;
            ioctl(tun_fd, TUNSETIFF, &req);
            ioctl(sock, SIOCSIFFLAGS, "gre0");
            write(tun_fd, "hi", 0);
            return 0;
          }
      ==========================
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2580c4c1
  29. 27 9月, 2017 1 次提交
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
  30. 26 9月, 2017 3 次提交
    • Y
      tun: delete original tun_get() and rename __tun_get() to tun_get() · 9484dc74
      yuan linyu 提交于
      it seems no need to keep tun_get() and __tun_get() at same time.
      Signed-off-by: Nyuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9484dc74
    • P
      tun: enable napi_gro_frags() for TUN/TAP driver · 90e33d45
      Petar Penkov 提交于
      Add a TUN/TAP receive mode that exercises the napi_gro_frags()
      interface. This mode is available only in TAP mode, as the interface
      expects packets with Ethernet headers.
      
      Furthermore, packets follow the layout of the iovec_iter that was
      received. The first iovec is the linear data, and every one after the
      first is a fragment. If there are more fragments than the max number,
      drop the packet. Additionally, invoke eth_get_headlen() to exercise flow
      dissector code and to verify that the header resides in the linear data.
      
      The napi_gro_frags() mode requires setting the IFF_NAPI_FRAGS option.
      This is imposed because this mode is intended for testing via tools like
      syzkaller and packetdrill, and the increased flexibility it provides can
      introduce security vulnerabilities. This flag is accepted only if the
      device is in TAP mode and has the IFF_NAPI flag set as well. This is
      done because both of these are explicit requirements for correct
      operation in this mode.
      Signed-off-by: NPetar Penkov <peterpenkov96@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: davem@davemloft.net
      Cc: ppenkov@stanford.edu
      Acked-by: NMahesh Bandewar <maheshb@google,com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90e33d45
    • P
      tun: enable NAPI for TUN/TAP driver · 94317099
      Petar Penkov 提交于
      Changes TUN driver to use napi_gro_receive() upon receiving packets
      rather than netif_rx_ni(). Adds flag IFF_NAPI that enables these
      changes and operation is not affected if the flag is disabled.  SKBs
      are constructed upon packet arrival and are queued to be processed
      later.
      
      The new path was evaluated with a benchmark with the following setup:
      Open two tap devices and a receiver thread that reads in a loop for
      each device. Start one sender thread and pin all threads to different
      CPUs. Send 1M minimum UDP packets to each device and measure sending
      time for each of the sending methods:
      	napi_gro_receive():	4.90s
      	netif_rx_ni():		4.90s
      	netif_receive_skb():	7.20s
      Signed-off-by: NPetar Penkov <peterpenkov96@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: davem@davemloft.net
      Cc: ppenkov@stanford.edu
      Acked-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94317099