1. 14 12月, 2018 1 次提交
  2. 07 12月, 2018 2 次提交
  3. 04 12月, 2018 1 次提交
    • P
      tun: remove skb access after netif_receive_skb · 4e4b08e5
      Prashant Bhole 提交于
      In tun.c skb->len was accessed while doing stats accounting after a
      call to netif_receive_skb. We can not access skb after this call
      because buffers may be dropped.
      
      The fix for this bug would be to store skb->len in local variable and
      then use it after netif_receive_skb(). IMO using xdp data size for
      accounting bytes will be better because input for tun_xdp_one() is
      xdp_buff.
      
      Hence this patch:
      - fixes a bug by removing skb access after netif_receive_skb()
      - uses xdp data size for accounting bytes
      
      [613.019057] BUG: KASAN: use-after-free in tun_sendmsg+0x77c/0xc50 [tun]
      [613.021062] Read of size 4 at addr ffff8881da9ab7c0 by task vhost-1115/1155
      [613.023073]
      [613.024003] CPU: 0 PID: 1155 Comm: vhost-1115 Not tainted 4.20.0-rc3-vm+ #232
      [613.026029] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [613.029116] Call Trace:
      [613.031145]  dump_stack+0x5b/0x90
      [613.032219]  print_address_description+0x6c/0x23c
      [613.034156]  ? tun_sendmsg+0x77c/0xc50 [tun]
      [613.036141]  kasan_report.cold.5+0x241/0x308
      [613.038125]  tun_sendmsg+0x77c/0xc50 [tun]
      [613.040109]  ? tun_get_user+0x1960/0x1960 [tun]
      [613.042094]  ? __isolate_free_page+0x270/0x270
      [613.045173]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
      [613.047127]  ? peek_head_len.part.13+0x90/0x90 [vhost_net]
      [613.049096]  ? get_tx_bufs+0x5a/0x2c0 [vhost_net]
      [613.051106]  ? vhost_enable_notify+0x2d8/0x420 [vhost]
      [613.053139]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
      [613.053139]  ? vhost_net_buf_peek+0x340/0x340 [vhost_net]
      [613.053139]  ? __mutex_lock+0x8d9/0xb30
      [613.053139]  ? finish_task_switch+0x8f/0x3f0
      [613.053139]  ? handle_tx+0x32/0x120 [vhost_net]
      [613.053139]  ? mutex_trylock+0x110/0x110
      [613.053139]  ? finish_task_switch+0xcf/0x3f0
      [613.053139]  ? finish_task_switch+0x240/0x3f0
      [613.053139]  ? __switch_to_asm+0x34/0x70
      [613.053139]  ? __switch_to_asm+0x40/0x70
      [613.053139]  ? __schedule+0x506/0xf10
      [613.053139]  handle_tx+0xc7/0x120 [vhost_net]
      [613.053139]  vhost_worker+0x166/0x200 [vhost]
      [613.053139]  ? vhost_dev_init+0x580/0x580 [vhost]
      [613.053139]  ? __kthread_parkme+0x77/0x90
      [613.053139]  ? vhost_dev_init+0x580/0x580 [vhost]
      [613.053139]  kthread+0x1b1/0x1d0
      [613.053139]  ? kthread_park+0xb0/0xb0
      [613.053139]  ret_from_fork+0x35/0x40
      [613.088705]
      [613.088705] Allocated by task 1155:
      [613.088705]  kasan_kmalloc+0xbf/0xe0
      [613.088705]  kmem_cache_alloc+0xdc/0x220
      [613.088705]  __build_skb+0x2a/0x160
      [613.088705]  build_skb+0x14/0xc0
      [613.088705]  tun_sendmsg+0x4f0/0xc50 [tun]
      [613.088705]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
      [613.088705]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
      [613.088705]  handle_tx+0xc7/0x120 [vhost_net]
      [613.088705]  vhost_worker+0x166/0x200 [vhost]
      [613.088705]  kthread+0x1b1/0x1d0
      [613.088705]  ret_from_fork+0x35/0x40
      [613.088705]
      [613.088705] Freed by task 1155:
      [613.088705]  __kasan_slab_free+0x12e/0x180
      [613.088705]  kmem_cache_free+0xa0/0x230
      [613.088705]  ip6_mc_input+0x40f/0x5a0
      [613.088705]  ipv6_rcv+0xc9/0x1e0
      [613.088705]  __netif_receive_skb_one_core+0xc1/0x100
      [613.088705]  netif_receive_skb_internal+0xc4/0x270
      [613.088705]  br_pass_frame_up+0x2b9/0x2e0
      [613.088705]  br_handle_frame_finish+0x2fb/0x7a0
      [613.088705]  br_handle_frame+0x30f/0x6c0
      [613.088705]  __netif_receive_skb_core+0x61a/0x15b0
      [613.088705]  __netif_receive_skb_one_core+0x8e/0x100
      [613.088705]  netif_receive_skb_internal+0xc4/0x270
      [613.088705]  tun_sendmsg+0x738/0xc50 [tun]
      [613.088705]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
      [613.088705]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
      [613.088705]  handle_tx+0xc7/0x120 [vhost_net]
      [613.088705]  vhost_worker+0x166/0x200 [vhost]
      [613.088705]  kthread+0x1b1/0x1d0
      [613.088705]  ret_from_fork+0x35/0x40
      [613.088705]
      [613.088705] The buggy address belongs to the object at ffff8881da9ab740
      [613.088705]  which belongs to the cache skbuff_head_cache of size 232
      
      Fixes: 043d222f ("tuntap: accept an array of XDP buffs through sendmsg()")
      Reviewed-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e4b08e5
  4. 01 12月, 2018 2 次提交
  5. 19 11月, 2018 2 次提交
    • M
      tuntap: fix multiqueue rx · 8ebebcba
      Matthew Cover 提交于
      When writing packets to a descriptor associated with a combined queue, the
      packets should end up on that queue.
      
      Before this change all packets written to any descriptor associated with a
      tap interface end up on rx-0, even when the descriptor is associated with a
      different queue.
      
      The rx traffic can be generated by either of the following.
        1. a simple tap program which spins up multiple queues and writes packets
           to each of the file descriptors
        2. tx from a qemu vm with a tap multiqueue netdev
      
      The queue for rx traffic can be observed by either of the following (done
      on the hypervisor in the qemu case).
        1. a simple netmap program which opens and reads from per-queue
           descriptors
        2. configuring RPS and doing per-cpu captures with rxtxcpu
      
      Alternatively, if you printk() the return value of skb_get_rx_queue() just
      before each instance of netif_receive_skb() in tun.c, you will get 65535
      for every skb.
      
      Calling skb_record_rx_queue() to set the rx queue to the queue_index fixes
      the association between descriptor and rx queue.
      Signed-off-by: NMatthew Cover <matthew.cover@stackpath.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ebebcba
    • E
      tun: use netdev_alloc_frag() in tun_napi_alloc_frags() · aa6daaca
      Eric Dumazet 提交于
      In order to cook skbs in the same way than Ethernet drivers,
      it is probably better to not use GFP_KERNEL, but rather
      use the GFP_ATOMIC and PFMEMALLOC mechanisms provided by
      netdev_alloc_frag().
      
      This would allow to use tun driver even in memory stress
      situations, especially if swap is used over this tun channel.
      
      Fixes: 90e33d45 ("tun: enable napi_gro_frags() for TUN/TAP driver")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Petar Penkov <peterpenkov96@gmail.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa6daaca
  6. 18 11月, 2018 2 次提交
  7. 08 11月, 2018 1 次提交
  8. 16 10月, 2018 1 次提交
    • S
      tun: Consistently configure generic netdev params via rtnetlink · df52eab2
      Serhey Popovych 提交于
      Configuring generic network device parameters on tun will fail in
      presence of IFLA_INFO_KIND attribute in IFLA_LINKINFO nested attribute
      since tun_validate() always return failure.
      
      This can be visualized with following ip-link(8) command sequences:
      
        # ip link set dev tun0 group 100
        # ip link set dev tun0 group 100 type tun
        RTNETLINK answers: Invalid argument
      
      with contrast to dummy and veth drivers:
      
        # ip link set dev dummy0 group 100
        # ip link set dev dummy0 type dummy
      
        # ip link set dev veth0 group 100
        # ip link set dev veth0 group 100 type veth
      
      Fix by returning zero in tun_validate() when @data is NULL that is
      always in case since rtnl_link_ops->maxtype is zero in tun driver.
      
      Fixes: f019a7a5 ("tun: Implement ip link del tunXXX")
      Signed-off-by: NSerhey Popovych <serhe.popovych@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df52eab2
  9. 11 10月, 2018 1 次提交
  10. 02 10月, 2018 3 次提交
    • E
      tun: napi flags belong to tfile · af3fb24e
      Eric Dumazet 提交于
      Since tun->flags might be shared by multiple tfile structures,
      it is better to make sure tun_get_user() is using the flags
      for the current tfile.
      
      Presence of the READ_ONCE() in tun_napi_frags_enabled() gave a hint
      of what could happen, but we need something stronger to please
      syzbot.
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 13647 Comm: syz-executor5 Not tainted 4.19.0-rc5+ #59
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:dev_gro_receive+0x132/0x2720 net/core/dev.c:5427
      Code: 48 c1 ea 03 80 3c 02 00 0f 85 6e 20 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 6e 10 49 8d bd d0 00 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 59 20 00 00 4d 8b a5 d0 00 00 00 31 ff 41 81 e4
      RSP: 0018:ffff8801c400f410 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8618d325
      RDX: 000000000000001a RSI: ffffffff86189f97 RDI: 00000000000000d0
      RBP: ffff8801c400f608 R08: ffff8801c8fb4300 R09: 0000000000000000
      R10: ffffed0038801ed7 R11: 0000000000000003 R12: ffff8801d327d358
      R13: 0000000000000000 R14: ffff8801c16dd8c0 R15: 0000000000000004
      FS:  00007fe003615700(0000) GS:ffff8801dac00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fe1f3c43db8 CR3: 00000001bebb2000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       napi_gro_frags+0x3f4/0xc90 net/core/dev.c:5715
       tun_get_user+0x31d5/0x42a0 drivers/net/tun.c:1922
       tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:1967
       call_write_iter include/linux/fs.h:1808 [inline]
       new_sync_write fs/read_write.c:474 [inline]
       __vfs_write+0x6b8/0x9f0 fs/read_write.c:487
       vfs_write+0x1fc/0x560 fs/read_write.c:549
       ksys_write+0x101/0x260 fs/read_write.c:598
       __do_sys_write fs/read_write.c:610 [inline]
       __se_sys_write fs/read_write.c:607 [inline]
       __x64_sys_write+0x73/0xb0 fs/read_write.c:607
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x457579
      Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fe003614c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457579
      RDX: 0000000000000012 RSI: 0000000020000000 RDI: 000000000000000a
      RBP: 000000000072c040 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe0036156d4
      R13: 00000000004c5574 R14: 00000000004d8e98 R15: 00000000ffffffff
      Modules linked in:
      
      RIP: 0010:dev_gro_receive+0x132/0x2720 net/core/dev.c:5427
      Code: 48 c1 ea 03 80 3c 02 00 0f 85 6e 20 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 6e 10 49 8d bd d0 00 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 59 20 00 00 4d 8b a5 d0 00 00 00 31 ff 41 81 e4
      RSP: 0018:ffff8801c400f410 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8618d325
      RDX: 000000000000001a RSI: ffffffff86189f97 RDI: 00000000000000d0
      RBP: ffff8801c400f608 R08: ffff8801c8fb4300 R09: 0000000000000000
      R10: ffffed0038801ed7 R11: 0000000000000003 R12: ffff8801d327d358
      R13: 0000000000000000 R14: ffff8801c16dd8c0 R15: 0000000000000004
      FS:  00007fe003615700(0000) GS:ffff8801dac00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fe1f3c43db8 CR3: 00000001bebb2000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Fixes: 90e33d45 ("tun: enable napi_gro_frags() for TUN/TAP driver")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af3fb24e
    • E
      tun: initialize napi_mutex unconditionally · c7256f57
      Eric Dumazet 提交于
      This is the first part to fix following syzbot report :
      
      console output: https://syzkaller.appspot.com/x/log.txt?x=145378e6400000
      kernel config:  https://syzkaller.appspot.com/x/.config?x=443816db871edd66
      dashboard link: https://syzkaller.appspot.com/bug?extid=e662df0ac1d753b57e80
      
      Following patch is fixing the race condition, but it seems safer
      to initialize this mutex at tfile creation anyway.
      
      Fixes: 90e33d45 ("tun: enable napi_gro_frags() for TUN/TAP driver")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: syzbot+e662df0ac1d753b57e80@syzkaller.appspotmail.com
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7256f57
    • E
      tun: remove unused parameters · 06e55add
      Eric Dumazet 提交于
      tun_napi_disable() and tun_napi_del() do not need
      a pointer to the tun_struct
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06e55add
  11. 24 9月, 2018 1 次提交
    • E
      tun: remove ndo_poll_controller · 765cdc20
      Eric Dumazet 提交于
      As diagnosed by Song Liu, ndo_poll_controller() can
      be very dangerous on loaded hosts, since the cpu
      calling ndo_poll_controller() might steal all NAPI
      contexts (for all RX/TX queues of the NIC). This capture
      can last for unlimited amount of time, since one
      cpu is generally not able to drain all the queues under load.
      
      tun uses NAPI for TX completions, so we better let core
      networking stack call the napi->poll() to avoid the capture.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      765cdc20
  12. 14 9月, 2018 9 次提交
  13. 05 8月, 2018 1 次提交
  14. 21 7月, 2018 1 次提交
    • E
      signal: Use PIDTYPE_TGID to clearly store where file signals will be sent · 01919134
      Eric W. Biederman 提交于
      When f_setown is called a pid and a pid type are stored.  Replace the use
      of PIDTYPE_PID with PIDTYPE_TGID as PIDTYPE_TGID goes to the entire thread
      group.  Replace the use of PIDTYPE_MAX with PIDTYPE_PID as PIDTYPE_PID now
      is only for a thread.
      
      Update the users of __f_setown to use PIDTYPE_TGID instead of
      PIDTYPE_PID.
      
      For now the code continues to capture task_pid (when task_tgid would
      really be appropriate), and iterate on PIDTYPE_PID (even when type ==
      PIDTYPE_TGID) out of an abundance of caution to preserve existing
      behavior.
      
      Oleg Nesterov suggested using the test to ensure we use PIDTYPE_PID
      for tgid lookup also be used to avoid taking the tasklist lock.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      01919134
  15. 17 7月, 2018 1 次提交
  16. 14 7月, 2018 1 次提交
  17. 10 7月, 2018 1 次提交
  18. 08 6月, 2018 1 次提交
    • W
      net: in virtio_net_hdr only add VLAN_HLEN to csum_start if payload holds vlan · fd3a8862
      Willem de Bruijn 提交于
      Tun, tap, virtio, packet and uml vector all use struct virtio_net_hdr
      to communicate packet metadata to userspace.
      
      For skbuffs with vlan, the first two return the packet as it may have
      existed on the wire, inserting the VLAN tag in the user buffer.  Then
      virtio_net_hdr.csum_start needs to be adjusted by VLAN_HLEN bytes.
      
      Commit f09e2249 ("macvtap: restore vlan header on user read")
      added this feature to macvtap. Commit 3ce9b20f ("macvtap: Fix
      csum_start when VLAN tags are present") then fixed up csum_start.
      
      Virtio, packet and uml do not insert the vlan header in the user
      buffer.
      
      When introducing virtio_net_hdr_from_skb to deduplicate filling in
      the virtio_net_hdr, the variant from macvtap which adds VLAN_HLEN was
      applied uniformly, breaking csum offset for packets with vlan on
      virtio and packet.
      
      Make insertion of VLAN_HLEN optional. Convert the callers to pass it
      when needed.
      
      Fixes: e858fae2 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
      Fixes: 1276f24e ("packet: use common code for virtio_net_hdr and skb GSO conversion")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd3a8862
  19. 05 6月, 2018 2 次提交
  20. 03 6月, 2018 2 次提交
  21. 29 5月, 2018 1 次提交
    • T
      tun: Fix NULL pointer dereference in XDP redirect · 6547e387
      Toshiaki Makita 提交于
      Calling XDP redirection requires bh disabled. Softirq can call another
      XDP function and redirection functions, then the percpu static variable
      ri->map can be overwritten to NULL.
      
      This is a generic XDP case called from tun.
      
      [ 3535.736058] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      [ 3535.743974] PGD 0 P4D 0
      [ 3535.746530] Oops: 0000 [#1] SMP PTI
      [ 3535.750049] Modules linked in: vhost_net vhost tap tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc vfat fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ses aesni_intel crypto_simd cryptd enclosure hpwdt hpilo glue_helper ipmi_si pcspkr wmi mei_me ioatdma mei ipmi_devintf shpchp dca ipmi_msghandler lpc_ich acpi_power_meter sch_fq_codel ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm smartpqi i40e crc32c_intel scsi_transport_sas tg3 i2c_core ptp pps_core
      [ 3535.813456] CPU: 5 PID: 1630 Comm: vhost-1614 Not tainted 4.17.0-rc4 #2
      [ 3535.820127] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/14/2017
      [ 3535.828732] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30
      [ 3535.833740] RSP: 0018:ffffb4bc47bf7c58 EFLAGS: 00010246
      [ 3535.839009] RAX: ffff9fdfcfea1c40 RBX: 0000000000000000 RCX: ffff9fdf27fe3100
      [ 3535.846205] RDX: ffff9fdfca769200 RSI: 0000000000000000 RDI: 0000000000000000
      [ 3535.853402] RBP: ffffb4bc491d9000 R08: 00000000000045ad R09: 0000000000000ec0
      [ 3535.860597] R10: 0000000000000001 R11: ffff9fdf26c3ce4e R12: ffff9fdf9e72c000
      [ 3535.867794] R13: 0000000000000000 R14: fffffffffffffff2 R15: ffff9fdfc82cdd00
      [ 3535.874990] FS:  0000000000000000(0000) GS:ffff9fdfcfe80000(0000) knlGS:0000000000000000
      [ 3535.883152] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 3535.888948] CR2: 0000000000000018 CR3: 0000000bde724004 CR4: 00000000007626e0
      [ 3535.896145] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 3535.903342] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 3535.910538] PKRU: 55555554
      [ 3535.913267] Call Trace:
      [ 3535.915736]  xdp_do_generic_redirect+0x7a/0x310
      [ 3535.920310]  do_xdp_generic.part.117+0x285/0x370
      [ 3535.924970]  tun_get_user+0x5b9/0x1260 [tun]
      [ 3535.929279]  tun_sendmsg+0x52/0x70 [tun]
      [ 3535.933237]  handle_tx+0x2ad/0x5f0 [vhost_net]
      [ 3535.937721]  vhost_worker+0xa5/0x100 [vhost]
      [ 3535.942030]  kthread+0xf5/0x130
      [ 3535.945198]  ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost]
      [ 3535.950031]  ? kthread_bind+0x10/0x10
      [ 3535.953727]  ret_from_fork+0x35/0x40
      [ 3535.957334] Code: 0e 74 15 83 f8 10 75 05 e9 49 aa b3 ff f3 c3 0f 1f 80 00 00 00 00 f3 c3 e9 29 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 47 18 83 f8 0e 74 0d 83 f8 10 75 05 e9 49 a9 b3 ff 31 c0 c3
      [ 3535.976387] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP: ffffb4bc47bf7c58
      [ 3535.982883] CR2: 0000000000000018
      [ 3535.987096] ---[ end trace 383b299dd1430240 ]---
      [ 3536.131325] Kernel panic - not syncing: Fatal exception
      [ 3536.137484] Kernel Offset: 0x26a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [ 3536.281406] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      And a kernel with generic case fixed still panics in tun driver XDP
      redirect, because it disabled only preemption, but not bh.
      
      [ 2055.128746] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      [ 2055.136662] PGD 0 P4D 0
      [ 2055.139219] Oops: 0000 [#1] SMP PTI
      [ 2055.142736] Modules linked in: vhost_net vhost tap tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc vfat fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ses aesni_intel ipmi_ssif crypto_simd enclosure cryptd hpwdt glue_helper ioatdma hpilo wmi dca pcspkr ipmi_si acpi_power_meter ipmi_devintf shpchp mei_me ipmi_msghandler mei lpc_ich sch_fq_codel ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i40e smartpqi tg3 scsi_transport_sas crc32c_intel i2c_core ptp pps_core
      [ 2055.206142] CPU: 6 PID: 1693 Comm: vhost-1683 Tainted: G        W         4.17.0-rc5-fix-tun+ #1
      [ 2055.215011] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/14/2017
      [ 2055.223617] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30
      [ 2055.228624] RSP: 0018:ffff998b07607cc0 EFLAGS: 00010246
      [ 2055.233892] RAX: ffff8dbd8e235700 RBX: ffff8dbd8ff21c40 RCX: 0000000000000004
      [ 2055.241089] RDX: ffff998b097a9000 RSI: 0000000000000000 RDI: 0000000000000000
      [ 2055.248286] RBP: 0000000000000000 R08: 00000000000065a8 R09: 0000000000005d80
      [ 2055.255483] R10: 0000000000000040 R11: ffff8dbcf0100000 R12: ffff998b097a9000
      [ 2055.262681] R13: ffff8dbd8c98c000 R14: 0000000000000000 R15: ffff998b07607d78
      [ 2055.269879] FS:  0000000000000000(0000) GS:ffff8dbd8ff00000(0000) knlGS:0000000000000000
      [ 2055.278039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 2055.283834] CR2: 0000000000000018 CR3: 0000000c0c8cc005 CR4: 00000000007626e0
      [ 2055.291030] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 2055.298227] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 2055.305424] PKRU: 55555554
      [ 2055.308153] Call Trace:
      [ 2055.310624]  xdp_do_redirect+0x7b/0x380
      [ 2055.314499]  tun_get_user+0x10fe/0x12a0 [tun]
      [ 2055.318895]  tun_sendmsg+0x52/0x70 [tun]
      [ 2055.322852]  handle_tx+0x2ad/0x5f0 [vhost_net]
      [ 2055.327337]  vhost_worker+0xa5/0x100 [vhost]
      [ 2055.331646]  kthread+0xf5/0x130
      [ 2055.334813]  ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost]
      [ 2055.339646]  ? kthread_bind+0x10/0x10
      [ 2055.343343]  ret_from_fork+0x35/0x40
      [ 2055.346950] Code: 0e 74 15 83 f8 10 75 05 e9 e9 aa b3 ff f3 c3 0f 1f 80 00 00 00 00 f3 c3 e9 c9 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 47 18 83 f8 0e 74 0d 83 f8 10 75 05 e9 e9 a9 b3 ff 31 c0 c3
      [ 2055.366004] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP: ffff998b07607cc0
      [ 2055.372500] CR2: 0000000000000018
      [ 2055.375856] ---[ end trace 2a2dcc5e9e174268 ]---
      [ 2055.523626] Kernel panic - not syncing: Fatal exception
      [ 2055.529796] Kernel Offset: 0x2e000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [ 2055.677539] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      v2:
       - Removed preempt_disable/enable since local_bh_disable will prevent
         preemption as well, feedback from Jason Wang.
      
      Fixes: 761876c8 ("tap: XDP support")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6547e387
  22. 25 5月, 2018 1 次提交
    • J
      xdp: change ndo_xdp_xmit API to support bulking · 735fc405
      Jesper Dangaard Brouer 提交于
      This patch change the API for ndo_xdp_xmit to support bulking
      xdp_frames.
      
      When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
      Most of the slowdown is caused by DMA API indirect function calls, but
      also the net_device->ndo_xdp_xmit() call.
      
      Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
      single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
      performance improved:
       for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
       for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
      
      With frames avail as a bulk inside the driver ndo_xdp_xmit call,
      further optimizations are possible, like bulk DMA-mapping for TX.
      
      Testing without CONFIG_RETPOLINE show the same performance for
      physical NIC drivers.
      
      The virtual NIC driver tun sees a huge performance boost, as it can
      avoid doing per frame producer locking, but instead amortize the
      locking cost over the bulk.
      
      V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
      V4: Isolated ndo, driver changes and callers.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      735fc405
  23. 24 5月, 2018 1 次提交
    • J
      tuntap: correctly set SOCKWQ_ASYNC_NOSPACE · 2f3ab622
      Jason Wang 提交于
      When link is down, writes to the device might fail with
      -EIO. Userspace needs an indication when the status is resolved.  As a
      fix, tun_net_open() attempts to wake up writers - but that is only
      effective if SOCKWQ_ASYNC_NOSPACE has been set in the past. This is
      not the case of vhost_net which only poll for EPOLLOUT after it meets
      errors during sendmsg().
      
      This patch fixes this by making sure SOCKWQ_ASYNC_NOSPACE is set when
      socket is not writable or device is down to guarantee EPOLLOUT will be
      raised in either tun_chr_poll() or tun_sock_write_space() after device
      is up.
      
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Fixes: 1bd4978a ("tun: honor IFF_UP in tun_get_user()")
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f3ab622
  24. 17 5月, 2018 1 次提交