1. 03 7月, 2019 4 次提交
    • J
      selftests: bpf: fix inlines in test_lwt_seg6local · 11aca65e
      Jiri Benc 提交于
      Selftests are reporting this failure in test_lwt_seg6local.sh:
      
      + ip netns exec ns2 ip -6 route add fb00::6 encap bpf in obj test_lwt_seg6local.o sec encap_srh dev veth2
      Error fetching program/map!
      Failed to parse eBPF program: Operation not permitted
      
      The problem is __attribute__((always_inline)) alone is not enough to prevent
      clang from inserting those functions in .text. In that case, .text is not
      marked as relocateable.
      
      See the output of objdump -h test_lwt_seg6local.o:
      
      Idx Name          Size      VMA               LMA               File off  Algn
        0 .text         00003530  0000000000000000  0000000000000000  00000040  2**3
                        CONTENTS, ALLOC, LOAD, READONLY, CODE
      
      This causes the iproute bpf loader to fail in bpf_fetch_prog_sec:
      bpf_has_call_data returns true but bpf_fetch_prog_relo fails as there's no
      relocateable .text section in the file.
      
      To fix this, convert to 'static __always_inline'.
      
      v2: Use 'static __always_inline' instead of 'static inline
          __attribute__((always_inline))'
      
      Fixes: c99a84ea ("selftests/bpf: test for seg6local End.BPF action")
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      11aca65e
    • L
      selftests: bpf: add tests for shifts by zero · ac8786c7
      Luke Nelson 提交于
      There are currently no tests for ALU64 shift operations when the shift
      amount is 0. This adds 6 new tests to make sure they are equivalent
      to a no-op. The x32 JIT had such bugs that could have been caught by
      these tests.
      
      Cc: Xi Wang <xi.wang@gmail.com>
      Signed-off-by: NLuke Nelson <luke.r.nels@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ac8786c7
    • L
      bpf, x32: Fix bug with ALU64 {LSH, RSH, ARSH} BPF_K shift by 0 · 6fa632e7
      Luke Nelson 提交于
      The current x32 BPF JIT does not correctly compile shift operations when
      the immediate shift amount is 0. The expected behavior is for this to
      be a no-op.
      
      The following program demonstrates the bug. The expexceted result is 1,
      but the current JITed code returns 2.
      
        r0 = 1
        r1 = 1
        r1 <<= 0
        if r1 == 1 goto end
        r0 = 2
      end:
        exit
      
      This patch simplifies the code and fixes the bug.
      
      Fixes: 03f5781b ("bpf, x86_32: add eBPF JIT compiler for ia32")
      Co-developed-by: NXi Wang <xi.wang@gmail.com>
      Signed-off-by: NXi Wang <xi.wang@gmail.com>
      Signed-off-by: NLuke Nelson <luke.r.nels@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6fa632e7
    • L
      bpf, x32: Fix bug with ALU64 {LSH, RSH, ARSH} BPF_X shift by 0 · 68a8357e
      Luke Nelson 提交于
      The current x32 BPF JIT for shift operations is not correct when the
      shift amount in a register is 0. The expected behavior is a no-op, whereas
      the current implementation changes bits in the destination register.
      
      The following example demonstrates the bug. The expected result of this
      program is 1, but the current JITed code returns 2.
      
        r0 = 1
        r1 = 1
        r2 = 0
        r1 <<= r2
        if r1 == 1 goto end
        r0 = 2
      end:
        exit
      
      The bug is caused by an incorrect assumption by the JIT that a shift by
      32 clear the register. On x32 however, shifts use the lower 5 bits of
      the source, making a shift by 32 equivalent to a shift by 0.
      
      This patch fixes the bug using double-precision shifts, which also
      simplifies the code.
      
      Fixes: 03f5781b ("bpf, x86_32: add eBPF JIT compiler for ia32")
      Co-developed-by: NXi Wang <xi.wang@gmail.com>
      Signed-off-by: NXi Wang <xi.wang@gmail.com>
      Signed-off-by: NLuke Nelson <luke.r.nels@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      68a8357e
  2. 29 6月, 2019 1 次提交
  3. 26 6月, 2019 3 次提交
  4. 25 6月, 2019 1 次提交
  5. 24 6月, 2019 2 次提交
  6. 18 6月, 2019 2 次提交
  7. 17 6月, 2019 8 次提交
    • J
      lapb: fixed leak of control-blocks. · 6be8e297
      Jeremy Sowden 提交于
      lapb_register calls lapb_create_cb, which initializes the control-
      block's ref-count to one, and __lapb_insert_cb, which increments it when
      adding the new block to the list of blocks.
      
      lapb_unregister calls __lapb_remove_cb, which decrements the ref-count
      when removing control-block from the list of blocks, and calls lapb_put
      itself to decrement the ref-count before returning.
      
      However, lapb_unregister also calls __lapb_devtostruct to look up the
      right control-block for the given net_device, and __lapb_devtostruct
      also bumps the ref-count, which means that when lapb_unregister returns
      the ref-count is still 1 and the control-block is leaked.
      
      Call lapb_put after __lapb_devtostruct to fix leak.
      
      Reported-by: syzbot+afb980676c836b4a0afa@syzkaller.appspotmail.com
      Signed-off-by: NJeremy Sowden <jeremy@azazel.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6be8e297
    • X
      tipc: purge deferredq list for each grp member in tipc_group_delete · 5cf02612
      Xin Long 提交于
      Syzbot reported a memleak caused by grp members' deferredq list not
      purged when the grp is be deleted.
      
      The issue occurs when more(msg_grp_bc_seqno(hdr), m->bc_rcv_nxt) in
      tipc_group_filter_msg() and the skb will stay in deferredq.
      
      So fix it by calling __skb_queue_purge for each member's deferredq
      in tipc_group_delete() when a tipc sk leaves the grp.
      
      Fixes: b87a5ea3 ("tipc: guarantee group unicast doesn't bypass group broadcast")
      Reported-by: syzbot+78fbe679c8ca8d264a8d@syzkaller.appspotmail.com
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5cf02612
    • E
      ax25: fix inconsistent lock state in ax25_destroy_timer · d4d5d8e8
      Eric Dumazet 提交于
      Before thread in process context uses bh_lock_sock()
      we must disable bh.
      
      sysbot reported :
      
      WARNING: inconsistent lock state
      5.2.0-rc3+ #32 Not tainted
      
      inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      blkid/26581 [HC0[0]:SC1[1]:HE1:SE0] takes:
      00000000e0da85ee (slock-AF_AX25){+.?.}, at: spin_lock include/linux/spinlock.h:338 [inline]
      00000000e0da85ee (slock-AF_AX25){+.?.}, at: ax25_destroy_timer+0x53/0xc0 net/ax25/af_ax25.c:275
      {SOFTIRQ-ON-W} state was registered at:
        lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:4303
        __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
        _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151
        spin_lock include/linux/spinlock.h:338 [inline]
        ax25_rt_autobind+0x3ca/0x720 net/ax25/ax25_route.c:429
        ax25_connect.cold+0x30/0xa4 net/ax25/af_ax25.c:1221
        __sys_connect+0x264/0x330 net/socket.c:1834
        __do_sys_connect net/socket.c:1845 [inline]
        __se_sys_connect net/socket.c:1842 [inline]
        __x64_sys_connect+0x73/0xb0 net/socket.c:1842
        do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      irq event stamp: 2272
      hardirqs last  enabled at (2272): [<ffffffff810065f3>] trace_hardirqs_on_thunk+0x1a/0x1c
      hardirqs last disabled at (2271): [<ffffffff8100660f>] trace_hardirqs_off_thunk+0x1a/0x1c
      softirqs last  enabled at (1522): [<ffffffff87400654>] __do_softirq+0x654/0x94c kernel/softirq.c:320
      softirqs last disabled at (2267): [<ffffffff81449010>] invoke_softirq kernel/softirq.c:374 [inline]
      softirqs last disabled at (2267): [<ffffffff81449010>] irq_exit+0x180/0x1d0 kernel/softirq.c:414
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(slock-AF_AX25);
        <Interrupt>
          lock(slock-AF_AX25);
      
       *** DEADLOCK ***
      
      1 lock held by blkid/26581:
       #0: 0000000010fd154d ((&ax25->dtimer)){+.-.}, at: lockdep_copy_map include/linux/lockdep.h:175 [inline]
       #0: 0000000010fd154d ((&ax25->dtimer)){+.-.}, at: call_timer_fn+0xe0/0x720 kernel/time/timer.c:1312
      
      stack backtrace:
      CPU: 1 PID: 26581 Comm: blkid Not tainted 5.2.0-rc3+ #32
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_usage_bug.cold+0x393/0x4a2 kernel/locking/lockdep.c:2935
       valid_state kernel/locking/lockdep.c:2948 [inline]
       mark_lock_irq kernel/locking/lockdep.c:3138 [inline]
       mark_lock+0xd46/0x1370 kernel/locking/lockdep.c:3513
       mark_irqflags kernel/locking/lockdep.c:3391 [inline]
       __lock_acquire+0x159f/0x5490 kernel/locking/lockdep.c:3745
       lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:4303
       __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
       _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151
       spin_lock include/linux/spinlock.h:338 [inline]
       ax25_destroy_timer+0x53/0xc0 net/ax25/af_ax25.c:275
       call_timer_fn+0x193/0x720 kernel/time/timer.c:1322
       expire_timers kernel/time/timer.c:1366 [inline]
       __run_timers kernel/time/timer.c:1685 [inline]
       __run_timers kernel/time/timer.c:1653 [inline]
       run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698
       __do_softirq+0x25c/0x94c kernel/softirq.c:293
       invoke_softirq kernel/softirq.c:374 [inline]
       irq_exit+0x180/0x1d0 kernel/softirq.c:414
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
       </IRQ>
      RIP: 0033:0x7f858d5c3232
      Code: 8b 61 08 48 8b 84 24 d8 00 00 00 4c 89 44 24 28 48 8b ac 24 d0 00 00 00 4c 8b b4 24 e8 00 00 00 48 89 7c 24 68 48 89 4c 24 78 <48> 89 44 24 58 8b 84 24 e0 00 00 00 89 84 24 84 00 00 00 8b 84 24
      RSP: 002b:00007ffcaf0cf5c0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
      RAX: 00007f858d7d27a8 RBX: 00007f858d7d8820 RCX: 00007f858d3940d8
      RDX: 00007ffcaf0cf798 RSI: 00000000f5e616f3 RDI: 00007f858d394fee
      RBP: 0000000000000000 R08: 00007ffcaf0cf780 R09: 00007f858d7db480
      R10: 0000000000000000 R11: 0000000009691a75 R12: 0000000000000005
      R13: 00000000f5e616f3 R14: 0000000000000000 R15: 00007ffcaf0cf798
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4d5d8e8
    • E
      neigh: fix use-after-free read in pneigh_get_next · f3e92cb8
      Eric Dumazet 提交于
      Nine years ago, I added RCU handling to neighbours, not pneighbours.
      (pneigh are not commonly used)
      
      Unfortunately I missed that /proc dump operations would use a
      common entry and exit point : neigh_seq_start() and neigh_seq_stop()
      
      We need to read_lock(tbl->lock) or risk use-after-free while
      iterating the pneigh structures.
      
      We might later convert pneigh to RCU and revert this patch.
      
      sysbot reported :
      
      BUG: KASAN: use-after-free in pneigh_get_next.isra.0+0x24b/0x280 net/core/neighbour.c:3158
      Read of size 8 at addr ffff888097f2a700 by task syz-executor.0/9825
      
      CPU: 1 PID: 9825 Comm: syz-executor.0 Not tainted 5.2.0-rc4+ #32
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
       __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       kasan_report+0x12/0x20 mm/kasan/common.c:614
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
       pneigh_get_next.isra.0+0x24b/0x280 net/core/neighbour.c:3158
       neigh_seq_next+0xdb/0x210 net/core/neighbour.c:3240
       seq_read+0x9cf/0x1110 fs/seq_file.c:258
       proc_reg_read+0x1fc/0x2c0 fs/proc/inode.c:221
       do_loop_readv_writev fs/read_write.c:714 [inline]
       do_loop_readv_writev fs/read_write.c:701 [inline]
       do_iter_read+0x4a4/0x660 fs/read_write.c:935
       vfs_readv+0xf0/0x160 fs/read_write.c:997
       kernel_readv fs/splice.c:359 [inline]
       default_file_splice_read+0x475/0x890 fs/splice.c:414
       do_splice_to+0x127/0x180 fs/splice.c:877
       splice_direct_to_actor+0x2d2/0x970 fs/splice.c:954
       do_splice_direct+0x1da/0x2a0 fs/splice.c:1063
       do_sendfile+0x597/0xd00 fs/read_write.c:1464
       __do_sys_sendfile64 fs/read_write.c:1525 [inline]
       __se_sys_sendfile64 fs/read_write.c:1511 [inline]
       __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4592c9
      Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f4aab51dc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
      RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000004592c9
      RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000005
      RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000080000000 R11: 0000000000000246 R12: 00007f4aab51e6d4
      R13: 00000000004c689d R14: 00000000004db828 R15: 00000000ffffffff
      
      Allocated by task 9827:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_kmalloc mm/kasan/common.c:489 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
       kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
       __do_kmalloc mm/slab.c:3660 [inline]
       __kmalloc+0x15c/0x740 mm/slab.c:3669
       kmalloc include/linux/slab.h:552 [inline]
       pneigh_lookup+0x19c/0x4a0 net/core/neighbour.c:731
       arp_req_set_public net/ipv4/arp.c:1010 [inline]
       arp_req_set+0x613/0x720 net/ipv4/arp.c:1026
       arp_ioctl+0x652/0x7f0 net/ipv4/arp.c:1226
       inet_ioctl+0x2a0/0x340 net/ipv4/af_inet.c:926
       sock_do_ioctl+0xd8/0x2f0 net/socket.c:1043
       sock_ioctl+0x3ed/0x780 net/socket.c:1194
       vfs_ioctl fs/ioctl.c:46 [inline]
       file_ioctl fs/ioctl.c:509 [inline]
       do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696
       ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 9824:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
       __cache_free mm/slab.c:3432 [inline]
       kfree+0xcf/0x220 mm/slab.c:3755
       pneigh_ifdown_and_unlock net/core/neighbour.c:812 [inline]
       __neigh_ifdown+0x236/0x2f0 net/core/neighbour.c:356
       neigh_ifdown+0x20/0x30 net/core/neighbour.c:372
       arp_ifdown+0x1d/0x21 net/ipv4/arp.c:1274
       inetdev_destroy net/ipv4/devinet.c:319 [inline]
       inetdev_event+0xa14/0x11f0 net/ipv4/devinet.c:1544
       notifier_call_chain+0xc2/0x230 kernel/notifier.c:95
       __raw_notifier_call_chain kernel/notifier.c:396 [inline]
       raw_notifier_call_chain+0x2e/0x40 kernel/notifier.c:403
       call_netdevice_notifiers_info+0x3f/0x90 net/core/dev.c:1749
       call_netdevice_notifiers_extack net/core/dev.c:1761 [inline]
       call_netdevice_notifiers net/core/dev.c:1775 [inline]
       rollback_registered_many+0x9b9/0xfc0 net/core/dev.c:8178
       rollback_registered+0x109/0x1d0 net/core/dev.c:8220
       unregister_netdevice_queue net/core/dev.c:9267 [inline]
       unregister_netdevice_queue+0x1ee/0x2c0 net/core/dev.c:9260
       unregister_netdevice include/linux/netdevice.h:2631 [inline]
       __tun_detach+0xd8a/0x1040 drivers/net/tun.c:724
       tun_detach drivers/net/tun.c:741 [inline]
       tun_chr_close+0xe0/0x180 drivers/net/tun.c:3451
       __fput+0x2ff/0x890 fs/file_table.c:280
       ____fput+0x16/0x20 fs/file_table.c:313
       task_work_run+0x145/0x1c0 kernel/task_work.c:113
       tracehook_notify_resume include/linux/tracehook.h:185 [inline]
       exit_to_usermode_loop+0x273/0x2c0 arch/x86/entry/common.c:168
       prepare_exit_to_usermode arch/x86/entry/common.c:199 [inline]
       syscall_return_slowpath arch/x86/entry/common.c:279 [inline]
       do_syscall_64+0x58e/0x680 arch/x86/entry/common.c:304
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff888097f2a700
       which belongs to the cache kmalloc-64 of size 64
      The buggy address is located 0 bytes inside of
       64-byte region [ffff888097f2a700, ffff888097f2a740)
      The buggy address belongs to the page:
      page:ffffea00025fca80 refcount:1 mapcount:0 mapping:ffff8880aa400340 index:0x0
      flags: 0x1fffc0000000200(slab)
      raw: 01fffc0000000200 ffffea000250d548 ffffea00025726c8 ffff8880aa400340
      raw: 0000000000000000 ffff888097f2a000 0000000100000020 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff888097f2a600: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
       ffff888097f2a680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
      >ffff888097f2a700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
                         ^
       ffff888097f2a780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
       ffff888097f2a800: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
      
      Fixes: 767e97e1 ("neigh: RCU conversion of struct neighbour")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3e92cb8
    • E
      tcp: fix compile error if !CONFIG_SYSCTL · 2e05fcae
      Eric Dumazet 提交于
      tcp_tx_skb_cache_key and tcp_rx_skb_cache_key must be available
      even if CONFIG_SYSCTL is not set.
      
      Fixes: 0b7d7f6b ("tcp: add tcp_tx_skb_cache sysctl")
      Fixes: ede61ca4 ("tcp: add tcp_rx_skb_cache sysctl")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e05fcae
    • D
      hv_sock: Suppress bogus "may be used uninitialized" warnings · d424a2af
      Dexuan Cui 提交于
      gcc 8.2.0 may report these bogus warnings under some condition:
      
      warning: ‘vnew’ may be used uninitialized in this function
      warning: ‘hvs_new’ may be used uninitialized in this function
      
      Actually, the 2 pointers are only initialized and used if the variable
      "conn_from_host" is true. The code is not buggy here.
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d424a2af
    • I
      be2net: Fix number of Rx queues used for flow hashing · 718f4a25
      Ivan Vecera 提交于
      Number of Rx queues used for flow hashing returned by the driver is
      incorrect and this bug prevents user to use the last Rx queue in
      indirection table.
      
      Let's say we have a NIC with 6 combined queues:
      
      [root@sm-03 ~]# ethtool -l enp4s0f0
      Channel parameters for enp4s0f0:
      Pre-set maximums:
      RX:             5
      TX:             5
      Other:          0
      Combined:       6
      Current hardware settings:
      RX:             0
      TX:             0
      Other:          0
      Combined:       6
      
      Default indirection table maps all (6) queues equally but the driver
      reports only 5 rings available.
      
      [root@sm-03 ~]# ethtool -x enp4s0f0
      RX flow hash indirection table for enp4s0f0 with 5 RX ring(s):
          0:      0     1     2     3     4     5     0     1
          8:      2     3     4     5     0     1     2     3
         16:      4     5     0     1     2     3     4     5
         24:      0     1     2     3     4     5     0     1
      ...
      
      Now change indirection table somehow:
      
      [root@sm-03 ~]# ethtool -X enp4s0f0 weight 1 1
      [root@sm-03 ~]# ethtool -x enp4s0f0
      RX flow hash indirection table for enp4s0f0 with 6 RX ring(s):
          0:      0     0     0     0     0     0     0     0
      ...
         64:      1     1     1     1     1     1     1     1
      ...
      
      Now it is not possible to change mapping back to equal (default) state:
      
      [root@sm-03 ~]# ethtool -X enp4s0f0 equal 6
      Cannot set RX flow hash configuration: Invalid argument
      
      Fixes: 594ad54a ("be2net: Add support for setting and getting rx flow hash options")
      Reported-by: NTianhao <tizhao@redhat.com>
      Signed-off-by: NIvan Vecera <ivecera@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      718f4a25
    • G
      net: handle 802.1P vlan 0 packets properly · 36b2f61a
      Govindarajulu Varadarajan 提交于
      When stack receives pkt: [802.1P vlan 0][802.1AD vlan 100][IPv4],
      vlan_do_receive() returns false if it does not find vlan_dev. Later
      __netif_receive_skb_core() fails to find packet type handler for
      skb->protocol 801.1AD and drops the packet.
      
      801.1P header with vlan id 0 should be handled as untagged packets.
      This patch fixes it by checking if vlan_id is 0 and processes next vlan
      header.
      Signed-off-by: NGovindarajulu Varadarajan <gvaradar@cisco.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36b2f61a
  8. 16 6月, 2019 11 次提交
  9. 15 6月, 2019 8 次提交
    • D
      Merge branch 'tcp-add-three-static-keys' · 35fc07ae
      David S. Miller 提交于
      Eric Dumazet says:
      
      ====================
      tcp: add three static keys
      
      Recent addition of per TCP socket rx/tx cache brought
      regressions for some workloads, as reported by Feng Tang.
      
      It seems better to make them opt-in, before we adopt better
      heuristics.
      
      The last patch adds high_order_alloc_disable sysctl
      to ask TCP sendmsg() to exclusively use order-0 allocations,
      as mm layer has specific optimizations.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35fc07ae
    • E
      net: add high_order_alloc_disable sysctl/static key · ce27ec60
      Eric Dumazet 提交于
      >From linux-3.7, (commit 5640f768 "net: use a per task frag
      allocator") TCP sendmsg() has preferred using order-3 allocations.
      
      While it gives good results for most cases, we had reports
      that heavy uses of TCP over loopback were hitting a spinlock
      contention in page allocations/freeing.
      
      This commits adds a sysctl so that admins can opt-in
      for order-0 allocations. Hopefully mm layer might optimize
      order-3 allocations in the future since it could give us
      a nice boost  (see 8 lines of following benchmark)
      
      The following benchmark shows a win when more than 8 TCP_STREAM
      threads are running (56 x86 cores server in my tests)
      
      for thr in {1..30}
      do
       sysctl -wq net.core.high_order_alloc_disable=0
       T0=`./super_netperf $thr -H 127.0.0.1 -l 15`
       sysctl -wq net.core.high_order_alloc_disable=1
       T1=`./super_netperf $thr -H 127.0.0.1 -l 15`
       echo $thr:$T0:$T1
      done
      
      1: 49979: 37267
      2: 98745: 76286
      3: 141088: 110051
      4: 177414: 144772
      5: 197587: 173563
      6: 215377: 208448
      7: 241061: 234087
      8: 267155: 263373
      9: 295069: 297402
      10: 312393: 335213
      11: 340462: 368778
      12: 371366: 403954
      13: 412344: 443713
      14: 426617: 473580
      15: 474418: 507861
      16: 503261: 538539
      17: 522331: 563096
      18: 532409: 567084
      19: 550824: 605240
      20: 525493: 641988
      21: 564574: 665843
      22: 567349: 690868
      23: 583846: 710917
      24: 588715: 736306
      25: 603212: 763494
      26: 604083: 792654
      27: 602241: 796450
      28: 604291: 797993
      29: 611610: 833249
      30: 577356: 841062
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce27ec60
    • E
      tcp: add tcp_tx_skb_cache sysctl · 0b7d7f6b
      Eric Dumazet 提交于
      Feng Tang reported a performance regression after introduction
      of per TCP socket tx/rx caches, for TCP over loopback (netperf)
      
      There is high chance the regression is caused by a change on
      how well the 32 KB per-thread page (current->task_frag) can
      be recycled, and lack of pcp caches for order-3 pages.
      
      I could not reproduce the regression myself, cpus all being
      spinning on the mm spinlocks for page allocs/freeing, regardless
      of enabling or disabling the per tcp socket caches.
      
      It seems best to disable the feature by default, and let
      admins enabling it.
      
      MM layer either needs to provide scalable order-3 pages
      allocations, or could attempt a trylock on zone->lock if
      the caller only attempts to get a high-order page and is
      able to fallback to order-0 ones in case of pressure.
      
      Tests run on a 56 cores host (112 hyper threads)
      
      -	35.49%	netperf 		 [kernel.vmlinux]	  [k] queued_spin_lock_slowpath
         - 35.49% queued_spin_lock_slowpath
      	  - 18.18% get_page_from_freelist
      		 - __alloc_pages_nodemask
      			- 18.18% alloc_pages_current
      				 skb_page_frag_refill
      				 sk_page_frag_refill
      				 tcp_sendmsg_locked
      				 tcp_sendmsg
      				 inet_sendmsg
      				 sock_sendmsg
      				 __sys_sendto
      				 __x64_sys_sendto
      				 do_syscall_64
      				 entry_SYSCALL_64_after_hwframe
      				 __libc_send
      	  + 17.31% __free_pages_ok
      +	31.43%	swapper 		 [kernel.vmlinux]	  [k] intel_idle
      +	 9.12%	netperf 		 [kernel.vmlinux]	  [k] copy_user_enhanced_fast_string
      +	 6.53%	netserver		 [kernel.vmlinux]	  [k] copy_user_enhanced_fast_string
      +	 0.69%	netserver		 [kernel.vmlinux]	  [k] queued_spin_lock_slowpath
      +	 0.68%	netperf 		 [kernel.vmlinux]	  [k] skb_release_data
      +	 0.52%	netperf 		 [kernel.vmlinux]	  [k] tcp_sendmsg_locked
      	 0.46%	netperf 		 [kernel.vmlinux]	  [k] _raw_spin_lock_irqsave
      
      Fixes: 472c2e07 ("tcp: add one skb cache for tx")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NFeng Tang <feng.tang@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b7d7f6b
    • E
      tcp: add tcp_rx_skb_cache sysctl · ede61ca4
      Eric Dumazet 提交于
      Instead of relying on rps_needed, it is safer to use a separate
      static key, since we do not want to enable TCP rx_skb_cache
      by default. This feature can cause huge increase of memory
      usage on hosts with millions of sockets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ede61ca4
    • E
      sysctl: define proc_do_static_key() · a8e11e5c
      Eric Dumazet 提交于
      Convert proc_dointvec_minmax_bpf_stats() into a more generic
      helper, since we are going to use jump labels more often.
      
      Note that sysctl_bpf_stats_enabled is removed, since
      it is no longer needed/used.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8e11e5c
    • H
      hv_netvsc: Set probe mode to sync · 9a33629b
      Haiyang Zhang 提交于
      For better consistency of synthetic NIC names, we set the probe mode to
      PROBE_FORCE_SYNCHRONOUS. So the names can be aligned with the vmbus
      channel offer sequence.
      
      Fixes: af0a5646 ("use the new async probing feature for the hyperv drivers")
      Signed-off-by: NHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a33629b
    • V
      net: sched: flower: don't call synchronize_rcu() on mask creation · 99815f50
      Vlad Buslov 提交于
      Current flower mask creating code assumes that temporary mask that is used
      when inserting new filter is stack allocated. To prevent race condition
      with data patch synchronize_rcu() is called every time fl_create_new_mask()
      replaces temporary stack allocated mask. As reported by Jiri, this
      increases runtime of creating 20000 flower classifiers from 4 seconds to
      163 seconds. However, this design is no longer necessary since temporary
      mask was converted to be dynamically allocated by commit 2cddd201
      ("net/sched: cls_flower: allocate mask dynamically in fl_change()").
      
      Remove synchronize_rcu() calls from mask creation code. Instead, refactor
      fl_change() to always deallocate temporary mask with rcu grace period.
      
      Fixes: 195c234d ("net: sched: flower: handle concurrent mask insertion")
      Reported-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Tested-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99815f50
    • A
      net: dsa: fix warning same module names · f0c03ee0
      Anders Roxell 提交于
      When building with CONFIG_NET_DSA_REALTEK_SMI and CONFIG_REALTEK_PHY
      enabled as loadable modules, we see the following warning:
      
      warning: same module names found:
        drivers/net/phy/realtek.ko
        drivers/net/dsa/realtek.ko
      
      Rework so the driver name is realtek-smi instead of realtek.
      Reviewed-by: NLinus Walleij <linus.walleij@linaro.org>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0c03ee0