1. 26 5月, 2021 3 次提交
    • V
      net: zero-initialize tc skb extension on allocation · 9453d45e
      Vlad Buslov 提交于
      Function skb_ext_add() doesn't initialize created skb extension with any
      value and leaves it up to the user. However, since extension of type
      TC_SKB_EXT originally contained only single value tc_skb_ext->chain its
      users used to just assign the chain value without setting whole extension
      memory to zero first. This assumption changed when TC_SKB_EXT extension was
      extended with additional fields but not all users were updated to
      initialize the new fields which leads to use of uninitialized memory
      afterwards. UBSAN log:
      
      [  778.299821] UBSAN: invalid-load in net/openvswitch/flow.c:899:28
      [  778.301495] load of value 107 is not a valid value for type '_Bool'
      [  778.303215] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc7+ #2
      [  778.304933] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [  778.307901] Call Trace:
      [  778.308680]  <IRQ>
      [  778.309358]  dump_stack+0xbb/0x107
      [  778.310307]  ubsan_epilogue+0x5/0x40
      [  778.311167]  __ubsan_handle_load_invalid_value.cold+0x43/0x48
      [  778.312454]  ? memset+0x20/0x40
      [  778.313230]  ovs_flow_key_extract.cold+0xf/0x14 [openvswitch]
      [  778.314532]  ovs_vport_receive+0x19e/0x2e0 [openvswitch]
      [  778.315749]  ? ovs_vport_find_upcall_portid+0x330/0x330 [openvswitch]
      [  778.317188]  ? create_prof_cpu_mask+0x20/0x20
      [  778.318220]  ? arch_stack_walk+0x82/0xf0
      [  778.319153]  ? secondary_startup_64_no_verify+0xb0/0xbb
      [  778.320399]  ? stack_trace_save+0x91/0xc0
      [  778.321362]  ? stack_trace_consume_entry+0x160/0x160
      [  778.322517]  ? lock_release+0x52e/0x760
      [  778.323444]  netdev_frame_hook+0x323/0x610 [openvswitch]
      [  778.324668]  ? ovs_netdev_get_vport+0xe0/0xe0 [openvswitch]
      [  778.325950]  __netif_receive_skb_core+0x771/0x2db0
      [  778.327067]  ? lock_downgrade+0x6e0/0x6f0
      [  778.328021]  ? lock_acquire+0x565/0x720
      [  778.328940]  ? generic_xdp_tx+0x4f0/0x4f0
      [  778.329902]  ? inet_gro_receive+0x2a7/0x10a0
      [  778.330914]  ? lock_downgrade+0x6f0/0x6f0
      [  778.331867]  ? udp4_gro_receive+0x4c4/0x13e0
      [  778.332876]  ? lock_release+0x52e/0x760
      [  778.333808]  ? dev_gro_receive+0xcc8/0x2380
      [  778.334810]  ? lock_downgrade+0x6f0/0x6f0
      [  778.335769]  __netif_receive_skb_list_core+0x295/0x820
      [  778.336955]  ? process_backlog+0x780/0x780
      [  778.337941]  ? mlx5e_rep_tc_netdevice_event_unregister+0x20/0x20 [mlx5_core]
      [  778.339613]  ? seqcount_lockdep_reader_access.constprop.0+0xa7/0xc0
      [  778.341033]  ? kvm_clock_get_cycles+0x14/0x20
      [  778.342072]  netif_receive_skb_list_internal+0x5f5/0xcb0
      [  778.343288]  ? __kasan_kmalloc+0x7a/0x90
      [  778.344234]  ? mlx5e_handle_rx_cqe_mpwrq+0x9e0/0x9e0 [mlx5_core]
      [  778.345676]  ? mlx5e_xmit_xdp_frame_mpwqe+0x14d0/0x14d0 [mlx5_core]
      [  778.347140]  ? __netif_receive_skb_list_core+0x820/0x820
      [  778.348351]  ? mlx5e_post_rx_mpwqes+0xa6/0x25d0 [mlx5_core]
      [  778.349688]  ? napi_gro_flush+0x26c/0x3c0
      [  778.350641]  napi_complete_done+0x188/0x6b0
      [  778.351627]  mlx5e_napi_poll+0x373/0x1b80 [mlx5_core]
      [  778.352853]  __napi_poll+0x9f/0x510
      [  778.353704]  ? mlx5_flow_namespace_set_mode+0x260/0x260 [mlx5_core]
      [  778.355158]  net_rx_action+0x34c/0xa40
      [  778.356060]  ? napi_threaded_poll+0x3d0/0x3d0
      [  778.357083]  ? sched_clock_cpu+0x18/0x190
      [  778.358041]  ? __common_interrupt+0x8e/0x1a0
      [  778.359045]  __do_softirq+0x1ce/0x984
      [  778.359938]  __irq_exit_rcu+0x137/0x1d0
      [  778.360865]  irq_exit_rcu+0xa/0x20
      [  778.361708]  common_interrupt+0x80/0xa0
      [  778.362640]  </IRQ>
      [  778.363212]  asm_common_interrupt+0x1e/0x40
      [  778.364204] RIP: 0010:native_safe_halt+0xe/0x10
      [  778.365273] Code: 4f ff ff ff 4c 89 e7 e8 50 3f 40 fe e9 dc fe ff ff 48 89 df e8 43 3f 40 fe eb 90 cc e9 07 00 00 00 0f 00 2d 74 05 62 00 fb f4 <c3> 90 e9 07 00 00 00 0f 00 2d 64 05 62 00 f4 c3 cc cc 0f 1f 44 00
      [  778.369355] RSP: 0018:ffffffff84407e48 EFLAGS: 00000246
      [  778.370570] RAX: ffff88842de46a80 RBX: ffffffff84425840 RCX: ffffffff83418468
      [  778.372143] RDX: 000000000026f1da RSI: 0000000000000004 RDI: ffffffff8343af5e
      [  778.373722] RBP: fffffbfff0884b08 R08: 0000000000000000 R09: ffff88842de46bcb
      [  778.375292] R10: ffffed1085bc8d79 R11: 0000000000000001 R12: 0000000000000000
      [  778.376860] R13: ffffffff851124a0 R14: 0000000000000000 R15: dffffc0000000000
      [  778.378491]  ? rcu_eqs_enter.constprop.0+0xb8/0xe0
      [  778.379606]  ? default_idle_call+0x5e/0xe0
      [  778.380578]  default_idle+0xa/0x10
      [  778.381406]  default_idle_call+0x96/0xe0
      [  778.382350]  do_idle+0x3d4/0x550
      [  778.383153]  ? arch_cpu_idle_exit+0x40/0x40
      [  778.384143]  cpu_startup_entry+0x19/0x20
      [  778.385078]  start_kernel+0x3c7/0x3e5
      [  778.385978]  secondary_startup_64_no_verify+0xb0/0xbb
      
      Fix the issue by providing new function tc_skb_ext_alloc() that allocates
      tc skb extension and initializes its memory to 0 before returning it to the
      caller. Change all existing users to use new API instead of calling
      skb_ext_add() directly.
      
      Fixes: 038ebb1a ("net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct")
      Fixes: d29334c1 ("net/sched: act_api: fix miss set post_ct for ovs after do conntrack in act_ct")
      Signed-off-by: NVlad Buslov <vladbu@nvidia.com>
      Acked-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9453d45e
    • X
      sctp: fix the proc_handler for sysctl encap_port · b2540cdc
      Xin Long 提交于
      proc_dointvec() cannot do min and max check for setting a value
      when extra1/extra2 is set, so change it to proc_dointvec_minmax()
      for sysctl encap_port.
      
      Fixes: e8a3001c ("sctp: add encap_port for netns sock asoc and transport")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2540cdc
    • X
      sctp: add the missing setting for asoc encap_port · 297739bd
      Xin Long 提交于
      This patch is to add the missing setting back for asoc encap_port.
      
      Fixes: 8dba2960 ("sctp: add SCTP_REMOTE_UDP_ENCAPS_PORT sockopt")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      297739bd
  2. 25 5月, 2021 2 次提交
    • G
      net: hsr: fix mac_len checks · 48b491a5
      George McCollister 提交于
      Commit 2e9f6093 ("net: hsr: check skb can contain struct hsr_ethhdr
      in fill_frame_info") added the following which resulted in -EINVAL
      always being returned:
      	if (skb->mac_len < sizeof(struct hsr_ethhdr))
      		return -EINVAL;
      
      mac_len was not being set correctly so this check completely broke
      HSR/PRP since it was always 14, not 20.
      
      Set mac_len correctly and modify the mac_len checks to test in the
      correct places since sometimes it is legitimately 14.
      
      Fixes: 2e9f6093 ("net: hsr: check skb can contain struct hsr_ethhdr in fill_frame_info")
      Signed-off-by: NGeorge McCollister <george.mccollister@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48b491a5
    • T
      sch_dsmark: fix a NULL deref in qdisc_reset() · 9b76eade
      Taehee Yoo 提交于
      If Qdisc_ops->init() is failed, Qdisc_ops->reset() would be called.
      When dsmark_init(Qdisc_ops->init()) is failed, it possibly doesn't
      initialize dsmark_qdisc_data->q. But dsmark_reset(Qdisc_ops->reset())
      uses dsmark_qdisc_data->q pointer wihtout any null checking.
      So, panic would occur.
      
      Test commands:
          sysctl net.core.default_qdisc=dsmark -w
          ip link add dummy0 type dummy
          ip link add vw0 link dummy0 type virt_wifi
          ip link set vw0 up
      
      Splat looks like:
      KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
      CPU: 3 PID: 684 Comm: ip Not tainted 5.12.0+ #910
      RIP: 0010:qdisc_reset+0x2b/0x680
      Code: 1f 44 00 00 48 b8 00 00 00 00 00 fc ff df 41 57 41 56 41 55 41 54
      55 48 89 fd 48 83 c7 18 53 48 89 fa 48 c1 ea 03 48 83 ec 20 <80> 3c 02
      00 0f 85 09 06 00 00 4c 8b 65 18 0f 1f 44 00 00 65 8b 1d
      RSP: 0018:ffff88800fda6bf8 EFLAGS: 00010282
      RAX: dffffc0000000000 RBX: ffff8880050ed800 RCX: 0000000000000000
      RDX: 0000000000000003 RSI: ffffffff99e34100 RDI: 0000000000000018
      RBP: 0000000000000000 R08: fffffbfff346b553 R09: fffffbfff346b553
      R10: 0000000000000001 R11: fffffbfff346b552 R12: ffffffffc0824940
      R13: ffff888109e83800 R14: 00000000ffffffff R15: ffffffffc08249e0
      FS:  00007f5042287680(0000) GS:ffff888119800000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055ae1f4dbd90 CR3: 0000000006760002 CR4: 00000000003706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       ? rcu_read_lock_bh_held+0xa0/0xa0
       dsmark_reset+0x3d/0xf0 [sch_dsmark]
       qdisc_reset+0xa9/0x680
       qdisc_destroy+0x84/0x370
       qdisc_create_dflt+0x1fe/0x380
       attach_one_default_qdisc.constprop.41+0xa4/0x180
       dev_activate+0x4d5/0x8c0
       ? __dev_open+0x268/0x390
       __dev_open+0x270/0x390
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b76eade
  3. 24 5月, 2021 2 次提交
    • D
      net/sched: fq_pie: fix OOB access in the traffic path · e70f7a11
      Davide Caratti 提交于
      the following script:
      
        # tc qdisc add dev eth0 handle 0x1 root fq_pie flows 2
        # tc qdisc add dev eth0 clsact
        # tc filter add dev eth0 egress matchall action skbedit priority 0x10002
        # ping 192.0.2.2 -I eth0 -c2 -w1 -q
      
      produces the following splat:
      
       BUG: KASAN: slab-out-of-bounds in fq_pie_qdisc_enqueue+0x1314/0x19d0 [sch_fq_pie]
       Read of size 4 at addr ffff888171306924 by task ping/942
      
       CPU: 3 PID: 942 Comm: ping Not tainted 5.12.0+ #441
       Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
       Call Trace:
        dump_stack+0x92/0xc1
        print_address_description.constprop.7+0x1a/0x150
        kasan_report.cold.13+0x7f/0x111
        fq_pie_qdisc_enqueue+0x1314/0x19d0 [sch_fq_pie]
        __dev_queue_xmit+0x1034/0x2b10
        ip_finish_output2+0xc62/0x2120
        __ip_finish_output+0x553/0xea0
        ip_output+0x1ca/0x4d0
        ip_send_skb+0x37/0xa0
        raw_sendmsg+0x1c4b/0x2d00
        sock_sendmsg+0xdb/0x110
        __sys_sendto+0x1d7/0x2b0
        __x64_sys_sendto+0xdd/0x1b0
        do_syscall_64+0x3c/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7fe69735c3eb
       Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 75 42 2c 00 41 89 ca 8b 00 85 c0 75 14 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 75 c3 0f 1f 40 00 41 57 4d 89 c7 41 56 41 89
       RSP: 002b:00007fff06d7fb38 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
       RAX: ffffffffffffffda RBX: 000055e961413700 RCX: 00007fe69735c3eb
       RDX: 0000000000000040 RSI: 000055e961413700 RDI: 0000000000000003
       RBP: 0000000000000040 R08: 000055e961410500 R09: 0000000000000010
       R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff06d81260
       R13: 00007fff06d7fb40 R14: 00007fff06d7fc30 R15: 000055e96140f0a0
      
       Allocated by task 917:
        kasan_save_stack+0x19/0x40
        __kasan_kmalloc+0x7f/0xa0
        __kmalloc_node+0x139/0x280
        fq_pie_init+0x555/0x8e8 [sch_fq_pie]
        qdisc_create+0x407/0x11b0
        tc_modify_qdisc+0x3c2/0x17e0
        rtnetlink_rcv_msg+0x346/0x8e0
        netlink_rcv_skb+0x120/0x380
        netlink_unicast+0x439/0x630
        netlink_sendmsg+0x719/0xbf0
        sock_sendmsg+0xe2/0x110
        ____sys_sendmsg+0x5ba/0x890
        ___sys_sendmsg+0xe9/0x160
        __sys_sendmsg+0xd3/0x170
        do_syscall_64+0x3c/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
       The buggy address belongs to the object at ffff888171306800
        which belongs to the cache kmalloc-256 of size 256
       The buggy address is located 36 bytes to the right of
        256-byte region [ffff888171306800, ffff888171306900)
       The buggy address belongs to the page:
       page:00000000bcfb624e refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x171306
       head:00000000bcfb624e order:1 compound_mapcount:0
       flags: 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
       raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff888100042b40
       raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff888171306800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ffff888171306880: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
       >ffff888171306900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                                      ^
        ffff888171306980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
        ffff888171306a00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      fix fq_pie traffic path to avoid selecting 'q->flows + q->flows_cnt' as a
      valid flow: it's an address beyond the allocated memory.
      
      Fixes: ec97ecf1 ("net: sched: add Flow Queue PIE packet scheduler")
      CC: stable@vger.kernel.org
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e70f7a11
    • D
      net/sched: fq_pie: re-factor fix for fq_pie endless loop · 3a62fed2
      Davide Caratti 提交于
      the patch that fixed an endless loop in_fq_pie_init() was not considering
      that 65535 is a valid class id. The correct bugfix for this infinite loop
      is to change 'idx' to become an u32, like Colin proposed in the past [1].
      
      Fix this as follows:
       - restore 65536 as maximum possible values of 'flows_cnt'
       - use u32 'idx' when iterating on 'q->flows'
       - fix the TDC selftest
      
      This reverts commit bb2f930d.
      
      [1] https://lore.kernel.org/netdev/20210407163808.499027-1-colin.king@canonical.com/
      
      CC: Colin Ian King <colin.king@canonical.com>
      CC: stable@vger.kernel.org
      Fixes: bb2f930d ("net/sched: fix infinite loop in sch_fq_pie")
      Fixes: ec97ecf1 ("net: sched: add Flow Queue PIE packet scheduler")
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a62fed2
  4. 22 5月, 2021 2 次提交
    • F
      ipv6: record frag_max_size in atomic fragments in input path · e29f011e
      Francesco Ruggeri 提交于
      Commit dbd1759e ("ipv6: on reassembly, record frag_max_size")
      filled the frag_max_size field in IP6CB in the input path.
      The field should also be filled in case of atomic fragments.
      
      Fixes: dbd1759e ('ipv6: on reassembly, record frag_max_size')
      Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e29f011e
    • R
      RDS tcp loopback connection can hang · aced3ce5
      Rao Shoaib 提交于
      When TCP is used as transport and a program on the
      system connects to RDS port 16385, connection is
      accepted but denied per the rules of RDS. However,
      RDS connections object is left in the list. Next
      loopback connection will select that connection
      object as it is at the head of list. The connection
      attempt will hang as the connection object is set
      to connect over TCP which is not allowed
      
      The issue can be reproduced easily, use rds-ping
      to ping a local IP address. After that use any
      program like ncat to connect to the same IP
      address and port 16385. This will hang so ctrl-c out.
      Now try rds-ping, it will hang.
      
      To fix the issue this patch adds checks to disallow
      the connection object creation and destroys the
      connection object.
      Signed-off-by: NRao Shoaib <rao.shoaib@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aced3ce5
  5. 20 5月, 2021 1 次提交
  6. 19 5月, 2021 1 次提交
  7. 18 5月, 2021 6 次提交
    • J
      netlink: disable IRQs for netlink_lock_table() · 1d482e66
      Johannes Berg 提交于
      Syzbot reports that in mac80211 we have a potential deadlock
      between our "local->stop_queue_reasons_lock" (spinlock) and
      netlink's nl_table_lock (rwlock). This is because there's at
      least one situation in which we might try to send a netlink
      message with this spinlock held while it is also possible to
      take the spinlock from a hardirq context, resulting in the
      following deadlock scenario reported by lockdep:
      
             CPU0                    CPU1
             ----                    ----
        lock(nl_table_lock);
                                     local_irq_disable();
                                     lock(&local->queue_stop_reason_lock);
                                     lock(nl_table_lock);
        <Interrupt>
          lock(&local->queue_stop_reason_lock);
      
      This seems valid, we can take the queue_stop_reason_lock in
      any kind of context ("CPU0"), and call ieee80211_report_ack_skb()
      with the spinlock held and IRQs disabled ("CPU1") in some
      code path (ieee80211_do_stop() via ieee80211_free_txskb()).
      
      Short of disallowing netlink use in scenarios like these
      (which would be rather complex in mac80211's case due to
      the deep callchain), it seems the only fix for this is to
      disable IRQs while nl_table_lock is held to avoid hitting
      this scenario, this disallows the "CPU0" portion of the
      reported deadlock.
      
      Note that the writer side (netlink_table_grab()) already
      disables IRQs for this lock.
      
      Unfortunately though, this seems like a huge hammer, and
      maybe the whole netlink table locking should be reworked.
      
      Reported-by: syzbot+69ff9dff50dcfe14ddd4@syzkaller.appspotmail.com
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d482e66
    • J
      net/smc: remove device from smcd_dev_list after failed device_add() · 444d7be9
      Julian Wiedmann 提交于
      If the device_add() for a smcd_dev fails, there's no cleanup step that
      rolls back the earlier list_add(). The device subsequently gets freed,
      and we end up with a corrupted list.
      
      Add some error handling that removes the device from the list.
      
      Fixes: c6ba7c9b ("net/smc: add base infrastructure for SMC-D and ISM")
      Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      444d7be9
    • X
      tipc: wait and exit until all work queues are done · 04c26faa
      Xin Long 提交于
      On some host, a crash could be triggered simply by repeating these
      commands several times:
      
        # modprobe tipc
        # tipc bearer enable media udp name UDP1 localip 127.0.0.1
        # rmmod tipc
      
        [] BUG: unable to handle kernel paging request at ffffffffc096bb00
        [] Workqueue: events 0xffffffffc096bb00
        [] Call Trace:
        []  ? process_one_work+0x1a7/0x360
        []  ? worker_thread+0x30/0x390
        []  ? create_worker+0x1a0/0x1a0
        []  ? kthread+0x116/0x130
        []  ? kthread_flush_work_fn+0x10/0x10
        []  ? ret_from_fork+0x35/0x40
      
      When removing the TIPC module, the UDP tunnel sock will be delayed to
      release in a work queue as sock_release() can't be done in rtnl_lock().
      If the work queue is schedule to run after the TIPC module is removed,
      kernel will crash as the work queue function cleanup_beareri() code no
      longer exists when trying to invoke it.
      
      To fix it, this patch introduce a member wq_count in tipc_net to track
      the numbers of work queues in schedule, and  wait and exit until all
      work queues are done in tipc_exit_net().
      
      Fixes: d0f91938 ("tipc: add ip/udp media type")
      Reported-by: NShuang Li <shuali@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04c26faa
    • T
      mld: fix panic in mld_newpack() · 020ef930
      Taehee Yoo 提交于
      mld_newpack() doesn't allow to allocate high order page,
      only order-0 allocation is allowed.
      If headroom size is too large, a kernel panic could occur in skb_put().
      
      Test commands:
          ip netns del A
          ip netns del B
          ip netns add A
          ip netns add B
          ip link add veth0 type veth peer name veth1
          ip link set veth0 netns A
          ip link set veth1 netns B
      
          ip netns exec A ip link set lo up
          ip netns exec A ip link set veth0 up
          ip netns exec A ip -6 a a 2001:db8:0::1/64 dev veth0
          ip netns exec B ip link set lo up
          ip netns exec B ip link set veth1 up
          ip netns exec B ip -6 a a 2001:db8:0::2/64 dev veth1
          for i in {1..99}
          do
              let A=$i-1
              ip netns exec A ip link add ip6gre$i type ip6gre \
      	local 2001:db8:$A::1 remote 2001:db8:$A::2 encaplimit 100
              ip netns exec A ip -6 a a 2001:db8:$i::1/64 dev ip6gre$i
              ip netns exec A ip link set ip6gre$i up
      
              ip netns exec B ip link add ip6gre$i type ip6gre \
      	local 2001:db8:$A::2 remote 2001:db8:$A::1 encaplimit 100
              ip netns exec B ip -6 a a 2001:db8:$i::2/64 dev ip6gre$i
              ip netns exec B ip link set ip6gre$i up
          done
      
      Splat looks like:
      kernel BUG at net/core/skbuff.c:110!
      invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted 5.12.0+ #891
      Workqueue: ipv6_addrconf addrconf_dad_work
      RIP: 0010:skb_panic+0x15d/0x15f
      Code: 92 fe 4c 8b 4c 24 10 53 8b 4d 70 45 89 e0 48 c7 c7 00 ae 79 83
      41 57 41 56 41 55 48 8b 54 24 a6 26 f9 ff <0f> 0b 48 8b 6c 24 20 89
      34 24 e8 4a 4e 92 fe 8b 34 24 48 c7 c1 20
      RSP: 0018:ffff88810091f820 EFLAGS: 00010282
      RAX: 0000000000000089 RBX: ffff8881086e9000 RCX: 0000000000000000
      RDX: 0000000000000089 RSI: 0000000000000008 RDI: ffffed1020123efb
      RBP: ffff888005f6eac0 R08: ffffed1022fc0031 R09: ffffed1022fc0031
      R10: ffff888117e00187 R11: ffffed1022fc0030 R12: 0000000000000028
      R13: ffff888008284eb0 R14: 0000000000000ed8 R15: 0000000000000ec0
      FS:  0000000000000000(0000) GS:ffff888117c00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f8b801c5640 CR3: 0000000033c2c006 CR4: 00000000003706f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       ? ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600
       ? ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600
       skb_put.cold.104+0x22/0x22
       ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600
       ? rcu_read_lock_sched_held+0x91/0xc0
       mld_newpack+0x398/0x8f0
       ? ip6_mc_hdr.isra.26.constprop.46+0x600/0x600
       ? lock_contended+0xc40/0xc40
       add_grhead.isra.33+0x280/0x380
       add_grec+0x5ca/0xff0
       ? mld_sendpack+0xf40/0xf40
       ? lock_downgrade+0x690/0x690
       mld_send_initial_cr.part.34+0xb9/0x180
       ipv6_mc_dad_complete+0x15d/0x1b0
       addrconf_dad_completed+0x8d2/0xbb0
       ? lock_downgrade+0x690/0x690
       ? addrconf_rs_timer+0x660/0x660
       ? addrconf_dad_work+0x73c/0x10e0
       addrconf_dad_work+0x73c/0x10e0
      
      Allowing high order page allocation could fix this problem.
      
      Fixes: 72e09ad1 ("ipv6: avoid high order allocations")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      020ef930
    • D
      NFC: nci: fix memory leak in nci_allocate_device · e0652f8b
      Dongliang Mu 提交于
      nfcmrvl_disconnect fails to free the hci_dev field in struct nci_dev.
      Fix this by freeing hci_dev in nci_free_device.
      
      BUG: memory leak
      unreferenced object 0xffff888111ea6800 (size 1024):
        comm "kworker/1:0", pid 19, jiffies 4294942308 (age 13.580s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 60 fd 0c 81 88 ff ff  .........`......
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<000000004bc25d43>] kmalloc include/linux/slab.h:552 [inline]
          [<000000004bc25d43>] kzalloc include/linux/slab.h:682 [inline]
          [<000000004bc25d43>] nci_hci_allocate+0x21/0xd0 net/nfc/nci/hci.c:784
          [<00000000c59cff92>] nci_allocate_device net/nfc/nci/core.c:1170 [inline]
          [<00000000c59cff92>] nci_allocate_device+0x10b/0x160 net/nfc/nci/core.c:1132
          [<00000000006e0a8e>] nfcmrvl_nci_register_dev+0x10a/0x1c0 drivers/nfc/nfcmrvl/main.c:153
          [<000000004da1b57e>] nfcmrvl_probe+0x223/0x290 drivers/nfc/nfcmrvl/usb.c:345
          [<00000000d506aed9>] usb_probe_interface+0x177/0x370 drivers/usb/core/driver.c:396
          [<00000000bc632c92>] really_probe+0x159/0x4a0 drivers/base/dd.c:554
          [<00000000f5009125>] driver_probe_device+0x84/0x100 drivers/base/dd.c:740
          [<000000000ce658ca>] __device_attach_driver+0xee/0x110 drivers/base/dd.c:846
          [<000000007067d05f>] bus_for_each_drv+0xb7/0x100 drivers/base/bus.c:431
          [<00000000f8e13372>] __device_attach+0x122/0x250 drivers/base/dd.c:914
          [<000000009cf68860>] bus_probe_device+0xc6/0xe0 drivers/base/bus.c:491
          [<00000000359c965a>] device_add+0x5be/0xc30 drivers/base/core.c:3109
          [<00000000086e4bd3>] usb_set_configuration+0x9d9/0xb90 drivers/usb/core/message.c:2164
          [<00000000ca036872>] usb_generic_driver_probe+0x8c/0xc0 drivers/usb/core/generic.c:238
          [<00000000d40d36f6>] usb_probe_device+0x5c/0x140 drivers/usb/core/driver.c:293
          [<00000000bc632c92>] really_probe+0x159/0x4a0 drivers/base/dd.c:554
      
      Reported-by: syzbot+19bcfc64a8df1318d1c3@syzkaller.appspotmail.com
      Fixes: 11f54f22 ("NFC: nci: Add HCI over NCI protocol support")
      Signed-off-by: NDongliang Mu <mudongliangabcd@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0652f8b
    • X
      tipc: skb_linearize the head skb when reassembling msgs · b7df21cf
      Xin Long 提交于
      It's not a good idea to append the frag skb to a skb's frag_list if
      the frag_list already has skbs from elsewhere, such as this skb was
      created by pskb_copy() where the frag_list was cloned (all the skbs
      in it were skb_get'ed) and shared by multiple skbs.
      
      However, the new appended frag skb should have been only seen by the
      current skb. Otherwise, it will cause use after free crashes as this
      appended frag skb are seen by multiple skbs but it only got skb_get
      called once.
      
      The same thing happens with a skb updated by pskb_may_pull() with a
      skb_cloned skb. Li Shuang has reported quite a few crashes caused
      by this when doing testing over macvlan devices:
      
        [] kernel BUG at net/core/skbuff.c:1970!
        [] Call Trace:
        []  skb_clone+0x4d/0xb0
        []  macvlan_broadcast+0xd8/0x160 [macvlan]
        []  macvlan_process_broadcast+0x148/0x150 [macvlan]
        []  process_one_work+0x1a7/0x360
        []  worker_thread+0x30/0x390
      
        [] kernel BUG at mm/usercopy.c:102!
        [] Call Trace:
        []  __check_heap_object+0xd3/0x100
        []  __check_object_size+0xff/0x16b
        []  simple_copy_to_iter+0x1c/0x30
        []  __skb_datagram_iter+0x7d/0x310
        []  __skb_datagram_iter+0x2a5/0x310
        []  skb_copy_datagram_iter+0x3b/0x90
        []  tipc_recvmsg+0x14a/0x3a0 [tipc]
        []  ____sys_recvmsg+0x91/0x150
        []  ___sys_recvmsg+0x7b/0xc0
      
        [] kernel BUG at mm/slub.c:305!
        [] Call Trace:
        []  <IRQ>
        []  kmem_cache_free+0x3ff/0x400
        []  __netif_receive_skb_core+0x12c/0xc40
        []  ? kmem_cache_alloc+0x12e/0x270
        []  netif_receive_skb_internal+0x3d/0xb0
        []  ? get_rx_page_info+0x8e/0xa0 [be2net]
        []  be_poll+0x6ef/0xd00 [be2net]
        []  ? irq_exit+0x4f/0x100
        []  net_rx_action+0x149/0x3b0
      
        ...
      
      This patch is to fix it by linearizing the head skb if it has frag_list
      set in tipc_buf_append(). Note that we choose to do this before calling
      skb_unshare(), as __skb_linearize() will avoid skb_copy(). Also, we can
      not just drop the frag_list either as the early time.
      
      Fixes: 45c8b7b1 ("tipc: allow non-linear first fragment buffer")
      Reported-by: NLi Shuang <shuali@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7df21cf
  8. 15 5月, 2021 5 次提交
    • Y
      net: sched: fix tx action reschedule issue with stopped queue · dcad9ee9
      Yunsheng Lin 提交于
      The netdev qeueue might be stopped when byte queue limit has
      reached or tx hw ring is full, net_tx_action() may still be
      rescheduled if STATE_MISSED is set, which consumes unnecessary
      cpu without dequeuing and transmiting any skb because the
      netdev queue is stopped, see qdisc_run_end().
      
      This patch fixes it by checking the netdev queue state before
      calling qdisc_run() and clearing STATE_MISSED if netdev queue is
      stopped during qdisc_run(), the net_tx_action() is rescheduled
      again when netdev qeueue is restarted, see netif_tx_wake_queue().
      
      As there is time window between netif_xmit_frozen_or_stopped()
      checking and STATE_MISSED clearing, between which STATE_MISSED
      may set by net_tx_action() scheduled by netif_tx_wake_queue(),
      so set the STATE_MISSED again if netdev queue is restarted.
      
      Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking")
      Reported-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcad9ee9
    • Y
      net: sched: fix tx action rescheduling issue during deactivation · 102b55ee
      Yunsheng Lin 提交于
      Currently qdisc_run() checks the STATE_DEACTIVATED of lockless
      qdisc before calling __qdisc_run(), which ultimately clear the
      STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED
      is set before clearing STATE_MISSED, there may be rescheduling
      of net_tx_action() at the end of qdisc_run_end(), see below:
      
      CPU0(net_tx_atcion)  CPU1(__dev_xmit_skb)  CPU2(dev_deactivate)
                .                   .                     .
                .            set STATE_MISSED             .
                .           __netif_schedule()            .
                .                   .           set STATE_DEACTIVATED
                .                   .                qdisc_reset()
                .                   .                     .
                .<---------------   .              synchronize_net()
      clear __QDISC_STATE_SCHED  |  .                     .
                .                |  .                     .
                .                |  .            some_qdisc_is_busy()
                .                |  .               return *false*
                .                |  .                     .
        test STATE_DEACTIVATED   |  .                     .
      __qdisc_run() *not* called |  .                     .
                .                |  .                     .
         test STATE_MISS         |  .                     .
       __netif_schedule()--------|  .                     .
                .                   .                     .
                .                   .                     .
      
      __qdisc_run() is not called by net_tx_atcion() in CPU0 because
      CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and
      STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule
      is called at the end of qdisc_run_end(), causing tx action
      rescheduling problem.
      
      qdisc_run() called by net_tx_action() runs in the softirq context,
      which should has the same semantic as the qdisc_run() called by
      __dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a
      synchronize_net() between STATE_DEACTIVATED flag being set and
      qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely
      bail out for the deactived lockless qdisc in net_tx_action(), and
      qdisc_reset() will reset all skb not dequeued yet.
      
      So add the rcu_read_lock() explicitly to protect the qdisc_run()
      and do the STATE_DEACTIVATED checking in net_tx_action() before
      calling qdisc_run_begin(). Another option is to do the checking in
      the qdisc_run_end(), but it will add unnecessary overhead for
      non-tx_action case, because __dev_queue_xmit() will not see qdisc
      with STATE_DEACTIVATED after synchronize_net(), the qdisc with
      STATE_DEACTIVATED can only be seen by net_tx_action() because of
      __netif_schedule().
      
      The STATE_DEACTIVATED checking in qdisc_run() is to avoid race
      between net_tx_action() and qdisc_reset(), see:
      commit d518d2ed ("net/sched: fix race between deactivation
      and dequeue for NOLOCK qdisc"). As the bailout added above for
      deactived lockless qdisc in net_tx_action() provides better
      protection for the race without calling qdisc_run() at all, so
      remove the STATE_DEACTIVATED checking in qdisc_run().
      
      After qdisc_reset(), there is no skb in qdisc to be dequeued, so
      clear the STATE_MISSED in dev_reset_queue() too.
      
      Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking")
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      V8: Clearing STATE_MISSED before calling __netif_schedule() has
          avoid the endless rescheduling problem, but there may still
          be a unnecessary rescheduling, so adjust the commit log.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      102b55ee
    • Y
      net: sched: fix packet stuck problem for lockless qdisc · a90c57f2
      Yunsheng Lin 提交于
      Lockless qdisc has below concurrent problem:
          cpu0                 cpu1
           .                     .
      q->enqueue                 .
           .                     .
      qdisc_run_begin()          .
           .                     .
      dequeue_skb()              .
           .                     .
      sch_direct_xmit()          .
           .                     .
           .                q->enqueue
           .             qdisc_run_begin()
           .            return and do nothing
           .                     .
      qdisc_run_end()            .
      
      cpu1 enqueue a skb without calling __qdisc_run() because cpu0
      has not released the lock yet and spin_trylock() return false
      for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
      enqueued by cpu1 when calling dequeue_skb() because cpu1 may
      enqueue the skb after cpu0 calling dequeue_skb() and before
      cpu0 calling qdisc_run_end().
      
      Lockless qdisc has below another concurrent problem when
      tx_action is involved:
      
      cpu0(serving tx_action)     cpu1             cpu2
                .                   .                .
                .              q->enqueue            .
                .            qdisc_run_begin()       .
                .              dequeue_skb()         .
                .                   .            q->enqueue
                .                   .                .
                .             sch_direct_xmit()      .
                .                   .         qdisc_run_begin()
                .                   .       return and do nothing
                .                   .                .
       clear __QDISC_STATE_SCHED    .                .
       qdisc_run_begin()            .                .
       return and do nothing        .                .
                .                   .                .
                .            qdisc_run_end()         .
      
      This patch fixes the above data race by:
      1. If the first spin_trylock() return false and STATE_MISSED is
         not set, set STATE_MISSED and retry another spin_trylock() in
         case other CPU may not see STATE_MISSED after it releases the
         lock.
      2. reschedule if STATE_MISSED is set after the lock is released
         at the end of qdisc_run_end().
      
      For tx_action case, STATE_MISSED is also set when cpu1 is at the
      end if qdisc_run_end(), so tx_action will be rescheduled again
      to dequeue the skb enqueued by cpu2.
      
      Clear STATE_MISSED before retrying a dequeuing when dequeuing
      returns NULL in order to reduce the overhead of the second
      spin_trylock() and __netif_schedule() calling.
      
      Also clear the STATE_MISSED before calling __netif_schedule()
      at the end of qdisc_run_end() to avoid doing another round of
      dequeuing in the pfifo_fast_dequeue().
      
      The performance impact of this patch, tested using pktgen and
      dummy netdev with pfifo_fast qdisc attached:
      
       threads  without+this_patch   with+this_patch      delta
          1        2.61Mpps            2.60Mpps           -0.3%
          2        3.97Mpps            3.82Mpps           -3.7%
          4        5.62Mpps            5.59Mpps           -0.5%
          8        2.78Mpps            2.77Mpps           -0.3%
         16        2.22Mpps            2.22Mpps           -0.0%
      
      Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking")
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Tested-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a90c57f2
    • J
      tls splice: check SPLICE_F_NONBLOCK instead of MSG_DONTWAIT · 974271e5
      Jim Ma 提交于
      In tls_sw_splice_read, checkout MSG_* is inappropriate, should use
      SPLICE_*, update tls_wait_data to accept nonblock arguments instead
      of flags for recvmsg and splice.
      
      Fixes: c46234eb ("tls: RX path for ktls")
      Signed-off-by: NJim Ma <majinjing3@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      974271e5
    • H
      Revert "net:tipc: Fix a double free in tipc_sk_mcast_rcv" · 75016891
      Hoang Le 提交于
      This reverts commit 6bf24dc0.
      Above fix is not correct and caused memory leak issue.
      
      Fixes: 6bf24dc0 ("net:tipc: Fix a double free in tipc_sk_mcast_rcv")
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Acked-by: NTung Nguyen <tung.q.nguyen@dektech.com.au>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75016891
  9. 14 5月, 2021 3 次提交
    • S
      netfilter: nft_set_pipapo_avx2: Add irq_fpu_usable() check, fallback to non-AVX2 version · f0b3d338
      Stefano Brivio 提交于
      Arturo reported this backtrace:
      
      [709732.358791] WARNING: CPU: 3 PID: 456 at arch/x86/kernel/fpu/core.c:128 kernel_fpu_begin_mask+0xae/0xe0
      [709732.358793] Modules linked in: binfmt_misc nft_nat nft_chain_nat nf_nat nft_counter nft_ct nf_tables nf_conntrack_netlink nfnetlink 8021q garp stp mrp llc vrf intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp crc32_pclmul mgag200 ghash_clmulni_intel drm_kms_helper cec aesni_intel drm libaes crypto_simd cryptd glue_helper mei_me dell_smbios iTCO_wdt evdev intel_pmc_bxt iTCO_vendor_support dcdbas pcspkr rapl dell_wmi_descriptor wmi_bmof sg i2c_algo_bit watchdog mei acpi_ipmi ipmi_si button nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipmi_devintf ipmi_msghandler ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 dm_mod raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor sd_mod t10_pi crc_t10dif crct10dif_generic raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ahci libahci tg3 libata xhci_pci libphy xhci_hcd ptp usbcore crct10dif_pclmul crct10dif_common bnxt_en crc32c_intel scsi_mod
      [709732.358941]  pps_core i2c_i801 lpc_ich i2c_smbus wmi usb_common
      [709732.358957] CPU: 3 PID: 456 Comm: jbd2/dm-0-8 Not tainted 5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1
      [709732.358959] Hardware name: Dell Inc. PowerEdge R440/04JN2K, BIOS 2.9.3 09/23/2020
      [709732.358964] RIP: 0010:kernel_fpu_begin_mask+0xae/0xe0
      [709732.358969] Code: ae 54 24 04 83 e3 01 75 38 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 33 48 83 c4 10 5b c3 65 8a 05 5e 21 5e 76 84 c0 74 92 <0f> 0b eb 8e f0 80 4f 01 40 48 81 c7 00 14 00 00 e8 dd fb ff ff eb
      [709732.358972] RSP: 0018:ffffbb9700304740 EFLAGS: 00010202
      [709732.358976] RAX: 0000000000000001 RBX: 0000000000000003 RCX: 0000000000000001
      [709732.358979] RDX: ffffbb9700304970 RSI: ffff922fe1952e00 RDI: 0000000000000003
      [709732.358981] RBP: ffffbb9700304970 R08: ffff922fc868a600 R09: ffff922fc711e462
      [709732.358984] R10: 000000000000005f R11: ffff922ff0b27180 R12: ffffbb9700304960
      [709732.358987] R13: ffffbb9700304b08 R14: ffff922fc664b6c8 R15: ffff922fc664b660
      [709732.358990] FS:  0000000000000000(0000) GS:ffff92371fec0000(0000) knlGS:0000000000000000
      [709732.358993] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [709732.358996] CR2: 0000557a6655bdd0 CR3: 000000026020a001 CR4: 00000000007706e0
      [709732.358999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [709732.359001] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [709732.359003] PKRU: 55555554
      [709732.359005] Call Trace:
      [709732.359009]  <IRQ>
      [709732.359035]  nft_pipapo_avx2_lookup+0x4c/0x1cba [nf_tables]
      [709732.359046]  ? sched_clock+0x5/0x10
      [709732.359054]  ? sched_clock_cpu+0xc/0xb0
      [709732.359061]  ? record_times+0x16/0x80
      [709732.359068]  ? plist_add+0xc1/0x100
      [709732.359073]  ? psi_group_change+0x47/0x230
      [709732.359079]  ? skb_clone+0x4d/0xb0
      [709732.359085]  ? enqueue_task_rt+0x22b/0x310
      [709732.359098]  ? bnxt_start_xmit+0x1e8/0xaf0 [bnxt_en]
      [709732.359102]  ? packet_rcv+0x40/0x4a0
      [709732.359121]  nft_lookup_eval+0x59/0x160 [nf_tables]
      [709732.359133]  nft_do_chain+0x350/0x500 [nf_tables]
      [709732.359152]  ? nft_lookup_eval+0x59/0x160 [nf_tables]
      [709732.359163]  ? nft_do_chain+0x364/0x500 [nf_tables]
      [709732.359172]  ? fib4_rule_action+0x6d/0x80
      [709732.359178]  ? fib_rules_lookup+0x107/0x250
      [709732.359184]  nft_nat_do_chain+0x8a/0xf2 [nft_chain_nat]
      [709732.359193]  nf_nat_inet_fn+0xea/0x210 [nf_nat]
      [709732.359202]  nf_nat_ipv4_out+0x14/0xa0 [nf_nat]
      [709732.359207]  nf_hook_slow+0x44/0xc0
      [709732.359214]  ip_output+0xd2/0x100
      [709732.359221]  ? __ip_finish_output+0x210/0x210
      [709732.359226]  ip_forward+0x37d/0x4a0
      [709732.359232]  ? ip4_key_hashfn+0xb0/0xb0
      [709732.359238]  ip_sublist_rcv_finish+0x4f/0x60
      [709732.359243]  ip_sublist_rcv+0x196/0x220
      [709732.359250]  ? ip_rcv_finish_core.isra.22+0x400/0x400
      [709732.359255]  ip_list_rcv+0x137/0x160
      [709732.359264]  __netif_receive_skb_list_core+0x29b/0x2c0
      [709732.359272]  netif_receive_skb_list_internal+0x1a6/0x2d0
      [709732.359280]  gro_normal_list.part.156+0x19/0x40
      [709732.359286]  napi_complete_done+0x67/0x170
      [709732.359298]  bnxt_poll+0x105/0x190 [bnxt_en]
      [709732.359304]  ? irqentry_exit+0x29/0x30
      [709732.359309]  ? asm_common_interrupt+0x1e/0x40
      [709732.359315]  net_rx_action+0x144/0x3c0
      [709732.359322]  __do_softirq+0xd5/0x29c
      [709732.359329]  asm_call_irq_on_stack+0xf/0x20
      [709732.359332]  </IRQ>
      [709732.359339]  do_softirq_own_stack+0x37/0x40
      [709732.359346]  irq_exit_rcu+0x9d/0xa0
      [709732.359353]  common_interrupt+0x78/0x130
      [709732.359358]  asm_common_interrupt+0x1e/0x40
      [709732.359366] RIP: 0010:crc_41+0x0/0x1e [crc32c_intel]
      [709732.359370] Code: ff ff f2 4d 0f 38 f1 93 a8 fe ff ff f2 4c 0f 38 f1 81 b0 fe ff ff f2 4c 0f 38 f1 8a b0 fe ff ff f2 4d 0f 38 f1 93 b0 fe ff ff <f2> 4c 0f 38 f1 81 b8 fe ff ff f2 4c 0f 38 f1 8a b8 fe ff ff f2 4d
      [709732.359373] RSP: 0018:ffffbb97008dfcd0 EFLAGS: 00000246
      [709732.359377] RAX: 000000000000002a RBX: 0000000000000400 RCX: ffff922fc591dd50
      [709732.359379] RDX: ffff922fc591dea0 RSI: 0000000000000a14 RDI: ffffffffc00dddc0
      [709732.359382] RBP: 0000000000001000 R08: 000000000342d8c3 R09: 0000000000000000
      [709732.359384] R10: 0000000000000000 R11: ffff922fc591dff0 R12: ffffbb97008dfe58
      [709732.359386] R13: 000000000000000a R14: ffff922fd2b91e80 R15: ffff922fef83fe38
      [709732.359395]  ? crc_43+0x1e/0x1e [crc32c_intel]
      [709732.359403]  ? crc32c_pcl_intel_update+0x97/0xb0 [crc32c_intel]
      [709732.359419]  ? jbd2_journal_commit_transaction+0xaec/0x1a30 [jbd2]
      [709732.359425]  ? irq_exit_rcu+0x3e/0xa0
      [709732.359447]  ? kjournald2+0xbd/0x270 [jbd2]
      [709732.359454]  ? finish_wait+0x80/0x80
      [709732.359470]  ? commit_timeout+0x10/0x10 [jbd2]
      [709732.359476]  ? kthread+0x116/0x130
      [709732.359481]  ? kthread_park+0x80/0x80
      [709732.359488]  ? ret_from_fork+0x1f/0x30
      [709732.359494] ---[ end trace 081a19978e5f09f5 ]---
      
      that is, nft_pipapo_avx2_lookup() uses the FPU running from a softirq
      that interrupted a kthread, also using the FPU.
      
      That's exactly the reason why irq_fpu_usable() is there: use it, and
      if we can't use the FPU, fall back to the non-AVX2 version of the
      lookup operation, i.e. nft_pipapo_lookup().
      Reported-by: NArturo Borrero Gonzalez <arturo@netfilter.org>
      Cc: <stable@vger.kernel.org> # 5.6.x
      Fixes: 7400b063 ("nft_set_pipapo: Introduce AVX2-based lookup implementation")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f0b3d338
    • R
      netfilter: flowtable: Remove redundant hw refresh bit · c07531c0
      Roi Dayan 提交于
      Offloading conns could fail for multiple reasons and a hw refresh bit is
      set to try to reoffload it in next sw packet.
      But it could be in some cases and future points that the hw refresh bit
      is not set but a refresh could succeed.
      Remove the hw refresh bit and do offload refresh if requested.
      There won't be a new work entry if a work is already pending
      anyway as there is the hw pending bit.
      
      Fixes: 8b3646d6 ("net/sched: act_ct: Support refreshing the flow table entries")
      Signed-off-by: NRoi Dayan <roid@nvidia.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c07531c0
    • T
      openvswitch: meter: fix race when getting now_ms. · e4df1b0c
      Tao Liu 提交于
      We have observed meters working unexpected if traffic is 3+Gbit/s
      with multiple connections.
      
      now_ms is not pretected by meter->lock, we may get a negative
      long_delta_ms when another cpu updated meter->used, then:
          delta_ms = (u32)long_delta_ms;
      which will be a large value.
      
          band->bucket += delta_ms * band->rate;
      then we get a wrong band->bucket.
      
      OpenVswitch userspace datapath has fixed the same issue[1] some
      time ago, and we port the implementation to kernel datapath.
      
      [1] https://patchwork.ozlabs.org/project/openvswitch/patch/20191025114436.9746-1-i.maximets@ovn.org/
      
      Fixes: 96fbc13d ("openvswitch: Add meter infrastructure")
      Signed-off-by: NTao Liu <thomas.liu@ucloud.cn>
      Suggested-by: NIlya Maximets <i.maximets@ovn.org>
      Reviewed-by: NIlya Maximets <i.maximets@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4df1b0c
  10. 13 5月, 2021 2 次提交
  11. 12 5月, 2021 13 次提交