1. 09 4月, 2021 40 次提交
    • W
      stmmac: intel: Fixes clock registration error seen for multiple interfaces · 96e1cedc
      Wong Vee Khee 提交于
      stable inclusion
      from stable-5.10.24
      commit 9c4136081cc2076ca981e68001b3cb8f53800a94
      bugzilla: 51348
      
      --------------------------------
      
      commit 8eb37ab7 upstream.
      
      Issue seen when enumerating multiple Intel mGbE interfaces in EHL.
      
      [    6.898141] intel-eth-pci 0000:00:1d.2: enabling device (0000 -> 0002)
      [    6.900971] intel-eth-pci 0000:00:1d.2: Fail to register stmmac-clk
      [    6.906434] intel-eth-pci 0000:00:1d.2: User ID: 0x51, Synopsys ID: 0x52
      
      We fix it by making the clock name to be unique following the format
      of stmmac-pci_name(pci_dev) so that we can differentiate the clock for
      these Intel mGbE interfaces in EHL platform as follow:
      
        /sys/kernel/debug/clk/stmmac-0000:00:1d.1
        /sys/kernel/debug/clk/stmmac-0000:00:1d.2
        /sys/kernel/debug/clk/stmmac-0000:00:1e.4
      
      Fixes: 58da0cfa ("net: stmmac: create dwmac-intel.c to contain all Intel platform")
      Signed-off-by: NWong Vee Khee <vee.khee.wong@intel.com>
      Signed-off-by: NVoon Weifeng <weifeng.voon@intel.com>
      Co-developed-by: NOng Boon Leong <boon.leong.ong@intel.com>
      Signed-off-by: NOng Boon Leong <boon.leong.ong@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      96e1cedc
    • O
      net: stmmac: Fix VLAN filter delete timeout issue in Intel mGBE SGMII · f4995f98
      Ong Boon Leong 提交于
      stable inclusion
      from stable-5.10.24
      commit d78f23ef304060608bc9b1627f303842f75d7029
      bugzilla: 51348
      
      --------------------------------
      
      commit 9a7b3950 upstream.
      
      For Intel mGbE controller, MAC VLAN filter delete operation will time-out
      if serdes power-down sequence happened first during driver remove() with
      below message.
      
      [82294.764958] intel-eth-pci 0000:00:1e.4 eth2: stmmac_dvr_remove: removing driver
      [82294.778677] intel-eth-pci 0000:00:1e.4 eth2: Timeout accessing MAC_VLAN_Tag_Filter
      [82294.779997] intel-eth-pci 0000:00:1e.4 eth2: failed to kill vid 0081/0
      [82294.947053] intel-eth-pci 0000:00:1d.2 eth1: stmmac_dvr_remove: removing driver
      [82295.002091] intel-eth-pci 0000:00:1d.1 eth0: stmmac_dvr_remove: removing driver
      
      Therefore, we delay the serdes power-down to be after unregister_netdev()
      which triggers the VLAN filter delete.
      
      Fixes: b9663b7c ("net: stmmac: Enable SERDES power up/down sequence")
      Signed-off-by: NOng Boon Leong <boon.leong.ong@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      f4995f98
    • P
      cipso,calipso: resolve a number of problems with the DOI refcounts · a2ebebf3
      Paul Moore 提交于
      stable inclusion
      from stable-5.10.24
      commit 85178d76febd30a745b7d947dbd9751919d0fa5b
      bugzilla: 51348
      
      --------------------------------
      
      commit ad5d07f4 upstream.
      
      The current CIPSO and CALIPSO refcounting scheme for the DOI
      definitions is a bit flawed in that we:
      
      1. Don't correctly match gets/puts in netlbl_cipsov4_list().
      2. Decrement the refcount on each attempt to remove the DOI from the
         DOI list, only removing it from the list once the refcount drops
         to zero.
      
      This patch fixes these problems by adding the missing "puts" to
      netlbl_cipsov4_list() and introduces a more conventional, i.e.
      not-buggy, refcounting mechanism to the DOI definitions.  Upon the
      addition of a DOI to the DOI list, it is initialized with a refcount
      of one, removing a DOI from the list removes it from the list and
      drops the refcount by one; "gets" and "puts" behave as expected with
      respect to refcounts, increasing and decreasing the DOI's refcount by
      one.
      
      Fixes: b1edeb10 ("netlabel: Replace protocol/NetLabel linking with refrerence counts")
      Fixes: d7cce015 ("netlabel: Add support for removing a CALIPSO DOI.")
      Reported-by: syzbot+9ec037722d2603a9f52e@syzkaller.appspotmail.com
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      a2ebebf3
    • H
      netdevsim: init u64 stats for 32bit hardware · db715b39
      Hillf Danton 提交于
      stable inclusion
      from stable-5.10.24
      commit e03ed1190d56f983c582c0853499d4c3a2cfa410
      bugzilla: 51348
      
      --------------------------------
      
      commit 863a42b2 upstream.
      
      Init the u64 stats in order to avoid the lockdep prints on the 32bit
      hardware like
      
       INFO: trying to register non-static key.
       the code is fine but needs lockdep annotation.
       turning off the locking correctness validator.
       CPU: 0 PID: 4695 Comm: syz-executor.0 Not tainted 5.11.0-rc5-syzkaller #0
       Hardware name: ARM-Versatile Express
       Backtrace:
       [<826fc5b8>] (dump_backtrace) from [<826fc82c>] (show_stack+0x18/0x1c arch/arm/kernel/traps.c:252)
       [<826fc814>] (show_stack) from [<8270d1f8>] (__dump_stack lib/dump_stack.c:79 [inline])
       [<826fc814>] (show_stack) from [<8270d1f8>] (dump_stack+0xa8/0xc8 lib/dump_stack.c:120)
       [<8270d150>] (dump_stack) from [<802bf9c0>] (assign_lock_key kernel/locking/lockdep.c:935 [inline])
       [<8270d150>] (dump_stack) from [<802bf9c0>] (register_lock_class+0xabc/0xb68 kernel/locking/lockdep.c:1247)
       [<802bef04>] (register_lock_class) from [<802baa2c>] (__lock_acquire+0x84/0x32d4 kernel/locking/lockdep.c:4711)
       [<802ba9a8>] (__lock_acquire) from [<802be840>] (lock_acquire.part.0+0xf0/0x554 kernel/locking/lockdep.c:5442)
       [<802be750>] (lock_acquire.part.0) from [<802bed10>] (lock_acquire+0x6c/0x74 kernel/locking/lockdep.c:5415)
       [<802beca4>] (lock_acquire) from [<81560548>] (seqcount_lockdep_reader_access include/linux/seqlock.h:103 [inline])
       [<802beca4>] (lock_acquire) from [<81560548>] (__u64_stats_fetch_begin include/linux/u64_stats_sync.h:164 [inline])
       [<802beca4>] (lock_acquire) from [<81560548>] (u64_stats_fetch_begin include/linux/u64_stats_sync.h:175 [inline])
       [<802beca4>] (lock_acquire) from [<81560548>] (nsim_get_stats64+0xdc/0xf0 drivers/net/netdevsim/netdev.c:70)
       [<8156046c>] (nsim_get_stats64) from [<81e2efa0>] (dev_get_stats+0x44/0xd0 net/core/dev.c:10405)
       [<81e2ef5c>] (dev_get_stats) from [<81e53204>] (rtnl_fill_stats+0x38/0x120 net/core/rtnetlink.c:1211)
       [<81e531cc>] (rtnl_fill_stats) from [<81e59d58>] (rtnl_fill_ifinfo+0x6d4/0x148c net/core/rtnetlink.c:1783)
       [<81e59684>] (rtnl_fill_ifinfo) from [<81e5ceb4>] (rtmsg_ifinfo_build_skb+0x9c/0x108 net/core/rtnetlink.c:3798)
       [<81e5ce18>] (rtmsg_ifinfo_build_skb) from [<81e5d0ac>] (rtmsg_ifinfo_event net/core/rtnetlink.c:3830 [inline])
       [<81e5ce18>] (rtmsg_ifinfo_build_skb) from [<81e5d0ac>] (rtmsg_ifinfo_event net/core/rtnetlink.c:3821 [inline])
       [<81e5ce18>] (rtmsg_ifinfo_build_skb) from [<81e5d0ac>] (rtmsg_ifinfo+0x44/0x70 net/core/rtnetlink.c:3839)
       [<81e5d068>] (rtmsg_ifinfo) from [<81e45c2c>] (register_netdevice+0x664/0x68c net/core/dev.c:10103)
       [<81e455c8>] (register_netdevice) from [<815608bc>] (nsim_create+0xf8/0x124 drivers/net/netdevsim/netdev.c:317)
       [<815607c4>] (nsim_create) from [<81561184>] (__nsim_dev_port_add+0x108/0x188 drivers/net/netdevsim/dev.c:941)
       [<8156107c>] (__nsim_dev_port_add) from [<815620d8>] (nsim_dev_port_add_all drivers/net/netdevsim/dev.c:990 [inline])
       [<8156107c>] (__nsim_dev_port_add) from [<815620d8>] (nsim_dev_probe+0x5cc/0x750 drivers/net/netdevsim/dev.c:1119)
       [<81561b0c>] (nsim_dev_probe) from [<815661dc>] (nsim_bus_probe+0x10/0x14 drivers/net/netdevsim/bus.c:287)
       [<815661cc>] (nsim_bus_probe) from [<811724c0>] (really_probe+0x100/0x50c drivers/base/dd.c:554)
       [<811723c0>] (really_probe) from [<811729c4>] (driver_probe_device+0xf8/0x1c8 drivers/base/dd.c:740)
       [<811728cc>] (driver_probe_device) from [<81172fe4>] (__device_attach_driver+0x8c/0xf0 drivers/base/dd.c:846)
       [<81172f58>] (__device_attach_driver) from [<8116fee0>] (bus_for_each_drv+0x88/0xd8 drivers/base/bus.c:431)
       [<8116fe58>] (bus_for_each_drv) from [<81172c6c>] (__device_attach+0xdc/0x1d0 drivers/base/dd.c:914)
       [<81172b90>] (__device_attach) from [<8117305c>] (device_initial_probe+0x14/0x18 drivers/base/dd.c:961)
       [<81173048>] (device_initial_probe) from [<81171358>] (bus_probe_device+0x90/0x98 drivers/base/bus.c:491)
       [<811712c8>] (bus_probe_device) from [<8116e77c>] (device_add+0x320/0x824 drivers/base/core.c:3109)
       [<8116e45c>] (device_add) from [<8116ec9c>] (device_register+0x1c/0x20 drivers/base/core.c:3182)
       [<8116ec80>] (device_register) from [<81566710>] (nsim_bus_dev_new drivers/net/netdevsim/bus.c:336 [inline])
       [<8116ec80>] (device_register) from [<81566710>] (new_device_store+0x178/0x208 drivers/net/netdevsim/bus.c:215)
       [<81566598>] (new_device_store) from [<8116fcb4>] (bus_attr_store+0x2c/0x38 drivers/base/bus.c:122)
       [<8116fc88>] (bus_attr_store) from [<805b4b8c>] (sysfs_kf_write+0x48/0x54 fs/sysfs/file.c:139)
       [<805b4b44>] (sysfs_kf_write) from [<805b3c90>] (kernfs_fop_write_iter+0x128/0x1ec fs/kernfs/file.c:296)
       [<805b3b68>] (kernfs_fop_write_iter) from [<804d22fc>] (call_write_iter include/linux/fs.h:1901 [inline])
       [<805b3b68>] (kernfs_fop_write_iter) from [<804d22fc>] (new_sync_write fs/read_write.c:518 [inline])
       [<805b3b68>] (kernfs_fop_write_iter) from [<804d22fc>] (vfs_write+0x3dc/0x57c fs/read_write.c:605)
       [<804d1f20>] (vfs_write) from [<804d2604>] (ksys_write+0x68/0xec fs/read_write.c:658)
       [<804d259c>] (ksys_write) from [<804d2698>] (__do_sys_write fs/read_write.c:670 [inline])
       [<804d259c>] (ksys_write) from [<804d2698>] (sys_write+0x10/0x14 fs/read_write.c:667)
       [<804d2688>] (sys_write) from [<80200060>] (ret_fast_syscall+0x0/0x2c arch/arm/mm/proc-v7.S:64)
      
      Fixes: 83c9e13a ("netdevsim: add software driver for testing offloads")
      Reported-by: syzbot+e74a6857f2d0efe3ad81@syzkaller.appspotmail.com
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NHillf Danton <hdanton@sina.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      db715b39
    • D
      net: usb: qmi_wwan: allow qmimux add/del with master up · 19b7b69a
      Daniele Palmas 提交于
      stable inclusion
      from stable-5.10.24
      commit 6ed0a2cafd1f08a243123df094aa8479590112bf
      bugzilla: 51348
      
      --------------------------------
      
      commit 6c59cff3 upstream.
      
      There's no reason for preventing the creation and removal
      of qmimux network interfaces when the underlying interface
      is up.
      
      This makes qmi_wwan mux implementation more similar to the
      rmnet one, simplifying userspace management of the same
      logical interfaces.
      
      Fixes: c6adf779 ("net: usb: qmi_wwan: add qmap mux protocol support")
      Reported-by: NAleksander Morgado <aleksander@aleksander.es>
      Signed-off-by: NDaniele Palmas <dnlplm@gmail.com>
      Acked-by: NBjørn Mork <bjorn@mork.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      19b7b69a
    • V
      net: dsa: sja1105: fix SGMII PCS being forced to SPEED_UNKNOWN instead of SPEED_10 · 1b757456
      Vladimir Oltean 提交于
      stable inclusion
      from stable-5.10.24
      commit 565b2d3ae20256be43df960206cdd1d8d479c325
      bugzilla: 51348
      
      --------------------------------
      
      commit 053d8ad1 upstream.
      
      When using MLO_AN_PHY or MLO_AN_FIXED, the MII_BMCR of the SGMII PCS is
      read before resetting the switch so it can be reprogrammed afterwards.
      This works for the speeds of 1Gbps and 100Mbps, but not for 10Mbps,
      because SPEED_10 is actually 0, so AND-ing anything with 0 is false,
      therefore that last branch is dead code.
      
      Do what others do (genphy_read_status_fixed, phy_mii_ioctl) and just
      remove the check for SPEED_10, let it fall into the default case.
      
      Fixes: ffe10e67 ("net: dsa: sja1105: Add support for the SGMII port")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1b757456
    • V
      net: mscc: ocelot: properly reject destination IP keys in VCAP IS1 · 64c28022
      Vladimir Oltean 提交于
      stable inclusion
      from stable-5.10.24
      commit 719611e806deea598088541bd4509a3735d29c92
      bugzilla: 51348
      
      --------------------------------
      
      commit f1becbed upstream.
      
      An attempt is made to warn the user about the fact that VCAP IS1 cannot
      offload keys matching on destination IP (at least given the current half
      key format), but sadly that warning fails miserably in practice, due to
      the fact that it operates on an uninitialized "match" variable. We must
      first decode the keys from the flow rule.
      
      Fixes: 75944fda ("net: mscc: ocelot: offload ingress skbedit and vlan actions to VCAP IS1")
      Reported-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      64c28022
    • M
      net: sched: avoid duplicates in classes dump · 3bfa215f
      Maximilian Heyne 提交于
      stable inclusion
      from stable-5.10.24
      commit 2809a5ca962e96397d9504414a1140a69fe5e138
      bugzilla: 51348
      
      --------------------------------
      
      commit bfc25605 upstream.
      
      This is a follow up of commit ea327469 ("net: sched: avoid
      duplicates in qdisc dump") which has fixed the issue only for the qdisc
      dump.
      
      The duplicate printing also occurs when dumping the classes via
        tc class show dev eth0
      
      Fixes: 59cc1f61 ("net: sched: convert qdisc linked list to hashtable")
      Signed-off-by: NMaximilian Heyne <mheyne@amazon.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      3bfa215f
    • I
      nexthop: Do not flush blackhole nexthops when loopback goes down · 9edc9d71
      Ido Schimmel 提交于
      stable inclusion
      from stable-5.10.24
      commit 9c61f1e1c40e85e5db4154eba36711d166a38d34
      bugzilla: 51348
      
      --------------------------------
      
      commit 76c03bf8 upstream.
      
      As far as user space is concerned, blackhole nexthops do not have a
      nexthop device and therefore should not be affected by the
      administrative or carrier state of any netdev.
      
      However, when the loopback netdev goes down all the blackhole nexthops
      are flushed. This happens because internally the kernel associates
      blackhole nexthops with the loopback netdev.
      
      This behavior is both confusing to those not familiar with kernel
      internals and also diverges from the legacy API where blackhole IPv4
      routes are not flushed when the loopback netdev goes down:
      
       # ip route add blackhole 198.51.100.0/24
       # ip link set dev lo down
       # ip route show 198.51.100.0/24
       blackhole 198.51.100.0/24
      
      Blackhole IPv6 routes are flushed, but at least user space knows that
      they are associated with the loopback netdev:
      
       # ip -6 route show 2001:db8:1::/64
       blackhole 2001:db8:1::/64 dev lo metric 1024 pref medium
      
      Fix this by only flushing blackhole nexthops when the loopback netdev is
      unregistered.
      
      Fixes: ab84be7e ("net: Initial nexthop code")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reported-by: NDonald Sharp <sharpd@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      9edc9d71
    • O
      net: stmmac: fix incorrect DMA channel intr enable setting of EQoS v4.10 · 025d472a
      Ong Boon Leong 提交于
      stable inclusion
      from stable-5.10.24
      commit 87b7b19d6e1dabbd12344b2784b78ea8b4992f6f
      bugzilla: 51348
      
      --------------------------------
      
      commit 879c348c upstream.
      
      We introduce dwmac410_dma_init_channel() here for both EQoS v4.10 and
      above which use different DMA_CH(n)_Interrupt_Enable bit definitions for
      NIE and AIE.
      
      Fixes: 48863ce5 ("stmmac: add DMA support for GMAC 4.xx")
      Signed-off-by: NOng Boon Leong <boon.leong.ong@intel.com>
      Signed-off-by: NRamesh Babu B <ramesh.babu.b@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      025d472a
    • K
      net/mlx4_en: update moderation when config reset · 24d7b86f
      Kevin(Yudong) Yang 提交于
      stable inclusion
      from stable-5.10.24
      commit 6b0d3ae1051bdca4acf91a66be64d42d1c0f577b
      bugzilla: 51348
      
      --------------------------------
      
      commit 00ff801b upstream.
      
      This patch fixes a bug that the moderation config will not be
      applied when calling mlx4_en_reset_config. For example, when
      turning on rx timestamping, mlx4_en_reset_config() will be called,
      causing the NIC to forget previous moderation config.
      
      This fix is in phase with a previous fix:
      commit 79c54b6b ("net/mlx4_en: Fix TX moderation info loss
      after set_ringparam is called")
      
      Tested: Before this patch, on a host with NIC using mlx4, run
      netserver and stream TCP to the host at full utilization.
      $ sar -I SUM 1
                       INTR    intr/s
      14:03:56          sum  48758.00
      
      After rx hwtstamp is enabled:
      $ sar -I SUM 1
      14:10:38          sum 317771.00
      We see the moderation is not working properly and issued 7x more
      interrupts.
      
      After the patch, and turned on rx hwtstamp, the rate of interrupts
      is as expected:
      $ sar -I SUM 1
      14:52:11          sum  49332.00
      
      Fixes: 79c54b6b ("net/mlx4_en: Fix TX moderation info loss after set_ringparam is called")
      Signed-off-by: NKevin(Yudong) Yang <yyd@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      CC: Tariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      24d7b86f
    • B
      net: ethernet: mtk-star-emac: fix wrong unmap in RX handling · 8f07278d
      Biao Huang 提交于
      stable inclusion
      from stable-5.10.24
      commit fa0bc09db49bf4875d9a8c88813fe2b87c1059bb
      bugzilla: 51348
      
      --------------------------------
      
      commit 95b39f07 upstream.
      
      mtk_star_dma_unmap_rx() should unmap the dma_addr of old skb rather than
      that of new skb.
      Assign new_dma_addr to desc_data.dma_addr after all handling of old skb
      ends to avoid unexpected receive side error.
      
      Fixes: f96e9641 ("net: ethernet: mtk-star-emac: fix error path in RX handling")
      Signed-off-by: NBiao Huang <biao.huang@mediatek.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      8f07278d
    • V
      net: enetc: keep RX ring consumer index in sync with hardware · 684db2b4
      Vladimir Oltean 提交于
      stable inclusion
      from stable-5.10.24
      commit 1cdd008902d4e32f270e8fdb3239db6412f0a90b
      bugzilla: 51348
      
      --------------------------------
      
      commit 3a5d12c9 upstream.
      
      The RX rings have a producer index owned by hardware, where newly
      received frame buffers are placed, and a consumer index owned by
      software, where newly allocated buffers are placed, in expectation of
      hardware being able to place frame data in them.
      
      Hardware increments the producer index when a frame is received, however
      it is not allowed to increment the producer index to match the consumer
      index (RBCIR) since the ring can hold at most RBLENR[LENGTH]-1 received
      BDs. Whenever the producer index matches the value of the consumer
      index, the ring has no unprocessed received frames and all BDs in the
      ring have been initialized/prepared by software, i.e. hardware owns all
      BDs in the ring.
      
      The code uses the next_to_clean variable to keep track of the producer
      index, and the next_to_use variable to keep track of the consumer index.
      
      The RX rings are seeded from enetc_refill_rx_ring, which is called from
      two places:
      
      1. initially the ring is seeded until full with enetc_bd_unused(rx_ring),
         i.e. with 511 buffers. This will make next_to_clean=0 and next_to_use=511:
      
      .ndo_open
      -> enetc_open
         -> enetc_setup_bdrs
            -> enetc_setup_rxbdr
               -> enetc_refill_rx_ring
      
      2. then during the data path processing, it is refilled with 16 buffers
         at a time:
      
      enetc_msix
      -> napi_schedule
         -> enetc_poll
            -> enetc_clean_rx_ring
               -> enetc_refill_rx_ring
      
      There is just one problem: the initial seeding done during .ndo_open
      updates just the producer index (ENETC_RBPIR) with 0, and the software
      next_to_clean and next_to_use variables. Notably, it will not update the
      consumer index to make the hardware aware of the newly added buffers.
      
      Wait, what? So how does it work?
      
      Well, the reset values of the producer index and of the consumer index
      of a ring are both zero. As per the description in the second paragraph,
      it means that the ring is full of buffers waiting for hardware to put
      frames in them, which by coincidence is almost true, because we have in
      fact seeded 511 buffers into the ring.
      
      But will the hardware attempt to access the 512th entry of the ring,
      which has an invalid BD in it? Well, no, because in order to do that, it
      would have to first populate the first 511 entries, and the NAPI
      enetc_poll will kick in by then. Eventually, after 16 processed slots
      have become available in the RX ring, enetc_clean_rx_ring will call
      enetc_refill_rx_ring and then will [ finally ] update the consumer index
      with the new software next_to_use variable. From now on, the
      next_to_clean and next_to_use variables are in sync with the producer
      and consumer ring indices.
      
      So the day is saved, right? Well, not quite. Freeing the memory
      allocated for the rings is done in:
      
      enetc_close
      -> enetc_clear_bdrs
         -> enetc_clear_rxbdr
            -> this just disables the ring
      -> enetc_free_rxtx_rings
         -> enetc_free_rx_ring
            -> sets next_to_clean and next_to_use to 0
      
      but again, nothing is committed to the hardware producer and consumer
      indices (yay!). The assumption is that the ring is disabled, so the
      indices don't matter anyway, and it's the responsibility of the "open"
      code path to set those up.
      
      .. Except that the "open" code path does not set those up properly.
      
      While initially, things almost work, during subsequent enetc_close ->
      enetc_open sequences, we have problems. To be precise, the enetc_open
      that is subsequent to enetc_close will again refill the ring with 511
      entries, but it will leave the consumer index untouched. Untouched
      means, of course, equal to the value it had before disabling the ring
      and draining the old buffers in enetc_close.
      
      But as mentioned, enetc_setup_rxbdr will at least update the producer
      index though, through this line of code:
      
      	enetc_rxbdr_wr(hw, idx, ENETC_RBPIR, 0);
      
      so at this stage we'll have:
      
      next_to_clean=0 (in hardware 0)
      next_to_use=511 (in hardware we'll have the refill index prior to enetc_close)
      
      Again, the next_to_clean and producer index are in sync and set to
      correct values, so the driver manages to limp on. Eventually, 16 ring
      entries will be consumed by enetc_poll, and the savior
      enetc_clean_rx_ring will come and call enetc_refill_rx_ring, and then
      update the hardware consumer ring based upon the new next_to_use.
      
      So.. it works?
      Well, by coincidence, it almost does, but there's a circumstance where
      enetc_clean_rx_ring won't be there to save us. If the previous value of
      the consumer index was 15, there's a problem, because the NAPI poll
      sequence will only issue a refill when 16 or more buffers have been
      consumed.
      
      It's easiest to illustrate this with an example:
      
      ip link set eno0 up
      ip addr add 192.168.100.1/24 dev eno0
      ping 192.168.100.1 -c 20 # ping this port from another board
      ip link set eno0 down
      ip link set eno0 up
      ping 192.168.100.1 -c 20 # ping it again from the same other board
      
      One by one:
      
      1. ip link set eno0 up
      -> calls enetc_setup_rxbdr:
         -> calls enetc_refill_rx_ring(511 buffers)
         -> next_to_clean=0 (in hw 0)
         -> next_to_use=511 (in hw 0)
      
      2. ping 192.168.100.1 -c 20 # ping this port from another board
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=15 next_to_clean 14 (in hw 15) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: enetc_refill_rx_ring(16) increments next_to_use by 16 (mod 512) and writes it to hw
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=0 next_to_clean 15 (in hw 16) next_to_use 15 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 16 (in hw 17) next_to_use 15 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 17 (in hw 18) next_to_use 15 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 18 (in hw 19) next_to_use 15 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 19 (in hw 20) next_to_use 15 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 20 (in hw 21) next_to_use 15 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 21 (in hw 22) next_to_use 15 (in hw 15)
      
      20 packets transmitted, 20 packets received, 0% packet loss
      
      3. ip link set eno0 down
      enetc_free_rx_ring: next_to_clean 0 (in hw 22), next_to_use 0 (in hw 15)
      
      4. ip link set eno0 up
      -> calls enetc_setup_rxbdr:
         -> calls enetc_refill_rx_ring(511 buffers)
         -> next_to_clean=0 (in hw 0)
         -> next_to_use=511 (in hw 15)
      
      5. ping 192.168.100.1 -c 20 # ping it again from the same other board
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 15)
      
      20 packets transmitted, 12 packets received, 40% packet loss
      
      And there it dies. No enetc_refill_rx_ring (because cleaned_cnt must be equal
      to 15 for that to happen), no nothing. The hardware enters the condition where
      the producer (14) + 1 is equal to the consumer (15) index, which makes it
      believe it has no more free buffers to put packets in, so it starts discarding
      them:
      
      ip netns exec ns0 ethtool -S eno0 | grep -v ': 0'
      NIC statistics:
           Rx ring  0 discarded frames: 8
      
      Summarized, if the interface receives between 16 and 32 (mod 512) frames
      and then there is a link flap, then the port will eventually die with no
      way to recover. If it receives less than 16 (mod 512) frames, then the
      initial NAPI poll [ before the link flap ] will not update the consumer
      index in hardware (it will remain zero) which will be ok when the buffers
      are later reinitialized. If more than 32 (mod 512) frames are received,
      the initial NAPI poll has the chance to refill the ring twice, updating
      the consumer index to at least 32. So after the link flap, the consumer
      index is still wrong, but the post-flap NAPI poll gets a chance to
      refill the ring once (because it passes through cleaned_cnt=15) and
      makes the consumer index be again back in sync with next_to_use.
      
      The solution to this problem is actually simple, we just need to write
      next_to_use into the hardware consumer index at enetc_open time, which
      always brings it back in sync after an initial buffer seeding process.
      
      The simpler thing would be to put the write to the consumer index into
      enetc_refill_rx_ring directly, but there are issues with the MDIO
      locking: in the NAPI poll code we have the enetc_lock_mdio() taken from
      top-level and we use the unlocked enetc_wr_reg_hot, whereas in
      enetc_open, the enetc_lock_mdio() is not taken at the top level, but
      instead by each individual enetc_wr_reg, so we are forced to put an
      additional enetc_wr_reg in enetc_setup_rxbdr. Better organization of
      the code is left as a refactoring exercise.
      
      Fixes: d4fd0404 ("enetc: Introduce basic PF and VF ENETC ethernet drivers")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      684db2b4
    • V
      net: enetc: remove bogus write to SIRXIDR from enetc_setup_rxbdr · 3829c6cf
      Vladimir Oltean 提交于
      stable inclusion
      from stable-5.10.24
      commit 5317365401119e88268d61691d298704ca7286c4
      bugzilla: 51348
      
      --------------------------------
      
      commit 96a5223b upstream.
      
      The Station Interface Receive Interrupt Detect Register (SIRXIDR)
      contains a 16-bit wide mask of 'interrupt detected' events for each ring
      associated with a port. Bit i is write-1-to-clean for RX ring i.
      
      I have no explanation whatsoever how this line of code came to be
      inserted in the blamed commit. I checked the downstream versions of that
      patch and none of them have it.
      
      The somewhat comical aspect of it is that we're writing a binary number
      to the SIRXIDR register, which is derived from enetc_bd_unused(rx_ring).
      Since the RX rings have 512 buffer descriptors, we end up writing 511 to
      this register, which is 0x1ff, so we are effectively clearing the
      'interrupt detected' event for rings 0-8.
      
      This register is not what is used for interrupt handling though - it
      only provides a summary for the entire SI. The hardware provides one
      separate Interrupt Detect Register per RX ring, which auto-clears upon
      read. So there doesn't seem to be any adverse effect caused by this
      bogus write.
      
      There is, however, one reason why this should be handled as a bugfix:
      next_to_clean _should_ be committed to hardware, just not to that
      register, and this was obscuring the fact that it wasn't. This is fixed
      in the next patch, and removing the bogus line now allows the fix patch
      to be backported beyond that point.
      
      Fixes: fd5736bf ("enetc: Workaround for MDIO register access issue")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      3829c6cf
    • V
      net: enetc: force the RGMII speed and duplex instead of operating in inband mode · d0a90192
      Vladimir Oltean 提交于
      stable inclusion
      from stable-5.10.24
      commit 63876df5615edfe94291409eb862f4570e2f4ffc
      bugzilla: 51348
      
      --------------------------------
      
      commit c76a9721 upstream.
      
      The ENETC port 0 MAC supports in-band status signaling coming from a PHY
      when operating in RGMII mode, and this feature is enabled by default.
      
      It has been reported that RGMII is broken in fixed-link, and that is not
      surprising considering the fact that no PHY is attached to the MAC in
      that case, but a switch.
      
      This brings us to the topic of the patch: the enetc driver should have
      not enabled the optional in-band status signaling for RGMII unconditionally,
      but should have forced the speed and duplex to what was resolved by
      phylink.
      
      Note that phylink does not accept the RGMII modes as valid for in-band
      signaling, and these operate a bit differently than 1000base-x and SGMII
      (notably there is no clause 37 state machine so no ACK required from the
      MAC, instead the PHY sends extra code words on RXD[3:0] whenever it is
      not transmitting something else, so it should be safe to leave a PHY
      with this option unconditionally enabled even if we ignore it). The spec
      talks about this here:
      https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/138/RGMIIv1_5F00_3.pdf
      
      Fixes: 71b77a7a ("enetc: Migrate to PHYLINK and PCS_LYNX")
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      d0a90192
    • V
      net: enetc: don't disable VLAN filtering in IFF_PROMISC mode · 4b000964
      Vladimir Oltean 提交于
      stable inclusion
      from stable-5.10.24
      commit 5732688c8411b1d29a3676819c279236b0a0ec5b
      bugzilla: 51348
      
      --------------------------------
      
      commit a74dbce9 upstream.
      
      Quoting from the blamed commit:
      
          In promiscuous mode, it is more intuitive that all traffic is received,
          including VLAN tagged traffic. It appears that it is necessary to set
          the flag in PSIPVMR for that to be the case, so VLAN promiscuous mode is
          also temporarily enabled. On exit from promiscuous mode, the setting
          made by ethtool is restored.
      
      Intuitive or not, there isn't any definition issued by a standards body
      which says that promiscuity has anything to do with VLAN filtering - it
      only has to do with accepting packets regardless of destination MAC address.
      
      In fact people are already trying to use this misunderstanding/bug of
      the enetc driver as a justification to transform promiscuity into
      something it never was about: accepting every packet (maybe that would
      be the "rx-all" netdev feature?):
      https://lore.kernel.org/netdev/20201110153958.ci5ekor3o2ekg3ky@ipetronik.com/
      
      This is relevant because there are use cases in the kernel (such as
      tc-flower rules with the protocol 802.1Q and a vlan_id key) which do not
      (yet) use the vlan_vid_add API to be compatible with VLAN-filtering NICs
      such as enetc, so for those, disabling rx-vlan-filter is currently the
      only right solution to make these setups work:
      https://lore.kernel.org/netdev/CA+h21hoxwRdhq4y+w8Kwgm74d4cA0xLeiHTrmT-VpSaM7obhkg@mail.gmail.com/
      The blamed patch has unintentionally introduced one more way for this to
      work, which is to enable IFF_PROMISC, however this is non-portable
      because port promiscuity is not meant to disable VLAN filtering.
      Therefore, it could invite people to write broken scripts for enetc, and
      then wonder why they are broken when migrating to other drivers that
      don't handle promiscuity in the same way.
      
      Fixes: 7070eea5 ("enetc: permit configuration of rx-vlan-filter with ethtool")
      Cc: Markus Blöchl <Markus.Bloechl@ipetronik.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      4b000964
    • V
      net: enetc: fix incorrect TPID when receiving 802.1ad tagged packets · fded4910
      Vladimir Oltean 提交于
      stable inclusion
      from stable-5.10.24
      commit d56e3f8d289bdc70378f84efab166ad38022532e
      bugzilla: 51348
      
      --------------------------------
      
      commit 827b6fd0 upstream.
      
      When the enetc ports have rx-vlan-offload enabled, they report a TPID of
      ETH_P_8021Q regardless of what was actually in the packet. When
      rx-vlan-offload is disabled, packets have the proper TPID. Fix this
      inconsistency by finishing the TODO left in the code.
      
      Fixes: d4fd0404 ("enetc: Introduce basic PF and VF ENETC ethernet drivers")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      fded4910
    • V
      net: enetc: take the MDIO lock only once per NAPI poll cycle · cfce9813
      Vladimir Oltean 提交于
      stable inclusion
      from stable-5.10.24
      commit bf9c564716a13dde6a990d3b02c27cd6e39608bf
      bugzilla: 51348
      
      --------------------------------
      
      commit 6d36ecdb upstream.
      
      The workaround for the ENETC MDIO erratum caused a performance
      degradation of 82 Kpps (seen with IP forwarding of two 1Gbps streams of
      64B packets). This is due to excessive locking and unlocking in the fast
      path, which can be avoided.
      
      By taking the MDIO read-side lock only once per NAPI poll cycle, we are
      able to regain 54 Kpps (65%) of the performance hit. The rest of the
      performance degradation comes from the TX data path, but unfortunately
      it doesn't look like we can optimize that away easily, even with
      netdev_xmit_more(), there just isn't any skb batching done, to help with
      taking the MDIO lock less often than once per packet.
      
      We need to change the register accessor type for enetc_get_tx_tstamp,
      because it now runs under the enetc_lock_mdio as per the new call path
      detailed below:
      
      enetc_msix
      -> napi_schedule
         -> enetc_poll
            -> enetc_lock_mdio
            -> enetc_clean_tx_ring
               -> enetc_get_tx_tstamp
            -> enetc_clean_rx_ring
            -> enetc_unlock_mdio
      
      Fixes: fd5736bf ("enetc: Workaround for MDIO register access issue")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      cfce9813
    • V
      net: enetc: don't overwrite the RSS indirection table when initializing · 88db41d4
      Vladimir Oltean 提交于
      stable inclusion
      from stable-5.10.24
      commit dfaf418dfff819aaa5e6a945bb8efd38d53b6eb9
      bugzilla: 51348
      
      --------------------------------
      
      commit c646d10d upstream.
      
      After the blamed patch, all RX traffic gets hashed to CPU 0 because the
      hashing indirection table set up in:
      
      enetc_pf_probe
      -> enetc_alloc_si_resources
         -> enetc_configure_si
            -> enetc_setup_default_rss_table
      
      is overwritten later in:
      
      enetc_pf_probe
      -> enetc_init_port_rss_memory
      
      which zero-initializes the entire port RSS table in order to avoid ECC errors.
      
      The trouble really is that enetc_init_port_rss_memory really neads
      enetc_alloc_si_resources to be called, because it depends upon
      enetc_alloc_cbdr and enetc_setup_cbdr. But that whole enetc_configure_si
      thing could have been better thought out, it has nothing to do in a
      function called "alloc_si_resources", especially since its counterpart,
      "free_si_resources", does nothing to unwind the configuration of the SI.
      
      The point is, we need to pull out enetc_configure_si out of
      enetc_alloc_resources, and move it after enetc_init_port_rss_memory.
      This allows us to set up the default RSS indirection table after
      initializing the memory.
      
      Fixes: 07bf34a5 ("net: enetc: initialize the RFS and RSS memories")
      Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      88db41d4
    • S
      sh_eth: fix TRSCER mask for SH771x · 2b4b2b80
      Sergey Shtylyov 提交于
      stable inclusion
      from stable-5.10.24
      commit 4ea379733555d652acadb05112a3365e5059f6f4
      bugzilla: 51348
      
      --------------------------------
      
      commit 8c91bc3d upstream.
      
      According  to  the SH7710, SH7712, SH7713 Group User's Manual: Hardware,
      Rev. 3.00, the TRSCER register actually has only bit 7 valid (and named
      differently), with all the other bits reserved. Apparently, this was not
      the case with some early revisions of the manual as we have the other
      bits declared (and set) in the original driver.  Follow the suit and add
      the explicit sh_eth_cpu_data::trscer_err_mask initializer for SH771x...
      
      Fixes: 86a74ff2 ("net: sh_eth: add support for Renesas SuperH Ethernet")
      Signed-off-by: NSergey Shtylyov <s.shtylyov@omprussia.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      2b4b2b80
    • D
      net: dsa: tag_rtl4_a: fix egress tags · 8de05457
      DENG Qingfang 提交于
      stable inclusion
      from stable-5.10.24
      commit 68277f69a8734a444a05dce9f78ce79c1225d08d
      bugzilla: 51348
      
      --------------------------------
      
      commit 9eb8bc59 upstream.
      
      Commit 86dd9868 has several issues, but was accepted too soon
      before anyone could take a look.
      
      - Double free. dsa_slave_xmit() will free the skb if the xmit function
        returns NULL, but the skb is already freed by eth_skb_pad(). Use
        __skb_put_padto() to avoid that.
      - Unnecessary allocation. It has been done by DSA core since commit
        a3b0b647.
      - A u16 pointer points to skb data. It should be __be16 for network
        byte order.
      - Typo in comments. "numer" -> "number".
      
      Fixes: 86dd9868 ("net: dsa: tag_rtl4_a: Support also egress tags")
      Signed-off-by: NDENG Qingfang <dqfext@gmail.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      8de05457
    • J
      docs: networking: drop special stable handling · 8157b471
      Jakub Kicinski 提交于
      stable inclusion
      from stable-5.10.24
      commit 389055e7b97048c7ecd6066cdac2c703bae493bc
      bugzilla: 51348
      
      --------------------------------
      
      commit dbbe7c96 upstream.
      
      Leave it to Greg.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      8157b471
    • L
      Revert "mm, slub: consider rest of partial list if acquire_slab() fails" · b36ca11e
      Linus Torvalds 提交于
      stable inclusion
      from stable-5.10.24
      commit e1759160877a06082a9323dfb9437abfbe4af2d3
      bugzilla: 51348
      
      --------------------------------
      
      commit 9b1ea29b upstream.
      
      This reverts commit 8ff60eb0.
      
      The kernel test robot reports a huge performance regression due to the
      commit, and the reason seems fairly straightforward: when there is
      contention on the page list (which is what causes acquire_slab() to
      fail), we do _not_ want to just loop and try again, because that will
      transfer the contention to the 'n->list_lock' spinlock we hold, and
      just make things even worse.
      
      This is admittedly likely a problem only on big machines - the kernel
      test robot report comes from a 96-thread dual socket Intel Xeon Gold
      6252 setup, but the regression there really is quite noticeable:
      
         -47.9% regression of stress-ng.rawpkt.ops_per_sec
      
      and the commit that was marked as being fixed (7ced3719: "slub:
      Acquire_slab() avoid loop") actually did the loop exit early very
      intentionally (the hint being that "avoid loop" part of that commit
      message), exactly to avoid this issue.
      
      The correct thing to do may be to pick some kind of reasonable middle
      ground: instead of breaking out of the loop on the very first sign of
      contention, or trying over and over and over again, the right thing may
      be to re-try _once_, and then give up on the second failure (or pick
      your favorite value for "once"..).
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Link: https://lore.kernel.org/lkml/20210301080404.GF12822@xsang-OptiPlex-9020/
      Cc: Jann Horn <jannh@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      b36ca11e
    • P
      cifs: return proper error code in statfs(2) · 9ae15b79
      Paulo Alcantara 提交于
      stable inclusion
      from stable-5.10.24
      commit 3d0bbd97eb6f32bcc1365252aa04a8984bab5007
      bugzilla: 51348
      
      --------------------------------
      
      commit 14302ee3 upstream.
      
      In cifs_statfs(), if server->ops->queryfs is not NULL, then we should
      use its return value rather than always returning 0.  Instead, use rc
      variable as it is properly set to 0 in case there is no
      server->ops->queryfs.
      Signed-off-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
      Reviewed-by: NAurelien Aptel <aaptel@suse.com>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      CC: <stable@vger.kernel.org>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      9ae15b79
    • C
      mount: fix mounting of detached mounts onto targets that reside on shared mounts · 3ef215f4
      Christian Brauner 提交于
      stable inclusion
      from stable-5.10.24
      commit 36e1efcdc54274d03e67ed6a9d5c1c2a2e77e947
      bugzilla: 51348
      
      --------------------------------
      
      commit ee2e3f50 upstream.
      
      Creating a series of detached mounts, attaching them to the filesystem,
      and unmounting them can be used to trigger an integer overflow in
      ns->mounts causing the kernel to block any new mounts in count_mounts()
      and returning ENOSPC because it falsely assumes that the maximum number
      of mounts in the mount namespace has been reached, i.e. it thinks it
      can't fit the new mounts into the mount namespace anymore.
      
      Depending on the number of mounts in your system, this can be reproduced
      on any kernel that supportes open_tree() and move_mount() by compiling
      and running the following program:
      
        /* SPDX-License-Identifier: LGPL-2.1+ */
      
        #define _GNU_SOURCE
        #include <errno.h>
        #include <fcntl.h>
        #include <getopt.h>
        #include <limits.h>
        #include <stdbool.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <string.h>
        #include <sys/mount.h>
        #include <sys/stat.h>
        #include <sys/syscall.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        /* open_tree() */
        #ifndef OPEN_TREE_CLONE
        #define OPEN_TREE_CLONE 1
        #endif
      
        #ifndef OPEN_TREE_CLOEXEC
        #define OPEN_TREE_CLOEXEC O_CLOEXEC
        #endif
      
        #ifndef __NR_open_tree
                #if defined __alpha__
                        #define __NR_open_tree 538
                #elif defined _MIPS_SIM
                        #if _MIPS_SIM == _MIPS_SIM_ABI32        /* o32 */
                                #define __NR_open_tree 4428
                        #endif
                        #if _MIPS_SIM == _MIPS_SIM_NABI32       /* n32 */
                                #define __NR_open_tree 6428
                        #endif
                        #if _MIPS_SIM == _MIPS_SIM_ABI64        /* n64 */
                                #define __NR_open_tree 5428
                        #endif
                #elif defined __ia64__
                        #define __NR_open_tree (428 + 1024)
                #else
                        #define __NR_open_tree 428
                #endif
        #endif
      
        /* move_mount() */
        #ifndef MOVE_MOUNT_F_EMPTY_PATH
        #define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */
        #endif
      
        #ifndef __NR_move_mount
                #if defined __alpha__
                        #define __NR_move_mount 539
                #elif defined _MIPS_SIM
                        #if _MIPS_SIM == _MIPS_SIM_ABI32        /* o32 */
                                #define __NR_move_mount 4429
                        #endif
                        #if _MIPS_SIM == _MIPS_SIM_NABI32       /* n32 */
                                #define __NR_move_mount 6429
                        #endif
                        #if _MIPS_SIM == _MIPS_SIM_ABI64        /* n64 */
                                #define __NR_move_mount 5429
                        #endif
                #elif defined __ia64__
                        #define __NR_move_mount (428 + 1024)
                #else
                        #define __NR_move_mount 429
                #endif
        #endif
      
        static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags)
        {
                return syscall(__NR_open_tree, dfd, filename, flags);
        }
      
        static inline int sys_move_mount(int from_dfd, const char *from_pathname, int to_dfd,
                                         const char *to_pathname, unsigned int flags)
        {
                return syscall(__NR_move_mount, from_dfd, from_pathname, to_dfd, to_pathname, flags);
        }
      
        static bool is_shared_mountpoint(const char *path)
        {
                bool shared = false;
                FILE *f = NULL;
                char *line = NULL;
                int i;
                size_t len = 0;
      
                f = fopen("/proc/self/mountinfo", "re");
                if (!f)
                        return 0;
      
                while (getline(&line, &len, f) > 0) {
                        char *slider1, *slider2;
      
                        for (slider1 = line, i = 0; slider1 && i < 4; i++)
                                slider1 = strchr(slider1 + 1, ' ');
      
                        if (!slider1)
                                continue;
      
                        slider2 = strchr(slider1 + 1, ' ');
                        if (!slider2)
                                continue;
      
                        *slider2 = '\0';
                        if (strcmp(slider1 + 1, path) == 0) {
                                /* This is the path. Is it shared? */
                                slider1 = strchr(slider2 + 1, ' ');
                                if (slider1 && strstr(slider1, "shared:")) {
                                        shared = true;
                                        break;
                                }
                        }
                }
                fclose(f);
                free(line);
      
                return shared;
        }
      
        static void usage(void)
        {
                const char *text = "mount-new [--recursive] <base-dir>\n";
                fprintf(stderr, "%s", text);
                _exit(EXIT_SUCCESS);
        }
      
        #define exit_usage(format, ...)                              \
                ({                                                   \
                        fprintf(stderr, format "\n", ##__VA_ARGS__); \
                        usage();                                     \
                })
      
        #define exit_log(format, ...)                                \
                ({                                                   \
                        fprintf(stderr, format "\n", ##__VA_ARGS__); \
                        exit(EXIT_FAILURE);                          \
                })
      
        static const struct option longopts[] = {
                {"help",        no_argument,            0,      'a'},
                { NULL,         no_argument,            0,       0 },
        };
      
        int main(int argc, char *argv[])
        {
                int exit_code = EXIT_SUCCESS, index = 0;
                int dfd, fd_tree, new_argc, ret;
                char *base_dir;
                char *const *new_argv;
                char target[PATH_MAX];
      
                while ((ret = getopt_long_only(argc, argv, "", longopts, &index)) != -1) {
                        switch (ret) {
                        case 'a':
                                /* fallthrough */
                        default:
                                usage();
                        }
                }
      
                new_argv = &argv[optind];
                new_argc = argc - optind;
                if (new_argc < 1)
                        exit_usage("Missing base directory\n");
                base_dir = new_argv[0];
      
                if (*base_dir != '/')
                        exit_log("Please specify an absolute path");
      
                /* Ensure that target is a shared mountpoint. */
                if (!is_shared_mountpoint(base_dir))
                        exit_log("Please ensure that \"%s\" is a shared mountpoint", base_dir);
      
                dfd = open(base_dir, O_RDONLY | O_DIRECTORY | O_CLOEXEC);
                if (dfd < 0)
                        exit_log("%m - Failed to open base directory \"%s\"", base_dir);
      
                ret = mkdirat(dfd, "detached-move-mount", 0755);
                if (ret < 0)
                        exit_log("%m - Failed to create required temporary directories");
      
                ret = snprintf(target, sizeof(target), "%s/detached-move-mount", base_dir);
                if (ret < 0 || (size_t)ret >= sizeof(target))
                        exit_log("%m - Failed to assemble target path");
      
                /*
                 * Having a mount table with 10000 mounts is already quite excessive
                 * and shoult account even for weird test systems.
                 */
                for (size_t i = 0; i < 10000; i++) {
                        fd_tree = sys_open_tree(dfd, "detached-move-mount",
                                                OPEN_TREE_CLONE |
                                                OPEN_TREE_CLOEXEC |
                                                AT_EMPTY_PATH);
                        if (fd_tree < 0) {
                                fprintf(stderr, "%m - Failed to open %d(detached-move-mount)", dfd);
                                exit_code = EXIT_FAILURE;
                                break;
                        }
      
                        ret = sys_move_mount(fd_tree, "", dfd, "detached-move-mount", MOVE_MOUNT_F_EMPTY_PATH);
                        if (ret < 0) {
                                if (errno == ENOSPC)
                                        fprintf(stderr, "%m - Buggy mount counting");
                                else
                                        fprintf(stderr, "%m - Failed to attach mount to %d(detached-move-mount)", dfd);
                                exit_code = EXIT_FAILURE;
                                break;
                        }
                        close(fd_tree);
      
                        ret = umount2(target, MNT_DETACH);
                        if (ret < 0) {
                                fprintf(stderr, "%m - Failed to unmount %s", target);
                                exit_code = EXIT_FAILURE;
                                break;
                        }
                }
      
                (void)unlinkat(dfd, "detached-move-mount", AT_REMOVEDIR);
                close(dfd);
      
                exit(exit_code);
        }
      
      and wait for the kernel to refuse any new mounts by returning ENOSPC.
      How many iterations are needed depends on the number of mounts in your
      system. Assuming you have something like 50 mounts on a standard system
      it should be almost instantaneous.
      
      The root cause of this is that detached mounts aren't handled correctly
      when source and target mount are identical and reside on a shared mount
      causing a broken mount tree where the detached source itself is
      propagated which propagation prevents for regular bind-mounts and new
      mounts. This ultimately leads to a miscalculation of the number of
      mounts in the mount namespace.
      
      Detached mounts created via
      open_tree(fd, path, OPEN_TREE_CLONE)
      are essentially like an unattached new mount, or an unattached
      bind-mount. They can then later on be attached to the filesystem via
      move_mount() which calls into attach_recursive_mount(). Part of
      attaching it to the filesystem is making sure that mounts get correctly
      propagated in case the destination mountpoint is MS_SHARED, i.e. is a
      shared mountpoint. This is done by calling into propagate_mnt() which
      walks the list of peers calling propagate_one() on each mount in this
      list making sure it receives the propagation event.
      The propagate_one() functions thereby skips both new mounts and bind
      mounts to not propagate them "into themselves". Both are identified by
      checking whether the mount is already attached to any mount namespace in
      mnt->mnt_ns. The is what the IS_MNT_NEW() helper is responsible for.
      
      However, detached mounts have an anonymous mount namespace attached to
      them stashed in mnt->mnt_ns which means that IS_MNT_NEW() doesn't
      realize they need to be skipped causing the mount to propagate "into
      itself" breaking the mount table and causing a disconnect between the
      number of mounts recorded as being beneath or reachable from the target
      mountpoint and the number of mounts actually recorded/counted in
      ns->mounts ultimately causing an overflow which in turn prevents any new
      mounts via the ENOSPC issue.
      
      So teach propagation to handle detached mounts by making it aware of
      them. I've been tracking this issue down for the last couple of days and
      then verifying that the fix is correct by
      unmounting everything in my current mount table leaving only /proc and
      /sys mounted and running the reproducer above overnight verifying the
      number of mounts counted in ns->mounts. With this fix the counts are
      correct and the ENOSPC issue can't be reproduced.
      
      This change will only have an effect on mounts created with the new
      mount API since detached mounts cannot be created with the old mount API
      so regressions are extremely unlikely.
      
      Link: https://lore.kernel.org/r/20210306101010.243666-1-christian.brauner@ubuntu.com
      Fixes: 2db154b3 ("vfs: syscall: Add move_mount(2) to move mounts around")
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      3ef215f4
    • C
      powerpc/603: Fix protection of user pages mapped with PROT_NONE · f0404980
      Christophe Leroy 提交于
      stable inclusion
      from stable-5.10.24
      commit aa1258d91455a75474d0541f746537c9bb0484c3
      bugzilla: 51348
      
      --------------------------------
      
      commit c119565a upstream.
      
      On book3s/32, page protection is defined by the PP bits in the PTE
      which provide the following protection depending on the access
      keys defined in the matching segment register:
      - PP 00 means RW with key 0 and N/A with key 1.
      - PP 01 means RW with key 0 and RO with key 1.
      - PP 10 means RW with both key 0 and key 1.
      - PP 11 means RO with both key 0 and key 1.
      
      Since the implementation of kernel userspace access protection,
      PP bits have been set as follows:
      - PP00 for pages without _PAGE_USER
      - PP01 for pages with _PAGE_USER and _PAGE_RW
      - PP11 for pages with _PAGE_USER and without _PAGE_RW
      
      For kernelspace segments, kernel accesses are performed with key 0
      and user accesses are performed with key 1. As PP00 is used for
      non _PAGE_USER pages, user can't access kernel pages not flagged
      _PAGE_USER while kernel can.
      
      For userspace segments, both kernel and user accesses are performed
      with key 0, therefore pages not flagged _PAGE_USER are still
      accessible to the user.
      
      This shouldn't be an issue, because userspace is expected to be
      accessible to the user. But unlike most other architectures, powerpc
      implements PROT_NONE protection by removing _PAGE_USER flag instead of
      flagging the page as not valid. This means that pages in userspace
      that are not flagged _PAGE_USER shall remain inaccessible.
      
      To get the expected behaviour, just mimic other architectures in the
      TLB miss handler by checking _PAGE_USER permission on userspace
      accesses as if it was the _PAGE_PRESENT bit.
      
      Note that this problem only is only for 603 cores. The 604+ have
      an hash table, and hash_page() function already implement the
      verification of _PAGE_USER permission on userspace pages.
      
      Fixes: f342adca ("powerpc/32s: Prepare Kernel Userspace Access Protection")
      Cc: stable@vger.kernel.org # v5.2+
      Reported-by: NChristoph Plattner <christoph.plattner@thalesgroup.com>
      Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/4a0c6e3bb8f0c162457bf54d9bc6fd8d7b55129f.1612160907.git.christophe.leroy@csgroup.euSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      f0404980
    • L
      mt76: dma: do not report truncated frames to mac80211 · de914b17
      Lorenzo Bianconi 提交于
      stable inclusion
      from stable-5.10.24
      commit e36d276dd4be6085b2f830dbb24e4746ec4a042b
      bugzilla: 51348
      
      --------------------------------
      
      commit d0bd52c5 upstream.
      
      Commit b102f0c5 ("mt76: fix array overflow on receiving too many
      fragments for a packet") fixes a possible OOB access but it introduces a
      memory leak since the pending frame is not released to page_frag_cache
      if the frag array of skb_shared_info is full. Commit 93a1d479
      ("mt76: dma: fix a possible memory leak in mt76_add_fragment()") fixes
      the issue but does not free the truncated skb that is forwarded to
      mac80211 layer. Fix the leftover issue discarding even truncated skbs.
      
      Fixes: 93a1d479 ("mt76: dma: fix a possible memory leak in mt76_add_fragment()")
      Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org>
      Signed-off-by: NKalle Valo <kvalo@codeaurora.org>
      Link: https://lore.kernel.org/r/a03166fcc8214644333c68674a781836e0f57576.1612697217.git.lorenzo@kernel.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      de914b17
    • J
      ibmvnic: always store valid MAC address · adf9ab95
      Jiri Wiesner 提交于
      stable inclusion
      from stable-5.10.24
      commit 1e343b2e7b9678f199df9693a3548e9a4ab98488
      bugzilla: 51348
      
      --------------------------------
      
      commit 67eb2114 upstream.
      
      The last change to ibmvnic_set_mac(), 8fc3672a, meant to prevent
      users from setting an invalid MAC address on an ibmvnic interface
      that has not been brought up yet. The change also prevented the
      requested MAC address from being stored by the adapter object for an
      ibmvnic interface when the state of the ibmvnic interface is
      VNIC_PROBED - that is after probing has finished but before the
      ibmvnic interface is brought up. The MAC address stored by the
      adapter object is used and sent to the hypervisor for checking when
      an ibmvnic interface is brought up.
      
      The ibmvnic driver ignoring the requested MAC address when in
      VNIC_PROBED state caused LACP bonds (bonds in 802.3ad mode) with more
      than one slave to malfunction. The bonding code must be able to
      change the MAC address of its slaves before they are brought up
      during enslaving. The inability of kernels with 8fc3672a to set
      the MAC addresses of bonding slaves is observable in the output of
      "ip address show". The MAC addresses of the slaves are the same as
      the MAC address of the bond on a working system whereas the slaves
      retain their original MAC addresses on a system with a malfunctioning
      LACP bond.
      
      Fixes: 8fc3672a ("ibmvnic: fix ibmvnic_set_mac")
      Signed-off-by: NJiri Wiesner <jwiesner@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      adf9ab95
    • M
      ibmvnic: Fix possibly uninitialized old_num_tx_queues variable warning. · f963c2cf
      Michal Suchanek 提交于
      stable inclusion
      from stable-5.10.24
      commit 57ac75f8d241b3d13b77d223214be025f18df8a1
      bugzilla: 51348
      
      --------------------------------
      
      commit 6881b07f upstream.
      
      GCC 7.5 reports:
      ../drivers/net/ethernet/ibm/ibmvnic.c: In function 'ibmvnic_reset_init':
      ../drivers/net/ethernet/ibm/ibmvnic.c:5373:51: warning: 'old_num_tx_queues' may be used uninitialized in this function [-Wmaybe-uninitialized]
      ../drivers/net/ethernet/ibm/ibmvnic.c:5373:6: warning: 'old_num_rx_queues' may be used uninitialized in this function [-Wmaybe-uninitialized]
      
      The variable is initialized only if(reset) and used only if(reset &&
      something) so this is a false positive. However, there is no reason to
      not initialize the variables unconditionally avoiding the warning.
      
      Fixes: 635e442f ("ibmvnic: merge ibmvnic_reset_init and ibmvnic_init")
      Signed-off-by: NMichal Suchanek <msuchanek@suse.de>
      Reviewed-by: NSukadev Bhattiprolu <sukadev@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      f963c2cf
    • M
      libbpf: Clear map_info before each bpf_obj_get_info_by_fd · 5929b475
      Maciej Fijalkowski 提交于
      stable inclusion
      from stable-5.10.24
      commit 2f6f72ee9a98811f80b604f54b00dd3dd7fa75eb
      bugzilla: 51348
      
      --------------------------------
      
      commit 2b2aedab upstream.
      
      xsk_lookup_bpf_maps, based on prog_fd, looks whether current prog has a
      reference to XSKMAP. BPF prog can include insns that work on various BPF
      maps and this is covered by iterating through map_ids.
      
      The bpf_map_info that is passed to bpf_obj_get_info_by_fd for filling
      needs to be cleared at each iteration, so that it doesn't contain any
      outdated fields and that is currently missing in the function of
      interest.
      
      To fix that, zero-init map_info via memset before each
      bpf_obj_get_info_by_fd call.
      
      Also, since the area of this code is touched, in general strcmp is
      considered harmful, so let's convert it to strncmp and provide the
      size of the array name for current map_info.
      
      While at it, do s/continue/break/ once we have found the xsks_map to
      terminate the search.
      
      Fixes: 5750902a ("libbpf: proper XSKMAP cleanup")
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
      Link: https://lore.kernel.org/bpf/20210303185636.18070-4-maciej.fijalkowski@intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      5929b475
    • M
      samples, bpf: Add missing munmap in xdpsock · 3a91b98c
      Maciej Fijalkowski 提交于
      stable inclusion
      from stable-5.10.24
      commit f126147970a11eb4a686d30bd0740de3de2cd6c8
      bugzilla: 51348
      
      --------------------------------
      
      commit 6bc66998 upstream.
      
      We mmap the umem region, but we never munmap it.
      Add the missing call at the end of the cleanup.
      
      Fixes: 3945b37a ("samples/bpf: use hugepages in xdpsock app")
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
      Link: https://lore.kernel.org/bpf/20210303185636.18070-3-maciej.fijalkowski@intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      3a91b98c
    • Y
      selftests/bpf: Mask bpf_csum_diff() return value to 16 bits in test_verifier · c95dcb7b
      Yauheni Kaliuta 提交于
      stable inclusion
      from stable-5.10.24
      commit 4d2cdb2ded60a6aae748ac61ae3919a3b037f26c
      bugzilla: 51348
      
      --------------------------------
      
      commit 6185266c upstream.
      
      The verifier test labelled "valid read map access into a read-only array
      2" calls the bpf_csum_diff() helper and checks its return value. However,
      architecture implementations of csum_partial() (which is what the helper
      uses) differ in whether they fold the return value to 16 bit or not. For
      example, x86 version has ...
      
      	if (unlikely(odd)) {
      		result = from32to16(result);
      		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
      	}
      
      ... while generic lib/checksum.c does:
      
      	result = from32to16(result);
      	if (odd)
      		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
      
      This makes the helper return different values on different architectures,
      breaking the test on non-x86. To fix this, add an additional instruction
      to always mask the return value to 16 bits, and update the expected return
      value accordingly.
      
      Fixes: fb2abb73 ("bpf, selftest: test {rd, wr}only flags and direct value access")
      Signed-off-by: NYauheni Kaliuta <yauheni.kaliuta@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210228103017.320240-1-yauheni.kaliuta@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      c95dcb7b
    • H
      selftests/bpf: No need to drop the packet when there is no geneve opt · 15c659c2
      Hangbin Liu 提交于
      stable inclusion
      from stable-5.10.24
      commit 4fa0ece2e0eb3740c6bfbf4f8121068248bb4295
      bugzilla: 51348
      
      --------------------------------
      
      commit 557c223b upstream.
      
      In bpf geneve tunnel test we set geneve option on tx side. On rx side we
      only call bpf_skb_get_tunnel_opt(). Since commit 9c2e14b4 ("ip_tunnels:
      Set tunnel option flag when tunnel metadata is present") geneve_rx() will
      not add TUNNEL_GENEVE_OPT flag if there is no geneve option, which cause
      bpf_skb_get_tunnel_opt() return ENOENT and _geneve_get_tunnel() in
      test_tunnel_kern.c drop the packet.
      
      As it should be valid that bpf_skb_get_tunnel_opt() return error when
      there is not tunnel option, there is no need to drop the packet and
      break all geneve rx traffic. Just set opt_class to 0 in this test and
      keep returning TC_ACT_OK.
      
      Fixes: 933a741e ("selftests/bpf: bpf tunnel test.")
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NWilliam Tu <u9012063@gmail.com>
      Link: https://lore.kernel.org/bpf/20210224081403.1425474-1-liuhangbin@gmail.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      15c659c2
    • I
      selftests/bpf: Use the last page in test_snprintf_btf on s390 · 67f8c3c8
      Ilya Leoshkevich 提交于
      stable inclusion
      from stable-5.10.24
      commit 7653656be252abd7d2d3f16152188623de5be4f8
      bugzilla: 51348
      
      --------------------------------
      
      commit 42a382a4 upstream.
      
      test_snprintf_btf fails on s390, because NULL points to a readable
      struct lowcore there. Fix by using the last page instead.
      
      Error message example:
      
          printing fffffffffffff000 should generate error, got (361)
      
      Fixes: 076a95f5 ("selftests/bpf: Add bpf_snprintf_btf helper tests")
      Signed-off-by: NIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NHeiko Carstens <hca@linux.ibm.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210227051726.121256-1-iii@linux.ibm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      67f8c3c8
    • G
      net: phy: fix save wrong speed and duplex problem if autoneg is on · 541e3645
      Guangbin Huang 提交于
      stable inclusion
      from stable-5.10.24
      commit 6aa23829949c2c0912e82866aeab4fd591595235
      bugzilla: 51348
      
      --------------------------------
      
      commit d9032dba upstream.
      
      If phy uses generic driver and autoneg is on, enter command
      "ethtool -s eth0 speed 50" will not change phy speed actually, but
      command "ethtool eth0" shows speed is 50Mb/s because phydev->speed
      has been set to 50 and no update later.
      
      And duplex setting has same problem too.
      
      However, if autoneg is on, phy only changes speed and duplex according to
      phydev->advertising, but not phydev->speed and phydev->duplex. So in this
      case, phydev->speed and phydev->duplex don't need to be set in function
      phy_ethtool_ksettings_set() if autoneg is on.
      
      Fixes: 51e2a384 ("PHY: Avoid unnecessary aneg restarts")
      Signed-off-by: NGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      541e3645
    • J
      net: always use icmp{,v6}_ndo_send from ndo_start_xmit · 0165d324
      Jason A. Donenfeld 提交于
      stable inclusion
      from stable-5.10.24
      commit 91796b65563bd3fd0efe4fb56d6ee1c5c6006eb0
      bugzilla: 51348
      
      --------------------------------
      
      commit 4372339e upstream.
      
      There were a few remaining tunnel drivers that didn't receive the prior
      conversion to icmp{,v6}_ndo_send. Knowing now that this could lead to
      memory corrution (see ee576c47 ("net: icmp: pass zeroed opts from
      icmp{,v6}_ndo_send before sending") for details), there's even more
      imperative to have these all converted. So this commit goes through the
      remaining cases that I could find and does a boring translation to the
      ndo variety.
      
      The Fixes: line below is the merge that originally added icmp{,v6}_
      ndo_send and converted the first batch of icmp{,v6}_send users. The
      rationale then for the change applies equally to this patch. It's just
      that these drivers were left out of the initial conversion because these
      network devices are hiding in net/ rather than in drivers/net/.
      
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Fixes: 803381f9 ("Merge branch 'icmp-account-for-NAT-when-sending-icmps-from-ndo-layer'")
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      0165d324
    • V
      netfilter: x_tables: gpf inside xt_find_revision() · 034017c8
      Vasily Averin 提交于
      stable inclusion
      from stable-5.10.24
      commit 8abbf7e53e179b16dc48c40cecc6c86240ca026c
      bugzilla: 51348
      
      --------------------------------
      
      commit 8e24eddd upstream.
      
      nested target/match_revfn() calls work with xt[NFPROTO_UNSPEC] lists
      without taking xt[NFPROTO_UNSPEC].mutex. This can race with module unload
      and cause host to crash:
      
      general protection fault: 0000 [#1]
      Modules linked in: ... [last unloaded: xt_cluster]
      CPU: 0 PID: 542455 Comm: iptables
      RIP: 0010:[<ffffffff8ffbd518>]  [<ffffffff8ffbd518>] strcmp+0x18/0x40
      RDX: 0000000000000003 RSI: ffff9a5a5d9abe10 RDI: dead000000000111
      R13: ffff9a5a5d9abe10 R14: ffff9a5a5d9abd8c R15: dead000000000100
      (VvS: %R15 -- &xt_match,  %RDI -- &xt_match.name,
      xt_cluster unregister match in xt[NFPROTO_UNSPEC].match list)
      Call Trace:
       [<ffffffff902ccf44>] match_revfn+0x54/0xc0
       [<ffffffff902ccf9f>] match_revfn+0xaf/0xc0
       [<ffffffff902cd01e>] xt_find_revision+0x6e/0xf0
       [<ffffffffc05a5be0>] do_ipt_get_ctl+0x100/0x420 [ip_tables]
       [<ffffffff902cc6bf>] nf_getsockopt+0x4f/0x70
       [<ffffffff902dd99e>] ip_getsockopt+0xde/0x100
       [<ffffffff903039b5>] raw_getsockopt+0x25/0x50
       [<ffffffff9026c5da>] sock_common_getsockopt+0x1a/0x20
       [<ffffffff9026b89d>] SyS_getsockopt+0x7d/0xf0
       [<ffffffff903cbf92>] system_call_fastpath+0x25/0x2a
      
      Fixes: 656caff2 ("netfilter 04/09: x_tables: fix match/target revision lookup")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      034017c8
    • F
      netfilter: nf_nat: undo erroneous tcp edemux lookup · 6c980f9a
      Florian Westphal 提交于
      stable inclusion
      from stable-5.10.24
      commit 42402bd84530d3761b97775c10762fde28d5b2f9
      bugzilla: 51348
      
      --------------------------------
      
      commit 03a3ca37 upstream.
      
      Under extremely rare conditions TCP early demux will retrieve the wrong
      socket.
      
      1. local machine establishes a connection to a remote server, S, on port
         p.
      
         This gives:
         laddr:lport -> S:p
         ... both in tcp and conntrack.
      
      2. local machine establishes a connection to host H, on port p2.
         2a. TCP stack choses same laddr:lport, so we have
         laddr:lport -> H:p2 from TCP point of view.
         2b). There is a destination NAT rewrite in place, translating
              H:p2 to S:p.  This results in following conntrack entries:
      
         I)  laddr:lport -> S:p  (origin)  S:p -> laddr:lport (reply)
         II) laddr:lport -> H:p2 (origin)  S:p -> laddr:lport2 (reply)
      
         NAT engine has rewritten laddr:lport to laddr:lport2 to map
         the reply packet to the correct origin.
      
         When server sends SYN/ACK to laddr:lport2, the PREROUTING hook
         will undo-the SNAT transformation, rewriting IP header to
         S:p -> laddr:lport
      
         This causes TCP early demux to associate the skb with the TCP socket
         of the first connection.
      
         The INPUT hook will then reverse the DNAT transformation, rewriting
         the IP header to H:p2 -> laddr:lport.
      
      Because packet ends up with the wrong socket, the new connection
      never completes: originator stays in SYN_SENT and conntrack entry
      remains in SYN_RECV until timeout, and responder retransmits SYN/ACK
      until it gives up.
      
      To resolve this, orphan the skb after the input rewrite:
      Because the source IP address changed, the socket must be incorrect.
      We can't move the DNAT undo to prerouting due to backwards
      compatibility, doing so will make iptables/nftables rules to no longer
      match the way they did.
      
      After orphan, the packet will be handed to the next protocol layer
      (tcp, udp, ...) and that will repeat the socket lookup just like as if
      early demux was disabled.
      
      Fixes: 41063e9d ("ipv4: Early TCP socket demux.")
      Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1427Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      6c980f9a
    • E
      tcp: add sanity tests to TCP_QUEUE_SEQ · bc4f1468
      Eric Dumazet 提交于
      stable inclusion
      from stable-5.10.24
      commit 046f3c1c2ff450fb7ae53650e9a95e0074a61f3e
      bugzilla: 51348
      
      --------------------------------
      
      commit 8811f4a9 upstream.
      
      Qingyu Li reported a syzkaller bug where the repro
      changes RCV SEQ _after_ restoring data in the receive queue.
      
      mprotect(0x4aa000, 12288, PROT_READ)    = 0
      mmap(0x1ffff000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1ffff000
      mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
      mmap(0x21000000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x21000000
      socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
      setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
      connect(3, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0
      setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
      sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="0x0000000000000003\0\0", iov_len=20}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
      setsockopt(3, SOL_TCP, TCP_REPAIR, [0], 4) = 0
      setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [128], 4) = 0
      recvfrom(3, NULL, 20, 0, NULL, NULL)    = -1 ECONNRESET (Connection reset by peer)
      
      syslog shows:
      [  111.205099] TCP recvmsg seq # bug 2: copied 80, seq 0, rcvnxt 80, fl 0
      [  111.207894] WARNING: CPU: 1 PID: 356 at net/ipv4/tcp.c:2343 tcp_recvmsg_locked+0x90e/0x29a0
      
      This should not be allowed. TCP_QUEUE_SEQ should only be used
      when queues are empty.
      
      This patch fixes this case, and the tx path as well.
      
      Fixes: ee995283 ("tcp: Initial repair mode")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=212005Reported-by: NQingyu Li <ieatmuttonchuan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      bc4f1468
    • A
      tcp: Fix sign comparison bug in getsockopt(TCP_ZEROCOPY_RECEIVE) · fe30893d
      Arjun Roy 提交于
      stable inclusion
      from stable-5.10.24
      commit e95ebe1ed6abc259b897abc1f92622504750747c
      bugzilla: 51348
      
      --------------------------------
      
      commit 2107d45f upstream.
      
      getsockopt(TCP_ZEROCOPY_RECEIVE) has a bug where we read a
      user-provided "len" field of type signed int, and then compare the
      value to the result of an "offsetofend" operation, which is unsigned.
      
      Negative values provided by the user will be promoted to large
      positive numbers; thus checking that len < offsetofend() will return
      false when the intention was that it return true.
      
      Note that while len is originally checked for negative values earlier
      on in do_tcp_getsockopt(), subsequent calls to get_user() re-read the
      value from userspace which may have changed in the meantime.
      
      Therefore, re-add the check for negative values after the call to
      get_user in the handler code for TCP_ZEROCOPY_RECEIVE.
      
      Fixes: c8856c05 ("tcp-zerocopy: Return inq along with tcp receive zerocopy.")
      Reported-by: Nkernel test robot <lkp@intel.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Link: https://lore.kernel.org/r/20210225232628.4033281-1-arjunroy.kdev@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      fe30893d