1. 02 10月, 2019 20 次提交
  2. 01 10月, 2019 14 次提交
    • J
      mac80211: keep BHs disabled while calling drv_tx_wake_queue() · d8dec42b
      Johannes Berg 提交于
      Drivers typically expect this, as it's the case for almost all cases
      where this is called (i.e. from the TX path). Also, the code in mac80211
      itself (if the driver calls ieee80211_tx_dequeue()) expects this as it
      uses this_cpu_ptr() without additional protection.
      
      This should fix various reports of the problem:
      https://bugzilla.kernel.org/show_bug.cgi?id=204127
      https://lore.kernel.org/linux-wireless/CAN5HydrWb3o_FE6A1XDnP1E+xS66d5kiEuhHfiGKkLNQokx13Q@mail.gmail.com/
      https://lore.kernel.org/lkml/nycvar.YFH.7.76.1909111238470.473@cbobk.fhfr.pm/
      
      Cc: stable@vger.kernel.org
      Reported-and-tested-by: NJiri Kosina <jkosina@suse.cz>
      Reported-by: NAaron Hill <aa1ronham@gmail.com>
      Reported-by: NLukas Redlinger <rel+kernel@agilox.net>
      Reported-by: NOleksii Shevchuk <alxchk@gmail.com>
      Fixes: 21a5d4c3 ("mac80211: add stop/start logic for software TXQs")
      Link: https://lore.kernel.org/r/1569928763-I3e8838c5ecad878e59d4a94eb069a90f6641461a@changeidReviewed-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      d8dec42b
    • M
      mac80211: fix txq null pointer dereference · 8ed31a26
      Miaoqing Pan 提交于
      If the interface type is P2P_DEVICE or NAN, read the file of
      '/sys/kernel/debug/ieee80211/phyx/netdev:wlanx/aqm' will get a
      NULL pointer dereference. As for those interface type, the
      pointer sdata->vif.txq is NULL.
      
      Unable to handle kernel NULL pointer dereference at virtual address 00000011
      CPU: 1 PID: 30936 Comm: cat Not tainted 4.14.104 #1
      task: ffffffc0337e4880 task.stack: ffffff800cd20000
      PC is at ieee80211_if_fmt_aqm+0x34/0xa0 [mac80211]
      LR is at ieee80211_if_fmt_aqm+0x34/0xa0 [mac80211]
      [...]
      Process cat (pid: 30936, stack limit = 0xffffff800cd20000)
      [...]
      [<ffffff8000b7cd00>] ieee80211_if_fmt_aqm+0x34/0xa0 [mac80211]
      [<ffffff8000b7c414>] ieee80211_if_read+0x60/0xbc [mac80211]
      [<ffffff8000b7ccc4>] ieee80211_if_read_aqm+0x28/0x30 [mac80211]
      [<ffffff80082eff94>] full_proxy_read+0x2c/0x48
      [<ffffff80081eef00>] __vfs_read+0x2c/0xd4
      [<ffffff80081ef084>] vfs_read+0x8c/0x108
      [<ffffff80081ef494>] SyS_read+0x40/0x7c
      Signed-off-by: NMiaoqing Pan <miaoqing@codeaurora.org>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/r/1569549796-8223-1-git-send-email-miaoqing@codeaurora.org
      [trim useless data from commit message]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      8ed31a26
    • M
      nl80211: fix null pointer dereference · b501426c
      Miaoqing Pan 提交于
      If the interface is not in MESH mode, the command 'iw wlanx mpath del'
      will cause kernel panic.
      
      The root cause is null pointer access in mpp_flush_by_proxy(), as the
      pointer 'sdata->u.mesh.mpp_paths' is NULL for non MESH interface.
      
      Unable to handle kernel NULL pointer dereference at virtual address 00000068
      [...]
      PC is at _raw_spin_lock_bh+0x20/0x5c
      LR is at mesh_path_del+0x1c/0x17c [mac80211]
      [...]
      Process iw (pid: 4537, stack limit = 0xd83e0238)
      [...]
      [<c021211c>] (_raw_spin_lock_bh) from [<bf8c7648>] (mesh_path_del+0x1c/0x17c [mac80211])
      [<bf8c7648>] (mesh_path_del [mac80211]) from [<bf6cdb7c>] (extack_doit+0x20/0x68 [compat])
      [<bf6cdb7c>] (extack_doit [compat]) from [<c05c309c>] (genl_rcv_msg+0x274/0x30c)
      [<c05c309c>] (genl_rcv_msg) from [<c05c25d8>] (netlink_rcv_skb+0x58/0xac)
      [<c05c25d8>] (netlink_rcv_skb) from [<c05c2e14>] (genl_rcv+0x20/0x34)
      [<c05c2e14>] (genl_rcv) from [<c05c1f90>] (netlink_unicast+0x11c/0x204)
      [<c05c1f90>] (netlink_unicast) from [<c05c2420>] (netlink_sendmsg+0x30c/0x370)
      [<c05c2420>] (netlink_sendmsg) from [<c05886d0>] (sock_sendmsg+0x70/0x84)
      [<c05886d0>] (sock_sendmsg) from [<c0589f4c>] (___sys_sendmsg.part.3+0x188/0x228)
      [<c0589f4c>] (___sys_sendmsg.part.3) from [<c058add4>] (__sys_sendmsg+0x4c/0x70)
      [<c058add4>] (__sys_sendmsg) from [<c0208c80>] (ret_fast_syscall+0x0/0x44)
      Code: e2822c02 e2822001 e5832004 f590f000 (e1902f9f)
      ---[ end trace bbd717600f8f884d ]---
      Signed-off-by: NMiaoqing Pan <miaoqing@codeaurora.org>
      Link: https://lore.kernel.org/r/1569485810-761-1-git-send-email-miaoqing@codeaurora.org
      [trim useless data from commit message]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      b501426c
    • J
      cfg80211: initialize on-stack chandefs · f43e5210
      Johannes Berg 提交于
      In a few places we don't properly initialize on-stack chandefs,
      resulting in EDMG data to be non-zero, which broke things.
      
      Additionally, in a few places we rely on the driver to init the
      data completely, but perhaps we shouldn't as non-EDMG drivers
      may not initialize the EDMG data, also initialize it there.
      
      Cc: stable@vger.kernel.org
      Fixes: 2a38075c ("nl80211: Add support for EDMG channels")
      Reported-by: NDmitry Osipenko <digetx@gmail.com>
      Tested-by: NDmitry Osipenko <digetx@gmail.com>
      Link: https://lore.kernel.org/r/1569239475-I2dcce394ecf873376c386a78f31c2ec8b538fa25@changeidSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      f43e5210
    • J
      cfg80211: validate SSID/MBSSID element ordering assumption · 242b0931
      Johannes Berg 提交于
      The code copying the data assumes that the SSID element is
      before the MBSSID element, but since the data is untrusted
      from the AP, this cannot be guaranteed.
      
      Validate that this is indeed the case and ignore the MBSSID
      otherwise, to avoid having to deal with both cases for the
      copy of data that should be between them.
      
      Cc: stable@vger.kernel.org
      Fixes: 0b8fb823 ("cfg80211: Parsing of Multiple BSSID information in scanning")
      Link: https://lore.kernel.org/r/1569009255-I1673911f5eae02964e21bdc11b2bf58e5e207e59@changeidSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      242b0931
    • J
      nl80211: validate beacon head · f88eb7c0
      Johannes Berg 提交于
      We currently don't validate the beacon head, i.e. the header,
      fixed part and elements that are to go in front of the TIM
      element. This means that the variable elements there can be
      malformed, e.g. have a length exceeding the buffer size, but
      most downstream code from this assumes that this has already
      been checked.
      
      Add the necessary checks to the netlink policy.
      
      Cc: stable@vger.kernel.org
      Fixes: ed1b6cc7 ("cfg80211/nl80211: add beacon settings")
      Link: https://lore.kernel.org/r/1569009255-I7ac7fbe9436e9d8733439eab8acbbd35e55c74ef@changeidSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      f88eb7c0
    • V
      net: sched: taprio: Fix potential integer overflow in taprio_set_picos_per_byte · 68ce6688
      Vladimir Oltean 提交于
      The speed divisor is used in a context expecting an s64, but it is
      evaluated using 32-bit arithmetic.
      
      To avoid that happening, instead of multiplying by 1,000,000 in the
      first place, simplify the fraction and do a standard 32 bit division
      instead.
      
      Fixes: f04b514c ("taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte")
      Reported-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Acked-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68ce6688
    • N
      net: dsa: sja1105: Prevent leaking memory · 68501df9
      Navid Emamdoost 提交于
      In sja1105_static_config_upload, in two cases memory is leaked: when
      static_config_buf_prepare_for_upload fails and when sja1105_inhibit_tx
      fails. In both cases config_buf should be released.
      
      Fixes: 8aa9ebcc ("net: dsa: Introduce driver for NXP SJA1105 5-port L2 switch")
      Fixes: 1a4c6940 ("net: dsa: sja1105: Prevent PHY jabbering during switch reset")
      Signed-off-by: NNavid Emamdoost <navid.emamdoost@gmail.com>
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68501df9
    • V
      net: dsa: sja1105: Ensure PTP time for rxtstamp reconstruction is not in the past · b6f2494d
      Vladimir Oltean 提交于
      Sometimes the PTP synchronization on the switch 'jumps':
      
        ptp4l[11241.155]: rms    8 max   16 freq -21732 +/-  11 delay   742 +/-   0
        ptp4l[11243.157]: rms    7 max   17 freq -21731 +/-  10 delay   744 +/-   0
        ptp4l[11245.160]: rms 33592410 max 134217731 freq +192422 +/- 8530253 delay   743 +/-   0
        ptp4l[11247.163]: rms 811631 max 964131 freq +10326 +/- 557785 delay   743 +/-   0
        ptp4l[11249.166]: rms 261936 max 533876 freq -304323 +/- 126371 delay   744 +/-   0
        ptp4l[11251.169]: rms 48700 max 57740 freq -20218 +/- 30532 delay   744 +/-   0
        ptp4l[11253.171]: rms 14570 max 30163 freq  -5568 +/- 7563 delay   742 +/-   0
        ptp4l[11255.174]: rms 2914 max 3440 freq -22001 +/- 1667 delay   744 +/-   1
        ptp4l[11257.177]: rms  811 max 1710 freq -22653 +/- 451 delay   744 +/-   1
        ptp4l[11259.180]: rms  177 max  218 freq -21695 +/-  89 delay   741 +/-   0
        ptp4l[11261.182]: rms   45 max   92 freq -21677 +/-  32 delay   742 +/-   0
        ptp4l[11263.186]: rms   14 max   32 freq -21733 +/-  11 delay   742 +/-   0
        ptp4l[11265.188]: rms    9 max   14 freq -21725 +/-  12 delay   742 +/-   0
        ptp4l[11267.191]: rms    9 max   16 freq -21727 +/-  13 delay   742 +/-   0
        ptp4l[11269.194]: rms    6 max   15 freq -21726 +/-   9 delay   743 +/-   0
        ptp4l[11271.197]: rms    8 max   15 freq -21728 +/-  11 delay   743 +/-   0
        ptp4l[11273.200]: rms    6 max   12 freq -21727 +/-   8 delay   743 +/-   0
        ptp4l[11275.202]: rms    9 max   17 freq -21720 +/-  11 delay   742 +/-   0
        ptp4l[11277.205]: rms    9 max   18 freq -21725 +/-  12 delay   742 +/-   0
      
      Background: the switch only offers partial RX timestamps (24 bits) and
      it is up to the driver to read the PTP clock to fill those timestamps up
      to 64 bits. But the PTP clock readout needs to happen quickly enough (in
      0.135 seconds, in fact), otherwise the PTP clock will wrap around 24
      bits, condition which cannot be detected.
      
      Looking at the 'max 134217731' value on output line 3, one can see that
      in hex it is 0x8000003. Because the PTP clock resolution is 8 ns,
      that means 0x1000000 in ticks, which is exactly 2^24. So indeed this is
      a PTP clock wraparound, but the reason might be surprising.
      
      What is going on is that sja1105_tstamp_reconstruct(priv, now, ts)
      expects a "now" time that is later than the "ts" was snapshotted at.
      This, of course, is obvious: we read the PTP time _after_ the partial RX
      timestamp was received. However, the workqueue is processing frames from
      a skb queue and reuses the same PTP time, read once at the beginning.
      Normally the skb queue only contains one frame and all goes well. But
      when the skb queue contains two frames, the second frame that gets
      dequeued might have been partially timestamped by the RX MAC _after_ we
      had read our PTP time initially.
      
      The code was originally like that due to concerns that SPI access for
      PTP time readout is a slow process, and we are time-constrained anyway
      (aka: premature optimization). But some timing analysis reveals that the
      time spent until the RX timestamp is completely reconstructed is 1 order
      of magnitude lower than the 0.135 s deadline even under worst-case
      conditions. So we can afford to read the PTP time for each frame in the
      RX timestamping queue, which of course ensures that the full PTP time is
      in the partial timestamp's future.
      
      Fixes: f3097be2 ("net: dsa: sja1105: Add a state machine for RX timestamping")
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6f2494d
    • D
      Merge tag 'ieee802154-for-davem-2019-09-28' of... · 3755ee22
      David S. Miller 提交于
      Merge tag 'ieee802154-for-davem-2019-09-28' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan
      
      Stefan Schmidt says:
      
      ====================
      pull-request: ieee802154 for net 2019-09-28
      
      An update from ieee802154 for your *net* tree.
      
      Three driver fixes. Navid Emamdoost fixed a memory leak on an error
      path in the ca8210 driver, Johan Hovold fixed a use-after-free found
      by syzbot in the atusb driver and Christophe JAILLET makes sure
      __skb_put_data is used instead of memcpy in the mcr20a driver
      
      I switched from branches to tags here to be pulled from. So far not
      annotated and not signed. Once I fixed my scripts it should contain
      this messages as annotations. If you want it signed as well just tell
      me. If there are any problems let me know.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3755ee22
    • M
      net: Unpublish sk from sk_reuseport_cb before call_rcu · 8c7138b3
      Martin KaFai Lau 提交于
      The "reuse->sock[]" array is shared by multiple sockets.  The going away
      sk must unpublish itself from "reuse->sock[]" before making call_rcu()
      call.  However, this unpublish-action is currently done after a grace
      period and it may cause use-after-free.
      
      The fix is to move reuseport_detach_sock() to sk_destruct().
      Due to the above reason, any socket with sk_reuseport_cb has
      to go through the rcu grace period before freeing it.
      
      It is a rather old bug (~3 yrs).  The Fixes tag is not necessary
      the right commit but it is the one that introduced the SOCK_RCU_FREE
      logic and this fix is depending on it.
      
      Fixes: a4298e45 ("net: add SOCK_RCU_FREE socket flag")
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c7138b3
    • H
      erspan: remove the incorrect mtu limit for erspan · 0e141f75
      Haishuang Yan 提交于
      erspan driver calls ether_setup(), after commit 61e84623
      ("net: centralize net_device min/max MTU checking"), the range
      of mtu is [min_mtu, max_mtu], which is [68, 1500] by default.
      
      It causes the dev mtu of the erspan device to not be greater
      than 1500, this limit value is not correct for ipgre tap device.
      
      Tested:
      Before patch:
      # ip link set erspan0 mtu 1600
      Error: mtu greater than device maximum.
      After patch:
      # ip link set erspan0 mtu 1600
      # ip -d link show erspan0
      21: erspan0@NONE: <BROADCAST,MULTICAST> mtu 1600 qdisc noop state DOWN
      mode DEFAULT group default qlen 1000
          link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 0
      
      Fixes: 61e84623 ("net: centralize net_device min/max MTU checking")
      Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e141f75
    • E
      sch_cbq: validate TCA_CBQ_WRROPT to avoid crash · e9789c7c
      Eric Dumazet 提交于
      syzbot reported a crash in cbq_normalize_quanta() caused
      by an out of range cl->priority.
      
      iproute2 enforces this check, but malicious users do not.
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN PTI
      Modules linked in:
      CPU: 1 PID: 26447 Comm: syz-executor.1 Not tainted 5.3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:cbq_normalize_quanta.part.0+0x1fd/0x430 net/sched/sch_cbq.c:902
      RSP: 0018:ffff8801a5c333b0 EFLAGS: 00010206
      RAX: 0000000020000003 RBX: 00000000fffffff8 RCX: ffffc9000712f000
      RDX: 00000000000043bf RSI: ffffffff83be8962 RDI: 0000000100000018
      RBP: ffff8801a5c33420 R08: 000000000000003a R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000002ef
      R13: ffff88018da95188 R14: dffffc0000000000 R15: 0000000000000015
      FS:  00007f37d26b1700(0000) GS:ffff8801dad00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000004c7cec CR3: 00000001bcd0a006 CR4: 00000000001626f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       [<ffffffff83be9d57>] cbq_normalize_quanta include/net/pkt_sched.h:27 [inline]
       [<ffffffff83be9d57>] cbq_addprio net/sched/sch_cbq.c:1097 [inline]
       [<ffffffff83be9d57>] cbq_set_wrr+0x2d7/0x450 net/sched/sch_cbq.c:1115
       [<ffffffff83bee8a7>] cbq_change_class+0x987/0x225b net/sched/sch_cbq.c:1537
       [<ffffffff83b96985>] tc_ctl_tclass+0x555/0xcd0 net/sched/sch_api.c:2329
       [<ffffffff83a84655>] rtnetlink_rcv_msg+0x485/0xc10 net/core/rtnetlink.c:5248
       [<ffffffff83cadf0a>] netlink_rcv_skb+0x17a/0x460 net/netlink/af_netlink.c:2510
       [<ffffffff83a7db6d>] rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5266
       [<ffffffff83cac2c6>] netlink_unicast_kernel net/netlink/af_netlink.c:1324 [inline]
       [<ffffffff83cac2c6>] netlink_unicast+0x536/0x720 net/netlink/af_netlink.c:1350
       [<ffffffff83cacd4a>] netlink_sendmsg+0x89a/0xd50 net/netlink/af_netlink.c:1939
       [<ffffffff8399d46e>] sock_sendmsg_nosec net/socket.c:673 [inline]
       [<ffffffff8399d46e>] sock_sendmsg+0x12e/0x170 net/socket.c:684
       [<ffffffff8399f1fd>] ___sys_sendmsg+0x81d/0x960 net/socket.c:2359
       [<ffffffff839a2d05>] __sys_sendmsg+0x105/0x1d0 net/socket.c:2397
       [<ffffffff839a2df9>] SYSC_sendmsg net/socket.c:2406 [inline]
       [<ffffffff839a2df9>] SyS_sendmsg+0x29/0x30 net/socket.c:2404
       [<ffffffff8101ccc8>] do_syscall_64+0x528/0x770 arch/x86/entry/common.c:305
       [<ffffffff84400091>] entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9789c7c
    • M
      net: dsa: qca8k: Use up to 7 ports for all operations · 7ae6d93c
      Michal Vokáč 提交于
      The QCA8K family supports up to 7 ports. So use the existing
      QCA8K_NUM_PORTS define to allocate the switch structure and limit all
      operations with the switch ports.
      
      This was not an issue until commit 0394a63a ("net: dsa: enable and
      disable all ports") disabled all unused ports. Since the unused ports 7-11
      are outside of the correct register range on this switch some registers
      were rewritten with invalid content.
      
      Fixes: 6b93fb46 ("net-next: dsa: add new driver for qca8xxx family")
      Fixes: a0c02161 ("net: dsa: variable number of ports")
      Fixes: 0394a63a ("net: dsa: enable and disable all ports")
      Signed-off-by: NMichal Vokáč <michal.vokac@ysoft.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ae6d93c
  3. 29 9月, 2019 6 次提交
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 02dc96ef
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
      
       1) Sanity check URB networking device parameters to avoid divide by
          zero, from Oliver Neukum.
      
       2) Disable global multicast filter in NCSI, otherwise LLDP and IPV6
          don't work properly. Longer term this needs a better fix tho. From
          Vijay Khemka.
      
       3) Small fixes to selftests (use ping when ping6 is not present, etc.)
          from David Ahern.
      
       4) Bring back rt_uses_gateway member of struct rtable, it's semantics
          were not well understood and trying to remove it broke things. From
          David Ahern.
      
       5) Move usbnet snaity checking, ignore endpoints with invalid
          wMaxPacketSize. From Bjørn Mork.
      
       6) Missing Kconfig deps for sja1105 driver, from Mao Wenan.
      
       7) Various small fixes to the mlx5 DR steering code, from Alaa Hleihel,
          Alex Vesker, and Yevgeny Kliteynik
      
       8) Missing CAP_NET_RAW checks in various places, from Ori Nimron.
      
       9) Fix crash when removing sch_cbs entry while offloading is enabled,
          from Vinicius Costa Gomes.
      
      10) Signedness bug fixes, generally in looking at the result given by
          of_get_phy_mode() and friends. From Dan Crapenter.
      
      11) Disable preemption around BPF_PROG_RUN() calls, from Eric Dumazet.
      
      12) Don't create VRF ipv6 rules if ipv6 is disabled, from David Ahern.
      
      13) Fix quantization code in tcp_bbr, from Kevin Yang.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (127 commits)
        net: tap: clean up an indentation issue
        nfp: abm: fix memory leak in nfp_abm_u32_knode_replace
        tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
        sk_buff: drop all skb extensions on free and skb scrubbing
        tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth
        mlxsw: spectrum_flower: Fail in case user specifies multiple mirror actions
        Documentation: Clarify trap's description
        mlxsw: spectrum: Clear VLAN filters during port initialization
        net: ena: clean up indentation issue
        NFC: st95hf: clean up indentation issue
        net: phy: micrel: add Asym Pause workaround for KSZ9021
        net: socionext: ave: Avoid using netdev_err() before calling register_netdev()
        ptp: correctly disable flags on old ioctls
        lib: dimlib: fix help text typos
        net: dsa: microchip: Always set regmap stride to 1
        nfp: flower: fix memory leak in nfp_flower_spawn_vnic_reprs
        nfp: flower: prevent memory leak in nfp_flower_spawn_phy_reprs
        net/sched: Set default of CONFIG_NET_TC_SKB_EXT to N
        vrf: Do not attempt to create IPv6 mcast rule if IPv6 is disabled
        net: sched: sch_sfb: don't call qdisc_put() while holding tree lock
        ...
      02dc96ef
    • L
      Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes) · edf445ad
      Linus Torvalds 提交于
      Merge hugepage allocation updates from David Rientjes:
       "We (mostly Linus, Andrea, and myself) have been discussing offlist how
        to implement a sane default allocation strategy for hugepages on NUMA
        platforms.
      
        With these reverts in place, the page allocator will happily allocate
        a remote hugepage immediately rather than try to make a local hugepage
        available. This incurs a substantial performance degradation when
        memory compaction would have otherwise made a local hugepage
        available.
      
        This series reverts those reverts and attempts to propose a more sane
        default allocation strategy specifically for hugepages. Andrea
        acknowledges this is likely to fix the swap storms that he originally
        reported that resulted in the patches that removed __GFP_THISNODE from
        hugepage allocations.
      
        The immediate goal is to return 5.3 to the behavior the kernel has
        implemented over the past several years so that remote hugepages are
        not immediately allocated when local hugepages could have been made
        available because the increased access latency is untenable.
      
        The next goal is to introduce a sane default allocation strategy for
        hugepages allocations in general regardless of the configuration of
        the system so that we prevent thrashing of local memory when
        compaction is unlikely to succeed and can prefer remote hugepages over
        remote native pages when the local node is low on memory."
      
      Note on timing: this reverts the hugepage VM behavior changes that got
      introduced fairly late in the 5.3 cycle, and that fixed a huge
      performance regression for certain loads that had been around since
      4.18.
      
      Andrea had this note:
      
       "The regression of 4.18 was that it was taking hours to start a VM
        where 3.10 was only taking a few seconds, I reported all the details
        on lkml when it was finally tracked down in August 2018.
      
           https://lore.kernel.org/linux-mm/20180820032640.9896-2-aarcange@redhat.com/
      
        __GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio
        workload degrade like in the "current upstream" above. And it still
        would have been that bad as above until 5.3-rc5"
      
      where the bad behavior ends up happening as you fill up a local node,
      and without that change, you'd get into the nasty swap storm behavior
      due to compaction working overtime to make room for more memory on the
      nodes.
      
      As a result 5.3 got the two performance fix reverts in rc5.
      
      However, David Rientjes then noted that those performance fixes in turn
      regressed performance for other loads - although not quite to the same
      degree.  He suggested reverting the reverts and instead replacing them
      with two small changes to how hugepage allocations are done (patch
      descriptions rephrased by me):
      
       - "avoid expensive reclaim when compaction may not succeed": just admit
         that the allocation failed when you're trying to allocate a huge-page
         and compaction wasn't successful.
      
       - "allow hugepage fallback to remote nodes when madvised": when that
         node-local huge-page allocation failed, retry without forcing the
         local node.
      
      but by then I judged it too late to replace the fixes for a 5.3 release.
      So 5.3 was released with behavior that harked back to the pre-4.18 logic.
      
      But now we're in the merge window for 5.4, and we can see if this
      alternate model fixes not just the horrendous swap storm behavior, but
      also restores the performance regression that the late reverts caused.
      
      Fingers crossed.
      
      * emailed patches from David Rientjes <rientjes@google.com>:
        mm, page_alloc: allow hugepage fallback to remote nodes when madvised
        mm, page_alloc: avoid expensive reclaim when compaction may not succeed
        Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
        Revert "Revert "mm, thp: restore node-local hugepage allocations""
      edf445ad
    • D
      mm, page_alloc: allow hugepage fallback to remote nodes when madvised · 76e654cc
      David Rientjes 提交于
      For systems configured to always try hard to allocate transparent
      hugepages (thp defrag setting of "always") or for memory that has been
      explicitly madvised to MADV_HUGEPAGE, it is often better to fallback to
      remote memory to allocate the hugepage if the local allocation fails
      first.
      
      The point is to allow the initial call to __alloc_pages_node() to attempt
      to defragment local memory to make a hugepage available, if possible,
      rather than immediately fallback to remote memory.  Local hugepages will
      always have a better access latency than remote (huge)pages, so an attempt
      to make a hugepage available locally is always preferred.
      
      If memory compaction cannot be successful locally, however, it is likely
      better to fallback to remote memory.  This could take on two forms: either
      allow immediate fallback to remote memory or do per-zone watermark checks.
      It would be possible to fallback only when per-zone watermarks fail for
      order-0 memory, since that would require local reclaim for all subsequent
      faults so remote huge allocation is likely better than thrashing the local
      zone for large workloads.
      
      In this case, it is assumed that because the system is configured to try
      hard to allocate hugepages or the vma is advised to explicitly want to try
      hard for hugepages that remote allocation is better when local allocation
      and memory compaction have both failed.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76e654cc
    • D
      mm, page_alloc: avoid expensive reclaim when compaction may not succeed · b39d0ee2
      David Rientjes 提交于
      Memory compaction has a couple significant drawbacks as the allocation
      order increases, specifically:
      
       - isolate_freepages() is responsible for finding free pages to use as
         migration targets and is implemented as a linear scan of memory
         starting at the end of a zone,
      
       - failing order-0 watermark checks in memory compaction does not account
         for how far below the watermarks the zone actually is: to enable
         migration, there must be *some* free memory available.  Per the above,
         watermarks are not always suffficient if isolate_freepages() cannot
         find the free memory but it could require hundreds of MBs of reclaim to
         even reach this threshold (read: potentially very expensive reclaim with
         no indication compaction can be successful), and
      
       - if compaction at this order has failed recently so that it does not even
         run as a result of deferred compaction, looping through reclaim can often
         be pointless.
      
      For hugepage allocations, these are quite substantial drawbacks because
      these are very high order allocations (order-9 on x86) and falling back to
      doing reclaim can potentially be *very* expensive without any indication
      that compaction would even be successful.
      
      Reclaim itself is unlikely to free entire pageblocks and certainly no
      reliance should be put on it to do so in isolation (recall lumpy reclaim).
      This means we should avoid reclaim and simply fail hugepage allocation if
      compaction is deferred.
      
      It is also not helpful to thrash a zone by doing excessive reclaim if
      compaction may not be able to access that memory.  If order-0 watermarks
      fail and the allocation order is sufficiently large, it is likely better
      to fail the allocation rather than thrashing the zone.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b39d0ee2
    • D
      Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" · 19deb769
      David Rientjes 提交于
      This reverts commit 92717d42.
      
      Since commit a8282608 ("Revert "mm, thp: restore node-local hugepage
      allocations"") is reverted in this series, it is better to restore the
      previous 5.2 behavior between the thp allocation and the page allocator
      rather than to attempt any consolidation or cleanup for a policy that is
      now reverted.  It's less risky during an rc cycle and subsequent patches
      in this series further modify the same policy that the pre-5.3 behavior
      implements.
      
      Consolidation and cleanup can be done subsequent to a sane default page
      allocation strategy, so this patch reverts a cleanup done on a strategy
      that is now reverted and thus is the least risky option.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19deb769
    • D
      Revert "Revert "mm, thp: restore node-local hugepage allocations"" · ac79f78d
      David Rientjes 提交于
      This reverts commit a8282608.
      
      The commit references the original intended semantic for MADV_HUGEPAGE
      which has subsequently taken on three unique purposes:
      
       - enables or disables thp for a range of memory depending on the system's
         config (is thp "enabled" set to "always" or "madvise"),
      
       - determines the synchronous compaction behavior for thp allocations at
         fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"),
         and
      
       - reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only
         clear previous hugepage advice).
      
      These are the three purposes that currently exist in 5.2 and over the
      past several years that userspace has been written around.  Adding a
      NUMA locality preference adds a fourth dimension to an already conflated
      advice mode.
      
      Based on the semantic that MADV_HUGEPAGE has provided over the past
      several years, there exist workloads that use the tunable based on these
      principles: specifically that the allocation should attempt to
      defragment a local node before falling back.  It is agreed that remote
      hugepages typically (but not always) have a better access latency than
      remote native pages, although on Naples this is at parity for
      intersocket.
      
      The revert commit that this patch reverts allows hugepage allocation to
      immediately allocate remotely when local memory is fragmented.  This is
      contrary to the semantic of MADV_HUGEPAGE over the past several years:
      that is, memory compaction should be attempted locally before falling
      back.
      
      The performance degradation of remote hugepages over local hugepages on
      Rome, for example, is 53.5% increased access latency.  For this reason,
      the goal is to revert back to the 5.2 and previous behavior that would
      attempt local defragmentation before falling back.  With the patch that
      is reverted by this patch, we see performance degradations at the tail
      because the allocator happily allocates the remote hugepage rather than
      even attempting to make a local hugepage available.
      
      zone_reclaim_mode is not a solution to this problem since it does not
      only impact hugepage allocations but rather changes the memory
      allocation strategy for *all* page allocations.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac79f78d