1. 16 6月, 2019 7 次提交
  2. 15 6月, 2019 21 次提交
    • D
      Merge branch 'tcp-add-three-static-keys' · 35fc07ae
      David S. Miller 提交于
      Eric Dumazet says:
      
      ====================
      tcp: add three static keys
      
      Recent addition of per TCP socket rx/tx cache brought
      regressions for some workloads, as reported by Feng Tang.
      
      It seems better to make them opt-in, before we adopt better
      heuristics.
      
      The last patch adds high_order_alloc_disable sysctl
      to ask TCP sendmsg() to exclusively use order-0 allocations,
      as mm layer has specific optimizations.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35fc07ae
    • E
      net: add high_order_alloc_disable sysctl/static key · ce27ec60
      Eric Dumazet 提交于
      >From linux-3.7, (commit 5640f768 "net: use a per task frag
      allocator") TCP sendmsg() has preferred using order-3 allocations.
      
      While it gives good results for most cases, we had reports
      that heavy uses of TCP over loopback were hitting a spinlock
      contention in page allocations/freeing.
      
      This commits adds a sysctl so that admins can opt-in
      for order-0 allocations. Hopefully mm layer might optimize
      order-3 allocations in the future since it could give us
      a nice boost  (see 8 lines of following benchmark)
      
      The following benchmark shows a win when more than 8 TCP_STREAM
      threads are running (56 x86 cores server in my tests)
      
      for thr in {1..30}
      do
       sysctl -wq net.core.high_order_alloc_disable=0
       T0=`./super_netperf $thr -H 127.0.0.1 -l 15`
       sysctl -wq net.core.high_order_alloc_disable=1
       T1=`./super_netperf $thr -H 127.0.0.1 -l 15`
       echo $thr:$T0:$T1
      done
      
      1: 49979: 37267
      2: 98745: 76286
      3: 141088: 110051
      4: 177414: 144772
      5: 197587: 173563
      6: 215377: 208448
      7: 241061: 234087
      8: 267155: 263373
      9: 295069: 297402
      10: 312393: 335213
      11: 340462: 368778
      12: 371366: 403954
      13: 412344: 443713
      14: 426617: 473580
      15: 474418: 507861
      16: 503261: 538539
      17: 522331: 563096
      18: 532409: 567084
      19: 550824: 605240
      20: 525493: 641988
      21: 564574: 665843
      22: 567349: 690868
      23: 583846: 710917
      24: 588715: 736306
      25: 603212: 763494
      26: 604083: 792654
      27: 602241: 796450
      28: 604291: 797993
      29: 611610: 833249
      30: 577356: 841062
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce27ec60
    • E
      tcp: add tcp_tx_skb_cache sysctl · 0b7d7f6b
      Eric Dumazet 提交于
      Feng Tang reported a performance regression after introduction
      of per TCP socket tx/rx caches, for TCP over loopback (netperf)
      
      There is high chance the regression is caused by a change on
      how well the 32 KB per-thread page (current->task_frag) can
      be recycled, and lack of pcp caches for order-3 pages.
      
      I could not reproduce the regression myself, cpus all being
      spinning on the mm spinlocks for page allocs/freeing, regardless
      of enabling or disabling the per tcp socket caches.
      
      It seems best to disable the feature by default, and let
      admins enabling it.
      
      MM layer either needs to provide scalable order-3 pages
      allocations, or could attempt a trylock on zone->lock if
      the caller only attempts to get a high-order page and is
      able to fallback to order-0 ones in case of pressure.
      
      Tests run on a 56 cores host (112 hyper threads)
      
      -	35.49%	netperf 		 [kernel.vmlinux]	  [k] queued_spin_lock_slowpath
         - 35.49% queued_spin_lock_slowpath
      	  - 18.18% get_page_from_freelist
      		 - __alloc_pages_nodemask
      			- 18.18% alloc_pages_current
      				 skb_page_frag_refill
      				 sk_page_frag_refill
      				 tcp_sendmsg_locked
      				 tcp_sendmsg
      				 inet_sendmsg
      				 sock_sendmsg
      				 __sys_sendto
      				 __x64_sys_sendto
      				 do_syscall_64
      				 entry_SYSCALL_64_after_hwframe
      				 __libc_send
      	  + 17.31% __free_pages_ok
      +	31.43%	swapper 		 [kernel.vmlinux]	  [k] intel_idle
      +	 9.12%	netperf 		 [kernel.vmlinux]	  [k] copy_user_enhanced_fast_string
      +	 6.53%	netserver		 [kernel.vmlinux]	  [k] copy_user_enhanced_fast_string
      +	 0.69%	netserver		 [kernel.vmlinux]	  [k] queued_spin_lock_slowpath
      +	 0.68%	netperf 		 [kernel.vmlinux]	  [k] skb_release_data
      +	 0.52%	netperf 		 [kernel.vmlinux]	  [k] tcp_sendmsg_locked
      	 0.46%	netperf 		 [kernel.vmlinux]	  [k] _raw_spin_lock_irqsave
      
      Fixes: 472c2e07 ("tcp: add one skb cache for tx")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NFeng Tang <feng.tang@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b7d7f6b
    • E
      tcp: add tcp_rx_skb_cache sysctl · ede61ca4
      Eric Dumazet 提交于
      Instead of relying on rps_needed, it is safer to use a separate
      static key, since we do not want to enable TCP rx_skb_cache
      by default. This feature can cause huge increase of memory
      usage on hosts with millions of sockets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ede61ca4
    • E
      sysctl: define proc_do_static_key() · a8e11e5c
      Eric Dumazet 提交于
      Convert proc_dointvec_minmax_bpf_stats() into a more generic
      helper, since we are going to use jump labels more often.
      
      Note that sysctl_bpf_stats_enabled is removed, since
      it is no longer needed/used.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8e11e5c
    • H
      hv_netvsc: Set probe mode to sync · 9a33629b
      Haiyang Zhang 提交于
      For better consistency of synthetic NIC names, we set the probe mode to
      PROBE_FORCE_SYNCHRONOUS. So the names can be aligned with the vmbus
      channel offer sequence.
      
      Fixes: af0a5646 ("use the new async probing feature for the hyperv drivers")
      Signed-off-by: NHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a33629b
    • V
      net: sched: flower: don't call synchronize_rcu() on mask creation · 99815f50
      Vlad Buslov 提交于
      Current flower mask creating code assumes that temporary mask that is used
      when inserting new filter is stack allocated. To prevent race condition
      with data patch synchronize_rcu() is called every time fl_create_new_mask()
      replaces temporary stack allocated mask. As reported by Jiri, this
      increases runtime of creating 20000 flower classifiers from 4 seconds to
      163 seconds. However, this design is no longer necessary since temporary
      mask was converted to be dynamically allocated by commit 2cddd201
      ("net/sched: cls_flower: allocate mask dynamically in fl_change()").
      
      Remove synchronize_rcu() calls from mask creation code. Instead, refactor
      fl_change() to always deallocate temporary mask with rcu grace period.
      
      Fixes: 195c234d ("net: sched: flower: handle concurrent mask insertion")
      Reported-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Tested-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99815f50
    • A
      net: dsa: fix warning same module names · f0c03ee0
      Anders Roxell 提交于
      When building with CONFIG_NET_DSA_REALTEK_SMI and CONFIG_REALTEK_PHY
      enabled as loadable modules, we see the following warning:
      
      warning: same module names found:
        drivers/net/phy/realtek.ko
        drivers/net/dsa/realtek.ko
      
      Rework so the driver name is realtek-smi instead of realtek.
      Reviewed-by: NLinus Walleij <linus.walleij@linaro.org>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0c03ee0
    • N
      sctp: Free cookie before we memdup a new one · ce950f10
      Neil Horman 提交于
      Based on comments from Xin, even after fixes for our recent syzbot
      report of cookie memory leaks, its possible to get a resend of an INIT
      chunk which would lead to us leaking cookie memory.
      
      To ensure that we don't leak cookie memory, free any previously
      allocated cookie first.
      
      Change notes
      v1->v2
      update subsystem tag in subject (davem)
      repeat kfree check for peer_random and peer_hmacs (xin)
      
      v2->v3
      net->sctp
      also free peer_chunks
      
      v3->v4
      fix subject tags
      
      v4->v5
      remove cut line
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Reported-by: syzbot+f7e9153b037eac9b1df8@syzkaller.appspotmail.com
      CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      CC: Xin Long <lucien.xin@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: netdev@vger.kernel.org
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce950f10
    • R
      net: dsa: microchip: Don't try to read stats for unused ports · 6bb9e376
      Robert Hancock 提交于
      If some of the switch ports were not listed in the device tree, due to
      being unused, the ksz_mib_read_work function ended up accessing a NULL
      dp->slave pointer and causing an oops. Skip checking statistics for any
      unused ports.
      
      Fixes: 7c6ff470 ("net: dsa: microchip: add MIB counter reading support")
      Signed-off-by: NRobert Hancock <hancock@sedsystems.ca>
      Reviewed-by: NVivien Didelot <vivien.didelot@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bb9e376
    • D
      Merge branch 'qmi_wwan-fix-QMAP-handling' · 2309f517
      David S. Miller 提交于
      Reinhard Speyerer says:
      
      ====================
      qmi_wwan: fix QMAP handling
      
      This series addresses the following issues observed when using the
      QMAP support of the qmi_wwan driver:
      
      1. The QMAP code in the qmi_wwan driver is based on the CodeAurora
         GobiNet driver ([1], [2]) which does not process QMAP padding
         in the RX path correctly. This causes qmimux_rx_fixup() to pass
         incorrect data to the IP stack when padding is used.
      
      2. qmimux devices currently lack proper network device usage statistics.
      
      3. RCU stalls on device disconnect with QMAP activated like this
      
         # echo Y > /sys/class/net/wwan0/qmi/raw_ip
         # echo 1 > /sys/class/net/wwan0/qmi/add_mux
         # echo 2 > /sys/class/net/wwan0/qmi/add_mux
         # echo 3 > /sys/class/net/wwan0/qmi/add_mux
      
         have been observed in certain setups:
      
         [ 2273.676593] option1 ttyUSB16: GSM modem (1-port) converter now disconnected from ttyUSB16
         [ 2273.676617] option 6-1.2:1.0: device disconnected
         [ 2273.676774] WARNING: CPU: 1 PID: 141 at kernel/rcu/tree_plugin.h:342 rcu_note_context_switch+0x2a/0x3d0
         [ 2273.676776] Modules linked in: option qmi_wwan cdc_mbim cdc_ncm qcserial cdc_wdm usb_wwan sierra sierra_net usbnet mii edd coretemp iptable_mangle ip6_tables iptable_filter ip_tables cdc_acm dm_mod dax iTCO_wdt evdev iTCO_vendor_support sg ftdi_sio usbserial e1000e ptp pps_core i2c_i801 ehci_pci button lpc_ich i2c_core mfd_core uhci_hcd ehci_hcd rtc_cmos usbcore usb_common sd_mod fan ata_piix thermal
         [ 2273.676817] CPU: 1 PID: 141 Comm: kworker/1:1 Not tainted 4.19.38-rsp-1 #1
         [ 2273.676819] Hardware name: Not Applicable   Not Applicable  /CX-GS/GM45-GL40             , BIOS V1.11      03/23/2011
         [ 2273.676828] Workqueue: usb_hub_wq hub_event [usbcore]
         [ 2273.676832] EIP: rcu_note_context_switch+0x2a/0x3d0
         [ 2273.676834] Code: 55 89 e5 57 56 89 c6 53 83 ec 14 89 45 f0 e8 5d ff ff ff 89 f0 64 8b 3d 24 a6 86 c0 84 c0 8b 87 04 02 00 00 75 7a 85 c0 7e 7a <0f> 0b 80 bf 08 02 00 00 00 0f 84 87 00 00 00 e8 b2 e2 ff ff bb dc
         [ 2273.676836] EAX: 00000001 EBX: f614bc00 ECX: 00000001 EDX: c0715b81
         [ 2273.676838] ESI: 00000000 EDI: f18beb40 EBP: f1a3dc20 ESP: f1a3dc00
         [ 2273.676840] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010002
         [ 2273.676842] CR0: 80050033 CR2: b7e97230 CR3: 2f9c4000 CR4: 000406b0
         [ 2273.676843] Call Trace:
         [ 2273.676847]  ? preempt_count_add+0xa5/0xc0
         [ 2273.676852]  __schedule+0x4e/0x4f0
         [ 2273.676855]  ? __queue_work+0xf1/0x2a0
         [ 2273.676858]  ? _raw_spin_lock_irqsave+0x14/0x40
         [ 2273.676860]  ? preempt_count_add+0x52/0xc0
         [ 2273.676862]  schedule+0x33/0x80
         [ 2273.676865]  _synchronize_rcu_expedited+0x24e/0x280
         [ 2273.676867]  ? rcu_accelerate_cbs_unlocked+0x70/0x70
         [ 2273.676871]  ? wait_woken+0x70/0x70
         [ 2273.676873]  ? rcu_accelerate_cbs_unlocked+0x70/0x70
         [ 2273.676875]  ? _synchronize_rcu_expedited+0x280/0x280
         [ 2273.676877]  synchronize_rcu_expedited+0x22/0x30
         [ 2273.676881]  synchronize_net+0x25/0x30
         [ 2273.676885]  dev_deactivate_many+0x133/0x230
         [ 2273.676887]  ? preempt_count_add+0xa5/0xc0
         [ 2273.676890]  __dev_close_many+0x4d/0xc0
         [ 2273.676892]  ? skb_dequeue+0x40/0x50
         [ 2273.676895]  dev_close_many+0x5d/0xd0
         [ 2273.676898]  rollback_registered_many+0xbf/0x4c0
         [ 2273.676901]  ? raw_notifier_call_chain+0x1a/0x20
         [ 2273.676904]  ? call_netdevice_notifiers_info+0x23/0x60
         [ 2273.676906]  ? netdev_master_upper_dev_get+0xe/0x70
         [ 2273.676908]  rollback_registered+0x1f/0x30
         [ 2273.676911]  unregister_netdevice_queue+0x47/0xb0
         [ 2273.676915]  qmimux_unregister_device+0x1f/0x30 [qmi_wwan]
         [ 2273.676917]  qmi_wwan_disconnect+0x5d/0x90 [qmi_wwan]
         ...
         [ 2273.677001] ---[ end trace 0fcc5f88496b485a ]---
         [ 2294.679136] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
         [ 2294.679140] rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-1): P141
         [ 2294.679144] rcu: 	(detected by 0, t=21002 jiffies, g=265857, q=8446)
         [ 2294.679148] kworker/1:1     D    0   141      2 0x80000000
      
      In addition the permitted QMAP mux_id value range is extended for
      compatibility with ip(8) and the rmnet driver.
      
      Reinhard
      
      [1]: https://portland.source.codeaurora.org/patches/quic/gobi
      [2]: https://portland.source.codeaurora.org/quic/qsdk/oss/lklm/gobinet/
      ====================
      Tested-by: NDaniele Palmas <dnlplm@gmail.com>
      Acked-by: NBjørn Mork <bjorn@mork.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2309f517
    • R
      qmi_wwan: extend permitted QMAP mux_id value range · 36815b41
      Reinhard Speyerer 提交于
      Permit mux_id values up to 254 to be used in qmimux_register_device()
      for compatibility with ip(8) and the rmnet driver.
      
      Fixes: c6adf779 ("net: usb: qmi_wwan: add qmap mux protocol support")
      Cc: Daniele Palmas <dnlplm@gmail.com>
      Signed-off-by: NReinhard Speyerer <rspmn@arcor.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36815b41
    • R
      qmi_wwan: avoid RCU stalls on device disconnect when in QMAP mode · a8fdde1c
      Reinhard Speyerer 提交于
      Switch qmimux_unregister_device() and qmi_wwan_disconnect() to
      use unregister_netdevice_queue() and unregister_netdevice_many()
      instead of unregister_netdevice(). This avoids RCU stalls which
      have been observed on device disconnect in certain setups otherwise.
      
      Fixes: c6adf779 ("net: usb: qmi_wwan: add qmap mux protocol support")
      Cc: Daniele Palmas <dnlplm@gmail.com>
      Signed-off-by: NReinhard Speyerer <rspmn@arcor.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8fdde1c
    • R
      qmi_wwan: add network device usage statistics for qmimux devices · 44f82312
      Reinhard Speyerer 提交于
      Add proper network device usage statistics for qmimux devices
      instead of reporting all-zero values for them.
      
      Fixes: c6adf779 ("net: usb: qmi_wwan: add qmap mux protocol support")
      Cc: Daniele Palmas <dnlplm@gmail.com>
      Signed-off-by: NReinhard Speyerer <rspmn@arcor.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44f82312
    • R
      qmi_wwan: add support for QMAP padding in the RX path · 61356088
      Reinhard Speyerer 提交于
      The QMAP code in the qmi_wwan driver is based on the CodeAurora GobiNet
      driver which does not process QMAP padding in the RX path correctly.
      Add support for QMAP padding to qmimux_rx_fixup() according to the
      description of the rmnet driver.
      
      Fixes: c6adf779 ("net: usb: qmi_wwan: add qmap mux protocol support")
      Cc: Daniele Palmas <dnlplm@gmail.com>
      Signed-off-by: NReinhard Speyerer <rspmn@arcor.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61356088
    • A
      bpf, x64: fix stack layout of JITed bpf code · fe8d9571
      Alexei Starovoitov 提交于
      Since commit 177366bf the %rbp stopped pointing to %rbp of the
      previous stack frame. That broke frame pointer based stack unwinding.
      This commit is a partial revert of it.
      Note that the location of tail_call_cnt is fixed, since the verifier
      enforces MAX_BPF_STACK stack size for programs with tail calls.
      
      Fixes: 177366bf ("bpf: change x86 JITed program stack layout")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fe8d9571
    • T
      bpf, devmap: Add missing RCU read lock on flush · 86723c86
      Toshiaki Makita 提交于
      .ndo_xdp_xmit() assumes it is called under RCU. For example virtio_net
      uses RCU to detect it has setup the resources for tx. The assumption
      accidentally broke when introducing bulk queue in devmap.
      
      Fixes: 5d053f9d ("bpf: devmap prepare xdp frames for bulking")
      Reported-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NToshiaki Makita <toshiaki.makita1@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      86723c86
    • T
      bpf, devmap: Add missing bulk queue free · edabf4d9
      Toshiaki Makita 提交于
      dev_map_free() forgot to free bulk queue when freeing its entries.
      
      Fixes: 5d053f9d ("bpf: devmap prepare xdp frames for bulking")
      Signed-off-by: NToshiaki Makita <toshiaki.makita1@gmail.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      edabf4d9
    • T
      bpf, devmap: Fix premature entry free on destroying map · d4dd153d
      Toshiaki Makita 提交于
      dev_map_free() waits for flush_needed bitmap to be empty in order to
      ensure all flush operations have completed before freeing its entries.
      However the corresponding clear_bit() was called before using the
      entries, so the entries could be used after free.
      
      All access to the entries needs to be done before clearing the bit.
      It seems commit a5e2da6e ("bpf: netdev is never null in
      __dev_map_flush") accidentally changed the clear_bit() and memory access
      order.
      
      Note that the problem happens only in __dev_map_flush(), not in
      dev_map_flush_old(). dev_map_flush_old() is called only after nulling
      out the corresponding netdev_map entry, so dev_map_free() never frees
      the entry thus no such race happens there.
      
      Fixes: a5e2da6e ("bpf: netdev is never null in __dev_map_flush")
      Signed-off-by: NToshiaki Makita <toshiaki.makita1@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d4dd153d
    • D
      Merge tag 'mac80211-for-davem-2019-06-14' of... · 2a2af5e6
      David S. Miller 提交于
      Merge tag 'mac80211-for-davem-2019-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
      
      Johannes Berg says:
      
      ====================
      Various fixes, all over:
       * a few memory leaks
       * fixes for management frame protection security
         and A2/A3 confusion (affecting TDLS as well)
       * build fix for certificates
       * etc.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a2af5e6
    • R
      net: phylink: further mac_config documentation improvements · 4add7009
      Russell King - ARM Linux admin 提交于
      While reviewing the DPAA2 work, it has become apparent that we need
      better documentation about which members of the phylink link state
      structure are valid in the mac_config call.  Improve this
      documentation.
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4add7009
  3. 14 6月, 2019 12 次提交