1. 09 12月, 2017 23 次提交
    • R
      cxgb4: collect on-chip memory information · 123e25c4
      Rahul Lakkireddy 提交于
      Collect memory layout of various on-chip memory regions.  Move code
      for collecting on-chip memory information to cudbg_lib.c and update
      cxgb4_debugfs.c to use the common function.  Also include
      cudbg_entity.h before cudbg_lib.h to avoid adding cudbg entity
      structure forward declarations in cudbg_lib.h.
      Signed-off-by: NRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: NGanesh Goudar <ganeshgr@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      123e25c4
    • D
      Merge branch 'veth-and-GSO-maximums' · 62fd8b18
      David S. Miller 提交于
      Stephen Hemminger says:
      
      ====================
      veth and GSO maximums
      
      This is the more general way to solving the issue of GSO limits
      not being set correctly for containers on Azure. If a GSO packet
      is sent to host that exceeds the limit (reported by NDIS), then
      the host is forced to do segmentation in software which has noticeable
      performance impact.
      
      The core rtnetlink infrastructure already has the messages and
      infrastructure to allow changing gso limits. With an updated iproute2
      the following already works:
        # ip li set dev dummy0 gso_max_size 30000
      
      These patches are about making it easier with veth.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62fd8b18
    • S
      veth: set peer GSO values · 72d24955
      Stephen Hemminger 提交于
      When new veth is created, and GSO values have been configured
      on one device, clone those values to the peer.
      
      For example:
         # ip link add dev vm1 gso_max_size 65530 type veth peer name vm2
      
      This should create vm1 <--> vm2 with both having GSO maximum
      size set to 65530.
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72d24955
    • S
      rtnetlink: allow GSO maximums to be set on device creation · 46e6b992
      Stephen Hemminger 提交于
      Netlink device already allows changing GSO sizes with
      ip set command. The part that is missing is allowing overriding
      GSO settings on device creation.
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46e6b992
    • N
      net: stmmac: fix broken dma_interrupt handling for multi-queues · 5a6a0445
      Niklas Cassel 提交于
      There is nothing that says that number of TX queues == number of RX
      queues. E.g. the ARTPEC-6 SoC has 2 TX queues and 1 RX queue.
      
      This code is obviously wrong:
      for (chan = 0; chan < tx_channel_count; chan++) {
          struct stmmac_rx_queue *rx_q = &priv->rx_queue[chan];
      
      priv->rx_queue has size MTL_MAX_RX_QUEUES, so this will send an
      uninitialized napi_struct to __napi_schedule(), causing us to
      crash in net_rx_action(), because napi_struct->poll is zero.
      
      [12846.759880] Unable to handle kernel NULL pointer dereference at virtual address 00000000
      [12846.768014] pgd = (ptrval)
      [12846.770742] [00000000] *pgd=39ec7831, *pte=00000000, *ppte=00000000
      [12846.777023] Internal error: Oops: 80000007 [#1] PREEMPT SMP ARM
      [12846.782942] Modules linked in:
      [12846.785998] CPU: 0 PID: 161 Comm: dropbear Not tainted 4.15.0-rc2-00285-gf5fb5f2f39a7 #36
      [12846.794177] Hardware name: Axis ARTPEC-6 Platform
      [12846.798879] task: (ptrval) task.stack: (ptrval)
      [12846.803407] PC is at 0x0
      [12846.805942] LR is at net_rx_action+0x274/0x43c
      [12846.810383] pc : [<00000000>]    lr : [<80bff064>]    psr: 200e0113
      [12846.816648] sp : b90d9ae8  ip : b90d9ae8  fp : b90d9b44
      [12846.821871] r10: 00000008  r9 : 0013250e  r8 : 00000100
      [12846.827094] r7 : 0000012c  r6 : 00000000  r5 : 00000001  r4 : bac84900
      [12846.833619] r3 : 00000000  r2 : b90d9b08  r1 : 00000000  r0 : bac84900
      
      Since each DMA channel can be used for rx and tx simultaneously,
      the current code should probably be rewritten so that napi_struct is
      embedded in a new struct stmmac_channel.
      That way, stmmac_poll() can call stmmac_tx_clean() on just the tx queue
      where we got the IRQ, instead of looping through all tx queues.
      This is also how the xgbe driver does it (another driver for this IP).
      
      Fixes: c22a3f48 ("net: stmmac: adding multiple napi mechanism")
      Signed-off-by: NNiklas Cassel <niklas.cassel@axis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a6a0445
    • E
      net: dsa: lan9303: Protect ALR operations with mutex · 2e8d243e
      Egil Hjelmeland 提交于
      ALR table operations are a sequence of related register operations which
      should be protected from concurrent access. The alr_cache should also be
      protected. Add alr_mutex doing that.
      Signed-off-by: NEgil Hjelmeland <privat@egil-hjelmeland.no>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e8d243e
    • J
      net: sched: fix use-after-free in tcf_block_put_ext · df45bf84
      Jiri Pirko 提交于
      Since the block is freed with last chain being put, once we reach the
      end of iteration of list_for_each_entry_safe, the block may be
      already freed. I'm hitting this only by creating and deleting clsact:
      
      [  202.171952] ==================================================================
      [  202.180182] BUG: KASAN: use-after-free in tcf_block_put_ext+0x240/0x390
      [  202.187590] Read of size 8 at addr ffff880225539a80 by task tc/796
      [  202.194508]
      [  202.196185] CPU: 0 PID: 796 Comm: tc Not tainted 4.15.0-rc2jiri+ #5
      [  202.203200] Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016
      [  202.213613] Call Trace:
      [  202.216369]  dump_stack+0xda/0x169
      [  202.220192]  ? dma_virt_map_sg+0x147/0x147
      [  202.224790]  ? show_regs_print_info+0x54/0x54
      [  202.229691]  ? tcf_chain_destroy+0x1dc/0x250
      [  202.234494]  print_address_description+0x83/0x3d0
      [  202.239781]  ? tcf_block_put_ext+0x240/0x390
      [  202.244575]  kasan_report+0x1ba/0x460
      [  202.248707]  ? tcf_block_put_ext+0x240/0x390
      [  202.253518]  tcf_block_put_ext+0x240/0x390
      [  202.258117]  ? tcf_chain_flush+0x290/0x290
      [  202.262708]  ? qdisc_hash_del+0x82/0x1a0
      [  202.267111]  ? qdisc_hash_add+0x50/0x50
      [  202.271411]  ? __lock_is_held+0x5f/0x1a0
      [  202.275843]  clsact_destroy+0x3d/0x80 [sch_ingress]
      [  202.281323]  qdisc_destroy+0xcb/0x240
      [  202.285445]  qdisc_graft+0x216/0x7b0
      [  202.289497]  tc_get_qdisc+0x260/0x560
      
      Fix this by holding the block also by chain 0 and put chain 0
      explicitly, out of the list_for_each_entry_safe loop at the very
      end of tcf_block_put_ext.
      
      Fixes: efbf7897 ("net_sched: get rid of rcu_barrier() in tcf_block_put_ext()")
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df45bf84
    • D
      Merge branch 'lockless-qdisc-series' · fc8b81a5
      David S. Miller 提交于
      John Fastabend says:
      
      ====================
      lockless qdisc series
      
      This series adds support for building lockless qdiscs. This is
      the result of noticing the qdisc lock is a common hot-spot in
      perf analysis of the Linux network stack, especially when testing
      with high packet per second rates. However, nothing is free and
      most qdiscs rely on the qdisc lock for their data structures so
      each qdisc must be converted on a case by case basis. In this
      series, to kick things off, we make pfifo_fast, mq, and mqprio
      lockless. Follow up series can address additional qdiscs as needed.
      For example sch_tbf might be useful. To allow this the lockless
      design is an opt-in flag. In some future utopia we convert all
      qdiscs and we get to drop this case analysis, but in order to
      make progress we live in the real-world.
      
      There are also a handful of optimizations I have behind this
      series and a few code cleanups that I couldn't figure out how
      to fit neatly into this series with out increasing the patch
      count. Once this is in additional patches can address this. The
      most notable is in skb_dequeue we can push the consumer lock
      out a bit and consume multiple skbs off the skb_array in pfifo
      fast per iteration. Ideally we could push arrays of packets at
      drivers as well but we would need the infrastructure for this.
      The other notable improvement is to do less locking in the
      overrun cases where bad tx queue list and gso_skb are being
      hit. Although, nice in theory in practice this is the error
      case and I haven't found a benchmark where this matters yet.
      
      For testing...
      
      My first test case uses multiple containers (via cilium) where
      multiple client containers use 'wrk' to benchmark connections with
      a server container running lighttpd. Where lighttpd is configured
      to use multiple threads, one per core. Additionally this test has
      a proxy agent running so all traffic takes an extra hop through a
      proxy container. In these cases each TCP packet traverses the egress
      qdisc layer at least four times and the ingress qdisc layer an
      additional four times. This makes for a good stress test IMO, perf
      details below.
      
      The other micro-benchmark I run is injecting packets directly into
      qdisc layer using pktgen. This uses the benchmark script,
      
       ./pktgen_bench_xmit_mode_queue_xmit.sh
      
      Benchmarks taken in two cases, "base" running latest net-next no
      changes to qdisc layer and "qdisc" tests run with qdisc lockless
      updates. Numbers reported in req/sec. All virtual 'veth' devices
      run with pfifo_fast in the qdisc test case.
      
      `wrk -t16 -c $conns -d30 "http://[$SERVER_IP4]:80"`
      
      conns    16      32     64   1024
      -----------------------------------------------
      base:   18831  20201  21393  29151
      qdisc:  19309  21063  23899  29265
      
      notice in all cases we see performance improvement when running
      with qdisc case.
      
      Microbenchmarks using pktgen are as follows,
      
      `pktgen_bench_xmit_mode_queue_xmit.sh -t 1 -i eth2 -c 20000000
      
      base(mq):          2.1Mpps
      base(pfifo_fast):  2.1Mpps
      qdisc(mq):         2.6Mpps
      qdisc(pfifo_fast): 2.6Mpps
      
      notice numbers are the same for mq and pfifo_fast because only
      testing a single thread here. In both tests we see a nice bump
      in performance gain. The key with 'mq' is it is already per
      txq ring so contention is minimal in the above cases. Qdiscs
      such as tbf or htb which have more contention will likely show
      larger gains when/if lockless versions are implemented.
      
      Thanks to everyone who helped with this work especially Daniel
      Borkmann, Eric Dumazet and Willem de Bruijn for discussing the
      design and reviewing versions of the code.
      
      Changes from the RFC: dropped a couple patches off the end,
      fixed a bug with skb_queue_walk_safe not unlinking skb in all
      cases, fixed a lockdep splat with pfifo_fast_destroy not calling
      *_bh lock variant, addressed _most_ of Willem's comments, there
      was a bug in the bulk locking (final patch) of the RFC series.
      
      @Willem, I left out lockdep annotation for a follow on series
      to add lockdep more completely, rather than just in code I
      touched.
      
      Comments and feedback welcome.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc8b81a5
    • J
      net: sched: pfifo_fast use skb_array · c5ad119f
      John Fastabend 提交于
      This converts the pfifo_fast qdisc to use the skb_array data structure
      and set the lockless qdisc bit. pfifo_fast is the first qdisc to support
      the lockless bit that can be a child of a qdisc requiring locking. So
      we add logic to clear the lock bit on initialization in these cases when
      the qdisc graft operation occurs.
      
      This also removes the logic used to pick the next band to dequeue from
      and instead just checks a per priority array for packets from top priority
      to lowest. This might need to be a bit more clever but seems to work
      for now.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5ad119f
    • J
      net: skb_array: expose peek API · 4a86a4cf
      John Fastabend 提交于
      This adds a peek routine to skb_array.h for use with qdisc.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a86a4cf
    • J
      net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio · ce679e8d
      John Fastabend 提交于
      The sch_mqprio qdisc creates a sub-qdisc per tx queue which are then
      called independently for enqueue and dequeue operations. However
      statistics are aggregated and pushed up to the "master" qdisc.
      
      This patch adds support for any of the sub-qdiscs to be per cpu
      statistic qdiscs. To handle this case add a check when calculating
      stats and aggregate the per cpu stats if needed.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce679e8d
    • J
      net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq · b01ac095
      John Fastabend 提交于
      The sch_mq qdisc creates a sub-qdisc per tx queue which are then
      called independently for enqueue and dequeue operations. However
      statistics are aggregated and pushed up to the "master" qdisc.
      
      This patch adds support for any of the sub-qdiscs to be per cpu
      statistic qdiscs. To handle this case add a check when calculating
      stats and aggregate the per cpu stats if needed.
      
      Also exports __gnet_stats_copy_queue() to use as a helper function.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b01ac095
    • J
      net: sched: helpers to sum qlen and qlen for per cpu logic · 7e66016f
      John Fastabend 提交于
      Add qdisc qlen helper routines for lockless qdiscs to use.
      
      The qdisc qlen is no longer used in the hotpath but it is reported
      via stats query on the qdisc so it still needs to be tracked. This
      adds the per cpu operations needed along with a helper to return
      the summation of per cpu stats.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e66016f
    • J
      net: sched: check for frozen queue before skb_bad_txq check · fd8e8d1a
      John Fastabend 提交于
      I can not think of any reason to pull the bad txq skb off the qdisc if
      the txq we plan to send this on is still frozen. So check for frozen
      queue first and abort before dequeuing either skb_bad_txq skb or
      normal qdisc dequeue() skb.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd8e8d1a
    • J
      net: sched: use skb list for skb_bad_tx · 70e57d5e
      John Fastabend 提交于
      Similar to how gso is handled use skb list for skb_bad_tx this is
      required with lockless qdiscs because we may have multiple cores
      attempting to push skbs into skb_bad_tx concurrently
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70e57d5e
    • J
      net: sched: drop qdisc_reset from dev_graft_qdisc · 7bbde83b
      John Fastabend 提交于
      In qdisc_graft_qdisc a "new" qdisc is attached and the 'qdisc_destroy'
      operation is called on the old qdisc. The destroy operation will wait
      a rcu grace period and call qdisc_rcu_free(). At which point
      gso_cpu_skb is free'd along with all stats so no need to zero stats
      and gso_cpu_skb from the graft operation itself.
      
      Further after dropping the qdisc locks we can not continue to call
      qdisc_reset before waiting an rcu grace period so that the qdisc is
      detached from all cpus. By removing the qdisc_reset() here we get
      the correct property of waiting an rcu grace period and letting the
      qdisc_destroy operation clean up the qdisc correctly.
      
      Note, a refcnt greater than 1 would cause the destroy operation to
      be aborted however if this ever happened the reference to the qdisc
      would be lost and we would have a memory leak.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bbde83b
    • J
      net: sched: explicit locking in gso_cpu fallback · a53851e2
      John Fastabend 提交于
      This work is preparing the qdisc layer to support egress lockless
      qdiscs. If we are running the egress qdisc lockless in the case we
      overrun the netdev, for whatever reason, the netdev returns a busy
      error code and the skb is parked on the gso_skb pointer. With many
      cores all hitting this case at once its possible to have multiple
      sk_buffs here so we turn gso_skb into a queue.
      
      This should be the edge case and if we see this frequently then
      the netdev/qdisc layer needs to back off.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a53851e2
    • J
      net: sched: a dflt qdisc may be used with per cpu stats · d59f5ffa
      John Fastabend 提交于
      Enable dflt qdisc support for per cpu stats before this patch a dflt
      qdisc was required to use the global statistics qstats and bstats.
      
      This adds a static flags field to qdisc_ops that is propagated
      into qdisc->flags in qdisc allocate call. This allows the allocation
      block to completely allocate the qdisc object so we don't have
      dangling allocations after qdisc init.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d59f5ffa
    • J
      net: sched: provide per cpu qstat helpers · 40bd0362
      John Fastabend 提交于
      The per cpu qstats support was added with per cpu bstat support which
      is currently used by the ingress qdisc. This patch adds a set of
      helpers needed to make other qdiscs that use qstats per cpu as well.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40bd0362
    • J
      net: sched: remove remaining uses for qdisc_qlen in xmit path · 29b86cda
      John Fastabend 提交于
      sch_direct_xmit() uses qdisc_qlen as a return value but all call sites
      of the routine only check if it is zero or not. Simplify the logic so
      that we don't need to return an actual queue length value.
      
      This introduces a case now where sch_direct_xmit would have returned
      a qlen of zero but now it returns true. However in this case all
      call sites of sch_direct_xmit will implement a dequeue() and get
      a null skb and abort. This trades tracking qlen in the hotpath for
      an extra dequeue operation. Overall this seems to be good for
      performance.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29b86cda
    • J
      net: sched: allow qdiscs to handle locking · 6b3ba914
      John Fastabend 提交于
      This patch adds a flag for queueing disciplines to indicate the stack
      does not need to use the qdisc lock to protect operations. This can
      be used to build lockless scheduling algorithms and improving
      performance.
      
      The flag is checked in the tx path and the qdisc lock is only taken
      if it is not set. For now use a conditional if statement. Later we
      could be more aggressive if it proves worthwhile and use a static key
      or wrap this in a likely().
      
      Also the lockless case drops the TCQ_F_CAN_BYPASS logic. The reason
      for this is synchronizing a qlen counter across threads proves to
      cost more than doing the enqueue/dequeue operations when tested with
      pktgen.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b3ba914
    • J
      net: sched: cleanup qdisc_run and __qdisc_run semantics · 6c148184
      John Fastabend 提交于
      Currently __qdisc_run calls qdisc_run_end() but does not call
      qdisc_run_begin(). This makes it hard to track pairs of
      qdisc_run_{begin,end} across function calls.
      
      To simplify reading these code paths this patch moves begin/end calls
      into qdisc_run().
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c148184
    • T
      virtio_net: Disable interrupts if napi_complete_done rescheduled napi · fdaa767a
      Toshiaki Makita 提交于
      Since commit 39e6c820 ("net: solve a NAPI race") napi has been able
      to be rescheduled within napi_complete_done() even in non-busypoll case,
      but virtnet_poll() always enabled interrupts before complete, and when
      napi was rescheduled within napi_complete_done() it did not disable
      interrupts.
      This caused more interrupts when event idx is disabled.
      
      According to commit cbdadbbf ("virtio_net: fix race in RX VQ
      processing") we cannot place virtqueue_enable_cb_prepare() after
      NAPI_STATE_SCHED is cleared, so disable interrupts again if
      napi_complete_done() returned false.
      
      Tested with vhost-user of OVS 2.7 on host, which does not have the event
      idx feature.
      
      * Before patch:
      
      $ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
      MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET
      Socket  Message  Elapsed      Messages
      Size    Size     Time         Okay Errors   Throughput
      bytes   bytes    secs            #      #   10^6bits/sec
      
      212992    1472   60.00     32763206      0    6430.32
      212992           60.00     23384299           4589.56
      
      Interrupts on guest: 9872369
      Packets/interrupt:   2.37
      
      * After patch
      
      $ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
      MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET
      Socket  Message  Elapsed      Messages
      Size    Size     Time         Okay Errors   Throughput
      bytes   bytes    secs            #      #   10^6bits/sec
      
      212992    1472   60.00     32794646      0    6436.49
      212992           60.00     32793501           6436.27
      
      Interrupts on guest: 4941299
      Packets/interrupt:   6.64
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fdaa767a
  2. 08 12月, 2017 11 次提交
  3. 07 12月, 2017 6 次提交