1. 05 7月, 2017 1 次提交
  2. 07 4月, 2017 1 次提交
  3. 13 3月, 2017 1 次提交
  4. 30 12月, 2016 1 次提交
    • M
      net: dev_weight: TX/RX orthogonality · 3d48b53f
      Matthias Tafelmeier 提交于
      Oftenly, introducing side effects on packet processing on the other half
      of the stack by adjusting one of TX/RX via sysctl is not desirable.
      There are cases of demand for asymmetric, orthogonal configurability.
      
      This holds true especially for nodes where RPS for RFS usage on top is
      configured and therefore use the 'old dev_weight'. This is quite a
      common base configuration setup nowadays, even with NICs of superior processing
      support (e.g. aRFS).
      
      A good example use case are nodes acting as noSQL data bases with a
      large number of tiny requests and rather fewer but large packets as responses.
      It's affordable to have large budget and rx dev_weights for the
      requests. But as a side effect having this large a number on TX
      processed in one run can overwhelm drivers.
      
      This patch therefore introduces an independent configurability via sysctl to
      userland.
      Signed-off-by: NMatthias Tafelmeier <matthias.tafelmeier@gmx.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d48b53f
  5. 06 12月, 2016 1 次提交
    • E
      net_sched: gen_estimator: complete rewrite of rate estimators · 1c0d32fd
      Eric Dumazet 提交于
      1) Old code was hard to maintain, due to complex lock chains.
         (We probably will be able to remove some kfree_rcu() in callers)
      
      2) Using a single timer to update all estimators does not scale.
      
      3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
         is not supposed to work well)
      
      In this rewrite :
      
      - I removed the RB tree that had to be scanned in
        gen_estimator_active(). qdisc dumps should be much faster.
      
      - Each estimator has its own timer.
      
      - Estimations are maintained in net_rate_estimator structure,
        instead of dirtying the qdisc. Minor, but part of the simplification.
      
      - Reading the estimator uses RCU and a seqcount to provide proper
        support for 32bit kernels.
      
      - We reduce memory need when estimators are not used, since
        we store a pointer, instead of the bytes/packets counters.
      
      - xt_rateest_mt() no longer has to grab a spinlock.
        (In the future, xt_rateest_tg() could be switched to per cpu counters)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c0d32fd
  6. 19 9月, 2016 3 次提交
  7. 26 8月, 2016 1 次提交
  8. 11 8月, 2016 1 次提交
  9. 26 6月, 2016 2 次提交
    • E
      net_sched: generalize bulk dequeue · 4d202a0d
      Eric Dumazet 提交于
      When qdisc bulk dequeue was added in linux-3.18 (commit
      5772e9a3 "qdisc: bulk dequeue support for qdiscs
      with TCQ_F_ONETXQUEUE"), it was constrained to some
      specific qdiscs.
      
      With some extra care, we can extend this to all qdiscs,
      so that typical traffic shaping solutions can benefit from
      small batches (8 packets in this patch).
      
      For example, HTB is often used on some multi queue device.
      And bonding/team are multi queue devices...
      
      Idea is to bulk-dequeue packets mapping to the same transmit queue.
      
      This brings between 35 and 80 % performance increase in HTB setup
      under pressure on a bonding setup :
      
      1) NUMA node contention :   610,000 pps -> 1,110,000 pps
      2) No node contention   : 1,380,000 pps -> 1,930,000 pps
      
      Now we should work to add batches on the enqueue() side ;)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d202a0d
    • E
      net_sched: drop packets after root qdisc lock is released · 520ac30f
      Eric Dumazet 提交于
      Qdisc performance suffers when packets are dropped at enqueue()
      time because drops (kfree_skb()) are done while qdisc lock is held,
      delaying a dequeue() draining the queue.
      
      Nominal throughput can be reduced by 50 % when this happens,
      at a time we would like the dequeue() to proceed as fast as possible.
      
      Even FQ is vulnerable to this problem, while one of FQ goals was
      to provide some flow isolation.
      
      This patch adds a 'struct sk_buff **to_free' parameter to all
      qdisc->enqueue(), and in qdisc_drop() helper.
      
      I measured a performance increase of up to 12 %, but this patch
      is a prereq so that future batches in enqueue() can fly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      520ac30f
  10. 16 6月, 2016 1 次提交
    • E
      net_sched: add the ability to defer skb freeing · 1b5c5493
      Eric Dumazet 提交于
      qdisc are changed under RTNL protection and often
      while blocking BH and root qdisc spinlock.
      
      When lots of skbs need to be dropped, we free
      them under these locks causing TX/RX freezes,
      and more generally latency spikes.
      
      This commit adds rtnl_kfree_skbs(), used to queue
      skbs for deferred freeing.
      
      Actual freeing happens right after RTNL is released,
      with appropriate scheduling points.
      
      rtnl_qdisc_drop() can also be used in place
      of disc_drop() when RTNL is held.
      
      qdisc_reset_queue() and __qdisc_reset_queue() get
      the new behavior, so standard qdiscs like pfifo, pfifo_fast...
      have their ->reset() method automatically handled.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b5c5493
  11. 10 6月, 2016 1 次提交
  12. 08 6月, 2016 1 次提交
  13. 07 6月, 2016 1 次提交
  14. 05 5月, 2016 2 次提交
    • F
      net: remove dev->trans_start · 9b36627a
      Florian Westphal 提交于
      previous patches removed all direct accesses to dev->trans_start,
      so change the netif_trans_update helper to update trans_start of
      netdev queue 0 instead and then remove trans_start from struct net_device.
      
      AFAICS a lot of the netif_trans_update() invocations are now useless
      because they occur in ndo_start_xmit and driver doesn't set LLTX
      (i.e. stack already took care of the update).
      
      As I can't test any of them it seems better to just leave them alone.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b36627a
    • F
      treewide: replace dev->trans_start update with helper · 860e9538
      Florian Westphal 提交于
      Replace all trans_start updates with netif_trans_update helper.
      change was done via spatch:
      
      struct net_device *d;
      @@
      - d->trans_start = jiffies
      + netif_trans_update(d)
      
      Compile tested only.
      
      Cc: user-mode-linux-devel@lists.sourceforge.net
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux1394-devel@lists.sourceforge.net
      Cc: linux-rdma@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Cc: MPT-FusionLinux.pdl@broadcom.com
      Cc: linux-scsi@vger.kernel.org
      Cc: linux-can@vger.kernel.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linux-omap@vger.kernel.org
      Cc: linux-hams@vger.kernel.org
      Cc: linux-usb@vger.kernel.org
      Cc: linux-wireless@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Cc: devel@driverdev.osuosl.org
      Cc: b.a.t.m.a.n@lists.open-mesh.org
      Cc: linux-bluetooth@vger.kernel.org
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NFelipe Balbi <felipe.balbi@linux.intel.com>
      Acked-by: NMugunthan V N <mugunthanvnm@ti.com>
      Acked-by: NAntonio Quartulli <a@unstable.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      860e9538
  15. 27 4月, 2016 1 次提交
  16. 14 4月, 2016 1 次提交
  17. 04 3月, 2016 1 次提交
  18. 06 1月, 2016 1 次提交
  19. 04 12月, 2015 1 次提交
    • E
      net_sched: fix qdisc_tree_decrease_qlen() races · 4eaf3b84
      Eric Dumazet 提交于
      qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
      devices.
      
      One problem is that it updates sch->q.qlen and sch->qstats.drops
      on the mq/mqprio root qdisc, while it should not : Daniele
      reported underflows errors :
      [  681.774821] PAX: sch->q.qlen: 0 n: 1
      [  681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
      [  681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G           O    4.2.6.201511282239-1-grsec #1
      [  681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
      [  681.774956]  ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
      [  681.774959]  ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
      [  681.774960]  ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
      [  681.774962] Call Trace:
      [  681.774967]  [<ffffffffa95d2810>] dump_stack+0x4c/0x7f
      [  681.774970]  [<ffffffffa91a44f4>] report_size_overflow+0x34/0x50
      [  681.774972]  [<ffffffffa94d17e2>] qdisc_tree_decrease_qlen+0x152/0x160
      [  681.774976]  [<ffffffffc02694b1>] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
      [  681.774978]  [<ffffffffc02680a0>] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
      [  681.774980]  [<ffffffffa94cd92d>] __qdisc_run+0x4d/0x1d0
      [  681.774983]  [<ffffffffa949b2b2>] net_tx_action+0xc2/0x160
      [  681.774985]  [<ffffffffa90664c1>] __do_softirq+0xf1/0x200
      [  681.774987]  [<ffffffffa90665ee>] run_ksoftirqd+0x1e/0x30
      [  681.774989]  [<ffffffffa90896b0>] smpboot_thread_fn+0x150/0x260
      [  681.774991]  [<ffffffffa9089560>] ? sort_range+0x40/0x40
      [  681.774992]  [<ffffffffa9085fe4>] kthread+0xe4/0x100
      [  681.774994]  [<ffffffffa9085f00>] ? kthread_worker_fn+0x170/0x170
      [  681.774995]  [<ffffffffa95d8d1e>] ret_from_fork+0x3e/0x70
      
      mq/mqprio have their own ways to report qlen/drops by folding stats on
      all their queues, with appropriate locking.
      
      A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
      without proper locking : concurrent qdisc updates could corrupt the list
      that qdisc_match_from_root() parses to find a qdisc given its handle.
      
      Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
      qdisc_tree_decrease_qlen() can use to abort its tree traversal,
      as soon as it meets a mq/mqprio qdisc children.
      
      Second problem can be fixed by RCU protection.
      Qdisc are already freed after RCU grace period, so qdisc_list_add() and
      qdisc_list_del() simply have to use appropriate rcu list variants.
      
      A future patch will add a per struct netdev_queue list anchor, so that
      qdisc_tree_decrease_qlen() can have more efficient lookups.
      Reported-by: NDaniele Fucini <dfucini@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <cwang@twopensource.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4eaf3b84
  20. 28 8月, 2015 3 次提交
  21. 18 8月, 2015 1 次提交
  22. 10 10月, 2014 1 次提交
    • J
      net_sched: restore qdisc quota fairness limits after bulk dequeue · b8358d70
      Jesper Dangaard Brouer 提交于
      Restore the quota fairness between qdisc's, that we broke with commit
      5772e9a3 ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE").
      
      Before that commit, the quota in __qdisc_run() were in packets as
      dequeue_skb() would only dequeue a single packet, that assumption
      broke with bulk dequeue.
      
      We choose not to account for the number of packets inside the TSO/GSO
      packets (accessable via "skb_gso_segs").  As the previous fairness
      also had this "defect". Thus, GSO/TSO packets counts as a single
      packet.
      
      Further more, we choose to slack on accuracy, by allowing a bulk
      dequeue try_bulk_dequeue_skb() to exceed the "packets" limit, only
      limited by the BQL bytelimit.  This is done because BQL prefers to get
      its full budget for appropriate feedback from TX completion.
      
      In future, we might consider reworking this further and, if it allows,
      switch to a time-based model, as suggested by Eric. Right now, we only
      restore old semantics.
      
      Joint work with Eric, Hannes, Daniel and Jesper.  Hannes wrote the
      first patch in cooperation with Daniel and Jesper.  Eric rewrote the
      patch.
      
      Fixes: 5772e9a3 ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8358d70
  23. 08 10月, 2014 1 次提交
    • E
      net: better IFF_XMIT_DST_RELEASE support · 02875878
      Eric Dumazet 提交于
      Testing xmit_more support with netperf and connected UDP sockets,
      I found strange dst refcount false sharing.
      
      Current handling of IFF_XMIT_DST_RELEASE is not optimal.
      
      Dropping dst in validate_xmit_skb() is certainly too late in case
      packet was queued by cpu X but dequeued by cpu Y
      
      The logical point to take care of drop/force is in __dev_queue_xmit()
      before even taking qdisc lock.
      
      As Julian Anastasov pointed out, need for skb_dst() might come from some
      packet schedulers or classifiers.
      
      This patch adds new helper to cleanly express needs of various drivers
      or qdiscs/classifiers.
      
      Drivers that need skb_dst() in their ndo_start_xmit() should call
      following helper in their setup instead of the prior :
      
      	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
      ->
      	netif_keep_dst(dev);
      
      Instead of using a single bit, we use two bits, one being
      eventually rebuilt in bonding/team drivers.
      
      The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
      rebuilt in bonding/team. Eventually, we could add something
      smarter later.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02875878
  24. 04 10月, 2014 3 次提交
    • E
      qdisc: validate skb without holding lock · 55a93b3e
      Eric Dumazet 提交于
      Validation of skb can be pretty expensive :
      
      GSO segmentation and/or checksum computations.
      
      We can do this without holding qdisc lock, so that other cpus
      can queue additional packets.
      
      Trick is that requeued packets were already validated, so we carry
      a boolean so that sch_direct_xmit() can validate a fresh skb list,
      or directly use an old one.
      
      Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads
      host.
      
      Turning TSO on or off had no effect on throughput, only few more cpu
      cycles. Lock contention on qdisc lock disappeared.
      
      Same if disabling TX checksum offload.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55a93b3e
    • J
      qdisc: dequeue bulking also pickup GSO/TSO packets · 808e7ac0
      Jesper Dangaard Brouer 提交于
      The TSO and GSO segmented packets already benefit from bulking
      on their own.
      
      The TSO packets have always taken advantage of the only updating
      the tailptr once for a large packet.
      
      The GSO segmented packets have recently taken advantage of
      bulking xmit_more API, via merge commit 53fda7f7 ("Merge
      branch 'xmit_list'"), specifically via commit 7f2e870f ("net:
      Move main gso loop out of dev_hard_start_xmit() into helper.")
      allowing qdisc requeue of remaining list.  And via commit
      ce93718f ("net: Don't keep around original SKB when we
      software segment GSO frames.").
      
      This patch allow further bulking of TSO/GSO packets together,
      when dequeueing from the qdisc.
      
      Testing:
       Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
      netperf-wrapper. Bulking several TSO show no performance regressions
      (requeues were in the area 32 requeues/sec).
      
      Bulking several GSOs does show small regression or very small
      improvement (requeues were in the area 8000 requeues/sec).
      
       Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
      latency. Base-case, which is "normal" GSO bulking, sees varying
      high-prio queue delay between 0.38ms to 0.47ms.  Bulking several GSOs
      together, result in a stable high-prio queue delay of 0.50ms.
      
       Using igb at 100Mbit/s with GSO bulking, shows an improvement.
      Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      808e7ac0
    • J
      qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE · 5772e9a3
      Jesper Dangaard Brouer 提交于
      Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
      sending/processing an entire skb list.
      
      This patch implements qdisc bulk dequeue, by allowing multiple packets
      to be dequeued in dequeue_skb().
      
      The optimization principle for this is two fold, (1) to amortize
      locking cost and (2) avoid expensive tailptr update for notifying HW.
       (1) Several packets are dequeued while holding the qdisc root_lock,
      amortizing locking cost over several packet.  The dequeued SKB list is
      processed under the TXQ lock in dev_hard_start_xmit(), thus also
      amortizing the cost of the TXQ lock.
       (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
      API to delay HW tailptr update, which also reduces the cost per
      packet.
      
      One restriction of the new API is that every SKB must belong to the
      same TXQ.  This patch takes the easy way out, by restricting bulk
      dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
      qdisc only have attached a single TXQ.
      
      Some detail about the flow; dev_hard_start_xmit() will process the skb
      list, and transmit packets individually towards the driver (see
      xmit_one()).  In case the driver stops midway in the list, the
      remaining skb list is returned by dev_hard_start_xmit().  In
      sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
      
      To avoid overshooting the HW limits, which results in requeuing, the
      patch limits the amount of bytes dequeued, based on the drivers BQL
      limits.  In-effect bulking will only happen for BQL enabled drivers.
      
      Small amounts for extra HoL blocking (2x MTU/0.24ms) were
      measured at 100Mbit/s, with bulking 8 packets, but the
      oscillating nature of the measurement indicate something, like
      sched latency might be causing this effect. More comparisons
      show, that this oscillation goes away occationally. Thus, we
      disregard this artifact completely and remove any "magic" bulking
      limit.
      
      For now, as a conservative approach, stop bulking when seeing TSO and
      segmented GSO packets.  They already benefit from bulking on their own.
      A followup patch add this, to allow easier bisect-ability for finding
      regressions.
      
      Jointed work with Hannes, Daniel and Florian.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5772e9a3
  25. 30 9月, 2014 1 次提交
    • J
      net: sched: make bstats per cpu and estimator RCU safe · 22e0f8b9
      John Fastabend 提交于
      In order to run qdisc's without locking statistics and estimators
      need to be handled correctly.
      
      To resolve bstats make the statistics per cpu. And because this is
      only needed for qdiscs that are running without locks which is not
      the case for most qdiscs in the near future only create percpu
      stats when qdiscs set the TCQ_F_CPUSTATS flag.
      
      Next because estimators use the bstats to calculate packets per
      second and bytes per second the estimator code paths are updated
      to use the per cpu statistics.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22e0f8b9
  26. 20 9月, 2014 1 次提交
  27. 14 9月, 2014 2 次提交
  28. 04 9月, 2014 1 次提交
  29. 03 9月, 2014 1 次提交
  30. 02 9月, 2014 2 次提交