1. 18 5月, 2017 4 次提交
  2. 22 4月, 2017 1 次提交
    • W
      net_sched: move the empty tp check from ->destroy() to ->delete() · 763dbf63
      WANG Cong 提交于
      We could have a race condition where in ->classify() path we
      dereference tp->root and meanwhile a parallel ->destroy() makes it
      a NULL. Daniel cured this bug in commit d9363774
      ("net, sched: respect rcu grace period on cls destruction").
      
      This happens when ->destroy() is called for deleting a filter to
      check if we are the last one in tp, this tp is still linked and
      visible at that time. The root cause of this problem is the semantic
      of ->destroy(), it does two things (for non-force case):
      
      1) check if tp is empty
      2) if tp is empty we could really destroy it
      
      and its caller, if cares, needs to check its return value to see if it
      is really destroyed. Therefore we can't unlink tp unless we know it is
      empty.
      
      As suggested by Daniel, we could actually move the test logic to ->delete()
      so that we can safely unlink tp after ->delete() tells us the last one is
      just deleted and before ->destroy().
      
      Fixes: 1e052be6 ("net_sched: destroy proto tp when all filters are gone")
      Cc: Roi Dayan <roid@mellanox.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      763dbf63
  3. 13 3月, 2017 1 次提交
  4. 11 2月, 2017 2 次提交
  5. 09 1月, 2017 4 次提交
    • W
      net-tc: convert tc_from to tc_from_ingress and tc_redirected · bc31c905
      Willem de Bruijn 提交于
      The tc_from field fulfills two roles. It encodes whether a packet was
      redirected by an act_mirred device and, if so, whether act_mirred was
      called on ingress or egress. Split it into separate fields.
      
      The information is needed by the special IFB loop, where packets are
      taken out of the normal path by act_mirred, forwarded to IFB, then
      reinjected at their original location (ingress or egress) by IFB.
      
      The IFB device cannot use skb->tc_at_ingress, because that may have
      been overwritten as the packet travels from act_mirred to ifb_xmit,
      when it passes through tc_classify on the IFB egress path. Cache this
      value in skb->tc_from_ingress.
      
      That field is valid only if a packet arriving at ifb_xmit came from
      act_mirred. Other packets can be crafted to reach ifb_xmit. These
      must be dropped. Set tc_redirected on redirection and drop all packets
      that do not have this bit set.
      
      Both fields are set only on cloned skbs in tc actions, so original
      packet sources do not have to clear the bit when reusing packets
      (notably, pktgen and octeon).
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc31c905
    • W
      net-tc: convert tc_at to tc_at_ingress · 8dc07fdb
      Willem de Bruijn 提交于
      Field tc_at is used only within tc actions to distinguish ingress from
      egress processing. A single bit is sufficient for this purpose.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8dc07fdb
    • W
      net-tc: convert tc_verd to integer bitfields · a5135bcf
      Willem de Bruijn 提交于
      Extract the remaining two fields from tc_verd and remove the __u16
      completely. TC_AT and TC_FROM are converted to equivalent two-bit
      integer fields tc_at and tc_from. Where possible, use existing
      helper skb_at_tc_ingress when reading tc_at. Introduce helper
      skb_reset_tc to clear fields.
      
      Not documenting tc_from and tc_at, because they will be replaced
      with single bit fields in follow-on patches.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5135bcf
    • W
      net-tc: extract skip classify bit from tc_verd · e7246e12
      Willem de Bruijn 提交于
      Packets sent by the IFB device skip subsequent tc classification.
      A single bit governs this state. Move it out of tc_verd in
      anticipation of removing that __u16 completely.
      
      The new bitfield tc_skip_classify temporarily uses one bit of a
      hole, until tc_verd is removed completely in a follow-up patch.
      
      Remove the bit hole comment. It could be 2, 3, 4 or 5 bits long.
      With that many options, little value in documenting it.
      
      Introduce a helper function to deduplicate the logic in the two
      sites that check this bit.
      
      The field tc_skip_classify is set only in IFB on skbs cloned in
      act_mirred, so original packet sources do not have to clear the
      bit when reusing packets (notably, pktgen and octeon).
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7246e12
  6. 06 12月, 2016 1 次提交
    • E
      net_sched: gen_estimator: complete rewrite of rate estimators · 1c0d32fd
      Eric Dumazet 提交于
      1) Old code was hard to maintain, due to complex lock chains.
         (We probably will be able to remove some kfree_rcu() in callers)
      
      2) Using a single timer to update all estimators does not scale.
      
      3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
         is not supposed to work well)
      
      In this rewrite :
      
      - I removed the RB tree that had to be scanned in
        gen_estimator_active(). qdisc dumps should be much faster.
      
      - Each estimator has its own timer.
      
      - Estimations are maintained in net_rate_estimator structure,
        instead of dirtying the qdisc. Minor, but part of the simplification.
      
      - Reading the estimator uses RCU and a seqcount to provide proper
        support for 32bit kernels.
      
      - We reduce memory need when estimators are not used, since
        we store a pointer, instead of the bytes/packets counters.
      
      - xt_rateest_mt() no longer has to grab a spinlock.
        (In the future, xt_rateest_tg() could be switched to per cpu counters)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c0d32fd
  7. 19 9月, 2016 2 次提交
  8. 26 8月, 2016 1 次提交
  9. 11 8月, 2016 1 次提交
  10. 26 6月, 2016 2 次提交
    • E
      net_sched: generalize bulk dequeue · 4d202a0d
      Eric Dumazet 提交于
      When qdisc bulk dequeue was added in linux-3.18 (commit
      5772e9a3 "qdisc: bulk dequeue support for qdiscs
      with TCQ_F_ONETXQUEUE"), it was constrained to some
      specific qdiscs.
      
      With some extra care, we can extend this to all qdiscs,
      so that typical traffic shaping solutions can benefit from
      small batches (8 packets in this patch).
      
      For example, HTB is often used on some multi queue device.
      And bonding/team are multi queue devices...
      
      Idea is to bulk-dequeue packets mapping to the same transmit queue.
      
      This brings between 35 and 80 % performance increase in HTB setup
      under pressure on a bonding setup :
      
      1) NUMA node contention :   610,000 pps -> 1,110,000 pps
      2) No node contention   : 1,380,000 pps -> 1,930,000 pps
      
      Now we should work to add batches on the enqueue() side ;)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d202a0d
    • E
      net_sched: drop packets after root qdisc lock is released · 520ac30f
      Eric Dumazet 提交于
      Qdisc performance suffers when packets are dropped at enqueue()
      time because drops (kfree_skb()) are done while qdisc lock is held,
      delaying a dequeue() draining the queue.
      
      Nominal throughput can be reduced by 50 % when this happens,
      at a time we would like the dequeue() to proceed as fast as possible.
      
      Even FQ is vulnerable to this problem, while one of FQ goals was
      to provide some flow isolation.
      
      This patch adds a 'struct sk_buff **to_free' parameter to all
      qdisc->enqueue(), and in qdisc_drop() helper.
      
      I measured a performance increase of up to 12 %, but this patch
      is a prereq so that future batches in enqueue() can fly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      520ac30f
  11. 16 6月, 2016 1 次提交
    • E
      net_sched: add the ability to defer skb freeing · 1b5c5493
      Eric Dumazet 提交于
      qdisc are changed under RTNL protection and often
      while blocking BH and root qdisc spinlock.
      
      When lots of skbs need to be dropped, we free
      them under these locks causing TX/RX freezes,
      and more generally latency spikes.
      
      This commit adds rtnl_kfree_skbs(), used to queue
      skbs for deferred freeing.
      
      Actual freeing happens right after RTNL is released,
      with appropriate scheduling points.
      
      rtnl_qdisc_drop() can also be used in place
      of disc_drop() when RTNL is held.
      
      qdisc_reset_queue() and __qdisc_reset_queue() get
      the new behavior, so standard qdiscs like pfifo, pfifo_fast...
      have their ->reset() method automatically handled.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b5c5493
  12. 11 6月, 2016 1 次提交
    • E
      net_sched: remove generic throttled management · 45f50bed
      Eric Dumazet 提交于
      __QDISC_STATE_THROTTLED bit manipulation is rather expensive
      for HTB and few others.
      
      I already removed it for sch_fq in commit f2600cf0
      ("net: sched: avoid costly atomic operation in fq_dequeue()")
      and so far nobody complained.
      
      When one ore more packets are stuck in one or more throttled
      HTB class, a htb dequeue() performs two atomic operations
      to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
      lock is held.
      
      Removing this pair of atomic operations bring me a 8 % performance
      increase on 200 TCP_RR tests, in presence of throttled classes.
      
      This patch has no side effect, since nothing actually uses
      disc_is_throttled() anymore.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45f50bed
  13. 10 6月, 2016 1 次提交
  14. 09 6月, 2016 4 次提交
    • F
      sched: place state, next_sched and gso_skb in same cacheline again · c8945043
      Florian Westphal 提交于
      Earlier commits removed two members from struct Qdisc which places
      next_sched/gso_skb into a different cacheline than ->state.
      
      This restores the struct layout to what it was before the removal.
      Move the two members, then add an annotation so they all reside in the
      same cacheline.
      
      This adds a 16 byte hole after cpu_qstats.
      
      The hole could be closed but as it doesn't decrease total struct size just
      do it this way.
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8945043
    • F
      sched: remove qdisc->drop · a09ceb0e
      Florian Westphal 提交于
      after removal of TCA_CBQ_OVL_STRATEGY from cbq scheduler, there are no
      more callers of ->drop() outside of other ->drop functions, i.e.
      nothing calls them.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a09ceb0e
    • F
      sched: remove qdisc_rehape_fail · c3a173d7
      Florian Westphal 提交于
      After the removal of TCA_CBQ_POLICE in cbq scheduler qdisc->reshape_fail
      is always NULL, i.e. qdisc_rehape_fail is now the same as qdisc_drop.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3a173d7
    • F
      cbq: remove TCA_CBQ_POLICE support · dd47c1fa
      Florian Westphal 提交于
      iproute2 doesn't implement any cbq option that results in this attribute
      being sent to kernel.
      
      To make use of it, user would have to
      
      - patch iproute2
      - add a class
      - attach a qdisc to the class (default pfifo doesn't work as
        q->handle is 0 and cbq_set_police() is a no-op in this case)
      - re-'add' the same class (tc class change ...) again
      - user must also specifiy a defmap (e.g. 'split 1:0 defmap 3f'), since
        this 'police' feature relies on its presence
      - the added qdisc must be one of bfifo, pfifo or netem
      
      If all of these conditions are met and _some_ leaf qdiscs, namely
      p/bfifo, netem, plug or tbf would drop a packet, kernel calls back into
      cbq, which will attempt to re-queue the skb into a different class
      as indicated by the parents' defmap entry for TC_PRIO_BESTEFFORT.
      
      [ i.e. we behave as if tc_classify returned TC_ACT_RECLASSIFY ].
      
      This feature, which isn't documented or implemented in iproute2,
      and isn't implemented consistently (most qdiscs like sfq, codel, etc
      drop right away instead of attempting this reclassification) is the
      sole reason for the reshape_fail and __parent member in Qdisc struct.
      
      So remove TCA_CBQ_POLICE support from the kernel, reject it via EOPNOTSUPP
      so userspace knows we don't support it, and then remove no-longer needed
      infrastructure in followup commit.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd47c1fa
  15. 08 6月, 2016 3 次提交
  16. 07 6月, 2016 1 次提交
  17. 17 5月, 2016 1 次提交
  18. 04 3月, 2016 1 次提交
  19. 01 3月, 2016 2 次提交
  20. 11 1月, 2016 1 次提交
  21. 04 12月, 2015 1 次提交
    • E
      net_sched: fix qdisc_tree_decrease_qlen() races · 4eaf3b84
      Eric Dumazet 提交于
      qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
      devices.
      
      One problem is that it updates sch->q.qlen and sch->qstats.drops
      on the mq/mqprio root qdisc, while it should not : Daniele
      reported underflows errors :
      [  681.774821] PAX: sch->q.qlen: 0 n: 1
      [  681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
      [  681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G           O    4.2.6.201511282239-1-grsec #1
      [  681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
      [  681.774956]  ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
      [  681.774959]  ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
      [  681.774960]  ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
      [  681.774962] Call Trace:
      [  681.774967]  [<ffffffffa95d2810>] dump_stack+0x4c/0x7f
      [  681.774970]  [<ffffffffa91a44f4>] report_size_overflow+0x34/0x50
      [  681.774972]  [<ffffffffa94d17e2>] qdisc_tree_decrease_qlen+0x152/0x160
      [  681.774976]  [<ffffffffc02694b1>] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
      [  681.774978]  [<ffffffffc02680a0>] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
      [  681.774980]  [<ffffffffa94cd92d>] __qdisc_run+0x4d/0x1d0
      [  681.774983]  [<ffffffffa949b2b2>] net_tx_action+0xc2/0x160
      [  681.774985]  [<ffffffffa90664c1>] __do_softirq+0xf1/0x200
      [  681.774987]  [<ffffffffa90665ee>] run_ksoftirqd+0x1e/0x30
      [  681.774989]  [<ffffffffa90896b0>] smpboot_thread_fn+0x150/0x260
      [  681.774991]  [<ffffffffa9089560>] ? sort_range+0x40/0x40
      [  681.774992]  [<ffffffffa9085fe4>] kthread+0xe4/0x100
      [  681.774994]  [<ffffffffa9085f00>] ? kthread_worker_fn+0x170/0x170
      [  681.774995]  [<ffffffffa95d8d1e>] ret_from_fork+0x3e/0x70
      
      mq/mqprio have their own ways to report qlen/drops by folding stats on
      all their queues, with appropriate locking.
      
      A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
      without proper locking : concurrent qdisc updates could corrupt the list
      that qdisc_match_from_root() parses to find a qdisc given its handle.
      
      Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
      qdisc_tree_decrease_qlen() can use to abort its tree traversal,
      as soon as it meets a mq/mqprio qdisc children.
      
      Second problem can be fixed by RCU protection.
      Qdisc are already freed after RCU grace period, so qdisc_list_add() and
      qdisc_list_del() simply have to use appropriate rcu list variants.
      
      A future patch will add a per struct netdev_queue list anchor, so that
      qdisc_tree_decrease_qlen() can have more efficient lookups.
      Reported-by: NDaniele Fucini <dfucini@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <cwang@twopensource.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4eaf3b84
  22. 18 9月, 2015 2 次提交
    • A
      bpf: add bpf_redirect() helper · 27b29f63
      Alexei Starovoitov 提交于
      Existing bpf_clone_redirect() helper clones skb before redirecting
      it to RX or TX of destination netdev.
      Introduce bpf_redirect() helper that does that without cloning.
      
      Benchmarked with two hosts using 10G ixgbe NICs.
      One host is doing line rate pktgen.
      Another host is configured as:
      $ tc qdisc add dev $dev ingress
      $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
         action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
      so it receives the packet on $dev and immediately xmits it on $dev + 1
      The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
      that does bpf_clone_redirect() and performance is 2.0 Mpps
      
      $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
         action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
      which is using bpf_redirect() - 2.4 Mpps
      
      and using cls_bpf with integrated actions as:
      $ tc filter add dev $dev root pref 10 \
        bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
      performance is 2.5 Mpps
      
      To summarize:
      u32+act_bpf using clone_redirect - 2.0 Mpps
      u32+act_bpf using redirect - 2.4 Mpps
      cls_bpf using redirect - 2.5 Mpps
      
      For comparison linux bridge in this setup is doing 2.1 Mpps
      and ixgbe rx + drop in ip_rcv - 7.8 Mpps
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27b29f63
    • D
      cls_bpf: introduce integrated actions · 045efa82
      Daniel Borkmann 提交于
      Often cls_bpf classifier is used with single action drop attached.
      Optimize this use case and let cls_bpf return both classid and action.
      For backwards compatibility reasons enable this feature under
      TCA_BPF_FLAG_ACT_DIRECT flag.
      
      Then more interesting programs like the following are easier to write:
      int cls_bpf_prog(struct __sk_buff *skb)
      {
        /* classify arp, ip, ipv6 into different traffic classes
         * and drop all other packets
         */
        switch (skb->protocol) {
        case htons(ETH_P_ARP):
          skb->tc_classid = 1;
          break;
        case htons(ETH_P_IP):
          skb->tc_classid = 2;
          break;
        case htons(ETH_P_IPV6):
          skb->tc_classid = 3;
          break;
        default:
          return TC_ACT_SHOT;
        }
      
        return TC_ACT_OK;
      }
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      045efa82
  23. 28 8月, 2015 1 次提交
  24. 09 7月, 2015 1 次提交