1. 23 10月, 2018 2 次提交
  2. 19 10月, 2018 1 次提交
    • P
      net: sched: Fix for duplicate class dump · 3c53ed8f
      Phil Sutter 提交于
      When dumping classes by parent, kernel would return classes twice:
      
      | # tc qdisc add dev lo root prio
      | # tc class show dev lo
      | class prio 8001:1 parent 8001:
      | class prio 8001:2 parent 8001:
      | class prio 8001:3 parent 8001:
      | # tc class show dev lo parent 8001:
      | class prio 8001:1 parent 8001:
      | class prio 8001:2 parent 8001:
      | class prio 8001:3 parent 8001:
      | class prio 8001:1 parent 8001:
      | class prio 8001:2 parent 8001:
      | class prio 8001:3 parent 8001:
      
      This comes from qdisc_match_from_root() potentially returning the root
      qdisc itself if its handle matched. Though in that case, root's classes
      were already dumped a few lines above.
      
      Fixes: cb395b20 ("net: sched: optimize class dumps")
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c53ed8f
  3. 16 10月, 2018 3 次提交
    • E
      net_sched: sch_fq: no longer use skb_is_tcp_pure_ack() · 7baf33bd
      Eric Dumazet 提交于
      With the new EDT model, sch_fq no longer has to special
      case TCP pure acks, since their skb->tstamp will allow them
      being sent without pacing delay.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7baf33bd
    • E
      net: extend sk_pacing_rate to unsigned long · 76a9ebe8
      Eric Dumazet 提交于
      sk_pacing_rate has beed introduced as a u32 field in 2013,
      effectively limiting per flow pacing to 34Gbit.
      
      We believe it is time to allow TCP to pace high speed flows
      on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
      
      This patch adds no cost for 32bit kernels.
      
      The tcpi_pacing_rate and tcpi_max_pacing_rate were already
      exported as 64bit, so iproute2/ss command require no changes.
      
      Unfortunately the SO_MAX_PACING_RATE socket option will stay
      32bit and we will need to add a new option to let applications
      control high pacing rates.
      
      State      Recv-Q Send-Q Local Address:Port             Peer Address:Port
      ESTAB      0      1787144  10.246.9.76:49992             10.246.9.77:36741
                       timer:(on,003ms,0) ino:91863 sk:2 <->
       skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
       ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
       rcvmss:536 advmss:1448
       cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
       segs_in:3916318 data_segs_out:177279175
       bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
       send 28045.5Mbps lastrcv:73333
       pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
       busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
       notsent:2085120 minrtt:0.013
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76a9ebe8
    • D
      net/sched: cls_api: add missing validation of netlink attributes · e331473f
      Davide Caratti 提交于
      Similarly to what has been done in 8b4c3cdd ("net: sched: Add policy
      validation for tc attributes"), fix classifier code to add validation of
      TCA_CHAIN and TCA_KIND netlink attributes.
      
      tested with:
       # ./tdc.py -c filter
      
      v2: Let sch_api and cls_api share nla_policy they have in common, thanks
          to David Ahern.
      v3: Avoid EXPORT_SYMBOL(), as validation of those attributes is not done
          by TC modules, thanks to Cong Wang.
          While at it, restore the 'Delete / get qdisc' comment to its orginal
          position, just above tc_get_qdisc() function prototype.
      
      Fixes: 5bc17018 ("net: sched: introduce multichain support for filters")
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e331473f
  4. 11 10月, 2018 1 次提交
  5. 09 10月, 2018 13 次提交
  6. 08 10月, 2018 2 次提交
    • A
      net: sched: cls_u32: fix hnode refcounting · 6d4c4077
      Al Viro 提交于
      cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references
      via ->hlist and via ->tp_root together.  u32_destroy() drops the former
      and, in case when there had been links, leaves the sucker on the list.
      As the result, there's nothing to protect it from getting freed once links
      are dropped.
      That also makes the "is it busy" check incapable of catching the root
      hnode - it *is* busy (there's a reference from tp), but we don't see it as
      something separate.  "Is it our root?" check partially covers that, but
      the problem exists for others' roots as well.
      
      AFAICS, the minimal fix preserving the existing behaviour (where it doesn't
      include oopsen, that is) would be this:
              * count tp->root and tp_c->hlist as separate references.  I.e.
      have u32_init() set refcount to 2, not 1.
      	* in u32_destroy() we always drop the former;
      in u32_destroy_hnode() - the latter.
      
      	That way we have *all* references contributing to refcount.  List
      removal happens in u32_destroy_hnode() (called only when ->refcnt is 1)
      an in u32_destroy() in case of tc_u_common going away, along with
      everything reachable from it.  IOW, that way we know that
      u32_destroy_key() won't free something still on the list (or pointed to by
      someone's ->root).
      
      Reproducer:
      
      tc qdisc add dev eth0 ingress
      tc filter add dev eth0 parent ffff: protocol ip prio 100 handle 1: \
      u32 divisor 1
      tc filter add dev eth0 parent ffff: protocol ip prio 200 handle 2: \
      u32 divisor 1
      tc filter add dev eth0 parent ffff: protocol ip prio 100 \
      handle 1:0:11 u32 ht 1: link 801: offset at 0 mask 0f00 shift 6 \
      plus 0 eat match ip protocol 6 ff
      tc filter delete dev eth0 parent ffff: protocol ip prio 200
      tc filter change dev eth0 parent ffff: protocol ip prio 100 \
      handle 1:0:11 u32 ht 1: link 0: offset at 0 mask 0f00 shift 6 plus 0 \
      eat match ip protocol 6 ff
      tc filter delete dev eth0 parent ffff: protocol ip prio 100
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d4c4077
    • L
      net: sched: pie: fix coding style issues · ac4a02c5
      Leslie Monis 提交于
      Fix 5 warnings and 14 checks issued by checkpatch.pl:
      
      CHECK: Logical continuations should be on the previous line
      +	if ((q->vars.qdelay < q->params.target / 2)
      +	    && (q->vars.prob < MAX_PROB / 5))
      
      WARNING: line over 80 characters
      +		q->params.tupdate = usecs_to_jiffies(nla_get_u32(tb[TCA_PIE_TUPDATE]));
      
      CHECK: Blank lines aren't necessary after an open brace '{'
      +{
      +
      
      CHECK: braces {} should be used on all arms of this statement
      +			if (qlen < QUEUE_THRESHOLD)
      [...]
      +			else {
      [...]
      
      CHECK: Unbalanced braces around else statement
      +			else {
      
      CHECK: No space is necessary after a cast
      +	if (delta > (s32) (MAX_PROB / (100 / 2)) &&
      
      CHECK: Unnecessary parentheses around 'qdelay == 0'
      +	if ((qdelay == 0) && (qdelay_old == 0) && update_prob)
      
      CHECK: Unnecessary parentheses around 'qdelay_old == 0'
      +	if ((qdelay == 0) && (qdelay_old == 0) && update_prob)
      
      CHECK: Unnecessary parentheses around 'q->vars.prob == 0'
      +	if ((q->vars.qdelay < q->params.target / 2) &&
      +	    (q->vars.qdelay_old < q->params.target / 2) &&
      +	    (q->vars.prob == 0) &&
      +	    (q->vars.avg_dq_rate > 0))
      
      CHECK: Unnecessary parentheses around 'q->vars.avg_dq_rate > 0'
      +	if ((q->vars.qdelay < q->params.target / 2) &&
      +	    (q->vars.qdelay_old < q->params.target / 2) &&
      +	    (q->vars.prob == 0) &&
      +	    (q->vars.avg_dq_rate > 0))
      
      CHECK: Blank lines aren't necessary before a close brace '}'
      +
      +}
      
      CHECK: Comparison to NULL could be written "!opts"
      +	if (opts == NULL)
      
      CHECK: No space is necessary after a cast
      +			((u32) PSCHED_TICKS2NS(q->params.target)) /
      
      WARNING: line over 80 characters
      +	    nla_put_u32(skb, TCA_PIE_TUPDATE, jiffies_to_usecs(q->params.tupdate)) ||
      
      CHECK: Blank lines aren't necessary before a close brace '}'
      +
      +}
      
      CHECK: No space is necessary after a cast
      +		.delay		= ((u32) PSCHED_TICKS2NS(q->vars.qdelay)) /
      
      WARNING: Missing a blank line after declarations
      +	struct sk_buff *skb;
      +	skb = qdisc_dequeue_head(sch);
      
      WARNING: Missing a blank line after declarations
      +	struct pie_sched_data *q = qdisc_priv(sch);
      +	qdisc_reset_queue(sch);
      
      WARNING: Missing a blank line after declarations
      +	struct pie_sched_data *q = qdisc_priv(sch);
      +	q->params.tupdate = 0;
      Signed-off-by: NLeslie Monis <lesliemonis@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac4a02c5
  7. 06 10月, 2018 2 次提交
    • K
      treewide: Replace more open-coded allocation size multiplications · 329e0989
      Kees Cook 提交于
      As done treewide earlier, this catches several more open-coded
      allocation size calculations that were added to the kernel during the
      merge window. This performs the following mechanical transformations
      using Coccinelle:
      
      	kvmalloc(a * b, ...) -> kvmalloc_array(a, b, ...)
      	kvzalloc(a * b, ...) -> kvcalloc(a, b, ...)
      	devm_kzalloc(..., a * b, ...) -> devm_kcalloc(..., a, b, ...)
      Signed-off-by: NKees Cook <keescook@chromium.org>
      329e0989
    • D
      net: sched: Add policy validation for tc attributes · 8b4c3cdd
      David Ahern 提交于
      A number of TC attributes are processed without proper validation
      (e.g., length checks). Add a tca policy for all input attributes and use
      when invoking nlmsg_parse.
      
      The 2 Fixes tags below cover the latest additions. The other attributes
      are a string (KIND), nested attribute (OPTIONS which does seem to have
      validation in most cases), for dumps only or a flag.
      
      Fixes: 5bc17018 ("net: sched: introduce multichain support for filters")
      Fixes: d47a6b0e ("net: sched: introduce ingress/egress block index attributes for qdisc")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b4c3cdd
  8. 05 10月, 2018 2 次提交
    • C
      net_sched: convert idrinfo->lock from spinlock to a mutex · 95278dda
      Cong Wang 提交于
      In commit ec3ed293 ("net_sched: change tcf_del_walker() to take idrinfo->lock")
      we move fl_hw_destroy_tmplt() to a workqueue to avoid blocking
      with the spinlock held. Unfortunately, this causes a lot of
      troubles here:
      
      1. tcf_chain_destroy() could be called right after we queue the work
         but before the work runs. This is a use-after-free.
      
      2. The chain refcnt is already 0, we can't even just hold it again.
         We can check refcnt==1 but it is ugly.
      
      3. The chain with refcnt 0 is still visible in its block, which means
         it could be still found and used!
      
      4. The block has a refcnt too, we can't hold it without introducing a
         proper API either.
      
      We can make it working but the end result is ugly. Instead of wasting
      time on reviewing it, let's just convert the troubling spinlock to
      a mutex, which allows us to use non-atomic allocations too.
      
      Fixes: ec3ed293 ("net_sched: change tcf_del_walker() to take idrinfo->lock")
      Reported-by: NIdo Schimmel <idosch@idosch.org>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Vlad Buslov <vladbu@mellanox.com>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Tested-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95278dda
    • V
      tc: Add support for configuring the taprio scheduler · 5a781ccb
      Vinicius Costa Gomes 提交于
      This traffic scheduler allows traffic classes states (transmission
      allowed/not allowed, in the simplest case) to be scheduled, according
      to a pre-generated time sequence. This is the basis of the IEEE
      802.1Qbv specification.
      
      Example configuration:
      
      tc qdisc replace dev enp3s0 parent root handle 100 taprio \
                num_tc 3 \
      	  map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
      	  queues 1@0 1@1 2@2 \
      	  base-time 1528743495910289987 \
      	  sched-entry S 01 300000 \
      	  sched-entry S 02 300000 \
      	  sched-entry S 04 300000 \
      	  clockid CLOCK_TAI
      
      The configuration format is similar to mqprio. The main difference is
      the presence of a schedule, built by multiple "sched-entry"
      definitions, each entry has the following format:
      
           sched-entry <CMD> <GATE MASK> <INTERVAL>
      
      The only supported <CMD> is "S", which means "SetGateStates",
      following the IEEE 802.1Qbv-2015 definition (Table 8-6). <GATE MASK>
      is a bitmask where each bit is a associated with a traffic class, so
      bit 0 (the least significant bit) being "on" means that traffic class
      0 is "active" for that schedule entry. <INTERVAL> is a time duration
      in nanoseconds that specifies for how long that state defined by <CMD>
      and <GATE MASK> should be held before moving to the next entry.
      
      This schedule is circular, that is, after the last entry is executed
      it starts from the first one, indefinitely.
      
      The other parameters can be defined as follows:
      
       - base-time: specifies the instant when the schedule starts, if
        'base-time' is a time in the past, the schedule will start at
      
       	      base-time + (N * cycle-time)
      
         where N is the smallest integer so the resulting time is greater
         than "now", and "cycle-time" is the sum of all the intervals of the
         entries in the schedule;
      
       - clockid: specifies the reference clock to be used;
      
      The parameters should be similar to what the IEEE 802.1Q family of
      specification defines.
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a781ccb
  9. 02 10月, 2018 3 次提交
  10. 29 9月, 2018 1 次提交
  11. 26 9月, 2018 8 次提交
  12. 25 9月, 2018 1 次提交
  13. 22 9月, 2018 1 次提交