1. 27 3月, 2020 1 次提交
  2. 18 3月, 2020 1 次提交
    • E
      net_sched: sch_fq: enable use of hrtimer slack · 583396f4
      Eric Dumazet 提交于
      Add a new attribute to control the fq qdisc hrtimer slack.
      
      Default is set to 10 usec.
      
      When/if packets are throttled, fq set up an hrtimer that can
      lead to one interrupt per packet in the throttled queue.
      
      By using a timer slack, we allow better use of timer interrupts,
      by giving them a chance to call multiple timer callbacks
      at each hardware interrupt.
      
      Also, giving a slack allows FQ to dequeue batches of packets
      instead of a single one, thus increasing xmit_more efficiency.
      
      This has no negative effect on the rate a TCP flow can sustain,
      since each TCP flow maintains its own precise vtime (tp->tcp_wstamp_ns)
      
      v2: added strict netlink checking (as feedback from Jakub Kicinski)
      
      Tested:
       1000 concurrent flows all using paced packets.
       1,000,000 packets sent per second.
      
      Before the patch :
      
      $ vmstat 2 10
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       0  0      0 60726784  23628 3485992    0    0   138     1  977  535  0 12 87  0  0
       0  0      0 60714700  23628 3485628    0    0     0     0 1568827 26462  0 22 78  0  0
       1  0      0 60716012  23628 3485656    0    0     0     0 1570034 26216  0 22 78  0  0
       0  0      0 60722420  23628 3485492    0    0     0     0 1567230 26424  0 22 78  0  0
       0  0      0 60727484  23628 3485556    0    0     0     0 1568220 26200  0 22 78  0  0
       2  0      0 60718900  23628 3485380    0    0     0    40 1564721 26630  0 22 78  0  0
       2  0      0 60718096  23628 3485332    0    0     0     0 1562593 26432  0 22 78  0  0
       0  0      0 60719608  23628 3485064    0    0     0     0 1563806 26238  0 22 78  0  0
       1  0      0 60722876  23628 3485236    0    0     0   130 1565874 26566  0 22 78  0  0
       1  0      0 60722752  23628 3484908    0    0     0     0 1567646 26247  0 22 78  0  0
      
      After the patch, slack of 10 usec, we can see a reduction of interrupts
      per second, and a small decrease of reported cpu usage.
      
      $ vmstat 2 10
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       1  0      0 60722564  23628 3484728    0    0   133     1  696  545  0 13 87  0  0
       1  0      0 60722568  23628 3484824    0    0     0     0 977278 25469  0 20 80  0  0
       0  0      0 60716396  23628 3484764    0    0     0     0 979997 25326  0 20 80  0  0
       0  0      0 60713844  23628 3484960    0    0     0     0 981394 25249  0 20 80  0  0
       2  0      0 60720468  23628 3484916    0    0     0     0 982860 25062  0 20 80  0  0
       1  0      0 60721236  23628 3484856    0    0     0     0 982867 25100  0 20 80  0  0
       1  0      0 60722400  23628 3484456    0    0     0     8 982698 25303  0 20 80  0  0
       0  0      0 60715396  23628 3484428    0    0     0     0 981777 25176  0 20 80  0  0
       0  0      0 60716520  23628 3486544    0    0     0    36 978965 27857  0 21 79  0  0
       0  0      0 60719592  23628 3486516    0    0     0    22 977318 25106  0 20 80  0  0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      583396f4
  3. 15 3月, 2020 2 次提交
    • P
      net: sched: RED: Introduce an ECN nodrop mode · 0a7fad23
      Petr Machata 提交于
      When the RED Qdisc is currently configured to enable ECN, the RED algorithm
      is used to decide whether a certain SKB should be marked. If that SKB is
      not ECN-capable, it is early-dropped.
      
      It is also possible to keep all traffic in the queue, and just mark the
      ECN-capable subset of it, as appropriate under the RED algorithm. Some
      switches support this mode, and some installations make use of it.
      
      To that end, add a new RED flag, TC_RED_NODROP. When the Qdisc is
      configured with this flag, non-ECT traffic is enqueued instead of being
      early-dropped.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a7fad23
    • P
      net: sched: Allow extending set of supported RED flags · 14bc175d
      Petr Machata 提交于
      The qdiscs RED, GRED, SFQ and CHOKE use different subsets of the same pool
      of global RED flags. These are passed in tc_red_qopt.flags. However none of
      these qdiscs validate the flag field, and just copy it over wholesale to
      internal structures, and later dump it back. (An exception is GRED, which
      does validate for VQs -- however not for the main setup.)
      
      A broken userspace can therefore configure a qdisc with arbitrary
      unsupported flags, and later expect to see the flags on qdisc dump. The
      current ABI therefore allows storage of several bits of custom data to
      qdisc instances of the types mentioned above. How many bits, depends on
      which flags are meaningful for the qdisc in question. E.g. SFQ recognizes
      flags ECN and HARDDROP, and the rest is not interpreted.
      
      If SFQ ever needs to support ADAPTATIVE, it needs another way of doing it,
      and at the same time it needs to retain the possibility to store 6 bits of
      uninterpreted data. Likewise RED, which adds a new flag later in this
      patchset.
      
      To that end, this patch adds a new function, red_get_flags(), to split the
      passed flags of RED-like qdiscs to flags and user bits, and
      red_validate_flags() to validate the resulting configuration. It further
      adds a new attribute, TCA_RED_FLAGS, to pass arbitrary flags.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14bc175d
  4. 23 1月, 2020 1 次提交
  5. 19 12月, 2019 1 次提交
    • P
      net: sch_ets: Add a new Qdisc · dcc68b4d
      Petr Machata 提交于
      Introduces a new Qdisc, which is based on 802.1Q-2014 wording. It is
      PRIO-like in how it is configured, meaning one needs to specify how many
      bands there are, how many are strict and how many are dwrr, quanta for the
      latter, and priomap.
      
      The new Qdisc operates like the PRIO / DRR combo would when configured as
      per the standard. The strict classes, if any, are tried for traffic first.
      When there's no traffic in any of the strict queues, the ETS ones (if any)
      are treated in the same way as in DRR.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcc68b4d
  6. 21 11月, 2019 1 次提交
  7. 17 9月, 2019 1 次提交
    • V
      taprio: Add support for hardware offloading · 9c66d156
      Vinicius Costa Gomes 提交于
      This allows taprio to offload the schedule enforcement to capable
      network cards, resulting in more precise windows and less CPU usage.
      
      The gate mask acts on traffic classes (groups of queues of same
      priority), as specified in IEEE 802.1Q-2018, and following the existing
      taprio and mqprio semantics.
      It is up to the driver to perform conversion between tc and individual
      netdev queues if for some reason it needs to make that distinction.
      
      Full offload is requested from the network interface by specifying
      "flags 2" in the tc qdisc creation command, which in turn corresponds to
      the TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD bit.
      
      The important detail here is the clockid which is implicitly /dev/ptpN
      for full offload, and hence not configurable.
      
      A reference counting API is added to support the use case where Ethernet
      drivers need to keep the taprio offload structure locally (i.e. they are
      a multi-port switch driver, and configuring a port depends on the
      settings of other ports as well). The refcount_t variable is kept in a
      private structure (__tc_taprio_qopt_offload) and not exposed to drivers.
      
      In the future, the private structure might also be expanded with a
      backpointer to taprio_sched *q, to implement the notification system
      described in the patch (of when admin became oper, or an error occurred,
      etc, so the offload can be monitored with 'tc qdisc show').
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NVoon Weifeng <weifeng.voon@intel.com>
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c66d156
  8. 17 7月, 2019 1 次提交
  9. 10 7月, 2019 1 次提交
  10. 29 6月, 2019 3 次提交
    • V
      taprio: Add support for txtime-assist mode · 4cfd5779
      Vedang Patel 提交于
      Currently, we are seeing non-critical packets being transmitted outside of
      their timeslice. We can confirm that the packets are being dequeued at the
      right time. So, the delay is induced in the hardware side.  The most likely
      reason is the hardware queues are starving the lower priority queues.
      
      In order to improve the performance of taprio, we will be making use of the
      txtime feature provided by the ETF qdisc. For all the packets which do not
      have the SO_TXTIME option set, taprio will set the transmit timestamp (set
      in skb->tstamp) in this mode. TAPrio Qdisc will ensure that the transmit
      time for the packet is set to when the gate is open. If SO_TXTIME is set,
      the TAPrio qdisc will validate whether the timestamp (in skb->tstamp)
      occurs when the gate corresponding to skb's traffic class is open.
      
      Following two parameters added to support this mode:
      - flags: used to enable txtime-assist mode. Will also be used to enable
        other modes (like hardware offloading) later.
      - txtime-delay: This indicates the minimum time it will take for the packet
        to hit the wire. This is useful in determining whether we can transmit
      the packet in the remaining time if the gate corresponding to the packet is
      currently open.
      
      An example configuration for enabling txtime-assist:
      
      tc qdisc replace dev eth0 parent root handle 100 taprio \\
            num_tc 3 \\
            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
            queues 1@0 1@0 1@0 \\
            base-time 1558653424279842568 \\
            sched-entry S 01 300000 \\
            sched-entry S 02 300000 \\
            sched-entry S 04 400000 \\
            flags 0x1 \\
            txtime-delay 40000 \\
            clockid CLOCK_TAI
      
      tc qdisc replace dev $IFACE parent 100:1 etf skip_sock_check \\
            offload delta 200000 clockid CLOCK_TAI
      
      Note that all the traffic classes are mapped to the same queue.  This is
      only possible in taprio when txtime-assist is enabled. Also, note that the
      ETF Qdisc is enabled with offload mode set.
      
      In this mode, if the packet's traffic class is open and the complete packet
      can be transmitted, taprio will try to transmit the packet immediately.
      This will be done by setting skb->tstamp to current_time + the time delta
      indicated in the txtime-delay parameter. This parameter indicates the time
      taken (in software) for packet to reach the network adapter.
      
      If the packet cannot be transmitted in the current interval or if the
      packet's traffic is not currently transmitting, the skb->tstamp is set to
      the next available timestamp value. This is tracked in the next_launchtime
      parameter in the struct sched_entry.
      
      The behaviour w.r.t admin and oper schedules is not changed from what is
      present in software mode.
      
      The transmit time is already known in advance. So, we do not need the HR
      timers to advance the schedule and wakeup the dequeue side of taprio.  So,
      HR timer won't be run when this mode is enabled.
      Signed-off-by: NVedang Patel <vedang.patel@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cfd5779
    • V
      etf: Add skip_sock_check · d14d2b20
      Vedang Patel 提交于
      Currently, etf expects a socket with SO_TXTIME option set for each packet
      it encounters. So, it will drop all other packets. But, in the future
      commits we are planning to add functionality where tstamp value will be set
      by another qdisc. Also, some packets which are generated from within the
      kernel (e.g. ICMP packets) do not have any socket associated with them.
      
      So, this commit adds support for skip_sock_check. When this option is set,
      etf will skip checking for a socket and other associated options for all
      skbs.
      Signed-off-by: NVedang Patel <vedang.patel@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d14d2b20
    • V
      etf: Don't use BIT() in UAPI headers. · 9903c8dc
      Vedang Patel 提交于
      The BIT() macro isn't exported as part of the UAPI interface. So, the
      compile-test to ensure they are self contained fails. So, use _BITUL()
      instead.
      Signed-off-by: NVedang Patel <vedang.patel@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9903c8dc
  11. 01 5月, 2019 3 次提交
    • V
      taprio: Add support for cycle-time-extension · c25031e9
      Vinicius Costa Gomes 提交于
      IEEE 802.1Q-2018 defines the concept of a cycle-time-extension, so the
      last entry of a schedule before the start of a new schedule can be
      extended, so "too-short" entries can be avoided.
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c25031e9
    • V
      taprio: Add support for setting the cycle-time manually · 6ca6a665
      Vinicius Costa Gomes 提交于
      IEEE 802.1Q-2018 defines that a the cycle-time of a schedule may be
      overridden, so the schedule is truncated to a determined "width".
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ca6a665
    • V
      taprio: Add support adding an admin schedule · a3d43c0d
      Vinicius Costa Gomes 提交于
      The IEEE 802.1Q-2018 defines two "types" of schedules, the "Oper" (from
      operational?) and "Admin" ones. Up until now, 'taprio' only had
      support for the "Oper" one, added when the qdisc is created. This adds
      support for the "Admin" one, which allows the .change() operation to
      be supported.
      
      Just for clarification, some quick (and dirty) definitions, the "Oper"
      schedule is the currently (as in this instant) running one, and it's
      read-only. The "Admin" one is the one that the system configurator has
      installed, it can be changed, and it will be "promoted" to "Oper" when
      it's 'base-time' is reached.
      
      The idea behing this patch is that calling something like the below,
      (after taprio is already configured with an initial schedule):
      
      $ tc qdisc change taprio dev IFACE parent root 	     \
           	   base-time X 	     	   	       	     \
           	   sched-entry <CMD> <GATES> <INTERVAL>	     \
      	   ...
      
      Will cause a new admin schedule to be created and programmed to be
      "promoted" to "Oper" at instant X. If an "Admin" schedule already
      exists, it will be overwritten with the new parameters.
      
      Up until now, there was some code that was added to ease the support
      of changing a single entry of a schedule, but was ultimately unused.
      Now, that we have support for "change" with more well thought
      semantics, updating a single entry seems to be less useful.
      
      So we remove what is in practice dead code, and return a "not
      supported" error if the user tries to use it. If changing a single
      entry would make the user's life easier we may ressurrect this idea,
      but at this point, removing it simplifies the code.
      
      For now, only the schedule specific bits are allowed to be added for a
      new schedule, that means that 'clockid', 'num_tc', 'map' and 'queues'
      cannot be modified.
      
      Example:
      
      $ tc qdisc change dev IFACE parent root handle 100 taprio \
            base-time $BASE_TIME \
            sched-entry S 00 500000 \
            sched-entry S 0f 500000 \
            clockid CLOCK_TAI
      
      The only change in the netlink API introduced by this change is the
      introduction of an "admin" type in the response to a dump request,
      that type allows userspace to separate the "oper" schedule from the
      "admin" schedule. If userspace doesn't support the "admin" type, it
      will only display the "oper" schedule.
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3d43c0d
  12. 04 3月, 2019 1 次提交
    • K
      sch_cake: Permit use of connmarks as tin classifiers · 0b5c7efd
      Kevin Darbyshire-Bryant 提交于
      Add flag 'FWMARK' to enable use of firewall connmarks as tin selector.
      The connmark (skbuff->mark) needs to be in the range 1->tin_cnt ie.
      for diffserv3 the mark needs to be 1->3.
      
      Background
      
      Typically CAKE uses DSCP as the basis for tin selection.  DSCP values
      are relatively easily changed as part of the egress path, usually with
      iptables & the mangle table, ingress is more challenging.  CAKE is often
      used on the WAN interface of a residential gateway where passthrough of
      DSCP from the ISP is either missing or set to unhelpful values thus use
      of ingress DSCP values for tin selection isn't helpful in that
      environment.
      
      An approach to solving the ingress tin selection problem is to use
      CAKE's understanding of tc filters.  Naive tc filters could match on
      source/destination port numbers and force tin selection that way, but
      multiple filters don't scale particularly well as each filter must be
      traversed whether it matches or not. e.g. a simple example to map 3
      firewall marks to tins:
      
      MAJOR=$( tc qdisc show dev $DEV | head -1 | awk '{print $3}' )
      tc filter add dev $DEV parent $MAJOR protocol all handle 0x01 fw action skbedit priority ${MAJOR}1
      tc filter add dev $DEV parent $MAJOR protocol all handle 0x02 fw action skbedit priority ${MAJOR}2
      tc filter add dev $DEV parent $MAJOR protocol all handle 0x03 fw action skbedit priority ${MAJOR}3
      
      Another option is to use eBPF cls_act with tc filters e.g.
      
      MAJOR=$( tc qdisc show dev $DEV | head -1 | awk '{print $3}' )
      tc filter add dev $DEV parent $MAJOR bpf da obj my-bpf-fwmark-to-class.o
      
      This has the disadvantages of a) needing someone to write & maintain
      the bpf program, b) a bpf toolchain to compile it and c) needing to
      hardcode the major number in the bpf program so it matches the cake
      instance (or forcing the cake instance to a particular major number)
      since the major number cannot be passed to the bpf program via tc
      command line.
      
      As already hinted at by the previous examples, it would be helpful
      to associate tins with something that survives the Internet path and
      ideally allows tin selection on both egress and ingress.  Netfilter's
      conntrack permits setting an identifying mark on a connection which
      can also be restored to an ingress packet with tc action connmark e.g.
      
      tc filter add dev eth0 parent ffff: protocol all prio 10 u32 \
      	match u32 0 0 flowid 1:1 action connmark action mirred egress redirect dev ifb1
      
      Since tc's connmark action has restored any connmark into skb->mark,
      any of the previous solutions are based upon it and in one form or
      another copy that mark to the skb->priority field where again CAKE
      picks this up.
      
      This change cuts out at least one of the (less intuitive &
      non-scalable) middlemen and permit direct access to skb->mark.
      Signed-off-by: NKevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b5c7efd
  13. 26 2月, 2019 1 次提交
  14. 17 11月, 2018 2 次提交
    • J
      net: sched: gred: allow manipulating per-DP RED flags · 72111015
      Jakub Kicinski 提交于
      Allow users to set and dump RED flags (ECN enabled and harddrop)
      on per-virtual queue basis.  Validation of attributes is split
      from changes to make sure we won't have to undo previous operations
      when we find out configuration is invalid.
      
      The objective is to allow changing per-Qdisc parameters without
      overwriting the per-vq configured flags.
      
      Old user space will not pass the TCA_GRED_VQ_FLAGS attribute and
      per-Qdisc flags will always get propagated to the virtual queues.
      
      New user space which wants to make use of per-vq flags should set
      per-Qdisc flags to 0 and then configure per-vq flags as it
      sees fit.  Once per-vq flags are set per-Qdisc flags can't be
      changed to non-zero.  Vice versa - if the per-Qdisc flags are
      non-zero the TCA_GRED_VQ_FLAGS attribute has to either be omitted
      or set to the same value as per-Qdisc flags.
      
      Update per-Qdisc parameters:
      per-Qdisc | per-VQ | result
              0 |      0 | all vq flags updated
      	0 |  non-0 | error (vq flags in use)
          non-0 |      0 | -- impossible --
          non-0 |  non-0 | all vq flags updated
      
      Update per-VQ state (flags parameter not specified):
         no change to flags
      
      Update per-VQ state (flags parameter set):
      per-Qdisc | per-VQ | result
              0 |   any  | per-vq flags updated
          non-0 |      0 | -- impossible --
          non-0 |  non-0 | error (per-Qdisc flags in use)
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NJohn Hurley <john.hurley@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72111015
    • J
      net: sched: gred: provide a better structured dump and expose stats · 80e22e96
      Jakub Kicinski 提交于
      Currently all GRED's virtual queue data is dumped in a single
      array in a single attribute.  This makes it pretty much impossible
      to add new fields.  In order to expose more detailed stats add a
      new set of attributes.  We can now expose the 64 bit value of bytesin
      and all the mark stats which were not part of the original design.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NJohn Hurley <john.hurley@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80e22e96
  15. 12 11月, 2018 1 次提交
  16. 05 10月, 2018 1 次提交
    • V
      tc: Add support for configuring the taprio scheduler · 5a781ccb
      Vinicius Costa Gomes 提交于
      This traffic scheduler allows traffic classes states (transmission
      allowed/not allowed, in the simplest case) to be scheduled, according
      to a pre-generated time sequence. This is the basis of the IEEE
      802.1Qbv specification.
      
      Example configuration:
      
      tc qdisc replace dev enp3s0 parent root handle 100 taprio \
                num_tc 3 \
      	  map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
      	  queues 1@0 1@1 2@2 \
      	  base-time 1528743495910289987 \
      	  sched-entry S 01 300000 \
      	  sched-entry S 02 300000 \
      	  sched-entry S 04 300000 \
      	  clockid CLOCK_TAI
      
      The configuration format is similar to mqprio. The main difference is
      the presence of a schedule, built by multiple "sched-entry"
      definitions, each entry has the following format:
      
           sched-entry <CMD> <GATE MASK> <INTERVAL>
      
      The only supported <CMD> is "S", which means "SetGateStates",
      following the IEEE 802.1Qbv-2015 definition (Table 8-6). <GATE MASK>
      is a bitmask where each bit is a associated with a traffic class, so
      bit 0 (the least significant bit) being "on" means that traffic class
      0 is "active" for that schedule entry. <INTERVAL> is a time duration
      in nanoseconds that specifies for how long that state defined by <CMD>
      and <GATE MASK> should be held before moving to the next entry.
      
      This schedule is circular, that is, after the last entry is executed
      it starts from the first one, indefinitely.
      
      The other parameters can be defined as follows:
      
       - base-time: specifies the instant when the schedule starts, if
        'base-time' is a time in the past, the schedule will start at
      
       	      base-time + (N * cycle-time)
      
         where N is the smallest integer so the resulting time is greater
         than "now", and "cycle-time" is the sum of all the intervals of the
         entries in the schedule;
      
       - clockid: specifies the reference clock to be used;
      
      The parameters should be similar to what the IEEE 802.1Q family of
      specification defines.
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a781ccb
  17. 03 9月, 2018 1 次提交
    • F
      net/sched: fix type of htb statistics · b9de3963
      Florent Fourcot 提交于
      tokens and ctokens are defined as s64 in htb_class structure,
      and clamped to 32bits value during netlink dumps:
      
      cl->xstats.tokens = clamp_t(s64, PSCHED_NS2TICKS(cl->tokens),
                                  INT_MIN, INT_MAX);
      
      Defining it as u32 is working since userspace (tc) is printing it as
      signed int, but a correct definition from the beginning is probably
      better.
      
      In the same time, 'giants' structure member is unused since years, so
      update the comment to mark it unused.
      Signed-off-by: NFlorent Fourcot <florent.fourcot@wifirst.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9de3963
  18. 25 7月, 2018 1 次提交
    • N
      net/sched: add skbprio scheduler · aea5f654
      Nishanth Devarajan 提交于
      Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
      according to their skb->priority field. Under congestion, already-enqueued lower
      priority packets will be dropped to make space available for higher priority
      packets. Skbprio was conceived as a solution for denial-of-service defenses that
      need to route packets with different priorities as a means to overcome DoS
      attacks.
      
      v5
      *Do not reference qdisc_dev(sch)->tx_queue_len for setting limit. Instead set
      default sch->limit to 64.
      
      v4
      *Drop Documentation/networking/sch_skbprio.txt doc file to move it to tc man
      page for Skbprio, in iproute2.
      
      v3
      *Drop max_limit parameter in struct skbprio_sched_data and instead use
      sch->limit.
      
      *Reference qdisc_dev(sch)->tx_queue_len only once, during initialisation for
      qdisc (previously being referenced every time qdisc changes).
      
      *Move qdisc's detailed description from in-code to Documentation/networking.
      
      *When qdisc is saturated, enqueue incoming packet first before dequeueing
      lowest priority packet in queue - improves usage of call stack registers.
      
      *Introduce and use overlimit stat to keep track of number of dropped packets.
      
      v2
      *Use skb->priority field rather than DS field. Rename queueing discipline as
      SKB Priority Queue (previously Gatekeeper Priority Queue).
      
      *Queueing discipline is made classful to expose Skbprio's internal priority
      queues.
      Signed-off-by: NNishanth Devarajan <ndev2021@gmail.com>
      Reviewed-by: NSachin Paryani <sachin.paryani@gmail.com>
      Reviewed-by: NCody Doucette <doucette@bu.edu>
      Reviewed-by: NMichel Machado <michel@digirati.com.br>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aea5f654
  19. 11 7月, 2018 1 次提交
    • T
      sched: Add Common Applications Kept Enhanced (cake) qdisc · 046f6fd5
      Toke Høiland-Jørgensen 提交于
      sch_cake targets the home router use case and is intended to squeeze the
      most bandwidth and latency out of even the slowest ISP links and routers,
      while presenting an API simple enough that even an ISP can configure it.
      
      Example of use on a cable ISP uplink:
      
      tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
      
      To shape a cable download link (ifb and tc-mirred setup elided)
      
      tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash
      
      CAKE is filled with:
      
      * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
        derived Flow Queuing system, which autoconfigures based on the bandwidth.
      * A novel "triple-isolate" mode (the default) which balances per-host
        and per-flow FQ even through NAT.
      * An deficit based shaper, that can also be used in an unlimited mode.
      * 8 way set associative hashing to reduce flow collisions to a minimum.
      * A reasonable interpretation of various diffserv latency/loss tradeoffs.
      * Support for zeroing diffserv markings for entering and exiting traffic.
      * Support for interacting well with Docsis 3.0 shaper framing.
      * Extensive support for DSL framing types.
      * Support for ack filtering.
      * Extensive statistics for measuring, loss, ecn markings, latency
        variation.
      
      A paper describing the design of CAKE is available at
      https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE
      International Symposium on Local and Metropolitan Area Networks (LANMAN).
      
      This patch adds the base shaper and packet scheduler, while subsequent
      commits add the optional (configurable) features. The full userspace API
      and most data structures are included in this commit, but options not
      understood in the base version will be ignored.
      
      Various versions baking have been available as an out of tree build for
      kernel versions going back to 3.10, as the embedded router world has been
      running a few years behind mainline Linux. A stable version has been
      generally available on lede-17.01 and later.
      
      sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
      in the sqm-scripts, with sane defaults and vastly simpler configuration.
      
      CAKE's principal author is Jonathan Morton, with contributions from
      Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
      Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht,
      and Loganaden Velvindron.
      
      Testing from Pete Heist, Georgios Amanakis, and the many other members of
      the cake@lists.bufferbloat.net mailing list.
      
      tc -s qdisc show dev eth2
       qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84
       Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12)
        backlog 0b 0p requeues 12
        memory used: 1053008b of 15140Kb
        capacity estimate: 970Mbit
        min/max network layer size:           28 /    1500
        min/max overhead-adjusted size:       84 /    1538
        average network hdr offset:           14
                          Bulk  Best Effort        Voice
         thresh      62500Kbit        1Gbit      250Mbit
         target          5.0ms        5.0ms        5.0ms
         interval      100.0ms      100.0ms      100.0ms
         pk_delay          5us          5us          6us
         av_delay          3us          2us          2us
         sp_delay          2us          1us          1us
         backlog            0b           0b           0b
         pkts          3164050     25030267      9530280
         bytes      3227519915  35396974782  12879808898
         way_inds            0            8            0
         way_miss           21          366           25
         way_cols            0            0            0
         drops               5            0            1
         marks               0            0            0
         ack_drop            0            0            0
         sp_flows            1            3            0
         bk_flows            0            1            1
         un_flows            0            0            0
         max_len         68130        68130        68130
      Tested-by: NPete Heist <peteheist@gmail.com>
      Tested-by: NGeorgios Amanakis <gamanakis@gmail.com>
      Signed-off-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      046f6fd5
  20. 04 7月, 2018 2 次提交
    • J
      net/sched: Add HW offloading capability to ETF · 88cab771
      Jesus Sanchez-Palencia 提交于
      Add infra so etf qdisc supports HW offload of time-based transmission.
      
      For hw offload, the time sorted list is still used, so packets are
      dequeued always in order of txtime.
      
      Example:
      
      $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
                 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
      
      $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 100000 \
      	   clockid CLOCK_REALTIME
      
      In this example, the Qdisc will use HW offload for the control of the
      transmission time through the network adapter. The hrtimer used for
      packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME
      as reference and packets leave the Qdisc "delta" (100000) nanoseconds
      before their transmission time. Because this will be using HW offload and
      since dynamic clocks are not supported by the hrtimer, the system clock
      and the PHC clock must be synchronized for this mode to behave as
      expected.
      Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88cab771
    • V
      net/sched: Introduce the ETF Qdisc · 25db26a9
      Vinicius Costa Gomes 提交于
      The ETF (Earliest TxTime First) qdisc uses the information added
      earlier in this series (the socket option SO_TXTIME and the new
      role of sk_buff->tstamp) to schedule packets transmission based
      on absolute time.
      
      For some workloads, just bandwidth enforcement is not enough, and
      precise control of the transmission of packets is necessary.
      
      Example:
      
      $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
                 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
      
      $ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \
                 clockid CLOCK_TAI
      
      In this example, the Qdisc will provide SW best-effort for the control
      of the transmission time to the network adapter, the time stamp in the
      socket will be in reference to the clockid CLOCK_TAI and packets
      will leave the qdisc "delta" (100000) nanoseconds before its transmission
      time.
      
      The ETF qdisc will buffer packets sorted by their txtime. It will drop
      packets on enqueue() if their skbuff clockid does not match the clock
      reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped
      if it expires while being enqueued.
      
      The qdisc also supports the SO_TXTIME deadline mode. For this mode, it
      will dequeue a packet as soon as possible and change the skb timestamp
      to 'now' during etf_dequeue().
      
      Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid
      to be configured, but it's been decided that usage of CLOCK_TAI should
      be enforced until we decide to allow for other clockids to be used.
      The rationale here is that PTP times are usually in the TAI scale, thus
      no other clocks should be necessary. For now, the qdisc will return
      EINVAL if any clocks other than CLOCK_TAI are used.
      Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25db26a9
  21. 28 6月, 2018 1 次提交
    • Y
      netem: slotting with non-uniform distribution · 0a9fe5c3
      Yousuk Seung 提交于
      Extend slotting with support for non-uniform distributions. This is
      similar to netem's non-uniform distribution delay feature.
      
      Commit f043efeae2f1 ("netem: support delivering packets in delayed
      time slots") added the slotting feature to approximate the behaviors
      of media with packet aggregation but only supported a uniform
      distribution for delays between transmission attempts. Tests with TCP
      BBR with emulated wifi links with non-uniform distributions produced
      more useful results.
      
      Syntax:
         slot dist DISTRIBUTION DELAY JITTER [packets MAX_PACKETS] \
            [bytes MAX_BYTES]
      
      The syntax and use of the distribution table is the same as in the
      non-uniform distribution delay feature. A file DISTRIBUTION must be
      present in TC_LIB_DIR (e.g. /usr/lib/tc) containing numbers scaled by
      NETEM_DIST_SCALE. A random value x is selected from the table and it
      takes DELAY + ( x * JITTER ) as delay. Correlation between values is not
      supported.
      
      Examples:
        Normal distribution delay with mean = 800us and stdev = 100us.
        > tc qdisc add dev eth0 root netem slot dist normal 800us 100us
      
        Optionally set the max slot size in bytes and/or packets.
        > tc qdisc add dev eth0 root netem slot dist normal 800us 100us \
          bytes 64k packets 42
      Signed-off-by: NYousuk Seung <ysseung@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a9fe5c3
  22. 16 12月, 2017 1 次提交
  23. 13 11月, 2017 2 次提交
    • D
      netem: support delivering packets in delayed time slots · 836af83b
      Dave Taht 提交于
      Slotting is a crude approximation of the behaviors of shared media such
      as cable, wifi, and LTE, which gather up a bunch of packets within a
      varying delay window and deliver them, relative to that, nearly all at
      once.
      
      It works within the existing loss, duplication, jitter and delay
      parameters of netem. Some amount of inherent latency must be specified,
      regardless.
      
      The new "slot" parameter specifies a minimum and maximum delay between
      transmission attempts.
      
      The "bytes" and "packets" parameters can be used to limit the amount of
      information transferred per slot.
      
      Examples of use:
      
      tc qdisc add dev eth0 root netem delay 200us \
               slot 800us 10ms bytes 64k packets 42
      
      A more correct example, using stacked netem instances and a packet limit
      to emulate a tail drop wifi queue with slots and variable packet
      delivery, with a 200Mbit isochronous underlying rate, and 20ms path
      delay:
      
      tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
               limit 10000
      tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
               slot 800us 10ms bytes 64k packets 42 limit 512
      Signed-off-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      836af83b
    • D
      netem: add uapi to express delay and jitter in nanoseconds · 99803171
      Dave Taht 提交于
      netem userspace has long relied on a horrible /proc/net/psched hack
      to translate the current notion of "ticks" to nanoseconds.
      
      Expressing latency and jitter instead, in well defined nanoseconds,
      increases the dynamic range of emulated delays and jitter in netem.
      
      It will also ease a transition where reducing a tick to nsec
      equivalence would constrain the max delay in prior versions of
      netem to only 4.3 seconds.
      Signed-off-by: NDave Taht <dave.taht@gmail.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99803171
  24. 08 11月, 2017 1 次提交
  25. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX license identifier to uapi header files with no license · 6f52b16c
      Greg Kroah-Hartman 提交于
      Many user space API headers are missing licensing information, which
      makes it hard for compliance tools to determine the correct license.
      
      By default are files without license information under the default
      license of the kernel, which is GPLV2.  Marking them GPLV2 would exclude
      them from being included in non GPLV2 code, which is obviously not
      intended. The user space API headers fall under the syscall exception
      which is in the kernels COPYING file:
      
         NOTE! This copyright does *not* cover user programs that use kernel
         services by normal system calls - this is merely considered normal use
         of the kernel, and does *not* fall under the heading of "derived work".
      
      otherwise syscall usage would not be possible.
      
      Update the files which contain no license information with an SPDX
      license identifier.  The chosen identifier is 'GPL-2.0 WITH
      Linux-syscall-note' which is the officially assigned identifier for the
      Linux syscall exception.  SPDX license identifiers are a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.  See the previous patch in this series for the
      methodology of how this patch was researched.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6f52b16c
  26. 28 10月, 2017 1 次提交
  27. 17 10月, 2017 1 次提交
  28. 14 10月, 2017 1 次提交
    • A
      mqprio: Introduce new hardware offload mode and shaper in mqprio · 4e8b86c0
      Amritha Nambiar 提交于
      The offload types currently supported in mqprio are 0 (no offload) and
      1 (offload only TCs) by setting these values for the 'hw' option. If
      offloads are supported by setting the 'hw' option to 1, the default
      offload mode is 'dcb' where only the TC values are offloaded to the
      device. This patch introduces a new hardware offload mode called
      'channel' with 'hw' set to 1 in mqprio which makes full use of the
      mqprio options, the TCs, the queue configurations and the QoS parameters
      for the TCs. This is achieved through a new netlink attribute for the
      'mode' option which takes values such as 'dcb' (default) and 'channel'.
      The 'channel' mode also supports QoS attributes for traffic class such as
      minimum and maximum values for bandwidth rate limits.
      
      This patch enables configuring additional HW shaper attributes associated
      with a traffic class. Currently the shaper for bandwidth rate limiting is
      supported which takes options such as minimum and maximum bandwidth rates
      and are offloaded to the hardware in the 'channel' mode. The min and max
      limits for bandwidth rates are provided by the user along with the TCs
      and the queue configurations when creating the mqprio qdisc. The interface
      can be extended to support new HW shapers in future through the 'shaper'
      attribute.
      
      Introduces a new data structure 'tc_mqprio_qopt_offload' for offloading
      mqprio queue options and use this to be shared between the kernel and
      device driver. This contains a copy of the existing data structure
      for mqprio queue options. This new data structure can be extended when
      adding new attributes for traffic class such as mode, shaper, shaper
      parameters (bandwidth rate limits). The existing data structure for mqprio
      queue options will be shared between the kernel and userspace.
      
      Example:
        queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
        min_rate 1Gbit 2Gbit max_rate 4Gbit 5Gbit
      
      To dump the bandwidth rates:
      
      qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
                   queues:(0:3) (4:7)
                   mode:channel
                   shaper:bw_rlimit   min_rate:1Gbit 2Gbit   max_rate:4Gbit 5Gbit
      Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      4e8b86c0
  29. 16 3月, 2017 1 次提交
  30. 23 9月, 2016 1 次提交
    • E
      net_sched: sch_fq: account for schedule/timers drifts · fefa569a
      Eric Dumazet 提交于
      It looks like the following patch can make FQ very precise, even in VM
      or stressed hosts. It matters at high pacing rates.
      
      We take into account the difference between the time that was programmed
      when last packet was sent, and current time (a drift of tens of usecs is
      often observed)
      
      Add an EWMA of the unthrottle latency to help diagnostics.
      
      This latency is the difference between current time and oldest packet in
      delayed RB-tree. This accounts for the high resolution timer latency,
      but can be different under stress, as fq_check_throttled() can be
      opportunistically be called from a dequeue() called after an enqueue()
      for a different flow.
      
      Tested:
      // Start a 10Gbit flow
      $ netperf --google-pacing-rate 1250000000 -H lpaa24 -l 10000 -- -K bbr &
      
      Before patch :
      $ sar -n DEV 10 5 | grep eth0 | grep Average
      Average:         eth0  17106.04 756876.84   1102.75 1119049.02      0.00      0.00      0.52
      
      After patch :
      $ sar -n DEV 10 5 | grep eth0 | grep Average
      Average:         eth0  17867.00 800245.90   1151.77 1183172.12      0.00      0.00      0.52
      
      A new iproute2 tc can output the 'unthrottle latency' :
      
      $ tc -s qd sh dev eth0 | grep latency
        0 gc, 0 highprio, 32490767 throttled, 2382 ns latency
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fefa569a
  31. 21 9月, 2016 1 次提交
  32. 09 5月, 2016 1 次提交
    • E
      fq_codel: add memory limitation per queue · 95b58430
      Eric Dumazet 提交于
      On small embedded routers, one wants to control maximal amount of
      memory used by fq_codel, instead of controlling number of packets or
      bytes, since GRO/TSO make these not practical.
      
      Assuming skb->truesize is accurate, we have to keep track of
      skb->truesize sum for skbs in queue.
      
      This patch adds a new TCA_FQ_CODEL_MEMORY_LIMIT attribute.
      
      I chose a default value of 32 MBytes, which looks reasonable even
      for heavy duty usages. (Prior fq_codel users should not be hurt
      when they upgrade their kernels)
      
      Two fields are added to tc_fq_codel_qd_stats to report :
       - Current memory usage
       - Number of drops caused by memory limits
      
      # tc qd replace dev eth1 root est 1sec 4sec fq_codel memory_limit 4M
      ..
      # tc -s -d qd sh dev eth1
      qdisc fq_codel 8008: root refcnt 257 limit 10240p flows 1024
       quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn
       Sent 2083566791363 bytes 1376214889 pkt (dropped 4994406, overlimits 0
      requeues 21705223)
       rate 9841Mbit 812549pps backlog 3906120b 376p requeues 21705223
        maxpacket 68130 drop_overlimit 4994406 new_flow_count 28855414
        ecn_mark 0 memory_used 4190048 drop_overmemory 4994406
        new_flows_len 1 old_flows_len 177
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Dave Täht <dave.taht@gmail.com>
      Cc: Sebastian Möller <moeller0@gmx.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95b58430