1. 02 5月, 2019 30 次提交
  2. 01 5月, 2019 10 次提交
    • D
      Merge branch 'net-sched-taprio-change-schedules' · 5b27aafa
      David S. Miller 提交于
      Vinicius Costa Gomes says:
      
      ====================
      net/sched: taprio change schedules
      
      Changes from RFC:
       - Removed the patches for taprio offloading, because of the lack of
         in-tree users;
       - Updated the links to point to the PATCH version of this series;
      
      Original cover letter:
      
      Overview
      --------
      
      This RFC has two objectives, it adds support for changing the running
      schedules during "runtime", explained in more detail later, and
      proposes an interface between taprio and the drivers for hardware
      offloading.
      
      These two different features are presented together so it's clear what
      the "final state" would look like. But after the RFC stage, they can
      be proposed (and reviewed) separately.
      
      Changing the schedules without disrupting traffic is important for
      handling dynamic use cases, for example, when streams are
      added/removed and when the network configuration changes.
      
      Hardware offloading support allows schedules to be more precise and
      have lower resource usage.
      
      Changing schedules
      ------------------
      
      The same as the other interfaces we proposed, we try to use the same
      concepts as the IEEE 802.1Q-2018 specification. So, for changing
      schedules, there are an "oper" (operational) and an "admin" schedule.
      The "admin" schedule is mutable and not in use, the "oper" schedule is
      immutable and is in use.
      
      That is, when the user first adds an schedule it is in the "admin"
      state, and it becomes "oper" when its base-time (basically when it
      starts) is reached.
      
      What this means is that now it's possible to create taprio with a schedule:
      
      $ tc qdisc add dev IFACE parent root handle 100 taprio \
            num_tc 3 \
            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
            queues 1@0 1@1 2@2 \
            base-time 10000000 \
            sched-entry S 03 300000 \
            sched-entry S 02 300000 \
            sched-entry S 06 400000 \
            clockid CLOCK_TAI
      
      And then, later, after the previous schedule is "promoted" to "oper",
      add a new ("admin") schedule to be used some time later:
      
      $ tc qdisc change dev IFACE parent root handle 100 taprio \
            base-time 1553121866000000000 \
            sched-entry S 02 500000 \
            sched-entry S 0f 400000 \
            clockid CLOCK_TAI
      
      When enabling the ability to change schedules, it makes sense to add
      two more defined knobs to schedules: "cycle-time" allows to truncate a
      cycle to some value, so it repeats after a well-defined value;
      "cycle-time-extension" controls how much an entry can be extended if
      it's the last one before the change of schedules, the reason is to
      avoid a very small cycle when transitioning from a schedule to
      another.
      
      With these, taprio in the software mode should provide a fairly
      complete implementation of what's defined in the Enhancements for
      Scheduled Traffic parts of the specification.
      
      Hardware offload
      ----------------
      
      Some workloads require better guarantees from their schedules than
      what's provided by the software implementation. This series proposes
      an interface for configuring schedules into compatible network
      controllers.
      
      This part is proposed together with the support for changing
      schedules, because it raises questions like, should the "qdisc" side
      be responsible of providing visibility into the schedules or should it
      be the driver?
      
      In this proposal, the driver is called passing the new schedule as
      soon as it is validated, and the "core" qdisc takes care of displaying
      (".dump()") the correct schedules at all times. It means that some
      logic would need to be duplicated in the driver, if the hardware
      doesn't have support for multiple schedules. But as taprio doesn't
      have enough information about the underlying controller to know how
      much in advance a schedule needs to be informed to the hardware, it
      feels like a fair compromise.
      
      The hardware offloading part of this proposal also tries to define an
      interface for frame-preemption and how it interacts with the
      scheduling of traffic, see Section 8.6.8.4 of IEEE 802.1Q-2018 for
      more information.
      
      One important difference between the qdisc interface and the
      qdisc-driver interface, is that the "gate mask" on the qdisc side
      references traffic classes, that is bit 0 of the gate mask means
      Traffic Class 0, and in the driver interface, it specifies the queues,
      that is bit 0 means queue 0. That is to say that taprio converts the
      references to traffic classes to references to queues before sending
      the offloading request to the driver.
      
      Request for help
      ----------------
      
      I would like that interested driver maintainers could take a look at
      the proposed interface and see if it's going to be too awkward for any
      particular device. Also, pointers to available documentation would be
      appreciated. The idea here is to start a discussion so we can have an
      interface that would work for multiple vendors.
      
      Links
      -----
      
      kernel patches:
      https://github.com/vcgomes/net-next/tree/taprio-add-support-for-change-v3
      
      iproute2 patches:
      https://github.com/vcgomes/iproute2/tree/taprio-add-support-for-change-v3
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b27aafa
    • V
      taprio: Add support for cycle-time-extension · c25031e9
      Vinicius Costa Gomes 提交于
      IEEE 802.1Q-2018 defines the concept of a cycle-time-extension, so the
      last entry of a schedule before the start of a new schedule can be
      extended, so "too-short" entries can be avoided.
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c25031e9
    • V
      taprio: Add support for setting the cycle-time manually · 6ca6a665
      Vinicius Costa Gomes 提交于
      IEEE 802.1Q-2018 defines that a the cycle-time of a schedule may be
      overridden, so the schedule is truncated to a determined "width".
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ca6a665
    • V
      taprio: Add support adding an admin schedule · a3d43c0d
      Vinicius Costa Gomes 提交于
      The IEEE 802.1Q-2018 defines two "types" of schedules, the "Oper" (from
      operational?) and "Admin" ones. Up until now, 'taprio' only had
      support for the "Oper" one, added when the qdisc is created. This adds
      support for the "Admin" one, which allows the .change() operation to
      be supported.
      
      Just for clarification, some quick (and dirty) definitions, the "Oper"
      schedule is the currently (as in this instant) running one, and it's
      read-only. The "Admin" one is the one that the system configurator has
      installed, it can be changed, and it will be "promoted" to "Oper" when
      it's 'base-time' is reached.
      
      The idea behing this patch is that calling something like the below,
      (after taprio is already configured with an initial schedule):
      
      $ tc qdisc change taprio dev IFACE parent root 	     \
           	   base-time X 	     	   	       	     \
           	   sched-entry <CMD> <GATES> <INTERVAL>	     \
      	   ...
      
      Will cause a new admin schedule to be created and programmed to be
      "promoted" to "Oper" at instant X. If an "Admin" schedule already
      exists, it will be overwritten with the new parameters.
      
      Up until now, there was some code that was added to ease the support
      of changing a single entry of a schedule, but was ultimately unused.
      Now, that we have support for "change" with more well thought
      semantics, updating a single entry seems to be less useful.
      
      So we remove what is in practice dead code, and return a "not
      supported" error if the user tries to use it. If changing a single
      entry would make the user's life easier we may ressurrect this idea,
      but at this point, removing it simplifies the code.
      
      For now, only the schedule specific bits are allowed to be added for a
      new schedule, that means that 'clockid', 'num_tc', 'map' and 'queues'
      cannot be modified.
      
      Example:
      
      $ tc qdisc change dev IFACE parent root handle 100 taprio \
            base-time $BASE_TIME \
            sched-entry S 00 500000 \
            sched-entry S 0f 500000 \
            clockid CLOCK_TAI
      
      The only change in the netlink API introduced by this change is the
      introduction of an "admin" type in the response to a dump request,
      that type allows userspace to separate the "oper" schedule from the
      "admin" schedule. If userspace doesn't support the "admin" type, it
      will only display the "oper" schedule.
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3d43c0d
    • V
      taprio: Fix potencial use of invalid memory during dequeue() · 8c79f0ea
      Vinicius Costa Gomes 提交于
      Right now, this isn't a problem, but the next commit allows schedules
      to be added during runtime. When a new schedule transitions from the
      inactive to the active state ("admin" -> "oper") the previous one can
      be freed, if it's freed just after the RCU read lock is released, we
      may access an invalid entry.
      
      So, we should take care to protect the dequeue() flow, so all the
      places that access the entries are protected by the RCU read lock.
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c79f0ea
    • D
      Merge branch 'tcp-undo-congestion' · cd86972a
      David S. Miller 提交于
      Yuchung Cheng says:
      
      ====================
      undo congestion window on spurious SYN or SYNACK timeout
      
      Linux TCP currently uses the initial congestion window of 1 packet
      if multiple SYN or SYNACK timeouts per RFC6298. However such
      timeouts are often spurious on wireless or cellular networks that
      experience high delay variances (e.g. ramping up dormant radios or
      local link retransmission). Another case is when the underlying
      path is longer than the default SYN timeout (e.g. 1 second). In
      these cases starting the transfer with a minimal congestion window
      is detrimental to the performance for short flows.
      
      One naive approach is to simply ignore SYN or SYNACK timeouts and
      always use a larger or default initial window. This approach however
      risks pouring gas to the fire when the network is already highly
      congested. This is particularly true in data center where application
      could start thousands to millions of connections over a single or
      multiple hosts resulting in high SYN drops (e.g. incast).
      
      This patch-set detects spurious SYN and SYNACK timeouts upon
      completing the handshake via the widely-supported TCP timestamp
      options. Upon such events the sender reverts to the default
      initial window to start the data transfer so it gets best of both
      worlds. This patch-set supports this feature for both active and
      passive as well as Fast Open or regular connections.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd86972a
    • Y
      tcp: refactor setting the initial congestion window · 98fa6271
      Yuchung Cheng 提交于
      Relocate the congestion window initialization from tcp_init_metrics()
      to tcp_init_transfer() to improve code readability.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98fa6271
    • Y
      tcp: refactor to consolidate TFO passive open code · 6b94b1c8
      Yuchung Cheng 提交于
      Use a helper to consolidate two identical code block for passive TFO.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b94b1c8
    • Y
      tcp: undo cwnd on Fast Open spurious SYNACK retransmit · 794200d6
      Yuchung Cheng 提交于
      This patch makes passive Fast Open reverts the cwnd to default
      initial cwnd (10 packets) if the SYNACK timeout is spurious.
      
      Passive Fast Open uses a full socket during handshake so it can
      use the existing undo logic to detect spurious retransmission
      by recording the first SYNACK timeout in key state variable
      retrans_stamp. Upon receiving the ACK of the SYNACK, if the socket
      has sent some data before the timeout, the spurious timeout
      is detected by tcp_try_undo_recovery() in tcp_process_loss()
      in tcp_ack().
      
      But if the socket has not send any data yet, tcp_ack() does not
      execute the undo code since no data is acknowledged. The fix is to
      check such case explicitly after tcp_ack() during the ACK processing
      in SYN_RECV state. In addition this is checked in FIN_WAIT_1 state
      in case the server closes the socket before handshake completes.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      794200d6
    • Y
      tcp: lower congestion window on Fast Open SYNACK timeout · 8c3cfe19
      Yuchung Cheng 提交于
      TCP sender would use congestion window of 1 packet on the second SYN
      and SYNACK timeout except passive TCP Fast Open. This makes passive
      TFO too aggressive and unfair during congestion at handshake. This
      patch fixes this issue so TCP (fast open or not, passive or active)
      always conforms to the RFC6298.
      
      Note that tcp_enter_loss() is called only once during recurring
      timeouts.  This is because during handshake, high_seq and snd_una
      are the same so tcp_enter_loss() would incorrect set the undo state
      variables multiple times.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c3cfe19