1. 17 10月, 2017 2 次提交
  2. 15 10月, 2017 1 次提交
  3. 14 10月, 2017 1 次提交
    • A
      mqprio: Introduce new hardware offload mode and shaper in mqprio · 4e8b86c0
      Amritha Nambiar 提交于
      The offload types currently supported in mqprio are 0 (no offload) and
      1 (offload only TCs) by setting these values for the 'hw' option. If
      offloads are supported by setting the 'hw' option to 1, the default
      offload mode is 'dcb' where only the TC values are offloaded to the
      device. This patch introduces a new hardware offload mode called
      'channel' with 'hw' set to 1 in mqprio which makes full use of the
      mqprio options, the TCs, the queue configurations and the QoS parameters
      for the TCs. This is achieved through a new netlink attribute for the
      'mode' option which takes values such as 'dcb' (default) and 'channel'.
      The 'channel' mode also supports QoS attributes for traffic class such as
      minimum and maximum values for bandwidth rate limits.
      
      This patch enables configuring additional HW shaper attributes associated
      with a traffic class. Currently the shaper for bandwidth rate limiting is
      supported which takes options such as minimum and maximum bandwidth rates
      and are offloaded to the hardware in the 'channel' mode. The min and max
      limits for bandwidth rates are provided by the user along with the TCs
      and the queue configurations when creating the mqprio qdisc. The interface
      can be extended to support new HW shapers in future through the 'shaper'
      attribute.
      
      Introduces a new data structure 'tc_mqprio_qopt_offload' for offloading
      mqprio queue options and use this to be shared between the kernel and
      device driver. This contains a copy of the existing data structure
      for mqprio queue options. This new data structure can be extended when
      adding new attributes for traffic class such as mode, shaper, shaper
      parameters (bandwidth rate limits). The existing data structure for mqprio
      queue options will be shared between the kernel and userspace.
      
      Example:
        queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
        min_rate 1Gbit 2Gbit max_rate 4Gbit 5Gbit
      
      To dump the bandwidth rates:
      
      qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
                   queues:(0:3) (4:7)
                   mode:channel
                   shaper:bw_rlimit   min_rate:1Gbit 2Gbit   max_rate:4Gbit 5Gbit
      Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      4e8b86c0
  4. 13 10月, 2017 5 次提交
  5. 12 10月, 2017 5 次提交
  6. 11 10月, 2017 2 次提交
  7. 10 10月, 2017 1 次提交
  8. 08 10月, 2017 9 次提交
    • L
      net: phonet: mark phonet_protocol as const · 548ec114
      Lin Zhang 提交于
      The phonet_protocol structs don't need to be written by anyone and
      so can be marked as const.
      Signed-off-by: NLin Zhang <xiaolou4617@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      548ec114
    • W
      ipv6: take care of rt6_stats · 81eb8447
      Wei Wang 提交于
      Currently, most of the rt6_stats are not hooked up correctly. As the
      last part of this patch series, hook up all existing rt6_stats and add
      one new stat fib_rt_uncache to indicate the number of routes in the
      uncached list.
      For details of the stats, please refer to the comments added in
      include/net/ip6_fib.h.
      
      Note: fib_rt_alloc and fib_rt_uncache are not guaranteed to be modified
      under a lock. So atomic_t is used for them.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81eb8447
    • W
      ipv6: replace rwlock with rcu and spinlock in fib6_table · 66f5d6ce
      Wei Wang 提交于
      With all the preparation work before, we are now ready to replace rwlock
      with rcu and spinlock in fib6_table.
      That means now all fib6_node in fib6_table are protected by rcu. And
      when freeing fib6_node, call_rcu() is used to wait for the rcu grace
      period before releasing the memory.
      When accessing fib6_node, corresponding rcu APIs need to be used.
      And all previous sessions protected by the write lock will now be
      protected by the spin lock per table.
      All previous sessions protected by read lock will now be protected by
      rcu_read_lock().
      
      A couple of things to note here:
      1. As part of the work of replacing rwlock with rcu, the linked list of
      fn->leaf now has to be rcu protected as well. So both fn->leaf and
      rt->dst.rt6_next are now __rcu tagged and corresponding rcu APIs are
      used when manipulating them.
      
      2. For fn->rr_ptr, first of all, it also needs to be rcu protected now
      and is tagged with __rcu and rcu APIs are used in corresponding places.
      Secondly, fn->rr_ptr is changed in rt6_select() which is a reader
      thread. This makes the issue a bit complicated. We think a valid
      solution for it is to let rt6_select() grab the tb6_lock if it decides
      to change it. As it is not in the normal operation and only happens when
      there is no valid neighbor cache for the route, we think the performance
      impact should be low.
      
      3. fib6_walk_continue() has to be called with tb6_lock held even in the
      route dumping related functions, e.g. inet6_dump_fib(),
      fib6_tables_dump() and ipv6_route_seq_ops. It is because
      fib6_walk_continue() makes modifications to the walker structure, and so
      are fib6_repair_tree() and fib6_del_route(). In order to do proper
      syncing between them, we need to let fib6_walk_continue() hold the lock.
      We may be able to do further improvement on the way we do the tree walk
      to get rid of the need for holding the spin lock. But not for now.
      
      4. When fib6_del_route() removes a route from the tree, we no longer
      mark rt->dst.rt6_next to NULL to make simultaneous reader be able to
      further traverse the list with rcu. However, rt->dst.rt6_next is only
      valid within this same rcu period. No one should access it later.
      
      5. All the operation of atomic_inc(rt->rt6i_ref) is changed to be
      performed before we publish this route (either by linking it to fn->leaf
      or insert it in the list pointed by fn->leaf) just to be safe because as
      soon as we publish the route, some read thread will be able to access it.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66f5d6ce
    • W
      ipv6: update fn_sernum after route is inserted to tree · bbd63f06
      Wei Wang 提交于
      fib6_add() logic currently calls fib6_add_1() to figure out what node
      should be used for the newly added route and then call
      fib6_add_rt2node() to insert the route to the node.
      And during the call of fib6_add_1(), fn_sernum is updated for all nodes
      that share the same prefix as the new route.
      This does not have issue in the current code because reader thread will
      not be able to access the tree while writer thread is inserting new
      route to it. However, it is not the case once we transition to use RCU.
      Reader thread could potentially see the new fn_sernum before the new
      route is inserted. As a result, reader thread's route lookup will return
      a stale route with the new fn_sernum.
      
      In order to solve this issue, we remove all the update of fn_sernum in
      fib6_add_1(), and instead, introduce a new function that updates fn_sernum
      for all related nodes and call this functions once the route is
      successfully inserted to the tree.
      Also, smp_wmb() is used after a route is successfully inserted into the
      fib tree and right before the updated of fn->sernum. And smp_rmb() is
      used right after fn->sernum is accessed in rt6_get_cookie_safe(). This
      is to guarantee that when the reader thread sees the new fn->sernum, the
      new route is already inserted in the tree in memory.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bbd63f06
    • W
      ipv6: hook up exception table to store dst cache · 2b760fcf
      Wei Wang 提交于
      This commit makes use of the exception hash table implementation to
      store dst caches created by pmtu discovery and ip redirect into the hash
      table under the rt_info and no longer inserts these routes into fib6
      tree.
      This makes the fib6 tree only contain static configured routes and could
      now be protected by rcu instead of a rw lock.
      With this change, in the route lookup related functions, after finding
      the rt6_info with the longest prefix, we also need to search for the
      exception table before doing backtracking.
      In the route delete function, if the route being deleted is not a dst
      cache, deletion of this route also need to flush the whole hash table
      under it. If it is a dst cache, then only delete the cached dst in the
      hash table.
      
      Note: for fib6_walk_continue() function, w->root now is always pointing
      to a root node considering that fib6_prune_clones() is removed from the
      code. So we add a WARN_ON() msg to make sure w->root always points to a
      root node and also removed the update of w->root in fib6_repair_tree().
      This is a prerequisite for later patch because we don't need to make
      w->root as rcu protected when replacing rwlock with RCU.
      Also, we remove all prune related variables as it is no longer used.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b760fcf
    • W
      ipv6: prepare fib6_locate() for exception table · 38fbeeee
      Wei Wang 提交于
      fib6_locate() is used to find the fib6_node according to the passed in
      prefix address key. It currently tries to find the fib6_node with the
      exact match of the passed in key. However, when we move cached routes
      into the exception table, fib6_locate() will fail to find the fib6_node
      for it as the cached routes will be stored in the exception table under
      the fib6_node with the longest prefix match of the cache's dst addr key.
      This commit adds a new parameter to let the caller specify if it needs
      exact match or longest prefix match.
      Right now, all callers still does exact match when calling
      fib6_locate(). It will be changed in later commit where exception table
      is hooked up to store cached routes.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38fbeeee
    • W
      ipv6: prepare fib6_age() for exception table · c757faa8
      Wei Wang 提交于
      If all dst cache entries are stored in the exception table under the
      main route, we have to go through them during fib6_age() when doing
      garbage collecting.
      Introduce a new function rt6_age_exception() which goes through all dst
      entries in the exception table and remove those entries that are expired.
      This function is called in fib6_age() so that all dst caches are also
      garbage collected.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c757faa8
    • W
      ipv6: introduce a hash table to store dst cache · 35732d01
      Wei Wang 提交于
      Add a hash table into struct rt6_info in order to store dst caches
      created by pmtu discovery and ip redirect in ipv6 routing code.
      APIs to add dst cache, delete dst cache, find dst cache and update
      dst cache in the hash table are implemented and will be used in later
      commits.
      This is a preparation work to move all cache routes into the exception
      table instead of getting inserted into the fib6 tree.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35732d01
    • W
      ipv6: introduce a new function fib6_update_sernum() · 180ca444
      Wei Wang 提交于
      This function takes a route as input and tries to update the sernum in
      the fib6_node this route is associated with. It will be used in later
      commit when adding a cached route into the exception table under that
      route.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      180ca444
  9. 07 10月, 2017 3 次提交
    • E
      tcp: implement rb-tree based retransmit queue · 75c119af
      Eric Dumazet 提交于
      Using a linear list to store all skbs in write queue has been okay
      for quite a while : O(N) is not too bad when N < 500.
      
      Things get messy when N is the order of 100,000 : Modern TCP stacks
      want 10Gbit+ of throughput even with 200 ms RTT flows.
      
      40 ns per cache line miss means a full scan can use 4 ms,
      blowing away CPU caches.
      
      SACK processing often can use various hints to avoid parsing
      whole retransmit queue. But with high packet losses and/or high
      reordering, hints no longer work.
      
      Sender has to process thousands of unfriendly SACK, accumulating
      a huge socket backlog, burning a cpu and massively dropping packets.
      
      Using an rb-tree for retransmit queue has been avoided for years
      because it added complexity and overhead, but now is the time
      to be more resistant and say no to quadratic behavior.
      
      1) RTX queue is no longer part of the write queue : already sent skbs
      are stored in one rb-tree.
      
      2) Since reaching the head of write queue no longer needs
      sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue
      
      Tested:
      
       On receiver :
       netem on ingress : delay 150ms 200us loss 1
       GRO disabled to force stress and SACK storms.
      
      for f in `seq 1 10`
      do
       ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
      done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'
      
      Before patch :
      
      323.87
      351.48
      339.59
      338.62
      306.72
      204.07
      304.93
      291.88
      202.47
      176.88
         2840
      
      After patch:
      
      1700.83
      2207.98
      2070.17
      1544.26
      2114.76
      2124.89
      1693.14
      1080.91
      2216.82
      1299.94
        18053
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75c119af
    • E
      tcp: uninline tcp_write_queue_purge() · ac3f09ba
      Eric Dumazet 提交于
      Since the upcoming rtx rbtree will add some extra code,
      it is time to not inline this fat function anymore.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac3f09ba
    • J
      net/ipv6: Convert icmpv6_push_pending_frames to void · 4e64b1ed
      Joe Perches 提交于
      commit cc71b7b0 ("net/ipv6: remove unused err variable on
      icmpv6_push_pending_frames") exposed icmpv6_push_pending_frames
      return value not being used.
      
      Remove now unnecessary int err declarations and uses.
      
      Miscellanea:
      
      o Remove unnecessary goto and out: labels
      o Realign arguments
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e64b1ed
  10. 06 10月, 2017 5 次提交
    • E
      tcp: new list for sent but unacked skbs for RACK recovery · e2080072
      Eric Dumazet 提交于
      This patch adds a new queue (list) that tracks the sent but not yet
      acked or SACKed skbs for a TCP connection. The list is chronologically
      ordered by skb->skb_mstamp (the head is the oldest sent skb).
      
      This list will be used to optimize TCP Rack recovery, which checks
      an skb's timestamp to judge if it has been lost and needs to be
      retransmitted. Since TCP write queue is ordered by sequence instead
      of sent time, RACK has to scan over the write queue to catch all
      eligible packets to detect lost retransmission, and iterates through
      SACKed skbs repeatedly.
      
      Special cares for rare events:
      1. TCP repair fakes skb transmission so the send queue needs adjusted
      2. SACK reneging would require re-inserting SACKed skbs into the
         send queue. For now I believe it's not worth the complexity to
         make RACK work perfectly on SACK reneging, so we do nothing here.
      3. Fast Open: currently for non-TFO, send-queue correctly queues
         the pure SYN packet. For TFO which queues a pure SYN and
         then a data packet, send-queue only queues the data packet but
         not the pure SYN due to the structure of TFO code. This is okay
         because the SYN receiver would never respond with a SACK on a
         missing SYN (i.e. SYN is never fast-retransmitted by SACK/RACK).
      
      In order to not grow sk_buff, we use an union for the new list and
      _skb_refdst/destructor fields. This is a bit complicated because
      we need to make sure _skb_refdst and destructor are properly zeroed
      before skb is cloned/copied at transmit, and before being freed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2080072
    • W
      tcp: uniform the set up of sockets after successful connection · 27204aaa
      Wei Wang 提交于
      Currently in the TCP code, the initialization sequence for cached
      metrics, congestion control, BPF, etc, after successful connection
      is very inconsistent. This introduces inconsistent bevhavior and is
      prone to bugs. The current call sequence is as follows:
      
      (1) for active case (tcp_finish_connect() case):
              tcp_mtup_init(sk);
              icsk->icsk_af_ops->rebuild_header(sk);
              tcp_init_metrics(sk);
              tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
              tcp_init_congestion_control(sk);
              tcp_init_buffer_space(sk);
      
      (2) for passive case (tcp_rcv_state_process() TCP_SYN_RECV case):
              icsk->icsk_af_ops->rebuild_header(sk);
              tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
              tcp_init_congestion_control(sk);
              tcp_mtup_init(sk);
              tcp_init_buffer_space(sk);
              tcp_init_metrics(sk);
      
      (3) for TFO passive case (tcp_fastopen_create_child()):
              inet_csk(child)->icsk_af_ops->rebuild_header(child);
              tcp_init_congestion_control(child);
              tcp_mtup_init(child);
              tcp_init_metrics(child);
              tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
              tcp_init_buffer_space(child);
      
      This commit uniforms the above functions to have the following sequence:
              tcp_mtup_init(sk);
              icsk->icsk_af_ops->rebuild_header(sk);
              tcp_init_metrics(sk);
              tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE/PASSIVE_ESTABLISHED_CB);
              tcp_init_congestion_control(sk);
              tcp_init_buffer_space(sk);
      This sequence is the same as the (1) active case. We pick this sequence
      because this order correctly allows BPF to override the settings
      including congestion control module and initial cwnd, etc from
      the route, and then allows the CC module to see those settings.
      Suggested-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27204aaa
    • S
      VSOCK: use TCP state constants for sk_state · 3b4477d2
      Stefan Hajnoczi 提交于
      There are two state fields: socket->state and sock->sk_state.  The
      socket->state field uses SS_UNCONNECTED, SS_CONNECTED, etc while the
      sock->sk_state typically uses values that match TCP state constants
      (TCP_CLOSE, TCP_ESTABLISHED).  AF_VSOCK does not follow this convention
      and instead uses SS_* constants for both fields.
      
      The sk_state field will be exposed to userspace through the vsock_diag
      interface for ss(8), netstat(8), and other programs.
      
      This patch switches sk_state to TCP state constants so that the meaning
      of this field is consistent with other address families.  Not just
      AF_INET and AF_INET6 use the TCP constants, AF_UNIX and others do too.
      
      The following mapping was used to convert the code:
      
        SS_FREE -> TCP_CLOSE
        SS_UNCONNECTED -> TCP_CLOSE
        SS_CONNECTING -> TCP_SYN_SENT
        SS_CONNECTED -> TCP_ESTABLISHED
        SS_DISCONNECTING -> TCP_CLOSING
        VSOCK_SS_LISTEN -> TCP_LISTEN
      
      In __vsock_create() the sk_state initialization was dropped because
      sock_init_data() already initializes sk_state to TCP_CLOSE.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b4477d2
    • S
      VSOCK: move __vsock_in_bound/connected_table() to af_vsock.h · bf359b81
      Stefan Hajnoczi 提交于
      The vsock_diag.ko module will need to check socket table membership.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf359b81
    • S
      VSOCK: export socket tables for sock_diag interface · 44f20980
      Stefan Hajnoczi 提交于
      The socket table symbols need to be exported from vsock.ko so that the
      vsock_diag.ko module will be able to traverse sockets.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44f20980
  11. 05 10月, 2017 3 次提交
  12. 04 10月, 2017 3 次提交
    • M
      sctp: introduce round robin stream scheduler · ac1ed8b8
      Marcelo Ricardo Leitner 提交于
      This patch introduces RFC Draft ndata section 3.2 Priority Based
      Scheduler (SCTP_SS_RR).
      
      Works by maintaining a list of enqueued streams and tracking the last
      one used to send data. When the datamsg is done, it switches to the next
      stream.
      
      See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13Tested-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac1ed8b8
    • M
      sctp: introduce priority based stream scheduler · 637784ad
      Marcelo Ricardo Leitner 提交于
      This patch introduces RFC Draft ndata section 3.4 Priority Based
      Scheduler (SCTP_SS_PRIO).
      
      It works by having a struct sctp_stream_priority for each priority
      configured. This struct is then enlisted on a queue ordered per priority
      if, and only if, there is a stream with data queued, so that dequeueing
      is very straightforward: either finish current datamsg or simply dequeue
      from the highest priority queued, which is the next stream pointed, and
      that's it.
      
      If there are multiple streams assigned with the same priority and with
      data queued, it will do round robin amongst them while respecting
      datamsgs boundaries (when not using idata chunks), to be reasonably
      fair.
      
      We intentionally don't maintain a list of priorities nor a list of all
      streams with the same priority to save memory. The first would mean at
      least 2 other pointers per priority (which, for 1000 priorities, that
      can mean 16kB) and the second would also mean 2 other pointers but per
      stream. As SCTP supports up to 65535 streams on a given asoc, that's
      1MB. This impacts when giving a priority to some stream, as we have to
      find out if the new priority is already being used and if we can free
      the old one, and also when tearing down.
      
      The new fields in struct sctp_stream_out_ext and sctp_stream are added
      under a union because that memory is to be shared with other schedulers.
      
      See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13Tested-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      637784ad
    • M
      sctp: introduce stream scheduler foundations · 5bbbbe32
      Marcelo Ricardo Leitner 提交于
      This patch introduces the hooks necessary to do stream scheduling, as
      per RFC Draft ndata.  It also introduces the first scheduler, which is
      what we do today but now factored out: first come first served (FCFS).
      
      With stream scheduling now we have to track which chunk was enqueued on
      which stream and be able to select another other than the in front of
      the main outqueue. So we introduce a list on sctp_stream_out_ext
      structure for this purpose.
      
      We reuse sctp_chunk->transmitted_list space for the list above, as the
      chunk cannot belong to the two lists at the same time. By using the
      union in there, we can have distinct names for these moments.
      
      sctp_sched_ops are the operations expected to be implemented by each
      scheduler. The dequeueing is a bit particular to this implementation but
      it is to match how we dequeue packets today. We first dequeue and then
      check if it fits the packet and if not, we requeue it at head. Thus why
      we don't have a peek operation but have dequeue_done instead, which is
      called once the chunk can be safely considered as transmitted.
      
      The check removed from sctp_outq_flush is now performed by
      sctp_stream_outq_migrate, which is only called during assoc setup.
      (sctp_sendmsg() also checks for it)
      
      The only operation that is foreseen but not yet added here is a way to
      signalize that a new packet is starting or that the packet is done, for
      round robin scheduler per packet, but is intentionally left to the
      patch that actually implements it.
      
      Support for I-DATA chunks, also described in this RFC, with user message
      interleaving is straightforward as it just requires the schedulers to
      probe for the feature and ignore datamsg boundaries when dequeueing.
      
      See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13Tested-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5bbbbe32