1. 18 5月, 2017 30 次提交
  2. 17 5月, 2017 4 次提交
    • E
      tcp: internal implementation for pacing · 218af599
      Eric Dumazet 提交于
      BBR congestion control depends on pacing, and pacing is
      currently handled by sch_fq packet scheduler for performance reasons,
      and also because implemening pacing with FQ was convenient to truly
      avoid bursts.
      
      However there are many cases where this packet scheduler constraint
      is not practical.
      - Many linux hosts are not focusing on handling thousands of TCP
        flows in the most efficient way.
      - Some routers use fq_codel or other AQM, but still would like
        to use BBR for the few TCP flows they initiate/terminate.
      
      This patch implements an automatic fallback to internal pacing.
      
      Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
      
      If sch_fq happens to be in the egress path, pacing is delegated to
      the qdisc, otherwise pacing is done by TCP itself.
      
      One advantage of pacing from TCP stack is to get more precise rtt
      estimations, and less work done from TX completion, since TCP Small
      queue limits are not generally hit. Setups with single TX queue but
      many cpus might even benefit from this.
      
      Note that unlike sch_fq, we do not take into account header sizes.
      Taking care of these headers would add additional complexity for
      no practical differences in behavior.
      
      Some performance numbers using 800 TCP_STREAM flows rate limited to
      ~48 Mbit per second on 40Gbit NIC.
      
      If MQ+pfifo_fast is used on the NIC :
      
      $ sar -n DEV 1 5 | grep eth
      14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
      14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
      14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
      14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
      14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
      Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
       4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
       0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
       1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
       1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0
      
      Now use MQ+FQ :
      
      lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
      lpaa23:~# tc qdisc replace dev eth0 root mq
      
      $ sar -n DEV 1 5 | grep eth
      14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
      14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
      14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
      14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
      14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
      Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
       1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
       0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
       4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
       3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0
      
      As expected, number of interrupts per second is very different.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      218af599
    • P
      udp: keep the sk_receive_queue held when splicing · 6dfb4367
      Paolo Abeni 提交于
      On packet reception, when we are forced to splice the
      sk_receive_queue, we can keep the related lock held, so
      that we can avoid re-acquiring it, if fwd memory
      scheduling is required.
      
      v1 -> v2:
        the rx_queue_lock_held param in udp_rmem_release() is
        now a bool
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dfb4367
    • P
      udp: use a separate rx queue for packet reception · 2276f58a
      Paolo Abeni 提交于
      under udp flood the sk_receive_queue spinlock is heavily contended.
      This patch try to reduce the contention on such lock adding a
      second receive queue to the udp sockets; recvmsg() looks first
      in such queue and, only if empty, tries to fetch the data from
      sk_receive_queue. The latter is spliced into the newly added
      queue every time the receive path has to acquire the
      sk_receive_queue lock.
      
      The accounting of forward allocated memory is still protected with
      the sk_receive_queue lock, so udp_rmem_release() needs to acquire
      both locks when the forward deficit is flushed.
      
      On specific scenarios we can end up acquiring and releasing the
      sk_receive_queue lock multiple times; that will be covered by
      the next patch
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2276f58a
    • P
      net/sock: factor out dequeue/peek with offset code · 65101aec
      Paolo Abeni 提交于
      And update __sk_queue_drop_skb() to work on the specified queue.
      This will help the udp protocol to use an additional private
      rx queue in a later patch.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65101aec
  3. 16 5月, 2017 3 次提交
  4. 12 5月, 2017 3 次提交
    • X
      sctp: fix src address selection if using secondary addresses for ipv6 · dbc2b5e9
      Xin Long 提交于
      Commit 0ca50d12 ("sctp: fix src address selection if using secondary
      addresses") has fixed a src address selection issue when using secondary
      addresses for ipv4.
      
      Now sctp ipv6 also has the similar issue. When using a secondary address,
      sctp_v6_get_dst tries to choose the saddr which has the most same bits
      with the daddr by sctp_v6_addr_match_len. It may make some cases not work
      as expected.
      
      hostA:
        [1] fd21:356b:459a:cf10::11 (eth1)
        [2] fd21:356b:459a:cf20::11 (eth2)
      
      hostB:
        [a] fd21:356b:459a:cf30::2  (eth1)
        [b] fd21:356b:459a:cf40::2  (eth2)
      
      route from hostA to hostB:
        fd21:356b:459a:cf30::/64 dev eth1  metric 1024  mtu 1500
      
      The expected path should be:
        fd21:356b:459a:cf10::11 <-> fd21:356b:459a:cf30::2
      But addr[2] matches addr[a] more bits than addr[1] does, according to
      sctp_v6_addr_match_len. It causes the path to be:
        fd21:356b:459a:cf20::11 <-> fd21:356b:459a:cf30::2
      
      This patch is to fix it with the same way as Marcelo's fix for sctp ipv4.
      As no ip_dev_find for ipv6, this patch is to use ipv6_chk_addr to check
      if the saddr is in a dev instead.
      
      Note that for backwards compatibility, it will still do the addr_match_len
      check here when no optimal is found.
      Reported-by: NPatrick Talbert <ptalbert@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbc2b5e9
    • J
      tipc: make macro tipc_wait_for_cond() smp safe · 844cf763
      Jon Paul Maloy 提交于
      The macro tipc_wait_for_cond() is embedding the macro sk_wait_event()
      to fulfil its task. The latter, in turn, is evaluating the stated
      condition outside the socket lock context. This is problematic if
      the condition is accessing non-trivial data structures which may be
      altered by incoming interrupts, as is the case with the cong_links()
      linked list, used by socket to keep track of the current set of
      congested links. We sometimes see crashes when this list is accessed
      by a condition function at the same time as a SOCK_WAKEUP interrupt
      is removing an element from the list.
      
      We fix this by expanding selected parts of sk_wait_event() into the
      outer macro, while ensuring that all evaluations of a given condition
      are performed under socket lock protection.
      
      Fixes: commit 365ad353 ("tipc: reduce risk of user starvation during link congestion")
      Reviewed-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      844cf763
    • E
      net: sched: optimize class dumps · cb395b20
      Eric Dumazet 提交于
      In commit 59cc1f61 ("net: sched: convert qdisc linked list to
      hashtable") we missed the opportunity to considerably speed up
      tc_dump_tclass_root() if a qdisc handle is provided by user.
      
      Instead of iterating all the qdiscs, use qdisc_match_from_root()
      to directly get the one we look for.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb395b20