1. 09 10月, 2013 6 次提交
    • E
      udp: fix a typo in __udp4_lib_mcast_demux_lookup · f69b923a
      Eric Dumazet 提交于
      At this point sk might contain garbage.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f69b923a
    • E
      ipv6: make lookups simpler and faster · efe4208f
      Eric Dumazet 提交于
      TCP listener refactoring, part 4 :
      
      To speed up inet lookups, we moved IPv4 addresses from inet to struct
      sock_common
      
      Now is time to do the same for IPv6, because it permits us to have fast
      lookups for all kind of sockets, including upcoming SYN_RECV.
      
      Getting IPv6 addresses in TCP lookups currently requires two extra cache
      lines, plus a dereference (and memory stall).
      
      inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
      
      This patch is way bigger than its IPv4 counter part, because for IPv4,
      we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
      it's not doable easily.
      
      inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
      inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
      
      And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
      at the same offset.
      
      We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
      macro.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efe4208f
    • E
      tcp/dccp: remove twchain · 05dbc7b5
      Eric Dumazet 提交于
      TCP listener refactoring, part 3 :
      
      Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
      and parallel SYN processing.
      
      Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
      friend states) sockets, another for TIME_WAIT sockets only.
      
      As the hash table is sized to get at most one socket per bucket, it
      makes little sense to have separate twchain, as it makes the lookup
      slightly more complicated, and doubles hash table memory usage.
      
      If we make sure all socket types have the lookup keys at the same
      offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
      and ESTABLISHED sockets already have common lookup fields for IPv4.
      
      [ INET_TW_MATCH() is no longer needed ]
      
      I'll provide a follow-up to factorize IPv6 lookup as well, to remove
      INET6_TW_MATCH()
      
      This way, SYN_RECV pseudo sockets will be supported the same.
      
      A new sock_gen_put() helper is added, doing either a sock_put() or
      inet_twsk_put() [ and will support SYN_RECV later ].
      
      Note this helper should only be called in real slow path, when rcu
      lookup found a socket that was moved to another identity (freed/reused
      immediately), but could eventually be used in other contexts, like
      sock_edemux()
      
      Before patch :
      
      dmesg | grep "TCP established"
      
      TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
      
      After patch :
      
      TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05dbc7b5
    • S
      net: ipv4 only populate IP_PKTINFO when needed · fbf8866d
      Shawn Bohrer 提交于
      The since the removal of the routing cache computing
      fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
      packet received.  This has introduced a performance regression for some
      UDP workloads.
      
      This change skips populating the packet info for sockets that do not have
      IP_PKTINFO set.
      
      Benchmark results from a netperf UDP_RR test:
      Before 89789.68 transactions/s
      After  90587.62 transactions/s
      
      Benchmark results from a fio 1 byte UDP multicast pingpong test
      (Multicast one way unicast response):
      Before 12.63us RTT
      After  12.48us RTT
      Signed-off-by: NShawn Bohrer <sbohrer@rgmadvisors.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbf8866d
    • S
      udp: ipv4: Add udp early demux · 421b3885
      Shawn Bohrer 提交于
      The removal of the routing cache introduced a performance regression for
      some UDP workloads since a dst lookup must be done for each packet.
      This change caches the dst per socket in a similar manner to what we do
      for TCP by implementing early_demux.
      
      For UDP multicast we can only cache the dst if there is only one
      receiving socket on the host.  Since caching only works when there is
      one receiving socket we do the multicast socket lookup using RCU.
      
      For UDP unicast we only demux sockets with an exact match in order to
      not break forwarding setups.  Additionally since the hash chains may be
      long we only check the first socket to see if it is a match and not
      waste extra time searching the whole chain when we might not find an
      exact match.
      
      Benchmark results from a netperf UDP_RR test:
      Before 87961.22 transactions/s
      After  89789.68 transactions/s
      
      Benchmark results from a fio 1 byte UDP multicast pingpong test
      (Multicast one way unicast response):
      Before 12.97us RTT
      After  12.63us RTT
      Signed-off-by: NShawn Bohrer <sbohrer@rgmadvisors.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      421b3885
    • S
      udp: Only allow busy read/poll on connected sockets · 005ec974
      Shawn Bohrer 提交于
      UDP sockets can receive packets from multiple endpoints and thus may be
      received on multiple receive queues.  Since packets packets can arrive
      on multiple receive queues we should not mark the napi_id for all
      packets.  This makes busy read/poll only work for connected UDP sockets.
      
      This additionally enables busy read/poll for UDP multicast packets as
      long as the socket is connected by moving the check into
      __udp_queue_rcv_skb().
      Signed-off-by: NShawn Bohrer <sbohrer@rgmadvisors.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      005ec974
  2. 08 10月, 2013 1 次提交
  3. 05 10月, 2013 1 次提交
    • E
      tcp: do not forget FIN in tcp_shifted_skb() · 5e8a402f
      Eric Dumazet 提交于
      Yuchung found following problem :
      
       There are bugs in the SACK processing code, merging part in
       tcp_shift_skb_data(), that incorrectly resets or ignores the sacked
       skbs FIN flag. When a receiver first SACK the FIN sequence, and later
       throw away ofo queue (e.g., sack-reneging), the sender will stop
       retransmitting the FIN flag, and hangs forever.
      
      Following packetdrill test can be used to reproduce the bug.
      
      $ cat sack-merge-bug.pkt
      `sysctl -q net.ipv4.tcp_fack=0`
      
      // Establish a connection and send 10 MSS.
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +.000 bind(3, ..., ...) = 0
      +.000 listen(3, 1) = 0
      
      +.050 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      +.000 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.001 < . 1:1(0) ack 1 win 1024
      +.000 accept(3, ..., ...) = 4
      
      +.100 write(4, ..., 12000) = 12000
      +.000 shutdown(4, SHUT_WR) = 0
      +.000 > . 1:10001(10000) ack 1
      +.050 < . 1:1(0) ack 2001 win 257
      +.000 > FP. 10001:12001(2000) ack 1
      +.050 < . 1:1(0) ack 2001 win 257 <sack 10001:11001,nop,nop>
      +.050 < . 1:1(0) ack 2001 win 257 <sack 10001:12002,nop,nop>
      // SACK reneg
      +.050 < . 1:1(0) ack 12001 win 257
      +0 %{ print "unacked: ",tcpi_unacked }%
      +5 %{ print "" }%
      
      First, a typo inverted left/right of one OR operation, then
      code forgot to advance end_seq if the merged skb carried FIN.
      
      Bug was added in 2.6.29 by commit 832d11c5
      ("tcp: Try to restore large SKBs while SACK processing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e8a402f
  4. 04 10月, 2013 3 次提交
    • E
      tcp: shrink tcp6_timewait_sock by one cache line · 96f817fe
      Eric Dumazet 提交于
      While working on tcp listener refactoring, I found that it
      would really make things easier if sock_common could include
      the IPv6 addresses needed in the lookups, instead of doing
      very complex games to get their values (depending on sock
      being SYN_RECV, ESTABLISHED, TIME_WAIT)
      
      For this to happen, I need to be sure that tcp6_timewait_sock
      and tcp_timewait_sock consume same number of cache lines.
      
      This is possible if we only use 32bits for tw_ttd, as we remove
      one 32bit hole in inet_timewait_sock
      
      inet_tw_time_stamp() is defined and used, even if its current
      implementation looks like tcp_time_stamp : We might need finer
      resolution for tcp_time_stamp in the future.
      
      Before patch : sizeof(struct tcp6_timewait_sock) = 0xc8
      
      After patch : sizeof(struct tcp6_timewait_sock) = 0xc0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96f817fe
    • P
      net: ipv4: Change variable type to bool · 34a6eda1
      Peter Senna Tschudin 提交于
      The variable fully_acked is only assigned the values true and false.
      Change its type to bool.
      
      The simplified semantic patch that find this problem is as
      follows (http://coccinelle.lip6.fr/):
      
      @exists@
      type T;
      identifier b;
      @@
      - T
      + bool
        b = ...;
        ... when any
        b = \(true\|false\)
      Signed-off-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34a6eda1
    • E
      inet: consolidate INET_TW_MATCH · 50805466
      Eric Dumazet 提交于
      TCP listener refactoring, part 2 :
      
      We can use a generic lookup, sockets being in whatever state, if
      we are sure all relevant fields are at the same place in all socket
      types (ESTABLISH, TIME_WAIT, SYN_RECV)
      
      This patch removes these macros :
      
       inet_addrpair, inet_addrpair, tw_addrpair, tw_portpair
      
      And adds :
      
       sk_portpair, sk_addrpair, sk_daddr, sk_rcv_saddr
      
      Then, INET_TW_MATCH() is really the same than INET_MATCH()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      50805466
  5. 03 10月, 2013 4 次提交
    • E
      net: do not call sock_put() on TIMEWAIT sockets · 80ad1d61
      Eric Dumazet 提交于
      commit 3ab5aee7 ("net: Convert TCP & DCCP hash tables to use RCU /
      hlist_nulls") incorrectly used sock_put() on TIMEWAIT sockets.
      
      We should instead use inet_twsk_put()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80ad1d61
    • E
      tcp: sndbuf autotuning improvements · 6ae70532
      Eric Dumazet 提交于
      tcp_fixup_sndbuf() is underestimating initial send buffer requirements.
      
      It was not noticed because big GSO packets were escaping the limitation,
      but with smaller TSO packets (or TSO/GSO/SG off), application hits
      sk_sndbuf before having a chance to fill enough packets in socket write
      queue.
      
      - initial cwnd can be bigger than 10 for specific routes
      
      - SKB_TRUESIZE() is a bit under real needs in some cases,
        because of power-of-two rounding in kmalloc()
      
      - Fast Recovery (RFC 5681 3.2) : Cubic needs 70% factor
      
      - Extra cushion (application might react slowly to POLLOUT)
      
      tcp_v4_conn_req_fastopen() needs to call tcp_init_metrics() before
      calling tcp_init_buffer_space()
      
      Then we realize tcp_new_space() should call tcp_fixup_sndbuf()
      instead of duplicating this stuff.
      
      Rename tcp_fixup_sndbuf() to tcp_sndbuf_expand() to be more
      descriptive.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ae70532
    • B
      fib_trie: avoid a redundant bit judgement in inflate · bbe34cf8
      baker.zhang 提交于
      Because 'node' is the i'st child of 'oldnode',
      thus, here 'i' equals
      tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits)
      
      we just get 1 more bit,
      and need not care the detail value of this bits.
      
      I apologize for the mistake.
      
      I generated the patch on a branch version,
      and did not notice the put_child has been changed.
      
      I have redone the test on HEAD version with my patch.
      
      two cases are used.
      case 1. inflate a node which has a leaf child node.
      case 2: inflate a node which has a an child node with skipped bits
      
      test env:
        ip link set eth0 up
        ip a add dev eth0 192.168.11.1/32
      here, we just focus on route table(MAIN),
      so I use a "192.168.11.1/32" address to simplify the test case.
      
      call trace:
      + fib_insert_node
      + + trie_rebalance
      + + + resize
      + + + + inflate
      
      Test case 1:  inflate a node which has a leaf child node.
      
      ===========================================================
      step 1. prepare a fib trie
      ------------------------------------------
        ip r a 192.168.0.0/24 via 192.168.11.1
        ip r a 192.168.1.0/24 via 192.168.11.1
      
      we get a fib trie.
      root@baker:~# cat /proc/net/fib_trie
      Main:
        +-- 192.168.0.0/23 1 0 0
         |-- 192.168.0.0
          /24 universe UNICAST
         |-- 192.168.1.0
          /24 universe UNICAST
      Local:
      .....
      
      step 2. Add the third route
      ------------------------------------------
      root@baker:~# ip r a 192.168.2.0/24 via 192.168.11.1
      
      A fib_trie leaf will be inserted in fib_insert_node before trie_rebalance.
      
      For function 'inflate':
      'inflate' is called with following trie.
        +-- 192.168.0.0/22 1 1 0 <=== tn node
          +-- 192.168.0.0/23 1 0 0    <== node a
              |-- 192.168.0.0
                /24 universe UNICAST
              |-- 192.168.1.0
                /24 universe UNICAST
            |-- 192.168.2.0          <== leaf(node b)
      
      When process node b, which is a leaf. here:
      i is 1,
      node key "192.168.2.0"
      oldnode is (pos:22, bits:1)
      
      unpatch source:
      tkey_extract_bits(node->key, oldtnode->pos + oldtnode->bits, 1)
      it equals:
      tkey_extract_bits("192.168,2,0", 22 + 1, 1)
      
      thus got 0, and call put_child(tn, 2*i, node); <== 2*i=2.
      
      patched source:
      tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits + 1),
      tkey_extract_bits("192.168,2,0", 22, 1 + 1)  <== get 2.
      
      Test case 2:  inflate a node which has a an child node with skipped bits
      ==========================================================================
      step 1. prepare a fib trie.
        ip link set eth0 up
        ip a add dev eth0 192.168.11.1/32
        ip r a 192.168.128.0/24 via 192.168.11.1
        ip r a 192.168.0.0/24  via 192.168.11.1
        ip r a 192.168.16.0/24   via 192.168.11.1
        ip r a 192.168.32.0/24  via 192.168.11.1
        ip r a 192.168.48.0/24  via 192.168.11.1
        ip r a 192.168.144.0/24   via 192.168.11.1
        ip r a 192.168.160.0/24   via 192.168.11.1
        ip r a 192.168.176.0/24   via 192.168.11.1
      
      check:
      root@baker:~# cat /proc/net/fib_trie
      Main:
        +-- 192.168.0.0/16 1 0 0
           +-- 192.168.0.0/18 2 0 0
              |-- 192.168.0.0
                 /24 universe UNICAST
              |-- 192.168.16.0
                 /24 universe UNICAST
              |-- 192.168.32.0
                 /24 universe UNICAST
              |-- 192.168.48.0
                 /24 universe UNICAST
           +-- 192.168.128.0/18 2 0 0
              |-- 192.168.128.0
                 /24 universe UNICAST
              |-- 192.168.144.0
                 /24 universe UNICAST
              |-- 192.168.160.0
                 /24 universe UNICAST
              |-- 192.168.176.0
                 /24 universe UNICAST
      Local:
        ...
      
      step 2. add a route to trigger inflate.
        ip r a 192.168.96.0/24   via 192.168.11.1
      
      This command will call serveral times inflate.
      In the first time, the fib_trie is:
      ________________________
      +-- 192.168.128.0/(16, 1) <== tn node
       +-- 192.168.0.0/(17, 1)  <== node a
        +-- 192.168.0.0/(18, 2)
         |-- 192.168.0.0
         |-- 192.168.16.0
         |-- 192.168.32.0
         |-- 192.168.48.0
        |-- 192.168.96.0
       +-- 192.168.128.0/(18, 2) <== node b.
        |-- 192.168.128.0
        |-- 192.168.144.0
        |-- 192.168.160.0
        |-- 192.168.176.0
      
      NOTE: node b is a interal node with skipped bits.
      here,
      i:1,
      node->key "192.168.128.0",
      oldnode:(pos:16, bits:1)
      so
      tkey_extract_bits(node->key, oldtnode->pos + oldtnode->bits, 1)
      it equals:
      tkey_extract_bits("192.168,128,0", 16 + 1, 1) <=== 0
      
      tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits, 1)
      it equals:
      tkey_extract_bits("192.168,128,0", 16, 1+1) <=== 2
      
      2*i + 0 == 2, so the result is same.
      Signed-off-by: Nbaker.zhang <baker.kernel@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bbe34cf8
    • A
      tcp: Always set options to 0 before calling tcp_established_options · 5843ef42
      Andi Kleen 提交于
      tcp_established_options assumes opts->options is 0 before calling,
      as it read modify writes it.
      
      For the tcp_current_mss() case the opts structure is not zeroed,
      so this can be done with uninitialized values.
      
      This is ok, because ->options is not read in this path.
      But it's still better to avoid the operation on the uninitialized
      field. This shuts up a static code analyzer, and presumably
      may help the optimizer.
      
      Cc: netdev@vger.kernel.org
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5843ef42
  6. 02 10月, 2013 4 次提交
  7. 01 10月, 2013 4 次提交
    • S
      ipv4 igmp: use in_dev_put in timer handlers instead of __in_dev_put · e2401654
      Salam Noureddine 提交于
      It is possible for the timer handlers to run after the call to
      ip_mc_down so use in_dev_put instead of __in_dev_put in the handler
      function in order to do proper cleanup when the refcnt reaches 0.
      Otherwise, the refcnt can reach zero without the in_device being
      destroyed and we end up leaking a reference to the net_device and
      see messages like the following,
      
      unregister_netdevice: waiting for eth0 to become free. Usage count = 1
      
      Tested on linux-3.4.43.
      Signed-off-by: NSalam Noureddine <noureddine@aristanetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2401654
    • E
      net ipv4: Convert ipv4.ip_local_port_range to be per netns v3 · 0bbf87d8
      Eric W. Biederman 提交于
      - Move sysctl_local_ports from a global variable into struct netns_ipv4.
      - Modify inet_get_local_port_range to take a struct net, and update all
        of the callers.
      - Move the initialization of sysctl_local_ports into
         sysctl_net_ipv4.c:ipv4_sysctl_init_net from inet_connection_sock.c
      
      v2:
      - Ensure indentation used tabs
      - Fixed ip.h so it applies cleanly to todays net-next
      
      v3:
      - Compile fixes of strange callers of inet_get_local_port_range.
        This patch now successfully passes an allmodconfig build.
        Removed manual inlining of inet_get_local_port_range in ipv4_local_port_range
      Originally-by: NSamya <samya@twitter.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0bbf87d8
    • E
      tcp: TSQ can use a dynamic limit · c9eeec26
      Eric Dumazet 提交于
      When TCP Small Queues was added, we used a sysctl to limit amount of
      packets queues on Qdisc/device queues for a given TCP flow.
      
      Problem is this limit is either too big for low rates, or too small
      for high rates.
      
      Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO
      auto sizing, it can better control number of packets in Qdisc/device
      queues.
      
      New limit is two packets or at least 1 to 2 ms worth of packets.
      
      Low rates flows benefit from this patch by having even smaller
      number of packets in queues, allowing for faster recovery,
      better RTT estimations.
      
      High rates flows benefit from this patch by allowing more than 2 packets
      in flight as we had reports this was a limiting factor to reach line
      rate. [ In particular if TX completion is delayed because of coalescing
      parameters ]
      
      Example for a single flow on 10Gbp link controlled by FQ/pacing
      
      14 packets in flight instead of 2
      
      $ tc -s -d qd
      qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p
      buckets 1024 quantum 3028 initial_quantum 15140
       Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0
      requeues 6822476)
       rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476
        2047 flow, 2046 inactive, 1 throttled, delay 15673 ns
        2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit
      
      Note that sk_pacing_rate is currently set to twice the actual rate, but
      this might be refined in the future when a flow is in congestion
      avoidance.
      
      Additional change : skb->destructor should be set to tcp_wfree().
      
      A future patch (for linux 3.13+) might remove tcp_limit_output_bytes
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Wei Liu <wei.liu2@citrix.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9eeec26
    • P
      ip_tunnel: Do not use stale inner_iph pointer. · d4a71b15
      Pravin B Shelar 提交于
      While sending packet skb_cow_head() can change skb header which
      invalidates inner_iph pointer to skb header. Following patch
      avoid using it. Found by code inspection.
      
      This bug was introduced by commit 0e6fbc5b (ip_tunnels: extend
      iptunnel_xmit()).
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4a71b15
  8. 30 9月, 2013 1 次提交
  9. 29 9月, 2013 4 次提交
    • E
      net: introduce SO_MAX_PACING_RATE · 62748f32
      Eric Dumazet 提交于
      As mentioned in commit afe4fd06 ("pkt_sched: fq: Fair Queue packet
      scheduler"), this patch adds a new socket option.
      
      SO_MAX_PACING_RATE offers the application the ability to cap the
      rate computed by transport layer. Value is in bytes per second.
      
      u32 val = 1000000;
      setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val));
      
      To be effectively paced, a flow must use FQ packet scheduler.
      
      Note that a packet scheduler takes into account the headers for its
      computations. The effective payload rate depends on MSS and retransmits
      if any.
      
      I chose to make this pacing rate a SOL_SOCKET option instead of a
      TCP one because this can be used by other protocols.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Steinar H. Gunderson <sesse@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62748f32
    • F
      ipv4: processing ancillary IP_TOS or IP_TTL · aa661581
      Francesco Fusco 提交于
      If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
      packets with the specified TTL or TOS overriding the socket values specified
      with the traditional setsockopt().
      
      The struct inet_cork stores the values of TOS, TTL and priority that are
      passed through the struct ipcm_cookie. If there are user-specified TOS
      (tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
      used to override the per-socket values. In case of TOS also the priority
      is changed accordingly.
      
      Two helper functions get_rttos and get_rtconn_flags are defined to take
      into account the presence of a user specified TOS value when computing
      RT_TOS and RT_CONN_FLAGS.
      Signed-off-by: NFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa661581
    • F
      ipv4: IP_TOS and IP_TTL can be specified as ancillary data · f02db315
      Francesco Fusco 提交于
      This patch enables the IP_TTL and IP_TOS values passed from userspace to
      be stored in the ipcm_cookie struct. Three fields are added to the struct:
      
      - the TTL, expressed as __u8.
        The allowed values are in the [1-255].
        A value of 0 means that the TTL is not specified.
      
      - the TOS, expressed as __s16.
        The allowed values are in the range [0,255].
        A value of -1 means that the TOS is not specified.
      
      - the priority, expressed as a char and computed when
        handling the ancillary data.
      Signed-off-by: NFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f02db315
    • E
      net: net_secret should not depend on TCP · 9a3bab6b
      Eric Dumazet 提交于
      A host might need net_secret[] and never open a single socket.
      
      Problem added in commit aebda156
      ("net: defer net_secret[] initialization")
      
      Based on prior patch from Hannes Frederic Sowa.
      Reported-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@strressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a3bab6b
  10. 24 9月, 2013 5 次提交
  11. 20 9月, 2013 2 次提交
  12. 18 9月, 2013 1 次提交
  13. 13 9月, 2013 1 次提交
  14. 07 9月, 2013 2 次提交
    • E
      tcp: properly increase rcv_ssthresh for ofo packets · 4e4f1fc2
      Eric Dumazet 提交于
      TCP receive window handling is multi staged.
      
      A socket has a memory budget, static or dynamic, in sk_rcvbuf.
      
      Because we do not really know how this memory budget translates to
      a TCP window (payload), TCP announces a small initial window
      (about 20 MSS).
      
      When a packet is received, we increase TCP rcv_win depending
      on the payload/truesize ratio of this packet. Good citizen
      packets give a hint that it's reasonable to have rcv_win = sk_rcvbuf/2
      
      This heuristic takes place in tcp_grow_window()
      
      Problem is : We currently call tcp_grow_window() only for in-order
      packets.
      
      This means that reorders or packet losses stop proper grow of
      rcv_win, and senders are unable to benefit from fast recovery,
      or proper reordering level detection.
      
      Really, a packet being stored in OFO queue is not a bad citizen.
      It should be part of the game as in-order packets.
      
      In our traces, we very often see sender is limited by linux small
      receive windows, even if linux hosts use autotuning (DRS) and should
      allow rcv_win to grow to ~3MB.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e4f1fc2
    • Y
      tcp: fix no cwnd growth after timeout · 16edfe7e
      Yuchung Cheng 提交于
      In commit 0f7cc9a3 "tcp: increase throughput when reordering is high",
      it only allows cwnd to increase in Open state. This mistakenly disables
      slow start after timeout (CA_Loss). Moreover cwnd won't grow if the
      state moves from Disorder to Open later in tcp_fastretrans_alert().
      
      Therefore the correct logic should be to allow cwnd to grow as long
      as the data is received in order in Open, Loss, or even Disorder state.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16edfe7e
  15. 06 9月, 2013 1 次提交