1. 01 3月, 2014 3 次提交
  2. 28 2月, 2014 1 次提交
  3. 27 2月, 2014 8 次提交
    • E
      tcp: switch rtt estimations to usec resolution · 740b0f18
      Eric Dumazet 提交于
      Upcoming congestion controls for TCP require usec resolution for RTT
      estimations. Millisecond resolution is simply not enough these days.
      
      FQ/pacing in DC environments also require this change for finer control
      and removal of bimodal behavior due to the current hack in
      tcp_update_pacing_rate() for 'small rtt'
      
      TCP_CONG_RTT_STAMP is no longer needed.
      
      As Julian Anastasov pointed out, we need to keep user compatibility :
      tcp_metrics used to export RTT and RTTVAR in msec resolution,
      so we added RTT_US and RTTVAR_US. An iproute2 patch is needed
      to use the new attributes if provided by the kernel.
      
      In this example ss command displays a srtt of 32 usecs (10Gbit link)
      
      lpk51:~# ./ss -i dst lpk52
      Netid  State      Recv-Q Send-Q   Local Address:Port       Peer
      Address:Port
      tcp    ESTAB      0      1         10.246.11.51:42959
      10.246.11.52:64614
               cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448
      cwnd:10 send
      3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559
      
      Updated iproute2 ip command displays :
      
      lpk51:~# ./ip tcp_metrics | grep 10.246.11.52
      10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source
      10.246.11.51
      
      Old binary displays :
      
      lpk51:~# ip tcp_metrics | grep 10.246.11.52
      10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source
      10.246.11.51
      
      With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Larry Brakmo <brakmo@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      740b0f18
    • H
      ipv6: yet another new IPV6_MTU_DISCOVER option IPV6_PMTUDISC_OMIT · 0b95227a
      Hannes Frederic Sowa 提交于
      This option has the same semantic as IP_PMTUDISC_OMIT for IPv4 which
      got recently introduced. It doesn't honor the path mtu discovered by the
      host but in contrary to IPV6_PMTUDISC_INTERFACE allows the generation of
      fragments if the packet size exceeds the MTU of the outgoing interface
      MTU.
      
      Fixes: 93b36cf3 ("ipv6: support IPV6_PMTU_INTERFACE on sockets")
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b95227a
    • H
      ipv4: yet another new IP_MTU_DISCOVER option IP_PMTUDISC_OMIT · 1b346576
      Hannes Frederic Sowa 提交于
      IP_PMTUDISC_INTERFACE has a design error: because it does not allow the
      generation of fragments if the interface mtu is exceeded, it is very
      hard to make use of this option in already deployed name server software
      for which I introduced this option.
      
      This patch adds yet another new IP_MTU_DISCOVER option to not honor any
      path mtu information and not accepting new icmp notifications destined for
      the socket this option is enabled on. But we allow outgoing fragmentation
      in case the packet size exceeds the outgoing interface mtu.
      
      As such this new option can be used as a drop-in replacement for
      IP_PMTUDISC_DONT, which is currently in use by most name server software
      making the adoption of this option very smooth and easy.
      
      The original advantage of IP_PMTUDISC_INTERFACE is still maintained:
      ignoring incoming path MTU updates and not honoring discovered path MTUs
      in the output path.
      
      Fixes: 482fc609 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE")
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b346576
    • H
      ipv4: use ip_skb_dst_mtu to determine mtu in ip_fragment · 69647ce4
      Hannes Frederic Sowa 提交于
      ip_skb_dst_mtu mostly falls back to ip_dst_mtu_maybe_forward if no socket
      is attached to the skb (in case of forwarding) or determines the mtu like
      we do in ip_finish_output, which actually checks if we should branch to
      ip_fragment. Thus use the same function to determine the mtu here, too.
      
      This is important for the introduction of IP_PMTUDISC_OMIT, where we
      want the packets getting cut in pieces of the size of the outgoing
      interface mtu. IPv6 already does this correctly.
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69647ce4
    • T
      neigh: probe application via netlink in NUD_PROBE · a960ff81
      Timo Teräs 提交于
      iproute2 arpd seems to expect this as there's code and comments
      to handle netlink probes with NUD_PROBE set. It is used to flush
      the arpd cached mappings.
      
      opennhrp instead turns off unicast probes (so it can handle all
      neighbour discovery). Without this change it will not see NUD_PROBE
      probes and cannot reconfirm the mapping. Thus currently neigh entry
      will just fail and can cause few packets dropped until broadcast
      discovery is restarted.
      
      Earlier discussion on the subject:
      http://marc.info/?t=139305877100001&r=1&w=2Signed-off-by: NTimo Teräs <timo.teras@iki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a960ff81
    • B
      ipv6: log src and dst along with "udp checksum is 0" · 84a3e72c
      Bjørn Mork 提交于
      These info messages are rather pointless without any means to identify
      the source of the bogus packets.  Logging the src and dst addresses and
      ports may help a bit.
      
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NBjørn Mork <bjorn@mork.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84a3e72c
    • A
      net: Add sysfs file for port number · 3f85944f
      Amir Vadai 提交于
      Add a sysfs file to enable user space to query the device
      port number used by a netdevice instance. This is needed for
      devices that have multiple ports on the same PCI function.
      Signed-off-by: NAmir Vadai <amirv@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f85944f
    • F
      net: tcp: add mib counters to track zero window transitions · 8e165e20
      Florian Westphal 提交于
      Three counters are added:
      - one to track when we went from non-zero to zero window
      - one to track the reverse
      - one counter incremented when we want to announce zero window,
        but can't because we would shrink current window.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e165e20
  4. 25 2月, 2014 17 次提交
  5. 21 2月, 2014 2 次提交
  6. 20 2月, 2014 3 次提交
    • F
      tcp: use zero-window when free_space is low · 86c1a045
      Florian Westphal 提交于
      Currently the kernel tries to announce a zero window when free_space
      is below the current receiver mss estimate.
      
      When a sender is transmitting small packets and reader consumes data
      slowly (or not at all), receiver might be unable to shrink the receive
      win because
      
      a) we cannot withdraw already-commited receive window, and,
      b) we have to round the current rwin up to a multiple of the wscale
         factor, else we would shrink the current window.
      
      This causes the receive buffer to fill up until the rmem limit is hit.
      When this happens, we start dropping packets.
      
      Moreover, tcp_clamp_window may continue to grow sk_rcvbuf towards rmem[2]
      even if socket is not being read from.
      
      As we cannot avoid the "current_win is rounded up to multiple of mss"
      issue [we would violate a) above] at least try to prevent the receive buf
      growth towards tcp_rmem[2] limit by attempting to move to zero-window
      announcement when free_space becomes less than 1/16 of the current
      allowed receive buffer maximum.  If tcp_rmem[2] is large, this will
      increase our chances to get a zero-window announcement out in time.
      
      Reproducer:
      On server:
      $ nc -l -p 12345
      <suspend it: CTRL-Z>
      
      Client:
      #!/usr/bin/env python
      import socket
      import time
      
      sock = socket.socket()
      sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
      sock.connect(("192.168.4.1", 12345));
      while True:
         sock.send('A' * 23)
         time.sleep(0.005)
      
      socket buffer on server-side will grow until tcp_rmem[2] is hit,
      at which point the client rexmits data until -EDTIMEOUT:
      
      tcp_data_queue invokes tcp_try_rmem_schedule which will call
      tcp_prune_queue which calls tcp_clamp_window().  And that function will
      grow sk->sk_rcvbuf up until it eventually hits tcp_rmem[2].
      
      Thanks to Eric Dumazet for running regression tests.
      
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86c1a045
    • E
      tipc: failed transmissions should return error · 63fa01c1
      Erik Hugne 提交于
      When a message could not be sent out because the destination node
      or link could not be found, the full message size is returned from
      sendmsg() as if it had been sent successfully. An application will
      then get a false indication that it's making forward progress. This
      problem has existed since the initial commit in 2.6.16.
      
      We change this to return -ENETUNREACH if the message cannot be
      delivered due to the destination node/link being unavailable. We
      also get rid of the redundant tipc_reject_msg call since freeing
      the buffer and doing a tipc_port_iovec_reject accomplishes exactly
      the same thing.
      Signed-off-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63fa01c1
    • H
      ipv6: honor IPV6_PKTINFO with v4 mapped addresses on sendmsg · c8e6ad08
      Hannes Frederic Sowa 提交于
      In case we decide in udp6_sendmsg to send the packet down the ipv4
      udp_sendmsg path because the destination is either of family AF_INET or
      the destination is an ipv4 mapped ipv6 address, we don't honor the
      maybe specified ipv4 mapped ipv6 address in IPV6_PKTINFO.
      
      We simply can check for this option in ip_cmsg_send because no calls to
      ipv6 module functions are needed to do so.
      Reported-by: NGert Doering <gert@space.net>
      Cc: Tore Anderson <tore@fud.no>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8e6ad08
  7. 19 2月, 2014 6 次提交