1. 14 12月, 2017 1 次提交
  2. 11 11月, 2017 2 次提交
    • Y
      tcp: retire FACK loss detection · 713bafea
      Yuchung Cheng 提交于
      FACK loss detection has been disabled by default and the
      successor RACK subsumed FACK and can handle reordering better.
      This patch removes FACK to simplify TCP loss recovery.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: NPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      713bafea
    • M
      net: ipv6: sysctl to specify IPv6 ND traffic class · 2210d6b2
      Maciej Żenczykowski 提交于
      Add a per-device sysctl to specify the default traffic class to use for
      kernel originated IPv6 Neighbour Discovery packets.
      
      Currently this includes:
      
        - Router Solicitation (ICMPv6 type 133)
          ndisc_send_rs() -> ndisc_send_skb() -> ip6_nd_hdr()
      
        - Neighbour Solicitation (ICMPv6 type 135)
          ndisc_send_ns() -> ndisc_send_skb() -> ip6_nd_hdr()
      
        - Neighbour Advertisement (ICMPv6 type 136)
          ndisc_send_na() -> ndisc_send_skb() -> ip6_nd_hdr()
      
        - Redirect (ICMPv6 type 137)
          ndisc_send_redirect() -> ndisc_send_skb() -> ip6_nd_hdr()
      
      and if the kernel ever gets around to generating RA's,
      it would presumably also include:
      
        - Router Advertisement (ICMPv6 type 134)
          (radvd daemon could pick up on the kernel setting and use it)
      
      Interface drivers may examine the Traffic Class value and translate
      the DiffServ Code Point into a link-layer appropriate traffic
      prioritization scheme.  An example of mapping IETF DSCP values to
      IEEE 802.11 User Priority values can be found here:
      
          https://tools.ietf.org/html/draft-ietf-tsvwg-ieee-802-11
      
      The expected primary use case is to properly prioritize ND over wifi.
      
      Testing:
        jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        0
        jzem22:~# echo -1 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        -bash: echo: write error: Invalid argument
        jzem22:~# echo 256 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        -bash: echo: write error: Invalid argument
        jzem22:~# echo 0 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        jzem22:~# echo 255 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        255
        jzem22:~# echo 34 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        34
      
        jzem22:~# echo $[0xDC] > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        jzem22:~# tcpdump -v -i eth0 icmp6 and src host jzem22.pgc and dst host fe80::1
        tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
        IP6 (class 0xdc, hlim 255, next-header ICMPv6 (58) payload length: 24)
        jzem22.pgc > fe80::1: [icmp6 sum ok] ICMP6, neighbor advertisement,
        length 24, tgt is jzem22.pgc, Flags [solicited]
      
      (based on original change written by Erik Kline, with minor changes)
      
      v2: fix 'suspicious rcu_dereference_check() usage'
          by explicitly grabbing the rcu_read_lock.
      
      Cc: Lorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NErik Kline <ek@google.com>
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2210d6b2
  3. 05 11月, 2017 1 次提交
    • P
      tcp: higher throughput under reordering with adaptive RACK reordering wnd · 1f255691
      Priyaranjan Jha 提交于
      Currently TCP RACK loss detection does not work well if packets are
      being reordered beyond its static reordering window (min_rtt/4).Under
      such reordering it may falsely trigger loss recoveries and reduce TCP
      throughput significantly.
      
      This patch improves that by increasing and reducing the reordering
      window based on DSACK, which is now supported in major TCP implementations.
      It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries.
      
      - If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded
        by srtt), since there is possibility that spurious retransmission was
        due to reordering delay longer than reo_wnd.
      
      - Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16)
        no. of successful recoveries (accounts for full DSACK-based loss
        recovery undo). After that, reset it to default (min_rtt/4).
      
      - At max, reo_wnd is incremented only once per rtt. So that the new
        DSACK on which we are reacting, is due to the spurious retx (approx)
        after the reo_wnd has been updated last time.
      
      - reo_wnd is tracked in terms of steps (of min_rtt/4), rather than
        absolute value to account for change in rtt.
      
      In our internal testing, we observed significant increase in throughput,
      in scenarios where reordering exceeds min_rtt/4 (previous static value).
      Signed-off-by: NPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f255691
  4. 03 11月, 2017 1 次提交
    • T
      ipv6: Implement limits on Hop-by-Hop and Destination options · 47d3d7ac
      Tom Herbert 提交于
      RFC 8200 (IPv6) defines Hop-by-Hop options and Destination options
      extension headers. Both of these carry a list of TLVs which is
      only limited by the maximum length of the extension header (2048
      bytes). By the spec a host must process all the TLVs in these
      options, however these could be used as a fairly obvious
      denial of service attack. I think this could in fact be
      a significant DOS vector on the Internet, one mitigating
      factor might be that many FWs drop all packets with EH (and
      obviously this is only IPv6) so an Internet wide attack might not
      be so effective (yet!).
      
      By my calculation, the worse case packet with TLVs in a standard
      1500 byte MTU packet that would be processed by the stack contains
      1282 invidual TLVs (including pad TLVS) or 724 two byte TLVs. I
      wrote a quick test program that floods a whole bunch of these
      packets to a host and sure enough there is substantial time spent
      in ip6_parse_tlv. These packets contain nothing but unknown TLVS
      (that are ignored), TLV padding, and bogus UDP header with zero
      payload length.
      
        25.38%  [kernel]                    [k] __fib6_clean_all
        21.63%  [kernel]                    [k] ip6_parse_tlv
         4.21%  [kernel]                    [k] __local_bh_enable_ip
         2.18%  [kernel]                    [k] ip6_pol_route.isra.39
         1.98%  [kernel]                    [k] fib6_walk_continue
         1.88%  [kernel]                    [k] _raw_write_lock_bh
         1.65%  [kernel]                    [k] dst_release
      
      This patch adds configurable limits to Destination and Hop-by-Hop
      options. There are three limits that may be set:
        - Limit the number of options in a Hop-by-Hop or Destination options
          extension header.
        - Limit the byte length of a Hop-by-Hop or Destination options
          extension header.
        - Disallow unrecognized options in a Hop-by-Hop or Destination
          options extension header.
      
      The limits are set in corresponding sysctls:
      
        ipv6.sysctl.max_dst_opts_cnt
        ipv6.sysctl.max_hbh_opts_cnt
        ipv6.sysctl.max_dst_opts_len
        ipv6.sysctl.max_hbh_opts_len
      
      If a max_*_opts_cnt is less than zero then unknown TLVs are disallowed.
      The number of known TLVs that are allowed is the absolute value of
      this number.
      
      If a limit is exceeded when processing an extension header the packet is
      dropped.
      
      Default values are set to 8 for options counts, and set to INT_MAX
      for maximum length. Note the choice to limit options to 8 is an
      arbitrary guess (roughly based on the fact that the stack supports
      three HBH options and just one destination option).
      
      These limits have being proposed in draft-ietf-6man-rfc6434-bis.
      
      Tested (by Martin Lau)
      
      I tested out 1 thread (i.e. one raw_udp process).
      
      I changed the net.ipv6.max_dst_(opts|hbh)_number between 8 to 2048.
      With sysctls setting to 2048, the softirq% is packed to 100%.
      With 8, the softirq% is almost unnoticable from mpstat.
      
      v2;
        - Code and documention cleanup.
        - Change references of RFC2460 to be RFC8200.
        - Add reference to RFC6434-bis where the limits will be in standard.
      Signed-off-by: NTom Herbert <tom@quantonium.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47d3d7ac
  5. 20 9月, 2017 1 次提交
    • M
      ipv6: fix net.ipv6.conf.all interface DAD handlers · 35e015e1
      Matteo Croce 提交于
      Currently, writing into
      net.ipv6.conf.all.{accept_dad,use_optimistic,optimistic_dad} has no effect.
      Fix handling of these flags by:
      
      - using the maximum of global and per-interface values for the
        accept_dad flag. That is, if at least one of the two values is
        non-zero, enable DAD on the interface. If at least one value is
        set to 2, enable DAD and disable IPv6 operation on the interface if
        MAC-based link-local address was found
      
      - using the logical OR of global and per-interface values for the
        optimistic_dad flag. If at least one of them is set to one, optimistic
        duplicate address detection (RFC 4429) is enabled on the interface
      
      - using the logical OR of global and per-interface values for the
        use_optimistic flag. If at least one of them is set to one,
        optimistic addresses won't be marked as deprecated during source address
        selection on the interface.
      
      While at it, as we're modifying the prototype for ipv6_use_optimistic_addr(),
      drop inline, and let the compiler decide.
      
      Fixes: 7fd2561e ("net: ipv6: Add a sysctl to make optimistic addresses useful candidates")
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35e015e1
  6. 30 8月, 2017 1 次提交
  7. 25 8月, 2017 1 次提交
    • J
      ipv6: Add sysctl for per namespace flow label reflection · 22b6722b
      Jakub Sitnicki 提交于
      Reflecting IPv6 Flow Label at server nodes is useful in environments
      that employ multipath routing to load balance the requests. As "IPv6
      Flow Label Reflection" standard draft [1] points out - ICMPv6 PTB error
      messages generated in response to a downstream packets from the server
      can be routed by a load balancer back to the original server without
      looking at transport headers, if the server applies the flow label
      reflection. This enables the Path MTU Discovery past the ECMP router in
      load-balance or anycast environments where each server node is reachable
      by only one path.
      
      Introduce a sysctl to enable flow label reflection per net namespace for
      all newly created sockets. Same could be earlier achieved only per
      socket by setting the IPV6_FL_F_REFLECT flag for the IPV6_FLOWLABEL_MGR
      socket option.
      
      [1] https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01Signed-off-by: NJakub Sitnicki <jkbs@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22b6722b
  8. 01 8月, 2017 1 次提交
  9. 19 7月, 2017 1 次提交
  10. 25 4月, 2017 1 次提交
    • W
      net/tcp_fastopen: Disable active side TFO in certain scenarios · cf1ef3f0
      Wei Wang 提交于
      Middlebox firewall issues can potentially cause server's data being
      blackholed after a successful 3WHS using TFO. Following are the related
      reports from Apple:
      https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
      Slide 31 identifies an issue where the client ACK to the server's data
      sent during a TFO'd handshake is dropped.
      C ---> syn-data ---> S
      C <--- syn/ack ----- S
      C (accept & write)
      C <---- data ------- S
      C ----- ACK -> X     S
      		[retry and timeout]
      
      https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
      Slide 5 shows a similar situation that the server's data gets dropped
      after 3WHS.
      C ---- syn-data ---> S
      C <--- syn/ack ----- S
      C ---- ack --------> S
      S (accept & write)
      C?  X <- data ------ S
      		[retry and timeout]
      
      This is the worst failure b/c the client can not detect such behavior to
      mitigate the situation (such as disabling TFO). Failing to proceed, the
      application (e.g., SSL library) may simply timeout and retry with TFO
      again, and the process repeats indefinitely.
      
      The proposed solution is to disable active TFO globally under the
      following circumstances:
      1. client side TFO socket detects out of order FIN
      2. client side TFO socket receives out of order RST
      
      We disable active side TFO globally for 1hr at first. Then if it
      happens again, we disable it for 2h, then 4h, 8h, ...
      And we reset the timeout to 1hr if a client side TFO sockets not opened
      on loopback has successfully received data segs from server.
      And we examine this condition during close().
      
      The rational behind it is that when such firewall issue happens,
      application running on the client should eventually close the socket as
      it is not able to get the data it is expecting. Or application running
      on the server should close the socket as it is not able to receive any
      response from client.
      In both cases, out of order FIN or RST will get received on the client
      given that the firewall will not block them as no data are in those
      frames.
      And we want to disable active TFO globally as it helps if the middle box
      is very close to the client and most of the connections are likely to
      fail.
      
      Also, add a debug sysctl:
        tcp_fastopen_blackhole_detect_timeout_sec:
          the initial timeout to use when firewall blackhole issue happens.
          This can be set and read.
          When setting it to 0, it means to disable the active disable logic.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf1ef3f0
  11. 25 3月, 2017 1 次提交
    • S
      net: Add sysctl to toggle early demux for tcp and udp · dddb64bc
      subashab@codeaurora.org 提交于
      Certain system process significant unconnected UDP workload.
      It would be preferrable to disable UDP early demux for those systems
      and enable it for TCP only.
      
      By disabling UDP demux, we see these slight gains on an ARM64 system-
      782 -> 788Mbps unconnected single stream UDPv4
      633 -> 654Mbps unconnected UDPv4 different sources
      
      The performance impact can change based on CPU architecure and cache
      sizes. There will not much difference seen if entire UDP hash table
      is in cache.
      
      Both sysctls are enabled by default to preserve existing behavior.
      
      v1->v2: Change function pointer instead of adding conditional as
      suggested by Stephen.
      
      v2->v3: Read once in callers to avoid issues due to compiler
      optimizations. Also update commit message with the tests.
      
      v3->v4: Store and use read once result instead of querying pointer
      again incorrectly.
      
      v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dddb64bc
  12. 23 3月, 2017 1 次提交
  13. 22 3月, 2017 1 次提交
    • N
      net: ipv4: add support for ECMP hash policy choice · bf4e0a3d
      Nikolay Aleksandrov 提交于
      This patch adds support for ECMP hash policy choice via a new sysctl
      called fib_multipath_hash_policy and also adds support for L4 hashes.
      The current values for fib_multipath_hash_policy are:
       0 - layer 3 (default)
       1 - layer 4
      If there's an skb hash already set and it matches the chosen policy then it
      will be used instead of being calculated (currently only for L4).
      In L3 mode we always calculate the hash due to the ICMP error special
      case, the flow dissector's field consistentification should handle the
      address order thus we can remove the address reversals.
      If the skb is provided we always use it for the hash calculation,
      otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf4e0a3d
  14. 17 3月, 2017 1 次提交
    • S
      tcp: remove tcp_tw_recycle · 4396e461
      Soheil Hassas Yeganeh 提交于
      The tcp_tw_recycle was already broken for connections
      behind NAT, since the per-destination timestamp is not
      monotonically increasing for multiple machines behind
      a single destination address.
      
      After the randomization of TCP timestamp offsets
      in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
      for each connection), the tcp_tw_recycle is broken for all
      types of connections for the same reason: the timestamps
      received from a single machine is not monotonically increasing,
      anymore.
      
      Remove tcp_tw_recycle, since it is not functional. Also, remove
      the PAWSPassive SNMP counter since it is only used for
      tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
      since the strict argument is only set when tcp_tw_recycle is
      enabled.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Cc: Lutz Vieweg <lvml@5t9.de>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4396e461
  15. 13 3月, 2017 1 次提交
  16. 31 1月, 2017 1 次提交
    • R
      net: Avoid receiving packets with an l3mdev on unbound UDP sockets · 63a6fff3
      Robert Shearman 提交于
      Packets arriving in a VRF currently are delivered to UDP sockets that
      aren't bound to any interface. TCP defaults to not delivering packets
      arriving in a VRF to unbound sockets. IP route lookup and socket
      transmit both assume that unbound means using the default table and
      UDP applications that haven't been changed to be aware of VRFs may not
      function correctly in this case since they may not be able to handle
      overlapping IP address ranges, or be able to send packets back to the
      original sender if required.
      
      So add a sysctl, udp_l3mdev_accept, to control this behaviour with it
      being analgous to the existing tcp_l3mdev_accept, namely to allow a
      process to have a VRF-global listen socket. Have this default to off
      as this is the behaviour that users will expect, given that there is
      no explicit mechanism to set unmodified VRF-unaware application into a
      default VRF.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63a6fff3
  17. 25 1月, 2017 1 次提交
    • K
      Introduce a sysctl that modifies the value of PROT_SOCK. · 4548b683
      Krister Johansen 提交于
      Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl
      that denotes the first unprivileged inet port in the namespace.  To
      disable all privileged ports set this to zero.  It also checks for
      overlap with the local port range.  The privileged and local range may
      not overlap.
      
      The use case for this change is to allow containerized processes to bind
      to priviliged ports, but prevent them from ever being allowed to modify
      their container's network configuration.  The latter is accomplished by
      ensuring that the network namespace is not a child of the user
      namespace.  This modification was needed to allow the container manager
      to disable a namespace's priviliged port restrictions without exposing
      control of the network namespace to processes in the user namespace.
      Signed-off-by: NKrister Johansen <kjlx@templeofstupid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4548b683
  18. 14 1月, 2017 2 次提交
  19. 04 12月, 2016 1 次提交
  20. 03 12月, 2016 1 次提交
  21. 10 11月, 2016 1 次提交
  22. 24 8月, 2016 1 次提交
  23. 30 5月, 2016 1 次提交
  24. 26 4月, 2016 2 次提交
  25. 12 4月, 2016 1 次提交
    • D
      net: ipv4: Consider failed nexthops in multipath routes · a6db4494
      David Ahern 提交于
      Multipath route lookups should consider knowledge about next hops and not
      select a hop that is known to be failed.
      
      Example:
      
                           [h2]                   [h3]   15.0.0.5
                            |                      |
                           3|                     3|
                          [SP1]                  [SP2]--+
                           1  2                   1     2
                           |  |     /-------------+     |
                           |   \   /                    |
                           |     X                      |
                           |    / \                     |
                           |   /   \---------------\    |
                           1  2                     1   2
               12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
                           4                         4
                            \                       /
                              \                    /
                               \                  /
                                -------|   |-----/
                                       1   2
                                      [TOR3]
                                        3|
                                         |
                                        [h1]  12.0.0.1
      
      host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:
      
          root@h1:~# ip ro ls
          ...
          12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
          15.0.0.0/16
                  nexthop via 12.0.0.2  dev swp1 weight 1
                  nexthop via 12.0.0.3  dev swp1 weight 1
          ...
      
      If the link between tor3 and tor1 is down and the link between tor1
      and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
      in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
      ssh 15.0.0.5 gets the other. Connections that attempt to use the
      12.0.0.2 nexthop fail since that neighbor is not reachable:
      
          root@h1:~# ip neigh show
          ...
          12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
          12.0.0.2 dev swp1  FAILED
          ...
      
      The failed path can be avoided by considering known neighbor information
      when selecting next hops. If the neighbor lookup fails we have no
      knowledge about the nexthop, so give it a shot. If there is an entry
      then only select the nexthop if the state is sane. This is similar to
      what fib_detect_death does.
      
      To maintain backward compatibility use of the neighbor information is
      based on a new sysctl, fib_multipath_use_neigh.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6db4494
  26. 22 3月, 2016 2 次提交
  27. 26 2月, 2016 1 次提交
    • D
      net: ipv6: Make address flushing on ifdown optional · f1705ec1
      David Ahern 提交于
      Currently, all ipv6 addresses are flushed when the interface is configured
      down, including global, static addresses:
      
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
              inet6 2100:1::2/120 scope global
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
                 valid_lft forever preferred_lft forever
          $ ip link set dev eth1 down
          $ ip -6 addr show dev eth1
          << nothing; all addresses have been flushed>>
      
      Add a new sysctl to make this behavior optional. The new setting defaults to
      flush all addresses to maintain backwards compatibility. When the set global
      addresses with no expire times are not flushed on an admin down. The sysctl
      is per-interface or system-wide for all interfaces
      
          $ sysctl -w net.ipv6.conf.eth1.keep_addr_on_down=1
      or
          $ sysctl -w net.ipv6.conf.all.keep_addr_on_down=1
      
      Will keep addresses on eth1 on an admin down.
      
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
              inet6 2100:1::2/120 scope global
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
                 valid_lft forever preferred_lft forever
          $ ip link set dev eth1 down
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST> mtu 1500 state DOWN qlen 1000
              inet6 2100:1::2/120 scope global tentative
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link tentative
                 valid_lft forever preferred_lft forever
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1705ec1
  28. 11 2月, 2016 4 次提交
  29. 21 1月, 2016 1 次提交
  30. 19 12月, 2015 1 次提交
    • D
      net: Allow accepted sockets to be bound to l3mdev domain · 6dd9a14e
      David Ahern 提交于
      Allow accepted sockets to derive their sk_bound_dev_if setting from the
      l3mdev domain in which the packets originated. A sysctl setting is added
      to control the behavior which is similar to sk_mark and
      sysctl_tcp_fwmark_accept.
      
      This effectively allow a process to have a "VRF-global" listen socket,
      with child sockets bound to the VRF device in which the packet originated.
      A similar behavior can be achieved using sk_mark, but a solution using marks
      is incomplete as it does not handle duplicate addresses in different L3
      domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
      domain provides a complete solution.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dd9a14e
  31. 16 12月, 2015 1 次提交
  32. 10 11月, 2015 1 次提交
  33. 21 10月, 2015 1 次提交
    • Y
      tcp: use RACK to detect losses · 4f41b1c5
      Yuchung Cheng 提交于
      This patch implements the second half of RACK that uses the the most
      recent transmit time among all delivered packets to detect losses.
      
      tcp_rack_mark_lost() is called upon receiving a dubious ACK.
      It then checks if an not-yet-sacked packet was sent at least
      "reo_wnd" prior to the sent time of the most recently delivered.
      If so the packet is deemed lost.
      
      The "reo_wnd" reordering window starts with 1msec for fast loss
      detection and changes to min-RTT/4 when reordering is observed.
      We found 1msec accommodates well on tiny degree of reordering
      (<3 pkts) on faster links. We use min-RTT instead of SRTT because
      reordering is more of a path property but SRTT can be inflated by
      self-inflicated congestion. The factor of 4 is borrowed from the
      delayed early retransmit and seems to work reasonably well.
      
      Since RACK is still experimental, it is now used as a supplemental
      loss detection on top of existing algorithms. It is only effective
      after the fast recovery starts or after the timeout occurs. The
      fast recovery is still triggered by FACK and/or dupack threshold
      instead of RACK.
      
      We introduce a new sysctl net.ipv4.tcp_recovery for future
      experiments of loss recoveries. For now RACK can be disabled by
      setting it to 0.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f41b1c5