1. 02 10月, 2017 4 次提交
  2. 01 8月, 2017 1 次提交
  3. 16 6月, 2017 1 次提交
    • D
      tcp: ULP infrastructure · 734942cc
      Dave Watson 提交于
      Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP
      sockets. Based on a similar infrastructure in tcp_cong.  The idea is that any
      ULP can add its own logic by changing the TCP proto_ops structure to its own
      methods.
      
      Example usage:
      
      setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));
      
      modules will call:
      tcp_register_ulp(&tcp_tls_ulp_ops);
      
      to register/unregister their ulp, with an init function and name.
      
      A list of registered ulps will be returned by tcp_get_available_ulp, which is
      hooked up to /proc.  Example:
      
      $ cat /proc/sys/net/ipv4/tcp_available_ulp
      tls
      
      There is currently no functionality to remove or chain ULPs, but
      it should be possible to add these in the future if needed.
      Signed-off-by: NBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: NDave Watson <davejwatson@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      734942cc
  4. 08 6月, 2017 3 次提交
  5. 25 4月, 2017 2 次提交
    • W
      net/tcp_fastopen: Disable active side TFO in certain scenarios · cf1ef3f0
      Wei Wang 提交于
      Middlebox firewall issues can potentially cause server's data being
      blackholed after a successful 3WHS using TFO. Following are the related
      reports from Apple:
      https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
      Slide 31 identifies an issue where the client ACK to the server's data
      sent during a TFO'd handshake is dropped.
      C ---> syn-data ---> S
      C <--- syn/ack ----- S
      C (accept & write)
      C <---- data ------- S
      C ----- ACK -> X     S
      		[retry and timeout]
      
      https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
      Slide 5 shows a similar situation that the server's data gets dropped
      after 3WHS.
      C ---- syn-data ---> S
      C <--- syn/ack ----- S
      C ---- ack --------> S
      S (accept & write)
      C?  X <- data ------ S
      		[retry and timeout]
      
      This is the worst failure b/c the client can not detect such behavior to
      mitigate the situation (such as disabling TFO). Failing to proceed, the
      application (e.g., SSL library) may simply timeout and retry with TFO
      again, and the process repeats indefinitely.
      
      The proposed solution is to disable active TFO globally under the
      following circumstances:
      1. client side TFO socket detects out of order FIN
      2. client side TFO socket receives out of order RST
      
      We disable active side TFO globally for 1hr at first. Then if it
      happens again, we disable it for 2h, then 4h, 8h, ...
      And we reset the timeout to 1hr if a client side TFO sockets not opened
      on loopback has successfully received data segs from server.
      And we examine this condition during close().
      
      The rational behind it is that when such firewall issue happens,
      application running on the client should eventually close the socket as
      it is not able to get the data it is expecting. Or application running
      on the server should close the socket as it is not able to receive any
      response from client.
      In both cases, out of order FIN or RST will get received on the client
      given that the firewall will not block them as no data are in those
      frames.
      And we want to disable active TFO globally as it helps if the middle box
      is very close to the client and most of the connections are likely to
      fail.
      
      Also, add a debug sysctl:
        tcp_fastopen_blackhole_detect_timeout_sec:
          the initial timeout to use when firewall blackhole issue happens.
          This can be set and read.
          When setting it to 0, it means to disable the active disable logic.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf1ef3f0
    • D
      net: add rcu locking when changing early demux · 58c4c6a3
      David Ahern 提交于
      systemd-sysctl is triggering a suspicious RCU usage message when
      net.ipv4.tcp_early_demux or net.ipv4.udp_early_demux is changed via
      a sysctl config file:
      
      [   33.896184] ===============================
      [   33.899558] [ ERR: suspicious RCU usage.  ]
      [   33.900624] 4.11.0-rc7+ #104 Not tainted
      [   33.901698] -------------------------------
      [   33.903059] /home/dsa/kernel-2.git/net/ipv4/sysctl_net_ipv4.c:305 suspicious rcu_dereference_check() usage!
      [   33.905724]
      other info that might help us debug this:
      
      [   33.907656]
      rcu_scheduler_active = 2, debug_locks = 0
      [   33.909288] 1 lock held by systemd-sysctl/143:
      [   33.910373]  #0:  (sb_writers#5){.+.+.+}, at: [<ffffffff8123a370>] file_start_write+0x45/0x48
      [   33.912407]
      stack backtrace:
      [   33.914018] CPU: 0 PID: 143 Comm: systemd-sysctl Not tainted 4.11.0-rc7+ #104
      [   33.915631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [   33.917870] Call Trace:
      [   33.918431]  dump_stack+0x81/0xb6
      [   33.919241]  lockdep_rcu_suspicious+0x10f/0x118
      [   33.920263]  proc_configure_early_demux+0x65/0x10a
      [   33.921391]  proc_udp_early_demux+0x3a/0x41
      
      add rcu locking to proc_configure_early_demux.
      
      Fixes: dddb64bc ("net: Add sysctl to toggle early demux for tcp and udp")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58c4c6a3
  6. 25 3月, 2017 1 次提交
    • S
      net: Add sysctl to toggle early demux for tcp and udp · dddb64bc
      subashab@codeaurora.org 提交于
      Certain system process significant unconnected UDP workload.
      It would be preferrable to disable UDP early demux for those systems
      and enable it for TCP only.
      
      By disabling UDP demux, we see these slight gains on an ARM64 system-
      782 -> 788Mbps unconnected single stream UDPv4
      633 -> 654Mbps unconnected UDPv4 different sources
      
      The performance impact can change based on CPU architecure and cache
      sizes. There will not much difference seen if entire UDP hash table
      is in cache.
      
      Both sysctls are enabled by default to preserve existing behavior.
      
      v1->v2: Change function pointer instead of adding conditional as
      suggested by Stephen.
      
      v2->v3: Read once in callers to avoid issues due to compiler
      optimizations. Also update commit message with the tests.
      
      v3->v4: Store and use read once result instead of querying pointer
      again incorrectly.
      
      v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dddb64bc
  7. 22 3月, 2017 1 次提交
    • N
      net: ipv4: add support for ECMP hash policy choice · bf4e0a3d
      Nikolay Aleksandrov 提交于
      This patch adds support for ECMP hash policy choice via a new sysctl
      called fib_multipath_hash_policy and also adds support for L4 hashes.
      The current values for fib_multipath_hash_policy are:
       0 - layer 3 (default)
       1 - layer 4
      If there's an skb hash already set and it matches the chosen policy then it
      will be used instead of being calculated (currently only for L4).
      In L3 mode we always calculate the hash due to the ICMP error special
      case, the flow dissector's field consistentification should handle the
      address order thus we can remove the address reversals.
      If the skb is provided we always use it for the hash calculation,
      otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf4e0a3d
  8. 17 3月, 2017 1 次提交
    • S
      tcp: remove tcp_tw_recycle · 4396e461
      Soheil Hassas Yeganeh 提交于
      The tcp_tw_recycle was already broken for connections
      behind NAT, since the per-destination timestamp is not
      monotonically increasing for multiple machines behind
      a single destination address.
      
      After the randomization of TCP timestamp offsets
      in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
      for each connection), the tcp_tw_recycle is broken for all
      types of connections for the same reason: the timestamps
      received from a single machine is not monotonically increasing,
      anymore.
      
      Remove tcp_tw_recycle, since it is not functional. Also, remove
      the PAWSPassive SNMP counter since it is only used for
      tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
      since the strict argument is only set when tcp_tw_recycle is
      enabled.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Cc: Lutz Vieweg <lvml@5t9.de>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4396e461
  9. 31 1月, 2017 1 次提交
    • R
      net: Avoid receiving packets with an l3mdev on unbound UDP sockets · 63a6fff3
      Robert Shearman 提交于
      Packets arriving in a VRF currently are delivered to UDP sockets that
      aren't bound to any interface. TCP defaults to not delivering packets
      arriving in a VRF to unbound sockets. IP route lookup and socket
      transmit both assume that unbound means using the default table and
      UDP applications that haven't been changed to be aware of VRFs may not
      function correctly in this case since they may not be able to handle
      overlapping IP address ranges, or be able to send packets back to the
      original sender if required.
      
      So add a sysctl, udp_l3mdev_accept, to control this behaviour with it
      being analgous to the existing tcp_l3mdev_accept, namely to allow a
      process to have a VRF-global listen socket. Have this default to off
      as this is the behaviour that users will expect, given that there is
      no explicit mechanism to set unmodified VRF-unaware application into a
      default VRF.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63a6fff3
  10. 25 1月, 2017 1 次提交
    • K
      Introduce a sysctl that modifies the value of PROT_SOCK. · 4548b683
      Krister Johansen 提交于
      Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl
      that denotes the first unprivileged inet port in the namespace.  To
      disable all privileged ports set this to zero.  It also checks for
      overlap with the local port range.  The privileged and local range may
      not overlap.
      
      The use case for this change is to allow containerized processes to bind
      to priviliged ports, but prevent them from ever being allowed to modify
      their container's network configuration.  The latter is accomplished by
      ensuring that the network namespace is not a child of the user
      namespace.  This modification was needed to allow the container manager
      to disable a namespace's priviliged port restrictions without exposing
      control of the network namespace to processes in the user namespace.
      Signed-off-by: NKrister Johansen <kjlx@templeofstupid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4548b683
  11. 14 1月, 2017 1 次提交
    • Y
      tcp: remove thin_dupack feature · 4a7f6009
      Yuchung Cheng 提交于
      Thin stream DUPACK is to start fast recovery on only one DUPACK
      provided the connection is a thin stream (i.e., low inflight).  But
      this older feature is now subsumed with RACK. If a connection
      receives only a single DUPACK, RACK would arm a reordering timer
      and soon starts fast recovery instead of timeout if no further
      ACKs are received.
      
      The socket option (THIN_DUPACK) is kept as a nop for compatibility.
      Note that this patch does not change another thin-stream feature
      which enables linear RTO. Although it might be good to generalize
      that in the future (i.e., linear RTO for the first say 3 retries).
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a7f6009
  12. 10 1月, 2017 1 次提交
  13. 30 12月, 2016 2 次提交
  14. 28 12月, 2016 1 次提交
  15. 23 10月, 2016 1 次提交
    • W
      ipv4: use the right lock for ping_group_range · 396a30cc
      WANG Cong 提交于
      This reverts commit a681574c
      ("ipv4: disable BH in set_ping_group_range()") because we never
      read ping_group_range in BH context (unlike local_port_range).
      
      Then, since we already have a lock for ping_group_range, those
      using ip_local_ports.lock for ping_group_range are clearly typos.
      
      We might consider to share a same lock for both ping_group_range
      and local_port_range w.r.t. space saving, but that should be for
      net-next.
      
      Fixes: a681574c ("ipv4: disable BH in set_ping_group_range()")
      Fixes: ba6b918a ("ping: move ping_group_range out of CONFIG_SYSCTL")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Eric Salo <salo@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      396a30cc
  16. 21 10月, 2016 1 次提交
  17. 24 5月, 2016 1 次提交
    • E
      ipv4: Fix non-initialized TTL when CONFIG_SYSCTL=n · 049bbf58
      Ezequiel Garcia 提交于
      Commit fa50d974 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
      moves the default TTL assignment, and as side-effect IPv4 TTL now
      has a default value only if sysctl support is enabled (CONFIG_SYSCTL=y).
      
      The sysctl_ip_default_ttl is fundamental for IP to work properly,
      as it provides the TTL to be used as default. The defautl TTL may be
      used in ip_selected_ttl, through the following flow:
      
        ip_select_ttl
          ip4_dst_hoplimit
            net->ipv4.sysctl_ip_default_ttl
      
      This commit fixes the issue by assigning net->ipv4.sysctl_ip_default_ttl
      in net_init_net, called during ipv4's initialization.
      
      Without this commit, a kernel built without sysctl support will send
      all IP packets with zero TTL (unless a TTL is explicitly set, e.g.
      with setsockopt).
      
      Given a similar issue might appear on the other knobs that were
      namespaceify, this commit also moves them.
      
      Fixes: fa50d974 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
      Signed-off-by: NEzequiel Garcia <ezequiel@vanguardiasur.com.ar>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      049bbf58
  18. 12 4月, 2016 1 次提交
    • D
      net: ipv4: Consider failed nexthops in multipath routes · a6db4494
      David Ahern 提交于
      Multipath route lookups should consider knowledge about next hops and not
      select a hop that is known to be failed.
      
      Example:
      
                           [h2]                   [h3]   15.0.0.5
                            |                      |
                           3|                     3|
                          [SP1]                  [SP2]--+
                           1  2                   1     2
                           |  |     /-------------+     |
                           |   \   /                    |
                           |     X                      |
                           |    / \                     |
                           |   /   \---------------\    |
                           1  2                     1   2
               12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
                           4                         4
                            \                       /
                              \                    /
                               \                  /
                                -------|   |-----/
                                       1   2
                                      [TOR3]
                                        3|
                                         |
                                        [h1]  12.0.0.1
      
      host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:
      
          root@h1:~# ip ro ls
          ...
          12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
          15.0.0.0/16
                  nexthop via 12.0.0.2  dev swp1 weight 1
                  nexthop via 12.0.0.3  dev swp1 weight 1
          ...
      
      If the link between tor3 and tor1 is down and the link between tor1
      and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
      in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
      ssh 15.0.0.5 gets the other. Connections that attempt to use the
      12.0.0.2 nexthop fail since that neighbor is not reachable:
      
          root@h1:~# ip neigh show
          ...
          12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
          12.0.0.2 dev swp1  FAILED
          ...
      
      The failed path can be avoided by considering known neighbor information
      when selecting next hops. If the neighbor lookup fails we have no
      knowledge about the nexthop, so give it a shot. If there is an entry
      then only select the nexthop if the state is sane. This is similar to
      what fib_detect_death does.
      
      To maintain backward compatibility use of the neighbor information is
      based on a new sysctl, fib_multipath_use_neigh.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6db4494
  19. 17 2月, 2016 3 次提交
  20. 11 2月, 2016 4 次提交
  21. 08 2月, 2016 8 次提交