1. 06 5月, 2017 1 次提交
    • E
      tcp: randomize timestamps on syncookies · 84b114b9
      Eric Dumazet 提交于
      Whole point of randomization was to hide server uptime, but an attacker
      can simply start a syn flood and TCP generates 'old style' timestamps,
      directly revealing server jiffies value.
      
      Also, TSval sent by the server to a particular remote address vary
      depending on syncookies being sent or not, potentially triggering PAWS
      drops for innocent clients.
      
      Lets implement proper randomization, including for SYNcookies.
      
      Also we do not need to export sysctl_tcp_timestamps, since it is not
      used from a module.
      
      In v2, I added Florian feedback and contribution, adding tsoff to
      tcp_get_cookie_sock().
      
      v3 removed one unused variable in tcp_v4_connect() as Florian spotted.
      
      Fixes: 95a22cae ("tcp: randomize tcp timestamp offsets for each connection")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Tested-by: NFlorian Westphal <fw@strlen.de>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84b114b9
  2. 10 1月, 2017 1 次提交
  3. 03 12月, 2016 1 次提交
    • F
      tcp: randomize tcp timestamp offsets for each connection · 95a22cae
      Florian Westphal 提交于
      jiffies based timestamps allow for easy inference of number of devices
      behind NAT translators and also makes tracking of hosts simpler.
      
      commit ceaa1fef ("tcp: adding a per-socket timestamp offset")
      added the main infrastructure that is needed for per-connection ts
      randomization, in particular writing/reading the on-wire tcp header
      format takes the offset into account so rest of stack can use normal
      tcp_time_stamp (jiffies).
      
      So only two items are left:
       - add a tsoffset for request sockets
       - extend the tcp isn generator to also return another 32bit number
         in addition to the ISN.
      
      Re-use of ISN generator also means timestamps are still monotonically
      increasing for same connection quadruple, i.e. PAWS will still work.
      
      Includes fixes from Eric Dumazet.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95a22cae
  4. 05 11月, 2016 1 次提交
    • L
      net: inet: Support UID-based routing in IP protocols. · e2d118a1
      Lorenzo Colitti 提交于
      - Use the UID in routing lookups made by protocol connect() and
        sendmsg() functions.
      - Make sure that routing lookups triggered by incoming packets
        (e.g., Path MTU discovery) take the UID of the socket into
        account.
      - For packets not associated with a userspace socket, (e.g., ping
        replies) use UID 0 inside the user namespace corresponding to
        the network namespace the socket belongs to. This allows
        all namespaces to apply routing and iptables rules to
        kernel-originated traffic in that namespaces by matching UID 0.
        This is better than using the UID of the kernel socket that is
        sending the traffic, because the UID of kernel sockets created
        at namespace creation time (e.g., the per-processor ICMP and
        TCP sockets) is the UID of the user that created the socket,
        which might not be mapped in the namespace.
      
      Tested: compiles allnoconfig, allyesconfig, allmodconfig
      Tested: https://android-review.googlesource.com/253302Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2d118a1
  5. 28 4月, 2016 1 次提交
  6. 16 3月, 2016 1 次提交
    • P
      tags: Fix DEFINE_PER_CPU expansions · 25528213
      Peter Zijlstra 提交于
      $ make tags
        GEN     tags
      ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
      ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
      ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
      ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
      ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
      ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
      ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
      
      Which are all the result of the DEFINE_PER_CPU pattern:
      
        scripts/tags.sh:200:	'/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
        scripts/tags.sh:201:	'/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'
      
      The below cures them. All except the workqueue one are within reasonable
      distance of the 80 char limit. TJ do you have any preference on how to
      fix the wq one, or shall we just not care its too long?
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25528213
  7. 08 2月, 2016 1 次提交
  8. 19 12月, 2015 1 次提交
    • D
      net: Allow accepted sockets to be bound to l3mdev domain · 6dd9a14e
      David Ahern 提交于
      Allow accepted sockets to derive their sk_bound_dev_if setting from the
      l3mdev domain in which the packets originated. A sysctl setting is added
      to control the behavior which is similar to sk_mark and
      sysctl_tcp_fwmark_accept.
      
      This effectively allow a process to have a "VRF-global" listen socket,
      with child sockets bound to the VRF device in which the packet originated.
      A similar behavior can be achieved using sk_mark, but a solution using marks
      is incomplete as it does not handle duplicate addresses in different L3
      domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
      domain provides a complete solution.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dd9a14e
  9. 03 12月, 2015 1 次提交
  10. 13 10月, 2015 1 次提交
  11. 05 10月, 2015 1 次提交
  12. 30 9月, 2015 1 次提交
  13. 22 9月, 2015 1 次提交
    • Y
      tcp: usec resolution SYN/ACK RTT · 0f1c28ae
      Yuchung Cheng 提交于
      Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
      RTT is often measured as 0ms or sometimes 1ms, which would affect
      RTT estimation and min RTT samping used by some congestion control.
      
      This patch improves SYN/ACK RTT to be usec resolution if platform
      supports it. While the timestamping of SYN/ACK is done in request
      sock, the RTT measurement is carefully arranged to avoid storing
      another u64 timestamp in tcp_sock.
      
      For regular handshake w/o SYNACK retransmission, the RTT is sampled
      right after the child socket is created and right before the request
      sock is released (tcp_check_req() in tcp_minisocks.c)
      
      For Fast Open the child socket is already created when SYN/ACK was
      sent, the RTT is sampled in tcp_rcv_state_process() after processing
      the final ACK an right before the request socket is released.
      
      If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
      on TCP timestamps to measure the RTT. The sample is taken at the
      same place in tcp_rcv_state_process() after the timestamp values
      are validated in tcp_validate_incoming(). Note that we do not store
      TS echo value in request_sock for SYN-cookies, because the value
      is already stored in tp->rx_opt used by tcp_ack_update_rtt().
      
      One side benefit is that the RTT measurement now happens before
      initializing congestion control (of the passive side). Therefore
      the congestion control can use the SYN/ACK RTT.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f1c28ae
  14. 08 6月, 2015 1 次提交
  15. 25 3月, 2015 1 次提交
    • E
      tcp: fix ipv4 mapped request socks · 0144a81c
      Eric Dumazet 提交于
      ss should display ipv4 mapped request sockets like this :
      
      tcp    SYN-RECV   0      0  ::ffff:192.168.0.1:8080   ::ffff:192.0.2.1:35261
      
      and not like this :
      
      tcp    SYN-RECV   0      0  192.168.0.1:8080   192.0.2.1:35261
      
      We should init ireq->ireq_family based on listener sk_family,
      not the actual protocol carried by SYN packet.
      
      This means we can set ireq_family in inet_reqsk_alloc()
      
      Fixes: 3f66b083 ("inet: introduce ireq_family")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0144a81c
  16. 21 3月, 2015 1 次提交
    • E
      inet: get rid of central tcp/dccp listener timer · fa76ce73
      Eric Dumazet 提交于
      One of the major issue for TCP is the SYNACK rtx handling,
      done by inet_csk_reqsk_queue_prune(), fired by the keepalive
      timer of a TCP_LISTEN socket.
      
      This function runs for awful long times, with socket lock held,
      meaning that other cpus needing this lock have to spin for hundred of ms.
      
      SYNACK are sent in huge bursts, likely to cause severe drops anyway.
      
      This model was OK 15 years ago when memory was very tight.
      
      We now can afford to have a timer per request sock.
      
      Timer invocations no longer need to lock the listener,
      and can be run from all cpus in parallel.
      
      With following patch increasing somaxconn width to 32 bits,
      I tested a listener with more than 4 million active request sockets,
      and a steady SYNFLOOD of ~200,000 SYN per second.
      Host was sending ~830,000 SYNACK per second.
      
      This is ~100 times more what we could achieve before this patch.
      
      Later, we will get rid of the listener hash and use ehash instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa76ce73
  17. 18 3月, 2015 3 次提交
  18. 13 3月, 2015 2 次提交
  19. 05 11月, 2014 2 次提交
    • F
      net: allow setting ecn via routing table · f7b3bec6
      Florian Westphal 提交于
      This patch allows to set ECN on a per-route basis in case the sysctl
      tcp_ecn is not set to 1. In other words, when ECN is set for specific
      routes, it provides a tcp_ecn=1 behaviour for that route while the rest
      of the stack acts according to the global settings.
      
      One can use 'ip route change dev $dev $net features ecn' to toggle this.
      
      Having a more fine-grained per-route setting can be beneficial for various
      reasons, for example, 1) within data centers, or 2) local ISPs may deploy
      ECN support for their own video/streaming services [1], etc.
      
      There was a recent measurement study/paper [2] which scanned the Alexa's
      publicly available top million websites list from a vantage point in US,
      Europe and Asia:
      
      Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
      blamed to commit 255cac91 ("tcp: extend ECN sysctl to allow server-side
      only ECN") ;)); the break in connectivity on-path was found is about
      1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
      more common in the negotiation phase (and mostly seen in the Alexa
      middle band, ranks around 50k-150k): from 12-thousand hosts on which
      there _may_ be ECN-linked connection failures, only 79 failed with RST
      when _not_ failing with RST when ECN is not requested.
      
      It's unclear though, how much equipment in the wild actually marks CE
      when buffers start to fill up.
      
      We thought about a fallback to non-ECN for retransmitted SYNs as another
      global option (which could perhaps one day be made default), but as Eric
      points out, there's much more work needed to detect broken middleboxes.
      
      Two examples Eric mentioned are buggy firewalls that accept only a single
      SYN per flow, and middleboxes that successfully let an ECN flow establish,
      but later mark CE for all packets (so cwnd converges to 1).
      
       [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
       [2] http://ecn.ethz.ch/
      
      Joint work with Daniel Borkmann.
      
      Reference: http://thread.gmane.org/gmane.linux.network/335797Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7b3bec6
    • F
      syncookies: split cookie_check_timestamp() into two functions · f1673381
      Florian Westphal 提交于
      The function cookie_check_timestamp(), both called from IPv4/6 context,
      is being used to decode the echoed timestamp from the SYN/ACK into TCP
      options used for follow-up communication with the peer.
      
      We can remove ECN handling from that function, split it into a separate
      one, and simply rename the original function into cookie_decode_options().
      cookie_decode_options() just fills in tcp_option struct based on the
      echoed timestamp received from the peer. Anything that fails in this
      function will actually discard the request socket.
      
      While this is the natural place for decoding options such as ECN which
      commit 172d69e6 ("syncookies: add support for ECN") added, we argue
      that in particular for ECN handling, it can be checked at a later point
      in time as the request sock would actually not need to be dropped from
      this, but just ECN support turned off.
      
      Therefore, we split this functionality into cookie_ecn_ok(), which tells
      us if the timestamp indicates ECN support AND the tcp_ecn sysctl is enabled.
      
      This prepares for per-route ECN support: just looking at the tcp_ecn sysctl
      won't be enough anymore at that point; if the timestamp indicates ECN
      and sysctl tcp_ecn == 0, we will also need to check the ECN dst metric.
      
      This would mean adding a route lookup to cookie_check_timestamp(), which
      we definitely want to avoid. As we already do a route lookup at a later
      point in cookie_{v4,v6}_check(), we can simply make use of that as well
      for the new cookie_ecn_ok() function w/o any additional cost.
      
      Joint work with Daniel Borkmann.
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1673381
  20. 31 10月, 2014 1 次提交
  21. 24 10月, 2014 1 次提交
  22. 18 10月, 2014 1 次提交
  23. 29 9月, 2014 1 次提交
  24. 28 8月, 2014 1 次提交
  25. 27 8月, 2014 1 次提交
  26. 28 6月, 2014 1 次提交
  27. 14 5月, 2014 1 次提交
    • L
      net: support marking accepting TCP sockets · 84f39b08
      Lorenzo Colitti 提交于
      When using mark-based routing, sockets returned from accept()
      may need to be marked differently depending on the incoming
      connection request.
      
      This is the case, for example, if different socket marks identify
      different networks: a listening socket may want to accept
      connections from all networks, but each connection should be
      marked with the network that the request came in on, so that
      subsequent packets are sent on the correct network.
      
      This patch adds a sysctl to mark TCP sockets based on the fwmark
      of the incoming SYN packet. If enabled, and an unmarked socket
      receives a SYN, then the SYN packet's fwmark is written to the
      connection's inet_request_sock, and later written back to the
      accepted socket when the connection is established.  If the
      socket already has a nonzero mark, then the behaviour is the same
      as it is today, i.e., the listening socket's fwmark is used.
      
      Black-box tested using user-mode linux:
      
      - IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
        mark of the incoming SYN packet.
      - The socket returned by accept() is marked with the mark of the
        incoming SYN packet.
      - Tested with syncookies=1 and syncookies=2.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84f39b08
  28. 06 12月, 2013 1 次提交
  29. 20 10月, 2013 1 次提交
  30. 11 10月, 2013 1 次提交
  31. 10 10月, 2013 1 次提交
    • E
      inet: includes a sock_common in request_sock · 634fb979
      Eric Dumazet 提交于
      TCP listener refactoring, part 5 :
      
      We want to be able to insert request sockets (SYN_RECV) into main
      ehash table instead of the per listener hash table to allow RCU
      lookups and remove listener lock contention.
      
      This patch includes the needed struct sock_common in front
      of struct request_sock
      
      This means there is no more inet6_request_sock IPv6 specific
      structure.
      
      Following inet_request_sock fields were renamed as they became
      macros to reference fields from struct sock_common.
      Prefix ir_ was chosen to avoid name collisions.
      
      loc_port   -> ir_loc_port
      loc_addr   -> ir_loc_addr
      rmt_addr   -> ir_rmt_addr
      rmt_port   -> ir_rmt_port
      iif        -> ir_iif
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      634fb979
  32. 24 9月, 2013 2 次提交
  33. 28 8月, 2013 1 次提交
  34. 18 3月, 2013 1 次提交
    • C
      tcp: Remove TCPCT · 1a2c6181
      Christoph Paasch 提交于
      TCPCT uses option-number 253, reserved for experimental use and should
      not be used in production environments.
      Further, TCPCT does not fully implement RFC 6013.
      
      As a nice side-effect, removing TCPCT increases TCP's performance for
      very short flows:
      
      Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
      for files of 1KB size.
      
      before this patch:
      	average (among 7 runs) of 20845.5 Requests/Second
      after:
      	average (among 7 runs) of 21403.6 Requests/Second
      Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a2c6181
  35. 07 1月, 2013 1 次提交