1. 30 7月, 2015 1 次提交
  2. 10 7月, 2015 1 次提交
  3. 07 6月, 2015 1 次提交
  4. 04 6月, 2015 1 次提交
    • E
      tcp: remove redundant checks · 12e25e10
      Eric Dumazet 提交于
      tcp_v4_rcv() checks the following before calling tcp_v4_do_rcv():
      
      if (th->doff < sizeof(struct tcphdr) / 4)
          goto bad_packet;
      if (!pskb_may_pull(skb, th->doff * 4))
          goto discard_it;
      
      So following check in tcp_v4_do_rcv() is redundant
      and "goto csum_err;" is wrong anyway.
      
      if (skb->len < tcp_hdrlen(skb) || ...)
      	goto csum_err;
      
      A second check can be removed after no_tcp_socket label for same reason.
      
      Same tests can be removed in tcp_v6_do_rcv()
      
      Note : short tcp frames are not properly accounted in tcpInErrs MIB,
      because pskb_may_pull() failure simply drops incoming skb, we might
      fix this in a separate patch.
      Signed-off-by: NEric Dumazet  <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12e25e10
  5. 22 5月, 2015 1 次提交
    • M
      tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info · 2efd055c
      Marcelo Ricardo Leitner 提交于
      This patch tracks the total number of inbound and outbound segments on a
      TCP socket. One may use this number to have an idea on connection
      quality when compared against the retransmissions.
      
      RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut
      
      These are a 32bit field each and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->segs_out was placed near tp->snd_nxt for good data
      locality and minimal performance impact, while tp->segs_in was placed
      near tp->bytes_received for the same reason.
      
      Join work with Eric Dumazet.
      
      Note that received SYN are accounted on the listener, but sent SYNACK
      are not accounted.
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2efd055c
  6. 20 5月, 2015 1 次提交
    • D
      tcp: add rfc3168, section 6.1.1.1. fallback · 49213555
      Daniel Borkmann 提交于
      This work as a follow-up of commit f7b3bec6 ("net: allow setting ecn
      via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
      ECN connections. In other words, this work adds a retry with a non-ECN
      setup SYN packet, as suggested from the RFC on the first timeout:
      
        [...] A host that receives no reply to an ECN-setup SYN within the
        normal SYN retransmission timeout interval MAY resend the SYN and
        any subsequent SYN retransmissions with CWR and ECE cleared. [...]
      
      Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
      that is, Linux default since 2009 via commit 255cac91 ("tcp: extend
      ECN sysctl to allow server-side only ECN"):
      
       1) Normal ECN-capable path:
      
          SYN ECE CWR ----->
                      <----- SYN ACK ECE
                  ACK ----->
      
       2) Path with broken middlebox, when client has fallback:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
                  SYN ----->
                      <----- SYN ACK
                  ACK ----->
      
      In case we would not have the fallback implemented, the middlebox drop
      point would basically end up as:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
      
      In any case, it's rather a smaller percentage of sites where there would
      occur such additional setup latency: it was found in end of 2014 that ~56%
      of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
      ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
      when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
      fallback would mitigate with a slight latency trade-off. Recent related
      paper on this topic:
      
        Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
        Gorry Fairhurst, and Richard Scheffenegger:
          "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
          Proc. PAM 2015, New York.
        http://ecn.ethz.ch/ecn-pam15.pdf
      
      Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
      section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
      which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
      allows for disabling the fallback.
      
      tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
      rather we let tcp_ecn_rcv_synack() take that over on input path in case a
      SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
      ECN being negotiated eventually in that case.
      
      Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
      Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
      Signed-off-by: NBrian Trammell <trammell@tik.ee.ethz.ch>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Dave That <dave.taht@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49213555
  7. 06 5月, 2015 1 次提交
    • E
      tcp: provide SYN headers for passive connections · cd8ae852
      Eric Dumazet 提交于
      This patch allows a server application to get the TCP SYN headers for
      its passive connections.  This is useful if the server is doing
      fingerprinting of clients based on SYN packet contents.
      
      Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.
      
      The first is used on a socket to enable saving the SYN headers
      for child connections. This can be set before or after the listen()
      call.
      
      The latter is used to retrieve the SYN headers for passive connections,
      if the parent listener has enabled TCP_SAVE_SYN.
      
      TCP_SAVED_SYN is read once, it frees the saved SYN headers.
      
      The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
      headers.
      
      Original patch was written by Tom Herbert, I changed it to not hold
      a full skb (and associated dst and conntracking reference).
      
      We have used such patch for about 3 years at Google.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd8ae852
  8. 24 4月, 2015 1 次提交
    • E
      inet: fix possible panic in reqsk_queue_unlink() · b357a364
      Eric Dumazet 提交于
      [ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
       0000000000000080
      [ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243
      
      There is a race when reqsk_timer_handler() and tcp_check_req() call
      inet_csk_reqsk_queue_unlink() on the same req at the same time.
      
      Before commit fa76ce73 ("inet: get rid of central tcp/dccp listener
      timer"), listener spinlock was held and race could not happen.
      
      To solve this bug, we change reqsk_queue_unlink() to not assume req
      must be found, and we return a status, to conditionally release a
      refcount on the request sock.
      
      This also means tcp_check_req() in non fastopen case might or not
      consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
      to properly handle this.
      
      (Same remark for dccp_check_req() and its callers)
      
      inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
      called 4 times in tcp and 3 times in dccp.
      
      Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b357a364
  9. 14 4月, 2015 1 次提交
    • E
      tcp/dccp: get rid of central timewait timer · 789f558c
      Eric Dumazet 提交于
      Using a timer wheel for timewait sockets was nice ~15 years ago when
      memory was expensive and machines had a single processor.
      
      This does not scale, code is ugly and source of huge latencies
      (Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)
      
      We can afford to use an extra 64 bytes per timewait sock and spread
      timewait load to all cpus to have better behavior.
      
      Tested:
      
      On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
      on the target (lpaa24)
      
      Before patch :
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      419594
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      437171
      
      While test is running, we can observe 25 or even 33 ms latencies.
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
      rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
      rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2
      
      After patch :
      
      About 90% increase of throughput :
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      810442
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      800992
      
      And latencies are kept to minimal values during this load, even
      if network utilization is 90% higher :
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
      rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      789f558c
  10. 10 4月, 2015 1 次提交
  11. 04 4月, 2015 2 次提交
  12. 30 3月, 2015 1 次提交
  13. 25 3月, 2015 4 次提交
  14. 24 3月, 2015 4 次提交
    • M
      tcp: prevent fetching dst twice in early demux code · d0c294c5
      Michal Kubeček 提交于
      On s390x, gcc 4.8 compiles this part of tcp_v6_early_demux()
      
              struct dst_entry *dst = sk->sk_rx_dst;
      
              if (dst)
                      dst = dst_check(dst, inet6_sk(sk)->rx_dst_cookie);
      
      to code reading sk->sk_rx_dst twice, once for the test and once for
      the argument of ip6_dst_check() (dst_check() is inline). This allows
      ip6_dst_check() to be called with null first argument, causing a crash.
      
      Protect sk->sk_rx_dst access by READ_ONCE() both in IPv4 and IPv6
      TCP early demux code.
      
      Fixes: 41063e9d ("ipv4: Early TCP socket demux.")
      Fixes: c7109986 ("ipv6: Early TCP socket demux")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0c294c5
    • F
      inet: fix double request socket freeing · c6973669
      Fan Du 提交于
      Eric Hugne reported following error :
      
      I'm hitting this warning on latest net-next when i try to SSH into a machine
      with eth0 added to a bridge (but i think the problem is older than that)
      
      Steps to reproduce:
      node2 ~ # brctl addif br0 eth0
      [  223.758785] device eth0 entered promiscuous mode
      node2 ~ # ip link set br0 up
      [  244.503614] br0: port 1(eth0) entered forwarding state
      [  244.505108] br0: port 1(eth0) entered forwarding state
      node2 ~ # [  251.160159] ------------[ cut here ]------------
      [  251.160831] WARNING: CPU: 0 PID: 3 at include/net/request_sock.h:102 tcp_v4_err+0x6b1/0x720()
      [  251.162077] Modules linked in:
      [  251.162496] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc3+ #18
      [  251.163334] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  251.164078]  ffffffff81a8365c ffff880038a6ba18 ffffffff8162ace4 0000000000009898
      [  251.165084]  0000000000000000 ffff880038a6ba58 ffffffff8104da85 ffff88003fa437c0
      [  251.166195]  ffff88003fa437c0 ffff88003fa74e00 ffff88003fa43bb8 ffff88003fad99a0
      [  251.167203] Call Trace:
      [  251.167533]  [<ffffffff8162ace4>] dump_stack+0x45/0x57
      [  251.168206]  [<ffffffff8104da85>] warn_slowpath_common+0x85/0xc0
      [  251.169239]  [<ffffffff8104db65>] warn_slowpath_null+0x15/0x20
      [  251.170271]  [<ffffffff81559d51>] tcp_v4_err+0x6b1/0x720
      [  251.171408]  [<ffffffff81630d03>] ? _raw_read_lock_irq+0x3/0x10
      [  251.172589]  [<ffffffff81534e20>] ? inet_del_offload+0x40/0x40
      [  251.173366]  [<ffffffff81569295>] icmp_socket_deliver+0x65/0xb0
      [  251.174134]  [<ffffffff815693a2>] icmp_unreach+0xc2/0x280
      [  251.174820]  [<ffffffff8156a82d>] icmp_rcv+0x2bd/0x3a0
      [  251.175473]  [<ffffffff81534ea2>] ip_local_deliver_finish+0x82/0x1e0
      [  251.176282]  [<ffffffff815354d8>] ip_local_deliver+0x88/0x90
      [  251.177004]  [<ffffffff815350f0>] ip_rcv_finish+0xf0/0x310
      [  251.177693]  [<ffffffff815357bc>] ip_rcv+0x2dc/0x390
      [  251.178336]  [<ffffffff814f5da3>] __netif_receive_skb_core+0x713/0xa20
      [  251.179170]  [<ffffffff814f7fca>] __netif_receive_skb+0x1a/0x80
      [  251.179922]  [<ffffffff814f97d4>] process_backlog+0x94/0x120
      [  251.180639]  [<ffffffff814f9612>] net_rx_action+0x1e2/0x310
      [  251.181356]  [<ffffffff81051267>] __do_softirq+0xa7/0x290
      [  251.182046]  [<ffffffff81051469>] run_ksoftirqd+0x19/0x30
      [  251.182726]  [<ffffffff8106cc23>] smpboot_thread_fn+0x153/0x1d0
      [  251.183485]  [<ffffffff8106cad0>] ? SyS_setgroups+0x130/0x130
      [  251.184228]  [<ffffffff8106935e>] kthread+0xee/0x110
      [  251.184871]  [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0
      [  251.185690]  [<ffffffff81631108>] ret_from_fork+0x58/0x90
      [  251.186385]  [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0
      [  251.187216] ---[ end trace c947fc7b24e42ea1 ]---
      [  259.542268] br0: port 1(eth0) entered forwarding state
      
      Remove the double calls to reqsk_put()
      
      [edumazet] :
      
      I got confused because reqsk_timer_handler() _has_ to call
      reqsk_put(req) after calling inet_csk_reqsk_queue_drop(), as
      the timer handler holds a reference on req.
      Signed-off-by: NFan Du <fan.du@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NErik Hugne <erik.hugne@ericsson.com>
      Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6973669
    • E
      ipv4: tcp: handle ICMP messages on TCP_NEW_SYN_RECV request sockets · 26e37360
      Eric Dumazet 提交于
      tcp_v4_err() can restrict lookups to ehash table, and not to listeners.
      
      Note this patch creates the infrastructure, but this means that ICMP
      messages for request sockets are ignored until complete conversion.
      
      New tcp_req_err() helper is exported so that we can use it in IPv6
      in following patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26e37360
    • E
      net: convert syn_wait_lock to a spinlock · b2827053
      Eric Dumazet 提交于
      This is a low hanging fruit, as we'll get rid of syn_wait_lock eventually.
      
      We hold syn_wait_lock for such small sections, that it makes no sense to use
      a read/write lock. A spin lock is simply faster.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2827053
  15. 21 3月, 2015 2 次提交
    • E
      inet: get rid of central tcp/dccp listener timer · fa76ce73
      Eric Dumazet 提交于
      One of the major issue for TCP is the SYNACK rtx handling,
      done by inet_csk_reqsk_queue_prune(), fired by the keepalive
      timer of a TCP_LISTEN socket.
      
      This function runs for awful long times, with socket lock held,
      meaning that other cpus needing this lock have to spin for hundred of ms.
      
      SYNACK are sent in huge bursts, likely to cause severe drops anyway.
      
      This model was OK 15 years ago when memory was very tight.
      
      We now can afford to have a timer per request sock.
      
      Timer invocations no longer need to lock the listener,
      and can be run from all cpus in parallel.
      
      With following patch increasing somaxconn width to 32 bits,
      I tested a listener with more than 4 million active request sockets,
      and a steady SYNFLOOD of ~200,000 SYN per second.
      Host was sending ~830,000 SYNACK per second.
      
      This is ~100 times more what we could achieve before this patch.
      
      Later, we will get rid of the listener hash and use ehash instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa76ce73
    • E
      inet: drop prev pointer handling in request sock · 52452c54
      Eric Dumazet 提交于
      When request sock are put in ehash table, the whole notion
      of having a previous request to update dl_next is pointless.
      
      Also, following patch will get rid of big purge timer,
      so we want to delete a request sock without holding listener lock.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52452c54
  16. 19 3月, 2015 2 次提交
  17. 17 3月, 2015 1 次提交
  18. 13 3月, 2015 2 次提交
  19. 07 3月, 2015 2 次提交
    • F
      ipv4: Create probe timer for tcp PMTU as per RFC4821 · 05cbc0db
      Fan Du 提交于
      As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
      be armed once probing has converged. Once this timer expired,
      probing again to take advantage of any path PMTU change. The
      recommended probing interval is 10 minutes per RFC1981. Probing
      interval could be sysctled by sysctl_tcp_probe_interval.
      
      Eric Dumazet suggested to implement pseudo timer based on 32bits
      jiffies tcp_time_stamp instead of using classic timer for such
      rare event.
      Signed-off-by: NFan Du <fan.du@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05cbc0db
    • F
      ipv4: Use binary search to choose tcp PMTU probe_size · 6b58e0a5
      Fan Du 提交于
      Current probe_size is chosen by doubling mss_cache,
      the probing process will end shortly with a sub-optimal
      mss size, and the link mtu will not be taken full
      advantage of, in return, this will make user to tweak
      tcp_base_mss with care.
      
      Use binary search to choose probe_size in a fine
      granularity manner, an optimal mss will be found
      to boost performance as its maxmium.
      
      In addition, introduce a sysctl_tcp_probe_threshold
      to control when probing will stop in respect to
      the width of search range.
      
      Test env:
      Docker instance with vxlan encapuslation(82599EB)
      iperf -c 10.0.0.24  -t 60
      
      before this patch:
      1.26 Gbits/sec
      
      After this patch: increase 26%
      1.59 Gbits/sec
      Signed-off-by: NFan Du <fan.du@intel.com>
      Acked-by: NJohn Heffner <johnwheffner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b58e0a5
  20. 10 2月, 2015 1 次提交
    • F
      ipv4: Namespecify TCP PMTU mechanism · b0f9ca53
      Fan Du 提交于
      Packetization Layer Path MTU Discovery works separately beside
      Path MTU Discovery at IP level, different net namespace has
      various requirements on which one to chose, e.g., a virutalized
      container instance would require TCP PMTU to probe an usable
      effective mtu for underlying tunnel, while the host would
      employ classical ICMP based PMTU to function.
      
      Hence making TCP PMTU mechanism per net namespace to decouple
      two functionality. Furthermore the probe base MSS should also
      be configured separately for each namespace.
      Signed-off-by: NFan Du <fan.du@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0f9ca53
  21. 02 2月, 2015 1 次提交
    • E
      ipv4: tcp: get rid of ugly unicast_sock · bdbbb852
      Eric Dumazet 提交于
      In commit be9f4a44 ("ipv4: tcp: remove per net tcp_sock")
      I tried to address contention on a socket lock, but the solution
      I chose was horrible :
      
      commit 3a7c384f ("ipv4: tcp: unicast_sock should not land outside
      of TCP stack") addressed a selinux regression.
      
      commit 0980e56e ("ipv4: tcp: set unicast_sock uc_ttl to -1")
      took care of another regression.
      
      commit b5ec8eea ("ipv4: fix ip_send_skb()") fixed another regression.
      
      commit 811230cd ("tcp: ipv4: initialize unicast_sock sk_pacing_rate")
      was another shot in the dark.
      
      Really, just use a proper socket per cpu, and remove the skb_orphan()
      call, to re-enable flow control.
      
      This solves a serious problem with FQ packet scheduler when used in
      hostile environments, as we do not want to allocate a flow structure
      for every RST packet sent in response to a spoofed packet.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdbbb852
  22. 06 1月, 2015 1 次提交
    • D
      net: tcp: add per route congestion control · 81164413
      Daniel Borkmann 提交于
      This work adds the possibility to define a per route/destination
      congestion control algorithm. Generally, this opens up the possibility
      for a machine with different links to enforce specific congestion
      control algorithms with optimal strategies for each of them based
      on their network characteristics, even transparently for a single
      application listening on all links.
      
      For our specific use case, this additionally facilitates deployment
      of DCTCP, for example, applications can easily serve internal
      traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
      would also allow for utilizing e.g. long living, low priority
      background flows for certain destinations/routes while still being
      able for normal traffic to utilize the default congestion control
      algorithm. We also thought about a per netns setting (where different
      defaults are possible), but given its actually a link specific
      property, we argue that a per route/destination setting is the most
      natural and flexible.
      
      The administrator can utilize this through ip-route(8) by appending
      "congctl [lock] <name>", where <name> denotes the name of a
      congestion control algorithm and the optional lock parameter allows
      to enforce the given algorithm so that applications in user space
      would not be allowed to overwrite that algorithm for that destination.
      
      The dst metric lookups are being done when a dst entry is already
      available in order to avoid a costly lookup and still before the
      algorithms are being initialized, thus overhead is very low when the
      feature is not being used. While the client side would need to drop
      the current reference on the module, on server side this can actually
      even be avoided as we just got a flat-copied socket clone.
      
      Joint work with Florian Westphal.
      Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81164413
  23. 10 12月, 2014 1 次提交
  24. 26 11月, 2014 1 次提交
  25. 12 11月, 2014 2 次提交
    • E
      net: introduce SO_INCOMING_CPU · 2c8c56e1
      Eric Dumazet 提交于
      Alternative to RPS/RFS is to use hardware support for multiple
      queues.
      
      Then split a set of million of sockets into worker threads, each
      one using epoll() to manage events on its own socket pool.
      
      Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
      know after accept() or connect() on which queue/cpu a socket is managed.
      
      We normally use one cpu per RX queue (IRQ smp_affinity being properly
      set), so remembering on socket structure which cpu delivered last packet
      is enough to solve the problem.
      
      After accept(), connect(), or even file descriptor passing around
      processes, applications can use :
      
       int cpu;
       socklen_t len = sizeof(cpu);
      
       getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
      
      And use this information to put the socket into the right silo
      for optimal performance, as all networking stack should run
      on the appropriate cpu, without need to send IPI (RPS/RFS).
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c8c56e1
    • E
      tcp: move sk_mark_napi_id() at the right place · 3d97379a
      Eric Dumazet 提交于
      sk_mark_napi_id() is used to record for a flow napi id of incoming
      packets for busypoll sake.
      We should do this only on established flows, not on listeners.
      
      This was 'working' by virtue of the socket cloning, but doing
      this on SYN packets in unecessary cache line dirtying.
      
      Even if we move sk_napi_id in the same cache line than sk_lock,
      we are working to make SYN processing lockless, so it is desirable
      to set sk_napi_id only for established flows.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d97379a
  26. 23 10月, 2014 1 次提交
    • S
      net: fix saving TX flow hash in sock for outgoing connections · 9e7ceb06
      Sathya Perla 提交于
      The commit "net: Save TX flow hash in sock and set in skbuf on xmit"
      introduced the inet_set_txhash() and ip6_set_txhash() routines to calculate
      and record flow hash(sk_txhash) in the socket structure. sk_txhash is used
      to set skb->hash which is used to spread flows across multiple TXQs.
      
      But, the above routines are invoked before the source port of the connection
      is created. Because of this all outgoing connections that just differ in the
      source port get hashed into the same TXQ.
      
      This patch fixes this problem for IPv4/6 by invoking the the above routines
      after the source port is available for the socket.
      
      Fixes: b73c3d0e("net: Save TX flow hash in sock and set in skbuf on xmit")
      Signed-off-by: NSathya Perla <sathya.perla@emulex.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e7ceb06
  27. 18 10月, 2014 2 次提交