1. 17 8月, 2018 1 次提交
    • D
      tcp, ulp: add alias for all ulp modules · 037b0b86
      Daniel Borkmann 提交于
      Lets not turn the TCP ULP lookup into an arbitrary module loader as
      we only intend to load ULP modules through this mechanism, not other
      unrelated kernel modules:
      
        [root@bar]# cat foo.c
        #include <sys/types.h>
        #include <sys/socket.h>
        #include <linux/tcp.h>
        #include <linux/in.h>
      
        int main(void)
        {
            int sock = socket(PF_INET, SOCK_STREAM, 0);
            setsockopt(sock, IPPROTO_TCP, TCP_ULP, "sctp", sizeof("sctp"));
            return 0;
        }
      
        [root@bar]# gcc foo.c -O2 -Wall
        [root@bar]# lsmod | grep sctp
        [root@bar]# ./a.out
        [root@bar]# lsmod | grep sctp
        sctp                 1077248  4
        libcrc32c              16384  3 nf_conntrack,nf_nat,sctp
        [root@bar]#
      
      Fix it by adding module alias to TCP ULP modules, so probing module
      via request_module() will be limited to tcp-ulp-[name]. The existing
      modules like kTLS will load fine given tcp-ulp-tls alias, but others
      will fail to load:
      
        [root@bar]# lsmod | grep sctp
        [root@bar]# ./a.out
        [root@bar]# lsmod | grep sctp
        [root@bar]#
      
      Sockmap is not affected from this since it's either built-in or not.
      
      Fixes: 734942cc ("tcp: ULP infrastructure")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      037b0b86
  2. 11 8月, 2018 1 次提交
    • M
      tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket · 40a1227e
      Martin KaFai Lau 提交于
      Although the actual cookie check "__cookie_v[46]_check()" does
      not involve sk specific info, it checks whether the sk has recent
      synq overflow event in "tcp_synq_no_recent_overflow()".  The
      tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
      when it has sent out a syncookie (through "tcp_synq_overflow()").
      
      The above per sk "recent synq overflow event timestamp" works well
      for non SO_REUSEPORT use case.  However, it may cause random
      connection request reject/discard when SO_REUSEPORT is used with
      syncookie because it fails the "tcp_synq_no_recent_overflow()"
      test.
      
      When SO_REUSEPORT is used, it usually has multiple listening
      socks serving TCP connection requests destinated to the same local IP:PORT.
      There are cases that the TCP-ACK-COOKIE may not be received
      by the same sk that sent out the syncookie.  For example,
      if reuse->socks[] began with {sk0, sk1},
      1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
         was updated.
      2) the reuse->socks[] became {sk1, sk2} later.  e.g. sk0 was first closed
         and then sk2 was added.  Here, sk2 does not have ts_recent_stamp set.
         There are other ordering that will trigger the similar situation
         below but the idea is the same.
      3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
         "tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
         all syncookies sent by sk1 will be handled (and rejected)
         by sk2 while sk1 is still alive.
      
      The userspace may create and remove listening SO_REUSEPORT sockets
      as it sees fit.  E.g. Adding new thread (and SO_REUSEPORT sock) to handle
      incoming requests, old process stopping and new process starting...etc.
      With or without SO_ATTACH_REUSEPORT_[CB]BPF,
      the sockets leaving and joining a reuseport group makes picking
      the same sk to check the syncookie very difficult (if not impossible).
      
      The later patches will allow bpf prog more flexibility in deciding
      where a sk should be located in a bpf map and selecting a particular
      SO_REUSEPORT sock as it sees fit.  e.g. Without closing any sock,
      replace the whole bpf reuseport_array in one map_update() by using
      map-in-map.  Getting the syncookie check working smoothly across
      socks in the same "reuse->socks[]" is important.
      
      A partial solution is to set the newly added sk's ts_recent_stamp
      to the max ts_recent_stamp of a reuseport group but that will require
      to iterate through reuse->socks[]  OR
      pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
      joining a reuseport group.  However, neither of them will solve the
      existing sk getting moved around the reuse->socks[] and that
      sk may not have ts_recent_stamp updated, unlikely under continuous
      synflood but not impossible.
      
      This patch opts to treat the reuseport group as a whole when
      considering the last synq overflow timestamp since
      they are serving the same IP:PORT from the userspace
      (and BPF program) perspective.
      
      "synq_overflow_ts" is added to "struct sock_reuseport".
      The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
      will update/check reuse->synq_overflow_ts if the sk is
      in a reuseport group.  Similar to the reuseport decision in
      __inet_lookup_listener(), both sk->sk_reuseport and
      sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
      Update on "synq_overflow_ts" happens at roughly once
      every second.
      
      A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
      No meaningful performance change is observed.  Before and
      after the change is ~9Mpps in IPv4.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      40a1227e
  3. 22 7月, 2018 1 次提交
    • D
      net/ipv6: Fix linklocal to global address with VRF · 24b711ed
      David Ahern 提交于
      Example setup:
          host: ip -6 addr add dev eth1 2001:db8:104::4
                 where eth1 is enslaved to a VRF
      
          switch: ip -6 ro add 2001:db8:104::4/128 dev br1
                  where br1 only has an LLA
      
                 ping6 2001:db8:104::4
                 ssh   2001:db8:104::4
      
      (NOTE: UDP works fine if the PKTINFO has the address set to the global
      address and ifindex is set to the index of eth1 with a destination an
      LLA).
      
      For ICMP, icmp6_iif needs to be updated to check if skb->dev is an
      L3 master. If it is then return the ifindex from rt6i_idev similar
      to what is done for loopback.
      
      For TCP, restore the original tcp_v6_iif definition which is needed in
      most places and add a new tcp_v6_iif_l3_slave that considers the
      l3_slave variability. This latter check is only needed for socket
      lookups.
      
      Fixes: 9ff74384 ("net: vrf: Handle ipv6 multicast and link-local addresses")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24b711ed
  4. 21 7月, 2018 2 次提交
    • Y
      tcp: do not delay ACK in DCTCP upon CE status change · a0496ef2
      Yuchung Cheng 提交于
      Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
      has to be sent immediately so the sender can respond quickly:
      
      """ When receiving packets, the CE codepoint MUST be processed as follows:
      
         1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
             true and send an immediate ACK.
      
         2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
             to false and send an immediate ACK.
      """
      
      Previously DCTCP implementation may continue to delay the ACK. This
      patch fixes that to implement the RFC by forcing an immediate ACK.
      
      Tested with this packetdrill script provided by Larry Brakmo
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
         +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      +0.005 < [ce] . 2001:3001(1000) ack 2 win 257
      
      +0.000 > [ect01] . 2:2(0) ack 2001
      // Previously the ACK below would be delayed by 40ms
      +0.000 > [ect01] E. 2:2(0) ack 3001
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0496ef2
    • Y
      tcp: do not cancel delay-AcK on DCTCP special ACK · 27cde44a
      Yuchung Cheng 提交于
      Currently when a DCTCP receiver delays an ACK and receive a
      data packet with a different CE mark from the previous one's, it
      sends two immediate ACKs acking previous and latest sequences
      respectly (for ECN accounting).
      
      Previously sending the first ACK may mark off the delayed ACK timer
      (tcp_event_ack_sent). This may subsequently prevent sending the
      second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
      The culprit is that tcp_send_ack() assumes it always acknowleges
      the latest sequence, which is not true for the first special ACK.
      
      The fix is to not make the assumption in tcp_send_ack and check the
      actual ack sequence before cancelling the delayed ACK. Further it's
      safer to pass the ack sequence number as a local variable into
      tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
      future bugs like this.
      Reported-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27cde44a
  5. 14 7月, 2018 1 次提交
  6. 13 7月, 2018 1 次提交
    • A
      tcp: use monotonic timestamps for PAWS · cca9bab1
      Arnd Bergmann 提交于
      Using get_seconds() for timestamps is deprecated since it can lead
      to overflows on 32-bit systems. While the interface generally doesn't
      overflow until year 2106, the specific implementation of the TCP PAWS
      algorithm breaks in 2038 when the intermediate signed 32-bit timestamps
      overflow.
      
      A related problem is that the local timestamps in CLOCK_REALTIME form
      lead to unexpected behavior when settimeofday is called to set the system
      clock backwards or forwards by more than 24 days.
      
      While the first problem could be solved by using an overflow-safe method
      of comparing the timestamps, a nicer solution is to use a monotonic
      clocksource with ktime_get_seconds() that simply doesn't overflow (at
      least not until 136 years after boot) and that doesn't change during
      settimeofday().
      
      To make 32-bit and 64-bit architectures behave the same way here, and
      also save a few bytes in the tcp_options_received structure, I'm changing
      the type to a 32-bit integer, which is now safe on all architectures.
      
      Finally, the ts_recent_stamp field also (confusingly) gets used to store
      a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow().
      This is currently safe, but changing the type to 32-bit requires
      some small changes there to keep it working.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cca9bab1
  7. 12 7月, 2018 1 次提交
  8. 08 7月, 2018 1 次提交
    • J
      bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb · 0ea488ff
      John Fastabend 提交于
      In commit
      
        'bpf: bpf_compute_data uses incorrect cb structure' (8108a775)
      
      we added the routine bpf_compute_data_end_sk_skb() to compute the
      correct data_end values, but this has since been lost. In kernel
      v4.14 this was correct and the above patch was applied in it
      entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
      we lost the piece that renamed bpf_compute_data_pointers to the
      new function bpf_compute_data_end_sk_skb. This was done here,
      
      e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      
      When it conflicted with the following rename patch,
      
      6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      
      Finally, after a refactor I thought even the function
      bpf_compute_data_end_sk_skb() was no longer needed and it was
      erroneously removed.
      
      However, we never reverted the sk_skb_convert_ctx_access() usage of
      tcp_skb_cb which had been committed and survived the merge conflict.
      Here we fix this by adding back the helper and *_data_end_sk_skb()
      usage. Using the bpf_skc_data_end mapping is not correct because it
      expects a qdisc_skb_cb object but at the sock layer this is not the
      case. Even though it happens to work here because we don't overwrite
      any data in-use at the socket layer and the cb structure is cleared
      later this has potential to create some subtle issues. But, even
      more concretely the filter.c access check uses tcp_skb_cb.
      
      And by some act of chance though,
      
      struct bpf_skb_data_end {
              struct qdisc_skb_cb        qdisc_cb;             /*     0    28 */
      
              /* XXX 4 bytes hole, try to pack */
      
              void *                     data_meta;            /*    32     8 */
              void *                     data_end;             /*    40     8 */
      
              /* size: 48, cachelines: 1, members: 3 */
              /* sum members: 44, holes: 1, sum holes: 4 */
              /* last cacheline: 48 bytes */
      };
      
      and then tcp_skb_cb,
      
      struct tcp_skb_cb {
      	[...]
                      struct {
                              __u32      flags;                /*    24     4 */
                              struct sock * sk_redir;          /*    32     8 */
                              void *     data_end;             /*    40     8 */
                      } bpf;                                   /*          24 */
              };
      
      So when we use offset_of() to track down the byte offset we get 40 in
      either case and everything continues to work. Fix this mess and use
      correct structures its unclear how long this might actually work for
      until someone moves the structs around.
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Fixes: e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      Fixes: 6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0ea488ff
  9. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  10. 26 6月, 2018 1 次提交
    • D
      net: Convert GRO SKB handling to list_head. · d4546c25
      David Miller 提交于
      Manage pending per-NAPI GRO packets via list_head.
      
      Return an SKB pointer from the GRO receive handlers.  When GRO receive
      handlers return non-NULL, it means that this SKB needs to be completed
      at this time and removed from the NAPI queue.
      
      Several operations are greatly simplified by this transformation,
      especially timing out the oldest SKB in the list when gro_count
      exceeds MAX_GRO_SKBS, and napi_gro_flush() which walks the queue
      in reverse order.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4546c25
  11. 22 6月, 2018 1 次提交
  12. 01 6月, 2018 1 次提交
  13. 26 5月, 2018 1 次提交
  14. 18 5月, 2018 6 次提交
  15. 16 5月, 2018 1 次提交
  16. 15 5月, 2018 1 次提交
    • J
      bpf: sockmap, refactor sockmap routines to work with hashmap · e5cd3abc
      John Fastabend 提交于
      This patch only refactors the existing sockmap code. This will allow
      much of the psock initialization code path and bpf helper codes to
      work for both sockmap bpf map types that are backed by an array, the
      currently supported type, and the new hash backed bpf map type
      sockhash.
      
      Most the fallout comes from three changes,
      
        - Pushing bpf programs into an independent structure so we
          can use it from the htab struct in the next patch.
        - Generalizing helpers to use void *key instead of the hardcoded
          u32.
        - Instead of passing map/key through the metadata we now do
          the lookup inline. This avoids storing the key in the metadata
          which will be useful when keys can be longer than 4 bytes. We
          rename the sk pointers to sk_redir at this point as well to
          avoid any confusion between the current sk pointer and the
          redirect pointer sk_redir.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e5cd3abc
  17. 12 5月, 2018 1 次提交
  18. 01 5月, 2018 1 次提交
  19. 17 4月, 2018 3 次提交
    • E
      tcp: implement mmap() for zero copy receive · 93ab6cc6
      Eric Dumazet 提交于
      Some networks can make sure TCP payload can exactly fit 4KB pages,
      with well chosen MSS/MTU and architectures.
      
      Implement mmap() system call so that applications can avoid
      copying data without complex splice() games.
      
      Note that a successful mmap( X bytes) on TCP socket is consuming
      bytes, as if recvmsg() has been done. (tp->copied += X)
      
      Only PROT_READ mappings are accepted, as skb page frags
      are fundamentally shared and read only.
      
      If tcp_mmap() finds data that is not a full page, or a patch of
      urgent data, -EINVAL is returned, no bytes are consumed.
      
      Application must fallback to recvmsg() to read the problematic sequence.
      
      mmap() wont block,  regardless of socket being in blocking or
      non-blocking mode. If not enough bytes are in receive queue,
      mmap() would return -EAGAIN, or -EIO if socket is in a state
      where no other bytes can be added into receive queue.
      
      An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
      to efficiently use mmap()
      
      On the sender side, MSG_EOR might help to clearly separate unaligned
      headers and 4K-aligned chunks if necessary.
      
      Tested:
      
      mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
      MTU set to 4168  (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)
      
      Without mmap() (tcp_mmap -s)
      
      received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
        cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
      received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
        cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
      received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
        cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
      received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
        cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches
      
      With mmap() on receiver (tcp_mmap -s -z)
      
      received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
        cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
      received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
        cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
      received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
        cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
      received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
        cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93ab6cc6
    • E
      tcp: avoid extra wakeups for SO_RCVLOWAT users · 03f45c88
      Eric Dumazet 提交于
      SO_RCVLOWAT is properly handled in tcp_poll(), so that POLLIN is only
      generated when enough bytes are available in receive queue, after
      David change (commit c7004482 "tcp: Respect SO_RCVLOWAT in tcp_poll().")
      
      But TCP still calls sk->sk_data_ready() for each chunk added in receive
      queue, meaning thread is awaken, and goes back to sleep shortly after.
      
      Tested:
      
      tcp_mmap test program, receiving 32768 MB of data with SO_RCVLOWAT set to 512KB
      
      -> Should get ~2 wakeups (c-switches) per MB, regardless of how many
      (tiny or big) packets were received.
      
      High speed (mostly full size GRO packets)
      
      received 32768 MB (100 % mmap'ed) in 8.03112 s, 34.2266 Gbit,
        cpu usage user:0.037 sys:1.404, 43.9758 usec per MB, 65497 c-switches
      
      received 32768 MB (99.9954 % mmap'ed) in 7.98453 s, 34.4263 Gbit,
        cpu usage user:0.03 sys:1.422, 44.3115 usec per MB, 65485 c-switches
      
      Low speed (sender is ratelimited and sends 1-MSS at a time, so GRO is not helping)
      
      received 22474.5 MB (100 % mmap'ed) in 6015.35 s, 0.0313414 Gbit,
        cpu usage user:0.05 sys:1.586, 72.7952 usec per MB, 44950 c-switches
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03f45c88
    • E
      tcp: fix SO_RCVLOWAT and RCVBUF autotuning · d1361840
      Eric Dumazet 提交于
      Applications might use SO_RCVLOWAT on TCP socket hoping to receive
      one [E]POLLIN event only when a given amount of bytes are ready in socket
      receive queue.
      
      Problem is that receive autotuning is not aware of this constraint,
      meaning sk_rcvbuf might be too small to allow all bytes to be stored.
      
      Add a new (struct proto_ops)->set_rcvlowat method so that a protocol
      can override the default setsockopt(SO_RCVLOWAT) behavior.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1361840
  20. 02 3月, 2018 1 次提交
    • E
      tcp_bbr: better deal with suboptimal GSO (II) · dcb8c9b4
      Eric Dumazet 提交于
      This is second part of dealing with suboptimal device gso parameters.
      In first patch (350c9f48 "tcp_bbr: better deal with suboptimal GSO")
      we dealt with devices having low gso_max_segs
      
      Some devices lower gso_max_size from 64KB to 16 KB (r8152 is an example)
      
      In order to probe an optimal cwnd, we want BBR being not sensitive
      to whatever GSO constraint a device can have.
      
      This patch removes tso_segs_goal() CC callback in favor of
      min_tso_segs() for CC wanting to override sysctl_tcp_min_tso_segs
      
      Next patch will remove bbr->tso_segs_goal since it does not have
      to be persistent.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcb8c9b4
  21. 15 2月, 2018 1 次提交
    • E
      tcp: try to keep packet if SYN_RCV race is lost · e0f9759f
      Eric Dumazet 提交于
      배석진 reported that in some situations, packets for a given 5-tuple
      end up being processed by different CPUS.
      
      This involves RPS, and fragmentation.
      
      배석진 is seeing packet drops when a SYN_RECV request socket is
      moved into ESTABLISH state. Other states are protected by socket lock.
      
      This is caused by a CPU losing the race, and simply not caring enough.
      
      Since this seems to occur frequently, we can do better and perform
      a second lookup.
      
      Note that all needed memory barriers are already in the existing code,
      thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert()
      and reqsk_put(). The second lookup must find the new socket,
      unless it has already been accepted and closed by another cpu.
      
      Note that the fragmentation could be avoided in the first place by
      use of a correct TCP MSS option in the SYN{ACK} packet, but this
      does not mean we can not be more robust.
      
      Many thanks to 배석진 for a very detailed analysis.
      Reported-by: N배석진 <soukjin.bae@samsung.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0f9759f
  22. 06 2月, 2018 2 次提交
    • J
      bpf: sockmap, add sock close() hook to remove socks · 1aa12bdf
      John Fastabend 提交于
      The selftests test_maps program was leaving dangling BPF sockmap
      programs around because not all psock elements were removed from
      the map. The elements in turn hold a reference on the BPF program
      they are attached to causing BPF programs to stay open even after
      test_maps has completed.
      
      The original intent was that sk_state_change() would be called
      when TCP socks went through TCP_CLOSE state. However, because
      socks may be in SOCK_DEAD state or the sock may be a listening
      socket the event is not always triggered.
      
      To resolve this use the ULP infrastructure and register our own
      proto close() handler. This fixes the above case.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Reported-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1aa12bdf
    • J
      net: add a UID to use for ULP socket assignment · b11a632c
      John Fastabend 提交于
      Create a UID field and enum that can be used to assign ULPs to
      sockets. This saves a set of string comparisons if the ULP id
      is known.
      
      For sockmap, which is added in the next patches, a ULP is used to
      hook into TCP sockets close state. In this case the ULP being added
      is done at map insert time and the ULP is known and done on the kernel
      side. In this case the named lookup is not needed. Because we don't
      want to expose psock internals to user space socket options a user
      visible flag is also added. For TLS this is set for BPF it will be
      cleared.
      
      Alos remove pr_notice, user gets an error code back and should check
      that rather than rely on logs.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b11a632c
  23. 26 1月, 2018 2 次提交
  24. 20 1月, 2018 1 次提交
  25. 14 12月, 2017 1 次提交
  26. 08 12月, 2017 1 次提交
  27. 05 12月, 2017 1 次提交
    • L
      bpf: Add access to snd_cwnd and others in sock_ops · f19397a5
      Lawrence Brakmo 提交于
      Adds read access to snd_cwnd and srtt_us fields of tcp_sock. Since these
      fields are only valid if the socket associated with the sock_ops program
      call is a full socket, the field is_fullsock is also added to the
      bpf_sock_ops struct. If the socket is not a full socket, reading these
      fields returns 0.
      
      Note that in most cases it will not be necessary to check is_fullsock to
      know if there is a full socket. The context of the call, as specified by
      the 'op' field, can sometimes determine whether there is a full socket.
      
      The struct bpf_sock_ops has the following fields added:
      
        __u32 is_fullsock;      /* Some TCP fields are only valid if
                                 * there is a full socket. If not, the
                                 * fields read as zero.
      			   */
        __u32 snd_cwnd;
        __u32 srtt_us;          /* Averaged RTT << 3 in usecs */
      
      There is a new macro, SOCK_OPS_GET_TCP32(NAME), to make it easier to add
      read access to more 32 bit tcp_sock fields.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f19397a5
  28. 04 12月, 2017 1 次提交
  29. 28 11月, 2017 1 次提交
  30. 19 11月, 2017 1 次提交
    • N
      tcp: when scheduling TLP, time of RTO should account for current ACK · ed66dfaf
      Neal Cardwell 提交于
      Fix the TLP scheduling logic so that when scheduling a TLP probe, we
      ensure that the estimated time at which an RTO would fire accounts for
      the fact that ACKs indicating forward progress should push back RTO
      times.
      
      After the following fix:
      
      df92c839 ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
      
      we had an unintentional behavior change in the following kind of
      scenario: suppose the RTT variance has been very low recently. Then
      suppose we send out a flight of N packets and our RTT is 100ms:
      
      t=0: send a flight of N packets
      t=100ms: receive an ACK for N-1 packets
      
      The response before df92c839 that was:
        -> schedule a TLP for now + RTO_interval
      
      The response after df92c839 is:
        -> schedule a TLP for t=0 + RTO_interval
      
      Since RTO_interval = srtt + RTT_variance, this means that we have
      scheduled a TLP timer at a point in the future that only accounts for
      RTT_variance. If the RTT_variance term is small, this means that the
      timer fires soon.
      
      Before df92c839 this would not happen, because in that code, when
      we receive an ACK for a prefix of flight, we did:
      
          1) Near the top of tcp_ack(), switch from TLP timer to RTO
             at write_queue_head->paket_tx_time + RTO_interval:
                  if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
                         tcp_rearm_rto(sk);
      
          2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
                  if (flag & FLAG_ACKED) {
                         tcp_rearm_rto(sk);
      
          3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
             to TLP at now + RTO_interval:
                  if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                         tcp_schedule_loss_probe(sk);
      
      In df92c839 we removed that 3-phase dance, and instead directly
      set the TLP timer once: we set the TLP timer in cases like this to
      write_queue_head->packet_tx_time + RTO_interval. So if the RTT
      variance is small, then this means that this is setting the TLP timer
      to fire quite soon. This means if the ACK for the tail of the flight
      takes longer than an RTT to arrive (often due to delayed ACKs), then
      the TLP timer fires too quickly.
      
      Fixes: df92c839 ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed66dfaf