1. 02 8月, 2018 1 次提交
  2. 24 7月, 2018 5 次提交
  3. 21 7月, 2018 1 次提交
    • Y
      tcp: do not delay ACK in DCTCP upon CE status change · a0496ef2
      Yuchung Cheng 提交于
      Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
      has to be sent immediately so the sender can respond quickly:
      
      """ When receiving packets, the CE codepoint MUST be processed as follows:
      
         1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
             true and send an immediate ACK.
      
         2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
             to false and send an immediate ACK.
      """
      
      Previously DCTCP implementation may continue to delay the ACK. This
      patch fixes that to implement the RFC by forcing an immediate ACK.
      
      Tested with this packetdrill script provided by Larry Brakmo
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
         +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      +0.005 < [ce] . 2001:3001(1000) ack 2 win 257
      
      +0.000 > [ect01] . 2:2(0) ack 2001
      // Previously the ACK below would be delayed by 40ms
      +0.000 > [ect01] E. 2:2(0) ack 3001
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0496ef2
  4. 16 7月, 2018 1 次提交
  5. 15 7月, 2018 1 次提交
  6. 13 7月, 2018 1 次提交
    • A
      tcp: use monotonic timestamps for PAWS · cca9bab1
      Arnd Bergmann 提交于
      Using get_seconds() for timestamps is deprecated since it can lead
      to overflows on 32-bit systems. While the interface generally doesn't
      overflow until year 2106, the specific implementation of the TCP PAWS
      algorithm breaks in 2038 when the intermediate signed 32-bit timestamps
      overflow.
      
      A related problem is that the local timestamps in CLOCK_REALTIME form
      lead to unexpected behavior when settimeofday is called to set the system
      clock backwards or forwards by more than 24 days.
      
      While the first problem could be solved by using an overflow-safe method
      of comparing the timestamps, a nicer solution is to use a monotonic
      clocksource with ktime_get_seconds() that simply doesn't overflow (at
      least not until 136 years after boot) and that doesn't change during
      settimeofday().
      
      To make 32-bit and 64-bit architectures behave the same way here, and
      also save a few bytes in the tcp_options_received structure, I'm changing
      the type to a 32-bit integer, which is now safe on all architectures.
      
      Finally, the ts_recent_stamp field also (confusingly) gets used to store
      a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow().
      This is currently safe, but changing the type to 32-bit requires
      some small changes there to keep it working.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cca9bab1
  7. 02 7月, 2018 1 次提交
  8. 01 7月, 2018 1 次提交
    • I
      tcp: prevent bogus FRTO undos with non-SACK flows · 1236f22f
      Ilpo Järvinen 提交于
      If SACK is not enabled and the first cumulative ACK after the RTO
      retransmission covers more than the retransmitted skb, a spurious
      FRTO undo will trigger (assuming FRTO is enabled for that RTO).
      The reason is that any non-retransmitted segment acknowledged will
      set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
      no indication that it would have been delivered for real (the
      scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
      case so the check for that bit won't help like it does with SACK).
      Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
      in tcp_process_loss.
      
      We need to use more strict condition for non-SACK case and check
      that none of the cumulatively ACKed segments were retransmitted
      to prove that progress is due to original transmissions. Only then
      keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
      non-SACK case.
      
      (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
      to better indicate its purpose but to keep this change minimal, it
      will be done in another patch).
      
      Besides burstiness and congestion control violations, this problem
      can result in RTO loop: When the loss recovery is prematurely
      undoed, only new data will be transmitted (if available) and
      the next retransmission can occur only after a new RTO which in case
      of multiple losses (that are not for consecutive packets) requires
      one RTO per loss to recover.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1236f22f
  9. 30 6月, 2018 1 次提交
  10. 28 6月, 2018 1 次提交
  11. 26 6月, 2018 1 次提交
  12. 22 6月, 2018 1 次提交
  13. 05 6月, 2018 1 次提交
  14. 01 6月, 2018 1 次提交
  15. 23 5月, 2018 2 次提交
  16. 18 5月, 2018 11 次提交
  17. 01 5月, 2018 1 次提交
  18. 24 4月, 2018 1 次提交
  19. 23 4月, 2018 3 次提交
    • Y
      net: init sk_cookie for inet socket · c6849a3a
      Yafang Shao 提交于
      With sk_cookie we can identify a socket, that is very helpful for
      traceing and statistic, i.e. tcp tracepiont and ebpf.
      So we'd better init it by default for inet socket.
      When using it, we just need call atomic64_read(&sk->sk_cookie).
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6849a3a
    • Y
      net: introduce a new tracepoint for tcp_rcv_space_adjust · 6163849d
      Yafang Shao 提交于
      tcp_rcv_space_adjust is called every time data is copied to user space,
      introducing a tcp tracepoint for which could show us when the packet is
      copied to user.
      
      When a tcp packet arrives, tcp_rcv_established() will be called and with
      the existed tracepoint tcp_probe we could get the time when this packet
      arrives.
      Then this packet will be copied to user, and tcp_rcv_space_adjust will
      be called and with this new introduced tracepoint we could get the time
      when this packet is copied to user.
      With these two tracepoints, we could figure out whether the user program
      processes this packet immediately or there's latency.
      
      Hence in the printk message, sk_cookie is printed as a key to relate
      tcp_rcv_space_adjust with tcp_probe.
      
      Maybe we could export sockfd in this new tracepoint as well, then we
      could relate this new tracepoint with epoll/read/recv* tracepoints, and
      finally that could show us the whole lifespan of this packet. But we
      could also implement that with pid as these functions are executed in
      process context.
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6163849d
    • J
      tcp: don't read out-of-bounds opsize · 7e5a206a
      Jann Horn 提交于
      The old code reads the "opsize" variable from out-of-bounds memory (first
      byte behind the segment) if a broken TCP segment ends directly after an
      opcode that is neither EOL nor NOP.
      
      The result of the read isn't used for anything, so the worst thing that
      could theoretically happen is a pagefault; and since the physmap is usually
      mostly contiguous, even that seems pretty unlikely.
      
      The following C reproducer triggers the uninitialized read - however, you
      can't actually see anything happen unless you put something like a
      pr_warn() in tcp_parse_md5sig_option() to print the opsize.
      
      ====================================
      #define _GNU_SOURCE
      #include <arpa/inet.h>
      #include <stdlib.h>
      #include <errno.h>
      #include <stdarg.h>
      #include <net/if.h>
      #include <linux/if.h>
      #include <linux/ip.h>
      #include <linux/tcp.h>
      #include <linux/in.h>
      #include <linux/if_tun.h>
      #include <err.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <string.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/ioctl.h>
      #include <assert.h>
      
      void systemf(const char *command, ...) {
        char *full_command;
        va_list ap;
        va_start(ap, command);
        if (vasprintf(&full_command, command, ap) == -1)
          err(1, "vasprintf");
        va_end(ap);
        printf("systemf: <<<%s>>>\n", full_command);
        system(full_command);
      }
      
      char *devname;
      
      int tun_alloc(char *name) {
        int fd = open("/dev/net/tun", O_RDWR);
        if (fd == -1)
          err(1, "open tun dev");
        static struct ifreq req = { .ifr_flags = IFF_TUN|IFF_NO_PI };
        strcpy(req.ifr_name, name);
        if (ioctl(fd, TUNSETIFF, &req))
          err(1, "TUNSETIFF");
        devname = req.ifr_name;
        printf("device name: %s\n", devname);
        return fd;
      }
      
      #define IPADDR(a,b,c,d) (((a)<<0)+((b)<<8)+((c)<<16)+((d)<<24))
      
      void sum_accumulate(unsigned int *sum, void *data, int len) {
        assert((len&2)==0);
        for (int i=0; i<len/2; i++) {
          *sum += ntohs(((unsigned short *)data)[i]);
        }
      }
      
      unsigned short sum_final(unsigned int sum) {
        sum = (sum >> 16) + (sum & 0xffff);
        sum = (sum >> 16) + (sum & 0xffff);
        return htons(~sum);
      }
      
      void fix_ip_sum(struct iphdr *ip) {
        unsigned int sum = 0;
        sum_accumulate(&sum, ip, sizeof(*ip));
        ip->check = sum_final(sum);
      }
      
      void fix_tcp_sum(struct iphdr *ip, struct tcphdr *tcp) {
        unsigned int sum = 0;
        struct {
          unsigned int saddr;
          unsigned int daddr;
          unsigned char pad;
          unsigned char proto_num;
          unsigned short tcp_len;
        } fakehdr = {
          .saddr = ip->saddr,
          .daddr = ip->daddr,
          .proto_num = ip->protocol,
          .tcp_len = htons(ntohs(ip->tot_len) - ip->ihl*4)
        };
        sum_accumulate(&sum, &fakehdr, sizeof(fakehdr));
        sum_accumulate(&sum, tcp, tcp->doff*4);
        tcp->check = sum_final(sum);
      }
      
      int main(void) {
        int tun_fd = tun_alloc("inject_dev%d");
        systemf("ip link set %s up", devname);
        systemf("ip addr add 192.168.42.1/24 dev %s", devname);
      
        struct {
          struct iphdr ip;
          struct tcphdr tcp;
          unsigned char tcp_opts[20];
        } __attribute__((packed)) syn_packet = {
          .ip = {
            .ihl = sizeof(struct iphdr)/4,
            .version = 4,
            .tot_len = htons(sizeof(syn_packet)),
            .ttl = 30,
            .protocol = IPPROTO_TCP,
            /* FIXUP check */
            .saddr = IPADDR(192,168,42,2),
            .daddr = IPADDR(192,168,42,1)
          },
          .tcp = {
            .source = htons(1),
            .dest = htons(1337),
            .seq = 0x12345678,
            .doff = (sizeof(syn_packet.tcp)+sizeof(syn_packet.tcp_opts))/4,
            .syn = 1,
            .window = htons(64),
            .check = 0 /*FIXUP*/
          },
          .tcp_opts = {
            /* INVALID: trailing MD5SIG opcode after NOPs */
            1, 1, 1, 1, 1,
            1, 1, 1, 1, 1,
            1, 1, 1, 1, 1,
            1, 1, 1, 1, 19
          }
        };
        fix_ip_sum(&syn_packet.ip);
        fix_tcp_sum(&syn_packet.ip, &syn_packet.tcp);
        while (1) {
          int write_res = write(tun_fd, &syn_packet, sizeof(syn_packet));
          if (write_res != sizeof(syn_packet))
            err(1, "packet write failed");
        }
      }
      ====================================
      
      Fixes: cfb6eeb4 ("[TCP]: MD5 Signature Option (RFC2385) support.")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e5a206a
  20. 20 4月, 2018 4 次提交