1. 18 6月, 2019 1 次提交
  2. 04 6月, 2019 1 次提交
  3. 19 3月, 2019 1 次提交
  4. 23 2月, 2019 1 次提交
  5. 31 1月, 2019 1 次提交
  6. 01 12月, 2018 1 次提交
  7. 04 11月, 2018 1 次提交
  8. 08 9月, 2018 1 次提交
  9. 04 8月, 2018 1 次提交
  10. 02 8月, 2018 5 次提交
  11. 31 7月, 2018 1 次提交
  12. 22 7月, 2018 1 次提交
  13. 19 7月, 2018 1 次提交
  14. 17 7月, 2018 1 次提交
  15. 13 7月, 2018 1 次提交
  16. 10 7月, 2018 1 次提交
  17. 08 7月, 2018 3 次提交
  18. 07 7月, 2018 1 次提交
  19. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  20. 22 6月, 2018 1 次提交
  21. 11 6月, 2018 1 次提交
  22. 26 5月, 2018 1 次提交
  23. 18 5月, 2018 1 次提交
    • E
      tcp: add SACK compression · 5d9f4262
      Eric Dumazet 提交于
      When TCP receives an out-of-order packet, it immediately sends
      a SACK packet, generating network load but also forcing the
      receiver to send 1-MSS pathological packets, increasing its
      RTX queue length/depth, and thus processing time.
      
      Wifi networks suffer from this aggressive behavior, but generally
      speaking, all these SACK packets add fuel to the fire when networks
      are under congestion.
      
      This patch adds a high resolution timer and tp->compressed_ack counter.
      
      Instead of sending a SACK, we program this timer with a small delay,
      based on RTT and capped to 1 ms :
      
      	delay = min ( 5 % of RTT, 1 ms)
      
      If subsequent SACKs need to be sent while the timer has not yet
      expired, we simply increment tp->compressed_ack.
      
      When timer expires, a SACK is sent with the latest information.
      Whenever an ACK is sent (if data is sent, or if in-order
      data is received) timer is canceled.
      
      Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
      if the sack blocks need to be shuffled, even if the timer has not
      expired.
      
      A new SNMP counter is added in the following patch.
      
      Two other patches add sysctls to allow changing the 1,000,000 and 44
      values that this commit hard-coded.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d9f4262
  24. 03 5月, 2018 1 次提交
    • E
      tcp: restore autocorking · 114f39fe
      Eric Dumazet 提交于
      When adding rb-tree for TCP retransmit queue, we inadvertently broke
      TCP autocorking.
      
      tcp_should_autocork() should really check if the rtx queue is not empty.
      
      Tested:
      
      Before the fix :
      $ nstat -n;./netperf -H 10.246.7.152 -Cc -- -m 500;nstat | grep AutoCork
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
      540000 262144    500    10.00      2682.85   2.47     1.59     3.618   2.329
      TcpExtTCPAutoCorking            33                 0.0
      
      // Same test, but forcing TCP_NODELAY
      $ nstat -n;./netperf -H 10.246.7.152 -Cc -- -D -m 500;nstat | grep AutoCork
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET : nodelay
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
      540000 262144    500    10.00      1408.75   2.44     2.96     6.802   8.259
      TcpExtTCPAutoCorking            1                  0.0
      
      After the fix :
      $ nstat -n;./netperf -H 10.246.7.152 -Cc -- -m 500;nstat | grep AutoCork
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
      540000 262144    500    10.00      5472.46   2.45     1.43     1.761   1.027
      TcpExtTCPAutoCorking            361293             0.0
      
      // With TCP_NODELAY option
      $ nstat -n;./netperf -H 10.246.7.152 -Cc -- -D -m 500;nstat | grep AutoCork
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET : nodelay
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
      540000 262144    500    10.00      5454.96   2.46     1.63     1.775   1.174
      TcpExtTCPAutoCorking            315448             0.0
      
      Fixes: 75c119af ("tcp: implement rb-tree based retransmit queue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMichael Wenig <mwenig@vmware.com>
      Tested-by: NMichael Wenig <mwenig@vmware.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMichael Wenig <mwenig@vmware.com>
      Tested-by: NMichael Wenig <mwenig@vmware.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      114f39fe
  25. 02 5月, 2018 2 次提交
    • S
      tcp: send in-queue bytes in cmsg upon read · b75eba76
      Soheil Hassas Yeganeh 提交于
      Applications with many concurrent connections, high variance
      in receive queue length and tight memory bounds cannot
      allocate worst-case buffer size to drain sockets. Knowing
      the size of receive queue length, applications can optimize
      how they allocate buffers to read from the socket.
      
      The number of bytes pending on the socket is directly
      available through ioctl(FIONREAD/SIOCINQ) and can be
      approximated using getsockopt(MEMINFO) (rmem_alloc includes
      skb overheads in addition to application data). But, both of
      these options add an extra syscall per recvmsg. Moreover,
      ioctl(FIONREAD/SIOCINQ) takes the socket lock.
      
      Add the TCP_INQ socket option to TCP. When this socket
      option is set, recvmsg() relays the number of bytes available
      on the socket for reading to the application via the
      TCP_CM_INQ control message.
      
      Calculate the number of bytes after releasing the socket lock
      to include the processed backlog, if any. To avoid an extra
      branch in the hot path of recvmsg() for this new control
      message, move all cmsg processing inside an existing branch for
      processing receive timestamps. Since the socket lock is not held
      when calculating the size of receive queue, TCP_INQ is a hint.
      For example, it can overestimate the queue size by one byte,
      if FIN is received.
      
      With this method, applications can start reading from the socket
      using a small buffer, and then use larger buffers based on the
      remaining data when needed.
      
      V3 change-log:
      	As suggested by David Miller, added loads with barrier
      	to check whether we have multiple threads calling recvmsg
      	in parallel. When that happens we lock the socket to
      	calculate inq.
      V4 change-log:
      	Removed inline from a static function.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b75eba76
    • E
      tcp: fix TCP_REPAIR_QUEUE bound checking · bf2acc94
      Eric Dumazet 提交于
      syzbot is able to produce a nasty WARN_ON() in tcp_verify_left_out()
      with following C-repro :
      
      socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
      setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
      setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0
      bind(3, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
      sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
      	1242, MSG_FASTOPEN, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("127.0.0.1")}, 16) = 1242
      setsockopt(3, SOL_TCP, TCP_REPAIR_WINDOW, "\4\0\0@+\205\0\0\377\377\0\0\377\377\377\177\0\0\0\0", 20) = 0
      writev(3, [{"\270", 1}], 1)             = 1
      setsockopt(3, SOL_TCP, TCP_REPAIR_OPTIONS, "\10\0\0\0\0\0\0\0\0\0\0\0|\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 386) = 0
      writev(3, [{"\210v\r[\226\320t\231qwQ\204\264l\254\t\1\20\245\214p\350H\223\254;\\\37\345\307p$"..., 3144}], 1) = 3144
      
      The 3rd system call looks odd :
      setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0
      
      This patch makes sure bound checking is using an unsigned compare.
      
      Fixes: ee995283 ("tcp: Initial repair mode")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf2acc94
  26. 30 4月, 2018 1 次提交
    • E
      tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive · 05255b82
      Eric Dumazet 提交于
      When adding tcp mmap() implementation, I forgot that socket lock
      had to be taken before current->mm->mmap_sem. syzbot eventually caught
      the bug.
      
      Since we can not lock the socket in tcp mmap() handler we have to
      split the operation in two phases.
      
      1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
        This operation does not involve any TCP locking.
      
      2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
       the transfert of pages from skbs to one VMA.
        This operation only uses down_read(&current->mm->mmap_sem) after
        holding TCP lock, thus solving the lockdep issue.
      
      This new implementation was suggested by Andy Lutomirski with great details.
      
      Benefits are :
      
      - Better scalability, in case multiple threads reuse VMAS
         (without mmap()/munmap() calls) since mmap_sem wont be write locked.
      
      - Better error recovery.
         The previous mmap() model had to provide the expected size of the
         mapping. If for some reason one part could not be mapped (partial MSS),
         the whole operation had to be aborted.
         With the tcp_zerocopy_receive struct, kernel can report how
         many bytes were successfuly mapped, and how many bytes should
         be read to skip the problematic sequence.
      
      - No more memory allocation to hold an array of page pointers.
        16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
      
      - skbs are freed while mmap_sem has been released
      
      Following patch makes the change in tcp_mmap tool to demonstrate
      one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
      
      Note that memcg might require additional changes.
      
      Fixes: 93ab6cc6 ("tcp: implement mmap() for zero copy receive")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Cc: linux-mm@kvack.org
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05255b82
  27. 27 4月, 2018 1 次提交
  28. 20 4月, 2018 2 次提交
  29. 17 4月, 2018 3 次提交
    • E
      tcp: implement mmap() for zero copy receive · 93ab6cc6
      Eric Dumazet 提交于
      Some networks can make sure TCP payload can exactly fit 4KB pages,
      with well chosen MSS/MTU and architectures.
      
      Implement mmap() system call so that applications can avoid
      copying data without complex splice() games.
      
      Note that a successful mmap( X bytes) on TCP socket is consuming
      bytes, as if recvmsg() has been done. (tp->copied += X)
      
      Only PROT_READ mappings are accepted, as skb page frags
      are fundamentally shared and read only.
      
      If tcp_mmap() finds data that is not a full page, or a patch of
      urgent data, -EINVAL is returned, no bytes are consumed.
      
      Application must fallback to recvmsg() to read the problematic sequence.
      
      mmap() wont block,  regardless of socket being in blocking or
      non-blocking mode. If not enough bytes are in receive queue,
      mmap() would return -EAGAIN, or -EIO if socket is in a state
      where no other bytes can be added into receive queue.
      
      An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
      to efficiently use mmap()
      
      On the sender side, MSG_EOR might help to clearly separate unaligned
      headers and 4K-aligned chunks if necessary.
      
      Tested:
      
      mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
      MTU set to 4168  (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)
      
      Without mmap() (tcp_mmap -s)
      
      received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
        cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
      received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
        cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
      received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
        cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
      received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
        cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches
      
      With mmap() on receiver (tcp_mmap -s -z)
      
      received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
        cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
      received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
        cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
      received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
        cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
      received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
        cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93ab6cc6
    • E
      tcp: avoid extra wakeups for SO_RCVLOWAT users · 03f45c88
      Eric Dumazet 提交于
      SO_RCVLOWAT is properly handled in tcp_poll(), so that POLLIN is only
      generated when enough bytes are available in receive queue, after
      David change (commit c7004482 "tcp: Respect SO_RCVLOWAT in tcp_poll().")
      
      But TCP still calls sk->sk_data_ready() for each chunk added in receive
      queue, meaning thread is awaken, and goes back to sleep shortly after.
      
      Tested:
      
      tcp_mmap test program, receiving 32768 MB of data with SO_RCVLOWAT set to 512KB
      
      -> Should get ~2 wakeups (c-switches) per MB, regardless of how many
      (tiny or big) packets were received.
      
      High speed (mostly full size GRO packets)
      
      received 32768 MB (100 % mmap'ed) in 8.03112 s, 34.2266 Gbit,
        cpu usage user:0.037 sys:1.404, 43.9758 usec per MB, 65497 c-switches
      
      received 32768 MB (99.9954 % mmap'ed) in 7.98453 s, 34.4263 Gbit,
        cpu usage user:0.03 sys:1.422, 44.3115 usec per MB, 65485 c-switches
      
      Low speed (sender is ratelimited and sends 1-MSS at a time, so GRO is not helping)
      
      received 22474.5 MB (100 % mmap'ed) in 6015.35 s, 0.0313414 Gbit,
        cpu usage user:0.05 sys:1.586, 72.7952 usec per MB, 44950 c-switches
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03f45c88
    • E
      tcp: fix SO_RCVLOWAT and RCVBUF autotuning · d1361840
      Eric Dumazet 提交于
      Applications might use SO_RCVLOWAT on TCP socket hoping to receive
      one [E]POLLIN event only when a given amount of bytes are ready in socket
      receive queue.
      
      Problem is that receive autotuning is not aware of this constraint,
      meaning sk_rcvbuf might be too small to allow all bytes to be stored.
      
      Add a new (struct proto_ops)->set_rcvlowat method so that a protocol
      can override the default setsockopt(SO_RCVLOWAT) behavior.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1361840
  30. 16 4月, 2018 1 次提交