• E
    tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive · 05255b82
    Eric Dumazet 提交于
    When adding tcp mmap() implementation, I forgot that socket lock
    had to be taken before current->mm->mmap_sem. syzbot eventually caught
    the bug.
    
    Since we can not lock the socket in tcp mmap() handler we have to
    split the operation in two phases.
    
    1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
      This operation does not involve any TCP locking.
    
    2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
     the transfert of pages from skbs to one VMA.
      This operation only uses down_read(&current->mm->mmap_sem) after
      holding TCP lock, thus solving the lockdep issue.
    
    This new implementation was suggested by Andy Lutomirski with great details.
    
    Benefits are :
    
    - Better scalability, in case multiple threads reuse VMAS
       (without mmap()/munmap() calls) since mmap_sem wont be write locked.
    
    - Better error recovery.
       The previous mmap() model had to provide the expected size of the
       mapping. If for some reason one part could not be mapped (partial MSS),
       the whole operation had to be aborted.
       With the tcp_zerocopy_receive struct, kernel can report how
       many bytes were successfuly mapped, and how many bytes should
       be read to skip the problematic sequence.
    
    - No more memory allocation to hold an array of page pointers.
      16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
    
    - skbs are freed while mmap_sem has been released
    
    Following patch makes the change in tcp_mmap tool to demonstrate
    one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
    
    Note that memcg might require additional changes.
    
    Fixes: 93ab6cc6 ("tcp: implement mmap() for zero copy receive")
    Signed-off-by: NEric Dumazet <edumazet@google.com>
    Reported-by: Nsyzbot <syzkaller@googlegroups.com>
    Suggested-by: NAndy Lutomirski <luto@kernel.org>
    Cc: linux-mm@kvack.org
    Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: NDavid S. Miller <davem@davemloft.net>
    05255b82
af_inet.c 50.0 KB