1. 07 11月, 2019 1 次提交
  2. 14 10月, 2019 4 次提交
    • E
      tcp: annotate tp->write_seq lockless reads · 0f317464
      Eric Dumazet 提交于
      There are few places where we fetch tp->write_seq while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f317464
    • E
      tcp: annotate tp->copied_seq lockless reads · 7db48e98
      Eric Dumazet 提交于
      There are few places where we fetch tp->copied_seq while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7db48e98
    • E
      tcp: annotate tp->rcv_nxt lockless reads · dba7d9b8
      Eric Dumazet 提交于
      There are few places where we fetch tp->rcv_nxt while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)
      
      syzbot reported :
      
      BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv
      
      write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
       tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
       tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
       tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
       tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
      
      read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
       tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
       tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
       sock_poll+0xed/0x250 net/socket.c:1256
       vfs_poll include/linux/poll.h:90 [inline]
       ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
       ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
       ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
       ep_send_events fs/eventpoll.c:1793 [inline]
       ep_poll+0xe3/0x900 fs/eventpoll.c:1930
       do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
       __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
       __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
       __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dba7d9b8
    • E
      tcp: add rcu protection around tp->fastopen_rsk · d983ea6f
      Eric Dumazet 提交于
      Both tcp_v4_err() and tcp_v6_err() do the following operations
      while they do not own the socket lock :
      
      	fastopen = tp->fastopen_rsk;
       	snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;
      
      The problem is that without appropriate barrier, the compiler
      might reload tp->fastopen_rsk and trigger a NULL deref.
      
      request sockets are protected by RCU, we can simply add
      the missing annotations and barriers to solve the issue.
      
      Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d983ea6f
  3. 27 9月, 2019 3 次提交
  4. 31 7月, 2019 1 次提交
  5. 12 7月, 2019 1 次提交
  6. 09 7月, 2019 1 次提交
    • W
      ipv6: elide flowlabel check if no exclusive leases exist · 59c820b2
      Willem de Bruijn 提交于
      Processes can request ipv6 flowlabels with cmsg IPV6_FLOWINFO.
      If not set, by default an autogenerated flowlabel is selected.
      
      Explicit flowlabels require a control operation per label plus a
      datapath check on every connection (every datagram if unconnected).
      This is particularly expensive on unconnected sockets multiplexing
      many flows, such as QUIC.
      
      In the common case, where no lease is exclusive, the check can be
      safely elided, as both lease request and check trivially succeed.
      Indeed, autoflowlabel does the same even with exclusive leases.
      
      Elide the check if no process has requested an exclusive lease.
      
      fl6_sock_lookup previously returns either a reference to a lease or
      NULL to denote failure. Modify to return a real error and update
      all callers. On return NULL, they can use the label and will elide
      the atomic_dec in fl6_sock_release.
      
      This is an optimization. Robust applications still have to revert to
      requesting leases if the fast path fails due to an exclusive lease.
      
      Changes RFC->v1:
        - use static_key_false_deferred to rate limit jump label operations
          - call static_key_deferred_flush to stop timers on exit
        - move decrement out of RCU context
        - defer optimization also if opt data is associated with a lease
        - updated all fp6_sock_lookup callers, not just udp
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59c820b2
  7. 02 7月, 2019 1 次提交
  8. 15 6月, 2019 1 次提交
  9. 13 6月, 2019 1 次提交
    • E
      tcp: add optional per socket transmit delay · a842fe14
      Eric Dumazet 提交于
      Adding delays to TCP flows is crucial for studying behavior
      of TCP stacks, including congestion control modules.
      
      Linux offers netem module, but it has unpractical constraints :
      - Need root access to change qdisc
      - Hard to setup on egress if combined with non trivial qdisc like FQ
      - Single delay for all flows.
      
      EDT (Earliest Departure Time) adoption in TCP stack allows us
      to enable a per socket delay at a very small cost.
      
      Networking tools can now establish thousands of flows, each of them
      with a different delay, simulating real world conditions.
      
      This requires FQ packet scheduler or a EDT-enabled NIC.
      
      This patchs adds TCP_TX_DELAY socket option, to set a delay in
      usec units.
      
        unsigned int tx_delay = 10000; /* 10 msec */
      
        setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));
      
      Note that FQ packet scheduler limits might need some tweaking :
      
      man tc-fq
      
      PARAMETERS
         limit
             Hard  limit  on  the  real  queue  size. When this limit is
             reached, new packets are dropped. If the value is  lowered,
             packets  are  dropped so that the new limit is met. Default
             is 10000 packets.
      
         flow_limit
             Hard limit on the maximum  number  of  packets  queued  per
             flow.  Default value is 100.
      
      Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
      so packets would be dropped if any of the previous limit is hit.
      
      Use of a jump label makes this support runtime-free, for hosts
      never using the option.
      
      Also note that TSQ (TCP Small Queues) limits are slightly changed
      with this patch : we need to account that skbs artificially delayed
      wont stop us providind more skbs to feed the pipe (netem uses
      skb_orphan_partial() for this purpose, but FQ can not use this trick)
      
      Because of that, using big delays might very well trigger
      old bugs in TSO auto defer logic and/or sndbuf limited detection.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a842fe14
  10. 10 6月, 2019 2 次提交
  11. 06 6月, 2019 2 次提交
  12. 31 5月, 2019 1 次提交
  13. 06 5月, 2019 2 次提交
  14. 02 4月, 2019 1 次提交
    • E
      tcp: fix tcp_inet6_sk() for 32bit kernels · f5d54767
      Eric Dumazet 提交于
      It turns out that struct ipv6_pinfo is not located as we think.
      
      inet6_sk_generic() and tcp_inet6_sk() disagree on 32bit kernels by 4-bytes,
      because struct tcp_sock has 8-bytes alignment,
      but ipv6_pinfo size is not a multiple of 8.
      
      sizeof(struct ipv6_pinfo): 116 (not padded to 8)
      
      I actually first coded tcp_inet6_sk() as this patch does, but thought
      that "container_of(tcp_sk(sk), struct tcp6_sock, tcp)" was cleaner.
      
      As Julian told me : Nobody should use tcp6_sock.inet6
      directly, it should be accessed via tcp_inet6_sk() or inet6_sk().
      
      This happened when we added the first u64 field in struct tcp_sock.
      
      Fixes: 93a77c11 ("tcp: add tcp_inet6_sk() helper")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Bisected-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5d54767
  15. 24 3月, 2019 1 次提交
    • E
      tcp: add one skb cache for rx · 8b27dae5
      Eric Dumazet 提交于
      Often times, recvmsg() system calls and BH handling for a particular
      TCP socket are done on different cpus.
      
      This means the incoming skb had to be allocated on a cpu,
      but freed on another.
      
      This incurs a high spinlock contention in slab layer for small rpc,
      but also a high number of cache line ping pongs for larger packets.
      
      A full size GRO packet might use 45 page fragments, meaning
      that up to 45 put_page() can be involved.
      
      More over performing the __kfree_skb() in the recvmsg() context
      adds a latency for user applications, and increase probability
      of trapping them in backlog processing, since the BH handler
      might found the socket owned by the user.
      
      This patch, combined with the prior one increases the rpc
      performance by about 10 % on servers with large number of cores.
      
      (tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
       instead of 8 Mpps)
      
      This also increases single bulk flow performance on 40Gbit+ links,
      since in this case there are often two cpus working in tandem :
      
       - CPU handling the NIC rx interrupts, feeding the receive queue,
        and (after this patch) freeing the skbs that were consumed.
      
       - CPU in recvmsg() system call, essentially 100 % busy copying out
        data to user space.
      
      Having at most one skb in a per-socket cache has very little risk
      of memory exhaustion, and since it is protected by socket lock,
      its management is essentially free.
      
      Note that if rps/rfs is used, we do not enable this feature, because
      there is high chance that the same cpu is handling both the recvmsg()
      system call and the TCP rx path, but that another cpu did the skb
      allocations in the device driver right before the RPS/RFS logic.
      
      To properly handle this case, it seems we would need to record
      on which cpu skb was allocated, and use a different channel
      to give skbs back to this cpu.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b27dae5
  16. 20 3月, 2019 2 次提交
  17. 26 2月, 2019 1 次提交
  18. 28 1月, 2019 1 次提交
  19. 16 12月, 2018 1 次提交
  20. 09 11月, 2018 1 次提交
    • S
      net: Convert protocol error handlers from void to int · 32bbd879
      Stefano Brivio 提交于
      We'll need this to handle ICMP errors for tunnels without a sending socket
      (i.e. FoU and GUE). There, we might have to look up different types of IP
      tunnels, registered as network protocols, before we get a match, so we
      want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
      inet_protos and inet6_protos. These error codes will be used in the next
      patch.
      
      For consistency, return sensible error codes in protocol error handlers
      whenever handlers can't handle errors because, even if valid, they don't
      match a protocol or any of its states.
      
      This has no effect on existing error handling paths.
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32bbd879
  21. 22 7月, 2018 1 次提交
    • D
      net/ipv6: Fix linklocal to global address with VRF · 24b711ed
      David Ahern 提交于
      Example setup:
          host: ip -6 addr add dev eth1 2001:db8:104::4
                 where eth1 is enslaved to a VRF
      
          switch: ip -6 ro add 2001:db8:104::4/128 dev br1
                  where br1 only has an LLA
      
                 ping6 2001:db8:104::4
                 ssh   2001:db8:104::4
      
      (NOTE: UDP works fine if the PKTINFO has the address set to the global
      address and ifindex is set to the index of eth1 with a destination an
      LLA).
      
      For ICMP, icmp6_iif needs to be updated to check if skb->dev is an
      L3 master. If it is then return the ifindex from rt6i_idev similar
      to what is done for loopback.
      
      For TCP, restore the original tcp_v6_iif definition which is needed in
      most places and add a new tcp_v6_iif_l3_slave that considers the
      l3_slave variability. This latter check is only needed for socket
      lookups.
      
      Fixes: 9ff74384 ("net: vrf: Handle ipv6 multicast and link-local addresses")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24b711ed
  22. 15 6月, 2018 1 次提交
  23. 01 6月, 2018 1 次提交
  24. 16 5月, 2018 2 次提交
  25. 11 5月, 2018 1 次提交
    • J
      tcp: Add mark for TIMEWAIT sockets · 00483690
      Jon Maxwell 提交于
      This version has some suggestions by Eric Dumazet:
      
      - Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP
      races.
      - Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
      statement.
      - Factorize code as sk_fullsock() check is not necessary.
      
      Aidan McGurn from Openwave Mobility systems reported the following bug:
      
      "Marked routing is broken on customer deployment. Its effects are large
      increase in Uplink retransmissions caused by the client never receiving
      the final ACK to their FINACK - this ACK misses the mark and routes out
      of the incorrect route."
      
      Currently marks are added to sk_buffs for replies when the "fwmark_reflect"
      sysctl is enabled. But not for TW sockets that had sk->sk_mark set via
      setsockopt(SO_MARK..).
      
      Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the
      original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location.
      Then progate this so that the skb gets sent with the correct mark. Do the same
      for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
      netfilter rules are still honored.
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00483690
  26. 31 3月, 2018 1 次提交
    • A
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov 提交于
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
  27. 28 3月, 2018 1 次提交
  28. 20 2月, 2018 1 次提交
    • K
      net: Convert tcpv6_net_ops · fef65a2c
      Kirill Tkhai 提交于
      These pernet_operations create and destroy net::ipv6.tcp_sk
      socket, which is used in tcp_v6_send_response() only. It looks
      like foreign pernet_operations don't want to set ipv6 connection
      inside destroyed net, so this socket may be created in destroyed
      in parallel with anything else. inet_twsk_purge() is also safe
      for that, as described in patch for tcp_sk_ops. So, it's possible
      to mark them as async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fef65a2c
  29. 15 2月, 2018 1 次提交
    • E
      tcp: try to keep packet if SYN_RCV race is lost · e0f9759f
      Eric Dumazet 提交于
      배석진 reported that in some situations, packets for a given 5-tuple
      end up being processed by different CPUS.
      
      This involves RPS, and fragmentation.
      
      배석진 is seeing packet drops when a SYN_RECV request socket is
      moved into ESTABLISH state. Other states are protected by socket lock.
      
      This is caused by a CPU losing the race, and simply not caring enough.
      
      Since this seems to occur frequently, we can do better and perform
      a second lookup.
      
      Note that all needed memory barriers are already in the existing code,
      thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert()
      and reqsk_put(). The second lookup must find the new socket,
      unless it has already been accepted and closed by another cpu.
      
      Note that the fragmentation could be avoided in the first place by
      use of a correct TCP MSS option in the SYN{ACK} packet, but this
      does not mean we can not be more robust.
      
      Many thanks to 배석진 for a very detailed analysis.
      Reported-by: N배석진 <soukjin.bae@samsung.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0f9759f
  30. 08 2月, 2018 1 次提交