1. 24 2月, 2021 1 次提交
  2. 03 2月, 2021 2 次提交
  3. 28 1月, 2021 1 次提交
  4. 21 1月, 2021 1 次提交
  5. 03 12月, 2020 1 次提交
  6. 25 8月, 2020 1 次提交
  7. 20 7月, 2020 1 次提交
  8. 08 7月, 2020 1 次提交
  9. 24 6月, 2020 2 次提交
    • E
      udp: move gro declarations to net/udp.h · 6db69328
      Eric Dumazet 提交于
      This removes following warnings :
        CC      net/ipv4/udp_offload.o
      net/ipv4/udp_offload.c:504:17: warning: no previous prototype for 'udp4_gro_receive' [-Wmissing-prototypes]
        504 | struct sk_buff *udp4_gro_receive(struct list_head *head, struct sk_buff *skb)
            |                 ^~~~~~~~~~~~~~~~
      net/ipv4/udp_offload.c:584:29: warning: no previous prototype for 'udp4_gro_complete' [-Wmissing-prototypes]
        584 | INDIRECT_CALLABLE_SCOPE int udp4_gro_complete(struct sk_buff *skb, int nhoff)
            |                             ^~~~~~~~~~~~~~~~~
      
        CHECK   net/ipv6/udp_offload.c
      net/ipv6/udp_offload.c:115:16: warning: symbol 'udp6_gro_receive' was not declared. Should it be static?
      net/ipv6/udp_offload.c:148:29: warning: symbol 'udp6_gro_complete' was not declared. Should it be static?
        CC      net/ipv6/udp_offload.o
      net/ipv6/udp_offload.c:115:17: warning: no previous prototype for 'udp6_gro_receive' [-Wmissing-prototypes]
        115 | struct sk_buff *udp6_gro_receive(struct list_head *head, struct sk_buff *skb)
            |                 ^~~~~~~~~~~~~~~~
      net/ipv6/udp_offload.c:148:29: warning: no previous prototype for 'udp6_gro_complete' [-Wmissing-prototypes]
        148 | INDIRECT_CALLABLE_SCOPE int udp6_gro_complete(struct sk_buff *skb, int nhoff)
            |                             ^~~~~~~~~~~~~~~~~
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6db69328
    • E
      net: move tcp gro declarations to net/tcp.h · 5521d95e
      Eric Dumazet 提交于
      This patch removes following (C=1 W=1) warnings for CONFIG_RETPOLINE=y :
      
      net/ipv4/tcp_offload.c:306:16: warning: symbol 'tcp4_gro_receive' was not declared. Should it be static?
      net/ipv4/tcp_offload.c:306:17: warning: no previous prototype for 'tcp4_gro_receive' [-Wmissing-prototypes]
      net/ipv4/tcp_offload.c:319:29: warning: symbol 'tcp4_gro_complete' was not declared. Should it be static?
      net/ipv4/tcp_offload.c:319:29: warning: no previous prototype for 'tcp4_gro_complete' [-Wmissing-prototypes]
        CHECK   net/ipv6/tcpv6_offload.c
      net/ipv6/tcpv6_offload.c:16:16: warning: symbol 'tcp6_gro_receive' was not declared. Should it be static?
      net/ipv6/tcpv6_offload.c:29:29: warning: symbol 'tcp6_gro_complete' was not declared. Should it be static?
        CC      net/ipv6/tcpv6_offload.o
      net/ipv6/tcpv6_offload.c:16:17: warning: no previous prototype for 'tcp6_gro_receive' [-Wmissing-prototypes]
         16 | struct sk_buff *tcp6_gro_receive(struct list_head *head, struct sk_buff *skb)
            |                 ^~~~~~~~~~~~~~~~
      net/ipv6/tcpv6_offload.c:29:29: warning: no previous prototype for 'tcp6_gro_complete' [-Wmissing-prototypes]
         29 | INDIRECT_CALLABLE_SCOPE int tcp6_gro_complete(struct sk_buff *skb, int thoff)
            |                             ^~~~~~~~~~~~~~~~~
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5521d95e
  10. 20 5月, 2020 1 次提交
    • D
      bpf: Add get{peer, sock}name attach types for sock_addr · 1b66d253
      Daniel Borkmann 提交于
      As stated in 983695fa ("bpf: fix unconnected udp hooks"), the objective
      for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
      transparent to applications. In Cilium we make use of these hooks [0] in
      order to enable E-W load balancing for existing Kubernetes service types
      for all Cilium managed nodes in the cluster. Those backends can be local
      or remote. The main advantage of this approach is that it operates as close
      as possible to the socket, and therefore allows to avoid packet-based NAT
      given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.
      
      This also allows to expose NodePort services on loopback addresses in the
      host namespace, for example. As another advantage, this also efficiently
      blocks bind requests for applications in the host namespace for exposed
      ports. However, one missing item is that we also need to perform reverse
      xlation for inet{,6}_getname() hooks such that we can return the service
      IP/port tuple back to the application instead of the remote peer address.
      
      The vast majority of applications does not bother about getpeername(), but
      in a few occasions we've seen breakage when validating the peer's address
      since it returns unexpectedly the backend tuple instead of the service one.
      Therefore, this trivial patch allows to customise and adds a getpeername()
      as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
      to address this situation.
      
      Simple example:
      
        # ./cilium/cilium service list
        ID   Frontend     Service Type   Backend
        1    1.2.3.4:80   ClusterIP      1 => 10.0.0.10:80
      
      Before; curl's verbose output example, no getpeername() reverse xlation:
      
        # curl --verbose 1.2.3.4
        * Rebuilt URL to: 1.2.3.4/
        *   Trying 1.2.3.4...
        * TCP_NODELAY set
        * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
        > GET / HTTP/1.1
        > Host: 1.2.3.4
        > User-Agent: curl/7.58.0
        > Accept: */*
        [...]
      
      After; with getpeername() reverse xlation:
      
        # curl --verbose 1.2.3.4
        * Rebuilt URL to: 1.2.3.4/
        *   Trying 1.2.3.4...
        * TCP_NODELAY set
        * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
        > GET / HTTP/1.1
        >  Host: 1.2.3.4
        > User-Agent: curl/7.58.0
        > Accept: */*
        [...]
      
      Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
      peer to the context similar as in inet{,6}_getname() fashion, but API-wise
      this is suboptimal as it always enforces programs having to test for ctx->peer
      which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
      Similarly, the checked return code is on tnum_range(1, 1), but if a use case
      comes up in future, it can easily be changed to return an error code instead.
      Helper and ctx member access is the same as with connect/sendmsg/etc hooks.
      
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
      1b66d253
  11. 19 5月, 2020 1 次提交
  12. 09 5月, 2020 2 次提交
  13. 29 4月, 2020 1 次提交
    • R
      net: ipv4: add sysctl for nexthop api compatibility mode · 4f80116d
      Roopa Prabhu 提交于
      Current route nexthop API maintains user space compatibility
      with old route API by default. Dumps and netlink notifications
      support both new and old API format. In systems which have
      moved to the new API, this compatibility mode cancels some
      of the performance benefits provided by the new nexthop API.
      
      This patch adds new sysctl nexthop_compat_mode which is on
      by default but provides the ability to turn off compatibility
      mode allowing systems to run entirely with the new routing
      API. Old route API behaviour and support is not modified by this
      sysctl.
      
      Uses a single sysctl to cover both ipv4 and ipv6 following
      other sysctls. Covers dumps and delete notifications as
      suggested by David Ahern.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f80116d
  14. 21 4月, 2020 1 次提交
  15. 30 3月, 2020 1 次提交
  16. 13 3月, 2020 1 次提交
  17. 27 11月, 2019 1 次提交
    • M
      net: port < inet_prot_sock(net) --> inet_port_requires_bind_service(net, port) · 82f31ebf
      Maciej Żenczykowski 提交于
      Note that the sysctl write accessor functions guarantee that:
        net->ipv4.sysctl_ip_prot_sock <= net->ipv4.ip_local_ports.range[0]
      invariant is maintained, and as such the max() in selinux hooks is actually spurious.
      
      ie. even though
        if (snum < max(inet_prot_sock(sock_net(sk)), low) || snum > high) {
      per logic is the same as
        if ((snum < inet_prot_sock(sock_net(sk)) && snum < low) || snum > high) {
      it is actually functionally equivalent to:
        if (snum < low || snum > high) {
      which is equivalent to:
        if (snum < inet_prot_sock(sock_net(sk)) || snum < low || snum > high) {
      even though the first clause is spurious.
      
      But we want to hold on to it in case we ever want to change what what
      inet_port_requires_bind_service() means (for example by changing
      it from a, by default, [0..1024) range to some sort of set).
      
      Test: builds, git 'grep inet_prot_sock' finds no other references
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82f31ebf
  18. 07 11月, 2019 1 次提交
  19. 20 8月, 2019 1 次提交
  20. 04 7月, 2019 2 次提交
  21. 31 5月, 2019 2 次提交
    • J
      net: don't clear sock->sk early to avoid trouble in strparser · 2b81f816
      Jakub Kicinski 提交于
      af_inet sets sock->sk to NULL which trips strparser over:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000012
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP PTI
      CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.2.0-rc1-00139-g14629453a6d3 #21
      RIP: 0010:tcp_peek_len+0x10/0x60
      RSP: 0018:ffffc02e41c54b98 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff9cf924c4e030 RCX: 0000000000000051
      RDX: 0000000000000000 RSI: 000000000000000c RDI: ffff9cf97128f480
      RBP: ffff9cf9365e0300 R08: ffff9cf94fe7d2c0 R09: 0000000000000000
      R10: 000000000000036b R11: ffff9cf939735e00 R12: ffff9cf91ad9ae40
      R13: ffff9cf924c4e000 R14: ffff9cf9a8fcbaae R15: 0000000000000020
      FS: 0000000000000000(0000) GS:ffff9cf9af7c0000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000012 CR3: 000000013920a003 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
       <IRQ>
       strp_data_ready+0x48/0x90
       tls_data_ready+0x22/0xd0 [tls]
       tcp_rcv_established+0x569/0x620
       tcp_v4_do_rcv+0x127/0x1e0
       tcp_v4_rcv+0xad7/0xbf0
       ip_protocol_deliver_rcu+0x2c/0x1c0
       ip_local_deliver_finish+0x41/0x50
       ip_local_deliver+0x6b/0xe0
       ? ip_protocol_deliver_rcu+0x1c0/0x1c0
       ip_rcv+0x52/0xd0
       ? ip_rcv_finish_core.isra.20+0x380/0x380
       __netif_receive_skb_one_core+0x7e/0x90
       netif_receive_skb_internal+0x42/0xf0
       napi_gro_receive+0xed/0x150
       nfp_net_poll+0x7a2/0xd30 [nfp]
       ? kmem_cache_free_bulk+0x286/0x310
       net_rx_action+0x149/0x3b0
       __do_softirq+0xe3/0x30a
       ? handle_irq_event_percpu+0x6a/0x80
       irq_exit+0xe8/0xf0
       do_IRQ+0x85/0xd0
       common_interrupt+0xf/0xf
       </IRQ>
      RIP: 0010:cpuidle_enter_state+0xbc/0x450
      
      To avoid this issue set sock->sk after sk_prot->close.
      My grepping and testing did not discover any code which
      would depend on the current behaviour.
      
      Fixes: c46234eb ("tls: RX path for ktls")
      Reported-by: NDavid Beckett <david.beckett@netronome.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b81f816
    • T
      treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 · 2874c5fd
      Thomas Gleixner 提交于
      Based on 1 normalized pattern(s):
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license as published by
        the free software foundation either version 2 of the license or at
        your option any later version
      
      extracted by the scancode license scanner the SPDX license identifier
      
        GPL-2.0-or-later
      
      has been chosen to replace the boilerplate/reference in 3029 file(s).
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAllison Randal <allison@lohutok.net>
      Cc: linux-spdx@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2874c5fd
  22. 20 4月, 2019 1 次提交
  23. 02 4月, 2019 1 次提交
  24. 24 3月, 2019 1 次提交
    • E
      tcp: add one skb cache for rx · 8b27dae5
      Eric Dumazet 提交于
      Often times, recvmsg() system calls and BH handling for a particular
      TCP socket are done on different cpus.
      
      This means the incoming skb had to be allocated on a cpu,
      but freed on another.
      
      This incurs a high spinlock contention in slab layer for small rpc,
      but also a high number of cache line ping pongs for larger packets.
      
      A full size GRO packet might use 45 page fragments, meaning
      that up to 45 put_page() can be involved.
      
      More over performing the __kfree_skb() in the recvmsg() context
      adds a latency for user applications, and increase probability
      of trapping them in backlog processing, since the BH handler
      might found the socket owned by the user.
      
      This patch, combined with the prior one increases the rpc
      performance by about 10 % on servers with large number of cores.
      
      (tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
       instead of 8 Mpps)
      
      This also increases single bulk flow performance on 40Gbit+ links,
      since in this case there are often two cpus working in tandem :
      
       - CPU handling the NIC rx interrupts, feeding the receive queue,
        and (after this patch) freeing the skbs that were consumed.
      
       - CPU in recvmsg() system call, essentially 100 % busy copying out
        data to user space.
      
      Having at most one skb in a per-socket cache has very little risk
      of memory exhaustion, and since it is protected by socket lock,
      its management is essentially free.
      
      Note that if rps/rfs is used, we do not enable this feature, because
      there is high chance that the same cpu is handling both the recvmsg()
      system call and the TCP rx path, but that another cpu did the skb
      allocations in the device driver right before the RPS/RFS logic.
      
      To properly handle this case, it seems we would need to record
      on which cpu skb was allocated, and use a different channel
      to give skbs back to this cpu.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b27dae5
  25. 21 2月, 2019 1 次提交
  26. 16 12月, 2018 1 次提交
  27. 08 11月, 2018 2 次提交
  28. 14 9月, 2018 2 次提交
  29. 02 8月, 2018 2 次提交
  30. 15 7月, 2018 1 次提交
    • A
      bpf: Add BPF_SOCK_OPS_TCP_LISTEN_CB · f333ee0c
      Andrey Ignatov 提交于
      Add new TCP-BPF callback that is called on listen(2) right after socket
      transition to TCP_LISTEN state.
      
      It fills the gap for listening sockets in TCP-BPF. For example BPF
      program can set BPF_SOCK_OPS_STATE_CB_FLAG when socket becomes listening
      and track later transition from TCP_LISTEN to TCP_CLOSE with
      BPF_SOCK_OPS_STATE_CB callback.
      
      Before there was no way to do it with TCP-BPF and other options were
      much harder to work with. E.g. socket state tracking can be done with
      tracepoints (either raw or regular) but they can't be attached to cgroup
      and their lifetime has to be managed separately.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f333ee0c
  31. 04 7月, 2018 1 次提交
    • E
      net: ipv4: listified version of ip_rcv · 17266ee9
      Edward Cree 提交于
      Also involved adding a way to run a netfilter hook over a list of packets.
       Rather than attempting to make netfilter know about lists (which would be
       a major project in itself) we just let it call the regular okfn (in this
       case ip_rcv_finish()) for any packets it steals, and have it give us back
       a list of packets it's synchronously accepted (which normally NF_HOOK
       would automatically call okfn() on, but we want to be able to potentially
       pass the list to a listified version of okfn().)
      The netfilter hooks themselves are indirect calls that still happen per-
       packet (see nf_hook_entry_hookfn()), but again, changing that can be left
       for future work.
      
      There is potential for out-of-order receives if the netfilter hook ends up
       synchronously stealing packets, as they will be processed before any
       accepts earlier in the list.  However, it was already possible for an
       asynchronous accept to cause out-of-order receives, so presumably this is
       considered OK.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17266ee9
  32. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43