1. 23 9月, 2021 1 次提交
  2. 24 8月, 2021 1 次提交
    • D
      bpf: Migrate cgroup_bpf to internal cgroup_bpf_attach_type enum · 6fc88c35
      Dave Marchevsky 提交于
      Add an enum (cgroup_bpf_attach_type) containing only valid cgroup_bpf
      attach types and a function to map bpf_attach_type values to the new
      enum. Inspired by netns_bpf_attach_type.
      
      Then, migrate cgroup_bpf to use cgroup_bpf_attach_type wherever
      possible.  Functionality is unchanged as attach_type_to_prog_type
      switches in bpf/syscall.c were preventing non-cgroup programs from
      making use of the invalid cgroup_bpf array slots.
      
      As a result struct cgroup_bpf uses 504 fewer bytes relative to when its
      arrays were sized using MAX_BPF_ATTACH_TYPE.
      
      bpf_cgroup_storage is notably not migrated as struct
      bpf_cgroup_storage_key is part of uapi and contains a bpf_attach_type
      member which is not meant to be opaque. Similarly, bpf_cgroup_link
      continues to report its bpf_attach_type member to userspace via fdinfo
      and bpf_link_info.
      
      To ease disambiguation, bpf_attach_type variables are renamed from
      'type' to 'atype' when changed to cgroup_bpf_attach_type.
      Signed-off-by: NDave Marchevsky <davemarchevsky@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210819092420.1984861-2-davemarchevsky@fb.com
      6fc88c35
  3. 23 7月, 2021 1 次提交
    • A
      net: socket: rework compat_ifreq_ioctl() · 29c49648
      Arnd Bergmann 提交于
      compat_ifreq_ioctl() is one of the last users of copy_in_user() and
      compat_alloc_user_space(), as it attempts to convert the 'struct ifreq'
      arguments from 32-bit to 64-bit format as used by dev_ioctl() and a
      couple of socket family specific interpretations.
      
      The current implementation works correctly when calling dev_ioctl(),
      inet_ioctl(), ieee802154_sock_ioctl(), atalk_ioctl(), qrtr_ioctl()
      and packet_ioctl(). The ioctl handlers for x25, netrom, rose and x25 do
      not interpret the arguments and only block the corresponding commands,
      so they do not care.
      
      For af_inet6 and af_decnet however, the compat conversion is slightly
      incorrect, as it will copy more data than the native handler accesses,
      both of them use a structure that is shorter than ifreq.
      
      Replace the copy_in_user() conversion with a pair of accessor functions
      to read and write the ifreq data in place with the correct length where
      needed, while leaving the other ones to copy the (already compatible)
      structures directly.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29c49648
  4. 10 6月, 2021 1 次提交
    • E
      inet: annotate data race in inet_send_prepare() and inet_dgram_connect() · dcd01eea
      Eric Dumazet 提交于
      Both functions are known to be racy when reading inet_num
      as we do not want to grab locks for the common case the socket
      has been bound already. The race is resolved in inet_autobind()
      by reading again inet_num under the socket lock.
      
      syzbot reported:
      BUG: KCSAN: data-race in inet_send_prepare / udp_lib_get_port
      
      write to 0xffff88812cba150e of 2 bytes by task 24135 on cpu 0:
       udp_lib_get_port+0x4b2/0xe20 net/ipv4/udp.c:308
       udp_v6_get_port+0x5e/0x70 net/ipv6/udp.c:89
       inet_autobind net/ipv4/af_inet.c:183 [inline]
       inet_send_prepare+0xd0/0x210 net/ipv4/af_inet.c:807
       inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
       ___sys_sendmsg net/socket.c:2404 [inline]
       __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
       __do_sys_sendmmsg net/socket.c:2519 [inline]
       __se_sys_sendmmsg net/socket.c:2516 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff88812cba150e of 2 bytes by task 24132 on cpu 1:
       inet_send_prepare+0x21/0x210 net/ipv4/af_inet.c:806
       inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
       ___sys_sendmsg net/socket.c:2404 [inline]
       __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
       __do_sys_sendmmsg net/socket.c:2519 [inline]
       __se_sys_sendmmsg net/socket.c:2516 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000 -> 0x9db4
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 24132 Comm: syz-executor.2 Not tainted 5.13.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcd01eea
  5. 02 6月, 2021 1 次提交
  6. 18 5月, 2021 1 次提交
  7. 02 4月, 2021 1 次提交
  8. 24 2月, 2021 1 次提交
  9. 03 2月, 2021 2 次提交
  10. 28 1月, 2021 1 次提交
  11. 21 1月, 2021 1 次提交
  12. 03 12月, 2020 1 次提交
  13. 25 8月, 2020 1 次提交
  14. 20 7月, 2020 1 次提交
  15. 08 7月, 2020 1 次提交
  16. 24 6月, 2020 2 次提交
    • E
      udp: move gro declarations to net/udp.h · 6db69328
      Eric Dumazet 提交于
      This removes following warnings :
        CC      net/ipv4/udp_offload.o
      net/ipv4/udp_offload.c:504:17: warning: no previous prototype for 'udp4_gro_receive' [-Wmissing-prototypes]
        504 | struct sk_buff *udp4_gro_receive(struct list_head *head, struct sk_buff *skb)
            |                 ^~~~~~~~~~~~~~~~
      net/ipv4/udp_offload.c:584:29: warning: no previous prototype for 'udp4_gro_complete' [-Wmissing-prototypes]
        584 | INDIRECT_CALLABLE_SCOPE int udp4_gro_complete(struct sk_buff *skb, int nhoff)
            |                             ^~~~~~~~~~~~~~~~~
      
        CHECK   net/ipv6/udp_offload.c
      net/ipv6/udp_offload.c:115:16: warning: symbol 'udp6_gro_receive' was not declared. Should it be static?
      net/ipv6/udp_offload.c:148:29: warning: symbol 'udp6_gro_complete' was not declared. Should it be static?
        CC      net/ipv6/udp_offload.o
      net/ipv6/udp_offload.c:115:17: warning: no previous prototype for 'udp6_gro_receive' [-Wmissing-prototypes]
        115 | struct sk_buff *udp6_gro_receive(struct list_head *head, struct sk_buff *skb)
            |                 ^~~~~~~~~~~~~~~~
      net/ipv6/udp_offload.c:148:29: warning: no previous prototype for 'udp6_gro_complete' [-Wmissing-prototypes]
        148 | INDIRECT_CALLABLE_SCOPE int udp6_gro_complete(struct sk_buff *skb, int nhoff)
            |                             ^~~~~~~~~~~~~~~~~
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6db69328
    • E
      net: move tcp gro declarations to net/tcp.h · 5521d95e
      Eric Dumazet 提交于
      This patch removes following (C=1 W=1) warnings for CONFIG_RETPOLINE=y :
      
      net/ipv4/tcp_offload.c:306:16: warning: symbol 'tcp4_gro_receive' was not declared. Should it be static?
      net/ipv4/tcp_offload.c:306:17: warning: no previous prototype for 'tcp4_gro_receive' [-Wmissing-prototypes]
      net/ipv4/tcp_offload.c:319:29: warning: symbol 'tcp4_gro_complete' was not declared. Should it be static?
      net/ipv4/tcp_offload.c:319:29: warning: no previous prototype for 'tcp4_gro_complete' [-Wmissing-prototypes]
        CHECK   net/ipv6/tcpv6_offload.c
      net/ipv6/tcpv6_offload.c:16:16: warning: symbol 'tcp6_gro_receive' was not declared. Should it be static?
      net/ipv6/tcpv6_offload.c:29:29: warning: symbol 'tcp6_gro_complete' was not declared. Should it be static?
        CC      net/ipv6/tcpv6_offload.o
      net/ipv6/tcpv6_offload.c:16:17: warning: no previous prototype for 'tcp6_gro_receive' [-Wmissing-prototypes]
         16 | struct sk_buff *tcp6_gro_receive(struct list_head *head, struct sk_buff *skb)
            |                 ^~~~~~~~~~~~~~~~
      net/ipv6/tcpv6_offload.c:29:29: warning: no previous prototype for 'tcp6_gro_complete' [-Wmissing-prototypes]
         29 | INDIRECT_CALLABLE_SCOPE int tcp6_gro_complete(struct sk_buff *skb, int thoff)
            |                             ^~~~~~~~~~~~~~~~~
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5521d95e
  17. 20 5月, 2020 1 次提交
    • D
      bpf: Add get{peer, sock}name attach types for sock_addr · 1b66d253
      Daniel Borkmann 提交于
      As stated in 983695fa ("bpf: fix unconnected udp hooks"), the objective
      for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
      transparent to applications. In Cilium we make use of these hooks [0] in
      order to enable E-W load balancing for existing Kubernetes service types
      for all Cilium managed nodes in the cluster. Those backends can be local
      or remote. The main advantage of this approach is that it operates as close
      as possible to the socket, and therefore allows to avoid packet-based NAT
      given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.
      
      This also allows to expose NodePort services on loopback addresses in the
      host namespace, for example. As another advantage, this also efficiently
      blocks bind requests for applications in the host namespace for exposed
      ports. However, one missing item is that we also need to perform reverse
      xlation for inet{,6}_getname() hooks such that we can return the service
      IP/port tuple back to the application instead of the remote peer address.
      
      The vast majority of applications does not bother about getpeername(), but
      in a few occasions we've seen breakage when validating the peer's address
      since it returns unexpectedly the backend tuple instead of the service one.
      Therefore, this trivial patch allows to customise and adds a getpeername()
      as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
      to address this situation.
      
      Simple example:
      
        # ./cilium/cilium service list
        ID   Frontend     Service Type   Backend
        1    1.2.3.4:80   ClusterIP      1 => 10.0.0.10:80
      
      Before; curl's verbose output example, no getpeername() reverse xlation:
      
        # curl --verbose 1.2.3.4
        * Rebuilt URL to: 1.2.3.4/
        *   Trying 1.2.3.4...
        * TCP_NODELAY set
        * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
        > GET / HTTP/1.1
        > Host: 1.2.3.4
        > User-Agent: curl/7.58.0
        > Accept: */*
        [...]
      
      After; with getpeername() reverse xlation:
      
        # curl --verbose 1.2.3.4
        * Rebuilt URL to: 1.2.3.4/
        *   Trying 1.2.3.4...
        * TCP_NODELAY set
        * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
        > GET / HTTP/1.1
        >  Host: 1.2.3.4
        > User-Agent: curl/7.58.0
        > Accept: */*
        [...]
      
      Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
      peer to the context similar as in inet{,6}_getname() fashion, but API-wise
      this is suboptimal as it always enforces programs having to test for ctx->peer
      which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
      Similarly, the checked return code is on tnum_range(1, 1), but if a use case
      comes up in future, it can easily be changed to return an error code instead.
      Helper and ctx member access is the same as with connect/sendmsg/etc hooks.
      
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
      1b66d253
  18. 19 5月, 2020 1 次提交
  19. 09 5月, 2020 2 次提交
  20. 29 4月, 2020 1 次提交
    • R
      net: ipv4: add sysctl for nexthop api compatibility mode · 4f80116d
      Roopa Prabhu 提交于
      Current route nexthop API maintains user space compatibility
      with old route API by default. Dumps and netlink notifications
      support both new and old API format. In systems which have
      moved to the new API, this compatibility mode cancels some
      of the performance benefits provided by the new nexthop API.
      
      This patch adds new sysctl nexthop_compat_mode which is on
      by default but provides the ability to turn off compatibility
      mode allowing systems to run entirely with the new routing
      API. Old route API behaviour and support is not modified by this
      sysctl.
      
      Uses a single sysctl to cover both ipv4 and ipv6 following
      other sysctls. Covers dumps and delete notifications as
      suggested by David Ahern.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f80116d
  21. 21 4月, 2020 1 次提交
  22. 30 3月, 2020 1 次提交
  23. 13 3月, 2020 1 次提交
  24. 27 11月, 2019 1 次提交
    • M
      net: port < inet_prot_sock(net) --> inet_port_requires_bind_service(net, port) · 82f31ebf
      Maciej Żenczykowski 提交于
      Note that the sysctl write accessor functions guarantee that:
        net->ipv4.sysctl_ip_prot_sock <= net->ipv4.ip_local_ports.range[0]
      invariant is maintained, and as such the max() in selinux hooks is actually spurious.
      
      ie. even though
        if (snum < max(inet_prot_sock(sock_net(sk)), low) || snum > high) {
      per logic is the same as
        if ((snum < inet_prot_sock(sock_net(sk)) && snum < low) || snum > high) {
      it is actually functionally equivalent to:
        if (snum < low || snum > high) {
      which is equivalent to:
        if (snum < inet_prot_sock(sock_net(sk)) || snum < low || snum > high) {
      even though the first clause is spurious.
      
      But we want to hold on to it in case we ever want to change what what
      inet_port_requires_bind_service() means (for example by changing
      it from a, by default, [0..1024) range to some sort of set).
      
      Test: builds, git 'grep inet_prot_sock' finds no other references
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82f31ebf
  25. 07 11月, 2019 1 次提交
  26. 20 8月, 2019 1 次提交
  27. 04 7月, 2019 2 次提交
  28. 31 5月, 2019 2 次提交
    • J
      net: don't clear sock->sk early to avoid trouble in strparser · 2b81f816
      Jakub Kicinski 提交于
      af_inet sets sock->sk to NULL which trips strparser over:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000012
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP PTI
      CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.2.0-rc1-00139-g14629453a6d3 #21
      RIP: 0010:tcp_peek_len+0x10/0x60
      RSP: 0018:ffffc02e41c54b98 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff9cf924c4e030 RCX: 0000000000000051
      RDX: 0000000000000000 RSI: 000000000000000c RDI: ffff9cf97128f480
      RBP: ffff9cf9365e0300 R08: ffff9cf94fe7d2c0 R09: 0000000000000000
      R10: 000000000000036b R11: ffff9cf939735e00 R12: ffff9cf91ad9ae40
      R13: ffff9cf924c4e000 R14: ffff9cf9a8fcbaae R15: 0000000000000020
      FS: 0000000000000000(0000) GS:ffff9cf9af7c0000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000012 CR3: 000000013920a003 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
       <IRQ>
       strp_data_ready+0x48/0x90
       tls_data_ready+0x22/0xd0 [tls]
       tcp_rcv_established+0x569/0x620
       tcp_v4_do_rcv+0x127/0x1e0
       tcp_v4_rcv+0xad7/0xbf0
       ip_protocol_deliver_rcu+0x2c/0x1c0
       ip_local_deliver_finish+0x41/0x50
       ip_local_deliver+0x6b/0xe0
       ? ip_protocol_deliver_rcu+0x1c0/0x1c0
       ip_rcv+0x52/0xd0
       ? ip_rcv_finish_core.isra.20+0x380/0x380
       __netif_receive_skb_one_core+0x7e/0x90
       netif_receive_skb_internal+0x42/0xf0
       napi_gro_receive+0xed/0x150
       nfp_net_poll+0x7a2/0xd30 [nfp]
       ? kmem_cache_free_bulk+0x286/0x310
       net_rx_action+0x149/0x3b0
       __do_softirq+0xe3/0x30a
       ? handle_irq_event_percpu+0x6a/0x80
       irq_exit+0xe8/0xf0
       do_IRQ+0x85/0xd0
       common_interrupt+0xf/0xf
       </IRQ>
      RIP: 0010:cpuidle_enter_state+0xbc/0x450
      
      To avoid this issue set sock->sk after sk_prot->close.
      My grepping and testing did not discover any code which
      would depend on the current behaviour.
      
      Fixes: c46234eb ("tls: RX path for ktls")
      Reported-by: NDavid Beckett <david.beckett@netronome.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b81f816
    • T
      treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 · 2874c5fd
      Thomas Gleixner 提交于
      Based on 1 normalized pattern(s):
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license as published by
        the free software foundation either version 2 of the license or at
        your option any later version
      
      extracted by the scancode license scanner the SPDX license identifier
      
        GPL-2.0-or-later
      
      has been chosen to replace the boilerplate/reference in 3029 file(s).
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAllison Randal <allison@lohutok.net>
      Cc: linux-spdx@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2874c5fd
  29. 20 4月, 2019 1 次提交
  30. 02 4月, 2019 1 次提交
  31. 24 3月, 2019 1 次提交
    • E
      tcp: add one skb cache for rx · 8b27dae5
      Eric Dumazet 提交于
      Often times, recvmsg() system calls and BH handling for a particular
      TCP socket are done on different cpus.
      
      This means the incoming skb had to be allocated on a cpu,
      but freed on another.
      
      This incurs a high spinlock contention in slab layer for small rpc,
      but also a high number of cache line ping pongs for larger packets.
      
      A full size GRO packet might use 45 page fragments, meaning
      that up to 45 put_page() can be involved.
      
      More over performing the __kfree_skb() in the recvmsg() context
      adds a latency for user applications, and increase probability
      of trapping them in backlog processing, since the BH handler
      might found the socket owned by the user.
      
      This patch, combined with the prior one increases the rpc
      performance by about 10 % on servers with large number of cores.
      
      (tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
       instead of 8 Mpps)
      
      This also increases single bulk flow performance on 40Gbit+ links,
      since in this case there are often two cpus working in tandem :
      
       - CPU handling the NIC rx interrupts, feeding the receive queue,
        and (after this patch) freeing the skbs that were consumed.
      
       - CPU in recvmsg() system call, essentially 100 % busy copying out
        data to user space.
      
      Having at most one skb in a per-socket cache has very little risk
      of memory exhaustion, and since it is protected by socket lock,
      its management is essentially free.
      
      Note that if rps/rfs is used, we do not enable this feature, because
      there is high chance that the same cpu is handling both the recvmsg()
      system call and the TCP rx path, but that another cpu did the skb
      allocations in the device driver right before the RPS/RFS logic.
      
      To properly handle this case, it seems we would need to record
      on which cpu skb was allocated, and use a different channel
      to give skbs back to this cpu.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b27dae5
  32. 21 2月, 2019 1 次提交
  33. 16 12月, 2018 1 次提交
  34. 08 11月, 2018 2 次提交