1. 10 11月, 2019 4 次提交
    • E
      net: use skb_queue_empty_lockless() in poll() handlers · eaf548fe
      Eric Dumazet 提交于
      [ Upstream commit 3ef7cf57c72f32f61e97f8fa401bc39ea1f1a5d4 ]
      
      Many poll() handlers are lockless. Using skb_queue_empty_lockless()
      instead of skb_queue_empty() is more appropriate.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eaf548fe
    • E
      udp: use skb_queue_empty_lockless() · afa1f5e9
      Eric Dumazet 提交于
      [ Upstream commit 137a0dbe3426fd7bcfe3f8117b36a87b3590e4eb ]
      
      syzbot reported a data-race [1].
      
      We should use skb_queue_empty_lockless() to document that we are
      not ensuring a mutual exclusion and silence KCSAN.
      
      [1]
      BUG: KCSAN: data-race in __skb_recv_udp / __udp_enqueue_schedule_skb
      
      write to 0xffff888122474b50 of 8 bytes by interrupt on cpu 0:
       __skb_insert include/linux/skbuff.h:1852 [inline]
       __skb_queue_before include/linux/skbuff.h:1958 [inline]
       __skb_queue_tail include/linux/skbuff.h:1991 [inline]
       __udp_enqueue_schedule_skb+0x2c1/0x410 net/ipv4/udp.c:1470
       __udp_queue_rcv_skb net/ipv4/udp.c:1940 [inline]
       udp_queue_rcv_one_skb+0x7bd/0xc70 net/ipv4/udp.c:2057
       udp_queue_rcv_skb+0xb5/0x400 net/ipv4/udp.c:2074
       udp_unicast_rcv_skb.isra.0+0x7e/0x1c0 net/ipv4/udp.c:2233
       __udp4_lib_rcv+0xa44/0x17c0 net/ipv4/udp.c:2300
       udp_rcv+0x2b/0x40 net/ipv4/udp.c:2470
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
      
      read to 0xffff888122474b50 of 8 bytes by task 8921 on cpu 1:
       skb_queue_empty include/linux/skbuff.h:1494 [inline]
       __skb_recv_udp+0x18d/0x500 net/ipv4/udp.c:1653
       udp_recvmsg+0xe1/0xb10 net/ipv4/udp.c:1712
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871
       ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480
       do_recvmmsg+0x19a/0x5c0 net/socket.c:2601
       __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680
       __do_sys_recvmmsg net/socket.c:2703 [inline]
       __se_sys_recvmmsg net/socket.c:2696 [inline]
       __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 8921 Comm: syz-executor.4 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      afa1f5e9
    • E
      udp: fix data-race in udp_set_dev_scratch() · a8a5adbb
      Eric Dumazet 提交于
      [ Upstream commit a793183caa9afae907a0d7ddd2ffd57329369bf5 ]
      
      KCSAN reported a data-race in udp_set_dev_scratch() [1]
      
      The issue here is that we must not write over skb fields
      if skb is shared. A similar issue has been fixed in commit
      89c22d8c ("net: Fix skb csum races when peeking")
      
      While we are at it, use a helper only dealing with
      udp_skb_scratch(skb)->csum_unnecessary, as this allows
      udp_set_dev_scratch() to be called once and thus inlined.
      
      [1]
      BUG: KCSAN: data-race in udp_set_dev_scratch / udpv6_recvmsg
      
      write to 0xffff888120278317 of 1 bytes by task 10411 on cpu 1:
       udp_set_dev_scratch+0xea/0x200 net/ipv4/udp.c:1308
       __first_packet_length+0x147/0x420 net/ipv4/udp.c:1556
       first_packet_length+0x68/0x2a0 net/ipv4/udp.c:1579
       udp_poll+0xea/0x110 net/ipv4/udp.c:2720
       sock_poll+0xed/0x250 net/socket.c:1256
       vfs_poll include/linux/poll.h:90 [inline]
       do_select+0x7d0/0x1020 fs/select.c:534
       core_sys_select+0x381/0x550 fs/select.c:677
       do_pselect.constprop.0+0x11d/0x160 fs/select.c:759
       __do_sys_pselect6 fs/select.c:784 [inline]
       __se_sys_pselect6 fs/select.c:769 [inline]
       __x64_sys_pselect6+0x12e/0x170 fs/select.c:769
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      read to 0xffff888120278317 of 1 bytes by task 10413 on cpu 0:
       udp_skb_csum_unnecessary include/net/udp.h:358 [inline]
       udpv6_recvmsg+0x43e/0xe90 net/ipv6/udp.c:310
       inet6_recvmsg+0xbb/0x240 net/ipv6/af_inet6.c:592
       sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871
       ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480
       do_recvmmsg+0x19a/0x5c0 net/socket.c:2601
       __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680
       __do_sys_recvmmsg net/socket.c:2703 [inline]
       __se_sys_recvmmsg net/socket.c:2696 [inline]
       __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 10413 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 2276f58a ("udp: use a separate rx queue for packet reception")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Reviewed-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8a5adbb
    • E
      net: annotate accesses to sk->sk_incoming_cpu · 0cfaf03c
      Eric Dumazet 提交于
      [ Upstream commit 7170a977743b72cf3eb46ef6ef89885dc7ad3621 ]
      
      This socket field can be read and written by concurrent cpus.
      
      Use READ_ONCE() and WRITE_ONCE() annotations to document this,
      and avoid some compiler 'optimizations'.
      
      KCSAN reported :
      
      BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv
      
      write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
       sk_incoming_cpu_update include/net/sock.h:953 [inline]
       tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
       do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
       do_softirq kernel/softirq.c:329 [inline]
       __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
      
      read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
       sk_incoming_cpu_update include/net/sock.h:952 [inline]
       tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
       smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0cfaf03c
  2. 08 10月, 2019 2 次提交
  3. 21 9月, 2019 1 次提交
  4. 03 7月, 2019 2 次提交
    • M
      bpf: udp: Avoid calling reuseport's bpf_prog from udp_gro · 79c6a8c0
      Martin KaFai Lau 提交于
      commit 257a525fe2e49584842c504a92c27097407f778f upstream.
      
      When the commit a6024562 ("udp: Add GRO functions to UDP socket")
      added udp[46]_lib_lookup_skb to the udp_gro code path, it broke
      the reuseport_select_sock() assumption that skb->data is pointing
      to the transport header.
      
      This patch follows an earlier __udp6_lib_err() fix by
      passing a NULL skb to avoid calling the reuseport's bpf_prog.
      
      Fixes: a6024562 ("udp: Add GRO functions to UDP socket")
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      79c6a8c0
    • D
      bpf: fix unconnected udp hooks · 613bc37f
      Daniel Borkmann 提交于
      commit 983695fa676568fc0fe5ddd995c7267aabc24632 upstream.
      
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NMartynas Pumputis <m@lambda.lt>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      613bc37f
  5. 31 1月, 2019 1 次提交
  6. 04 11月, 2018 1 次提交
    • S
      net: udp: fix handling of CHECKSUM_COMPLETE packets · 4fb0dc97
      Sean Tranchetti 提交于
      [ Upstream commit db4f1be3 ]
      
      Current handling of CHECKSUM_COMPLETE packets by the UDP stack is
      incorrect for any packet that has an incorrect checksum value.
      
      udp4/6_csum_init() will both make a call to
      __skb_checksum_validate_complete() to initialize/validate the csum
      field when receiving a CHECKSUM_COMPLETE packet. When this packet
      fails validation, skb->csum will be overwritten with the pseudoheader
      checksum so the packet can be fully validated by software, but the
      skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way
      the stack can later warn the user about their hardware spewing bad
      checksums. Unfortunately, leaving the SKB in this state can cause
      problems later on in the checksum calculation.
      
      Since the the packet is still marked as CHECKSUM_COMPLETE,
      udp_csum_pull_header() will SUBTRACT the checksum of the UDP header
      from skb->csum instead of adding it, leaving us with a garbage value
      in that field. Once we try to copy the packet to userspace in the
      udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg()
      to checksum the packet data and add it in the garbage skb->csum value
      to perform our final validation check.
      
      Since the value we're validating is not the proper checksum, it's possible
      that the folded value could come out to 0, causing us not to drop the
      packet. Instead, we believe that the packet was checksummed incorrectly
      by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt
      to warn the user with netdev_rx_csum_fault(skb->dev);
      
      Unfortunately, since this is the UDP path, skb->dev has been overwritten
      by skb->dev_scratch and is no longer a valid pointer, so we end up
      reading invalid memory.
      
      This patch addresses this problem in two ways:
      	1) Do not use the dev pointer when calling netdev_rx_csum_fault()
      	   from skb_copy_and_csum_datagram_msg(). Since this gets called
      	   from the UDP path where skb->dev has been overwritten, we have
      	   no way of knowing if the pointer is still valid. Also for the
      	   sake of consistency with the other uses of
      	   netdev_rx_csum_fault(), don't attempt to call it if the
      	   packet was checksummed by software.
      
      	2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init().
      	   If we receive a packet that's CHECKSUM_COMPLETE that fails
      	   verification (i.e. skb->csum_valid == 0), check who performed
      	   the calculation. It's possible that the checksum was done in
      	   software by the network stack earlier (such as Netfilter's
      	   CONNTRACK module), and if that says the checksum is bad,
      	   we can drop the packet immediately instead of waiting until
      	   we try and copy it to userspace. Otherwise, we need to
      	   mark the SKB as CHECKSUM_NONE, since the skb->csum field
      	   no longer contains the full packet checksum after the
      	   call to __skb_checksum_validate_complete().
      
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Fixes: c84d9490 ("udp: copy skb->truesize in the first cache line")
      Cc: Sam Kumar <samanthakumar@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NSean Tranchetti <stranche@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4fb0dc97
  7. 08 10月, 2018 1 次提交
    • J
      udp: Unbreak modules that rely on external __skb_recv_udp() availability · 7e823644
      Jiri Kosina 提交于
      Commit 2276f58a ("udp: use a separate rx queue for packet reception")
      turned static inline __skb_recv_udp() from being a trivial helper around
      __skb_recv_datagram() into a UDP specific implementaion, making it
      EXPORT_SYMBOL_GPL() at the same time.
      
      There are external modules that got broken by __skb_recv_udp() not being
      visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL().
      
      Rationale (one of those) why this is actually "technically correct" thing
      to do: __skb_recv_udp() used to be an inline wrapper around
      __skb_recv_datagram(), which itself (still, and correctly so, I believe)
      is EXPORT_SYMBOL().
      
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Fixes: 2276f58a ("udp: use a separate rx queue for packet reception")
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e823644
  8. 17 9月, 2018 1 次提交
  9. 11 8月, 2018 2 次提交
    • M
      bpf: Enable BPF_PROG_TYPE_SK_REUSEPORT bpf prog in reuseport selection · 8217ca65
      Martin KaFai Lau 提交于
      This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
      SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
      the earlier patch.  "bpf_run_sk_reuseport()" will return -ECONNREFUSED
      when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
      The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
      handle this case and return NULL immediately instead of continuing the
      sk search from its hashtable.
      
      It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
      BPF_PROG_TYPE_SK_REUSEPORT.  The "sk_reuseport_attach_bpf()" will check
      if the attaching bpf prog is in the new SK_REUSEPORT or the existing
      SOCKET_FILTER type and then check different things accordingly.
      
      One level of "__reuseport_attach_prog()" call is removed.  The
      "sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
      back to "reuseport_attach_prog()" in sock_reuseport.c.  sock_reuseport.c
      seems to have more knowledge on those test requirements than filter.c.
      In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
      the old_prog (if any) is also directly freed instead of returning the
      old_prog to the caller and asking the caller to free.
      
      The sysctl_optmem_max check is moved back to the
      "sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
      As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
      bounded by the usual "bpf_prog_charge_memlock()" during load time
      instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8217ca65
    • M
      bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT · 2dbb9b9e
      Martin KaFai Lau 提交于
      This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
      a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
      non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
      
      BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
      to store the bpf context instead of using the skb->cb[48].
      
      At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
      from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
      point,  it is not always clear where the bpf context can be appended
      in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
      aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
      clear if the lower layer is only ipv4 and ipv6 in the future and
      will it not touch the cb[] again before transiting to the upper
      layer.
      
      For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
      instead of IP[6]CB and it may still modify the cb[] after calling
      the udp[46]_lib_lookup_skb().  Because of the above reason, if
      sk->cb is used for the bpf ctx, saving-and-restoring is needed
      and likely the whole 48 bytes cb[] has to be saved and restored.
      
      Instead of saving, setting and restoring the cb[], this patch opts
      to create a new "struct sk_reuseport_kern" and setting the needed
      values in there.
      
      The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
      will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
      specific usage at this point and it is also inline with the current
      sock_reuseport.c implementation (i.e. no protocol specific requirement).
      
      In "struct sk_reuseport_md", this patch exposes data/data_end/len
      with semantic similar to other existing usages.  Together
      with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
      the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
      the bpf prog that the reuseport group is bind-ed to a local
      INANY address which cannot be learned from skb.
      
      The new "bind_inany" is added to "struct sock_reuseport" which will be
      used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
      to avoid repeating the "bind INANY" test on
      "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
      only be properly initialized when a "sk->sk_reuseport" enabled sk is
      adding to a hashtable (i.e. during "reuseport_alloc()" and
      "reuseport_add_sock()").
      
      The new "sk_select_reuseport()" is the main helper that the
      bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
      that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
      the earlier patch, the validity of a selected sk is checked in
      run time in "sk_select_reuseport()".  Doing the check in
      verification time is difficult and inflexible (consider the map-in-map
      use case).  The runtime check is to compare the selected sk's reuseport_id
      with the reuseport_id that we want.  This helper will return -EXXX if the
      selected sk cannot serve the incoming request (e.g. reuseport_id
      not match).  The bpf prog can decide if it wants to do SK_DROP as its
      discretion.
      
      When the bpf prog returns SK_PASS, the kernel will check if a
      valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
      If it does , it will use the selected sk.  If not, the kernel
      will select one from "reuse->socks[]" (as before this patch).
      
      The SK_DROP and SK_PASS handling logic will be in the next patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2dbb9b9e
  10. 07 7月, 2018 2 次提交
    • W
      ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6 · 678ca42d
      Willem de Bruijn 提交于
      skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly
      after modification by __sock_cmsg_send, by calling sock_tx_timestamp.
      
      The IPv4 and IPv6 paths do this conversion differently. In IPv4, the
      individual protocols that support tx timestamps call this function
      and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is
      called in __ip6_append_data.
      
      There is no need to store both tx_flags and ts_flags in the cookie
      as one is derived from the other. Convert when setting up the cork
      and remove the redundant field. This is similar to IPv6, only have
      the conversion happen only once per datagram, in ip(6)_setup_cork.
      
      Also change __ip6_append_data to match __ip_append_data. Only update
      tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is
      redundant: only valid protocols can have non-zero cork->tx_flags.
      
      After this change the IPv4 and IPv6 logic is the same.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      678ca42d
    • W
      ipv4: ipcm_cookie initializers · 35178206
      Willem de Bruijn 提交于
      Initialize the cookie in one location to reduce code duplication and
      avoid bugs from inconsistent initialization, such as that fixed in
      commit 9887cba1 ("ip: limit use of gso_size to udp").
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35178206
  11. 04 7月, 2018 1 次提交
  12. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  13. 09 6月, 2018 1 次提交
    • P
      udp: fix rx queue len reported by diag and proc interface · 6c206b20
      Paolo Abeni 提交于
      After commit 6b229cf7 ("udp: add batching to udp_rmem_release()")
      the sk_rmem_alloc field does not measure exactly anymore the
      receive queue length, because we batch the rmem release. The issue
      is really apparent only after commit 0d4a6608 ("udp: do rmem bulk
      free even if the rx sk queue is empty"): the user space can easily
      check for an empty socket with not-0 queue length reported by the 'ss'
      tool or the procfs interface.
      
      We need to use a custom UDP helper to report the correct queue length,
      taking into account the forward allocation deficit.
      
      Reported-by: trevor.francis@46labs.com
      Fixes: 6b229cf7 ("UDP: add batching to udp_rmem_release()")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c206b20
  14. 05 6月, 2018 1 次提交
  15. 28 5月, 2018 1 次提交
    • A
      bpf: Hooks for sys_sendmsg · 1cedee13
      Andrey Ignatov 提交于
      In addition to already existing BPF hooks for sys_bind and sys_connect,
      the patch provides new hooks for sys_sendmsg.
      
      It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR`
      that provides access to socket itlself (properties like family, type,
      protocol) and user-passed `struct sockaddr *` so that BPF program can
      override destination IP and port for system calls such as sendto(2) or
      sendmsg(2) and/or assign source IP to the socket.
      
      The hooks are implemented as two new attach types:
      `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and
      UDPv6 correspondingly.
      
      UDPv4 and UDPv6 separate attach types for same reason as sys_bind and
      sys_connect hooks, i.e. to prevent reading from / writing to e.g.
      user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound.
      
      The difference with already existing hooks is sys_sendmsg are
      implemented only for unconnected UDP.
      
      For TCP it doesn't make sense to change user-provided `struct sockaddr *`
      at sendto(2)/sendmsg(2) time since socket either was already connected
      and has source/destination set or wasn't connected and call to
      sendto(2)/sendmsg(2) would lead to ENOTCONN anyway.
      
      Connected UDP is already handled by sys_connect hooks that can override
      source/destination at connect time and use fast-path later, i.e. these
      hooks don't affect UDP fast-path.
      
      Rewriting source IP is implemented differently than that in sys_connect
      hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to
      just bind socket to desired local IP address since source IP can be set
      on per-packet basis by using ancillary data (cmsg(3)). So no matter if
      socket is bound or not, source IP has to be rewritten on every call to
      sys_sendmsg.
      
      To do so two new fields are added to UAPI `struct bpf_sock_addr`;
      * `msg_src_ip4` to set source IPv4 for UDPv4;
      * `msg_src_ip6` to set source IPv6 for UDPv6.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1cedee13
  16. 26 5月, 2018 1 次提交
  17. 24 5月, 2018 1 次提交
  18. 16 5月, 2018 2 次提交
  19. 12 5月, 2018 1 次提交
  20. 11 5月, 2018 2 次提交
  21. 09 5月, 2018 1 次提交
  22. 02 5月, 2018 1 次提交
    • W
      udp: disable gso with no_check_tx · a8c744a8
      Willem de Bruijn 提交于
      Syzbot managed to send a udp gso packet without checksum offload into
      the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This
      triggered the skb_warn_bad_offload.
      
        RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658
         skb_gso_segment include/linux/netdevice.h:4038 [inline]
         validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120
         __dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577
         dev_queue_xmit+0x17/0x20 net/core/dev.c:3618
      
      UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the
      udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to
      catch and fail in this case.
      
      After the integrity checks jump directly to the CHECKSUM_PARTIAL case
      to avoid reading the no_check_tx flags again (a TOCTTOU race).
      
      Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8c744a8
  23. 27 4月, 2018 3 次提交
    • W
      udp: add gso segment cmsg · 2e8de857
      Willem de Bruijn 提交于
      Allow specifying segment size in the send call.
      
      The new control message performs the same function as socket option
      UDP_SEGMENT while avoiding the extra system call.
      
      [ Export udp_cmsg_send for ipv6. -DaveM ]
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e8de857
    • W
      udp: generate gso with UDP_SEGMENT · bec1f6f6
      Willem de Bruijn 提交于
      Support generic segmentation offload for udp datagrams. Callers can
      concatenate and send at once the payload of multiple datagrams with
      the same destination.
      
      To set segment size, the caller sets socket option UDP_SEGMENT to the
      length of each discrete payload. This value must be smaller than or
      equal to the relevant MTU.
      
      A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
      per send call basis.
      
      Total byte length may then exceed MTU. If not an exact multiple of
      segment size, the last segment will be shorter.
      
      The implementation adds a gso_size field to the udp socket, ip(v6)
      cmsg cookie and inet_cork structure to be able to set the value at
      setsockopt or cmsg time and to work with both lockless and corked
      paths.
      
      Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
      
          tcp tso
           3197 MB/s 54232 msg/s 54232 calls/s
               6,457,754,262      cycles
      
          tcp gso
           1765 MB/s 29939 msg/s 29939 calls/s
              11,203,021,806      cycles
      
          tcp without tso/gso *
            739 MB/s 12548 msg/s 12548 calls/s
              11,205,483,630      cycles
      
          udp
            876 MB/s 14873 msg/s 624666 calls/s
              11,205,777,429      cycles
      
          udp gso
           2139 MB/s 36282 msg/s 36282 calls/s
              11,204,374,561      cycles
      
         [*] after reverting commit 0a6b2a1d
             ("tcp: switch to GSO being always on")
      
      Measured total system cycles ('-a') for one core while pinning both
      the network receive path and benchmark process to that core:
      
        perf stat -a -C 12 -e cycles \
          ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
      
      Note the reduction in calls/s with GSO. Bytes per syscall drops
      increases from 1470 to 61818.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bec1f6f6
    • W
      udp: expose inet cork to udp · 1cd7884d
      Willem de Bruijn 提交于
      UDP segmentation offload needs access to inet_cork in the udp layer.
      Pass the struct to ip(6)_make_skb instead of allocating it on the
      stack in that function itself.
      
      This patch is a noop otherwise.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cd7884d
  24. 31 3月, 2018 1 次提交
    • A
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov 提交于
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
  25. 28 3月, 2018 1 次提交
  26. 27 3月, 2018 1 次提交
  27. 24 3月, 2018 1 次提交
  28. 17 3月, 2018 1 次提交
  29. 17 2月, 2018 1 次提交
    • A
      udplite: fix partial checksum initialization · 15f35d49
      Alexey Kodanev 提交于
      Since UDP-Lite is always using checksum, the following path is
      triggered when calculating pseudo header for it:
      
        udp4_csum_init() or udp6_csum_init()
          skb_checksum_init_zero_check()
            __skb_checksum_validate_complete()
      
      The problem can appear if skb->len is less than CHECKSUM_BREAK. In
      this particular case __skb_checksum_validate_complete() also invokes
      __skb_checksum_complete(skb). If UDP-Lite is using partial checksum
      that covers only part of a packet, the function will return bad
      checksum and the packet will be dropped.
      
      It can be fixed if we skip skb_checksum_init_zero_check() and only
      set the required pseudo header checksum for UDP-Lite with partial
      checksum before udp4_csum_init()/udp6_csum_init() functions return.
      
      Fixes: ed70fcfc ("net: Call skb_checksum_init in IPv4")
      Fixes: e4f45b7f ("net: Call skb_checksum_init in IPv6")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15f35d49