1. 26 7月, 2020 1 次提交
  2. 25 7月, 2020 2 次提交
  3. 22 7月, 2020 2 次提交
  4. 20 7月, 2020 1 次提交
  5. 18 7月, 2020 2 次提交
  6. 14 7月, 2020 1 次提交
  7. 31 3月, 2020 1 次提交
  8. 15 1月, 2020 1 次提交
  9. 31 10月, 2019 1 次提交
    • E
      net: annotate accesses to sk->sk_incoming_cpu · 7170a977
      Eric Dumazet 提交于
      This socket field can be read and written by concurrent cpus.
      
      Use READ_ONCE() and WRITE_ONCE() annotations to document this,
      and avoid some compiler 'optimizations'.
      
      KCSAN reported :
      
      BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv
      
      write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
       sk_incoming_cpu_update include/net/sock.h:953 [inline]
       tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
       do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
       do_softirq kernel/softirq.c:329 [inline]
       __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
      
      read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
       sk_incoming_cpu_update include/net/sock.h:952 [inline]
       tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
       smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7170a977
  10. 03 10月, 2019 2 次提交
  11. 16 9月, 2019 1 次提交
  12. 14 9月, 2019 1 次提交
    • W
      ip: support SO_MARK cmsg · c6af0c22
      Willem de Bruijn 提交于
      Enable setting skb->mark for UDP and RAW sockets using cmsg.
      
      This is analogous to existing support for TOS, TTL, txtime, etc.
      
      Packet sockets already support this as of commit c7d39e32
      ("packet: support per-packet fwmark for af_packet sendmsg").
      
      Similar to other fields, implement by
      1. initialize the sockcm_cookie.mark from socket option sk_mark
      2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
      3. initialize inet_cork.mark from sockcm_cookie.mark
      4. initialize each (usually just one) skb->mark from inet_cork.mark
      
      Step 1 is handled in one location for most protocols by ipcm_init_sk
      as of commit 35178206 ("ipv4: ipcm_cookie initializers").
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6af0c22
  13. 09 7月, 2019 1 次提交
    • W
      ipv6: elide flowlabel check if no exclusive leases exist · 59c820b2
      Willem de Bruijn 提交于
      Processes can request ipv6 flowlabels with cmsg IPV6_FLOWINFO.
      If not set, by default an autogenerated flowlabel is selected.
      
      Explicit flowlabels require a control operation per label plus a
      datapath check on every connection (every datagram if unconnected).
      This is particularly expensive on unconnected sockets multiplexing
      many flows, such as QUIC.
      
      In the common case, where no lease is exclusive, the check can be
      safely elided, as both lease request and check trivially succeed.
      Indeed, autoflowlabel does the same even with exclusive leases.
      
      Elide the check if no process has requested an exclusive lease.
      
      fl6_sock_lookup previously returns either a reference to a lease or
      NULL to denote failure. Modify to return a real error and update
      all callers. On return NULL, they can use the label and will elide
      the atomic_dec in fl6_sock_release.
      
      This is an optimization. Robust applications still have to revert to
      requesting leases if the fast path fails due to an exclusive lease.
      
      Changes RFC->v1:
        - use static_key_false_deferred to rate limit jump label operations
          - call static_key_deferred_flush to stop timers on exit
        - move decrement out of RCU context
        - defer optimization also if opt data is associated with a lease
        - updated all fp6_sock_lookup callers, not just udp
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59c820b2
  14. 06 7月, 2019 1 次提交
  15. 15 6月, 2019 2 次提交
  16. 07 6月, 2019 1 次提交
    • D
      bpf: fix unconnected udp hooks · 983695fa
      Daniel Borkmann 提交于
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NMartynas Pumputis <m@lambda.lt>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      983695fa
  17. 06 6月, 2019 1 次提交
  18. 04 6月, 2019 2 次提交
  19. 31 5月, 2019 1 次提交
  20. 06 5月, 2019 2 次提交
  21. 13 4月, 2019 1 次提交
  22. 09 4月, 2019 1 次提交
  23. 23 2月, 2019 2 次提交
  24. 18 1月, 2019 1 次提交
  25. 17 1月, 2019 1 次提交
  26. 05 1月, 2019 1 次提交
    • A
      bpf: Fix [::] -> [::1] rewrite in sys_sendmsg · e8e36984
      Andrey Ignatov 提交于
      sys_sendmsg has supported unspecified destination IPv6 (wildcard) for
      unconnected UDP sockets since 876c7f41. When [::] is passed by user as
      destination, sys_sendmsg rewrites it with [::1] to be consistent with
      BSD (see "BSD'ism" comment in the code).
      
      This didn't work when cgroup-bpf was enabled though since the rewrite
      [::] -> [::1] happened before passing control to cgroup-bpf block where
      fl6.daddr was updated with passed by user sockaddr_in6.sin6_addr (that
      might or might not be changed by BPF program). That way if user passed
      [::] as dst IPv6 it was first rewritten with [::1] by original code from
      876c7f41, but then rewritten back with [::] by cgroup-bpf block.
      
      It happened even when BPF_CGROUP_UDP6_SENDMSG program was not present
      (CONFIG_CGROUP_BPF=y was enough).
      
      The fix is to apply BSD'ism after cgroup-bpf block so that [::] is
      replaced with [::1] no matter where it came from: passed by user to
      sys_sendmsg or set by BPF_CGROUP_UDP6_SENDMSG program.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Reported-by: NNitin Rawat <nitin.rawat@intel.com>
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e8e36984
  27. 15 12月, 2018 1 次提交
    • P
      net: udp6: prefer listeners bound to an address · 23b0269e
      Peter Oskolkov 提交于
      A relatively common use case is to have several IPs configured
      on a host, and have different listeners for each of them. We would
      like to add a "catch all" listener on addr_any, to match incoming
      connections not served by any of the listeners bound to a specific
      address.
      
      However, port-only lookups can match addr_any sockets when sockets
      listening on specific addresses are present if so_reuseport flag
      is set. This patch eliminates lookups into port-only hashtable,
      as lookups by (addr,port) tuple are easily available.
      
      In addition, compute_score() is tweaked to _not_ match
      addr_any sockets to specific addresses, as hash collisions
      could result in the unwanted behavior described above.
      
      Tested: the patch compiles; full test in the last patch in this
      patchset. Existing reuseport_* selftests also pass.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23b0269e
  28. 17 11月, 2018 1 次提交
  29. 10 11月, 2018 1 次提交
  30. 09 11月, 2018 3 次提交
    • S
      udp: Support for error handlers of tunnels with arbitrary destination port · e7cc0824
      Stefano Brivio 提交于
      ICMP error handling is currently not possible for UDP tunnels not
      employing a receiving socket with local destination port matching the
      remote one, because we have no way to look them up.
      
      Add an err_handler tunnel encapsulation operation that can be exported by
      tunnels in order to pass the error to the protocol implementing the
      encapsulation. We can't easily use a lookup function as we did for VXLAN
      and GENEVE, as protocol error handlers, which would be in turn called by
      implementations of this new operation, handle the errors themselves,
      together with the tunnel lookup.
      
      Without a socket, we can't be sure which encapsulation error handler is
      the appropriate one: encapsulation handlers (the ones for FoU and GUE
      introduced in the next patch, e.g.) will need to check the new error codes
      returned by protocol handlers to figure out if errors match the given
      encapsulation, and, in turn, report this error back, so that we can try
      all of them in __udp{4,6}_lib_err_encap_no_sk() until we have a match.
      
      v2:
      - Name all arguments in err_handler prototypes (David Miller)
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7cc0824
    • S
      net: Convert protocol error handlers from void to int · 32bbd879
      Stefano Brivio 提交于
      We'll need this to handle ICMP errors for tunnels without a sending socket
      (i.e. FoU and GUE). There, we might have to look up different types of IP
      tunnels, registered as network protocols, before we get a match, so we
      want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
      inet_protos and inet6_protos. These error codes will be used in the next
      patch.
      
      For consistency, return sensible error codes in protocol error handlers
      whenever handlers can't handle errors because, even if valid, they don't
      match a protocol or any of its states.
      
      This has no effect on existing error handling paths.
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32bbd879
    • S
      udp: Handle ICMP errors for tunnels with same destination port on both endpoints · a36e185e
      Stefano Brivio 提交于
      For both IPv4 and IPv6, if we can't match errors to a socket, try
      tunnels before ignoring them. Look up a socket with the original source
      and destination ports as found in the UDP packet inside the ICMP payload,
      this will work for tunnels that force the same destination port for both
      endpoints, i.e. VXLAN and GENEVE.
      
      Actually, lwtunnels could break this assumption if they are configured by
      an external control plane to have different destination ports on the
      endpoints: in this case, we won't be able to trace ICMP messages back to
      them.
      
      For IPv6 redirect messages, call ip6_redirect() directly with the output
      interface argument set to the interface we received the packet from (as
      it's the very interface we should build the exception on), otherwise the
      new nexthop will be rejected. There's no such need for IPv4.
      
      Tunnels can now export an encap_err_lookup() operation that indicates a
      match. Pass the packet to the lookup function, and if the tunnel driver
      reports a matching association, continue with regular ICMP error handling.
      
      v2:
      - Added newline between network and transport header sets in
        __udp{4,6}_lib_err_encap() (David Miller)
      - Removed redundant skb_reset_network_header(skb); in
        __udp4_lib_err_encap()
      - Removed redundant reassignment of iph in __udp4_lib_err_encap()
        (Sabrina Dubroca)
      - Edited comment to __udp{4,6}_lib_err_encap() to reflect the fact this
        won't work with lwtunnels configured to use asymmetric ports. By the way,
        it's VXLAN, not VxLAN (Jiri Benc)
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a36e185e