1. 29 5月, 2020 3 次提交
  2. 26 5月, 2020 1 次提交
    • E
      tcp: allow traceroute -Mtcp for unpriv users · 45af29ca
      Eric Dumazet 提交于
      Unpriv users can use traceroute over plain UDP sockets, but not TCP ones.
      
      $ traceroute -Mtcp 8.8.8.8
      You do not have enough privileges to use this traceroute method.
      
      $ traceroute -n -Mudp 8.8.8.8
      traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
       1  192.168.86.1  3.631 ms  3.512 ms  3.405 ms
       2  10.1.10.1  4.183 ms  4.125 ms  4.072 ms
       3  96.120.88.125  20.621 ms  19.462 ms  20.553 ms
       4  96.110.177.65  24.271 ms  25.351 ms  25.250 ms
       5  69.139.199.197  44.492 ms  43.075 ms  44.346 ms
       6  68.86.143.93  27.969 ms  25.184 ms  25.092 ms
       7  96.112.146.18  25.323 ms 96.112.146.22  25.583 ms 96.112.146.26  24.502 ms
       8  72.14.239.204  24.405 ms 74.125.37.224  16.326 ms  17.194 ms
       9  209.85.251.9  18.154 ms 209.85.247.55  14.449 ms 209.85.251.9  26.296 ms^C
      
      We can easily support traceroute over TCP, by queueing an error message
      into socket error queue.
      
      Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
      enable this feature, and that the error message is only queued
      while in SYN_SNT state.
      
      socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
      setsockopt(3, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
      setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0
      setsockopt(3, SOL_IPV6, IPV6_UNICAST_HOPS, [5], 4) = 0
      connect(3, {sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
              inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EHOSTUNREACH (No route to host)
      recvmsg(3, {msg_name={sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
              inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0},
              msg_namelen=1024->28, msg_iov=[{iov_base="`\r\337\320\0004\6\1&\7\370\260\200\231\16\27\0\0\0\0\0\0\0\0 \2\n\5f\10\2\227"..., iov_len=1024}],
              msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=SO_TIMESTAMP_OLD, cmsg_data={tv_sec=1590340680, tv_usec=272424}},
                                         {cmsg_len=60, cmsg_level=SOL_IPV6, cmsg_type=IPV6_RECVERR}],
              msg_controllen=96, msg_flags=MSG_ERRQUEUE}, MSG_ERRQUEUE) = 144
      
      Suggested-by: Maciej Żenczykowski <maze@google.com
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reviewed-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45af29ca
  3. 23 5月, 2020 5 次提交
  4. 22 5月, 2020 1 次提交
    • S
      net: don't return invalid table id error when we fall back to PF_UNSPEC · 41b4bd98
      Sabrina Dubroca 提交于
      In case we can't find a ->dumpit callback for the requested
      (family,type) pair, we fall back to (PF_UNSPEC,type). In effect, we're
      in the same situation as if userspace had requested a PF_UNSPEC
      dump. For RTM_GETROUTE, that handler is rtnl_dump_all, which calls all
      the registered RTM_GETROUTE handlers.
      
      The requested table id may or may not exist for all of those
      families. commit ae677bbb ("net: Don't return invalid table id
      error when dumping all families") fixed the problem when userspace
      explicitly requests a PF_UNSPEC dump, but missed the fallback case.
      
      For example, when we pass ipv6.disable=1 to a kernel with
      CONFIG_IP_MROUTE=y and CONFIG_IP_MROUTE_MULTIPLE_TABLES=y,
      the (PF_INET6, RTM_GETROUTE) handler isn't registered, so we end up in
      rtnl_dump_all, and listing IPv6 routes will unexpectedly print:
      
        # ip -6 r
        Error: ipv4: MR table does not exist.
        Dump terminated
      
      commit ae677bbb introduced the dump_all_families variable, which
      gets set when userspace requests a PF_UNSPEC dump. However, we can't
      simply set the family to PF_UNSPEC in rtnetlink_rcv_msg in the
      fallback case to get dump_all_families == true, because some messages
      types (for example RTM_GETRULE and RTM_GETNEIGH) only register the
      PF_UNSPEC handler and use the family to filter in the kernel what is
      dumped to userspace. We would then export more entries, that userspace
      would have to filter. iproute does that, but other programs may not.
      
      Instead, this patch removes dump_all_families and updates the
      RTM_GETROUTE handlers to check if the family that is being dumped is
      their own. When it's not, which covers both the intentional PF_UNSPEC
      dumps (as dump_all_families did) and the fallback case, ignore the
      missing table id error.
      
      Fixes: cb167893 ("net: Plumb support for filtering ipv4 and ipv6 multicast route dumps")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41b4bd98
  5. 21 5月, 2020 8 次提交
  6. 20 5月, 2020 6 次提交
    • C
      ipv6: use ->ndo_tunnel_ctl in addrconf_set_dstaddr · 8e3db0bb
      Christoph Hellwig 提交于
      Use the new ->ndo_tunnel_ctl instead of overriding the address limit
      and using ->ndo_do_ioctl just to do a pointless user copy.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e3db0bb
    • C
      ipv6: streamline addrconf_set_dstaddr · 68ad6886
      Christoph Hellwig 提交于
      Factor out a addrconf_set_sit_dstaddr helper for the actual work if we
      found a SIT device, and only hold the rtnl lock around the device lookup
      and that new helper, as there is no point in holding it over a
      copy_from_user call.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68ad6886
    • C
      ipv6: stub out even more of addrconf_set_dstaddr if SIT is disabled · f0988460
      Christoph Hellwig 提交于
      There is no point in copying the structure from userspace or looking up
      a device if SIT support is not disabled and we'll eventually return
      -ENODEV anyway.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0988460
    • C
      sit: impement ->ndo_tunnel_ctl · f60fe2df
      Christoph Hellwig 提交于
      Implement the ->ndo_tunnel_ctl method, and use ip_tunnel_ioctl to
      handle userspace requests for the SIOCGETTUNNEL, SIOCADDTUNNEL,
      SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f60fe2df
    • C
      sit: refactor ipip6_tunnel_ioctl · fd5d687b
      Christoph Hellwig 提交于
      Split the ioctl handler into one function per command instead of having
      a all the logic sit in one giant switch statement.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd5d687b
    • D
      bpf: Add get{peer, sock}name attach types for sock_addr · 1b66d253
      Daniel Borkmann 提交于
      As stated in 983695fa ("bpf: fix unconnected udp hooks"), the objective
      for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
      transparent to applications. In Cilium we make use of these hooks [0] in
      order to enable E-W load balancing for existing Kubernetes service types
      for all Cilium managed nodes in the cluster. Those backends can be local
      or remote. The main advantage of this approach is that it operates as close
      as possible to the socket, and therefore allows to avoid packet-based NAT
      given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.
      
      This also allows to expose NodePort services on loopback addresses in the
      host namespace, for example. As another advantage, this also efficiently
      blocks bind requests for applications in the host namespace for exposed
      ports. However, one missing item is that we also need to perform reverse
      xlation for inet{,6}_getname() hooks such that we can return the service
      IP/port tuple back to the application instead of the remote peer address.
      
      The vast majority of applications does not bother about getpeername(), but
      in a few occasions we've seen breakage when validating the peer's address
      since it returns unexpectedly the backend tuple instead of the service one.
      Therefore, this trivial patch allows to customise and adds a getpeername()
      as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
      to address this situation.
      
      Simple example:
      
        # ./cilium/cilium service list
        ID   Frontend     Service Type   Backend
        1    1.2.3.4:80   ClusterIP      1 => 10.0.0.10:80
      
      Before; curl's verbose output example, no getpeername() reverse xlation:
      
        # curl --verbose 1.2.3.4
        * Rebuilt URL to: 1.2.3.4/
        *   Trying 1.2.3.4...
        * TCP_NODELAY set
        * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
        > GET / HTTP/1.1
        > Host: 1.2.3.4
        > User-Agent: curl/7.58.0
        > Accept: */*
        [...]
      
      After; with getpeername() reverse xlation:
      
        # curl --verbose 1.2.3.4
        * Rebuilt URL to: 1.2.3.4/
        *   Trying 1.2.3.4...
        * TCP_NODELAY set
        * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
        > GET / HTTP/1.1
        >  Host: 1.2.3.4
        > User-Agent: curl/7.58.0
        > Accept: */*
        [...]
      
      Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
      peer to the context similar as in inet{,6}_getname() fashion, but API-wise
      this is suboptimal as it always enforces programs having to test for ctx->peer
      which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
      Similarly, the checked return code is on tnum_range(1, 1), but if a use case
      comes up in future, it can easily be changed to return an error code instead.
      Helper and ctx member access is the same as with connect/sendmsg/etc hooks.
      
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
      1b66d253
  7. 19 5月, 2020 2 次提交
  8. 17 5月, 2020 2 次提交
  9. 14 5月, 2020 4 次提交
  10. 13 5月, 2020 1 次提交
    • P
      netlabel: cope with NULL catmap · eead1c2e
      Paolo Abeni 提交于
      The cipso and calipso code can set the MLS_CAT attribute on
      successful parsing, even if the corresponding catmap has
      not been allocated, as per current configuration and external
      input.
      
      Later, selinux code tries to access the catmap if the MLS_CAT flag
      is present via netlbl_catmap_getlong(). That may cause null ptr
      dereference while processing incoming network traffic.
      
      Address the issue setting the MLS_CAT flag only if the catmap is
      really allocated. Additionally let netlbl_catmap_getlong() cope
      with NULL catmap.
      Reported-by: NMatthew Sheets <matthew.sheets@gd-ms.com>
      Fixes: 4b8feff2 ("netlabel: fix the horribly broken catmap functions")
      Fixes: ceba1832 ("calipso: Set the calipso socket label to match the secattr.")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eead1c2e
  11. 10 5月, 2020 1 次提交
  12. 09 5月, 2020 4 次提交
  13. 08 5月, 2020 1 次提交
    • M
      Revert "ipv6: add mtu lock check in __ip6_rt_update_pmtu" · 09454fd0
      Maciej Żenczykowski 提交于
      This reverts commit 19bda36c:
      
      | ipv6: add mtu lock check in __ip6_rt_update_pmtu
      |
      | Prior to this patch, ipv6 didn't do mtu lock check in ip6_update_pmtu.
      | It leaded to that mtu lock doesn't really work when receiving the pkt
      | of ICMPV6_PKT_TOOBIG.
      |
      | This patch is to add mtu lock check in __ip6_rt_update_pmtu just as ipv4
      | did in __ip_rt_update_pmtu.
      
      The above reasoning is incorrect.  IPv6 *requires* icmp based pmtu to work.
      There's already a comment to this effect elsewhere in the kernel:
      
        $ git grep -p -B1 -A3 'RTAX_MTU lock'
        net/ipv6/route.c=4813=
      
        static int rt6_mtu_change_route(struct fib6_info *f6i, void *p_arg)
        ...
          /* In IPv6 pmtu discovery is not optional,
             so that RTAX_MTU lock cannot disable it.
             We still use this lock to block changes
             caused by addrconf/ndisc.
          */
      
      This reverts to the pre-4.9 behaviour.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Fixes: 19bda36c ("ipv6: add mtu lock check in __ip6_rt_update_pmtu")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09454fd0
  14. 07 5月, 2020 1 次提交
    • A
      seg6: fix SRH processing to comply with RFC8754 · 0cb7498f
      Ahmed Abdelsalam 提交于
      The Segment Routing Header (SRH) which defines the SRv6 dataplane is defined
      in RFC8754.
      
      RFC8754 (section 4.1) defines the SR source node behavior which encapsulates
      packets into an outer IPv6 header and SRH. The SR source node encodes the
      full list of Segments that defines the packet path in the SRH. Then, the
      first segment from list of Segments is copied into the Destination address
      of the outer IPv6 header and the packet is sent to the first hop in its path
      towards the destination.
      
      If the Segment list has only one segment, the SR source node can omit the SRH
      as he only segment is added in the destination address.
      
      RFC8754 (section 4.1.1) defines the Reduced SRH, when a source does not
      require the entire SID list to be preserved in the SRH. A reduced SRH does
      not contain the first segment of the related SR Policy (the first segment is
      the one already in the DA of the IPv6 header), and the Last Entry field is
      set to n-2, where n is the number of elements in the SR Policy.
      
      RFC8754 (section 4.3.1.1) defines the SRH processing and the logic to
      validate the SRH (S09, S10, S11) which works for both reduced and
      non-reduced behaviors.
      
      This patch updates seg6_validate_srh() to validate the SRH as per RFC8754.
      Signed-off-by: NAhmed Abdelsalam <ahabdels@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cb7498f