1. 31 1月, 2018 1 次提交
    • P
      netfilter: on sockopt() acquire sock lock only in the required scope · 3f34cfae
      Paolo Abeni 提交于
      Syzbot reported several deadlocks in the netfilter area caused by
      rtnl lock and socket lock being acquired with a different order on
      different code paths, leading to backtraces like the following one:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      4.15.0-rc9+ #212 Not tainted
      ------------------------------------------------------
      syzkaller041579/3682 is trying to acquire lock:
        (sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>] lock_sock
      include/net/sock.h:1463 [inline]
        (sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>]
      do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167
      
      but task is already holding lock:
        (rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (rtnl_mutex){+.+.}:
              __mutex_lock_common kernel/locking/mutex.c:756 [inline]
              __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
              mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
              rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
              register_netdevice_notifier+0xad/0x860 net/core/dev.c:1607
              tee_tg_check+0x1a0/0x280 net/netfilter/xt_TEE.c:106
              xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:845
              check_target net/ipv6/netfilter/ip6_tables.c:538 [inline]
              find_check_entry.isra.7+0x935/0xcf0
      net/ipv6/netfilter/ip6_tables.c:580
              translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:749
              do_replace net/ipv6/netfilter/ip6_tables.c:1165 [inline]
              do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1691
              nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
              nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
              ipv6_setsockopt+0x115/0x150 net/ipv6/ipv6_sockglue.c:928
              udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
              sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
              SYSC_setsockopt net/socket.c:1849 [inline]
              SyS_setsockopt+0x189/0x360 net/socket.c:1828
              entry_SYSCALL_64_fastpath+0x29/0xa0
      
      -> #0 (sk_lock-AF_INET6){+.+.}:
              lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3914
              lock_sock_nested+0xc2/0x110 net/core/sock.c:2780
              lock_sock include/net/sock.h:1463 [inline]
              do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167
              ipv6_setsockopt+0xd7/0x150 net/ipv6/ipv6_sockglue.c:922
              udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
              sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
              SYSC_setsockopt net/socket.c:1849 [inline]
              SyS_setsockopt+0x189/0x360 net/socket.c:1828
              entry_SYSCALL_64_fastpath+0x29/0xa0
      
      other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(rtnl_mutex);
                                      lock(sk_lock-AF_INET6);
                                      lock(rtnl_mutex);
         lock(sk_lock-AF_INET6);
      
        *** DEADLOCK ***
      
      1 lock held by syzkaller041579/3682:
        #0:  (rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
      
      The problem, as Florian noted, is that nf_setsockopt() is always
      called with the socket held, even if the lock itself is required only
      for very tight scopes and only for some operation.
      
      This patch addresses the issues moving the lock_sock() call only
      where really needed, namely in ipv*_getorigdst(), so that nf_setsockopt()
      does not need anymore to acquire both locks.
      
      Fixes: 22265a5c ("netfilter: xt_TEE: resolve oif using netdevice notifiers")
      Reported-by: syzbot+a4c2dc980ac1af699b36@syzkaller.appspotmail.com
      Suggested-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3f34cfae
  2. 22 12月, 2017 1 次提交
    • S
      net: reevalulate autoflowlabel setting after sysctl setting · 513674b5
      Shaohua Li 提交于
      sysctl.ip6.auto_flowlabels is default 1. In our hosts, we set it to 2.
      If sockopt doesn't set autoflowlabel, outcome packets from the hosts are
      supposed to not include flowlabel. This is true for normal packet, but
      not for reset packet.
      
      The reason is ipv6_pinfo.autoflowlabel is set in sock creation. Later if
      we change sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't
      changed, so the sock will keep the old behavior in terms of auto
      flowlabel. Reset packet is suffering from this problem, because reset
      packet is sent from a special control socket, which is created at boot
      time. Since sysctl.ipv6.auto_flowlabels is 1 by default, the control
      socket will always have its ipv6_pinfo.autoflowlabel set, even after
      user set sysctl.ipv6.auto_flowlabels to 1, so reset packset will always
      have flowlabel. Normal sock created before sysctl setting suffers from
      the same issue. We can't even turn off autoflowlabel unless we kill all
      socks in the hosts.
      
      To fix this, if IPV6_AUTOFLOWLABEL sockopt is used, we use the
      autoflowlabel setting from user, otherwise we always call
      ip6_default_np_autolabel() which has the new settings of sysctl.
      
      Note, this changes behavior a little bit. Before commit 42240901
      (ipv6: Implement different admin modes for automatic flow labels), the
      autoflowlabel behavior of a sock isn't sticky, eg, if sysctl changes,
      existing connection will change autoflowlabel behavior. After that
      commit, autoflowlabel behavior is sticky in the whole life of the sock.
      With this patch, the behavior isn't sticky again.
      
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Tom Herbert <tom@quantonium.net>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      513674b5
  3. 30 9月, 2017 1 次提交
    • M
      net-ipv6: add support for sockopt(SOL_IPV6, IPV6_FREEBIND) · 84e14fe3
      Maciej Żenczykowski 提交于
      So far we've been relying on sockopt(SOL_IP, IP_FREEBIND) being usable
      even on IPv6 sockets.
      
      However, it turns out it is perfectly reasonable to want to set freebind
      on an AF_INET6 SOCK_RAW socket - but there is no way to set any SOL_IP
      socket option on such a socket (they're all blindly errored out).
      
      One use case for this is to allow spoofing src ip on a raw socket
      via sendmsg cmsg.
      
      Tested:
        built, and booted
        # python
        >>> import socket
        >>> SOL_IP = socket.SOL_IP
        >>> SOL_IPV6 = socket.IPPROTO_IPV6
        >>> IP_FREEBIND = 15
        >>> IPV6_FREEBIND = 78
        >>> s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, 0)
        >>> s.getsockopt(SOL_IP, IP_FREEBIND)
        0
        >>> s.getsockopt(SOL_IPV6, IPV6_FREEBIND)
        0
        >>> s.setsockopt(SOL_IPV6, IPV6_FREEBIND, 1)
        >>> s.getsockopt(SOL_IP, IP_FREEBIND)
        1
        >>> s.getsockopt(SOL_IPV6, IPV6_FREEBIND)
        1
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84e14fe3
  4. 30 8月, 2017 1 次提交
    • X
      ipv6: do not set sk_destruct in IPV6_ADDRFORM sockopt · e8d411d2
      Xin Long 提交于
      ChunYu found a kernel warn_on during syzkaller fuzzing:
      
      [40226.038539] WARNING: CPU: 5 PID: 23720 at net/ipv4/af_inet.c:152 inet_sock_destruct+0x78d/0x9a0
      [40226.144849] Call Trace:
      [40226.147590]  <IRQ>
      [40226.149859]  dump_stack+0xe2/0x186
      [40226.176546]  __warn+0x1a4/0x1e0
      [40226.180066]  warn_slowpath_null+0x31/0x40
      [40226.184555]  inet_sock_destruct+0x78d/0x9a0
      [40226.246355]  __sk_destruct+0xfa/0x8c0
      [40226.290612]  rcu_process_callbacks+0xaa0/0x18a0
      [40226.336816]  __do_softirq+0x241/0x75e
      [40226.367758]  irq_exit+0x1f6/0x220
      [40226.371458]  smp_apic_timer_interrupt+0x7b/0xa0
      [40226.376507]  apic_timer_interrupt+0x93/0xa0
      
      The warn_on happned when sk->sk_rmem_alloc wasn't 0 in inet_sock_destruct.
      As after commit f970bd9e ("udp: implement memory accounting helpers"),
      udp has changed to use udp_destruct_sock as sk_destruct where it would
      udp_rmem_release all rmem.
      
      But IPV6_ADDRFORM sockopt sets sk_destruct with inet_sock_destruct after
      changing family to PF_INET. If rmem is not 0 at that time, and there is
      no place to release rmem before calling inet_sock_destruct, the warn_on
      will be triggered.
      
      This patch is to fix it by not setting sk_destruct in IPV6_ADDRFORM sockopt
      any more. As IPV6_ADDRFORM sockopt only works for tcp and udp. TCP sock has
      already set it's sk_destruct with inet_sock_destruct and UDP has set with
      udp_destruct_sock since they're created.
      
      Fixes: f970bd9e ("udp: implement memory accounting helpers")
      Reported-by: NChunYu Wang <chunwang@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8d411d2
  5. 04 7月, 2017 1 次提交
  6. 30 6月, 2017 1 次提交
  7. 31 12月, 2016 1 次提交
    • D
      net: Allow IP_MULTICAST_IF to set index to L3 slave · 7bb387c5
      David Ahern 提交于
      IP_MULTICAST_IF fails if sk_bound_dev_if is already set and the new index
      does not match it. e.g.,
      
          ntpd[15381]: setsockopt IP_MULTICAST_IF 192.168.1.23 fails: Invalid argument
      
      Relax the check in setsockopt to allow setting mc_index to an L3 slave if
      sk_bound_dev_if points to an L3 master.
      
      Make a similar change for IPv6. In this case change the device lookup to
      take the rcu_read_lock avoiding a refcnt. The rcu lock is also needed for
      the lookup of a potential L3 master device.
      
      This really only silences a setsockopt failure since uses of mc_index are
      secondary to sk_bound_dev_if if it is set. In both cases, if either index
      is an L3 slave or master, lookups are directed to the same FIB table so
      relaxing the check at setsockopt time causes no harm.
      
      Patch is based on a suggested change by Darwin for a problem noted in
      their code base.
      Suggested-by: NDarwin Dingel <darwin.dingel@alliedtelesis.co.nz>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bb387c5
  8. 25 12月, 2016 1 次提交
  9. 10 11月, 2016 1 次提交
  10. 04 11月, 2016 1 次提交
    • W
      ipv6: add IPV6_RECVFRAGSIZE cmsg · 0cc0aa61
      Willem de Bruijn 提交于
      When reading a datagram or raw packet that arrived fragmented, expose
      the maximum fragment size if recorded to allow applications to
      estimate receive path MTU.
      
      At this point, the field is only recorded when ipv6 connection
      tracking is enabled. A follow-up patch will record this field also
      in the ipv6 input path.
      
      Tested using the test for IP_RECVFRAGSIZE plus
      
        ip netns exec to ip addr add dev veth1 fc07::1/64
        ip netns exec from ip addr add dev veth0 fc07::2/64
      
        ip netns exec to ./recv_cmsg_recvfragsize -6 -u -p 6000 &
        ip netns exec from nc -q 1 -u fc07::1 6000 < payload
      
      Both with and without enabling connection tracking
      
        ip6tables -A INPUT -m state --state NEW -p udp -j LOG
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cc0aa61
  11. 21 10月, 2016 1 次提交
  12. 28 6月, 2016 1 次提交
  13. 04 5月, 2016 1 次提交
    • W
      ipv6: add new struct ipcm6_cookie · 26879da5
      Wei Wang 提交于
      In the sendmsg function of UDP, raw, ICMP and l2tp sockets, we use local
      variables like hlimits, tclass, opt and dontfrag and pass them to corresponding
      functions like ip6_make_skb, ip6_append_data and xxx_push_pending_frames.
      This is not a good practice and makes it hard to add new parameters.
      This fix introduces a new struct ipcm6_cookie similar to ipcm_cookie in
      ipv4 and include the above mentioned variables. And we only pass the
      pointer to this structure to corresponding functions. This makes it easier
      to add new parameters in the future and makes the function cleaner.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26879da5
  14. 08 4月, 2016 1 次提交
  15. 05 4月, 2016 1 次提交
  16. 03 12月, 2015 1 次提交
  17. 01 4月, 2015 1 次提交
  18. 21 3月, 2015 1 次提交
  19. 19 3月, 2015 2 次提交
  20. 26 1月, 2015 1 次提交
    • E
      ipv6: tcp: fix race in IPV6_2292PKTOPTIONS · 1dc7b90f
      Eric Dumazet 提交于
      IPv6 TCP sockets store in np->pktoptions skbs, and use skb_set_owner_r()
      to charge the skb to socket.
      
      It means that destructor must be called while socket is locked.
      
      Therefore, we cannot use skb_get() or atomic_inc(&skb->users)
      to protect ourselves : kfree_skb() might race with other users
      manipulating sk->sk_forward_alloc
      
      Fix this race by holding socket lock for the duration of
      ip6_datagram_recv_ctl()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1dc7b90f
  21. 10 12月, 2014 1 次提交
  22. 25 8月, 2014 2 次提交
  23. 16 7月, 2014 1 次提交
  24. 08 7月, 2014 1 次提交
    • T
      ipv6: Implement automatic flow label generation on transmit · cb1ce2ef
      Tom Herbert 提交于
      Automatically generate flow labels for IPv6 packets on transmit.
      The flow label is computed based on skb_get_hash. The flow label will
      only automatically be set when it is zero otherwise (i.e. flow label
      manager hasn't set one). This supports the transmit side functionality
      of RFC 6438.
      
      Added an IPv6 sysctl auto_flowlabels to enable/disable this behavior
      system wide, and added IPV6_AUTOFLOWLABEL socket option to enable this
      functionality per socket.
      
      By default, auto flowlabels are disabled to avoid possible conflicts
      with flow label manager, however if this feature proves useful we
      may want to enable it by default.
      
      It should also be noted that FreeBSD has already implemented automatic
      flow labels (including the sysctl and socket option). In FreeBSD,
      automatic flow labels default to enabled.
      
      Performance impact:
      
      Running super_netperf with 200 flows for TCP_RR and UDP_RR for
      IPv6. Note that in UDP case, __skb_get_hash will be called for
      every packet with explains slight regression. In the TCP case
      the hash is saved in the socket so there is no regression.
      
      Automatic flow labels disabled:
      
        TCP_RR:
          86.53% CPU utilization
          127/195/322 90/95/99% latencies
          1.40498e+06 tps
      
        UDP_RR:
          90.70% CPU utilization
          118/168/243 90/95/99% latencies
          1.50309e+06 tps
      
      Automatic flow labels enabled:
      
        TCP_RR:
          85.90% CPU utilization
          128/199/337 90/95/99% latencies
          1.40051e+06
      
        UDP_RR
          92.61% CPU utilization
          115/164/236 90/95/99% latencies
          1.4687e+06
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb1ce2ef
  25. 02 7月, 2014 1 次提交
    • E
      inet: move ipv6only in sock_common · 9fe516ba
      Eric Dumazet 提交于
      When an UDP application switches from AF_INET to AF_INET6 sockets, we
      have a small performance degradation for IPv4 communications because of
      extra cache line misses to access ipv6only information.
      
      This can also be noticed for TCP listeners, as ipv6_only_sock() is also
      used from __inet_lookup_listener()->compute_score()
      
      This is magnified when SO_REUSEPORT is used.
      
      Move ipv6only into struct sock_common so that it is available at
      no extra cost in lookups.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fe516ba
  26. 27 2月, 2014 1 次提交
  27. 20 1月, 2014 2 次提交
  28. 16 1月, 2014 1 次提交
  29. 19 12月, 2013 1 次提交
  30. 13 12月, 2013 1 次提交
  31. 10 12月, 2013 2 次提交
  32. 09 11月, 2013 1 次提交
  33. 09 10月, 2013 1 次提交
    • E
      ipv6: make lookups simpler and faster · efe4208f
      Eric Dumazet 提交于
      TCP listener refactoring, part 4 :
      
      To speed up inet lookups, we moved IPv4 addresses from inet to struct
      sock_common
      
      Now is time to do the same for IPv6, because it permits us to have fast
      lookups for all kind of sockets, including upcoming SYN_RECV.
      
      Getting IPv6 addresses in TCP lookups currently requires two extra cache
      lines, plus a dereference (and memory stall).
      
      inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
      
      This patch is way bigger than its IPv4 counter part, because for IPv4,
      we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
      it's not doable easily.
      
      inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
      inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
      
      And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
      at the same offset.
      
      We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
      macro.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efe4208f
  34. 01 2月, 2013 1 次提交
  35. 19 11月, 2012 1 次提交
    • E
      net: Allow userns root to control ipv6 · af31f412
      Eric W. Biederman 提交于
      Allow an unpriviled user who has created a user namespace, and then
      created a network namespace to effectively use the new network
      namespace, by reducing capable(CAP_NET_ADMIN) and
      capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
      CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
      
      Settings that merely control a single network device are allowed.
      Either the network device is a logical network device where
      restrictions make no difference or the network device is hardware NIC
      that has been explicity moved from the initial network namespace.
      
      In general policy and network stack state changes are allowed while
      resource control is left unchanged.
      
      Allow the SIOCSIFADDR ioctl to add ipv6 addresses.
      Allow the SIOCDIFADDR ioctl to delete ipv6 addresses.
      Allow the SIOCADDRT ioctl to add ipv6 routes.
      Allow the SIOCDELRT ioctl to delete ipv6 routes.
      
      Allow creation of ipv6 raw sockets.
      
      Allow setting the IPV6_JOIN_ANYCAST socket option.
      Allow setting the IPV6_FL_A_RENEW parameter of the IPV6_FLOWLABEL_MGR
      socket option.
      
      Allow setting the IPV6_TRANSPARENT socket option.
      Allow setting the IPV6_HOPOPTS socket option.
      Allow setting the IPV6_RTHDRDSTOPTS socket option.
      Allow setting the IPV6_DSTOPTS socket option.
      Allow setting the IPV6_IPSEC_POLICY socket option.
      Allow setting the IPV6_XFRM_POLICY socket option.
      
      Allow sending packets with the IPV6_2292HOPOPTS control message.
      Allow sending packets with the IPV6_2292DSTOPTS control message.
      Allow sending packets with the IPV6_RTHDRDSTOPTS control message.
      
      Allow setting the multicast routing socket options on non multicast
      routing sockets.
      
      Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL, and SIOCDELTUNNEL ioctls for
      setting up, changing and deleting tunnels over ipv6.
      
      Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL, SIOCDELTUNNEL ioctls for
      setting up, changing and deleting ipv6 over ipv4 tunnels.
      
      Allow the SIOCADDPRL, SIOCDELPRL, SIOCCHGPRL ioctls for adding,
      deleting, and changing the potential router list for ISATAP tunnels.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af31f412
  36. 14 11月, 2012 1 次提交