1. 23 3月, 2018 4 次提交
  2. 17 2月, 2018 1 次提交
  3. 15 2月, 2018 1 次提交
    • P
      netfilter: drop outermost socket lock in getsockopt() · 01ea306f
      Paolo Abeni 提交于
      The Syzbot reported a possible deadlock in the netfilter area caused by
      rtnl lock, xt lock and socket lock being acquired with a different order
      on different code paths, leading to the following backtrace:
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      
      ======================================================
      WARNING: possible circular locking dependency detected
      4.15.0+ #301 Not tainted
      ------------------------------------------------------
      syzkaller233489/4179 is trying to acquire lock:
        (rtnl_mutex){+.+.}, at: [<0000000048e996fd>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
      
      but task is already holding lock:
        (&xt[i].mutex){+.+.}, at: [<00000000328553a2>]
      xt_find_table_lock+0x3e/0x3e0 net/netfilter/x_tables.c:1041
      
      which lock already depends on the new lock.
      ===
      
      Since commit 3f34cfae1230 ("netfilter: on sockopt() acquire sock lock
      only in the required scope"), we already acquire the socket lock in
      the innermost scope, where needed. In such commit I forgot to remove
      the outer-most socket lock from the getsockopt() path, this commit
      addresses the issues dropping it now.
      
      v1 -> v2: fix bad subj, added relavant 'fixes' tag
      
      Fixes: 22265a5c ("netfilter: xt_TEE: resolve oif using netdevice notifiers")
      Fixes: 202f59af ("netfilter: ipt_CLUSTERIP: do not hold dev")
      Fixes: 3f34cfae1230 ("netfilter: on sockopt() acquire sock lock only in the required scope")
      Reported-by: syzbot+ddde1c7b7ff7442d7f2d@syzkaller.appspotmail.com
      Suggested-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      01ea306f
  4. 31 1月, 2018 1 次提交
    • P
      netfilter: on sockopt() acquire sock lock only in the required scope · 3f34cfae
      Paolo Abeni 提交于
      Syzbot reported several deadlocks in the netfilter area caused by
      rtnl lock and socket lock being acquired with a different order on
      different code paths, leading to backtraces like the following one:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      4.15.0-rc9+ #212 Not tainted
      ------------------------------------------------------
      syzkaller041579/3682 is trying to acquire lock:
        (sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>] lock_sock
      include/net/sock.h:1463 [inline]
        (sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>]
      do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167
      
      but task is already holding lock:
        (rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (rtnl_mutex){+.+.}:
              __mutex_lock_common kernel/locking/mutex.c:756 [inline]
              __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
              mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
              rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
              register_netdevice_notifier+0xad/0x860 net/core/dev.c:1607
              tee_tg_check+0x1a0/0x280 net/netfilter/xt_TEE.c:106
              xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:845
              check_target net/ipv6/netfilter/ip6_tables.c:538 [inline]
              find_check_entry.isra.7+0x935/0xcf0
      net/ipv6/netfilter/ip6_tables.c:580
              translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:749
              do_replace net/ipv6/netfilter/ip6_tables.c:1165 [inline]
              do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1691
              nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
              nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
              ipv6_setsockopt+0x115/0x150 net/ipv6/ipv6_sockglue.c:928
              udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
              sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
              SYSC_setsockopt net/socket.c:1849 [inline]
              SyS_setsockopt+0x189/0x360 net/socket.c:1828
              entry_SYSCALL_64_fastpath+0x29/0xa0
      
      -> #0 (sk_lock-AF_INET6){+.+.}:
              lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3914
              lock_sock_nested+0xc2/0x110 net/core/sock.c:2780
              lock_sock include/net/sock.h:1463 [inline]
              do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167
              ipv6_setsockopt+0xd7/0x150 net/ipv6/ipv6_sockglue.c:922
              udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
              sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
              SYSC_setsockopt net/socket.c:1849 [inline]
              SyS_setsockopt+0x189/0x360 net/socket.c:1828
              entry_SYSCALL_64_fastpath+0x29/0xa0
      
      other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(rtnl_mutex);
                                      lock(sk_lock-AF_INET6);
                                      lock(rtnl_mutex);
         lock(sk_lock-AF_INET6);
      
        *** DEADLOCK ***
      
      1 lock held by syzkaller041579/3682:
        #0:  (rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
      
      The problem, as Florian noted, is that nf_setsockopt() is always
      called with the socket held, even if the lock itself is required only
      for very tight scopes and only for some operation.
      
      This patch addresses the issues moving the lock_sock() call only
      where really needed, namely in ipv*_getorigdst(), so that nf_setsockopt()
      does not need anymore to acquire both locks.
      
      Fixes: 22265a5c ("netfilter: xt_TEE: resolve oif using netdevice notifiers")
      Reported-by: syzbot+a4c2dc980ac1af699b36@syzkaller.appspotmail.com
      Suggested-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3f34cfae
  5. 26 1月, 2018 1 次提交
    • D
      net/ipv4: Allow send to local broadcast from a socket bound to a VRF · 9515a2e0
      David Ahern 提交于
      Message sends to the local broadcast address (255.255.255.255) require
      uc_index or sk_bound_dev_if to be set to an egress device. However,
      responses or only received if the socket is bound to the device. This
      is overly constraining for processes running in an L3 domain. This
      patch allows a socket bound to the VRF device to send to the local
      broadcast address by using IP_UNICAST_IF to set the egress interface
      with packet receipt handled by the VRF binding.
      
      Similar to IP_MULTICAST_IF, relax the constraint on setting
      IP_UNICAST_IF if a socket is bound to an L3 master device. In this
      case allow uc_index to be set to an enslaved if sk_bound_dev_if is
      an L3 master device and is the master device for the ifindex.
      
      In udp and raw sendmsg, allow uc_index to override the oif if
      uc_index master device is oif (ie., the oif is an L3 master and the
      index is an L3 slave).
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9515a2e0
  6. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  7. 16 9月, 2017 1 次提交
    • D
      net: ipv4: fix l3slave check for index returned in IP_PKTINFO · cbea8f02
      David Ahern 提交于
      rt_iif is only set to the actual egress device for the output path. The
      recent change to consider the l3slave flag when returning IP_PKTINFO
      works for local traffic (the correct device index is returned), but it
      broke the more typical use case of packets received from a remote host
      always returning the VRF index rather than the original ingress device.
      Update the fixup to consider l3slave and rt_iif actually getting set.
      
      Fixes: 1dfa7639 ("net: ipv4: add check for l3slave for index returned in IP_PKTINFO")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbea8f02
  8. 14 8月, 2017 1 次提交
  9. 07 8月, 2017 2 次提交
  10. 30 6月, 2017 1 次提交
  11. 01 5月, 2017 1 次提交
  12. 18 4月, 2017 2 次提交
    • W
      net-timestamp: avoid use-after-free in ip_recv_error · 1862d620
      Willem de Bruijn 提交于
      Syzkaller reported a use-after-free in ip_recv_error at line
      
          info->ipi_ifindex = skb->dev->ifindex;
      
      This function is called on dequeue from the error queue, at which
      point the device pointer may no longer be valid.
      
      Save ifindex on enqueue in __skb_complete_tx_timestamp, when the
      pointer is valid or NULL. Store it in temporary storage skb->cb.
      
      It is safe to reference skb->dev here, as called from device drivers
      or dev_queue_xmit. The exception is when called from tcp_ack_tstamp;
      in that case it is NULL and ifindex is set to 0 (invalid).
      
      Do not return a pktinfo cmsg if ifindex is 0. This maintains the
      current behavior of not returning a cmsg if skb->dev was NULL.
      
      On dequeue, the ipv4 path will cast from sock_exterr_skb to
      in_pktinfo. Both have ifindex as their first element, so no explicit
      conversion is needed. This is by design, introduced in commit
      0b922b7a ("net: original ingress device index in PKTINFO"). For
      ipv6 ip6_datagram_support_cmsg converts to in6_pktinfo.
      
      Fixes: 829ae9d6 ("net-timestamp: allow reading recv cmsg on errqueue with origin tstamp")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1862d620
    • W
      ipv4: fix a deadlock in ip_ra_control · 1215e51e
      WANG Cong 提交于
      Similar to commit 87e9f031
      ("ipv4: fix a potential deadlock in mcast getsockopt() path"),
      there is a deadlock scenario for IP_ROUTER_ALERT too:
      
             CPU0                    CPU1
             ----                    ----
        lock(rtnl_mutex);
                                     lock(sk_lock-AF_INET);
                                     lock(rtnl_mutex);
        lock(sk_lock-AF_INET);
      
      Fix this by always locking RTNL first on all setsockopt() paths.
      
      Note, after this patch ip_ra_lock is no longer needed either.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1215e51e
  13. 22 2月, 2017 1 次提交
  14. 05 2月, 2017 1 次提交
  15. 05 1月, 2017 1 次提交
  16. 31 12月, 2016 1 次提交
    • D
      net: Allow IP_MULTICAST_IF to set index to L3 slave · 7bb387c5
      David Ahern 提交于
      IP_MULTICAST_IF fails if sk_bound_dev_if is already set and the new index
      does not match it. e.g.,
      
          ntpd[15381]: setsockopt IP_MULTICAST_IF 192.168.1.23 fails: Invalid argument
      
      Relax the check in setsockopt to allow setting mc_index to an L3 slave if
      sk_bound_dev_if points to an L3 master.
      
      Make a similar change for IPv6. In this case change the device lookup to
      take the rcu_read_lock avoiding a refcnt. The rcu lock is also needed for
      the lookup of a potential L3 master device.
      
      This really only silences a setsockopt failure since uses of mc_index are
      secondary to sk_bound_dev_if if it is set. In both cases, if either index
      is an L3 slave or master, lookups are directed to the same FIB table so
      relaxing the check at setsockopt time causes no harm.
      
      Patch is based on a suggested change by Darwin for a problem noted in
      their code base.
      Suggested-by: NDarwin Dingel <darwin.dingel@alliedtelesis.co.nz>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bb387c5
  17. 30 12月, 2016 1 次提交
    • W
      net: fix incorrect original ingress device index in PKTINFO · f0c16ba8
      Wei Zhang 提交于
      When we send a packet for our own local address on a non-loopback
      interface (e.g. eth0), due to the change had been introduced from
      commit 0b922b7a ("net: original ingress device index in PKTINFO"), the
      original ingress device index would be set as the loopback interface.
      However, the packet should be considered as if it is being arrived via the
      sending interface (eth0), otherwise it would break the expectation of the
      userspace application (e.g. the DHCPRELEASE message from dhcp_release
      binary would be ignored by the dnsmasq daemon, since it come from lo which
      is not the interface dnsmasq bind to)
      
      Fixes: 0b922b7a ("net: original ingress device index in PKTINFO")
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NWei Zhang <asuka.com@163.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0c16ba8
  18. 25 12月, 2016 1 次提交
  19. 24 12月, 2016 1 次提交
  20. 08 11月, 2016 1 次提交
  21. 04 11月, 2016 1 次提交
    • W
      ipv4: add IP_RECVFRAGSIZE cmsg · 70ecc248
      Willem de Bruijn 提交于
      The IP stack records the largest fragment of a reassembled packet
      in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
      that arrived fragmented, expose the value to allow applications to
      estimate receive path MTU.
      
      Tested:
        Sent data over a veth pair of which the source has a small mtu.
        Sent data using netcat, received using a dedicated process.
      
        Verified that the cmsg IP_RECVFRAGSIZE is returned only when
        data arrives fragmented, and in that cases matches the veth mtu.
      
          ip link add veth0 type veth peer name veth1
      
          ip netns add from
          ip netns add to
      
          ip link set dev veth1 netns to
          ip netns exec to ip addr add dev veth1 192.168.10.1/24
          ip netns exec to ip link set dev veth1 up
      
          ip link set dev veth0 netns from
          ip netns exec from ip addr add dev veth0 192.168.10.2/24
          ip netns exec from ip link set dev veth0 up
          ip netns exec from ip link set dev veth0 mtu 1300
          ip netns exec from ethtool -K veth0 ufo off
      
          dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload
      
          ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
          ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload
      
        using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70ecc248
  22. 27 10月, 2016 1 次提交
    • E
      udp: fix IP_CHECKSUM handling · 10df8e61
      Eric Dumazet 提交于
      First bug was added in commit ad6f939a ("ip: Add offset parameter to
      ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
      AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
      ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
      
      Then commit e6afc8ac ("udp: remove headers from UDP packets before
      queueing") forgot to adjust the offsets now UDP headers are pulled
      before skb are put in receive queue.
      
      Fixes: ad6f939a ("ip: Add offset parameter to ip_cmsg_recv")
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Sam Kumar <samanthakumar@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10df8e61
  23. 09 9月, 2016 1 次提交
  24. 17 5月, 2016 1 次提交
  25. 12 5月, 2016 1 次提交
    • D
      net: original ingress device index in PKTINFO · 0b922b7a
      David Ahern 提交于
      Applications such as OSPF and BFD need the original ingress device not
      the VRF device; the latter can be derived from the former. To that end
      add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
      the skb control buffer similar to IPv6. From there the pktinfo can just
      pull it from cb with the PKTINFO_SKB_CB cast.
      
      The previous patch moving the skb->dev change to L3 means nothing else
      is needed for IPv6; it just works.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b922b7a
  26. 26 4月, 2016 1 次提交
  27. 14 4月, 2016 1 次提交
  28. 08 4月, 2016 1 次提交
  29. 05 4月, 2016 1 次提交
  30. 17 2月, 2016 1 次提交
  31. 13 2月, 2016 1 次提交
  32. 11 2月, 2016 1 次提交
  33. 05 11月, 2015 1 次提交
  34. 24 6月, 2015 1 次提交
  35. 07 6月, 2015 1 次提交
    • E
      inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations · 90c337da
      Eric Dumazet 提交于
      When an application needs to force a source IP on an active TCP socket
      it has to use bind(IP, port=x).
      
      As most applications do not want to deal with already used ports, x is
      often set to 0, meaning the kernel is in charge to find an available
      port.
      But kernel does not know yet if this socket is going to be a listener or
      be connected.
      It has very limited choices (no full knowledge of final 4-tuple for a
      connect())
      
      With limited ephemeral port range (about 32K ports), it is very easy to
      fill the space.
      
      This patch adds a new SOL_IP socket option, asking kernel to ignore
      the 0 port provided by application in bind(IP, port=0) and only
      remember the given IP address.
      
      The port will be automatically chosen at connect() time, in a way
      that allows sharing a source port as long as the 4-tuples are unique.
      
      This new feature is available for both IPv4 and IPv6 (Thanks Neal)
      
      Tested:
      
      Wrote a test program and checked its behavior on IPv4 and IPv6.
      
      strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
      connect().
      Also getsockname() show that the port is still 0 right after bind()
      but properly allocated after connect().
      
      socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
      setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
      bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
      getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
      connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
      getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
      
      IPv6 test :
      
      socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
      setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
      bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
      getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
      connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
      getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
      
      I was able to bind()/connect() a million concurrent IPv4 sockets,
      instead of ~32000 before patch.
      
      lpaa23:~# ulimit -n 1000010
      lpaa23:~# ./bind --connect --num-flows=1000000 &
      1000000 sockets
      
      lpaa23:~# grep TCP /proc/net/sockstat
      TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66
      
      Check that a given source port is indeed used by many different
      connections :
      
      lpaa23:~# ss -t src :40000 | head -10
      State      Recv-Q Send-Q   Local Address:Port          Peer Address:Port
      ESTAB      0      0           127.0.0.2:40000         127.0.202.33:44983
      ESTAB      0      0           127.0.0.2:40000         127.2.27.240:44983
      ESTAB      0      0           127.0.0.2:40000           127.2.98.5:44983
      ESTAB      0      0           127.0.0.2:40000        127.0.124.196:44983
      ESTAB      0      0           127.0.0.2:40000         127.2.139.38:44983
      ESTAB      0      0           127.0.0.2:40000          127.1.59.80:44983
      ESTAB      0      0           127.0.0.2:40000          127.3.6.228:44983
      ESTAB      0      0           127.0.0.2:40000          127.0.38.53:44983
      ESTAB      0      0           127.0.0.2:40000         127.1.197.10:44983
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90c337da