1. 25 7月, 2015 1 次提交
  2. 22 7月, 2015 2 次提交
  3. 16 7月, 2015 2 次提交
  4. 11 7月, 2015 1 次提交
    • P
      net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets · 8220ea23
      Phil Sutter 提交于
      Reconsidering my commit 20462155 "net: inet_diag: export IPV6_V6ONLY
      sockopt", I am not happy with the limitations it causes for socket
      analysing code in userspace. Exporting the value only if it is set makes
      it hard for userspace to decide whether the option is not set or the
      kernel does not support exporting the option at all.
      
      >From an auditor's perspective, the interesting question for listening
      AF_INET6 sockets is: "Does it NOT have IPV6_V6ONLY set?" Because it is
      the unexpected case. This patch allows to answer this question reliably.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8220ea23
  5. 09 7月, 2015 2 次提交
  6. 02 7月, 2015 1 次提交
    • F
      netfilter: arptables: use percpu jumpstack · 3bd22997
      Florian Westphal 提交于
      commit 482cfc31 ("netfilter: xtables: avoid percpu ruleset duplication")
      
      Unlike ip and ip6tables, arp tables were never converted to use the percpu
      jump stack.
      
      It still uses the rule blob to store return address, which isn't safe
      anymore since we now share this blob among all processors.
      
      Because there is no TEE support for arptables, we don't need to cope
      with reentrancy, so we can use loocal variable to hold stack offset.
      
      Fixes: 482cfc31 ("netfilter: xtables: avoid percpu ruleset duplication")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3bd22997
  7. 29 6月, 2015 1 次提交
    • A
      ipv4: fix RCU lockdep warning from linkdown changes · 96ac5cc9
      Andy Gospodarek 提交于
      The following lockdep splat was seen due to the wrong context for
      grabbing in_dev.
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      4.1.0-next-20150626-dbg-00020-g54a6d91-dirty #244 Not tainted
      -------------------------------
      include/linux/inetdevice.h:205 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 1, debug_locks = 0
      2 locks held by ip/403:
       #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81453305>] rtnl_lock+0x17/0x19
       #1:  ((inetaddr_chain).rwsem){.+.+.+}, at: [<ffffffff8105c327>] __blocking_notifier_call_chain+0x35/0x6a
      
      stack backtrace:
      CPU: 2 PID: 403 Comm: ip Not tainted 4.1.0-next-20150626-dbg-00020-g54a6d91-dirty #244
       0000000000000001 ffff8800b189b728 ffffffff8150a542 ffffffff8107a8b3
       ffff880037bbea40 ffff8800b189b758 ffffffff8107cb74 ffff8800379dbd00
       ffff8800bec85800 ffff8800bf9e13c0 00000000000000ff ffff8800b189b7d8
      Call Trace:
       [<ffffffff8150a542>] dump_stack+0x4c/0x6e
       [<ffffffff8107a8b3>] ? up+0x39/0x3e
       [<ffffffff8107cb74>] lockdep_rcu_suspicious+0xf7/0x100
       [<ffffffff814b63c3>] fib_dump_info+0x227/0x3e2
       [<ffffffff814b6624>] rtmsg_fib+0xa6/0x116
       [<ffffffff814b978f>] fib_table_insert+0x316/0x355
       [<ffffffff814b362e>] fib_magic+0xb7/0xc7
       [<ffffffff814b4803>] fib_add_ifaddr+0xb1/0x13b
       [<ffffffff814b4d09>] fib_inetaddr_event+0x36/0x90
       [<ffffffff8105c086>] notifier_call_chain+0x4c/0x71
       [<ffffffff8105c340>] __blocking_notifier_call_chain+0x4e/0x6a
       [<ffffffff8105c370>] blocking_notifier_call_chain+0x14/0x16
       [<ffffffff814a7f50>] __inet_insert_ifa+0x1a5/0x1b3
       [<ffffffff814a894d>] inet_rtm_newaddr+0x350/0x35f
       [<ffffffff81457866>] rtnetlink_rcv_msg+0x17b/0x18a
       [<ffffffff8107e7c3>] ? trace_hardirqs_on+0xd/0xf
       [<ffffffff8146965f>] ? netlink_deliver_tap+0x1cb/0x1f7
       [<ffffffff814576eb>] ? rtnl_newlink+0x72a/0x72a
      ...
      
      This patch resolves that splat.
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Reported-by: NSergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96ac5cc9
  8. 24 6月, 2015 4 次提交
    • P
      net: inet_diag: export IPV6_V6ONLY sockopt · 20462155
      Phil Sutter 提交于
      For AF_INET6 sockets, the value of struct ipv6_pinfo.ipv6only is
      exported to userspace. It indicates whether a socket bound to in6addr_any
      listens on IPv4 as well as IPv6. Since the socket is natively IPv6, it is not
      listed by e.g. 'ss -l -4'.
      
      This patch is accompanied by an appropriate one for iproute2 to enable
      the additional information in 'ss -e'.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20462155
    • A
      net: ipv4 sysctl option to ignore routes when nexthop link is down · 0eeb075f
      Andy Gospodarek 提交于
      This feature is only enabled with the new per-interface or ipv4 global
      sysctls called 'ignore_routes_with_linkdown'.
      
      net.ipv4.conf.all.ignore_routes_with_linkdown = 0
      net.ipv4.conf.default.ignore_routes_with_linkdown = 0
      net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
      ...
      
      When the above sysctls are set, will report to userspace that a route is
      dead and will no longer resolve to this nexthop when performing a fib
      lookup.  This will signal to userspace that the route will not be
      selected.  The signalling of a RTNH_F_DEAD is only passed to userspace
      if the sysctl is enabled and link is down.  This was done as without it
      the netlink listeners would have no idea whether or not a nexthop would
      be selected.   The kernel only sets RTNH_F_DEAD internally if the
      interface has IFF_UP cleared.
      
      With the new sysctl set, the following behavior can be observed
      (interface p8p1 is link-down):
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15
          cache
      
      While the route does remain in the table (so it can be modified if
      needed rather than being wiped away as it would be if IFF_UP was
      cleared), the proper next-hop is chosen automatically when the link is
      down.  Now interface p8p1 is linked-up:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
      90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      
      and the output changes to what one would expect.
      
      If the sysctl is not set, the following output would be expected when
      p8p1 is down:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      
      Since the dead flag does not appear, there should be no expectation that
      the kernel would skip using this route due to link being down.
      
      v2: Split kernel changes into 2 patches, this actually makes a
      behavioral change if the sysctl is set.  Also took suggestion from Alex
      to simplify code by only checking sysctl during fib lookup and
      suggestion from Scott to add a per-interface sysctl.
      
      v3: Code clean-ups to make it more readable and efficient as well as a
      reverse path check fix.
      
      v4: Drop binary sysctl
      
      v5: Whitespace fixups from Dave
      
      v6: Style changes from Dave and checkpatch suggestions
      
      v7: One more checkpatch fixup
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0eeb075f
    • A
      net: track link-status of ipv4 nexthops · 8a3d0316
      Andy Gospodarek 提交于
      Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
      reachable via an interface where carrier is off.  No action is taken,
      but additional flags are passed to userspace to indicate carrier status.
      
      This also includes a cleanup to fib_disable_ip to more clearly indicate
      what event made the function call to replace the more cryptic force
      option previously used.
      
      v2: Split out kernel functionality into 2 patches, this patch simply
      sets and clears new nexthop flag RTNH_F_LINKDOWN.
      
      v3: Cleanups suggested by Alex as well as a bug noticed in
      fib_sync_down_dev and fib_sync_up when multipath was not enabled.
      
      v5: Whitespace and variable declaration fixups suggested by Dave.
      
      v6: Style fixups noticed by Dave; ran checkpatch to be sure I got them
      all.
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a3d0316
    • J
      ip: report the original address of ICMP messages · 34b99df4
      Julian Anastasov 提交于
      ICMP messages can trigger ICMP and local errors. In this case
      serr->port is 0 and starting from Linux 4.0 we do not return
      the original target address to the error queue readers.
      Add function to define which errors provide addr_offset.
      With this fix my ping command is not silent anymore.
      
      Fixes: c247f053 ("ip: fix error queue empty skb handling")
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34b99df4
  9. 23 6月, 2015 2 次提交
    • C
      tcp: Do not call tcp_fastopen_reset_cipher from interrupt context · dfea2aa6
      Christoph Paasch 提交于
      tcp_fastopen_reset_cipher really cannot be called from interrupt
      context. It allocates the tcp_fastopen_context with GFP_KERNEL and
      calls crypto_alloc_cipher, which allocates all kind of stuff with
      GFP_KERNEL.
      
      Thus, we might sleep when the key-generation is triggered by an
      incoming TFO cookie-request which would then happen in interrupt-
      context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP:
      
      [   36.001813] BUG: sleeping function called from invalid context at mm/slub.c:1266
      [   36.003624] in_atomic(): 1, irqs_disabled(): 0, pid: 1016, name: packetdrill
      [   36.004859] CPU: 1 PID: 1016 Comm: packetdrill Not tainted 4.1.0-rc7 #14
      [   36.006085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [   36.008250]  00000000000004f2 ffff88007f8838a8 ffffffff8171d53a ffff880075a084a8
      [   36.009630]  ffff880075a08000 ffff88007f8838c8 ffffffff810967d3 ffff88007f883928
      [   36.011076]  0000000000000000 ffff88007f8838f8 ffffffff81096892 ffff88007f89be00
      [   36.012494] Call Trace:
      [   36.012953]  <IRQ>  [<ffffffff8171d53a>] dump_stack+0x4f/0x6d
      [   36.014085]  [<ffffffff810967d3>] ___might_sleep+0x103/0x170
      [   36.015117]  [<ffffffff81096892>] __might_sleep+0x52/0x90
      [   36.016117]  [<ffffffff8118e887>] kmem_cache_alloc_trace+0x47/0x190
      [   36.017266]  [<ffffffff81680d82>] ? tcp_fastopen_reset_cipher+0x42/0x130
      [   36.018485]  [<ffffffff81680d82>] tcp_fastopen_reset_cipher+0x42/0x130
      [   36.019679]  [<ffffffff81680f01>] tcp_fastopen_init_key_once+0x61/0x70
      [   36.020884]  [<ffffffff81680f2c>] __tcp_fastopen_cookie_gen+0x1c/0x60
      [   36.022058]  [<ffffffff816814ff>] tcp_try_fastopen+0x58f/0x730
      [   36.023118]  [<ffffffff81671788>] tcp_conn_request+0x3e8/0x7b0
      [   36.024185]  [<ffffffff810e3872>] ? __module_text_address+0x12/0x60
      [   36.025327]  [<ffffffff8167b2e1>] tcp_v4_conn_request+0x51/0x60
      [   36.026410]  [<ffffffff816727e0>] tcp_rcv_state_process+0x190/0xda0
      [   36.027556]  [<ffffffff81661f97>] ? __inet_lookup_established+0x47/0x170
      [   36.028784]  [<ffffffff8167c2ad>] tcp_v4_do_rcv+0x16d/0x3d0
      [   36.029832]  [<ffffffff812e6806>] ? security_sock_rcv_skb+0x16/0x20
      [   36.030936]  [<ffffffff8167cc8a>] tcp_v4_rcv+0x77a/0x7b0
      [   36.031875]  [<ffffffff816af8c3>] ? iptable_filter_hook+0x33/0x70
      [   36.032953]  [<ffffffff81657d22>] ip_local_deliver_finish+0x92/0x1f0
      [   36.034065]  [<ffffffff81657f1a>] ip_local_deliver+0x9a/0xb0
      [   36.035069]  [<ffffffff81657c90>] ? ip_rcv+0x3d0/0x3d0
      [   36.035963]  [<ffffffff81657569>] ip_rcv_finish+0x119/0x330
      [   36.036950]  [<ffffffff81657ba7>] ip_rcv+0x2e7/0x3d0
      [   36.037847]  [<ffffffff81610652>] __netif_receive_skb_core+0x552/0x930
      [   36.038994]  [<ffffffff81610a57>] __netif_receive_skb+0x27/0x70
      [   36.040033]  [<ffffffff81610b72>] process_backlog+0xd2/0x1f0
      [   36.041025]  [<ffffffff81611482>] net_rx_action+0x122/0x310
      [   36.042007]  [<ffffffff81076743>] __do_softirq+0x103/0x2f0
      [   36.042978]  [<ffffffff81723e3c>] do_softirq_own_stack+0x1c/0x30
      
      This patch moves the call to tcp_fastopen_init_key_once to the places
      where a listener socket creates its TFO-state, which always happens in
      user-context (either from the setsockopt, or implicitly during the
      listen()-call)
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Fixes: 222e83d2 ("tcp: switch tcp_fastopen key generation to net_get_random_once")
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dfea2aa6
    • H
      inet_diag: Remove _bh suffix in inet_diag_dump_reqs(). · 3b188443
      Hiroaki SHIMODA 提交于
      inet_diag_dump_reqs() is called from inet_diag_dump_icsk() with BH
      disabled. So no need to disable BH in inet_diag_dump_reqs().
      Signed-off-by: NHiroaki Shimoda <shimoda.hiroaki@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b188443
  10. 22 6月, 2015 2 次提交
  11. 17 6月, 2015 1 次提交
    • P
      netfilter: don't use module_init/exit in core IPV4 code · 55331060
      Paul Gortmaker 提交于
      The file net/ipv4/netfilter.o is created based on whether
      CONFIG_NETFILTER is set.  However that is defined as a bool, and
      hence this file with the core netfilter hooks will never be
      modular.  So using module_init as an alias for __initcall can be
      somewhat misleading.
      
      Fix this up now, so that we can relocate module_init from
      init.h into module.h in the future.  If we don't do this, we'd
      have to add module.h to obviously non-modular code, and that
      would be a worse thing.  Also add an inclusion of init.h, as
      that was previously implicit here in the netfilter.c file.
      
      Note that direct use of __initcall is discouraged, vs. one
      of the priority categorized subgroups.  As __initcall gets
      mapped onto device_initcall, our use of subsys_initcall (which
      seems to make sense for netfilter code) will thus change this
      registration from level 6-device to level 4-subsys (i.e. slightly
      earlier).  However no observable impact of that small difference
      has been observed during testing, or is expected. (i.e. the
      location of the netfilter messages in dmesg remains unchanged
      with respect to all the other surrounding messages.)
      
      As for the module_exit, rather than replace it with __exitcall,
      we simply remove it, since it appears only UML does anything
      with those, and even for UML, there is no relevant cleanup
      to be done here.
      
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: netfilter-devel@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      55331060
  12. 16 6月, 2015 3 次提交
  13. 15 6月, 2015 2 次提交
  14. 13 6月, 2015 1 次提交
  15. 12 6月, 2015 8 次提交
  16. 11 6月, 2015 3 次提交
    • K
      tcp: add CDG congestion control · 2b0a8c9e
      Kenneth Klette Jonassen 提交于
      CAIA Delay-Gradient (CDG) is a TCP congestion control that modifies
      the TCP sender in order to [1]:
      
        o Use the delay gradient as a congestion signal.
        o Back off with an average probability that is independent of the RTT.
        o Coexist with flows that use loss-based congestion control, i.e.,
          flows that are unresponsive to the delay signal.
        o Tolerate packet loss unrelated to congestion. (Disabled by default.)
      
      Its FreeBSD implementation was presented for the ICCRG in July 2012;
      slides are available at http://www.ietf.org/proceedings/84/iccrg.html
      
      Running the experiment scenarios in [1] suggests that our implementation
      achieves more goodput compared with FreeBSD 10.0 senders, although it also
      causes more queueing delay for a given backoff factor.
      
      The loss tolerance heuristic is disabled by default due to safety concerns
      for its use in the Internet [2, p. 45-46].
      
      We use a variant of the Hybrid Slow start algorithm in tcp_cubic to reduce
      the probability of slow start overshoot.
      
      [1] D.A. Hayes and G. Armitage. "Revisiting TCP congestion control using
          delay gradients." In Networking 2011, pages 328-341. Springer, 2011.
      [2] K.K. Jonassen. "Implementing CAIA Delay-Gradient in Linux."
          MSc thesis. Department of Informatics, University of Oslo, 2015.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: David Hayes <davihay@ifi.uio.no>
      Cc: Andreas Petlund <apetlund@simula.no>
      Cc: Dave Taht <dave.taht@bufferbloat.net>
      Cc: Nicolas Kuhn <nicolas.kuhn@telecom-bretagne.eu>
      Signed-off-by: NKenneth Klette Jonassen <kennetkl@ifi.uio.no>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b0a8c9e
    • K
      tcp: export tcp_enter_cwr() · 7782ad8b
      Kenneth Klette Jonassen 提交于
      Upcoming tcp_cdg uses tcp_enter_cwr() to initiate PRR. Export this
      function so that CDG can be compiled as a module.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: David Hayes <davihay@ifi.uio.no>
      Cc: Andreas Petlund <apetlund@simula.no>
      Cc: Dave Taht <dave.taht@bufferbloat.net>
      Cc: Nicolas Kuhn <nicolas.kuhn@telecom-bretagne.eu>
      Signed-off-by: NKenneth Klette Jonassen <kennetkl@ifi.uio.no>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7782ad8b
    • E
      net: tcp: dctcp_update_alpha() fixes. · f9c2ff22
      Eric Dumazet 提交于
      dctcp_alpha can be read by from dctcp_get_info() without
      synchro, so use WRITE_ONCE() to prevent compiler from using
      dctcp_alpha as a temporary variable.
      
      Also, playing with small dctcp_shift_g (like 1), can expose
      an overflow with 32bit values shifted 9 times before divide.
      
      Use an u64 field to avoid this problem, and perform the divide
      only if acked_bytes_ecn is not zero.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9c2ff22
  17. 08 6月, 2015 2 次提交
  18. 07 6月, 2015 2 次提交
    • E
      tcp: remove redundant checks II · 98da81a4
      Eric Dumazet 提交于
      For same reasons than in commit 12e25e10 ("tcp: remove redundant
      checks"), we can remove redundant checks done for timewait sockets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98da81a4
    • E
      inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations · 90c337da
      Eric Dumazet 提交于
      When an application needs to force a source IP on an active TCP socket
      it has to use bind(IP, port=x).
      
      As most applications do not want to deal with already used ports, x is
      often set to 0, meaning the kernel is in charge to find an available
      port.
      But kernel does not know yet if this socket is going to be a listener or
      be connected.
      It has very limited choices (no full knowledge of final 4-tuple for a
      connect())
      
      With limited ephemeral port range (about 32K ports), it is very easy to
      fill the space.
      
      This patch adds a new SOL_IP socket option, asking kernel to ignore
      the 0 port provided by application in bind(IP, port=0) and only
      remember the given IP address.
      
      The port will be automatically chosen at connect() time, in a way
      that allows sharing a source port as long as the 4-tuples are unique.
      
      This new feature is available for both IPv4 and IPv6 (Thanks Neal)
      
      Tested:
      
      Wrote a test program and checked its behavior on IPv4 and IPv6.
      
      strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
      connect().
      Also getsockname() show that the port is still 0 right after bind()
      but properly allocated after connect().
      
      socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
      setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
      bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
      getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
      connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
      getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
      
      IPv6 test :
      
      socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
      setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
      bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
      getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
      connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
      getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
      
      I was able to bind()/connect() a million concurrent IPv4 sockets,
      instead of ~32000 before patch.
      
      lpaa23:~# ulimit -n 1000010
      lpaa23:~# ./bind --connect --num-flows=1000000 &
      1000000 sockets
      
      lpaa23:~# grep TCP /proc/net/sockstat
      TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66
      
      Check that a given source port is indeed used by many different
      connections :
      
      lpaa23:~# ss -t src :40000 | head -10
      State      Recv-Q Send-Q   Local Address:Port          Peer Address:Port
      ESTAB      0      0           127.0.0.2:40000         127.0.202.33:44983
      ESTAB      0      0           127.0.0.2:40000         127.2.27.240:44983
      ESTAB      0      0           127.0.0.2:40000           127.2.98.5:44983
      ESTAB      0      0           127.0.0.2:40000        127.0.124.196:44983
      ESTAB      0      0           127.0.0.2:40000         127.2.139.38:44983
      ESTAB      0      0           127.0.0.2:40000          127.1.59.80:44983
      ESTAB      0      0           127.0.0.2:40000          127.3.6.228:44983
      ESTAB      0      0           127.0.0.2:40000          127.0.38.53:44983
      ESTAB      0      0           127.0.0.2:40000         127.1.197.10:44983
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90c337da