1. 18 12月, 2016 1 次提交
    • T
      inet: Fix get port to handle zero port number with soreuseport set · 0643ee4f
      Tom Herbert 提交于
      A user may call listen with binding an explicit port with the intent
      that the kernel will assign an available port to the socket. In this
      case inet_csk_get_port does a port scan. For such sockets, the user may
      also set soreuseport with the intent a creating more sockets for the
      port that is selected. The problem is that the initial socket being
      opened could inadvertently choose an existing and unreleated port
      number that was already created with soreuseport.
      
      This patch adds a boolean parameter to inet_bind_conflict that indicates
      rather soreuseport is allowed for the check (in addition to
      sk->sk_reuseport). In calls to inet_bind_conflict from inet_csk_get_port
      the argument is set to true if an explicit port is being looked up (snum
      argument is nonzero), and is false if port scan is done.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0643ee4f
  2. 05 11月, 2016 1 次提交
    • L
      net: inet: Support UID-based routing in IP protocols. · e2d118a1
      Lorenzo Colitti 提交于
      - Use the UID in routing lookups made by protocol connect() and
        sendmsg() functions.
      - Make sure that routing lookups triggered by incoming packets
        (e.g., Path MTU discovery) take the UID of the socket into
        account.
      - For packets not associated with a userspace socket, (e.g., ping
        replies) use UID 0 inside the user namespace corresponding to
        the network namespace the socket belongs to. This allows
        all namespaces to apply routing and iptables rules to
        kernel-originated traffic in that namespaces by matching UID 0.
        This is better than using the UID of the kernel socket that is
        sending the traffic, because the UID of kernel sockets created
        at namespace creation time (e.g., the per-processor ICMP and
        TCP sockets) is the UID of the user that created the socket,
        which might not be mapped in the namespace.
      
      Tested: compiles allnoconfig, allyesconfig, allmodconfig
      Tested: https://android-review.googlesource.com/253302Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2d118a1
  3. 11 2月, 2016 1 次提交
    • C
      soreuseport: fast reuseport TCP socket selection · c125e80b
      Craig Gallek 提交于
      This change extends the fast SO_REUSEPORT socket lookup implemented
      for UDP to TCP.  Listener sockets with SO_REUSEPORT and the same
      receive address are additionally added to an array for faster
      random access.  This means that only a single socket from the group
      must be found in the listener list before any socket in the group can
      be used to receive a packet.  Previously, every socket in the group
      needed to be considered before handing off the incoming packet.
      
      This feature also exposes the ability to use a BPF program when
      selecting a socket from a reuseport group.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c125e80b
  4. 05 1月, 2016 1 次提交
    • C
      soreuseport: fast reuseport UDP socket selection · e32ea7e7
      Craig Gallek 提交于
      Include a struct sock_reuseport instance when a UDP socket binds to
      a specific address for the first time with the reuseport flag set.
      When selecting a socket for an incoming UDP packet, use the information
      available in sock_reuseport if present.
      
      This required adding an additional field to the UDP source address
      equality function to differentiate between exact and wildcard matches.
      The original use case allowed wildcard matches when checking for
      existing port uses during bind.  The new use case of adding a socket
      to a reuseport group requires exact address matching.
      
      Performance test (using a machine with 2 CPU sockets and a total of
      48 cores):  Create reuseport groups of varying size.  Use one socket
      from this group per user thread (pinning each thread to a different
      core) calling recvmmsg in a tight loop.  Record number of messages
      received per second while saturating a 10G link.
        10 sockets: 18% increase (~2.8M -> 3.3M pkts/s)
        20 sockets: 14% increase (~2.9M -> 3.3M pkts/s)
        40 sockets: 13% increase (~3.0M -> 3.4M pkts/s)
      
      This work is based off a similar implementation written by
      Ying Cai <ycai@google.com> for implementing policy-based reuseport
      selection.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e32ea7e7
  5. 04 12月, 2015 1 次提交
    • E
      ipv6: kill sk_dst_lock · 6bd4f355
      Eric Dumazet 提交于
      While testing the np->opt RCU conversion, I found that UDP/IPv6 was
      using a mixture of xchg() and sk_dst_lock to protect concurrent changes
      to sk->sk_dst_cache, leading to possible corruptions and crashes.
      
      ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest
      way to fix the mess is to remove sk_dst_lock completely, as we did for
      IPv4.
      
      __ip6_dst_store() and ip6_dst_store() share same implementation.
      
      sk_setup_caps() being called with socket lock being held or not,
      we have to use sk_dst_set() instead of __sk_dst_set()
      
      Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);"
      in ip6_dst_store() before the sk_setup_caps(sk, dst) call.
      
      This is because ip6_dst_store() can be called from process context,
      without any lock held.
      
      As soon as the dst is installed in sk->sk_dst_cache, dst can be freed
      from another cpu doing a concurrent ip6_dst_store()
      
      Doing the dst dereference before doing the install is needed to make
      sure no use after free would trigger.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bd4f355
  6. 03 12月, 2015 1 次提交
  7. 03 10月, 2015 2 次提交
  8. 30 9月, 2015 1 次提交
  9. 26 9月, 2015 1 次提交
  10. 24 3月, 2015 1 次提交
  11. 21 3月, 2015 2 次提交
    • E
      inet: get rid of central tcp/dccp listener timer · fa76ce73
      Eric Dumazet 提交于
      One of the major issue for TCP is the SYNACK rtx handling,
      done by inet_csk_reqsk_queue_prune(), fired by the keepalive
      timer of a TCP_LISTEN socket.
      
      This function runs for awful long times, with socket lock held,
      meaning that other cpus needing this lock have to spin for hundred of ms.
      
      SYNACK are sent in huge bursts, likely to cause severe drops anyway.
      
      This model was OK 15 years ago when memory was very tight.
      
      We now can afford to have a timer per request sock.
      
      Timer invocations no longer need to lock the listener,
      and can be run from all cpus in parallel.
      
      With following patch increasing somaxconn width to 32 bits,
      I tested a listener with more than 4 million active request sockets,
      and a steady SYNFLOOD of ~200,000 SYN per second.
      Host was sending ~830,000 SYNACK per second.
      
      This is ~100 times more what we could achieve before this patch.
      
      Later, we will get rid of the listener hash and use ehash instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa76ce73
    • E
      inet: drop prev pointer handling in request sock · 52452c54
      Eric Dumazet 提交于
      When request sock are put in ehash table, the whole notion
      of having a previous request to update dl_next is pointless.
      
      Also, following patch will get rid of big purge timer,
      so we want to delete a request sock without holding listener lock.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52452c54
  12. 25 8月, 2014 2 次提交
  13. 14 5月, 2014 1 次提交
    • L
      net: support marking accepting TCP sockets · 84f39b08
      Lorenzo Colitti 提交于
      When using mark-based routing, sockets returned from accept()
      may need to be marked differently depending on the incoming
      connection request.
      
      This is the case, for example, if different socket marks identify
      different networks: a listening socket may want to accept
      connections from all networks, but each connection should be
      marked with the network that the request came in on, so that
      subsequent packets are sent on the correct network.
      
      This patch adds a sysctl to mark TCP sockets based on the fwmark
      of the incoming SYN packet. If enabled, and an unmarked socket
      receives a SYN, then the SYN packet's fwmark is written to the
      connection's inet_request_sock, and later written back to the
      accepted socket when the connection is established.  If the
      socket already has a nonzero mark, then the behaviour is the same
      as it is today, i.e., the listening socket's fwmark is used.
      
      Black-box tested using user-mode linux:
      
      - IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
        mark of the incoming SYN packet.
      - The socket returned by accept() is marked with the mark of the
        incoming SYN packet.
      - Tested with syncookies=1 and syncookies=2.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84f39b08
  14. 16 4月, 2014 1 次提交
  15. 06 12月, 2013 1 次提交
  16. 11 10月, 2013 1 次提交
  17. 10 10月, 2013 1 次提交
    • E
      inet: includes a sock_common in request_sock · 634fb979
      Eric Dumazet 提交于
      TCP listener refactoring, part 5 :
      
      We want to be able to insert request sockets (SYN_RECV) into main
      ehash table instead of the per listener hash table to allow RCU
      lookups and remove listener lock contention.
      
      This patch includes the needed struct sock_common in front
      of struct request_sock
      
      This means there is no more inet6_request_sock IPv6 specific
      structure.
      
      Following inet_request_sock fields were renamed as they became
      macros to reference fields from struct sock_common.
      Prefix ir_ was chosen to avoid name collisions.
      
      loc_port   -> ir_loc_port
      loc_addr   -> ir_loc_addr
      rmt_addr   -> ir_rmt_addr
      rmt_port   -> ir_rmt_port
      iif        -> ir_iif
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      634fb979
  18. 09 10月, 2013 1 次提交
    • E
      ipv6: make lookups simpler and faster · efe4208f
      Eric Dumazet 提交于
      TCP listener refactoring, part 4 :
      
      To speed up inet lookups, we moved IPv4 addresses from inet to struct
      sock_common
      
      Now is time to do the same for IPv6, because it permits us to have fast
      lookups for all kind of sockets, including upcoming SYN_RECV.
      
      Getting IPv6 addresses in TCP lookups currently requires two extra cache
      lines, plus a dereference (and memory stall).
      
      inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
      
      This patch is way bigger than its IPv4 counter part, because for IPv4,
      we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
      it's not doable easily.
      
      inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
      inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
      
      And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
      at the same offset.
      
      We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
      macro.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efe4208f
  19. 09 3月, 2013 1 次提交
  20. 06 3月, 2013 1 次提交
  21. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  22. 30 1月, 2013 1 次提交
    • E
      ipv6: Fix inet6_csk_bind_conflict so it builds with user namespaces enabled · 243bb4c6
      Eric W. Biederman 提交于
      When attempting to build linux-next with user namespaces enabled I ran
      into this fun build error.
      
        CC      net/ipv6/inet6_connection_sock.o
      .../net/ipv6/inet6_connection_sock.c: In function ‘inet6_csk_bind_conflict’:
      .../net/ipv6/inet6_connection_sock.c:37:12: error: incompatible types when initializing type ‘int’ using
       type ‘kuid_t’
      .../net/ipv6/inet6_connection_sock.c:54:30: error: incompatible type for argument 1 of ‘uid_eq’
      .../include/linux/uidgid.h:48:20: note: expected ‘kuid_t’ but argument is of type ‘int’
      make[3]: *** [net/ipv6/inet6_connection_sock.o] Error 1
      make[2]: *** [net/ipv6] Error 2
      make[2]: *** Waiting for unfinished jobs....
      
      Using kuid_t instead of int to hold the uid fixes this.
      
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      243bb4c6
  23. 24 1月, 2013 1 次提交
    • T
      soreuseport: TCP/IPv6 implementation · 5ba24953
      Tom Herbert 提交于
      Motivation for soreuseport would be something like a web server
      binding to port 80 running with multiple threads, where each thread
      might have it's own listener socket.  This could be done as an
      alternative to other models: 1) have one listener thread which
      dispatches completed connections to workers. 2) accept on a single
      listener socket from multiple threads.  In case #1 the listener thread
      can easily become the bottleneck with high connection turn-over rate.
      In case #2, the proportion of connections accepted per thread tends
      to be uneven under high connection load (assuming simple event loop:
      while (1) { accept(); process() }, wakeup does not promote fairness
      among the sockets.  We have seen the  disproportion to be as high
      as 3:1 ratio between thread accepting most connections and the one
      accepting the fewest.  With so_reusport the distribution is
      uniform.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ba24953
  24. 21 11月, 2012 1 次提交
  25. 19 9月, 2012 1 次提交
  26. 18 7月, 2012 1 次提交
  27. 17 7月, 2012 1 次提交
    • D
      net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() · 6700c270
      David S. Miller 提交于
      This will be used so that we can compose a full flow key.
      
      Even though we have a route in this context, we need more.  In the
      future the routes will be without destination address, source address,
      etc. keying.  One ipv4 route will cover entire subnets, etc.
      
      In this environment we have to have a way to possess persistent storage
      for redirects and PMTU information.  This persistent storage will exist
      in the FIB tables, and that's why we'll need to be able to rebuild a
      full lookup flow key here.  Using that flow key will do a fib_lookup()
      and create/update the persistent entry.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6700c270
  28. 16 7月, 2012 1 次提交
  29. 29 6月, 2012 2 次提交
  30. 15 4月, 2012 1 次提交
    • A
      tcp: bind() use stronger condition for bind_conflict · aacd9289
      Alex Copot 提交于
      We must try harder to get unique (addr, port) pairs when
      doing port autoselection for sockets with SO_REUSEADDR
      option set.
      
      We achieve this by adding a relaxation parameter to
      inet_csk_bind_conflict. When 'relax' parameter is off
      we return a conflict whenever the current searched
      pair (addr, port) is not unique.
      
      This tries to address the problems reported in patch:
      	8d238b25
      	Revert "tcp: bind() fix when many ports are bound"
      
      Tests where ran for creating and binding(0) many sockets
      on 100 IPs. The results are, on average:
      
      	* 60000 sockets, 600 ports / IP:
      		* 0.210 s, 620 (IP, port) duplicates without patch
      		* 0.219 s, no duplicates with patch
      	* 100000 sockets, 1000 ports / IP:
      		* 0.371 s, 1720 duplicates without patch
      		* 0.373 s, no duplicates with patch
      	* 200000 sockets, 2000 ports / IP:
      		* 0.766 s, 6900 duplicates without patch
      		* 0.768 s, no duplicates with patch
      	* 500000 sockets, 5000 ports / IP:
      		* 2.227 s, 41500 duplicates without patch
      		* 2.284 s, no duplicates with patch
      Signed-off-by: NAlex Copot <alex.mihai.c@gmail.com>
      Signed-off-by: NDaniel Baluta <dbaluta@ixiacom.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aacd9289
  31. 24 11月, 2011 1 次提交
  32. 23 11月, 2011 1 次提交
  33. 27 10月, 2011 1 次提交
  34. 01 8月, 2011 1 次提交
  35. 09 5月, 2011 1 次提交
    • D
      inet: Pass flowi to ->queue_xmit(). · d9d8da80
      David S. Miller 提交于
      This allows us to acquire the exact route keying information from the
      protocol, however that might be managed.
      
      It handles all of the possibilities, from the simplest case of storing
      the key in inet->cork.fl to the more complex setup SCTP has where
      individual transports determine the flow.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9d8da80
  36. 14 4月, 2011 1 次提交