1. 15 2月, 2017 1 次提交
    • S
      net: Add a skb_gro_flush_final helper. · 5f114163
      Steffen Klassert 提交于
      Add a skb_gro_flush_final helper to prepare for  consuming
      skbs in call_gro_receive. We will extend this helper to not
      touch the skb if the skb is consumed by a gro callback with
      a followup patch. We need this to handle the upcomming IPsec
      ESP callbacks as they reinject the skb to the napi_gro_receive
      asynchronous. The handler is used in all gro_receive functions
      that can call the ESP gro handlers.
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      5f114163
  2. 09 2月, 2017 4 次提交
  3. 04 2月, 2017 2 次提交
  4. 03 2月, 2017 2 次提交
  5. 02 2月, 2017 6 次提交
  6. 31 1月, 2017 2 次提交
  7. 30 1月, 2017 2 次提交
  8. 26 1月, 2017 5 次提交
    • W
      net/tcp-fastopen: make connect()'s return case more consistent with non-TFO · 3979ad7e
      Willy Tarreau 提交于
      Without TFO, any subsequent connect() call after a successful one returns
      -1 EISCONN. The last API update ensured that __inet_stream_connect() can
      return -1 EINPROGRESS in response to sendmsg() when TFO is in use to
      indicate that the connection is now in progress. Unfortunately since this
      function is used both for connect() and sendmsg(), it has the undesired
      side effect of making connect() now return -1 EINPROGRESS as well after
      a successful call, while at the same time poll() returns POLLOUT. This
      can confuse some applications which happen to call connect() and to
      check for -1 EISCONN to ensure the connection is usable, and for which
      EINPROGRESS indicates a need to poll, causing a loop.
      
      This problem was encountered in haproxy where a call to connect() is
      precisely used in certain cases to confirm a connection's readiness.
      While arguably haproxy's behaviour should be improved here, it seems
      important to aim at a more robust behaviour when the goal of the new
      API is to make it easier to implement TFO in existing applications.
      
      This patch simply ensures that we preserve the same semantics as in
      the non-TFO case on the connect() syscall when using TFO, while still
      returning -1 EINPROGRESS on sendmsg(). For this we simply tell
      __inet_stream_connect() whether we're doing a regular connect() or in
      fact connecting for a sendmsg() call.
      
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3979ad7e
    • W
      net/tcp-fastopen: Add new API support · 19f6d3f3
      Wei Wang 提交于
      This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
      alternative way to perform Fast Open on the active side (client). Prior
      to this patch, a client needs to replace the connect() call with
      sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
      to use Fast Open: these socket operations are often done in lower layer
      libraries used by many other applications. Changing these libraries
      and/or the socket call sequences are not trivial. A more convenient
      approach is to perform Fast Open by simply enabling a socket option when
      the socket is created w/o changing other socket calls sequence:
        s = socket()
          create a new socket
        setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
          newly introduced sockopt
          If set, new functionality described below will be used.
          Return ENOTSUPP if TFO is not supported or not enabled in the
          kernel.
      
        connect()
          With cookie present, return 0 immediately.
          With no cookie, initiate 3WHS with TFO cookie-request option and
          return -1 with errno = EINPROGRESS.
      
        write()/sendmsg()
          With cookie present, send out SYN with data and return the number of
          bytes buffered.
          With no cookie, and 3WHS not yet completed, return -1 with errno =
          EINPROGRESS.
          No MSG_FASTOPEN flag is needed.
      
        read()
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
          write() is not called yet.
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
          established but no msg is received yet.
          Return number of bytes read if socket is established and there is
          msg received.
      
      The new API simplifies life for applications that always perform a write()
      immediately after a successful connect(). Such applications can now take
      advantage of Fast Open by merely making one new setsockopt() call at the time
      of creating the socket. Nothing else about the application's socket call
      sequence needs to change.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f6d3f3
    • W
      net/tcp-fastopen: refactor cookie check logic · 065263f4
      Wei Wang 提交于
      Refactor the cookie check logic in tcp_send_syn_data() into a function.
      This function will be called else where in later changes.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      065263f4
    • J
      tcp: correct memory barrier usage in tcp_check_space() · 56d80622
      Jason Baron 提交于
      sock_reset_flag() maps to __clear_bit() not the atomic version clear_bit().
      Thus, we need smp_mb(), smp_mb__after_atomic() is not sufficient.
      
      Fixes: 3c715127 ("tcp: add memory barriers to write space paths")
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56d80622
    • E
      tcp: reduce skb overhead in selected places · 60b1af33
      Eric Dumazet 提交于
      tcp_add_backlog() can use skb_condense() helper to get better
      gains and less SKB_TRUESIZE() magic. This only happens when socket
      backlog has to be used.
      
      Some attacks involve specially crafted out of order tiny TCP packets,
      clogging the ofo queue of (many) sockets.
      Then later, expensive collapse happens, trying to copy all these skbs
      into single ones.
      This unfortunately does not work if each skb has no neighbor in TCP
      sequence order.
      
      By using skb_condense() if the skb could not be coalesced to a prior
      one, we defeat these kind of threats, potentially saving 4K per skb
      (or more, since this is one page fragment).
      
      A typical NAPI driver allocates gro packets with GRO_MAX_HEAD bytes
      in skb->head, meaning the copy done by skb_condense() is limited to
      about 200 bytes.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60b1af33
  9. 25 1月, 2017 2 次提交
  10. 21 1月, 2017 2 次提交
  11. 20 1月, 2017 1 次提交
    • A
      tcp: initialize max window for a new fastopen socket · 0dbd7ff3
      Alexey Kodanev 提交于
      Found that if we run LTP netstress test with large MSS (65K),
      the first attempt from server to send data comparable to this
      MSS on fastopen connection will be delayed by the probe timer.
      
      Here is an example:
      
           < S  seq 0:0 win 43690 options [mss 65495 wscale 7 tfo cookie] length 32
           > S. seq 0:0 ack 1 win 43690 options [mss 65495 wscale 7] length 0
           < .  ack 1 win 342 length 0
      
      Inside tcp_sendmsg(), tcp_send_mss() returns max MSS in 'mss_now',
      as well as in 'size_goal'. This results the segment not queued for
      transmition until all the data copied from user buffer. Then, inside
      __tcp_push_pending_frames(), it breaks on send window test and
      continues with the check probe timer.
      
      Fragmentation occurs in tcp_write_wakeup()...
      
      +0.2 > P. seq 1:43777 ack 1 win 342 length 43776
           < .  ack 43777, win 1365 length 0
           > P. seq 43777:65001 ack 1 win 342 options [...] length 21224
           ...
      
      This also contradicts with the fact that we should bound to the half
      of the window if it is large.
      
      Fix this flaw by correctly initializing max_window. Before that, it
      could have large values that affect further calculations of 'size_goal'.
      
      Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0dbd7ff3
  12. 19 1月, 2017 8 次提交
    • D
      lwtunnel: fix autoload of lwt modules · 9ed59592
      David Ahern 提交于
      Trying to add an mpls encap route when the MPLS modules are not loaded
      hangs. For example:
      
          CONFIG_MPLS=y
          CONFIG_NET_MPLS_GSO=m
          CONFIG_MPLS_ROUTING=m
          CONFIG_MPLS_IPTUNNEL=m
      
          $ ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2
      
      The ip command hangs:
      root       880   826  0 21:25 pts/0    00:00:00 ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2
      
          $ cat /proc/880/stack
          [<ffffffff81065a9b>] call_usermodehelper_exec+0xd6/0x134
          [<ffffffff81065efc>] __request_module+0x27b/0x30a
          [<ffffffff814542f6>] lwtunnel_build_state+0xe4/0x178
          [<ffffffff814aa1e4>] fib_create_info+0x47f/0xdd4
          [<ffffffff814ae451>] fib_table_insert+0x90/0x41f
          [<ffffffff814a8010>] inet_rtm_newroute+0x4b/0x52
          ...
      
      modprobe is trying to load rtnl-lwt-MPLS:
      
      root       881     5  0 21:25 ?        00:00:00 /sbin/modprobe -q -- rtnl-lwt-MPLS
      
      and it hangs after loading mpls_router:
      
          $ cat /proc/881/stack
          [<ffffffff81441537>] rtnl_lock+0x12/0x14
          [<ffffffff8142ca2a>] register_netdevice_notifier+0x16/0x179
          [<ffffffffa0033025>] mpls_init+0x25/0x1000 [mpls_router]
          [<ffffffff81000471>] do_one_initcall+0x8e/0x13f
          [<ffffffff81119961>] do_init_module+0x5a/0x1e5
          [<ffffffff810bd070>] load_module+0x13bd/0x17d6
          ...
      
      The problem is that lwtunnel_build_state is called with rtnl lock
      held preventing mpls_init from registering.
      
      Given the potential references held by the time lwtunnel_build_state it
      can not drop the rtnl lock to the load module. So, extract the module
      loading code from lwtunnel_build_state into a new function to validate
      the encap type. The new function is called while converting the user
      request into a fib_config which is well before any table, device or
      fib entries are examined.
      
      Fixes: 745041e2 ("lwtunnel: autoload of lwt modules")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ed59592
    • A
      netfilter: ipt_CLUSTERIP: fix build error without procfs · 3fd0b634
      Arnd Bergmann 提交于
      We can't access c->pde if CONFIG_PROC_FS is disabled:
      
      net/ipv4/netfilter/ipt_CLUSTERIP.c: In function 'clusterip_config_find_get':
      net/ipv4/netfilter/ipt_CLUSTERIP.c:147:9: error: 'struct clusterip_config' has no member named 'pde'
      
      This moves the check inside of another #ifdef.
      
      Fixes: 6c5d5cfb ("netfilter: ipt_CLUSTERIP: check duplicate config when initializing")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3fd0b634
    • J
      inet: reset tb->fastreuseport when adding a reuseport sk · 637bc8bb
      Josef Bacik 提交于
      If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and
      never set it again.  Which means that in the future if we end up adding a bunch
      of reuseport sk's to that tb we'll have to do the expensive scan every time.
      Instead add the ipv4/ipv6 saddr fields to the bind bucket, as well as the family
      so we know what comparison to make, and the ipv6 only setting so we can make
      sure to compare with new sockets appropriately.  Once one sk has made it onto
      the list we know that there are no potential bind conflicts on the owners list
      that match that sk's rcv_addr.  So copy the sk's information into our bind
      bucket and set tb->fastruseport to FASTREUSESOCK_STRICT so we know we have to do
      an extra check for subsequent reuseport sockets and skip the expensive bind
      conflict check.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      637bc8bb
    • J
      inet: split inet_csk_get_port into two functions · 289141b7
      Josef Bacik 提交于
      inet_csk_get_port does two different things, it either scans for an open port,
      or it tries to see if the specified port is available for use.  Since these two
      operations have different rules and are basically independent lets split them
      into two different functions to make them both more readable.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      289141b7
    • J
      inet: don't check for bind conflicts twice when searching for a port · 6cd66616
      Josef Bacik 提交于
      This is just wasted time, we've already found a tb that doesn't have a bind
      conflict, and we don't drop the head lock so scanning again isn't going to give
      us a different answer.  Instead move the tb->reuse setting logic outside of the
      found_tb path and put it in the success: path.  Then make it so that we don't
      goto again if we find a bind conflict in the found_tb path as we won't reach
      this anymore when we are scanning for an ephemeral port.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6cd66616
    • J
      inet: kill smallest_size and smallest_port · b9470c27
      Josef Bacik 提交于
      In inet_csk_get_port we seem to be using smallest_port to figure out where the
      best place to look for a SO_REUSEPORT sk that matches with an existing set of
      SO_REUSEPORT's.  However if we get to the logic
      
      if (smallest_size != -1) {
      	port = smallest_port;
      	goto have_port;
      }
      
      we will do a useless search, because we would have already done the
      inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
      would have gone to found_tb and succeeded.  Since this logic makes us do yet
      another trip through inet_csk_bind_conflict for a port we know won't work just
      delete this code and save us the time.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9470c27
    • J
      inet: drop ->bind_conflict · aa078842
      Josef Bacik 提交于
      The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
      is how they check the rcv_saddr, so delete this call back and simply
      change inet_csk_bind_conflict to call inet_rcv_saddr_equal.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa078842
    • J
      inet: collapse ipv4/v6 rcv_saddr_equal functions into one · fe38d2a1
      Josef Bacik 提交于
      We pass these per-protocol equal functions around in various places, but
      we can just have one function that checks the sk->sk_family and then do
      the right comparison function.  I've also changed the ipv4 version to
      not cast to inet_sock since it is unneeded.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe38d2a1
  13. 18 1月, 2017 2 次提交
    • J
      tcp: accept RST for rcv_nxt - 1 after receiving a FIN · 0e40f4c9
      Jason Baron 提交于
      Using a Mac OSX box as a client connecting to a Linux server, we have found
      that when certain applications (such as 'ab'), are abruptly terminated
      (via ^C), a FIN is sent followed by a RST packet on tcp connections. The
      FIN is accepted by the Linux stack but the RST is sent with the same
      sequence number as the FIN, and Linux responds with a challenge ACK per
      RFC 5961. The OSX client then sometimes (they are rate-limited) does not
      reply with any RST as would be expected on a closed socket.
      
      This results in sockets accumulating on the Linux server left mostly in
      the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
      This sequence of events can tie up a lot of resources on the Linux server
      since there may be a lot of data in write buffers at the time of the RST.
      Accepting a RST equal to rcv_nxt - 1, after we have already successfully
      processed a FIN, has made a significant difference for us in practice, by
      freeing up unneeded resources in a more expedient fashion.
      
      A packetdrill test demonstrating the behavior:
      
      // testing mac osx rst behavior
      
      // Establish a connection
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5>
      0.200 < . 1:1(0) ack 1 win 32768
      0.200 accept(3, ..., ...) = 4
      
      // Client closes the connection
      0.300 < F. 1:1(0) ack 1 win 32768
      
      // now send rst with same sequence
      0.300 < R. 1:1(0) ack 1 win 32768
      
      // make sure we are in TCP_CLOSE
      0.400 %{
      assert tcpi_state == 7
      }%
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e40f4c9
    • G
      net: ping: Use right format specifier to avoid type casting · a7ef6715
      Gao Feng 提交于
      The inet_num is u16, so use %hu instead of casting it to int. And
      the sk_bound_dev_if is int actually, so it needn't cast to int.
      Signed-off-by: NGao Feng <fgao@ikuai8.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7ef6715
  14. 17 1月, 2017 1 次提交