1. 22 5月, 2015 4 次提交
    • E
      tcp: improve REUSEADDR/NOREUSEADDR cohabitation · 946f9eb2
      Eric Dumazet 提交于
      inet_csk_get_port() randomization effort tends to spread
      sockets on all the available range (ip_local_port_range)
      
      This is unfortunate because SO_REUSEADDR sockets have
      less requirements than non SO_REUSEADDR ones.
      
      If an application uses SO_REUSEADDR hint, it is to try to
      allow source ports being shared.
      
      So instead of picking a random port number in ip_local_port_range,
      lets try first in first half of the range.
      
      This gives more chances to use upper half of the range for the
      sockets with strong requirements (not using SO_REUSEADDR)
      
      Note this patch does not add a new sysctl, and only changes
      the way we try to pick port number.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Flavio Leitner <fbl@redhat.com>
      Acked-by: NFlavio Leitner <fbl@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      946f9eb2
    • E
      inet_hashinfo: remove bsocket counter · f5af1f57
      Eric Dumazet 提交于
      We no longer need bsocket atomic counter, as inet_csk_get_port()
      calls bind_conflict() regardless of its value, after commit
      2b05ad33 ("tcp: bind() fix autoselection to share ports")
      
      This patch removes overhead of maintaining this counter and
      double inet_csk_get_port() calls under pressure.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Flavio Leitner <fbl@redhat.com>
      Acked-by: NFlavio Leitner <fbl@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5af1f57
    • J
      tcp: ensure epoll edge trigger wakeup when write queue is empty · ce5ec440
      Jason Baron 提交于
      We currently rely on the setting of SOCK_NOSPACE in the write()
      path to ensure that we wake up any epoll edge trigger waiters when
      acks return to free space in the write queue. However, if we fail
      to allocate even a single skb in the write queue, we could end up
      waiting indefinitely.
      
      Fix this by explicitly issuing a wakeup when we detect the condition
      of an empty write queue and a return value of -EAGAIN. This allows
      userspace to re-try as we expect this to be a temporary failure.
      
      I've tested this approach by artificially making
      sk_stream_alloc_skb() return NULL periodically. In that case,
      epoll edge trigger waiters will hang indefinitely in epoll_wait()
      without this patch.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce5ec440
    • E
      tcp: add a force_schedule argument to sk_stream_alloc_skb() · eb934478
      Eric Dumazet 提交于
      In commit 8e4d980a ("tcp: fix behavior for epoll edge trigger")
      we fixed a possible hang of TCP sockets under memory pressure,
      by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
      if no packet is in socket write queue.
      
      It turns out there are other cases where we want to force memory
      schedule :
      
      tcp_fragment() & tso_fragment() need to split a big TSO packet into
      two smaller ones. If we block here because of TCP memory pressure,
      we can effectively block TCP socket from sending new data.
      If no further ACK is coming, this hang would be definitive, and socket
      has no chance to effectively reduce its memory usage.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb934478
  2. 20 5月, 2015 2 次提交
    • D
      tcp: add rfc3168, section 6.1.1.1. fallback · 49213555
      Daniel Borkmann 提交于
      This work as a follow-up of commit f7b3bec6 ("net: allow setting ecn
      via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
      ECN connections. In other words, this work adds a retry with a non-ECN
      setup SYN packet, as suggested from the RFC on the first timeout:
      
        [...] A host that receives no reply to an ECN-setup SYN within the
        normal SYN retransmission timeout interval MAY resend the SYN and
        any subsequent SYN retransmissions with CWR and ECE cleared. [...]
      
      Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
      that is, Linux default since 2009 via commit 255cac91 ("tcp: extend
      ECN sysctl to allow server-side only ECN"):
      
       1) Normal ECN-capable path:
      
          SYN ECE CWR ----->
                      <----- SYN ACK ECE
                  ACK ----->
      
       2) Path with broken middlebox, when client has fallback:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
                  SYN ----->
                      <----- SYN ACK
                  ACK ----->
      
      In case we would not have the fallback implemented, the middlebox drop
      point would basically end up as:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
      
      In any case, it's rather a smaller percentage of sites where there would
      occur such additional setup latency: it was found in end of 2014 that ~56%
      of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
      ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
      when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
      fallback would mitigate with a slight latency trade-off. Recent related
      paper on this topic:
      
        Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
        Gorry Fairhurst, and Richard Scheffenegger:
          "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
          Proc. PAM 2015, New York.
        http://ecn.ethz.ch/ecn-pam15.pdf
      
      Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
      section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
      which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
      allows for disabling the fallback.
      
      tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
      rather we let tcp_ecn_rcv_synack() take that over on input path in case a
      SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
      ECN being negotiated eventually in that case.
      
      Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
      Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
      Signed-off-by: NBrian Trammell <trammell@tik.ee.ethz.ch>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Dave That <dave.taht@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49213555
    • E
      tcp: Return error instead of partial read for saved syn headers · aea0929e
      Eric B Munson 提交于
      Currently the getsockopt() requesting the cached contents of the syn
      packet headers will fail silently if the caller uses a buffer that is
      too small to contain the requested data.  Rather than fail silently and
      discard the headers, getsockopt() should return an error and report the
      required size to hold the data.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aea0929e
  3. 19 5月, 2015 3 次提交
  4. 18 5月, 2015 7 次提交
  5. 16 5月, 2015 1 次提交
    • P
      netfilter: x_tables: add context to know if extension runs from nft_compat · 55917a21
      Pablo Neira Ayuso 提交于
      Currently, we have four xtables extensions that cannot be used from the
      xt over nft compat layer. The problem is that they need real access to
      the full blown xt_entry to validate that the rule comes with the right
      dependencies. This check was introduced to overcome the lack of
      sufficient userspace dependency validation in iptables.
      
      To resolve this problem, this patch introduces a new field to the
      xt_tgchk_param structure that tell us if the extension is run from
      nft_compat context.
      
      The three affected extensions are:
      
      1) CLUSTERIP, this target has been superseded by xt_cluster. So just
         bail out by returning -EINVAL.
      
      2) TCPMSS. Relax the checking when used from nft_compat. If used with
         the wrong configuration, it will corrupt !syn packets by adding TCP
         MSS option.
      
      3) ebt_stp. Relax the check to make sure it uses the reserved
         destination MAC address for STP.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Tested-by: NArturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
      55917a21
  6. 15 5月, 2015 1 次提交
  7. 14 5月, 2015 6 次提交
  8. 13 5月, 2015 1 次提交
  9. 11 5月, 2015 3 次提交
  10. 10 5月, 2015 2 次提交
  11. 06 5月, 2015 3 次提交
  12. 05 5月, 2015 1 次提交
    • L
      net: Export IGMP/MLD message validation code · 9afd85c9
      Linus Lüssing 提交于
      With this patch, the IGMP and MLD message validation functions are moved
      from the bridge code to IPv4/IPv6 multicast files. Some small
      refactoring was done to enhance readibility and to iron out some
      differences in behaviour between the IGMP and MLD parsing code (e.g. the
      skb-cloning of MLD messages is now only done if necessary, just like the
      IGMP part always did).
      
      Finally, these IGMP and MLD message validation functions are exported so
      that not only the bridge can use it but batman-adv later, too.
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9afd85c9
  13. 04 5月, 2015 4 次提交
  14. 03 5月, 2015 1 次提交
  15. 02 5月, 2015 1 次提交