1. 25 5月, 2015 2 次提交
  2. 23 5月, 2015 3 次提交
    • M
      ipv4: fill in table id when replacing a route · d4e64c29
      Michal Kubeček 提交于
      When replacing an IPv4 route, tb_id member of the new fib_alias
      structure is not set in the replace code path so that the new route is
      ignored.
      
      Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4e64c29
    • E
      ipv4: Avoid crashing in ip_error · 381c759d
      Eric W. Biederman 提交于
      ip_error does not check if in_dev is NULL before dereferencing it.
      
      IThe following sequence of calls is possible:
      CPU A                          CPU B
      ip_rcv_finish
          ip_route_input_noref()
              ip_route_input_slow()
                                     inetdev_destroy()
          dst_input()
      
      With the result that a network device can be destroyed while processing
      an input packet.
      
      A crash was triggered with only unicast packets in flight, and
      forwarding enabled on the only network device.   The error condition
      was created by the removal of the network device.
      
      As such it is likely the that error code was -EHOSTUNREACH, and the
      action taken by ip_error (if in_dev had been accessible) would have
      been to not increment any counters and to have tried and likely failed
      to send an icmp error as the network device is going away.
      
      Therefore handle this weird case by just dropping the packet if
      !in_dev.  It will result in dropping the packet sooner, and will not
      result in an actual change of behavior.
      
      Fixes: 251da413 ("ipv4: Cache ip_error() routes even when not forwarding.")
      Reported-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Tested-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Signed-off-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      381c759d
    • E
      tcp: fix a potential deadlock in tcp_get_info() · d654976c
      Eric Dumazet 提交于
      Taking socket spinlock in tcp_get_info() can deadlock, as
      inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
      while packet processing can use the reverse locking order.
      
      We could avoid this locking for TCP_LISTEN states, but lockdep would
      certainly get confused as all TCP sockets share same lockdep classes.
      
      [  523.722504] ======================================================
      [  523.728706] [ INFO: possible circular locking dependency detected ]
      [  523.734990] 4.1.0-dbg-DEV #1676 Not tainted
      [  523.739202] -------------------------------------------------------
      [  523.745474] ss/18032 is trying to acquire lock:
      [  523.750002]  (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360
      [  523.758129]
      [  523.758129] but task is already holding lock:
      [  523.763968]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0
      [  523.774661]
      [  523.774661] which lock already depends on the new lock.
      [  523.774661]
      [  523.782850]
      [  523.782850] the existing dependency chain (in reverse order) is:
      [  523.790326]
      -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
      [  523.796599]        [<ffffffff811126bb>] lock_acquire+0xbb/0x270
      [  523.802565]        [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50
      [  523.808628]        [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110
      [  523.815273]        [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350
      [  523.822067]        [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500
      [  523.828199]        [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0
      [  523.834331]        [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10
      [  523.840202]        [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0
      [  523.847214]        [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0
      [  523.853440]        [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0
      [  523.859624]        [<ffffffff81659db7>] ip_rcv+0x307/0x420
      
      Lets use u64_sync infrastructure instead. As a bonus, 64bit
      arches get optimized, as these are nop for them.
      
      Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d654976c
  3. 22 5月, 2015 5 次提交
    • M
      tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info · 2efd055c
      Marcelo Ricardo Leitner 提交于
      This patch tracks the total number of inbound and outbound segments on a
      TCP socket. One may use this number to have an idea on connection
      quality when compared against the retransmissions.
      
      RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut
      
      These are a 32bit field each and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->segs_out was placed near tp->snd_nxt for good data
      locality and minimal performance impact, while tp->segs_in was placed
      near tp->bytes_received for the same reason.
      
      Join work with Eric Dumazet.
      
      Note that received SYN are accounted on the listener, but sent SYNACK
      are not accounted.
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2efd055c
    • E
      tcp: improve REUSEADDR/NOREUSEADDR cohabitation · 946f9eb2
      Eric Dumazet 提交于
      inet_csk_get_port() randomization effort tends to spread
      sockets on all the available range (ip_local_port_range)
      
      This is unfortunate because SO_REUSEADDR sockets have
      less requirements than non SO_REUSEADDR ones.
      
      If an application uses SO_REUSEADDR hint, it is to try to
      allow source ports being shared.
      
      So instead of picking a random port number in ip_local_port_range,
      lets try first in first half of the range.
      
      This gives more chances to use upper half of the range for the
      sockets with strong requirements (not using SO_REUSEADDR)
      
      Note this patch does not add a new sysctl, and only changes
      the way we try to pick port number.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Flavio Leitner <fbl@redhat.com>
      Acked-by: NFlavio Leitner <fbl@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      946f9eb2
    • E
      inet_hashinfo: remove bsocket counter · f5af1f57
      Eric Dumazet 提交于
      We no longer need bsocket atomic counter, as inet_csk_get_port()
      calls bind_conflict() regardless of its value, after commit
      2b05ad33 ("tcp: bind() fix autoselection to share ports")
      
      This patch removes overhead of maintaining this counter and
      double inet_csk_get_port() calls under pressure.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Flavio Leitner <fbl@redhat.com>
      Acked-by: NFlavio Leitner <fbl@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5af1f57
    • J
      tcp: ensure epoll edge trigger wakeup when write queue is empty · ce5ec440
      Jason Baron 提交于
      We currently rely on the setting of SOCK_NOSPACE in the write()
      path to ensure that we wake up any epoll edge trigger waiters when
      acks return to free space in the write queue. However, if we fail
      to allocate even a single skb in the write queue, we could end up
      waiting indefinitely.
      
      Fix this by explicitly issuing a wakeup when we detect the condition
      of an empty write queue and a return value of -EAGAIN. This allows
      userspace to re-try as we expect this to be a temporary failure.
      
      I've tested this approach by artificially making
      sk_stream_alloc_skb() return NULL periodically. In that case,
      epoll edge trigger waiters will hang indefinitely in epoll_wait()
      without this patch.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce5ec440
    • E
      tcp: add a force_schedule argument to sk_stream_alloc_skb() · eb934478
      Eric Dumazet 提交于
      In commit 8e4d980a ("tcp: fix behavior for epoll edge trigger")
      we fixed a possible hang of TCP sockets under memory pressure,
      by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
      if no packet is in socket write queue.
      
      It turns out there are other cases where we want to force memory
      schedule :
      
      tcp_fragment() & tso_fragment() need to split a big TSO packet into
      two smaller ones. If we block here because of TCP memory pressure,
      we can effectively block TCP socket from sending new data.
      If no further ACK is coming, this hang would be definitive, and socket
      has no chance to effectively reduce its memory usage.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb934478
  4. 20 5月, 2015 5 次提交
    • D
      netfilter: ensure number of counters is >0 in do_replace() · 1086bbe9
      Dave Jones 提交于
      After improving setsockopt() coverage in trinity, I started triggering
      vmalloc failures pretty reliably from this code path:
      
      warn_alloc_failed+0xe9/0x140
      __vmalloc_node_range+0x1be/0x270
      vzalloc+0x4b/0x50
      __do_replace+0x52/0x260 [ip_tables]
      do_ipt_set_ctl+0x15d/0x1d0 [ip_tables]
      nf_setsockopt+0x65/0x90
      ip_setsockopt+0x61/0xa0
      raw_setsockopt+0x16/0x60
      sock_common_setsockopt+0x14/0x20
      SyS_setsockopt+0x71/0xd0
      
      It turns out we don't validate that the num_counters field in the
      struct we pass in from userspace is initialized.
      
      The same problem also exists in ebtables, arptables, ipv6, and the
      compat variants.
      Signed-off-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      1086bbe9
    • D
      tcp: add rfc3168, section 6.1.1.1. fallback · 49213555
      Daniel Borkmann 提交于
      This work as a follow-up of commit f7b3bec6 ("net: allow setting ecn
      via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
      ECN connections. In other words, this work adds a retry with a non-ECN
      setup SYN packet, as suggested from the RFC on the first timeout:
      
        [...] A host that receives no reply to an ECN-setup SYN within the
        normal SYN retransmission timeout interval MAY resend the SYN and
        any subsequent SYN retransmissions with CWR and ECE cleared. [...]
      
      Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
      that is, Linux default since 2009 via commit 255cac91 ("tcp: extend
      ECN sysctl to allow server-side only ECN"):
      
       1) Normal ECN-capable path:
      
          SYN ECE CWR ----->
                      <----- SYN ACK ECE
                  ACK ----->
      
       2) Path with broken middlebox, when client has fallback:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
                  SYN ----->
                      <----- SYN ACK
                  ACK ----->
      
      In case we would not have the fallback implemented, the middlebox drop
      point would basically end up as:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
      
      In any case, it's rather a smaller percentage of sites where there would
      occur such additional setup latency: it was found in end of 2014 that ~56%
      of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
      ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
      when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
      fallback would mitigate with a slight latency trade-off. Recent related
      paper on this topic:
      
        Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
        Gorry Fairhurst, and Richard Scheffenegger:
          "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
          Proc. PAM 2015, New York.
        http://ecn.ethz.ch/ecn-pam15.pdf
      
      Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
      section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
      which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
      allows for disabling the fallback.
      
      tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
      rather we let tcp_ecn_rcv_synack() take that over on input path in case a
      SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
      ECN being negotiated eventually in that case.
      
      Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
      Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
      Signed-off-by: NBrian Trammell <trammell@tik.ee.ethz.ch>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Dave That <dave.taht@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49213555
    • Y
      tcp: don't over-send F-RTO probes · b7b0ed91
      Yuchung Cheng 提交于
      After sending the new data packets to probe (step 2), F-RTO may
      incorrectly send more probes if the next ACK advances SND_UNA and
      does not sack new packet. However F-RTO RFC 5682 probes at most
      once. This bug may cause sender to always send new data instead of
      repairing holes, inducing longer HoL blocking on the receiver for
      the application.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7b0ed91
    • Y
      tcp: only undo on partial ACKs in CA_Loss · da34ac76
      Yuchung Cheng 提交于
      Undo based on TCP timestamps should only happen on ACKs that advance
      SND_UNA, according to the Eifel algorithm in RFC 3522:
      
      Section 3.2:
      
        (4) If the value of the Timestamp Echo Reply field of the
            acceptable ACK's Timestamps option is smaller than the
            value of RetransmitTS, then proceed to step (5),
      
      Section Terminology:
         We use the term 'acceptable ACK' as defined in [RFC793].  That is an
         ACK that acknowledges previously unacknowledged data.
      
      This is because upon receiving an out-of-order packet, the receiver
      returns the last timestamp that advances RCV_NXT, not the current
      timestamp of the packet in the DUPACK. Without checking the flag,
      the DUPACK will cause tcp_packet_delayed() to return true and
      tcp_try_undo_loss() will revert cwnd reduction.
      
      Note that we check the condition in CA_Recovery already by only
      calling tcp_try_undo_partial() if FLAG_SND_UNA_ADVANCED is set or
      tcp_try_undo_recovery() if snd_una crosses high_seq.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da34ac76
    • E
      tcp: Return error instead of partial read for saved syn headers · aea0929e
      Eric B Munson 提交于
      Currently the getsockopt() requesting the cached contents of the syn
      packet headers will fail silently if the caller uses a buffer that is
      too small to contain the requested data.  Rather than fail silently and
      discard the headers, getsockopt() should return an error and report the
      required size to hold the data.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aea0929e
  5. 19 5月, 2015 3 次提交
  6. 18 5月, 2015 8 次提交
  7. 16 5月, 2015 1 次提交
    • P
      netfilter: x_tables: add context to know if extension runs from nft_compat · 55917a21
      Pablo Neira Ayuso 提交于
      Currently, we have four xtables extensions that cannot be used from the
      xt over nft compat layer. The problem is that they need real access to
      the full blown xt_entry to validate that the rule comes with the right
      dependencies. This check was introduced to overcome the lack of
      sufficient userspace dependency validation in iptables.
      
      To resolve this problem, this patch introduces a new field to the
      xt_tgchk_param structure that tell us if the extension is run from
      nft_compat context.
      
      The three affected extensions are:
      
      1) CLUSTERIP, this target has been superseded by xt_cluster. So just
         bail out by returning -EINVAL.
      
      2) TCPMSS. Relax the checking when used from nft_compat. If used with
         the wrong configuration, it will corrupt !syn packets by adding TCP
         MSS option.
      
      3) ebt_stp. Relax the check to make sure it uses the reserved
         destination MAC address for STP.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Tested-by: NArturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
      55917a21
  8. 15 5月, 2015 2 次提交
  9. 14 5月, 2015 6 次提交
  10. 13 5月, 2015 1 次提交
  11. 11 5月, 2015 3 次提交
  12. 10 5月, 2015 1 次提交