1. 21 1月, 2016 1 次提交
  2. 15 1月, 2016 1 次提交
  3. 11 1月, 2016 1 次提交
  4. 23 12月, 2015 2 次提交
  5. 16 12月, 2015 1 次提交
  6. 15 12月, 2015 1 次提交
    • E
      net: fix IP early demux races · 5037e9ef
      Eric Dumazet 提交于
      David Wilder reported crashes caused by dst reuse.
      
      <quote David>
        I am seeing a crash on a distro V4.2.3 kernel caused by a double
        release of a dst_entry.  In ipv4_dst_destroy() the call to
        list_empty() finds a poisoned next pointer, indicating the dst_entry
        has already been removed from the list and freed. The crash occurs
        18 to 24 hours into a run of a network stress exerciser.
      </quote>
      
      Thanks to his detailed report and analysis, we were able to understand
      the core issue.
      
      IP early demux can associate a dst to skb, after a lookup in TCP/UDP
      sockets.
      
      When socket cache is not properly set, we want to store into
      sk->sk_dst_cache the dst for future IP early demux lookups,
      by acquiring a stable refcount on the dst.
      
      Problem is this acquisition is simply using an atomic_inc(),
      which works well, unless the dst was queued for destruction from
      dst_release() noticing dst refcount went to zero, if DST_NOCACHE
      was set on dst.
      
      We need to make sure current refcount is not zero before incrementing
      it, or risk double free as David reported.
      
      This patch, being a stable candidate, adds two new helpers, and use
      them only from IP early demux problematic paths.
      
      It might be possible to merge in net-next skb_dst_force() and
      skb_dst_force_safe(), but I prefer having the smallest patch for stable
      kernels : Maybe some skb_dst_force() callers do not expect skb->dst
      can suddenly be cleared.
      
      Can probably be backported back to linux-3.6 kernels
      Reported-by: NDavid J. Wilder <dwilder@us.ibm.com>
      Tested-by: NDavid J. Wilder <dwilder@us.ibm.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5037e9ef
  7. 04 12月, 2015 1 次提交
    • E
      ipv6: kill sk_dst_lock · 6bd4f355
      Eric Dumazet 提交于
      While testing the np->opt RCU conversion, I found that UDP/IPv6 was
      using a mixture of xchg() and sk_dst_lock to protect concurrent changes
      to sk->sk_dst_cache, leading to possible corruptions and crashes.
      
      ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest
      way to fix the mess is to remove sk_dst_lock completely, as we did for
      IPv4.
      
      __ip6_dst_store() and ip6_dst_store() share same implementation.
      
      sk_setup_caps() being called with socket lock being held or not,
      we have to use sk_dst_set() instead of __sk_dst_set()
      
      Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);"
      in ip6_dst_store() before the sk_setup_caps(sk, dst) call.
      
      This is because ip6_dst_store() can be called from process context,
      without any lock held.
      
      As soon as the dst is installed in sk->sk_dst_cache, dst can be freed
      from another cpu doing a concurrent ip6_dst_store()
      
      Doing the dst dereference before doing the install is needed to make
      sure no use after free would trigger.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bd4f355
  8. 03 12月, 2015 2 次提交
  9. 16 11月, 2015 1 次提交
  10. 06 11月, 2015 2 次提交
  11. 03 11月, 2015 1 次提交
    • E
      tcp/dccp: fix ireq->pktopts race · ce105008
      Eric Dumazet 提交于
      IPv6 request sockets store a pointer to skb containing the SYN packet
      to be able to transfer it to full blown socket when 3WHS is done
      (ireq->pktopts -> np->pktoptions)
      
      As explained in commit 5e0724d0 ("tcp/dccp: fix hashdance race for
      passive sessions"), we must transfer the skb only if we won the
      hashdance race, if multiple cpus receive the 'ack' packet completing
      3WHS at the same time.
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce105008
  12. 23 10月, 2015 1 次提交
    • E
      tcp/dccp: fix hashdance race for passive sessions · 5e0724d0
      Eric Dumazet 提交于
      Multiple cpus can process duplicates of incoming ACK messages
      matching a SYN_RECV request socket. This is a rare event under
      normal operations, but definitely can happen.
      
      Only one must win the race, otherwise corruption would occur.
      
      To fix this without adding new atomic ops, we use logic in
      inet_ehash_nolisten() to detect the request was present in the same
      ehash bucket where we try to insert the new child.
      
      If request socket was not found, we have to undo the child creation.
      
      This actually removes a spin_lock()/spin_unlock() pair in
      reqsk_queue_unlink() for the fast path.
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e0724d0
  13. 19 10月, 2015 1 次提交
  14. 16 10月, 2015 1 次提交
  15. 14 10月, 2015 1 次提交
    • E
      tcp/dccp: fix behavior of stale SYN_RECV request sockets · 4bdc3d66
      Eric Dumazet 提交于
      When a TCP/DCCP listener is closed, its pending SYN_RECV request sockets
      become stale, meaning 3WHS can not complete.
      
      But current behavior is wrong :
      incoming packets finding such stale sockets are dropped.
      
      We need instead to cleanup the request socket and perform another
      lookup :
      - Incoming ACK will give a RST answer,
      - SYN rtx might find another listener if available.
      - We expedite cleanup of request sockets and old listener socket.
      
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bdc3d66
  16. 13 10月, 2015 1 次提交
  17. 03 10月, 2015 6 次提交
    • E
      tcp: do not lock listener to process SYN packets · e994b2f0
      Eric Dumazet 提交于
      Everything should now be ready to finally allow SYN
      packets processing without holding listener lock.
      
      Tested:
      
      3.5 Mpps SYNFLOOD. Plenty of cpu cycles available.
      
      Next bottleneck is the refcount taken on listener,
      that could be avoided if we remove SLAB_DESTROY_BY_RCU
      strict semantic for listeners, and use regular RCU.
      
          13.18%  [kernel]  [k] __inet_lookup_listener
           9.61%  [kernel]  [k] tcp_conn_request
           8.16%  [kernel]  [k] sha_transform
           5.30%  [kernel]  [k] inet_reqsk_alloc
           4.22%  [kernel]  [k] sock_put
           3.74%  [kernel]  [k] tcp_make_synack
           2.88%  [kernel]  [k] ipt_do_table
           2.56%  [kernel]  [k] memcpy_erms
           2.53%  [kernel]  [k] sock_wfree
           2.40%  [kernel]  [k] tcp_v4_rcv
           2.08%  [kernel]  [k] fib_table_lookup
           1.84%  [kernel]  [k] tcp_openreq_init_rwin
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e994b2f0
    • E
      tcp: attach SYNACK messages to request sockets instead of listener · ca6fb065
      Eric Dumazet 提交于
      If a listen backlog is very big (to avoid syncookies), then
      the listener sk->sk_wmem_alloc is the main source of false
      sharing, as we need to touch it twice per SYNACK re-transmit
      and TX completion.
      
      (One SYN packet takes listener lock once, but up to 6 SYNACK
      are generated)
      
      By attaching the skb to the request socket, we remove this
      source of contention.
      
      Tested:
      
       listen(fd, 10485760); // single listener (no SO_REUSEPORT)
       16 RX/TX queue NIC
       Sustain a SYNFLOOD attack of ~320,000 SYN per second,
       Sending ~1,400,000 SYNACK per second.
       Perf profiles now show listener spinlock being next bottleneck.
      
          20.29%  [kernel]  [k] queued_spin_lock_slowpath
          10.06%  [kernel]  [k] __inet_lookup_established
           5.12%  [kernel]  [k] reqsk_timer_handler
           3.22%  [kernel]  [k] get_next_timer_interrupt
           3.00%  [kernel]  [k] tcp_make_synack
           2.77%  [kernel]  [k] ipt_do_table
           2.70%  [kernel]  [k] run_timer_softirq
           2.50%  [kernel]  [k] ip_finish_output
           2.04%  [kernel]  [k] cascade
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca6fb065
    • E
      tcp/dccp: install syn_recv requests into ehash table · 079096f1
      Eric Dumazet 提交于
      In this patch, we insert request sockets into TCP/DCCP
      regular ehash table (where ESTABLISHED and TIMEWAIT sockets
      are) instead of using the per listener hash table.
      
      ACK packets find SYN_RECV pseudo sockets without having
      to find and lock the listener.
      
      In nominal conditions, this halves pressure on listener lock.
      
      Note that this will allow for SO_REUSEPORT refinements,
      so that we can select a listener using cpu/numa affinities instead
      of the prior 'consistent hash', since only SYN packets will
      apply this selection logic.
      
      We will shrink listen_sock in the following patch to ease
      code review.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ying Cai <ycai@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      079096f1
    • E
      tcp: get_openreq[46]() changes · aa3a0c8c
      Eric Dumazet 提交于
      When request sockets are no longer in a per listener hash table
      but on regular TCP ehash, we need to access listener uid
      through req->rsk_listener
      
      get_openreq6() also gets a const for its request socket argument.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa3a0c8c
    • E
      tcp: cleanup tcp_v[46]_inbound_md5_hash() · ba8e275a
      Eric Dumazet 提交于
      We'll soon have to call tcp_v[46]_inbound_md5_hash() twice.
      Also add const attribute to the socket, as it might be the
      unlocked listener for SYN packets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba8e275a
    • E
      tcp: call sk_mark_napi_id() on the child, not the listener · 38cb5245
      Eric Dumazet 提交于
      This fixes a typo : We want to store the NAPI id on child socket.
      Presumably nobody really uses busy polling, on short lived flows.
      
      Fixes: 3d97379a ("tcp: move sk_mark_napi_id() at the right place")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38cb5245
  18. 30 9月, 2015 8 次提交
  19. 26 9月, 2015 3 次提交
  20. 25 9月, 2015 1 次提交
  21. 18 9月, 2015 1 次提交
    • E
      tcp: provide skb->hash to synack packets · 58d607d3
      Eric Dumazet 提交于
      In commit b73c3d0e ("net: Save TX flow hash in sock and set in skbuf
      on xmit"), Tom provided a l4 hash to most outgoing TCP packets.
      
      We'd like to provide one as well for SYNACK packets, so that all packets
      of a given flow share same txhash, to later enable bonding driver to
      also use skb->hash to perform slave selection.
      
      Note that a SYNACK retransmit shuffles the tx hash, as Tom did
      in commit 265f94ff ("net: Recompute sk_txhash on negative routing
      advice") for established sockets.
      
      This has nice effect making TCP flows resilient to some kind of black
      holes, even at connection establish phase.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58d607d3
  22. 11 8月, 2015 1 次提交
    • E
      inet: fix possible request socket leak · 3257d8b1
      Eric Dumazet 提交于
      In commit b357a364 ("inet: fix possible panic in
      reqsk_queue_unlink()"), I missed fact that tcp_check_req()
      can return the listener socket in one case, and that we must
      release the request socket refcount or we leak it.
      
      Tested:
      
       Following packetdrill test template shows the issue
      
      0     socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +0    < S 0:0(0) win 2920 <mss 1460,sackOK,nop,nop>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
      +.002 < . 1:1(0) ack 21 win 2920
      +0    > R 21:21(0)
      
      Fixes: b357a364 ("inet: fix possible panic in reqsk_queue_unlink()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3257d8b1
  23. 30 7月, 2015 1 次提交