1. 15 12月, 2012 1 次提交
    • C
      inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sock · e337e24d
      Christoph Paasch 提交于
      If in either of the above functions inet_csk_route_child_sock() or
      __inet_inherit_port() fails, the newsk will not be freed:
      
      unreferenced object 0xffff88022e8a92c0 (size 1592):
        comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
        hex dump (first 32 bytes):
          0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00  ................
          02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff8153d190>] kmemleak_alloc+0x21/0x3e
          [<ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5
          [<ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd
          [<ffffffff8149b784>] sk_clone_lock+0x16/0x21e
          [<ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b
          [<ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481
          [<ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b
          [<ffffffff814ec5ba>] tcp_check_req+0x29f/0x416
          [<ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc
          [<ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701
          [<ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4
          [<ffffffff814cec20>] ip_local_deliver+0x4e/0x7f
          [<ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233
          [<ffffffff814cee68>] ip_rcv+0x217/0x267
          [<ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553
          [<ffffffff814a7cc3>] netif_receive_skb+0x50/0x82
      
      This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
      a single sock_put() is not enough to free the memory. Additionally, things
      like xfrm, memcg, cookie_values,... may have been initialized.
      We have to free them properly.
      
      This is fixed by forcing a call to tcp_done(), ending up in
      inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
      because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
      xfrm,...
      
      Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
      force it entering inet_csk_destroy_sock. To avoid the warning in
      inet_csk_destroy_sock, inet_num has to be set to 0.
      As inet_csk_destroy_sock does a dec on orphan_count, we first have to
      increase it.
      
      Calling tcp_done() allows us to remove the calls to
      tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().
      
      A similar approach is taken for dccp by calling dccp_done().
      
      This is in the kernel since 093d2823 (tproxy: fix hash locking issue
      when using port redirection in __inet_inherit_port()), thus since
      version >= 2.6.37.
      Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e337e24d
  2. 04 11月, 2012 1 次提交
    • E
      tcp: better retrans tracking for defer-accept · e6c022a4
      Eric Dumazet 提交于
      For passive TCP connections using TCP_DEFER_ACCEPT facility,
      we incorrectly increment req->retrans each time timeout triggers
      while no SYNACK is sent.
      
      SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
      which we received the ACK from client). Only the last SYNACK is sent
      so that we can receive again an ACK from client, to move the req into
      accept queue. We plan to change this later to avoid the useless
      retransmit (and potential problem as this SYNACK could be lost)
      
      TCP_INFO later gives wrong information to user, claiming imaginary
      retransmits.
      
      Decouple req->retrans field into two independent fields :
      
      num_retrans : number of retransmit
      num_timeout : number of timeouts
      
      num_timeout is the counter that is incremented at each timeout,
      regardless of actual SYNACK being sent or not, and used to
      compute the exponential timeout.
      
      Introduce inet_rtx_syn_ack() helper to increment num_retrans
      only if ->rtx_syn_ack() succeeded.
      
      Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
      when we re-send a SYNACK in answer to a (retransmitted) SYN.
      Prior to this patch, we were not counting these retransmits.
      
      Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
      only if a synack packet was successfully queued.
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Elliott Hughes <enh@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6c022a4
  3. 09 10月, 2012 1 次提交
    • J
      ipv4: introduce rt_uses_gateway · 155e8336
      Julian Anastasov 提交于
      Add new flag to remember when route is via gateway.
      We will use it to allow rt_gateway to contain address of
      directly connected host for the cases when DST_NOCACHE is
      used or when the NH exception caches per-destination route
      without DST_NOCACHE flag, i.e. when routes are not used for
      other destinations. By this way we force the neighbour
      resolving to work with the routed destination but we
      can use different address in the packet, feature needed
      for IPVS-DR where original packet for virtual IP is routed
      via route to real IP.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      155e8336
  4. 07 9月, 2012 1 次提交
    • E
      tcp: fix TFO regression · 7ab4551f
      Eric Dumazet 提交于
      Fengguang Wu reported various panics and bisected to commit
      8336886f (tcp: TCP Fast Open Server - support TFO listeners)
      
      Fix this by making sure socket is a TCP socket before accessing TFO data
      structures.
      
      [  233.046014] kfree_debugcheck: out of range ptr ea6000000bb8h.
      [  233.047399] ------------[ cut here ]------------
      [  233.048393] kernel BUG at /c/kernel-tests/src/stable/mm/slab.c:3074!
      [  233.048393] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      [  233.048393] Modules linked in:
      [  233.048393] CPU 0
      [  233.048393] Pid: 3929, comm: trinity-watchdo Not tainted 3.6.0-rc3+
      #4192 Bochs Bochs
      [  233.048393] RIP: 0010:[<ffffffff81169653>]  [<ffffffff81169653>]
      kfree_debugcheck+0x27/0x2d
      [  233.048393] RSP: 0018:ffff88000facbca8  EFLAGS: 00010092
      [  233.048393] RAX: 0000000000000031 RBX: 0000ea6000000bb8 RCX:
      00000000a189a188
      [  233.048393] RDX: 000000000000a189 RSI: ffffffff8108ad32 RDI:
      ffffffff810d30f9
      [  233.048393] RBP: ffff88000facbcb8 R08: 0000000000000002 R09:
      ffffffff843846f0
      [  233.048393] R10: ffffffff810ae37c R11: 0000000000000908 R12:
      0000000000000202
      [  233.048393] R13: ffffffff823dbd5a R14: ffff88000ec5bea8 R15:
      ffffffff8363c780
      [  233.048393] FS:  00007faa6899c700(0000) GS:ffff88001f200000(0000)
      knlGS:0000000000000000
      [  233.048393] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  233.048393] CR2: 00007faa6841019c CR3: 0000000012c82000 CR4:
      00000000000006f0
      [  233.048393] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [  233.048393] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
      0000000000000400
      [  233.048393] Process trinity-watchdo (pid: 3929, threadinfo
      ffff88000faca000, task ffff88000faec600)
      [  233.048393] Stack:
      [  233.048393]  0000000000000000 0000ea6000000bb8 ffff88000facbce8
      ffffffff8116ad81
      [  233.048393]  ffff88000ff588a0 ffff88000ff58850 ffff88000ff588a0
      0000000000000000
      [  233.048393]  ffff88000facbd08 ffffffff823dbd5a ffffffff823dbcb0
      ffff88000ff58850
      [  233.048393] Call Trace:
      [  233.048393]  [<ffffffff8116ad81>] kfree+0x5f/0xca
      [  233.048393]  [<ffffffff823dbd5a>] inet_sock_destruct+0xaa/0x13c
      [  233.048393]  [<ffffffff823dbcb0>] ? inet_sk_rebuild_header
      +0x319/0x319
      [  233.048393]  [<ffffffff8231c307>] __sk_free+0x21/0x14b
      [  233.048393]  [<ffffffff8231c4bd>] sk_free+0x26/0x2a
      [  233.048393]  [<ffffffff825372db>] sctp_close+0x215/0x224
      [  233.048393]  [<ffffffff810d6835>] ? lock_release+0x16f/0x1b9
      [  233.048393]  [<ffffffff823daf12>] inet_release+0x7e/0x85
      [  233.048393]  [<ffffffff82317d15>] sock_release+0x1f/0x77
      [  233.048393]  [<ffffffff82317d94>] sock_close+0x27/0x2b
      [  233.048393]  [<ffffffff81173bbe>] __fput+0x101/0x20a
      [  233.048393]  [<ffffffff81173cd5>] ____fput+0xe/0x10
      [  233.048393]  [<ffffffff810a3794>] task_work_run+0x5d/0x75
      [  233.048393]  [<ffffffff8108da70>] do_exit+0x290/0x7f5
      [  233.048393]  [<ffffffff82707415>] ? retint_swapgs+0x13/0x1b
      [  233.048393]  [<ffffffff8108e23f>] do_group_exit+0x7b/0xba
      [  233.048393]  [<ffffffff8108e295>] sys_exit_group+0x17/0x17
      [  233.048393]  [<ffffffff8270de10>] tracesys+0xdd/0xe2
      [  233.048393] Code: 59 01 5d c3 55 48 89 e5 53 41 50 0f 1f 44 00 00 48
      89 fb e8 d4 b0 f0 ff 84 c0 75 11 48 89 de 48 c7 c7 fc fa f7 82 e8 0d 0f
      57 01 <0f> 0b 5f 5b 5d c3 55 48 89 e5 0f 1f 44 00 00 48 63 87 d8 00 00
      [  233.048393] RIP  [<ffffffff81169653>] kfree_debugcheck+0x27/0x2d
      [  233.048393]  RSP <ffff88000facbca8>
      Reported-by: NFengguang Wu <wfg@linux.intel.com>
      Tested-by: NFengguang Wu <wfg@linux.intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: "H.K. Jerry Chu" <hkchu@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ab4551f
  5. 01 9月, 2012 1 次提交
    • J
      tcp: TCP Fast Open Server - support TFO listeners · 8336886f
      Jerry Chu 提交于
      This patch builds on top of the previous patch to add the support
      for TFO listeners. This includes -
      
      1. allocating, properly initializing, and managing the per listener
      fastopen_queue structure when TFO is enabled
      
      2. changes to the inet_csk_accept code to support TFO. E.g., the
      request_sock can no longer be freed upon accept(), not until 3WHS
      finishes
      
      3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
      if it's a TFO socket
      
      4. properly closing a TFO listener, and a TFO socket before 3WHS
      finishes
      
      5. supporting TCP_FASTOPEN socket option
      
      6. modifying tcp_check_req() to use to check a TFO socket as well
      as request_sock
      
      7. supporting TCP's TFO cookie option
      
      8. adding a new SYN-ACK retransmit handler to use the timer directly
      off the TFO socket rather than the listener socket. Note that TFO
      server side will not retransmit anything other than SYN-ACK until
      the 3WHS is completed.
      
      The patch also contains an important function
      "reqsk_fastopen_remove()" to manage the somewhat complex relation
      between a listener, its request_sock, and the corresponding child
      socket. See the comment above the function for the detail.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8336886f
  6. 22 8月, 2012 1 次提交
  7. 21 7月, 2012 2 次提交
  8. 18 7月, 2012 1 次提交
  9. 17 7月, 2012 1 次提交
    • D
      net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() · 6700c270
      David S. Miller 提交于
      This will be used so that we can compose a full flow key.
      
      Even though we have a route in this context, we need more.  In the
      future the routes will be without destination address, source address,
      etc. keying.  One ipv4 route will cover entire subnets, etc.
      
      In this environment we have to have a way to possess persistent storage
      for redirects and PMTU information.  This persistent storage will exist
      in the FIB tables, and that's why we'll need to be able to rebuild a
      full lookup flow key here.  Using that flow key will do a fib_lookup()
      and create/update the persistent entry.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6700c270
  10. 16 7月, 2012 1 次提交
    • D
      ipv4: Add helper inet_csk_update_pmtu(). · 80d0a69f
      David S. Miller 提交于
      This abstracts away the call to dst_ops->update_pmtu() so that we can
      transparently handle the fact that, in the future, the dst itself can
      be invalidated by the PMTU update (when we have non-host routes cached
      in sockets).
      
      So we try to rebuild the socket cached route after the method
      invocation if necessary.
      
      This isn't used by SCTP because it needs to cache dsts per-transport,
      and thus will need it's own local version of this helper.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80d0a69f
  11. 11 7月, 2012 1 次提交
  12. 23 6月, 2012 1 次提交
  13. 02 6月, 2012 1 次提交
    • E
      tcp: do not create inetpeer on SYNACK message · 7433819a
      Eric Dumazet 提交于
      Another problem on SYNFLOOD/DDOS attack is the inetpeer cache getting
      larger and larger, using lots of memory and cpu time.
      
      tcp_v4_send_synack()
      ->inet_csk_route_req()
       ->ip_route_output_flow()
        ->rt_set_nexthop()
         ->rt_init_metrics()
          ->inet_getpeer( create = true)
      
      This is a side effect of commit a4daad6b (net: Pre-COW metrics for
      TCP) added in 2.6.39
      
      Possible solution :
      
      Instruct inet_csk_route_req() to remove FLOWI_FLAG_PRECOW_METRICS
      
      Before patch :
      
      # grep peer /proc/slabinfo
      inet_peer_cache   4175430 4175430    192   42    2 : tunables    0    0    0 : slabdata  99415  99415      0
      
      Samples: 41K of event 'cycles', Event count (approx.): 30716565122
      +  20,24%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_getpeer
      +   8,19%      ksoftirqd/0  [kernel.kallsyms]           [k] peer_avl_rebalance.isra.1
      +   4,81%      ksoftirqd/0  [kernel.kallsyms]           [k] sha_transform
      +   3,64%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_table_lookup
      +   2,36%      ksoftirqd/0  [ixgbe]                     [k] ixgbe_poll
      +   2,16%      ksoftirqd/0  [kernel.kallsyms]           [k] __ip_route_output_key
      +   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] kernel_map_pages
      +   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] ip_route_input_common
      +   2,01%      ksoftirqd/0  [kernel.kallsyms]           [k] __inet_lookup_established
      +   1,83%      ksoftirqd/0  [kernel.kallsyms]           [k] md5_transform
      +   1,75%      ksoftirqd/0  [kernel.kallsyms]           [k] check_leaf.isra.9
      +   1,49%      ksoftirqd/0  [kernel.kallsyms]           [k] ipt_do_table
      +   1,46%      ksoftirqd/0  [kernel.kallsyms]           [k] hrtimer_interrupt
      +   1,45%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_alloc
      +   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_csk_search_req
      +   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] __netif_receive_skb
      +   1,16%      ksoftirqd/0  [kernel.kallsyms]           [k] copy_user_generic_string
      +   1,15%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_free
      +   1,02%      ksoftirqd/0  [kernel.kallsyms]           [k] tcp_make_synack
      +   0,93%      ksoftirqd/0  [kernel.kallsyms]           [k] _raw_spin_lock_bh
      +   0,87%      ksoftirqd/0  [kernel.kallsyms]           [k] __call_rcu
      +   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] rt_garbage_collect
      +   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_rules_lookup
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7433819a
  14. 22 4月, 2012 1 次提交
  15. 16 4月, 2012 1 次提交
  16. 15 4月, 2012 3 次提交
  17. 26 1月, 2012 2 次提交
  18. 12 12月, 2011 1 次提交
  19. 09 11月, 2011 1 次提交
  20. 24 5月, 2011 1 次提交
  21. 19 5月, 2011 1 次提交
  22. 09 5月, 2011 1 次提交
  23. 29 4月, 2011 2 次提交
    • D
      ipv4: Get route daddr from flow key in inet_csk_route_req(). · 072d8c94
      David S. Miller 提交于
      Now that output route lookups update the flow with
      destination address selection, we can fetch it from
      fl4->daddr instead of rt->rt_dst
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      072d8c94
    • E
      inet: add RCU protection to inet->opt · f6d8bd05
      Eric Dumazet 提交于
      We lack proper synchronization to manipulate inet->opt ip_options
      
      Problem is ip_make_skb() calls ip_setup_cork() and
      ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
      without any protection against another thread manipulating inet->opt.
      
      Another thread can change inet->opt pointer and free old one under us.
      
      Use RCU to protect inet->opt (changed to inet->inet_opt).
      
      Instead of handling atomic refcounts, just copy ip_options when
      necessary, to avoid cache line dirtying.
      
      We cant insert an rcu_head in struct ip_options since its included in
      skb->cb[], so this patch is large because I had to introduce a new
      ip_options_rcu structure.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6d8bd05
  24. 14 4月, 2011 1 次提交
  25. 31 3月, 2011 1 次提交
  26. 13 3月, 2011 4 次提交
  27. 03 3月, 2011 1 次提交
  28. 02 3月, 2011 2 次提交
  29. 12 1月, 2011 1 次提交
  30. 10 12月, 2010 1 次提交
    • E
      net: optimize INET input path further · 68835aba
      Eric Dumazet 提交于
      Followup of commit b178bb3d (net: reorder struct sock fields)
      
      Optimize INET input path a bit further, by :
      
      1) moving sk_refcnt close to sk_lock.
      
      This reduces number of dirtied cache lines by one on 64bit arches (and
      64 bytes cache line size).
      
      2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
      
      (same cache line than hash / family / bound_dev_if / nulls_node)
      
      This reduces number of accessed cache lines in lookups by one, and dont
      increase size of inet and timewait socks.
      inet and tw sockets now share same place-holder for these fields.
      
      Before patch :
      
      offsetof(struct sock, sk_refcnt) = 0x10
      offsetof(struct sock, sk_lock) = 0x40
      offsetof(struct sock, sk_receive_queue) = 0x60
      offsetof(struct inet_sock, inet_daddr) = 0x270
      offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
      
      After patch :
      
      offsetof(struct sock, sk_refcnt) = 0x44
      offsetof(struct sock, sk_lock) = 0x48
      offsetof(struct sock, sk_receive_queue) = 0x68
      offsetof(struct inet_sock, inet_daddr) = 0x0
      offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
      
      compute_score() (udp or tcp) now use a single cache line per ignored
      item, instead of two.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68835aba
  31. 18 11月, 2010 1 次提交