1. 09 6月, 2012 1 次提交
  2. 07 6月, 2012 1 次提交
  3. 04 6月, 2012 4 次提交
  4. 02 6月, 2012 2 次提交
    • E
      tcp: reflect SYN queue_mapping into SYNACK packets · fff32699
      Eric Dumazet 提交于
      While testing how linux behaves on SYNFLOOD attack on multiqueue device
      (ixgbe), I found that SYNACK messages were dropped at Qdisc level
      because we send them all on a single queue.
      
      Obvious choice is to reflect incoming SYN packet @queue_mapping to
      SYNACK packet.
      
      Under stress, my machine could only send 25.000 SYNACK per second (for
      200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues.
      
      After patch, not a single SYNACK is dropped.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fff32699
    • E
      tcp: do not create inetpeer on SYNACK message · 7433819a
      Eric Dumazet 提交于
      Another problem on SYNFLOOD/DDOS attack is the inetpeer cache getting
      larger and larger, using lots of memory and cpu time.
      
      tcp_v4_send_synack()
      ->inet_csk_route_req()
       ->ip_route_output_flow()
        ->rt_set_nexthop()
         ->rt_init_metrics()
          ->inet_getpeer( create = true)
      
      This is a side effect of commit a4daad6b (net: Pre-COW metrics for
      TCP) added in 2.6.39
      
      Possible solution :
      
      Instruct inet_csk_route_req() to remove FLOWI_FLAG_PRECOW_METRICS
      
      Before patch :
      
      # grep peer /proc/slabinfo
      inet_peer_cache   4175430 4175430    192   42    2 : tunables    0    0    0 : slabdata  99415  99415      0
      
      Samples: 41K of event 'cycles', Event count (approx.): 30716565122
      +  20,24%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_getpeer
      +   8,19%      ksoftirqd/0  [kernel.kallsyms]           [k] peer_avl_rebalance.isra.1
      +   4,81%      ksoftirqd/0  [kernel.kallsyms]           [k] sha_transform
      +   3,64%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_table_lookup
      +   2,36%      ksoftirqd/0  [ixgbe]                     [k] ixgbe_poll
      +   2,16%      ksoftirqd/0  [kernel.kallsyms]           [k] __ip_route_output_key
      +   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] kernel_map_pages
      +   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] ip_route_input_common
      +   2,01%      ksoftirqd/0  [kernel.kallsyms]           [k] __inet_lookup_established
      +   1,83%      ksoftirqd/0  [kernel.kallsyms]           [k] md5_transform
      +   1,75%      ksoftirqd/0  [kernel.kallsyms]           [k] check_leaf.isra.9
      +   1,49%      ksoftirqd/0  [kernel.kallsyms]           [k] ipt_do_table
      +   1,46%      ksoftirqd/0  [kernel.kallsyms]           [k] hrtimer_interrupt
      +   1,45%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_alloc
      +   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_csk_search_req
      +   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] __netif_receive_skb
      +   1,16%      ksoftirqd/0  [kernel.kallsyms]           [k] copy_user_generic_string
      +   1,15%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_free
      +   1,02%      ksoftirqd/0  [kernel.kallsyms]           [k] tcp_make_synack
      +   0,93%      ksoftirqd/0  [kernel.kallsyms]           [k] _raw_spin_lock_bh
      +   0,87%      ksoftirqd/0  [kernel.kallsyms]           [k] __call_rcu
      +   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] rt_garbage_collect
      +   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_rules_lookup
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7433819a
  5. 30 5月, 2012 1 次提交
    • G
      memcg: decrement static keys at real destroy time · 3f134619
      Glauber Costa 提交于
      We call the destroy function when a cgroup starts to be removed, such as
      by a rmdir event.
      
      However, because of our reference counters, some objects are still
      inflight.  Right now, we are decrementing the static_keys at destroy()
      time, meaning that if we get rid of the last static_key reference, some
      objects will still have charges, but the code to properly uncharge them
      won't be run.
      
      This becomes a problem specially if it is ever enabled again, because now
      new charges will be added to the staled charges making keeping it pretty
      much impossible.
      
      We just need to be careful with the static branch activation: since there
      is no particular preferred order of their activation, we need to make sure
      that we only start using it after all call sites are active.  This is
      achieved by having a per-memcg flag that is only updated after
      static_key_slow_inc() returns.  At this time, we are sure all sites are
      active.
      
      This is made per-memcg, not global, for a reason: it also has the effect
      of making socket accounting more consistent.  The first memcg to be
      limited will trigger static_key() activation, therefore, accounting.  But
      all the others will then be accounted no matter what.  After this patch,
      only limited memcgs will have its sockets accounted.
      
      [akpm@linux-foundation.org: move enum sock_flag_bits into sock.h,
                                  document enum sock_flag_bits,
                                  convert memcg_proto_active() and memcg_proto_activated() to test_bit(),
                                  redo tcp_update_limit() comment to 80 cols]
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f134619
  6. 27 5月, 2012 1 次提交
  7. 24 5月, 2012 3 次提交
    • E
      tcp: take care of overlaps in tcp_try_coalesce() · 1ca7ee30
      Eric Dumazet 提交于
      Sergio Correia reported following warning :
      
      WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
      
      WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
           "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
           tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);
      
      It appears TCP coalescing, and more specifically commit b081f85c
      (net: implement tcp coalescing in tcp_queue_rcv()) should take care of
      possible segment overlaps in receive queue. This was properly done in
      the case of out_or_order_queue by the caller.
      
      For example, segment at tail of queue have sequence 1000-2000, and we
      add a segment with sequence 1500-2500.
      This can happen in case of retransmits.
      
      In this case, just don't do the coalescing.
      Reported-by: NSergio Correia <lists@uece.net>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NSergio Correia <lists@uece.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ca7ee30
    • Y
      ipv4: fix the rcu race between free_fib_info and ip_route_output_slow · e49cc0da
      Yanmin Zhang 提交于
      We hit a kernel OOPS.
      
      <3>[23898.789643] BUG: sleeping function called from invalid context at
      /data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103
      <3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name:
      Thread-6683
      <4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
      to suspend...
      <4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G        W
      3.0.8-137685-ge7742f9 #1
      <4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
      to suspend...
      <4>[23899.904225] Call Trace:
      <4>[23899.989209]  [<c1227f50>] ? pgtable_bad+0x130/0x130
      <4>[23900.000416]  [<c1238c2a>] __might_sleep+0x10a/0x110
      <4>[23900.007357]  [<c1228021>] do_page_fault+0xd1/0x3c0
      <4>[23900.013764]  [<c18e9ba9>] ? restore_all+0xf/0xf
      <4>[23900.024024]  [<c17c007b>] ? napi_complete+0x8b/0x690
      <4>[23900.029297]  [<c1227f50>] ? pgtable_bad+0x130/0x130
      <4>[23900.123739]  [<c1227f50>] ? pgtable_bad+0x130/0x130
      <4>[23900.128955]  [<c18ea0c3>] error_code+0x5f/0x64
      <4>[23900.133466]  [<c1227f50>] ? pgtable_bad+0x130/0x130
      <4>[23900.138450]  [<c17f6298>] ? __ip_route_output_key+0x698/0x7c0
      <4>[23900.144312]  [<c17f5f8d>] ? __ip_route_output_key+0x38d/0x7c0
      <4>[23900.150730]  [<c17f63df>] ip_route_output_flow+0x1f/0x60
      <4>[23900.156261]  [<c181de58>] ip4_datagram_connect+0x188/0x2b0
      <4>[23900.161960]  [<c18e981f>] ? _raw_spin_unlock_bh+0x1f/0x30
      <4>[23900.167834]  [<c18298d6>] inet_dgram_connect+0x36/0x80
      <4>[23900.173224]  [<c14f9e88>] ? _copy_from_user+0x48/0x140
      <4>[23900.178817]  [<c17ab9da>] sys_connect+0x9a/0xd0
      <4>[23900.183538]  [<c132e93c>] ? alloc_file+0xdc/0x240
      <4>[23900.189111]  [<c123925d>] ? sub_preempt_count+0x3d/0x50
      
      Function free_fib_info resets nexthop_nh->nh_dev to NULL before releasing
      fi. Other cpu might be accessing fi. Fixing it by delaying the releasing.
      
      With the patch, we ran MTBF testing on Android mobile for 12 hours
      and didn't trigger the issue.
      
      Thank Eric for very detailed review/checking the issue.
      Signed-off-by: NYanmin Zhang <yanmin_zhang@linux.intel.com>
      Signed-off-by: NKun Jiang <kunx.jiang@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e49cc0da
    • T
      mm: add a low limit to alloc_large_system_hash · 31fe62b9
      Tim Bird 提交于
      UDP stack needs a minimum hash size value for proper operation and also
      uses alloc_large_system_hash() for proper NUMA distribution of its hash
      tables and automatic sizing depending on available system memory.
      
      On some low memory situations, udp_table_init() must ignore the
      alloc_large_system_hash() result and reallocs a bigger memory area.
      
      As we cannot easily free old hash table, we leak it and kmemleak can
      issue a warning.
      
      This patch adds a low limit parameter to alloc_large_system_hash() to
      solve this problem.
      
      We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
      allocation.
      Reported-by: NMark Asselstine <mark.asselstine@windriver.com>
      Reported-by: NTim Bird <tim.bird@am.sony.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31fe62b9
  8. 20 5月, 2012 4 次提交
  9. 18 5月, 2012 3 次提交
    • E
      ip_frag: struct inet_frags match() method returns a bool · cbc264ca
      Eric Dumazet 提交于
      - match() method returns a boolean
      - return (A && B && C && D) -> return A && B && C && D
      - fix indentation
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      cbc264ca
    • W
      tcp: do_tcp_sendpages() must try to push data out on oom conditions · bad115cf
      Willy Tarreau 提交于
      Since recent changes on TCP splicing (starting with commits 2f533844
      "tcp: allow splice() to build full TSO packets" and 35f9c09f "tcp:
      tcp_sendpages() should call tcp_push() once"), I started seeing
      massive stalls when forwarding traffic between two sockets using
      splice() when pipe buffers were larger than socket buffers.
      
      Latest changes (net: netdev_alloc_skb() use build_skb()) made the
      problem even more apparent.
      
      The reason seems to be that if do_tcp_sendpages() fails on out of memory
      condition without being able to send at least one byte, tcp_push() is not
      called and the buffers cannot be flushed.
      
      After applying the attached patch, I cannot reproduce the stalls at all
      and the data rate it perfectly stable and steady under any condition
      which previously caused the problem to be permanent.
      
      The issue seems to have been there since before the kernel migrated to
      git, which makes me think that the stalls I occasionally experienced
      with tux during stress-tests years ago were probably related to the
      same issue.
      
      This issue was first encountered on 3.0.31 and 3.2.17, so please backport
      to -stable.
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Cc: <stable@vger.kernel.org>
      bad115cf
    • E
      tcp: bool conversions · a2a385d6
      Eric Dumazet 提交于
      bool conversions where possible.
      
      __inline__ -> inline
      
      space cleanups
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2a385d6
  10. 17 5月, 2012 1 次提交
  11. 16 5月, 2012 4 次提交
  12. 11 5月, 2012 4 次提交
  13. 09 5月, 2012 1 次提交
    • P
      netfilter: remove ip_queue support · d16cf20e
      Pablo Neira Ayuso 提交于
      This patch removes ip_queue support which was marked as obsolete
      years ago. The nfnetlink_queue modules provides more advanced
      user-space packet queueing mechanism.
      
      This patch also removes capability code included in SELinux that
      refers to ip_queue. Otherwise, we break compilation.
      
      Several warning has been sent regarding this to the mailing list
      in the past month without anyone rising the hand to stop this
      with some strong argument.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d16cf20e
  14. 08 5月, 2012 1 次提交
  15. 05 5月, 2012 1 次提交
    • E
      tcp: be more strict before accepting ECN negociation · bd14b1b2
      Eric Dumazet 提交于
      It appears some networks play bad games with the two bits reserved for
      ECN. This can trigger false congestion notifications and very slow
      transferts.
      
      Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
      disable TCP ECN negociation if it happens we receive mangled CT bits in
      the SYN packet.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Perry Lorier <perryl@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Wilmer van der Gaast <wilmer@google.com>
      Cc: Ankur Jain <jankur@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Dave Täht <dave.taht@bufferbloat.net>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd14b1b2
  16. 04 5月, 2012 1 次提交
  17. 03 5月, 2012 7 次提交