1. 21 5月, 2013 1 次提交
    • W
      rps: selective flow shedding during softnet overflow · 99bbc707
      Willem de Bruijn 提交于
      A cpu executing the network receive path sheds packets when its input
      queue grows to netdev_max_backlog. A single high rate flow (such as a
      spoofed source DoS) can exceed a single cpu processing rate and will
      degrade throughput of other flows hashed onto the same cpu.
      
      This patch adds a more fine grained hashtable. If the netdev backlog
      is above a threshold, IRQ cpus track the ratio of total traffic of
      each flow (using 4096 buckets, configurable). The ratio is measured
      by counting the number of packets per flow over the last 256 packets
      from the source cpu. Any flow that occupies a large fraction of this
      (set at 50%) will see packet drop while above the threshold.
      
      Tested:
      Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0,
      kernel receive (RPS) on cpu0 and application threads on cpus 2--7
      each handling 20k req/s. Throughput halves when hit with a 400 kpps
      antagonist storm. With this patch applied, antagonist overload is
      dropped and the server processes its complete load.
      
      The patch is effective when kernel receive processing is the
      bottleneck. The above RPS scenario is a extreme, but the same is
      reached with RFS and sufficient kernel processing (iptables, packet
      socket tap, ..).
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99bbc707
  2. 20 5月, 2013 2 次提交
    • Y
      tcp: remove bad timeout logic in fast recovery · 3e59cb0d
      Yuchung Cheng 提交于
      tcp_timeout_skb() was intended to trigger fast recovery on timeout,
      unfortunately in reality it often causes spurious retransmission
      storms during fast recovery. The particular sign is a fast retransmit
      over the highest sacked sequence (SND.FACK).
      
      Currently the RTO timer re-arming (as in RFC6298) offers a nice cushion
      to avoid spurious timeout: when SND.UNA advances the sender re-arms
      RTO and extends the timeout by icsk_rto. The sender does not offset
      the time elapsed since the packet at SND.UNA was sent.
      
      But if the next (DUP)ACK arrives later than ~RTTVAR and triggers
      tcp_fastretrans_alert(), then tcp_timeout_skb() will mark any packet
      sent before the icsk_rto interval lost, including one that's above the
      highest sacked sequence. Most likely a large part of scorebard will be
      marked.
      
      If most packets are not lost then the subsequent DUPACKs with new SACK
      blocks will cause the sender to continue to retransmit packets beyond
      SND.FACK spuriously. Even if only one packet is lost the sender may
      falsely retransmit almost the entire window.
      
      The situation becomes common in the world of bufferbloat: the RTT
      continues to grow as the queue builds up but RTTVAR remains small and
      close to the minimum 200ms. If a data packet is lost and the DUPACK
      triggered by the next data packet is slightly delayed, then a spurious
      retransmission storm forms.
      
      As the original comment on tcp_timeout_skb() suggests: the usefulness
      of this feature is questionable. It also wastes cycles walking the
      sack scoreboard and is actually harmful because of false recovery.
      
      It's time to remove this.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e59cb0d
    • N
      ipv6: add support of peer address · caeaba79
      Nicolas Dichtel 提交于
      This patch adds the support of peer address for IPv6. For example, it is
      possible to specify the remote end of a 6inY tunnel.
      This was already possible in IPv4:
       ip addr add ip1 peer ip2 dev dev1
      
      The peer address is specified with IFA_ADDRESS and the local address with
      IFA_LOCAL (like explained in include/uapi/linux/if_addr.h).
      Note that the API is not changed, because before this patch, it was not
      possible to specify two different addresses in IFA_LOCAL and IFA_REMOTE.
      There is a small change for the dump: if the peer is different from ::,
      IFA_ADDRESS will contain the peer address instead of the local address.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      caeaba79
  3. 18 5月, 2013 1 次提交
  4. 17 5月, 2013 1 次提交
    • E
      tcp: speedup tcp_fixup_rcvbuf() · d2cf4367
      Eric Dumazet 提交于
      tcp_fixup_rcvbuf() contains a loop to estimate initial socket
      rcv space needed for a given mss. With large MTU (like 64K on lo),
      we can loop ~500 times and consume a lot of cpu cycles.
      
      perf top of 200 concurrent netperf -t TCP_CRR
      
      5.62%  netperf  [kernel.kallsyms]  [k] tcp_init_buffer_space
      1.71%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock
      1.55%  netperf  [kernel.kallsyms]  [k] kmem_cache_free
      1.51%  netperf  [kernel.kallsyms]  [k] tcp_transmit_skb
      1.50%  netperf  [kernel.kallsyms]  [k] tcp_ack
      
      Lets use a 100% factor, and remove the loop.
      
      100% is needed anyway for tcp_adv_win_scale=1
      default value, and is also the maximum factor.
      
      Refs: commit b49960a0
            ("tcp: change tcp_adv_win_scale and tcp_rmem[2]")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2cf4367
  5. 12 5月, 2013 4 次提交
  6. 09 5月, 2013 5 次提交
  7. 08 5月, 2013 1 次提交
  8. 07 5月, 2013 2 次提交
  9. 06 5月, 2013 6 次提交
    • D
      netpoll: inverted down_trylock() test · a3dbbc2b
      Dan Carpenter 提交于
      The return value is reversed from mutex_trylock().
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3dbbc2b
    • A
      rps_dev_flow_table_release(): no need to delay vfree() · 243198d0
      Al Viro 提交于
      The same story as with fib_trie patch - vfree() from RCU callbacks
      is legitimate now.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      243198d0
    • A
      fib_trie: no need to delay vfree() · 00203563
      Al Viro 提交于
      Now that vfree() can be called from interrupt contexts, there's no
      need to play games with schedule_work() to escape calling vfree()
      from RCU callbacks.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00203563
    • K
      net: frag, fix race conditions in LRU list maintenance · b56141ab
      Konstantin Khlebnikov 提交于
      This patch fixes race between inet_frag_lru_move() and inet_frag_lru_add()
      which was introduced in commit 3ef0eb0d
      ("net: frag, move LRU list maintenance outside of rwlock")
      
      One cpu already added new fragment queue into hash but not into LRU.
      Other cpu found it in hash and tries to move it to the end of LRU.
      This leads to NULL pointer dereference inside of list_move_tail().
      
      Another possible race condition is between inet_frag_lru_move() and
      inet_frag_lru_del(): move can happens after deletion.
      
      This patch initializes LRU list head before adding fragment into hash and
      inet_frag_lru_move() doesn't touches it if it's empty.
      
      I saw this kernel oops two times in a couple of days.
      
      [119482.128853] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [119482.132693] IP: [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
      [119482.136456] PGD 2148f6067 PUD 215ab9067 PMD 0
      [119482.140221] Oops: 0000 [#1] SMP
      [119482.144008] Modules linked in: vfat msdos fat 8021q fuse nfsd auth_rpcgss nfs_acl nfs lockd sunrpc ppp_async ppp_generic bridge slhc stp llc w83627ehf hwmon_vid snd_hda_codec_hdmi snd_hda_codec_realtek kvm_amd k10temp kvm snd_hda_intel snd_hda_codec edac_core radeon snd_hwdep ath9k snd_pcm ath9k_common snd_page_alloc ath9k_hw snd_timer snd soundcore drm_kms_helper ath ttm r8169 mii
      [119482.152692] CPU 3
      [119482.152721] Pid: 20, comm: ksoftirqd/3 Not tainted 3.9.0-zurg-00001-g9f95269 #132 To Be Filled By O.E.M. To Be Filled By O.E.M./RS880D
      [119482.161478] RIP: 0010:[<ffffffff812ede89>]  [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
      [119482.166004] RSP: 0018:ffff880216d5db58  EFLAGS: 00010207
      [119482.170568] RAX: 0000000000000000 RBX: ffff88020882b9c0 RCX: dead000000200200
      [119482.175189] RDX: 0000000000000000 RSI: 0000000000000880 RDI: ffff88020882ba00
      [119482.179860] RBP: ffff880216d5db58 R08: ffffffff8155c7f0 R09: 0000000000000014
      [119482.184570] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88020882ba00
      [119482.189337] R13: ffffffff81c8d780 R14: ffff880204357f00 R15: 00000000000005a0
      [119482.194140] FS:  00007f58124dc700(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000
      [119482.198928] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [119482.203711] CR2: 0000000000000000 CR3: 00000002155f0000 CR4: 00000000000007e0
      [119482.208533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [119482.213371] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [119482.218221] Process ksoftirqd/3 (pid: 20, threadinfo ffff880216d5c000, task ffff880216d3a9a0)
      [119482.223113] Stack:
      [119482.228004]  ffff880216d5dbd8 ffffffff8155dcda 0000000000000000 ffff000200000001
      [119482.233038]  ffff8802153c1f00 ffff880000289440 ffff880200000014 ffff88007bc72000
      [119482.238083]  00000000000079d5 ffff88007bc72f44 ffffffff00000002 ffff880204357f00
      [119482.243090] Call Trace:
      [119482.248009]  [<ffffffff8155dcda>] ip_defrag+0x8fa/0xd10
      [119482.252921]  [<ffffffff815a8013>] ipv4_conntrack_defrag+0x83/0xe0
      [119482.257803]  [<ffffffff8154485b>] nf_iterate+0x8b/0xa0
      [119482.262658]  [<ffffffff8155c7f0>] ? inet_del_offload+0x40/0x40
      [119482.267527]  [<ffffffff815448e4>] nf_hook_slow+0x74/0x130
      [119482.272412]  [<ffffffff8155c7f0>] ? inet_del_offload+0x40/0x40
      [119482.277302]  [<ffffffff8155d068>] ip_rcv+0x268/0x320
      [119482.282147]  [<ffffffff81519992>] __netif_receive_skb_core+0x612/0x7e0
      [119482.286998]  [<ffffffff81519b78>] __netif_receive_skb+0x18/0x60
      [119482.291826]  [<ffffffff8151a650>] process_backlog+0xa0/0x160
      [119482.296648]  [<ffffffff81519f29>] net_rx_action+0x139/0x220
      [119482.301403]  [<ffffffff81053707>] __do_softirq+0xe7/0x220
      [119482.306103]  [<ffffffff81053868>] run_ksoftirqd+0x28/0x40
      [119482.310809]  [<ffffffff81074f5f>] smpboot_thread_fn+0xff/0x1a0
      [119482.315515]  [<ffffffff81074e60>] ? lg_local_lock_cpu+0x40/0x40
      [119482.320219]  [<ffffffff8106d870>] kthread+0xc0/0xd0
      [119482.324858]  [<ffffffff8106d7b0>] ? insert_kthread_work+0x40/0x40
      [119482.329460]  [<ffffffff816c32dc>] ret_from_fork+0x7c/0xb0
      [119482.334057]  [<ffffffff8106d7b0>] ? insert_kthread_work+0x40/0x40
      [119482.338661] Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de 48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48 39 c8 74 7a <4c> 8b 00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89 42 08
      [119482.343787] RIP  [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
      [119482.348675]  RSP <ffff880216d5db58>
      [119482.353493] CR2: 0000000000000000
      
      Oops happened on this path:
      ip_defrag() -> ip_frag_queue() -> inet_frag_lru_move() -> list_move_tail() -> __list_del_entry()
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b56141ab
    • G
      SUNRPC: Refactor gssx_dec_option_array() to kill uninitialized warning · 9fd40c5a
      Geert Uytterhoeven 提交于
      net/sunrpc/auth_gss/gss_rpc_xdr.c: In function ‘gssx_dec_option_array’:
      net/sunrpc/auth_gss/gss_rpc_xdr.c:258: warning: ‘creds’ may be used uninitialized in this function
      
      Return early if count is zero, to make it clearer to the compiler (and the
      casual reviewer) that no more processing is done.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      9fd40c5a
    • E
      tcp: do not expire TCP fastopen cookies · efeaa555
      Eric Dumazet 提交于
      TCP metric cache expires entries after one hour.
      
      This probably make sense for TCP RTT/RTTVAR/CWND, but not
      for TCP fastopen cookies.
      
      Its better to try previous cookie. If it appears to be obsolete,
      server will send us new cookie anyway.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efeaa555
  10. 04 5月, 2013 8 次提交
  11. 03 5月, 2013 5 次提交
  12. 02 5月, 2013 4 次提交
    • A
      libceph: create source file "net/ceph/snapshot.c" · 4f0dcb10
      Alex Elder 提交于
      This creates a new source file "net/ceph/snapshot.c" to contain
      utility routines related to ceph snapshot contexts.  The main
      motivation was to define ceph_create_snap_context() as a common way
      to create these structures, but I've moved the definitions of
      ceph_get_snap_context() and ceph_put_snap_context() there too.
      (The benefit of inlining those is very small, and I'd rather
      keep this collection of functions together.)
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      4f0dcb10
    • A
      libceph: fix byte order mismatch · 9ef1ee5a
      Alex Elder 提交于
      A WATCH op includes an object version.  The version that's supplied
      is incorrectly byte-swapped osd_req_op_watch_init() where it's first
      assigned (it's been this way since that code was first added).
      
      The result is that the version sent to the osd is wrong, because
      that value gets byte-swapped again in osd_req_encode_op().  This
      is the source of a sparse warning related to improper byte order in
      the assignment.
      
      The approach of using the version to avoid a race is deprecated
      (see http://tracker.ceph.com/issues/3871), and the watch parameter
      is no longer even examined by the osd.  So fix the assignment in
      osd_req_op_watch_init() so it no longer does the byte swap.
      
      This resolves:
          http://tracker.ceph.com/issues/3847Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      9ef1ee5a
    • A
      libceph: support pages for class request data · 6c57b554
      Alex Elder 提交于
      Add the ability to provide an array of pages as outbound request
      data for object class method calls.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      6c57b554
    • A
      libceph: fix two messenger bugs · a51b272e
      Alex Elder 提交于
      This patch makes four small changes in the ceph messenger.
      
      While getting copyup functionality working I found two bugs in the
      messenger.  Existing paths through the code did not trigger these
      problems, but they're fixed here:
          - In ceph_msg_data_pagelist_cursor_init(), the cursor's
            last_piece field was being checked against the length
            supplied.  This was OK until this commit: ccba6d98 libceph:
            implement multiple data items in a message That commit changed
            the cursor init routines to allow lengths to be supplied that
            exceeded the size of the current data item. Because of this,
            we have to use the assigned cursor resid field rather than the
            provided length in determining whether the cursor points to
            the last piece of a data item.
          - In ceph_msg_data_add_pages(), a BUG_ON() was erroneously
            catching attempts to add page data to a message if the message
            already had data assigned to it. That was OK until that same
            commit, at which point it was fine for messages to have
            multiple data items. It slipped through because that BUG_ON()
            call was present twice in that function. (You can never be too
            careful.)
      
      In addition two other minor things are changed:
          - In ceph_msg_data_cursor_init(), the local variable "data" was
            getting assigned twice.
          - In ceph_msg_data_advance(), it was assumed that the
            type-specific advance routine would set new_piece to true
            after it advanced past the last piece. That may have been
            fine, but since we check for that case we might as well set it
            explicitly in ceph_msg_data_advance().
      
      This resolves:
          http://tracker.ceph.com/issues/4762Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      a51b272e