1. 26 5月, 2013 1 次提交
    • E
      ip_tunnel: fix kernel panic with icmp_dest_unreach · a6222602
      Eric Dumazet 提交于
      Daniel Petre reported crashes in icmp_dst_unreach() with following call
      graph:
      
      #3 [ffff88003fc03938] __stack_chk_fail at ffffffff81037f77
      #4 [ffff88003fc03948] icmp_send at ffffffff814d5fec
      #5 [ffff88003fc03ae8] ipv4_link_failure at ffffffff814a1795
      #6 [ffff88003fc03af8] ipgre_tunnel_xmit at ffffffff814e7965
      #7 [ffff88003fc03b78] dev_hard_start_xmit at ffffffff8146e032
      #8 [ffff88003fc03bc8] sch_direct_xmit at ffffffff81487d66
      #9 [ffff88003fc03c08] __qdisc_run at ffffffff81487efd
      #10 [ffff88003fc03c48] dev_queue_xmit at ffffffff8146e5a7
      #11 [ffff88003fc03c88] ip_finish_output at ffffffff814ab596
      
      Daniel found a similar problem mentioned in
       http://lkml.indiana.edu/hypermail/linux/kernel/1007.0/00961.html
      
      And indeed this is the root cause : skb->cb[] contains data fooling IP
      stack.
      
      We must clear IPCB in ip_tunnel_xmit() sooner in case dst_link_failure()
      is called. Or else skb->cb[] might contain garbage from GSO segmentation
      layer.
      
      A similar fix was tested on linux-3.9, but gre code was refactored in
      linux-3.10. I'll send patches for stable kernels as well.
      
      Many thanks to Daniel for providing reports, patches and testing !
      Reported-by: NDaniel Petre <daniel.petre@rcs-rds.ro>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6222602
  2. 24 5月, 2013 1 次提交
    • E
      tcp: xps: fix reordering issues · 547669d4
      Eric Dumazet 提交于
      commit 3853b584 ("xps: Improvements in TX queue selection")
      introduced ooo_okay flag, but the condition to set it is slightly wrong.
      
      In our traces, we have seen ACK packets being received out of order,
      and RST packets sent in response.
      
      We should test if we have any packets still in host queue.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      547669d4
  3. 23 5月, 2013 1 次提交
    • N
      tcp: bug fix in proportional rate reduction. · 35f079eb
      Nandita Dukkipati 提交于
      This patch is a fix for a bug triggering newly_acked_sacked < 0
      in tcp_ack(.).
      
      The bug is triggered by sacked_out decreasing relative to prior_sacked,
      but packets_out remaining the same as pior_packets. This is because the
      snapshot of prior_packets is taken after tcp_sacktag_write_queue() while
      prior_sacked is captured before tcp_sacktag_write_queue(). The problem
      is: tcp_sacktag_write_queue (tcp_match_skb_to_sack() -> tcp_fragment)
      adjusts the pcount for packets_out and sacked_out (MSS change or other
      reason). As a result, this delta in pcount is reflected in
      (prior_sacked - sacked_out) but not in (prior_packets - packets_out).
      
      This patch does the following:
      1) initializes prior_packets at the start of tcp_ack() so as to
      capture the delta in packets_out created by tcp_fragment.
      2) introduces a new "previous_packets_out" variable that snapshots
      packets_out right before tcp_clean_rtx_queue, so pkts_acked can be
      correctly computed as before.
      3) Computes pkts_acked using previous_packets_out, and computes
      newly_acked_sacked using prior_packets.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35f079eb
  4. 20 5月, 2013 1 次提交
  5. 17 5月, 2013 1 次提交
    • E
      tcp: gso: do not generate out of order packets · 6ff50cd5
      Eric Dumazet 提交于
      GSO TCP handler has following issues :
      
      1) ooo_okay from original GSO packet is duplicated to all segments
      2) segments (but the last one) are orphaned, so transmit path can not
      get transmit queue number from the socket. This happens if GSO
      segmentation is done before stacked device for example.
      
      Result is we can send packets from a given TCP flow to different TX
      queues (if using multiqueue NICS). This generates OOO problems and
      spurious SACK & retransmits.
      
      Fix this by keeping socket pointer set for all segments.
      
      This means that every segment must also have a destructor, and the
      original gso skb truesize must be split on all segments, to keep
      precise sk->sk_wmem_alloc accounting.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ff50cd5
  6. 15 5月, 2013 2 次提交
  7. 12 5月, 2013 1 次提交
  8. 09 5月, 2013 1 次提交
  9. 06 5月, 2013 3 次提交
    • A
      fib_trie: no need to delay vfree() · 00203563
      Al Viro 提交于
      Now that vfree() can be called from interrupt contexts, there's no
      need to play games with schedule_work() to escape calling vfree()
      from RCU callbacks.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00203563
    • K
      net: frag, fix race conditions in LRU list maintenance · b56141ab
      Konstantin Khlebnikov 提交于
      This patch fixes race between inet_frag_lru_move() and inet_frag_lru_add()
      which was introduced in commit 3ef0eb0d
      ("net: frag, move LRU list maintenance outside of rwlock")
      
      One cpu already added new fragment queue into hash but not into LRU.
      Other cpu found it in hash and tries to move it to the end of LRU.
      This leads to NULL pointer dereference inside of list_move_tail().
      
      Another possible race condition is between inet_frag_lru_move() and
      inet_frag_lru_del(): move can happens after deletion.
      
      This patch initializes LRU list head before adding fragment into hash and
      inet_frag_lru_move() doesn't touches it if it's empty.
      
      I saw this kernel oops two times in a couple of days.
      
      [119482.128853] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [119482.132693] IP: [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
      [119482.136456] PGD 2148f6067 PUD 215ab9067 PMD 0
      [119482.140221] Oops: 0000 [#1] SMP
      [119482.144008] Modules linked in: vfat msdos fat 8021q fuse nfsd auth_rpcgss nfs_acl nfs lockd sunrpc ppp_async ppp_generic bridge slhc stp llc w83627ehf hwmon_vid snd_hda_codec_hdmi snd_hda_codec_realtek kvm_amd k10temp kvm snd_hda_intel snd_hda_codec edac_core radeon snd_hwdep ath9k snd_pcm ath9k_common snd_page_alloc ath9k_hw snd_timer snd soundcore drm_kms_helper ath ttm r8169 mii
      [119482.152692] CPU 3
      [119482.152721] Pid: 20, comm: ksoftirqd/3 Not tainted 3.9.0-zurg-00001-g9f95269 #132 To Be Filled By O.E.M. To Be Filled By O.E.M./RS880D
      [119482.161478] RIP: 0010:[<ffffffff812ede89>]  [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
      [119482.166004] RSP: 0018:ffff880216d5db58  EFLAGS: 00010207
      [119482.170568] RAX: 0000000000000000 RBX: ffff88020882b9c0 RCX: dead000000200200
      [119482.175189] RDX: 0000000000000000 RSI: 0000000000000880 RDI: ffff88020882ba00
      [119482.179860] RBP: ffff880216d5db58 R08: ffffffff8155c7f0 R09: 0000000000000014
      [119482.184570] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88020882ba00
      [119482.189337] R13: ffffffff81c8d780 R14: ffff880204357f00 R15: 00000000000005a0
      [119482.194140] FS:  00007f58124dc700(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000
      [119482.198928] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [119482.203711] CR2: 0000000000000000 CR3: 00000002155f0000 CR4: 00000000000007e0
      [119482.208533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [119482.213371] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [119482.218221] Process ksoftirqd/3 (pid: 20, threadinfo ffff880216d5c000, task ffff880216d3a9a0)
      [119482.223113] Stack:
      [119482.228004]  ffff880216d5dbd8 ffffffff8155dcda 0000000000000000 ffff000200000001
      [119482.233038]  ffff8802153c1f00 ffff880000289440 ffff880200000014 ffff88007bc72000
      [119482.238083]  00000000000079d5 ffff88007bc72f44 ffffffff00000002 ffff880204357f00
      [119482.243090] Call Trace:
      [119482.248009]  [<ffffffff8155dcda>] ip_defrag+0x8fa/0xd10
      [119482.252921]  [<ffffffff815a8013>] ipv4_conntrack_defrag+0x83/0xe0
      [119482.257803]  [<ffffffff8154485b>] nf_iterate+0x8b/0xa0
      [119482.262658]  [<ffffffff8155c7f0>] ? inet_del_offload+0x40/0x40
      [119482.267527]  [<ffffffff815448e4>] nf_hook_slow+0x74/0x130
      [119482.272412]  [<ffffffff8155c7f0>] ? inet_del_offload+0x40/0x40
      [119482.277302]  [<ffffffff8155d068>] ip_rcv+0x268/0x320
      [119482.282147]  [<ffffffff81519992>] __netif_receive_skb_core+0x612/0x7e0
      [119482.286998]  [<ffffffff81519b78>] __netif_receive_skb+0x18/0x60
      [119482.291826]  [<ffffffff8151a650>] process_backlog+0xa0/0x160
      [119482.296648]  [<ffffffff81519f29>] net_rx_action+0x139/0x220
      [119482.301403]  [<ffffffff81053707>] __do_softirq+0xe7/0x220
      [119482.306103]  [<ffffffff81053868>] run_ksoftirqd+0x28/0x40
      [119482.310809]  [<ffffffff81074f5f>] smpboot_thread_fn+0xff/0x1a0
      [119482.315515]  [<ffffffff81074e60>] ? lg_local_lock_cpu+0x40/0x40
      [119482.320219]  [<ffffffff8106d870>] kthread+0xc0/0xd0
      [119482.324858]  [<ffffffff8106d7b0>] ? insert_kthread_work+0x40/0x40
      [119482.329460]  [<ffffffff816c32dc>] ret_from_fork+0x7c/0xb0
      [119482.334057]  [<ffffffff8106d7b0>] ? insert_kthread_work+0x40/0x40
      [119482.338661] Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de 48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48 39 c8 74 7a <4c> 8b 00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89 42 08
      [119482.343787] RIP  [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
      [119482.348675]  RSP <ffff880216d5db58>
      [119482.353493] CR2: 0000000000000000
      
      Oops happened on this path:
      ip_defrag() -> ip_frag_queue() -> inet_frag_lru_move() -> list_move_tail() -> __list_del_entry()
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b56141ab
    • E
      tcp: do not expire TCP fastopen cookies · efeaa555
      Eric Dumazet 提交于
      TCP metric cache expires entries after one hour.
      
      This probably make sense for TCP RTT/RTTVAR/CWND, but not
      for TCP fastopen cookies.
      
      Its better to try previous cookie. If it appears to be obsolete,
      server will send us new cookie anyway.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efeaa555
  10. 04 5月, 2013 2 次提交
  11. 02 5月, 2013 1 次提交
  12. 30 4月, 2013 3 次提交
  13. 25 4月, 2013 1 次提交
  14. 20 4月, 2013 2 次提交
  15. 19 4月, 2013 4 次提交
    • F
      netfilter: xt_rpfilter: depend on raw or mangle table · d37d6968
      Florian Westphal 提交于
      rpfilter is only valid in raw/mangle PREROUTING, i.e.
      RPFILTER=y|m is useless without raw or mangle table support.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d37d6968
    • F
      netfilter: xt_rpfilter: skip locally generated broadcast/multicast, too · f83a7ea2
      Florian Westphal 提交于
      Alex Efros reported rpfilter module doesn't match following packets:
      IN=br.qemu SRC=192.168.2.1 DST=192.168.2.255 [ .. ]
      (netfilter bugzilla #814).
      
      Problem is that network stack arranges for the locally generated broadcasts
      to appear on the interface they were sent out, so the IFF_LOOPBACK check
      doesn't trigger.
      
      As -m rpfilter is restricted to PREROUTING, we can check for existing
      rtable instead, it catches locally-generated broad/multicast case, too.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f83a7ea2
    • E
      tcp: introduce TCPSpuriousRtxHostQueues SNMP counter · 0e280af0
      Eric Dumazet 提交于
      Host queues (Qdisc + NIC) can hold packets so long that TCP can
      eventually retransmit a packet before the first transmit even left
      the host.
      
      Its not clear right now if we could avoid this in the first place :
      
      - We could arm RTO timer not at the time we enqueue packets, but
        at the time we TX complete them (tcp_wfree())
      
      - Cancel the sending of the new copy of the packet if prior one
        is still in queue.
      
      This patch adds instrumentation so that we can at least see how
      often this problem happens.
      
      TCPSpuriousRtxHostQueues SNMP counter is incremented every time
      we detect the fast clone is not yet freed in tcp_transmit_skb()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e280af0
    • P
      netfilter: add my copyright statements · f229f6ce
      Patrick McHardy 提交于
      Add copyright statements to all netfilter files which have had significant
      changes done by myself in the past.
      
      Some notes:
      
      - nf_conntrack_ecache.c was incorrectly attributed to Rusty and Netfilter
        Core Team when it got split out of nf_conntrack_core.c. The copyrights
        even state a date which lies six years before it was written. It was
        written in 2005 by Harald and myself.
      
      - net/ipv{4,6}/netfilter.c, net/netfitler/nf_queue.c were missing copyright
        statements. I've added the copyright statement from net/netfilter/core.c,
        where this code originated
      
      - for nf_conntrack_proto_tcp.c I've also added Jozsef, since I didn't want
        it to give the wrong impression
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f229f6ce
  16. 17 4月, 2013 1 次提交
    • E
      net: drop dst before queueing fragments · 97599dc7
      Eric Dumazet 提交于
      Commit 4a94445c (net: Use ip_route_input_noref() in input path)
      added a bug in IP defragmentation handling, as non refcounted
      dst could escape an RCU protected section.
      
      Commit 64f3b9e2 (net: ip_expire() must revalidate route) fixed
      the case of timeouts, but not the general problem.
      
      Tom Parkin noticed crashes in UDP stack and provided a patch,
      but further analysis permitted us to pinpoint the root cause.
      
      Before queueing a packet into a frag list, we must drop its dst,
      as this dst has limited lifetime (RCU protected)
      
      When/if a packet is finally reassembled, we use the dst of the very
      last skb, still protected by RCU and valid, as the dst of the
      reassembled packet.
      
      Use same logic in IPv6, as there is no need to hold dst references.
      Reported-by: NTom Parkin <tparkin@katalix.com>
      Tested-by: NTom Parkin <tparkin@katalix.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97599dc7
  17. 16 4月, 2013 1 次提交
  18. 15 4月, 2013 2 次提交
  19. 14 4月, 2013 1 次提交
  20. 13 4月, 2013 1 次提交
    • E
      tcp: GSO should be TSQ friendly · d6a4a104
      Eric Dumazet 提交于
      I noticed that TSQ (TCP Small queues) was less effective when TSO is
      turned off, and GSO is on. If BQL is not enabled, TSQ has then no
      effect.
      
      It turns out the GSO engine frees the original gso_skb at the time the
      fragments are generated and queued to the NIC.
      
      We should instead call the tcp_wfree() destructor for the last fragment,
      to keep the flow control as intended in TSQ. This effectively limits
      the number of queued packets on qdisc + NIC layers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6a4a104
  21. 12 4月, 2013 2 次提交
  22. 10 4月, 2013 3 次提交
  23. 09 4月, 2013 3 次提交
  24. 08 4月, 2013 1 次提交