1. 27 1月, 2013 1 次提交
  2. 24 1月, 2013 2 次提交
    • T
      soreuseport: UDP/IPv4 implementation · ba418fa3
      Tom Herbert 提交于
      Allow multiple UDP sockets to bind to the same port.
      
      Motivation soreuseport would be something like a DNS server.  An
      alternative would be to recv on the same socket from multiple threads.
      As in the case of TCP, the load across these threads tends to be
      disproportionate and we also see a lot of contection on the socketlock.
      Note that SO_REUSEADDR already allows multiple UDP sockets to bind to
      the same port, however there is no provision to prevent hijacking and
      nothing to distribute packets across all the sockets sharing the same
      bound port.  This patch does not change the semantics of SO_REUSEADDR,
      but provides usable functionality of it for unicast.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba418fa3
    • T
      soreuseport: TCP/IPv4 implementation · da5e3630
      Tom Herbert 提交于
      Allow multiple listener sockets to bind to the same port.
      
      Motivation for soresuseport would be something like a web server
      binding to port 80 running with multiple threads, where each thread
      might have it's own listener socket.  This could be done as an
      alternative to other models: 1) have one listener thread which
      dispatches completed connections to workers. 2) accept on a single
      listener socket from multiple threads.  In case #1 the listener thread
      can easily become the bottleneck with high connection turn-over rate.
      In case #2, the proportion of connections accepted per thread tends
      to be uneven under high connection load (assuming simple event loop:
      while (1) { accept(); process() }, wakeup does not promote fairness
      among the sockets.  We have seen the  disproportion to be as high
      as 3:1 ratio between thread accepting most connections and the one
      accepting the fewest.  With so_reusport the distribution is
      uniform.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da5e3630
  3. 23 1月, 2013 5 次提交
  4. 22 1月, 2013 1 次提交
    • N
      mcast: add multicast proxy support (IPv4 and IPv6) · 660b26dc
      Nicolas Dichtel 提交于
      This patch add the support of proxy multicast, ie being able to build a static
      multicast tree. It adds the support of (*,*) and (*,G) entries.
      
      The user should define an (*,*) entry which is not used for real forwarding.
      This entry defines the upstream in iif and contains all interfaces from the
      static tree in its oifs. It will be used to forward packet upstream when they
      come from an interface belonging to the static tree.
      Hence, the user should define (*,G) entries to build its static tree. Note that
      upstream interface must be part of oifs: packets are sent to all oifs
      interfaces except the input interface. This ensures to always join the whole
      static tree, even if the packet is not coming from the upstream interface.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NDavid L Stevens <dlstevens@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      660b26dc
  5. 18 1月, 2013 1 次提交
    • J
      net: increase fragment memory usage limits · c2a93660
      Jesper Dangaard Brouer 提交于
      Increase the amount of memory usage limits for incomplete
      IP fragments.
      
      Arguing for new thresh high/low values:
      
       High threshold = 4 MBytes
       Low  threshold = 3 MBytes
      
      The fragmentation memory accounting code, tries to account for the
      real memory usage, by measuring both the size of frag queue struct
      (inet_frag_queue (ipv4:ipq/ipv6:frag_queue)) and the SKB's truesize.
      
      We want to be able to handle/hold-on-to enough fragments, to ensure
      good performance, without causing incomplete fragments to hurt
      scalability, by causing the number of inet_frag_queue to grow too much
      (resulting longer searches for frag queues).
      
      For IPv4, how much memory does the largest frag consume.
      
      Maximum size fragment is 64K, which is approx 44 fragments with
      MTU(1500) sized packets. Sizeof(struct ipq) is 200.  A 1500 byte
      packet results in a truesize of 2944 (not 2048 as I first assumed)
      
        (44*2944)+200 = 129736 bytes
      
      The current default high thresh of 262144 bytes, is obviously
      problematic, as only two 64K fragments can fit in the queue at the
      same time.
      
      How many 64K fragment can we fit into 4 MBytes:
      
        4*2^20/((44*2944)+200) = 32.34 fragment in queues
      
      An attacker could send a separate/distinct fake fragment packets per
      queue, causing us to allocate one inet_frag_queue per packet, and thus
      attacking the hash table and its lists.
      
      How many frag queue do we need to store, and given a current hash size
      of 64, what is the average list length.
      
      Using one MTU sized fragment per inet_frag_queue, each consuming
      (2944+200) 3144 bytes.
      
        4*2^20/(2944+200) = 1334 frag queues -> 21 avg list length
      
      An attack could send small fragments, the smallest packet I could send
      resulted in a truesize of 896 bytes (I'm a little surprised by this).
      
        4*2^20/(896+200)  = 3827 frag queues -> 59 avg list length
      
      When increasing these number, we also need to followup with
      improvements, that is going to help scalability.  Simply increasing
      the hash size, is not enough as the current implementation does not
      have a per hash bucket locking.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2a93660
  6. 12 1月, 2013 1 次提交
  7. 11 1月, 2013 3 次提交
    • E
      tcp: accept RST without ACK flag · 7b514a88
      Eric Dumazet 提交于
      commit c3ae62af (tcp: should drop incoming frames without ACK flag
      set) added a regression on the handling of RST messages.
      
      RST should be allowed to come even without ACK bit set. We validate
      the RST by checking the exact sequence, as requested by RFC 793 and
      5961 3.2, in tcp_validate_incoming()
      Reported-by: NEric Wong <normalperson@yhbt.net>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NEric Wong <normalperson@yhbt.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b514a88
    • E
      tcp: fix splice() and tcp collapsing interaction · f26845b4
      Eric Dumazet 提交于
      Under unusual circumstances, TCP collapse can split a big GRO TCP packet
      while its being used in a splice(socket->pipe) operation.
      
      skb_splice_bits() releases the socket lock before calling
      splice_to_pipe().
      
      [ 1081.353685] WARNING: at net/ipv4/tcp.c:1330 tcp_cleanup_rbuf+0x4d/0xfc()
      [ 1081.371956] Hardware name: System x3690 X5 -[7148Z68]-
      [ 1081.391820] cleanup rbuf bug: copied AD3BCF1 seq AD370AF rcvnxt AD3CF13
      
      To fix this problem, we must eat skbs in tcp_recv_skb().
      
      Remove the inline keyword from tcp_recv_skb() definition since
      it has three call sites.
      Reported-by: NChristian Becker <c.becker@traviangames.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f26845b4
    • E
      tcp: splice: fix an infinite loop in tcp_read_sock() · ff905b1e
      Eric Dumazet 提交于
      commit 02275a2e (tcp: don't abort splice() after small transfers)
      added a regression.
      
      [   83.843570] INFO: rcu_sched self-detected stall on CPU
      [   83.844575] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=21002 jiffies, g=4457, c=4456, q=13132)
      [   83.844582] Task dump for CPU 6:
      [   83.844584] netperf         R  running task        0  8966   8952 0x0000000c
      [   83.844587]  0000000000000000 0000000000000006 0000000000006c6c 0000000000000000
      [   83.844589]  000000000000006c 0000000000000096 ffffffff819ce2bc ffffffffffffff10
      [   83.844592]  ffffffff81088679 0000000000000010 0000000000000246 ffff880c4b9ddcd8
      [   83.844594] Call Trace:
      [   83.844596]  [<ffffffff81088679>] ? vprintk_emit+0x1c9/0x4c0
      [   83.844601]  [<ffffffff815ad449>] ? schedule+0x29/0x70
      [   83.844606]  [<ffffffff81537bd2>] ? tcp_splice_data_recv+0x42/0x50
      [   83.844610]  [<ffffffff8153beaa>] ? tcp_read_sock+0xda/0x260
      [   83.844613]  [<ffffffff81537b90>] ? tcp_prequeue_process+0xb0/0xb0
      [   83.844615]  [<ffffffff8153c0f0>] ? tcp_splice_read+0xc0/0x250
      [   83.844618]  [<ffffffff814dc0c2>] ? sock_splice_read+0x22/0x30
      [   83.844622]  [<ffffffff811b820b>] ? do_splice_to+0x7b/0xa0
      [   83.844627]  [<ffffffff811ba4bc>] ? sys_splice+0x59c/0x5d0
      [   83.844630]  [<ffffffff8119745b>] ? putname+0x2b/0x40
      [   83.844633]  [<ffffffff8118bcb4>] ? do_sys_open+0x174/0x1e0
      [   83.844636]  [<ffffffff815b6202>] ? system_call_fastpath+0x16/0x1b
      
      if recv_actor() returns 0, we should stop immediately,
      because looping wont give a chance to drain the pipe.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff905b1e
  8. 10 1月, 2013 1 次提交
  9. 09 1月, 2013 1 次提交
  10. 07 1月, 2013 2 次提交
  11. 05 1月, 2013 1 次提交
  12. 29 12月, 2012 1 次提交
  13. 27 12月, 2012 2 次提交
  14. 25 12月, 2012 1 次提交
  15. 22 12月, 2012 3 次提交
    • E
      ipv4: arp: fix a lockdep splat in arp_solicit() · 9650388b
      Eric Dumazet 提交于
      Yan Burman reported following lockdep warning :
      
      =============================================
      [ INFO: possible recursive locking detected ]
      3.7.0+ #24 Not tainted
      ---------------------------------------------
      swapper/1/0 is trying to acquire lock:
        (&n->lock){++--..}, at: [<ffffffff8139f56e>] __neigh_event_send
      +0x2e/0x2f0
      
      but task is already holding lock:
        (&n->lock){++--..}, at: [<ffffffff813f63f4>] arp_solicit+0x1d4/0x280
      
      other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(&n->lock);
         lock(&n->lock);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
      4 locks held by swapper/1/0:
        #0:  (((&n->timer))){+.-...}, at: [<ffffffff8104b350>]
      call_timer_fn+0x0/0x1c0
        #1:  (&n->lock){++--..}, at: [<ffffffff813f63f4>] arp_solicit
      +0x1d4/0x280
        #2:  (rcu_read_lock_bh){.+....}, at: [<ffffffff81395400>]
      dev_queue_xmit+0x0/0x5d0
        #3:  (rcu_read_lock_bh){.+....}, at: [<ffffffff813cb41e>]
      ip_finish_output+0x13e/0x640
      
      stack backtrace:
      Pid: 0, comm: swapper/1 Not tainted 3.7.0+ #24
      Call Trace:
        <IRQ>  [<ffffffff8108c7ac>] validate_chain+0xdcc/0x11f0
        [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
        [<ffffffff81120565>] ? kmem_cache_free+0xe5/0x1c0
        [<ffffffff8108d570>] __lock_acquire+0x440/0xc30
        [<ffffffff813c3570>] ? inet_getpeer+0x40/0x600
        [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
        [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
        [<ffffffff8108ddf5>] lock_acquire+0x95/0x140
        [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
        [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
        [<ffffffff81448d4b>] _raw_write_lock_bh+0x3b/0x50
        [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
        [<ffffffff8139f56e>] __neigh_event_send+0x2e/0x2f0
        [<ffffffff8139f99b>] neigh_resolve_output+0x16b/0x270
        [<ffffffff813cb62d>] ip_finish_output+0x34d/0x640
        [<ffffffff813cb41e>] ? ip_finish_output+0x13e/0x640
        [<ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan]
        [<ffffffff813cb9a0>] ip_output+0x80/0xf0
        [<ffffffff813ca368>] ip_local_out+0x28/0x80
        [<ffffffffa046f25a>] vxlan_xmit+0x66a/0xbec [vxlan]
        [<ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan]
        [<ffffffff81394a50>] ? skb_gso_segment+0x2b0/0x2b0
        [<ffffffff81449355>] ? _raw_spin_unlock_irqrestore+0x65/0x80
        [<ffffffff81394c57>] ? dev_queue_xmit_nit+0x207/0x270
        [<ffffffff813950c8>] dev_hard_start_xmit+0x298/0x5d0
        [<ffffffff813956f3>] dev_queue_xmit+0x2f3/0x5d0
        [<ffffffff81395400>] ? dev_hard_start_xmit+0x5d0/0x5d0
        [<ffffffff813f5788>] arp_xmit+0x58/0x60
        [<ffffffff813f59db>] arp_send+0x3b/0x40
        [<ffffffff813f6424>] arp_solicit+0x204/0x280
        [<ffffffff813a1a70>] ? neigh_add+0x310/0x310
        [<ffffffff8139f515>] neigh_probe+0x45/0x70
        [<ffffffff813a1c10>] neigh_timer_handler+0x1a0/0x2a0
        [<ffffffff8104b3cf>] call_timer_fn+0x7f/0x1c0
        [<ffffffff8104b350>] ? detach_if_pending+0x120/0x120
        [<ffffffff8104b748>] run_timer_softirq+0x238/0x2b0
        [<ffffffff813a1a70>] ? neigh_add+0x310/0x310
        [<ffffffff81043e51>] __do_softirq+0x101/0x280
        [<ffffffff814518cc>] call_softirq+0x1c/0x30
        [<ffffffff81003b65>] do_softirq+0x85/0xc0
        [<ffffffff81043a7e>] irq_exit+0x9e/0xc0
        [<ffffffff810264f8>] smp_apic_timer_interrupt+0x68/0xa0
        [<ffffffff8145122f>] apic_timer_interrupt+0x6f/0x80
        <EOI>  [<ffffffff8100a054>] ? mwait_idle+0xa4/0x1c0
        [<ffffffff8100a04b>] ? mwait_idle+0x9b/0x1c0
        [<ffffffff8100a6a9>] cpu_idle+0x89/0xe0
        [<ffffffff81441127>] start_secondary+0x1b2/0x1b6
      
      Bug is from arp_solicit(), releasing the neigh lock after arp_send()
      In case of vxlan, we eventually need to write lock a neigh lock later.
      
      Its a false positive, but we can get rid of it without lockdep
      annotations.
      
      We can instead use neigh_ha_snapshot() helper.
      Reported-by: NYan Burman <yanb@mellanox.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9650388b
    • E
      ip_gre: fix possible use after free · f7e75ba1
      Eric Dumazet 提交于
      Once skb_realloc_headroom() is called, tiph might point to freed memory.
      
      Cache tiph->ttl value before the reallocation, to avoid unexpected
      behavior.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Isaku Yamahata <yamahata@valinux.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7e75ba1
    • I
      ip_gre: make ipgre_tunnel_xmit() not parse network header as IP unconditionally · 412ed947
      Isaku Yamahata 提交于
      ipgre_tunnel_xmit() parses network header as IP unconditionally.
      But transmitting packets are not always IP packet. For example such packet
      can be sent by packet socket with sockaddr_ll.sll_protocol set.
      So make the function check if skb->protocol is IP.
      Signed-off-by: NIsaku Yamahata <yamahata@valinux.co.jp>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      412ed947
  16. 17 12月, 2012 2 次提交
    • A
      netfilter: nf_nat: Also handle non-ESTABLISHED routing changes in MASQUERADE · c65ef8dc
      Andrew Collins 提交于
      Since (a0ecb85a netfilter: nf_nat: Handle routing changes in MASQUERADE
      target), the MASQUERADE target handles routing changes which affect
      the output interface of a connection, but only for ESTABLISHED
      connections.  It is also possible for NEW connections which
      already have a conntrack entry to be affected by routing changes.
      
      This adds a check to drop entries in the NEW+conntrack state
      when the oif has changed.
      Signed-off-by: NAndrew Collins <bsderandrew@gmail.com>
      Acked-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c65ef8dc
    • M
      netfilter: ip[6]t_REJECT: fix wrong transport header pointer in TCP reset · c6f40899
      Mukund Jampala 提交于
      The problem occurs when iptables constructs the tcp reset packet.
      It doesn't initialize the pointer to the tcp header within the skb.
      When the skb is passed to the ixgbe driver for transmit, the ixgbe
      driver attempts to access the tcp header and crashes.
      Currently, other drivers (such as our 1G e1000e or igb drivers) don't
      access the tcp header on transmit unless the TSO option is turned on.
      
      <1>BUG: unable to handle kernel NULL pointer dereference at 0000000d
      <1>IP: [<d081621c>] ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe]
      <4>*pdpt = 0000000085e5d001 *pde = 0000000000000000
      <0>Oops: 0000 [#1] SMP
      [...]
      <4>Pid: 0, comm: swapper Tainted: P            2.6.35.12 #1 Greencity/Thurley
      <4>EIP: 0060:[<d081621c>] EFLAGS: 00010246 CPU: 16
      <4>EIP is at ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe]
      <4>EAX: c7628820 EBX: 00000007 ECX: 00000000 EDX: 00000000
      <4>ESI: 00000008 EDI: c6882180 EBP: dfc6b000 ESP: ced95c48
      <4> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
      <0>Process swapper (pid: 0, ti=ced94000 task=ced73bd0 task.ti=ced94000)
      <0>Stack:
      <4> cbec7418 c779e0d8 c77cc888 c77cc8a8 0903010a 00000000 c77c0008 00000002
      <4><0> cd4997c0 00000010 dfc6b000 00000000 d0d176c9 c77cc8d8 c6882180 cbec7318
      <4><0> 00000004 00000004 cbec7230 cbec7110 00000000 cbec70c0 c779e000 00000002
      <0>Call Trace:
      <4> [<d0d176c9>] ? 0xd0d176c9
      <4> [<d0d18a4d>] ? 0xd0d18a4d
      <4> [<411e243e>] ? dev_hard_start_xmit+0x218/0x2d7
      <4> [<411f03d7>] ? sch_direct_xmit+0x4b/0x114
      <4> [<411f056a>] ? __qdisc_run+0xca/0xe0
      <4> [<411e28b0>] ? dev_queue_xmit+0x2d1/0x3d0
      <4> [<411e8120>] ? neigh_resolve_output+0x1c5/0x20f
      <4> [<411e94a1>] ? neigh_update+0x29c/0x330
      <4> [<4121cf29>] ? arp_process+0x49c/0x4cd
      <4> [<411f80c9>] ? nf_hook_slow+0x3f/0xac
      <4> [<4121ca8d>] ? arp_process+0x0/0x4cd
      <4> [<4121ca8d>] ? arp_process+0x0/0x4cd
      <4> [<4121c6d5>] ? T.901+0x38/0x3b
      <4> [<4121c918>] ? arp_rcv+0xa3/0xb4
      <4> [<4121ca8d>] ? arp_process+0x0/0x4cd
      <4> [<411e1173>] ? __netif_receive_skb+0x32b/0x346
      <4> [<411e19e1>] ? netif_receive_skb+0x5a/0x5f
      <4> [<411e1ea9>] ? napi_skb_finish+0x1b/0x30
      <4> [<d0816eb4>] ? ixgbe_xmit_frame_ring+0x1564/0x2260 [ixgbe]
      <4> [<41013468>] ? lapic_next_event+0x13/0x16
      <4> [<410429b2>] ? clockevents_program_event+0xd2/0xe4
      <4> [<411e1b03>] ? net_rx_action+0x55/0x127
      <4> [<4102da1a>] ? __do_softirq+0x77/0xeb
      <4> [<4102dab1>] ? do_softirq+0x23/0x27
      <4> [<41003a67>] ? do_IRQ+0x7d/0x8e
      <4> [<41002a69>] ? common_interrupt+0x29/0x30
      <4> [<41007bcf>] ? mwait_idle+0x48/0x4d
      <4> [<4100193b>] ? cpu_idle+0x37/0x4c
      <0>Code: df 09 d7 0f 94 c2 0f b6 d2 e9 e7 fb ff ff 31 db 31 c0 e9 38
      ff ff ff 80 78 06 06 0f 85 3e fb ff ff 8b 7c 24 38 8b 8f b8 00 00 00
      <0f> b6 51 0d f6 c2 01 0f 85 27 fb ff ff 80 e2 02 75 0d 8b 6c 24
      <0>EIP: [<d081621c>] ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe] SS:ESP
      Signed-off-by: NMukund Jampala <jbmukund@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c6f40899
  17. 15 12月, 2012 1 次提交
    • C
      inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sock · e337e24d
      Christoph Paasch 提交于
      If in either of the above functions inet_csk_route_child_sock() or
      __inet_inherit_port() fails, the newsk will not be freed:
      
      unreferenced object 0xffff88022e8a92c0 (size 1592):
        comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
        hex dump (first 32 bytes):
          0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00  ................
          02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff8153d190>] kmemleak_alloc+0x21/0x3e
          [<ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5
          [<ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd
          [<ffffffff8149b784>] sk_clone_lock+0x16/0x21e
          [<ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b
          [<ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481
          [<ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b
          [<ffffffff814ec5ba>] tcp_check_req+0x29f/0x416
          [<ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc
          [<ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701
          [<ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4
          [<ffffffff814cec20>] ip_local_deliver+0x4e/0x7f
          [<ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233
          [<ffffffff814cee68>] ip_rcv+0x217/0x267
          [<ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553
          [<ffffffff814a7cc3>] netif_receive_skb+0x50/0x82
      
      This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
      a single sock_put() is not enough to free the memory. Additionally, things
      like xfrm, memcg, cookie_values,... may have been initialized.
      We have to free them properly.
      
      This is fixed by forcing a call to tcp_done(), ending up in
      inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
      because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
      xfrm,...
      
      Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
      force it entering inet_csk_destroy_sock. To avoid the warning in
      inet_csk_destroy_sock, inet_num has to be set to 0.
      As inet_csk_destroy_sock does a dec on orphan_count, we first have to
      increase it.
      
      Calling tcp_done() allows us to remove the calls to
      tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().
      
      A similar approach is taken for dccp by calling dccp_done().
      
      This is in the kernel since 093d2823 (tproxy: fix hash locking issue
      when using port redirection in __inet_inherit_port()), thus since
      version >= 2.6.37.
      Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e337e24d
  18. 11 12月, 2012 2 次提交
  19. 10 12月, 2012 4 次提交
    • N
      inet_diag: validate port comparison byte code to prevent unsafe reads · 5e1f5420
      Neal Cardwell 提交于
      Add logic to verify that a port comparison byte code operation
      actually has the second inet_diag_bc_op from which we read the port
      for such operations.
      
      Previously the code blindly referenced op[1] without first checking
      whether a second inet_diag_bc_op struct could fit there. So a
      malicious user could make the kernel read 4 bytes beyond the end of
      the bytecode array by claiming to have a whole port comparison byte
      code (2 inet_diag_bc_op structs) when in fact the bytecode was not
      long enough to hold both.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e1f5420
    • N
      inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run() · f67caec9
      Neal Cardwell 提交于
      Add logic to check the address family of the user-supplied conditional
      and the address family of the connection entry. We now do not do
      prefix matching of addresses from different address families (AF_INET
      vs AF_INET6), except for the previously existing support for having an
      IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
      maintains as-is).
      
      This change is needed for two reasons:
      
      (1) The addresses are different lengths, so comparing a 128-bit IPv6
      prefix match condition to a 32-bit IPv4 connection address can cause
      us to unwittingly walk off the end of the IPv4 address and read
      garbage or oops.
      
      (2) The IPv4 and IPv6 address spaces are semantically distinct, so a
      simple bit-wise comparison of the prefixes is not meaningful, and
      would lead to bogus results (except for the IPv4-mapped IPv6 case,
      which this commit maintains).
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f67caec9
    • N
      inet_diag: validate byte code to prevent oops in inet_diag_bc_run() · 405c0059
      Neal Cardwell 提交于
      Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND
      operations.
      
      Previously we did not validate the inet_diag_hostcond, address family,
      address length, and prefix length. So a malicious user could make the
      kernel read beyond the end of the bytecode array by claiming to have a
      whole inet_diag_hostcond when the bytecode was not long enough to
      contain a whole inet_diag_hostcond of the given address family. Or
      they could make the kernel read up to about 27 bytes beyond the end of
      a connection address by passing a prefix length that exceeded the
      length of addresses of the given family.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      405c0059
    • N
      inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state · 1c95df85
      Neal Cardwell 提交于
      Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
      instantiated for IPv4 traffic and in the SYN-RECV state were actually
      created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
      means that for such connections inet6_rsk(req) returns a pointer to a
      random spot in memory up to roughly 64KB beyond the end of the
      request_sock.
      
      With this bug, for a server using AF_INET6 TCP sockets and serving
      IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
      inet_diag_fill_req() causing an oops or the export to user space of 16
      bytes of kernel memory as a garbage IPv6 address, depending on where
      the garbage inet6_rsk(req) pointed.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c95df85
  20. 09 12月, 2012 1 次提交
  21. 08 12月, 2012 2 次提交
    • Y
      tcp: bug fix Fast Open client retransmission · 93b174ad
      Yuchung Cheng 提交于
      If SYN-ACK partially acks SYN-data, the client retransmits the
      remaining data by tcp_retransmit_skb(). This increments lost recovery
      state variables like tp->retrans_out in Open state. If loss recovery
      happens before the retransmission is acked, it triggers the WARN_ON
      check in tcp_fastretrans_alert(). For example: the client sends
      SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends
      another 4 data packets and get 3 dupacks.
      
      Since the retransmission is not caused by network drop it should not
      update the recovery state variables. Further the server may return a
      smaller MSS than the cached MSS used for SYN-data, so the retranmission
      needs a loop. Otherwise some data will not be retransmitted until timeout
      or other loss recovery events.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93b174ad
    • N
      ipv4/route/rtnl: get mcast attributes when dst is multicast · 8caaf7b6
      Nicolas Dichtel 提交于
      Commit f1ce3062 (ipv4: Remove 'rt_dst' from 'struct rtable') removes the
      call to ipmr_get_route(), which will get multicast parameters of the route.
      
      I revert the part of the patch that remove this call. I think the goal was only
      to get rid of rt_dst field.
      
      The patch is only compiled-tested. My first idea was to remove ipmr_get_route()
      because rt_fill_info() was the only user, but it seems the previous patch cleans
      the code a bit too much ;-)
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8caaf7b6
  22. 05 12月, 2012 2 次提交