1. 03 7月, 2008 1 次提交
    • P
      tcp: de-bloat a bit with factoring NET_INC_STATS_BH out · 40b215e5
      Pavel Emelyanov 提交于
      There are some places in TCP that select one MIB index to
      bump snmp statistics like this:
      
      	if (<something>)
      		NET_INC_STATS_BH(<some_id>);
      	else if (<something_else>)
      		NET_INC_STATS_BH(<some_other_id>);
      	...
      	else
      		NET_INC_STATS_BH(<default_id>);
      
      or in a more tricky but still similar way.
      
      On the other hand, this NET_INC_STATS_BH is a camouflaged
      increment of percpu variable, which is not that small.
      
      Factoring those cases out de-bloats 235 bytes on non-preemptible
      i386 config and drives parts of the code into 80 columns.
      
      add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-235 (-235)
      function                                     old     new   delta
      tcp_fastretrans_alert                       1437    1424     -13
      tcp_dsack_set                                137     124     -13
      tcp_xmit_retransmit_queue                    690     676     -14
      tcp_try_undo_recovery                        283     265     -18
      tcp_sacktag_write_queue                     1550    1515     -35
      tcp_update_reordering                        162     106     -56
      tcp_retransmit_timer                         990     904     -86
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40b215e5
  2. 02 7月, 2008 1 次提交
  3. 28 6月, 2008 4 次提交
  4. 20 6月, 2008 2 次提交
  5. 18 6月, 2008 3 次提交
    • E
      udp: sk_drops handling · cb61cb9b
      Eric Dumazet 提交于
      In commits 33c732c3 ([IPV4]: Add raw
      drops counter) and a92aa318 ([IPV6]:
      Add raw drops counter), Wang Chen added raw drops counter for
      /proc/net/raw & /proc/net/raw6
      
      This patch adds this capability to UDP sockets too (/proc/net/udp &
      /proc/net/udp6).
      
      This means that 'RcvbufErrors' errors found in /proc/net/snmp can be also
      be examined for each udp socket.
      
      # grep Udp: /proc/net/snmp
      Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors
      Udp: 23971006 75 899420 16390693 146348 0
      
      # cat /proc/net/udp
       sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt  ---
      uid  timeout inode ref pointer drops
       75: 00000000:02CB 00000000:0000 07 00000000:00000000 00:00000000 00000000  ---
        0        0 2358 2 ffff81082a538c80 0
      111: 00000000:006F 00000000:0000 07 00000000:00000000 00:00000000 00000000  ---
        0        0 2286 2 ffff81042dd35c80 146348
      
      In this example, only port 111 (0x006F) was flooded by messages that
      user program could not read fast enough. 146348 messages were lost.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb61cb9b
    • S
      xfrm: fix fragmentation for ipv4 xfrm tunnel · fe833fca
      Steffen Klassert 提交于
      When generating the ip header for the transformed packet we just copy
      the frag_off field of the ip header from the original packet to the ip
      header of the new generated packet. If we receive a packet as a chain
      of fragments, all but the last of the new generated packets have the
      IP_MF flag set. We have to mask the frag_off field to only keep the
      IP_DF flag from the original packet. This got lost with git commit
      36cf9acf ("[IPSEC]: Separate
      inner/outer mode processing on output")
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe833fca
    • P
      netfilter: nf_nat: fix RCU races · 68b80f11
      Patrick McHardy 提交于
      Fix three ct_extend/NAT extension related races:
      
      - When cleaning up the extension area and removing it from the bysource hash,
        the nat->ct pointer must not be set to NULL since it may still be used in
        a RCU read side
      
      - When replacing a NAT extension area in the bysource hash, the nat->ct
        pointer must be assigned before performing the replacement
      
      - When reallocating extension storage in ct_extend, the old memory must
        not be freed immediately since it may still be used by a RCU read side
      
      Possibly fixes https://bugzilla.redhat.com/show_bug.cgi?id=449315
      and/or http://bugzilla.kernel.org/show_bug.cgi?id=10875Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68b80f11
  6. 17 6月, 2008 9 次提交
  7. 15 6月, 2008 1 次提交
  8. 13 6月, 2008 1 次提交
    • D
      tcp: Revert 'process defer accept as established' changes. · ec0a1966
      David S. Miller 提交于
      This reverts two changesets, ec3c0982
      ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
      the follow-on bug fix 9ae27e0a
      ("tcp: Fix slab corruption with ipv6 and tcp6fuzz").
      
      This change causes several problems, first reported by Ingo Molnar
      as a distcc-over-loopback regression where connections were getting
      stuck.
      
      Ilpo Järvinen first spotted the locking problems.  The new function
      added by this code, tcp_defer_accept_check(), only has the
      child socket locked, yet it is modifying state of the parent
      listening socket.
      
      Fixing that is non-trivial at best, because we can't simply just grab
      the parent listening socket lock at this point, because it would
      create an ABBA deadlock.  The normal ordering is parent listening
      socket --> child socket, but this code path would require the
      reverse lock ordering.
      
      Next is a problem noticed by Vitaliy Gusev, he noted:
      
      ----------------------------------------
      >--- a/net/ipv4/tcp_timer.c
      >+++ b/net/ipv4/tcp_timer.c
      >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
      > 		goto death;
      > 	}
      >
      >+	if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
      >+		tcp_send_active_reset(sk, GFP_ATOMIC);
      >+		goto death;
      
      Here socket sk is not attached to listening socket's request queue. tcp_done()
      will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
      release this sk) as socket is not DEAD. Therefore socket sk will be lost for
      freeing.
      ----------------------------------------
      
      Finally, Alexey Kuznetsov argues that there might not even be any
      real value or advantage to these new semantics even if we fix all
      of the bugs:
      
      ----------------------------------------
      Hiding from accept() sockets with only out-of-order data only
      is the only thing which is impossible with old approach. Is this really
      so valuable? My opinion: no, this is nothing but a new loophole
      to consume memory without control.
      ----------------------------------------
      
      So revert this thing for now.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec0a1966
  9. 12 6月, 2008 5 次提交
  10. 11 6月, 2008 3 次提交
  11. 10 6月, 2008 4 次提交
  12. 06 6月, 2008 1 次提交
  13. 05 6月, 2008 5 次提交
    • O
      tcp: Fix for race due to temporary drop of the socket lock in skb_splice_bits. · 293ad604
      Octavian Purdila 提交于
      skb_splice_bits temporary drops the socket lock while iterating over
      the socket queue in order to break a reverse locking condition which
      happens with sendfile. This, however, opens a window of opportunity
      for tcp_collapse() to aggregate skbs and thus potentially free the
      current skb used in skb_splice_bits and tcp_read_sock.
      
      This patch fixes the problem by (re-)getting the same "logical skb"
      after the lock has been temporary dropped.
      
      Based on idea and initial patch from Evgeniy Polyakov.
      Signed-off-by: NOctavian Purdila <opurdila@ixiacom.com>
      Acked-by: NEvgeniy Polyakov <johnpol@2ka.mipt.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      293ad604
    • S
      tcp: Increment OUTRSTS in tcp_send_active_reset() · 26af65cb
      Sridhar Samudrala 提交于
      TCP "resets sent" counter is not incremented when a TCP Reset is 
      sent via tcp_send_active_reset().
      Signed-off-by: NSridhar Samudrala <sri@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26af65cb
    • D
      raw: Raw socket leak. · 22dd4850
      Denis V. Lunev 提交于
      The program below just leaks the raw kernel socket
      
      int main() {
              int fd = socket(PF_INET, SOCK_RAW, IPPROTO_UDP);
              struct sockaddr_in addr;
      
              memset(&addr, 0, sizeof(addr));
              inet_aton("127.0.0.1", &addr.sin_addr);
              addr.sin_family = AF_INET;
              addr.sin_port = htons(2048);
              sendto(fd,  "a", 1, MSG_MORE, &addr, sizeof(addr));
              return 0;
      }
      
      Corked packet is allocated via sock_wmalloc which holds the owner socket,
      so one should uncork it and flush all pending data on close. Do this in the
      same way as in UDP.
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      Acked-by: NAlexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22dd4850
    • I
      tcp: fix skb vs fack_count out-of-sync condition · a6604471
      Ilpo Järvinen 提交于
      This bug is able to corrupt fackets_out in very rare cases.
      In order for this to cause corruption:
        1) DSACK in the middle of previous SACK block must be generated.
        2) In order to take that particular branch, part or all of the
           DSACKed segment must already be SACKed so that we have that
           in cache in the first place.
        3) The new info must be top enough so that fackets_out will be
           updated on this iteration.
      ...then fack_count is updated while skb wasn't, then we walk again
      that particular segment thus updating fack_count twice for
      a single skb and finally that value is assigned to fackets_out
      by tcp_sacktag_one.
      
      It is safe to call tcp_sacktag_one just once for a segment (at
      DSACK), no need to call again for plain SACK.
      
      Potential problem of the miscount are limited to premature entry
      to recovery and to inflated reordering metric (which could even
      cancel each other out in the most the luckiest scenarios :-)).
      Both are quite insignificant in worst case too and there exists
      also code to reset them (fackets_out once sacked_out becomes zero
      and reordering metric on RTO).
      
      This has been reported by a number of people, because it occurred
      quite rarely, it has been very evasive. Andy Furniss was able to
      get it to occur couple of times so that a bit more info was
      collected about the problem using a debug patch, though it still
      required lot of checking around. Thanks also to others who have
      tried to help here.
      
      This is listed as Bugzilla #10346. The bug was introduced by
      me in commit 68f8353b ([TCP]: Rewrite SACK block processing & 
      sack_recv_cache use), I probably thought back then that there's
      need to scan that entry twice or didn't dare to make it go
      through it just once there. Going through twice would have
      required restoring fack_count after the walk but as noted above,
      I chose to drop the additional walk step altogether here.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6604471
    • D
      [IPV6]: inet_sk(sk)->cork.opt leak · 36d926b9
      Denis V. Lunev 提交于
      IPv6 UDP sockets wth IPv4 mapped address use udp_sendmsg to send the data
      actually. In this case ip_flush_pending_frames should be called instead
      of ip6_flush_pending_frames.
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      36d926b9