1. 13 6月, 2008 1 次提交
    • D
      tcp: Revert 'process defer accept as established' changes. · ec0a1966
      David S. Miller 提交于
      This reverts two changesets, ec3c0982
      ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
      the follow-on bug fix 9ae27e0a
      ("tcp: Fix slab corruption with ipv6 and tcp6fuzz").
      
      This change causes several problems, first reported by Ingo Molnar
      as a distcc-over-loopback regression where connections were getting
      stuck.
      
      Ilpo Järvinen first spotted the locking problems.  The new function
      added by this code, tcp_defer_accept_check(), only has the
      child socket locked, yet it is modifying state of the parent
      listening socket.
      
      Fixing that is non-trivial at best, because we can't simply just grab
      the parent listening socket lock at this point, because it would
      create an ABBA deadlock.  The normal ordering is parent listening
      socket --> child socket, but this code path would require the
      reverse lock ordering.
      
      Next is a problem noticed by Vitaliy Gusev, he noted:
      
      ----------------------------------------
      >--- a/net/ipv4/tcp_timer.c
      >+++ b/net/ipv4/tcp_timer.c
      >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
      > 		goto death;
      > 	}
      >
      >+	if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
      >+		tcp_send_active_reset(sk, GFP_ATOMIC);
      >+		goto death;
      
      Here socket sk is not attached to listening socket's request queue. tcp_done()
      will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
      release this sk) as socket is not DEAD. Therefore socket sk will be lost for
      freeing.
      ----------------------------------------
      
      Finally, Alexey Kuznetsov argues that there might not even be any
      real value or advantage to these new semantics even if we fix all
      of the bugs:
      
      ----------------------------------------
      Hiding from accept() sockets with only out-of-order data only
      is the only thing which is impossible with old approach. Is this really
      so valuable? My opinion: no, this is nothing but a new loophole
      to consume memory without control.
      ----------------------------------------
      
      So revert this thing for now.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec0a1966
  2. 11 6月, 2008 2 次提交
  3. 05 6月, 2008 7 次提交
    • O
      tcp: Fix for race due to temporary drop of the socket lock in skb_splice_bits. · 293ad604
      Octavian Purdila 提交于
      skb_splice_bits temporary drops the socket lock while iterating over
      the socket queue in order to break a reverse locking condition which
      happens with sendfile. This, however, opens a window of opportunity
      for tcp_collapse() to aggregate skbs and thus potentially free the
      current skb used in skb_splice_bits and tcp_read_sock.
      
      This patch fixes the problem by (re-)getting the same "logical skb"
      after the lock has been temporary dropped.
      
      Based on idea and initial patch from Evgeniy Polyakov.
      Signed-off-by: NOctavian Purdila <opurdila@ixiacom.com>
      Acked-by: NEvgeniy Polyakov <johnpol@2ka.mipt.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      293ad604
    • S
      tcp: Increment OUTRSTS in tcp_send_active_reset() · 26af65cb
      Sridhar Samudrala 提交于
      TCP "resets sent" counter is not incremented when a TCP Reset is 
      sent via tcp_send_active_reset().
      Signed-off-by: NSridhar Samudrala <sri@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26af65cb
    • D
      raw: Raw socket leak. · 22dd4850
      Denis V. Lunev 提交于
      The program below just leaks the raw kernel socket
      
      int main() {
              int fd = socket(PF_INET, SOCK_RAW, IPPROTO_UDP);
              struct sockaddr_in addr;
      
              memset(&addr, 0, sizeof(addr));
              inet_aton("127.0.0.1", &addr.sin_addr);
              addr.sin_family = AF_INET;
              addr.sin_port = htons(2048);
              sendto(fd,  "a", 1, MSG_MORE, &addr, sizeof(addr));
              return 0;
      }
      
      Corked packet is allocated via sock_wmalloc which holds the owner socket,
      so one should uncork it and flush all pending data on close. Do this in the
      same way as in UDP.
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      Acked-by: NAlexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22dd4850
    • I
      tcp: fix skb vs fack_count out-of-sync condition · a6604471
      Ilpo Järvinen 提交于
      This bug is able to corrupt fackets_out in very rare cases.
      In order for this to cause corruption:
        1) DSACK in the middle of previous SACK block must be generated.
        2) In order to take that particular branch, part or all of the
           DSACKed segment must already be SACKed so that we have that
           in cache in the first place.
        3) The new info must be top enough so that fackets_out will be
           updated on this iteration.
      ...then fack_count is updated while skb wasn't, then we walk again
      that particular segment thus updating fack_count twice for
      a single skb and finally that value is assigned to fackets_out
      by tcp_sacktag_one.
      
      It is safe to call tcp_sacktag_one just once for a segment (at
      DSACK), no need to call again for plain SACK.
      
      Potential problem of the miscount are limited to premature entry
      to recovery and to inflated reordering metric (which could even
      cancel each other out in the most the luckiest scenarios :-)).
      Both are quite insignificant in worst case too and there exists
      also code to reset them (fackets_out once sacked_out becomes zero
      and reordering metric on RTO).
      
      This has been reported by a number of people, because it occurred
      quite rarely, it has been very evasive. Andy Furniss was able to
      get it to occur couple of times so that a bit more info was
      collected about the problem using a debug patch, though it still
      required lot of checking around. Thanks also to others who have
      tried to help here.
      
      This is listed as Bugzilla #10346. The bug was introduced by
      me in commit 68f8353b ([TCP]: Rewrite SACK block processing & 
      sack_recv_cache use), I probably thought back then that there's
      need to scan that entry twice or didn't dare to make it go
      through it just once there. Going through twice would have
      required restoring fack_count after the walk but as noted above,
      I chose to drop the additional walk step altogether here.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6604471
    • D
      [IPV6]: inet_sk(sk)->cork.opt leak · 36d926b9
      Denis V. Lunev 提交于
      IPv6 UDP sockets wth IPv4 mapped address use udp_sendmsg to send the data
      actually. In this case ip_flush_pending_frames should be called instead
      of ip6_flush_pending_frames.
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      36d926b9
    • Y
    • I
      tcp: Fix inconsistency source (CA_Open only when !tcp_left_out(tp)) · 8aca6cb1
      Ilpo Järvinen 提交于
      It is possible that this skip path causes TCP to end up into an
      invalid state where ca_state was left to CA_Open while some
      segments already came into sacked_out. If next valid ACK doesn't
      contain new SACK information TCP fails to enter into
      tcp_fastretrans_alert(). Thus at least high_seq is set
      incorrectly to a too high seqno because some new data segments
      could be sent in between (and also, limited transmit is not
      being correctly invoked there). Reordering in both directions
      can easily cause this situation to occur.
      
      I guess we would want to use tcp_moderate_cwnd(tp) there as well
      as it may be possible to use this to trigger oversized burst to
      network by sending an old ACK with huge amount of SACK info, but
      I'm a bit unsure about its effects (mainly to FlightSize), so to
      be on the safe side I just currently fixed it minimally to keep
      TCP's state consistent (obviously, such nasty ACKs have been
      possible this far). Though it seems that FlightSize is already
      underestimated by some amount, so probably on the long term we
      might want to trigger recovery there too, if appropriate, to make
      FlightSize calculation to resemble reality at the time when the
      losses where discovered (but such change scares me too much now
      and requires some more thinking anyway how to do that as it
      likely involves some code shuffling).
      
      This bug was found by Brian Vowell while running my TCP debug
      patch to find cause of another TCP issue (fackets_out
      miscount).
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8aca6cb1
  4. 04 6月, 2008 3 次提交
  5. 22 5月, 2008 3 次提交
    • R
      net: The world is not perfect patch. · 071f92d0
      Rami Rosen 提交于
        Unless there will be any objection here, I suggest consider the
      following patch which simply removes the code for the
      -DI_WISH_WORLD_WERE_PERFECT in the three methods which use it.
      
      The compilation errors we get when using -DI_WISH_WORLD_WERE_PERFECT
      show that this code was not built and not used for really a long time.
      Signed-off-by: NRami Rosen <ramirose@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      071f92d0
    • D
      net/ipv4/arp.c: Use common hex_asc helpers · 51f82a2b
      Denis Cheng 提交于
      Here the local hexbuf is a duplicate of global const char hex_asc from
      lib/hexdump.c, except the hex letters' cases:
      
      	const char hexbuf[] = "0123456789ABCDEF";
      
      	const char hex_asc[] = "0123456789abcdef";
      
      and here to print HW addresses, the hex cases are not significant.
      
      Thanks to Harvey Harrison to introduce the hex_asc_hi/hex_asc_lo helpers.
      Signed-off-by: NDenis Cheng <crquan@gmail.com>
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51f82a2b
    • S
      tcp: TCP connection times out if ICMP frag needed is delayed · 7d227cd2
      Sridhar Samudrala 提交于
      We are seeing an issue with TCP in handling an ICMP frag needed
      message that is received after net.ipv4.tcp_retries1 retransmits.
      The default value of retries1 is 3. So if the path mtu changes
      and ICMP frag needed is lost for the first 3 retransmits or if
      it gets delayed until 3 retransmits are done, TCP doesn't update
      MSS correctly and continues to retransmit the orginal message
      until it timesout after tcp_retries2 retransmits.
      
      I am seeing this issue even with the latest 2.6.25.4 kernel.
      
      In tcp_retransmit_timer(), when retransmits counter exceeds 
      tcp_retries1 value, the dst cache entry of the socket is reset.
      At this time, if we receive an ICMP frag needed message, the 
      dst entry gets updated with the new MTU, but the TCP sockets
      dst_cache entry remains NULL.
      
      So the next time when we try to retransmit after the ICMP frag
      needed is received, tcp_retransmit_skb() gets called. Here the
      cur_mss value is calculated at the start of the routine with
      a NULL sk_dst_cache. Instead we should call tcp_current_mss after
      the rebuild_header that caches the dst entry with the updated mtu.
      Also the rebuild_header should be called before tcp_fragment
      so that skb is fragmented if the mss goes down.
      Signed-off-by: NSridhar Samudrala <sri@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d227cd2
  6. 21 5月, 2008 1 次提交
    • H
      ipsec: Use the correct ip_local_out function · 1ac06e03
      Herbert Xu 提交于
      Because the IPsec output function xfrm_output_resume does its
      own dst_output call it should always call __ip_local_output
      instead of ip_local_output as the latter may invoke dst_output
      directly.  Otherwise the return values from nf_hook and dst_output
      may clash as they both use the value 1 but for different purposes.
      
      When that clash occurs this can cause a packet to be used after
      it has been freed which usually leads to a crash.  Because the
      offending value is only returned from dst_output with qdiscs
      such as HTB, this bug is normally not visible.
      
      Thanks to Marco Berizzi for his perseverance in tracking this
      down.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ac06e03
  7. 14 5月, 2008 1 次提交
  8. 13 5月, 2008 3 次提交
    • I
      tcp FRTO: work-around inorder receivers · 79d44516
      Ilpo Järvinen 提交于
      If receiver consumes segments successfully only in-order, FRTO
      fallback to conventional recovery produces RTO loop because
      FRTO's forward transmissions will always get dropped and need to
      be resent, yet by default they're not marked as lost (which are
      the only segments we will retransmit in CA_Loss).
      
      Price to pay about this is occassionally unnecessarily
      retransmitting the forward transmission(s). SACK blocks help
      a bit to avoid this, so it's mainly a concern for NewReno case
      though SACK is not fully immune either.
      
      This change has a side-effect of fixing SACKFRTO problem where
      it didn't have snd_nxt of the RTO time available anymore when
      fallback become necessary (this problem would have only occured
      when RTO would occur for two or more segments and ECE arrives
      in step 3; no need to figure out how to fix that unless the
      TODO item of selective behavior is considered in future).
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Reported-by: NDamon L. Chesser <damon@damtek.com>
      Tested-by: NDamon L. Chesser <damon@damtek.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79d44516
    • I
      tcp FRTO: Fix fallback to conventional recovery · a1c1f281
      Ilpo Järvinen 提交于
      It seems that commit 009a2e3e ("[TCP] FRTO: Improve
      interoperability with other undo_marker users") run into
      another land-mine which caused fallback to conventional
      recovery to break:
      
      1. Cumulative ACK arrives after FRTO retransmission
      2. tcp_try_to_open sees zero retrans_out, clears retrans_stamp
         which should be kept like in CA_Loss state it would be
      3. undo_marker change allowed tcp_packet_delayed to return
         true because of the cleared retrans_stamp once FRTO is
         terminated causing LossUndo to occur, which means all loss
         markings FRTO made are reverted.
      
      This means that the conventional recovery basically recovered
      one loss per RTT, which is not that efficient. It was quite
      unobvious that the undo_marker change broken something like
      this, I had a quite long session to track it down because of
      the non-intuitiviness of the bug (luckily I had a trivial
      reproducer at hand and I was also able to learn to use kprobes
      in the process as well :-)).
      
      This together with the NewReno+FRTO fix and FRTO in-order
      workaround this fixes Damon's problems, this and the first
      mentioned are enough to fix Bugzilla #10063.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Reported-by: NDamon L. Chesser <damon@damtek.com>
      Tested-by: NDamon L. Chesser <damon@damtek.com>
      Tested-by: NSebastian Hyrwall <zibbe@cisko.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1c1f281
    • J
      net: Allow netdevices to specify needed head/tailroom · f5184d26
      Johannes Berg 提交于
      This patch adds needed_headroom/needed_tailroom members to struct
      net_device and updates many places that allocate sbks to use them. Not
      all of them can be converted though, and I'm sure I missed some (I
      mostly grepped for LL_RESERVED_SPACE)
      Signed-off-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5184d26
  9. 08 5月, 2008 2 次提交
    • J
      net/ipv4: correct RFC 1122 section reference in comment · c67fa027
      J.H.M. Dassen (Ray) 提交于
      RFC 1122 does not have a section 3.1.2.2. The requirement to silently
      discard datagrams with a bad checksum is in section 3.2.1.2 instead.
      
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10611Signed-off-by: NJ.H.M. Dassen (Ray) <jdassen@debian.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c67fa027
    • I
      tcp FRTO: SACK variant is errorneously used with NewReno · 62ab2227
      Ilpo Järvinen 提交于
      Note: there's actually another bug in FRTO's SACK variant, which
      is the causing failure in NewReno case because of the error
      that's fixed here. I'll fix the SACK case separately (it's
      a separate bug really, though related, but in order to fix that
      I need to audit tp->snd_nxt usage a bit).
      
      There were two places where SACK variant of FRTO is getting
      incorrectly used even if SACK wasn't negotiated by the TCP flow.
      This leads to incorrect setting of frto_highmark with NewReno
      if a previous recovery was interrupted by another RTO.
      
      An eventual fallback to conventional recovery then incorrectly
      considers one or couple of segments as forward transmissions
      though they weren't, which then are not LOST marked during
      fallback making them "non-retransmittable" until the next RTO.
      In a bad case, those segments are really lost and are the only
      one left in the window. Thus TCP needs another RTO to continue.
      The next FRTO, however, could again repeat the same events
      making the progress of the TCP flow extremely slow.
      
      In order for these events to occur at all, FRTO must occur
      again in FRTOs step 3 while the key segments must be lost as
      well, which is not too likely in practice. It seems to most
      frequently with some small devices such as network printers
      that *seem* to accept TCP segments only in-order. In cases
      were key segments weren't lost, things get automatically
      resolved because those wrongly marked segments don't need to be
      retransmitted in order to continue.
      
      I found a reproducer after digging up relevant reports (few
      reports in total, none at netdev or lkml I know of), some
      cases seemed to indicate middlebox issues which seems now
      to be a false assumption some people had made. Bugzilla
      #10063 _might_ be related. Damon L. Chesser <damon@damtek.com>
      had a reproducable case and was kind enough to tcpdump it
      for me. With the tcpdump log it was quite trivial to figure
      out.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62ab2227
  10. 05 5月, 2008 2 次提交
  11. 03 5月, 2008 1 次提交
  12. 02 5月, 2008 2 次提交
  13. 01 5月, 2008 2 次提交
    • R
      rename div64_64 to div64_u64 · 6f6d6a1a
      Roman Zippel 提交于
      Rename div64_64 to div64_u64 to make it consistent with the other divide
      functions, so it clearly includes the type of the divide.  Move its definition
      to math64.h as currently no architecture overrides the generic implementation.
       They can still override it of course, but the duplicated declarations are
      avoided.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f6d6a1a
    • H
      net: fix returning void-valued expression warnings · ab59859d
      Harvey Harrison 提交于
      drivers/net/8390.c:37:2: warning: returning void-valued expression
      drivers/net/bnx2.c:1635:3: warning: returning void-valued expression
      drivers/net/xen-netfront.c:1806:2: warning: returning void-valued expression
      net/ipv4/tcp_hybla.c:105:3: warning: returning void-valued expression
      net/ipv4/tcp_vegas.c:171:3: warning: returning void-valued expression
      net/ipv4/tcp_veno.c:123:3: warning: returning void-valued expression
      net/sysctl_net.c:85:2: warning: returning void-valued expression
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Acked-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab59859d
  14. 30 4月, 2008 3 次提交
    • L
      tcp: Overflow bug in Vegas · 15913114
      Lachlan Andrew 提交于
      From: Lachlan Andrew <lachlan.andrew@gmail.com>
      
      There is an overflow bug in net/ipv4/tcp_vegas.c for large BDPs
      (e.g. 400Mbit/s, 400ms).  The multiplication (old_wnd *
      vegas->baseRTT) << V_PARAM_SHIFT overflows a u32.
      
      [ Fix tcp_veno.c too, it has similar calculations. -DaveM ]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15913114
    • K
      [IPv4] UFO: prevent generation of chained skb destined to UFO device · be9164e7
      Kostya B 提交于
      Problem: ip_append_data() could wrongly generate a chained skb for
      devices which support UFO.  When sk_write_queue is not empty
      (e.g. MSG_MORE), __instead__ of appending data into the next nr_frag
      of the queued skb, a new chained skb is created.
      
      I would normally assume UFO device should get data in nr_frags and not
      in frag_list.  Later the udp4_hwcsum_outgoing() resets csum to NONE
      and skb_gso_segment() has oops.
      
      Proposal:
      1. Even length is less than mtu, employ ip_ufo_append_data()
      and append data to the __existed__ skb in the sk_write_queue.
      
      2. ip_ufo_append_data() is fixed due to a wrong manipulation of
      peek-ing and later enqueue-ing of the same skb.  Now, enqueuing is
      always performed, because on error the further
      ip_flush_pending_frames() would release the queued skb.
      Signed-off-by: NKostya B <bkostya@hotmail.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be9164e7
    • S
      ipv4: annotate a few functions __init in ipconfig.c · 45e741b8
      Sam Ravnborg 提交于
      A few functions are only used from __init context.
      So annotate these with __init for consistency and silence
      the following warnings:
      
      WARNING: net/ipv4/built-in.o(.text+0x2a876): Section mismatch
               in reference from the function ic_bootp_init() to
               the variable .init.data:bootp_packet_type
      WARNING: net/ipv4/built-in.o(.text+0x2a907): Section mismatch
               in reference from the function ic_bootp_cleanup() to
               the variable .init.data:bootp_packet_type
      
      Note: The warnings only appear with CONFIG_DEBUG_SECTION_MISMATCH=y
      Signed-off-by: NSam Ravnborg <sam@ravnborg.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45e741b8
  15. 29 4月, 2008 7 次提交
    • H
      Remove duplicated unlikely() in IS_ERR() · 801678c5
      Hirofumi Nakagawa 提交于
      Some drivers have duplicated unlikely() macros.  IS_ERR() already has
      unlikely() in itself.
      
      This patch cleans up such pointless code.
      Signed-off-by: NHirofumi Nakagawa <hnakagawa@miraclelinux.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NJeff Garzik <jeff@garzik.org>
      Cc: Paul Clements <paul.clements@steeleye.com>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: David Brownell <david-b@pacbell.net>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jaroslav Kysela <perex@perex.cz>
      Cc: Takashi Iwai <tiwai@suse.de>
      Acked-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      801678c5
    • P
      netfilter: nf_conntrack: padding breaks conntrack hash on ARM · 443a70d5
      Philip Craig 提交于
      commit 0794935e "[NETFILTER]: nf_conntrack: optimize hash_conntrack()"
      results in ARM platforms hashing uninitialised padding.  This padding
      doesn't exist on other architectures.
      
      Fix this by replacing NF_CT_TUPLE_U_BLANK() with memset() to ensure
      everything is initialised.  There were only 4 bytes that
      NF_CT_TUPLE_U_BLANK() wasn't clearing anyway (or 12 bytes on ARM).
      Signed-off-by: NPhilip Craig <philipc@snapgear.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      443a70d5
    • T
      ipv4: Update MTU to all related cache entries in ip_rt_frag_needed() · 0010e465
      Timo Teras 提交于
      Add struct net_device parameter to ip_rt_frag_needed() and update MTU to
      cache entries where ifindex is specified. This is similar to what is
      already done in ip_rt_redirect().
      Signed-off-by: NTimo Teras <timo.teras@iki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0010e465
    • D
      net: Add compat support for getsockopt (MCAST_MSFILTER) · 42908c69
      David L Stevens 提交于
      This patch adds support for getsockopt for MCAST_MSFILTER for
      both IPv4 and IPv6. It depends on the previous setsockopt patch,
      and uses the same method.
      Signed-off-by: NDavid L Stevens <dlstevens@us.ibm.com>
      Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42908c69
    • J
      ipvs: fix oops in backup for fwmark conn templates · 2ad17def
      Julian Anastasov 提交于
      	Fixes bug http://bugzilla.kernel.org/show_bug.cgi?id=10556
      where conn templates with protocol=IPPROTO_IP can oops backup box.
      
              Result from ip_vs_proto_get() should be checked because
      protocol value can be invalid or unsupported in backup. But
      for valid message we should not fail for templates which use
      IPPROTO_IP. Also, add checks to validate message limits and
      connection state. Show state NONE for templates using IPPROTO_IP.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ad17def
    • A
      netfilter: {nfnetlink,ip,ip6}_queue: fix skb_over_panic when enlarging packets · 9a732ed6
      Arnaud Ebalard 提交于
      While reinjecting *bigger* modified versions of IPv6 packets using
      libnetfilter_queue, things work fine on a 2.6.24 kernel (2.6.22 too)
      but I get the following on recents kernels (2.6.25, trace below is
      against today's net-2.6 git tree):
      
      skb_over_panic: text:c04fddb0 len:696 put:632 head:f7592c00 data:f7592c00 tail:0xf7592eb8 end:0xf7592e80 dev:eth0
      ------------[ cut here ]------------
      invalid opcode: 0000 [#1] PREEMPT 
      Process sendd (pid: 3657, ti=f6014000 task=f77c31d0 task.ti=f6014000)
      Stack: c071e638 c04fddb0 000002b8 00000278 f7592c00 f7592c00 f7592eb8 f7592e80 
             f763c000 f6bc5200 f7592c40 f6015c34 c04cdbfc f6bc5200 00000278 f6015c60 
             c04fddb0 00000020 f72a10c0 f751b420 00000001 0000000a 000002b8 c065582c 
      Call Trace:
       [<c04fddb0>] ? nfqnl_recv_verdict+0x1c0/0x2e0
       [<c04cdbfc>] ? skb_put+0x3c/0x40
       [<c04fddb0>] ? nfqnl_recv_verdict+0x1c0/0x2e0
       [<c04fd115>] ? nfnetlink_rcv_msg+0xf5/0x160
       [<c04fd03e>] ? nfnetlink_rcv_msg+0x1e/0x160
       [<c04fd020>] ? nfnetlink_rcv_msg+0x0/0x160
       [<c04f8ed7>] ? netlink_rcv_skb+0x77/0xa0
       [<c04fcefc>] ? nfnetlink_rcv+0x1c/0x30
       [<c04f8c73>] ? netlink_unicast+0x243/0x2b0
       [<c04cfaba>] ? memcpy_fromiovec+0x4a/0x70
       [<c04f9406>] ? netlink_sendmsg+0x1c6/0x270
       [<c04c8244>] ? sock_sendmsg+0xc4/0xf0
       [<c011970d>] ? set_next_entity+0x1d/0x50
       [<c0133a80>] ? autoremove_wake_function+0x0/0x40
       [<c0118f9e>] ? __wake_up_common+0x3e/0x70
       [<c0342fbf>] ? n_tty_receive_buf+0x34f/0x1280
       [<c011d308>] ? __wake_up+0x68/0x70
       [<c02cea47>] ? copy_from_user+0x37/0x70
       [<c04cfd7c>] ? verify_iovec+0x2c/0x90
       [<c04c837a>] ? sys_sendmsg+0x10a/0x230
       [<c011967a>] ? __dequeue_entity+0x2a/0xa0
       [<c011970d>] ? set_next_entity+0x1d/0x50
       [<c0345397>] ? pty_write+0x47/0x60
       [<c033d59b>] ? tty_default_put_char+0x1b/0x20
       [<c011d2e9>] ? __wake_up+0x49/0x70
       [<c033df99>] ? tty_ldisc_deref+0x39/0x90
       [<c033ff20>] ? tty_write+0x1a0/0x1b0
       [<c04c93af>] ? sys_socketcall+0x7f/0x260
       [<c0102ff9>] ? sysenter_past_esp+0x6a/0x91
       [<c05f0000>] ? snd_intel8x0m_probe+0x270/0x6e0
       =======================
      Code: 00 00 89 5c 24 14 8b 98 9c 00 00 00 89 54 24 0c 89 5c 24 10 8b 40 50 89 4c 24 04 c7 04 24 38 e6 71 c0 89 44 24 08 e8 c4 46 c5 ff <0f> 0b eb fe 55 89 e5 56 89 d6 53 89 c3 83 ec 0c 8b 40 50 39 d0 
      EIP: [<c04ccdfc>] skb_over_panic+0x5c/0x60 SS:ESP 0068:f6015bf8
      
      
      Looking at the code, I ended up in nfq_mangle() function (called by
      nfqnl_recv_verdict()) which performs a call to skb_copy_expand() due to
      the increased size of data passed to the function. AFAICT, it should ask
      for 'diff' instead of 'diff - skb_tailroom(e->skb)'. Because the
      resulting sk_buff has not enough space to support the skb_put(skb, diff)
      call a few lines later, this results in the call to skb_over_panic().
      
      The patch below asks for allocation of a copy with enough space for
      mangled packet and the same amount of headroom as old sk_buff. While
      looking at how the regression appeared (e2b58a67), I noticed the same
      pattern in ipq_mangle_ipv6() and ipq_mangle_ipv4(). The patch corrects
      those locations too.
      
      Tested with bigger reinjected IPv6 packets (nfqnl_mangle() path), things
      are ok (2.6.25 and today's net-2.6 git tree).
      Signed-off-by: NArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a732ed6
    • J
      tcp: Limit cwnd growth when deferring for GSO · 246eb2af
      John Heffner 提交于
      This fixes inappropriately large cwnd growth on sender-limited flows
      when GSO is enabled, limiting cwnd growth to 64k.
      Signed-off-by: NJohn Heffner <johnwheffner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      246eb2af