1. 29 10月, 2005 1 次提交
    • A
      [IPv4/IPv6]: UFO Scatter-gather approach · e89e9cf5
      Ananda Raju 提交于
      Attached is kernel patch for UDP Fragmentation Offload (UFO) feature.
      
      1. This patch incorporate the review comments by Jeff Garzik.
      2. Renamed USO as UFO (UDP Fragmentation Offload)
      3. udp sendfile support with UFO
      
      This patches uses scatter-gather feature of skb to generate large UDP
      datagram. Below is a "how-to" on changes required in network device
      driver to use the UFO interface.
      
      UDP Fragmentation Offload (UFO) Interface:
      -------------------------------------------
      UFO is a feature wherein the Linux kernel network stack will offload the
      IP fragmentation functionality of large UDP datagram to hardware. This
      will reduce the overhead of stack in fragmenting the large UDP datagram to
      MTU sized packets
      
      1) Drivers indicate their capability of UFO using
      dev->features |= NETIF_F_UFO | NETIF_F_HW_CSUM | NETIF_F_SG
      
      NETIF_F_HW_CSUM is required for UFO over ipv6.
      
      2) UFO packet will be submitted for transmission using driver xmit routine.
      UFO packet will have a non-zero value for
      
      "skb_shinfo(skb)->ufo_size"
      
      skb_shinfo(skb)->ufo_size will indicate the length of data part in each IP
      fragment going out of the adapter after IP fragmentation by hardware.
      
      skb->data will contain MAC/IP/UDP header and skb_shinfo(skb)->frags[]
      contains the data payload. The skb->ip_summed will be set to CHECKSUM_HW
      indicating that hardware has to do checksum calculation. Hardware should
      compute the UDP checksum of complete datagram and also ip header checksum of
      each fragmented IP packet.
      
      For IPV6 the UFO provides the fragment identification-id in
      skb_shinfo(skb)->ip6_frag_id. The adapter should use this ID for generating
      IPv6 fragments.
      Signed-off-by: NAnanda Raju <ananda.raju@neterion.com>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (forwarded)
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      e89e9cf5
  2. 28 10月, 2005 1 次提交
    • H
      [TCP]: Clear stale pred_flags when snd_wnd changes · 2ad41065
      Herbert Xu 提交于
      This bug is responsible for causing the infamous "Treason uncloaked"
      messages that's been popping up everywhere since the printk was added.
      It has usually been blamed on foreign operating systems.  However,
      some of those reports implicate Linux as both systems are running
      Linux or the TCP connection is going across the loopback interface.
      
      In fact, there really is a bug in the Linux TCP header prediction code
      that's been there since at least 2.1.8.  This bug was tracked down with
      help from Dale Blount.
      
      The effect of this bug ranges from harmless "Treason uncloaked"
      messages to hung/aborted TCP connections.  The details of the bug
      and fix is as follows.
      
      When snd_wnd is updated, we only update pred_flags if
      tcp_fast_path_check succeeds.  When it fails (for example,
      when our rcvbuf is used up), we will leave pred_flags with
      an out-of-date snd_wnd value.
      
      When the out-of-date pred_flags happens to match the next incoming
      packet we will again hit the fast path and use the current snd_wnd
      which will be wrong.
      
      In the case of the treason messages, it just happens that the snd_wnd
      cached in pred_flags is zero while tp->snd_wnd is non-zero.  Therefore
      when a zero-window packet comes in we incorrectly conclude that the
      window is non-zero.
      
      In fact if the peer continues to send us zero-window pure ACKs we
      will continue making the same mistake.  It's only when the peer
      transmits a zero-window packet with data attached that we get a
      chance to snap out of it.  This is what triggers the treason
      message at the next retransmit timeout.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      2ad41065
  3. 26 10月, 2005 5 次提交
  4. 23 10月, 2005 1 次提交
  5. 21 10月, 2005 1 次提交
  6. 14 10月, 2005 2 次提交
  7. 13 10月, 2005 1 次提交
  8. 11 10月, 2005 11 次提交
  9. 09 10月, 2005 1 次提交
  10. 06 10月, 2005 1 次提交
  11. 05 10月, 2005 3 次提交
  12. 04 10月, 2005 5 次提交
    • D
      [IPV4]: Update icmp sysctl docs and disable broadcast ECHO/TIMESTAMP by default · 7ce31246
      David S. Miller 提交于
      It's not a good idea to be smurf'able by default.
      The few people who need this can turn it on.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ce31246
    • H
      [IPV4]: Replace __in_dev_get with __in_dev_get_rcu/rtnl · e5ed6399
      Herbert Xu 提交于
      The following patch renames __in_dev_get() to __in_dev_get_rtnl() and
      introduces __in_dev_get_rcu() to cover the second case.
      
      1) RCU with refcnt should use in_dev_get().
      2) RCU without refcnt should use __in_dev_get_rcu().
      3) All others must hold RTNL and use __in_dev_get_rtnl().
      
      There is one exception in net/ipv4/route.c which is in fact a pre-existing
      race condition.  I've marked it as such so that we remember to fix it.
      
      This patch is based on suggestions and prior work by Suzanne Wood and
      Paul McKenney.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5ed6399
    • H
      [IPV4]: Fix "Proxy ARP seems broken" · 444fc8fc
      Herbert Xu 提交于
      Meelis Roos <mroos@linux.ee> wrote:
      > RK> My firewall setup relies on proxyarp working.  However, with 2.6.14-rc3,
      > RK> it appears to be completely broken.  The firewall is 212.18.232.186,
      > 
      > Same here with some kernel between 14-rc2 and 14-rc3 - no reposnse to
      > ARP on a proxyarp gateway. Sorry, no exact revison and no more debugging
      > yet since it'a a production gateway.
      
      The breakage is caused by the change to use the CB area for flagging
      whether a packet has been queued due to proxy_delay.  This area gets
      cleared every time arp_rcv gets called.  Unfortunately packets delayed
      due to proxy_delay also go through arp_rcv when they are reprocessed.
      
      In fact, I can't think of a reason why delayed proxy packets should go
      through netfilter again at all.  So the easiest solution is to bypass
      that and go straight to arp_process.
      
      This is essentially what would've happened before netfilter support
      was added to ARP.
      
      Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> 
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      444fc8fc
    • E
      [INET]: speedup inet (tcp/dccp) lookups · 81c3d547
      Eric Dumazet 提交于
      Arnaldo and I agreed it could be applied now, because I have other
      pending patches depending on this one (Thank you Arnaldo)
      
      (The other important patch moves skc_refcnt in a separate cache line,
      so that the SMP/NUMA performance doesnt suffer from cache line ping pongs)
      
      1) First some performance data :
      --------------------------------
      
      tcp_v4_rcv() wastes a *lot* of time in __inet_lookup_established()
      
      The most time critical code is :
      
      sk_for_each(sk, node, &head->chain) {
           if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif))
               goto hit; /* You sunk my battleship! */
      }
      
      The sk_for_each() does use prefetch() hints but only the begining of
      "struct sock" is prefetched.
      
      As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far
      away from the begining of "struct sock", it has to bring into CPU
      cache cold cache line. Each iteration has to use at least 2 cache
      lines.
      
      This can be problematic if some chains are very long.
      
      2) The goal
      -----------
      
      The idea I had is to change things so that INET_MATCH() may return
      FALSE in 99% of cases only using the data already in the CPU cache,
      using one cache line per iteration.
      
      3) Description of the patch
      ---------------------------
      
      Adds a new 'unsigned int skc_hash' field in 'struct sock_common',
      filling a 32 bits hole on 64 bits platform.
      
      struct sock_common {
      	unsigned short		skc_family;
      	volatile unsigned char	skc_state;
      	unsigned char		skc_reuse;
      	int			skc_bound_dev_if;
      	struct hlist_node	skc_node;
      	struct hlist_node	skc_bind_node;
      	atomic_t		skc_refcnt;
      +	unsigned int		skc_hash;
      	struct proto		*skc_prot;
      };
      
      Store in this 32 bits field the full hash, not masked by (ehash_size -
      1) Using this full hash as the first comparison done in INET_MATCH
      permits us immediatly skip the element without touching a second cache
      line in case of a miss.
      
      Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to
      sk_hash and tw_hash) already contains the slot number if we mask with
      (ehash_size - 1)
      
      File include/net/inet_hashtables.h
      
      64 bits platforms :
      #define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
           (((__sk)->sk_hash == (__hash))
           ((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie))   &&  \
           ((*((__u32 *)&(inet_sk(__sk)->dport))) == (__ports))   &&  \
           (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
      
      32bits platforms:
      #define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
           (((__sk)->sk_hash == (__hash))                 &&  \
           (inet_sk(__sk)->daddr          == (__saddr))   &&  \
           (inet_sk(__sk)->rcv_saddr      == (__daddr))   &&  \
           (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
      
      
      - Adds a prefetch(head->chain.first) in 
      __inet_lookup_established()/__tcp_v4_check_established() and 
      __inet6_lookup_established()/__tcp_v6_check_established() and 
      __dccp_v4_check_established() to bring into cache the first element of the 
      list, before the {read|write}_lock(&head->lock);
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NArnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81c3d547
    • H
      [NET]: Fix packet timestamping. · 325ed823
      Herbert Xu 提交于
      I've found the problem in general.  It affects any 64-bit
      architecture.  The problem occurs when you change the system time.
      
      Suppose that when you boot your system clock is forward by a day.
      This gets recorded down in skb_tv_base.  You then wind the clock back
      by a day.  From that point onwards the offset will be negative which
      essentially overflows the 32-bit variables they're stored in.
      
      In fact, why don't we just store the real time stamp in those 32-bit
      variables? After all, we're not going to overflow for quite a while
      yet.
      
      When we do overflow, we'll need a better solution of course.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      325ed823
  13. 30 9月, 2005 2 次提交
    • A
      [TCP]: Don't over-clamp window in tcp_clamp_window() · 09e9ec87
      Alexey Kuznetsov 提交于
      From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      
      Handle better the case where the sender sends full sized
      frames initially, then moves to a mode where it trickles
      out small amounts of data at a time.
      
      This known problem is even mentioned in the comments
      above tcp_grow_window() in tcp_input.c, specifically:
      
      ...
       * The scheme does not work when sender sends good segments opening
       * window and then starts to feed us spagetti. But it should work
       * in common situations. Otherwise, we have to rely on queue collapsing.
      ...
      
      When the sender gives full sized frames, the "struct sk_buff" overhead
      from each packet is small.  So we'll advertize a larger window.
      If the sender moves to a mode where small segments are sent, this
      ratio becomes tilted to the other extreme and we start overrunning
      the socket buffer space.
      
      tcp_clamp_window() tries to address this, but it's clamping of
      tp->window_clamp is a wee bit too aggressive for this particular case.
      
      Fix confirmed by Ion Badulescu.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09e9ec87
    • D
      [TCP]: Revert 6b251858 · 01ff367e
      David S. Miller 提交于
      But retain the comment fix.
      
      Alexey Kuznetsov has explained the situation as follows:
      
      --------------------
      
      I think the fix is incorrect. Look, the RFC function init_cwnd(mss) is
      not continuous: f.e. for mss=1095 it needs initial window 1095*4, but
      for mss=1096 it is 1096*3. We do not know exactly what mss sender used
      for calculations. If we advertised 1096 (and calculate initial window
      3*1096), the sender could limit it to some value < 1096 and then it
      will need window his_mss*4 > 3*1096 to send initial burst.
      
      See?
      
      So, the honest function for inital rcv_wnd derived from
      tcp_init_cwnd() is:
      
      	init_rcv_wnd(mss)=
      	  min { init_cwnd(mss1)*mss1 for mss1 <= mss }
      
      It is something sort of:
      
      	if (mss < 1096)
      		return mss*4;
      	if (mss < 1096*2)
      		return 1096*4;
      	return mss*2;
      
      (I just scrablled a graph of piece of paper, it is difficult to see or
      to explain without this)
      
      I selected it differently giving more window than it is strictly
      required.  Initial receive window must be large enough to allow sender
      following to the rfc (or just setting initial cwnd to 2) to send
      initial burst.  But besides that it is arbitrary, so I decided to give
      slack space of one segment.
      
      Actually, the logic was:
      
      If mss is low/normal (<=ethernet), set window to receive more than
      initial burst allowed by rfc under the worst conditions
      i.e. mss*4. This gives slack space of 1 segment for ethernet frames.
      
      For msses slighlty more than ethernet frame, take 3. Try to give slack
      space of 1 frame again.
      
      If mss is huge, force 2*mss. No slack space.
      
      Value 1460*3 is really confusing. Minimal one is 1096*2, but besides
      that it is an arbitrary value. It was meant to be ~4096. 1460*3 is
      just the magic number from RFC, 1460*3 = 1095*4 is the magic :-), so
      that I guess hands typed this themselves.
      
      --------------------
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01ff367e
  14. 29 9月, 2005 1 次提交
  15. 27 9月, 2005 1 次提交
    • H
      [NETFILTER]: Fix invalid module autoloading by splitting iptable_nat · 188bab3a
      Harald Welte 提交于
      When you've enabled conntrack and NAT as a module (standard case in all
      distributions), and you've also enabled the new conntrack netlink
      interface, loading ip_conntrack_netlink.ko will auto-load iptable_nat.ko.
      This causes a huge performance penalty, since for every packet you iterate
      the nat code, even if you don't want it.
      
      This patch splits iptable_nat.ko into the NAT core (ip_nat.ko) and the
      iptables frontend (iptable_nat.ko).  Threfore, ip_conntrack_netlink.ko will
      only pull ip_nat.ko, but not the frontend.  ip_nat.ko will "only" allocate
      some resources, but not affect runtime performance.
      
      This separation is also a nice step in anticipation of new packet filters
      (nf-hipac, ipset, pkttables) being able to use the NAT core.
      Signed-off-by: NHarald Welte <laforge@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      188bab3a
  16. 25 9月, 2005 2 次提交
  17. 23 9月, 2005 1 次提交
    • H
      [NETFILTER] Fix conntrack event cache deadlock/oops · 1dfbab59
      Harald Welte 提交于
      This patch fixes a number of bugs.  It cannot be reasonably split up in
      multiple fixes, since all bugs interact with each other and affect the same
      function:
      
      Bug #1:
      The event cache code cannot be called while a lock is held.  Therefore, the
      call to ip_conntrack_event_cache() within ip_ct_refresh_acct() needs to be
      moved outside of the locked section.  This fixes a number of 2.6.14-rcX
      oops and deadlock reports.
      
      Bug #2:
      We used to call ct_add_counters() for unconfirmed connections without
      holding a lock.  Since the add operations are not atomic, we could race
      with another CPU.
      
      Bug #3:
      ip_ct_refresh_acct() lost REFRESH events in some cases where refresh
      (and the corresponding event) are desired, but no accounting shall be
      performed.  Both, evenst and accounting implicitly depended on the skb
      parameter bein non-null.   We now re-introduce a non-accounting
      "ip_ct_refresh()" variant to explicitly state the desired behaviour.
      Signed-off-by: NHarald Welte <laforge@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1dfbab59