1. 30 6月, 2006 1 次提交
    • H
      [NET]: Added GSO header verification · 576a30eb
      Herbert Xu 提交于
      When GSO packets come from an untrusted source (e.g., a Xen guest domain),
      we need to verify the header integrity before passing it to the hardware.
      
      Since the first step in GSO is to verify the header, we can reuse that
      code by adding a new bit to gso_type: SKB_GSO_DODGY.  Packets with this
      bit set can only be fed directly to devices with the corresponding bit
      NETIF_F_GSO_ROBUST.  If the device doesn't have that bit, then the skb
      is fed to the GSO engine which will allow the packet to be sent to the
      hardware if it passes the header check.
      
      This patch changes the sg flag to a full features flag.  The same method
      can be used to implement TSO ECN support.  We simply have to mark packets
      with CWR set with SKB_GSO_ECN so that only hardware with a corresponding
      NETIF_F_TSO_ECN can accept them.  The GSO engine can either fully segment
      the packet, or segment the first MTU and pass the rest to the hardware for
      further segmentation.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      576a30eb
  2. 26 6月, 2006 1 次提交
    • H
      [NET]: Fix CHECKSUM_HW GSO problems. · 0718bcc0
      Herbert Xu 提交于
      Fix checksum problems in the GSO code path for CHECKSUM_HW packets.
      
      The ipv4 TCP pseudo header checksum has to be adjusted for GSO
      segmented packets.
      
      The adjustment is needed because the length field in the pseudo-header
      changes.  However, because we have the inequality oldlen > newlen, we
      know that delta = (u16)~oldlen + newlen is still a 16-bit quantity.
      This also means that htonl(delta) + th->check still fits in 32 bits.
      Therefore we don't have to use csum_add on this operations.
      
      This is based on a patch by Michael Chan <mchan@broadcom.com>.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Acked-by: NMichael Chan <mchan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0718bcc0
  3. 23 6月, 2006 2 次提交
    • H
      [NET]: Add software TSOv4 · f4c50d99
      Herbert Xu 提交于
      This patch adds the GSO implementation for IPv4 TCP.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4c50d99
    • H
      [NET]: Merge TSO/UFO fields in sk_buff · 7967168c
      Herbert Xu 提交于
      Having separate fields in sk_buff for TSO/UFO (tso_size/ufo_size) is not
      going to scale if we add any more segmentation methods (e.g., DCCP).  So
      let's merge them.
      
      They were used to tell the protocol of a packet.  This function has been
      subsumed by the new gso_type field.  This is essentially a set of netdev
      feature bits (shifted by 16 bits) that are required to process a specific
      skb.  As such it's easy to tell whether a given device can process a GSO
      skb: you just have to and the gso_type field and the netdev's features
      field.
      
      I've made gso_type a conjunction.  The idea is that you have a base type
      (e.g., SKB_GSO_TCPV4) that can be modified further to support new features.
      For example, if we add a hardware TSO type that supports ECN, they would
      declare NETIF_F_TSO | NETIF_F_TSO_ECN.  All TSO packets with CWR set would
      have a gso_type of SKB_GSO_TCPV4 | SKB_GSO_TCPV4_ECN while all other TSO
      packets would be SKB_GSO_TCPV4.  This means that only the CWR packets need
      to be emulated in software.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7967168c
  4. 18 6月, 2006 4 次提交
  5. 04 5月, 2006 1 次提交
    • H
      [TCP]: Fix sock_orphan dead lock · 75c2d907
      Herbert Xu 提交于
      Calling sock_orphan inside bh_lock_sock in tcp_close can lead to dead
      locks.  For example, the inet_diag code holds sk_callback_lock without
      disabling BH.  If an inbound packet arrives during that admittedly tiny
      window, it will cause a dead lock on bh_lock_sock.  Another possible
      path would be through sock_wfree if the network device driver frees the
      tx skb in process context with BH enabled.
      
      We can fix this by moving sock_orphan out of bh_lock_sock.
      
      The tricky bit is to work out when we need to destroy the socket
      ourselves and when it has already been destroyed by someone else.
      
      By moving sock_orphan before the release_sock we can solve this
      problem.  This is because as long as we own the socket lock its
      state cannot change.
      
      So we simply record the socket state before the release_sock
      and then check the state again after we regain the socket lock.
      If the socket state has transitioned to TCP_CLOSE in the time being,
      we know that the socket has been destroyed.  Otherwise the socket is
      still ours to keep.
      
      Note that I've also moved the increment on the orphan count forward.
      This may look like a problem as we're increasing it even if the socket
      is just about to be destroyed where it'll be decreased again.  However,
      this simply enlarges a window that already exists.  This also changes
      the orphan count test by one.
      
      Considering what the orphan count is meant to do this is no big deal.
      
      This problem was discoverd by Ingo Molnar using his lock validator.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75c2d907
  6. 26 3月, 2006 1 次提交
    • D
      [PATCH] POLLRDHUP/EPOLLRDHUP handling for half-closed devices notifications · f348d70a
      Davide Libenzi 提交于
      Implement the half-closed devices notifiation, by adding a new POLLRDHUP
      (and its alias EPOLLRDHUP) bit to the existing poll/select sets.  Since the
      existing POLLHUP handling, that does not report correctly half-closed
      devices, was feared to be changed, this implementation leaves the current
      POLLHUP reporting unchanged and simply add a new bit that is set in the few
      places where it makes sense.  The same thing was discussed and conceptually
      agreed quite some time ago:
      
      http://lkml.org/lkml/2003/7/12/116
      
      Since this new event bit is added to the existing Linux poll infrastruture,
      even the existing poll/select system calls will be able to use it.  As far
      as the existing POLLHUP handling, the patch leaves it as is.  The
      pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing
      archs and sets the bit in the six relevant files.  The other attached diff
      is the simple change required to sys/epoll.h to add the EPOLLRDHUP
      definition.
      
      There is "a stupid program" to test POLLRDHUP delivery here:
      
       http://www.xmailserver.org/pollrdhup-test.c
      
      It tests poll(2), but since the delivery is same epoll(2) will work equally.
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f348d70a
  7. 25 3月, 2006 2 次提交
  8. 21 3月, 2006 3 次提交
  9. 04 1月, 2006 2 次提交
  10. 30 11月, 2005 2 次提交
    • A
      [NET]: Add const markers to various variables. · 9b5b5cff
      Arjan van de Ven 提交于
      the patch below marks various variables const in net/; the goal is to
      move them to the .rodata section so that they can't false-share
      cachelines with things that get written to, as well as potentially
      helping gcc a bit with optimisations.  (these were found using a gcc
      patch to warn about such variables)
      Signed-off-by: NArjan van de Ven <arjan@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b5b5cff
    • M
      [IPV4] tcp/route: Another look at hash table sizes · 18955cfc
      Mike Stroyan 提交于
        The tcp_ehash hash table gets too big on systems with really big memory.
      It is worse on systems with pages larger than 4KB.  It wastes memory that
      could be better used.  It also makes the netstat command slow because reading
      /proc/net/tcp and /proc/net/tcp6 needs to go through the full hash table.
      
        The default value should not be larger for larger page sizes.  It seems
      that the effect of page size is an unintended error dating back a long
      time.  I also wonder if the default value really should be a larger
      fraction of memory for systems with more memory.  While systems with
      really big ram can afford more space for hash tables, it is not clear to
      me that they benefit from increasing the allocation ratio for this table.
      
        The amount of memory allocated is determined by net/ipv4/tcp.c:tcp_init and
      mm/page_alloc.c:alloc_large_system_hash.
      
      tcp_init calls alloc_large_system_hash passing parameters-
          bucketsize=sizeof(struct tcp_ehash_bucket)
          numentries=thash_entries
          scale=(num_physpages >= 128 * 1024) ? (25-PAGE_SHIFT) : (27-PAGE_SHIFT)
          limit=0
      
      On i386, PAGE_SHIFT is 12 for a page size of 4K
      On ia64, PAGE_SHIFT defaults to 14 for a page size of 16K
      
      The num_physpages test above makes the allocation take a larger fraction
      of the total memory on systems with larger memory.  The threshold size
      for a i386 system is 512MB.  For an ia64 system with 16KB pages the
      threshold is 2GB.
      
      For smaller memory systems-
      On i386, scale = (27 - 12) = 15
      On ia64, scale = (27 - 14) = 13
      For larger memory systems-
      On i386, scale = (25 - 12) = 13
      On ia64, scale = (25 - 14) = 11
      
        For the rest of this discussion, I'll just track the larger memory case.
      
        The default behavior has numentries=thash_entries=0, so the allocated
      size is determined by either scale or by the default limit of 1/16 of
      total memory.
      
      In alloc_large_system_hash-
      |	numentries = (flags & HASH_HIGHMEM) ? nr_all_pages : nr_kernel_pages;
      |	numentries += (1UL << (20 - PAGE_SHIFT)) - 1;
      |	numentries >>= 20 - PAGE_SHIFT;
      |	numentries <<= 20 - PAGE_SHIFT;
      
        At this point, numentries is pages for all of memory, rounded up to the
      nearest megabyte boundary.
      
      |	/* limit to 1 bucket per 2^scale bytes of low memory */
      |	if (scale > PAGE_SHIFT)
      |		numentries >>= (scale - PAGE_SHIFT);
      |	else
      |		numentries <<= (PAGE_SHIFT - scale);
      
      On i386, numentries >>= (13 - 12), so numentries is 1/8196 of
      bytes of total memory.
      On ia64, numentries <<= (14 - 11), so numentries is 1/2048 of
      bytes of total memory.
      
      |        log2qty = long_log2(numentries);
      |
      |        do {
      |                size = bucketsize << log2qty;
      
      bucketsize is 16, so size is 16 times numentries, rounded
      down to a power of two.
      
      On i386, size is 1/512 of bytes of total memory.
      On ia64, size is 1/128 of bytes of total memory.
      
      For smaller systems the results are
      On i386, size is 1/2048 of bytes of total memory.
      On ia64, size is 1/512 of bytes of total memory.
      
        The large page effect can be removed by just replacing
      the use of PAGE_SHIFT with a constant of 12 in the calls to
      alloc_large_system_hash.  That makes them more like the other uses of
      that function from fs/inode.c and fs/dcache.c
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18955cfc
  11. 11 11月, 2005 2 次提交
  12. 06 11月, 2005 1 次提交
  13. 06 9月, 2005 1 次提交
  14. 02 9月, 2005 2 次提交
  15. 30 8月, 2005 15 次提交