1. 15 1月, 2009 1 次提交
  2. 11 1月, 2009 1 次提交
  3. 07 1月, 2009 5 次提交
  4. 06 1月, 2009 1 次提交
  5. 05 1月, 2009 3 次提交
    • M
      net: Fix for initial link state in 2.6.28 · 22604c86
      Michael Marineau 提交于
      From: Michael Marineau <mike@marineau.org>
      
      Commit b4730016 "Do not fire linkwatch
      events until the device is registered." was made as a workaround for
      drivers that call netif_carrier_off before registering the device.
      Unfortunately this causes these drivers to incorrectly report their
      link status as IF_OPER_UNKNOWN which can falsely set the IFF_RUNNING
      flag when the interface is first brought up. This issues was
      previously pointed out[1] but was dismissed saying that IFF_RUNNING is
      not related to the link status. From my digging IFF_RUNNING, as
      reported to userspace, is based on the link state. It is set based on
      __LINK_STATE_START and IF_OPER_UP or IF_OPER_UNKNOWN. See [2], [3],
      and [4]. (Whether or not the kernel has IFF_RUNNING set in flags is
      not reported to user space so it may well be independent of the link,
      I don't know if and when it may get set.)
      
      The end result depends slightly depending on the driver. The the two I
      tested were e1000e and b44. With e1000e if the system is booted
      without a network cable attached the interface will falsely report
      RUNNING when it is brought up causing NetworkManager to attempt to
      start it and eventually time out. With b44 when the system is booted
      with a network cable attached and brought up with dhcpcd it will time
      out the first time.
      
      The attached patch that will still set the operstate variable
      correctly to IF_OPER_UP/DOWN/etc when linkwatch_fire_event is called
      but then return rather than skipping the linkwatch_fire_event call
      entirely as the previous fix did. (sorry it isn't inline, I don't have
      a patch friendly email client at the moment)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22604c86
    • H
      gro: Add page frag support · 5d38a079
      Herbert Xu 提交于
      This patch allows GRO to merge page frags (skb_shinfo(skb)->frags)
      in one skb, rather than using the less efficient frag_list.
      
      It also adds a new interface, napi_gro_frags to allow drivers
      to inject page frags directly into the stack without allocating
      an skb.  This is intended to be the GRO equivalent for LRO's
      lro_receive_frags interface.
      
      The existing GSO interface can already handle page frags with
      or without an appended frag_list so nothing needs to be changed
      there.
      
      The merging itself is rather simple.  We store any new frag entries
      after the last existing entry, without checking whether the first
      new entry can be merged with the last existing entry.  Making this
      check would actually be easy but since no existing driver can
      produce contiguous frags anyway it would just be mental masturbation.
      
      If the total number of entries would exceed the capacity of a
      single skb, we simply resort to using frag_list as we do now.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d38a079
    • H
      gro: Use gso_size to store MSS · b530256d
      Herbert Xu 提交于
      In order to allow GRO packets without frag_list at all, we need to
      store the MSS in the packet itself.  The obvious place is gso_size.
      The only thing to watch out for is if the packet ends up not being
      GRO then we need to clear gso_size before pushing the packet into
      the stack.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b530256d
  6. 30 12月, 2008 2 次提交
  7. 27 12月, 2008 1 次提交
  8. 26 12月, 2008 1 次提交
  9. 23 12月, 2008 1 次提交
  10. 18 12月, 2008 3 次提交
  11. 16 12月, 2008 5 次提交
    • H
      ethtool: Add GGRO and SGRO ops · b240a0e5
      Herbert Xu 提交于
      This patch adds the ethtool ops to enable and disable GRO.  It also
      makes GRO depend on RX checksum offload much the same as how TSO
      depends on SG support.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b240a0e5
    • H
      net: Add skb_gro_receive · 71d93b39
      Herbert Xu 提交于
      This patch adds the helper skb_gro_receive to merge packets for
      GRO.  The current method is to allocate a new header skb and then
      chain the original packets to its frag_list.  This is done to
      make it easier to integrate into the existing GSO framework.
      
      In future as GSO is moved into the drivers, we can undo this and
      simply chain the original packets together.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71d93b39
    • H
      net: Add Generic Receive Offload infrastructure · d565b0a1
      Herbert Xu 提交于
      This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
      This is pretty similar to LRO except that this is protocol-independent.
      Instead of holding packets in an lro_mgr structure, they're now held in
      napi_struct.
      
      For drivers that intend to use this, they can set the NETIF_F_GRO bit and
      call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
      The latter will call napi_receive_skb automatically.  When napi_gro_receive
      is used, the driver must either call napi_complete/napi_rx_complete, or
      call napi_gro_flush in softirq context if the driver uses the primitives
      __napi_complete/__napi_rx_complete.
      
      Protocols will set the gro_receive and gro_complete function pointers in
      order to participate in this scheme.
      
      In addition to the packet, gro_receive will get a list of currently held
      packets.  Each packet in the list has a same_flow field which is non-zero
      if it is a potential match for the new packet.  For each packet that may
      match, they also have a flush field which is non-zero if the held packet
      must not be merged with the new packet.
      
      Once gro_receive has determined that the new skb matches a held packet,
      the held packet may be processed immediately if the new skb cannot be
      merged with it.  In this case gro_receive should return the pointer to
      the existing skb in gro_list.  Otherwise the new skb should be merged into
      the existing packet and NULL should be returned, unless the new skb makes
      it impossible for any further merges to be made (e.g., FIN packet) where
      the merged skb should be returned.
      
      Whenever the skb is merged into an existing entry, the gro_receive
      function should set NAPI_GRO_CB(skb)->same_flow.  Note that if an skb
      merely matches an existing entry but can't be merged with it, then
      this shouldn't be set.
      
      If gro_receive finds it pointless to hold the new skb for future merging,
      it should set NAPI_GRO_CB(skb)->flush.
      
      Held packets will be flushed by napi_gro_flush which is called by
      napi_complete and napi_rx_complete.
      
      Currently held packets are stored in a singly liked list just like LRO.
      The list is limited to a maximum of 8 entries.  In future, this may be
      expanded to use a hash table to allow more flows to be held for merging.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d565b0a1
    • H
      net: Add frag_list support to GSO · 1a881f27
      Herbert Xu 提交于
      This patch allows GSO to handle frag_list in a limited way for the
      purposes of allowing packets merged by GRO to be refragmented on
      output.
      
      Most hardware won't (and aren't expected to) support handling GRO
      frag_list packets directly.  Therefore we will perform GSO in
      software for those cases.
      
      However, for drivers that can support it (such as virtual NICs) we
      may not have to segment the packets at all.
      
      Whether the added overhead of GRO/GSO is worthwhile for bridges
      and routers when weighed against the benefit of potentially
      increasing the MTU within the host is still an open question.
      However, for the case of host nodes this is undoubtedly a win.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a881f27
    • H
      net: Add frag_list support to skb_segment · 89319d38
      Herbert Xu 提交于
      This patch adds limited support for handling frag_list packets in
      skb_segment.  The intention is to support GRO (Generic Receive Offload)
      packets which will be constructed by chaining normal packets using
      frag_list.
      
      As such we require all frag_list members terminate on exact MSS
      boundaries.  This is checked using BUG_ON.
      
      As there should only be one producer in the kernel of such packets,
      namely GRO, this requirement should not be difficult to maintain.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89319d38
  12. 10 12月, 2008 1 次提交
    • N
      netpoll: fix race on poll_list resulting in garbage entry · 7b363e44
      Neil Horman 提交于
      	A few months back a race was discused between the netpoll napi service
      path, and the fast path through net_rx_action:
      http://kerneltrap.org/mailarchive/linux-netdev/2007/10/16/345470
      
      A patch was submitted for that bug, but I think we missed a case.
      
      Consider the following scenario:
      
      INITIAL STATE
      CPU0 has one napi_struct A on its poll_list
      CPU1 is calling netpoll_send_skb and needs to call poll_napi on the same
      napi_struct A that CPU0 has on its list
      
      
      
      CPU0						CPU1
      net_rx_action					poll_napi
      !list_empty (returns true)			locks poll_lock for A
      						 poll_one_napi
      						  napi->poll
      						   netif_rx_complete
      						    __napi_complete
      						    (removes A from poll_list)
      list_entry(list->next)
      
      
      In the above scenario, net_rx_action assumes that the per-cpu poll_list is
      exclusive to that cpu.  netpoll of course violates that, and because the netpoll
      path can dequeue from the poll list, its possible for CPU0 to detect a non-empty
      list at the top of the while loop in net_rx_action, but have it become empty by
      the time it calls list_entry.  Since the poll_list isn't surrounded by any other
      structure, the returned data from that list_entry call in this situation is
      garbage, and any number of crashes can result based on what exactly that garbage
      is.
      
      Given that its not fasible for performance reasons to place exclusive locks
      arround each cpus poll list to provide that mutal exclusion, I think the best
      solution is modify the netpoll path in such a way that we continue to guarantee
      that the poll_list for a cpu is in fact exclusive to that cpu.  To do this I've
      implemented the patch below.  It adds an additional bit to the state field in
      the napi_struct.  When executing napi->poll from the netpoll_path, this bit will
      be set. When a driver calls netif_rx_complete, if that bit is set, it will not
      remove the napi_struct from the poll_list.  That work will be saved for the next
      iteration of net_rx_action.
      
      I've tested this and it seems to work well.  About the biggest drawback I can
      see to it is the fact that it might result in an extra loop through
      net_rx_action in the event that the device is actually contended for (i.e. the
      netpoll path actually preforms all the needed work no the device, and the call
      to net_rx_action winds up doing nothing, except removing the napi_struct from
      the poll_list.  However I think this is probably a small price to pay, given
      that the alternative is a crash.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b363e44
  13. 08 12月, 2008 1 次提交
    • W
      netdevice: Kill netdev->priv · b74ca3a8
      Wang Chen 提交于
      This is the last shoot of this series.
      After I removing all directly reference of netdev->priv, I am killing
      "priv" of "struct net_device" and fixing relative comments/docs.
      
      Anyone will not be allowed to reference netdev->priv directly.
      If you want to reference the memory of private data, use netdev_priv()
      instead.
      If the private data is not allocted when alloc_netdev(), use
      netdev->ml_priv to point that memory after you creating that private
      data.
      Signed-off-by: NWang Chen <wangchen@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b74ca3a8
  14. 27 11月, 2008 1 次提交
  15. 26 11月, 2008 9 次提交
  16. 25 11月, 2008 3 次提交
    • I
      tcp: handle shift/merge of cloned skbs too · 0ace2856
      Ilpo Järvinen 提交于
      This caused me to get repeatably:
      
        tcpdump: pcap_loop: recvfrom: Bad address
      
      Happens occassionally when I tcpdump my for-looped test xfers:
        while [ : ]; do echo -n "$(date '+%s.%N') "; ./sendfile; sleep 20; done
      
      Rest of the relevant commands:
        ethtool -K eth0 tso off
        tc qdisc add dev eth0 root netem drop 4%
        tcpdump -n -s0 -i eth0 -w sacklog.all
      
      Running net-next under kvm, connection goes to the same host
      (basically just out of kvm). The connection itself works ok
      and data gets sent without corruption even with a large
      number of tests while tcpdump fails usually within less than
      5 tests.
      
      Whether it only happens because of this change or not, I
      don't know for sure but it's the only thing with which
      I've seen that error. The non-cloned variant works w/o it
      for much longer time. I'm yet to debug where the error
      actually comes from.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ace2856
    • I
      tcp: Try to restore large SKBs while SACK processing · 832d11c5
      Ilpo Järvinen 提交于
      During SACK processing, most of the benefits of TSO are eaten by
      the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
      Then we're in problems when cleanup work for them has to be done
      when a large cumulative ACK comes. Try to return back to pre-split
      state already while more and more SACK info gets discovered by
      combining newly discovered SACK areas with the previous skb if
      that's SACKed as well.
      
      This approach has a number of benefits:
      
      1) The processing overhead is spread more equally over the RTT
      2) Write queue has less skbs to process (affect everything
         which has to walk in the queue past the sacked areas)
      3) Write queue is consistent whole the time, so no other parts
         of TCP has to be aware of this (this was not the case with
         some other approach that was, well, quite intrusive all
         around).
      4) Clean_rtx_queue can release most of the pages using single
         put_page instead of previous PAGE_SIZE/mss+1 calls
      
      In case a hole is fully filled by the new SACK block, we attempt
      to combine the next skb too which allows construction of skbs
      that are even larger than what tso split them to and it handles
      hole per on every nth patterns that often occur during slow start
      overshoot pretty nicely. Though this to be really useful also
      a retransmission would have to get lost since cumulative ACKs
      advance one hole at a time in the most typical case.
      
      TODO: handle upwards only merging. That should be rather easy
      when segment is fully sacked but I'm leaving that as future
      work item (it won't make very large difference anyway since
      this current approach already covers quite a lot of normal
      cases).
      
      I was earlier thinking of some sophisticated way of tracking
      timestamps of the first and the last segment but later on
      realized that it won't be that necessary at all to store the
      timestamp of the last segment. The cases that can occur are
      basically either:
        1) ambiguous => no sensible measurement can be taken anyway
        2) non-ambiguous is due to reordering => having the timestamp
           of the last segment there is just skewing things more off
           than does some good since the ack got triggered by one of
           the holes (besides some substle issues that would make
           determining right hole/skb even harder problem). Anyway,
           it has nothing to do with this change then.
      
      I choose to route some abnormal looking cases with goto noop,
      some could be handled differently (eg., by stopping the
      walking at that skb but again). In general, they either
      shouldn't happen at all or are rare enough to make no difference
      in practice.
      
      In theory this change (as whole) could cause some macroscale
      regression (global) because of cache misses that are taken over
      the round-trip time but it gets very likely better because of much
      less (local) cache misses per other write queue walkers and the
      big recovery clearing cumulative ack.
      
      Worth to note that these benefits would be very easy to get also
      without TSO/GSO being on as long as the data is in pages so that
      we can merge them. Currently I won't let that happen because
      DSACK splitting at fragment that would mess up pcounts due to
      sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
      avoided, we have some conditions that can be made less strict.
      
      TODO: I will probably have to convert the excessive pointer
      passing to struct sacktag_state... :-)
      
      My testing revealed that considerable amount of skbs couldn't
      be shifted because they were cloned (most likely still awaiting
      tx reclaim)...
      
      [The rest is considering future work instead since I got
      repeatably EFAULT to tcpdump's recvfrom when I added
      pskb_expand_head to deal with clones, so I separated that
      into another, later patch]
      
      ...To counter that, I gave up on the fifth advantage:
      
      5) When growing previous SACK block, less allocs for new skbs
         are done, basically a new alloc is needed only when new hole
         is detected and when the previous skb runs out of frags space
      
      ...which now only happens of if reclaim is fast enough to dispose
      the clone before the SACK block comes in (the window is RTT long),
      otherwise we'll have to alloc some.
      
      With clones being handled I got these numbers (will be somewhat
      worse without that), taken with fine-grained mibs:
      
                        TCPSackShifted 398
                         TCPSackMerged 877
                  TCPSackShiftFallback 320
            TCPSACKCOLLAPSEFALLBACKGSO 0
        TCPSACKCOLLAPSEFALLBACKSKBBITS 0
        TCPSACKCOLLAPSEFALLBACKSKBDATA 0
          TCPSACKCOLLAPSEFALLBACKBELOW 0
          TCPSACKCOLLAPSEFALLBACKFIRST 1
       TCPSACKCOLLAPSEFALLBACKPREVBITS 318
            TCPSACKCOLLAPSEFALLBACKMSS 1
         TCPSACKCOLLAPSEFALLBACKNOHEAD 0
          TCPSACKCOLLAPSEFALLBACKSHIFT 0
                TCPSACKCOLLAPSENOOPSEQ 0
        TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
           TCPSACKCOLLAPSENOOPSMALLLEN 0
                   TCPSACKCOLLAPSEHOLE 12
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      832d11c5
    • J
      net: gen_estimator: Fix gen_kill_estimator() lookups · 4db0acf3
      Jarek Poplawski 提交于
      gen_kill_estimator() linear lists lookups are very slow, and e.g. while
      deleting a large number of HTB classes soft lockups were reported. Here
      is another try to fix this problem: this time internally, with rbtree,
      so similarly to Jamal's hashing idea IIRC. (Looking for next hits could
      be still optimized, but it's really fast as it is.)
      Reported-by: NBadalian Vyacheslav <slavon@bigtelecom.ru>
      Reported-by: NDenys Fedoryshchenko <denys@visp.net.lb>
      Signed-off-by: NJarek Poplawski <jarkao2@gmail.com>
      Acked-by: NJamal Hadi Salim <hadi@cyberus.ca>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4db0acf3
  17. 22 11月, 2008 1 次提交