1. 18 2月, 2009 1 次提交
    • D
      net: Kill skb_truesize_check(), it only catches false-positives. · 92a0acce
      David S. Miller 提交于
      A long time ago we had bugs, primarily in TCP, where we would modify
      skb->truesize (for TSO queue collapsing) in ways which would corrupt
      the socket memory accounting.
      
      skb_truesize_check() was added in order to try and catch this error
      more systematically.
      
      However this debugging check has morphed into a Frankenstein of sorts
      and these days it does nothing other than catch false-positives.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92a0acce
  2. 30 1月, 2009 2 次提交
    • S
      net: Fix OOPS in skb_seq_read(). · 71b3346d
      Shyam Iyer 提交于
      It oopsd for me in skb_seq_read. addr2line said it was
      linux-2.6/net/core/skbuff.c:2228, which is this line:
      
      
      	while (st->frag_idx < skb_shinfo(st->cur_skb)->nr_frags) {
      
      
      I added some printks in there and it looks like we hit this:
      
              } else if (st->root_skb == st->cur_skb &&
                         skb_shinfo(st->root_skb)->frag_list) {
                       st->cur_skb = skb_shinfo(st->root_skb)->frag_list;
                       st->frag_idx = 0;
                       goto next_skb;
              }
      
      
      
      Actually I did some testing and added a few printks and found that the
      st->cur_skb->data was 0 and hence the ptr used by iscsi_tcp was null.
      This caused the kernel panic.
      
       	if (abs_offset < block_limit) {
      -		*data = st->cur_skb->data + abs_offset;
      +		*data = st->cur_skb->data + (abs_offset - st->stepped_offset);
      
      I enabled the debug_tcp and with a few printks found that the code did
      not go to the next_skb label and could find that the sequence being
      followed was this -
      
      It hit this if condition -
      
              if (st->cur_skb->next) {
                      st->cur_skb = st->cur_skb->next;
                      st->frag_idx = 0;
                      goto next_skb;
      
      And so, now the st pointer is shifted to the next skb whereas actually
      it should have hit the second else if first since the data is in the
      frag_list.
      
              else if (st->root_skb == st->cur_skb &&
                       skb_shinfo(st->root_skb)->frag_list) {
                      st->cur_skb = skb_shinfo(st->root_skb)->frag_list;
                      goto next_skb;
              }
      
      Reversing the two conditions the attached patch fixes the issue for me
      on top of Herbert's patches. 
      Signed-off-by: NShyam Iyer <shyam_iyer@dell.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71b3346d
    • H
      net: Fix frag_list handling in skb_seq_read · 95e3b24c
      Herbert Xu 提交于
      The frag_list handling was broken in skb_seq_read:
      
      1) We didn't add the stepped offset when looking at the head
      are of fragments other than the first.
      
      2) We didn't take the stepped offset away when setting the data
      pointer in the head area.
      
      3) The frag index wasn't reset.
      
      This patch fixes both issues.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95e3b24c
  3. 21 1月, 2009 1 次提交
  4. 20 1月, 2009 1 次提交
    • J
      net: Fix data corruption when splicing from sockets. · 8b9d3728
      Jarek Poplawski 提交于
      The trick in socket splicing where we try to convert the skb->data
      into a page based reference using virt_to_page() does not work so
      well.
      
      The idea is to pass the virt_to_page() reference via the pipe
      buffer, and refcount the buffer using a SKB reference.
      
      But if we are splicing from a socket to a socket (via sendpage)
      this doesn't work.
      
      The from side processing will grab the page (and SKB) references.
      The sendpage() calls will grab page references only, return, and
      then the from side processing completes and drops the SKB ref.
      
      The page based reference to skb->data is not enough to keep the
      kmalloc() buffer backing it from being reused.  Yet, that is
      all that the socket send side has at this point.
      
      This leads to data corruption if the skb->data buffer is reused
      by SLAB before the send side socket actually gets the TX packet
      out to the device.
      
      The fix employed here is to simply allocate a page and copy the
      skb->data bytes into that page.
      
      This will hurt performance, but there is no clear way to fix this
      properly without a copy at the present time, and it is important
      to get rid of the data corruption.
      
      With fixes from Herbert Xu.
      Tested-by: NWilly Tarreau <w@1wt.eu>
      Foreseen-by: NChangli Gao <xiaosuo@gmail.com>
      Diagnosed-by: NWilly Tarreau <w@1wt.eu>
      Reported-by: NWilly Tarreau <w@1wt.eu>
      Fixed-by: NJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b9d3728
  5. 15 1月, 2009 1 次提交
  6. 05 1月, 2009 2 次提交
    • H
      gro: Add page frag support · 5d38a079
      Herbert Xu 提交于
      This patch allows GRO to merge page frags (skb_shinfo(skb)->frags)
      in one skb, rather than using the less efficient frag_list.
      
      It also adds a new interface, napi_gro_frags to allow drivers
      to inject page frags directly into the stack without allocating
      an skb.  This is intended to be the GRO equivalent for LRO's
      lro_receive_frags interface.
      
      The existing GSO interface can already handle page frags with
      or without an appended frag_list so nothing needs to be changed
      there.
      
      The merging itself is rather simple.  We store any new frag entries
      after the last existing entry, without checking whether the first
      new entry can be merged with the last existing entry.  Making this
      check would actually be easy but since no existing driver can
      produce contiguous frags anyway it would just be mental masturbation.
      
      If the total number of entries would exceed the capacity of a
      single skb, we simply resort to using frag_list as we do now.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d38a079
    • H
      gro: Use gso_size to store MSS · b530256d
      Herbert Xu 提交于
      In order to allow GRO packets without frag_list at all, we need to
      store the MSS in the packet itself.  The obvious place is gso_size.
      The only thing to watch out for is if the packet ends up not being
      GRO then we need to clear gso_size before pushing the packet into
      the stack.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b530256d
  7. 16 12月, 2008 2 次提交
    • H
      net: Add skb_gro_receive · 71d93b39
      Herbert Xu 提交于
      This patch adds the helper skb_gro_receive to merge packets for
      GRO.  The current method is to allocate a new header skb and then
      chain the original packets to its frag_list.  This is done to
      make it easier to integrate into the existing GSO framework.
      
      In future as GSO is moved into the drivers, we can undo this and
      simply chain the original packets together.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71d93b39
    • H
      net: Add frag_list support to skb_segment · 89319d38
      Herbert Xu 提交于
      This patch adds limited support for handling frag_list packets in
      skb_segment.  The intention is to support GRO (Generic Receive Offload)
      packets which will be constructed by chaining normal packets using
      frag_list.
      
      As such we require all frag_list members terminate on exact MSS
      boundaries.  This is checked using BUG_ON.
      
      As there should only be one producer in the kernel of such packets,
      namely GRO, this requirement should not be difficult to maintain.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89319d38
  8. 26 11月, 2008 2 次提交
  9. 25 11月, 2008 2 次提交
    • I
      tcp: handle shift/merge of cloned skbs too · 0ace2856
      Ilpo Järvinen 提交于
      This caused me to get repeatably:
      
        tcpdump: pcap_loop: recvfrom: Bad address
      
      Happens occassionally when I tcpdump my for-looped test xfers:
        while [ : ]; do echo -n "$(date '+%s.%N') "; ./sendfile; sleep 20; done
      
      Rest of the relevant commands:
        ethtool -K eth0 tso off
        tc qdisc add dev eth0 root netem drop 4%
        tcpdump -n -s0 -i eth0 -w sacklog.all
      
      Running net-next under kvm, connection goes to the same host
      (basically just out of kvm). The connection itself works ok
      and data gets sent without corruption even with a large
      number of tests while tcpdump fails usually within less than
      5 tests.
      
      Whether it only happens because of this change or not, I
      don't know for sure but it's the only thing with which
      I've seen that error. The non-cloned variant works w/o it
      for much longer time. I'm yet to debug where the error
      actually comes from.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ace2856
    • I
      tcp: Try to restore large SKBs while SACK processing · 832d11c5
      Ilpo Järvinen 提交于
      During SACK processing, most of the benefits of TSO are eaten by
      the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
      Then we're in problems when cleanup work for them has to be done
      when a large cumulative ACK comes. Try to return back to pre-split
      state already while more and more SACK info gets discovered by
      combining newly discovered SACK areas with the previous skb if
      that's SACKed as well.
      
      This approach has a number of benefits:
      
      1) The processing overhead is spread more equally over the RTT
      2) Write queue has less skbs to process (affect everything
         which has to walk in the queue past the sacked areas)
      3) Write queue is consistent whole the time, so no other parts
         of TCP has to be aware of this (this was not the case with
         some other approach that was, well, quite intrusive all
         around).
      4) Clean_rtx_queue can release most of the pages using single
         put_page instead of previous PAGE_SIZE/mss+1 calls
      
      In case a hole is fully filled by the new SACK block, we attempt
      to combine the next skb too which allows construction of skbs
      that are even larger than what tso split them to and it handles
      hole per on every nth patterns that often occur during slow start
      overshoot pretty nicely. Though this to be really useful also
      a retransmission would have to get lost since cumulative ACKs
      advance one hole at a time in the most typical case.
      
      TODO: handle upwards only merging. That should be rather easy
      when segment is fully sacked but I'm leaving that as future
      work item (it won't make very large difference anyway since
      this current approach already covers quite a lot of normal
      cases).
      
      I was earlier thinking of some sophisticated way of tracking
      timestamps of the first and the last segment but later on
      realized that it won't be that necessary at all to store the
      timestamp of the last segment. The cases that can occur are
      basically either:
        1) ambiguous => no sensible measurement can be taken anyway
        2) non-ambiguous is due to reordering => having the timestamp
           of the last segment there is just skewing things more off
           than does some good since the ack got triggered by one of
           the holes (besides some substle issues that would make
           determining right hole/skb even harder problem). Anyway,
           it has nothing to do with this change then.
      
      I choose to route some abnormal looking cases with goto noop,
      some could be handled differently (eg., by stopping the
      walking at that skb but again). In general, they either
      shouldn't happen at all or are rare enough to make no difference
      in practice.
      
      In theory this change (as whole) could cause some macroscale
      regression (global) because of cache misses that are taken over
      the round-trip time but it gets very likely better because of much
      less (local) cache misses per other write queue walkers and the
      big recovery clearing cumulative ack.
      
      Worth to note that these benefits would be very easy to get also
      without TSO/GSO being on as long as the data is in pages so that
      we can merge them. Currently I won't let that happen because
      DSACK splitting at fragment that would mess up pcounts due to
      sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
      avoided, we have some conditions that can be made less strict.
      
      TODO: I will probably have to convert the excessive pointer
      passing to struct sacktag_state... :-)
      
      My testing revealed that considerable amount of skbs couldn't
      be shifted because they were cloned (most likely still awaiting
      tx reclaim)...
      
      [The rest is considering future work instead since I got
      repeatably EFAULT to tcpdump's recvfrom when I added
      pskb_expand_head to deal with clones, so I separated that
      into another, later patch]
      
      ...To counter that, I gave up on the fifth advantage:
      
      5) When growing previous SACK block, less allocs for new skbs
         are done, basically a new alloc is needed only when new hole
         is detected and when the previous skb runs out of frags space
      
      ...which now only happens of if reclaim is fast enough to dispose
      the clone before the SACK block comes in (the window is RTT long),
      otherwise we'll have to alloc some.
      
      With clones being handled I got these numbers (will be somewhat
      worse without that), taken with fine-grained mibs:
      
                        TCPSackShifted 398
                         TCPSackMerged 877
                  TCPSackShiftFallback 320
            TCPSACKCOLLAPSEFALLBACKGSO 0
        TCPSACKCOLLAPSEFALLBACKSKBBITS 0
        TCPSACKCOLLAPSEFALLBACKSKBDATA 0
          TCPSACKCOLLAPSEFALLBACKBELOW 0
          TCPSACKCOLLAPSEFALLBACKFIRST 1
       TCPSACKCOLLAPSEFALLBACKPREVBITS 318
            TCPSACKCOLLAPSEFALLBACKMSS 1
         TCPSACKCOLLAPSEFALLBACKNOHEAD 0
          TCPSACKCOLLAPSEFALLBACKSHIFT 0
                TCPSACKCOLLAPSENOOPSEQ 0
        TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
           TCPSACKCOLLAPSENOOPSMALLLEN 0
                   TCPSACKCOLLAPSEHOLE 12
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      832d11c5
  10. 11 11月, 2008 1 次提交
  11. 02 11月, 2008 1 次提交
  12. 01 11月, 2008 1 次提交
  13. 29 10月, 2008 1 次提交
  14. 14 10月, 2008 1 次提交
  15. 08 10月, 2008 1 次提交
  16. 01 10月, 2008 2 次提交
  17. 16 8月, 2008 1 次提交
  18. 30 7月, 2008 1 次提交
  19. 26 7月, 2008 1 次提交
  20. 17 7月, 2008 1 次提交
  21. 15 7月, 2008 2 次提交
    • O
      net: refactor tcp splice receive path to improve readability · 2870c43d
      Octavian Purdila 提交于
      - move all of the details on offsets, lengths and buffers into a
      single function instead of doing these operation from multiple places
      
      - use a bottom up approach: try to avoid details in the high level
      functions, introduce them gradually as we go deeper in the function
      call stack
      
      With helpful feedback from Jarek Poplawski.
      Signed-off-by: NOctavian Purdila <opurdila@ixiacom.com>
      Acked-by: NJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2870c43d
    • P
      vlan: Don't store VLAN tag in cb · 6aa895b0
      Patrick McHardy 提交于
      Use a real skb member to store the skb to avoid clashes with qdiscs,
      which are allowed to use the cb area themselves. As currently only real
      devices that consume the skb set the NETIF_F_HW_VLAN_TX flag, no explicit
      invalidation is neccessary.
      
      The new member fills a hole on 64 bit, the skb layout changes from:
      
              __u32                      mark;                 /*   172     4 */
              sk_buff_data_t             transport_header;     /*   176     4 */
              sk_buff_data_t             network_header;       /*   180     4 */
              sk_buff_data_t             mac_header;           /*   184     4 */
              sk_buff_data_t             tail;                 /*   188     4 */
              /* --- cacheline 3 boundary (192 bytes) --- */
              sk_buff_data_t             end;                  /*   192     4 */
      
              /* XXX 4 bytes hole, try to pack */
      
      to
      
              __u32                      mark;                 /*   172     4 */
              __u16                      vlan_tci;             /*   176     2 */
      
              /* XXX 2 bytes hole, try to pack */
      
              sk_buff_data_t             transport_header;     /*   180     4 */
              sk_buff_data_t             network_header;       /*   184     4 */
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6aa895b0
  22. 28 6月, 2008 1 次提交
    • O
      tcp: fix for splice receive when used with software LRO · db43a282
      Octavian Purdila 提交于
      If an skb has nr_frags set to zero but its frag_list is not empty (as
      it can happen if software LRO is enabled), and a previous
      tcp_read_sock has consumed the linear part of the skb, then
      __skb_splice_bits:
      
      (a) incorrectly reports an error and
      
      (b) forgets to update the offset to account for the linear part
      
      Any of the two problems will cause the subsequent __skb_splice_bits
      call (the one that handles the frag_list skbs) to either skip data,
      or, if the unadjusted offset is greater then the size of the next skb
      in the frag_list, make tcp_splice_read loop forever.
      Signed-off-by: NOctavian Purdila <opurdila@ixiacom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db43a282
  23. 20 6月, 2008 1 次提交
  24. 12 6月, 2008 1 次提交
  25. 05 6月, 2008 1 次提交
  26. 04 5月, 2008 1 次提交
  27. 14 4月, 2008 2 次提交
  28. 29 3月, 2008 2 次提交
  29. 28 3月, 2008 3 次提交