1. 11 6月, 2016 2 次提交
  2. 09 6月, 2016 2 次提交
  3. 03 5月, 2016 1 次提交
    • N
      netem: Segment GSO packets on enqueue · 6071bd1a
      Neil Horman 提交于
      This was recently reported to me, and reproduced on the latest net kernel,
      when attempting to run netperf from a host that had a netem qdisc attached
      to the egress interface:
      
      [  788.073771] ---------------------[ cut here ]---------------------------
      [  788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
      [  788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
      data_len=0 gso_size=1448 gso_type=1 ip_summed=3
      [  788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
      ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
      glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
      i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
      pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
      sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
      i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
      crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
      serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
      dm_mod
      [  788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G        W
      ------------   3.10.0-327.el7.x86_64 #1
      [  788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
      [  788.542260]  ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
      ffffffff816351f1
      [  788.576332]  ffff880437c036a8 ffffffff8107b200 ffff880633e74200
      ffff880231674000
      [  788.611943]  0000000000000001 0000000000000003 0000000000000000
      ffff880437c03710
      [  788.647241] Call Trace:
      [  788.658817]  <IRQ>  [<ffffffff816351f1>] dump_stack+0x19/0x1b
      [  788.686193]  [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
      [  788.713803]  [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
      [  788.741314]  [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
      [  788.767018]  [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
      [  788.796117]  [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
      [  788.823392]  [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
      [  788.854487]  [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
      [  788.880870]  [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
      ...
      
      The problem occurs because netem is not prepared to handle GSO packets (as it
      uses skb_checksum_help in its enqueue path, which cannot manipulate these
      frames).
      
      The solution I think is to simply segment the skb in a simmilar fashion to the
      way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
      When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
      the first segment, and enqueue the remaining ones.
      
      tested successfully by myself on the latest net kernel, to which this applies
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Jamal Hadi Salim <jhs@mojatatu.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: netem@lists.linux-foundation.org
      CC: eric.dumazet@gmail.com
      CC: stephen@networkplumber.org
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6071bd1a
  4. 26 4月, 2016 1 次提交
  5. 01 3月, 2016 2 次提交
  6. 12 5月, 2015 1 次提交
  7. 08 4月, 2015 1 次提交
    • B
      netem: Fixes byte backlog accounting for the first of two chained netem instances · 0ad2a836
      Beshay, Joseph 提交于
      Fixes byte backlog accounting for the first of two chained netem instances.
      Bytes backlog reported now corresponds to the number of queued packets.
      
      When two netem instances are chained, for instance to apply rate and queue
      limitation followed by packet delay, the number of backlogged bytes reported
      by the first netem instance is wrong. It reports the sum of bytes in the queues
      of the first and second netem. The first netem reports the correct number of
      backlogged packets but not bytes. This is shown in the example below.
      
      Consider a chain of two netem schedulers created using the following commands:
      
      $ tc -s qdisc replace dev veth2 root handle 1:0 netem rate 10000kbit limit 100
      $ tc -s qdisc add dev veth2 parent 1:0 handle 2: netem delay 50ms
      
      Start an iperf session to send packets out on the specified interface and
      monitor the backlog using tc:
      
      $ tc -s qdisc show dev veth2
      
      Output using unpatched netem:
      	qdisc netem 1: root refcnt 2 limit 100 rate 10000Kbit
      	 Sent 98422639 bytes 65434 pkt (dropped 123, overlimits 0 requeues 0)
      	 backlog 172694b 73p requeues 0
      	qdisc netem 2: parent 1: limit 1000 delay 50.0ms
      	 Sent 98422639 bytes 65434 pkt (dropped 0, overlimits 0 requeues 0)
      	 backlog 63588b 42p requeues 0
      
      The interface used to produce this output has an MTU of 1500. The output for
      backlogged bytes behind netem 1 is 172694b. This value is not correct. Consider
      the total number of sent bytes and packets. By dividing the number of sent
      bytes by the number of sent packets, we get an average packet size of ~=1504.
      If we divide the number of backlogged bytes by packets, we get ~=2365. This is
      due to the first netem incorrectly counting the 63588b which are in netem 2's
      queue as being in its own queue. To verify this is the case, we subtract them
      from the reported value and divide by the number of packets as follows:
      	172694 - 63588 = 109106 bytes actualled backlogged in netem 1
      	109106 / 73 packets ~= 1494 bytes (which matches our MTU)
      
      The root cause is that the byte accounting is not done at the
      same time with packet accounting. The solution is to update the backlog value
      every time the packet queue is updated.
      Signed-off-by: NJoseph D Beshay <joseph.beshay@utdallas.edu>
      Acked-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ad2a836
  8. 04 11月, 2014 1 次提交
  9. 30 9月, 2014 1 次提交
  10. 05 6月, 2014 1 次提交
  11. 18 2月, 2014 1 次提交
  12. 14 2月, 2014 3 次提交
  13. 22 1月, 2014 1 次提交
    • H
      reciprocal_divide: update/correction of the algorithm · 809fa972
      Hannes Frederic Sowa 提交于
      Jakub Zawadzki noticed that some divisions by reciprocal_divide()
      were not correct [1][2], which he could also show with BPF code
      after divisions are transformed into reciprocal_value() for runtime
      invariance which can be passed to reciprocal_divide() later on;
      reverse in BPF dump ended up with a different, off-by-one K in
      some situations.
      
      This has been fixed by Eric Dumazet in commit aee636c4
      ("bpf: do not use reciprocal divide"). This follow-up patch
      improves reciprocal_value() and reciprocal_divide() to work in
      all cases by using Granlund and Montgomery method, so that also
      future use is safe and without any non-obvious side-effects.
      Known problems with the old implementation were that division by 1
      always returned 0 and some off-by-ones when the dividend and divisor
      where very large. This seemed to not be problematic with its
      current users, as far as we can tell. Eric Dumazet checked for
      the slab usage, we cannot surely say so in the case of flex_array.
      Still, in order to fix that, we propose an extension from the
      original implementation from commit 6a2d7a95 resp. [3][4],
      by using the algorithm proposed in "Division by Invariant Integers
      Using Multiplication" [5], Torbjörn Granlund and Peter L.
      Montgomery, that is, pseudocode for q = n/d where q, n, d is in
      u32 universe:
      
      1) Initialization:
      
        int l = ceil(log_2 d)
        uword m' = floor((1<<32)*((1<<l)-d)/d)+1
        int sh_1 = min(l,1)
        int sh_2 = max(l-1,0)
      
      2) For q = n/d, all uword:
      
        uword t = (n*m')>>32
        q = (t+((n-t)>>sh_1))>>sh_2
      
      The assembler implementation from Agner Fog [6] also helped a lot
      while implementing. We have tested the implementation on x86_64,
      ppc64, i686, s390x; on x86_64/haswell we're still half the latency
      compared to normal divide.
      
      Joint work with Daniel Borkmann.
      
        [1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
        [2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
        [3] https://gmplib.org/~tege/division-paper.pdf
        [4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
        [5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
        [6] http://www.agner.org/optimize/asmlib.zipReported-by: NJakub Zawadzki <darkjames-ws@darkjames.pl>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: Jesse Gross <jesse@nicira.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Veaceslav Falico <vfalico@redhat.com>
      Cc: Jay Vosburgh <fubar@us.ibm.com>
      Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      809fa972
  14. 20 1月, 2014 1 次提交
  15. 15 1月, 2014 1 次提交
  16. 01 1月, 2014 2 次提交
  17. 11 12月, 2013 1 次提交
  18. 01 12月, 2013 3 次提交
  19. 26 10月, 2013 1 次提交
    • H
      netem: markov loss model transition fix · 4a3ad7b3
      Hagen Paul Pfeifer 提交于
      The transition from markov state "3 => lost packets within a burst
      period" to "1 => successfully transmitted packets within a gap period"
      has no *additional* loss event. The loss already happen for transition
      from 1 -> 3, this additional loss will make things go wild.
      
      E.g. transition probabilities:
      
      p13:   10%
      p31:  100%
      
      Expected:
      
      Ploss = p13 / (p13 + p31)
      Ploss = ~9.09%
      
      ... but it isn't. Even worse: we get a double loss - each time.
      So simple don't return true to indicate loss, rather break and return
      false.
      Signed-off-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Stefano Salsano <stefano.salsano@uniroma2.it>
      Cc: Fabio Ludovici <fabio.ludovici@yahoo.it>
      Signed-off-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a3ad7b3
  20. 12 10月, 2013 2 次提交
  21. 01 8月, 2013 1 次提交
    • E
      netem: Introduce skb_orphan_partial() helper · f2f872f9
      Eric Dumazet 提交于
      Commit 547669d4 ("tcp: xps: fix reordering issues") added
      unexpected reorders in case netem is used in a MQ setup for high
      performance test bed.
      
      ETH=eth0
      tc qd del dev $ETH root 2>/dev/null
      tc qd add dev $ETH root handle 1: mq
      for i in `seq 1 32`
      do
       tc qd add dev $ETH parent 1:$i netem delay 100ms
      done
      
      As all tcp packets are orphaned by netem, TCP stack believes it can
      set skb->ooo_okay on all packets.
      
      In order to allow producers to send more packets, we want to
      keep sk_wmem_alloc from reaching sk_sndbuf limit.
      
      We can do that by accounting one byte per skb in netem queues,
      so that TCP stack is not fooled too much.
      
      Tested:
      
      With above MQ/netem setup, scaling number of concurrent flows gives
      linear results and no reorders/retransmits
      
      lpq83:~# for n in 1 10 20 30 40 50 60 70 80 90 100
       do echo -n "n:$n " ; ./super_netperf $n -H 10.7.7.84; done
      n:1 198.46
      n:10 2002.69
      n:20 4000.98
      n:30 6006.35
      n:40 8020.93
      n:50 10032.3
      n:60 12081.9
      n:70 13971.3
      n:80 16009.7
      n:90 17117.3
      n:100 17425.5
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2f872f9
  22. 04 7月, 2013 1 次提交
  23. 02 7月, 2013 1 次提交
    • E
      netem: use rb tree to implement the time queue · aec0a40a
      Eric Dumazet 提交于
      Following typical setup to implement a ~100 ms RTT and big
      amount of reorders has very poor performance because netem
      implements the time queue using a linked list.
      -----------------------------------------------------------
      ETH=eth0
      IFB=ifb0
      modprobe ifb
      ip link set dev $IFB up
      tc qdisc add dev $ETH ingress 2>/dev/null
      tc filter add dev $ETH parent ffff: \
         protocol ip u32 match u32 0 0 flowid 1:1 action mirred egress \
         redirect dev $IFB
      ethtool -K $ETH gro off tso off gso off
      tc qdisc add dev $IFB root netem delay 50ms 10ms limit 100000
      tc qd add dev $ETH root netem delay 50ms limit 100000
      ---------------------------------------------------------
      
      Switch netem time queue to a rb tree, so this kind of setup can work at
      high speed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aec0a40a
  24. 30 1月, 2013 1 次提交
    • J
      netem: fix delay calculation in rate extension · a13d3104
      Johannes Naab 提交于
      The delay calculation with the rate extension introduces in v3.3 does
      not properly work, if other packets are still queued for transmission.
      For the delay calculation to work, both delay types (latency and delay
      introduces by rate limitation) have to be handled differently. The
      latency delay for a packet can overlap with the delay of other packets.
      The delay introduced by the rate however is separate, and can only
      start, once all other rate-introduced delays finished.
      
      Latency delay is from same distribution for each packet, rate delay
      depends on the packet size.
      
      .: latency delay
      -: rate delay
      x: additional delay we have to wait since another packet is currently
         transmitted
      
        .....----                    Packet 1
          .....xx------              Packet 2
                     .....------     Packet 3
          ^^^^^
          latency stacks
               ^^
               rate delay doesn't stack
                     ^^
                     latency stacks
      
        -----> time
      
      When a packet is enqueued, we first consider the latency delay. If other
      packets are already queued, we can reduce the latency delay until the
      last packet in the queue is send, however the latency delay cannot be
      <0, since this would mean that the rate is overcommitted.  The new
      reference point is the time at which the last packet will be send. To
      find the time, when the packet should be send, the rate introduces delay
      has to be added on top of that.
      Signed-off-by: NJohannes Naab <jn@stusta.de>
      Acked-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a13d3104
  25. 17 7月, 2012 1 次提交
    • E
      netem: refine early skb orphaning · 5a308f40
      Eric Dumazet 提交于
      netem does an early orphaning of skbs. Doing so breaks TCP Small Queue
      or any mechanism relying on socket sk_wmem_alloc feedback.
      
      Ideally, we should perform this orphaning after the rate module and
      before the delay module, to mimic what happens on a real link :
      
      skb orphaning is indeed normally done at TX completion, before the
      transit on the link.
      
      +-------+   +--------+  +---------------+  +-----------------+
      + Qdisc +---> Device +--> TX completion +--> links / hops    +->
      +       +   +  xmit  +  + skb orphaning +  + propagation     +
      +-------+   +--------+  +---------------+  +-----------------+
            < rate limiting >                  < delay, drops, reorders >
      
      If netem is used without delay feature (drops, reorders, rate
      limiting), then we should avoid early skb orphaning, to keep pressure
      on sockets as long as packets are still in qdisc queue.
      
      Ideally, netem should be refactored to implement delay module
      as the last stage. Current algorithm merges the two phases
      (rate limiting + delay) so its not correct.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Cc: Mark Gordon <msg@google.com>
      Cc: Andreas Terzis <aterzis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a308f40
  26. 09 7月, 2012 1 次提交
    • E
      netem: add limitation to reordered packets · 960fb66e
      Eric Dumazet 提交于
      Fix two netem bugs :
      
      1) When a frame was dropped by tfifo_enqueue(), drop counter
         was incremented twice.
      
      2) When reordering is triggered, we enqueue a packet without
         checking queue limit. This can OOM pretty fast when this
         is repeated enough, since skbs are orphaned, no socket limit
         can help in this situation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Mark Gordon <msg@google.com>
      Cc: Andreas Terzis <aterzis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      960fb66e
  27. 02 5月, 2012 1 次提交
  28. 01 5月, 2012 1 次提交
  29. 02 4月, 2012 1 次提交
  30. 20 2月, 2012 1 次提交
  31. 10 2月, 2012 1 次提交