1. 15 11月, 2017 2 次提交
  2. 13 11月, 2017 3 次提交
    • D
      netem: support delivering packets in delayed time slots · 836af83b
      Dave Taht 提交于
      Slotting is a crude approximation of the behaviors of shared media such
      as cable, wifi, and LTE, which gather up a bunch of packets within a
      varying delay window and deliver them, relative to that, nearly all at
      once.
      
      It works within the existing loss, duplication, jitter and delay
      parameters of netem. Some amount of inherent latency must be specified,
      regardless.
      
      The new "slot" parameter specifies a minimum and maximum delay between
      transmission attempts.
      
      The "bytes" and "packets" parameters can be used to limit the amount of
      information transferred per slot.
      
      Examples of use:
      
      tc qdisc add dev eth0 root netem delay 200us \
               slot 800us 10ms bytes 64k packets 42
      
      A more correct example, using stacked netem instances and a packet limit
      to emulate a tail drop wifi queue with slots and variable packet
      delivery, with a 200Mbit isochronous underlying rate, and 20ms path
      delay:
      
      tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
               limit 10000
      tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
               slot 800us 10ms bytes 64k packets 42 limit 512
      Signed-off-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      836af83b
    • D
      netem: add uapi to express delay and jitter in nanoseconds · 99803171
      Dave Taht 提交于
      netem userspace has long relied on a horrible /proc/net/psched hack
      to translate the current notion of "ticks" to nanoseconds.
      
      Expressing latency and jitter instead, in well defined nanoseconds,
      increases the dynamic range of emulated delays and jitter in netem.
      
      It will also ease a transition where reducing a tick to nsec
      equivalence would constrain the max delay in prior versions of
      netem to only 4.3 seconds.
      Signed-off-by: NDave Taht <dave.taht@gmail.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99803171
    • D
      netem: convert to qdisc_watchdog_schedule_ns · 112f9cb6
      Dave Taht 提交于
      Upgrade the internal netem scheduler to use nanoseconds rather than
      ticks throughout.
      
      Convert to and from the std "ticks" userspace api automatically,
      while allowing for finer grained scheduling to take place.
      Signed-off-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      112f9cb6
  3. 07 10月, 2017 1 次提交
  4. 26 9月, 2017 1 次提交
    • E
      sch_netem: faster rb tree removal · 3aa605f2
      Eric Dumazet 提交于
      While running TCP tests involving netem storing millions of packets,
      I had the idea to speed up tfifo_reset() and did experiments.
      
      I tried the rbtree_postorder_for_each_entry_safe() method that is
      used in skb_rbtree_purge() but discovered it was slower than the
      current tfifo_reset() method.
      
      I measured time taken to release skbs with three occupation levels :
      10^4, 10^5 and 10^6 skbs with three methods :
      
      1) (current 'naive' method)
      
      	while ((p = rb_first(&q->t_root))) {
      		struct sk_buff *skb = netem_rb_to_skb(p);
      
      		rb_erase(p, &q->t_root);
      		rtnl_kfree_skbs(skb, skb);
      	}
      
      2) Use rb_next() instead of rb_first() in the loop :
      
      	p = rb_first(&q->t_root);
      	while (p) {
      		struct sk_buff *skb = netem_rb_to_skb(p);
      
      		p = rb_next(p);
      		rb_erase(&skb->rbnode, &q->t_root);
      		rtnl_kfree_skbs(skb, skb);
      	}
      
      3) "optimized" method using rbtree_postorder_for_each_entry_safe()
      
      	struct sk_buff *skb, *next;
      
      	rbtree_postorder_for_each_entry_safe(skb, next,
      					     &q->t_root, rbnode) {
                     rtnl_kfree_skbs(skb, skb);
      	}
      	q->t_root = RB_ROOT;
      
      Results :
      
      method_1:while (rb_first()) rb_erase() 10000 skbs in 690378 ns (69 ns per skb)
      method_2:rb_first; while (p) { p = rb_next(p); ...}  10000 skbs in 541846 ns (54 ns per skb)
      method_3:rbtree_postorder_for_each_entry_safe() 10000 skbs in 868307 ns (86 ns per skb)
      
      method_1:while (rb_first()) rb_erase() 99996 skbs in 7804021 ns (78 ns per skb)
      method_2:rb_first; while (p) { p = rb_next(p); ...}  100000 skbs in 5942456 ns (59 ns per skb)
      method_3:rbtree_postorder_for_each_entry_safe() 100000 skbs in 11584940 ns (115 ns per skb)
      
      method_1:while (rb_first()) rb_erase() 1000000 skbs in 108577838 ns (108 ns per skb)
      method_2:rb_first; while (p) { p = rb_next(p); ...}  1000000 skbs in 82619635 ns (82 ns per skb)
      method_3:rbtree_postorder_for_each_entry_safe() 1000000 skbs in 127328743 ns (127 ns per skb)
      
      Method 2) is simply faster, probably because it maintains a smaller
      working size set.
      
      Note that this is the method we use in tcp_ofo_queue() already.
      
      I will also change skb_rbtree_purge() in a second patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3aa605f2
  5. 20 9月, 2017 1 次提交
    • E
      net: sk_buff rbnode reorg · bffa72cf
      Eric Dumazet 提交于
      skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
      
      Current uses (TCP receive ofo queue and netem) need to save/restore
      tstamp, while skb->dev is either NULL (TCP) or a constant for a given
      queue (netem).
      
      Since we plan using an RB tree for TCP retransmit queue to speedup SACK
      processing with large BDP, this patch exchanges skb->dev and
      skb->tstamp.
      
      This saves some overhead in both TCP and netem.
      
      v2: removes the swtstamp field from struct tcp_skb_cb
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bffa72cf
  6. 31 8月, 2017 1 次提交
    • N
      sch_netem: avoid null pointer deref on init failure · 634576a1
      Nikolay Aleksandrov 提交于
      netem can fail in ->init due to missing options (either not supplied by
      user-space or used as a default qdisc) causing a timer->base null
      pointer deref in its ->destroy() and ->reset() callbacks.
      
      Reproduce:
      $ sysctl net.core.default_qdisc=netem
      $ ip l set ethX up
      
      Crash log:
      [ 1814.846943] BUG: unable to handle kernel NULL pointer dereference at (null)
      [ 1814.847181] IP: hrtimer_active+0x17/0x8a
      [ 1814.847270] PGD 59c34067
      [ 1814.847271] P4D 59c34067
      [ 1814.847337] PUD 37374067
      [ 1814.847403] PMD 0
      [ 1814.847468]
      [ 1814.847582] Oops: 0000 [#1] SMP
      [ 1814.847655] Modules linked in: sch_netem(O) sch_fq_codel(O)
      [ 1814.847761] CPU: 3 PID: 1573 Comm: ip Tainted: G           O 4.13.0-rc6+ #62
      [ 1814.847884] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [ 1814.848043] task: ffff88003723a700 task.stack: ffff88005adc8000
      [ 1814.848235] RIP: 0010:hrtimer_active+0x17/0x8a
      [ 1814.848407] RSP: 0018:ffff88005adcb590 EFLAGS: 00010246
      [ 1814.848590] RAX: 0000000000000000 RBX: ffff880058e359d8 RCX: 0000000000000000
      [ 1814.848793] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880058e359d8
      [ 1814.848998] RBP: ffff88005adcb5b0 R08: 00000000014080c0 R09: 00000000ffffffff
      [ 1814.849204] R10: ffff88005adcb660 R11: 0000000000000020 R12: 0000000000000000
      [ 1814.849410] R13: ffff880058e359d8 R14: 00000000ffffffff R15: 0000000000000001
      [ 1814.849616] FS:  00007f733bbca740(0000) GS:ffff88005d980000(0000) knlGS:0000000000000000
      [ 1814.849919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1814.850107] CR2: 0000000000000000 CR3: 0000000059f0d000 CR4: 00000000000406e0
      [ 1814.850313] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1814.850518] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1814.850723] Call Trace:
      [ 1814.850875]  hrtimer_try_to_cancel+0x1a/0x93
      [ 1814.851047]  hrtimer_cancel+0x15/0x20
      [ 1814.851211]  qdisc_watchdog_cancel+0x12/0x14
      [ 1814.851383]  netem_reset+0xe6/0xed [sch_netem]
      [ 1814.851561]  qdisc_destroy+0x8b/0xe5
      [ 1814.851723]  qdisc_create_dflt+0x86/0x94
      [ 1814.851890]  ? dev_activate+0x129/0x129
      [ 1814.852057]  attach_one_default_qdisc+0x36/0x63
      [ 1814.852232]  netdev_for_each_tx_queue+0x3d/0x48
      [ 1814.852406]  dev_activate+0x4b/0x129
      [ 1814.852569]  __dev_open+0xe7/0x104
      [ 1814.852730]  __dev_change_flags+0xc6/0x15c
      [ 1814.852899]  dev_change_flags+0x25/0x59
      [ 1814.853064]  do_setlink+0x30c/0xb3f
      [ 1814.853228]  ? check_chain_key+0xb0/0xfd
      [ 1814.853396]  ? check_chain_key+0xb0/0xfd
      [ 1814.853565]  rtnl_newlink+0x3a4/0x729
      [ 1814.853728]  ? rtnl_newlink+0x117/0x729
      [ 1814.853905]  ? ns_capable_common+0xd/0xb1
      [ 1814.854072]  ? ns_capable+0x13/0x15
      [ 1814.854234]  rtnetlink_rcv_msg+0x188/0x197
      [ 1814.854404]  ? rcu_read_unlock+0x3e/0x5f
      [ 1814.854572]  ? rtnl_newlink+0x729/0x729
      [ 1814.854737]  netlink_rcv_skb+0x6c/0xce
      [ 1814.854902]  rtnetlink_rcv+0x23/0x2a
      [ 1814.855064]  netlink_unicast+0x103/0x181
      [ 1814.855230]  netlink_sendmsg+0x326/0x337
      [ 1814.855398]  sock_sendmsg_nosec+0x14/0x3f
      [ 1814.855584]  sock_sendmsg+0x29/0x2e
      [ 1814.855747]  ___sys_sendmsg+0x209/0x28b
      [ 1814.855912]  ? do_raw_spin_unlock+0xcd/0xf8
      [ 1814.856082]  ? _raw_spin_unlock+0x27/0x31
      [ 1814.856251]  ? __handle_mm_fault+0x651/0xdb1
      [ 1814.856421]  ? check_chain_key+0xb0/0xfd
      [ 1814.856592]  __sys_sendmsg+0x45/0x63
      [ 1814.856755]  ? __sys_sendmsg+0x45/0x63
      [ 1814.856923]  SyS_sendmsg+0x19/0x1b
      [ 1814.857083]  entry_SYSCALL_64_fastpath+0x23/0xc2
      [ 1814.857256] RIP: 0033:0x7f733b2dd690
      [ 1814.857419] RSP: 002b:00007ffe1d3387d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [ 1814.858238] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007f733b2dd690
      [ 1814.858445] RDX: 0000000000000000 RSI: 00007ffe1d338820 RDI: 0000000000000003
      [ 1814.858651] RBP: ffff88005adcbf98 R08: 0000000000000001 R09: 0000000000000003
      [ 1814.858856] R10: 00007ffe1d3385a0 R11: 0000000000000246 R12: 0000000000000002
      [ 1814.859060] R13: 000000000066f1a0 R14: 00007ffe1d3408d0 R15: 0000000000000000
      [ 1814.859267]  ? trace_hardirqs_off_caller+0xa7/0xcf
      [ 1814.859446] Code: 10 55 48 89 c7 48 89 e5 e8 45 a1 fb ff 31 c0 5d c3
      31 c0 c3 66 66 66 66 90 55 48 89 e5 41 56 41 55 41 54 53 49 89 fd 49 8b
      45 30 <4c> 8b 20 41 8b 5c 24 38 31 c9 31 d2 48 c7 c7 50 8e 1d 82 41 89
      [ 1814.860022] RIP: hrtimer_active+0x17/0x8a RSP: ffff88005adcb590
      [ 1814.860214] CR2: 0000000000000000
      
      Fixes: 87b60cfa ("net_sched: fix error recovery at qdisc creation")
      Fixes: 0fbbeb1b ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      634576a1
  7. 26 8月, 2017 1 次提交
    • W
      net_sched: remove tc class reference counting · 143976ce
      WANG Cong 提交于
      For TC classes, their ->get() and ->put() are always paired, and the
      reference counting is completely useless, because:
      
      1) For class modification and dumping paths, we already hold RTNL lock,
         so all of these ->get(),->change(),->put() are atomic.
      
      2) For filter bindiing/unbinding, we use other reference counter than
         this one, and they should have RTNL lock too.
      
      3) For ->qlen_notify(), it is special because it is called on ->enqueue()
         path, but we already hold qdisc tree lock there, and we hold this
         tree lock when graft or delete the class too, so it should not be gone
         or changed until we release the tree lock.
      
      Therefore, this patch removes ->get() and ->put(), but:
      
      1) Adds a new ->find() to find the pointer to a class by classid, no
         refcnt.
      
      2) Move the original class destroy upon the last refcnt into ->delete(),
         right after releasing tree lock. This is fine because the class is
         already removed from hash when holding the lock.
      
      For those who also use ->put() as ->unbind(), just rename them to reflect
      this change.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      143976ce
  8. 09 5月, 2017 1 次提交
    • M
      treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68
      Michal Hocko 提交于
      There are many code paths opencoding kvmalloc.  Let's use the helper
      instead.  The main difference to kvmalloc is that those users are
      usually not considering all the aspects of the memory allocator.  E.g.
      allocation requests <= 32kB (with 4kB pages) are basically never failing
      and invoke OOM killer to satisfy the allocation.  This sounds too
      disruptive for something that has a reasonable fallback - the vmalloc.
      On the other hand those requests might fallback to vmalloc even when the
      memory allocator would succeed after several more reclaim/compaction
      attempts previously.  There is no guarantee something like that happens
      though.
      
      This patch converts many of those places to kv[mz]alloc* helpers because
      they are more conservative.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
      Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
      Acked-by: David Sterba <dsterba@suse.com> # btrfs
      Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
      Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Santosh Raspatur <santosh@chelsio.com>
      Cc: Hariprasad S <hariprasad@chelsio.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      752ade68
  9. 14 4月, 2017 1 次提交
  10. 17 3月, 2017 1 次提交
    • N
      netem: apply correct delay when rate throttling · 5080f39e
      Nik Unger 提交于
      I recently reported on the netem list that iperf network benchmarks
      show unexpected results when a bandwidth throttling rate has been
      configured for netem. Specifically:
      
      1) The measured link bandwidth *increases* when a higher delay is added
      2) The measured link bandwidth appears higher than the specified limit
      3) The measured link bandwidth for the same very slow settings varies significantly across
        machines
      
      The issue can be reproduced by using tc to configure netem with a
      512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
      veth pair between network namespaces, and then using iperf (or any
      other network benchmarking tool) to test throughput. Complete detailed
      instructions are in the original email chain here:
      https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html
      
      There appear to be two underlying bugs causing these effects:
      
      - The first issue causes long delays when the rate is slow and no
        delay is configured (e.g., "rate 512kbit"). This is because SKBs are
        not orphaned when no delay is configured, so orphaning does not
        occur until *after* the rate-induced delay has been applied. For
        this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
        dramatically increases the measured bandwidth.
      
      - The second issue is that rate-induced delays are not correctly
        applied, allowing SKB delays to occur in parallel. The indended
        approach is to compute the delay for an SKB and to add this delay to
        the end of the current queue. However, the code does not detect
        existing SKBs in the queue due to improperly testing sch->q.qlen,
        which is nonzero even when packets exist only in the
        rbtree. Consequently, new SKBs do not wait for the current queue to
        empty. When packet delays vary significantly (e.g., if packet sizes
        are different), then this also causes unintended reordering.
      
      I modified the code to expect a delay (and orphan the SKB) when a rate
      is configured. I also added some defensive tests that correctly find
      the latest scheduled delivery time, even if it is (unexpectedly) for a
      packet in sch->q. I have tested these changes on the latest kernel
      (4.11.0-rc1+) and the iperf / ping test results are as expected.
      Signed-off-by: NNik Unger <njunger@uwaterloo.ca>
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5080f39e
  11. 09 1月, 2017 2 次提交
    • W
      net-tc: convert tc_from to tc_from_ingress and tc_redirected · bc31c905
      Willem de Bruijn 提交于
      The tc_from field fulfills two roles. It encodes whether a packet was
      redirected by an act_mirred device and, if so, whether act_mirred was
      called on ingress or egress. Split it into separate fields.
      
      The information is needed by the special IFB loop, where packets are
      taken out of the normal path by act_mirred, forwarded to IFB, then
      reinjected at their original location (ingress or egress) by IFB.
      
      The IFB device cannot use skb->tc_at_ingress, because that may have
      been overwritten as the packet travels from act_mirred to ifb_xmit,
      when it passes through tc_classify on the IFB egress path. Cache this
      value in skb->tc_from_ingress.
      
      That field is valid only if a packet arriving at ifb_xmit came from
      act_mirred. Other packets can be crafted to reach ifb_xmit. These
      must be dropped. Set tc_redirected on redirection and drop all packets
      that do not have this bit set.
      
      Both fields are set only on cloned skbs in tc actions, so original
      packet sources do not have to clear the bit when reusing packets
      (notably, pktgen and octeon).
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc31c905
    • W
      net-tc: convert tc_verd to integer bitfields · a5135bcf
      Willem de Bruijn 提交于
      Extract the remaining two fields from tc_verd and remove the __u16
      completely. TC_AT and TC_FROM are converted to equivalent two-bit
      integer fields tc_at and tc_from. Where possible, use existing
      helper skb_at_tc_ingress when reading tc_at. Introduce helper
      skb_reset_tc to clear fields.
      
      Not documenting tc_from and tc_at, because they will be replaced
      with single bit fields in follow-on patches.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5135bcf
  12. 26 12月, 2016 1 次提交
    • T
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner 提交于
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  13. 21 12月, 2016 1 次提交
  14. 19 9月, 2016 3 次提交
  15. 29 6月, 2016 1 次提交
  16. 26 6月, 2016 1 次提交
    • E
      net_sched: drop packets after root qdisc lock is released · 520ac30f
      Eric Dumazet 提交于
      Qdisc performance suffers when packets are dropped at enqueue()
      time because drops (kfree_skb()) are done while qdisc lock is held,
      delaying a dequeue() draining the queue.
      
      Nominal throughput can be reduced by 50 % when this happens,
      at a time we would like the dequeue() to proceed as fast as possible.
      
      Even FQ is vulnerable to this problem, while one of FQ goals was
      to provide some flow isolation.
      
      This patch adds a 'struct sk_buff **to_free' parameter to all
      qdisc->enqueue(), and in qdisc_drop() helper.
      
      I measured a performance increase of up to 12 %, but this patch
      is a prereq so that future batches in enqueue() can fly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      520ac30f
  17. 24 6月, 2016 1 次提交
  18. 16 6月, 2016 1 次提交
  19. 11 6月, 2016 2 次提交
  20. 09 6月, 2016 2 次提交
  21. 03 5月, 2016 1 次提交
    • N
      netem: Segment GSO packets on enqueue · 6071bd1a
      Neil Horman 提交于
      This was recently reported to me, and reproduced on the latest net kernel,
      when attempting to run netperf from a host that had a netem qdisc attached
      to the egress interface:
      
      [  788.073771] ---------------------[ cut here ]---------------------------
      [  788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
      [  788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
      data_len=0 gso_size=1448 gso_type=1 ip_summed=3
      [  788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
      ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
      glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
      i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
      pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
      sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
      i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
      crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
      serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
      dm_mod
      [  788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G        W
      ------------   3.10.0-327.el7.x86_64 #1
      [  788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
      [  788.542260]  ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
      ffffffff816351f1
      [  788.576332]  ffff880437c036a8 ffffffff8107b200 ffff880633e74200
      ffff880231674000
      [  788.611943]  0000000000000001 0000000000000003 0000000000000000
      ffff880437c03710
      [  788.647241] Call Trace:
      [  788.658817]  <IRQ>  [<ffffffff816351f1>] dump_stack+0x19/0x1b
      [  788.686193]  [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
      [  788.713803]  [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
      [  788.741314]  [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
      [  788.767018]  [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
      [  788.796117]  [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
      [  788.823392]  [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
      [  788.854487]  [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
      [  788.880870]  [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
      ...
      
      The problem occurs because netem is not prepared to handle GSO packets (as it
      uses skb_checksum_help in its enqueue path, which cannot manipulate these
      frames).
      
      The solution I think is to simply segment the skb in a simmilar fashion to the
      way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
      When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
      the first segment, and enqueue the remaining ones.
      
      tested successfully by myself on the latest net kernel, to which this applies
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Jamal Hadi Salim <jhs@mojatatu.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: netem@lists.linux-foundation.org
      CC: eric.dumazet@gmail.com
      CC: stephen@networkplumber.org
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6071bd1a
  22. 26 4月, 2016 1 次提交
  23. 01 3月, 2016 2 次提交
  24. 12 5月, 2015 1 次提交
  25. 08 4月, 2015 1 次提交
    • B
      netem: Fixes byte backlog accounting for the first of two chained netem instances · 0ad2a836
      Beshay, Joseph 提交于
      Fixes byte backlog accounting for the first of two chained netem instances.
      Bytes backlog reported now corresponds to the number of queued packets.
      
      When two netem instances are chained, for instance to apply rate and queue
      limitation followed by packet delay, the number of backlogged bytes reported
      by the first netem instance is wrong. It reports the sum of bytes in the queues
      of the first and second netem. The first netem reports the correct number of
      backlogged packets but not bytes. This is shown in the example below.
      
      Consider a chain of two netem schedulers created using the following commands:
      
      $ tc -s qdisc replace dev veth2 root handle 1:0 netem rate 10000kbit limit 100
      $ tc -s qdisc add dev veth2 parent 1:0 handle 2: netem delay 50ms
      
      Start an iperf session to send packets out on the specified interface and
      monitor the backlog using tc:
      
      $ tc -s qdisc show dev veth2
      
      Output using unpatched netem:
      	qdisc netem 1: root refcnt 2 limit 100 rate 10000Kbit
      	 Sent 98422639 bytes 65434 pkt (dropped 123, overlimits 0 requeues 0)
      	 backlog 172694b 73p requeues 0
      	qdisc netem 2: parent 1: limit 1000 delay 50.0ms
      	 Sent 98422639 bytes 65434 pkt (dropped 0, overlimits 0 requeues 0)
      	 backlog 63588b 42p requeues 0
      
      The interface used to produce this output has an MTU of 1500. The output for
      backlogged bytes behind netem 1 is 172694b. This value is not correct. Consider
      the total number of sent bytes and packets. By dividing the number of sent
      bytes by the number of sent packets, we get an average packet size of ~=1504.
      If we divide the number of backlogged bytes by packets, we get ~=2365. This is
      due to the first netem incorrectly counting the 63588b which are in netem 2's
      queue as being in its own queue. To verify this is the case, we subtract them
      from the reported value and divide by the number of packets as follows:
      	172694 - 63588 = 109106 bytes actualled backlogged in netem 1
      	109106 / 73 packets ~= 1494 bytes (which matches our MTU)
      
      The root cause is that the byte accounting is not done at the
      same time with packet accounting. The solution is to update the backlog value
      every time the packet queue is updated.
      Signed-off-by: NJoseph D Beshay <joseph.beshay@utdallas.edu>
      Acked-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ad2a836
  26. 04 11月, 2014 1 次提交
  27. 30 9月, 2014 1 次提交
  28. 05 6月, 2014 1 次提交
  29. 18 2月, 2014 1 次提交
  30. 14 2月, 2014 2 次提交