1. 30 10月, 2020 1 次提交
    • A
      netem: fix zero division in tabledist · eadd1bef
      Aleksandr Nogikh 提交于
      Currently it is possible to craft a special netlink RTM_NEWQDISC
      command that can result in jitter being equal to 0x80000000. It is
      enough to set the 32 bit jitter to 0x02000000 (it will later be
      multiplied by 2^6) or just set the 64 bit jitter via
      TCA_NETEM_JITTER64. This causes an overflow during the generation of
      uniformly distributed numbers in tabledist(), which in turn leads to
      division by zero (sigma != 0, but sigma * 2 is 0).
      
      The related fragment of code needs 32-bit division - see commit
      9b0ed891 ("netem: remove unnecessary 64 bit modulus"), so switching to
      64 bit is not an option.
      
      Fix the issue by keeping the value of jitter within the range that can
      be adequately handled by tabledist() - [0;INT_MAX]. As negative std
      deviation makes no sense, take the absolute value of the passed value
      and cap it at INT_MAX. Inside tabledist(), switch to unsigned 32 bit
      arithmetic in order to prevent overflows.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NAleksandr Nogikh <nogikh@google.com>
      Reported-by: syzbot+ec762a6342ad0d3c0d8f@syzkaller.appspotmail.com
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Link: https://lore.kernel.org/r/20201028170731.1383332-1-aleksandrnogikh@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      eadd1bef
  2. 17 7月, 2020 1 次提交
  3. 30 6月, 2020 1 次提交
    • P
      net: sched: Pass root lock to Qdisc_ops.enqueue · aebe4426
      Petr Machata 提交于
      A following patch introduces qevents, points in qdisc algorithm where
      packet can be processed by user-defined filters. Should this processing
      lead to a situation where a new packet is to be enqueued on the same port,
      holding the root lock would lead to deadlocks. To solve the issue, qevent
      handler needs to unlock and relock the root lock when necessary.
      
      To that end, add the root lock argument to the qdisc op enqueue, and
      propagate throughout.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aebe4426
  4. 01 3月, 2020 1 次提交
  5. 20 10月, 2019 2 次提交
  6. 27 9月, 2019 1 次提交
    • E
      sch_netem: fix rcu splat in netem_enqueue() · 159d2c7d
      Eric Dumazet 提交于
      qdisc_root() use from netem_enqueue() triggers a lockdep warning.
      
      __dev_queue_xmit() uses rcu_read_lock_bh() which is
      not equivalent to rcu_read_lock() + local_bh_disable_bh as far
      as lockdep is concerned.
      
      WARNING: suspicious RCU usage
      5.3.0-rc7+ #0 Not tainted
      -----------------------------
      include/net/sch_generic.h:492 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      3 locks held by syz-executor427/8855:
       #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: lwtunnel_xmit_redirect include/net/lwtunnel.h:92 [inline]
       #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: ip_finish_output2+0x2dc/0x2570 net/ipv4/ip_output.c:214
       #1: 00000000b5525c01 (rcu_read_lock_bh){....}, at: __dev_queue_xmit+0x20a/0x3650 net/core/dev.c:3804
       #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: spin_lock include/linux/spinlock.h:338 [inline]
       #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_xmit_skb net/core/dev.c:3502 [inline]
       #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_queue_xmit+0x14b8/0x3650 net/core/dev.c:3838
      
      stack backtrace:
      CPU: 0 PID: 8855 Comm: syz-executor427 Not tainted 5.3.0-rc7+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5357
       qdisc_root include/net/sch_generic.h:492 [inline]
       netem_enqueue+0x1cfb/0x2d80 net/sched/sch_netem.c:479
       __dev_xmit_skb net/core/dev.c:3527 [inline]
       __dev_queue_xmit+0x15d2/0x3650 net/core/dev.c:3838
       dev_queue_xmit+0x18/0x20 net/core/dev.c:3902
       neigh_hh_output include/net/neighbour.h:500 [inline]
       neigh_output include/net/neighbour.h:509 [inline]
       ip_finish_output2+0x1726/0x2570 net/ipv4/ip_output.c:228
       __ip_finish_output net/ipv4/ip_output.c:308 [inline]
       __ip_finish_output+0x5fc/0xb90 net/ipv4/ip_output.c:290
       ip_finish_output+0x38/0x1f0 net/ipv4/ip_output.c:318
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip_mc_output+0x292/0xf40 net/ipv4/ip_output.c:417
       dst_output include/net/dst.h:436 [inline]
       ip_local_out+0xbb/0x190 net/ipv4/ip_output.c:125
       ip_send_skb+0x42/0xf0 net/ipv4/ip_output.c:1555
       udp_send_skb.isra.0+0x6b2/0x1160 net/ipv4/udp.c:887
       udp_sendmsg+0x1e96/0x2820 net/ipv4/udp.c:1174
       inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:657
       ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311
       __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413
       __do_sys_sendmmsg net/socket.c:2442 [inline]
       __se_sys_sendmmsg net/socket.c:2439 [inline]
       __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439
       do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      159d2c7d
  7. 21 9月, 2019 1 次提交
  8. 19 6月, 2019 2 次提交
    • J
      net: netem: fix use after free and double free with packet corruption · 3e14c383
      Jakub Kicinski 提交于
      Brendan reports that the use of netem's packet corruption capability
      leads to strange crashes.  This seems to be caused by
      commit d66280b1 ("net: netem: use a list in addition to rbtree")
      which uses skb->next pointer to construct a fast-path queue of
      in-order skbs.
      
      Packet corruption code has to invoke skb_gso_segment() in case
      of skbs in need of GSO.  skb_gso_segment() returns a list of
      skbs.  If next pointers of the skbs on that list do not get cleared
      fast path list may point to freed skbs or skbs which are also on
      the RB tree.
      
      Let's say skb gets segmented into 3 frames:
      
      A -> B -> C
      
      A gets hooked to the t_head t_tail list by tfifo_enqueue(), but it's
      next pointer didn't get cleared so we have:
      
      h t
      |/
      A -> B -> C
      
      Now if B and C get also get enqueued successfully all is fine, because
      tfifo_enqueue() will overwrite the list in order.  IOW:
      
      Enqueue B:
      
      h    t
      |    |
      A -> B    C
      
      Enqueue C:
      
      h         t
      |         |
      A -> B -> C
      
      But if B and C get reordered we may end up with:
      
      h t            RB tree
      |/                |
      A -> B -> C       B
                         \
                          C
      
      Or if they get dropped just:
      
      h t
      |/
      A -> B -> C
      
      where A and B are already freed.
      
      To reproduce either limit has to be set low to cause freeing of
      segs or reorders have to happen (due to delay jitter).
      
      Note that we only have to mark the first segment as not on the
      list, "finish_segs" handling of other frags already does that.
      
      Another caveat is that qdisc_drop_all() still has to free all
      segments correctly in case of drop of first segment, therefore
      we re-link segs before calling it.
      
      v2:
       - re-link before drop, v1 was leaking non-first segs if limit
         was hit at the first seg
       - better commit message which lead to discovering the above :)
      Reported-by: NBrendan Galloway <brendan.galloway@netronome.com>
      Fixes: d66280b1 ("net: netem: use a list in addition to rbtree")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e14c383
    • J
      net: netem: fix backlog accounting for corrupted GSO frames · 177b8007
      Jakub Kicinski 提交于
      When GSO frame has to be corrupted netem uses skb_gso_segment()
      to produce the list of frames, and re-enqueues the segments one
      by one.  The backlog length has to be adjusted to account for
      new frames.
      
      The current calculation is incorrect, leading to wrong backlog
      lengths in the parent qdisc (both bytes and packets), and
      incorrect packet backlog count in netem itself.
      
      Parent backlog goes negative, netem's packet backlog counts
      all non-first segments twice (thus remaining non-zero even
      after qdisc is emptied).
      
      Move the variables used to count the adjustment into local
      scope to make 100% sure they aren't used at any stage in
      backports.
      
      Fixes: 6071bd1a ("netem: Segment GSO packets on enqueue")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      177b8007
  9. 31 5月, 2019 1 次提交
  10. 28 4月, 2019 2 次提交
    • J
      netlink: make validation more configurable for future strictness · 8cb08174
      Johannes Berg 提交于
      We currently have two levels of strict validation:
      
       1) liberal (default)
           - undefined (type >= max) & NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
           - garbage at end of message accepted
       2) strict (opt-in)
           - NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
      
      Split out parsing strictness into four different options:
       * TRAILING     - check that there's no trailing data after parsing
                        attributes (in message or nested)
       * MAXTYPE      - reject attrs > max known type
       * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
       * STRICT_ATTRS - strictly validate attribute size
      
      The default for future things should be *everything*.
      The current *_strict() is a combination of TRAILING and MAXTYPE,
      and is renamed to _deprecated_strict().
      The current regular parsing has none of this, and is renamed to
      *_parse_deprecated().
      
      Additionally it allows us to selectively set one of the new flags
      even on old policies. Notably, the UNSPEC flag could be useful in
      this case, since it can be arranged (by filling in the policy) to
      not be an incompatible userspace ABI change, but would then going
      forward prevent forgetting attribute entries. Similar can apply
      to the POLICY flag.
      
      We end up with the following renames:
       * nla_parse           -> nla_parse_deprecated
       * nla_parse_strict    -> nla_parse_deprecated_strict
       * nlmsg_parse         -> nlmsg_parse_deprecated
       * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
       * nla_parse_nested    -> nla_parse_nested_deprecated
       * nla_validate_nested -> nla_validate_nested_deprecated
      
      Using spatch, of course:
          @@
          expression TB, MAX, HEAD, LEN, POL, EXT;
          @@
          -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
          +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression TB, MAX, NLA, POL, EXT;
          @@
          -nla_parse_nested(TB, MAX, NLA, POL, EXT)
          +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
      
          @@
          expression START, MAX, POL, EXT;
          @@
          -nla_validate_nested(START, MAX, POL, EXT)
          +nla_validate_nested_deprecated(START, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, MAX, POL, EXT;
          @@
          -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
          +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
      
      For this patch, don't actually add the strict, non-renamed versions
      yet so that it breaks compile if I get it wrong.
      
      Also, while at it, make nla_validate and nla_parse go down to a
      common __nla_validate_parse() function to avoid code duplication.
      
      Ultimately, this allows us to have very strict validation for every
      new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
      next patch, while existing things will continue to work as is.
      
      In effect then, this adds fully strict validation for any new command.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8cb08174
    • M
      netlink: make nla_nest_start() add NLA_F_NESTED flag · ae0be8de
      Michal Kubecek 提交于
      Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
      netlink based interfaces (including recently added ones) are still not
      setting it in kernel generated messages. Without the flag, message parsers
      not aware of attribute semantics (e.g. wireshark dissector or libmnl's
      mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
      the structure of their contents.
      
      Unfortunately we cannot just add the flag everywhere as there may be
      userspace applications which check nlattr::nla_type directly rather than
      through a helper masking out the flags. Therefore the patch renames
      nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
      as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
      are rewritten to use nla_nest_start().
      
      Except for changes in include/net/netlink.h, the patch was generated using
      this semantic patch:
      
      @@ expression E1, E2; @@
      -nla_nest_start(E1, E2)
      +nla_nest_start_noflag(E1, E2)
      
      @@ expression E1, E2; @@
      -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
      +nla_nest_start(E1, E2)
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae0be8de
  11. 01 3月, 2019 1 次提交
    • S
      net: netem: fix skb length BUG_ON in __skb_to_sgvec · 5845f706
      Sheng Lan 提交于
      It can be reproduced by following steps:
      1. virtio_net NIC is configured with gso/tso on
      2. configure nginx as http server with an index file bigger than 1M bytes
      3. use tc netem to produce duplicate packets and delay:
         tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
      4. continually curl the nginx http server to get index file on client
      5. BUG_ON is seen quickly
      
      [10258690.371129] kernel BUG at net/core/skbuff.c:4028!
      [10258690.371748] invalid opcode: 0000 [#1] SMP PTI
      [10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W         5.0.0-rc6 #2
      [10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
      [10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea
      [10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002
      [10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028
      [10258690.372094] FS:  0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000
      [10258690.372094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0
      [10258690.372094] Call Trace:
      [10258690.372094]  <IRQ>
      [10258690.372094]  skb_to_sgvec+0x11/0x40
      [10258690.372094]  start_xmit+0x38c/0x520 [virtio_net]
      [10258690.372094]  dev_hard_start_xmit+0x9b/0x200
      [10258690.372094]  sch_direct_xmit+0xff/0x260
      [10258690.372094]  __qdisc_run+0x15e/0x4e0
      [10258690.372094]  net_tx_action+0x137/0x210
      [10258690.372094]  __do_softirq+0xd6/0x2a9
      [10258690.372094]  irq_exit+0xde/0xf0
      [10258690.372094]  smp_apic_timer_interrupt+0x74/0x140
      [10258690.372094]  apic_timer_interrupt+0xf/0x20
      [10258690.372094]  </IRQ>
      
      In __skb_to_sgvec(), the skb->len is not equal to the sum of the skb's
      linear data size and nonlinear data size, thus BUG_ON triggered.
      Because the skb is cloned and a part of nonlinear data is split off.
      
      Duplicate packet is cloned in netem_enqueue() and may be delayed
      some time in qdisc. When qdisc len reached the limit and returns
      NET_XMIT_DROP, the skb will be retransmit later in write queue.
      the skb will be fragmented by tso_fragment(), the limit size
      that depends on cwnd and mss decrease, the skb's nonlinear
      data will be split off. The length of the skb cloned by netem
      will not be updated. When we use virtio_net NIC and invoke skb_to_sgvec(),
      the BUG_ON trigger.
      
      To fix it, netem returns NET_XMIT_SUCCESS to upper stack
      when it clones a duplicate packet.
      
      Fixes: 35d889d1 ("sch_netem: fix skb leak in netem_enqueue()")
      Signed-off-by: NSheng Lan <lansheng@huawei.com>
      Reported-by: NQin Ji <jiqin.ji@huawei.com>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5845f706
  12. 06 12月, 2018 1 次提交
  13. 30 11月, 2018 1 次提交
    • C
      net: Prevent invalid access to skb->prev in __qdisc_drop_all · 9410d386
      Christoph Paasch 提交于
      __qdisc_drop_all() accesses skb->prev to get to the tail of the
      segment-list.
      
      With commit 68d2f84a ("net: gro: properly remove skb from list")
      the skb-list handling has been changed to set skb->next to NULL and set
      the list-poison on skb->prev.
      
      With that change, __qdisc_drop_all() will panic when it tries to
      dereference skb->prev.
      
      Since commit 992cba7e ("net: Add and use skb_list_del_init().")
      __list_del_entry is used, leaving skb->prev unchanged (thus,
      pointing to the list-head if it's the first skb of the list).
      This will make __qdisc_drop_all modify the next-pointer of the list-head
      and result in a panic later on:
      
      [   34.501053] general protection fault: 0000 [#1] SMP KASAN PTI
      [   34.501968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.20.0-rc2.mptcp #108
      [   34.502887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
      [   34.504074] RIP: 0010:dev_gro_receive+0x343/0x1f90
      [   34.504751] Code: e0 48 c1 e8 03 42 80 3c 30 00 0f 85 4a 1c 00 00 4d 8b 24 24 4c 39 65 d0 0f 84 0a 04 00 00 49 8d 7c 24 38 48 89 f8 48 c1 e8 03 <42> 0f b6 04 30 84 c0 74 08 3c 04
      [   34.507060] RSP: 0018:ffff8883af507930 EFLAGS: 00010202
      [   34.507761] RAX: 0000000000000007 RBX: ffff8883970b2c80 RCX: 1ffff11072e165a6
      [   34.508640] RDX: 1ffff11075867008 RSI: ffff8883ac338040 RDI: 0000000000000038
      [   34.509493] RBP: ffff8883af5079d0 R08: ffff8883970b2d40 R09: 0000000000000062
      [   34.510346] R10: 0000000000000034 R11: 0000000000000000 R12: 0000000000000000
      [   34.511215] R13: 0000000000000000 R14: dffffc0000000000 R15: ffff8883ac338008
      [   34.512082] FS:  0000000000000000(0000) GS:ffff8883af500000(0000) knlGS:0000000000000000
      [   34.513036] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   34.513741] CR2: 000055ccc3e9d020 CR3: 00000003abf32000 CR4: 00000000000006e0
      [   34.514593] Call Trace:
      [   34.514893]  <IRQ>
      [   34.515157]  napi_gro_receive+0x93/0x150
      [   34.515632]  receive_buf+0x893/0x3700
      [   34.516094]  ? __netif_receive_skb+0x1f/0x1a0
      [   34.516629]  ? virtnet_probe+0x1b40/0x1b40
      [   34.517153]  ? __stable_node_chain+0x4d0/0x850
      [   34.517684]  ? kfree+0x9a/0x180
      [   34.518067]  ? __kasan_slab_free+0x171/0x190
      [   34.518582]  ? detach_buf+0x1df/0x650
      [   34.519061]  ? lapic_next_event+0x5a/0x90
      [   34.519539]  ? virtqueue_get_buf_ctx+0x280/0x7f0
      [   34.520093]  virtnet_poll+0x2df/0xd60
      [   34.520533]  ? receive_buf+0x3700/0x3700
      [   34.521027]  ? qdisc_watchdog_schedule_ns+0xd5/0x140
      [   34.521631]  ? htb_dequeue+0x1817/0x25f0
      [   34.522107]  ? sch_direct_xmit+0x142/0xf30
      [   34.522595]  ? virtqueue_napi_schedule+0x26/0x30
      [   34.523155]  net_rx_action+0x2f6/0xc50
      [   34.523601]  ? napi_complete_done+0x2f0/0x2f0
      [   34.524126]  ? kasan_check_read+0x11/0x20
      [   34.524608]  ? _raw_spin_lock+0x7d/0xd0
      [   34.525070]  ? _raw_spin_lock_bh+0xd0/0xd0
      [   34.525563]  ? kvm_guest_apic_eoi_write+0x6b/0x80
      [   34.526130]  ? apic_ack_irq+0x9e/0xe0
      [   34.526567]  __do_softirq+0x188/0x4b5
      [   34.527015]  irq_exit+0x151/0x180
      [   34.527417]  do_IRQ+0xdb/0x150
      [   34.527783]  common_interrupt+0xf/0xf
      [   34.528223]  </IRQ>
      
      This patch makes sure that skb->prev is set to NULL when entering
      netem_enqueue.
      
      Cc: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Fixes: 68d2f84a ("net: gro: properly remove skb from list")
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9410d386
  14. 12 11月, 2018 1 次提交
  15. 26 9月, 2018 1 次提交
  16. 11 9月, 2018 2 次提交
  17. 28 6月, 2018 1 次提交
    • Y
      netem: slotting with non-uniform distribution · 0a9fe5c3
      Yousuk Seung 提交于
      Extend slotting with support for non-uniform distributions. This is
      similar to netem's non-uniform distribution delay feature.
      
      Commit f043efeae2f1 ("netem: support delivering packets in delayed
      time slots") added the slotting feature to approximate the behaviors
      of media with packet aggregation but only supported a uniform
      distribution for delays between transmission attempts. Tests with TCP
      BBR with emulated wifi links with non-uniform distributions produced
      more useful results.
      
      Syntax:
         slot dist DISTRIBUTION DELAY JITTER [packets MAX_PACKETS] \
            [bytes MAX_BYTES]
      
      The syntax and use of the distribution table is the same as in the
      non-uniform distribution delay feature. A file DISTRIBUTION must be
      present in TC_LIB_DIR (e.g. /usr/lib/tc) containing numbers scaled by
      NETEM_DIST_SCALE. A random value x is selected from the table and it
      takes DELAY + ( x * JITTER ) as delay. Correlation between values is not
      supported.
      
      Examples:
        Normal distribution delay with mean = 800us and stdev = 100us.
        > tc qdisc add dev eth0 root netem slot dist normal 800us 100us
      
        Optionally set the max slot size in bytes and/or packets.
        > tc qdisc add dev eth0 root netem slot dist normal 800us 100us \
          bytes 64k packets 42
      Signed-off-by: NYousuk Seung <ysseung@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a9fe5c3
  18. 08 3月, 2018 1 次提交
    • A
      sch_netem: fix skb leak in netem_enqueue() · 35d889d1
      Alexey Kodanev 提交于
      When we exceed current packets limit and we have more than one
      segment in the list returned by skb_gso_segment(), netem drops
      only the first one, skipping the rest, hence kmemleak reports:
      
      unreferenced object 0xffff880b5d23b600 (size 1024):
        comm "softirq", pid 0, jiffies 4384527763 (age 2770.629s)
        hex dump (first 32 bytes):
          00 80 23 5d 0b 88 ff ff 00 00 00 00 00 00 00 00  ..#]............
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000d8a19b9d>] __alloc_skb+0xc9/0x520
          [<000000001709b32f>] skb_segment+0x8c8/0x3710
          [<00000000c7b9bb88>] tcp_gso_segment+0x331/0x1830
          [<00000000c921cba1>] inet_gso_segment+0x476/0x1370
          [<000000008b762dd4>] skb_mac_gso_segment+0x1f9/0x510
          [<000000002182660a>] __skb_gso_segment+0x1dd/0x620
          [<00000000412651b9>] netem_enqueue+0x1536/0x2590 [sch_netem]
          [<0000000005d3b2a9>] __dev_queue_xmit+0x1167/0x2120
          [<00000000fc5f7327>] ip_finish_output2+0x998/0xf00
          [<00000000d309e9d3>] ip_output+0x1aa/0x2c0
          [<000000007ecbd3a4>] tcp_transmit_skb+0x18db/0x3670
          [<0000000042d2a45f>] tcp_write_xmit+0x4d4/0x58c0
          [<0000000056a44199>] tcp_tasklet_func+0x3d9/0x540
          [<0000000013d06d02>] tasklet_action+0x1ca/0x250
          [<00000000fcde0b8b>] __do_softirq+0x1b4/0x5a3
          [<00000000e7ed027c>] irq_exit+0x1e2/0x210
      
      Fix it by adding the rest of the segments, if any, to skb 'to_free'
      list. Add new __qdisc_drop_all() and qdisc_drop_all() functions
      because they can be useful in the future if we need to drop segmented
      GSO packets in other places.
      
      Fixes: 6071bd1a ("netem: Segment GSO packets on enqueue")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35d889d1
  19. 08 2月, 2018 1 次提交
    • M
      sch_netem: Bug fixing in calculating Netem interval · 043e337f
      Md. Islam 提交于
      In Kernel 4.15.0+, Netem does not work properly.
      
      Netem setup:
      
      tc qdisc add dev h1-eth0 root handle 1: netem delay 10ms 2ms
      
      Result:
      
      PING 172.16.101.2 (172.16.101.2) 56(84) bytes of data.
      64 bytes from 172.16.101.2: icmp_seq=1 ttl=64 time=22.8 ms
      64 bytes from 172.16.101.2: icmp_seq=2 ttl=64 time=10.9 ms
      64 bytes from 172.16.101.2: icmp_seq=3 ttl=64 time=10.9 ms
      64 bytes from 172.16.101.2: icmp_seq=5 ttl=64 time=11.4 ms
      64 bytes from 172.16.101.2: icmp_seq=6 ttl=64 time=11.8 ms
      64 bytes from 172.16.101.2: icmp_seq=4 ttl=64 time=4303 ms
      64 bytes from 172.16.101.2: icmp_seq=10 ttl=64 time=11.2 ms
      64 bytes from 172.16.101.2: icmp_seq=11 ttl=64 time=10.3 ms
      64 bytes from 172.16.101.2: icmp_seq=7 ttl=64 time=4304 ms
      64 bytes from 172.16.101.2: icmp_seq=8 ttl=64 time=4303 ms
      
      Patch:
      
      (rnd % (2 * sigma)) - sigma was overflowing s32. After applying the
      patch, I found following output which is desirable.
      
      PING 172.16.101.2 (172.16.101.2) 56(84) bytes of data.
      64 bytes from 172.16.101.2: icmp_seq=1 ttl=64 time=21.1 ms
      64 bytes from 172.16.101.2: icmp_seq=2 ttl=64 time=8.46 ms
      64 bytes from 172.16.101.2: icmp_seq=3 ttl=64 time=9.00 ms
      64 bytes from 172.16.101.2: icmp_seq=4 ttl=64 time=11.8 ms
      64 bytes from 172.16.101.2: icmp_seq=5 ttl=64 time=8.36 ms
      64 bytes from 172.16.101.2: icmp_seq=6 ttl=64 time=11.8 ms
      64 bytes from 172.16.101.2: icmp_seq=7 ttl=64 time=8.11 ms
      64 bytes from 172.16.101.2: icmp_seq=8 ttl=64 time=10.0 ms
      64 bytes from 172.16.101.2: icmp_seq=9 ttl=64 time=11.3 ms
      64 bytes from 172.16.101.2: icmp_seq=10 ttl=64 time=11.5 ms
      64 bytes from 172.16.101.2: icmp_seq=11 ttl=64 time=10.2 ms
      Reviewed-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      043e337f
  20. 22 12月, 2017 3 次提交
  21. 15 11月, 2017 2 次提交
  22. 13 11月, 2017 3 次提交
    • D
      netem: support delivering packets in delayed time slots · 836af83b
      Dave Taht 提交于
      Slotting is a crude approximation of the behaviors of shared media such
      as cable, wifi, and LTE, which gather up a bunch of packets within a
      varying delay window and deliver them, relative to that, nearly all at
      once.
      
      It works within the existing loss, duplication, jitter and delay
      parameters of netem. Some amount of inherent latency must be specified,
      regardless.
      
      The new "slot" parameter specifies a minimum and maximum delay between
      transmission attempts.
      
      The "bytes" and "packets" parameters can be used to limit the amount of
      information transferred per slot.
      
      Examples of use:
      
      tc qdisc add dev eth0 root netem delay 200us \
               slot 800us 10ms bytes 64k packets 42
      
      A more correct example, using stacked netem instances and a packet limit
      to emulate a tail drop wifi queue with slots and variable packet
      delivery, with a 200Mbit isochronous underlying rate, and 20ms path
      delay:
      
      tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
               limit 10000
      tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
               slot 800us 10ms bytes 64k packets 42 limit 512
      Signed-off-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      836af83b
    • D
      netem: add uapi to express delay and jitter in nanoseconds · 99803171
      Dave Taht 提交于
      netem userspace has long relied on a horrible /proc/net/psched hack
      to translate the current notion of "ticks" to nanoseconds.
      
      Expressing latency and jitter instead, in well defined nanoseconds,
      increases the dynamic range of emulated delays and jitter in netem.
      
      It will also ease a transition where reducing a tick to nsec
      equivalence would constrain the max delay in prior versions of
      netem to only 4.3 seconds.
      Signed-off-by: NDave Taht <dave.taht@gmail.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99803171
    • D
      netem: convert to qdisc_watchdog_schedule_ns · 112f9cb6
      Dave Taht 提交于
      Upgrade the internal netem scheduler to use nanoseconds rather than
      ticks throughout.
      
      Convert to and from the std "ticks" userspace api automatically,
      while allowing for finer grained scheduling to take place.
      Signed-off-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      112f9cb6
  23. 07 10月, 2017 1 次提交
  24. 26 9月, 2017 1 次提交
    • E
      sch_netem: faster rb tree removal · 3aa605f2
      Eric Dumazet 提交于
      While running TCP tests involving netem storing millions of packets,
      I had the idea to speed up tfifo_reset() and did experiments.
      
      I tried the rbtree_postorder_for_each_entry_safe() method that is
      used in skb_rbtree_purge() but discovered it was slower than the
      current tfifo_reset() method.
      
      I measured time taken to release skbs with three occupation levels :
      10^4, 10^5 and 10^6 skbs with three methods :
      
      1) (current 'naive' method)
      
      	while ((p = rb_first(&q->t_root))) {
      		struct sk_buff *skb = netem_rb_to_skb(p);
      
      		rb_erase(p, &q->t_root);
      		rtnl_kfree_skbs(skb, skb);
      	}
      
      2) Use rb_next() instead of rb_first() in the loop :
      
      	p = rb_first(&q->t_root);
      	while (p) {
      		struct sk_buff *skb = netem_rb_to_skb(p);
      
      		p = rb_next(p);
      		rb_erase(&skb->rbnode, &q->t_root);
      		rtnl_kfree_skbs(skb, skb);
      	}
      
      3) "optimized" method using rbtree_postorder_for_each_entry_safe()
      
      	struct sk_buff *skb, *next;
      
      	rbtree_postorder_for_each_entry_safe(skb, next,
      					     &q->t_root, rbnode) {
                     rtnl_kfree_skbs(skb, skb);
      	}
      	q->t_root = RB_ROOT;
      
      Results :
      
      method_1:while (rb_first()) rb_erase() 10000 skbs in 690378 ns (69 ns per skb)
      method_2:rb_first; while (p) { p = rb_next(p); ...}  10000 skbs in 541846 ns (54 ns per skb)
      method_3:rbtree_postorder_for_each_entry_safe() 10000 skbs in 868307 ns (86 ns per skb)
      
      method_1:while (rb_first()) rb_erase() 99996 skbs in 7804021 ns (78 ns per skb)
      method_2:rb_first; while (p) { p = rb_next(p); ...}  100000 skbs in 5942456 ns (59 ns per skb)
      method_3:rbtree_postorder_for_each_entry_safe() 100000 skbs in 11584940 ns (115 ns per skb)
      
      method_1:while (rb_first()) rb_erase() 1000000 skbs in 108577838 ns (108 ns per skb)
      method_2:rb_first; while (p) { p = rb_next(p); ...}  1000000 skbs in 82619635 ns (82 ns per skb)
      method_3:rbtree_postorder_for_each_entry_safe() 1000000 skbs in 127328743 ns (127 ns per skb)
      
      Method 2) is simply faster, probably because it maintains a smaller
      working size set.
      
      Note that this is the method we use in tcp_ofo_queue() already.
      
      I will also change skb_rbtree_purge() in a second patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3aa605f2
  25. 20 9月, 2017 1 次提交
    • E
      net: sk_buff rbnode reorg · bffa72cf
      Eric Dumazet 提交于
      skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
      
      Current uses (TCP receive ofo queue and netem) need to save/restore
      tstamp, while skb->dev is either NULL (TCP) or a constant for a given
      queue (netem).
      
      Since we plan using an RB tree for TCP retransmit queue to speedup SACK
      processing with large BDP, this patch exchanges skb->dev and
      skb->tstamp.
      
      This saves some overhead in both TCP and netem.
      
      v2: removes the swtstamp field from struct tcp_skb_cb
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bffa72cf
  26. 31 8月, 2017 1 次提交
    • N
      sch_netem: avoid null pointer deref on init failure · 634576a1
      Nikolay Aleksandrov 提交于
      netem can fail in ->init due to missing options (either not supplied by
      user-space or used as a default qdisc) causing a timer->base null
      pointer deref in its ->destroy() and ->reset() callbacks.
      
      Reproduce:
      $ sysctl net.core.default_qdisc=netem
      $ ip l set ethX up
      
      Crash log:
      [ 1814.846943] BUG: unable to handle kernel NULL pointer dereference at (null)
      [ 1814.847181] IP: hrtimer_active+0x17/0x8a
      [ 1814.847270] PGD 59c34067
      [ 1814.847271] P4D 59c34067
      [ 1814.847337] PUD 37374067
      [ 1814.847403] PMD 0
      [ 1814.847468]
      [ 1814.847582] Oops: 0000 [#1] SMP
      [ 1814.847655] Modules linked in: sch_netem(O) sch_fq_codel(O)
      [ 1814.847761] CPU: 3 PID: 1573 Comm: ip Tainted: G           O 4.13.0-rc6+ #62
      [ 1814.847884] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [ 1814.848043] task: ffff88003723a700 task.stack: ffff88005adc8000
      [ 1814.848235] RIP: 0010:hrtimer_active+0x17/0x8a
      [ 1814.848407] RSP: 0018:ffff88005adcb590 EFLAGS: 00010246
      [ 1814.848590] RAX: 0000000000000000 RBX: ffff880058e359d8 RCX: 0000000000000000
      [ 1814.848793] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880058e359d8
      [ 1814.848998] RBP: ffff88005adcb5b0 R08: 00000000014080c0 R09: 00000000ffffffff
      [ 1814.849204] R10: ffff88005adcb660 R11: 0000000000000020 R12: 0000000000000000
      [ 1814.849410] R13: ffff880058e359d8 R14: 00000000ffffffff R15: 0000000000000001
      [ 1814.849616] FS:  00007f733bbca740(0000) GS:ffff88005d980000(0000) knlGS:0000000000000000
      [ 1814.849919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1814.850107] CR2: 0000000000000000 CR3: 0000000059f0d000 CR4: 00000000000406e0
      [ 1814.850313] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1814.850518] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1814.850723] Call Trace:
      [ 1814.850875]  hrtimer_try_to_cancel+0x1a/0x93
      [ 1814.851047]  hrtimer_cancel+0x15/0x20
      [ 1814.851211]  qdisc_watchdog_cancel+0x12/0x14
      [ 1814.851383]  netem_reset+0xe6/0xed [sch_netem]
      [ 1814.851561]  qdisc_destroy+0x8b/0xe5
      [ 1814.851723]  qdisc_create_dflt+0x86/0x94
      [ 1814.851890]  ? dev_activate+0x129/0x129
      [ 1814.852057]  attach_one_default_qdisc+0x36/0x63
      [ 1814.852232]  netdev_for_each_tx_queue+0x3d/0x48
      [ 1814.852406]  dev_activate+0x4b/0x129
      [ 1814.852569]  __dev_open+0xe7/0x104
      [ 1814.852730]  __dev_change_flags+0xc6/0x15c
      [ 1814.852899]  dev_change_flags+0x25/0x59
      [ 1814.853064]  do_setlink+0x30c/0xb3f
      [ 1814.853228]  ? check_chain_key+0xb0/0xfd
      [ 1814.853396]  ? check_chain_key+0xb0/0xfd
      [ 1814.853565]  rtnl_newlink+0x3a4/0x729
      [ 1814.853728]  ? rtnl_newlink+0x117/0x729
      [ 1814.853905]  ? ns_capable_common+0xd/0xb1
      [ 1814.854072]  ? ns_capable+0x13/0x15
      [ 1814.854234]  rtnetlink_rcv_msg+0x188/0x197
      [ 1814.854404]  ? rcu_read_unlock+0x3e/0x5f
      [ 1814.854572]  ? rtnl_newlink+0x729/0x729
      [ 1814.854737]  netlink_rcv_skb+0x6c/0xce
      [ 1814.854902]  rtnetlink_rcv+0x23/0x2a
      [ 1814.855064]  netlink_unicast+0x103/0x181
      [ 1814.855230]  netlink_sendmsg+0x326/0x337
      [ 1814.855398]  sock_sendmsg_nosec+0x14/0x3f
      [ 1814.855584]  sock_sendmsg+0x29/0x2e
      [ 1814.855747]  ___sys_sendmsg+0x209/0x28b
      [ 1814.855912]  ? do_raw_spin_unlock+0xcd/0xf8
      [ 1814.856082]  ? _raw_spin_unlock+0x27/0x31
      [ 1814.856251]  ? __handle_mm_fault+0x651/0xdb1
      [ 1814.856421]  ? check_chain_key+0xb0/0xfd
      [ 1814.856592]  __sys_sendmsg+0x45/0x63
      [ 1814.856755]  ? __sys_sendmsg+0x45/0x63
      [ 1814.856923]  SyS_sendmsg+0x19/0x1b
      [ 1814.857083]  entry_SYSCALL_64_fastpath+0x23/0xc2
      [ 1814.857256] RIP: 0033:0x7f733b2dd690
      [ 1814.857419] RSP: 002b:00007ffe1d3387d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [ 1814.858238] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007f733b2dd690
      [ 1814.858445] RDX: 0000000000000000 RSI: 00007ffe1d338820 RDI: 0000000000000003
      [ 1814.858651] RBP: ffff88005adcbf98 R08: 0000000000000001 R09: 0000000000000003
      [ 1814.858856] R10: 00007ffe1d3385a0 R11: 0000000000000246 R12: 0000000000000002
      [ 1814.859060] R13: 000000000066f1a0 R14: 00007ffe1d3408d0 R15: 0000000000000000
      [ 1814.859267]  ? trace_hardirqs_off_caller+0xa7/0xcf
      [ 1814.859446] Code: 10 55 48 89 c7 48 89 e5 e8 45 a1 fb ff 31 c0 5d c3
      31 c0 c3 66 66 66 66 90 55 48 89 e5 41 56 41 55 41 54 53 49 89 fd 49 8b
      45 30 <4c> 8b 20 41 8b 5c 24 38 31 c9 31 d2 48 c7 c7 50 8e 1d 82 41 89
      [ 1814.860022] RIP: hrtimer_active+0x17/0x8a RSP: ffff88005adcb590
      [ 1814.860214] CR2: 0000000000000000
      
      Fixes: 87b60cfa ("net_sched: fix error recovery at qdisc creation")
      Fixes: 0fbbeb1b ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      634576a1
  27. 26 8月, 2017 1 次提交
    • W
      net_sched: remove tc class reference counting · 143976ce
      WANG Cong 提交于
      For TC classes, their ->get() and ->put() are always paired, and the
      reference counting is completely useless, because:
      
      1) For class modification and dumping paths, we already hold RTNL lock,
         so all of these ->get(),->change(),->put() are atomic.
      
      2) For filter bindiing/unbinding, we use other reference counter than
         this one, and they should have RTNL lock too.
      
      3) For ->qlen_notify(), it is special because it is called on ->enqueue()
         path, but we already hold qdisc tree lock there, and we hold this
         tree lock when graft or delete the class too, so it should not be gone
         or changed until we release the tree lock.
      
      Therefore, this patch removes ->get() and ->put(), but:
      
      1) Adds a new ->find() to find the pointer to a class by classid, no
         refcnt.
      
      2) Move the original class destroy upon the last refcnt into ->delete(),
         right after releasing tree lock. This is fine because the class is
         already removed from hash when holding the lock.
      
      For those who also use ->put() as ->unbind(), just rename them to reflect
      this change.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      143976ce
  28. 09 5月, 2017 1 次提交
    • M
      treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68
      Michal Hocko 提交于
      There are many code paths opencoding kvmalloc.  Let's use the helper
      instead.  The main difference to kvmalloc is that those users are
      usually not considering all the aspects of the memory allocator.  E.g.
      allocation requests <= 32kB (with 4kB pages) are basically never failing
      and invoke OOM killer to satisfy the allocation.  This sounds too
      disruptive for something that has a reasonable fallback - the vmalloc.
      On the other hand those requests might fallback to vmalloc even when the
      memory allocator would succeed after several more reclaim/compaction
      attempts previously.  There is no guarantee something like that happens
      though.
      
      This patch converts many of those places to kv[mz]alloc* helpers because
      they are more conservative.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
      Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
      Acked-by: David Sterba <dsterba@suse.com> # btrfs
      Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
      Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Santosh Raspatur <santosh@chelsio.com>
      Cc: Hariprasad S <hariprasad@chelsio.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      752ade68
  29. 14 4月, 2017 1 次提交
  30. 17 3月, 2017 1 次提交
    • N
      netem: apply correct delay when rate throttling · 5080f39e
      Nik Unger 提交于
      I recently reported on the netem list that iperf network benchmarks
      show unexpected results when a bandwidth throttling rate has been
      configured for netem. Specifically:
      
      1) The measured link bandwidth *increases* when a higher delay is added
      2) The measured link bandwidth appears higher than the specified limit
      3) The measured link bandwidth for the same very slow settings varies significantly across
        machines
      
      The issue can be reproduced by using tc to configure netem with a
      512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
      veth pair between network namespaces, and then using iperf (or any
      other network benchmarking tool) to test throughput. Complete detailed
      instructions are in the original email chain here:
      https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html
      
      There appear to be two underlying bugs causing these effects:
      
      - The first issue causes long delays when the rate is slow and no
        delay is configured (e.g., "rate 512kbit"). This is because SKBs are
        not orphaned when no delay is configured, so orphaning does not
        occur until *after* the rate-induced delay has been applied. For
        this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
        dramatically increases the measured bandwidth.
      
      - The second issue is that rate-induced delays are not correctly
        applied, allowing SKB delays to occur in parallel. The indended
        approach is to compute the delay for an SKB and to add this delay to
        the end of the current queue. However, the code does not detect
        existing SKBs in the queue due to improperly testing sch->q.qlen,
        which is nonzero even when packets exist only in the
        rbtree. Consequently, new SKBs do not wait for the current queue to
        empty. When packet delays vary significantly (e.g., if packet sizes
        are different), then this also causes unintended reordering.
      
      I modified the code to expect a delay (and orphan the SKB) when a rate
      is configured. I also added some defensive tests that correctly find
      the latest scheduled delivery time, even if it is (unexpectedly) for a
      packet in sch->q. I have tested these changes on the latest kernel
      (4.11.0-rc1+) and the iperf / ping test results are as expected.
      Signed-off-by: NNik Unger <njunger@uwaterloo.ca>
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5080f39e
  31. 09 1月, 2017 1 次提交
    • W
      net-tc: convert tc_from to tc_from_ingress and tc_redirected · bc31c905
      Willem de Bruijn 提交于
      The tc_from field fulfills two roles. It encodes whether a packet was
      redirected by an act_mirred device and, if so, whether act_mirred was
      called on ingress or egress. Split it into separate fields.
      
      The information is needed by the special IFB loop, where packets are
      taken out of the normal path by act_mirred, forwarded to IFB, then
      reinjected at their original location (ingress or egress) by IFB.
      
      The IFB device cannot use skb->tc_at_ingress, because that may have
      been overwritten as the packet travels from act_mirred to ifb_xmit,
      when it passes through tc_classify on the IFB egress path. Cache this
      value in skb->tc_from_ingress.
      
      That field is valid only if a packet arriving at ifb_xmit came from
      act_mirred. Other packets can be crafted to reach ifb_xmit. These
      must be dropped. Set tc_redirected on redirection and drop all packets
      that do not have this bit set.
      
      Both fields are set only on cloned skbs in tc actions, so original
      packet sources do not have to clear the bit when reusing packets
      (notably, pktgen and octeon).
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc31c905