1. 05 5月, 2020 3 次提交
    • C
      net: partially revert dynamic lockdep key changes · 1a33e10e
      Cong Wang 提交于
      This patch reverts the folowing commits:
      
      commit 064ff66e
      "bonding: add missing netdev_update_lockdep_key()"
      
      commit 53d37497
      "net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()"
      
      commit 1f26c0d3
      "net: fix kernel-doc warning in <linux/netdevice.h>"
      
      commit ab92d68f
      "net: core: add generic lockdep keys"
      
      but keeps the addr_list_lock_key because we still lock
      addr_list_lock nestedly on stack devices, unlikely xmit_lock
      this is safe because we don't take addr_list_lock on any fast
      path.
      
      Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Taehee Yoo <ap420073@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a33e10e
    • E
      net_sched: sch_fq: add horizon attribute · 39d01050
      Eric Dumazet 提交于
      QUIC servers would like to use SO_TXTIME, without having CAP_NET_ADMIN,
      to efficiently pace UDP packets.
      
      As far as sch_fq is concerned, we need to add safety checks, so
      that a buggy application does not fill the qdisc with packets
      having delivery time far in the future.
      
      This patch adds a configurable horizon (default: 10 seconds),
      and a configurable policy when a packet is beyond the horizon
      at enqueue() time:
      - either drop the packet (default policy)
      - or cap its delivery time to the horizon.
      
      $ tc -s -d qd sh dev eth0
      qdisc fq 8022: root refcnt 257 limit 10000p flow_limit 100p buckets 1024
       orphan_mask 1023 quantum 10Kb initial_quantum 51160b low_rate_threshold 550Kbit
       refill_delay 40.0ms timer_slack 10.000us horizon 10.000s
       Sent 1234215879 bytes 837099 pkt (dropped 21, overlimits 0 requeues 6)
       backlog 0b 0p requeues 6
        flows 1191 (inactive 1177 throttled 0)
        gc 0 highprio 0 throttled 692 latency 11.480us
        pkts_too_long 0 alloc_errors 0 horizon_drops 21 horizon_caps 0
      
      v2: fixed an overflow on 32bit kernels in fq_init(), reported
          by kbuild test robot <lkp@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39d01050
    • J
      net: sched: fallback to qdisc noqueue if default qdisc setup fail · bf6dba76
      Jesper Dangaard Brouer 提交于
      Currently if the default qdisc setup/init fails, the device ends up with
      qdisc "noop", which causes all TX packets to get dropped.
      
      With the introduction of sysctl net/core/default_qdisc it is possible
      to change the default qdisc to be more advanced, which opens for the
      possibility that Qdisc_ops->init() can fail.
      
      This patch detect these kind of failures, and choose to fallback to
      qdisc "noqueue", which is so simple that its init call will not fail.
      This allows the interface to continue functioning.
      
      V2:
      As this also captures memory failures, which are transient, the
      device is not kept in IFF_NO_QUEUE state.  This allows the net_device
      to retry to default qdisc assignment.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf6dba76
  2. 04 5月, 2020 5 次提交
  3. 02 5月, 2020 2 次提交
    • P
      net: schedule: add action gate offloading · d29bdd69
      Po Liu 提交于
      Add the gate action to the flow action entry. Add the gate parameters to
      the tc_setup_flow_action() queueing to the entries of flow_action_entry
      array provide to the driver.
      Signed-off-by: NPo Liu <Po.Liu@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d29bdd69
    • P
      net: qos: introduce a gate control flow action · a51c328d
      Po Liu 提交于
      Introduce a ingress frame gate control flow action.
      Tc gate action does the work like this:
      Assume there is a gate allow specified ingress frames can be passed at
      specific time slot, and be dropped at specific time slot. Tc filter
      chooses the ingress frames, and tc gate action would specify what slot
      does these frames can be passed to device and what time slot would be
      dropped.
      Tc gate action would provide an entry list to tell how much time gate
      keep open and how much time gate keep state close. Gate action also
      assign a start time to tell when the entry list start. Then driver would
      repeat the gate entry list cyclically.
      For the software simulation, gate action requires the user assign a time
      clock type.
      
      Below is the setting example in user space. Tc filter a stream source ip
      address is 192.168.0.20 and gate action own two time slots. One is last
      200ms gate open let frame pass another is last 100ms gate close let
      frames dropped. When the ingress frames have reach total frames over
      8000000 bytes, the excessive frames will be dropped in that 200000000ns
      time slot.
      
      > tc qdisc add dev eth0 ingress
      
      > tc filter add dev eth0 parent ffff: protocol ip \
      	   flower src_ip 192.168.0.20 \
      	   action gate index 2 clockid CLOCK_TAI \
      	   sched-entry open 200000000 -1 8000000 \
      	   sched-entry close 100000000 -1 -1
      
      > tc chain del dev eth0 ingress chain 0
      
      "sched-entry" follow the name taprio style. Gate state is
      "open"/"close". Follow with period nanosecond. Then next item is internal
      priority value means which ingress queue should put. "-1" means
      wildcard. The last value optional specifies the maximum number of
      MSDU octets that are permitted to pass the gate during the specified
      time interval.
      Base-time is not set will be 0 as default, as result start time would
      be ((N + 1) * cycletime) which is the minimal of future time.
      
      Below example shows filtering a stream with destination mac address is
      10:00:80:00:00:00 and ip type is ICMP, follow the action gate. The gate
      action would run with one close time slot which means always keep close.
      The time cycle is total 200000000ns. The base-time would calculate by:
      
       1357000000000 + (N + 1) * cycletime
      
      When the total value is the future time, it will be the start time.
      The cycletime here would be 200000000ns for this case.
      
      > tc filter add dev eth0 parent ffff:  protocol ip \
      	   flower skip_hw ip_proto icmp dst_mac 10:00:80:00:00:00 \
      	   action gate index 12 base-time 1357000000000 \
      	   sched-entry close 200000000 -1 -1 \
      	   clockid CLOCK_TAI
      Signed-off-by: NPo Liu <Po.Liu@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a51c328d
  4. 01 5月, 2020 1 次提交
  5. 25 4月, 2020 1 次提交
  6. 24 4月, 2020 1 次提交
  7. 23 4月, 2020 2 次提交
    • E
      sched: etf: do not assume all sockets are full blown · a1211bf9
      Eric Dumazet 提交于
      skb->sk does not always point to a full blown socket,
      we need to use sk_fullsock() before accessing fields which
      only make sense on full socket.
      
      BUG: KASAN: use-after-free in report_sock_error+0x286/0x300 net/sched/sch_etf.c:141
      Read of size 1 at addr ffff88805eb9b245 by task syz-executor.5/9630
      
      CPU: 1 PID: 9630 Comm: syz-executor.5 Not tainted 5.7.0-rc2-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x188/0x20d lib/dump_stack.c:118
       print_address_description.constprop.0.cold+0xd3/0x315 mm/kasan/report.c:382
       __kasan_report.cold+0x35/0x4d mm/kasan/report.c:511
       kasan_report+0x33/0x50 mm/kasan/common.c:625
       report_sock_error+0x286/0x300 net/sched/sch_etf.c:141
       etf_enqueue_timesortedlist+0x389/0x740 net/sched/sch_etf.c:170
       __dev_xmit_skb net/core/dev.c:3710 [inline]
       __dev_queue_xmit+0x154a/0x30a0 net/core/dev.c:4021
       neigh_hh_output include/net/neighbour.h:499 [inline]
       neigh_output include/net/neighbour.h:508 [inline]
       ip6_finish_output2+0xfb5/0x25b0 net/ipv6/ip6_output.c:117
       __ip6_finish_output+0x442/0xab0 net/ipv6/ip6_output.c:143
       ip6_finish_output+0x34/0x1f0 net/ipv6/ip6_output.c:153
       NF_HOOK_COND include/linux/netfilter.h:296 [inline]
       ip6_output+0x239/0x810 net/ipv6/ip6_output.c:176
       dst_output include/net/dst.h:435 [inline]
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ip6_xmit+0xe1a/0x2090 net/ipv6/ip6_output.c:280
       tcp_v6_send_synack+0x4e7/0x960 net/ipv6/tcp_ipv6.c:521
       tcp_rtx_synack+0x10d/0x1a0 net/ipv4/tcp_output.c:3916
       inet_rtx_syn_ack net/ipv4/inet_connection_sock.c:669 [inline]
       reqsk_timer_handler+0x4c2/0xb40 net/ipv4/inet_connection_sock.c:763
       call_timer_fn+0x1ac/0x780 kernel/time/timer.c:1405
       expire_timers kernel/time/timer.c:1450 [inline]
       __run_timers kernel/time/timer.c:1774 [inline]
       __run_timers kernel/time/timer.c:1741 [inline]
       run_timer_softirq+0x623/0x1600 kernel/time/timer.c:1787
       __do_softirq+0x26c/0x9f7 kernel/softirq.c:292
       invoke_softirq kernel/softirq.c:373 [inline]
       irq_exit+0x192/0x1d0 kernel/softirq.c:413
       exiting_irq arch/x86/include/asm/apic.h:546 [inline]
       smp_apic_timer_interrupt+0x19e/0x600 arch/x86/kernel/apic/apic.c:1140
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:829
       </IRQ>
      RIP: 0010:des_encrypt+0x157/0x9c0 lib/crypto/des.c:792
      Code: 85 22 06 00 00 41 31 dc 41 8b 4d 04 44 89 e2 41 83 e4 3f 4a 8d 3c a5 60 72 72 88 81 e2 3f 3f 3f 3f 48 89 f8 48 c1 e8 03 31 d9 <0f> b6 34 28 48 89 f8 c1 c9 04 83 e0 07 83 c0 03 40 38 f0 7c 09 40
      RSP: 0018:ffffc90003b5f6c0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
      RAX: 1ffffffff10e4e55 RBX: 00000000d2f846d0 RCX: 00000000d2f846d0
      RDX: 0000000012380612 RSI: ffffffff839863ca RDI: ffffffff887272a8
      RBP: dffffc0000000000 R08: ffff888091d0a380 R09: 0000000000800081
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000012
      R13: ffff8880a8ae8078 R14: 00000000c545c93e R15: 0000000000000006
       cipher_crypt_one crypto/cipher.c:75 [inline]
       crypto_cipher_encrypt_one+0x124/0x210 crypto/cipher.c:82
       crypto_cbcmac_digest_update+0x1b5/0x250 crypto/ccm.c:830
       crypto_shash_update+0xc4/0x120 crypto/shash.c:119
       shash_ahash_update+0xa3/0x110 crypto/shash.c:246
       crypto_ahash_update include/crypto/hash.h:547 [inline]
       hash_sendmsg+0x518/0xad0 crypto/algif_hash.c:102
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:672
       ____sys_sendmsg+0x308/0x7e0 net/socket.c:2362
       ___sys_sendmsg+0x100/0x170 net/socket.c:2416
       __sys_sendmmsg+0x195/0x480 net/socket.c:2506
       __do_sys_sendmmsg net/socket.c:2535 [inline]
       __se_sys_sendmmsg net/socket.c:2532 [inline]
       __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2532
       do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
       entry_SYSCALL_64_after_hwframe+0x49/0xb3
      RIP: 0033:0x45c829
      Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f6d9528ec78 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 00000000004fc080 RCX: 000000000045c829
      RDX: 0000000000000001 RSI: 0000000020002640 RDI: 0000000000000004
      RBP: 000000000078bf00 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000000008d7 R14: 00000000004cb7aa R15: 00007f6d9528f6d4
      
      Fixes: 4b15c707 ("net/sched: Make etf report drops on error_queue")
      Fixes: 25db26a9 ("net/sched: Introduce the ETF Qdisc")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com>
      Reviewed-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1211bf9
    • W
      net/sched: act_ct: update nf_conn_acct for act_ct SW offload in flowtable · beb97d3a
      wenxu 提交于
      When the act_ct SW offload in flowtable, The counter of the conntrack
      entry will never update. So update the nf_conn_acct conuter in act_ct
      flowtable software offload.
      Signed-off-by: Nwenxu <wenxu@ucloud.cn>
      Reviewed-by: NRoi Dayan <roid@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      beb97d3a
  8. 08 4月, 2020 1 次提交
  9. 04 4月, 2020 1 次提交
  10. 02 4月, 2020 1 次提交
    • C
      net_sched: add a temporary refcnt for struct tcindex_data · 304e0242
      Cong Wang 提交于
      Although we intentionally use an ordered workqueue for all tc
      filter works, the ordering is not guaranteed by RCU work,
      given that tcf_queue_work() is esstenially a call_rcu().
      
      This problem is demostrated by Thomas:
      
        CPU 0:
          tcf_queue_work()
            tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);
      
        -> Migration to CPU 1
      
        CPU 1:
           tcf_queue_work(&p->rwork, tcindex_destroy_work);
      
      so the 2nd work could be queued before the 1st one, which leads
      to a free-after-free.
      
      Enforcing this order in RCU work is hard as it requires to change
      RCU code too. Fortunately we can workaround this problem in tcindex
      filter by taking a temporary refcnt, we only refcnt it right before
      we begin to destroy it. This simplifies the code a lot as a full
      refcnt requires much more changes in tcindex_set_parms().
      
      Reported-by: syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com
      Fixes: 3d210534 ("net_sched: fix a race condition in tcindex_destroy()")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: NPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      304e0242
  11. 31 3月, 2020 3 次提交
    • J
      bpf: Add socket assign support · cf7fbe66
      Joe Stringer 提交于
      Add support for TPROXY via a new bpf helper, bpf_sk_assign().
      
      This helper requires the BPF program to discover the socket via a call
      to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
      helper takes its own reference to the socket in addition to any existing
      reference that may or may not currently be obtained for the duration of
      BPF processing. For the destination socket to receive the traffic, the
      traffic must be routed towards that socket via local route. The
      simplest example route is below, but in practice you may want to route
      traffic more narrowly (eg by CIDR):
      
        $ ip route add local default dev lo
      
      This patch avoids trying to introduce an extra bit into the skb->sk, as
      that would require more invasive changes to all code interacting with
      the socket to ensure that the bit is handled correctly, such as all
      error-handling cases along the path from the helper in BPF through to
      the orphan path in the input. Instead, we opt to use the destructor
      variable to switch on the prefetch of the socket.
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
      cf7fbe66
    • J
      net: sched: expose HW stats types per action used by drivers · 93a129eb
      Jiri Pirko 提交于
      It may be up to the driver (in case ANY HW stats is passed) to select
      which type of HW stats he is going to use. Add an infrastructure to
      expose this information to user.
      
      $ tc filter add dev enp3s0np1 ingress proto ip handle 1 pref 1 flower dst_ip 192.168.1.1 action drop
      $ tc -s filter show dev enp3s0np1 ingress
      filter protocol ip pref 1 flower chain 0
      filter protocol ip pref 1 flower chain 0 handle 0x1
        eth_type ipv4
        dst_ip 192.168.1.1
        in_hw in_hw_count 2
              action order 1: gact action drop
               random type none pass val 0
               index 1 ref 1 bind 1 installed 10 sec used 10 sec
              Action statistics:
              Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
              backlog 0b 0p requeues 0
              used_hw_stats immediate     <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93a129eb
    • J
      net: introduce nla_put_bitfield32() helper and use it · 8953b077
      Jiri Pirko 提交于
      Introduce a helper to pass value and selector to. The helper packs them
      into struct and puts them into netlink message.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8953b077
  12. 28 3月, 2020 1 次提交
  13. 27 3月, 2020 5 次提交
  14. 26 3月, 2020 1 次提交
    • P
      net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build · 2c64605b
      Pablo Neira Ayuso 提交于
      net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’:
          net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’
            pkt->skb->tc_redirected = 1;
                    ^~
          net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’
            pkt->skb->tc_from_ingress = 1;
                    ^~
      
      To avoid a direct dependency with tc actions from netfilter, wrap the
      redirect bits around CONFIG_NET_REDIRECT and move helpers to
      include/linux/skbuff.h. Turn on this toggle from the ifb driver, the
      only existing client of these bits in the tree.
      
      This patch adds skb_set_redirected() that sets on the redirected bit
      on the skbuff, it specifies if the packet was redirect from ingress
      and resets the timestamp (timestamp reset was originally missing in the
      netfilter bugfix).
      
      Fixes: bcfabee1 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress")
      Reported-by: noreply@ellerman.id.au
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c64605b
  15. 25 3月, 2020 1 次提交
    • Z
      net: cbs: Fix software cbs to consider packet sending time · 961d0e5b
      Zh-yuan Ye 提交于
      Currently the software CBS does not consider the packet sending time
      when depleting the credits. It caused the throughput to be
      Idleslope[kbps] * (Port transmit rate[kbps] / |Sendslope[kbps]|) where
      Idleslope * (Port transmit rate / (Idleslope + |Sendslope|)) = Idleslope
      is expected. In order to fix the issue above, this patch takes the time
      when the packet sending completes into account by moving the anchor time
      variable "last" ahead to the send completion time upon transmission and
      adding wait when the next dequeue request comes before the send
      completion time of the previous packet.
      
      changelog:
      V2->V3:
       - remove unnecessary whitespace cleanup
       - add the checks if port_rate is 0 before division
      
      V1->V2:
       - combine variable "send_completed" into "last"
       - add the comment for estimate of the packet sending
      
      Fixes: 585d763a ("net/sched: Introduce Credit Based Shaper (CBS) qdisc")
      Signed-off-by: NZh-yuan Ye <ye.zh-yuan@socionext.com>
      Reviewed-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      961d0e5b
  16. 24 3月, 2020 1 次提交
  17. 20 3月, 2020 1 次提交
  18. 19 3月, 2020 2 次提交
    • P
      net: sched: Fix hw_stats_type setting in pedit loop · 2c4b58dc
      Petr Machata 提交于
      In the commit referenced below, hw_stats_type of an entry is set for every
      entry that corresponds to a pedit action. However, the assignment is only
      done after the entry pointer is bumped, and therefore could overwrite
      memory outside of the entries array.
      
      The reason for this positioning may have been that the current entry's
      hw_stats_type is already set above, before the action-type dispatch.
      However, if there are no more actions, the assignment is wrong. And if
      there are, the next round of the for_each_action loop will make the
      assignment before the action-type dispatch anyway.
      
      Therefore fix this issue by simply reordering the two lines.
      
      Fixes: 74522e7b ("net: sched: set the hw_stats_type in pedit loop")
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c4b58dc
    • P
      net/sched: act_ct: Fix leak of ct zone template on replace · dd2af104
      Paul Blakey 提交于
      Currently, on replace, the previous action instance params
      is swapped with a newly allocated params. The old params is
      only freed (via kfree_rcu), without releasing the allocated
      ct zone template related to it.
      
      Call tcf_ct_params_free (via call_rcu) for the old params,
      so it will release it.
      
      Fixes: b57dc7c1 ("net/sched: Introduce action ct")
      Signed-off-by: NPaul Blakey <paulb@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd2af104
  19. 18 3月, 2020 4 次提交
    • E
      net_sched: sch_fq: enable use of hrtimer slack · 583396f4
      Eric Dumazet 提交于
      Add a new attribute to control the fq qdisc hrtimer slack.
      
      Default is set to 10 usec.
      
      When/if packets are throttled, fq set up an hrtimer that can
      lead to one interrupt per packet in the throttled queue.
      
      By using a timer slack, we allow better use of timer interrupts,
      by giving them a chance to call multiple timer callbacks
      at each hardware interrupt.
      
      Also, giving a slack allows FQ to dequeue batches of packets
      instead of a single one, thus increasing xmit_more efficiency.
      
      This has no negative effect on the rate a TCP flow can sustain,
      since each TCP flow maintains its own precise vtime (tp->tcp_wstamp_ns)
      
      v2: added strict netlink checking (as feedback from Jakub Kicinski)
      
      Tested:
       1000 concurrent flows all using paced packets.
       1,000,000 packets sent per second.
      
      Before the patch :
      
      $ vmstat 2 10
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       0  0      0 60726784  23628 3485992    0    0   138     1  977  535  0 12 87  0  0
       0  0      0 60714700  23628 3485628    0    0     0     0 1568827 26462  0 22 78  0  0
       1  0      0 60716012  23628 3485656    0    0     0     0 1570034 26216  0 22 78  0  0
       0  0      0 60722420  23628 3485492    0    0     0     0 1567230 26424  0 22 78  0  0
       0  0      0 60727484  23628 3485556    0    0     0     0 1568220 26200  0 22 78  0  0
       2  0      0 60718900  23628 3485380    0    0     0    40 1564721 26630  0 22 78  0  0
       2  0      0 60718096  23628 3485332    0    0     0     0 1562593 26432  0 22 78  0  0
       0  0      0 60719608  23628 3485064    0    0     0     0 1563806 26238  0 22 78  0  0
       1  0      0 60722876  23628 3485236    0    0     0   130 1565874 26566  0 22 78  0  0
       1  0      0 60722752  23628 3484908    0    0     0     0 1567646 26247  0 22 78  0  0
      
      After the patch, slack of 10 usec, we can see a reduction of interrupts
      per second, and a small decrease of reported cpu usage.
      
      $ vmstat 2 10
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       1  0      0 60722564  23628 3484728    0    0   133     1  696  545  0 13 87  0  0
       1  0      0 60722568  23628 3484824    0    0     0     0 977278 25469  0 20 80  0  0
       0  0      0 60716396  23628 3484764    0    0     0     0 979997 25326  0 20 80  0  0
       0  0      0 60713844  23628 3484960    0    0     0     0 981394 25249  0 20 80  0  0
       2  0      0 60720468  23628 3484916    0    0     0     0 982860 25062  0 20 80  0  0
       1  0      0 60721236  23628 3484856    0    0     0     0 982867 25100  0 20 80  0  0
       1  0      0 60722400  23628 3484456    0    0     0     8 982698 25303  0 20 80  0  0
       0  0      0 60715396  23628 3484428    0    0     0     0 981777 25176  0 20 80  0  0
       0  0      0 60716520  23628 3486544    0    0     0    36 978965 27857  0 21 79  0  0
       0  0      0 60719592  23628 3486516    0    0     0    22 977318 25106  0 20 80  0  0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      583396f4
    • E
      net_sched: do not reprogram a timer about to expire · b88948fb
      Eric Dumazet 提交于
      qdisc_watchdog_schedule_range_ns() can use the newly added slack
      and avoid rearming the hrtimer a bit earlier than the current
      value. This patch has no effect if delta_ns parameter
      is zero.
      
      Note that this means the max slack is potentially doubled.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b88948fb
    • E
      net_sched: add qdisc_watchdog_schedule_range_ns() · efe074c2
      Eric Dumazet 提交于
      Some packet schedulers might want to add a slack
      when programming hrtimers. This can reduce number
      of interrupts and increase batch sizes and thus
      give good xmit_more savings.
      
      This commit adds qdisc_watchdog_schedule_range_ns()
      helper, with an extra delta_ns parameter.
      
      Legacy qdisc_watchdog_schedule_n() becomes an inline
      passing a zero slack.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efe074c2
    • J
      net: rename flow_action_hw_stats_types* -> flow_action_hw_stats* · 53eca1f3
      Jakub Kicinski 提交于
      flow_action_hw_stats_types_check() helper takes one of the
      FLOW_ACTION_HW_STATS_*_BIT values as input. If we align
      the arguments to the opening bracket of the helper there
      is no way to call this helper and stay under 80 characters.
      
      Remove the "types" part from the new flow_action helpers
      and enum values.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53eca1f3
  20. 16 3月, 2020 2 次提交
    • J
      net: sched: set the hw_stats_type in pedit loop · 74522e7b
      Jiri Pirko 提交于
      For a single pedit action, multiple offload entries may be used. Set the
      hw_stats_type to all of them.
      
      Fixes: 44f86580 ("sched: act: allow user to specify type of HW stats for a filter")
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74522e7b
    • C
      net_sched: cls_route: remove the right filter from hashtable · ef299cc3
      Cong Wang 提交于
      route4_change() allocates a new filter and copies values from
      the old one. After the new filter is inserted into the hash
      table, the old filter should be removed and freed, as the final
      step of the update.
      
      However, the current code mistakenly removes the new one. This
      looks apparently wrong to me, and it causes double "free" and
      use-after-free too, as reported by syzbot.
      
      Reported-and-tested-by: syzbot+f9b32aaacd60305d9687@syzkaller.appspotmail.com
      Reported-and-tested-by: syzbot+2f8c233f131943d6056d@syzkaller.appspotmail.com
      Reported-and-tested-by: syzbot+9c2df9fd5e9445b74e01@syzkaller.appspotmail.com
      Fixes: 1109c005 ("net: sched: RCU cls_route")
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef299cc3
  21. 15 3月, 2020 1 次提交
    • P
      net: sched: RED: Introduce an ECN nodrop mode · 0a7fad23
      Petr Machata 提交于
      When the RED Qdisc is currently configured to enable ECN, the RED algorithm
      is used to decide whether a certain SKB should be marked. If that SKB is
      not ECN-capable, it is early-dropped.
      
      It is also possible to keep all traffic in the queue, and just mark the
      ECN-capable subset of it, as appropriate under the RED algorithm. Some
      switches support this mode, and some installations make use of it.
      
      To that end, add a new RED flag, TC_RED_NODROP. When the Qdisc is
      configured with this flag, non-ECT traffic is enqueued instead of being
      early-dropped.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a7fad23