提交 · c5ad119fb6c09b0297446be05bd66602fa564758 · openeuler / Kernel

09 12月, 2017 15 次提交

net: sched: pfifo_fast use skb_array · c5ad119f

由 John Fastabend 提交于 12月 07, 2017

This converts the pfifo_fast qdisc to use the skb_array data structure
and set the lockless qdisc bit. pfifo_fast is the first qdisc to support
the lockless bit that can be a child of a qdisc requiring locking. So
we add logic to clear the lock bit on initialization in these cases when
the qdisc graft operation occurs.

This also removes the logic used to pick the next band to dequeue from
and instead just checks a per priority array for packets from top priority
to lowest. This might need to be a bit more clever but seems to work
for now.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c5ad119f

net: skb_array: expose peek API · 4a86a4cf

由 John Fastabend 提交于 12月 07, 2017

This adds a peek routine to skb_array.h for use with qdisc.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4a86a4cf

net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio · ce679e8d

由 John Fastabend 提交于 12月 07, 2017

The sch_mqprio qdisc creates a sub-qdisc per tx queue which are then
called independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.

This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce679e8d

net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq · b01ac095

由 John Fastabend 提交于 12月 07, 2017

The sch_mq qdisc creates a sub-qdisc per tx queue which are then
called independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.

This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.

Also exports __gnet_stats_copy_queue() to use as a helper function.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b01ac095

net: sched: helpers to sum qlen and qlen for per cpu logic · 7e66016f

由 John Fastabend 提交于 12月 07, 2017

Add qdisc qlen helper routines for lockless qdiscs to use.

The qdisc qlen is no longer used in the hotpath but it is reported
via stats query on the qdisc so it still needs to be tracked. This
adds the per cpu operations needed along with a helper to return
the summation of per cpu stats.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7e66016f

net: sched: check for frozen queue before skb_bad_txq check · fd8e8d1a

由 John Fastabend 提交于 12月 07, 2017

I can not think of any reason to pull the bad txq skb off the qdisc if
the txq we plan to send this on is still frozen. So check for frozen
queue first and abort before dequeuing either skb_bad_txq skb or
normal qdisc dequeue() skb.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fd8e8d1a

net: sched: use skb list for skb_bad_tx · 70e57d5e

由 John Fastabend 提交于 12月 07, 2017

Similar to how gso is handled use skb list for skb_bad_tx this is
required with lockless qdiscs because we may have multiple cores
attempting to push skbs into skb_bad_tx concurrently
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

70e57d5e

net: sched: drop qdisc_reset from dev_graft_qdisc · 7bbde83b

由 John Fastabend 提交于 12月 07, 2017

In qdisc_graft_qdisc a "new" qdisc is attached and the 'qdisc_destroy'
operation is called on the old qdisc. The destroy operation will wait
a rcu grace period and call qdisc_rcu_free(). At which point
gso_cpu_skb is free'd along with all stats so no need to zero stats
and gso_cpu_skb from the graft operation itself.

Further after dropping the qdisc locks we can not continue to call
qdisc_reset before waiting an rcu grace period so that the qdisc is
detached from all cpus. By removing the qdisc_reset() here we get
the correct property of waiting an rcu grace period and letting the
qdisc_destroy operation clean up the qdisc correctly.

Note, a refcnt greater than 1 would cause the destroy operation to
be aborted however if this ever happened the reference to the qdisc
would be lost and we would have a memory leak.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7bbde83b

net: sched: explicit locking in gso_cpu fallback · a53851e2

由 John Fastabend 提交于 12月 07, 2017

This work is preparing the qdisc layer to support egress lockless
qdiscs. If we are running the egress qdisc lockless in the case we
overrun the netdev, for whatever reason, the netdev returns a busy
error code and the skb is parked on the gso_skb pointer. With many
cores all hitting this case at once its possible to have multiple
sk_buffs here so we turn gso_skb into a queue.

This should be the edge case and if we see this frequently then
the netdev/qdisc layer needs to back off.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a53851e2

net: sched: a dflt qdisc may be used with per cpu stats · d59f5ffa

由 John Fastabend 提交于 12月 07, 2017

Enable dflt qdisc support for per cpu stats before this patch a dflt
qdisc was required to use the global statistics qstats and bstats.

This adds a static flags field to qdisc_ops that is propagated
into qdisc->flags in qdisc allocate call. This allows the allocation
block to completely allocate the qdisc object so we don't have
dangling allocations after qdisc init.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d59f5ffa

net: sched: provide per cpu qstat helpers · 40bd0362

由 John Fastabend 提交于 12月 07, 2017

The per cpu qstats support was added with per cpu bstat support which
is currently used by the ingress qdisc. This patch adds a set of
helpers needed to make other qdiscs that use qstats per cpu as well.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

40bd0362

net: sched: remove remaining uses for qdisc_qlen in xmit path · 29b86cda

由 John Fastabend 提交于 12月 07, 2017

sch_direct_xmit() uses qdisc_qlen as a return value but all call sites
of the routine only check if it is zero or not. Simplify the logic so
that we don't need to return an actual queue length value.

This introduces a case now where sch_direct_xmit would have returned
a qlen of zero but now it returns true. However in this case all
call sites of sch_direct_xmit will implement a dequeue() and get
a null skb and abort. This trades tracking qlen in the hotpath for
an extra dequeue operation. Overall this seems to be good for
performance.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

29b86cda

net: sched: allow qdiscs to handle locking · 6b3ba914

由 John Fastabend 提交于 12月 07, 2017

This patch adds a flag for queueing disciplines to indicate the stack
does not need to use the qdisc lock to protect operations. This can
be used to build lockless scheduling algorithms and improving
performance.

The flag is checked in the tx path and the qdisc lock is only taken
if it is not set. For now use a conditional if statement. Later we
could be more aggressive if it proves worthwhile and use a static key
or wrap this in a likely().

Also the lockless case drops the TCQ_F_CAN_BYPASS logic. The reason
for this is synchronizing a qlen counter across threads proves to
cost more than doing the enqueue/dequeue operations when tested with
pktgen.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b3ba914

net: sched: cleanup qdisc_run and __qdisc_run semantics · 6c148184

由 John Fastabend 提交于 12月 07, 2017

Currently __qdisc_run calls qdisc_run_end() but does not call
qdisc_run_begin(). This makes it hard to track pairs of
qdisc_run_{begin,end} across function calls.

To simplify reading these code paths this patch moves begin/end calls
into qdisc_run().
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6c148184

virtio_net: Disable interrupts if napi_complete_done rescheduled napi · fdaa767a

由 Toshiaki Makita 提交于 12月 07, 2017

Since commit 39e6c820 ("net: solve a NAPI race") napi has been able
to be rescheduled within napi_complete_done() even in non-busypoll case,
but virtnet_poll() always enabled interrupts before complete, and when
napi was rescheduled within napi_complete_done() it did not disable
interrupts.
This caused more interrupts when event idx is disabled.

According to commit cbdadbbf ("virtio_net: fix race in RX VQ
processing") we cannot place virtqueue_enable_cb_prepare() after
NAPI_STATE_SCHED is cleared, so disable interrupts again if
napi_complete_done() returned false.

Tested with vhost-user of OVS 2.7 on host, which does not have the event
idx feature.

* Before patch:

$ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1472   60.00     32763206      0    6430.32
212992           60.00     23384299           4589.56

Interrupts on guest: 9872369
Packets/interrupt:   2.37

* After patch

$ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1472   60.00     32794646      0    6436.49
212992           60.00     32793501           6436.27

Interrupts on guest: 4941299
Packets/interrupt:   6.64
Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fdaa767a

08 12月, 2017 11 次提交

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 62cd2770

由 David S. Miller 提交于 12月 08, 2017

Alexei Starovoitov says:

====================
pull-request: bpf-next 2017-12-07

The following pull-request contains BPF updates for your net-next tree.

The main changes are:

1) Detailed documentation of BPF development process from Daniel.

2) Addition of is_fullsock, snd_cwnd and srtt_us fields to bpf_sock_ops
   from Lawrence.

3) Minor follow up for bpf_skb_set_tunnel_key() from William.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

62cd2770

tuntap: fix possible deadlock when fail to register netdev · 124da8f6

由 Jason Wang 提交于 12月 08, 2017

Private destructor could be called when register_netdev() fail with
rtnl lock held. This will lead deadlock in tun_free_netdev() who tries
to hold rtnl_lock. Fixing this by switching to use spinlock to
synchronize.

Fixes: 96f84061 ("tun: add eBPF based queue selection method")
Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

124da8f6

Merge branch 'smc-fixes-next' · 66c5c5b5

由 David S. Miller 提交于 12月 07, 2017

Ursula Braun says:

====================
smc: fixes 2017-12-07

here are some smc-patches. The initial 4 patches are cleanups.
Patch 5 gets rid of ib_post_sends in tasklet context to avoid peer drops due
to out-of-order receivals.
Patch 6 makes sure, the Linux SMC code understands variable sized CLC proposal
messages built according to RFC7609.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

66c5c5b5

smc: support variable CLC proposal messages · e7b7a64a