提交 · cf44012810ccdd8fd947518e965cb04b7b8498be · openeuler / Kernel

24 2月, 2016 14 次提交

mac80211: fix unnecessary frame drops in mesh fwding · cf440128

由 Michal Kazior 提交于 1月 25, 2016

The ieee80211_queue_stopped() expects hw queue
number but it was given raw WMM AC number instead.

This could cause frame drops and problems with
traffic in some cases - most notably if driver
doesn't map AC numbers to queue numbers 1:1 and
uses ieee80211_stop_queues() and
ieee80211_wake_queue() only without ever calling
ieee80211_wake_queues().

On ath10k it was possible to hit this problem in
the following case:

  1. wlan0 uses queue 0
     (ath10k maps queues per vif)
  2. offchannel uses queue 15
  3. queues 1-14 are unused
  4. ieee80211_stop_queues()
  5. ieee80211_wake_queue(q=0)
  6. ieee80211_wake_queue(q=15)
     (other queues are not woken up because both
      driver and mac80211 know other queues are
      unused)
  7. ieee80211_rx_h_mesh_fwding()
  8. ieee80211_select_queue_80211() returns 2
  9. ieee80211_queue_stopped(q=2) returns true
 10. frame is dropped (oops!)

Fixes: d3c1597b ("mac80211: fix forwarded mesh frame queue mapping")
Signed-off-by: NMichal Kazior <michal.kazior@tieto.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

cf440128

mac80211: fix txq queue related crashes · 2a58d42c

由 Michal Kazior 提交于 1月 21, 2016

The driver can access the queue simultanously
while mac80211 tears down the interface. Without
spinlock protection this could lead to corrupting
sk_buff_head and subsequently to an invalid
pointer dereference.

Fixes: ba8c3d6f ("mac80211: add an intermediate software queue implementation")
Signed-off-by: NMichal Kazior <michal.kazior@tieto.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

2a58d42c

mac80211: mesh_plink: remove redundant sta_info check · b8631c00

由 Sunil Shahu 提交于 1月 21, 2016

Remove unnecessory "if" statement and club it with previos "if" block.
Signed-off-by: NSunil Shahu <shshahu@gmail.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

b8631c00

rfkill: Remove obsolete "claim" sysfs interface · e2a35e89

由 João Paulo Rechi Vita 提交于 1月 19, 2016

This was scheduled to be removed in 2012 by:

 commit 69c86373
 Author: florian@mickler.org <florian@mickler.org>
 Date:   Wed Feb 24 12:05:16 2010 +0100

     Document the rfkill sysfs ABI

     This moves sysfs ABI info from Documentation/rfkill.txt to the
     ABI subfolder and reformats it.

     This also schedules the deprecated sysfs parts to be removed in
     2012 (claim file) and 2014 (state file).
Signed-off-by: NFlorian Mickler <florian@mickler.org>
Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
Signed-off-by: NJoão Paulo Rechi Vita <jprvita@endlessm.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

e2a35e89

rfkill: remove/inline __rfkill_set_hw_state · 1926e260

由 João Paulo Rechi Vita 提交于 1月 19, 2016

__rfkill_set_hw_state() is only one used in rfkill_set_hw_state(), and
none of them are long or complicated, so merging the two makes the code
easier to read.
Signed-off-by: NJoão Paulo Rechi Vita <jprvita@endlessm.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

1926e260

rfkill: use variable instead of duplicating the expression · f3e7fae2

由 João Paulo Rechi Vita 提交于 1月 19, 2016

RFKILL_BLOCK_SW value have just been saved to prev, no need to check it
again in the if expression. This makes code a little bit easier to read.
Signed-off-by: NJoão Paulo Rechi Vita <jprvita@endlessm.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

f3e7fae2

cfg80211: Fix some linguistics in Kconfig · 573a2b51

由 Ola Olsson 提交于 1月 10, 2016

Signed-off-by: NOla Olsson <ola.olsson@sonymobile.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

573a2b51

rfkill: disentangle polling pause and suspend · dd21dfc6

由 Johannes Berg 提交于 1月 20, 2016

When suspended while polling is paused, polling will erroneously
resume at resume time. Fix this by tracking pause and suspend in
separate state variable and adding the necessary checks.

Clarify the documentation on this as well.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

dd21dfc6

mac80211: refactor HT/VHT to chandef code · 8ac3c704

由 Johannes Berg 提交于 12月 18, 2015

The station MLME and IBSS/mesh ones use entirely different
code for interpreting HT and VHT operation elements. Change
the code that interprets them a bit - it now modifies an
existing chandef - and use it also in the MLME code.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

8ac3c704

cfg80211: add more warnings for inconsistent ops · de3bb771

由 Ola Olsson 提交于 12月 16, 2015

Print a warning whenever an expected callback function
lacks implementation.
Signed-off-by: NOla Olsson <ola.olsson@sonymobile.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

de3bb771

mac80211: limit the A-MSDU Tx based on peer's capabilities · 506bcfa8

由 Emmanuel Grumbach 提交于 12月 13, 2015

In VHT, the specification allows to limit the number of
MSDUs in an A-MSDU in the Extended Capabilities IE. There
is also a limitation on the byte size in the VHT IE.
In HT, the only limitation is on the byte size.
Parse the capabilities from the peer and make them
available to the driver.

In HT, there is another limitation when a BA agreement
is active: the byte size can't be greater than 4095.
This is not enforced here.
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

506bcfa8

mac80211: Recalc min chandef when station is associated · a7201a6c

由 Ilan Peer 提交于 12月 13, 2015

The minimum chandef bandwidth calculation was done only in case
a new station was inserted (or when an existing station was removed).
However, it is possible that stations are inserted before they are
associated, e.g., when FULL_AP_CLIENT_STATE is supported and user
space adds stations unassociated.

Fix this by calling ieee80211_recalc_min_chandef() whenever
a station transitions in/out the associated state, and only
consider station marked as associated.
Signed-off-by: NIlan Peer <ilan.peer@intel.com>
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

a7201a6c

mac80211: allow drivers to report (non-)monitor frames · 17883048

由 Grzegorz Bajorski 提交于 12月 11, 2015

Some drivers offload some frames internally (e.g.
AddBa). Reporting such frames to mac80211 would
only confuse MLME. However it would be useful to
be able to pass such frames to monitor interfaces
for sniffing purposes, e.g. when running AP +
monitor.

To do that allow drivers to tell mac80211 whether
a given frame should be:
 - processed but not delivered to any monitor vif
 - not processed but delievered to monitor vifs
   only
Signed-off-by: NGrzegorz Bajorski <grzegorz.bajorski@tieto.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

17883048

mac80211: support hw managing reorder logic · 412a6d80

由 Sara Sharon 提交于 12月 08, 2015

Enable driver to manage the reordering logic itself.
This is needed for example for the iwlwifi driver that
will support hardware assisted reordering.
Signed-off-by: NSara Sharon <sara.sharon@intel.com>
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

412a6d80

14 1月, 2016 6 次提交

mac80211: pass block ack session timeout to to driver · 50ea05ef

由 Sara Sharon 提交于 12月 30, 2015

Currently mac80211 does not inform the driver of the session
block ack timeout when starting a rx aggregation session.
Drivers that manage the reorder buffer need to know this
parameter.
Seeing that there are now too many arguments for the
drv_ampdu_action() function, wrap them inside a structure.
Signed-off-by: NSara Sharon <sara.sharon@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

50ea05ef

cfg80211/mac80211: use to_delayed_work · a85a7e28

由 Geliang Tang 提交于 1月 01, 2016

Use to_delayed_work() instead of open-coding it.
Signed-off-by: NGeliang Tang <geliangtang@163.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

a85a7e28

mac80211: pass RX aggregation window size to driver · fad47186

由 Sara Sharon 提交于 12月 08, 2015

Currently mac80211 does not inform the driver of the window
size when starting an RX aggregation session.
To enable managing the reorder buffer in the driver or hardware
the window size is needed.
Signed-off-by: NSara Sharon <sara.sharon@intel.com>
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

fad47186

mac80211: add flag for duplication check · f9cfa5f3

由 Sara Sharon 提交于 12月 08, 2015

Add an option for driver to check for packet duplication
by itself.
This is needed for example by the iwlwifi driver which
parallelizes the RX path and does the duplication check
per queue.
Signed-off-by: NSara Sharon <sara.sharon@intel.com>
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

f9cfa5f3

mac80211: process and save VHT MU-MIMO group frame · 23a1f8d4

由 Sara Sharon 提交于 12月 08, 2015

The Group ID Management frame is an Action frame of
category VHT. It is transmitted by the AP to assign
or change the user position of a STA for one or more
group IDs.
Process and save the group membership data. Notify
underlying driver of changes.
Signed-off-by: NSara Sharon <sara.sharon@intel.com>
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

23a1f8d4

cfg80211: remove CFG80211_REG_DEBUG · c799ba6e

由 Johannes Berg 提交于 12月 11, 2015

Instead of having this Kconfig option, which just *floods* the
kernel log,
 * remove the per-channel prints that are fairly useless anyway
 * convert the conditional printing to pr_debug()
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

c799ba6e

13 1月, 2016 4 次提交

net: bpf: reject invalid shifts · 229394e8

由 Rabin Vincent 提交于 1月 12, 2016

On ARM64, a BUG() is triggered in the eBPF JIT if a filter with a
constant shift that can't be encoded in the immediate field of the
UBFM/SBFM instructions is passed to the JIT.  Since these shifts
amounts, which are negative or >= regsize, are invalid, reject them in
the eBPF verifier and the classic BPF filter checker, for all
architectures.
Signed-off-by: NRabin Vincent <rabin@rab.in>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

229394e8

net: netlink: Fix multicast group storage allocation for families with more than one groups · ccdf6ce6

由 Matti Vaittinen 提交于 1月 11, 2016

Multicast groups are stored in global buffer. Check for needed buffer size
incorrectly compares buffer size to first id for family. This means that
for families with more than one mcast id one may allocate too small buffer
and end up writing rest of the groups to some unallocated memory. Fix the
buffer size check to compare allocated space to last mcast id for the
family.

Tested on ARM using kernel 3.14
Signed-off-by: NMatti Vaittinen <matti.vaittinen@nokia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ccdf6ce6

net: bpf: reject invalid shifts · 06928b38

由 Rabin Vincent 提交于 1月 12, 2016

On ARM64, a BUG() is triggered in the eBPF JIT if a filter with a
constant shift that can't be encoded in the immediate field of the
UBFM/SBFM instructions is passed to the JIT.  Since these shifts
amounts, which are negative or >= regsize, are invalid, reject them in
the eBPF verifier and the classic BPF filter checker, for all
architectures.
Signed-off-by: NRabin Vincent <rabin@rab.in>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

06928b38

phonet: properly unshare skbs in phonet_rcv() · 7aaed57c

由 Eric Dumazet 提交于 1月 12, 2016

Ivaylo Dimitrov reported a regression caused by commit 7866a621
("dev: add per net_device packet type chains").

skb->dev becomes NULL and we crash in __netif_receive_skb_core().

Before above commit, different kind of bugs or corruptions could happen
without major crash.

But the root cause is that phonet_rcv() can queue skb without checking
if skb is shared or not.

Many thanks to Ivaylo Dimitrov for his help, diagnosis and tests.
Reported-by: NIvaylo Dimitrov <ivo.g.dimitrov.75@gmail.com>
Tested-by: NIvaylo Dimitrov <ivo.g.dimitrov.75@gmail.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Remi Denis-Courmont <courmisch@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7aaed57c

12 1月, 2016 7 次提交

udp: disallow UFO for sockets with SO_NO_CHECK option · 40ba3302

由 Michal Kubeček 提交于 1月 11, 2016

Commit acf8dd0a ("udp: only allow UFO for packets from SOCK_DGRAM
sockets") disallows UFO for packets sent from raw sockets. We need to do
the same also for SOCK_DGRAM sockets with SO_NO_CHECK options, even if
for a bit different reason: while such socket would override the
CHECKSUM_PARTIAL set by ip_ufo_append_data(), gso_size is still set and
bad offloading flags warning is triggered in __skb_gso_segment().

In the IPv6 case, SO_NO_CHECK option is ignored but we need to disallow
UFO for packets sent by sockets with UDP_NO_CHECK6_TX option.
Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
Tested-by: NShannon Nelson <shannon.nelson@intel.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

40ba3302

net: pktgen: fix null ptr deref in skb allocation · 3de03596

由 John Fastabend 提交于 1月 10, 2016

Fix possible null pointer dereference that may occur when calling
skb_reserve() on a null skb.

Fixes: 879c7220 ("net: pktgen: Observe needed_headroom of the device")
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3de03596

bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key · c6c33454

由 Daniel Borkmann 提交于 1月 11, 2016

After IPv6 support has recently been added to metadata dst and related
encaps, add support for populating/reading it from an eBPF program.

Commit d3aa45ce ("bpf: add helpers to access tunnel metadata") started
with initial IPv4-only support back then (due to IPv6 metadata support
not being available yet).

To stay compatible with older programs, we need to test for the passed
structure size. Also TOS and TTL support from the ip_tunnel_info key has
been added. Tested with vxlan devs in collect meta data mode with IPv4,
IPv6 and in compat mode over different network namespaces.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c6c33454

bpf: export helper function flags and reject invalid ones · 781c53bc

由 Daniel Borkmann 提交于 1月 11, 2016

Export flags used by eBPF helper functions through UAPI, so they can be
used by programs (instead of them redefining all flags each time or just
using the hard-coded values). It also gives a better overview what flags
are used where and we can further get rid of the extra macros defined in
filter.c. Moreover, reject invalid flags.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

781c53bc

sched,cls_flower: set key address type when present · 66530bdf

由 Jamal Hadi Salim 提交于 1月 10, 2016

only when user space passes the addresses should we consider their
presence
Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
Acked-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

66530bdf

tcp_yeah: don't set ssthresh below 2 · 83d15e70

由 Neal Cardwell 提交于 1月 11, 2016

For tcp_yeah, use an ssthresh floor of 2, the same floor used by Reno
and CUBIC, per RFC 5681 (equation 4).

tcp_yeah_ssthresh() was sometimes returning a 0 or negative ssthresh
value if the intended reduction is as big or bigger than the current
cwnd. Congestion control modules should never return a zero or
negative ssthresh. A zero ssthresh generally results in a zero cwnd,
causing the connection to stall. A negative ssthresh value will be
interpreted as a u32 and will set a target cwnd for PRR near 4
billion.

Oleksandr Natalenko reported that a system using tcp_yeah with ECN
could see a warning about a prior_cwnd of 0 in
tcp_cwnd_reduction(). Testing verified that this was due to
tcp_yeah_ssthresh() misbehaving in this way.
Reported-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

83d15e70

sctp: fix use-after-free in pr_debug statement · 649621e3

由 Marcelo Ricardo Leitner 提交于 1月 08, 2016

Dmitry Vyukov reported a use-after-free in the code expanded by the
macro debug_post_sfx, which is caused by the use of the asoc pointer
after it was freed within sctp_side_effect() scope.

This patch fixes it by allowing sctp_side_effect to clear that asoc
pointer when the TCB is freed.

As Vlad explained, we also have to cover the SCTP_DISPOSITION_ABORT case
because it will trigger DELETE_TCB too on that same loop.

Also, there were places issuing SCTP_CMD_INIT_FAILED and ASSOC_FAILED
but returning SCTP_DISPOSITION_CONSUME, which would fool the scheme
above. Fix it by returning SCTP_DISPOSITION_ABORT instead.

The macro is already prepared to handle such NULL pointer.
Reported-by: NDmitry Vyukov <dvyukov@google.com>
Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: NVlad Yasevich <vyasevich@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

649621e3

11 1月, 2016 9 次提交

net/rtnetlink: remove unused sz_idx variable · 617cfc75

由 Alexander Kuleshov 提交于 1月 10, 2016

The sz_idx variable is defined in the rtnetlink_rcv_msg(), but
not used anywhere. Let's remove it.
Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

617cfc75

unix: properly account for FDs passed over unix sockets · 712f4aad

由 willy tarreau 提交于 1月 10, 2016

It is possible for a process to allocate and accumulate far more FDs than
the process' limit by sending them over a unix socket then closing them
to keep the process' fd count low.

This change addresses this problem by keeping track of the number of FDs
in flight per user and preventing non-privileged processes from having
more FDs in flight than their configured FD limit.

Reported-by: socketpair@gmail.com
Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NWilly Tarreau <w@1wt.eu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

712f4aad

openvswitch: update kernel doc for struct vport · c5420eb1

由 Jean Sacren 提交于 1月 09, 2016

commit be4ace6e ("openvswitch: Move dev pointer into vport itself")

The commit above added @dev and moved @rcu to the bottom of struct
vport, but the change was not reflected in the kernel doc. So let's
update the kernel doc as well.
Signed-off-by: NJean Sacren <sakiwit@gmail.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c5420eb1

openvswitch: fix struct geneve_port member name · 2f7066ad

由 Jean Sacren 提交于 1月 09, 2016

commit 6b001e68 ("openvswitch: Use Geneve device.")

The commit above introduced 'port_no' as the name for the member of
struct geneve_port. The correct name should be 'dst_port' as described
in the kernel doc. Let's fix that member name and all the pertinent
instances so that both doc and code would be consistent.
Signed-off-by: NJean Sacren <sakiwit@gmail.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2f7066ad

openvswitch: clean up unused function · 5ea03042

由 Jean Sacren 提交于 1月 09, 2016

commit 6b001e68 ("openvswitch: Use Geneve device.")

The commit above deleted the only call site of ovs_tunnel_route_lookup()
and now that function is not used any more. So let's delete the function
definition as well.
Signed-off-by: NJean Sacren <sakiwit@gmail.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5ea03042

ipv6: tcp: add rcu locking in tcp_v6_send_synack() · 3e4006f0

由 Eric Dumazet 提交于 1月 08, 2016

When first SYNACK is sent, we already hold rcu_read_lock(), but this
is not true if a SYNACK is retransmitted, as a timer (soft) interrupt
does not hold rcu_read_lock()

Fixes: 45f6fad8 ("ipv6: add complete rcu protection around np->opt")
Reported-by: NDave Jones <davej@codemonkey.org.uk>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e4006f0

net: add scheduling point in recvmmsg/sendmmsg · a78cb84c

由 Eric Dumazet 提交于 1月 08, 2016

Applications often have to reduce number of datagrams
they receive or send per system call to avoid starvation problems.

Really the kernel should take care of this by using cond_resched(),
so that applications can experiment bigger batch sizes.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a78cb84c

ipv6: always add flag an address that failed DAD with DADFAILED · 3d171f39

由 Lubomir Rintel 提交于 1月 08, 2016

The userspace needs to know why is the address being removed so that it can
perhaps obtain a new address.

Without the DADFAILED flag it's impossible to distinguish removal of a
temporary and tentative address due to DAD failure from other reasons (device
removed, manual address removal).
Signed-off-by: NLubomir Rintel <lkundrak@v3.sk>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3d171f39

net, sched: add clsact qdisc · 1f211a1b

由 Daniel Borkmann 提交于 1月 07, 2016

This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).

Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.

Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).

The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.

Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.

I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.

Example, adding qdisc:

# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1

Adding filters (deleting, etc works analogous by specifying ingress/egress):

# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action

A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.

Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.

[1] http://patchwork.ozlabs.org/patch/512949/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f211a1b

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功