提交 · a33a4c73810589f80b8a37477e1b28b4c1d61913 · openanolis / cloud-kernel

10 11月, 2016 37 次提交

sfc: enable 4-tuple RSS hashing for UDP · a33a4c73

由 Edward Cree 提交于 11月 03, 2016

This improves UDP spreading, and also slightly improves GRO performance
of encapsulated TCP on 7000 series NICs.
Signed-off-by: NEdward Cree <ecree@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a33a4c73

Merge branch 'mlx5-SRIOV-offload-tunnel_key-set-release' · 04b206b8

由 David S. Miller 提交于 11月 09, 2016

Saeed Mahameed says:

====================
Mellanox 100G SRIOV offloads tunnel_key set/release

From Hadar Hen Zion:

This series further enhances the SRIOV TC offloads of mlx5 to handle the
TC tunnel_key release and set actions.

This serves a common use-case in virtualization systems where the virtual
switch encapsulate packets (tunnel_key set action) sent from VMs with
outer headers corresponding to the local/remote host IPs and de-capsulate
(tunnel_key release) outer headers before the packets are received by the
VM.

We use the new E-Switch switchdev mode and TC tunnel_key set/release
action to achieve that also in SW defined SRIOV environments by
offloading TC rules that contain these actions along with forwarding
(TC mirred/redirect action) the packets.

The first six patches are adding the needed support in flow dissector,
flower and tc for offloading tunnel_key actions:
    - The first three patches are adding the needed help functions
      and enums
    - The next three patches in the series are adding UDP port attribute
      to tunnel_key release and set actions.

The addition of UDP ports would allow the HW driver to make sure they are
given (say) a VXLAN tunnel to offload (mlx5e uses that).

Patches 7-10 are mlx5 preparations for tunnel_key actions offloads support.

Patch #11 adds mlx5e support to offload tunnel_key release action, and the
last two patches (#12-13) add mlx5e support to tc tunnel_key set action.

Currently in order to offload tc tunnel_key release action, the tc rule
should be placed on top of the mlx5e offloading (uplink) interface instead
of the shared tunnel interface. The resolution between the tunnel interface
to the HW netdevice will be implemented in a follow up series.

This series was generated against commit
94edc86b ("Merge branch 'dwmac-sti-refactor-cleanup'")
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

04b206b8

net/mlx5e: Add basic TC tunnel set action for SRIOV offloads · a54e20b4

由 Hadar Hen Zion 提交于 11月 07, 2016

In mlx5 HW, encapsulation is offloaded by the steering rule having
index into an encapsulation table containing the entire set of headers
to be added by the HW. The driver sets these headers in a buffer when we
are offloading the action.

The code maintains mlx5_encap_entry for each encap header it has
encountered when attempted to offload TC tunnel set action.

This entry maintains a linked list of all the flows sharing the same
encap header, when the last flow is removed from the list the encap
entry is removed.

The actual encap_header is allocated by the driver in the hardware only
if we have layer two neighbour info when the encap entry is created.
While the flow is in the driver, the driver holds a reference on the
neighbour.

When a new flow with encap action is inserted, the code first checks if
the required encap entry exists according to the tunnel set parameters.
If it does the encap is shared, otherwise a new mlx5_encap_entry is
created.

TC action parsing implementation in the driver assumes that tunnel set
action is provided in the same order set by the user, e.g before the
mirred_redirect action.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a54e20b4

net/mlx5e: Add ndo_udp_tunnel_add to VF representors · 4a25730e

由 Hadar Hen Zion 提交于 11月 07, 2016

By implementing this ndo, the host stack will set the vxlan udp port
also to VF representor netdevices. This will allow the TC offload code
in the driver when it gets a tunnel key set action to identify the UDP
port as vxlan, and hence the rule will be a candidate for offloading.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4a25730e

net/mlx5e: Add TC tunnel release action for SRIOV offloads · bbd00f7e

由 Hadar Hen Zion 提交于 11月 07, 2016

Enhance the parsing of offloaded TC rules to set HW matching on outer
(encapsulation) headers.
Parse TC tunnel release action and set it as mlx5 decap action when the
required capabilities are supported.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bbd00f7e

net/mlx5: Support encap id when setting new steering entry · 66958ed9

由 Hadar Hen Zion 提交于 11月 07, 2016

In order to support steering rules which add encapsulation headers,
encap_id parameter is needed.

Add new mlx5_flow_act struct which holds action related parameter:
action, flow_tag and encap_id. Use mlx5_flow_act struct when adding a new
steering rule.
This patch doesn't change any functionality.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

66958ed9

net/mlx5: Add creation flags when adding new flow table · c9f1b073

由 Hadar Hen Zion 提交于 11月 07, 2016

When creating flow tables, allow the caller to specify creation flags.
Currently no flags are used and as such this patch doesn't add any new
functionality.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c9f1b073

net/mlx5: Check max encap header size capability · 43f93839

由 Hadar Hen Zion 提交于 11月 07, 2016

Instead of comparing to a const value, check the value of max encap
header size capability as reported by the Firmware.

Fixes: 575ddf58 ('net/mlx5: Introduce alloc_encap and dealloc_encap commands')
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

43f93839

net/mlx5: Move alloc/dealloc encap commands declarations to common header file · ae9f83ac

由 Hadar Hen Zion 提交于 11月 07, 2016

The alloc and dealloc encap commands will be used in the mlx5e driver,
as such, declare them in a common header file.

Also, rename the functions: mlx5_cmd_{de}alloc_encap is replaced with
mlx5_encap_{de}alloc.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ae9f83ac

net/sched: act_tunnel_key: Add UDP dst port option · 75bfbca0

由 Hadar Hen Zion 提交于 11月 07, 2016

The current tunnel set action supports only IP addresses and key
options. Add UDP dst port option.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

75bfbca0

net/dst: Add dst port to dst_metadata utility functions · 24ba898d

由 Hadar Hen Zion 提交于 11月 07, 2016

Add dst port parameter to __ip_tun_set_dst and __ipv6_tun_set_dst
utility functions.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

24ba898d

net/sched: cls_flower: Add UDP port to tunnel parameters · f4d997fd

由 Hadar Hen Zion 提交于 11月 07, 2016

The current IP tunneling classification supports only IP addresses and key.
Enhance UDP based IP tunneling classification parameters by adding UDP
src and dst port.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f4d997fd

net/sched: cls_flower: Allow setting encapsulation fields as used key · 519d1052

由 Hadar Hen Zion 提交于 11月 07, 2016

When encapsulation field is set, mark it as used key for the flow
dissector. This will be used by offloading drivers.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

519d1052

flow_dissector: Add enums for encapsulation keys · 9ba6a9a9

由 Hadar Hen Zion 提交于 11月 07, 2016

New encapsulation keys were added to the flower classifier, which allow
classification according to outer (encapsulation) headers attributes
such as key and IP addresses.
In order to expose those attributes outside flower, add
corresponding enums in the flow dissector.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9ba6a9a9

net/sched: act_tunnel_key: add helper inlines to access tcf_tunnel_key · 9ce183b4

由 Hadar Hen Zion 提交于 11月 07, 2016

Needed for drivers to pick the relevant action when offloading tunnel
key act.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9ce183b4

net: core: add missing check for uid_range in rule_exists. · 35b80733

由 Lorenzo Colitti 提交于 11月 07, 2016

Without this check, it is not possible to create two rules that
are identical except for their UID ranges. For example:

root@net-test:/# ip rule add prio 1000 lookup 300
root@net-test:/# ip rule add prio 1000 uidrange 100-200 lookup 300
RTNETLINK answers: File exists
root@net-test:/# ip rule add prio 1000 uidrange 100-199 lookup 100
root@net-test:/# ip rule add prio 1000 uidrange 200-299 lookup 200
root@net-test:/# ip rule add prio 1000 uidrange 300-399 lookup 100
RTNETLINK answers: File exists

Tested: https://android-review.googlesource.com/#/c/299980/Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
Acked-by: NMaciej Żenczykowski <maze@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35b80733

qed: Prevent stack corruption on MFW interaction · bb480242

由 Mintz, Yuval 提交于 11月 06, 2016

Driver uses a union for copying data to & from management firmware
when interacting with it.
Problem is that the function always copies sizeof(union) while commit
2edbff8d ("qed: Learn resources from management firmware") is casting
a union elements which is of smaller size [24-byte instead of 88-bytes].

Also, the union contains some inappropriate elements which increase its
size [should have been 32-bytes]. While this shouldn't corrupt other
PF messages to the MFW [as management firmware enforces permissions so
that each PF is allowed to write only to its own mailbox] we fix this
here as well.

Fixes: 2edbff8d ("qed: Learn resources from management firmware")
Signed-off-by: NYuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bb480242

net: 3com: typhoon: fix typhoon_get_link_ksettings · b12ab9b1

由 Philippe Reynes 提交于 11月 06, 2016

When moving from typhoon_get_settings to typhoon_getlink_ksettings
in the commit f7a5537c ("net: 3com: typhoon: use new api
ethtool_{get|set}_link_ksettings"), we use a local variable supported
but we forgot to update the struct ethtool_link_ksettings with
this value.

We also initialize advertising to zero, because otherwise it may
be uninitialized if no case of the switch (tp->xcvr_select) is used.
Signed-off-by: NPhilippe Reynes <tremyfr@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b12ab9b1

net: xgbe: use new api ethtool_{get|set}_link_ksettings · 90fdd04e

由 Philippe Reynes 提交于 11月 06, 2016

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: NPhilippe Reynes <tremyfr@gmail.com>
Acked-by: NTom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

90fdd04e

net: amd: pcnet32: use new api ethtool_{get|set}_link_ksettings · ea74df81

由 Philippe Reynes 提交于 11月 06, 2016

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: NPhilippe Reynes <tremyfr@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea74df81

net: amd8111e: use new api ethtool_{get|set}_link_ksettings · 1435003c

由 Philippe Reynes 提交于 11月 05, 2016

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: NPhilippe Reynes <tremyfr@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1435003c

net: alteon: acenic: use new api ethtool_{get|set}_link_ksettings · d17970d7

由 Philippe Reynes 提交于 11月 05, 2016

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: NPhilippe Reynes <tremyfr@gmail.com>
Acked-by: NJes Sorensen <Jes.Sorensen@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d17970d7

net: adaptec: starfire: use new api ethtool_{get|set}_link_ksettings · f1cd5aa0

由 Philippe Reynes 提交于 11月 05, 2016

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: NPhilippe Reynes <tremyfr@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f1cd5aa0

Merge branch 'stmmac-dwmac-rk-PM' · 35887d32

由 David S. Miller 提交于 11月 09, 2016

Joachim Eastwood says:

====================
stmmac: dwmac-rk: convert to standard PM/remove functions

This patch set aims to remove the init/exit callbacks from the
dwmac-rk driver and instead use standard PM callbacks. Eventually
the init/exit callbacks will be deprecated and removed from all
drivers dwmac-* except for dwmac-generic. Drivers will be refactored
to use standard PM and remove callbacks.

This conversion was pretty straight forward, but it would really nice
if some chromium people could test suspend/resume with this patch set.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35887d32

Revert "net: stmmac: allow to split suspend/resume from init/exit callbacks" · 5a3c7805

由 Joachim Eastwood 提交于 11月 05, 2016

Instead of adding hooks inside stmmac_platform it is better to just use
the standard PM callbacks within the specific dwmac-driver. This only
used by the dwmac-rk driver.

This reverts commit cecbc556 ("stmmac: allow to split suspend/resume
from init/exit callbacks").
Signed-off-by: NJoachim Eastwood <manabian@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5a3c7805

stmmac: dwmac-rk: absorb rk_gmac_init into probe · 07a5e769

由 Joachim Eastwood 提交于 11月 05, 2016

Since the rk_gmac_init() only calls another function move this
function call into probe so rk_gmac_init() can be removed.

Since commit cecbc556 ("stmmac: allow to split suspend/resume
from init/exit callbacks") the init hook is no longer used in
dwmac-rk so this can be removed.
Signed-off-by: NJoachim Eastwood <manabian@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

07a5e769

stmmac: dwmac-rk: turn exit into standard driver remove callback · 0de8c4c9

由 Joachim Eastwood 提交于 11月 05, 2016

Convert the exit hook into a standard driver remove function as
the hook doesn't really buy us anything extra.

Eventually the exit hook will be deprecated in favor of the driver
remove function.
Signed-off-by: NJoachim Eastwood <manabian@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0de8c4c9

stmmac: dwmac-rk: turn resume/suspend into standard PM callbacks · 5619468a

由 Joachim Eastwood 提交于 11月 05, 2016

Use standard PM resume/suspend callbacks instead of the hooks in
stmmac_platform. This gives the driver more control and flexibility
when implementing PM functionality. The hooks in stmmac_platform
also doesn't buy us anything extra.
Signed-off-by: NJoachim Eastwood <manabian@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5619468a

Merge branch 'tcp_get_info-locking' · c68d7f1b

由 David S. Miller 提交于 11月 09, 2016

Eric Dumazet says:

====================
tcp: tcp_get_info() locking changes

This short series prepares tcp_get_info() for more detailed infos.

In order to not slow down fast path, our goal is to use the normal
socket spinlock instead of custom synchronization.

All we need to ensure is that tcp_get_info() is not called with
ehash lock, which might dead lock, since packet processing would acquire
the spinlocks in reverse way.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c68d7f1b

tcp: no longer hold ehash lock while calling tcp_get_info() · 67db3e4b

由 Eric Dumazet 提交于 11月 04, 2016

We had various problems in the past in tcp_get_info() and used
specific synchronization to avoid deadlocks.

We would like to add more instrumentation points for TCP, and
avoiding grabing socket lock in tcp_getinfo() was too costly.

Being able to lock the socket allows to provide consistent set
of fields.

inet_diag_dump_icsk() can make sure ehash locks are not
held any more when tcp_get_info() is called.

We can remove syncp added in commit d654976c
("tcp: fix a potential deadlock in tcp_get_info()"), but we need
to use lock_sock_fast() instead of spin_lock_bh() since TCP input
path can now be run from process context.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

67db3e4b

tcp: shortcut listeners in tcp_get_info() · ccbf3bfa

由 Eric Dumazet 提交于 11月 04, 2016

Being lockless in tcp_get_info() is hard, because we need to add
specific synchronization in TCP fast path, like seqcount.

Following patch will change inet_diag_dump_icsk() to no longer
hold any lock for non listeners, so that we can properly acquire
socket lock in get_tcp_info() and let it return more consistent counters.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ccbf3bfa

Merge branch 'Meson-GXL-internal-phy' · 721ad321

由 David S. Miller 提交于 11月 09, 2016

Neil Armstrong says:

====================
ARM64: Add Internal PHY support for Meson GXL

The Amlogic Meson GXL SoCs have an internal RMII PHY that is muxed with the
external RGMII pins.

In order to support switching between the two PHYs links, extended registers
size for mdio-mux-mmioreg must be added.

The DT related patches submitted as RFC in [3] will be sent in a separate
patchset due to multiple patchsets and DTSI migrations.

Changes since v2 RFC patchset at : [3]
 - Change phy Kconfig/Makefile alphabetic order
 - GXL dtsi cleanup

Changes since original RFC patchset at : [2]
 - Remove meson8b experimental phy switching
 - Switch to mdio-mux-mmioreg with extennded size support
 - Add internal phy support for S905x and p231
 - Add external PHY support for p230

[1] http://lkml.kernel.org/r/1477932286-27482-1-git-send-email-narmstrong@baylibre.com
[2] http://lkml.kernel.org/r/1477060838-14164-1-git-send-email-narmstrong@baylibre.com
[3] http://lkml.kernel.org/r/1477932987-27871-1-git-send-email-narmstrong@baylibre.com
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

721ad321

net: phy: Add Meson GXL Internal PHY driver · 7334b3e4

由 Neil Armstrong 提交于 11月 04, 2016

Add driver for the Internal RMII PHY found in the Amlogic Meson GXL SoCs.

This PHY seems to only implement some standard registers and need some
workarounds to provide autoneg values from vendor registers.

Some magic values are currently used to configure the PHY, and this a
temporary setup until clarification about these registers names and
registers fields are provided by Amlogic.
Signed-off-by: NNeil Armstrong <narmstrong@baylibre.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7334b3e4

net: mdio-mux-mmioreg: Add support for 16bit and 32bit register sizes · 9a4c8037

由 Neil Armstrong 提交于 11月 04, 2016

In order to support PHY switching on Amlogic GXL SoCs, add support for
16bit and 32bit registers sizes.
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NNeil Armstrong <narmstrong@baylibre.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9a4c8037

Merge branch 'rds-tcp-fixes' · ddc5e157

由 David S. Miller 提交于 11月 09, 2016

Sowmini Varadhan says:

====================
RDS: TCP: bug fixes

A couple of bug fixes identified during testing.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ddc5e157

RDS: TCP: start multipath acceptor loop at 0 · 117d15bb

由 Sowmini Varadhan 提交于 11月 04, 2016

The for() loop in rds_tcp_accept_one() assumes that the 0'th
rds_tcp_conn_path is UP and starts multipath accepts at index 1.
But this assumption may not always be true: if the 0'th path
has failed (ERROR or DOWN state) an incoming connection request
should be used to resurrect this path.
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

117d15bb

RDS: TCP: report addr/port info based on TCP socket in rds-info · 1ac507d4

由 Sowmini Varadhan 提交于 11月 04, 2016

The socket argument passed to rds_tcp_tc_info() is a PF_RDS socket,
so it is incorrect to report the address port info based on
rds_getname() as part of TCP state report.

Invoke inet_getname() for the t_sock associated with the
rds_tcp_connection instead.
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1ac507d4

08 11月, 2016 3 次提交

sock: do not set sk_err in sock_dequeue_err_skb · f5f99309

由 Soheil Hassas Yeganeh 提交于 11月 03, 2016

Do not set sk_err when dequeuing errors from the error queue.
Doing so results in:
a) Bugs: By overwriting existing sk_err values, it possibly
   hides legitimate errors. It is also incorrect when local
   errors are queued with ip_local_error. That happens in the
   context of a system call, which already returns the error
   code.
b) Inconsistent behavior: When there are pending errors on
   the error queue, sk_err is sometimes 0 (e.g., for
   the first timestamp on the error queue) and sometimes
   set to an error code (after dequeuing the first
   timestamp).
c) Suboptimality: Setting sk_err to ENOMSG on simple
   TX timestamps can abort parallel reads and writes.

Removing this line doesn't break userspace. This is because
userspace code cannot rely on sk_err for detecting whether
there is something on the error queue. Except for ICMP messages
received for UDP and RAW, sk_err is not set at enqueue time,
and as a result sk_err can be 0 while there are plenty of
errors on the error queue.

For ICMP packets in UDP and RAW, sk_err is set when they are
enqueued on the error queue, but that does not result in aborting
reads and writes. For such cases, sk_err is only readable via
getsockopt(SO_ERROR) which will reset the value of sk_err on
its own. More importantly, prior to this patch,
recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e.,
sk_err is set by sock_dequeue_err_skb without atomic ops or
locks) which can store 0 in sk_err even when we have ICMP
messages pending. Removing this line from sock_dequeue_err_skb
eliminates that race.
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f5f99309

Merge branch 'IFF_NO_QUEUE-semantics' · 5f7f7502

由 David S. Miller 提交于 11月 07, 2016

Jesper Dangaard Brouer says:

====================
qdisc and tx_queue_len cleanups for IFF_NO_QUEUE devices

This patchset is a cleanup for IFF_NO_QUEUE devices.  It will
hopefully help userspace get a more consistent behavior when attaching
qdisc to such virtual devices.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5f7f7502

qdisc: catch misconfig of attaching qdisc to tx_queue_len zero device · 84c46dd8

由 Jesper Dangaard Brouer 提交于 11月 03, 2016

It is a clear misconfiguration to attach a qdisc to a device with
tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred,
htb, plug and sfb) inherit/copy this value as their queue length.

Why should the kernel catch such a misconfiguration?  Because prior to
introducing the IFF_NO_QUEUE device flag, userspace found a loophole
in the qdisc config system that allowed them to achieve the equivalent
of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from
a device.  The loophole on older kernels is setting tx_queue_len=0,
*prior* to device qdisc init (the config time is significant, simply
setting tx_queue_len=0 doesn't trigger the loophole).

This loophole is currently used by Docker[1] to get better performance
and scalability out of the veth device.  The Docker developers were
warned[1] that they needed to adjust the tx_queue_len if ever
attaching a qdisc.  The OpenShift project didn't remember this warning
and attached a qdisc, this were caught and fixed in[2].

[1] https://github.com/docker/libcontainer/pull/193
[2] https://github.com/openshift/origin/pull/11126

Instead of fixing every userspace program that used this loophole, and
forgot to reset the tx_queue_len, prior to attaching a qdisc.  Let's
catch the misconfiguration on the kernel side.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

84c46dd8

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功