提交 · a4ea5d43c807be28545625c1e0641905022fa0d1 · openeuler / Kernel

09 4月, 2019 8 次提交

ipv4: Add support to fib_config for IPv6 gateway · a4ea5d43

由 David Ahern 提交于 4月 05, 2019

Add support for an IPv6 gateway to fib_config. Since a gateway is either
IPv4 or IPv6, make it a union with fc_gw4 where fc_gw_family decides
which address is in use. Update current checks on family and gw4 to
handle ipv6 as well.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a4ea5d43

ipv4: Add support to rtable for ipv6 gateway · 0f5f7d7b

由 David Ahern 提交于 4月 05, 2019

Add support for an IPv6 gateway to rtable. Since a gateway is either
IPv4 or IPv6, make it a union with rt_gw4 where rt_gw_family decides
which address is in use.

When dumping the route data, encode an ipv6 nexthop using RTA_VIA.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0f5f7d7b

ipv4: Prepare fib_config for IPv6 gateway · f35b794b

由 David Ahern 提交于 4月 05, 2019

Similar to rtable, fib_config needs to allow the gateway to be either an
IPv4 or an IPv6 address. To that end, rename fc_gw to fc_gw4 to mean an
IPv4 address and add fc_gw_family. Checks on 'is a gateway set' are changed
to see if fc_gw_family is set. In the process prepare the code for a
fc_gw_family == AF_INET6.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f35b794b

ipv4: Prepare rtable for IPv6 gateway · 1550c171

由 David Ahern 提交于 4月 05, 2019

To allow the gateway to be either an IPv4 or IPv6 address, remove
rt_uses_gateway from rtable and replace with rt_gw_family. If
rt_gw_family is set it implies rt_uses_gateway. Rename rt_gateway
to rt_gw4 to represent the IPv4 version.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1550c171

net: Replace nhc_has_gw with nhc_gw_family · bdf00467

由 David Ahern 提交于 4月 05, 2019

Allow the gateway in a fib_nh_common to be from a different address
family than the outer fib{6}_nh. To that end, replace nhc_has_gw with
nhc_gw_family and update users of nhc_has_gw to check nhc_gw_family.
Now nhc_family is used to know if the nh_common is part of a fib_nh
or fib6_nh (used for container_of to get to route family specific data),
and nhc_gw_family represents the address family for the gateway.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bdf00467

ipv6: Add neighbor helpers that use the ipv6 stub · 71df5777

由 David Ahern 提交于 4月 05, 2019

Add ipv6 helpers to handle ndisc references via the stub. Update
bpf_ipv6_fib_lookup to use __ipv6_neigh_lookup_noref_stub instead of
the open code ___neigh_lookup_noref with the stub.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

71df5777

ipv6: Add fib6_nh_init and release to stubs · 1aefd3de

由 David Ahern 提交于 4月 05, 2019

Add fib6_nh_init and fib6_nh_release to ipv6_stubs. If fib6_nh_init fails,
callers should not invoke fib6_nh_release, so there is no reason to have
a dummy stub for the IPv6 is not enabled case.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1aefd3de

datagram: remove rendundant 'peeked' argument · fd69c399

由 Paolo Abeni 提交于 4月 08, 2019

After commit a297569f ("net/udp: do not touch skb->peeked unless
really needed") the 'peeked' argument of __skb_try_recv_datagram()
and friends is always equal to !!'flags & MSG_PEEK'.

Since such argument is really a boolean info, and the callers have
already 'flags & MSG_PEEK' handy, we can remove it and clean-up the
code a bit.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fd69c399

05 4月, 2019 3 次提交

net: devlink: introduce devlink_compat_switch_id_get() helper · 7e1146e8

由 Jiri Pirko 提交于 4月 03, 2019

Introduce devlink_compat_switch_id_get() helper which fills up switch_id
according to passed netdev pointer. Call it directly from
dev_get_port_parent_id() as a fallback when ndo_get_port_parent_id
is not defined for given netdev.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7e1146e8

net: devlink: extend port attrs for switch ID · bec5267c

由 Jiri Pirko 提交于 4月 03, 2019

Extend devlink_port_attrs_set() to pass switch ID for ports which are
part of switch and store it in port attrs. For other ports, this is
NULL.

Note that this allows the driver to group devlink ports into one or more
switches according to the actual topology.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bec5267c

net: devlink: convert devlink_port_attrs bools to bits · 407dd706

由 Jiri Pirko 提交于 4月 03, 2019

In order to save space in the struct, convert bools to bits.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

407dd706

04 4月, 2019 2 次提交

ipv6: Flip to fib_nexthop_info · c0a72077

由 David Ahern 提交于 4月 02, 2019

Export fib_nexthop_info and fib_add_nexthop for use by IPv6 code.
Remove rt6_nexthop_info and rt6_add_nexthop in favor of the IPv4
versions. Update fib_nexthop_info for IPv6 linkdown check and
RTA_GATEWAY for AF_INET6.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c0a72077

ipv4: Add fib_nh_common to fib_result · eba618ab

由 David Ahern 提交于 4月 02, 2019

Most of the ipv4 code only needs data from fib_nh_common. Add
fib_nh_common selection to fib_result and update users to use it.

Right now, fib_nh_common in fib_result will point to a fib_nh struct
that is embedded within a fib_info:

        fib_info  --> fib_nh
                      fib_nh
                      ...
                      fib_nh
                        ^
    fib_result->nhc ----+

Later, nhc can point to a fib_nh within a nexthop struct:

        fib_info --> nexthop --> fib_nh
                                   ^
    fib_result->nhc ---------------+

or for a nexthop group:

        fib_info --> nexthop --> nexthop --> fib_nh
                                 nexthop --> fib_nh
                                 ...
                                 nexthop --> fib_nh
                                               ^
    fib_result->nhc ---------------------------+

In all cases nhsel within fib_result will point to which leg in the
multipath route is used.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eba618ab

02 4月, 2019 4 次提交

net: dsa: read mac address from DT for slave device · a2c7023f

由 Xiaofei Shen 提交于 3月 29, 2019

Before creating a slave netdevice, get the mac address from DTS and
apply in case it is valid.
Signed-off-by: NXiaofei Shen <xiaofeis@codeaurora.org>
Signed-off-by: NVinod Koul <vkoul@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a2c7023f

net: sched: introduce and use qdisc tree flush/purge helpers · e5f0e8f8

由 Paolo Abeni 提交于 3月 28, 2019

The same code to flush qdisc tree and purge the qdisc queue
is duplicated in many places and in most cases it does not
respect NOLOCK qdisc: the global backlog len is used and the
per CPU values are ignored.

This change addresses the above, factoring-out the relevant
code and using the helpers introduced by the previous patch
to fetch the correct backlog len.

Fixes: c5ad119f ("net: sched: pfifo_fast use skb_array")
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e5f0e8f8

net: sched: introduce and use qstats read helpers · 5dd431b6

由 Paolo Abeni 提交于 3月 28, 2019

Classful qdiscs can't access directly the child qdiscs backlog
length: if such qdisc is NOLOCK, per CPU values should be
accounted instead.

Most qdiscs no not respect the above. As a result, qstats fetching
for most classful qdisc is currently incorrect: if the child qdisc is
NOLOCK, it always reports 0 len backlog.

This change introduces a pair of helpers to safely fetch
both backlog and qlen and use them in stats class dumping
functions, fixing the above issue and cleaning a bit the code.

DRR needs also to access the child qdisc queue length, so it
needs custom handling.

Fixes: c5ad119f ("net: sched: pfifo_fast use skb_array")
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5dd431b6

vrf: check accept_source_route on the original netdevice · 8c83f2df

由 Stephen Suryaputra 提交于 4月 01, 2019

Configuration check to accept source route IP options should be made on
the incoming netdevice when the skb->dev is an l3mdev master. The route
lookup for the source route next hop also needs the incoming netdev.

v2->v3:
- Simplify by passing the original netdevice down the stack (per David
  Ahern).
Signed-off-by: NStephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8c83f2df

30 3月, 2019 11 次提交

ipv6: Move ipv6 stubs to a separate header file · 3616d08b

由 David Ahern 提交于 3月 22, 2019

The number of stubs is growing and has nothing to do with addrconf.
Move the definition of the stubs to a separate header file and update
users. In the move, drop the vxlan specific comment before ipv6_stub.

Code move only; no functional change intended.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3616d08b

net: Use common nexthop init and release helpers · 979e276e

由 David Ahern 提交于 3月 27, 2019

With fib_nh_common in place, move common initialization and release
code into helpers used by both ipv4 and ipv6. For the moment, the init
is just the lwt encap and the release is both the netdev reference and
the the lwt state reference. More will be added later.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

979e276e

net: Add fib_nh_common and update fib_nh and fib6_nh · f1741730

由 David Ahern 提交于 3月 27, 2019

Add fib_nh_common struct with common nexthop attributes. Convert
fib_nh and fib6_nh to use it. Use macros to move existing
fib_nh_* references to the new nh_common.nhc_*.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f1741730

ipv6: Rename fib6_nh entries · ad1601ae

由 David Ahern 提交于 3月 27, 2019

Rename fib6_nh entries that will be moved to a fib_nh_common struct.
Specifically, the device, gateway, flags, and lwtstate are common
with all nexthop definitions. In some places new temporary variables
are declared or local variables renamed to maintain line lengths.

Rename only; no functional change intended.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad1601ae

ipv4: Rename fib_nh entries · b75ed8b1

由 David Ahern 提交于 3月 27, 2019

Rename fib_nh entries that will be moved to a fib_nh_common struct.
Specifically, the device, oif, gateway, flags, scope, lwtstate,
nh_weight and nh_upper_bound are common with all nexthop definitions.
In the process shorten fib_nh_lwtstate to fib_nh_lws to avoid really
long lines.

Rename only; no functional change intended.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b75ed8b1

ipv6: Refactor fib6_ignore_linkdown · 6d3d07b4

由 David Ahern 提交于 3月 27, 2019

fib6_ignore_linkdown takes a fib6_info but only looks at the net_device
and its IPv6 config. Change it to take a net_device over a fib6_info as
its input argument.

In addition, move it to a header file to make the check inline and usable
later with IPv4 code without going through the ipv6 stub, and rename to
ip6_ignore_linkdown since it is only checking the setting based on the
ipv6 struct on a device.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6d3d07b4

ipv6: Move gateway checks to a fib6_nh setting · 2b2450ca

由 David Ahern 提交于 3月 27, 2019

The gateway setting is not per fib6_info entry but per-fib6_nh. Add a new
fib_nh_has_gw flag to fib6_nh and convert references to RTF_GATEWAY to
the new flag. For IPv6 address the flag is cheaper than checking that
nh_gw is non-0 like IPv4 does.

While this increases fib6_nh by 8-bytes, the effective allocation size of
a fib6_info is unchanged. The 8 bytes is recovered later with a
fib_nh_common change.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2b2450ca

ipv6: Create cleanup helper for fib6_nh · dac7d0f2

由 David Ahern 提交于 3月 27, 2019

Move the fib6_nh cleanup code to a new helper, fib6_nh_release.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dac7d0f2

ipv6: Create init helper for fib6_nh · 83c44251

由 David Ahern 提交于 3月 27, 2019

Similar to IPv4, consolidate the fib6_nh initialization into a helper.
As a new standalone function, add a cleanup path to put lwtstate on
error.

To avoid modifying fib6_config flags, move the reject check to a helper
that is invoked once by fib6_nh_init to reset the device and then
again in ip6_route_info_create to set the fib6_flags.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

83c44251

ipv4: Create cleanup helper for fib_nh · faa041a4

由 David Ahern 提交于 3月 27, 2019

Move the fib_nh cleanup code from free_fib_info_rcu into a new helper,
fib_nh_release. Move classid accounting into fib_nh_release which is
called per fib_nh to make accounting symmetrical with fib_nh_init.
Export the helper to allow for use with nexthop objects in the
future.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

faa041a4

ipv4: Create init helper for fib_nh · e4516ef6

由 David Ahern 提交于 3月 27, 2019

Consolidate the fib_nh initialization which is duplicated between
fib_create_info for single path and fib_get_nhs for multipath.
Export the helper to allow for use with nexthop objects in the
future.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e4516ef6

29 3月, 2019 5 次提交

netns: provide pure entropy for net_hash_mix() · 355b9855

由 Eric Dumazet 提交于 3月 27, 2019

net_hash_mix() currently uses kernel address of a struct net,
and is used in many places that could be used to reveal this
address to a patient attacker, thus defeating KASLR, for
the typical case (initial net namespace, &init_net is
not dynamically allocated)

I believe the original implementation tried to avoid spending
too many cycles in this function, but security comes first.

Also provide entropy regardless of CONFIG_NET_NS.

Fixes: 0b441916 ("netns: introduce the net_hash_mix "salt" for hashes")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NAmit Klein <aksecurity@gmail.com>
Reported-by: NBenny Pinkas <benny@pinkas.net>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

355b9855

netfilter: Export nf_ct_{set,destroy}_timeout() · 717700d1

由 Yi-Hung Wei 提交于 3月 26, 2019

This patch exports nf_ct_set_timeout() and nf_ct_destroy_timeout().
The two functions are derived from xt_ct_destroy_timeout() and
xt_ct_set_timeout() in xt_CT.c, and moved to nf_conntrack_timeout.c
without any functional change.
It would be useful for other users (i.e. OVS) that utilizes the
finer-grain conntrack timeout feature.

CC: Pablo Neira Ayuso <pablo@netfilter.org>
CC: Pravin Shelar <pshelar@ovn.org>
Signed-off-by: NYi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

717700d1

net: devlink: remove unused devlink_port_get_phys_port_name() function · 14c03ac4

由 Jiri Pirko 提交于 3月 28, 2019

Now it is unused, remove it.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

14c03ac4

net: devlink: introduce devlink_compat_phys_port_name_get() · af3836df

由 Jiri Pirko 提交于 3月 28, 2019

Introduce devlink_compat_phys_port_name_get() helper that
gets the physical port name for specified netdevice
according to devlink port attributes.
Call this helper from dev_get_phys_port_name()
in case ndo_get_phys_port_name is not defined.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

af3836df

net: replace ndo_get_devlink with ndo_get_devlink_port · 5dc37bb9

由 Jiri Pirko 提交于 3月 28, 2019

Follow-up patch is going to need a devlink port instance according to
a netdev. Devlink port instance should be always available when devlink
is used. So change the recently introduced ndo_get_devlink to
ndo_get_devlink_port. With that, adjust the wrapper for the only
user to get devlink pointer.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NMichal Kubecek <mkubecek@suse.cz>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5dc37bb9

28 3月, 2019 2 次提交

inet: switch IP ID generator to siphash · df453700

由 Eric Dumazet 提交于 3月 27, 2019

According to Amit Klein and Benny Pinkas, IP ID generation is too weak
and might be used by attackers.

Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
having 64bit key and Jenkins hash is risky.

It is time to switch to siphash and its 128bit keys.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NAmit Klein <aksecurity@gmail.com>
Reported-by: NBenny Pinkas <benny@pinkas.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

df453700

tcp: fix zerocopy and notsent_lowat issues · 4f661542

由 Eric Dumazet 提交于 3月 26, 2019

My recent patch had at least three problems :

1) TX zerocopy wants notification when skb is acknowledged,
   thus we need to call skb_zcopy_clear() if the skb is
   cached into sk->sk_tx_skb_cache

2) Some applications might expect precise EPOLLOUT
   notifications, so we need to update sk->sk_wmem_queued
   and call sk_mem_uncharge() from sk_wmem_free_skb()
   in all cases. The SOCK_QUEUE_SHRUNK flag must also be set.

3) Reuse of saved skb should have used skb_cloned() instead
  of simply checking if the fast clone has been freed.

Fixes: 472c2e07 ("tcp: add one skb cache for tx")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f661542

25 3月, 2019 2 次提交

net: devlink: select NET_DEVLINK from drivers · f6b19b35

由 Jiri Pirko 提交于 3月 24, 2019

Some drivers are becoming more dependent on NET_DEVLINK being selected
in configuration. With upcoming compat functions, the behavior would be
wrong in case devlink was not compiled in. So make the drivers select
NET_DEVLINK and rely on the functions being there, not just stubs.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f6b19b35

net: devlink: add port type spinlock · b8f97554

由 Jiri Pirko 提交于 3月 24, 2019

Add spinlock to protect port type and type_dev pointer consistency.
Without that, userspace may see inconsistent type and type_dev
combinations.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
v1->v2:
- rebased
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b8f97554

24 3月, 2019 3 次提交

tcp: add one skb cache for rx · 8b27dae5

由 Eric Dumazet 提交于 3月 22, 2019

Often times, recvmsg() system calls and BH handling for a particular
TCP socket are done on different cpus.

This means the incoming skb had to be allocated on a cpu,
but freed on another.

This incurs a high spinlock contention in slab layer for small rpc,
but also a high number of cache line ping pongs for larger packets.

A full size GRO packet might use 45 page fragments, meaning
that up to 45 put_page() can be involved.

More over performing the __kfree_skb() in the recvmsg() context
adds a latency for user applications, and increase probability
of trapping them in backlog processing, since the BH handler
might found the socket owned by the user.

This patch, combined with the prior one increases the rpc
performance by about 10 % on servers with large number of cores.

(tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
 instead of 8 Mpps)

This also increases single bulk flow performance on 40Gbit+ links,
since in this case there are often two cpus working in tandem :

 - CPU handling the NIC rx interrupts, feeding the receive queue,
  and (after this patch) freeing the skbs that were consumed.

 - CPU in recvmsg() system call, essentially 100 % busy copying out
  data to user space.

Having at most one skb in a per-socket cache has very little risk
of memory exhaustion, and since it is protected by socket lock,
its management is essentially free.

Note that if rps/rfs is used, we do not enable this feature, because
there is high chance that the same cpu is handling both the recvmsg()
system call and the TCP rx path, but that another cpu did the skb
allocations in the device driver right before the RPS/RFS logic.

To properly handle this case, it seems we would need to record
on which cpu skb was allocated, and use a different channel
to give skbs back to this cpu.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8b27dae5

tcp: add one skb cache for tx · 472c2e07

由 Eric Dumazet 提交于 3月 22, 2019

On hosts with a lot of cores, RPC workloads suffer from heavy contention on slab spinlocks.

    20.69%  [kernel]       [k] queued_spin_lock_slowpath
     5.64%  [kernel]       [k] _raw_spin_lock
     3.83%  [kernel]       [k] syscall_return_via_sysret
     3.48%  [kernel]       [k] __entry_text_start
     1.76%  [kernel]       [k] __netif_receive_skb_core
     1.64%  [kernel]       [k] __fget

For each sendmsg(), we allocate one skb, and free it at the time ACK packet comes.

In many cases, ACK packets are handled by another cpus, and this unfortunately
incurs heavy costs for slab layer.

This patch uses an extra pointer in socket structure, so that we try to reuse
the same skb and avoid these expensive costs.

We cache at most one skb per socket so this should be safe as far as
memory pressure is concerned.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

472c2e07

net: convert rps_needed and rfs_needed to new static branch api · dc05360f

由 Eric Dumazet 提交于 3月 22, 2019

We prefer static_branch_unlikely() over static_key_false() these days.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dc05360f

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功