提交 · 8f3feb2420818e98d5e94aa33d0276bade565ab2 · openeuler / Kernel

10 11月, 2020 4 次提交

vti: switch to dev_get_tstats64 · 8f3feb24

由 Heiner Kallweit 提交于 11月 07, 2020

Replace ip_tunnel_get_stats64() with the new identical core function
dev_get_tstats64().
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

8f3feb24

net: udp: remove redundant initialization in udp_dump_one · 6e822c2c

由 Menglong Dong 提交于 11月 06, 2020

The initialization for 'err' with '-EINVAL' is redundant and
can be removed, as it is updated soon and not used.
Signed-off-by: NMenglong Dong <dong.menglong@zte.com.cn>
Link: https://lore.kernel.org/r/1604644960-48378-2-git-send-email-dong.menglong@zte.com.cnSigned-off-by: NJakub Kicinski <kuba@kernel.org>

6e822c2c

net: udp: remove redundant initialization in udp_send_skb · cffb8f61

由 Menglong Dong 提交于 11月 06, 2020

The initialization for 'err' with 0 is redundant and can be removed,
as it is updated by ip_send_skb and not used before that.
Signed-off-by: NMenglong Dong <dong.menglong@zte.com.cn>
Link: https://lore.kernel.org/r/1604644960-48378-4-git-send-email-dong.menglong@zte.com.cnSigned-off-by: NJakub Kicinski <kuba@kernel.org>

cffb8f61

net: udp: introduce UDP_MIB_MEMERRORS for udp_mem · a3ce2b10

由 Menglong Dong 提交于 11月 05, 2020

When udp_memory_allocated is at the limit, __udp_enqueue_schedule_skb
will return a -ENOBUFS, and skb will be dropped in __udp_queue_rcv_skb
without any counters being done. It's hard to find out what happened
once this happen.

So we introduce a UDP_MIB_MEMERRORS to do this job. Well, this change
looks friendly to the existing users, such as netstat:

$ netstat -u -s
Udp:
    0 packets received
    639 packets to unknown port received.
    158689 packet receive errors
    180022 packets sent
    RcvbufErrors: 20930
    MemErrors: 137759
UdpLite:
IpExt:
    InOctets: 257426235
    OutOctets: 257460598
    InNoECTPkts: 181177

v2:
- Fix some alignment problems
Signed-off-by: NMenglong Dong <dong.menglong@zte.com.cn>
Link: https://lore.kernel.org/r/1604627354-43207-1-git-send-email-dong.menglong@zte.com.cnSigned-off-by: NJakub Kicinski <kuba@kernel.org>

a3ce2b10

08 11月, 2020 1 次提交

net: ipv4: convert tasklets to use new tasklet_setup() API · c6533ca8

由 Allen Pais 提交于 11月 03, 2020

In preparation for unconditionally passing the
struct tasklet_struct pointer to all tasklet
callbacks, switch to using the new tasklet_setup()
and from_tasklet() to pass the tasklet pointer explicitly.
Signed-off-by: NRomain Perier <romain.perier@gmail.com>
Signed-off-by: NAllen Pais <apais@linux.microsoft.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

c6533ca8

07 11月, 2020 13 次提交

nexthop: Remove in-kernel route notifications when nexthop changes · bbea126c

由 Ido Schimmel 提交于 11月 04, 2020

Remove in-kernel route notifications when the configuration of their
nexthop changes.

These notifications are unnecessary because the route still uses the
same nexthop ID. A separate notification for the nexthop change itself
is now sent in the nexthop notification chain.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

bbea126c

nexthop: Replay nexthops when registering a notifier · 975ff7f3

由 Ido Schimmel 提交于 11月 04, 2020

When registering a new notifier to the nexthop notification chain,
replay all the existing nexthops to the new notifier so that it will
have a complete picture of the available nexthops.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

975ff7f3

nexthop: Pass extack to register_nexthop_notifier() · ce7e9c8a

由 Ido Schimmel 提交于 11月 04, 2020

This will be used by the next patch which extends the function to replay
all the existing nexthops to the notifier block being registered.

Device drivers will be able to pass extack to the function since it is
passed to them upon reload from devlink.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ce7e9c8a

nexthop: Emit a notification when a nexthop group is reduced · 833a1065

由 Ido Schimmel 提交于 11月 04, 2020

When a single nexthop is deleted, the configuration of all the groups
using the nexthop is effectively modified. In this case, emit a
notification in the nexthop notification chain for each modified group
so that listeners would not need to keep track of which nexthops are
member in which groups.

In the rare cases where the notification fails, emit an error to the
kernel log. This is done by allocating extack on the stack and printing
the error logged by the listener that rejected the notification.

Changes since RFC:
* Allocate extack on the stack
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

833a1065

nexthop: Emit a notification when a nexthop group is modified · f17bc33d

由 Ido Schimmel 提交于 11月 04, 2020

When a single nexthop is replaced, the configuration of all the groups
using the nexthop is effectively modified. In this case, emit a
notification in the nexthop notification chain for each modified group
so that listeners would not need to keep track of which nexthops are
member in which groups.

The notification can only be emitted after the new configuration (i.e.,
'struct nh_info') is pointed at by the old shell (i.e., 'struct
nexthop'). Before that the configuration of the nexthop groups is still
the same as before the replacement.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

f17bc33d

nexthop: Emit a notification when a single nexthop is replaced · 8c09c9f9

由 Ido Schimmel 提交于 11月 04, 2020

The notification is emitted after all the validation checks were
performed, but before the new configuration (i.e., 'struct nh_info') is
pointed at by the old shell (i.e., 'struct nexthop'). This prevents the
need to perform rollback in case the notification is vetoed.

The next patch will also emit a replace notification for all the nexthop
groups in which the nexthop is used.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

8c09c9f9

nexthop: Emit a notification when a nexthop group is replaced · d144cc5f

由 Ido Schimmel 提交于 11月 04, 2020

Emit a notification in the nexthop notification chain when an existing
nexthop group is replaced.

The notification is emitted after all the validation checks were
performed, but before the new configuration (i.e., 'struct nh_grp') is
pointed at by the old shell (i.e., 'struct nexthop'). This prevents the
need to perform rollback in case the notification is vetoed.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

d144cc5f

nexthop: Emit a notification when a nexthop is added · 732d167b

由 Ido Schimmel 提交于 11月 04, 2020

Emit a notification in the nexthop notification chain when a new nexthop
is added (not replaced). The nexthop can either be a new group or a
single nexthop.

The notification is sent after the nexthop is inserted into the
red-black tree, as listeners might need to callback into the nexthop
code with the nexthop ID in order to mark the nexthop as offloaded.

A 'REPLACE' notification is emitted instead of 'ADD' as the distinction
between the two is not important for in-kernel listeners. In case the
listener is not familiar with the encoded nexthop ID, it can simply
treat it as a new one. This is also consistent with the route offload
API.

Changes since RFC:
* Reword commit message
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

732d167b

nexthop: Allow setting "offload" and "trap" indications on nexthops · e95f2592

由 Ido Schimmel 提交于 11月 04, 2020

Add a function that can be called by device drivers to set "offload" or
"trap" indication on nexthops following nexthop notifications.

Changes since RFC:
* s/nexthop_hw_flags_set/nexthop_set_hw_flags/
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

e95f2592

rtnetlink: Add RTNH_F_TRAP flag · 968a83f8

由 Ido Schimmel 提交于 11月 04, 2020

The flag indicates to user space that the nexthop is not programmed to
forward packets in hardware, but rather to trap them to the CPU. This is
needed, for example, when the MAC of the nexthop neighbour is not
resolved and packets should reach the CPU to trigger neighbour
resolution.

The flag will be used in subsequent patches by netdevsim to test nexthop
objects programming to device drivers and in the future by mlxsw as
well.

Changes since RFC:
* Reword commit message
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

968a83f8

nexthop: vxlan: Convert to new notification info · 1ec69d18

由 Ido Schimmel 提交于 11月 04, 2020

Convert the sole listener of the nexthop notification chain (the VXLAN
driver) to the new notification info.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

1ec69d18

nexthop: Prepare new notification info · 5ca474f2

由 Ido Schimmel 提交于 11月 04, 2020

Prepare the new notification information so that it could be passed to
listeners in the new patch.

Changes since RFC:
* Add a blank line in __nh_notifier_single_info_init()
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

5ca474f2

nexthop: Pass extack to nexthop notifier · 3578d53d

由 Ido Schimmel 提交于 11月 04, 2020

The next patch will add extack to the notification info. This allows
listeners to veto notifications and communicate the reason to user space.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

3578d53d

05 11月, 2020 1 次提交

tcp: propagate MPTCP skb extensions on xmit splits · 5a369ca6

由 Paolo Abeni 提交于 11月 03, 2020

When the TCP stack splits a packet on the write queue, the tail
half currently lose the associated skb extensions, and will not
carry the DSM on the wire.

The above does not cause functional problems and is allowed by
the RFC, but interact badly with GRO and RX coalescing, as possible
candidates for aggregation will carry different TCP options.

This change tries to improve the MPTCP behavior, propagating the
skb extensions on split.

Additionally, we must prevent the MPTCP stack from updating the
mapping after the split occur: that will both violate the RFC and
fool the reader.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

5a369ca6

03 11月, 2020 1 次提交

tcp: avoid slow start during fast recovery on new losses · 7e901ee7

由 Yuchung Cheng 提交于 10月 30, 2020

During TCP fast recovery, the congestion control in charge is by
default the Proportional Rate Reduction (PRR) unless the congestion
control module specified otherwise (e.g. BBR).

Previously when tcp_packets_in_flight() is below snd_ssthresh PRR
would slow start upon receiving an ACK that
   1) cumulatively acknowledges retransmitted data
   and
   2) does not detect further lost retransmission

Such conditions indicate the repair is in good steady progress
after the first round trip of recovery. Otherwise PRR adopts the
packet conservation principle to send only the amount that was
newly delivered (indicated by this ACK).

This patch generalizes the previous design principle to include
also the newly sent data beside retransmission: as long as
the delivery is making good progress, both retransmission and
new data should be accounted to make PRR more cautious in slow
starting.
Suggested-by: NMatt Mathis <mattmathis@google.com>
Suggested-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201031013412.1973112-1-ycheng@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

7e901ee7

01 11月, 2020 2 次提交

netfilter: nft_reject_inet: allow to use reject from inet ingress · 117ca1f8

由 Pablo Neira Ayuso 提交于 10月 31, 2020

Enhance validation to support for reject from inet ingress chains.

Note that, reject from inet ingress and netdev ingress differ.

Reject packets from inet ingress are sent through ip_local_out() since
inet reject emulates the IP layer receive path. So the reject packet
follows to classic IP output and postrouting paths.

The reject action from netdev ingress assumes the packet not yet entered
the IP layer, so the reject packet is sent through dev_queue_xmit().
Therefore, reject packets from netdev ingress do not follow the classic
IP output and postrouting paths.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

117ca1f8

ip_tunnel: fix over-mtu packet send fail without TUNNEL_DONT_FRAGMENT flags · 20149e9e

由 wenxu 提交于 10月 30, 2020

The tunnel device such as vxlan, bareudp and geneve in the lwt mode set
the outer df only based TUNNEL_DONT_FRAGMENT.
And this was also the behavior for gre device before switching to use
ip_md_tunnel_xmit in commit 962924fa ("ip_gre: Refactor collect
metatdata mode tunnel xmit to ip_md_tunnel_xmit")

When the ip_gre in lwt mode xmit with ip_md_tunnel_xmi changed the rule and
make the discrepancy between handling of DF by different tunnels. So in the
ip_md_tunnel_xmit should follow the same rule like other tunnels.

Fixes: cfc7381b ("ip_tunnel: add collect_md mode to IPIP tunnel")
Signed-off-by: Nwenxu <wenxu@ucloud.cn>
Link: https://lore.kernel.org/r/1604028728-31100-1-git-send-email-wenxu@ucloud.cnSigned-off-by: NJakub Kicinski <kuba@kernel.org>

20149e9e

31 10月, 2020 4 次提交

netfilter: nf_reject: add reject skbuff creation helpers · fa538f7c

由 Jose M. Guisado Gomez 提交于 10月 22, 2020

Adds reject skbuff creation helper functions to ipv4/6 nf_reject
infrastructure. Use these functions for reject verdict in bridge
family.

Can be reused by all different families that support reject and
will not inject the reject packet through ip local out.
Signed-off-by: NJose M. Guisado Gomez <guigom@riseup.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

fa538f7c

udp: support sctp over udp in skb_udp_tunnel_segment · 527beb8e

由 Xin Long 提交于 10月 29, 2020

For the gso of sctp over udp packets, sctp_gso_segment() will be called in
skb_udp_tunnel_segment(), we need to set transport_header to sctp header.

As all the current HWs can't handle both crc checksum and udp checksum at
the same time, the crc checksum has to be done in sctp_gso_segment() by
removing the NETIF_F_SCTP_CRC flag from the features.

Meanwhile, if the HW can't do udp checksum, csum and csum_start has to be
set correctly, and udp checksum will be done in __skb_udp_tunnel_segment()
by calling gso_make_checksum().

Thanks to Paolo, Marcelo and Guillaume for helping with this one.

v1->v2:
  - no change.
v2->v3:
  - remove the he NETIF_F_SCTP_CRC flag from the features.
  - set csum and csum_start in sctp_gso_make_checksum().
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

527beb8e

udp: check udp sock encap_type in __udp_lib_err · d26796ae

由 Xin Long 提交于 10月 29, 2020

There is a chance that __udp4/6_lib_lookup() returns a udp encap
sock in __udp_lib_err(), like the udp encap listening sock may
use the same port as remote encap port, in which case it should
go to __udp4/6_lib_err_encap() for more validation before
processing the icmp packet.

This patch is to check encap_type in __udp_lib_err() for the
further validation for a encap sock.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

d26796ae

net: ipv4: Fix some kerneldoc warnings in TCP Low Priority · 726e5e49

由 Andrew Lunn 提交于 10月 28, 2020

net//ipv4/tcp_lp.c:120: warning: Function parameter or member 'sk' not described in 'tcp_lp_cong_avoid'
net//ipv4/tcp_lp.c:135: warning: Function parameter or member 'sk' not described in 'tcp_lp_remote_hz_estimator'
net//ipv4/tcp_lp.c:188: warning: Function parameter or member 'sk' not described in 'tcp_lp_owd_calculator'
net//ipv4/tcp_lp.c:222: warning: Function parameter or member 'rtt' not described in 'tcp_lp_rtt_sample'
net//ipv4/tcp_lp.c:222: warning: Function parameter or member 'sk' not described in 'tcp_lp_rtt_sample'
net//ipv4/tcp_lp.c:265: warning: Function parameter or member 'sk' not described in 'tcp_lp_pkts_acked'
net//ipv4/tcp_lp.c:97: warning: Function parameter or member 'sk' not described in 'tcp_lp_init'

There are still a few kerneldoc warnings after this fix.
Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20201028012703.931632-1-andrew@lunn.chSigned-off-by: NJakub Kicinski <kuba@kernel.org>

726e5e49

30 10月, 2020 1 次提交

netfilter: use actual socket sk rather than skb sk when routing harder · 46d6c5ae

由 Jason A. Donenfeld 提交于 10月 29, 2020

If netfilter changes the packet mark when mangling, the packet is
rerouted using the route_me_harder set of functions. Prior to this
commit, there's one big difference between route_me_harder and the
ordinary initial routing functions, described in the comment above
__ip_queue_xmit():

   /* Note: skb->sk can be different from sk, in case of tunnels */
   int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,

That function goes on to correctly make use of sk->sk_bound_dev_if,
rather than skb->sk->sk_bound_dev_if. And indeed the comment is true: a
tunnel will receive a packet in ndo_start_xmit with an initial skb->sk.
It will make some transformations to that packet, and then it will send
the encapsulated packet out of a *new* socket. That new socket will
basically always have a different sk_bound_dev_if (otherwise there'd be
a routing loop). So for the purposes of routing the encapsulated packet,
the routing information as it pertains to the socket should come from
that socket's sk, rather than the packet's original skb->sk. For that
reason __ip_queue_xmit() and related functions all do the right thing.

One might argue that all tunnels should just call skb_orphan(skb) before
transmitting the encapsulated packet into the new socket. But tunnels do
*not* do this -- and this is wisely avoided in skb_scrub_packet() too --
because features like TSQ rely on skb->destructor() being called when
that buffer space is truely available again. Calling skb_orphan(skb) too
early would result in buffers filling up unnecessarily and accounting
info being all wrong. Instead, additional routing must take into account
the new sk, just as __ip_queue_xmit() notes.

So, this commit addresses the problem by fishing the correct sk out of
state->sk -- it's already set properly in the call to nf_hook() in
__ip_local_out(), which receives the sk as part of its normal
functionality. So we make sure to plumb state->sk through the various
route_me_harder functions, and then make correct use of it following the
example of __ip_queue_xmit().

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

46d6c5ae

24 10月, 2020 1 次提交

tcp: Prevent low rmem stalls with SO_RCVLOWAT. · 435ccfa8

由 Arjun Roy 提交于 10月 23, 2020

With SO_RCVLOWAT, under memory pressure,
it is possible to enter a state where:

1. We have not received enough bytes to satisfy SO_RCVLOWAT.
2. We have not entered buffer pressure (see tcp_rmem_pressure()).
3. But, we do not have enough buffer space to accept more packets.

In this case, we advertise 0 rwnd (due to #3) but the application does
not drain the receive queue (no wakeup because of #1 and #2) so the
flow stalls.

Modify the heuristic for SO_RCVLOWAT so that, if we are advertising
rwnd<=rcv_mss, force a wakeup to prevent a stall.

Without this patch, setting tcp_rmem to 6143 and disabling TCP
autotune causes a stalled flow. With this patch, no stall occurs. This
is with RPC-style traffic with large messages.

Fixes: 03f45c88 ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
Signed-off-by: NArjun Roy <arjunroy@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201023184709.217614-1-arjunroy.kdev@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

435ccfa8

23 10月, 2020 1 次提交

tcp: fix to update snd_wl1 in bulk receiver fast path · 18ded910

由 Neal Cardwell 提交于 10月 22, 2020

In the header prediction fast path for a bulk data receiver, if no
data is newly acknowledged then we do not call tcp_ack() and do not
call tcp_ack_update_window(). This means that a bulk receiver that
receives large amounts of data can have the incoming sequence numbers
wrap, so that the check in tcp_may_update_window fails:
after(ack_seq, tp->snd_wl1)

If the incoming receive windows are zero in this state, and then the
connection that was a bulk data receiver later wants to send data,
that connection can find itself persistently rejecting the window
updates in incoming ACKs. This means the connection can persistently
fail to discover that the receive window has opened, which in turn
means that the connection is unable to send anything, and the
connection's sending process can get permanently "stuck".

The fix is to update snd_wl1 in the header prediction fast path for a
bulk data receiver, so that it keeps up and does not see wrapping
problems.

This fix is based on a very nice and thorough analysis and diagnosis
by Apollon Oikonomopoulos (see link below).

This is a stable candidate but there is no Fixes tag here since the
bug predates current git history. Just for fun: looks like the bug
dates back to when header prediction was added in Linux v2.1.8 in Nov
1996. In that version tcp_rcv_established() was added, and the code
only updates snd_wl1 in tcp_ack(), and in the new "Bulk data transfer:
receiver" code path it does not call tcp_ack(). This fix seems to
apply cleanly at least as far back as v3.2.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Reported-by: NApollon Oikonomopoulos <apoikos@dmesg.gr>
Tested-by: NApollon Oikonomopoulos <apoikos@dmesg.gr>
Link: https://www.spinics.net/lists/netdev/msg692430.htmlAcked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201022143331.1887495-1-ncardwell.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

18ded910

20 10月, 2020 1 次提交

nexthop: Fix performance regression in nexthop deletion · df6afe2f

由 Ido Schimmel 提交于 10月 16, 2020

While insertion of 16k nexthops all using the same netdev ('dummy10')
takes less than a second, deletion takes about 130 seconds:

# time -p ip -b nexthop.batch
real 0.29
user 0.01
sys 0.15

# time -p ip link set dev dummy10 down
real 131.03
user 0.06
sys 0.52

This is because of repeated calls to synchronize_rcu() whenever a
nexthop is removed from a nexthop group:

# /usr/share/bcc/tools/offcputime -p `pgrep -nx ip` -K
...
    b'finish_task_switch'
    b'schedule'
    b'schedule_timeout'
    b'wait_for_completion'
    b'__wait_rcu_gp'
    b'synchronize_rcu.part.0'
    b'synchronize_rcu'
    b'__remove_nexthop'
    b'remove_nexthop'
    b'nexthop_flush_dev'
    b'nh_netdev_event'
    b'raw_notifier_call_chain'
    b'call_netdevice_notifiers_info'
    b'__dev_notify_flags'
    b'dev_change_flags'
    b'do_setlink'
    b'__rtnl_newlink'
    b'rtnl_newlink'
    b'rtnetlink_rcv_msg'
    b'netlink_rcv_skb'
    b'rtnetlink_rcv'
    b'netlink_unicast'
    b'netlink_sendmsg'
    b'____sys_sendmsg'
    b'___sys_sendmsg'
    b'__sys_sendmsg'
    b'__x64_sys_sendmsg'
    b'do_syscall_64'
    b'entry_SYSCALL_64_after_hwframe'
    -                ip (277)
        126554955

Since nexthops are always deleted under RTNL, synchronize_net() can be
used instead. It will call synchronize_rcu_expedited() which only blocks
for several microseconds as opposed to multiple milliseconds like
synchronize_rcu().

With this patch deletion of 16k nexthops takes less than a second:

# time -p ip link set dev dummy10 down
real 0.12
user 0.00
sys 0.04

Tested with fib_nexthops.sh which includes torture tests that prompted
the initial change:

# ./fib_nexthops.sh
...
Tests passed: 134
Tests failed:   0

Fixes: 90f33bff ("nexthops: don't modify published nexthop groups")
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20201016172914.643282-1-idosch@idosch.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

df6afe2f

17 10月, 2020 1 次提交

icmp: randomize the global rate limiter · b38e7819

由 Eric Dumazet 提交于 10月 15, 2020

Keyu Man reported that the ICMP rate limiter could be used
by attackers to get useful signal. Details will be provided
in an upcoming academic publication.

Our solution is to add some noise, so that the attackers
no longer can get help from the predictable token bucket limiter.

Fixes: 4cdf507d ("icmp: add a global rate limitation")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NKeyu Man <kman001@ucr.edu>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

b38e7819

15 10月, 2020 1 次提交

ipv4/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2) · e1e84eb5

由 Mathieu Desnoyers 提交于 10月 12, 2020

As per RFC792, ICMP errors should be sent to the source host.

However, in configurations with Virtual Routing and Forwarding tables,
looking up which routing table to use is currently done by using the
destination net_device.

commit 9d1a6c4e ("net: icmp_route_lookup should use rt dev to
determine L3 domain") changes the interface passed to
l3mdev_master_ifindex() and inet_addr_type_dev_table() from skb_in->dev
to skb_dst(skb_in)->dev. This effectively uses the destination device
rather than the source device for choosing which routing table should be
used to lookup where to send the ICMP error.

Therefore, if the source and destination interfaces are within separate
VRFs, or one in the global routing table and the other in a VRF, looking
up the source host in the destination interface's routing table will
fail if the destination interface's routing table contains no route to
the source host.

One observable effect of this issue is that traceroute does not work in
the following cases:

- Route leaking between global routing table and VRF
- Route leaking between VRFs

Preferably use the source device routing table when sending ICMP error
messages. If no source device is set, fall-back on the destination
device routing table. Else, use the main routing table (index 0).

[ It has been pointed out that a similar issue may exist with ICMP
errors triggered when forwarding between network namespaces. It would
be worthwhile to investigate, but is outside of the scope of this
investigation. ]

[ It has also been pointed out that a similar issue exists with
unreachable / fragmentation needed messages, which can be triggered by
changing the MTU of eth1 in r1 to 1400 and running:

ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2

Some investigation points to raw_icmp_error() and raw_err() as being
involved in this last scenario. The focus of this patch is TTL expired
ICMP messages, which go through icmp_route_lookup.
Investigation of failure modes related to raw_icmp_error() is beyond
this investigation's scope. ]

Fixes: 9d1a6c4e ("net: icmp_route_lookup should use rt dev to determine L3 domain")
Link: https://tools.ietf.org/html/rfc792Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

e1e84eb5

14 10月, 2020 4 次提交

ip_gre: set dev->hard_header_len and dev->needed_headroom properly · fdafed45

由 Cong Wang 提交于 10月 12, 2020

GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
conditionally. When it is set, it assumes the outer IP header is
already created before ipgre_xmit().

This is not true when we send packets through a raw packet socket,
where L2 headers are supposed to be constructed by user. Packet
socket calls dev_validate_header() to validate the header. But
GRE tunnel does not set dev->hard_header_len, so that check can
be simply bypassed, therefore uninit memory could be passed down
to ipgre_xmit(). Similar for dev->needed_headroom.

dev->hard_header_len is supposed to be the length of the header
created by dev->header_ops->create(), so it should be used whenever
header_ops is set, and dev->needed_headroom should be used when it
is not set.

Reported-and-tested-by: syzbot+4a2c52677a8a1aa283cb@syzkaller.appspotmail.com
Cc: William Tu <u9012063@gmail.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NXie He <xie.he.0141@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

fdafed45

iptunnel: use new function dev_fetch_sw_netstats · cf89f18f

由 Heiner Kallweit 提交于 10月 12, 2020

Simplify the code by using new function dev_fetch_sw_netstats().
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/050f9a83-b195-a3d6-edbd-91a59040be21@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

cf89f18f

tcp: use semicolons rather than commas to separate statements · 44797589

由 Julia Lawall 提交于 10月 11, 2020

Replace commas with semicolons.  Commas introduce unnecessary
variability in the code structure and are hard to see.  What is done
is essentially described by the following Coccinelle semantic patch
(http://coccinelle.lip6.fr/):

// <smpl>
@@ expression e1,e2; @@
e1
-,
+;
e2
... when any
// </smpl>
Signed-off-by: NJulia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/1602412498-32025-4-git-send-email-Julia.Lawall@inria.frSigned-off-by: NJakub Kicinski <kuba@kernel.org>

44797589

netfilter: nf_log: missing vlan offload tag and proto · 0d9826bc

由 Pablo Neira Ayuso 提交于 10月 12, 2020

Dump vlan tag and proto for the usual vlan offload case if the
NF_LOG_MACDECODE flag is set on. Without this information the logging is
misleading as there is no reference to the VLAN header.

[12716.993704] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0800 SRC=192.168.10.2 DST=172.217.168.163 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=2548 DF PROTO=TCP SPT=55848 DPT=80 WINDOW=501 RES=0x00 ACK FIN URGP=0
[12721.157643] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0806 ARP HTYPE=1 PTYPE=0x0800 OPCODE=2 MACSRC=86:6c:92:ea:d6:73 IPSRC=192.168.10.2 MACDST=0e:3b:eb:86:73:76 IPDST=192.168.10.1

Fixes: 83e96d44 ("netfilter: log: split family specific code to nf_log_{ip,ip6,common}.c files")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

0d9826bc

11 10月, 2020 1 次提交

ipv4: Restore flowi4_oif update before call to xfrm_lookup_route · 874fb9e2

由 David Ahern 提交于 10月 09, 2020

Tobias reported regressions in IPsec tests following the patch
referenced by the Fixes tag below. The root cause is dropping the
reset of the flowi4_oif after the fib_lookup. Apparently it is
needed for xfrm cases, so restore the oif update to ip_route_output_flow
right before the call to xfrm_lookup_route.

Fixes: 2fbc6e89 ("ipv4: Update exception handling for multipath routes via same device")
Reported-by: NTobias Brunner <tobias@strongswan.org>
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

874fb9e2

09 10月, 2020 1 次提交

xfrm: interface: fix the priorities for ipip and ipv6 tunnels · 7fe94612

由 Xin Long 提交于 10月 08, 2020

As Nicolas noticed in his case, when xfrm_interface module is installed
the standard IP tunnels will break in receiving packets.

This is caused by the IP tunnel handlers with a higher priority in xfrm
interface processing incoming packets by xfrm_input(), which would drop
the packets and return 0 instead when anything wrong happens.

Rather than changing xfrm_input(), this patch is to adjust the priority
for the IP tunnel handlers in xfrm interface, so that the packets would
go to xfrmi's later than the others', as the others' would not drop the
packets when the handlers couldn't process them.

Note that IPCOMP also defines its own IPIP tunnel handler and it calls
xfrm_input() as well, so we must make its priority lower than xfrmi's,
which means having xfrmi loaded would still break IPCOMP. We may seek
another way to fix it in xfrm_input() in the future.
Reported-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Fixes: da9bbf05 ("xfrm: interface: support IPIP and IPIP6 tunnels processing with .cb_handler")
FIxes: d7b360c2 ("xfrm: interface: support IP6IP6 and IP6IP tunnels processing with .cb_handler")
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

7fe94612

06 10月, 2020 2 次提交

ipv4: use dev_sw_netstats_rx_add() · 560b50cf

由 Fabian Frederick 提交于 10月 05, 2020

use new helper for netstats settings
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

560b50cf

tcp: fix receive window update in tcp_add_backlog() · 86bccd03

由 Eric Dumazet 提交于 10月 05, 2020

We got reports from GKE customers flows being reset by netfilter
conntrack unless nf_conntrack_tcp_be_liberal is set to 1.

Traces seemed to suggest ACK packet being dropped by the
packet capture, or more likely that ACK were received in the
wrong order.

wscale=7, SYN and SYNACK not shown here.

This ACK allows the sender to send 1871*128 bytes from seq 51359321 :
New right edge of the window -> 51359321+1871*128=51598809

09:17:23.389210 IP A > B: Flags [.], ack 51359321, win 1871, options [nop,nop,TS val 10 ecr 999], length 0

09:17:23.389212 IP B > A: Flags [.], seq 51422681:51424089, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 1408
09:17:23.389214 IP A > B: Flags [.], ack 51422681, win 1376, options [nop,nop,TS val 10 ecr 999], length 0
09:17:23.389253 IP B > A: Flags [.], seq 51424089:51488857, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 64768
09:17:23.389272 IP A > B: Flags [.], ack 51488857, win 859, options [nop,nop,TS val 10 ecr 999], length 0
09:17:23.389275 IP B > A: Flags [.], seq 51488857:51521241, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384

Receiver now allows to send 606*128=77568 from seq 51521241 :
New right edge of the window -> 51521241+606*128=51598809

09:17:23.389296 IP A > B: Flags [.], ack 51521241, win 606, options [nop,nop,TS val 10 ecr 999], length 0

09:17:23.389308 IP B > A: Flags [.], seq 51521241:51553625, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384

It seems the sender exceeds RWIN allowance, since 51611353 > 51598809

09:17:23.389346 IP B > A: Flags [.], seq 51553625:51611353, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 57728
09:17:23.389356 IP B > A: Flags [.], seq 51611353:51618393, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 7040

09:17:23.389367 IP A > B: Flags [.], ack 51611353, win 0, options [nop,nop,TS val 10 ecr 999], length 0

netfilter conntrack is not happy and sends RST

09:17:23.389389 IP A > B: Flags [R], seq 92176528, win 0, length 0
09:17:23.389488 IP B > A: Flags [R], seq 174478967, win 0, length 0

Now imagine ACK were delivered out of order and tcp_add_backlog() sets window based on wrong packet.
New right edge of the window -> 51521241+859*128=51631193

Normally TCP stack handles OOO packets just fine, but it
turns out tcp_add_backlog() does not. It can update the window
field of the aggregated packet even if the ACK sequence
of the last received packet is too old.

Many thanks to Alexandre Ferrieux for independently reporting the issue
and suggesting a fix.

Fixes: 4f693b55 ("tcp: implement coalescing on backlog queue")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NAlexandre Ferrieux <alexandre.ferrieux@orange.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

86bccd03

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功