提交 · 9a301cafc8619c7f30032d314da6e65d9d913d57 · openeuler / Kernel

26 4月, 2021 19 次提交

xprtrdma: Move fr_linv_done field to struct rpcrdma_mr · 9a301caf

由 Chuck Lever 提交于 4月 19, 2021

Clean up: Move more of struct rpcrdma_frwr into its parent.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

9a301caf

xprtrdma: Move cqe to struct rpcrdma_mr · e10fa96d

由 Chuck Lever 提交于 4月 19, 2021

Clean up.

- Simplify variable initialization in the completion handlers.

- Move another field out of struct rpcrdma_frwr.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

e10fa96d

xprtrdma: Move fr_cid to struct rpcrdma_mr · 0a26d10e

由 Chuck Lever 提交于 4月 19, 2021

Clean up (for several purposes):

- The MR's cid is initialized sooner so that tracepoints can show
  something reasonable even if the MR is never posted.
- The MR's res.id doesn't change so the cid won't change either.
  Initializing the cid once is sufficient.
- struct rpcrdma_frwr is going away soon.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

0a26d10e

xprtrdma: Remove the RPC/RDMA QP event handler · e1648eb2

由 Chuck Lever 提交于 4月 19, 2021

Clean up: The handler only recorded a trace event. If indeed no
action is needed by the RPC/RDMA consumer, then the event can be
ignored.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

e1648eb2

xprtrdma: Add tracepoints showing FastReg WRs and remote invalidation · 4ddd0fc3

由 Chuck Lever 提交于 4月 19, 2021

The Send signaling logic is a little subtle, so add some
observability around it. For every xprtrdma_mr_fastreg event, there
should be an xprtrdma_mr_localinv or xprtrdma_mr_reminv event.

When these tracepoints are enabled, we can see exactly when an MR is
DMA-mapped, registered, invalidated (either locally or remotely) and
then DMA-unmapped.

kworker/u25:2-190 [000] 787.979512: xprtrdma_mr_map: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/u25:2-190 [000] 787.979515: xprtrdma_chunk_read: task:351@5 pos=148 5608@0x8679e0c8f6f56000:0x00000503 (last)
kworker/u25:2-190 [000] 787.979519: xprtrdma_marshal: task:351@5 xid=0x8679e0c8: hdr=52 xdr=148/5608/0 read list/inline
kworker/u25:2-190 [000] 787.979525: xprtrdma_mr_fastreg: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/u25:2-190 [000] 787.979526: xprtrdma_post_send: task:351@5 cq.id=0 cid=73 (2 SGEs)

...

kworker/5:1H-219 [005] 787.980567: xprtrdma_wc_receive: cq.id=1 cid=161 status=SUCCESS (0/0x0) received=164
kworker/5:1H-219 [005] 787.980571: xprtrdma_post_recvs: peer=[192.168.100.55]:20049 r_xprt=0xffff8884974d4000: 0 new recvs, 70 active (rc 0)
kworker/5:1H-219 [005] 787.980573: xprtrdma_reply: task:351@5 xid=0x8679e0c8 credits=64
kworker/5:1H-219 [005] 787.980576: xprtrdma_mr_reminv: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/5:1H-219 [005] 787.980577: xprtrdma_mr_unmap: mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)

Note that I've moved the xprtrdma_post_send tracepoint so that event
always appears after the xprtrdma_mr_fastreg tracepoint. Otherwise
the event log looks counterintuitive (FastReg is always supposed to
happen before Send).
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

4ddd0fc3

xprtrdma: Avoid Send Queue wrapping · b3ce7a25

由 Chuck Lever 提交于 4月 19, 2021

Send WRs can be signalled or unsignalled. A signalled Send WR
always has a matching Send completion, while a unsignalled Send
has a completion only if the Send WR fails.

xprtrdma has a Send account mechanism that is designed to reduce
the number of signalled Send WRs. This in turn mitigates the
interrupt rate of the underlying device.

RDMA consumers can't leave all Sends unsignaled, however, because
providers rely on Send completions to maintain their Send Queue head
and tail pointers. xprtrdma counts the number of unsignaled Send WRs
that have been posted to ensure that Sends are signalled often
enough to prevent the Send Queue from wrapping.

This mechanism neglected to account for FastReg WRs, which are
posted on the Send Queue but never signalled. As a result, the
Send Queue wrapped on occasion, resulting in duplication completions
of FastReg and LocalInv WRs.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

b3ce7a25

xprtrdma: Do not wake RPC consumer on a failed LocalInv · 8a053433

由 Chuck Lever 提交于 4月 19, 2021

Throw away any reply where the LocalInv flushes or could not be
posted. The registered memory region is in an unknown state until
the disconnect completes.

rpcrdma_xprt_disconnect() will find and release the MR. No need to
put it back on the MR free list in this case.

The client retransmits pending RPC requests once it reestablishes a
fresh connection, so a replacement reply should be forthcoming on
the next connection instance.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

8a053433

xprtrdma: Do not recycle MR after FastReg/LocalInv flushes · e4b52ca0

由 Chuck Lever 提交于 4月 19, 2021

Better not to touch MRs involved in a flush or post error until the
Send and Receive Queues are drained and the transport is fully
quiescent. Simply don't insert such MRs back onto the free list.
They remain on mr_all and will be released when the connection is
torn down.

I had thought that recycling would prevent hardware resources from
being tied up for a long time. However, since v5.7, a transport
disconnect destroys the QP and other hardware-owned resources. The
MRs get cleaned up nicely at that point.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

e4b52ca0

xprtrdma: Clarify use of barrier in frwr_wc_localinv_done() · 44438ad9

由 Chuck Lever 提交于 4月 19, 2021

Clean up: The comment and the placement of the memory barrier is
confusing. Humans want to read the function statements from head
to tail.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

44438ad9

xprtrdma: Rename frwr_release_mr() · f912af77

由 Chuck Lever 提交于 4月 19, 2021

Clean up: To be consistent with other functions in this source file,
follow the naming convention of putting the object being acted upon
before the action itself.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

f912af77

xprtrdma: rpcrdma_mr_pop() already does list_del_init() · 1363e638

由 Chuck Lever 提交于 4月 19, 2021

The rpcrdma_mr_pop() earlier in the function has already cleared
out mr_list, so it must not be done again in the error path.

Fixes: 84756894 ("xprtrdma: Remove fr_state")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

1363e638

xprtrdma: Delete rpcrdma_recv_buffer_put() · c35ca60d

由 Chuck Lever 提交于 4月 19, 2021

Clean up: The name recv_buffer_put() is a vestige of older code,
and the function is just a wrapper for the newer rpcrdma_rep_put().
In most of the existing call sites, a pointer to the owning
rpcrdma_buffer is already available.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

c35ca60d

xprtrdma: Fix cwnd update ordering · 35d8b10a

由 Chuck Lever 提交于 4月 19, 2021

After a reconnect, the reply handler is opening the cwnd (and thus
enabling more RPC Calls to be sent) /before/ rpcrdma_post_recvs()
can post enough Receive WRs to receive their replies. This causes an
RNR and the new connection is lost immediately.

The race is most clearly exposed when KASAN and disconnect injection
are enabled. This slows down rpcrdma_rep_create() enough to allow
the send side to post a bunch of RPC Calls before the Receive
completion handler can invoke ib_post_recv().

Fixes: 2ae50ad6 ("xprtrdma: Close window between waking RPC senders and posting Receives")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

35d8b10a

xprtrdma: Improve locking around rpcrdma_rep creation · 9e3ca33b

由 Chuck Lever 提交于 4月 19, 2021

Defensive clean up: Protect the rb_all_reps list during rep
creation.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

9e3ca33b

xprtrdma: Improve commentary around rpcrdma_reps_unmap() · 8b5292be

由 Chuck Lever 提交于 4月 19, 2021

Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

8b5292be

xprtrdma: Improve locking around rpcrdma_rep destruction · eaf86e8c

由 Chuck Lever 提交于 4月 24, 2021

Currently rpcrdma_reps_destroy() assumes that, at transport
tear-down, the content of the rb_free_reps list is the same as the
content of the rb_all_reps list. Although that is usually true,
using the rb_all_reps list should be more reliable because of
the way it's managed. And, rpcrdma_reps_unmap() uses rb_all_reps;
these two functions should both traverse the "all" list.

Ensure that all rpcrdma_reps are always destroyed whether they are
on the rep free list or not.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

eaf86e8c

xprtrdma: Put flushed Receives on free list instead of destroying them · 5030c9a9

由 Chuck Lever 提交于 4月 19, 2021

Defer destruction of an rpcrdma_rep until transport tear-down to
preserve the rb_all_reps list while Receives flush.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NTom Talpey <tom@talpey.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

5030c9a9

xprtrdma: Do not refresh Receive Queue while it is draining · 15788d1d

由 Chuck Lever 提交于 4月 19, 2021

Currently the Receive completion handler refreshes the Receive Queue
whenever a successful Receive completion occurs.

On disconnect, xprtrdma drains the Receive Queue. The first few
Receive completions after a disconnect are typically successful,
until the first flushed Receive.

This means the Receive completion handler continues to post more
Receive WRs after the drain sentinel has been posted. The late-
posted Receives flush after the drain sentinel has completed,
leading to a crash later in rpcrdma_xprt_disconnect().

To prevent this crash, xprtrdma has to ensure that the Receive
handler stops posting Receives before ib_drain_rq() posts its
drain sentinel.
Suggested-by: NTom Talpey <tom@talpey.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

15788d1d

xprtrdma: Avoid Receive Queue wrapping · 32e6b681

由 Chuck Lever 提交于 4月 19, 2021

Commit e340c2d6 ("xprtrdma: Reduce the doorbell rate (Receive)")
increased the number of Receive WRs that are posted by the client,
but did not increase the size of the Receive Queue allocated during
transport set-up.

This is usually not an issue because RPCRDMA_BACKWARD_WRS is defined
as (32) when SUNRPC_BACKCHANNEL is defined. In cases where it isn't,
there is a real risk of Receive Queue wrapping.

Fixes: e340c2d6 ("xprtrdma: Reduce the doorbell rate (Receive)")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NTom Talpey <tom@talpey.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

32e6b681

14 4月, 2021 4 次提交

SUNRPC: Handle major timeout in xprt_adjust_timeout() · 09252177

由 Chris Dion 提交于 4月 04, 2021

Currently if a major timeout value is reached, but the minor value has
not been reached, an ETIMEOUT will not be sent back to the caller.
This can occur if the v4 server is not responding to requests and
retrans is configured larger than the default of two.

For example, A TCP mount with a configured timeout value of 50 and a
retransmission count of 3 to a v4 server which is not responding:

1. Initial value and increment set to 5s, maxval set to 20s, retries at 3
2. Major timeout is set to 20s, minor timeout set to 5s initially
3. xport_adjust_timeout() is called after 5s, retry with 10s timeout,
   minor timeout is bumped to 10s
4. And again after another 10s, 15s total time with minor timeout set
   to 15s
5. After 20s total time xport_adjust_timeout is called as major timeout is
   reached, but skipped because the minor timeout is not reached
       - After this time the cpu spins continually calling
       	 xport_adjust_timeout() and returning 0 for 10 seconds.
	 As seen on perf sched:
   	 39243.913182 [0005]  mount.nfs[3794] 4607.938      0.017   9746.863
6. This continues until the 15s minor timeout condition is reached (in
   this case for 10 seconds). After which the ETIMEOUT is processed
   back to the caller, the cpu spinning stops, and normal operations
   continue

Fixes: 7de62bc0 ("SUNRPC dont update timeout value on connection reset")
Signed-off-by: NChris Dion <Christopher.Dion@dell.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

09252177

SUNRPC: Remove trace_xprt_transmit_queued · 6cf23783

由 Chuck Lever 提交于 3月 31, 2021

This tracepoint can crash when dereferencing snd_task because
when some transports connect, they put a cookie in that field
instead of a pointer to an rpc_task.

BUG: KASAN: use-after-free in trace_event_raw_event_xprt_writelock_event+0x141/0x18e [sunrpc]
Read of size 2 at addr ffff8881a83bd3a0 by task git/331872

CPU: 11 PID: 331872 Comm: git Tainted: G S                5.12.0-rc2-00007-g3ab6e585a7f9 #1453
Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
Call Trace:
 dump_stack+0x9c/0xcf
 print_address_description.constprop.0+0x18/0x239
 kasan_report+0x174/0x1b0
 trace_event_raw_event_xprt_writelock_event+0x141/0x18e [sunrpc]
 xprt_prepare_transmit+0x8e/0xc1 [sunrpc]
 call_transmit+0x4d/0xc6 [sunrpc]

Fixes: 9ce07ae5 ("SUNRPC: Replace dprintk() call site in xprt_prepare_transmit")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

6cf23783

SUNRPC: Add tracepoint that fires when an RPC is retransmitted · e936a597

由 Chuck Lever 提交于 3月 31, 2021

A separate tracepoint can be left enabled all the time to capture
rare but important retransmission events. So for example:

kworker/u26:3-568 [009] 156.967933: xprt_retransmit: task:44093@5 xid=0xa25dbc79 nfsv3 WRITE ntrans=2

Or, for example, enable all nfs and nfs4 tracepoints, and set up a
trigger to disable tracing when xprt_retransmit fires to capture
everything that leads up to it.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

e936a597

SUNRPC: Move fault injection call sites · 7638e0bf

由 Chuck Lever 提交于 3月 31, 2021

I've hit some crashes that occur in the xprt_rdma_inject_disconnect
path. It appears that, for some provides, rdma_disconnect() can
take so long that the transport can disconnect and release its
hardware resources while rdma_disconnect() is still running,
resulting in a UAF in the provider.

The transport's fault injection method may depend on the stability
of transport data structures. That means it needs to be invoked
only from contexts that hold the transport write lock.

Fixes: 4a068258 ("SUNRPC: Transport fault injection")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

7638e0bf

05 4月, 2021 3 次提交

SUNRPC: Ensure the transport backchannel association · 98b5cee3

由 Benjamin Coddington 提交于 3月 22, 2021

If the server sends CB_ calls on a connection that is not associated
with the backchannel, refuse to process the call and shut down the
connection.  This avoids a NULL dereference crash in
xprt_complete_bc_request().  There's not much more we can do in this
situation unless we want to look into allowing all connections to be
associated with the fore and back channel.
Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

98b5cee3

sunrpc: honor rpc_task's timeout value in rpcb_create() · 6b996476

由 Eryu Guan 提交于 3月 22, 2021

Currently rpcbind client is created without setting rpc timeout (thus
using the default value). But if the rpc_task already has a customized
timeout in its tk_client field, it's also ignored.

Let's use the same timeout setting in rpc_task->tk_client->cl_timeout
for rpcbind connection.
Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

6b996476

SUNRPC: Set TCP_CORK until the transmit queue is empty · d737e5d4

由 Trond Myklebust 提交于 2月 09, 2021

When we have multiple RPC requests queued up, it makes sense to set the
TCP_CORK option while the transmit queue is non-empty.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

d737e5d4

24 3月, 2021 2 次提交

net: bridge: don't notify switchdev for local FDB addresses · 6ab4c311

由 Vladimir Oltean 提交于 3月 22, 2021

As explained in this discussion:
https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/

the switchdev notifiers for FDB entries managed to have a zero-day bug.
The bridge would not say that this entry is local:

ip link add br0 type bridge
ip link set swp0 master br0
bridge fdb add dev swp0 00:01:02:03:04:05 master local

and the switchdev driver would be more than happy to offload it as a
normal static FDB entry. This is despite the fact that 'local' and
non-'local' entries have completely opposite directions: a local entry
is locally terminated and not forwarded, whereas a static entry is
forwarded and not locally terminated. So, for example, DSA would install
this entry on swp0 instead of installing it on the CPU port as it should.

There is an even sadder part, which is that the 'local' flag is implicit
if 'static' is not specified, meaning that this command produces the
same result of adding a 'local' entry:

bridge fdb add dev swp0 00:01:02:03:04:05 master

I've updated the man pages for 'bridge', and after reading it now, it
should be pretty clear to any user that the commands above were broken
and should have never resulted in the 00:01:02:03:04:05 address being
forwarded (this behavior is coherent with non-switchdev interfaces):
https://patchwork.kernel.org/project/netdevbpf/cover/20210211104502.2081443-1-olteanv@gmail.com/
If you're a user reading this and this is what you want, just use:

bridge fdb add dev swp0 00:01:02:03:04:05 master static

Because switchdev should have given drivers the means from day one to
classify FDB entries as local/non-local, but didn't, it means that all
drivers are currently broken. So we can just as well omit the switchdev
notifications for local FDB entries, which is exactly what this patch
does to close the bug in stable trees. For further development work
where drivers might want to trap the local FDB entries to the host, we
can add a 'bool is_local' to br_switchdev_fdb_call_notifiers(), and
selectively make drivers act upon that bit, while all the others ignore
those entries if the 'is_local' bit is set.

Fixes: 6b26b51b ("net: bridge: Add support for notifying devices about FDB add/del")
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ab4c311

net/sched: act_ct: clear post_ct if doing ct_clear · 8ca1b090

由 Marcelo Ricardo Leitner 提交于 3月 22, 2021

Invalid detection works with two distinct moments: act_ct tries to find
a conntrack entry and set post_ct true, indicating that that was
attempted. Then, when flow dissector tries to dissect CT info and no
entry is there, it knows that it was tried and no entry was found, and
synthesizes/sets
                  key->ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
                                  TCA_FLOWER_KEY_CT_FLAGS_INVALID;
mimicing what OVS does.

OVS has this a bit more streamlined, as it recomputes the key after
trying to find a conntrack entry for it.

Issue here is, when we have 'tc action ct clear', it didn't clear
post_ct, causing a subsequent match on 'ct_state -trk' to fail, due to
the above. The fix, thus, is to clear it.

Reproducer rules:
tc filter add dev enp130s0f0np0_0 ingress prio 1 chain 0 \
	protocol ip flower ip_proto tcp ct_state -trk \
	action ct zone 1 pipe \
	action goto chain 2
tc filter add dev enp130s0f0np0_0 ingress prio 1 chain 2 \
	protocol ip flower \
	action ct clear pipe \
	action goto chain 4
tc filter add dev enp130s0f0np0_0 ingress prio 1 chain 4 \
	protocol ip flower ct_state -trk \
	action mirred egress redirect dev enp130s0f1np1_0

With the fix, the 3rd rule matches, like it does with OVS kernel
datapath.

Fixes: 7baf2429 ("net/sched: cls_flower add CT_FLAGS_INVALID flag support")
Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Nwenxu <wenxu@ucloud.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8ca1b090

23 3月, 2021 2 次提交

net: dsa: don't assign an error value to tag_ops · e0c755a4

由 George McCollister 提交于 3月 22, 2021

Use a temporary variable to hold the return value from
dsa_tag_driver_get() instead of assigning it to dst->tag_ops. Leaving
an error value in dst->tag_ops can result in deferencing an invalid
pointer when a deferred switch configuration happens later.

Fixes: 357f203b ("net: dsa: keep a copy of the tagging protocol in the DSA switch tree")
Signed-off-by: NGeorge McCollister <george.mccollister@gmail.com>
Reviewed-by: NVladimir Oltean <olteanv@gmail.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e0c755a4

net: ipconfig: ic_dev can be NULL in ic_close_devs · a50a151e

由 Vladimir Oltean 提交于 3月 22, 2021

ic_close_dev contains a generalization of the logic to not close a
network interface if it's the host port for a DSA switch. This logic is
disguised behind an iteration through the lowers of ic_dev in
ic_close_dev.

When no interface for ipconfig can be found, ic_dev is NULL, and
ic_close_dev:
- dereferences a NULL pointer when assigning selected_dev
- would attempt to search through the lower interfaces of a NULL
  net_device pointer

So we should protect against that case.

The "lower_dev" iterator variable was shortened to "lower" in order to
keep the 80 character limit.

Fixes: f68cbaed ("net: ipconfig: avoid use-after-free in ic_close_devs")
Fixes: 46acf7bd ("Revert "net: ipv4: handle DSA enabled master network devices"")
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: NHeiko Thiery <heiko.thiery@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a50a151e

21 3月, 2021 1 次提交

can: isotp: tx-path: zero initialize outgoing CAN frames · b5f020f8

由 Oliver Hartkopp 提交于 3月 19, 2021

Commit d4eb538e ("can: isotp: TX-path: ensure that CAN frame flags are
initialized") ensured the TX flags to be properly set for outgoing CAN
frames.

In fact the root cause of the issue results from a missing initialization
of outgoing CAN frames created by isotp. This is no problem on the CAN bus
as the CAN driver only picks the correctly defined content from the struct
can(fd)_frame. But when the outgoing frames are monitored (e.g. with
candump) we potentially leak some bytes in the unused content of
struct can(fd)_frame.

Fixes: e057dd3f ("can: add ISO 15765-2:2016 transport protocol")
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/r/20210319100619.10858-1-socketcan@hartkopp.netSigned-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

b5f020f8

20 3月, 2021 2 次提交

selinux: vsock: Set SID for socket returned by accept() · 1f935e8e

由 David Brazdil 提交于 3月 19, 2021

For AF_VSOCK, accept() currently returns sockets that are unlabelled.
Other socket families derive the child's SID from the SID of the parent
and the SID of the incoming packet. This is typically done as the
connected socket is placed in the queue that accept() removes from.

Reuse the existing 'security_sk_clone' hook to copy the SID from the
parent (server) socket to the child. There is no packet SID in this
case.

Fixes: d021c344 ("VSOCK: Introduce VM Sockets")
Signed-off-by: NDavid Brazdil <dbrazdil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f935e8e

sctp: move sk_route_caps check and set into sctp_outq_flush_transports · 8ff0b1f0

由 Xin Long 提交于 3月 19, 2021

The sk's sk_route_caps is set in sctp_packet_config, and later it
only needs to change when traversing the transport_list in a loop,
as the dst might be changed in the tx path.

So move sk_route_caps check and set into sctp_outq_flush_transports
from sctp_packet_transmit. This also fixes a dst leak reported by
Chen Yi:

  https://bugzilla.kernel.org/show_bug.cgi?id=212227

As calling sk_setup_caps() in sctp_packet_transmit may also set the
sk_route_caps for the ctrl sock in a netns. When the netns is being
deleted, the ctrl sock's releasing is later than dst dev's deleting,
which will cause this dev's deleting to hang and dmesg error occurs:

  unregister_netdevice: waiting for xxx to become free. Usage count = 1
Reported-by: NChen Yi <yiche@redhat.com>
Fixes: bcd623d8 ("sctp: call sk_setup_caps in sctp_packet_transmit instead")
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8ff0b1f0

19 3月, 2021 2 次提交

net: check all name nodes in __dev_alloc_name · 6c015a22

由 Jiri Bohac 提交于 3月 18, 2021

__dev_alloc_name(), when supplied with a name containing '%d',
will search for the first available device number to generate a
unique device name.

Since commit ff927412 ("net:
introduce name_node struct to be used in hashlist") network
devices may have alternate names.  __dev_alloc_name() does take
these alternate names into account, possibly generating a name
that is already taken and failing with -ENFILE as a result.

This demonstrates the bug:

    # rmmod dummy 2>/dev/null
    # ip link property add dev lo altname dummy0
    # modprobe dummy numdummies=1
    modprobe: ERROR: could not insert 'dummy': Too many open files in system

Instead of creating a device named dummy1, modprobe fails.

Fix this by checking all the names in the d->name_node list, not just d->name.
Signed-off-by: NJiri Bohac <jbohac@suse.cz>
Fixes: ff927412 ("net: introduce name_node struct to be used in hashlist")
Reviewed-by: NJiri Pirko <jiri@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6c015a22

ipv6: weaken the v4mapped source check · dcc32f4f

由 Jakub Kicinski 提交于 3月 17, 2021

This reverts commit 6af1799a.

Commit 6af1799a ("ipv6: drop incoming packets having a v4mapped
source address") introduced an input check against v4mapped addresses.
Use of such addresses on the wire is indeed questionable and not
allowed on public Internet. As the commit pointed out

  https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02

lists potential issues.

Unfortunately there are applications which use v4mapped addresses,
and breaking them is a clear regression. For example v4mapped
addresses (or any semi-valid addresses, really) may be used
for uni-direction event streams or packet export.

Since the issue which sparked the addition of the check was with
TCP and request_socks in particular push the check down to TCPv6
and DCCP. This restores the ability to receive UDPv6 packets with
v4mapped address as the source.

Keep using the IPSTATS_MIB_INHDRERRORS statistic to minimize the
user-visible changes.

Fixes: 6af1799a ("ipv6: drop incoming packets having a v4mapped source address")
Reported-by: NSunyi Shao <sunyishao@fb.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Acked-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dcc32f4f

18 3月, 2021 5 次提交

netfilter: nftables: skip hook overlap logic if flowtable is stale · 86fe2c19

由 Pablo Neira Ayuso 提交于 3月 17, 2021

If the flowtable has been previously removed in this batch, skip the
hook overlap checks. This fixes spurious EEXIST errors when removing and
adding the flowtable in the same batch.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

86fe2c19

netfilter: flowtable: Make sure GC works periodically in idle system · 740b486a

由 Yinjun Zhang 提交于 3月 17, 2021

Currently flowtable's GC work is initialized as deferrable, which
means GC cannot work on time when system is idle. So the hardware
offloaded flow may be deleted for timeout, since its used time is
not timely updated.

Resolve it by initializing the GC work as delayed work instead of
deferrable.

Fixes: c29f74e0 ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: NYinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: NLouis Peens <louis.peens@corigine.com>
Signed-off-by: NSimon Horman <simon.horman@netronome.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

740b486a

netfilter: nftables: allow to update flowtable flags · 7b35582c

由 Pablo Neira Ayuso 提交于 3月 17, 2021

Honor flowtable flags from the control update path. Disallow disabling
to toggle hardware offload support though.

Fixes: 8bb69f3b ("netfilter: nf_tables: add flowtable offload control plane")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

7b35582c

netfilter: nftables: report EOPNOTSUPP on unsupported flowtable flags · 7e6136f1

由 Pablo Neira Ayuso 提交于 3月 17, 2021

Error was not set accordingly.

Fixes: 8bb69f3b ("netfilter: nf_tables: add flowtable offload control plane")
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

7e6136f1

netfilter: conntrack: Fix gre tunneling over ipv6 · 8b2030b4

由 Ludovic Senecaux 提交于 3月 04, 2021

This fix permits gre connections to be tracked within ip6tables rules
Signed-off-by: NLudovic Senecaux <linuxludo@free.fr>
Acked-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

8b2030b4

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功