提交 · 9eeb3aa33ae005526f672b394c1791578463513f · openeuler / Kernel

22 10月, 2021 1 次提交

bpf: Add bpf_skc_to_unix_sock() helper · 9eeb3aa3

由 Hengqi Chen 提交于 10月 21, 2021

The helper is used in tracing programs to cast a socket
pointer to a unix_sock pointer.
The return value could be NULL if the casting is illegal.
Suggested-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NHengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NSong Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211021134752.1223426-2-hengqi.chen@gmail.com

9eeb3aa3

19 10月, 2021 1 次提交

bpf: Rename BTF_KIND_TAG to BTF_KIND_DECL_TAG · 223f903e

由 Yonghong Song 提交于 10月 12, 2021

Patch set [1] introduced BTF_KIND_TAG to allow tagging
declarations for struct/union, struct/union field, var, func
and func arguments and these tags will be encoded into
dwarf. They are also encoded to btf by llvm for the bpf target.

After BTF_KIND_TAG is introduced, we intended to use it
for kernel __user attributes. But kernel __user is actually
a type attribute. Upstream and internal discussion showed
it is not a good idea to mix declaration attribute and
type attribute. So we proposed to introduce btf_type_tag
as a type attribute and existing btf_tag renamed to
btf_decl_tag ([2]).

This patch renamed BTF_KIND_TAG to BTF_KIND_DECL_TAG and some
other declarations with *_tag to *_decl_tag to make it clear
the tag is for declaration. In the future, BTF_KIND_TYPE_TAG
might be introduced per [3].

 [1] https://lore.kernel.org/bpf/20210914223004.244411-1-yhs@fb.com/
 [2] https://reviews.llvm.org/D111588
 [3] https://reviews.llvm.org/D111199

Fixes: b5ea834d ("bpf: Support for new btf kind BTF_KIND_TAG")
Fixes: 5b84bd10 ("libbpf: Add support for BTF_KIND_TAG")
Fixes: 5c07f2fe ("bpftool: Add support for BTF_KIND_TAG")
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211012164838.3345699-1-yhs@fb.com

223f903e

09 10月, 2021 1 次提交

bpf: Support writable context for bare tracepoint · 65223741

由 Hou Tao 提交于 10月 04, 2021

Commit 9df1c28b ("bpf: add writable context for raw tracepoints")
supports writable context for tracepoint, but it misses the support
for bare tracepoint which has no associated trace event.

Bare tracepoint is defined by DECLARE_TRACE(), so adding a corresponding
DECLARE_TRACE_WRITABLE() macro to generate a definition in __bpf_raw_tp_map
section for bare tracepoint in a similar way to DEFINE_TRACE_WRITABLE().
Signed-off-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Acked-by: NAndrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211004094857.30868-2-hotforest@gmail.com

65223741

06 10月, 2021 4 次提交

bpf: selftests: Add selftests for module kfunc support · c48e51c8

由 Kumar Kartikeya Dwivedi 提交于 10月 02, 2021

This adds selftests that tests the success and failure path for modules
kfuncs (in presence of invalid kfunc calls) for both libbpf and
gen_loader. It also adds a prog_test kfunc_btf_id_list so that we can
add module BTF ID set from bpf_testmod.

This also introduces a couple of test cases to verifier selftests for
validating whether we get an error or not depending on if invalid kfunc
call remains after elimination of unreachable instructions.
Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-10-memxor@gmail.com

c48e51c8

bpf: Enable TCP congestion control kfunc from modules · 0e32dfc8

由 Kumar Kartikeya Dwivedi 提交于 10月 02, 2021

This commit moves BTF ID lookup into the newly added registration
helper, in a way that the bbr, cubic, and dctcp implementation set up
their sets in the bpf_tcp_ca kfunc_btf_set list, while the ones not
dependent on modules are looked up from the wrapper function.

This lifts the restriction for them to be compiled as built in objects,
and can be loaded as modules if required. Also modify Makefile.modfinal
to call resolve_btfids for each module.

Note that since kernel kfunc_ids never overlap with module kfunc_ids, we
only match the owner for module btf id sets.

See following commits for background on use of:

 CONFIG_X86 ifdef:
 569c484f (bpf: Limit static tcp-cc functions in the .BTF_ids list to x86)

 CONFIG_DYNAMIC_FTRACE ifdef:
 7aae231a (bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE)
Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-6-memxor@gmail.com

0e32dfc8

bpf: btf: Introduce helpers for dynamic BTF set registration · 14f267d9

由 Kumar Kartikeya Dwivedi 提交于 10月 02, 2021

This adds helpers for registering btf_id_set from modules and the
bpf_check_mod_kfunc_call callback that can be used to look them up.

With in kernel sets, the way this is supposed to work is, in kernel
callback looks up within the in-kernel kfunc whitelist, and then defers
to the dynamic BTF set lookup if it doesn't find the BTF id. If there is
no in-kernel BTF id set, this callback can be used directly.

Also fix includes for btf.h and bpfptr.h so that they can included in
isolation. This is in preparation for their usage in tcp_bbr, tcp_cubic
and tcp_dctcp modules in the next patch.
Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-4-memxor@gmail.com

14f267d9

bpf: Introduce BPF support for kernel module function calls · 2357672c

由 Kumar Kartikeya Dwivedi 提交于 10月 02, 2021

This change adds support on the kernel side to allow for BPF programs to
call kernel module functions. Userspace will prepare an array of module
BTF fds that is passed in during BPF_PROG_LOAD using fd_array parameter.
In the kernel, the module BTFs are placed in the auxilliary struct for
bpf_prog, and loaded as needed.

The verifier then uses insn->off to index into the fd_array. insn->off
0 is reserved for vmlinux BTF (for backwards compat), so userspace must
use an fd_array index > 0 for module kfunc support. kfunc_btf_tab is
sorted based on offset in an array, and each offset corresponds to one
descriptor, with a max limit up to 256 such module BTFs.

We also change existing kfunc_tab to distinguish each element based on
imm, off pair as each such call will now be distinct.

Another change is to check_kfunc_call callback, which now include a
struct module * pointer, this is to be used in later patch such that the
kfunc_id and module pointer are matched for dynamically registered BTF
sets from loadable modules, so that same kfunc_id in two modules doesn't
lead to check_kfunc_call succeeding. For the duration of the
check_kfunc_call, the reference to struct module exists, as it returns
the pointer stored in kfunc_btf_tab.
Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-2-memxor@gmail.com

2357672c

05 10月, 2021 7 次提交

mlx4: constify args for const dev_addr · ebb1fdb5

由 Jakub Kicinski 提交于 10月 04, 2021

netdev->dev_addr will become const soon. Make sure all
functions which pass it around mark appropriate args
as const.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ebb1fdb5

mlx4: replace mlx4_u64_to_mac() with u64_to_ether_addr() · 1bb96a07

由 Jakub Kicinski 提交于 10月 04, 2021

mlx4_u64_to_mac() predates the common helper but doesn't
make the argument constant.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1bb96a07

mlx4: replace mlx4_mac_to_u64() with ether_addr_to_u64() · ded6e16b

由 Jakub Kicinski 提交于 10月 04, 2021

mlx4_mac_to_u64() predates and opencodes ether_addr_to_u64().
It doesn't make the argument constant so it'll be problematic
when dev->dev_addr becomes a const. Convert to the generic helper.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ded6e16b

netlink: remove netlink_broadcast_filtered · 549017aa

由 Florian Westphal 提交于 10月 05, 2021

No users in tree since commit a3498436 ("netns: restrict uevents"),
so remove this functionality.

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

549017aa

net/mlx5: Enable single IRQ for PCI Function · f891b7cd

由 Shay Drory 提交于 8月 01, 2021

Prior to this patch the driver requires two IRQs to function properly,
one required IRQ for control and at least one required IRQ for IO.

This requirement can be relaxed to one as the driver now allows
sharing of IRQs, so control and IO EQs can share the same irq.

This is needed for high scale amount of VFs.
Signed-off-by: NShay Drory <shayd@nvidia.com>
Reviewed-by: NMoshe Shemesh <moshe@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

f891b7cd

net/mlx5: Shift control IRQ to the last index · 3663ad34

由 Shay Drory 提交于 8月 19, 2021

Control IRQ is the first IRQ vector. This complicates handling of
completion irqs as we need to offset them by one.
in the next patch, there are scenarios where completion and control EQs
will share the same irq. for example: functions with single IRQ. To ease
such scenarios, we shift control IRQ to the end of the irq array.
Signed-off-by: NShay Drory <shayd@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

3663ad34

net/mlx5: Bridge, mark reg_c1 when pushing VLAN · 5249001d

由 Vlad Buslov 提交于 9月 01, 2021

On ingress VLAN push also assign value 0x7FE to reg_c1 tunnel id+opts
bits (tunnel id 0, which is not a valid tunnel id, and option 0x7FE which
was reserved by one of previous patches in the series). In following patch
the reg value is matched on egress miss to restore the packet to its
original state by removing the VLAN before passing it to the software data
path.
Signed-off-by: NVlad Buslov <vladbu@nvidia.com>
Reviewed-by: NPaul Blakey <paulb@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

5249001d

04 10月, 2021 9 次提交

net: phylink: add phylink_set_10g_modes() helper · a2c27a61

由 Russell King (Oracle) 提交于 10月 04, 2021

Add a helper for setting 10Gigabit modes, so we have one central
place that sets all appropriate 10G modes for a driver.
Signed-off-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a2c27a61

qed: Update the TCP active termination 2 MSL timer ("TIME_WAIT") · a64aa0a8

由 Prabhakar Kushwaha 提交于 10月 04, 2021

Initialize 2 MSL timeout value used for the TCP TIME_WAIT state to
non-zero default.

This patch also removes magic number from qedi/qedi_main.c.
Reviewed-by: NManish Rangankar <mrangankar@marvell.com>
Signed-off-by: NNikolay Assa <nassa@marvell.com>
Signed-off-by: NAriel Elior <aelior@marvell.com>
Signed-off-by: NShai Malin <smalin@marvell.com>
Signed-off-by: NOmkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: NPrabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a64aa0a8

qed: Update TCP silly-window-syndrome timeout for iwarp, scsi · 3a6f5d0c

由 Nikolay Assa 提交于 10月 04, 2021

Update TCP silly-window-syndrome timeout, for the cases where
initiator's small TCP window size prevents FW from transmitting
packets on the connection. Timeout causes FW to retransmit
window probes if needed, preventing I/O stall if initiator ignores
first window probe.
Reviewed-by: NManish Rangankar <mrangankar@marvell.com>
Signed-off-by: NNikolay Assa <nassa@marvell.com>
Signed-off-by: NAriel Elior <aelior@marvell.com>
Signed-off-by: NShai Malin <smalin@marvell.com>
Signed-off-by: NOmkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: NPrabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3a6f5d0c

qed: Update qed_hsi.h for fw 8.59.1.0 · fe40a830

由 Prabhakar Kushwaha 提交于 10月 04, 2021

The qed_hsi.h has been updated to support new FW version 8.59.1.0 with
changes.
 - Updates FW HSI (Hardware Software interface) structures.
 - Addition/update in function declaration and defines as per HSI.
 - Add generic infrastructure for FW error reporting as part of
   common event queue handling.
 - Move malicious VF error reporting to FW error reporting
   infrastructure.
 - Move consolidation queue initialization from FW context to ramrod
   message.

qed_hsi.h header file changes lead to change in many files to ensure
compilation.

This patch also fixes the existing checkpatch warnings and few important
checks.
Signed-off-by: NAriel Elior <aelior@marvell.com>
Signed-off-by: NShai Malin <smalin@marvell.com>
Signed-off-by: NOmkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: NPrabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fe40a830

qed: Update common_hsi for FW ver 8.59.1.0 · 484563e2

由 Prabhakar Kushwaha 提交于 10月 04, 2021

The common_hsi.h has been updated for FW version 8.59.1.0 with below
changes.
  - FW and Tools version.
  - New structures related to search table, packet duplication.
  - Structure for doorbell address for legacy mode without DEM.
  - Enhanced union rdma_eqe_data for RoCE Suspend Event Data.
  - New defines.

This patch also fixes the existing checkpatch warnings and few important
checks.
Signed-off-by: NAriel Elior <aelior@marvell.com>
Signed-off-by: NShai Malin <smalin@marvell.com>
Signed-off-by: NOmkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: NPrabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

484563e2

qed: Remove e4_ and _e4 from FW HSI · fb09a1ed

由 Shai Malin 提交于 10月 04, 2021

The existing qed/qede/qedr/qedi/qedf code uses chip-specific naming in
structures,  functions, variables and defines in FW HSI (Hardware
Software Interface).

The new FW version introduced a generic naming convention in HSI
in-which the same code will be used across different versions
for simpler maintainability. It also eases in providing support for
new features.

With this patch every "_e4" or "e4_" prefix or suffix is not needed
anymore and it will be removed.
Reviewed-by: NManish Rangankar <mrangankar@marvell.com>
Reviewed-by: NJaved Hasan <jhasan@marvell.com>
Signed-off-by: NAriel Elior <aelior@marvell.com>
Signed-off-by: NOmkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: NShai Malin <smalin@marvell.com>
Signed-off-by: NPrabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fb09a1ed

qed: Fix kernel-doc warnings · 19198e4e

由 Prabhakar Kushwaha 提交于 10月 04, 2021

This patch fixes all the qed and qede kernel-doc warnings
according to the guidelines that are described in
Documentation/doc-guide/kernel-doc.rst.
Signed-off-by: NAriel Elior <aelior@marvell.com>
Signed-off-by: NOmkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: NShai Malin <smalin@marvell.com>
Signed-off-by: NPrabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19198e4e

ipv6: ioam: Add support for the ip6ip6 encapsulation · 8cb3bf8b

由 Justin Iurman 提交于 10月 03, 2021

This patch adds support for the ip6ip6 encapsulation by providing three encap
modes: inline, encap and auto.
Signed-off-by: NJustin Iurman <justin.iurman@uliege.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8cb3bf8b

ipv6: ioam: Distinguish input and output for hop-limit · 52d03786

由 Justin Iurman 提交于 10月 03, 2021

This patch anticipates the support for the IOAM insertion inside in-transit
packets, by making a difference between input and output in order to determine
the right value for its hop-limit (inherited from the IPv6 hop-limit).

Input case: happens before ip6_forward, the IPv6 hop-limit is not decremented
yet -> decrement the IOAM hop-limit to reflect the new hop inside the trace.

Output case: happens after ip6_forward, the IPv6 hop-limit has already been
decremented -> keep the same value for the IOAM hop-limit.
Signed-off-by: NJustin Iurman <justin.iurman@uliege.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

52d03786

02 10月, 2021 4 次提交

ethernet: use eth_hw_addr_set() instead of ether_addr_copy() · f3956ebb

由 Jakub Kicinski 提交于 10月 01, 2021

Convert Ethernet from ether_addr_copy() to eth_hw_addr_set():

  @@
  expression dev, np;
  @@
  - ether_addr_copy(dev->dev_addr, np)
  + eth_hw_addr_set(dev, np)
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f3956ebb

net: mscc: ocelot: write full VLAN TCI in the injection header · e8c07229

由 Vladimir Oltean 提交于 10月 01, 2021

The VLAN TCI contains more than the VLAN ID, it also has the VLAN PCP
and Drop Eligibility Indicator.

If the ocelot driver is going to write the VLAN header inside the DSA
tag, it could just as well write the entire TCI.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e8c07229

net: mscc: ocelot: support egress VLAN rewriting via VCAP ES0 · de5bbb6f

由 Vladimir Oltean 提交于 10月 01, 2021

Currently the ocelot driver does support the 'vlan modify' action, but
in the ingress chain, and it is offloaded to VCAP IS1. This action
changes the classified VLAN before the packet enters the bridging
service, and the bridging works with the classified VLAN modified by
VCAP IS1.

That is good for some use cases, but there are others where the VLAN
must be modified at the stage of the egress port, after the packet has
exited the bridging service. One example is simulating IEEE 802.1CB
active stream identification filters ("active" means that not only the
rule matches on a packet flow, but it is also able to change some
headers). For example, a stream is replicated on two egress ports, but
they must have different VLAN IDs on egress ports A and B.

This seems like a task for the VCAP ES0, but that currently only
supports pushing the ES0 tag A, which is specified in the rule. Pushing
another VLAN header is not what we want, but rather overwriting the
existing one.

It looks like when we push the ES0 tag A, it is actually possible to not
only take the ES0 tag A's value from the rule itself (VID_A_VAL), but
derive it from the following formula:

ES0_TAG_A = Classified VID + VID_A_VAL

Otherwise said, ES0_TAG_A can be used to increment with a given value
the VLAN ID that the packet was already classified to, and the packet
will have this value as an outer VLAN tag. This new VLAN ID value then
gets stripped on egress (or not) according to the value of the native
VLAN from the bridging service.

While the hardware will happily increment the classified VLAN ID for all
packets that match the ES0 rule, in practice this would be rather
insane, so we only allow this kind of ES0 action if the ES0 filter
contains a VLAN ID too, so as to restrict the matching on a known
classified VLAN. If we program VID_A_VAL with the delta between the
desired final VLAN (ES0_TAG_A) and the classified VLAN, we obtain the
desired behavior.

It doesn't look like it is possible with the tc-vlan action to modify
the VLAN ID but not the PCP. In hardware it is possible to leave the PCP
to the classified value, but we unconditionally program it to overwrite
it with the PCP value from the rule.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de5bbb6f

Bluetooth: Rename driver .prevent_wake to .wakeup · 4539ca67

由 Luiz Augusto von Dentz 提交于 10月 01, 2021

prevent_wake logic is backward since what it is really checking is
if the device may wakeup the system or not, not that it will prevent
the to be awaken.

Also looking on how other subsystems have the entry as power/wakeup
this also renames the force_prevent_wake to force_wakeup in vhci driver.
Signed-off-by: NLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

4539ca67

01 10月, 2021 2 次提交

devlink: report maximum number of snapshots with regions · a70e3f02

由 Jacob Keller 提交于 9月 30, 2021

Each region has an independently configurable number of maximum
snapshots. This information is not reported to userspace, making it not
very discoverable. Fix this by adding a new
DEVLINK_ATTR_REGION_MAX_SNAPSHOST attribute which is used to report this
maximum.

Ex:

  $devlink region
  pci/0000:af:00.0/nvm-flash: size 10485760 snapshot [] max 1
  pci/0000:af:00.0/device-caps: size 4096 snapshot [] max 10
  pci/0000:af:00.1/nvm-flash: size 10485760 snapshot [] max 1
  pci/0000:af:00.1/device-caps: size 4096 snapshot [] max 10

This information enables users to understand why a new region command
may fail due to having too many existing snapshots.

Reported-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
Acked-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a70e3f02

bpf, xdp, docs: Correct some English grammar and spelling · 6bbc7103

由 Kev Jackson 提交于 9月 30, 2021

Header DOC on include/net/xdp.h contained a few English grammer and
spelling errors.
Signed-off-by: NKev Jackson <foamdino@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/YVVaWmKqA8l9Tm4J@kev-VirtualBox

6bbc7103

30 9月, 2021 6 次提交

af_unix: fix races in sk_peer_pid and sk_peer_cred accesses · 35306eb2

由 Eric Dumazet 提交于 9月 29, 2021

Jann Horn reported that SO_PEERCRED and SO_PEERGROUPS implementations
are racy, as af_unix can concurrently change sk_peer_pid and sk_peer_cred.

In order to fix this issue, this patch adds a new spinlock that needs
to be used whenever these fields are read or written.

Jann also pointed out that l2cap_sock_get_peer_pid_cb() is currently
reading sk->sk_peer_pid which makes no sense, as this field
is only possibly set by AF_UNIX sockets.
We will have to clean this in a separate patch.
This could be done by reverting b48596d1 "Bluetooth: L2CAP: Add get_peer_pid callback"
or implementing what was truly expected.

Fixes: 109f6e39 ("af_unix: Allow SO_PEERCRED to work across namespaces.")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NJann Horn <jannh@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35306eb2

net: snmp: inline snmp_get_cpu_field() · 59f09ae8

由 Eric Dumazet 提交于 9月 29, 2021

This trivial function is called ~90,000 times on 256 cpus hosts,
when reading /proc/net/netstat. And this number keeps inflating.

Inlining it saves many cycles.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

59f09ae8

tcp: adjust rcv_ssthresh according to sk_reserved_mem · 053f3684

由 Wei Wang 提交于 9月 29, 2021

When user sets SO_RESERVE_MEM socket option, in order to utilize the
reserved memory when in memory pressure state, we adjust rcv_ssthresh
according to the available reserved memory for the socket, instead of
using 4 * advmss always.
Signed-off-by: NWei Wang <weiwan@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

053f3684

tcp: adjust sndbuf according to sk_reserved_mem · ca057051

由 Wei Wang 提交于 9月 29, 2021

If user sets SO_RESERVE_MEM socket option, in order to fully utilize the
reserved memory in memory pressure state on the tx path, we modify the
logic in sk_stream_moderate_sndbuf() to set sk_sndbuf according to
available reserved memory, instead of MIN_SOCK_SNDBUF, and adjust it
when new data is acked.
Signed-off-by: NWei Wang <weiwan@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ca057051

net: add new socket option SO_RESERVE_MEM · 2bb2f5fb

由 Wei Wang 提交于 9月 29, 2021

This socket option provides a mechanism for users to reserve a certain
amount of memory for the socket to use. When this option is set, kernel
charges the user specified amount of memory to memcg, as well as
sk_forward_alloc. This amount of memory is not reclaimable and is
available in sk_forward_alloc for this socket.
With this socket option set, the networking stack spends less cycles
doing forward alloc and reclaim, which should lead to better system
performance, with the cost of an amount of pre-allocated and
unreclaimable memory, even under memory pressure.

Note:
This socket option is only available when memory cgroup is enabled and we
require this reserved memory to be charged to the user's memcg. We hope
this could avoid mis-behaving users to abused this feature to reserve a
large amount on certain sockets and cause unfairness for others.
Signed-off-by: NWei Wang <weiwan@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2bb2f5fb

net: introduce and use lock_sock_fast_nested() · 49054556

由 Paolo Abeni 提交于 9月 29, 2021

Syzkaller reported a false positive deadlock involving
the nl socket lock and the subflow socket lock:

MPTCP: kernel_bind error, err=-98
============================================
WARNING: possible recursive locking detected
5.15.0-rc1-syzkaller #0 Not tainted
--------------------------------------------
syz-executor998/6520 is trying to acquire lock:
ffff8880795718a0 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x267/0x7b0 net/mptcp/protocol.c:2738

but task is already holding lock:
ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1612 [inline]
ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x23/0x7b0 net/mptcp/protocol.c:2720

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(k-sk_lock-AF_INET);
  lock(k-sk_lock-AF_INET);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by syz-executor998/6520:
 #0: ffffffff8d176c50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40 net/netlink/genetlink.c:802
 #1: ffffffff8d176d08 (genl_mutex){+.+.}-{3:3}, at: genl_lock net/netlink/genetlink.c:33 [inline]
 #1: ffffffff8d176d08 (genl_mutex){+.+.}-{3:3}, at: genl_rcv_msg+0x3e0/0x580 net/netlink/genetlink.c:790
 #2: ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1612 [inline]
 #2: ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x23/0x7b0 net/mptcp/protocol.c:2720

stack backtrace:
CPU: 1 PID: 6520 Comm: syz-executor998 Not tainted 5.15.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 print_deadlock_bug kernel/locking/lockdep.c:2944 [inline]
 check_deadlock kernel/locking/lockdep.c:2987 [inline]
 validate_chain kernel/locking/lockdep.c:3776 [inline]
 __lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5015
 lock_acquire kernel/locking/lockdep.c:5625 [inline]
 lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590
 lock_sock_fast+0x36/0x100 net/core/sock.c:3229
 mptcp_close+0x267/0x7b0 net/mptcp/protocol.c:2738
 inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
 __sock_release net/socket.c:649 [inline]
 sock_release+0x87/0x1b0 net/socket.c:677
 mptcp_pm_nl_create_listen_socket+0x238/0x2c0 net/mptcp/pm_netlink.c:900
 mptcp_nl_cmd_add_addr+0x359/0x930 net/mptcp/pm_netlink.c:1170
 genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:731
 genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
 genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:792
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
 genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
 netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
 netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
 sock_sendmsg_nosec net/socket.c:704 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:724
 sock_no_sendpage+0x101/0x150 net/core/sock.c:2980
 kernel_sendpage.part.0+0x1a0/0x340 net/socket.c:3504
 kernel_sendpage net/socket.c:3501 [inline]
 sock_sendpage+0xe5/0x140 net/socket.c:1003
 pipe_to_sendpage+0x2ad/0x380 fs/splice.c:364
 splice_from_pipe_feed fs/splice.c:418 [inline]
 __splice_from_pipe+0x43e/0x8a0 fs/splice.c:562
 splice_from_pipe fs/splice.c:597 [inline]
 generic_splice_sendpage+0xd4/0x140 fs/splice.c:746
 do_splice_from fs/splice.c:767 [inline]
 direct_splice_actor+0x110/0x180 fs/splice.c:936
 splice_direct_to_actor+0x34b/0x8c0 fs/splice.c:891
 do_splice_direct+0x1b3/0x280 fs/splice.c:979
 do_sendfile+0xae9/0x1240 fs/read_write.c:1249
 __do_sys_sendfile64 fs/read_write.c:1314 [inline]
 __se_sys_sendfile64 fs/read_write.c:1300 [inline]
 __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1300
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f215cb69969
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc96bb3868 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 00007f215cbad072 RCX: 00007f215cb69969
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000005
RBP: 0000000000000000 R08: 00007ffc96bb3a08 R09: 00007ffc96bb3a08
R10: 0000000100000002 R11: 0000000000000246 R12: 00007ffc96bb387c
R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000

the problem originates from uncorrect lock annotation in the mptcp
code and is only visible since commit 2dcb96ba ("net: core: Correct
the sock::sk_lock.owned lockdep annotations"), but is present since
the port-based endpoint support initial implementation.

This patch addresses the issue introducing a nested variant of
lock_sock_fast() and using it in the relevant code path.

Fixes: 1729cf18 ("mptcp: create the listening socket for new port")
Fixes: 2dcb96ba ("net: core: Correct the sock::sk_lock.owned lockdep annotations")
Suggested-by: NThomas Gleixner <tglx@linutronix.de>
Reported-and-tested-by: syzbot+1dd53f7a89b299d59eaf@syzkaller.appspotmail.com
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

49054556

29 9月, 2021 5 次提交

mctp: Add tracepoints for tag/key handling · 4f9e1ba6

由 Jeremy Kerr 提交于 9月 29, 2021

The tag allocation, release and bind events are somewhat opaque outside
the kernel; this change adds a few tracepoints to assist in
instrumentation and debugging.
Signed-off-by: NJeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f9e1ba6

mctp: Implement a timeout for tags · 7b14e15a

由 Jeremy Kerr 提交于 9月 29, 2021

Currently, a MCTP (local-eid,remote-eid,tag) tuple is allocated to a
socket on send, and only expires when the socket is closed.

This change introduces a tag timeout, freeing the tuple after a fixed
expiry - currently six seconds. This is greater than (but close to) the
max response timeout in upper-layer bindings.
Signed-off-by: NJeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7b14e15a

mctp: Add refcounts to mctp_dev · 43f55f23

由 Jeremy Kerr 提交于 9月 29, 2021

Currently, we tie the struct mctp_dev lifetime to the underlying struct
net_device, and hold/put that device as a proxy for a separate mctp_dev
refcount. This works because we're not holding any references to the
mctp_dev that are different from the netdev lifetime.

In a future change we'll break that assumption though, as we'll need to
hold mctp_dev references in a workqueue, which might live past the
netdev unregister notification.

In order to support that, this change introduces a refcount on the
mctp_dev, currently taken by the net_device->mctp_ptr reference, and
released on netdev unregister events. We can then use this for future
references that might outlast the net device.
Signed-off-by: NJeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

43f55f23

mctp: locking, lifetime and validity changes for sk_keys · 73c61845

由 Jeremy Kerr 提交于 9月 29, 2021

We will want to invalidate sk_keys in a future change, which will
require a boolean flag to mark invalidated items in the socket & net
namespace lists. We'll also need to take a reference to keys, held over
non-atomic contexts, so we need a refcount on keys also.

This change adds a validity flag (currently always true) and refcount to
struct mctp_sk_key. With a refcount on the keys, using RCU no longer
makes much sense; we have exact indications on the lifetime of keys. So,
we also change the RCU list traversal to a locked implementation.
Signed-off-by: NJeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

73c61845

net: phy: micrel: Add support for LAN8804 PHY · 7c2dcfa2

由 Horatiu Vultur 提交于 9月 28, 2021

The LAN8804 PHY has same features as that of LAN8814 PHY except that it
doesn't support 1588, SyncE or Q-USGMII.

This PHY is found inside the LAN966X switches.
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NHoratiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7c2dcfa2

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功