提交 · fddecf6a237ee464db7a1771fad6507d8c180c03 · openeuler / Kernel

22 4月, 2021 3 次提交

KVM: SVM: Add KVM_SEND_UPDATE_DATA command · d3d1af85

由 Brijesh Singh 提交于 4月 15, 2021

The command is used for encrypting the guest memory region using the encryption
context created with KVM_SEV_SEND_START.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by : Steve Rutherford <srutherford@google.com>
Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
Message-Id: <d6a6ea740b0c668b30905ae31eac5ad7da048bb3.1618498113.git.ashish.kalra@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d3d1af85

KVM: SVM: Add KVM_SEV SEND_START command · 4cfdd47d

由 Brijesh Singh 提交于 4月 15, 2021

The command is used to create an outgoing SEV guest encryption context.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: NSteve Rutherford <srutherford@google.com>
Reviewed-by: NVenu Busireddy <venu.busireddy@oracle.com>
Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
Message-Id: <2f1686d0164e0f1b3d6a41d620408393e0a48376.1618498113.git.ashish.kalra@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4cfdd47d

KVM: x86: Support KVM VMs sharing SEV context · 54526d1f

由 Nathan Tempelman 提交于 4月 08, 2021

Add a capability for userspace to mirror SEV encryption context from
one vm to another. On our side, this is intended to support a
Migration Helper vCPU, but it can also be used generically to support
other in-guest workloads scheduled by the host. The intention is for
the primary guest and the mirror to have nearly identical memslots.

The primary benefits of this are that:
1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
can't accidentally clobber each other.
2) The VMs can have different memory-views, which is necessary for post-copy
migration (the migration vCPUs on the target need to read and write to
pages, when the primary guest would VMEXIT).

This does not change the threat model for AMD SEV. Any memory involved
is still owned by the primary guest and its initial state is still
attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
to circumvent SEV, they could achieve the same effect by simply attaching
a vCPU to the primary VM.
This patch deliberately leaves userspace in charge of the memslots for the
mirror, as it already has the power to mess with them in the primary guest.

This patch does not support SEV-ES (much less SNP), as it does not
handle handing off attested VMSAs to the mirror.

For additional context, we need a Migration Helper because SEV PSP
migration is far too slow for our live migration on its own. Using
an in-guest migrator lets us speed this up significantly.
Signed-off-by: NNathan Tempelman <natet@google.com>
Message-Id: <20210408223214.2582277-1-natet@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

54526d1f

20 4月, 2021 1 次提交

KVM: x86: Add capability to grant VM access to privileged SGX attribute · fe7e9488

由 Sean Christopherson 提交于 4月 12, 2021

Add a capability, KVM_CAP_SGX_ATTRIBUTE, that can be used by userspace
to grant a VM access to a priveleged attribute, with args[0] holding a
file handle to a valid SGX attribute file.

The SGX subsystem restricts access to a subset of enclave attributes to
provide additional security for an uncompromised kernel, e.g. to prevent
malware from using the PROVISIONKEY to ensure its nodes are running
inside a geniune SGX enclave and/or to obtain a stable fingerprint.

To prevent userspace from circumventing such restrictions by running an
enclave in a VM, KVM restricts guest access to privileged attributes by
default.

Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NKai Huang <kai.huang@intel.com>
Message-Id: <0b099d65e933e068e3ea934b0523bab070cb8cea.1618196135.git.kai.huang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fe7e9488

17 4月, 2021 1 次提交

KVM: introduce KVM_CAP_SET_GUEST_DEBUG2 · 8b13c364

由 Paolo Bonzini 提交于 4月 06, 2021

This capability will allow the user to know which KVM_GUESTDBG_* bits
are supported.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210401135451.1004564-3-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8b13c364

04 3月, 2021 1 次提交

net: l2tp: reduce log level of messages in receive path, add counter instead · 3e59e885

由 Matthias Schiffer 提交于 3月 03, 2021

Commit 5ee759cd ("l2tp: use standard API for warning log messages")
changed a number of warnings about invalid packets in the receive path
so that they are always shown, instead of only when a special L2TP debug
flag is set. Even with rate limiting these warnings can easily cause
significant log spam - potentially triggered by a malicious party
sending invalid packets on purpose.

In addition these warnings were noticed by projects like Tunneldigger [1],
which uses L2TP for its data path, but implements its own control
protocol (which is sufficiently different from L2TP data packets that it
would always be passed up to userspace even with future extensions of
L2TP).

Some of the warnings were already redundant, as l2tp_stats has a counter
for these packets. This commit adds one additional counter for invalid
packets that are passed up to userspace. Packets with unknown session are
not counted as invalid, as there is nothing wrong with the format of
these packets.

With the additional counter, all of these messages are either redundant
or benign, so we reduce them to pr_debug_ratelimited().

[1] https://github.com/wlanslovenija/tunneldigger/issues/160

Fixes: 5ee759cd ("l2tp: use standard API for warning log messages")
Signed-off-by: NMatthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e59e885

03 3月, 2021 1 次提交

KVM: x86/xen: Add support for vCPU runstate information · 30b5c851

由 David Woodhouse 提交于 3月 01, 2021

This is how Xen guests do steal time accounting. The hypervisor records
the amount of time spent in each of running/runnable/blocked/offline
states.

In the Xen accounting, a vCPU is still in state RUNSTATE_running while
in Xen for a hypercall or I/O trap, etc. Only if Xen explicitly schedules
does the state become RUNSTATE_blocked. In KVM this means that even when
the vCPU exits the kvm_run loop, the state remains RUNSTATE_running.

The VMM can explicitly set the vCPU to RUNSTATE_blocked by using the
KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT attribute, and can also use
KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST to retrospectively add a given
amount of time to the blocked state and subtract it from the running
state.

The state_entry_time corresponds to get_kvmclock_ns() at the time the
vCPU entered the current state, and the total times of all four states
should always add up to state_entry_time.
Co-developed-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20210301125309.874953-2-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

30b5c851

28 2月, 2021 1 次提交

uapi: nfnetlink_cthelper.h: fix userspace compilation error · c33cb002

由 Dmitry V. Levin 提交于 2月 22, 2021

Apparently, <linux/netfilter/nfnetlink_cthelper.h> and
<linux/netfilter/nfnetlink_acct.h> could not be included into the same
compilation unit because of a cut-and-paste typo in the former header.

Fixes: 12f7a505 ("netfilter: add user-space connection tracking helper infrastructure")
Cc: <stable@vger.kernel.org> # v3.6
Signed-off-by: NDmitry V. Levin <ldv@altlinux.org>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

c33cb002

27 2月, 2021 1 次提交

include/linux: remove repeated words · df54714f

由 Randy Dunlap 提交于 2月 25, 2021

Drop the doubled word "for" in a comment. {firewire-cdev.h}
Drop the doubled word "in" in a comment. {input.h}
Drop the doubled word "a" in a comment. {mdev.h}
Drop the doubled word "the" in a comment. {ptrace.h}

Link: https://lkml.kernel.org/r/20210126232444.22861-1-rdunlap@infradead.orgSigned-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

df54714f

25 2月, 2021 2 次提交

numa balancing: migrate on fault among multiple bound nodes · bda420b9

由 Huang Ying 提交于 2月 24, 2021

Now, NUMA balancing can only optimize the page placement among the NUMA
nodes if the default memory policy is used.  Because the memory policy
specified explicitly should take precedence.  But this seems too strict in
some situations.  For example, on a system with 4 NUMA nodes, if the
memory of an application is bound to the node 0 and 1, NUMA balancing can
potentially migrate the pages between the node 0 and 1 to reduce
cross-node accessing without breaking the explicit memory binding policy.

So in this patch, we add MPOL_F_NUMA_BALANCING mode flag to
set_mempolicy() when mode is MPOL_BIND.  With the flag specified, NUMA
balancing will be enabled within the thread to optimize the page placement
within the constrains of the specified memory binding policy.  With the
newly added flag, the NUMA balancing control mechanism becomes,

 - sysctl knob numa_balancing can enable/disable the NUMA balancing
   globally.

 - even if sysctl numa_balancing is enabled, the NUMA balancing will be
   disabled for the memory areas or applications with the explicit
   memory policy by default.

 - MPOL_F_NUMA_BALANCING can be used to enable the NUMA balancing for
   the applications when specifying the explicit memory policy
   (MPOL_BIND).

Various page placement optimization based on the NUMA balancing can be
done with these flags.  As the first step, in this patch, if the memory of
the application is bound to multiple nodes (MPOL_BIND), and in the hint
page fault handler the accessing node are in the policy nodemask, the page
will be tried to be migrated to the accessing node to reduce the
cross-node accessing.

If the newly added MPOL_F_NUMA_BALANCING flag is specified by an
application on an old kernel version without its support, set_mempolicy()
will return -1 and errno will be set to EINVAL.  The application can use
this behavior to run on both old and new kernel versions.

And if the MPOL_F_NUMA_BALANCING flag is specified for the mode other than
MPOL_BIND, set_mempolicy() will return -1 and errno will be set to EINVAL
as before.  Because we don't support optimization based on the NUMA
balancing for these modes.

In the previous version of the patch, we tried to reuse MPOL_MF_LAZY for
mbind().  But that flag is tied to MPOL_MF_MOVE.*, so it seems not a good
API/ABI for the purpose of the patch.

And because it's not clear whether it's necessary to enable NUMA balancing
for a specific memory area inside an application, so we only add the flag
at the thread level (set_mempolicy()) instead of the memory area level
(mbind()).  We can do that when it become necessary.

To test the patch, we run a test case as follows on a 4-node machine with
192 GB memory (48 GB per node).

1. Change pmbench memory accessing benchmark to call set_mempolicy()
   to bind its memory to node 1 and 3 and enable NUMA balancing.  Some
   related code snippets are as follows,

     #include <numaif.h>
     #include <numa.h>

	struct bitmask *bmp;
	int ret;

	bmp = numa_parse_nodestring("1,3");
	ret = set_mempolicy(MPOL_BIND | MPOL_F_NUMA_BALANCING,
			    bmp->maskp, bmp->size + 1);
	/* If MPOL_F_NUMA_BALANCING isn't supported, fall back to MPOL_BIND */
	if (ret < 0 && errno == EINVAL)
		ret = set_mempolicy(MPOL_BIND, bmp->maskp, bmp->size + 1);
	if (ret < 0) {
		perror("Failed to call set_mempolicy");
		exit(-1);
	}

2. Run a memory eater on node 3 to use 40 GB memory before running pmbench.

3. Run pmbench with 64 processes, the working-set size of each process
   is 640 MB, so the total working-set size is 64 * 640 MB = 40 GB.  The
   CPU and the memory (as in step 1.) of all pmbench processes is bound
   to node 1 and 3. So, after CPU usage is balanced, some pmbench
   processes run on the CPUs of the node 3 will access the memory of
   the node 1.

4. After the pmbench processes run for 100 seconds, kill the memory
   eater.  Now it's possible for some pmbench processes to migrate
   their pages from node 1 to node 3 to reduce cross-node accessing.

Test results show that, with the patch, the pages can be migrated from
node 1 to node 3 after killing the memory eater, and the pmbench score
can increase about 17.5%.

Link: https://lkml.kernel.org/r/20210120061235.148637-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bda420b9

bpf: Remove blank line in bpf helper description comment · a7c9c25a

由 Hangbin Liu 提交于 2月 23, 2021

Commit 34b2021c ("bpf: Add BPF-helper for MTU checking") added an extra
blank line in bpf helper description. This will make bpf_helpers_doc.py stop
building bpf_helper_defs.h immediately after bpf_check_mtu(), which will
affect future added functions.

Fixes: 34b2021c ("bpf: Add BPF-helper for MTU checking")
Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20210223131457.1378978-1-liuhangbin@gmail.com

a7c9c25a

24 2月, 2021 1 次提交

io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS · 1c0aa1fa

由 Jens Axboe 提交于 2月 20, 2021

A few reasons to do this:

- The naming of the manager and worker have changed. That's a user visible
  change, so makes sense to flag it.

- Opening certain files that use ->signal (like /proc/self or /dev/tty)
  now works, and the flag tells the application upfront that this is the
  case.

- Related to the above, using signalfd will now work as well.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1c0aa1fa

23 2月, 2021 3 次提交

vdpa: Enable user to query vdpa device info · bc0d90ee

由 Parav Pandit 提交于 1月 05, 2021

Enable user to query vdpa device information.

$ vdpa dev add mgmtdev vdpasim_net name foo2

Show the newly created vdpa device by its name:
$ vdpa dev show foo2
foo2: type network mgmtdev vdpasim_net vendor_id 0 max_vqs 2 max_vq_size 256

$ vdpa dev show foo2 -jp
{
    "dev": {
        "foo2": {
            "type": "network",
            "mgmtdev": "vdpasim_net",
            "vendor_id": 0,
            "max_vqs": 2,
            "max_vq_size": 256
        }
    }
}
Signed-off-by: NParav Pandit <parav@nvidia.com>
Reviewed-by: NEli Cohen <elic@nvidia.com>
Reviewed-by: NJason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20210105103203.82508-6-parav@nvidia.com

Including a memory leak fix:

Link: https://lore.kernel.org/r/20210217060614.59561-1-parav@nvidia.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>

bc0d90ee

vdpa: Enable a user to add and delete a vdpa device · 903f7bca

由 Parav Pandit 提交于 1月 05, 2021

Add the ability to add and delete a vdpa device.

Examples:
Create a vdpa device of type network named "foo2" from
the management device vdpasim:

$ vdpa dev add mgmtdev vdpasim_net name foo2

Delete the vdpa device after its use:
$ vdpa dev del foo2
Signed-off-by: NParav Pandit <parav@nvidia.com>
Reviewed-by: NEli Cohen <elic@nvidia.com>
Reviewed-by: NJason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20210105103203.82508-5-parav@nvidia.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>

903f7bca

vdpa: Define vdpa mgmt device, ops and a netlink interface · 33b34750

由 Parav Pandit 提交于 1月 05, 2021

To add one or more VDPA devices, define a management device which
allows adding or removing vdpa device. A management device defines
set of callbacks to manage vdpa devices.

To begin with, it defines add and remove callbacks through which a user
defined vdpa device can be added or removed.

A unique management device is identified by its unique handle identified
by management device name and optionally the bus name.

Hence, introduce routine through which driver can register a
management device and its callback operations for adding and remove
a vdpa device.

Introduce vdpa netlink socket family so that user can query management
device and its attributes.

Example of show vdpa management device which allows creating vdpa device of
networking class (device id = 0x1) of virtio specification 1.1
section 5.1.1.

$ vdpa mgmtdev show
vdpasim_net:
  supported_classes:
    net

Example of showing vdpa management device in JSON format.

$ vdpa mgmtdev show -jp
{
    "show": {
        "vdpasim_net": {
            "supported_classes": [ "net" ]
        }
    }
}
Signed-off-by: NParav Pandit <parav@nvidia.com>
Reviewed-by: NEli Cohen <elic@nvidia.com>
Reviewed-by: NJason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20210105103203.82508-4-parav@nvidia.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>

Including a bugfix:

vpda: correctly size vdpa_nl_policy

We need to ensure last entry of vdpa_nl_policy[]
is zero, otherwise out-of-bounds access is hurting us.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Eli Cohen <elic@nvidia.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20210210134911.4119555-1-eric.dumazet@gmail.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>

33b34750

17 2月, 2021 4 次提交

cxl/mem: Add set of informational commands · 57ee605b

由 Ben Widawsky 提交于 2月 16, 2021

Add initial set of formal commands beyond basic identify and command
enumeration.
Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> (v2)
Link: https://lore.kernel.org/r/20210217040958.1354670-8-ben.widawsky@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>

57ee605b

cxl/mem: Enable commands via CEL · 472b1ce6

由 Ben Widawsky 提交于 2月 16, 2021

CXL devices identified by the memory-device class code must implement
the Device Command Interface (described in 8.2.9 of the CXL 2.0 spec).
While the driver already maintains a list of commands it supports, there
is still a need to be able to distinguish between commands that the
driver knows about from commands that are optionally supported by the
hardware.

The Command Effects Log (CEL) is specified in the CXL 2.0 specification.
The CEL is one of two types of logs, the other being vendor specific.
They are distinguished in hardware/spec via UUID. The CEL is useful for
2 things:
1. Determine which optional commands are supported by the CXL device.
2. Enumerate any vendor specific commands

The CEL is used by the driver to determine which commands are available
in the hardware and therefore which commands userspace is allowed to
execute. The set of enabled commands might be a subset of commands which
are advertised in UAPI via CXL_MEM_SEND_COMMAND IOCTL.

With the CEL enabling comes a internal flag to indicate a base set of
commands that are enabled regardless of CEL. Such commands are required
for basic interaction with the hardware and thus can be useful in debug
cases, for example if the CEL is corrupted.

The implementation leaves the statically defined table of commands and
supplements it with a bitmap to determine commands that are enabled.
This organization was chosen for the following reasons:
- Smaller memory footprint. Doesn't need a table per device.
- Reduce memory allocation complexity.
- Fixed command IDs to opcode mapping for all devices makes development
and debugging easier.
- Certain helpers are easily achievable, like cxl_for_each_cmd().
Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com> (v2)
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> (v3)
Link: https://lore.kernel.org/r/20210217040958.1354670-7-ben.widawsky@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>

472b1ce6

cxl/mem: Add a "RAW" send command · 13237183

由 Ben Widawsky 提交于 2月 16, 2021

The CXL memory device send interface will have a number of supported
commands. The raw command is not such a command. Raw commands allow
userspace to send a specified opcode to the underlying hardware and
bypass all driver checks on the command. The primary use for this
command is to [begrudgingly] allow undocumented vendor specific hardware
commands.

While not the main motivation, it also allows prototyping new hardware
commands without a driver patch and rebuild.

While this all sounds very powerful it comes with a couple of caveats:
1. Bug reports using raw commands will not get the same level of
attention as bug reports using supported commands (via taint).
2. Supported commands will be rejected by the RAW command.

With this comes new debugfs knob to allow full access to your toes with
your weapon of choice.
Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com> (v2)
Reviewed-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Ariel Sibley <Ariel.Sibley@microchip.com>
Link: https://lore.kernel.org/r/20210217040958.1354670-6-ben.widawsky@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>

13237183

cxl/mem: Add basic IOCTL interface · 583fa5e7

由 Ben Widawsky 提交于 2月 16, 2021

Add a straightforward IOCTL that provides a mechanism for userspace to
query the supported memory device commands. CXL commands as they appear
to userspace are described as part of the UAPI kerneldoc. The command
list returned via this IOCTL will contain the full set of commands that
the driver supports, however, some of those commands may not be
available for use by userspace.

Memory device commands first appear in the CXL 2.0 specification. They
are submitted through a mailbox mechanism specified in the CXL 2.0
specification.

The send command allows userspace to issue mailbox commands directly to
the hardware. The list of available commands to send are the output of
the query command. The driver verifies basic properties of the command
and possibly inspect the input (or output) payload to determine whether
or not the command is allowed (or might taint the kernel).

Reported-by: kernel test robot <lkp@intel.com> # bug in earlier revision
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com> (v2)
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20210217040958.1354670-5-ben.widawsky@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>

583fa5e7

16 2月, 2021 3 次提交

mptcp: add local addr info in mptcp_info · 0caf3ada

由 Geliang Tang 提交于 2月 12, 2021

Add mptcpi_local_addr_used and mptcpi_local_addr_max in struct mptcp_info.
Signed-off-by: NGeliang Tang <geliangtang@gmail.com>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0caf3ada

binfmt_misc: pass binfmt_misc flags to the interpreter · 2347961b

由 Laurent Vivier 提交于 1月 28, 2020

It can be useful to the interpreter to know which flags are in use.

For instance, knowing if the preserve-argv[0] is in use would
allow to skip the pathname argument.

This patch uses an unused auxiliary vector, AT_FLAGS, to add a
flag to inform interpreter if the preserve-argv[0] is enabled.

Note by Helge Deller:
The real-world user of this patch is qemu-user, which needs to know
if it has to preserve the argv[0]. See Debian bug #970460.
Signed-off-by: NLaurent Vivier <laurent@vivier.eu>
Reviewed-by: NYunQiang Su <ysu@wavecomp.com>
URL: http://bugs.debian.org/970460Signed-off-by: NHelge Deller <deller@gmx.de>

2347961b

netfilter: nftables: introduce table ownership · 6001a930

由 Pablo Neira Ayuso 提交于 2月 15, 2021

A userspace daemon like firewalld might need to monitor for netlink
updates to detect its ruleset removal by the (global) flush ruleset
command to ensure ruleset persistency. This adds extra complexity from
userspace and, for some little time, the firewall policy is not in
place.

This patch adds the NFT_TABLE_F_OWNER flag which allows a userspace
program to own the table that creates in exclusivity.

Tables that are owned...

- can only be updated and removed by the owner, non-owners hit EPERM if
  they try to update it or remove it.
- are destroyed when the owner closes the netlink socket or the process
  is gone (implicit netlink socket closure).
- are skipped by the global flush ruleset command.
- are listed in the global ruleset.

The userspace process that sets on the NFT_TABLE_F_OWNER flag need to
leave open the netlink socket.

A new NFTA_TABLE_OWNER netlink attribute specifies the netlink port ID
to identify the owner from userspace.

This patch also updates error reporting when an unknown table flag is
specified to change it from EINVAL to EOPNOTSUPP given that EINVAL is
usually reserved to report for malformed netlink messages to userspace.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

6001a930

15 2月, 2021 2 次提交

gpio: uapi: use the preferred SPDX license identifier · 50f9a6c2

由 Bartosz Golaszewski 提交于 2月 04, 2021

GPL-2.0 license identifier is deprecated. User-space projects that want
to include the kernel header with their source-code will be unable to
become fully REUSE compliant due to the reuse tool complaining about
deprecated licenses. Change the SPDX identifier to GPL-2.0-only.
Signed-off-by: NBartosz Golaszewski <bgolaszewski@baylibre.com>

50f9a6c2

gpio: uapi: fix line info flags description · f61d3f0c

由 Kent Gibson 提交于 1月 19, 2021

The description of the flags field of the struct gpio_v2_line_info
mentions "the GPIO lines" while the info only applies to an individual
GPIO line.  This was accidentally changed from "the GPIO line" during
formatting improvements.

Reword to "this GPIO line" to clarify and to be consistent with other
struct gpio_v2_line_info fields.

Fixes: 2cc522d3 ("gpio: uapi: kernel-doc formatting improvements")
Signed-off-by: NKent Gibson <warthog618@gmail.com>
Signed-off-by: NBartosz Golaszewski <bgolaszewski@baylibre.com>

f61d3f0c

13 2月, 2021 3 次提交

mptcp: add netlink event support · b911c97c

由 Florian Westphal 提交于 2月 12, 2021

Allow userspace (mptcpd) to subscribe to mptcp genl multicast events.
This implementation reuses the same event API as the mptcp kernel fork
to ease integration of existing tools, e.g. mptcpd.

Supported events include:
1. start and close of an mptcp connection
2. start and close of subflows (joins)
3. announce and withdrawals of addresses
4. subflow priority (backup/non-backup) change.
Reviewed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b911c97c

bpf: Add BPF-helper for MTU checking · 34b2021c

由 Jesper Dangaard Brouer 提交于 2月 09, 2021

This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.

The SKB object is complex and the skb->len value (accessible from
BPF-prog) also include the length of any extra GRO/GSO segments, but
without taking into account that these GRO/GSO segments get added
transport (L4) and network (L3) headers before being transmitted. Thus,
this BPF-helper is created such that the BPF-programmer don't need to
handle these details in the BPF-prog.

The API is designed to help the BPF-programmer, that want to do packet
context size changes, which involves other helpers. These other helpers
usually does a delta size adjustment. This helper also support a delta
size (len_diff), which allow BPF-programmer to reuse arguments needed by
these other helpers, and perform the MTU check prior to doing any actual
size adjustment of the packet context.

It is on purpose, that we allow the len adjustment to become a negative
result, that will pass the MTU check. This might seem weird, but it's not
this helpers responsibility to "catch" wrong len_diff adjustments. Other
helpers will take care of these checks, if BPF-programmer chooses to do
actual size adjustment.

V14:
 - Improve man-page desc of len_diff.

V13:
 - Enforce flag BPF_MTU_CHK_SEGS cannot use len_diff.

V12:
 - Simplify segment check that calls skb_gso_validate_network_len.
 - Helpers should return long

V9:
- Use dev->hard_header_len (instead of ETH_HLEN)
- Annotate with unlikely req from Daniel
- Fix logic error using skb_gso_validate_network_len from Daniel

V6:
- Took John's advice and dropped BPF_MTU_CHK_RELAX
- Returned MTU is kept at L3-level (like fib_lookup)

V4: Lot of changes
 - ifindex 0 now use current netdev for MTU lookup
 - rename helper from bpf_mtu_check to bpf_check_mtu
 - fix bug for GSO pkt length (as skb->len is total len)
 - remove __bpf_len_adj_positive, simply allow negative len adj
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/161287790461.790810.3429728639563297353.stgit@firesoul

34b2021c

bpf: bpf_fib_lookup return MTU value as output when looked up · e1850ea9

由 Jesper Dangaard Brouer 提交于 2月 09, 2021

The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED. The BPF-prog
don't know the MTU value that caused this rejection.

If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
need to know this MTU value for the ICMP packet.

Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
value as output via a union with 'tot_len' as this is the value used for
the MTU lookup.

V5:
 - Fixed uninit value spotted by Dan Carpenter.
 - Name struct output member mtu_result
Reported-by: Nkernel test robot <lkp@intel.com>
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/161287789952.790810.13134700381067698781.stgit@firesoul

e1850ea9

12 2月, 2021 5 次提交

nl80211: add documentation for HT/VHT/HE disable attributes · 735a4848

由 Johannes Berg 提交于 2月 12, 2021

These were missed earlier, add the necessary documentation
and, while at it, clarify it.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20210212105023.895c3389f063.I46dea3bfc64385bc6f600c50d294007510994f8f@changeidSigned-off-by: NJohannes Berg <johannes.berg@intel.com>

735a4848

cfg80211/mac80211: Support disabling HE mode · b6db0f89

由 Ben Greear 提交于 2月 04, 2021

Allow user to disable HE mode, similar to how VHT and HT
can be disabled. Useful for testing.
Signed-off-by: NBen Greear <greearb@candelatech.com>
Link: https://lore.kernel.org/r/20210204144610.25971-1-greearb@candelatech.comSigned-off-by: NJohannes Berg <johannes.berg@intel.com>

b6db0f89

tcp: Sanitize CMSG flags and reserved args in tcp_zerocopy_receive. · 3c5a2fd0

由 Arjun Roy 提交于 2月 11, 2021

Explicitly define reserved field and require it and any subsequent
fields to be zero-valued for now. Additionally, limit the valid CMSG
flags that tcp_zerocopy_receive accepts.

Fixes: 7eeba170 ("tcp: Add receive timestamp support for receive zerocopy.")
Signed-off-by: NArjun Roy <arjunroy@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Suggested-by: NDavid Ahern <dsahern@gmail.com>
Suggested-by: NLeon Romanovsky <leon@kernel.org>
Suggested-by: NJakub Kicinski <kuba@kernel.org>
Acked-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c5a2fd0

bpf: Expose bpf_get_socket_cookie to tracing programs · c5dbb89f

由 Florent Revest 提交于 2月 10, 2021

This needs a new helper that:
- can work in a sleepable context (using sock_gen_cookie)
- takes a struct sock pointer and checks that it's not NULL
Signed-off-by: NFlorent Revest <revest@chromium.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NKP Singh <kpsingh@kernel.org>
Acked-by: NAndrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210210111406.785541-2-revest@chromium.org

c5dbb89f

bpf: Be less specific about socket cookies guarantees · 07881ccb

由 Florent Revest 提交于 2月 10, 2021

Since "92acdc58 bpf, net: Rework cookie generator as per-cpu one"
socket cookies are not guaranteed to be non-decreasing. The
bpf_get_socket_cookie helper descriptions are currently specifying that
cookies are non-decreasing but we don't want users to rely on that.
Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NFlorent Revest <revest@chromium.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NKP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20210210111406.785541-1-revest@chromium.org

07881ccb

11 2月, 2021 3 次提交

bpf: Count the number of times recursion was prevented · 9ed9e9ba

由 Alexei Starovoitov 提交于 2月 09, 2021

Add per-program counter for number of times recursion prevention mechanism
was triggered and expose it via show_fdinfo and bpf_prog_info.
Teach bpftool to print it.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAndrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-7-alexei.starovoitov@gmail.com

9ed9e9ba

dm: add support for passing through inline crypto support · aa6ce87a

由 Satya Tangirala 提交于 2月 01, 2021

Update the device-mapper core to support exposing the inline crypto
support of the underlying device(s) through the device-mapper device.

This works by creating a "passthrough keyslot manager" for the dm
device, which declares support for encryption settings which all
underlying devices support.  When a supported setting is used, the bio
cloning code handles cloning the crypto context to the bios for all the
underlying devices.  When an unsupported setting is used, the blk-crypto
fallback is used as usual.

Crypto support on each underlying device is ignored unless the
corresponding dm target opts into exposing it.  This is needed because
for inline crypto to semantically operate on the original bio, the data
must not be transformed by the dm target.  Thus, targets like dm-linear
can expose crypto support of the underlying device, but targets like
dm-crypt can't.  (dm-crypt could use inline crypto itself, though.)

A DM device's table can only be changed if the "new" inline encryption
capabilities are a (*not* necessarily strict) superset of the "old" inline
encryption capabilities.  Attempts to make changes to the table that result
in some inline encryption capability becoming no longer supported will be
rejected.

For the sake of clarity, key eviction from underlying devices will be
handled in a future patch.
Co-developed-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NSatya Tangirala <satyat@google.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

aa6ce87a

net/sched: cls_flower: Reject invalid ct_state flags rules · 1bcc51ac

由 wenxu 提交于 2月 09, 2021

Reject the unsupported and invalid ct_state flags of cls flower rules.

Fixes: e0ace68a ("net/sched: cls_flower: Add matching on conntrack info")
Signed-off-by: Nwenxu <wenxu@ucloud.cn>
Reviewed-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1bcc51ac

10 2月, 2021 1 次提交

KVM: PPC: Book3S HV: Introduce new capability for 2nd DAWR · d9a47eda

由 Ravi Bangoria 提交于 12月 16, 2020

Introduce KVM_CAP_PPC_DAWR1 which can be used by QEMU to query whether
KVM supports 2nd DAWR or not. The capability is by default disabled
even when the underlying CPU supports 2nd DAWR. QEMU needs to check
and enable it manually to use the feature.
Signed-off-by: NRavi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

d9a47eda

09 2月, 2021 4 次提交

uapi: map_to_7segment: Update example in documentation · 0566752c

由 Geert Uytterhoeven 提交于 2月 07, 2021

The device_attribute .show() and .store() methods gained an extra
parameter in v2.6.13, but the example in the documentation for the
7-segment header file was never updated. Add the missing parameters.

While at it, get rid of the (misspelled) deprecated symbolic
permissions, and switch to DEVICE_ATTR_RW(), which was introduced in
v3.11

Fixes: 54b6f35c ("[PATCH] Driver core: change device_attribute callbacks")
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20210207130543.2128980-1-geert@linux-m68k.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

0566752c

virt: acrn: Introduce irqfd · aa3b483f

由 Shuo Liu 提交于 2月 07, 2021

irqfd is a mechanism to inject a specific interrupt to a User VM using a
decoupled eventfd mechanism.

Vhost is a kernel-level virtio server which uses eventfd for interrupt
injection. To support vhost on ACRN, irqfd is introduced in HSM.

HSM provides ioctls to associate a virtual Message Signaled Interrupt
(MSI) with an eventfd. The corresponding virtual MSI will be injected
into a User VM once the eventfd got signal.

Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Yu Wang <yu1.wang@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NZhi Wang <zhi.a.wang@intel.com>
Reviewed-by: NReinette Chatre <reinette.chatre@intel.com>
Signed-off-by: NShuo Liu <shuo.a.liu@intel.com>
Link: https://lore.kernel.org/r/20210207031040.49576-17-shuo.a.liu@intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

aa3b483f

virt: acrn: Introduce ioeventfd · d8ad5151

由 Shuo Liu 提交于 2月 07, 2021

ioeventfd is a mechanism to register PIO/MMIO regions to trigger an
eventfd signal when written to by a User VM. ACRN userspace can register
any arbitrary I/O address with a corresponding eventfd and then pass the
eventfd to a specific end-point of interest for handling.

Vhost is a kernel-level virtio server which uses eventfd for signalling.
To support vhost on ACRN, ioeventfd is introduced in HSM.

A new I/O client dedicated to ioeventfd is associated with a User VM
during VM creation. HSM provides ioctls to associate an I/O region with
a eventfd. The I/O client signals a eventfd once its corresponding I/O
region is matched with an I/O request.

Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Yu Wang <yu1.wang@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NZhi Wang <zhi.a.wang@intel.com>
Reviewed-by: NReinette Chatre <reinette.chatre@intel.com>
Signed-off-by: NShuo Liu <shuo.a.liu@intel.com>
Link: https://lore.kernel.org/r/20210207031040.49576-16-shuo.a.liu@intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

d8ad5151

virt: acrn: Introduce interfaces to query C-states and P-states allowed by hypervisor · 3d679d5a

由 Shuo Liu 提交于 2月 07, 2021

The C-states and P-states data are used to support CPU power management.
The hypervisor controls C-states and P-states for a User VM.

ACRN userspace need to query the data from the hypervisor to build ACPI
tables for a User VM.

HSM provides ioctls for ACRN userspace to query C-states and P-states
data obtained from the hypervisor.

Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Yu Wang <yu1.wang@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NZhi Wang <zhi.a.wang@intel.com>
Reviewed-by: NReinette Chatre <reinette.chatre@intel.com>
Signed-off-by: NShuo Liu <shuo.a.liu@intel.com>
Link: https://lore.kernel.org/r/20210207031040.49576-14-shuo.a.liu@intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

3d679d5a

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功