提交 · 1fb7f8973f51ca1a7ffe61a2c779ed15f57f3d82 · openeuler / Kernel

26 3月, 2021 1 次提交

RDMA: Support more than 255 rdma ports · 1fb7f897

由 Mark Bloch 提交于 3月 01, 2021

Current code uses many different types when dealing with a port of a RDMA
device: u8, unsigned int and u32. Switch to u32 to clean up the logic.

This allows us to make (at least) the core view consistent and use the
same type. Unfortunately not all places can be converted. Many uverbs
functions expect port to be u8 so keep those places in order not to break
UAPIs. HW/Spec defined values must also not be changed.

With the switch to u32 we now can support devices with more than 255
ports. U32_MAX is reserved to make control logic a bit easier to deal
with. As a device with U32_MAX ports probably isn't going to happen any
time soon this seems like a non issue.

When a device with more than 255 ports is created uverbs will report the
RDMA device as having 255 ports as this is the max currently supported.

The verbs interface is not changed yet because the IBTA spec limits the
port size in too many places to be u8 and all applications that relies in
verbs won't be able to cope with this change. At this stage, we are
extending the interfaces that are using vendor channel solely

Once the limitation is lifted mlx5 in switchdev mode will be able to have
thousands of SFs created by the device. As the only instance of an RDMA
device that reports more than 255 ports will be a representor device and
it exposes itself as a RAW Ethernet only device CM/MAD/IPoIB and other
ULPs aren't effected by this change and their sysfs/interfaces that are
exposes to userspace can remain unchanged.

While here cleanup some alignment issues and remove unneeded sanity
checks (mainly in rdmavt),

Link: https://lore.kernel.org/r/20210301070420.439400-1-leon@kernel.orgSigned-off-by: NMark Bloch <mbloch@nvidia.com>
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

1fb7f897

24 3月, 2021 2 次提交

RDMA/hns: Support to query firmware version · 847d19a4

由 Lang Cheng 提交于 3月 16, 2021

Implement the ops named get_dev_fw_str to support ib_get_device_fw_str().

Link: https://lore.kernel.org/r/1615882161-53827-1-git-send-email-liweihang@huawei.comSigned-off-by: NLang Cheng <chenglang@huawei.com>
Signed-off-by: NWeihang Li <liweihang@huawei.com>
Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

847d19a4

RDMA/mlx5: Create ODP EQ only when ODP MR is created · ad50294d

由 Shay Drory 提交于 3月 14, 2021

There is no need to create the ODP EQ if the user doesn't use ODP MRs.
Hence, create it only when the first ODP MR is created. This EQ will be
destroyed only when the device is unloaded.
This will decrease the number of EQs created per device. for example: If
we creates 1K devices (SF/VF/etc'), than we will decrease the num of EQs
by 1K.

Link: https://lore.kernel.org/r/20210314125418.179716-1-leon@kernel.orgSigned-off-by: NShay Drory <shayd@nvidia.com>
Reviewed-by: NMaor Gottlieb <maorg@nvidia.com>
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

ad50294d

23 3月, 2021 1 次提交

RDMA/hns: Fix memory corruption when allocating XRCDN · 783cf673

由 Weihang Li 提交于 3月 22, 2021

It's incorrect to cast the type of pointer to xrcdn from (u32 *) to
(unsigned long *), then pass it into hns_roce_bitmap_alloc(), this will
lead to a memory corruption.

Fixes: 32548870 ("RDMA/hns: Add support for XRC on HIP09")
Link: https://lore.kernel.org/r/1616381069-51759-1-git-send-email-liweihang@huawei.comReported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NWeihang Li <liweihang@huawei.com>
Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

783cf673

22 3月, 2021 4 次提交

RDMA/cma: Remove unused leftovers in cma code · 87115951

由 Gal Pressman 提交于 3月 14, 2021

Commit ee1c60b1 ("IB/SA: Modify SA to implicitly cache Class Port
info") removed the class_port_info_context struct usage, remove a couple
of leftovers.

Link: https://lore.kernel.org/r/20210314143427.76101-1-galpress@amazon.comSigned-off-by: NGal Pressman <galpress@amazon.com>
Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

87115951

RDMA: Delete not-used static inline functions · fdb68dd3

由 Leon Romanovsky 提交于 3月 14, 2021

Perform mass deletion of static inline functions that are not used.

Link: https://lore.kernel.org/r/20210314133908.291945-3-leon@kernel.orgSigned-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

fdb68dd3

RDMA: Fix kernel-doc compilation warnings · ae360f41

由 Leon Romanovsky 提交于 3月 14, 2021

This patch fixes bunch of kernel-doc compilation warnings like below:

drivers/infiniband/hw/i40iw/i40iw_cm.c:4372: warning: expecting prototype for i40iw_ifdown_notify(). Prototype was for i40iw_if_notify() instead

Link: https://lore.kernel.org/r/20210314133908.291945-2-leon@kernel.orgSigned-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

ae360f41

RDMA/mlx5: Add missing returned error check of mlx5_ib_dereg_mr · b5486430

由 Leon Romanovsky 提交于 3月 14, 2021

Fix the following smatch error:

drivers/infiniband/hw/mlx5/mr.c:1950 mlx5_ib_dereg_mr() error: uninitialized symbol 'rc'.

Fixes: e6fb246c ("RDMA/mlx5: Consolidate MR destruction to mlx5_ib_dereg_mr()")
Link: https://lore.kernel.org/r/20210314082250.10143-1-leon@kernel.orgReported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

b5486430

12 3月, 2021 13 次提交

RDMA/mlx5: Allow larger pages in DevX umem · 7610ab57

由 Jason Gunthorpe 提交于 3月 04, 2021

The umem DMA list calculation was locked at 4k pages due to confusion
around how this API works and is used when larger pages are present.

The conclusion is:

 - umem's cannot extend past what is mapped into the process, so creating
   a lage page size and referring to a sub-range is not allowed

 - umem's must always have a page offset of zero, except for sub PAGE_SIZE
   umems

 - The feature of umem_offset to create multiple objects inside a umem
   is buggy and isn't used anyplace. Thus we can assume all users of the
   current API have umem_offset == 0 as well

Provide a new page size calculator that limits the DMA list to the VA
range and enforces umem_offset == 0.

Allow user space to specify the page sizes which it can accept, this
bitmap must be derived from the intended use of the umem, based on
per-usage HW limitations.

Link: https://lore.kernel.org/r/20210304130501.1102577-4-leon@kernel.orgSigned-off-by: NYishai Hadas <yishaih@nvidia.com>
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

7610ab57

IB/core: Split uverbs_get_const/default to consider target type · 2904bb37

由 Yishai Hadas 提交于 3月 04, 2021

Change uverbs_get_const/uverbs_get_const_default to work properly with
both signed/unsigned parameters.

Current APIs mix s64 and u64 which leads to incorrect check when u64
value was supplied and its upper bit was set. In that case
uverbs_get_const() / uverbs_get_const_default() lower bound check may
fail unexpectedly, target is unsigned (lower bound is 0) but value
became negative as of the s64 usage.

Split to have two different APIs, no change to callers as the required
API will be called internally according to the target type.

Link: https://lore.kernel.org/r/20210304130501.1102577-3-leon@kernel.orgSigned-off-by: NYishai Hadas <yishaih@nvidia.com>
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

2904bb37

IB/core: Drop WARN_ON() from ib_umem_find_best_pgsz() · 3f32dc0f

由 Yishai Hadas 提交于 3月 04, 2021

The WARN_ON() issued as part of ib_umem_find_best_pgsz() blocked cases
when only page sizes larger than PAGE_SIZE were set, drop it to enable
those cases.

In addition, there is no need to have a specific check for zero
pgsz_bitmap, the function will do its job and return 0 at the end if
nothing match will be found.

Link: https://lore.kernel.org/r/20210304130501.1102577-2-leon@kernel.orgSigned-off-by: NYishai Hadas <yishaih@nvidia.com>
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

3f32dc0f

RDMA/mlx5: Fix mlx5 rates to IB rates map · 6fe6e568

由 Mark Zhang 提交于 3月 04, 2021

Correct the map between mlx5 rates and corresponding ib rates, as they
don't always have a fixed offset between them.

Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Link: https://lore.kernel.org/r/20210304124517.1100608-4-leon@kernel.orgSigned-off-by: NMark Zhang <markzhang@nvidia.com>
Reviewed-by: NMaor Gottlieb <maorg@nvidia.com>
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

6fe6e568

RDMA/mlx5: Fix query RoCE port · 7852546f

由 Maor Gottlieb 提交于 3月 04, 2021

mlx5_is_roce_enabled returns the devlink RoCE init value, therefore it
should be used only when driver is loaded. Instead we just need to read
the roce_en field.

In addition, rename mlx5_is_roce_enabled to mlx5_is_roce_init_enabled.

Fixes: 7a58779e ("IB/mlx5: Improve query port for representor port")
Link: https://lore.kernel.org/r/20210304124517.1100608-2-leon@kernel.orgSigned-off-by: NMaor Gottlieb <maorg@nvidia.com>
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

7852546f

RDMA/mlx5: Rename mlx5_mr_cache_invalidate() to revoke_mr() · 14d05b55

由 Jason Gunthorpe 提交于 3月 04, 2021

Now that this is only used in a few places in mr.c give it a sensible
name. It has nothing to do with the cache and can be invoked on any
MR. DMA is stopped and the user cannot touch the MR any further once it
completes.

Link: https://lore.kernel.org/r/20210304120745.1090751-5-leon@kernel.orgSigned-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

14d05b55

RDMA/mlx5: Consolidate MR destruction to mlx5_ib_dereg_mr() · e6fb246c

由 Jason Gunthorpe 提交于 3月 04, 2021

Now that the SRCU stuff has been removed the entire MR destroy logic can
be made a lot simpler. Currently there are many different ways to destroy a
MR and it makes it really hard to do this task correctly. Route all
destruction through mlx5_ib_dereg_mr() and make it work for all
situations.

Since it turns out all the different MR types do basically the same thing
this removes a lot of knowledge of MR internals from ODP and leaves ODP
just exporting an operation to clean up children.

This fixes a few weird corner cases bugs and firmly uses the correct
ordering of the MR destruction:
 - Stop parallel access to the mkey via the ODP xarray
 - Stop DMA
 - Release the umem
 - Clean up ODP children
 - Free/Recycle the MR

Link: https://lore.kernel.org/r/20210304120745.1090751-4-leon@kernel.orgSigned-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

e6fb246c

RDMA/mlx5: Use a union inside mlx5_ib_mr · f18ec422

由 Jason Gunthorpe 提交于 3月 04, 2021

The struct mlx5_ib_mr can be used for three different things, but only one
at a time:

 - In the user MR cache
 - As a kernel MR
 - As a user MR

Overlay the three things into a single union with the following rules:

 - If the mr is found on the cache_ent->head list then it is a cache MR
   and umem == NULL. The entire union is zero after the MR is removed from
   the cache.

 - If umem != NULL or type == IB_MR_TYPE_USER then it is a user MR.

 - If umem == NULL then it is a kernel MR

This reduces the size of struct mlx5_ib_mr to 552 bytes from 702.

The only place the three flows overlap in the code is during dereg, so add
a few extra checks along there.

Link: https://lore.kernel.org/r/20210304120745.1090751-3-leon@kernel.orgSigned-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

f18ec422

RDMA/mlx5: Zero out ODP related items in the mlx5_ib_mr · a639e667

由 Jason Gunthorpe 提交于 3月 04, 2021

All of the ODP code assumes when it calls mlx5_mr_cache_alloc() the ODP
related fields are zero'd. This is true if the MR was just allocated, but
if the MR is recycled through the cache then the values are never zero'd.

This causes a bug in the odp_stats, they don't reset when the MR is
reallocated, also is_odp_implicit is never 0'd.

So we can use memset on a block of the mlx5_ib_mr reorganize the structure
to put all the data that can be zero'd by the cache at the end.

It is organized as an anonymous struct because the next patch will make
this a union.

Delete the unused smr_info. Don't set the kernel only desc_size on the
user path. No longer any need to zero mr->parent before freeing it, the
memset() will get it now.

Fixes: a3de94e3 ("IB/mlx5: Introduce ODP diagnostic counters")
Link: https://lore.kernel.org/r/20210304120745.1090751-2-leon@kernel.orgSigned-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

a639e667

RDMA/hns: Add support for XRC on HIP09 · 32548870

由 Wenpeng Liang 提交于 3月 04, 2021

The HIP09 supports XRC transport service, it greatly saves the number of
QPs required to connect all processes in a large cluster.

Link: https://lore.kernel.org/r/1614826558-35423-1-git-send-email-liweihang@huawei.comSigned-off-by: NWenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: NWeihang Li <liweihang@huawei.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

32548870

RDMA/rtrs-clt: Use rdma_event_msg in log · c33d516a

由 Jack Wang 提交于 2月 22, 2021

It's easier to understand a string instead of enum.

Link: https://lore.kernel.org/r/20210222141551.54345-2-jinpu.wang@cloud.ionos.comSigned-off-by: NJack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

c33d516a

RDMA/rtrs: Use new shared CQ mechanism · 3b89e92c

由 Jack Wang 提交于 2月 22, 2021

Have the driver use shared CQs which provids a ~10%-20% improvement during
test.

Instead of opening a CQ for each QP per connection, a CQ for each QP will
be provided by the RDMA core driver that will be shared between the QPs on
that core reducing interrupt overhead.

Link: https://lore.kernel.org/r/20210222141551.54345-1-jinpu.wang@cloud.ionos.comSigned-off-by: NJack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

3b89e92c

RDMA/core: Remove unused req_ncomp_notif device operation · f675ba12

由 Gal Pressman 提交于 3月 11, 2021

The request_ncomp_notif device operation and function are unused, remove
them.

Link: https://lore.kernel.org/r/20210311150921.23726-1-galpress@amazon.comSigned-off-by: NGal Pressman <galpress@amazon.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

f675ba12

11 3月, 2021 2 次提交

RDMA/iwcm: Allow AFONLY binding for IPv6 addresses · e35ecb46

由 Bernard Metzler 提交于 2月 19, 2021

Binding IPv6 address/port to AF_INET6 domain only is provided via
rdma_set_afonly(), but was not signalled to the provider. Applications
like NFS/RDMA bind the same port to both IPv4 and IPv6 addresses
simultaneously and thus rely on it working correctly.

Link: https://lore.kernel.org/r/20210219143441.1068-1-bmt@zurich.ibm.comTested-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NBenjamin Coddington <bcodding@redhat.com>
Signed-off-by: NBernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

e35ecb46

RDMA/hns: Use new SQ doorbell register for HIP09 · 0f00571f

由 Lang Cheng 提交于 2月 23, 2021

HIP09 uses new address space to map SQ doorbell registers, the doorbell of
each QP is isolated based on the size of 64KB, which can improve the
performance in concurrency scenarios.

Link: https://lore.kernel.org/r/1614082833-23130-1-git-send-email-liweihang@huawei.comSigned-off-by: NLang Cheng <chenglang@huawei.com>
Signed-off-by: NWeihang Li <liweihang@huawei.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

0f00571f

06 3月, 2021 3 次提交

RDMA/rxe: Fix errant WARN_ONCE in rxe_completer() · 545c4ab4

由 Bob Pearson 提交于 3月 04, 2021

In rxe_comp.c in rxe_completer() the function free_pkt() did not clear skb
which triggered a warning at 'done:' and could possibly at 'exit:'. The
WARN_ONCE() calls are not actually needed. The call to free_pkt() is
moved to the end to clearly show that all skbs are freed.

Fixes: 899aba89 ("RDMA/rxe: Fix FIXME in rxe_udp_encap_recv()")
Link: https://lore.kernel.org/r/20210304192048.2958-1-rpearson@hpe.comSigned-off-by: NBob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

545c4ab4

RDMA/rxe: Fix extra deref in rxe_rcv_mcast_pkt() · 5e4a7ccc

由 Bob Pearson 提交于 3月 04, 2021

rxe_rcv_mcast_pkt() dropped a reference to ib_device when no error
occurred causing an underflow on the reference counter. This code is
cleaned up to be clearer and easier to read.

5e4a7ccc

RDMA/rxe: Fix missed IB reference counting in loopback · 21e27ac8

由 Bob Pearson 提交于 3月 04, 2021

When the noted patch below extending the reference taken by
rxe_get_dev_from_net() in rxe_udp_encap_recv() until each skb is freed it
was not matched by a reference in the loopback path resulting in
underflows.

21e27ac8

05 3月, 2021 11 次提交

nvmet: model_number must be immutable once set · d9f273b7

由 Max Gurtovoy 提交于 2月 17, 2021

In case we have already established connection to nvmf target, it
shouldn't be allowed to change the model_number. E.g. if someone will
identify ctrl and get model_number of "my_model" later on will change
the model_numbel via configfs to "my_new_model" this will break the NVMe
specification for "Get Log Page – Persistent Event Log" that refers to
Model Number as: "This field contains the same value as reported in the
Model Number field of the Identify Controller data structure, bytes
63:24."

Although it doesn't mentioned explicitly that this field can't be
changed, we can assume it.

So allow setting this field only once: using configfs or in the first
identify ctrl operation.
Signed-off-by: NMax Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d9f273b7

nvme-fabrics: fix kato initialization · 32feb6de

由 Martin George 提交于 2月 11, 2021

Currently kato is initialized to NVME_DEFAULT_KATO for both
discovery & i/o controllers. This is a problem specifically
for non-persistent discovery controllers since it always ends
up with a non-zero kato value. Fix this by initializing kato
to zero instead, and ensuring various controllers are assigned
appropriate kato values as follows:

non-persistent controllers  - kato set to zero
persistent controllers      - kato set to NVMF_DEV_DISC_TMO
                              (or any positive int via nvme-cli)
i/o controllers             - kato set to NVME_DEFAULT_KATO
                              (or any positive int via nvme-cli)
Signed-off-by: NMartin George <marting@netapp.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

32feb6de

nvme-hwmon: Return error code when registration fails · 78570f88

由 Daniel Wagner 提交于 2月 12, 2021

The hwmon pointer wont be NULL if the registration fails. Though the
exit code path will assign it to ctrl->hwmon_device. Later
nvme_hwmon_exit() will try to free the invalid pointer. Avoid this by
returning the error code from hwmon_device_register_with_info().

Fixes: ed7770f6 ("nvme/hwmon: rework to avoid devm allocation")
Signed-off-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

78570f88

nvme-pci: add quirks for Lexar 256GB SSD · 6e6a6828

由 Pascal Terjan 提交于 2月 23, 2021

Add the NVME_QUIRK_NO_NS_DESC_LIST and NVME_QUIRK_IGNORE_DEV_SUBNQN
quirks for this buggy device.

Reported and tested in https://bugs.mageia.org/show_bug.cgi?id=28417Signed-off-by: NPascal Terjan <pterjan@google.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

6e6a6828

nvme-pci: mark Kingston SKC2000 as not supporting the deepest power state · dc22c1c0

由 Zoltán Böszörményi 提交于 2月 21, 2021

My 2TB SKC2000 showed the exact same symptoms that were provided
in 538e4a8c ("nvme-pci: avoid the deepest sleep state on
Kingston A2000 SSDs"), i.e. a complete NVME lockup that needed
cold boot to get it back.

According to some sources, the A2000 is simply a rebadged
SKC2000 with a slightly optimized firmware.

Adding the SKC2000 PCI ID to the quirk list with the same workaround
as the A2000 made my laptop survive a 5 hours long Yocto bootstrap
buildfest which reliably triggered the SSD lockup previously.
Signed-off-by: NZoltán Böszörményi <zboszor@gmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

dc22c1c0

nvme-pci: mark Seagate Nytro XM1440 as QUIRK_NO_NS_DESC_LIST. · 5e112d3f

由 Julian Einwag 提交于 2月 16, 2021

The kernel fails to fully detect these SSDs, only the character devices
are present:

[   10.785605] nvme nvme0: pci function 0000:04:00.0
[   10.876787] nvme nvme1: pci function 0000:81:00.0
[   13.198614] nvme nvme0: missing or invalid SUBNQN field.
[   13.198658] nvme nvme1: missing or invalid SUBNQN field.
[   13.206896] nvme nvme0: Shutdown timeout set to 20 seconds
[   13.215035] nvme nvme1: Shutdown timeout set to 20 seconds
[   13.225407] nvme nvme0: 16/0/0 default/read/poll queues
[   13.233602] nvme nvme1: 16/0/0 default/read/poll queues
[   13.239627] nvme nvme0: Identify Descriptors failed (8194)
[   13.246315] nvme nvme1: Identify Descriptors failed (8194)

Adding the NVME_QUIRK_NO_NS_DESC_LIST fixes this problem.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=205679Signed-off-by: NJulian Einwag <jeinwag-nvme@marcapo.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>

5e112d3f

scsi: iscsi: Verify lengths on passthrough PDUs · f9dbdf97

由 Chris Leech 提交于 2月 23, 2021

Open-iSCSI sends passthrough PDUs over netlink, but the kernel should be
verifying that the provided PDU header and data lengths fall within the
netlink message to prevent accessing beyond that in memory.

Cc: stable@vger.kernel.org
Reported-by: NAdam Nichols <adam@grimm-co.com>
Reviewed-by: NLee Duncan <lduncan@suse.com>
Reviewed-by: NMike Christie <michael.christie@oracle.com>
Signed-off-by: NChris Leech <cleech@redhat.com>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>

f9dbdf97

scsi: iscsi: Ensure sysfs attributes are limited to PAGE_SIZE · ec98ea70

由 Chris Leech 提交于 2月 23, 2021

As the iSCSI parameters are exported back through sysfs, it should be
enforcing that they never are more than PAGE_SIZE (which should be more
than enough) before accepting updates through netlink.

Change all iSCSI sysfs attributes to use sysfs_emit().

Cc: stable@vger.kernel.org
Reported-by: NAdam Nichols <adam@grimm-co.com>
Reviewed-by: NLee Duncan <lduncan@suse.com>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NMike Christie <michael.christie@oracle.com>
Signed-off-by: NChris Leech <cleech@redhat.com>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>

ec98ea70

scsi: iscsi: Restrict sessions and handles to admin capabilities · 688e8128

由 Lee Duncan 提交于 2月 23, 2021

Protect the iSCSI transport handle, available in sysfs, by requiring
CAP_SYS_ADMIN to read it. Also protect the netlink socket by restricting
reception of messages to ones sent with CAP_SYS_ADMIN. This disables
normal users from being able to end arbitrary iSCSI sessions.

Cc: stable@vger.kernel.org
Reported-by: NAdam Nichols <adam@grimm-co.com>
Reviewed-by: NChris Leech <cleech@redhat.com>
Reviewed-by: NMike Christie <michael.christie@oracle.com>
Signed-off-by: NLee Duncan <lduncan@suse.com>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>

688e8128

dm verity: fix FEC for RS roots unaligned to block size · df7b59ba

由 Milan Broz 提交于 2月 23, 2021

Optional Forward Error Correction (FEC) code in dm-verity uses
Reed-Solomon code and should support roots from 2 to 24.

The error correction parity bytes (of roots lengths per RS block) are
stored on a separate device in sequence without any padding.

Currently, to access FEC device, the dm-verity-fec code uses dm-bufio
client with block size set to verity data block (usually 4096 or 512
bytes).

Because this block size is not divisible by some (most!) of the roots
supported lengths, data repair cannot work for partially stored parity
bytes.

This fix changes FEC device dm-bufio block size to "roots << SECTOR_SHIFT"
where we can be sure that the full parity data is always available.
(There cannot be partial FEC blocks because parity must cover whole
sectors.)

Because the optional FEC starting offset could be unaligned to this
new block size, we have to use dm_bufio_set_sector_offset() to
configure it.

The problem is easily reproduced using veritysetup, e.g. for roots=13:

  # create verity device with RS FEC
  dd if=/dev/urandom of=data.img bs=4096 count=8 status=none
  veritysetup format data.img hash.img --fec-device=fec.img --fec-roots=13 | awk '/^Root hash/{ print $3 }' >roothash

  # create an erasure that should be always repairable with this roots setting
  dd if=/dev/zero of=data.img conv=notrunc bs=1 count=8 seek=4088 status=none

  # try to read it through dm-verity
  veritysetup open data.img test hash.img --fec-device=fec.img --fec-roots=13 $(cat roothash)
  dd if=/dev/mapper/test of=/dev/null bs=4096 status=noxfer
  # wait for possible recursive recovery in kernel
  udevadm settle
  veritysetup close test

With this fix, errors are properly repaired.
  device-mapper: verity-fec: 7:1: FEC 0: corrected 8 errors
  ...

Without it, FEC code usually ends on unrecoverable failure in RS decoder:
  device-mapper: verity-fec: 7:1: FEC 0: failed to correct: -74
  ...

This problem is present in all kernels since the FEC code's
introduction (kernel 4.5).

It is thought that this problem is not visible in Android ecosystem
because it always uses a default RS roots=2.

Depends-on: a14e5ec6 ("dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size")
Signed-off-by: NMilan Broz <gmazyland@gmail.com>
Tested-by: NJérôme Carretero <cJ-ko@zougloub.eu>
Reviewed-by: NSami Tolvanen <samitolvanen@google.com>
Cc: stable@vger.kernel.org # 4.5+
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

df7b59ba

dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size · a14e5ec6

由 Mikulas Patocka 提交于 2月 23, 2021

dm_bufio_get_device_size returns the device size in blocks. Before
returning the value, we must subtract the nubmer of starting
sectors. The number of starting sectors may not be divisible by block
size.

Note that currently, no target is using dm_bufio_set_sector_offset and
dm_bufio_get_device_size simultaneously, so this change has no effect.
However, an upcoming dm-verity-fec fix needs this change.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Reviewed-by: NMilan Broz <gmazyland@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

a14e5ec6

04 3月, 2021 3 次提交

iommu/vt-d: Fix status code for Allocate/Free PASID command · 444d66a2

由 Zenghui Yu 提交于 2月 27, 2021

As per Intel vt-d spec, Rev 3.0 (section 10.4.45 "Virtual Command Response
Register"), the status code of "No PASID available" error in response to
the Allocate PASID command is 2, not 1. The same for "Invalid PASID" error
in response to the Free PASID command.

We will otherwise see confusing kernel log under the command failure from
guest side. Fix it.

Fixes: 24f27d32 ("iommu/vt-d: Enlightened PASID allocation")
Signed-off-by: NZenghui Yu <yuzenghui@huawei.com>
Acked-by: NLu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210227073909.432-1-yuzenghui@huawei.comSigned-off-by: NJoerg Roedel <jroedel@suse.de>

444d66a2

iommu: Don't use lazy flush for untrusted device · 82c3cefb

由 Lu Baolu 提交于 2月 25, 2021

The lazy IOTLB flushing setup leaves a time window, in which the device
can still access some system memory, which has already been unmapped by
the device driver. It's not suitable for untrusted devices. A malicious
device might use this to attack the system by obtaining data that it
shouldn't obtain.

Fixes: c588072b ("iommu/vt-d: Convert intel iommu driver to the iommu ops")
Signed-off-by: NLu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210225061454.2864009-1-baolu.lu@linux.intel.comSigned-off-by: NJoerg Roedel <jroedel@suse.de>

82c3cefb

iommu/tegra-smmu: Fix mc errors on tegra124-nyan · 765a9d1d

由 Nicolin Chen 提交于 2月 18, 2021

Commit 25938c73 ("iommu/tegra-smmu: Rework tegra_smmu_probe_device()")
removed certain hack in the tegra_smmu_probe() by relying on IOMMU core to
of_xlate SMMU's SID per device, so as to get rid of tegra_smmu_find() and
tegra_smmu_configure() that are typically done in the IOMMU core also.

This approach works for both existing devices that have DT nodes and other
devices (like PCI device) that don't exist in DT, on Tegra210 and Tegra3
upon testing. However, Page Fault errors are reported on tegra124-Nyan:

tegra-mc 70019000.memory-controller: display0a: read @0xfe056b40:
EMEM address decode error (SMMU translation error [--S])
tegra-mc 70019000.memory-controller: display0a: read @0xfe056b40:
Page fault (SMMU translation error [--S])

After debugging, I found that the mentioned commit changed some function
callback sequence of tegra-smmu's, resulting in enabling SMMU for display
client before display driver gets initialized. I couldn't reproduce exact
same issue on Tegra210 as Tegra124 (arm-32) differs at arch-level code.

Actually this Page Fault is a known issue, as on most of Tegra platforms,
display gets enabled by the bootloader for the splash screen feature, so
it keeps filling the framebuffer memory. A proper fix to this issue is to
1:1 linear map the framebuffer memory to IOVA space so the SMMU will have
the same address as the physical address in its page table. Yet, Thierry
has been working on the solution above for a year, and it hasn't merged.

Therefore, let's partially revert the mentioned commit to fix the errors.

The reason why we do a partial revert here is that we can still set priv
in ->of_xlate() callback for PCI devices. Meanwhile, devices existing in
DT, like display, will go through tegra_smmu_configure() at the stage of
bus_set_iommu() when SMMU gets probed(), as what it did before we merged
the mentioned commit.

Once we have the linear map solution for framebuffer memory, this change
can be cleaned away.

[Big thank to Guillaume who reported and helped debugging/verification]

Fixes: 25938c73 ("iommu/tegra-smmu: Rework tegra_smmu_probe_device()")
Reported-by: NGuillaume Tucker <guillaume.tucker@collabora.com>
Signed-off-by: NNicolin Chen <nicoleotsuka@gmail.com>
Tested-by: NGuillaume Tucker <guillaume.tucker@collabora.com>
Acked-by: NThierry Reding <treding@nvidia.com>
Link: https://lore.kernel.org/r/20210218220702.1962-1-nicoleotsuka@gmail.comSigned-off-by: NJoerg Roedel <jroedel@suse.de>

765a9d1d

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功