提交 · 84e4c204b6a0e81f56bd7d1254123390ef0498c8 · openeuler / Kernel

27 5月, 2020 10 次提交

nvme: disable streams when get stream params failed · 84e4c204

由 Wu Bo 提交于 5月 13, 2020

Disable streams again if getting the stream params fails.
Signed-off-by: NWu Bo <wubo40@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

84e4c204

nvme-fc: print proper nvme-fc devloss_tmo value · 614fc1c0

由 Martin George 提交于 5月 12, 2020

The nvme-fc devloss_tmo is computed as the min of either the
ctrl_loss_tmo (max_retries * reconnect_delay) or the remote port's
devloss_tmo. But what gets printed as the nvme-fc devloss_tmo in
nvme_fc_reconnect_or_delete() is always the remote port's devloss_tmo
value. So correct this by printing the min value instead.
Signed-off-by: NMartin George <marting@netapp.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

614fc1c0

nvme-pci: make sure write/poll_queues less or equal then cpu count · 9c9e76d5

由 Weiping Zhang 提交于 5月 09, 2020

Check module parameter write/poll_queues before using it to catch
too large values.

Reproducer:

modprobe -r nvme
modprobe nvme write_queues=`nproc`
echo $((`nproc`+1)) > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller

[  657.069000] ------------[ cut here ]------------
[  657.069022] WARNING: CPU: 10 PID: 1163 at kernel/irq/affinity.c:390 irq_create_affinity_masks+0x47c/0x4a0
[  657.069056]  dm_region_hash dm_log dm_mod
[  657.069059] CPU: 10 PID: 1163 Comm: kworker/u193:9 Kdump: loaded Tainted: G        W         5.6.0+ #8
[  657.069060] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[  657.069064] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[  657.069066] RIP: 0010:irq_create_affinity_masks+0x47c/0x4a0
[  657.069067] Code: fe ff ff 48 c7 c0 b0 89 14 95 48 89 46 20 e9 e9 fb ff ff 31 c0 e9 90 fc ff ff 0f 0b 48 c7 44 24 08 00 00 00 00 e9 e9 fc ff ff <0f> 0b e9 87 fe ff ff 48 8b 7c 24 28 e8 33 a0 80 00 e9 b6 fc ff ff
[  657.069068] RSP: 0018:ffffb505ce1ffc78 EFLAGS: 00010202
[  657.069069] RAX: 0000000000000060 RBX: ffff9b97921fe5c0 RCX: 0000000000000000
[  657.069069] RDX: ffff9b67bad80000 RSI: 00000000ffffffa0 RDI: 0000000000000000
[  657.069070] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff9b97921fe718
[  657.069070] R10: ffff9b97921fe710 R11: 0000000000000001 R12: 0000000000000064
[  657.069070] R13: 0000000000000060 R14: 0000000000000000 R15: 0000000000000001
[  657.069071] FS:  0000000000000000(0000) GS:ffff9b67c0880000(0000) knlGS:0000000000000000
[  657.069072] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  657.069072] CR2: 0000559eac6fc238 CR3: 000000057860a002 CR4: 00000000007606e0
[  657.069073] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  657.069073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  657.069073] PKRU: 55555554
[  657.069074] Call Trace:
[  657.069080]  __pci_enable_msix_range+0x233/0x5a0
[  657.069085]  ? kernfs_put+0xec/0x190
[  657.069086]  pci_alloc_irq_vectors_affinity+0xbb/0x130
[  657.069089]  nvme_reset_work+0x6e6/0xeab [nvme]
[  657.069093]  ? __switch_to_asm+0x34/0x70
[  657.069094]  ? __switch_to_asm+0x40/0x70
[  657.069095]  ? nvme_irq_check+0x30/0x30 [nvme]
[  657.069098]  process_one_work+0x1a7/0x370
[  657.069101]  worker_thread+0x1c9/0x380
[  657.069102]  ? max_active_store+0x80/0x80
[  657.069103]  kthread+0x112/0x130
[  657.069104]  ? __kthread_parkme+0x70/0x70
[  657.069105]  ret_from_fork+0x35/0x40
[  657.069106] ---[ end trace f4f06b7d24513d06 ]---
[  657.077110] nvme nvme0: 95/1/0 default/read/poll queues
Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

9c9e76d5

nvmet-tcp: move send/recv error handling in the send/recv methods instead of call-sites · 0236d343

由 Sagi Grimberg 提交于 5月 18, 2020

Have routines handle errors and just bail out of the poll loop.
This simplifies the code and will help as we may enhance the poll
loop logic and these are somewhat in the way.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

0236d343

nvmet-tcp: set MSG_EOR if we send last payload in the batch · f381ab1f

由 Sagi Grimberg 提交于 5月 12, 2020

when trying to send the pdu data digest, we should set this
flag.
Reported-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

f381ab1f

nvmet-tcp: set MSG_SENDPAGE_NOTLAST with MSG_MORE when we have more to send · 4eea8043

由 Sagi Grimberg 提交于 5月 04, 2020

We can signal the stack that this is not the last page coming and the
stack can build a larger tso segment, so go ahead and use it.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

4eea8043

nvme-tcp: set MSG_SENDPAGE_NOTLAST with MSG_MORE when we have more to send · 5bb052d7

由 Sagi Grimberg 提交于 5月 04, 2020

We can signal the stack that this is not the last page coming and the
stack can build a larger tso segment, so go ahead and use it.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5bb052d7

nvmet: mark nvmet_ana_state static · 74255969

由 Christoph Hellwig 提交于 5月 12, 2020

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

74255969

nvmet: replace kstrndup() with kmemdup_nul() · 09bb8986

由 Chen Zhou 提交于 5月 08, 2020

It is more efficient to use kmemdup_nul() if the size is known exactly.

The doc in kernel:
"Note: Use kmemdup_nul() instead if the size is known exactly."
Signed-off-by: NChen Zhou <chenzhou10@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

09bb8986

block/floppy: fix contended case in floppy_queue_rq() · 263c6158

由 Jiri Kosina 提交于 5月 26, 2020

Since the switch of floppy driver to blk-mq, the contended (fdc_busy) case
in floppy_queue_rq() is not handled correctly.

In case we reach floppy_queue_rq() with fdc_busy set (i.e. with the floppy
locked due to another request still being in-flight), we put the request
on the list of requests and return BLK_STS_OK to the block core, without
actually scheduling delayed work / doing further processing of the
request. This means that processing of this request is postponed until
another request comes and passess uncontended.

Which in some cases might actually never happen and we keep waiting
indefinitely. The simple testcase is

	for i in `seq 1 2000`; do echo -en $i '\r'; blkid --info /dev/fd0 2> /dev/null; done

run in quemu. That reliably causes blkid eventually indefinitely hanging
in __floppy_read_block_0() waiting for completion, as the BIO callback
never happens, and no further IO is ever submitted on the (non-existent)
floppy device. This was observed reliably on qemu-emulated device.

Fix that by not queuing the request in the contended case, and return
BLK_STS_RESOURCE instead, so that blk core handles the request
rescheduling and let it pass properly non-contended later.

Fixes: a9f38e1d ("floppy: convert to blk-mq")
Cc: stable@vger.kernel.org
Tested-by: NLibor Pechacek <lpechacek@suse.cz>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

263c6158

25 5月, 2020 1 次提交

loop: remove redundant assignment to variable error · d29b92f5

由 Colin Ian King 提交于 5月 24, 2020

The variable error is being assigned a value that is never
read so the assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d29b92f5

21 5月, 2020 14 次提交

block: remove ioctl_by_bdev · 3783daeb

由 Christoph Hellwig 提交于 5月 19, 2020

No callers left.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3783daeb

s390/dasd: remove ioctl_by_bdev calls · 26d7e28e

由 Stefan Haberland 提交于 5月 19, 2020

The IBM partition parser requires device type specific information only
available to the DASD driver to correctly register partitions. The
current approach of using ioctl_by_bdev with a fake user space pointer
is discouraged.

Fix this by replacing IOCTL calls with direct in-kernel function calls.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NStefan Haberland <sth@linux.ibm.com>
Reviewed-by: NJan Hoeppner <hoeppner@linux.ibm.com>
Reviewed-by: NPeter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

26d7e28e

dasd: refactor dasd_ioctl_information · 9353848c

由 Christoph Hellwig 提交于 5月 19, 2020

Prepare for in-kernel callers of this functionality.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
[sth@de.ibm.com: remove leftover kfree]
Signed-off-by: NStefan Haberland <sth@linux.ibm.com>
Reviewed-by: NPeter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: NJan Hoeppner <hoeppner@linux.ibm.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9353848c

loop: Add LOOP_CONFIGURE ioctl · 3448914e

由 Martijn Coenen 提交于 5月 13, 2020

This allows userspace to completely setup a loop device with a single
ioctl, removing the in-between state where the device can be partially
configured - eg the loop device has a backing file associated with it,
but is reading from the wrong offset.

Besides removing the intermediate state, another big benefit of this
ioctl is that LOOP_SET_STATUS can be slow; the main reason for this
slowness is that LOOP_SET_STATUS(64) calls blk_mq_freeze_queue() to
freeze the associated queue; this requires waiting for RCU
synchronization, which I've measured can take about 15-20ms on this
device on average.

In addition to doing what LOOP_SET_STATUS can do, LOOP_CONFIGURE can
also be used to:
- Set the correct block size immediately by setting
  loop_config.block_size (avoids LOOP_SET_BLOCK_SIZE)
- Explicitly request direct I/O mode by setting LO_FLAGS_DIRECT_IO
  in loop_config.info.lo_flags (avoids LOOP_SET_DIRECT_IO)
- Explicitly request read-only mode by setting LO_FLAGS_READ_ONLY
  in loop_config.info.lo_flags

Here's setting up ~70 regular loop devices with an offset on an x86
Android device, using LOOP_SET_FD and LOOP_SET_STATUS:

vsoc_x86:/system/apex # time for i in `seq 30 100`;
do losetup -r -o 4096 /dev/block/loop$i com.android.adbd.apex; done
    0m03.40s real     0m00.02s user     0m00.03s system

Here's configuring ~70 devices in the same way, but using a modified
losetup that uses the new LOOP_CONFIGURE ioctl:

vsoc_x86:/system/apex # time for i in `seq 30 100`;
do losetup -r -o 4096 /dev/block/loop$i com.android.adbd.apex; done
    0m01.94s real     0m00.01s user     0m00.01s system
Signed-off-by: NMartijn Coenen <maco@android.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3448914e

loop: Clean up LOOP_SET_STATUS lo_flags handling · faf1d254

由 Martijn Coenen 提交于 5月 13, 2020

LOOP_SET_STATUS(64) will actually allow some lo_flags to be modified; in
particular, LO_FLAGS_AUTOCLEAR can be set and cleared, whereas
LO_FLAGS_PARTSCAN can be set to request a partition scan. Make this
explicit by updating the UAPI to include the flags that can be
set/cleared using this ioctl.

The implementation can then blindly take over the passed in flags,
and use the previous flags for those flags that can't be set / cleared
using LOOP_SET_STATUS.
Signed-off-by: NMartijn Coenen <maco@android.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

faf1d254

loop: Rework lo_ioctl() __user argument casting · 571fae6e

由 Martijn Coenen 提交于 5月 13, 2020

In preparation for a new ioctl that needs to copy_from_user(); makes the
code easier to read as well.
Signed-off-by: NMartijn Coenen <maco@android.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

571fae6e

loop: Move loop_set_status_from_info() and friends up · 62ab466c

由 Martijn Coenen 提交于 5月 13, 2020

So we can use it without forward declaration. This is a separate commit
to make it easier to verify that this is just a move, without functional
modifications.
Signed-off-by: NMartijn Coenen <maco@android.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

62ab466c

loop: Factor out configuring loop from status · 0c3796c2

由 Martijn Coenen 提交于 5月 13, 2020

Factor out this code into a separate function, so it can be reused by
other code more easily.
Signed-off-by: NMartijn Coenen <maco@android.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0c3796c2

loop: Remove figure_loop_size() · 0a6ed1b5

由 Martijn Coenen 提交于 5月 13, 2020

This function was now only used by loop_set_capacity(). Just open code
the remaining code in the caller instead.
Signed-off-by: NMartijn Coenen <maco@android.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0a6ed1b5

loop: Refactor loop_set_status() size calculation · b0bd158d

由 Martijn Coenen 提交于 5月 13, 2020

figure_loop_size() calculates the loop size based on the passed in
parameters, but at the same time it updates the offset and sizelimit
parameters in the loop device configuration. That is a somewhat
unexpected side effect of a function with this name, and it is only only
needed by one of the two callers of this function - loop_set_status().

Move the lo_offset and lo_sizelimit assignment back into loop_set_status(),
and use the newly factored out functions to validate and apply the newly
calculated size. This allows us to get rid of figure_loop_size() in a
follow-up commit.
Signed-off-by: NMartijn Coenen <maco@android.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b0bd158d

loop: Switch to set_capacity_revalidate_and_notify() · 716ad098

由 Martijn Coenen 提交于 5月 13, 2020

This was recently added to block/genhd.c, and takes care of both
updating the capacity and notifying userspace of the new size.
Signed-off-by: NMartijn Coenen <maco@android.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

716ad098

loop: Factor out setting loop device size · 5795b6f5

由 Martijn Coenen 提交于 5月 13, 2020

This code is used repeatedly.
Signed-off-by: NMartijn Coenen <maco@android.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5795b6f5

loop: Remove sector_t truncation checks · 083a6a50

由 Martijn Coenen 提交于 5月 13, 2020

sector_t is now always u64, so we don't need to check for truncation.
Signed-off-by: NMartijn Coenen <maco@android.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

083a6a50

loop: Call loop_config_discard() only after new config is applied · 7c5014b0

由 Martijn Coenen 提交于 5月 13, 2020

loop_set_status() calls loop_config_discard() to configure discard for
the loop device; however, the discard configuration depends on whether
the loop device uses encryption, and when we call it the encryption
configuration has not been updated yet. Move the call down so we apply
the correct discard configuration based on the new configuration.
Signed-off-by: NMartijn Coenen <maco@android.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7c5014b0

17 5月, 2020 1 次提交

block/swim3: use set_current_state macro · c6516565

由 Xu Wang 提交于 5月 07, 2020

Use set_current_state macro instead of current->state = TASK_RUNNING.
Signed-off-by: NXu Wang <vulab@iscas.ac.cn>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c6516565

14 5月, 2020 14 次提交

Merge branch 'md-next' of... · 8fd2b980

由 Jens Axboe 提交于 5月 13, 2020

Merge branch 'md-next' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.8/drivers

Pull MD changes from Song.

* 'md-next' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md/raid1: Replace zero-length array with flexible-array
  md: add a newline when printing parameter 'start_ro' by sysfs
  md: stop using ->queuedata
  md/raid1: release pending accounting for an I/O only after write-behind is also finished
  md: remove redundant memalloc scope API usage
  raid5: update code comment of scribble_alloc()
  raid5: remove gfp flags from scribble_alloc()
  md: use memalloc scope APIs in mddev_suspend()/mddev_resume()
  md: remove the extra line for ->hot_add_disk
  md: flush md_rdev_misc_wq for HOT_ADD_DISK case
  md: don't flush workqueue unconditionally in md_open
  md: add new workqueue for delete rdev
  md: add checkings before flush md_misc_wq

8fd2b980

md/raid1: Replace zero-length array with flexible-array · 358369f0

由 Gustavo A. R. Silva 提交于 5月 07, 2020

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: NSong Liu <songliubraving@fb.com>

358369f0

md: add a newline when printing parameter 'start_ro' by sysfs · 3f99980c

由 Xiongfeng Wang 提交于 5月 11, 2020

Add a missing newline when printing module parameter 'start_ro' by
sysfs.
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

3f99980c

md: stop using ->queuedata · e4fc5a74

由 Christoph Hellwig 提交于 5月 08, 2020

Pointer to mddev is already available in private_data.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <songliubraving@fb.com>

e4fc5a74

md/raid1: release pending accounting for an I/O only after write-behind is also finished · c91114c2

由 David Jeffery 提交于 1月 27, 2020

When using RAID1 and write-behind, md can deadlock when errors occur. With
write-behind, r1bio structs can be accounted by raid1 as queued but not
counted as pending. The pending count is dropped when the original bio is
returned complete but write-behind for the r1bio may still be active.

This breaks the accounting used in some conditions to know when the raid1
md device has reached an idle state. It can result in calls to
freeze_array deadlocking. freeze_array will never complete from a negative
"unqueued" value being calculated due to a queued count larger than the
pending count.

To properly account for write-behind, move the call to allow_barrier from
call_bio_endio to raid_end_bio_io. When using write-behind, md can call
call_bio_endio before all write-behind I/O is complete. Using
raid_end_bio_io for the point to call allow_barrier will release the
pending count at a point where all I/O for an r1bio, even write-behind, is
done.
Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

c91114c2

md: remove redundant memalloc scope API usage · 3024ba2d

由 Coly Li 提交于 4月 09, 2020

In mddev_create_serial_pool(), memalloc scope APIs memalloc_noio_save()
and memalloc_noio_restore() are used when allocating memory by calling
mempool_create_kmalloc_pool(). After adding the memalloc scope APIs in
raid array suspend context, it is unncessary to explicitly call them
around mempool_create_kmalloc_pool() any longer.

This patch removes the redundant memalloc scope APIs in
mddev_create_serial_pool().
Signed-off-by: NColy Li <colyli@suse.de>
Cc: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

3024ba2d

raid5: update code comment of scribble_alloc() · 7f8a30e5

由 Coly Li 提交于 4月 09, 2020

Code comments of scribble_alloc() is outdated for a while. This patch
update the comments in function header for the new parameter list.
Suggested-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NSong Liu <songliubraving@fb.com>

7f8a30e5

raid5: remove gfp flags from scribble_alloc() · ba54d4d4

由 Coly Li 提交于 4月 09, 2020

Using GFP_NOIO flag to call scribble_alloc() from resize_chunk() does
not have the expected behavior. kvmalloc_array() inside scribble_alloc()
which receives the GFP_NOIO flag will eventually call kmalloc_node() to
allocate physically continuous pages.

Now we have memalloc scope APIs in mddev_suspend()/mddev_resume() to
prevent memory reclaim I/Os during raid array suspend context, calling
to kvmalloc_array() with GFP_KERNEL flag may avoid deadlock of recursive
I/O as expected.

This patch removes the useless gfp flags from parameters list of
scribble_alloc(), and call kvmalloc_array() with GFP_KERNEL flag. The
incorrect GFP_NOIO flag does not exist anymore.

Fixes: b330e6a4 ("md: convert to kvmalloc")
Suggested-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NSong Liu <songliubraving@fb.com>

ba54d4d4

md: use memalloc scope APIs in mddev_suspend()/mddev_resume() · 78f57ef9

由 Coly Li 提交于 4月 09, 2020

In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO
flag, then it is sent into kvmalloc_array() inside scribble_alloc().

The problem is kvmalloc_array() eventually calls kvmalloc_node() which
does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then
kmalloc_node() is called indeed to allocate physically continuous
pages. When system memory is under heavy pressure, and the requesting
size is large, there is high probability that allocating continueous
pages will fail.

But simply using GFP_KERNEL flag to call kvmalloc_array() is also
progblematic. In the code path where scribble_alloc() is called, the
raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os
and such I/Os go back to the suspend raid array, deadlock will happen.

What is desired here is to allocate non-physically (a.k.a virtually)
continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests
to use the mmealloc sceope APIs to restrict memory reclaim I/O in
allocating context, specifically to call memalloc_noio_save() when
suspend the raid array and to call memalloc_noio_restore() when
resume the raid array.

This patch adds the memalloc scope APIs in mddev_suspend() and
mddev_resume(), to restrict memory reclaim I/Os during the raid array
is suspended. The benifit of adding the memalloc scope API in the
unified entry point mddev_suspend()/mddev_resume() is, no matter which
md raid array type (personality), we are sure the deadlock by recursive
memory reclaim I/O won't happen on the suspending context.

Please notice that the memalloc scope APIs only take effect on the raid
array suspending context, if the memory allocation is from another new
created kthread after raid array suspended, the recursive memory reclaim
I/Os won't be restricted. The mddev_suspend()/mddev_resume() entries are
used for the critical section where the raid metadata is modifying,
creating a kthread to allocate memory inside the critical section is
queer and very probably being buggy.

Fixes: b330e6a4 ("md: convert to kvmalloc")
Suggested-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NSong Liu <songliubraving@fb.com>

78f57ef9

md: remove the extra line for ->hot_add_disk · 3f79cc22

由 Guoqing Jiang 提交于 4月 04, 2020

It is not not necessary to add a newline for them since they don't exceed
80 characters, and it is not intutive to distinguish ->hot_add_disk() from
hot_add_disk() too.
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

3f79cc22

md: flush md_rdev_misc_wq for HOT_ADD_DISK case · 78b990cf

由 Guoqing Jiang 提交于 4月 04, 2020

Since rdev->kobj is removed asynchronously, it is possible that the
rdev->kobj still exists when try to add the rdev again after rdev
is removed. But this path md_ioctl (HOT_ADD_DISK) -> hot_add_disk
-> bind_rdev_to_array missed it.
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

78b990cf

md: don't flush workqueue unconditionally in md_open · f6766ff6

由 Guoqing Jiang 提交于 4月 04, 2020

We need to check mddev->del_work before flush workqueu since the purpose
of flush is to ensure the previous md is disappeared. Otherwise the similar
deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the
bdev->bd_mutex before flush workqueue.

kernel: [  154.522645] ======================================================
kernel: [  154.522647] WARNING: possible circular locking dependency detected
kernel: [  154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G           O
kernel: [  154.522651] ------------------------------------------------------
kernel: [  154.522653] mdadm/2482 is trying to acquire lock:
kernel: [  154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0
kernel: [  154.522673]
kernel: [  154.522673] but task is already holding lock:
kernel: [  154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
kernel: [  154.522691]
kernel: [  154.522691] which lock already depends on the new lock.
kernel: [  154.522691]
kernel: [  154.522694]
kernel: [  154.522694] the existing dependency chain (in reverse order) is:
kernel: [  154.522696]
kernel: [  154.522696] -> #4 (&bdev->bd_mutex){+.+.}:
kernel: [  154.522704]        __mutex_lock+0x87/0x950
kernel: [  154.522706]        __blkdev_get+0x79/0x590
kernel: [  154.522708]        blkdev_get+0x65/0x140
kernel: [  154.522709]        blkdev_get_by_dev+0x2f/0x40
kernel: [  154.522716]        lock_rdev+0x3d/0x90 [md_mod]
kernel: [  154.522719]        md_import_device+0xd6/0x1b0 [md_mod]
kernel: [  154.522723]        new_dev_store+0x15e/0x210 [md_mod]
kernel: [  154.522728]        md_attr_store+0x7a/0xc0 [md_mod]
kernel: [  154.522732]        kernfs_fop_write+0x117/0x1b0
kernel: [  154.522735]        vfs_write+0xad/0x1a0
kernel: [  154.522737]        ksys_write+0xa4/0xe0
kernel: [  154.522745]        do_syscall_64+0x64/0x2b0
kernel: [  154.522748]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [  154.522749]
kernel: [  154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}:
kernel: [  154.522752]        __mutex_lock+0x87/0x950
kernel: [  154.522756]        new_dev_store+0xc9/0x210 [md_mod]
kernel: [  154.522759]        md_attr_store+0x7a/0xc0 [md_mod]
kernel: [  154.522761]        kernfs_fop_write+0x117/0x1b0
kernel: [  154.522763]        vfs_write+0xad/0x1a0
kernel: [  154.522765]        ksys_write+0xa4/0xe0
kernel: [  154.522767]        do_syscall_64+0x64/0x2b0
kernel: [  154.522769]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [  154.522770]
kernel: [  154.522770] -> #2 (kn->count#253){++++}:
kernel: [  154.522775]        __kernfs_remove+0x253/0x2c0
kernel: [  154.522778]        kernfs_remove+0x1f/0x30
kernel: [  154.522780]        kobject_del+0x28/0x60
kernel: [  154.522783]        mddev_delayed_delete+0x24/0x30 [md_mod]
kernel: [  154.522786]        process_one_work+0x2a7/0x5f0
kernel: [  154.522788]        worker_thread+0x2d/0x3d0
kernel: [  154.522793]        kthread+0x117/0x130
kernel: [  154.522795]        ret_from_fork+0x3a/0x50
kernel: [  154.522796]
kernel: [  154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}:
kernel: [  154.522800]        process_one_work+0x27e/0x5f0
kernel: [  154.522802]        worker_thread+0x2d/0x3d0
kernel: [  154.522804]        kthread+0x117/0x130
kernel: [  154.522806]        ret_from_fork+0x3a/0x50
kernel: [  154.522807]
kernel: [  154.522807] -> #0 ((wq_completion)md_misc){+.+.}:
kernel: [  154.522813]        __lock_acquire+0x1392/0x1690
kernel: [  154.522816]        lock_acquire+0xb4/0x1a0
kernel: [  154.522818]        flush_workqueue+0xab/0x4b0
kernel: [  154.522821]        md_open+0xb6/0xc0 [md_mod]
kernel: [  154.522823]        __blkdev_get+0xea/0x590
kernel: [  154.522825]        blkdev_get+0x65/0x140
kernel: [  154.522828]        do_dentry_open+0x1d1/0x380
kernel: [  154.522831]        path_openat+0x567/0xcc0
kernel: [  154.522834]        do_filp_open+0x9b/0x110
kernel: [  154.522836]        do_sys_openat2+0x201/0x2a0
kernel: [  154.522838]        do_sys_open+0x57/0x80
kernel: [  154.522840]        do_syscall_64+0x64/0x2b0
kernel: [  154.522842]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [  154.522844]
kernel: [  154.522844] other info that might help us debug this:
kernel: [  154.522844]
kernel: [  154.522846] Chain exists of:
kernel: [  154.522846]   (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex
kernel: [  154.522846]
kernel: [  154.522850]  Possible unsafe locking scenario:
kernel: [  154.522850]
kernel: [  154.522852]        CPU0                    CPU1
kernel: [  154.522853]        ----                    ----
kernel: [  154.522854]   lock(&bdev->bd_mutex);
kernel: [  154.522856]                                lock(&mddev->reconfig_mutex);
kernel: [  154.522858]                                lock(&bdev->bd_mutex);
kernel: [  154.522860]   lock((wq_completion)md_misc);
kernel: [  154.522861]
kernel: [  154.522861]  *** DEADLOCK ***
kernel: [  154.522861]
kernel: [  154.522864] 1 lock held by mdadm/2482:
kernel: [  154.522865]  #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
kernel: [  154.522868]
kernel: [  154.522868] stack backtrace:
kernel: [  154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G           O      5.6.0-rc7-lp151.27-default #25
kernel: [  154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
kernel: [  154.522878] Call Trace:
kernel: [  154.522881]  dump_stack+0x8f/0xcb
kernel: [  154.522884]  check_noncircular+0x194/0x1b0
kernel: [  154.522888]  ? __lock_acquire+0x1392/0x1690
kernel: [  154.522890]  __lock_acquire+0x1392/0x1690
kernel: [  154.522893]  lock_acquire+0xb4/0x1a0
kernel: [  154.522895]  ? flush_workqueue+0x84/0x4b0
kernel: [  154.522898]  flush_workqueue+0xab/0x4b0
kernel: [  154.522900]  ? flush_workqueue+0x84/0x4b0
kernel: [  154.522905]  ? md_open+0xb6/0xc0 [md_mod]
kernel: [  154.522908]  md_open+0xb6/0xc0 [md_mod]
kernel: [  154.522910]  __blkdev_get+0xea/0x590
kernel: [  154.522912]  ? bd_acquire+0xc0/0xc0
kernel: [  154.522914]  blkdev_get+0x65/0x140
kernel: [  154.522916]  ? bd_acquire+0xc0/0xc0
kernel: [  154.522918]  do_dentry_open+0x1d1/0x380
kernel: [  154.522921]  path_openat+0x567/0xcc0
kernel: [  154.522923]  ? __lock_acquire+0x380/0x1690
kernel: [  154.522926]  do_filp_open+0x9b/0x110
kernel: [  154.522929]  ? __alloc_fd+0xe5/0x1f0
kernel: [  154.522935]  ? kmem_cache_alloc+0x28c/0x630
kernel: [  154.522939]  ? do_sys_openat2+0x201/0x2a0
kernel: [  154.522941]  do_sys_openat2+0x201/0x2a0
kernel: [  154.522944]  do_sys_open+0x57/0x80
kernel: [  154.522946]  do_syscall_64+0x64/0x2b0
kernel: [  154.522948]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [  154.522951] RIP: 0033:0x7f98d279d9ae

And md_alloc also flushed the same workqueue, but the thing is different
here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and
the flush is necessary to avoid race condition, so leave it as it is.
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

f6766ff6

md: add new workqueue for delete rdev · cc1ffe61

由 Guoqing Jiang 提交于 4月 04, 2020

Since the purpose of call flush_workqueue in new_dev_store is to ensure
md_delayed_delete() has completed, so we should check rdev->del_work is
pending or not.

To suppress lockdep warning, we have to check mddev->del_work while
md_delayed_delete is attached to rdev->del_work, so it is not aligned
to the purpose of flush workquee. So a new workqueue is needed to avoid
the awkward situation, and introduce a new func flush_rdev_wq to flush
the new workqueue after check if there was pending work.

Also like new_dev_store, ADD_NEW_DISK ioctl has the same purpose to flush
workqueue while it holds bdev->bd_mutex, so make the same change applies
to the ioctl to avoid similar lock issue.

And md_delayed_delete actually wants to delete rdev, so rename the function
to rdev_delayed_delete.
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

cc1ffe61

md: add checkings before flush md_misc_wq · 21e0958e

由 Guoqing Jiang 提交于 4月 04, 2020

Coly reported possible circular locking dependencyi with LOCKDEP enabled,
quote the below info from the detailed report [1].

[ 1607.673903] Chain exists of:
[ 1607.673903]   kn->count#256 --> (wq_completion)md_misc -->
(work_completion)(&rdev->del_work)
[ 1607.673903]
[ 1607.827946]  Possible unsafe locking scenario:
[ 1607.827946]
[ 1607.898780]        CPU0                    CPU1
[ 1607.952980]        ----                    ----
[ 1608.007173]   lock((work_completion)(&rdev->del_work));
[ 1608.069690]                                lock((wq_completion)md_misc);
[ 1608.149887]                                lock((work_completion)(&rdev->del_work));
[ 1608.242563]   lock(kn->count#256);
[ 1608.283238]
[ 1608.283238]  *** DEADLOCK ***
[ 1608.283238]
[ 1608.354078] 2 locks held by kworker/5:0/843:
[ 1608.405152]  #0: ffff8889eecc9948 ((wq_completion)md_misc){+.+.}, at:
process_one_work+0x42b/0xb30
[ 1608.512399]  #1: ffff888a1d3b7e10
((work_completion)(&rdev->del_work)){+.+.}, at: process_one_work+0x42b/0xb30
[ 1608.632130]

Since works (rdev->del_work and mddev->del_work) are queued in md_misc_wq,
then lockdep_map lock is held if either of them are running, then both of
them try to hold kernfs lock by call kobject_del. Then if new_dev_store
or array_state_store are triggered by write to the related sysfs node, so
the write operation gets kernfs lock, but need the lockdep_map because all
of them would trigger flush_workqueue(md_misc_wq) finally, then the same
lockdep_map lock is needed.

To suppress the lockdep warnning, we should flush the workqueue in case the
related work is pending. And several works are attached to md_misc_wq, so
we need to check which work should be checked:

1. for __md_stop_writes, the purpose of call flush workqueue is ensure sync
thread is started if it was starting, so check mddev->del_work is pending
or not since md_start_sync is attached to mddev->del_work.

2. __md_stop flushes md_misc_wq to ensure event_work is done, check the
event_work is enough. Assume raid_{ctr,dtr} -> md_stop -> __md_stop doesn't
need the kernfs lock.

3. both new_dev_store (holds kernfs lock) and ADD_NEW_DISK ioctl (holds the
bdev->bd_mutex) call flush_workqueue to ensure md_delayed_delete has
completed, this case will be handled in next patch.

4. md_open flushes workqueue to ensure the previous md is disappeared, but
it holds bdev->bd_mutex then try to flush workqueue, so it is better to
check mddev->del_work as well to avoid potential lock issue, this will be
done in another patch.

[1]: https://marc.info/?l=linux-raid&m=158518958031584&w=2

Cc: Coly Li <colyli@suse.de>
Reported-by: NColy Li <colyli@suse.de>
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

21e0958e

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功