提交 · 1b96f862ecccb3e6f950eba584bebf22955cecc5 · openeuler / Kernel

15 11月, 2022 18 次提交

nvme: implement the DEAC bit for the Write Zeroes command · 1b96f862

由 Christoph Hellwig 提交于 10月 30, 2022

While the specification allows devices to either deallocate data
or to actually write zeroes on any Write Zeroes command, many SSDs
only do the sensible thing and deallocate data when the DEAC bit
is specific.  Set it when it is supported and the caller doesn't
explicitly opt out of deallocation.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>

1b96f862

nvme: identify-namespace without CAP_SYS_ADMIN · e4fbcf32

由 Kanchan Joshi 提交于 11月 01, 2022

Allow all identify-namespace variants (CNS 00h, 05h and 08h) without
requiring CAP_SYS_ADMIN. The information (retrieved using id-ns) is
needed to form IO commands for passthrough interface.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

e4fbcf32

nvme: fine-granular CAP_SYS_ADMIN for nvme io commands · 855b7717

由 Kanchan Joshi 提交于 10月 31, 2022

Currently both io and admin commands are kept under a
coarse-granular CAP_SYS_ADMIN check, disregarding file mode completely.

$ ls -l /dev/ng*
crw-rw-rw- 1 root root 242, 0 Sep  9 19:20 /dev/ng0n1
crw------- 1 root root 242, 1 Sep  9 19:20 /dev/ng0n2

In the example above, ng0n1 appears as if it may allow unprivileged
read/write operation but it does not and behaves same as ng0n2.

This patch implements a shift from CAP_SYS_ADMIN to more fine-granular
control for io-commands.
If CAP_SYS_ADMIN is present, nothing else is checked as before.
Otherwise, following rules are in place
- any admin-cmd is not allowed
- vendor-specific and fabric commmand are not allowed
- io-commands that can write are allowed if matching FMODE_WRITE
permission is present
- io-commands that read are allowed

Add a helper nvme_cmd_allowed that implements above policy.
Change all the callers of CAP_SYS_ADMIN to go through nvme_cmd_allowed
for any decision making.
Since file open mode is counted for any approval/denial, change at
various places to keep file-mode information handy.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

855b7717

nvme-fc: improve memory usage in nvme_fc_rcv_ls_req() · cf3d0084

由 Christophe JAILLET 提交于 10月 02, 2022

sizeof( struct nvmefc_ls_rcv_op ) = 64
sizeof( union nvmefc_ls_requests ) = 1024
sizeof( union nvmefc_ls_responses ) = 128

So, in nvme_fc_rcv_ls_req(), 1216 bytes of memory are requested when
kzalloc() is called.

Because of the way memory allocations are performed, 2048 bytes are
allocated. So about 800 bytes are wasted for each request.

Switch to 3 distinct memory allocations, in order to:
   - save these 800 bytes
   - avoid zeroing this extra memory
   - make sure that memory is properly aligned in case of DMA access
    ("fc_dma_map_single(lsop->rspbuf)" just a few lines below)
Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: NJames Smart <jsmart2021@gmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

cf3d0084

nvmet: only allocate a single slab for bvecs · fa8f9ac4

由 Christoph Hellwig 提交于 11月 07, 2022

There is no need to have a separate slab cache for each namespace,
and having separate ones creates duplicate debugs file names as well.

Fixes: d5eff33e ("nvmet: add simple file backed ns support")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>

fa8f9ac4

nvmet: force reconnect when number of queue changes · 2be2cd52

由 Daniel Wagner 提交于 10月 25, 2022

In order to test queue number changes we need to make sure that the
host reconnects. Because only when the host disconnects from the
target the number of queues are allowed to change according the spec.

The initial idea was to disable and re-enable the ports and have the
host wait until the KATO timer expires, triggering error
recovery. Though the host would see a DNR reply when trying to
reconnect. Because of the DNR bit the connection is dropped
completely. There is no point in trying to reconnect with the same
parameters according the spec.

We can force to reconnect the host is by deleting all controllers. The
host will observe any newly posted request to fail and thus starts the
error recovery but this time without the DNR bit set.
Signed-off-by: NDaniel Wagner <dwagner@suse.de>
Reviewed-by: NChaitanya Kulkarni  <kch@nvidia.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

2be2cd52

nvmet: use try_cmpxchg in nvmet_update_sq_head · bbf5410b

由 Uros Bizjak 提交于 10月 20, 2022

Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
nvmet_update_sq_head.  x86 CMPXCHG instruction returns success in ZF flag, so
this change saves a compare after cmpxchg (and related move instruction in
front of cmpxchg).

Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails. There is no need to re-read the value in the loop.

Note that the value from *ptr should be read using READ_ONCE to prevent
the compiler from merging, refetching or reordering the read.

No functional change intended.
Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

bbf5410b

Merge branch 'md-next' of... · 5626196a

由 Jens Axboe 提交于 11月 14, 2022

Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.2/block

Pull MD fixes from Song.

* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md/raid1: stop mdx_raid1 thread when raid1 array run failed
  md/raid5: use bdev_write_cache instead of open coding it
  md: fix a crash in mempool_free
  md/raid0, raid10: Don't set discard sectors for request queue
  md/bitmap: Fix bitmap chunk size overflow issues
  md: introduce md_ro_state
  md: factor out __md_set_array_info()
  lib/raid6: drop RAID6_USE_EMPTY_ZERO_PAGE
  raid5-cache: use try_cmpxchg in r5l_wake_reclaim
  drivers/md/md-bitmap: check the return value of md_bitmap_get_counter()

5626196a

md/raid1: stop mdx_raid1 thread when raid1 array run failed · b611ad14

由 Jiang Li 提交于 11月 07, 2022

fail run raid1 array when we assemble array with the inactive disk only,
but the mdx_raid1 thread were not stop, Even if the associated resources
have been released. it will caused a NULL dereference when we do poweroff.

This causes the following Oops:
    [  287.587787] BUG: kernel NULL pointer dereference, address: 0000000000000070
    [  287.594762] #PF: supervisor read access in kernel mode
    [  287.599912] #PF: error_code(0x0000) - not-present page
    [  287.605061] PGD 0 P4D 0
    [  287.607612] Oops: 0000 [#1] SMP NOPTI
    [  287.611287] CPU: 3 PID: 5265 Comm: md0_raid1 Tainted: G     U            5.10.146 #0
    [  287.619029] Hardware name: xxxxxxx/To be filled by O.E.M, BIOS 5.19 06/16/2022
    [  287.626775] RIP: 0010:md_check_recovery+0x57/0x500 [md_mod]
    [  287.632357] Code: fe 01 00 00 48 83 bb 10 03 00 00 00 74 08 48 89 ......
    [  287.651118] RSP: 0018:ffffc90000433d78 EFLAGS: 00010202
    [  287.656347] RAX: 0000000000000000 RBX: ffff888105986800 RCX: 0000000000000000
    [  287.663491] RDX: ffffc90000433bb0 RSI: 00000000ffffefff RDI: ffff888105986800
    [  287.670634] RBP: ffffc90000433da0 R08: 0000000000000000 R09: c0000000ffffefff
    [  287.677771] R10: 0000000000000001 R11: ffffc90000433ba8 R12: ffff888105986800
    [  287.684907] R13: 0000000000000000 R14: fffffffffffffe00 R15: ffff888100b6b500
    [  287.692052] FS:  0000000000000000(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000
    [  287.700149] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  287.705897] CR2: 0000000000000070 CR3: 000000000320a000 CR4: 0000000000350ee0
    [  287.713033] Call Trace:
    [  287.715498]  raid1d+0x6c/0xbbb [raid1]
    [  287.719256]  ? __schedule+0x1ff/0x760
    [  287.722930]  ? schedule+0x3b/0xb0
    [  287.726260]  ? schedule_timeout+0x1ed/0x290
    [  287.730456]  ? __switch_to+0x11f/0x400
    [  287.734219]  md_thread+0xe9/0x140 [md_mod]
    [  287.738328]  ? md_thread+0xe9/0x140 [md_mod]
    [  287.742601]  ? wait_woken+0x80/0x80
    [  287.746097]  ? md_register_thread+0xe0/0xe0 [md_mod]
    [  287.751064]  kthread+0x11a/0x140
    [  287.754300]  ? kthread_park+0x90/0x90
    [  287.757974]  ret_from_fork+0x1f/0x30

In fact, when raid1 array run fail, we need to do
md_unregister_thread() before raid1_free().
Signed-off-by: NJiang Li <jiang.li@ugreen.com>
Signed-off-by: NSong Liu <song@kernel.org>

b611ad14

md/raid5: use bdev_write_cache instead of open coding it · ad831a16

由 Christoph Hellwig 提交于 11月 09, 2022

Use the bdev_write_cache instead of two equivalent open coded checks.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>

ad831a16

md: fix a crash in mempool_free · 341097ee

由 Mikulas Patocka 提交于 11月 04, 2022

There's a crash in mempool_free when running the lvm test
shell/lvchange-rebuild-raid.sh.

The reason for the crash is this:
* super_written calls atomic_dec_and_test(&mddev->pending_writes) and
  wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev)
  and bio_put(bio).
* so, the process that waited on sb_wait and that is woken up is racing
  with bio_put(bio).
* if the process wins the race, it calls bioset_exit before bio_put(bio)
  is executed.
* bio_put(bio) attempts to free a bio into a destroyed bio set - causing
  a crash in mempool_free.

We fix this bug by moving bio_put before atomic_dec_and_test.

We also move rdev_dec_pending before atomic_dec_and_test as suggested by
Neil Brown.

The function md_end_flush has a similar bug - we must call bio_put before
we decrement the number of in-progress bios.

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 11557f0067 P4D 11557f0067 PUD 0
 Oops: 0002 [#1] PREEMPT SMP
 CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 #5
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
 Workqueue: kdelayd flush_expired_bios [dm_delay]
 RIP: 0010:mempool_free+0x47/0x80
 Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00
 RSP: 0018:ffff88910036bda8 EFLAGS: 00010093
 RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001
 RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8
 RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900
 R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000
 R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05
 FS:  0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0
 Call Trace:
  <TASK>
  clone_endio+0xf4/0x1c0 [dm_mod]
  clone_endio+0xf4/0x1c0 [dm_mod]
  __submit_bio+0x76/0x120
  submit_bio_noacct_nocheck+0xb6/0x2a0
  flush_expired_bios+0x28/0x2f [dm_delay]
  process_one_work+0x1b4/0x300
  worker_thread+0x45/0x3e0
  ? rescuer_thread+0x380/0x380
  kthread+0xc2/0x100
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>
 Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd]
 CR2: 0000000000000000
 ---[ end trace 0000000000000000 ]---
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NSong Liu <song@kernel.org>

341097ee

md/raid0, raid10: Don't set discard sectors for request queue · 8e1a2279

由 Xiao Ni 提交于 11月 02, 2022

It should use disk_stack_limits to get a proper max_discard_sectors
rather than setting a value by stack drivers.

And there is a bug. If all member disks are rotational devices,
raid0/raid10 set max_discard_sectors. So the member devices are
not ssd/nvme, but raid0/raid10 export the wrong value. It reports
warning messages in function __blkdev_issue_discard when mkfs.xfs
like this:

[ 4616.022599] ------------[ cut here ]------------
[ 4616.027779] WARNING: CPU: 4 PID: 99634 at block/blk-lib.c:50 __blkdev_issue_discard+0x16a/0x1a0
[ 4616.140663] RIP: 0010:__blkdev_issue_discard+0x16a/0x1a0
[ 4616.146601] Code: 24 4c 89 20 31 c0 e9 fe fe ff ff c1 e8 09 8d 48 ff 4c 89 f0 4c 09 e8 48 85 c1 0f 84 55 ff ff ff b8 ea ff ff ff e9 df fe ff ff <0f> 0b 48 8d 74 24 08 e8 ea d6 00 00 48 c7 c6 20 1e 89 ab 48 c7 c7
[ 4616.167567] RSP: 0018:ffffaab88cbffca8 EFLAGS: 00010246
[ 4616.173406] RAX: ffff9ba1f9e44678 RBX: 0000000000000000 RCX: ffff9ba1c9792080
[ 4616.181376] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ba1c9792080
[ 4616.189345] RBP: 0000000000000cc0 R08: ffffaab88cbffd10 R09: 0000000000000000
[ 4616.197317] R10: 0000000000000012 R11: 0000000000000000 R12: 0000000000000000
[ 4616.205288] R13: 0000000000400000 R14: 0000000000000cc0 R15: ffff9ba1c9792080
[ 4616.213259] FS:  00007f9a5534e980(0000) GS:ffff9ba1b7c80000(0000) knlGS:0000000000000000
[ 4616.222298] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4616.228719] CR2: 000055a390a4c518 CR3: 0000000123e40006 CR4: 00000000001706e0
[ 4616.236689] Call Trace:
[ 4616.239428]  blkdev_issue_discard+0x52/0xb0
[ 4616.244108]  blkdev_common_ioctl+0x43c/0xa00
[ 4616.248883]  blkdev_ioctl+0x116/0x280
[ 4616.252977]  __x64_sys_ioctl+0x8a/0xc0
[ 4616.257163]  do_syscall_64+0x5c/0x90
[ 4616.261164]  ? handle_mm_fault+0xc5/0x2a0
[ 4616.265652]  ? do_user_addr_fault+0x1d8/0x690
[ 4616.270527]  ? do_syscall_64+0x69/0x90
[ 4616.274717]  ? exc_page_fault+0x62/0x150
[ 4616.279097]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 4616.284748] RIP: 0033:0x7f9a55398c6b
Signed-off-by: NXiao Ni <xni@redhat.com>
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NSong Liu <song@kernel.org>

8e1a2279

md/bitmap: Fix bitmap chunk size overflow issues · 45552111

由 Florian-Ewald Mueller 提交于 10月 25, 2022

- limit bitmap chunk size internal u64 variable to values not overflowing
  the u32 bitmap superblock structure variable stored on persistent media
- assign bitmap chunk size internal u64 variable from unsigned values to
  avoid possible sign extension artifacts when assigning from a s32 value

The bug has been there since at least kernel 4.0.
Steps to reproduce it:
1: mdadm -C /dev/mdx -l 1 --bitmap=internal --bitmap-chunk=256M -e 1.2
-n2 /dev/rnbd1 /dev/rnbd2
2 resize member device rnbd1 and rnbd2 to 8 TB
3 mdadm --grow /dev/mdx --size=max

The bitmap_chunksize will overflow without patch.

Cc: stable@vger.kernel.org
Signed-off-by: NFlorian-Ewald Mueller <florian-ewald.mueller@ionos.com>
Signed-off-by: NJack Wang <jinpu.wang@ionos.com>
Signed-off-by: NSong Liu <song@kernel.org>

45552111

md: introduce md_ro_state · f97a5528

由 Ye Bin 提交于 9月 20, 2022

Introduce md_ro_state for mddev->ro, so it is easy to understand.
Signed-off-by: NYe Bin <yebin10@huawei.com>
Signed-off-by: NSong Liu <song@kernel.org>

f97a5528

md: factor out __md_set_array_info() · 2f6d261e

由 Ye Bin 提交于 9月 20, 2022

Factor out __md_set_array_info(). No functional change.
Signed-off-by: NYe Bin <yebin10@huawei.com>
Signed-off-by: NSong Liu <song@kernel.org>

2f6d261e

lib/raid6: drop RAID6_USE_EMPTY_ZERO_PAGE · 42271ca3

由 Giulio Benetti 提交于 10月 19, 2022

RAID6_USE_EMPTY_ZERO_PAGE is unused and hardcoded to 0, so let's drop it.
Signed-off-by: NGiulio Benetti <giulio.benetti@benettiengineering.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>

42271ca3

raid5-cache: use try_cmpxchg in r5l_wake_reclaim · 9487a0f6

由 Uros Bizjak 提交于 10月 20, 2022

Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
r5l_wake_reclaim. 86 CMPXCHG instruction returns success in ZF flag, so
this change saves a compare after cmpxchg (and related move instruction in
front of cmpxchg).

Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails. There is no need to re-read the value in the loop.

Note that the value from *ptr should be read using READ_ONCE to prevent
the compiler from merging, refetching or reordering the read.

No functional change intended.

Cc: Song Liu <song@kernel.org>
Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
Signed-off-by: NSong Liu <song@kernel.org>

9487a0f6

drivers/md/md-bitmap: check the return value of md_bitmap_get_counter() · 3bd548e5

由 Li Zhong 提交于 9月 16, 2022

Check the return value of md_bitmap_get_counter() in case it returns
NULL pointer, which will result in a null pointer dereference.

v2: update the check to include other dereference
Signed-off-by: NLi Zhong <floridsleeves@gmail.com>
Signed-off-by: NSong Liu <song@kernel.org>

3bd548e5

11 11月, 2022 3 次提交

sbitmap: Use single per-bitmap counting to wake up queued tags · 4f8126bb

由 Gabriel Krisman Bertazi 提交于 11月 05, 2022

sbitmap suffers from code complexity, as demonstrated by recent fixes,
and eventual lost wake ups on nested I/O completion. The later happens,
from what I understand, due to the non-atomic nature of the updates to
wait_cnt, which needs to be subtracted and eventually reset when equal
to zero. This two step process can eventually miss an update when a
nested completion happens to interrupt the CPU in between the wait_cnt
updates. This is very hard to fix, as shown by the recent changes to
this code.

The code complexity arises mostly from the corner cases to avoid missed
wakes in this scenario. In addition, the handling of wake_batch
recalculation plus the synchronization with sbq_queue_wake_up is
non-trivial.

This patchset implements the idea originally proposed by Jan [1], which
removes the need for the two-step updates of wait_cnt. This is done by
tracking the number of completions and wakeups in always increasing,
per-bitmap counters. Instead of having to reset the wait_cnt when it
reaches zero, we simply keep counting, and attempt to wake up N threads
in a single wait queue whenever there is enough space for a batch.
Waking up less than batch_wake shouldn't be a problem, because we
haven't changed the conditions for wake up, and the existing batch
calculation guarantees at least enough remaining completions to wake up
a batch for each queue at any time.

Performance-wise, one should expect very similar performance to the
original algorithm for the case where there is no queueing. In both the
old algorithm and this implementation, the first thing is to check
ws_active, which bails out if there is no queueing to be managed. In the
new code, we took care to avoid accounting completions and wakeups when
there is no queueing, to not pay the cost of atomic operations
unnecessarily, since it doesn't skew the numbers.

For more interesting cases, where there is queueing, we need to take
into account the cross-communication of the atomic operations. I've
been benchmarking by running parallel fio jobs against a single hctx
nullb in different hardware queue depth scenarios, and verifying both
IOPS and queueing.

Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
varying only the hardware queue length per test.

queue size 2 4 8 16 32 64
6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K)
patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K)

The following is a similar experiment, ran against a nullb with a single
bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
parallel fio jobs operating on the same device

queue size 2 4 8 16 32 64
6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K)
patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K)

It has also survived blktests and a 12h-stress run against nullb. I also
ran the code against nvme and a scsi SSD, and I didn't observe
performance regression in those. If there are other tests you think I
should run, please let me know and I will follow up with results.

[1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/

Cc: Hugh Dickins <hughd@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Liu Song <liusong@linux.alibaba.com>
Suggested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NGabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20221105231055.25953-1-krisman@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

4f8126bb

blk-mq: simplify blk_mq_realloc_tag_set_tags · ee9d5521

由 Christoph Hellwig 提交于 11月 09, 2022

Use set->nr_hw_queues for the current number of tags, and remove the
duplicate set->nr_hw_queues update in the caller.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221109100811.2413423-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

ee9d5521

blk-mq: remove blk_mq_alloc_tag_set_tags · 5ee20298

由 Christoph Hellwig 提交于 11月 09, 2022

There is no point in trying to share any code with the realloc case when
all that is needed by the initial tagset allocation is a simple
kcalloc_node.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221109100811.2413423-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

5ee20298

10 11月, 2022 14 次提交

bfq: ignore oom_bfqq in bfq_check_waker · 99771d73

由 Khazhismel Kumykov 提交于 11月 08, 2022

oom_bfqq is just a fallback bfqq, so shouldn't be used with waker
detection.
Suggested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NKhazhismel Kumykov <khazhy@google.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221108181030.1611703-2-khazhy@google.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

99771d73

bfq: fix waker_bfqq inconsistency crash · a1795c2c

由 Khazhismel Kumykov 提交于 11月 08, 2022

This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL,
but woken_list_node still being hashed. This would happen when
bfq_init_rq() expects a brand new allocated queue to be returned from
bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq
without resetting woken_list_node. Since we can always return oom_bfqq
when attempting to allocate, we cannot assume waker_bfqq starts as NULL.

Avoid setting woken_bfqq for oom_bfqq entirely, as it's not useful.

Crashes would have a stacktrace like:
[160595.656560]  bfq_add_bfqq_busy+0x110/0x1ec
[160595.661142]  bfq_add_request+0x6bc/0x980
[160595.666602]  bfq_insert_request+0x8ec/0x1240
[160595.671762]  bfq_insert_requests+0x58/0x9c
[160595.676420]  blk_mq_sched_insert_request+0x11c/0x198
[160595.682107]  blk_mq_submit_bio+0x270/0x62c
[160595.686759]  __submit_bio_noacct_mq+0xec/0x178
[160595.691926]  submit_bio+0x120/0x184
[160595.695990]  ext4_mpage_readpages+0x77c/0x7c8
[160595.701026]  ext4_readpage+0x60/0xb0
[160595.705158]  filemap_read_page+0x54/0x114
[160595.711961]  filemap_fault+0x228/0x5f4
[160595.716272]  do_read_fault+0xe0/0x1f0
[160595.720487]  do_fault+0x40/0x1c8

Tested by injecting random failures into bfq_get_queue, crashes go away
completely.

Fixes: 8ef3fc3a ("block, bfq: make shared queues inherit wakers")
Signed-off-by: NKhazhismel Kumykov <khazhy@google.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221108181030.1611703-1-khazhy@google.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a1795c2c

drbd: Store op in drbd_peer_request · ce668b6d

由 Christoph Böhmwalder 提交于 11月 09, 2022

(Sort of) cherry-picked from the out-of-tree drbd9 branch. Original
commit message by Joel Colledge:

    This simplifies drbd_submit_peer_request by removing most of the
    arguments. It also makes the treatment of the op better aligned with
    that in struct bio.

    Determine fault_type dynamically using information which is already
    available instead of passing it in as a parameter.

Note: The opf in receive_rs_deallocated was changed from
REQ_OP_WRITE_ZEROES to REQ_OP_DISCARD. This was required in the
out-of-tree module, and does not matter in-tree. The opf is ignored
anyway in drbd_submit_peer_request, since the discard/zero-out is
decided by the EE_TRIM flag.
Signed-off-by: NJoel Colledge <joel.colledge@linbit.com>
Signed-off-by: NChristoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20221109133453.51652-4-christoph.boehmwalder@linbit.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

ce668b6d

drbd: disable discard support if granularity > max · 21b87a7d

由 Philipp Reisner 提交于 11月 09, 2022

The discard_granularity describes the minimum unit of a discard.
If that is larger than the maximal discard size, we need to disable
discards completely.
Reviewed-by: NJoel Colledge <joel.colledge@linbit.com>
Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: NChristoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20221109133453.51652-3-christoph.boehmwalder@linbit.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

21b87a7d

drbd: use blk_queue_max_discard_sectors helper · 258bea63

由 Christoph Böhmwalder 提交于 11月 09, 2022

We currently only set q->limits.max_discard_sectors, but that is not
enough. Another field, max_hw_discard_sectors, was introduced in
commit 0034af03 ("block: make /sys/block/<dev>/queue/discard_max_bytes
writeable").

The difference is that max_discard_sectors can be changed from user
space via sysfs, while max_hw_discard_sectors is the "hardware" upper
limit.

So use this helper, which sets both.

This is also a fixup for commit 998e9cbc ("drbd: cleanup
decide_on_discard_support"): if discards are not supported, that does
not necessarily mean we also want to disable write_zeroes.

Fixes: 998e9cbc ("drbd: cleanup decide_on_discard_support")
Reviewed-by: NJoel Colledge <joel.colledge@linbit.com>
Signed-off-by: NChristoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20221109133453.51652-2-christoph.boehmwalder@linbit.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

258bea63

ABI: sysfs-bus-pci: add documentation for p2pmem allocate · 6d4338cb

由 Logan Gunthorpe 提交于 10月 21, 2022

Add documentation for the p2pmem/allocate binary file which allows
for allocating p2pmem buffers in userspace for passing to drivers
that support them. (Currently only O_DIRECT to NVMe devices.)
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221021174116.7200-10-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

6d4338cb

PCI/P2PDMA: Allow userspace VMA allocations through sysfs · 7e9c7ef8

由 Logan Gunthorpe 提交于 10月 21, 2022

Create a sysfs bin attribute called "allocate" under the existing
"p2pmem" group. The only allowable operation on this file is the mmap()
call.

When mmap() is called on this attribute, the kernel allocates a chunk of
memory from the genalloc and inserts the pages into the VMA. The
dev_pagemap .page_free callback will indicate when these pages are no
longer used and they will be put back into the genalloc.

On device unbind, remove the sysfs file before the memremap_pages are
cleaned up. This ensures unmap_mapping_range() is called on the files
inode and no new mappings can be created.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Acked-by: NBjorn Helgaas <bhelgaas@google.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221021174116.7200-9-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

7e9c7ef8

block: set FOLL_PCI_P2PDMA in bio_map_user_iov() · 7ee4ccf5

由 Logan Gunthorpe 提交于 10月 21, 2022

When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
passed from userspace and enables the NVMe passthru requests to
use P2PDMA pages.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221021174116.7200-8-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

7ee4ccf5

block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() · 5e3e3f2e

由 Logan Gunthorpe 提交于 10月 21, 2022

When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed
from userspace and enables the O_DIRECT path in iomap based filesystems
and direct to block devices.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221021174116.7200-7-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

5e3e3f2e

lib/scatterlist: add check when merging zone device pages · 1567b49d

由 Logan Gunthorpe 提交于 10月 21, 2022

Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.

Factor out the check for page mergability into a pages_are_mergable()
helper and add a check with zone_device_pages_are_mergeable().
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221021174116.7200-6-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

1567b49d

block: add check when merging zone device pages · 49580e69

由 Logan Gunthorpe 提交于 10月 21, 2022

Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.

Add a helper to determine if zone device pages are mergeable and use
this helper in page_is_mergeable().
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221021174116.7200-5-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

49580e69

iov_iter: introduce iov_iter_get_pages_[alloc_]flags() · d8207640

由 Logan Gunthorpe 提交于 10月 21, 2022

Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
which take a flags argument that is passed to get_user_pages_fast().

This is so that FOLL_PCI_P2PDMA can be passed when appropriate.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221021174116.7200-4-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

d8207640

mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages · 4003f107

由 Logan Gunthorpe 提交于 10月 21, 2022

GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If GUP is called without the flag and a
P2PDMA page is found, it will return an error in try_grab_page() or
try_grab_folio().

The check is safe to do before taking the reference to the page in both
cases seeing the page should be protected by either the appropriate
ptl or mmap_lock; or the gup fast guarantees preventing TLB flushes.

try_grab_folio() has one call site that WARNs on failure and cannot
actually deal with the failure of this function (it seems it will
get into an infinite loop). Expand the comment there to document a
couple more conditions on why it will not fail.

FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set. This is to copy
fsdax until pgmap refcounts are fixed (see the link below for more
information).

Link: https://lkml.kernel.org/r/Yy4Ot5MoOhsgYLTQ@ziepe.caSigned-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221021174116.7200-3-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

4003f107

mm: allow multiple error returns in try_grab_page() · 0f089235

由 Logan Gunthorpe 提交于 10月 21, 2022

In order to add checks for P2PDMA memory into try_grab_page(), expand
the error return from a bool to an int/error code. Update all the
callsites handle change in usage.

Also remove the WARN_ON_ONCE() call at the callsites seeing there
already is a WARN_ON_ONCE() inside the function if it fails.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221021174116.7200-2-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

0f089235

07 11月, 2022 1 次提交

block: Fix some kernel-doc comments · 5b2560c4

由 Yang Li 提交于 11月 07, 2022

Remove the description of @required_features in elevator_match()
to clear the below warning:

block/elevator.c:103: warning: Excess function parameter 'required_features' description in 'elevator_match'

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2734
Fixes: ffb86425 ("block: don't check for required features in elevator_match")
Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
Signed-off-by: NYang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221107062255.2685-1-yang.lee@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

5b2560c4

02 11月, 2022 4 次提交

blk-mq: use if-else instead of goto in blk_mq_alloc_cached_request() · 40467282

由 Jinlong Chen 提交于 11月 02, 2022

if-else is more readable than goto here.
Signed-off-by: NJinlong Chen <nickyc975@zju.edu.cn>
Link: https://lore.kernel.org/r/d3306fa4e92dc9cc614edc8f1802686096bafef2.1667356813.git.nickyc975@zju.edu.cnSigned-off-by: NJens Axboe <axboe@kernel.dk>

40467282

blk-mq: improve error handling in blk_mq_alloc_rq_map() · 7edfd681

由 Jinlong Chen 提交于 11月 02, 2022

Use goto-style error handling like we do elsewhere in the kernel.
Signed-off-by: NJinlong Chen <nickyc975@zju.edu.cn>
Link: https://lore.kernel.org/r/bbbc2d9b17b137798c7fb92042141ca4cbbc58cc.1667356813.git.nickyc975@zju.edu.cnSigned-off-by: NJens Axboe <axboe@kernel.dk>

7edfd681

nvme: use blk_mq_[un]quiesce_tagset · 98d81f0d

由 Chao Leng 提交于 11月 01, 2022

All controller namespaces share the same tagset, so we can use this
interface which does the optimal operation for parallel quiesce based on
the tagset type(e.g. blocking tagsets and non-blocking tagsets).

nvme connect_q should not be quiesced when quiesce tagset, so set the
QUEUE_FLAG_SKIP_TAGSET_QUIESCE to skip it when init connect_q.

Currently we use NVME_NS_STOPPED to ensure pairing quiescing and
unquiescing. If use blk_mq_[un]quiesce_tagset, NVME_NS_STOPPED will be
invalided, so introduce NVME_CTRL_STOPPED to replace NVME_NS_STOPPED.
In addition, we never really quiesce a single namespace. It is a better
choice to move the flag from ns to ctrl.
Signed-off-by: NChao Leng <lengchao@huawei.com>
[hch: rebased on top of prep patches]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChao Leng <lengchao@huawei.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-15-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

98d81f0d

blk-mq: add tagset quiesce interface · 414dd48e

由 Chao Leng 提交于 11月 01, 2022

Drivers that have shared tagsets may need to quiesce potentially a lot
of request queues that all share a single tagset (e.g. nvme). Add an
interface to quiesce all the queues on a given tagset. This interface is
useful because it can speedup the quiesce by doing it in parallel.

Because some queues should not need to be quiesced (e.g. the nvme
connect_q) when quiescing the tagset, introduce a
QUEUE_FLAG_SKIP_TAGSET_QUIESCE flag to allow this new interface to
ski quiescing a particular queue.
Signed-off-by: NChao Leng <lengchao@huawei.com>
[hch: simplify for the per-tag_set srcu_struct]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChao Leng <lengchao@huawei.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-14-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

414dd48e

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功