提交 · daaca3522a8e67c46e39ef09c1d542e866f85f3b · openeuler / Kernel

15 3月, 2022 1 次提交

block: release rq qos structures for queue without disk · daaca352

由 Ming Lei 提交于 3月 14, 2022

blkcg_init_queue() may add rq qos structures to request queue, previously
blk_cleanup_queue() calls rq_qos_exit() to release them, but commit
8e141f9e ("block: drain file system I/O on del_gendisk")
moves rq_qos_exit() into del_gendisk(), so memory leak is caused
because queues may not have disk, such as un-present scsi luns, nvme
admin queue, ...

Fixes the issue by adding rq_qos_exit() to blk_cleanup_queue() back.

BTW, v5.18 won't need this patch any more since we move
blkcg_init_queue()/blkcg_exit_queue() into disk allocation/release
handler, and patches have been in for-5.18/block.

Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Fixes: 8e141f9e ("block: drain file system I/O on del_gendisk")
Reported-by: syzbot+b42749a851a47a0f581b@syzkaller.appspotmail.com
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220314043018.177141-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

daaca352

09 3月, 2022 1 次提交

block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection · 0a5aa8d1

由 Shin'ichiro Kawasaki 提交于 3月 08, 2022

Commit 9d497e29 ("block: don't protect submit_bio_checks by
q_usage_counter") moved blk_mq_attempt_bio_merge and rq_qos_throttle
calls out of q_usage_counter protection. However, these functions require
q_usage_counter protection. The blk_mq_attempt_bio_merge call without
the protection resulted in blktests block/005 failure with KASAN null-
ptr-deref or use-after-free at bio merge. The rq_qos_throttle call
without the protection caused kernel hang at qos throttle.

To fix the failures, move the blk_mq_attempt_bio_merge and
rq_qos_throttle calls back to q_usage_counter protection.

Fixes: 9d497e29 ("block: don't protect submit_bio_checks by q_usage_counter")
Signed-off-by: NShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20220308080915.3473689-1-shinichiro.kawasaki@wdc.comReviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0a5aa8d1

28 2月, 2022 1 次提交

blktrace: fix use after free for struct blk_trace · 30939293

由 Yu Kuai 提交于 2月 28, 2022

When tracing the whole disk, 'dropped' and 'msg' will be created
under 'q->debugfs_dir' and 'bt->dir' is NULL, thus blk_trace_free()
won't remove those files. What's worse, the following UAF can be
triggered because of accessing stale 'dropped' and 'msg':

==================================================================
BUG: KASAN: use-after-free in blk_dropped_read+0x89/0x100
Read of size 4 at addr ffff88816912f3d8 by task blktrace/1188

CPU: 27 PID: 1188 Comm: blktrace Not tainted 5.17.0-rc4-next-20220217+ #469
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-4
Call Trace:
 <TASK>
 dump_stack_lvl+0x34/0x44
 print_address_description.constprop.0.cold+0xab/0x381
 ? blk_dropped_read+0x89/0x100
 ? blk_dropped_read+0x89/0x100
 kasan_report.cold+0x83/0xdf
 ? blk_dropped_read+0x89/0x100
 kasan_check_range+0x140/0x1b0
 blk_dropped_read+0x89/0x100
 ? blk_create_buf_file_callback+0x20/0x20
 ? kmem_cache_free+0xa1/0x500
 ? do_sys_openat2+0x258/0x460
 full_proxy_read+0x8f/0xc0
 vfs_read+0xc6/0x260
 ksys_read+0xb9/0x150
 ? vfs_write+0x3d0/0x3d0
 ? fpregs_assert_state_consistent+0x55/0x60
 ? exit_to_user_mode_prepare+0x39/0x1e0
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fbc080d92fd
Code: ce 20 00 00 75 10 b8 00 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 1
RSP: 002b:00007fbb95ff9cb0 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 00007fbb95ff9dc0 RCX: 00007fbc080d92fd
RDX: 0000000000000100 RSI: 00007fbb95ff9cc0 RDI: 0000000000000045
RBP: 0000000000000045 R08: 0000000000406299 R09: 00000000fffffffd
R10: 000000000153afa0 R11: 0000000000000293 R12: 00007fbb780008c0
R13: 00007fbb78000938 R14: 0000000000608b30 R15: 00007fbb780029c8
 </TASK>

Allocated by task 1050:
 kasan_save_stack+0x1e/0x40
 __kasan_kmalloc+0x81/0xa0
 do_blk_trace_setup+0xcb/0x410
 __blk_trace_setup+0xac/0x130
 blk_trace_ioctl+0xe9/0x1c0
 blkdev_ioctl+0xf1/0x390
 __x64_sys_ioctl+0xa5/0xe0
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Freed by task 1050:
 kasan_save_stack+0x1e/0x40
 kasan_set_track+0x21/0x30
 kasan_set_free_info+0x20/0x30
 __kasan_slab_free+0x103/0x180
 kfree+0x9a/0x4c0
 __blk_trace_remove+0x53/0x70
 blk_trace_ioctl+0x199/0x1c0
 blkdev_common_ioctl+0x5e9/0xb30
 blkdev_ioctl+0x1a5/0x390
 __x64_sys_ioctl+0xa5/0xe0
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

The buggy address belongs to the object at ffff88816912f380
 which belongs to the cache kmalloc-96 of size 96
The buggy address is located 88 bytes inside of
 96-byte region [ffff88816912f380, ffff88816912f3e0)
The buggy address belongs to the page:
page:000000009a1b4e7c refcount:1 mapcount:0 mapping:0000000000000000 index:0x0f
flags: 0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff)
raw: 0017ffffc0000200 ffffea00044f1100 dead000000000002 ffff88810004c780
raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88816912f280: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
 ffff88816912f300: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>ffff88816912f380: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
                                                    ^
 ffff88816912f400: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
 ffff88816912f480: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
==================================================================

Fixes: c0ea5760 ("blktrace: remove debugfs file dentries from struct blk_trace")
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20220228034354.4047385-1-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

30939293

24 2月, 2022 1 次提交

Merge tag 'nvme-5.17-2022-02-24' of git://git.infradead.org/nvme into block-5.17 · b2750f14

由 Jens Axboe 提交于 2月 24, 2022

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 5.17

 - send H2CData PDUs based on MAXH2CDATA (Varun Prakash)
 - fix passthrough to namespaces with unsupported features (me)"

* tag 'nvme-5.17-2022-02-24' of git://git.infradead.org/nvme:
  nvme-tcp: send H2CData PDUs based on MAXH2CDATA
  nvme: also mark passthrough-only namespaces ready in nvme_update_ns_info
  nvme: don't return an error from nvme_configure_metadata

b2750f14

23 2月, 2022 3 次提交

nvme-tcp: send H2CData PDUs based on MAXH2CDATA · c2700d28

由 Varun Prakash 提交于 1月 22, 2022

As per NVMe/TCP specification (revision 1.0a, section 3.6.2.3)
Maximum Host to Controller Data length (MAXH2CDATA): Specifies the
maximum number of PDU-Data bytes per H2CData PDU in bytes. This value
is a multiple of dwords and should be no less than 4,096.

Current code sets H2CData PDU data_length to r2t_length,
it does not check MAXH2CDATA value. Fix this by setting H2CData PDU
data_length to min(req->h2cdata_left, queue->maxh2cdata).

Also validate MAXH2CDATA value returned by target in ICResp PDU,
if it is not a multiple of dword or if it is less than 4096 return
-EINVAL from nvme_tcp_init_connection().
Signed-off-by: NVarun Prakash <varun@chelsio.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

c2700d28

nvme: also mark passthrough-only namespaces ready in nvme_update_ns_info · 602e57c9

由 Christoph Hellwig 提交于 2月 16, 2022

Commit e7d65803 ("nvme-multipath: revalidate paths during rescan")
introduced the NVME_NS_READY flag, which nvme_path_is_disabled() uses
to check if a path can be used or not.  We also need to set this flag
for devices that fail the ZNS feature validation and which are available
through passthrough devices only to that they can be used in multipathing
setups.

Fixes: e7d65803 ("nvme-multipath: revalidate paths during rescan")
Reported-by: NKanchan Joshi <joshi.k@samsung.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Tested-by: NKanchan Joshi <joshi.k@samsung.com>

602e57c9

nvme: don't return an error from nvme_configure_metadata · 363f6368

由 Christoph Hellwig 提交于 2月 16, 2022

When a fabrics controller claims to support an invalidate metadata
configuration we already warn and disable metadata support.  No need to
also return an error during revalidation.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Tested-by: NKanchan Joshi <joshi.k@samsung.com>

363f6368

22 2月, 2022 1 次提交

block: clear iocb->private in blkdev_bio_end_io_async() · bb49c6fa

由 Stefano Garzarella 提交于 2月 11, 2022

iocb_bio_iopoll() expects iocb->private to be cleared before
releasing the bio.

We already do this in blkdev_bio_end_io(), but we forgot in the
recently added blkdev_bio_end_io_async().

Fixes: 54a88eb8 ("block: add single bio async direct IO helper")
Cc: asml.silence@gmail.com
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220211090136.44471-1-sgarzare@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

bb49c6fa

17 2月, 2022 3 次提交

block/wbt: fix negative inflight counter when remove scsi device · e92bc4cd

由 Laibin Qiu 提交于 1月 22, 2022

Now that we disable wbt by set WBT_STATE_OFF_DEFAULT in
wbt_disable_default() when switch elevator to bfq. And when
we remove scsi device, wbt will be enabled by wbt_enable_default.
If it become false positive between wbt_wait() and wbt_track()
when submit write request.

The following is the scenario that triggered the problem.

T1                          T2                           T3
                            elevator_switch_mq
                            bfq_init_queue
                            wbt_disable_default <= Set
                            rwb->enable_state (OFF)
Submit_bio
blk_mq_make_request
rq_qos_throttle
<= rwb->enable_state (OFF)
                                                         scsi_remove_device
                                                         sd_remove
                                                         del_gendisk
                                                         blk_unregister_queue
                                                         elv_unregister_queue
                                                         wbt_enable_default
                                                         <= Set rwb->enable_state (ON)
q_qos_track
<= rwb->enable_state (ON)
^^^^^^ this request will mark WBT_TRACKED without inflight add and will
lead to drop rqw->inflight to -1 in wbt_done() which will trigger IO hung.

Fix this by move wbt_enable_default() from elv_unregister to
bfq_exit_queue(). Only re-enable wbt when bfq exit.

Fixes: 76a80408 ("blk-wbt: make sure throttle is enabled properly")

Remove oneline stale comment, and kill one oneshot local variable.
Signed-off-by: NMing Lei <ming.lei@rehdat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-block/20211214133103.551813-1-qiulaibin@huawei.com/Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e92bc4cd

block: fix surprise removal for drivers calling blk_set_queue_dying · 7a5428dc

由 Christoph Hellwig 提交于 2月 17, 2022

Various block drivers call blk_set_queue_dying to mark a disk as dead due
to surprise removal events, but since commit 8e141f9e that doesn't
work given that the GD_DEAD flag needs to be set to stop I/O.

Replace the driver calls to blk_set_queue_dying with a new (and properly
documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the
only remaining caller.

Fixes: 8e141f9e ("block: drain file system I/O on del_gendisk")
Reported-by: NMarkus Blöchl <markus.bloechl@ipetronik.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220217075231.1140-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

7a5428dc

block-map: add __GFP_ZERO flag for alloc_page in function bio_copy_kern · cc8f7fe1

由 Haimin Zhang 提交于 2月 16, 2022

Add __GFP_ZERO flag for alloc_page in function bio_copy_kern to initialize
the buffer of a bio.
Signed-off-by: NHaimin Zhang <tcs.kernel@gmail.com>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220216084038.15635-1-tcs.kernel@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

cc8f7fe1

12 2月, 2022 2 次提交

block: loop:use kstatfs.f_bsize of backing file to set discard granularity · 06582bc8

由 Ming Lei 提交于 1月 26, 2022

If backing file's filesystem has implemented ->fallocate(), we think the
loop device can support discard, then pass sb->s_blocksize as
discard_granularity. However, some underlying FS, such as overlayfs,
doesn't set sb->s_blocksize, and causes discard_granularity to be set as
zero, then the warning in __blkdev_issue_discard() is triggered.

Christoph suggested to pass kstatfs.f_bsize as discard granularity, and
this way is fine because kstatfs.f_bsize means 'Optimal transfer block
size', which still matches with definition of discard granularity.

So fix the issue by setting discard_granularity as kstatfs.f_bsize if it
is available, otherwise claims discard isn't supported.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Reported-by: NPei Zhang <pezhang@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220126035830.296465-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

06582bc8

block: Add handling for zone append command in blk_complete_request · a12821d5

由 Pankaj Raghav 提交于 2月 11, 2022

Zone append command needs special handling to update the bi_sector
field in the bio struct with the actual position of the data in the
device. It is stored in __sector field of the request struct.

Fixes: 5581a5dd ("block: add completion handler for fast path")
Signed-off-by: NPankaj Raghav <p.raghav@samsung.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NAdam Manzanares <a.manzanares@samsung.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220211093425.43262-2-p.raghav@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a12821d5

11 2月, 2022 1 次提交

loop: revert "make autoclear operation asynchronous" · bf23747e

由 Tetsuo Handa 提交于 2月 11, 2022

The kernel test robot is reporting that xfstest which does

  umount ext2 on xfs
  umount xfs

sequence started failing, for commit 322c4293 ("loop: make
autoclear operation asynchronous") removed a guarantee that fput() of
backing file is processed before lo_release() from close() returns to
user mode.

And syzbot is reporting that deferring destroy_workqueue() from
__loop_clr_fd() to a WQ context did not help [1]. Revert that commit.

Link: https://syzkaller.appspot.com/bug?extid=831661966588c802aae9 [1]
Reported-by: Nkernel test robot <oliver.sang@intel.com>
Acked-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reported-by: Nsyzbot <syzbot+831661966588c802aae9@syzkaller.appspotmail.com>
Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/20220211071554.3424-1-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NJens Axboe <axboe@kernel.dk>

bf23747e

10 2月, 2022 1 次提交

Merge tag 'nvme-5.17-2022-02-10' of git://git.infradead.org/nvme into block-5.17 · 93e2c52d

由 Jens Axboe 提交于 2月 10, 2022

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 5.17

 - nvme-tcp: fix bogus request completion when failing to send AER
   (Sagi Grimberg)
 - add the missing nvme_complete_req tracepoint for batched completion
   (Bean Huo)"

* tag 'nvme-5.17-2022-02-10' of git://git.infradead.org/nvme:
  nvme-tcp: fix bogus request completion when failing to send AER
  nvme: add nvme_complete_req tracepoint for batched completion

93e2c52d

09 2月, 2022 2 次提交

nvme-tcp: fix bogus request completion when failing to send AER · 63573807

由 Sagi Grimberg 提交于 2月 07, 2022

AER is not backed by a real request, hence we should not incorrectly
assume that when failing to send a nvme command, it is a normal request
but rather check if this is an aer and if so complete the aer (similar
to the normal completion path).

Cc: stable@vger.kernel.org
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

63573807

nvme: add nvme_complete_req tracepoint for batched completion · 00e757b6

由 Bean Huo 提交于 2月 08, 2022

Add NVMe request completion trace in nvme_complete_batch_req() because
nvme:nvme_complete_req tracepoint is missing in case of request batched
completion.
Signed-off-by: NBean Huo <beanhuo@micron.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

00e757b6

04 2月, 2022 3 次提交

block: bio-integrity: Advance seed correctly for larger interval sizes · b13e0c71

由 Martin K. Petersen 提交于 2月 03, 2022

Commit 309a62fa ("bio-integrity: bio_integrity_advance must update
integrity seed") added code to update the integrity seed value when
advancing a bio. However, it failed to take into account that the
integrity interval might be larger than the 512-byte block layer
sector size. This broke bio splitting on PI devices with 4KB logical
blocks.

The seed value should be advanced by bio_integrity_intervals() and not
the number of sectors.

Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org
Fixes: 309a62fa ("bio-integrity: bio_integrity_advance must update integrity seed")
Tested-by: NDmitry Ivanov <dmitry.ivanov2@hpe.com>
Reported-by: NAlexey Lyashkov <alexey.lyashkov@hpe.com>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220204034209.4193-1-martin.petersen@oracle.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

b13e0c71

Merge tag 'nvme-5.17-2022-02-03' of git://git.infradead.org/nvme into block-5.17 · e8db8c9c

由 Jens Axboe 提交于 2月 03, 2022

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 5.17

 - fix a use-after-free in rdm and tcp controller reset (Sagi Grimberg)
 - fix the state check in nvmf_ctlr_matches_baseopts (Uday Shankar)"

* tag 'nvme-5.17-2022-02-03' of git://git.infradead.org/nvme:
  nvme-fabrics: fix state check in nvmf_ctlr_matches_baseopts()
  nvme-rdma: fix possible use-after-free in transport error_recovery work
  nvme-tcp: fix possible use-after-free in transport error_recovery work
  nvme: fix a possible use-after-free in controller reset during load

e8db8c9c

Merge branch 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.17 · aace2b7a

由 Jens Axboe 提交于 2月 03, 2022

Pull MD fix from Song:

"Please consider pulling the following fix on top of your block-5.17
 branch. It fixes a NULL ptr deref case with nowait."

* 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md: fix NULL pointer deref with nowait but no mddev->queue

aace2b7a

03 2月, 2022 2 次提交

nvme-fabrics: fix state check in nvmf_ctlr_matches_baseopts() · 6a51abde

由 Uday Shankar 提交于 1月 20, 2022

Controller deletion/reset, immediately followed by or concurrent with
a reconnect, is hard failing the connect attempt resulting in a
complete loss of connectivity to the controller.

In the connect request, fabrics looks for an existing controller with
the same address components and aborts the connect if a controller
already exists and the duplicate connect option isn't set. The match
routine filters out controllers that are dead or dying, so they don't
interfere with the new connect request.

When NVME_CTRL_DELETING_NOIO was added, it missed updating the state
filters in the nvmf_ctlr_matches_baseopts() routine. Thus, when in this
new state, it's seen as a live controller and fails the connect request.

Correct by adding the DELETING_NIO state to the match checks.

Fixes: ecca390e ("nvme: fix deadlock in disconnect during scan_work and/or ana_work")
Cc: <stable@vger.kernel.org> # v5.7+
Signed-off-by: NUday Shankar <ushankar@purestorage.com>
Reviewed-by: NJames Smart <jsmart2021@gmail.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

6a51abde

md: fix NULL pointer deref with nowait but no mddev->queue · 0f9650bd

由 Song Liu 提交于 2月 02, 2022

Leon reported NULL pointer deref with nowait support:

[   15.123761] device-mapper: raid: Loading target version 1.15.1
[   15.124185] device-mapper: raid: Ignoring chunk size parameter for RAID 1
[   15.124192] device-mapper: raid: Choosing default region size of 4MiB
[   15.129524] BUG: kernel NULL pointer dereference, address: 0000000000000060
[   15.129530] #PF: supervisor write access in kernel mode
[   15.129533] #PF: error_code(0x0002) - not-present page
[   15.129535] PGD 0 P4D 0
[   15.129538] Oops: 0002 [#1] PREEMPT SMP NOPTI
[   15.129541] CPU: 5 PID: 494 Comm: ldmtool Not tainted 5.17.0-rc2-1-mainline #1 9fe89d43dfcb215d2731e6f8851740520778615e
[   15.129546] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F36e 10/14/2021
[   15.129549] RIP: 0010:blk_queue_flag_set+0x7/0x20
[   15.129555] Code: 00 00 00 0f 1f 44 00 00 48 8b 35 e4 e0 04 02 48 8d 57 28 bf 40 01 \
       00 00 e9 16 c1 be ff 66 0f 1f 44 00 00 0f 1f 44 00 00 89 ff <f0> 48 0f ab 7e 60 \
       31 f6 89 f7 c3 66 66 2e 0f 1f 84 00 00 00 00 00
[   15.129559] RSP: 0018:ffff966b81987a88 EFLAGS: 00010202
[   15.129562] RAX: ffff8b11c363a0d0 RBX: ffff8b11e294b070 RCX: 0000000000000000
[   15.129564] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000001d
[   15.129566] RBP: ffff8b11e294b058 R08: 0000000000000000 R09: 0000000000000000
[   15.129568] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b11e294b070
[   15.129570] R13: 0000000000000000 R14: ffff8b11e294b000 R15: 0000000000000001
[   15.129572] FS:  00007fa96e826780(0000) GS:ffff8b18deb40000(0000) knlGS:0000000000000000
[   15.129575] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.129577] CR2: 0000000000000060 CR3: 000000010b8ce000 CR4: 00000000003506e0
[   15.129580] Call Trace:
[   15.129582]  <TASK>
[   15.129584]  md_run+0x67c/0xc70 [md_mod 1e470c1b6bcf1114198109f42682f5a2740e9531]
[   15.129597]  raid_ctr+0x134a/0x28ea [dm_raid 6a645dd7519e72834bd7e98c23497eeade14cd63]
[   15.129604]  ? dm_split_args+0x63/0x150 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[   15.129615]  dm_table_add_target+0x188/0x380 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[   15.129625]  table_load+0x13b/0x370 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[   15.129635]  ? dev_suspend+0x2d0/0x2d0 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[   15.129644]  ctl_ioctl+0x1bd/0x460 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[   15.129655]  dm_ctl_ioctl+0xa/0x20 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[   15.129663]  __x64_sys_ioctl+0x8e/0xd0
[   15.129667]  do_syscall_64+0x5c/0x90
[   15.129672]  ? syscall_exit_to_user_mode+0x23/0x50
[   15.129675]  ? do_syscall_64+0x69/0x90
[   15.129677]  ? do_syscall_64+0x69/0x90
[   15.129679]  ? syscall_exit_to_user_mode+0x23/0x50
[   15.129682]  ? do_syscall_64+0x69/0x90
[   15.129684]  ? do_syscall_64+0x69/0x90
[   15.129686]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   15.129689] RIP: 0033:0x7fa96ecd559b
[   15.129692] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c \
    c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff \
    ff 73 01 c3 48 8b 0d a5 a8 0c 00 f7 d8 64 89 01 48
[   15.129696] RSP: 002b:00007ffcaf85c258 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[   15.129699] RAX: ffffffffffffffda RBX: 00007fa96f1b48f0 RCX: 00007fa96ecd559b
[   15.129701] RDX: 00007fa97017e610 RSI: 00000000c138fd09 RDI: 0000000000000003
[   15.129702] RBP: 00007fa96ebab583 R08: 00007fa97017c9e0 R09: 00007ffcaf85bf27
[   15.129704] R10: 0000000000000001 R11: 0000000000000206 R12: 00007fa97017e610
[   15.129706] R13: 00007fa97017e640 R14: 00007fa97017e6c0 R15: 00007fa97017e530
[   15.129709]  </TASK>

This is caused by missing mddev->queue check for setting QUEUE_FLAG_NOWAIT
Fix this by moving the QUEUE_FLAG_NOWAIT logic to under mddev->queue check.

Fixes: f51d46d0 ("md: add support for REQ_NOWAIT")
Reported-by: NLeon Möller <jkhsjdhjs@totally.rip>
Tested-by: NLeon Möller <jkhsjdhjs@totally.rip>
Cc: Vishal Verma <vverma@digitalocean.com>
Signed-off-by: NSong Liu <song@kernel.org>

0f9650bd

02 2月, 2022 4 次提交

block: fix DIO handling regressions in blkdev_read_iter() · 3e1f941d

由 Ilya Dryomov 提交于 2月 01, 2022

Commit ceaa7625 ("block: move direct_IO into our own read_iter
handler") introduced several regressions for bdev DIO:

1. read spanning EOF always returns 0 instead of the number of bytes
   read.  This is because "count" is assigned early and isn't updated
   when the iterator is truncated:

     $ lsblk -o name,size /dev/vdb
     NAME SIZE
     vdb    1G
     $ xfs_io -d -c 'pread -b 4M 1021M 4M' /dev/vdb
     read 0/4194304 bytes at offset 1070596096
     0.000000 bytes, 0 ops; 0.0007 sec (0.000000 bytes/sec and 0.0000 ops/sec)

     instead of

     $ xfs_io -d -c 'pread -b 4M 1021M 4M' /dev/vdb
     read 3145728/4194304 bytes at offset 1070596096
     3 MiB, 1 ops; 0.0007 sec (3.865 GiB/sec and 1319.2612 ops/sec)

2. truncated iterator isn't reexpanded
3. iterator isn't reverted on blkdev_direct_IO() error
4. zero size read no longer skips atime update

Fixes: ceaa7625 ("block: move direct_IO into our own read_iter handler")
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220201100420.25875-1-idryomov@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

3e1f941d

nvme-rdma: fix possible use-after-free in transport error_recovery work · b6bb1722

由 Sagi Grimberg 提交于 2月 01, 2022

While nvme_rdma_submit_async_event_work is checking the ctrl and queue
state before preparing the AER command and scheduling io_work, in order
to fully prevent a race where this check is not reliable the error
recovery work must flush async_event_work before continuing to destroy
the admin queue after setting the ctrl state to RESETTING such that
there is no race .submit_async_event and the error recovery handler
itself changing the ctrl state.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

b6bb1722

nvme-tcp: fix possible use-after-free in transport error_recovery work · ff9fc7eb

由 Sagi Grimberg 提交于 2月 01, 2022

While nvme_tcp_submit_async_event_work is checking the ctrl and queue
state before preparing the AER command and scheduling io_work, in order
to fully prevent a race where this check is not reliable the error
recovery work must flush async_event_work before continuing to destroy
the admin queue after setting the ctrl state to RESETTING such that
there is no race .submit_async_event and the error recovery handler
itself changing the ctrl state.
Tested-by: NChris Leech <cleech@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

ff9fc7eb

nvme: fix a possible use-after-free in controller reset during load · 0fa0f99f

由 Sagi Grimberg 提交于 2月 01, 2022

Unlike .queue_rq, in .submit_async_event drivers may not check the ctrl
readiness for AER submission. This may lead to a use-after-free
condition that was observed with nvme-tcp.

The race condition may happen in the following scenario:
1. driver executes its reset_ctrl_work
2. -> nvme_stop_ctrl - flushes ctrl async_event_work
3. ctrl sends AEN which is received by the host, which in turn
   schedules AEN handling
4. teardown admin queue (which releases the queue socket)
5. AEN processed, submits another AER, calling the driver to submit
6. driver attempts to send the cmd
==> use-after-free

In order to fix that, add ctrl state check to validate the ctrl
is actually able to accept the AER submission.

This addresses the above race in controller resets because the driver
during teardown should:
1. change ctrl state to RESETTING
2. flush async_event_work (as well as other async work elements)

So after 1,2, any other AER command will find the
ctrl state to be RESETTING and bail out without submitting the AER.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

0fa0f99f

29 1月, 2022 3 次提交

dm: properly fix redundant bio-based IO accounting · b879f915

由 Mike Snitzer 提交于 1月 28, 2022

Record the start_time for a bio but defer the starting block core's IO
accounting until after IO is submitted using bio_start_io_acct_time().

This approach avoids the need to mess around with any of the
individual IO stats in response to a bio_split() that follows bio
submission.
Reported-by: NBud Brown <bubrown@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Depends-on: e45c47d1 ("block: add bio_start_io_acct_time() to control start_time")
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220128155841.39644-4-snitzer@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

b879f915

dm: revert partial fix for redundant bio-based IO accounting · f524d9c9

由 Mike Snitzer 提交于 1月 28, 2022

Reverts a1e1cb72 ("dm: fix redundant IO accounting for bios that
need splitting") because it was too narrow in scope (only addressed
redundant 'sectors[]' accounting and not ios, nsecs[], etc).

Cc: stable@vger.kernel.org
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220128155841.39644-3-snitzer@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

f524d9c9

block: add bio_start_io_acct_time() to control start_time · e45c47d1

由 Mike Snitzer 提交于 1月 28, 2022

bio_start_io_acct_time() interface is like bio_start_io_acct() that
allows start_time to be passed in. This gives drivers the ability to
defer starting accounting until after IO is issued (but possibily not
entirely due to bio splitting).
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220128155841.39644-2-snitzer@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

e45c47d1

28 1月, 2022 1 次提交

blk-mq: Fix wrong wakeup batch configuration which will cause hang · 10825410

由 Laibin Qiu 提交于 1月 27, 2022

Commit 180dccb0 ("blk-mq: fix tag_get wait task can't be
awakened") will recalculate wake_batch when incrementing or decrementing
active_queues to avoid wake_batch > hctx_max_depth. At the same time, in
order to not affect performance as much as possible, the minimum wakeup
batch is set to 4. But when the QD is small (such as QD=1), if inc or dec
active_queues increases wakeup batch, that can lead to a hang:

Fix this problem with the following strategies:
QD          :  >= 32 | < 32
---------------------------------
wakeup batch:  8~4   | 3~1

Fixes: 180dccb0 ("blk-mq: fix tag_get wait task can't be awakened")
Link: https://lore.kernel.org/linux-block/78cafe94-a787-e006-8851-69906f0c2128@huawei.com/T/#tReported-by: NAlex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com>
Tested-by: NAlex Xu (Hello71) <alex_y_xu@yahoo.ca>
Link: https://lore.kernel.org/r/20220127100047.1763746-1-qiulaibin@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

10825410

27 1月, 2022 3 次提交

Merge tag 'nvme-5.17-2022-01-27' of git://git.infradead.org/nvme into block-5.17 · 3c8cef9f

由 Jens Axboe 提交于 1月 27, 2022

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 5.17

 - add the IGNORE_DEV_SUBNQN quirk for Intel P4500/P4600 SSDs (Wu Zheng)
 - remove the unneeded ret variable in nvmf_dev_show (Changcheng Deng)"

* tag 'nvme-5.17-2022-01-27' of git://git.infradead.org/nvme:
  nvme-fabrics: remove the unneeded ret variable in nvmf_dev_show
  nvme-pci: add the IGNORE_DEV_SUBNQN quirk for Intel P4500/P4600 SSDs

3c8cef9f

nvme-fabrics: remove the unneeded ret variable in nvmf_dev_show · a5f3851b

由 Changcheng Deng 提交于 1月 07, 2022

Remove unneeded variable and directly return 0.
Reported-by: NZeal Robot <zealci@zte.com.cn>
Signed-off-by: NChangcheng Deng <deng.changcheng@zte.com.cn>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a5f3851b

nvme-pci: add the IGNORE_DEV_SUBNQN quirk for Intel P4500/P4600 SSDs · 25e58af4

由 Wu Zheng 提交于 6月 21, 2021

The Intel P4500/P4600 SSDs do not report a subsystem NQN despite claiming
compliance to a standards version where reporting one is required.

Add the IGNORE_DEV_SUBNQN quirk to not fail the initialization of a
second such SSDs in a system.
Signed-off-by: NZheng Wu <wu.zheng@intel.com>
Signed-off-by: NYe Jinhe <jinhe.ye@intel.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

25e58af4

26 1月, 2022 1 次提交

blk-mq: fix missing blk_account_io_done() in error path · 592ee119

由 Yu Kuai 提交于 1月 26, 2022

If blk_mq_request_issue_directly() failed from
blk_insert_cloned_request(), the request will be accounted start.
Currently, blk_insert_cloned_request() is only called by dm, and such
request won't be accounted done by dm.

In normal path, io will be accounted start from blk_mq_bio_to_request(),
when the request is allocated, and such io will be accounted done from
__blk_mq_end_request_acct() whether it succeeded or failed. Thus add
blk_account_io_done() to fix the problem.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220126012132.3111551-1-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

592ee119

24 1月, 2022 1 次提交

block: fix memory leak in disk_register_independent_access_ranges · 83114df3

由 Miaoqian Lin 提交于 1月 20, 2022

kobject_init_and_add() takes reference even when it fails.
According to the doc of kobject_init_and_add()

   If this function returns an error, kobject_put() must be called to
   properly clean up the memory associated with the object.

Fix this issue by adding kobject_put().
Callback function blk_ia_ranges_sysfs_release() in kobject_put()
can handle the pointer "iars" properly.

Fixes: a2247f19 ("block: Add independent access ranges support")
Signed-off-by: NMiaoqian Lin <linmq006@gmail.com>
Reviewed-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220120101025.22411-1-linmq006@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

83114df3

23 1月, 2022 5 次提交

Merge tag 'powerpc-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · dd81e1c7

由 Linus Torvalds 提交于 1月 23, 2022

Pull powerpc fixes from Michael Ellerman:

 - A series of bpf fixes, including an oops fix and some codegen fixes.

 - Fix a regression in syscall_get_arch() for compat processes.

 - Fix boot failure on some 32-bit systems with KASAN enabled.

 - A couple of other build/minor fixes.

Thanks to Athira Rajeev, Christophe Leroy, Dmitry V. Levin, Jiri Olsa,
Johan Almbladh, Maxime Bizon, Naveen N. Rao, and Nicholas Piggin.

* tag 'powerpc-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s: Mask SRR0 before checking against the masked NIP
  powerpc/perf: Only define power_pmu_wants_prompt_pmi() for CONFIG_PPC64
  powerpc/32s: Fix kasan_init_region() for KASAN
  powerpc/time: Fix build failure due to do_hard_irq_enable() on PPC32
  powerpc/audit: Fix syscall_get_arch()
  powerpc64/bpf: Limit 'ldbrx' to processors compliant with ISA v2.06
  tools/bpf: Rename 'struct event' to avoid naming conflict
  powerpc/bpf: Update ldimm64 instructions during extra pass
  powerpc32/bpf: Fix codegen for bpf-to-bpf calls
  bpf: Guard against accessing NULL pt_regs in bpf_get_task_stack()

dd81e1c7

Merge tag 'irq_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ac5a9bb6

由 Linus Torvalds 提交于 1月 23, 2022

Pull irq fix from Borislav Petkov:
 "A single use-after-free fix in the PCI MSI irq domain allocation path"

* tag 'irq_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  PCI/MSI: Prevent UAF in error path

ac5a9bb6

Merge tag 'sched_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 10c64a0f

由 Linus Torvalds 提交于 1月 23, 2022

Pull scheduler fixes from Borislav Petkov:
 "A bunch of fixes: forced idle time accounting, utilization values
  propagation in the sched hierarchies and other minor cleanups and
  improvements"

* tag 'sched_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kernel/sched: Remove dl_boosted flag comment
  sched: Avoid double preemption in __cond_resched_*lock*()
  sched/fair: Fix all kernel-doc warnings
  sched/core: Accounting forceidle time for all tasks except idle task
  sched/pelt: Relax the sync of load_sum with load_avg
  sched/pelt: Relax the sync of runnable_sum with runnable_avg
  sched/pelt: Continue to relax the sync of util_sum with util_avg
  sched/pelt: Relax the sync of util_sum with util_avg
  psi: Fix uaf issue when psi trigger is destroyed while being polled

10c64a0f

Merge tag 'perf_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 0f9e0422

由 Linus Torvalds 提交于 1月 23, 2022

Pull perf fixes from Borislav Petkov:

 - Add support for accessing the general purpose counters on Alder Lake
   via MMIO

 - Add new LBR format v7 support which is v5 modulo TSX

 - Fix counter enumeration on Alder Lake hybrids

 - Overhaul how context time updates are done and get rid of
   perf_event::shadow_ctx_time.

 - The usual amount of fixes: event mask correction, supported event
   types reporting, etc.

* tag 'perf_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/perf: Avoid warning for Arch LBR without XSAVE
  perf/x86/intel/uncore: Add IMC uncore support for ADL
  perf/x86/intel/lbr: Add static_branch for LBR INFO flags
  perf/x86/intel/lbr: Support LBR format V7
  perf/x86/rapl: fix AMD event handling
  perf/x86/intel/uncore: Fix CAS_COUNT_WRITE issue for ICX
  perf/x86/intel: Add a quirk for the calculation of the number of counters on Alder Lake
  perf: Fix perf_event_read_local() time

0f9e0422

L

Linux 5.17-rc1 · e783362e
由 Linus Torvalds 提交于 1月 23, 2022

e783362e

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功