- 21 11月, 2019 1 次提交
-
-
由 Dmitry Monakhov 提交于
In most cases blk_tracing is not active, but bfq_log_bfqq macro generate pid_str unconditionally, which result in significant overhead. ## Test modprobe null_blk echo bfq > /sys/block/nullb0/queue/scheduler fio --name=t --ioengine=libaio --direct=1 --filename=/dev/nullb0 \ --runtime=30 --time_based=1 --rw=write --iodepth=128 --bs=4k # Results | | baseline | w/ patch | gain | | iops | 113.19K | 126.42K | +11% | Acked-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NDmitry Monakhov <dmonakhov@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 19 11月, 2019 1 次提交
-
-
由 Revanth Rajashekar 提交于
In function 'activate_lsp', rather than hard-coding the short atom header(0x83), we need to let the function 'add_short_atom_header' append the header based on the parameter being appended. The parameter has been defined in Section 3.1.2.1 of https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage-Opal_Feature_Set_Single_User_Mode_v1-00_r1-00-Final.pdfReviewed-by: NJon Derrick <jonathan.derrick@intel.com> Signed-off-by: NRevanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 18 11月, 2019 2 次提交
-
-
由 Tejun Heo 提交于
Currently, cgroup rstat is supported only on cgroup2 hierarchy and rstat functions shouldn't be called on cgroup1 cgroups. While converting blk-cgroup core statistics to rstat, f7331648 ("blk-cgroup: reimplement basic IO stats using cgroup rstat") accidentally ended up calling cgroup_rstat_updated() on cgroup1 cgroups causing crashes. Longer term, we probably should add cgroup1 support to rstat but for now let's mask the call directly. Fixes: f7331648 ("blk-cgroup: reimplement basic IO stats using cgroup rstat") Tested-by: NFaiz Abbas <faiz_abbas@ti.com> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
trigger_softirq() is always invoked as a SMP-function call which is always invoked with disables interrupts. Don't disable interrupt in trigger_softirq() because interrupts are already disabled. Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 14 11月, 2019 2 次提交
-
-
由 John Garry 提交于
Since the only caller of this function has been deleted, delete this one also. Signed-off-by: NJohn Garry <john.garry@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 John Garry 提交于
These functions are not referenced, so delete them. Signed-off-by: NJohn Garry <john.garry@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 08 11月, 2019 9 次提交
-
-
由 Ming Lei 提交于
64K PAGE_SIZE is popular on ARM64 or other ARCHs, and 64K has been big enough to break some devices probably, so change the logic to split bio if the only bvec's length is > SZ_4K instead of PAGE_SIZE. Fixes: fa532287 (block: avoid blk_bio_segment_split for small I/O operations) Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
Some device may set segment boundary as PAGE_SIZE - 1. If the bvec crosses pages, and meantime its length is <= PAGE_SIZE, we still need to split the bvec into 2 segments. Fixes this issue by still splitting bio if the single bvec crosses pages. Reported-by: Nkernel test robot <lkp@intel.com> Fixes: fa532287 (block: avoid blk_bio_segment_split for small I/O operations) Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
blkg_rwstat is now only used by bfq-iosched and blk-throtl when on cgroup1. Let's move it into its own files and gate it behind a config option. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
blk-cgroup has been using blkg_rwstat to track basic IO stats. Unfortunately, reading recursive stats scales badly as itinvolves walking all descendants. On systems with a huge number of cgroups (dead or alive), this can lead to substantial CPU cost when reading IO stats. This patch reimplements basic IO stats using cgroup rstat which uses more memory but makes recursive stat reading O(# descendants which have been active since last reading) instead of O(# descendants). * blk-cgroup core no longer uses sync/async stats. Introduce new stat enums - BLKG_IOSTAT_{READ|WRITE|DISCARD}. * Add blkg_iostat[_set] which encapsulates byte and io stats, last values for propagation delta calculation and u64_stats_sync for correctness on 32bit archs. * Update the new percpu stat counters directly and implement blkcg_rstat_flush() to implement propagation. * blkg_print_stat() can now bring the stats up to date by calling cgroup_rstat_flush() and print them instead of directly summing up all descendants. * It now allocates 96 bytes per cpu. It used to be 40 bytes. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Dan Schatzberg <dschatzberg@fb.com> Cc: Daniel Xu <dlxu@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
These don't have users anymore. Remove them. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
When used on cgroup1, blk-throtl uses the blkg->stat_bytes and ->stat_ios from blk-cgroup core to populate four stat knobs. blk-cgroup core is moving away from blkg_rwstat to improve scalability and won't be able to support this usage. It isn't like the sharing gains all that much. Let's break them out to dedicated rwstat counters which are updated when on cgroup1. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
When used on cgroup1, bfq uses the blkg->stat_bytes and ->stat_ios from blk-cgroup core to populate six stat knobs. blk-cgroup core is moving away from blkg_rwstat to improve scalability and won't be able to support this usage. It isn't like the sharing gains all that much. Let's break it out to dedicated rwstat counters which are updated when on cgroup1. This makes use of bfqg_*rwstat*() helpers outside of CONFIG_BFQ_CGROUP_DEBUG. Move them out. v2: Compile fix when !CONFIG_BFQ_CGROUP_DEBUG. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
Collect them right under #ifdef CONFIG_BFQ_CGROUP_DEBUG. The next patch will use them from !DEBUG path and this makes it easy to move them out of the ifdef block. This is pure code reorganization. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Pull on for-linus to resolve what otherwise would have been a conflict with the cgroups rstat patchset from Tejun. * for-linus: (942 commits) blkcg: make blkcg_print_stat() print stats only for online blkgs nvme: change nvme_passthru_cmd64 to explicitly mark rsvd nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths nvme-rdma: fix a segmentation fault during module unload iocost: don't nest spin_lock_irq in ioc_weight_write() io_uring: ensure we clear io_kiocb->result before each issue um-ubd: Entrust re-queue to the upper layers nvme-multipath: remove unused groups_only mode in ana log nvme-multipath: fix possible io hang after ctrl reconnect io_uring: don't touch ctx in setup after ring fd install io_uring: Fix leaked shadow_req Linux 5.4-rc5 riscv: cleanup do_trap_break nbd: verify socket is supported during setup ata: libahci_platform: Fix regulator_get_optional() misuse nbd: handle racing with error'ed out commands nbd: protect cmd->status with cmd->lock io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD io_uring: used cached copies of sq->dropped and cq->overflow ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157 ...
-
- 07 11月, 2019 5 次提交
-
-
由 Ajay Joshi 提交于
Introduce three new ioctl commands BLKOPENZONE, BLKCLOSEZONE and BLKFINISHZONE to allow applications to control the condition of zones on a zoned block device through the execution of the REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH operations. Contains contributions from Matias Bjorling, Hans Holmberg, Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig. Reviewed-by: NJavier González <javier@javigon.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAjay Joshi <ajay.joshi@wdc.com> Signed-off-by: NMatias Bjorling <matias.bjorling@wdc.com> Signed-off-by: NHans Holmberg <hans.holmberg@wdc.com> Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com> Signed-off-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ajay Joshi 提交于
Zoned block devices (ZBC and ZAC devices) allow an explicit control over the condition (state) of zones. The operations allowed are: * Open a zone: Transition to open condition to indicate that a zone will actively be written * Close a zone: Transition to closed condition to release the drive resources used for writing to a zone * Finish a zone: Transition an open or closed zone to the full condition to prevent write operations To enable this control for in-kernel zoned block device users, define the new request operations REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH as well as the generic function blkdev_zone_mgmt() for submitting these operations on a range of zones. This results in blkdev_reset_zones() removal and replacement with this new zone magement function. Users of blkdev_reset_zones() (f2fs and dm-zoned) are updated accordingly. Contains contributions from Matias Bjorling, Hans Holmberg, Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig. Reviewed-by: NJavier González <javier@javigon.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAjay Joshi <ajay.joshi@wdc.com> Signed-off-by: NMatias Bjorling <matias.bjorling@wdc.com> Signed-off-by: NHans Holmberg <hans.holmberg@wdc.com> Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com> Signed-off-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Damien Le Moal 提交于
There is no need for the function __blkdev_reset_all_zones() as REQ_OP_ZONE_RESET_ALL can be handled directly in blkdev_reset_zones() bio loop with an early break from the loop. This patch removes this function and modifies blkdev_reset_zones(), simplifying the code. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Damien Le Moal 提交于
REQ_OP_ZONE_RESET operations cannot be merged as these bios and requests do not have a size and are never sequential due to the zone start sector position required for their execution. As a result, there is no point in using a plug around blkdev_reset_zones() bio issuing loop. This patch removes this unnecessary plugging. Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: NJavier González <javier@javigon.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
blkcg_print_stat() iterates blkgs under RCU and doesn't test whether the blkg is online. This can call into pd_stat_fn() on a pd which is still being initialized leading to an oops. The heaviest operation - recursively summing up rwstat counters - is already done while holding the queue_lock. Expand queue_lock to cover the other operations and skip the blkg if it isn't online yet. The online state is protected by both blkcg and queue locks, so this guarantees that only online blkgs are processed. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: NRoman Gushchin <guro@fb.com> Cc: Josef Bacik <jbacik@fb.com> Fixes: 903d23f0 ("blk-cgroup: allow controllers to output their own stats") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 06 11月, 2019 3 次提交
-
-
由 Jan Kara 提交于
With transition to blk-mq, the elevator= kernel argument was removed as it makes less and less sense with the current variety of devices. Since this may surprise some users and there are advices on the Internet that still suggest to use it, let's at least warn if the parameter is used. Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
git://git.infradead.org/nvme由 Jens Axboe 提交于
Pull NVMe fixes from Keith: "We have a few late nvme fixes for a couple device removal kernel crashes, and a compat fix for a new ioctl introduced during this merge window." * 'nvme-5.4-rc7' of git://git.infradead.org/nvme: nvme: change nvme_passthru_cmd64 to explicitly mark rsvd nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths nvme-rdma: fix a segmentation fault during module unload
-
由 Charles Machalow 提交于
Changing nvme_passthru_cmd64 to add a field: rsvd2. This field is an explicit marker for the padding space added on certain platforms as a result of the enlargement of the result field from 32 bit to 64 bits in size, and fixes differences in struct size when using compat ioctl for 32-bit binaries on 64-bit architecture. Fixes: 65e68edc ("nvme: allow 64-bit results in passthru commands") Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NCharles Machalow <csm10495@gmail.com> [changelog] Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
- 05 11月, 2019 3 次提交
-
-
由 Anton Eidelman 提交于
nvme_mpath_clear_ctrl_paths() iterates through the ctrl->namespaces list while holding ctrl->scan_lock. This does not seem to be the correct way of protecting from concurrent list modification. Specifically, nvme_scan_work() sorts ctrl->namespaces AFTER unlocking scan_lock. This may result in the following (rare) crash in ctrl disconnect during scan_work: BUG: kernel NULL pointer dereference, address: 0000000000000050 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 3995 Comm: nvme 5.3.5-050305-generic RIP: 0010:nvme_mpath_clear_current_path+0xe/0x90 [nvme_core] ... Call Trace: nvme_mpath_clear_ctrl_paths+0x3c/0x70 [nvme_core] nvme_remove_namespaces+0x35/0xe0 [nvme_core] nvme_do_delete_ctrl+0x47/0x90 [nvme_core] nvme_sysfs_delete+0x49/0x60 [nvme_core] dev_attr_store+0x17/0x30 sysfs_kf_write+0x3e/0x50 kernfs_fop_write+0x11e/0x1a0 __vfs_write+0x1b/0x40 vfs_write+0xb9/0x1a0 ksys_write+0x67/0xe0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x5a/0x130 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f8d02bfb154 Fix: After taking scan_lock in nvme_mpath_clear_ctrl_paths() down_read(&ctrl->namespaces_rwsem) as well to make list traversal safe. This will not cause deadlocks because taking scan_lock never happens while holding the namespaces_rwsem. Moreover, scan work downs namespaces_rwsem in the same order. Alternative: sort ctrl->namespaces in nvme_scan_work() while still holding the scan_lock. This would leave nvme_mpath_clear_ctrl_paths() without correct protection against ctrl->namespaces modification by anyone other than scan_work. Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
由 Max Gurtovoy 提交于
In case there are controllers that are not associated with any RDMA device (e.g. during unsuccessful reconnection) and the user will unload the module, these controllers will not be freed and will access already freed memory. The same logic appears in other fabric drivers as well. Fixes: 87fd1253 ("nvme-rdma: remove redundant reference between ib_device and tagset") Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
由 Christoph Hellwig 提交于
__blk_queue_split() adds significant overhead for small I/O operations. Add a shortcut to avoid it for cases where we know we never need to split. Based on a patch from Ming Lei. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 04 11月, 2019 4 次提交
-
-
由 Ming Lei 提交于
8962842c ("blk-mq: avoid sysfs buffer overflow with too many CPU cores") avoids sysfs buffer overflow, and reserves one character for line break. However, the last snprintf() doesn't get correct 'size' parameter passed in, so fixed it. Fixes: 8962842c ("blk-mq: avoid sysfs buffer overflow with too many CPU cores") Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Revanth Rajashekar 提交于
This patch introduces Opal Datastore UID. The generic read/write table ioctl can use this UID to access the Opal Datastore. Reviewed-by: NScott Bauer <sbauer@plzdonthack.me> Reviewed-by: NJon Derrick <jonathan.derrick@intel.com> Signed-off-by: NRevanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Revanth Rajashekar 提交于
This feature gives the user RW access to any opal table with admin1 authority. The flags described in the new structure determines if the user wants to read/write the data. Flags are checked for valid values in order to allow future features to be added to the ioctl. The user can provide the desired table's UID. Also, the ioctl provides a size and offset field and internally will loop data accesses to return the full data block. Read overrun is prevented by the initiator's sec_send_recv() backend. The ioctl provides a private field with the intention to accommodate any future expansions to the ioctl. Reviewed-by: NScott Bauer <sbauer@plzdonthack.me> Reviewed-by: NJon Derrick <jonathan.derrick@intel.com> Signed-off-by: NRevanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Revanth Rajashekar 提交于
This patch refactors the existing "write_shadowmbr" func and creates a new generalized function "generic_table_write_data", to write data to any opal table. Also, a few cleanups are included in this patch. Reviewed-by: NScott Bauer <sbauer@plzdonthack.me> Reviewed-by: NJon Derrick <jonathan.derrick@intel.com> Signed-off-by: NRevanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 03 11月, 2019 2 次提交
-
-
由 Jan Kara 提交于
Currently, block device size in not updated on second and further open for block devices where partition scan is disabled. This is particularly annoying for example for DVD drives as that means block device size does not get updated once the media is inserted into a drive if the device is already open when inserting the media. This is actually always the case for example when pktcdvd is in use. Fix the problem by revalidating block device size on every open even for devices with partition scan disabled. Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jan Kara 提交于
Factor out code handling revalidation of bdev on disk change into a common helper. Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 02 11月, 2019 1 次提交
-
-
由 Ming Lei 提交于
It is reported that sysfs buffer overflow can be triggered if the system has too many CPU cores(>841 on 4K PAGE_SIZE) when showing CPUs of hctx via /sys/block/$DEV/mq/$N/cpu_list. Use snprintf to avoid the potential buffer overflow. This version doesn't change the attribute format, and simply stops showing CPU numbers if the buffer is going to overflow. Cc: stable@vger.kernel.org Fixes: 676141e4("blk-mq: don't dump CPU -> hw queue map on driver load") Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 11月, 2019 2 次提交
-
-
由 John Garry 提交于
Since commit 97889f9a ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()"), the return value of blk_mq_run_hw_queue() is never checked, so make it return void, which very marginally simplifies the code. Reviewed-by: NBob Liu <bob.liu@oracle.com> Signed-off-by: NJohn Garry <john.garry@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dan Carpenter 提交于
This code causes a static analysis warning: block/blk-iocost.c:2113 ioc_weight_write() error: double lock 'irq' We disable IRQs in blkg_conf_prep() and re-enable them in blkg_conf_finish(). IRQ disable/enable should not be nested because that means the IRQs will be enabled at the first unlock instead of the second one. Fixes: 7caa4715 ("blkcg: implement blk-iocost") Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 31 10月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
We use io_kiocb->result == -EAGAIN as a way to know if we need to re-submit a polled request, as -EAGAIN reporting happens out-of-line for IO submission failures. This field is cleared when we originally allocate the request, but it isn't reset when we retry the submission from async context. This can cause issues where we think something needs a re-issue, but we're really just reading stale data. Reset ->result whenever we re-prep a request for polled submission. Cc: stable@vger.kernel.org Fixes: 9e645e11 ("io_uring: add support for sqe links") Reported-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 30 10月, 2019 1 次提交
-
-
由 Anton Ivanov 提交于
Fixes crashes due to ubd requeue logic conflicting with the block-mq logic. Crash is reproducible in 5.0 - 5.3. Fixes: 53766def ("um: Clean-up command processing in UML UBD driver") Cc: stable@vger.kernel.org # v5.0+ Signed-off-by: NAnton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 29 10月, 2019 2 次提交
-
-
由 Anton Eidelman 提交于
groups_only mode in nvme_read_ana_log() is no longer used: remove it. Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com> Signed-off-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Anton Eidelman 提交于
The following scenario results in an IO hang: 1) ctrl completes a request with NVME_SC_ANA_TRANSITION. NVME_NS_ANA_PENDING bit in ns->flags is set and ana_work is triggered. 2) ana_work: nvme_read_ana_log() tries to get the ANA log page from the ctrl. This fails because ctrl disconnects. Therefore nvme_update_ns_ana_state() is not called and NVME_NS_ANA_PENDING bit in ns->flags is not cleared. 3) ctrl reconnects: nvme_mpath_init(ctrl,...) calls nvme_read_ana_log(ctrl, groups_only=true). However, nvme_update_ana_state() does not update namespaces because nr_nsids = 0 (due to groups_only mode). 4) scan_work calls nvme_validate_ns() finds the ns and re-validates OK. Result: The ctrl is now live but NVME_NS_ANA_PENDING bit in ns->flags is still set. Consequently ctrl will never be considered a viable path by __nvme_find_path(). IO will hang if ctrl is the only or the last path to the namespace. More generally, while ctrl is reconnecting, its ANA state may change. And because nvme_mpath_init() requests ANA log in groups_only mode, these changes are not propagated to the existing ctrl namespaces. This may result in a mal-function or an IO hang. Solution: nvme_mpath_init() will nvme_read_ana_log() with groups_only set to false. This will not harm the new ctrl case (no namespaces present), and will make sure the ANA state of namespaces gets updated after reconnect. Note: Another option would be for nvme_mpath_init() to invoke nvme_parse_ana_log(..., nvme_set_ns_ana_state) for each existing namespace. Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com> Signed-off-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 28 10月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
syzkaller reported an issue where it looks like a malicious app can trigger a use-after-free of reading the ctx ->sq_array and ->rings value right after having installed the ring fd in the process file table. Defer ring fd installation until after we're done reading those values. Fixes: 75b28aff ("io_uring: allocate the two rings together") Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com Signed-off-by: NJens Axboe <axboe@kernel.dk>
-