- 15 11月, 2022 6 次提交
-
-
由 Florian-Ewald Mueller 提交于
- limit bitmap chunk size internal u64 variable to values not overflowing the u32 bitmap superblock structure variable stored on persistent media - assign bitmap chunk size internal u64 variable from unsigned values to avoid possible sign extension artifacts when assigning from a s32 value The bug has been there since at least kernel 4.0. Steps to reproduce it: 1: mdadm -C /dev/mdx -l 1 --bitmap=internal --bitmap-chunk=256M -e 1.2 -n2 /dev/rnbd1 /dev/rnbd2 2 resize member device rnbd1 and rnbd2 to 8 TB 3 mdadm --grow /dev/mdx --size=max The bitmap_chunksize will overflow without patch. Cc: stable@vger.kernel.org Signed-off-by: NFlorian-Ewald Mueller <florian-ewald.mueller@ionos.com> Signed-off-by: NJack Wang <jinpu.wang@ionos.com> Signed-off-by: NSong Liu <song@kernel.org>
-
由 Ye Bin 提交于
Introduce md_ro_state for mddev->ro, so it is easy to understand. Signed-off-by: NYe Bin <yebin10@huawei.com> Signed-off-by: NSong Liu <song@kernel.org>
-
由 Ye Bin 提交于
Factor out __md_set_array_info(). No functional change. Signed-off-by: NYe Bin <yebin10@huawei.com> Signed-off-by: NSong Liu <song@kernel.org>
-
由 Giulio Benetti 提交于
RAID6_USE_EMPTY_ZERO_PAGE is unused and hardcoded to 0, so let's drop it. Signed-off-by: NGiulio Benetti <giulio.benetti@benettiengineering.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSong Liu <song@kernel.org>
-
由 Uros Bizjak 提交于
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in r5l_wake_reclaim. 86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. Note that the value from *ptr should be read using READ_ONCE to prevent the compiler from merging, refetching or reordering the read. No functional change intended. Cc: Song Liu <song@kernel.org> Signed-off-by: NUros Bizjak <ubizjak@gmail.com> Signed-off-by: NSong Liu <song@kernel.org>
-
由 Li Zhong 提交于
Check the return value of md_bitmap_get_counter() in case it returns NULL pointer, which will result in a null pointer dereference. v2: update the check to include other dereference Signed-off-by: NLi Zhong <floridsleeves@gmail.com> Signed-off-by: NSong Liu <song@kernel.org>
-
- 11 11月, 2022 3 次提交
-
-
由 Gabriel Krisman Bertazi 提交于
sbitmap suffers from code complexity, as demonstrated by recent fixes, and eventual lost wake ups on nested I/O completion. The later happens, from what I understand, due to the non-atomic nature of the updates to wait_cnt, which needs to be subtracted and eventually reset when equal to zero. This two step process can eventually miss an update when a nested completion happens to interrupt the CPU in between the wait_cnt updates. This is very hard to fix, as shown by the recent changes to this code. The code complexity arises mostly from the corner cases to avoid missed wakes in this scenario. In addition, the handling of wake_batch recalculation plus the synchronization with sbq_queue_wake_up is non-trivial. This patchset implements the idea originally proposed by Jan [1], which removes the need for the two-step updates of wait_cnt. This is done by tracking the number of completions and wakeups in always increasing, per-bitmap counters. Instead of having to reset the wait_cnt when it reaches zero, we simply keep counting, and attempt to wake up N threads in a single wait queue whenever there is enough space for a batch. Waking up less than batch_wake shouldn't be a problem, because we haven't changed the conditions for wake up, and the existing batch calculation guarantees at least enough remaining completions to wake up a batch for each queue at any time. Performance-wise, one should expect very similar performance to the original algorithm for the case where there is no queueing. In both the old algorithm and this implementation, the first thing is to check ws_active, which bails out if there is no queueing to be managed. In the new code, we took care to avoid accounting completions and wakeups when there is no queueing, to not pay the cost of atomic operations unnecessarily, since it doesn't skew the numbers. For more interesting cases, where there is queueing, we need to take into account the cross-communication of the atomic operations. I've been benchmarking by running parallel fio jobs against a single hctx nullb in different hardware queue depth scenarios, and verifying both IOPS and queueing. Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, varying only the hardware queue length per test. queue size 2 4 8 16 32 64 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) The following is a similar experiment, ran against a nullb with a single bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40 parallel fio jobs operating on the same device queue size 2 4 8 16 32 64 6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K) patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K) It has also survived blktests and a 12h-stress run against nullb. I also ran the code against nvme and a scsi SSD, and I didn't observe performance regression in those. If there are other tests you think I should run, please let me know and I will follow up with results. [1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/ Cc: Hugh Dickins <hughd@google.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Liu Song <liusong@linux.alibaba.com> Suggested-by: NJan Kara <jack@suse.cz> Signed-off-by: NGabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20221105231055.25953-1-krisman@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Use set->nr_hw_queues for the current number of tags, and remove the duplicate set->nr_hw_queues update in the caller. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221109100811.2413423-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
There is no point in trying to share any code with the realloc case when all that is needed by the initial tagset allocation is a simple kcalloc_node. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221109100811.2413423-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 11月, 2022 14 次提交
-
-
由 Khazhismel Kumykov 提交于
oom_bfqq is just a fallback bfqq, so shouldn't be used with waker detection. Suggested-by: NJan Kara <jack@suse.cz> Signed-off-by: NKhazhismel Kumykov <khazhy@google.com> Reviewed-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221108181030.1611703-2-khazhy@google.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Khazhismel Kumykov 提交于
This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL, but woken_list_node still being hashed. This would happen when bfq_init_rq() expects a brand new allocated queue to be returned from bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq without resetting woken_list_node. Since we can always return oom_bfqq when attempting to allocate, we cannot assume waker_bfqq starts as NULL. Avoid setting woken_bfqq for oom_bfqq entirely, as it's not useful. Crashes would have a stacktrace like: [160595.656560] bfq_add_bfqq_busy+0x110/0x1ec [160595.661142] bfq_add_request+0x6bc/0x980 [160595.666602] bfq_insert_request+0x8ec/0x1240 [160595.671762] bfq_insert_requests+0x58/0x9c [160595.676420] blk_mq_sched_insert_request+0x11c/0x198 [160595.682107] blk_mq_submit_bio+0x270/0x62c [160595.686759] __submit_bio_noacct_mq+0xec/0x178 [160595.691926] submit_bio+0x120/0x184 [160595.695990] ext4_mpage_readpages+0x77c/0x7c8 [160595.701026] ext4_readpage+0x60/0xb0 [160595.705158] filemap_read_page+0x54/0x114 [160595.711961] filemap_fault+0x228/0x5f4 [160595.716272] do_read_fault+0xe0/0x1f0 [160595.720487] do_fault+0x40/0x1c8 Tested by injecting random failures into bfq_get_queue, crashes go away completely. Fixes: 8ef3fc3a ("block, bfq: make shared queues inherit wakers") Signed-off-by: NKhazhismel Kumykov <khazhy@google.com> Reviewed-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221108181030.1611703-1-khazhy@google.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Böhmwalder 提交于
(Sort of) cherry-picked from the out-of-tree drbd9 branch. Original commit message by Joel Colledge: This simplifies drbd_submit_peer_request by removing most of the arguments. It also makes the treatment of the op better aligned with that in struct bio. Determine fault_type dynamically using information which is already available instead of passing it in as a parameter. Note: The opf in receive_rs_deallocated was changed from REQ_OP_WRITE_ZEROES to REQ_OP_DISCARD. This was required in the out-of-tree module, and does not matter in-tree. The opf is ignored anyway in drbd_submit_peer_request, since the discard/zero-out is decided by the EE_TRIM flag. Signed-off-by: NJoel Colledge <joel.colledge@linbit.com> Signed-off-by: NChristoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20221109133453.51652-4-christoph.boehmwalder@linbit.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Philipp Reisner 提交于
The discard_granularity describes the minimum unit of a discard. If that is larger than the maximal discard size, we need to disable discards completely. Reviewed-by: NJoel Colledge <joel.colledge@linbit.com> Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NChristoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20221109133453.51652-3-christoph.boehmwalder@linbit.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Böhmwalder 提交于
We currently only set q->limits.max_discard_sectors, but that is not enough. Another field, max_hw_discard_sectors, was introduced in commit 0034af03 ("block: make /sys/block/<dev>/queue/discard_max_bytes writeable"). The difference is that max_discard_sectors can be changed from user space via sysfs, while max_hw_discard_sectors is the "hardware" upper limit. So use this helper, which sets both. This is also a fixup for commit 998e9cbc ("drbd: cleanup decide_on_discard_support"): if discards are not supported, that does not necessarily mean we also want to disable write_zeroes. Fixes: 998e9cbc ("drbd: cleanup decide_on_discard_support") Reviewed-by: NJoel Colledge <joel.colledge@linbit.com> Signed-off-by: NChristoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20221109133453.51652-2-christoph.boehmwalder@linbit.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Logan Gunthorpe 提交于
Add documentation for the p2pmem/allocate binary file which allows for allocating p2pmem buffers in userspace for passing to drivers that support them. (Currently only O_DIRECT to NVMe devices.) Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221021174116.7200-10-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Logan Gunthorpe 提交于
Create a sysfs bin attribute called "allocate" under the existing "p2pmem" group. The only allowable operation on this file is the mmap() call. When mmap() is called on this attribute, the kernel allocates a chunk of memory from the genalloc and inserts the pages into the VMA. The dev_pagemap .page_free callback will indicate when these pages are no longer used and they will be put back into the genalloc. On device unbind, remove the sysfs file before the memremap_pages are cleaned up. This ensures unmap_mapping_range() is called on the files inode and no new mappings can be created. Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Acked-by: NBjorn Helgaas <bhelgaas@google.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221021174116.7200-9-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Logan Gunthorpe 提交于
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed from userspace and enables the NVMe passthru requests to use P2PDMA pages. Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221021174116.7200-8-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Logan Gunthorpe 提交于
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed from userspace and enables the O_DIRECT path in iomap based filesystems and direct to block devices. Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221021174116.7200-7-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Logan Gunthorpe 提交于
Consecutive zone device pages should not be merged into the same sgl or bvec segment with other types of pages or if they belong to different pgmaps. Otherwise getting the pgmap of a given segment is not possible without scanning the entire segment. This helper returns true either if both pages are not zone device pages or both pages are zone device pages with the same pgmap. Factor out the check for page mergability into a pages_are_mergable() helper and add a check with zone_device_pages_are_mergeable(). Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221021174116.7200-6-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Logan Gunthorpe 提交于
Consecutive zone device pages should not be merged into the same sgl or bvec segment with other types of pages or if they belong to different pgmaps. Otherwise getting the pgmap of a given segment is not possible without scanning the entire segment. This helper returns true either if both pages are not zone device pages or both pages are zone device pages with the same pgmap. Add a helper to determine if zone device pages are mergeable and use this helper in page_is_mergeable(). Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221021174116.7200-5-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Logan Gunthorpe 提交于
Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags() which take a flags argument that is passed to get_user_pages_fast(). This is so that FOLL_PCI_P2PDMA can be passed when appropriate. Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221021174116.7200-4-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Logan Gunthorpe 提交于
GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to allow obtaining P2PDMA pages. If GUP is called without the flag and a P2PDMA page is found, it will return an error in try_grab_page() or try_grab_folio(). The check is safe to do before taking the reference to the page in both cases seeing the page should be protected by either the appropriate ptl or mmap_lock; or the gup fast guarantees preventing TLB flushes. try_grab_folio() has one call site that WARNs on failure and cannot actually deal with the failure of this function (it seems it will get into an infinite loop). Expand the comment there to document a couple more conditions on why it will not fail. FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set. This is to copy fsdax until pgmap refcounts are fixed (see the link below for more information). Link: https://lkml.kernel.org/r/Yy4Ot5MoOhsgYLTQ@ziepe.caSigned-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221021174116.7200-3-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Logan Gunthorpe 提交于
In order to add checks for P2PDMA memory into try_grab_page(), expand the error return from a bool to an int/error code. Update all the callsites handle change in usage. Also remove the WARN_ON_ONCE() call at the callsites seeing there already is a WARN_ON_ONCE() inside the function if it fails. Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221021174116.7200-2-logang@deltatee.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 07 11月, 2022 1 次提交
-
-
由 Yang Li 提交于
Remove the description of @required_features in elevator_match() to clear the below warning: block/elevator.c:103: warning: Excess function parameter 'required_features' description in 'elevator_match' Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2734 Fixes: ffb86425 ("block: don't check for required features in elevator_match") Reported-by: NAbaci Robot <abaci@linux.alibaba.com> Signed-off-by: NYang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20221107062255.2685-1-yang.lee@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 02 11月, 2022 16 次提交
-
-
由 Jinlong Chen 提交于
if-else is more readable than goto here. Signed-off-by: NJinlong Chen <nickyc975@zju.edu.cn> Link: https://lore.kernel.org/r/d3306fa4e92dc9cc614edc8f1802686096bafef2.1667356813.git.nickyc975@zju.edu.cnSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jinlong Chen 提交于
Use goto-style error handling like we do elsewhere in the kernel. Signed-off-by: NJinlong Chen <nickyc975@zju.edu.cn> Link: https://lore.kernel.org/r/bbbc2d9b17b137798c7fb92042141ca4cbbc58cc.1667356813.git.nickyc975@zju.edu.cnSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Chao Leng 提交于
All controller namespaces share the same tagset, so we can use this interface which does the optimal operation for parallel quiesce based on the tagset type(e.g. blocking tagsets and non-blocking tagsets). nvme connect_q should not be quiesced when quiesce tagset, so set the QUEUE_FLAG_SKIP_TAGSET_QUIESCE to skip it when init connect_q. Currently we use NVME_NS_STOPPED to ensure pairing quiescing and unquiescing. If use blk_mq_[un]quiesce_tagset, NVME_NS_STOPPED will be invalided, so introduce NVME_CTRL_STOPPED to replace NVME_NS_STOPPED. In addition, we never really quiesce a single namespace. It is a better choice to move the flag from ns to ctrl. Signed-off-by: NChao Leng <lengchao@huawei.com> [hch: rebased on top of prep patches] Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NChao Leng <lengchao@huawei.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-15-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Chao Leng 提交于
Drivers that have shared tagsets may need to quiesce potentially a lot of request queues that all share a single tagset (e.g. nvme). Add an interface to quiesce all the queues on a given tagset. This interface is useful because it can speedup the quiesce by doing it in parallel. Because some queues should not need to be quiesced (e.g. the nvme connect_q) when quiescing the tagset, introduce a QUEUE_FLAG_SKIP_TAGSET_QUIESCE flag to allow this new interface to ski quiescing a particular queue. Signed-off-by: NChao Leng <lengchao@huawei.com> [hch: simplify for the per-tag_set srcu_struct] Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NChao Leng <lengchao@huawei.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-14-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Nothing in blk_mq_wait_quiesce_done needs the request_queue now, so just pass the tagset, and move the non-mq check into the only caller that needs it. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChao Leng <lengchao@huawei.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-13-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
All I/O submissions have fairly similar latencies, and a tagset-wide quiesce is a fairly common operation. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NChao Leng <lengchao@huawei.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NHannes Reinecke <hare@suse.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-12-hch@lst.de [axboe: fix whitespace] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
For submit_bio based queues there is no (S)RCU critical section during I/O submission and thus nothing to wait for in blk_mq_wait_quiesce_done, so skip doing any synchronization. No non-mq driver should be calling this, but for now we have core callers that unconditionally call into it. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-11-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
apple_nvme_reset_work schedules apple_nvme_remove, to be called, which will call apple_nvme_disable and unquiesce the I/O queues. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-10-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
nvme_remove_dead_ctrl schedules nvme_remove to be called, which will call nvme_dev_disable and unquiesce the I/O queues. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-9-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
nvme_kill_queues does two things: 1) mark the gendisk of all namespaces dead 2) unquiesce all I/O queues These used to be be intertwined due to block layer issues, but aren't any more. So move the unquiscing of the I/O queues into the callers, and rename the rest of the function to the now more descriptive nvme_mark_namespaces_dead. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-8-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
None of the callers of nvme_kill_queues needs it to unquiesce the admin queues, as all of them already do it themselves: 1) nvme_reset_work explicit call nvme_start_admin_queue toward the beginning of the function. The extra call to nvme_start_admin_queue in nvme_reset_work this won't do anything as NVME_CTRL_ADMIN_Q_STOPPED will already be cleared. 2) nvme_remove calls nvme_dev_disable with shutdown flag set to true at the very beginning of the function if the PCIe device was not present, which is the precondition for the call to nvme_kill_queues. nvme_dev_disable already calls nvme_start_admin_queue toward the end of the function when the shutdown flag is set to true, so the admin queue is already enabled at this point. 3) nvme_remove_dead_ctrl schedules a workqueue to unbind the driver, which will end up in nvme_remove, which calls nvme_dev_disable with the shutdown flag. This case will call nvme_start_admin_queue a bit later than before. 4) apple_nvme_remove uses the same sequence as nvme_remove_dead_ctrl above. 5) nvme_remove_namespaces only calls nvme_kill_queues when the controller is in the DEAD state. That can only happen in the PCIe driver, and only from nvme_remove. See item 2) above for the conditions there. So it is safe to just remove the call to nvme_start_admin_queue in nvme_kill_queues without replacement. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-7-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
At the point where namespaces are marked dead, the controller is in a non-live state and we won't get pass the identify commands. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-6-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
The NVME_NS_DEAD check only made sense when we revalidated namespaces in nvme_passthrough_end for commands that affected the namespace inventory. These days NVME_NS_DEAD is only set during reset or when tearing down namespaces, and we always remove all namespaces right after that. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-5-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
The call to nvme_remove_invalid_namespaces made sense when nvme_passthru_end revalidated all namespaces and had to remove those that didn't exist any more. Since we don't revalidate from nvme_passthru_end now, this call is entirely spurious. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
The code to create, update or delete a tagset and namespaces in nvme_reset_work is a bit convoluted. Refactor it with a two high-level conditionals for first probe vs reset and I/O queues vs no I/O queues to make the code flow more clear. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-3-hch@lst.de [axboe: fix whitespace issue] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
nvme and xen-blkfront are already doing this to stop buffered writes from creating dirty pages that can't be written out later. Move it to the common code. This also removes the comment about the ordering from nvme, as bd_mutex not only is gone entirely, but also hasn't been used for locking updates to the disk size long before that, and thus the ordering requirement documented there doesn't apply any more. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NChao Leng <lengchao@huawei.com> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-