- 11 11月, 2022 1 次提交
-
-
由 Gabriel Krisman Bertazi 提交于
sbitmap suffers from code complexity, as demonstrated by recent fixes, and eventual lost wake ups on nested I/O completion. The later happens, from what I understand, due to the non-atomic nature of the updates to wait_cnt, which needs to be subtracted and eventually reset when equal to zero. This two step process can eventually miss an update when a nested completion happens to interrupt the CPU in between the wait_cnt updates. This is very hard to fix, as shown by the recent changes to this code. The code complexity arises mostly from the corner cases to avoid missed wakes in this scenario. In addition, the handling of wake_batch recalculation plus the synchronization with sbq_queue_wake_up is non-trivial. This patchset implements the idea originally proposed by Jan [1], which removes the need for the two-step updates of wait_cnt. This is done by tracking the number of completions and wakeups in always increasing, per-bitmap counters. Instead of having to reset the wait_cnt when it reaches zero, we simply keep counting, and attempt to wake up N threads in a single wait queue whenever there is enough space for a batch. Waking up less than batch_wake shouldn't be a problem, because we haven't changed the conditions for wake up, and the existing batch calculation guarantees at least enough remaining completions to wake up a batch for each queue at any time. Performance-wise, one should expect very similar performance to the original algorithm for the case where there is no queueing. In both the old algorithm and this implementation, the first thing is to check ws_active, which bails out if there is no queueing to be managed. In the new code, we took care to avoid accounting completions and wakeups when there is no queueing, to not pay the cost of atomic operations unnecessarily, since it doesn't skew the numbers. For more interesting cases, where there is queueing, we need to take into account the cross-communication of the atomic operations. I've been benchmarking by running parallel fio jobs against a single hctx nullb in different hardware queue depth scenarios, and verifying both IOPS and queueing. Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, varying only the hardware queue length per test. queue size 2 4 8 16 32 64 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) The following is a similar experiment, ran against a nullb with a single bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40 parallel fio jobs operating on the same device queue size 2 4 8 16 32 64 6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K) patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K) It has also survived blktests and a 12h-stress run against nullb. I also ran the code against nvme and a scsi SSD, and I didn't observe performance regression in those. If there are other tests you think I should run, please let me know and I will follow up with results. [1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/ Cc: Hugh Dickins <hughd@google.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Liu Song <liusong@linux.alibaba.com> Suggested-by: NJan Kara <jack@suse.cz> Signed-off-by: NGabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20221105231055.25953-1-krisman@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 12 10月, 2022 2 次提交
-
-
由 Jason A. Donenfeld 提交于
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done by hand, covering things that coccinelle could not do on its own. Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NKees Cook <keescook@chromium.org> Reviewed-by: NYury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext2, ext4, and sbitmap Acked-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
-
由 Jason A. Donenfeld 提交于
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script: @basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) ) @multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@ - RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1)) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT) @collapse_ret@ type T; identifier VAR; expression E; @@ { - T VAR; - VAR = (E); - return VAR; + return E; } @drop_var@ type T; identifier VAR; @@ { - T VAR; ... when != VAR } Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NKees Cook <keescook@chromium.org> Reviewed-by: NYury Norov <yury.norov@gmail.com> Reviewed-by: NKP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by: NJakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
-
- 30 9月, 2022 1 次提交
-
-
由 Hugh Dickins 提交于
Commit 4acb8341 ("sbitmap: fix batched wait_cnt accounting") is a big improvement: without it, I had to revert to before commit 040b83fc ("sbitmap: fix possible io hung due to lost wakeup") to avoid the high system time and freezes which that had introduced. Now okay on the NVME laptop, but 4acb8341 is a disaster for heavy swapping (kernel builds in low memory) on another: soon locking up in sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling around with waitqueue_active() but wait_cnt 0 . Here is a backtrace, showing the common pattern of outer sbitmap_queue_wake_up() interrupted before setting wait_cnt 0 back to wake_batch (in some cases other CPUs are idle, in other cases they're spinning for a lock in dd_bio_merge()): sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag < __blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request < scsi_end_request < scsi_io_completion < scsi_finish_command < scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq < __irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt < _raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up < sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag < __blk_mq_free_request < blk_mq_free_request < dd_bio_merge < blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio < __submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct < submit_bio < __swap_writepage < swap_writepage < pageout < shrink_folio_list < evict_folios < lru_gen_shrink_lruvec < shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages < __alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio < do_anonymous_page < __handle_mm_fault < handle_mm_fault < do_user_addr_fault < exc_page_fault < asm_exc_page_fault See how the process-context sbitmap_queue_wake_up() has been interrupted, after bringing wait_cnt down to 0 (and in this example, after doing its wakeups), before advancing wake_index and refilling wake_cnt: an interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck. I have almost no grasp of all the possible sbitmap races, and their consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0, so it is better if sbq_wake_ptr() skips on to the next ws in that case: which fixes the lockup and shows no adverse consequence for me. The check for wait_cnt being 0 is obviously racy, and ultimately can lead to lost wakeups: for example, when there is only a single waitqueue with waiters. However, lost wakeups are unlikely to matter in these cases, and a proper fix requires redesign (and benchmarking) of the batched wakeup code: so let's plug the hole with this bandaid for now. Signed-off-by: NHugh Dickins <hughd@google.com> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NKeith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/9c2038a7-cdc5-5ee-854c-fbc6168bf16@google.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 12 9月, 2022 1 次提交
-
-
由 Keith Busch 提交于
Batched completions can clear multiple bits, but we're only decrementing the wait_cnt by one each time. This can cause waiters to never be woken, stalling IO. Use the batched count instead. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679Signed-off-by: NKeith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20220909184022.1709476-1-kbusch@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 08 9月, 2022 2 次提交
-
-
由 Uros Bizjak 提交于
Use atomic_long_try_cmpxchg instead of atomic_long_cmpxchg (*ptr, old, new) == old in __sbitmap_queue_get_batch. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, atomic_long_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails, enabling further code simplifications, e.g. an extra memory read can be avoided in the loop. No functional change intended. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NUros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20220908151200.9993-1-ubizjak@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jan Kara 提交于
When __sbq_wake_up() decrements wait_cnt to 0 but races with someone else waking the waiter on the waitqueue (so the waitqueue becomes empty), it exits without reseting wait_cnt to wake_batch number. Once wait_cnt is 0, nobody will ever reset the wait_cnt or wake the new waiters resulting in possible deadlocks or busyloops. Fix the problem by making sure we reset wait_cnt even if we didn't wake up anybody in the end. Fixes: 040b83fc ("sbitmap: fix possible io hung due to lost wakeup") Reported-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220908130937.2795-1-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 04 9月, 2022 1 次提交
-
-
由 Jens Axboe 提交于
This reverts commit 16ede669. This is causing issues with CPU stalls on my test box, revert it for now until we understand what is going on. It looks like infinite looping off sbitmap_queue_wake_up(), but hard to tell with a lot of CPUs hitting this issue and the console scrolling infinitely. Link: https://lore.kernel.org/linux-block/e742813b-ce5c-0d58-205b-1626f639b1bd@kernel.dk/Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 02 9月, 2022 1 次提交
-
-
由 Keith Busch 提交于
Batched completions can clear multiple bits, but we're only decrementing the wait_cnt by one each time. This can cause waiters to never be woken, stalling IO. Use the batched count instead. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679Signed-off-by: NKeith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20220825145312.1217900-1-kbusch@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 8月, 2022 1 次提交
-
-
由 Liu Song 提交于
If "nr + nr_tags <= map_depth", then the value of nr_tags will not be greater than map_depth, so no additional comparison is required. Signed-off-by: NLiu Song <liusong@linux.alibaba.com> Link: https://lore.kernel.org/r/1661483653-27326-1-git-send-email-liusong@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 23 8月, 2022 1 次提交
-
-
由 Yu Kuai 提交于
There are two problems can lead to lost wakeup: 1) invalid wakeup on the wrong waitqueue: For example, 2 * wake_batch tags are put, while only wake_batch threads are woken: __sbq_wake_up atomic_cmpxchg -> reset wait_cnt __sbq_wake_up -> decrease wait_cnt ... __sbq_wake_up -> wait_cnt is decreased to 0 again atomic_cmpxchg sbq_index_atomic_inc -> increase wake_index wake_up_nr -> wake up and waitqueue might be empty sbq_index_atomic_inc -> increase again, one waitqueue is skipped wake_up_nr -> invalid wake up because old wakequeue might be empty To fix the problem, increasing 'wake_index' before resetting 'wait_cnt'. 2) 'wait_cnt' can be decreased while waitqueue is empty As pointed out by Jan Kara, following race is possible: CPU1 CPU2 __sbq_wake_up __sbq_wake_up sbq_wake_ptr() sbq_wake_ptr() -> the same wait_cnt = atomic_dec_return() /* decreased to 0 */ sbq_index_atomic_inc() /* move to next waitqueue */ atomic_set() /* reset wait_cnt */ wake_up_nr() /* wake up on the old waitqueue */ wait_cnt = atomic_dec_return() /* * decrease wait_cnt in the old * waitqueue, while it can be * empty. */ Fix the problem by waking up before updating 'wake_index' and 'wait_cnt'. With this patch, noted that 'wait_cnt' is still decreased in the old empty waitqueue, however, the wakeup is redirected to a active waitqueue, and the extra decrement on the old empty waitqueue is not handled. Fixes: 88459642 ("blk-mq: abstract tag allocation out into sbitmap library") Signed-off-by: NYu Kuai <yukuai3@huawei.com> Reviewed-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220803121504.212071-1-yukuai1@huaweicloud.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 6月, 2022 1 次提交
-
-
由 wuchi 提交于
1. Getting next index before continue branch. 2. Checking free bits when setting the target bits. Otherwise, it may reuse the busying bits. Signed-off-by: Nwuchi <wuchi.zero@gmail.com> Reviewed-by: NMartin Wilck <mwilck@suse.com> Link: https://lore.kernel.org/r/20220605145835.26916-1-wuchi.zero@gmail.com Fixes: 9672b0d4 ("sbitmap: add __sbitmap_queue_get_batch()") Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 22 3月, 2022 1 次提交
-
-
由 Ming Lei 提交于
sbitmap has been used in scsi for replacing atomic operations on sdev->device_busy, so IOPS on some fast scsi storage can be improved. However, sdev->device_busy can be changed in fast path, so we have to allocate the sb->map statically. sdev->device_busy has been capped to 1024, but some drivers may configure the default depth as < 8, then cause each sbitmap word to hold only one bit. Finally 1024 * 128( sizeof(sbitmap_word)) bytes is needed for sb->map, given it is order 5 allocation, sometimes it may fail. Avoid the issue by using kvzalloc_node() for allocating sb->map. Cc: Ewan D. Milne <emilne@redhat.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220316012708.354668-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 08 2月, 2022 2 次提交
-
-
由 John Garry 提交于
Since __sbitmap_queue_get_shallow() was introduced in commit c05e6673 ("sbitmap: add sbitmap_get_shallow() operation"), it has not been used. Delete __sbitmap_queue_get_shallow() and rename public __sbitmap_queue_get_shallow() -> sbitmap_queue_get_shallow() as it is odd to have public __foo but no foo at all. Signed-off-by: NJohn Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/1644322024-105340-1-git-send-email-john.garry@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
Only the last sbitmap_word can have different depth, and all the others must have same depth of 1U << sb->shift, so not necessary to store it in sbitmap_word, and it can be retrieved easily and efficiently by adding one internal helper of __map_depth(sb, index). Remove 'depth' field from sbitmap_word, then the annotation of ____cacheline_aligned_in_smp for 'word' isn't needed any more. Not see performance effect when running high parallel IOPS test on null_blk. This way saves us one cacheline(usually 64 words) per each sbitmap_word. Cc: Martin Wilck <martin.wilck@suse.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NMartin Wilck <mwilck@suse.com> Reviewed-by: NJohn Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20220110072945.347535-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 28 1月, 2022 1 次提交
-
-
由 Laibin Qiu 提交于
Commit 180dccb0 ("blk-mq: fix tag_get wait task can't be awakened") will recalculate wake_batch when incrementing or decrementing active_queues to avoid wake_batch > hctx_max_depth. At the same time, in order to not affect performance as much as possible, the minimum wakeup batch is set to 4. But when the QD is small (such as QD=1), if inc or dec active_queues increases wakeup batch, that can lead to a hang: Fix this problem with the following strategies: QD : >= 32 | < 32 --------------------------------- wakeup batch: 8~4 | 3~1 Fixes: 180dccb0 ("blk-mq: fix tag_get wait task can't be awakened") Link: https://lore.kernel.org/linux-block/78cafe94-a787-e006-8851-69906f0c2128@huawei.com/T/#tReported-by: NAlex Xu (Hello71) <alex_y_xu@yahoo.ca> Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com> Tested-by: NAlex Xu (Hello71) <alex_y_xu@yahoo.ca> Link: https://lore.kernel.org/r/20220127100047.1763746-1-qiulaibin@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 14 1月, 2022 1 次提交
-
-
由 Laibin Qiu 提交于
In case of shared tags, there might be more than one hctx which allocates from the same tags, and each hctx is limited to allocate at most: hctx_max_depth = max((bt->sb.depth + users - 1) / users, 4U); tag idle detection is lazy, and may be delayed for 30sec, so there could be just one real active hctx(queue) but all others are actually idle and still accounted as active because of the lazy idle detection. Then if wake_batch is > hctx_max_depth, driver tag allocation may wait forever on this real active hctx. Fix this by recalculating wake_batch when inc or dec active_queues. Fixes: 0d2602ca ("blk-mq: improve support for shared tags maps") Suggested-by: NMing Lei <ming.lei@redhat.com> Suggested-by: NJohn Garry <john.garry@huawei.com> Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com> Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20220113025536.1479653-1-qiulaibin@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 10月, 2021 1 次提交
-
-
由 Jens Axboe 提交于
KCSAN complaints about the sbitmap hint update: ================================================================== BUG: KCSAN: data-race in sbitmap_queue_clear / sbitmap_queue_clear write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 1: sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606 blk_mq_put_tag+0x82/0x90 __blk_mq_free_request+0x114/0x180 block/blk-mq.c:507 blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541 __blk_mq_end_request+0x214/0x230 block/blk-mq.c:565 blk_mq_end_request+0x37/0x50 block/blk-mq.c:574 lo_complete_rq+0xca/0x170 drivers/block/loop.c:541 blk_complete_reqs block/blk-mq.c:584 [inline] blk_done_softirq+0x69/0x90 block/blk-mq.c:589 __do_softirq+0x12c/0x26e kernel/softirq.c:558 run_ksoftirqd+0x13/0x20 kernel/softirq.c:920 smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164 kthread+0x262/0x280 kernel/kthread.c:319 ret_from_fork+0x1f/0x30 write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 0: sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606 blk_mq_put_tag+0x82/0x90 __blk_mq_free_request+0x114/0x180 block/blk-mq.c:507 blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541 __blk_mq_end_request+0x214/0x230 block/blk-mq.c:565 blk_mq_end_request+0x37/0x50 block/blk-mq.c:574 lo_complete_rq+0xca/0x170 drivers/block/loop.c:541 blk_complete_reqs block/blk-mq.c:584 [inline] blk_done_softirq+0x69/0x90 block/blk-mq.c:589 __do_softirq+0x12c/0x26e kernel/softirq.c:558 run_ksoftirqd+0x13/0x20 kernel/softirq.c:920 smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164 kthread+0x262/0x280 kernel/kthread.c:319 ret_from_fork+0x1f/0x30 value changed: 0x00000035 -> 0x00000044 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 5.15.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 ================================================================== which is a data race, but not an important one. This is just updating the percpu alloc hint, and the reader of that hint doesn't ever require it to be valid. Just annotate it with data_race() to silence this one. Reported-by: syzbot+4f8bfd804b4a1f95b8f6@syzkaller.appspotmail.com Acked-by: NMarco Elver <elver@google.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 19 10月, 2021 1 次提交
-
-
由 Jens Axboe 提交于
sbitmap currently only supports clearing tags one-by-one, add a helper that allows the caller to pass in an array of tags to clear. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 18 10月, 2021 1 次提交
-
-
由 Jens Axboe 提交于
The block layer tag allocation batching still calls into sbitmap to get each tag, but we can improve on that. Add __sbitmap_queue_get_batch(), which returns a mask of tags all at once, along with an offset for those tags. An example return would be 0xff, where bits 0..7 are set, with tag_offset == 128. The valid tags in this case would be 128..135. A batch is specific to an individual sbitmap_map, hence it cannot be larger than that. The requested number of tags is automatically reduced to the max that can be satisfied with a single map. On failure, 0 is returned. Caller should fall back to single tag allocation at that point/ Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 09 7月, 2021 1 次提交
-
-
由 Zhen Lei 提交于
Fix some spelling mistakes in comments: permanentely ==> permanently wont ==> won't remaning ==> remaining succed ==> succeed shouldnt ==> shouldn't alpha-numeric ==> alphanumeric storeing ==> storing funtion ==> function documenation ==> documentation Determin ==> Determine intepreted ==> interpreted ammount ==> amount obious ==> obvious interupts ==> interrupts occured ==> occurred asssociated ==> associated taking into acount ==> taking into account squence ==> sequence stil ==> still contiguos ==> contiguous matchs ==> matches Link: https://lkml.kernel.org/r/20210607072555.12416-1-thunder.leizhen@huawei.comSigned-off-by: NZhen Lei <thunder.leizhen@huawei.com> Reviewed-by: NJacob Keller <jacob.e.keller@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 3月, 2021 5 次提交
-
-
由 Ming Lei 提交于
Move code for calculating default shift into a public helper which can be used by SCSI. Link: https://lore.kernel.org/r/20210122023317.687987-7-ming.lei@redhat.com Cc: Omar Sandoval <osandov@fb.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Cc: Ewan D. Milne <emilne@redhat.com> Tested-by: NSumanesh Samanta <sumanesh.samanta@broadcom.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
-
由 Ming Lei 提交于
SCSI's .device_busy will be converted to sbitmap and sbitmap_weight is needed. Export the helper. The only existing user of sbitmap_weight() uses it to find out how many bits are set and not cleared. Align sbitmap_weight() meaning with this usage model. Link: https://lore.kernel.org/r/20210122023317.687987-6-ming.lei@redhat.com Cc: Omar Sandoval <osandov@fb.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Cc: Ewan D. Milne <emilne@redhat.com> Tested-by: NSumanesh Samanta <sumanesh.samanta@broadcom.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
-
由 Ming Lei 提交于
Allocation hint should have belonged to sbitmap. Also, when sbitmap's depth is high and there is no need to use mulitple wakeup queues, user can benefit from percpu allocation hint too. Move allocation hint into sbitmap, then SCSI device queue can benefit from allocation hint when converting to plain sbitmap. Convert vhost/scsi.c to use sbitmap allocation with percpu alloc hint. This is more efficient than the previous approach. Link: https://lore.kernel.org/r/20210122023317.687987-5-ming.lei@redhat.com Cc: Omar Sandoval <osandov@fb.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Cc: Ewan D. Milne <emilne@redhat.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: virtualization@lists.linux-foundation.org Tested-by: NSumanesh Samanta <sumanesh.samanta@broadcom.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
-
由 Ming Lei 提交于
Add helpers for updating allocation hint so that we can avoid duplicate code. Prepare for moving allocation hint into sbitmap. Link: https://lore.kernel.org/r/20210122023317.687987-4-ming.lei@redhat.com Cc: Omar Sandoval <osandov@fb.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Cc: Ewan D. Milne <emilne@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Tested-by: NSumanesh Samanta <sumanesh.samanta@broadcom.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
-
由 Ming Lei 提交于
Currently the allocation round_robin info is maintained by sbitmap_queue. However, bit allocation really belongs to sbitmap. Move it there. Link: https://lore.kernel.org/r/20210122023317.687987-3-ming.lei@redhat.com Cc: Omar Sandoval <osandov@fb.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Cc: Ewan D. Milne <emilne@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: virtualization@lists.linux-foundation.org Tested-by: NSumanesh Samanta <sumanesh.samanta@broadcom.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
-
- 08 12月, 2020 4 次提交
-
-
由 Pavel Begunkov 提交于
__sbitmap_get_word() doesn't warp if it's starting from the beginning (i.e. initial hint is 0). Instead of stashing the original hint just set @wrap accordingly. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
sbitmap_deferred_clear() does CAS loop to propagate cleared bits, replace it with equivalent atomic bitwise and. That's slightly faster and makes wait-free instead of lock-free as before. The atomic can be relaxed (i.e. barrier-less) because following sbitmap_get*() deal with synchronisation, see comments in sbitmap_queue_clear(). It's ok to cast to atomic_long_t, that's what bitops/lock.h does. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
map->swap_lock protects map->cleared from concurrent modification, however sbitmap_deferred_clear() is already atomically drains it, so it's guaranteed to not loose bits on concurrent sbitmap_deferred_clear(). A one threaded tag heavy test on top of nullbk showed ~1.5% t-put increase, and 3% -> 1% cycle reduction of sbitmap_get() according to perf. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Because of spinlocks and atomics sbitmap_deferred_clear() have to reload &sb->map[index] on each access even though the map address won't change. Pass in sbitmap_word instead of {sb, index}, so it's cached in a variable. It also improves code generation of sbitmap_find_bit_in_index(). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Reviewed-by: NJohn Garry <john.garry@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 02 7月, 2020 1 次提交
-
-
由 John Garry 提交于
sbitmap works by maintaining separate bitmaps of set and cleared bits. The set bits are cleared in a batch, to save the burden of continuously locking the "word" map to unset. sbitmap_bitmap_show() only shows the set bits (in "word"), which is not too much use, so mask out the cleared bits. Fixes: ea86ea2c ("sbitmap: ammortize cost of clearing bits") Signed-off-by: NJohn Garry <john.garry@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 12月, 2019 1 次提交
-
-
由 David Jeffery 提交于
Under heavy loads where the kyber I/O scheduler hits the token limits for its scheduling domains, kyber can become stuck. When active requests complete, kyber may not be woken up leaving the I/O requests in kyber stuck. This stuck state is due to a race condition with kyber and the sbitmap functions it uses to run a callback when enough requests have completed. The running of a sbt_wait callback can race with the attempt to insert the sbt_wait. Since sbitmap_del_wait_queue removes the sbt_wait from the list first then sets the sbq field to NULL, kyber can see the item as not on a list but the call to sbitmap_add_wait_queue will see sbq as non-NULL. This results in the sbt_wait being inserted onto the wait list but ws_active doesn't get incremented. So the sbitmap queue does not know there is a waiter on a wait list. Since sbitmap doesn't think there is a waiter, kyber may never be informed that there are domain tokens available and the I/O never advances. With the sbt_wait on a wait list, kyber believes it has an active waiter so cannot insert a new waiter when reaching the domain's full state. This race can be fixed by only adding the sbt_wait to the queue if the sbq field is NULL. If sbq is not NULL, there is already an action active which will trigger the re-running of kyber. Let it run and add the sbt_wait to the wait list if still needing to wait. Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NDavid Jeffery <djeffery@redhat.com> Reported-by: NJohn Pittman <jpittman@redhat.com> Tested-by: NJohn Pittman <jpittman@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 14 11月, 2019 1 次提交
-
-
由 John Garry 提交于
Since the only caller of this function has been deleted, delete this one also. Signed-off-by: NJohn Garry <john.garry@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 02 7月, 2019 1 次提交
-
-
由 Pavel Begunkov 提交于
cmpxchg() with an immediate value could be replaced with less expensive xchg(). The same true if new value don't _depend_ on the old one. In the second block, atomic_cmpxchg() return value isn't checked, so after atomic_cmpxchg() -> atomic_xchg() conversion it could be replaced with atomic_set(). Comparison with atomic_read() in the second chunk was left as an optimisation (if that was the initial intention). Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 05 6月, 2019 1 次提交
-
-
由 Thomas Gleixner 提交于
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license v2 as published by the free software foundation this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not see https www gnu org licenses extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 2 file(s). Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NAlexios Zavras <alexios.zavras@intel.com> Reviewed-by: NArmijn Hemel <armijn@tjaldur.nl> Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org> Reviewed-by: NAllison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190530000435.923873561@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 24 5月, 2019 1 次提交
-
-
由 Andrea Parri 提交于
This barrier only applies to the read-modify-write operations; in particular, it does not apply to the atomic_set() primitive. Replace the barrier with an smp_mb(). Fixes: 6c0ca7ae ("sbitmap: fix wakeup hang after sbq resize") Cc: stable@vger.kernel.org Reported-by: N"Paul E. McKenney" <paulmck@linux.ibm.com> Reported-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NAndrea Parri <andrea.parri@amarulasolutions.com> Reviewed-by: NMing Lei <ming.lei@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: linux-block@vger.kernel.org Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 3月, 2019 1 次提交
-
-
由 Ming Lei 提交于
Inside sbitmap_queue_clear(), once the clear bit is set, it will be visiable to allocation path immediately. Meantime READ/WRITE on old associated instance(such as request in case of blk-mq) may be out-of-order with the setting clear bit, so race with re-allocation may be triggered. Adds one memory barrier for ordering READ/WRITE of the freed associated instance with setting clear bit for avoiding race with re-allocation. The following kernel oops triggerd by block/006 on aarch64 may be fixed: [ 142.330954] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000330 [ 142.338794] Mem abort info: [ 142.341554] ESR = 0x96000005 [ 142.344632] Exception class = DABT (current EL), IL = 32 bits [ 142.350500] SET = 0, FnV = 0 [ 142.353544] EA = 0, S1PTW = 0 [ 142.356678] Data abort info: [ 142.359528] ISV = 0, ISS = 0x00000005 [ 142.363343] CM = 0, WnR = 0 [ 142.366305] user pgtable: 64k pages, 48-bit VAs, pgdp = 000000002a3c51c0 [ 142.372983] [0000000000000330] pgd=0000000000000000, pud=0000000000000000 [ 142.379777] Internal error: Oops: 96000005 [#1] SMP [ 142.384613] Modules linked in: null_blk ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp vfat fat rpcrdma sunrpc rdma_ucm ib_iser rdma_cm iw_cm libiscsi ib_umad scsi_transport_iscsi ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core sbsa_gwdt crct10dif_ce ghash_ce ipmi_ssif sha2_ce ipmi_devintf sha256_arm64 sg sha1_ce ipmi_msghandler ip_tables xfs libcrc32c mlx5_core sdhci_acpi mlxfw ahci_platform at803x sdhci libahci_platform qcom_emac mmc_core hdma hdma_mgmt i2c_dev [last unloaded: null_blk] [ 142.429753] CPU: 7 PID: 1983 Comm: fio Not tainted 5.0.0.cki #2 [ 142.449458] pstate: 00400005 (nzcv daif +PAN -UAO) [ 142.454239] pc : __blk_mq_free_request+0x4c/0xa8 [ 142.458830] lr : blk_mq_free_request+0xec/0x118 [ 142.463344] sp : ffff00003360f6a0 [ 142.466646] x29: ffff00003360f6a0 x28: ffff000010e70000 [ 142.471941] x27: ffff801729a50048 x26: 0000000000010000 [ 142.477232] x25: ffff00003360f954 x24: ffff7bdfff021440 [ 142.482529] x23: 0000000000000000 x22: 00000000ffffffff [ 142.487830] x21: ffff801729810000 x20: 0000000000000000 [ 142.493123] x19: ffff801729a50000 x18: 0000000000000000 [ 142.498413] x17: 0000000000000000 x16: 0000000000000001 [ 142.503709] x15: 00000000000000ff x14: ffff7fe000000000 [ 142.509003] x13: ffff8017dcde09a0 x12: 0000000000000000 [ 142.514308] x11: 0000000000000001 x10: 0000000000000008 [ 142.519597] x9 : ffff8017dcde09a0 x8 : 0000000000002000 [ 142.524889] x7 : ffff8017dcde0a00 x6 : 000000015388f9be [ 142.530187] x5 : 0000000000000001 x4 : 0000000000000000 [ 142.535478] x3 : 0000000000000000 x2 : 0000000000000000 [ 142.540777] x1 : 0000000000000001 x0 : ffff00001041b194 [ 142.546071] Process fio (pid: 1983, stack limit = 0x000000006460a0ea) [ 142.552500] Call trace: [ 142.554926] __blk_mq_free_request+0x4c/0xa8 [ 142.559181] blk_mq_free_request+0xec/0x118 [ 142.563352] blk_mq_end_request+0xfc/0x120 [ 142.567444] end_cmd+0x3c/0xa8 [null_blk] [ 142.571434] null_complete_rq+0x20/0x30 [null_blk] [ 142.576194] blk_mq_complete_request+0x108/0x148 [ 142.580797] null_handle_cmd+0x1d4/0x718 [null_blk] [ 142.585662] null_queue_rq+0x60/0xa8 [null_blk] [ 142.590171] blk_mq_try_issue_directly+0x148/0x280 [ 142.594949] blk_mq_try_issue_list_directly+0x9c/0x108 [ 142.600064] blk_mq_sched_insert_requests+0xb0/0xd0 [ 142.604926] blk_mq_flush_plug_list+0x16c/0x2a0 [ 142.609441] blk_flush_plug_list+0xec/0x118 [ 142.613608] blk_finish_plug+0x3c/0x4c [ 142.617348] blkdev_direct_IO+0x3b4/0x428 [ 142.621336] generic_file_read_iter+0x84/0x180 [ 142.625761] blkdev_read_iter+0x50/0x78 [ 142.629579] aio_read.isra.6+0xf8/0x190 [ 142.633409] __io_submit_one.isra.8+0x148/0x738 [ 142.637912] io_submit_one.isra.9+0x88/0xb8 [ 142.642078] __arm64_sys_io_submit+0xe0/0x238 [ 142.646428] el0_svc_handler+0xa0/0x128 [ 142.650238] el0_svc+0x8/0xc [ 142.653104] Code: b9402a63 f9000a7f 3100047f 540000a0 (f9419a81) [ 142.659202] ---[ end trace 467586bc175eb09d ]--- Fixes: ea86ea2c ("sbitmap: ammortize cost of clearing bits") Reported-and-bisected_and_tested-by: Yi Zhang <yi.zhang@redhat.com> Cc: Yi Zhang <yi.zhang@redhat.com> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 15 1月, 2019 2 次提交
-
-
由 Ming Lei 提交于
Because we may call blk_mq_get_driver_tag() directly from blk_mq_dispatch_rq_list() without holding any lock, then HARDIRQ may come and the above DEADLOCK is triggered. Commit ab53dcfb3e7b ("sbitmap: Protect swap_lock from hardirq") tries to fix this issue by using 'spin_lock_bh', which isn't enough because we complete request from hardirq context direclty in case of multiqueue. Cc: Clark Williams <williams@redhat.com> Fixes: ab53dcfb3e7b ("sbitmap: Protect swap_lock from hardirq") Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Steven Rostedt (VMware) 提交于
The swap_lock used by sbitmap has a chain with locks taken from softirq, but the swap_lock is not protected from being preempted by softirqs. A chain exists of: sbq->ws[i].wait -> dispatch_wait_lock -> swap_lock Where the sbq->ws[i].wait lock can be taken from softirq context, which means all locks below it in the chain must also be protected from softirqs. Reported-by: NClark Williams <williams@redhat.com> Fixes: 58ab5e32 ("sbitmap: silence bogus lockdep IRQ warning") Fixes: ea86ea2c ("sbitmap: amortize cost of clearing bits") Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 12月, 2018 1 次提交
-
-
由 Jens Axboe 提交于
After commit 5d2ee712, users of sbitmap that need wait queue handling must use the provided helpers. But we only added prepare_to_wait()/finish_wait() style helpers, add the equivalent add_wait_queue/list_del wrappers as we.. This is needed to ensure kyber plays by the sbitmap waitqueue rules. Tested-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-