- 27 12月, 2019 40 次提交
-
-
由 Mikulas Patocka 提交于
[ Upstream commit b21555786f18cd77f2311ad89074533109ae3ffa ] Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and workqueue stalls") introduced a semaphore to limit the maximum number of in-flight kcopyd (COW) jobs. The implementation of this throttling mechanism is prone to a deadlock: 1. One or more threads write to the origin device causing COW, which is performed by kcopyd. 2. At some point some of these threads might reach the s->cow_count semaphore limit and block in down(&s->cow_count), holding a read lock on _origins_lock. 3. Someone tries to acquire a write lock on _origins_lock, e.g., snapshot_ctr(), which blocks because the threads at step (2) already hold a read lock on it. 4. A COW operation completes and kcopyd runs dm-snapshot's completion callback, which ends up calling pending_complete(). pending_complete() tries to resubmit any deferred origin bios. This requires acquiring a read lock on _origins_lock, which blocks. This happens because the read-write semaphore implementation gives priority to writers, meaning that as soon as a writer tries to enter the critical section, no readers will be allowed in, until all writers have completed their work. So, pending_complete() waits for the writer at step (3) to acquire and release the lock. This writer waits for the readers at step (2) to release the read lock and those readers wait for pending_complete() (the kcopyd thread) to signal the s->cow_count semaphore: DEADLOCK. The above was thoroughly analyzed and documented by Nikos Tsironis as part of his initial proposal for fixing this deadlock, see: https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html Fix this deadlock by reworking COW throttling so that it waits without holding any locks. Add a variable 'in_progress' that counts how many kcopyd jobs are running. A function wait_for_in_progress() will sleep if 'in_progress' is over the limit. It drops _origins_lock in order to avoid the deadlock. Reported-by: NGuruswamy Basavaiah <guru2018@gmail.com> Reported-by: NNikos Tsironis <ntsironis@arrikto.com> Reviewed-by: NNikos Tsironis <ntsironis@arrikto.com> Tested-by: NNikos Tsironis <ntsironis@arrikto.com> Fixes: 721b1d98fb51 ("dm snapshot: Fix excessive memory usage and workqueue stalls") Cc: stable@vger.kernel.org # v5.0+ Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()") Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Mikulas Patocka 提交于
[ Upstream commit a2f83e8b0c82c9500421a26c49eb198b25fcdea3 ] This simple refactoring moves code for modifying the semaphore cow_count into separate functions to prepare for changes that will extend these methods to provide for a more sophisticated mechanism for COW throttling. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Reviewed-by: NNikos Tsironis <ntsironis@arrikto.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Mikulas Patocka 提交于
commit 13bd677a472d534bf100bab2713efc3f9e3f5978 upstream. GFP_NOWAIT allocation can fail anytime - it doesn't wait for memory being available and it fails if the mempool is exhausted and there is not enough memory. If we go down this path: map_bio -> mg_start -> alloc_migration -> mempool_alloc(GFP_NOWAIT) we can see that map_bio() doesn't check the return value of mg_start(), and the bio is leaked. If we go down this path: map_bio -> mg_start -> mg_lock_writes -> alloc_prison_cell -> dm_bio_prison_alloc_cell_v2 -> mempool_alloc(GFP_NOWAIT) -> mg_lock_writes -> mg_complete the bio is ended with an error - it is unacceptable because it could cause filesystem corruption if the machine ran out of memory temporarily. Change GFP_NOWAIT to GFP_NOIO, so that the mempool code will properly wait until memory becomes available. mempool_alloc with GFP_NOIO can't fail, so remove the code paths that deal with allocation failure. Cc: stable@vger.kernel.org Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Song Liu 提交于
[ Upstream commit 3874d73e06c9b9dc15de0b7382fc223986d75571 ] The message should match the parameter, i.e. raid0.default_layout. Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.") Cc: NeilBrown <neilb@suse.de> Reported-by: NIvan Topolsky <doktor.yak@gmail.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 NeilBrown 提交于
[ Upstream commit c84a1372df929033cb1a0441fb57bd3932f39ac9 ] If the drives in a RAID0 are not all the same size, the array is divided into zones. The first zone covers all drives, to the size of the smallest. The second zone covers all drives larger than the smallest, up to the size of the second smallest - etc. A change in Linux 3.14 unintentionally changed the layout for the second and subsequent zones. All the correct data is still stored, but each chunk may be assigned to a different device than in pre-3.14 kernels. This can lead to data corruption. It is not possible to determine what layout to use - it depends which kernel the data was written by. So we add a module parameter to allow the old (0) or new (1) layout to be specified, and refused to assemble an affected array if that parameter is not set. Fixes: 20d0189b ("block: Introduce new bio_split()") cc: stable@vger.kernel.org (3.14+) Acked-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: NNeilBrown <neilb@suse.de> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 NeilBrown 提交于
commit 480523feae581ab714ba6610388a3b4619a2f695 upstream. Since commit 4ad23a97 ("MD: use per-cpu counter for writes_pending"), set_in_sync() is substantially more expensive: it can wait for a full RCU grace period which can be 10s of milliseconds. So we should only call it when the cost is justified. md_check_recovery() currently calls set_in_sync() every time it finds anything to do (on non-external active arrays). For an array performing resync or recovery, this will be quite often. Each call will introduce a delay to the md thread, which can noticeable affect IO submission latency. In md_check_recovery() we only need to call set_in_sync() if 'safemode' was non-zero at entry, meaning that there has been not recent IO. So we save this "safemode was nonzero" state, and only call set_in_sync() if it was non-zero. This measurably reduces mean and maximum IO submission latency during resync/recovery. Reported-and-tested-by: NJack Wang <jinpu.wang@cloud.ionos.com> Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending") Cc: stable@vger.kernel.org (v4.12+) Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 NeilBrown 提交于
commit 9d4b45d6af442237560d0bb5502a012baa5234b7 upstream. Until revalidate_disk() has completed, the size of a new md array will appear to be zero. So we shouldn't report, through array_state, that the array is active until that time. udev rules check array_state to see if the array is ready. As soon as it appear to be zero, fsck can be run. If it find the size to be zero, it will fail. So add a new flag to provide an interlock between do_md_run() and array_state_show(). This flag is set while do_md_run() is active and it prevents array_state_show() from reporting that the array is active. Before do_md_run() is called, ->pers will be NULL so array is definitely not active. After do_md_run() is called, revalidate_disk() will have run and the array will be completely ready. We also move various sysfs_notify*() calls out of md_run() into do_md_run() after MD_NOT_READY is cleared. This ensure the information is ready before the notification is sent. Prior to v4.12, array_state_show() was called with the mddev->reconfig_mutex held, which provided exclusion with do_md_run(). Note that MD_NOT_READY cleared twice. This is deliberate to cover both success and error paths with minimal noise. Fixes: b7b17c9b ("md: remove mddev_lock() from md_attr_show()") Cc: stable@vger.kernel.org (v4.12++) Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Xiao Ni 提交于
commit 143f6e733b73051cd22dcb80951c6c929da413ce upstream. 7471fb77 ("md/raid6: Fix anomily when recovering a single device in RAID6.") avoids rereading P when it can be computed from other members. However, this misses the chance to re-write the right data to P. This patch sets R5_ReadError if the re-read fails. Also, when re-read is skipped, we also missed the chance to reset rdev->read_errors to 0. It can fail the disk when there are many read errors on P member disk (other disks don't have read error) V2: upper layer read request don't read parity/Q data. So there is no need to consider such situation. This is Reported-by: kbuild test robot <lkp@intel.com> Fixes: 7471fb77 ("md/raid6: Fix anomily when recovering a single device in RAID6.") Cc: <stable@vger.kernel.org> #4.4+ Signed-off-by: NXiao Ni <xni@redhat.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ming Lei 提交于
[ Upstream commit 226b4fc7 ] SCSI maintains its own driver private data hooked off of each SCSI request, and the pridate data won't be freed after scsi_queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver (e.g. dm-rq) may need to retry these SCSI requests, before SCSI has fully dispatched them, due to a lower level SCSI driver's resource limitation identified in scsi_queue_rq(). Currently SCSI's per-request private data is leaked when the upper layer driver (dm-rq) frees and then retries these requests in response to BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE returns from scsi_queue_rq(). This usecase is so specialized that it doesn't warrant training an existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to account for freeing its driver private data -- doing so would add an extra branch for handling a special case that all other consumers of SCSI (and blk-mq) won't ever need to worry about. So the most pragmatic way forward is to delegate freeing SCSI driver private data to the upper layer driver (dm-rq). Do so by adding new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method from dm-rq. A following commit will implement the .cleanup_rq() hook in scsi_mq_ops. Cc: Ewan D. Milne <emilne@redhat.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: <stable@vger.kernel.org> Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Nigel Croxon 提交于
[ Upstream commit b76b4715 ] While MD continues to count read errors returned by the lower layer. If those errors are -EILSEQ, instead of -EIO, it should NOT increase the read_errors count. When RAID6 is set up on dm-integrity target that detects massive corruption, the leg will be ejected from the array. Even if the issue is correctable with a sector re-write and the array has necessary redundancy to correct it. The leg is ejected because it runs up the rdev->read_errors beyond conf->max_nr_stripes. The return status in dm-drypt when there is a data integrity error is -EILSEQ (BLK_STS_PROTECTION). Signed-off-by: NNigel Croxon <ncroxon@redhat.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Guoqing Jiang 提交于
[ Upstream commit 6ce220dd ] If stripe in batch list is set with STRIPE_HANDLE flag, then the stripe could be set with STRIPE_ACTIVE by the handle_stripe function. And if error happens to the batch_head at the same time, break_stripe_batch_list is called, then below warning could happen (the same report in [1]), it means a member of batch list was set with STRIPE_ACTIVE. [7028915.431770] stripe state: 2001 [7028915.431815] ------------[ cut here ]------------ [7028915.431828] WARNING: CPU: 18 PID: 29089 at drivers/md/raid5.c:4614 break_stripe_batch_list+0x203/0x240 [raid456] [...] [7028915.431879] CPU: 18 PID: 29089 Comm: kworker/u82:5 Tainted: G O 4.14.86-1-storage #4.14.86-1.2~deb9 [7028915.431881] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018 [7028915.431888] Workqueue: raid5wq raid5_do_work [raid456] [7028915.431890] task: ffff9ab0ef36d7c0 task.stack: ffffb72926f84000 [7028915.431896] RIP: 0010:break_stripe_batch_list+0x203/0x240 [raid456] [7028915.431898] RSP: 0018:ffffb72926f87ba8 EFLAGS: 00010286 [7028915.431900] RAX: 0000000000000012 RBX: ffff9aaa84a98000 RCX: 0000000000000000 [7028915.431901] RDX: 0000000000000000 RSI: ffff9ab2bfa15458 RDI: ffff9ab2bfa15458 [7028915.431902] RBP: ffff9aaa8fb4e900 R08: 0000000000000001 R09: 0000000000002eb4 [7028915.431903] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff9ab1736f1b00 [7028915.431904] R13: 0000000000000000 R14: ffff9aaa8fb4e900 R15: 0000000000000001 [7028915.431906] FS: 0000000000000000(0000) GS:ffff9ab2bfa00000(0000) knlGS:0000000000000000 [7028915.431907] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [7028915.431908] CR2: 00007ff953b9f5d8 CR3: 0000000bf4009002 CR4: 00000000003606e0 [7028915.431909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [7028915.431910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [7028915.431910] Call Trace: [7028915.431923] handle_stripe+0x8e7/0x2020 [raid456] [7028915.431930] ? __wake_up_common_lock+0x89/0xc0 [7028915.431935] handle_active_stripes.isra.58+0x35f/0x560 [raid456] [7028915.431939] raid5_do_work+0xc6/0x1f0 [raid456] Also commit 59fc630b ("RAID5: batch adjacent full stripe write") said "If a stripe is added to batch list, then only the first stripe of the list should be put to handle_list and run handle_stripe." So don't set STRIPE_HANDLE to stripe which is already in batch list, otherwise the stripe could be put to handle_list and run handle_stripe, then the above warning could be triggered. [1]. https://www.spinics.net/lists/raid/msg62552.htmlSigned-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Yufen Yu 提交于
[ Upstream commit 07f1a685 ] When run test case: mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal mdadm -S /dev/md1 mdadm -A /dev/md1 /dev/sd[b-c] --run --force mdadm --zero /dev/sda mdadm /dev/md1 -a /dev/sda echo offline > /sys/block/sdc/device/state echo offline > /sys/block/sdb/device/state sleep 5 mdadm -S /dev/md1 echo running > /sys/block/sdb/device/state echo running > /sys/block/sdc/device/state mdadm -A /dev/md1 /dev/sd[a-c] --run --force mdadm run fail with kernel message as follow: [ 172.986064] md: kicking non-fresh sdb from array! [ 173.004210] md: kicking non-fresh sdc from array! [ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors [ 173.022406] md1: failed to create bitmap (-5) In fact, when active disk in raid1 array less than one, we need to return fail in raid1_run(). Reviewed-by: NNeilBrown <neilb@suse.de> Signed-off-by: NYufen Yu <yuyufen@huawei.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Kent Overstreet 提交于
[ Upstream commit a22a9602 ] The race was when a thread using closure_sync() notices cl->s->done == 1 before the thread calling closure_put() calls wake_up_process(). Then, it's possible for that thread to return and exit just before wake_up_process() is called - so we're trying to wake up a process that no longer exists. rcu_read_lock() is sufficient to protect against this, as there's an rcu barrier somewhere in the process teardown path. Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com> Acked-by: NColy Li <colyli@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Guoqing Jiang 提交于
[ Upstream commit 062f5b2ae12a153644c765e7ba3b0f825427be1d ] When a disk is added to array, the following path is called in mdadm. Manage_subdevs -> sysfs_freeze_array -> Manage_add -> sysfs_set_str(&info, NULL, "sync_action","idle") Then from kernel side, Manage_add invokes the path (add_new_disk -> validate_super = super_1_validate) to set In_sync flag. Since In_sync means "device is in_sync with rest of array", and the new added disk need to resync thread to help the synchronization of data. And md_reap_sync_thread would call spare_active to set In_sync for the new added disk finally. So don't set In_sync if array is in frozen. Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Guoqing Jiang 提交于
[ Upstream commit 0d8ed0e9bf9643f27f4816dca61081784dedb38d ] When add one disk to array, the md_reap_sync_thread is responsible to activate the spare and set In_sync flag for the new member in spare_active(). But if raid1 has one member disk A, and disk B is added to the array. Then we offline A before all the datas are synchronized from A to B, obviously B doesn't have the latest data as A, but B is still marked with In_sync flag. So let's not call spare_active under the condition, otherwise B is still showed with 'U' state which is not correct. Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Mikulas Patocka 提交于
[ Upstream commit 0c8e9c2d668278652af028c3cc068c65f66342f4 ] Commit 75d66ffb48efb30f2dd42f041ba8b39c5b2bd115 ("dm zoned: properly handle backing device failure") triggers a coverity warning: *** CID 1452808: Memory - illegal accesses (USE_AFTER_FREE) /drivers/md/dm-zoned-target.c: 137 in dmz_submit_bio() 131 clone->bi_private = bioctx; 132 133 bio_advance(bio, clone->bi_iter.bi_size); 134 135 refcount_inc(&bioctx->ref); 136 generic_make_request(clone); >>> CID 1452808: Memory - illegal accesses (USE_AFTER_FREE) >>> Dereferencing freed pointer "clone". 137 if (clone->bi_status == BLK_STS_IOERR) 138 return -EIO; 139 140 if (bio_op(bio) == REQ_OP_WRITE && dmz_is_seq(zone)) 141 zone->wp_block += nr_blocks; 142 The "clone" bio may be processed and freed before the check "clone->bi_status == BLK_STS_IOERR" - so this check can access invalid memory. Fixes: 75d66ffb48efb3 ("dm zoned: properly handle backing device failure") Cc: stable@vger.kernel.org Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Coly Li 提交于
[ Upstream commit cdca22bcbc64fc83dadb8d927df400a8d86ddabb ] Commit 95f18c9d1310 ("bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set") forgets to remove the original define of LIST_HEAD(journal), which makes the change no take effect. This patch removes redundant variable LIST_HEAD(journal) from run_cache_set(), to make Shenghui's fix working. Fixes: 95f18c9d1310 ("bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set") Reported-by: NJuha Aatrokoski <juha.aatrokoski@aalto.fi> Cc: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: NColy Li <colyli@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Coly Li 提交于
[ Upstream commit 50a260e8 ] There is a race between mca_reap(), btree_node_free() and journal code btree_flush_write(), which results very rare and strange deadlock or panic and are very hard to reproduce. Let me explain how the race happens. In btree_flush_write() one btree node with oldest journal pin is selected, then it is flushed to cache device, the select-and-flush is a two steps operation. Between these two steps, there are something may happen inside the race window, - The selected btree node was reaped by mca_reap() and allocated to other requesters for other btree node. - The slected btree node was selected, flushed and released by mca shrink callback bch_mca_scan(). When btree_flush_write() tries to flush the selected btree node, firstly b->write_lock is held by mutex_lock(). If the race happens and the memory of selected btree node is allocated to other btree node, if that btree node's write_lock is held already, a deadlock very probably happens here. A worse case is the memory of the selected btree node is released, then all references to this btree node (e.g. b->write_lock) will trigger NULL pointer deference panic. This race was introduced in commit cafe5635 ("bcache: A block layer cache"), and enlarged by commit c4dc2497 ("bcache: fix high CPU occupancy during journal"), which selected 128 btree nodes and flushed them one-by-one in a quite long time period. Such race is not easy to reproduce before. On a Lenovo SR650 server with 48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0 device assembled by 3 NVMe SSDs as backing device, this race can be observed around every 10,000 times btree_flush_write() gets called. Both deadlock and kernel panic all happened as aftermath of the race. The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It is set when selecting btree nodes, and cleared after btree nodes flushed. Then when mca_reap() selects a btree node with this bit set, this btree node will be skipped. Since mca_reap() only reaps btree node without BTREE_NODE_journal_flush flag, such race is avoided. Once corner case should be noticed, that is btree_node_free(). It might be called in some error handling code path. For example the following code piece from btree_split(), 2149 err_free2: 2150 bkey_put(b->c, &n2->key); 2151 btree_node_free(n2); 2152 rw_unlock(true, n2); 2153 err_free1: 2154 bkey_put(b->c, &n1->key); 2155 btree_node_free(n1); 2156 rw_unlock(true, n1); At line 2151 and 2155, the btree node n2 and n1 are released without mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here. If btree_node_free() is called directly in such error handling path, and the selected btree node has BTREE_NODE_journal_flush bit set, just delay for 1 us and retry again. In this case this btree node won't be skipped, just retry until the BTREE_NODE_journal_flush bit cleared, and free the btree node memory. Fixes: cafe5635 ("bcache: A block layer cache") Signed-off-by: NColy Li <colyli@suse.de> Reported-and-tested-by: Nkbuild test robot <lkp@intel.com> Cc: stable@vger.kernel.org Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Coly Li 提交于
[ Upstream commit 41508bb7 ] When accessing or modifying BTREE_NODE_dirty bit, it is not always necessary to acquire b->write_lock. In bch_btree_cache_free() and mca_reap() acquiring b->write_lock is necessary, and this patch adds comments to explain why mutex_lock(&b->write_lock) is necessary for checking or clearing BTREE_NODE_dirty bit there. Signed-off-by: NColy Li <colyli@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Coly Li 提交于
[ Upstream commit e5ec5f47 ] In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is always set no matter btree node is dirty or not. The code looks like this, if (btree_node_dirty(b)) btree_complete_write(b, btree_current_write(b)); clear_bit(BTREE_NODE_dirty, &b->flags); Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty bit is cleared, then it is unnecessary to clear the bit again. This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is true (the bit is set), to save a few CPU cycles. Signed-off-by: NColy Li <colyli@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Milan Broz 提交于
[ Upstream commit 7a1cd7238fde6ab367384a4a2998cba48330c398 ] The information about tag size should not be printed without debug info set. Also print device major:minor in the error message to identify the device instance. Also use rate limiting and debug level for info about used crypto API implementaton. This is important because during online reencryption the existing message saturates syslog (because we are moving hotzone across the whole device). Cc: stable@vger.kernel.org Signed-off-by: NMilan Broz <gmazyland@gmail.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Yufen Yu 提交于
[ Upstream commit 5de719e3 ] After commit 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback"), map_request() will requeue the tio when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. Thus, if device driver status is error, a tio may be requeued multiple times until the return value is not DM_MAPIO_REQUEUE. That means type->start_io may be called multiple times, while type->end_io is only called when IO complete. In fact, even without commit 396eaf21, setup_clone() failure can also cause tio requeue and associated missed call to type->end_io. The service-time path selector selects path based on in_flight_size, which is increased by st_start_io() and decreased by st_end_io(). Missed calls to st_end_io() can lead to in_flight_size count error and will cause the selector to make the wrong choice. In addition, queue-length path selector will also be affected. To fix the problem, call type->end_io in ->release_clone_rq before tio requeue. map_info is passed to ->release_clone_rq() for map_request() error path that result in requeue. Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") Cc: stable@vger.kernl.org Signed-off-by: NYufen Yu <yuyufen@huawei.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Tang Junhui 提交于
[ Upstream commit 58ac323084ebf44f8470eeb8b82660f9d0ee3689 ] Stale && dirty keys can be produced in the follow way: After writeback in write_dirty_finish(), dirty keys k1 will replace by clean keys k2 ==>ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key); ==>btree_insert_fn(struct btree_op *b_op, struct btree *b) ==>static int bch_btree_insert_node(struct btree *b, struct btree_op *op, struct keylist *insert_keys, atomic_t *journal_ref, Then two steps: A) update k1 to k2 in btree node memory; bch_btree_insert_keys(b, op, insert_keys, replace_key) B) Write the bset(contains k2) to cache disk by a 30s delay work bch_btree_leaf_dirty(b, journal_ref). But before the 30s delay work write the bset to cache device, these things happened: A) GC works, and reclaim the bucket k2 point to; B) Allocator works, and invalidate the bucket k2 point to, and increase the gen of the bucket, and place it into free_inc fifo; C) Until now, the 30s delay work still does not finish work, so in the disk, the key still is k1, it is dirty and stale (its gen is smaller than the gen of the bucket). and then the machine power off suddenly happens; D) When the machine power on again, after the btree reconstruction, the stale dirty key appear. In bch_extent_bad(), when expensive_debug_checks is off, it would treat the dirty key as good even it is stale keys, and it would cause bellow probelms: A) In read_dirty() it would cause machine crash: BUG_ON(ptr_stale(dc->disk.c, &w->key, 0)); B) It could be worse when reads hits stale dirty keys, it would read old incorrect data. This patch tolerate the existence of these stale && dirty keys, and treat them as bad key in bch_extent_bad(). (Coly Li: fix indent which was modified by sender's email client) Signed-off-by: NTang Junhui <tang.junhui.linux@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: NColy Li <colyli@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Coly Li 提交于
[ Upstream commit 149d0efada7777ad5a5242b095692af142f533d8 ] In extents.c:bch_extent_bad(), number 96 is used as parameter to call btree_bug_on(). The purpose is to check whether stale gen value exceeds BUCKET_GC_GEN_MAX, so it is better to use macro BUCKET_GC_GEN_MAX to make the code more understandable. Signed-off-by: NColy Li <colyli@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Yang Yingliang 提交于
hulk inclusion category: bugfix bugzilla: 13971 CVE: NA ------------------------------------------------- This reverts commit abd4f00326f4d4865761833554d2404a0c9995bd. Use LTS patch instead of this patch. Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Dan Carpenter 提交于
[ Upstream commit e0702d90 ] This function is supposed to return error pointers so it matches the dmz_get_rnd_zone_for_reclaim() function. The current code could lead to a NULL dereference in dmz_do_reclaim() Fixes: b234c6d7 ("dm zoned: improve error handling in reclaim") Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NDmitry Fomichev <dmitry.fomichev@wdc.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Dmitry Fomichev 提交于
commit 75d66ffb upstream. dm-zoned is observed to lock up or livelock in case of hardware failure or some misconfiguration of the backing zoned device. This patch adds a new dm-zoned target function that checks the status of the backing device. If the request queue of the backing device is found to be in dying state or the SCSI backing device enters offline state, the health check code sets a dm-zoned target flag prompting all further incoming I/O to be rejected. In order to detect backing device failures timely, this new function is called in the request mapping path, at the beginning of every reclaim run and before performing any metadata I/O. The proper way out of this situation is to do dmsetup remove <dm-zoned target> and recreate the target when the problem with the backing device is resolved. Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com> Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Dmitry Fomichev 提交于
commit d7428c50 upstream. Some errors are ignored in the I/O path during queueing chunks for processing by chunk works. Since at least these errors are transient in nature, it should be possible to retry the failed incoming commands. The fix - Errors that can happen while queueing chunks are carried upwards to the main mapping function and it now returns DM_MAPIO_REQUEUE for any incoming requests that can not be properly queued. Error logging/debug messages are added where needed. Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com> Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Dmitry Fomichev 提交于
commit b234c6d7 upstream. There are several places in reclaim code where errors are not propagated to the main function, dmz_reclaim(). This function is responsible for unlocking zones that might be still locked at the end of any failed reclaim iterations. As the result, some device zones may be left permanently locked for reclaim, degrading target's capability to reclaim zones. This patch fixes these issues as follows - Make sure that dmz_reclaim_buf(), dmz_reclaim_seq_data() and dmz_reclaim_rnd_data() return error codes to the caller. dmz_reclaim() function is renamed to dmz_do_reclaim() to avoid clashing with "struct dmz_reclaim" and is modified to return the error to the caller. dmz_get_zone_for_reclaim() now returns an error instead of NULL pointer and reclaim code checks for that error. Error logging/debug messages are added where necessary. Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com> Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Mikulas Patocka 提交于
commit 1cfd5d33 upstream. If the sector number is too high, dm_table_find_target() should return a pointer to a zeroed dm_target structure (the caller should test it with dm_target_is_valid). However, for some table sizes, the code in dm_table_find_target() that performs btree lookup will access out of bound memory structures. Fix this bug by testing the sector number at the beginning of dm_table_find_target(). Also, add an "inline" keyword to the function dm_table_get_size() because this is a hot path. Fixes: 512875bd ("dm: table detect io beyond device") Cc: stable@vger.kernel.org Reported-by: NZhang Tao <kontais@zoho.com> Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Wenwen Wang 提交于
commit dc1a3e8e upstream. If rs_prepare_reshape() fails, no cleanup is executed, leading to leak of the raid_set structure allocated at the beginning of raid_ctr(). To fix this issue, go to the label 'bad' if the error occurs. Fixes: 11e47232 ("dm raid: stop keeping raid set frozen altogether") Cc: stable@vger.kernel.org Signed-off-by: NWenwen Wang <wenwen@cs.uga.edu> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Mikulas Patocka 提交于
commit 5729b6e5 upstream. Fix a crash that was introduced by the commit 724376a0. The crash is reported here: https://gitlab.com/cryptsetup/cryptsetup/issues/468 When reading from the integrity device, the function dm_integrity_map_continue calls find_journal_node to find out if the location to read is present in the journal. Then, it calculates how many sectors are consecutively stored in the journal. Then, it locks the range with add_new_range and wait_and_add_new_range. The problem is that during wait_and_add_new_range, we hold no locks (we don't hold ic->endio_wait.lock and we don't hold a range lock), so the journal may change arbitrarily while wait_and_add_new_range sleeps. The code then goes to __journal_read_write and hits BUG_ON(journal_entry_get_sector(je) != logical_sector); because the journal has changed. In order to fix this bug, we need to re-check the journal location after wait_and_add_new_range. We restrict the length to one block in order to not complicate the code too much. Fixes: 724376a0 ("dm integrity: implement fair range locks") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Dmitry Fomichev 提交于
commit d1fef414 upstream. This patch fixes a problem in dm-kcopyd that may leave jobs in complete queue indefinitely in the event of backing storage failure. This behavior has been observed while running 100% write file fio workload against an XFS volume created on top of a dm-zoned target device. If the underlying storage of dm-zoned goes to offline state under I/O, kcopyd sometimes never issues the end copy callback and dm-zoned reclaim work hangs indefinitely waiting for that completion. This behavior was traced down to the error handling code in process_jobs() function that places the failed job to complete_jobs queue, but doesn't wake up the job handler. In case of backing device failure, all outstanding jobs may end up going to complete_jobs queue via this code path and then stay there forever because there are no more successful I/O jobs to wake up the job handler. This patch adds a wake() call to always wake up kcopyd job wait queue for all I/O jobs that fail before dm_io() gets called for that job. The patch also sets the write error status in all sub jobs that are failed because their master job has failed. Fixes: b73c67c2 ("dm kcopyd: add sequential write feature") Cc: stable@vger.kernel.org Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com> Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Mikulas Patocka 提交于
commit cf3591ef upstream. Revert the commit bd293d07. The proper fix has been made available with commit d0a255e7 ("loop: set PF_MEMALLOC_NOIO for the worker thread"). Note that the fix offered by commit bd293d07 doesn't really prevent the deadlock from occuring - if we look at the stacktrace reported by Junxiao Bi, we see that it hangs in bit_wait_io and not on the mutex - i.e. it has already successfully taken the mutex. Changing the mutex from mutex_lock to mutex_trylock won't help with deadlocks that happen afterwards. PID: 474 TASK: ffff8813e11f4600 CPU: 10 COMMAND: "kswapd0" #0 [ffff8813dedfb938] __schedule at ffffffff8173f405 #1 [ffff8813dedfb990] schedule at ffffffff8173fa27 #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186 #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8 #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81 #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio] #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio] #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio] #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778 #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f #13 [ffff8813dedfbec0] kthread at ffffffff810a8428 #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242 Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: bd293d07 ("dm bufio: fix deadlock with loop device") Depends-on: d0a255e7 ("loop: set PF_MEMALLOC_NOIO for the worker thread") Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Xiongfeng Wang 提交于
hulk inclusion category: feature bugzilla: 20209 CVE: NA --------------------------- This patch modifies the dm-crypt layer to rely on the IV generation templates for generating IV. The dm-crypt layer won't divided the 'bio' into sectors and passes each sector to the crypto driver any more. The dm-crypt layer creates a scatterlist array to record all the data of the 'bio' and passes this scatterlist array to the crypto driver at one time. This crypto driver could be some accelerator driver which can process several sectors at one time. This patch is based on the patchset originally started by Binoy Jayan <binoy.jayan@linaro.org> ( crypto: Add IV generation algorithms https://patchwork.kernel.org/patch/9803469/ ) Signed-off-by: NBinoy Jayan <binoy.jayan@linaro.org> Signed-off-by: NXiongfeng Wang <xiongfeng.wang@linaro.org> Reviewed-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Xiongfeng Wang 提交于
hulk inclusion category: feature bugzilla: 20209 CVE: NA --------------------------- Currently, the IV generation algorithms are implemented in crypt layer. This patch implements these algorithms as templates, so that dm-crypt layer can be simplified, and also these algorithms can be implemented in hardware for performance. This patch is based on the patchset originally started by Binoy Jayan <binoy.jayan@linaro.org> ( crypto: Add IV generation algorithms https://patchwork.kernel.org/patch/9803469/ ) Signed-off-by: NBinoy Jayan <binoy.jayan@linaro.org> Signed-off-by: NXiongfeng Wang <xiongfeng.wang@linaro.org> Reviewed-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 ZhangXiaoxu 提交于
mainline inclusion from mainline-v5.3-rc6 commit ae148243d3f0816b37477106c05a2ec7d5f32614 category: bugfix bugzilla: 20701 CVE: NA ------------------------------------------------- In commit 6096d91a ("dm space map metadata: fix occasional leak of a metadata block on resize"), we refactor the commit logic to a new function 'apply_bops'. But when that logic was replaced in out() the return value was not stored. This may lead out() returning a wrong value to the caller. Fixes: 6096d91a ("dm space map metadata: fix occasional leak of a metadata block on resize") Cc: stable@vger.kernel.org Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com> Reviewed-by: NYufen Yu <yuyufen@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 ZhangXiaoxu 提交于
mainline inclusion from mainline-v5.3-rc6 commit e4f9d6013820d1eba1432d51dd1c5795759aa77f category: bugfix bugzilla: 20701 CVE: NA ------------------------------------------------- When btree_split_beneath() splits a node to two new children, it will allocate two blocks: left and right. If right block's allocation failed, the left block will be unlocked and marked dirty. If this happened, the left block'ss content is zero, because it wasn't initialized with the btree struct before the attempot to allocate the right block. Upon return, when flushing the left block to disk, the validator will fail when check this block. Then a BUG_ON is raised. Fix this by completely initializing the left block before allocating and initializing the right block. Fixes: 4dcb8b57 ("dm btree: fix leak of bufio-backed block in btree_split_beneath error path") Cc: stable@vger.kernel.org Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com> Reviewed-by: NYufen Yu <yuyufen@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Mike Snitzer 提交于
mainline inclusion from mainline-v5.3-rc1 commit 54fa16ee category: bugfix bugzilla: 18564 CVE: NA ------------------------------------------------- Check if in fail_io mode at start of dm_pool_metadata_set_needs_check(). Otherwise dm_pool_metadata_set_needs_check()'s superblock_lock() can crash in dm_bm_write_lock() while accessing the block manager object that was previously destroyed as part of a failed dm_pool_abort_metadata() that ultimately set fail_io to begin with. Also, update DMERR() message to more accurately describe superblock_lock() failure. Cc: stable@vger.kernel.org Reported-by: NZdenek Kabelac <zkabelac@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Conflicts: drivers/md/dm-thin-metadata.c Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com> Reviewed-by: NYi Zhang <yi.zhang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Mike Snitzer 提交于
mainline inclusion from mainline-5.0-rc4 commit a1e1cb72d96491277ede8d257ce6b48a381dd336 category: bugfix bugzilla: 18695 CVE: NA --------------------------- The risk of redundant IO accounting was not taken into consideration when commit 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk") introduced IO splitting in terms of recursion via generic_make_request(). Fix this by subtracting the split bio's payload from the IO stats that were already accounted for by start_io_acct() upon dm_make_request() entry. This repeat oscillation of the IO accounting, up then down, isn't ideal but refactoring DM core's IO splitting to pre-split bios _before_ they are accounted turned out to be an excessive amount of change that will need a full development cycle to refine and verify. Before this fix: /dev/mapper/stripe_dev is a 4-way stripe using a 32k chunksize, so bios are split on 32k boundaries. # fio --name=16M --filename=/dev/mapper/stripe_dev --rw=write --bs=64k --size=16M \ --iodepth=1 --ioengine=libaio --direct=1 --refill_buffers with debugging added: [103898.310264] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=0 len=128 [103898.318704] device-mapper: core: __split_and_process_bio: recursing for following split bio: [103898.329136] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=64 len=64 ... 16M written yet 136M (278528 * 512b) accounted: # cat /sys/block/dm-2/stat | awk '{ print $7 }' 278528 After this fix: 16M written and 16M (32768 * 512b) accounted: # cat /sys/block/dm-2/stat | awk '{ print $7 }' 32768 Conflicts: drivers/md/dm.c Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk") Cc: stable@vger.kernel.org # 4.16+ Reported-by: NBryan Gurney <bgurney@redhat.com> Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-