- 21 6月, 2019 3 次提交
-
-
由 Guoqing Jiang 提交于
Now, there are two places need to consider about the failure of destroy bitmap, so move the common part between bitmap_abort and abort label. Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NSong Liu <songliubraving@fb.com>
-
由 Guoqing Jiang 提交于
Previously, we called rdev_init_wb to avoid potential data inconsistency when array is created. Now, we need to call the function and create mempool if a device is added or just be flaged as "writemostly". So mddev_create_wb_pool is introduced and called accordingly. And for safety reason, we mark implicit GFP_NOIO allocation scope for create mempool during mddev_suspend/mddev_resume. And mempool should be removed conversely after remove a member device or its's "writemostly" flag, which is done by call mddev_destroy_wb_pool. Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NSong Liu <songliubraving@fb.com>
-
由 Guoqing Jiang 提交于
For write-behind mode, we think write IO is complete once it has reached all the non-writemostly devices. It works fine for single queue devices. But for multiqueue device, if there are lots of IOs come from upper layer, then the write-behind device could issue those IOs to different queues, depends on the each queue's delay, so there is no guarantee that those IOs can arrive in order. To address the issue, we need to check the collision among write behind IOs, we can only continue without collision, otherwise wait for the completion of previous collisioned IO. And WBCollision is introduced for multiqueue device which is worked under write-behind mode. But this patch doesn't handle below cases which could have the data inconsistency issue as well, these cases will be handled in later patches. 1. modify max_write_behind by write backlog node. 2. add or remove array's bitmap dynamically. 3. the change of member disk. Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NSong Liu <songliubraving@fb.com>
-
- 15 6月, 2019 2 次提交
-
-
由 Yufen Yu 提交于
This patch fix a spelling typo and add necessary space for code. In addition, the patch get rid of the unnecessary 'if'. Signed-off-by: NYufen Yu <yuyufen@huawei.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Marcos Paulo de Souza 提交于
Commit c42d3240 ("md: return -ENODEV if rdev has no mddev assigned") changed rdev_attr_store to return -ENODEV when rdev->mddev is NULL, now do the same to rdev_attr_show. Signed-off-by: NMarcos Paulo de Souza <marcos.souza.org@gmail.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 24 5月, 2019 1 次提交
-
-
由 Thomas Gleixner 提交于
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 or at your option any later version you should have received a copy of the gnu general public license for example usr src linux copying if not write to the free software foundation inc 675 mass ave cambridge ma 02139 usa extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 20 file(s). Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NAllison Randal <allison@lohutok.net> Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190520170858.552543146@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 17 4月, 2019 1 次提交
-
-
由 Pawel Baldysiak 提交于
Mdadm expects that setting drive as faulty will fail with -EBUSY only if this operation will cause RAID to be failed. If this happens, it will try to stop the array. Currently -EBUSY might also be returned if rdev is in the middle of the removal process - for example there is a race with mdmon that already requested the drive to be failed/removed. If rdev does not contain mddev, return -ENODEV instead, so the caller can distinguish between those two cases and behave accordingly. Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: NSong Liu <songliubraving@fb.com>
-
- 11 4月, 2019 5 次提交
-
-
由 Christoph Hellwig 提交于
Sparse complains that it has no external declaration, and it turns out that it is never even used outside of md.c. So just mark it static and drop the export. Acked-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSong Liu <songliubraving@fb.com>
-
由 Christoph Hellwig 提交于
If we want to convert from a little endian format we need to cast to a little endian type, otherwise sparse will be unhappy. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSong Liu <songliubraving@fb.com>
-
由 Christoph Hellwig 提交于
If we want to convert from a little endian format we need to cast to a little endian type, otherwise sparse will be unhappy. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSong Liu <songliubraving@fb.com>
-
由 Christoph Hellwig 提交于
The on-disk value is little endian and we need to convert it to native endian before storing the value in the in-core structure. Fixes: 7564beda ("md-cluster/raid10: support add disk under grow mode") Cc: <stable@vger.kernel.org> # 4.20+ Acked-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSong Liu <songliubraving@fb.com>
-
由 Yufen Yu 提交于
When doing re-add, we need to ensure rdev->mddev->pers is not NULL, which can avoid potential NULL pointer derefence in fallowing add_bound_rdev(). Fixes: a6da4ef8 ("md: re-add a failed disk") Cc: Xiao Ni <xni@redhat.com> Cc: NeilBrown <neilb@suse.com> Cc: <stable@vger.kernel.org> # 4.4+ Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NYufen Yu <yuyufen@huawei.com> Signed-off-by: NSong Liu <songliubraving@fb.com>
-
- 07 4月, 2019 1 次提交
-
-
由 Christoph Hellwig 提交于
Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit architectures. These types are required to support block device and/or file sizes larger than 2 TiB, and have generally defaulted to on for a long time. Enabling the option only increases the i386 tinyconfig size by 145 bytes, and many data structures already always use 64-bit values for their in-core and on-disk data structures anyway, so there should not be a large change in dynamic memory usage either. Dropping this option removes a somewhat weird non-default config that has cause various bugs or compiler warnings when actually used. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 02 4月, 2019 2 次提交
-
-
由 NeilBrown 提交于
Currently if many flush requests are submitted to an md device is quick succession, they are serialized and can take a long to process them all. We don't really need to call flush all those times - a single flush call can satisfy all requests submitted before it started. So keep track of when the current flush started and when it finished, allow any pending flush that was requested before the flush started to complete without waiting any more. Test results from Xiao: Test is done on a raid10 device which is created by 4 SSDs. The tool is dbench. 1. The latest linux stable kernel Operation Count AvgLat MaxLat -------------------------------------------------- Deltree 768 10.509 78.305 Flush 2078376 0.013 10.094 Close 21787697 0.019 18.821 LockX 96580 0.007 3.184 Mkdir 384 0.008 0.062 Rename 1255883 0.191 23.534 ReadX 46495589 0.020 14.230 WriteX 14790591 7.123 60.706 Unlink 5989118 0.440 54.551 UnlockX 96580 0.005 2.736 FIND_FIRST 10393845 0.042 12.079 SET_FILE_INFORMATION 2415558 0.129 10.088 QUERY_FILE_INFORMATION 4711725 0.005 8.462 QUERY_PATH_INFORMATION 26883327 0.032 21.715 QUERY_FS_INFORMATION 4929409b 0.010 8.238 NTCreateX 29660080 0.100 53.268 Throughput 1034.88 MB/sec (sync open) 128 clients 128 procs max_latency=60.712 ms 2. With patch1 "Revert "MD: fix lock contention for flush bios"" Operation Count AvgLat MaxLat -------------------------------------------------- Deltree 256 8.326 36.761 Flush 693291 3.974 180.269 Close 7266404 0.009 36.929 LockX 32160 0.006 0.840 Mkdir 128 0.008 0.021 Rename 418755 0.063 29.945 ReadX 15498708 0.007 7.216 WriteX 4932310 22.482 267.928 Unlink 1997557 0.109 47.553 UnlockX 32160 0.004 1.110 FIND_FIRST 3465791 0.036 7.320 SET_FILE_INFORMATION 805825 0.015 1.561 QUERY_FILE_INFORMATION 1570950 0.005 2.403 QUERY_PATH_INFORMATION 8965483 0.013 14.277 QUERY_FS_INFORMATION 1643626 0.009 3.314 NTCreateX 9892174 0.061 41.278 Throughput 345.009 MB/sec (sync open) 128 clients 128 procs max_latency=267.939 m 3. With patch1 and patch2 Operation Count AvgLat MaxLat -------------------------------------------------- Deltree 768 9.570 54.588 Flush 2061354 0.666 15.102 Close 21604811 0.012 25.697 LockX 95770 0.007 1.424 Mkdir 384 0.008 0.053 Rename 1245411 0.096 12.263 ReadX 46103198 0.011 12.116 WriteX 14667988 7.375 60.069 Unlink 5938936 0.173 30.905 UnlockX 95770 0.005 4.147 FIND_FIRST 10306407 0.041 11.715 SET_FILE_INFORMATION 2395987 0.048 7.640 QUERY_FILE_INFORMATION 4672371 0.005 9.291 QUERY_PATH_INFORMATION 26656735 0.018 19.719 QUERY_FS_INFORMATION 4887940 0.010 7.654 NTCreateX 29410811 0.059 28.551 Throughput 1026.21 MB/sec (sync open) 128 clients 128 procs max_latency=60.075 ms Cc: <stable@vger.kernel.org> # v4.19+ Tested-by: NXiao Ni <xni@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
This reverts commit 5a409b4f. This patch has two problems. 1/ it make multiple calls to submit_bio() from inside a make_request_fn. The bios thus submitted will be queued on current->bio_list and not submitted immediately. As the bios are allocated from a mempool, this can theoretically result in a deadlock - all the pool of requests could be in various ->bio_list queues and a subsequent mempool_alloc could block waiting for one of them to be released. 2/ It aims to handle a case when there are many concurrent flush requests. It handles this by submitting many requests in parallel - all of which are identical and so most of which do nothing useful. It would be more efficient to just send one lower-level request, but allow that to satisfy multiple upper-level requests. Fixes: 5a409b4f ("MD: fix lock contention for flush bios") Cc: <stable@vger.kernel.org> # v4.19+ Tested-by: NXiao Ni <xni@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 14 1月, 2019 1 次提交
-
-
由 Marcos Paulo de Souza 提交于
bio_alloc_bioset returns a bio pointer or NULL, so we can avoid storing the returned data into a new variable. Acked-by: NGuoqing Jiang <gqjiang@suse.com> Acked-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: NMarcos Paulo de Souza <marcos.souza.org@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 12月, 2018 2 次提交
-
-
由 Chengguang Xu 提交于
mempool_destroy() can handle NULL pointer correctly, so there is no need to check NULL pointer before calling mempool_destroy(). Signed-off-by: NChengguang Xu <cgxu519@gmx.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Yue Haibing 提交于
Fixes gcc '-Wunused-but-set-variable' warning: drivers/md/md.c: In function 'md_integrity_add_rdev': drivers/md/md.c:2149:24: warning: variable 'bi_rdev' set but not used [-Wunused-but-set-variable] It not used any more after commit 1501efad ("md/raid: only permit hot-add of compatible integrity profiles") Signed-off-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 10 12月, 2018 1 次提交
-
-
由 Mike Snitzer 提交于
All of part_stat_* and related methods are used with preempt disabled, so there is no need to pass cpu around to allow of them. Just call smp_processor_id() as needed. Suggested-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 23 10月, 2018 2 次提交
-
-
由 Xiao Ni 提交于
flush_pool is leaked when flush bio size is zero Fixes: 5a409b4f ("MD: fix lock contention for flush bios") Signed-off-by: NDavid Jeffery <djeffery@redhat.com> Signed-off-by: NXiao Ni <xni@redhat.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Jack Wang 提交于
I noticed kmemleak report memory leak when run create/stop md in a loop, backtrace: [<000000001ca975e7>] mempool_create_node+0x86/0xd0 [<0000000095576bcd>] md_run+0x1057/0x1410 [md_mod] [<000000007b45c5fc>] do_md_run+0x15/0x130 [md_mod] [<000000001ede9ec0>] md_ioctl+0x1f49/0x25d0 [md_mod] [<000000004142cacf>] blkdev_ioctl+0x680/0xd00 The root cause is we alloc mddev->flush_pool and mddev->flush_bio_pool in md_run, but from do_md_stop will not call into md_stop but __md_stop, move the mempool_destroy to __md_stop fixes the problem for me. The bug was introduced in 5a409b4f, the fixes should go to 4.18+ Fixes: 5a409b4f ("MD: fix lock contention for flush bios") Signed-off-by: NJack Wang <jinpu.wang@profitbricks.com> Reviewed-by: NXiao Ni <xni@redhat.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 19 10月, 2018 4 次提交
-
-
由 Guoqing Jiang 提交于
We need to continue the reshaping if it was interrupted in original node. So original node should call resync_bitmap in case reshaping is aborted. Then BITMAP_NEEDS_SYNC message is broadcasted to other nodes, node which continues the reshaping should restart reshape from mddev->reshape_position instead of from the first beginning. Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Guoqing Jiang 提交于
remove_and_add_spares is not needed if reshape is happening in another node, because raid10_add_disk called inside raid10_start_reshape would handle the role changes of disk. Plus, remove_and_add_spares can't deal with the role change due to reshape. Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Guoqing Jiang 提交于
We need to change the capacity in all nodes after one node finishs reshape. And as we did before, we can't change the capacity directly in md_do_sync, instead, the capacity should be only changed in update_size or received CHANGE_CAPACITY msg. So master node calls update_size after completes reshape in md_reap_sync_thread, but we need to skip ops->update_size if MD_CLOSING is set since reshaping could not be finish. Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Guoqing Jiang 提交于
For clustered raid10 scenario, we need to let all the nodes know about that a new disk is added to the array, and the reshape caused by add new member just need to be happened in one node, but other nodes should know about the change. Since reshape means read data from somewhere (which is already used by array) and write data to unused region. Obviously, it is awful if one node is reading data from address while another node is writing to the same address. Considering we have implemented suspend writes in the resyncing area, so we can just broadcast the reading address to other nodes to avoid the trouble. For master node, it would call reshape_request then update sb during the reshape period. To avoid above trouble, we call resync_info_update to send RESYNC message in reshape_request. Then from slave node's view, it receives two type messages: 1. RESYNCING message Slave node add the address (where master node reading data from) to suspend list. 2. METADATA_UPDATED message Once slave nodes know the reshaping is started in master node, it is time to update reshape position and call start_reshape to follow master node's step. After reshape is done, only reshape position is need to be updated, so the majority task of reshaping is happened on the master node. Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 15 10月, 2018 1 次提交
-
-
由 Shaohua Li 提交于
Commit d595567d (MD: fix invalid stored role for a disk) broke linear hotadd. Let's only fix the role for disks in raid1/10. Based on Guoqing's original patch. Reported-by: Nkernel test robot <rong.a.chen@intel.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 04 10月, 2018 1 次提交
-
-
由 NeilBrown 提交于
Commit 35bfc521 ("md: allow metadata update while suspending.") added support for allowing md_check_recovery() to still perform metadata updates while the array is entering the 'suspended' state. This is needed to allow the processes of entering the state to complete. Unfortunately, the patch doesn't really work. The test for "mddev->suspended" at the start of md_check_recovery() means that the function doesn't try to do anything at all while entering suspend. This patch moves the code of updating the metadata while suspending to *before* the test on mddev->suspended. Reported-by: NJeff Mahoney <jeffm@suse.com> Fixes: 35bfc521 ("md: allow metadata update while suspending.") Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 02 10月, 2018 1 次提交
-
-
由 Shaohua Li 提交于
If we change the number of array's device after device is removed from array, then add the device back to array, we can see that device is added as active role instead of spare which we expected. Please see the below link for details: https://marc.info/?l=linux-raid&m=153736982015076&w=2 This is caused by that we prefer to use device's previous role which is recorded by saved_raid_disk, but we should respect the new number of conf->raid_disks since it could be changed after device is removed. Reported-by: NGioh Kim <gi-oh.kim@profitbricks.com> Tested-by: NGioh Kim <gi-oh.kim@profitbricks.com> Acked-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 02 8月, 2018 1 次提交
-
-
由 Andy Shevchenko 提交于
bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods. On the other hand MD bitmap API is special case. Adding 'md' prefix to it to avoid name space collision. No functional changes intended. Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: NShaohua Li <shli@kernel.org> Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
-
- 25 7月, 2018 1 次提交
-
-
由 Christoph Hellwig 提交于
The function name mentioned doesn't exist, and the code next to it doesn't match the description either. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 18 7月, 2018 2 次提交
-
-
由 Michael Callahan 提交于
Add and use a new op_stat_group() function for indexing partition stat fields rather than indexing them by rq_data_dir() or bio_data_dir(). This function works similarly to op_is_sync() in that it takes the request::cmd_flags or bio::bi_opf flags and determines which stats should et updated. In addition, the second parameter to generic_start_io_acct() and generic_end_io_acct() is now a REQ_OP rather than simply a read or write bit and it uses op_stat_group() on the parameter to determine the stat group. Note that the partition in_flight counts are not part of the per-cpu statistics and as such are not indexed via this function. It's now indexed by op_is_write(). tj: Refreshed on top of v4.17. Updated to pass around REQ_OP. Signed-off-by: NMichael Callahan <michaelcallahan@fb.com> Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Matias Bjorling <mb@lightnvm.io> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Alasdair Kergon <agk@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Michael Callahan 提交于
Add a part_stat_read_accum macro to genhd.h to read and sum across field entries. For example to sum up the number read and write sectors completed. In addition to being ar reasonable cleanup by itself this will make it easier to add new stat fields in the future. tj: Refreshed on top of v4.17. Signed-off-by: NMichael Callahan <michaelcallahan@fb.com> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 06 7月, 2018 1 次提交
-
-
由 Guoqing Jiang 提交于
When resync or recovery is happening in one node, other nodes don't show the appropriate info now. For example, when create an array in master node without "--assume-clean", then assemble the array in slave nodes, you can see "resync=PENDING" when read /proc/mdstat in slave nodes. However, the info is confusing since "PENDING" status is introduced for start array in read-only mode. We introduce RESYNCING_REMOTE flag to indicate that resync thread is running in remote node. The flags is set when node receive RESYNCING msg. And we clear the REMOTE flag in following cases: 1. resync or recover is finished in master node, which means slaves receive msg with both lo and hi are set to 0. 2. node continues resync/recovery in recover_bitmaps. 3. when resync_finish is called. Then we show accurate information in status_resync by check REMOTE flags and with other conditions. Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 19 6月, 2018 1 次提交
-
-
由 Shaohua Li 提交于
We need destroy the memory pool in failure Signed-off-by: NShaohua Li <shli@fb.com>
-
- 08 6月, 2018 1 次提交
-
-
由 Kent Overstreet 提交于
Previously, mddev_put() had a couple different paths for freeing a mddev, due to the fact that the kobject wasn't initialized when the mddev was first allocated. If we move the kobject_init() to when it's first allocated and just use kobject_add() later, we can clean all this up. This also removes a hack in mddev_put() to avoid freeing biosets under a spinlock, which involved copying biosets on the stack after the reset bioset_init() changes. Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 31 5月, 2018 1 次提交
-
-
由 Kent Overstreet 提交于
Convert md to embedded bio sets. Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 22 5月, 2018 1 次提交
-
-
由 Xiao Ni 提交于
There is a lock contention when there are many processes which send flush bios to md device. eg. Create many lvs on one raid device and mkfs.xfs on each lv. Now it just can handle flush request sequentially. It needs to wait mddev->flush_bio to be NULL, otherwise get mddev->lock. This patch remove mddev->flush_bio and handle flush bio asynchronously. I did a test with command dbench -s 128 -t 300. This is the test result: =================Without the patch============================ Operation Count AvgLat MaxLat -------------------------------------------------- Flush 11165 167.595 5879.560 Close 107469 1.391 2231.094 LockX 384 0.003 0.019 Rename 5944 2.141 1856.001 ReadX 208121 0.003 0.074 WriteX 98259 1925.402 15204.895 Unlink 25198 13.264 3457.268 UnlockX 384 0.001 0.009 FIND_FIRST 47111 0.012 0.076 SET_FILE_INFORMATION 12966 0.007 0.065 QUERY_FILE_INFORMATION 27921 0.004 0.085 QUERY_PATH_INFORMATION 124650 0.005 5.766 QUERY_FS_INFORMATION 22519 0.003 0.053 NTCreateX 141086 4.291 2502.812 Throughput 3.7181 MB/sec (sync open) 128 clients 128 procs max_latency=15204.905 ms =================With the patch============================ Operation Count AvgLat MaxLat -------------------------------------------------- Flush 4500 174.134 406.398 Close 48195 0.060 467.062 LockX 256 0.003 0.029 Rename 2324 0.026 0.360 ReadX 78846 0.004 0.504 WriteX 66832 562.775 1467.037 Unlink 5516 3.665 1141.740 UnlockX 256 0.002 0.019 FIND_FIRST 16428 0.015 0.313 SET_FILE_INFORMATION 6400 0.009 0.520 QUERY_FILE_INFORMATION 17865 0.003 0.089 QUERY_PATH_INFORMATION 47060 0.078 416.299 QUERY_FS_INFORMATION 7024 0.004 0.032 NTCreateX 55921 0.854 1141.452 Throughput 11.744 MB/sec (sync open) 128 clients 128 procs max_latency=1467.041 ms The test is done on raid1 disk with two rotational disks V5: V4 is more complicated than the version with memory pool. So revert to the memory pool version V4: use address of fbio to do hash to choose free flush info. V3: Shaohua suggests mempool is overkill. In v3 it allocs memory during creating raid device and uses a simple bitmap to record which resource is free. Fix a bug from v2. It should set flush_pending to 1 at first. V2: Neil pointed out two problems. One is counting error problem and another is return value when allocat memory fails. 1. counting error problem This isn't safe. It is only safe to call rdev_dec_pending() on rdevs that you previously called atomic_inc(&rdev->nr_pending); If an rdev was added to the list between the start and end of the flush, this will do something bad. Now it doesn't use bio_chain. It uses specified call back function for each flush bio. 2. Returned on IO error when kmalloc fails is wrong. I use mempool suggested by Neil in V2 3. Fixed some places pointed by Guoqing Suggested-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NXiao Ni <xni@redhat.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 18 5月, 2018 1 次提交
-
-
由 Yufen Yu 提交于
We met NULL pointer BUG as follow: [ 151.760358] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060 [ 151.761340] PGD 80000001011eb067 P4D 80000001011eb067 PUD 1011ea067 PMD 0 [ 151.762039] Oops: 0000 [#1] SMP PTI [ 151.762406] Modules linked in: [ 151.762723] CPU: 2 PID: 3561 Comm: mdadm-test Kdump: loaded Not tainted 4.17.0-rc1+ #238 [ 151.763542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014 [ 151.764432] RIP: 0010:remove_and_add_spares.part.56+0x13c/0x3a0 [ 151.765061] RSP: 0018:ffffc90001d7fcd8 EFLAGS: 00010246 [ 151.765590] RAX: 0000000000000000 RBX: ffff88013601d600 RCX: 0000000000000000 [ 151.766306] RDX: 0000000000000000 RSI: ffff88013601d600 RDI: ffff880136187000 [ 151.767014] RBP: ffff880136187018 R08: 0000000000000003 R09: 0000000000000051 [ 151.767728] R10: ffffc90001d7fed8 R11: 0000000000000000 R12: ffff88013601d600 [ 151.768447] R13: ffff8801298b1300 R14: ffff880136187000 R15: 0000000000000000 [ 151.769160] FS: 00007f2624276700(0000) GS:ffff88013ae80000(0000) knlGS:0000000000000000 [ 151.769971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 151.770554] CR2: 0000000000000060 CR3: 0000000111aac000 CR4: 00000000000006e0 [ 151.771272] Call Trace: [ 151.771542] md_ioctl+0x1df2/0x1e10 [ 151.771906] ? __switch_to+0x129/0x440 [ 151.772295] ? __schedule+0x244/0x850 [ 151.772672] blkdev_ioctl+0x4bd/0x970 [ 151.773048] block_ioctl+0x39/0x40 [ 151.773402] do_vfs_ioctl+0xa4/0x610 [ 151.773770] ? dput.part.23+0x87/0x100 [ 151.774151] ksys_ioctl+0x70/0x80 [ 151.774493] __x64_sys_ioctl+0x16/0x20 [ 151.774877] do_syscall_64+0x5b/0x180 [ 151.775258] entry_SYSCALL_64_after_hwframe+0x44/0xa9 For raid6, when two disk of the array are offline, two spare disks can be added into the array. Before spare disks recovery completing, system reboot and mdadm thinks it is ok to restart the degraded array by md_ioctl(). Since disks in raid6 is not only_parity(), raid5_run() will abort, when there is no PPL feature or not setting 'start_dirty_degraded' parameter. Therefore, mddev->pers is NULL. But, mddev->raid_disks has been set and it will not be cleared when raid5_run abort. md_ioctl() can execute cmd 'HOT_REMOVE_DISK' to remove a disk by mdadm, which will cause NULL pointer dereference in remove_and_add_spares() finally. Signed-off-by: NYufen Yu <yuyufen@huawei.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 02 5月, 2018 1 次提交
-
-
由 NeilBrown 提交于
If "re-add" is written to the "state" file for a device which is faulty, this has an effect similar to removing and re-adding the device. It should take up the same slot in the array that it previously had, and an accelerated (e.g. bitmap-based) rebuild should happen. The slot that "it previously had" is determined by rdev->saved_raid_disk. However this is not set when a device fails (only when a device is added), and it is cleared when resync completes. This means that "re-add" will normally work once, but may not work a second time. This patch includes two fixes. 1/ when a device fails, record the ->raid_disk value in ->saved_raid_disk before clearing ->raid_disk 2/ when "re-add" is written to a device for which ->saved_raid_disk is not set, fail. I think this is suitable for stable as it can cause re-adding a device to be forced to do a full resync which takes a lot longer and so puts data at more risk. Cc: <stable@vger.kernel.org> (v4.1) Fixes: 97f6cd39 ("md-cluster: re-add capabilities") Signed-off-by: NNeilBrown <neilb@suse.com> Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 09 4月, 2018 1 次提交
-
-
由 Guoqing Jiang 提交于
Device could become faulty when clustered array handling METADATA_UPDATED msg, so we don't need to call read_rdev for this device. Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-