- 01 9月, 2016 1 次提交
-
-
由 Shaohua Li 提交于
There is a potential deadlock in superblock write. Discard could zero data, so before discard we must make sure superblock is updated to new log tail. Updating superblock (either directly call md_update_sb() or depend on md thread) must hold reconfig mutex. On the other hand, raid5_quiesce is called with reconfig_mutex hold. The first step of raid5_quiesce() is waitting for all IO finish, hence waitting for reclaim thread, while reclaim thread is calling this function and waitting for reconfig mutex. So there is a deadlock. We workaround this issue with a trylock. The downside of the solution is we could miss discard if we can't take reconfig mutex. But this should happen rarely (mainly in raid array stop), so miss discard shouldn't be a big problem. Cc: NeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 08 8月, 2016 1 次提交
-
-
由 Jens Axboe 提交于
Since commit 63a4cc24, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 08 6月, 2016 3 次提交
-
-
由 Mike Christie 提交于
To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: NMike Christie <mchristi@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Mike Christie 提交于
Separate the op from the rq_flag_bits and have md set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: NMike Christie <mchristi@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Mike Christie 提交于
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: NMike Christie <mchristi@redhat.com> Fixed up fs/ext4/crypto.c Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 10 5月, 2016 1 次提交
-
-
由 Guoqing Jiang 提交于
Some code waits for a metadata update by: 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the management thread 3. waiting for MD_CHANGE_PENDING to be cleared If the first two are done without locking, the code in md_update_sb() which checks if it needs to repeat might test if an update is needed before step 1, then clear MD_CHANGE_PENDING after step 2, resulting in the wait returning early. So make sure all places that set MD_CHANGE_PENDING are atomicial, and bit_clear_unless (suggested by Neil) is introduced for the purpose. Cc: Martin Kepplinger <martink@posteo.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: <linux-kernel@vger.kernel.org> Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 14 4月, 2016 1 次提交
-
-
由 Jens Axboe 提交于
Now that we converted everything to the newer block write cache interface, kill off the queue flush_flags and queueable flush entries. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 14 1月, 2016 2 次提交
-
-
由 Shaohua Li 提交于
Handle journal hotadd in quiesce to avoid creating duplicated threads. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Shaohua Li 提交于
Set MD_HAS_JOURNAL when a array is loaded or journal is initialized. This is to avoid the flags set too early in journal disk hotadd. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
- 06 1月, 2016 6 次提交
-
-
由 Christoph Hellwig 提交于
And propagate the error up the stack so we can add the stripe to no_stripes_list and retry our log operation later. This avoids blocking raid5d due to reclaim, an it allows to get rid of the deadlock-prone GFP_NOFAIL allocation. shli: add missing mempool_destroy() Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
We only have a limited number in flight, so use a page based mempool. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
This allows us to make guaranteed forward progress. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Shaohua Li 提交于
Add support for journal disk hot add/remove. Mostly trival checks in md part. The raid5 part is a little tricky. For hot-remove, we can't wait pending write as it's called from raid5d. The wait will cause deadlock. We simplily fail the hot-remove. A hot-remove retry can success eventually since if journal disk is faulty all pending write will be failed and finish. For hot-add, since an array supporting journal but without journal disk will be marked read-only, we are safe to hot add journal without stopping IO (should be read IO, while journal only handles write IO). Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
Once the I/O completed we don't need the meta page anymore. As the iounits can live on for a long time this reduces memory pressure a bit. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
It's only used for one kind of move, so make that explicit. Also clean up the code a bit by using list_for_each_safe. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
- 01 11月, 2015 22 次提交
-
-
由 Shaohua Li 提交于
If raid array is expected to have journal (eg, journal is set in MD superblock feature map) and the array is started without journal disk, start the array readonly. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Shaohua Li 提交于
There are 3 places the raid5-cache dispatches IO. The discard IO error doesn't matter, so we ignore it. The superblock write IO error can be handled in MD core. The remaining are log write and flush. When the IO error happens, we mark log disk faulty and fail all write IO. Read IO is still allowed to run. Userspace will get a notification too and corresponding daemon can choose setting raid array readonly for example. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Shaohua Li 提交于
Since superblock is updated infrequently, we do a simple trim of log disk (a synchronous trim) Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
Simplify the bio completion handler by using bio chaining and submitting bios as soon as they are full. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
Factor out code to reserve log space. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
This is the only user, and keeping all code initializing the io_unit structure together improves readbility. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
Set up bi_sector properly when we allocate an bio instead of updating it at submission time. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
Split out a helper to allocate a bio for log writes. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
Remove the only partially used local 'io' variable to simplify the code flow. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
For devices without a volatile write cache we don't need to send a FLUSH command to ensure writes are stable on disk, and thus can avoid the whole step of batching up bios for processing by the MD thread. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
After this series we won't nessecarily have flushed the cache for these I/Os, so give the list a more neutral name. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
There is no good reason to keep the I/O unit structures around after the stripe has been written back to the RAID array. The only information we need is the log sequence number, and the checkpoint offset of the highest successfull writeback. Store those in the log structure, and free the IO units from __r5l_stripe_write_finished. Besides simplifying the code this also avoid having to keep the allocation for the I/O unit around for a potentially long time as superblock updates that checkpoint the log do not happen very often. This also fixes the previously incorrect calculation of 'free' in r5l_do_reclaim as a side effect: previous if took the last unit which isn't checkpointed into account. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Shaohua Li 提交于
Move reclaim stop to quiesce handling, where is safer for this stuff. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Shaohua Li 提交于
There is a case a stripe gets delayed forever. 1. a stripe finishes construction 2. a new bio hits the stripe 3. handle_stripe runs for the stripe. The stripe gets DELAYED bit set since construction can't run for new bio (the stripe is locked since step 1) Without log, handle_stripe will call ops_run_io. After IO finishes, the stripe gets unlocked and the stripe will restart and run construction for the new bio. With log, ops_run_io need to run two times. If the DELAYED bit set, the stripe can't enter into the handle_list, so the second ops_run_io doesn't run, which leaves the stripe stalled. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Shaohua Li 提交于
stripes could finish out of order. Hence r5l_move_io_unit_list() of __r5l_stripe_write_finished might not move any entry and leave stripe_end_ios list empty. This applies on top of http://marc.info/?l=linux-raid&m=144122700510667Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Shaohua Li 提交于
With log enabled, bio is written to raid disks after the bio is settled down in log disk. The recovery guarantees we can recovery the bio data from log disk, so we we skip FLUSH IO. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Christoph Hellwig 提交于
Just keep __r5l_set_io_unit_state as a small set the state wrapper, and remove r5l_set_io_unit_state entirely after moving the real functionality to the two callers that need it. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Shaohua Li 提交于
r5l_compress_stripe_end_list() can free an io_unit. This breaks the assumption only reclaimer can free io_unit. We can add a reference count based io_unit free, but since only reclaim can wait io_unit becoming to STRIPE_END state, we use a simple global wait queue here. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Shaohua Li 提交于
Before we write stripe data to raid disks, we must guarantee stripe data is settled down in log disk. To do this, we flush log disk cache and wait the flush finish. That wait introduces sleep time in raid5d thread and impact performance. This patch moves the log disk cache flush process to the stripe handling state machine, which can remove the wait in raid5d. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Shaohua Li 提交于
crc32c has lower overhead with cpu acceleration. It's a shame I didn't use it in first post, sorry. This changes disk format, but we are still ok in current stage. V2: delete unnecessary type conversion as pointed out by Bart Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com> Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
-
- 24 10月, 2015 3 次提交
-
-
由 Shaohua Li 提交于
This is the log recovery support. The process is quite straightforward. We scan the log and read all valid meta/data/parity into memory. If a stripe's data/parity checksum is correct, the stripe will be recoveried. Otherwise, it's discarded and we don't scan the log further. The reclaim process guarantees stripe which starts to be flushed raid disks has completed data/parity and has correct checksum. To recovery a stripe, we just copy its data/parity to corresponding raid disks. The trick thing is superblock update after recovery. we can't let superblock point to last valid meta block. The log might look like: | meta 1| meta 2| meta 3| meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If superblock points to meta 1, we write a new valid meta 2n. If crash happens again, new recovery will start from meta 1. Since meta 2n is valid, recovery will think meta 3 is valid, which is wrong. The solution is we create a new meta in meta2 with its seq == meta 1's seq + 10 and let superblock points to meta2. recovery will not think meta 3 is a valid meta, because its seq is wrong Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Shaohua Li 提交于
This is the reclaim support for raid5 log. A stripe write will have following steps: 1. reconstruct the stripe, read data/calculate parity. ops_run_io prepares to write data/parity to raid disks 2. hijack ops_run_io. stripe data/parity is appending to log disk 3. flush log disk cache 4. ops_run_io run again and do normal operation. stripe data/parity is written in raid array disks. raid core can return io to upper layer. 5. flush cache of all raid array disks 6. update super block 7. log disk space used by the stripe can be reused In practice, several stripes consist of an io_unit and we will batch several io_unit in different steps, but the whole process doesn't change. It's possible io return just after data/parity hit log disk, but then read IO will need read from log disk. For simplicity, IO return happens at step 4, where read IO can directly read from raid disks. Currently reclaim run if there is specific reclaimable space (1/4 disk size or 10G) or we are out of space. Reclaim is just to free log disk spaces, it doesn't impact data consistency. The size based force reclaim is to make sure log isn't too big, so recovery doesn't scan log too much. Recovery make sure raid disks and log disk have the same data of a stripe. If crash happens before 4, recovery might/might not recovery stripe's data/parity depending on if data/parity and its checksum matches. In either case, this doesn't change the syntax of an IO write. After step 3, stripe is guaranteed recoverable, because stripe's data/parity is persistent in log disk. In some cases, log disk content and raid disks content of a stripe are the same, but recovery will still copy log disk content to raid disks, this doesn't impact data consistency. space reuse happens after superblock update and cache flush. There is one situation we want to avoid. A broken meta in the middle of a log causes recovery can't find meta at the head of log. If operations require meta at the head persistent in log, we must make sure meta before it persistent in log too. The case is stripe data/parity is in log and we start write stripe to raid disks (before step 4). stripe data/parity must be persistent in log before we do the write to raid disks. The solution is we restrictly maintain io_unit list order. In this case, we only write stripes of an io_unit to raid disks till the io_unit is the first one whose data/parity is in log. The io_unit list order is important for other cases too. For example, some io_unit are reclaimable and others not. They can be mixed in the list, we shouldn't reuse space of an unreclaimable io_unit. Includes fixes to problems which were... Reported-by: Nkbuild test robot <fengguang.wu@intel.com> Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Shaohua Li 提交于
This introduces a simple log for raid5. Data/parity writing to raid array first writes to the log, then write to raid array disks. If crash happens, we can recovery data from the log. This can speed up raid resync and fix write hole issue. The log structure is pretty simple. Data/meta data is stored in block unit, which is 4k generally. It has only one type of meta data block. The meta data block can track 3 types of data, stripe data, stripe parity and flush block. MD superblock will point to the last valid meta data block. Each meta data block has checksum/seq number, so recovery can scan the log correctly. We store a checksum of stripe data/parity to the metadata block, so meta data and stripe data/parity can be written to log disk together. otherwise, meta data write must wait till stripe data/parity is finished. For stripe data, meta data block will record stripe data sector and size. Currently the size is always 4k. This meta data record can be made simpler if we just fix write hole (eg, we can record data of a stripe's different disks together), but this format can be extended to support caching in the future, which must record data address/size. For stripe parity, meta data block will record stripe sector. It's size should be 4k (for raid5) or 8k (for raid6). We always store p parity first. This format should work for caching too. flush block indicates a stripe is in raid array disks. Fixing write hole doesn't need this type of meta data, it's for caching extension. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-