提交 · a3dfbdaadba2612faf11f025b8156c36e3700247 · openeuler / Kernel

01 11月, 2015 32 次提交

MD: kick out journal disk if it's not fresh · a3dfbdaa

由 Song Liu 提交于 10月 08, 2015

When journal disk is faulty and we are reassemabling the raid array, the
journal disk is old. We don't allow the journal disk added to the raid
array. Since journal disk is missing in the array, the raid5 will mark
the array readonly.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

a3dfbdaa

raid5-cache: start raid5 readonly if journal is missing · 7dde2ad3

由 Shaohua Li 提交于 10月 08, 2015

If raid array is expected to have journal (eg, journal is set in MD
superblock feature map) and the array is started without journal disk,
start the array readonly.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

7dde2ad3

MD: add new bit to indicate raid array with journal · a97b7896

由 Song Liu 提交于 10月 08, 2015

If a raid array has journal feature bit set, add a new bit to indicate
this. If the array is started without journal disk existing, we know
there is something wrong.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

a97b7896

raid5-cache: IO error handling · 6e74a9cf

由 Shaohua Li 提交于 10月 08, 2015

There are 3 places the raid5-cache dispatches IO. The discard IO error
doesn't matter, so we ignore it. The superblock write IO error can be
handled in MD core. The remaining are log write and flush. When the IO
error happens, we mark log disk faulty and fail all write IO. Read IO is
still allowed to run. Userspace will get a notification too and
corresponding daemon can choose setting raid array readonly for example.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

6e74a9cf

raid5: journal disk can't be removed · c2bb6242

由 Shaohua Li 提交于 10月 08, 2015

raid5-cache uses journal disk rdev->bdev, rdev->mddev in several places.
Don't allow journal disk disappear magically. On the other hand, we do
need to update superblock for other disks to bump up ->events, so next
time journal disk will be identified as stale.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

c2bb6242

raid5-cache: add trim support for log · 4b482044

由 Shaohua Li 提交于 10月 08, 2015

Since superblock is updated infrequently, we do a simple trim of log
disk (a synchronous trim)
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

4b482044

MD: fix info output for journal disk · 9efdca16

由 Shaohua Li 提交于 10月 12, 2015

journal disk can be faulty. The Journal and Faulty aren't exclusive with
each other.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

9efdca16

raid5-cache: use bio chaining · 6143e2ce

由 Christoph Hellwig 提交于 10月 05, 2015

Simplify the bio completion handler by using bio chaining and submitting
bios as soon as they are full.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

6143e2ce

raid5-cache: small log->seq cleanup · 2b8ef16e

由 Christoph Hellwig 提交于 10月 05, 2015

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

2b8ef16e

raid5-cache: new helper: r5_reserve_log_entry · c1b99198

由 Christoph Hellwig 提交于 10月 05, 2015

Factor out code to reserve log space.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

c1b99198

raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta · 51039cd0

由 Christoph Hellwig 提交于 10月 05, 2015

This is the only user, and keeping all code initializing the io_unit
structure together improves readbility.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

51039cd0

raid5-cache: take rdev->data_offset into account early on · 1e932a37

由 Christoph Hellwig 提交于 10月 05, 2015

Set up bi_sector properly when we allocate an bio instead of updating it
at submission time.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NNeilBrown <neilb@suse.com>

1e932a37

raid5-cache: refactor bio allocation · b349feb3

由 Christoph Hellwig 提交于 10月 05, 2015

Split out a helper to allocate a bio for log writes.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

b349feb3

raid5-cache: clean up r5l_get_meta · 22581f58

由 Christoph Hellwig 提交于 10月 05, 2015

Remove the only partially used local 'io' variable to simplify the code
flow.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

22581f58

raid5-cache: simplify state machine when caches flushes are not needed · 56fef7c6

由 Christoph Hellwig 提交于 10月 05, 2015

For devices without a volatile write cache we don't need to send a FLUSH
command to ensure writes are stable on disk, and thus can avoid the whole
step of batching up bios for processing by the MD thread.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

56fef7c6

raid5-cache: factor out a helper to run all stripes for an I/O unit · d8858f43

由 Christoph Hellwig 提交于 10月 05, 2015

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

d8858f43

raid5-cache: rename flushed_ios to finished_ios · 04732f74

由 Christoph Hellwig 提交于 10月 05, 2015

After this series we won't nessecarily have flushed the cache for these
I/Os, so give the list a more neutral name.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

04732f74

raid5-cache: free I/O units earlier · 17036461

由 Christoph Hellwig 提交于 10月 05, 2015

There is no good reason to keep the I/O unit structures around after the
stripe has been written back to the RAID array.  The only information
we need is the log sequence number, and the checkpoint offset of the
highest successfull writeback.  Store those in the log structure, and
free the IO units from __r5l_stripe_write_finished.

Besides simplifying the code this also avoid having to keep the allocation
for the I/O unit around for a potentially long time as superblock updates
that checkpoint the log do not happen very often.

This also fixes the previously incorrect calculation of 'free' in
r5l_do_reclaim as a side effect: previous if took the last unit which
isn't checkpointed into account.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

17036461

raid5-cache: move reclaim stop to quiesce · e6c033f7

由 Shaohua Li 提交于 10月 04, 2015

Move reclaim stop to quiesce handling, where is safer for this stuff.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

e6c033f7

md: show journal for journal disk in disk state sysfs · ac6096e9

由 Shaohua Li 提交于 10月 04, 2015

Journal disk state sysfs entry should indicate it's journal
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

ac6096e9

skip match_mddev_units check for special roles · 0b020e85

由 Song Liu 提交于 9月 03, 2015

match_mddev_units is used to check whether 2 RAID arrays share
same disk(s). Arrays that share disk(s) will not do resync at the
same time for better performance (fewer HDD seek). However, this
check should not apply to Spare, Faulty, and Journal disks, as
they do not paticipate in resync.

In this patch, match_mddev_units skips check for disks with flag
"Faulty" or "Journal" or raid_disk < 0.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

0b020e85

raid5-cache: don't delay stripe captured in log · 253f9fd4

由 Shaohua Li 提交于 9月 04, 2015

There is a case a stripe gets delayed forever.
1. a stripe finishes construction
2. a new bio hits the stripe
3. handle_stripe runs for the stripe. The stripe gets DELAYED bit set
since construction can't run for new bio (the stripe is locked since
step 1)

Without log, handle_stripe will call ops_run_io. After IO finishes, the
stripe gets unlocked and the stripe will restart and run construction
for the new bio. With log, ops_run_io need to run two times. If the
DELAYED bit set, the stripe can't enter into the handle_list, so the
second ops_run_io doesn't run, which leaves the stripe stalled.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

253f9fd4

raid5-cache: check stripe finish out of order · 85f2f9a4

由 Shaohua Li 提交于 9月 04, 2015

stripes could finish out of order. Hence r5l_move_io_unit_list() of
__r5l_stripe_write_finished might not move any entry and leave
stripe_end_ios list empty.

This applies on top of http://marc.info/?l=linux-raid&m=144122700510667Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

85f2f9a4

md: skip resync for raid array with journal · bd18f646

由 Shaohua Li 提交于 9月 02, 2015

If a raid array has journal, the journal can guarantee the consistency,
we can skip resync after a unclean shutdown. The exception is raid
creation or user initiated resync, which we still do a raid resync.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

bd18f646

raid5-cache: optimize FLUSH IO with log enabled · 828cbe98

由 Shaohua Li 提交于 9月 02, 2015

With log enabled, bio is written to raid disks after the bio is settled
down in log disk. The recovery guarantees we can recovery the bio data
from log disk, so we we skip FLUSH IO.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

828cbe98

raid5-cache: move functionality out of __r5l_set_io_unit_state · 509ffec7

由 Christoph Hellwig 提交于 9月 02, 2015

Just keep __r5l_set_io_unit_state as a small set the state wrapper, and
remove r5l_set_io_unit_state entirely after moving the real
functionality to the two callers that need it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

509ffec7

raid5-cache: fix a user-after-free bug · 0fd22b45

由 Shaohua Li 提交于 9月 02, 2015

r5l_compress_stripe_end_list() can free an io_unit. This breaks the
assumption only reclaimer can free io_unit. We can add a reference count
based io_unit free, but since only reclaim can wait io_unit becoming to
STRIPE_END state, we use a simple global wait queue here.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

0fd22b45

raid5-cache: switching to state machine for log disk cache flush · a8c34f91

由 Shaohua Li 提交于 9月 02, 2015

Before we write stripe data to raid disks, we must guarantee stripe data
is settled down in log disk. To do this, we flush log disk cache and
wait the flush finish. That wait introduces sleep time in raid5d thread
and impact performance. This patch moves the log disk cache flush
process to the stripe handling state machine, which can remove the wait
in raid5d.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

a8c34f91

raid5: enable log for raid array with cache disk · 5c7e81c3

由 Shaohua Li 提交于 8月 13, 2015

Now log is safe to enable for raid array with cache disk
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

5c7e81c3

raid5: don't allow resize/reshape with cache(log) support · 713cf5a6

由 Shaohua Li 提交于 8月 13, 2015

If cache(log) support is enabled, don't allow resize/reshape in current
stage. In the future, we can flush all data from cache(log) to raid
before resize/reshape and then allow resize/reshape.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

713cf5a6

raid5: disable batch with log enabled · 9c3e333d

由 Shaohua Li 提交于 8月 13, 2015

With log enabled, r5l_write_stripe will add the stripe to log. With
batch, several stripes are linked together. The stripes must be in the
same state. While with log, the log/reclaim unit is stripe, we can't
guarantee the several stripes are in the same state. Disabling batch for
log now.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

9c3e333d

raid5-cache: use crc32c checksum · 5cb2fbd6

由 Shaohua Li 提交于 10月 28, 2015

crc32c has lower overhead with cpu acceleration. It's a shame I didn't
use it in first post, sorry. This changes disk format, but we are still
ok in current stage.

V2: delete unnecessary type conversion as pointed out by Bart
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>

5cb2fbd6

24 10月, 2015 8 次提交

raid5: log recovery · 355810d1

由 Shaohua Li 提交于 8月 13, 2015

This is the log recovery support. The process is quite straightforward.
We scan the log and read all valid meta/data/parity into memory. If a
stripe's data/parity checksum is correct, the stripe will be recoveried.
Otherwise, it's discarded and we don't scan the log further. The reclaim
process guarantees stripe which starts to be flushed raid disks has
completed data/parity and has correct checksum. To recovery a stripe, we
just copy its data/parity to corresponding raid disks.

The trick thing is superblock update after recovery. we can't let
superblock point to last valid meta block. The log might look like:
| meta 1| meta 2| meta 3|
meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If superblock
points to meta 1, we write a new valid meta 2n.  If crash happens again,
new recovery will start from meta 1. Since meta 2n is valid, recovery
will think meta 3 is valid, which is wrong.  The solution is we create a
new meta in meta2 with its seq == meta 1's seq + 10 and let superblock
points to meta2.  recovery will not think meta 3 is a valid meta,
because its seq is wrong
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

355810d1

raid5: log reclaim support · 0576b1c6

由 Shaohua Li 提交于 8月 13, 2015

This is the reclaim support for raid5 log. A stripe write will have
following steps:

1. reconstruct the stripe, read data/calculate parity. ops_run_io
prepares to write data/parity to raid disks
2. hijack ops_run_io. stripe data/parity is appending to log disk
3. flush log disk cache
4. ops_run_io run again and do normal operation. stripe data/parity is
written in raid array disks. raid core can return io to upper layer.
5. flush cache of all raid array disks
6. update super block
7. log disk space used by the stripe can be reused

In practice, several stripes consist of an io_unit and we will batch
several io_unit in different steps, but the whole process doesn't
change.

It's possible io return just after data/parity hit log disk, but then
read IO will need read from log disk. For simplicity, IO return happens
at step 4, where read IO can directly read from raid disks.

Currently reclaim run if there is specific reclaimable space (1/4 disk
size or 10G) or we are out of space. Reclaim is just to free log disk
spaces, it doesn't impact data consistency. The size based force reclaim
is to make sure log isn't too big, so recovery doesn't scan log too
much.

Recovery make sure raid disks and log disk have the same data of a
stripe. If crash happens before 4, recovery might/might not recovery
stripe's data/parity depending on if data/parity and its checksum
matches. In either case, this doesn't change the syntax of an IO write.
After step 3, stripe is guaranteed recoverable, because stripe's
data/parity is persistent in log disk. In some cases, log disk content
and raid disks content of a stripe are the same, but recovery will still
copy log disk content to raid disks, this doesn't impact data
consistency. space reuse happens after superblock update and cache
flush.

There is one situation we want to avoid. A broken meta in the middle of
a log causes recovery can't find meta at the head of log. If operations
require meta at the head persistent in log, we must make sure meta
before it persistent in log too. The case is stripe data/parity is in
log and we start write stripe to raid disks (before step 4). stripe
data/parity must be persistent in log before we do the write to raid
disks. The solution is we restrictly maintain io_unit list order. In
this case, we only write stripes of an io_unit to raid disks till the
io_unit is the first one whose data/parity is in log.

The io_unit list order is important for other cases too. For example,
some io_unit are reclaimable and others not. They can be mixed in the
list, we shouldn't reuse space of an unreclaimable io_unit.

Includes fixes to problems which were...
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

0576b1c6

raid5: add basic stripe log · f6bed0ef

由 Shaohua Li 提交于 8月 13, 2015

This introduces a simple log for raid5. Data/parity writing to raid
array first writes to the log, then write to raid array disks. If
crash happens, we can recovery data from the log. This can speed up
raid resync and fix write hole issue.

The log structure is pretty simple. Data/meta data is stored in block
unit, which is 4k generally. It has only one type of meta data block.
The meta data block can track 3 types of data, stripe data, stripe
parity and flush block. MD superblock will point to the last valid
meta data block. Each meta data block has checksum/seq number, so
recovery can scan the log correctly. We store a checksum of stripe
data/parity to the metadata block, so meta data and stripe data/parity
can be written to log disk together. otherwise, meta data write must
wait till stripe data/parity is finished.

For stripe data, meta data block will record stripe data sector and
size. Currently the size is always 4k. This meta data record can be made
simpler if we just fix write hole (eg, we can record data of a stripe's
different disks together), but this format can be extended to support
caching in the future, which must record data address/size.

For stripe parity, meta data block will record stripe sector. It's
size should be 4k (for raid5) or 8k (for raid6). We always store p
parity first. This format should work for caching too.

flush block indicates a stripe is in raid array disks. Fixing write
hole doesn't need this type of meta data, it's for caching extension.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

f6bed0ef

raid5: add a new state for stripe log handling · b70abcb2

由 Shaohua Li 提交于 8月 13, 2015

When a stripe finishes construction, we write the stripe to raid in
ops_run_io normally. With log, we do a bunch of other operations before
the stripe is written to raid. Mainly write the stripe to log disk,
flush disk cache and so on. The operations are still driven by raid5d
and run in the stripe state machine. We introduce a new state for such
stripe (trapped into log). The stripe is in this state from the time it
first enters ops_run_io (finish construction) to the time it is written
to raid. Since we know the state is only for log, we bypass other
check/operation in handle_stripe.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

b70abcb2

raid5: export some functions · 6d036f7d

由 Shaohua Li 提交于 8月 13, 2015

Next several patches use some raid5 functions, rename them with raid5
prefix and export out.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

6d036f7d

md: override md superblock recovery_offset for journal device · 3069aa8d

由 Shaohua Li 提交于 8月 13, 2015

Journal device stores data in a log structure. We need record the log
start. Here we override md superblock recovery_offset for this purpose.
This field of a journal device is meaningless otherwise.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

3069aa8d

MD: add a new disk role to present write journal device · bac624f3

由 Song Liu 提交于 8月 13, 2015

Next patches will use a disk as raid5/6 journaling. We need a new disk
role to present the journal device and add MD_FEATURE_JOURNAL to
feature_map for backward compability.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

bac624f3

MD: replace special disk roles with macros · c4d4c91b

由 Song Liu 提交于 8月 13, 2015

Add the following two macros for special roles: spare and faulty

MD_DISK_ROLE_SPARE	0xffff
MD_DISK_ROLE_FAULTY	0xfffe

Add MD_DISK_ROLE_MAX	0xff00 as the maximal possible regular role,
and minimal value of special role.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

c4d4c91b

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功