提交 · f45958756fef552436e4a63029a168495920026e · openanolis / cloud-kernel

26 3月, 2017 1 次提交

block: remove bio_clone_bioset_partial() · f4595875

由 Shaohua Li 提交于 3月 24, 2017

commit c18a1e09(block: introduce bio_clone_bioset_partial()) introduced
bio_clone_bioset_partial() for raid1 write behind IO. Now the write behind is
rewritten by Ming. We don't need the API any more, so revert the commit.

Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: NJens Axboe <axboe@fb.com>
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

f4595875

25 3月, 2017 14 次提交

md: raid10: avoid direct access to bvec table in handle_reshape_read_error · 2d06e3b7

由 Ming Lei 提交于 3月 17, 2017

All reshape I/O share pages from 1st copy device, so just use that pages
for avoiding direct access to bvec table in handle_reshape_read_error.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

2d06e3b7

md: raid10: retrieve page from preallocated resync page array · cdb76be3

由 Ming Lei 提交于 3月 17, 2017

Now one page array is allocated for each resync bio, and we can
retrieve page from this table directly.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

cdb76be3

md: raid10: don't use bio's vec table to manage resync pages · f0250618

由 Ming Lei 提交于 3月 17, 2017

Now we allocate one page array for managing resync pages, instead
of using bio's vec table to do that, and the old way is very hacky
and won't work any more if multipage bvec is enabled.

The introduced cost is that we need to allocate (128 + 16) * copies
bytes per r10_bio, and it is fine because the inflight r10_bio for
resync shouldn't be much, as pointed by Shaohua.

Also bio_reset() in raid10_sync_request() and reshape_request()
are removed because all bios are freshly new now in these functions
and not necessary to reset any more.

This patch can be thought as cleanup too.
Suggested-by: NShaohua Li <shli@kernel.org>
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

f0250618

md: raid10: refactor code of read reshape's .bi_end_io · 81fa1520

由 Ming Lei 提交于 3月 17, 2017

reshape read request is a bit special and requires one extra
bio which isn't allocated from r10buf_pool.

Refactor the .bi_end_io for read reshape, so that we can use
raid10's resync page mangement approach easily in the following
patches.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

81fa1520

md: raid1: improve write behind · 841c1316

由 Ming Lei 提交于 3月 17, 2017

This patch improve handling of write behind in the following ways:

- introduce behind master bio to hold all write behind pages
- fast clone bios from behind master bio
- avoid to change bvec table directly
- use bio_copy_data() and make code more clean
Suggested-by: NShaohua Li <shli@fb.com>
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

841c1316

md: raid1: move 'offset' out of loop · d8c84c4f

由 Ming Lei 提交于 3月 17, 2017

The 'offset' local variable can't be changed inside the loop, so
move it out.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d8c84c4f

block: introduce bio_copy_data_partial · 6f880285

由 Ming Lei 提交于 3月 17, 2017

Turns out we can use bio_copy_data in raid1's write behind,
and we can make alloc_behind_pages() more clean/efficient,
but we need to partial version of bio_copy_data().
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Reviewed-by: NJens Axboe <axboe@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

6f880285

md: raid1: use bio helper in process_checks() · 60928a91

由 Ming Lei 提交于 3月 17, 2017

Avoid to direct access to bvec table.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

60928a91

md: raid1: retrieve page from pre-allocated resync page array · 44cf0f4d

由 Ming Lei 提交于 3月 17, 2017

Now one page array is allocated for each resync bio, and we can
retrieve page from this table directly.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

44cf0f4d

md: raid1: don't use bio's vec table to manage resync pages · 98d30c58

由 Ming Lei 提交于 3月 17, 2017

Now we allocate one page array for managing resync pages, instead
of using bio's vec table to do that, and the old way is very hacky
and won't work any more if multipage bvec is enabled.

The introduced cost is that we need to allocate (128 + 16) * raid_disks
bytes per r1_bio, and it is fine because the inflight r1_bio for
resync shouldn't be much, as pointed by Shaohua.

Also the bio_reset() in raid1_sync_request() is removed because
all bios are freshly new now and not necessary to reset any more.

This patch can be thought as a cleanup too
Suggested-by: NShaohua Li <shli@kernel.org>
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

98d30c58

md: raid1: simplify r1buf_pool_free() · a7234234

由 Ming Lei 提交于 3月 17, 2017

This patch gets each page's reference of each bio for resync,
then r1buf_pool_free() gets simplified a lot.

The same policy has been taken in raid10's buf pool allocation/free
too.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

a7234234

md: prepare for managing resync I/O pages in clean way · 513e2faa

由 Ming Lei 提交于 3月 17, 2017

Now resync I/O use bio's bec table to manage pages,
this way is very hacky, and may not work any more
once multipage bvec is introduced.

So introduce helpers and new data structure for
managing resync I/O pages more cleanly.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

513e2faa

md: move two macros into md.h · d8e29fbc

由 Ming Lei 提交于 3月 17, 2017

Both raid1 and raid10 share common resync
block size and page count, so move them into md.h.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d8e29fbc

md: raid1/raid10: don't handle failure of bio_add_page() · c85ba149

由 Ming Lei 提交于 3月 17, 2017

All bio_add_page() is for adding one page into resync bio,
which is big enough to hold RESYNC_PAGES pages, and
the current bio_add_page() doesn't check queue limit any more,
so it won't fail at all.

remove unused label (shaohua)
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

c85ba149

24 3月, 2017 3 次提交

Z
md: fix several trivial typos in comments · 3560741e
由 Zhilong Liu 提交于 3月 15, 2017
```
Signed-off-by: NZhilong Liu <zlliu@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>
```
3560741e

md/raid10: refactor some codes from raid10_write_request · 27f26a0f

由 Guoqing Jiang 提交于 3月 20, 2017

Previously, we clone both bio and repl_bio in raid10_write_request,
then add the cloned bio to plug->pending or conf->pending_bio_list
based on plug or not, and most of the logics are same for the two
conditions.

So introduce raid10_write_one_disk for it, and use replacement parameter
to distinguish the difference. No functional changes in the patch.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

27f26a0f

raid5-ppl: silence a misleading warning message · 0b408baf

由 Dan Carpenter 提交于 3月 21, 2017

The "need_cache_flush" variable is never set to false.  When the
variable is true that means we print a warning message at the end of
the function.

Fixes: 3418d036 ("raid5-ppl: Partial Parity Log write logging implementation")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

0b408baf

23 3月, 2017 14 次提交

MD: use per-cpu counter for writes_pending · 4ad23a97

由 NeilBrown 提交于 3月 15, 2017

The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean".  Consequently it needs to be updated frequently
but only checked for zero occasionally.  Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio.  This provided
justification for making the updates more efficient.

So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed.  As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.

We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero.  md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode.  If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.

It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.

mod_timer() already optimizes the case where the timeout value doesn't
actually change.  We leverage that further by always rounding off the
jiffies to the timeout value.  This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.

md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set.  That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

4ad23a97

percpu-refcount: support synchronous switch to atomic mode. · 210f7cdc

由 NeilBrown 提交于 3月 15, 2017

percpu_ref_switch_to_atomic_sync() schedules the switch to atomic mode, then
waits for it to complete.

Also export percpu_ref_switch_to_* so they can be used from modules.

This will be used in md/raid to count the number of pending write
requests to an array.
We occasionally need to check if the count is zero, but most often
we don't care.
We always want updates to the counter to be fast, as in some cases
we count every 4K page.
Signed-off-by: NNeilBrown <neilb@suse.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NShaohua Li <shli@fb.com>

210f7cdc

md: close a race with setting mddev->in_sync · 55cc39f3

由 NeilBrown 提交于 3月 15, 2017

If ->in_sync is being set just as md_write_start() is being called,
it is possible that set_in_sync() won't see the elevated
->writes_pending, and md_write_start() won't see the set ->in_sync.

To close this race, re-test ->writes_pending after setting ->in_sync,
and add memory barriers to ensure the increment of ->writes_pending
will be seen by the time of this second test, or the new ->in_sync
will be seen by md_write_start().

Add a spinlock to array_state_show() to ensure this temporary
instability is never visible from userspace.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

55cc39f3

md: factor out set_in_sync() · 6497709b

由 NeilBrown 提交于 3月 15, 2017

Three separate places in md.c check if the number of active
writes is zero and, if so, sets mddev->in_sync.

There are a few differences, but there shouldn't be:
- it is always appropriate to notify the change in
  sysfs_state, and there is no need to do this outside a
  spin-locked region.
- we never need to check ->recovery_cp.  The state of resync
  is not relevant for whether there are any pending writes
  or not (which is what ->in_sync reports).

So create set_in_sync() which does the correct tests and
makes the correct changes, and call this in all three
places.

Any behaviour changes here a minor and cosmetic.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

6497709b

md/raid5: don't test ->writes_pending in raid5_remove_disk · 84dd97a6

由 NeilBrown 提交于 3月 15, 2017

This test on ->writes_pending cannot be safe as the counter
can be incremented at any moment and cannot be locked against.

Change it to test conf->active_stripes, which at least
can be locked against.  More changes are still needed.

A future patch will change ->writes_pending, and testing it here will
be very inconvenient.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

84dd97a6

md/raid1: stop using bi_phys_segment · 37011e3a

由 NeilBrown 提交于 3月 15, 2017

Change to use bio->__bi_remaining to count number of r1bio attached
to a bio.
See precious raid10 patch for more details.

Like the raid10.c patch, this fixes a bug as nr_queued and nr_pending
used to measure different things, but were being compared.

This patch fixes another bug in that nr_pending previously did not
could write-behind requests, so behind writes could continue while
resync was happening.  How that nr_pending counts all r1_bio,
the resync cannot commence until the behind writes have completed.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

37011e3a

md/raid10: stop using bi_phys_segments · fd16f2e8

由 NeilBrown 提交于 3月 15, 2017

raid10 currently repurposes bi_phys_segments on each
incoming bio to count how many r10bio was used to encode the
request.

We need to know when the number of attached r10bio reaches
zero to:
1/ call bio_endio() when all IO on the bio is finished
2/ decrement ->nr_pending so that resync IO can proceed.

Now that the bio has its own __bi_remaining counter, that
can be used instead. We can call bio_inc_remaining to
increment the counter and call bio_endio() every time an
r10bio completes, rather than only when bi_phys_segments
reaches zero.

This addresses point 1, but not point 2.  bio_endio()
doesn't (and cannot) report when the last r10bio has
finished, so a different approach is needed.

So: instead of counting bios in ->nr_pending, count r10bios.
i.e. every time we attach a bio, increment nr_pending.
Every time an r10bio completes, decrement nr_pending.

Normally we only increment nr_pending after first checking
that ->barrier is zero, or some other non-trivial tests and
possible waiting.  When attaching multiple r10bios to a bio,
we only need the tests and the waiting once.  After the
first increment, subsequent increments can happen
unconditionally as they are really all part of the one
request.

So introduce inc_pending() which can be used when we know
that nr_pending is already elevated.

Note that this fixes a bug.  freeze_array() contains the line
	atomic_read(&conf->nr_pending) == conf->nr_queued+extra,
which implies that the units for ->nr_pending, ->nr_queued and extra
are the same.
->nr_queue and extra count r10_bios, but prior to this patch,
->nr_pending counted bios.  If a bio ever resulted in multiple
r10_bios (due to bad blocks), freeze_array() would not work correctly.
Now it does.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

fd16f2e8

md/raid1, raid10: move rXbio accounting closer to allocation. · 6b6c8110

由 NeilBrown 提交于 3月 15, 2017

When raid1 or raid10 find they will need to allocate a new
r1bio/r10bio, in order to work around a known bad block, they
account for the allocation well before the allocation is
made.  This separation makes the correctness less obvious
and requires comments.

The accounting needs to be a little before: before the first
rXbio is submitted, but that is all.

So move the accounting down to where it makes more sense.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

6b6c8110

Revert "md/raid5: limit request size according to implementation limits" · 97d53438

由 NeilBrown 提交于 3月 15, 2017

This reverts commit e8d7c332.

Now that raid5 doesn't abuse bi_phys_segments any more, we no longer
need to impose these limits.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

97d53438

md/raid5: remove over-loading of ->bi_phys_segments. · 0472a42b

由 NeilBrown 提交于 3月 15, 2017

When a read request, which bypassed the cache, fails, we need to retry
it through the cache.
This involves attaching it to a sequence of stripe_heads, and it may not
be possible to get all the stripe_heads we need at once.
We do what we can, and record how far we got in ->bi_phys_segments so
we can pick up again later.

There is only ever one bio which may have a non-zero offset stored in
->bi_phys_segments, the one that is either active in the single thread
which calls retry_aligned_read(), or is in conf->retry_read_aligned
waiting for retry_aligned_read() to be called again.

So we only need to store one offset value.  This can be in a local
variable passed between remove_bio_from_retry() and
retry_aligned_read(), or in the r5conf structure next to the
->retry_read_aligned pointer.

Storing it there allows the last usage of ->bi_phys_segments to be
removed from md/raid5.c.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

0472a42b

md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter · 016c76ac

由 NeilBrown 提交于 3月 15, 2017

md/raid5 needs to keep track of how many stripe_heads are processing a
bio so that it can delay calling bio_endio() until all stripe_heads
have completed.  It currently uses 16 bits of ->bi_phys_segments for
this purpose.

16 bits is only enough for 256M requests, and it is possible for a
single bio to be larger than this, which causes problems.  Also, the
bio struct contains a larger counter, __bi_remaining, which has a
purpose very similar to the purpose of our counter.  So stop using
->bi_phys_segments, and instead use __bi_remaining.

This means we don't need to initialize the counter, as our caller
initializes it to '1'.  It also means we can call bio_endio() directly
as it tests this counter internally.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

016c76ac

md/raid5: call bio_endio() directly rather than queueing for later. · bd83d0a2

由 NeilBrown 提交于 3月 15, 2017

We currently gather bios that need to be returned into a bio_list
and call bio_endio() on them all together.
The original reason for this was to avoid making the calls while
holding a spinlock.
Locking has changed a lot since then, and that reason is no longer
valid.

So discard return_io() and various return_bi lists, and just call
bio_endio() directly as needed.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

bd83d0a2

md/raid5: simplfy delaying of writes while metadata is updated. · 16d997b7

由 NeilBrown 提交于 3月 15, 2017

If a device fails during a write, we must ensure the failure is
recorded in the metadata before the completion of the write is
acknowleged.

Commit c3cce6cd ("md/raid5: ensure device failure recorded before
write request returns.")  added code for this, but it was
unnecessarily complicated.  We already had similar functionality for
handling updates to the bad-block-list, thanks to Commit de393cde
("md: make it easier to wait for bad blocks to be acknowledged.")

So revert most of the former commit, and instead avoid collecting
completed writes if MD_CHANGE_PENDING is set.  raid5d() will then flush
the metadata and retry the stripe_head.
As this change can leave a stripe_head ready for handling immediately
after handle_active_stripes() returns, we change raid5_do_work() to
pause when MD_CHANGE_PENDING is set, so that it doesn't spin.

We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
asynchronously.  After analyse_stripe(), we have collected stable data
about the state of devices, which will be used to make decisions.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

16d997b7

md/raid5: use md_write_start to count stripes, not bios · 49728050

由 NeilBrown 提交于 3月 15, 2017

We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count.  We currently count bios
submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
has been attached to.

So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.

add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().

The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
 Whenever we set bi_phys_segments to 1, we now call md_write_start.
 Whenever we increment it on non-read requests with
   raid5_inc_bi_active_stripes(), we now call md_write_start().
 Whenever we decrement bi_phys_segments on non-read requsts with
    raid5_dec_bi_active_stripes(), we now call md_write_end().

This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.

md_write_inc() is added which parallels md_write_start(), but requires
that a write has already been started, and is certain never to sleep.
This can be used inside a spinlocked region when adding to a write
request.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

49728050

17 3月, 2017 8 次提交

md: move bitmap_destroy to the beginning of __md_stop · 48df498d

由 Guoqing Jiang 提交于 3月 14, 2017

Since we have switched to sync way to handle METADATA_UPDATED
msg for md-cluster, then process_metadata_update is depended
on mddev->thread->wqueue.

With the new change, clustered raid could possible hang if
array received a METADATA_UPDATED msg after array unregistered
mddev->thread, so we need to stop clustered raid (bitmap_destroy
-> bitmap_free -> md_cluster_stop) earlier than unregister
thread (mddev_detach -> md_unregister_thread).

And this change should be safe for non-clustered raid since
all writes are stopped before the destroy. Also in md_run,
we activate the personality (pers->run()) before activating
the bitmap (bitmap_create()). So it is pleasingly symmetric
to stop the bitmap (bitmap_destroy()) before stopping the
personality (__md_stop() calls pers->free()), we achieve this
by move bitmap_destroy to the beginning of __md_stop.

But we don't want to break the codes for waiting behind IO as
Shaohua mentioned, so introduce bitmap_wait_behind_writes to
call the codes, and call the new fun in both mddev_detach and
bitmap_destroy, then we will not break original behind IO code
and also fit the new condition well.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

48df498d

md/r5cache: generate R5LOG_PAYLOAD_FLUSH · ea17481f

由 Song Liu 提交于 3月 09, 2017

In r5c_finish_stripe_write_out(), R5LOG_PAYLOAD_FLUSH is append to
log->current_io.

Appending R5LOG_PAYLOAD_FLUSH in quiesce needs extra writes to
journal. To simplify the logic, we just skip R5LOG_PAYLOAD_FLUSH in
quiesce.

Even R5LOG_PAYLOAD_FLUSH supports multiple stripes per payload.
However, current implementation is one stripe per R5LOG_PAYLOAD_FLUSH,
which is simpler.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

ea17481f

md/r5cache: handle R5LOG_PAYLOAD_FLUSH in recovery · 2d4f4687

由 Song Liu 提交于 3月 07, 2017

This patch adds handling of R5LOG_PAYLOAD_FLUSH in journal recovery.
Next patch will add logic that generate R5LOG_PAYLOAD_FLUSH on flush
finish.

When R5LOG_PAYLOAD_FLUSH is seen in recovery, pending data and parity
will be dropped from recovery. This will reduce the number of stripes
to replay, and thus accelerate the recovery process.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

2d4f4687

raid5-ppl: runtime PPL enabling or disabling · ba903a3e

由 Artur Paszkiewicz 提交于 3月 09, 2017

Allow writing to 'consistency_policy' attribute when the array is
active. Add a new function 'change_consistency_policy' to the
md_personality operations structure to handle the change in the
personality code. Values "ppl" and "resync" are accepted and
turn PPL on and off respectively.

When enabling PPL its location and size should first be set using
'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
written at this location on each member device.

Enabling or disabling PPL is performed under a suspended array.  The
raid5_reset_stripe_cache function frees the stripe cache and allocates
it again in order to allocate or free the ppl_pages for the stripes in
the stripe cache.
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

ba903a3e

raid5-ppl: support disk hot add/remove with PPL · 6358c239

由 Artur Paszkiewicz 提交于 3月 09, 2017

Add a function to modify the log by removing an rdev when a drive fails
or adding when a spare/replacement is activated as a raid member.

Removing a disk just clears the child log rdev pointer. No new stripes
will be accepted for this child log in ppl_write_stripe() and running io
units will be processed without writing PPL to the device.

Adding a disk sets the child log rdev pointer and writes an empty PPL
header.
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

6358c239

raid5-ppl: load and recover the log · 4536bf9b

由 Artur Paszkiewicz 提交于 3月 09, 2017

Load the log from each disk when starting the array and recover if the
array is dirty.

The initial empty PPL is written by mdadm. When loading the log we
verify the header checksum and signature. For external metadata arrays
the signature is verified in userspace, so here we read it from the
header, verifying only if it matches on all disks, and use it later when
writing PPL.

In addition to the header checksum, each header entry also contains a
checksum of its partial parity data. If the header is valid, recovery is
performed for each entry until an invalid entry is found. If the array
is not degraded and recovery using PPL fully succeeds, there is no need
to resync the array because data and parity will be consistent, so in
this case resync will be disabled.

Due to compatibility with IMSM implementations on other systems, we
can't assume that the recovery data block size is always 4K. Writes
generated by MD raid5 don't have this issue, but when recovering PPL
written in other environments it is possible to have entries with
512-byte sector granularity. The recovery code takes this into account
and also the logical sector size of the underlying drives.
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

4536bf9b

md: add sysfs entries for PPL · 664aed04

由 Artur Paszkiewicz 提交于 3月 09, 2017

Add 'consistency_policy' attribute for array. It indicates how the array
maintains consistency in case of unexpected shutdown.

Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
and size of the PPL space on the device. They can't be changed for
active members if the array is started and PPL is enabled, so in the
setter functions only basic checks are performed. More checks are done
in ppl_validate_rdev() when starting the log.

These attributes are writable to allow enabling PPL for external
metadata arrays and (later) to enable/disable PPL for a running array.
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

664aed04

raid5-ppl: Partial Parity Log write logging implementation · 3418d036

由 Artur Paszkiewicz 提交于 3月 09, 2017

Implement the calculation of partial parity for a stripe and PPL write
logging functionality. The description of PPL is added to the
documentation. More details can be found in the comments in raid5-ppl.c.

Attach a page for holding the partial parity data to stripe_head.
Allocate it only if mddev has the MD_HAS_PPL flag set.

Partial parity is the xor of not modified data chunks of a stripe and is
calculated as follows:

- reconstruct-write case:
  xor data from all not updated disks in a stripe

- read-modify-write case:
  xor old data and parity from all updated disks in a stripe

Implement it using the async_tx API and integrate into raid_run_ops().
It must be called when we still have access to old data, so do it when
STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
stored into sh->ppl_page.

Partial parity is not meaningful for full stripe write and is not stored
in the log or used for recovery, so don't attempt to calculate it when
stripe has STRIPE_FULL_WRITE.

Put the PPL metadata structures to md_p.h because userspace tools
(mdadm) will also need to read/write PPL.

Warn about using PPL with enabled disk volatile write-back cache for
now. It can be removed once disk cache flushing before writing PPL is
implemented.
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

3418d036

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功