提交 · 3c6edc66085e1d895a698c572bbfaf4d57fdb771 · openanolis / cloud-kernel

09 12月, 2016 3 次提交

md/r5cache: after recovery, increase journal seq by 10000 · 3c6edc66

由 Song Liu 提交于 12月 07, 2016

Currently, we increase journal entry seq by 10 after recovery.
However, this is not sufficient in the following case.

After crash the journal looks like

| seq+0 | +1 | +2 | +3 | +4 | +5 | +6 | +7 | ... | +11 | +12 |

If +1 is not valid, we dropped all entries from +1 to +12; and
write seq+10:

| seq+0 | +10 | +2 | +3 | +4 | +5 | +6 | +7 | ... | +11 | +12 |

However, if we write a big journal entry with seq+11, it will
connect with some stale journal entry:

| seq+0 | +10 |                     +11                 | +12 |

To reduce the risk of this issue, we increase seq by 10000 instead.

Shaohua: use 10000 instead of 1000. The risk should be very unlikely. The total
stripe cache size is less than 2k typically, and several stripes can fit into
one meta data block. So the total inflight meta data blocks would be quite
small, which means the the total sequence number used should be quite small.
The 10000 sequence number increase should be far more than safe.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

3c6edc66

md/raid5-cache: fix crc in rewrite_data_only_stripes() · 5c88f403

由 Song Liu 提交于 12月 07, 2016

r5l_recovery_create_empty_meta_block() creates crc for the empty
metablock. After the metablock is updated, we need clear the
checksum before recalculate it.

Shaohua: moved checksum calculation out of
r5l_recovery_create_empty_meta_block. We should calculate it after all fields
are updated.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5c88f403

md/raid5-cache: no recovery is required when create super-block · d30dfeb9

由 JackieLiu 提交于 12月 08, 2016

When create the super-block information, We do not need to do this
recovery stage, only need to initialize some variables.
Signed-off-by: NJackieLiu <liuyun01@kylinos.cn>
Reviewed-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d30dfeb9

06 12月, 2016 3 次提交

md: fix refcount problem on mddev when stopping array. · e2342ca8

由 NeilBrown 提交于 12月 05, 2016

md_open() gets a counted reference on an mddev using mddev_find().
If it ends up returning an error, it must drop this reference.

There are two error paths where the reference is not dropped.
One only happens if the process is signalled and an awkward time,
which is quite unlikely.
The other was introduced recently in commit af8d8e6f.

Change the code to ensure the drop the reference when returning an error,
and make it harded to re-introduce this sort of bug in the future.
Reported-by: NMarc Smith <marc.smith@mcc.edu>
Fixes: af8d8e6f ("md: changes for MD_STILL_CLOSED flag")
Signed-off-by: NNeilBrown <neilb@suse.com>
Acked-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

e2342ca8

md/r5cache: do r5c_update_log_state after log recovery · 3d7e7e1d

由 Zhengyuan Liu 提交于 12月 04, 2016

We should update log state after we did a log recovery, current completion
may get wrong log state since log->log_start wasn't initalized until we
called r5l_recovery_log.

At log recovery stage, no lock needed as there is no race conditon.
next_checkpoint field will be initialized in r5l_recovery_log too.
Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: NShaohua Li <shli@fb.com>

3d7e7e1d

md/raid5-cache: adjust the write position of the empty block if no data blocks · 43b96748

由 JackieLiu 提交于 12月 05, 2016

When recovery is complete, we write an empty block and record his
position first, then make the data-only stripes rewritten done,
the location of the empty block as the last checkpoint position
to write into the super block. And we should update last_checkpoint
to this empty block position.

------------------------------------------------------------------
|  old log       | empty block | data only stripes | invalid log |
------------------------------------------------------------------
^                ^                                 ^
|                |- log->last_checkpoint           |- log->log_start
|                |- log->last_cp_seq               |- log->next_checkpoint
|- log->seq=n    |- log->seq=10+n

At the same time, if there is no data-only stripes, this scene may appear,
| meta1 | meta2 | meta3 |
meta 1 is valid, meta 2 is invalid. meta 3 could be valid. so we should
The solution is we create a new meta in meta2 with its seq == meta1's
seq + 10 and let superblock points to meta2.
Signed-off-by: NJackieLiu <liuyun01@kylinos.cn>
Reviewed-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
Reviewed-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

43b96748

03 12月, 2016 1 次提交

md/r5cache: run_no_space_stripes() when R5C_LOG_CRITICAL == 0 · f687a33e

由 Song Liu 提交于 11月 30, 2016

With writeback cache, we define log space critical as

   free_space < 2 * reclaim_required_space

So the deassert of R5C_LOG_CRITICAL could happen when
  1. free_space increases
  2. reclaim_required_space decreases

Currently, run_no_space_stripes() is called when 1 happens, but
not (always) when 2 happens.

With this patch, run_no_space_stripes() is call when
R5C_LOG_CRITICAL is cleared.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

f687a33e

30 11月, 2016 8 次提交

md/raid5: limit request size according to implementation limits · e8d7c332

由 Konstantin Khlebnikov 提交于 11月 27, 2016

Current implementation employ 16bit counter of active stripes in lower
bits of bio->bi_phys_segments. If request is big enough to overflow
this counter bio will be completed and freed too early.

Fortunately this not happens in default configuration because several
other limits prevent that: stripe_cache_size * nr_disks effectively
limits count of active stripes. And small max_sectors_kb at lower
disks prevent that during normal read/write operations.

Overflow easily happens in discard if it's enabled by module parameter
"devices_handle_discard_safely" and stripe_cache_size is set big enough.

This patch limits requests size with 256Mb - 8Kb to prevent overflows.
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Shaohua Li <shli@kernel.org>
Cc: Neil Brown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: NShaohua Li <shli@fb.com>

e8d7c332

md/raid5-cache: do not need to set STRIPE_PREREAD_ACTIVE repeatedly · 1a0ec5c3

由 JackieLiu 提交于 11月 29, 2016

R5c_make_stripe_write_out has set this flag, do not need to set again.
Signed-off-by: NJackieLiu <liuyun01@kylinos.cn>
Signed-off-by: NShaohua Li <shli@fb.com>

1a0ec5c3

md/raid5-cache: remove the unnecessary next_cp_seq field from the r5l_log · dbd22c8d

由 JackieLiu 提交于 11月 29, 2016

The next_cp_seq field is useless, remove it.
Signed-off-by: NJackieLiu <liuyun01@kylinos.cn>
Signed-off-by: NShaohua Li <shli@fb.com>

dbd22c8d

md/raid5-cache: release the stripe_head at the appropriate location · bc8f167f

由 JackieLiu 提交于 11月 28, 2016

If we released the 'stripe_head' in r5c_recovery_flush_log,
ctx->cached_list will both release the data-parity stripes and
data-only stripes, which will become empty.
And we also need to use the data-only stripes in
r5c_recovery_rewrite_data_only_stripes, so we should wait util rewrite
data-only stripes is done before releasing them.
Reviewed-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
Reviewed-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJackieLiu <liuyun01@kylinos.cn>
Signed-off-by: NShaohua Li <shli@fb.com>

bc8f167f

md/raid5-cache: use ring add to prevent overflow · fc833c2a

由 JackieLiu 提交于 11月 28, 2016

'write_pos' must be protected with 'r5l_ring_add', or it may overflow
Signed-off-by: NJackieLiu <liuyun01@kylinos.cn>
Reviewed-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

fc833c2a

md/raid5-cache: remove unnecessary function parameters · 9b69173e

由 JackieLiu 提交于 11月 28, 2016

The function parameter 'recovery_list' is not used in
body, we can delete it
Signed-off-by: NJackieLiu <liuyun01@kylinos.cn>
Reviewed-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

9b69173e

raid5-cache: don't set STRIPE_R5C_PARTIAL_STRIPE flag while load stripe into cache · 462eb7d8

由 Zhengyuan Liu 提交于 11月 26, 2016

r5c_recovery_load_one_stripe should not set STRIPE_R5C_PARTIAL_STRIPE flag,as
the data-only stripe may be STRIPE_R5C_FULL_STRIPE stripe. The state machine
would release the stripe later and add it into neither r5c_cached_full_stripes
list or r5c_cached_partial_stripes list and set correct flag.
Reviewed-by: NJackieLiu <liuyun01@kylinos.cn>
Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: NShaohua Li <shli@fb.com>

462eb7d8

raid5-cache: add another check conditon before replaying one stripe · f7b7bee7

由 Zhengyuan Liu 提交于 11月 26, 2016

New stripe that was just allocated has no STRIPE_R5C_CACHING state too,
add this check condition could avoid unnecessary replaying for empty stripe.

r5l_recovery_replay_one_stripe would reset stripe for any case, delete it
to make code more clean.
Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: NShaohua Li <shli@fb.com>

f7b7bee7

28 11月, 2016 2 次提交

md/r5cache: enable IRQs on error path · d3014e21

由 Dan Carpenter 提交于 11月 24, 2016

We need to re-enable the IRQs here before returning.

Fixes: a39f7afd ("md/r5cache: write-out phase and reclaim support")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d3014e21

md/r5cache: handle alloc_page failure · d7bd398e

由 Song Liu 提交于 11月 23, 2016

RMW of r5c write back cache uses an extra page to store old data for
prexor. handle_stripe_dirtying() allocates this page by calling
alloc_page(). However, alloc_page() may fail.

To handle alloc_page() failures, this patch adds an extra page to
disk_info. When alloc_page fails, handle_stripe() trys to use these
pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE),
the stripe is added to delayed_list.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d7bd398e

24 11月, 2016 2 次提交

md: stop write should stop journal reclaim · 034e33f5

由 Shaohua Li 提交于 11月 21, 2016

__md_stop_writes currently doesn't stop raid5-cache reclaim thread. It's
possible the reclaim thread is still running and doing write, which
doesn't match what __md_stop_writes should do. The extra ->quiesce()
call should not harm any raid types. For raid5-cache, this will
guarantee we reclaim all caches before we update superblock.
Signed-off-by: NShaohua Li <shli@fb.com>
Reviewed-by: NNeilBrown <neilb@suse.de>
Cc: Song Liu <songliubraving@fb.com>

034e33f5

raid5-cache: suspend reclaim thread instead of shutdown · ce1ccd07

由 Shaohua Li 提交于 11月 21, 2016

There is mechanism to suspend a kernel thread. Use it instead of playing
create/destroy game.
Signed-off-by: NShaohua Li <shli@fb.com>
Reviewed-by: NNeilBrown <neilb@suse.de>
Cc: Song Liu <songliubraving@fb.com>

ce1ccd07

23 11月, 2016 6 次提交

md/raid10: add failfast handling for writes. · 1919cbb2

由 NeilBrown 提交于 11月 18, 2016

When writing to a fastfail device, we use MD_FASTFAIL unless
it is the only device being written to.  For
resync/recovery, assume there was a working device to read
from so always use MD_FASTFAIL.

If a write for resync/recovery fails, we just fail the
device - there is not much else to do.

If a normal write fails, but the device cannot be marked
Faulty (must be only one left), we queue for write error
handling which calls narrow_write_error() to write the block
synchronously without any failfast flags.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

1919cbb2

md/raid10: add failfast handling for reads. · 8d3ca83d

由 NeilBrown 提交于 11月 18, 2016

If a device is marked FailFast, and it is not the only
device we can read from, we mark the bio as MD_FAILFAST.

If this does fail-fast, we don't try read repair but just
allow failure.

If it was the last device, it doesn't get marked Faulty so
the retry happens on the same device - this time without
FAILFAST.  A subsequent failure will not retry but will just
pass up the error.

During resync we may use FAILFAST requests, and on a failure
we will simply use the other device(s).

During recovery we will only use FAILFAST in the unusual
case were there are multiple places to read from - i.e. if
there are > 2 devices.  If we get a failure we will fail the
device and complete the resync/recovery with remaining
devices.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

8d3ca83d

md/raid1: add failfast handling for writes. · 212e7eb7

由 NeilBrown 提交于 11月 18, 2016

When writing to a fastfail device we use MD_FASTFAIL unless
it is the only device being written to.

For resync/recovery, assume there was a working device to
read from so always use REQ_FASTFAIL_DEV.

If a write for resync/recovery fails, we just fail the
device - there is not much else to do.

If a normal failfast write fails, but the device cannot be
failed (must be only one left), we queue for write error
handling.  This will call narrow_write_error() to retry the
write synchronously and without any FAILFAST flags.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

212e7eb7

md/raid1: add failfast handling for reads. · 2e52d449

由 NeilBrown 提交于 11月 18, 2016

If a device is marked FailFast and it is not the only device
we can read from, we mark the bio with REQ_FAILFAST_* flags.

If this does fail, we don't try read repair but just allow
failure.  If it was the last device it doesn't fail of
course, so the retry happens on the same device - this time
without FAILFAST.  A subsequent failure will not retry but
will just pass up the error.

During resync we may use FAILFAST requests and on a failure
we will simply use the other device(s).

During recovery we will only use FAILFAST in the unusual
case were there are multiple places to read from - i.e. if
there are > 2 devices.  If we get a failure we will fail the
device and complete the resync/recovery with remaining
devices.

The new R1BIO_FailFast flag is set on read reqest to suggest
the a FAILFAST request might be acceptable.  The rdev needs
to have FailFast set as well for the read to actually use
REQ_FAILFAST_*.

We need to know there are at least two working devices
before we can set R1BIO_FailFast, so we mustn't stop looking
at the first device we find.  So the "min_pending == 0"
handling to not exit early, but too always choose the
best_pending_disk if min_pending == 0.

The spinlocked region in raid1_error() in enlarged to ensure
that if two bios, reading from two different devices, fail
at the same time, then there is no risk that both devices
will be marked faulty, leaving zero "In_sync" devices.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

2e52d449

md: Use REQ_FAILFAST_* on metadata writes where appropriate · 46533ff7

由 NeilBrown 提交于 11月 18, 2016

This can only be supported on personalities which ensure
that md_error() never causes an array to enter the 'failed'
state.  i.e. if marking a device Faulty would cause some
data to be inaccessible, the device is status is left as
non-Faulty.  This is true for RAID1 and RAID10.

If we get a failure writing metadata but the device doesn't
fail, it must be the last device so we re-write without
FAILFAST to improve chance of success.  We also flag the
device as LastDev so that future metadata updates don't
waste time on failfast writes.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

46533ff7

md/failfast: add failfast flag for md to be used by some personalities. · 688834e6

由 NeilBrown 提交于 11月 18, 2016

This patch just adds a 'failfast' per-device flag which can be stored
in v0.90 or v1.x metadata.
The flag is not used yet but the intent is that it can be used for
mirrored (raid1/raid10) arrays where low latency is more important
than keeping all devices on-line.

Setting the flag for a device effectively gives permission for that
device to be marked as Faulty and excluded from the array on the first
error.  The underlying driver will be directed not to retry requests
that result in failures.  There is a proviso that the device must not
be marked faulty if that would cause the array as a whole to fail, it
may only be marked Faulty if the array remains functional, but is
degraded.

Failures on read requests will cause the device to be marked
as Faulty immediately so that further reads will avoid that
device.  No attempt will be made to correct read errors by
over-writing with the correct data.

It is expected that if transient errors, such as cable unplug, are
possible, then something in user-space will revalidate failed
devices and re-add them when they appear to be working again.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

688834e6

19 11月, 2016 14 次提交

md/r5cache: handle FLUSH and FUA · 3bddb7f8

由 Song Liu 提交于 11月 18, 2016

With raid5 cache, we committing data from journal device. When
there is flush request, we need to flush journal device's cache.
This was not needed in raid5 journal, because we will flush the
journal before committing data to raid disks.

This is similar to FUA, except that we also need flush journal for
FUA. Otherwise, corruptions in earlier meta data will stop recovery
from reaching FUA data.

slightly changed the code by Shaohua
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

3bddb7f8

md/r5cache: r5cache recovery: part 2 · 5aabf7c4

由 Song Liu 提交于 11月 17, 2016

1. In previous patch, we:
      - add new data to r5l_recovery_ctx
      - add new functions to recovery write-back cache
   The new functions are not used in this patch, so this patch does not
   change the behavior of recovery.

2. In this patchpatch, we:
      - modify main recovery procedure r5l_recovery_log() to call new
        functions
      - remove old functions
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5aabf7c4

md/r5cache: r5cache recovery: part 1 · b4c625c6

由 Song Liu 提交于 11月 17, 2016

Recovery of write-back cache has different logic to write-through only
cache. Specifically, for write-back cache, the recovery need to scan
through all active journal entries before flushing data out. Therefore,
large portion of the recovery logic is rewritten here.

To make the diffs cleaner, we split the rewrite as follows:

1. In this patch, we:
      - add new data to r5l_recovery_ctx
      - add new functions to recovery write-back cache
   The new functions are not used in this patch, so this patch does not
   change the behavior of recovery.

2. In next patch, we:
      - modify main recovery procedure r5l_recovery_log() to call new
        functions
      - remove old functions

With cache feature, there are 2 different scenarios of recovery:
1. Data-Parity stripe: a stripe with complete parity in journal.
2. Data-Only stripe: a stripe with only data in journal (or partial
   parity).

The code differentiate Data-Parity stripe from Data-Only stripe with
flag STRIPE_R5C_CACHING.

For Data-Parity stripes, we use the same procedure as raid5 journal,
where all the data and parity are replayed to the RAID devices.

For Data-Only strips, we need to finish complete calculate parity and
finish the full reconstruct write or RMW write. For simplicity, in
the recovery, we load the stripe to stripe cache. Once the array is
started, the stripe cache state machine will handle these stripes
through normal write path.

r5c_recovery_flush_log contains the main procedure of recovery. The
recovery code first scans through the journal and loads data to
stripe cache. The code keeps tracks of all these stripes in a list
(use sh->lru and ctx->cached_list), stripes in the list are
organized in the order of its first appearance on the journal.
During the scan, the recovery code assesses each stripe as
Data-Parity or Data-Only.

During scan, the array may run out of stripe cache. In these cases,
the recovery code will also call raid5_set_cache_size to increase
stripe cache size. If the array still runs out of stripe cache
because there isn't enough memory, the array will not assemble.

At the end of scan, the recovery code replays all Data-Parity
stripes, and sets proper states for Data-Only stripes. The recovery
code also increases seq number by 10 and rewrites all Data-Only
stripes to journal. This is to avoid confusion after repeated
crashes. More details is explained in raid5-cache.c before
r5c_recovery_rewrite_data_only_stripes().
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

b4c625c6

md/r5cache: refactoring journal recovery code · 9ed988f5

由 Song Liu 提交于 11月 17, 2016

1. rename r5l_read_meta_block() as r5l_recovery_read_meta_block();
2. pull the code that initialize r5l_meta_block from
   r5l_log_write_empty_meta_block() to a separate function
   r5l_recovery_create_empty_meta_block(), so that we can reuse this
   piece of code.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

9ed988f5

md/r5cache: sysfs entry journal_mode · 2c7da14b

由 Song Liu 提交于 11月 17, 2016

With write cache, journal_mode is the knob to switch between
write-back and write-through.

Below is an example:

root@virt-test:~/# cat /sys/block/md0/md/journal_mode
[write-through] write-back
root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode
root@virt-test:~/# cat /sys/block/md0/md/journal_mode
write-through [write-back]
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

2c7da14b

md/r5cache: write-out phase and reclaim support · a39f7afd

由 Song Liu 提交于 11月 17, 2016

There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.

In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
   if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
   or cached stripes is enough for a full stripe (chunk size / 4k)
   (r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)

r5c_do_reclaim() contains new logic of reclaim.

For stripe cache:

When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.

For log space:

To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.

Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.

r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.

When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).

This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

a39f7afd

md/r5cache: caching phase of r5cache · 1e6d690b

由 Song Liu 提交于 11月 17, 2016

As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
   (r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
   (r5c_handle_data_cached, r5c_return_dev_pending_writes).

Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks

This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.

Writing-out phase of the cache is implemented in the next patch.

With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.

This patch adds 2 flags to stripe_head.state:
 - STRIPE_R5C_PARTIAL_STRIPE,
 - STRIPE_R5C_FULL_STRIPE,

Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".

For RMW, the code allocates an extra page for each data block
being updated.  This is stored in r5dev->orig_page and the old data
is read into it.  Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.

r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.

There are some known limitations of the cache implementation:

1. Write cache only covers full page writes (R5_OVERWRITE). Writes
   of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
   writes for the same stripe have to wait. This can be improved by
   moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
   is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
   the log, so recovery code has to replay more than necessary data
   (sometimes all the log from last_checkpoint). This reduces
   availability of the array.

This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

1e6d690b

md/r5cache: State machine for raid5-cache write back mode · 2ded3703

由 Song Liu 提交于 11月 17, 2016

This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
  - write-back (R5C_MODE_WRITE_BACK)
  - write-through (R5C_MODE_WRITE_THROUGH)

Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.

With write-back cache, every stripe could operate in two different
phases:
  - caching
  - writing-out

In caching phase, the stripe handles writes as:
  - write to journal
  - return IO

In writing-out phase, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.

STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
writing-out phase.

Please note: this is a "no-op" patch for raid5-cache write-through
mode.

The following detailed explanation is copied from the raid5-cache.c:

/*
 * raid5 cache state machine
 *
 * With rhe RAID cache, each stripe works in two phases:
 *      - caching phase
 *      - writing-out phase
 *
 * These two phases are controlled by bit STRIPE_R5C_CACHING:
 *   if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
 *   if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
 *
 * When there is no journal, or the journal is in write-through mode,
 * the stripe is always in writing-out phase.
 *
 * For write-back journal, the stripe is sent to caching phase on write
 * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
 * the write-out phase by clearing STRIPE_R5C_CACHING.
 *
 * Stripes in caching phase do not write the raid disks. Instead, all
 * writes are committed from the log device. Therefore, a stripe in
 * caching phase handles writes as:
 *      - write to log device
 *      - return IO
 *
 * Stripes in writing-out phase handle writes as:
 *      - calculate parity
 *      - write pending data and parity to journal
 *      - write data and parity to raid disks
 *      - return IO for pending writes
 */
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

2ded3703

md/r5cache: move some code to raid5.h · 937621c3

由 Song Liu 提交于 11月 17, 2016

Move some define and inline functions to raid5.h, so they can be
used in raid5-cache.c
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

937621c3

md/r5cache: Check array size in r5l_init_log · c757ec95

由 Song Liu 提交于 11月 17, 2016

Currently, r5l_write_stripe checks meta size for each stripe write,
which is not necessary.

With this patch, r5l_init_log checks maximal meta size of the array,
which is (r5l_meta_block + raid_disks x r5l_payload_data_parity).
If this is too big to fit in one page, r5l_init_log aborts.

With current meta data, r5l_log support raid_disks up to 203.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

c757ec95

md: add blktrace event for writes to superblock · 504634f6

由 Shaohua Li 提交于 11月 18, 2016

superblock write is an expensive operation. With raid5-cache, it can be called
regularly. Tracing to help performance debug.
Signed-off-by: NShaohua Li <shli@fb.com>
Cc: NeilBrown <neilb@suse.com>

504634f6

md/raid1, raid10: add blktrace records when IO is delayed · 578b54ad

由 NeilBrown 提交于 11月 14, 2016

Both raid1 and raid10 will sometimes delay handling an IO request,
such as when resync is happening or there are too many requests queued.

Add some blktrace messsages so we can see when that is happening when
looking for performance artefacts.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

578b54ad

md/bitmap: add blktrace event for writes to the bitmap · 581dbd94

由 NeilBrown 提交于 11月 14, 2016

We trace wheneven bitmap_unplug() finds that it needs to write
to the bitmap, or when bitmap_daemon_work() find there is work
to do.

This makes it easier to correlate bitmap updates with data writes.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

581dbd94

md: add block tracing for bio_remapping · 109e3765

由 NeilBrown 提交于 11月 18, 2016

The block tracing infrastructure (accessed with blktrace/blkparse)
supports the tracing of mapping bios from one device to another.
This is currently used when a bio in a partition is mapped to the
whole device, when bios are mapped by dm, and for mapping in md/raid5.
Other md personalities do not include this tracing yet, so add it.

When a read-error is detected we redirect the request to a different device.
This could justifiably be seen as a new mapping for the originial bio,
or a secondary mapping for the bio that errors.  This patch uses
the second option.

When md is used under dm-raid, the mappings are not traced as we do
not have access to the block device number of the parent.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

109e3765

18 11月, 2016 1 次提交

raid5-cache: fix lockdep warning · 354b445b

由 Shaohua Li 提交于 11月 16, 2016

lockdep reports warning of the rcu_dereference usage. Using normal rdev
access pattern to avoid the warning.
Signed-off-by: NShaohua Li <shli@fb.com>

354b445b

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功