提交 · e8d7c33232e5fdfa761c3416539bc5b4acd12db5 · openanolis / cloud-kernel

30 11月, 2016 1 次提交

md/raid5: limit request size according to implementation limits · e8d7c332

由 Konstantin Khlebnikov 提交于 11月 27, 2016

Current implementation employ 16bit counter of active stripes in lower
bits of bio->bi_phys_segments. If request is big enough to overflow
this counter bio will be completed and freed too early.

Fortunately this not happens in default configuration because several
other limits prevent that: stripe_cache_size * nr_disks effectively
limits count of active stripes. And small max_sectors_kb at lower
disks prevent that during normal read/write operations.

Overflow easily happens in discard if it's enabled by module parameter
"devices_handle_discard_safely" and stripe_cache_size is set big enough.

This patch limits requests size with 256Mb - 8Kb to prevent overflows.
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Shaohua Li <shli@kernel.org>
Cc: Neil Brown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: NShaohua Li <shli@fb.com>

e8d7c332

28 11月, 2016 1 次提交

md/r5cache: handle alloc_page failure · d7bd398e

由 Song Liu 提交于 11月 23, 2016

RMW of r5c write back cache uses an extra page to store old data for
prexor. handle_stripe_dirtying() allocates this page by calling
alloc_page(). However, alloc_page() may fail.

To handle alloc_page() failures, this patch adds an extra page to
disk_info. When alloc_page fails, handle_stripe() trys to use these
pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE),
the stripe is added to delayed_list.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d7bd398e

19 11月, 2016 7 次提交

md/r5cache: handle FLUSH and FUA · 3bddb7f8

由 Song Liu 提交于 11月 18, 2016

With raid5 cache, we committing data from journal device. When
there is flush request, we need to flush journal device's cache.
This was not needed in raid5 journal, because we will flush the
journal before committing data to raid disks.

This is similar to FUA, except that we also need flush journal for
FUA. Otherwise, corruptions in earlier meta data will stop recovery
from reaching FUA data.

slightly changed the code by Shaohua
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

3bddb7f8

md/r5cache: r5cache recovery: part 2 · 5aabf7c4

由 Song Liu 提交于 11月 17, 2016

1. In previous patch, we:
      - add new data to r5l_recovery_ctx
      - add new functions to recovery write-back cache
   The new functions are not used in this patch, so this patch does not
   change the behavior of recovery.

2. In this patchpatch, we:
      - modify main recovery procedure r5l_recovery_log() to call new
        functions
      - remove old functions
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5aabf7c4

md/r5cache: sysfs entry journal_mode · 2c7da14b

由 Song Liu 提交于 11月 17, 2016

With write cache, journal_mode is the knob to switch between
write-back and write-through.

Below is an example:

root@virt-test:~/# cat /sys/block/md0/md/journal_mode
[write-through] write-back
root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode
root@virt-test:~/# cat /sys/block/md0/md/journal_mode
write-through [write-back]
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

2c7da14b

md/r5cache: write-out phase and reclaim support · a39f7afd

由 Song Liu 提交于 11月 17, 2016

There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.

In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
   if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
   or cached stripes is enough for a full stripe (chunk size / 4k)
   (r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)

r5c_do_reclaim() contains new logic of reclaim.

For stripe cache:

When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.

For log space:

To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.

Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.

r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.

When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).

This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

a39f7afd

md/r5cache: caching phase of r5cache · 1e6d690b

由 Song Liu 提交于 11月 17, 2016

As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
   (r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
   (r5c_handle_data_cached, r5c_return_dev_pending_writes).

Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks

This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.

Writing-out phase of the cache is implemented in the next patch.

With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.

This patch adds 2 flags to stripe_head.state:
 - STRIPE_R5C_PARTIAL_STRIPE,
 - STRIPE_R5C_FULL_STRIPE,

Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".

For RMW, the code allocates an extra page for each data block
being updated.  This is stored in r5dev->orig_page and the old data
is read into it.  Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.

r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.

There are some known limitations of the cache implementation:

1. Write cache only covers full page writes (R5_OVERWRITE). Writes
   of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
   writes for the same stripe have to wait. This can be improved by
   moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
   is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
   the log, so recovery code has to replay more than necessary data
   (sometimes all the log from last_checkpoint). This reduces
   availability of the array.

This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

1e6d690b

md/r5cache: State machine for raid5-cache write back mode · 2ded3703

由 Song Liu 提交于 11月 17, 2016

This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
  - write-back (R5C_MODE_WRITE_BACK)
  - write-through (R5C_MODE_WRITE_THROUGH)

Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.

With write-back cache, every stripe could operate in two different
phases:
  - caching
  - writing-out

In caching phase, the stripe handles writes as:
  - write to journal
  - return IO

In writing-out phase, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.

STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
writing-out phase.

Please note: this is a "no-op" patch for raid5-cache write-through
mode.

The following detailed explanation is copied from the raid5-cache.c:

/*
 * raid5 cache state machine
 *
 * With rhe RAID cache, each stripe works in two phases:
 *      - caching phase
 *      - writing-out phase
 *
 * These two phases are controlled by bit STRIPE_R5C_CACHING:
 *   if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
 *   if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
 *
 * When there is no journal, or the journal is in write-through mode,
 * the stripe is always in writing-out phase.
 *
 * For write-back journal, the stripe is sent to caching phase on write
 * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
 * the write-out phase by clearing STRIPE_R5C_CACHING.
 *
 * Stripes in caching phase do not write the raid disks. Instead, all
 * writes are committed from the log device. Therefore, a stripe in
 * caching phase handles writes as:
 *      - write to log device
 *      - return IO
 *
 * Stripes in writing-out phase handle writes as:
 *      - calculate parity
 *      - write pending data and parity to journal
 *      - write data and parity to raid disks
 *      - return IO for pending writes
 */
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

2ded3703

md/r5cache: move some code to raid5.h · 937621c3

由 Song Liu 提交于 11月 17, 2016

Move some define and inline functions to raid5.h, so they can be
used in raid5-cache.c
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

937621c3

08 11月, 2016 2 次提交

N
md/raid5: change printk() to pr_*() · cc6167b4
由 NeilBrown 提交于 11月 02, 2016
```
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>
```
cc6167b4

raid5: revert commit · 7adb072c

由 Tomasz Majchrzak 提交于 10月 26, 2016

Revert commit 11367799 ("md: Prevent IO hold during accessing to faulty
raid5 array") as it doesn't comply with commit c3cce6cd ("md/raid5:
ensure device failure recorded before write request returns."). That change
is not required anymore as the problem is resolved by commit 16f88949
("md: report 'write_pending' state when array in sync") - read request is
stuck as array state is not reported correctly via sysfs attribute.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

7adb072c

22 9月, 2016 3 次提交

raid5: handle register_shrinker failure · 30c89465

由 Shaohua Li 提交于 9月 21, 2016

register_shrinker() now can fail. When it happens, shrinker.nr_deferred is
null. We use it to determine if unregister_shrinker is required.
Signed-off-by: NShaohua Li <shli@fb.com>

30c89465

raid5: fix to detect failure of register_shrinker · 6a0f53ff

由 Chao Yu 提交于 9月 20, 2016

register_shrinker can fail after commit 1d3d4437 ("vmscan: per-node
deferred work"), we should detect the failure of it, otherwise we may
fail to register shrinker after raid5 configuration was setup successfully.
Signed-off-by: NChao Yu <yuchao0@huawei.com>
Signed-off-by: NShaohua Li <shli@fb.com>

6a0f53ff

raid5: allow arbitrary max_hw_sectors · 1dffdddd

由 Shaohua Li 提交于 9月 08, 2016

raid5 will split bio to proper size internally, there is no point to use
underlayer disk's max_hw_sectors. In my qemu system, without the change,
the raid5 only receives 128k size bio, which reduces the chance of bio
merge sending to underlayer disks.
Signed-off-by: NShaohua Li <shli@fb.com>

1dffdddd

10 9月, 2016 1 次提交

raid5: fix a small race condition · c9445555

由 Shaohua Li 提交于 9月 08, 2016

commit 5f9d1fde(raid5: fix memory leak of bio integrity data)
moves bio_reset to bio_endio. But it introduces a small race condition.
It does bio_reset after raid5_release_stripe, which could make the
stripe reusable and hence reuse the bio just before bio_reset. Moving
bio_reset before raid5_release_stripe is called should fix the race.
Reported-and-tested-by: NStefan Priebe - Profihost AG <s.priebe@profihost.ag>
Signed-off-by: NShaohua Li <shli@fb.com>

c9445555

07 9月, 2016 1 次提交

md/raid5: Convert to hotplug state machine · 29c6d1bb

由 Sebastian Andrzej Siewior 提交于 8月 18, 2016

Install the callbacks via the state machine and let the core invoke
the callbacks on the already online CPUs.
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Neil Brown <neilb@suse.com>
Cc: linux-raid@vger.kernel.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160818125731.27256-10-bigeasy@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

29c6d1bb

01 9月, 2016 1 次提交

raid5: guarantee enough stripes to avoid reshape hang · ad5b0f76

由 Shaohua Li 提交于 8月 30, 2016

If there aren't enough stripes, reshape will hang. We have a check for
this in new reshape, but miss it for reshape resume, hence we could see
hang in reshape resume. This patch forces enough stripes existed if
reshape resumes.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

ad5b0f76

25 8月, 2016 3 次提交

raid5: avoid unnecessary bio data set · 45c91d80

由 Shaohua Li 提交于 8月 22, 2016

bio_reset doesn't change bi_io_vec and bi_max_vecs, so we don't need to
set them every time. bi_private will be set before the bio is
dispatched.
Signed-off-by: NShaohua Li <shli@fb.com>

45c91d80

raid5: fix memory leak of bio integrity data · 5f9d1fde

由 Shaohua Li 提交于 8月 22, 2016

Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5
doesn't alloc/free bio, instead it reuses bios. There are two issues in
current code:
1. the code calls bio_init (from
init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io).
The bio is reused, so likely there is integrity data attached. bio_init
will clear a pointer to integrity data and makes bio_reset can't release
the data
2. bio_reset is called before dispatching bio. After bio is finished,
it's possible we don't free bio's integrity data (eg, we don't call
bio_reset again)
Both issues will cause memory leak. The patch moves bio_init to stripe
creation and bio_reset to bio end io. This will fix the two issues.
Reported-by: NYi Zhang <yizhan@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5f9d1fde

r5cache: set MD_JOURNAL_CLEAN correctly · 486b0f7b

由 Song Liu 提交于 8月 19, 2016

Currently, the code sets MD_JOURNAL_CLEAN when the array has
MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array
will be MD_JOURNAL_CLEAN even if the journal device is missing.

With this patch, the MD_JOURNAL_CLEAN is only set when the journal
device presents.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

486b0f7b

08 8月, 2016 1 次提交

block: rename bio bi_rw to bi_opf · 1eff9d32

由 Jens Axboe 提交于 8月 05, 2016

Since commit 63a4cc24, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.
Signed-off-by: NJens Axboe <axboe@fb.com>

1eff9d32

06 8月, 2016 1 次提交

md: Prevent IO hold during accessing to faulty raid5 array · 11367799

由 Alexey Obitotskiy 提交于 8月 03, 2016

After array enters in faulty state (e.g. number of failed drives
becomes more then accepted for raid5 level) it sets error flags
(one of this flags is MD_CHANGE_PENDING). For internal metadata
arrays MD_CHANGE_PENDING cleared into md_update_sb, but not for
external metadata arrays. MD_CHANGE_PENDING flag set prevents to
finish all new or non-finished IOs to array and hold them in
pending state. In some cases this can leads to deadlock situation.

For example, we have faulty array (2 of 4 drives failed) and
udev handle array state changes and blkid started (or other
userspace application that used array to read/write) but unable
to finish reads due to IO hold. At the same time we unable to get
exclusive access to array (to stop array in our case) because
another external application still use this array.

Fix makes possible to return IO with errors immediately.
So external application can finish working with array and
give exclusive access to other applications to perform
required management actions with array.
Signed-off-by: NAlexey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

11367799

02 8月, 2016 1 次提交

raid5: fix incorrectly counter of conf->empty_inactive_list_nr · ff00d3b4

由 ZhengYuan Liu 提交于 7月 28, 2016

The counter conf->empty_inactive_list_nr is only used for determine if the
raid5 is congested which is deal with in function raid5_congested().
It was increased in get_free_stripe() when conf->inactive_list got to be
empty and decreased in release_inactive_stripe_list() when splice
temp_inactive_list to conf->inactive_list. However, this may have a
problem when raid5_get_active_stripe or stripe_add_to_batch_list was called,
because these two functions may call list_del_init(&sh->lru) to delete sh from
"conf->inactive_list + hash" which may cause "conf->inactive_list + hash" to
be empty when atomic_inc_not_zero(&sh->count) got false. So a check should be
done at these two point and increase empty_inactive_list_nr accordingly.
Otherwise the counter may get to be negative number which would influence
async readahead from VFS.
Signed-off-by: NZhengYuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: NShaohua Li <shli@fb.com>

ff00d3b4

21 7月, 2016 1 次提交

block: get rid of bio_rw and READA · 70246286

由 Christoph Hellwig 提交于 7月 19, 2016

These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces.  For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense.  Any check for READA is replaced with an
explicit check for REQ_RAHEAD.  Also remove the READA alias for
REQ_RAHEAD.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NMike Christie <mchristi@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

70246286

14 6月, 2016 5 次提交

md: reduce the number of synchronize_rcu() calls when multiple devices fail. · d787be40

由 NeilBrown 提交于 6月 02, 2016

Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made
which can delay several milliseconds in some case.
If lots of devices fail at once - as could happen with a large RAID10 where one set
of devices are removed all at once - these delays can add up to be very inconcenient.

As failure is not reversible we can check for that first, setting a
separate flag if it is found, and then all synchronize_rcu() once for
all the flagged devices.  Then ->hot_remove_disk() function can skip the
synchronize_rcu() step if the flag is set.

fix build error(Shaohua)
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d787be40

md: be extra careful not to take a reference to a Faulty device. · f5b67ae8

由 NeilBrown 提交于 6月 02, 2016

It is important that we never increment rdev->nr_pending on a Faulty
device as ->hot_remove_disk() assumes that once the Faulty flag is visible
no code will take a new reference.

Some places take a new reference after only check In_sync.  This should
be safe as the two are changed together.  However to make the code more
obviously safe, add checks for 'Faulty' as well.

Note: the actual rule is:
  Never increment nr_pending if  Faulty is set and Blocked is clear,
  never clear Faulty, and never set Blocked without holding a reference
  through nr_pending.

fix build error (Shaohua)
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

f5b67ae8

N
md/raid5: add rcu protection to rdev accesses in raid5_status. · 5fd13351
由 NeilBrown 提交于 6月 02, 2016
```
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>
```
5fd13351

md/raid5: add rcu protection to rdev accesses in want_replace · 3f232d6a

由 NeilBrown 提交于 6月 02, 2016

Being in the middle of resync is no longer protection against failed
rdevs disappearing.  So add rcu protection.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

3f232d6a

md/raid5: add rcu protection to rdev accesses in handle_failed_sync. · e50d3992

由 NeilBrown 提交于 6月 02, 2016

The rdev could be freed while handle_failed_sync is running, so
rcu protection is needed.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

e50d3992

08 6月, 2016 3 次提交

block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH · 28a8f0d3

由 Mike Christie 提交于 6月 05, 2016

To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: NMike Christie <mchristi@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

28a8f0d3

block, drivers, fs: shrink bi_rw from long to int · 6296b960

由 Mike Christie 提交于 6月 05, 2016

We don't need bi_rw to be so large on 64 bit archs, so
reduce it to unsigned int.
Signed-off-by: NMike Christie <mchristi@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

6296b960

md: use bio op accessors · 796a5cf0

由 Mike Christie 提交于 6月 05, 2016

Separate the op from the rq_flag_bits and have md
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: NMike Christie <mchristi@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

796a5cf0

26 5月, 2016 1 次提交

right meaning of PARITY_ENABLE_RMW and PARITY_PREFER_RMW · 41257580

由 Song Liu 提交于 5月 23, 2016

In current handle_stripe_dirtying, the code prefers rmw with
PARITY_ENABLE_RMW; while prefers rcw with PARITY_PREFER_RMW.

This patch reverses this behavior.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

41257580

10 5月, 2016 2 次提交

md: set MD_CHANGE_PENDING in a atomic region · 85ad1d13

由 Guoqing Jiang 提交于 5月 03, 2016

Some code waits for a metadata update by:

1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
2. setting MD_CHANGE_PENDING and waking the management thread
3. waiting for MD_CHANGE_PENDING to be cleared

If the first two are done without locking, the code in md_update_sb()
which checks if it needs to repeat might test if an update is needed
before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
in the wait returning early.

So make sure all places that set MD_CHANGE_PENDING are atomicial, and
bit_clear_unless (suggested by Neil) is introduced for the purpose.

Cc: Martin Kepplinger <martink@posteo.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: <linux-kernel@vger.kernel.org>
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

85ad1d13

md: raid5: add prerequisite to run underneath dm-raid · fe67d19a

由 Heinz Mauelshagen 提交于 5月 03, 2016

In case md runs underneath the dm-raid target, the mddev does not have
a request queue or gendisk, thus avoid accesses.

This patch adds a missing conditional to the raid5 personality.
Signed-of-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

fe67d19a

30 4月, 2016 1 次提交

raid5: delete unnecessary warnning · b8a0b8e9

由 Shaohua Li 提交于 4月 29, 2016

If device has R5_LOCKED set, it's legit device has R5_SkipCopy set and page !=
orig_page. After R5_LOCKED is clear, handle_stripe_clean_event will clear the
SkipCopy flag and set page to orig_page. So the warning is unnecessary.
Reported-by: NJoey Liao <joeyliao@qnap.com>
Signed-off-by: NShaohua Li <shli@fb.com>

b8a0b8e9

18 3月, 2016 1 次提交

md/raid5: Cleanup cpu hotplug notifier · 1d034e68

由 Anna-Maria Gleixner 提交于 3月 16, 2016

The raid456_cpu_notify() hotplug callback lacks handling of the
CPU_UP_CANCELED case. That means if CPU_UP_PREPARE fails, the scratch
buffer is leaked.

Add handling for CPU_UP_CANCELED[_FROZEN] hotplug notifier transitions
to free the scratch buffer.

CC: Shaohua Li <shli@kernel.org>
CC: linux-raid@vger.kernel.org
Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: NShaohua Li <shli@fb.com>

1d034e68

10 3月, 2016 2 次提交

md/raid5: output stripe state for debug · fb3229d5

由 Shaohua Li 提交于 3月 09, 2016

Neil recently fixed an obscure race in break_stripe_batch_list. Debug would be
quite convenient if we know the stripe state. This is what this patch does.
Signed-off-by: NShaohua Li <shli@fb.com>

fb3229d5

md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list · 550da24f

由 NeilBrown 提交于 3月 09, 2016

break_stripe_batch_list breaks up a batch and copies some flags from
the batch head to the members, preserving others.

It doesn't preserve or copy STRIPE_PREREAD_ACTIVE.  This is not
normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a
stripe_head is added to a batch, and is not set on stripe_heads
already in a batch.

However there is no locking to ensure one thread doesn't set the flag
after it has just been cleared in another.  This does occasionally happen.

md/raid5 maintains a count of the number of stripe_heads with
STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes.  When
break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently
this could becomes incorrect and will never again return to zero.

md/raid5 delays the handling of some stripe_heads until
preread_active_stripes becomes zero.  So when the above mention race
happens, those stripe_heads become blocked and never progress,
resulting is write to the array handing.

So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE
in the members of a batch.

URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741
URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153
URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz
Reported-by: Martin Svec <martin.svec@zoner.cz> (and others)
Tested-by: NTom Weber <linux@junkyard.4t2.com>
Fixes: 1b956f7a ("md/raid5: be more selective about distributing flags across batch.")
Cc: stable@vger.kernel.org (v4.1 and later)
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

550da24f

27 2月, 2016 1 次提交

RAID5: revert to fix a livelock · 6ab2a4b8

由 Shaohua Li 提交于 2月 25, 2016

Revert commit
e9e4c377(md/raid5: per hash value and exclusive wait_for_stripe)

The problem is raid5_get_active_stripe waits on
conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes
in this order:
- release all stripes with hash 0
- raid5_get_active_stripe still sleeps since active_stripes >
  max_nr_stripes * 3 / 4
- release all stripes with hash other than 0. active_stripes becomes 0
- raid5_get_active_stripe still sleeps, since nobody wakes up
  wait_for_stripe[0]
The system live locks. The problem is active_stripes isn't a per-hash
count. Revert the patch makes the live lock go away.

Cc: stable@vger.kernel.org (v4.2+)
Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: NeilBrown <neilb@suse.de>
Signed-off-by: NShaohua Li <shli@fb.com>

6ab2a4b8

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功