提交 · 0c55e02259115c151e4835dd417cf41467bb02e2 · openeuler / raspberrypi-kernel

18 5月, 2010 9 次提交

md/raid5: improve consistency of error messages. · 0c55e022

由 NeilBrown 提交于 5月 03, 2010

Many 'printk' messages from the raid456 module mention 'raid5' even
though it may be a 'raid6' or even 'raid4' array.  This can cause
confusion.
Also the actual array name is not always reported and when it is
it is not reported consistently.

So change all the messages to start:
    md/raid:%s:
where '%s' becomes e.g. md3 to identify the particular array.
Signed-off-by: NNeilBrown <neilb@suse.de>

0c55e022

md/raid4: permit raid0 takeover · f1b29bca

由 Dan Williams 提交于 5月 01, 2010

For consistency allow raid4 to takeover raid0 in addition to raid5 (with a
raid4 layout).
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

f1b29bca

md: pass mddev to make_request functions rather than request_queue · 21a52c6d

由 NeilBrown 提交于 4月 01, 2010

We used to pass the personality make_request function direct
to the block layer so the first argument had to be a queue.
But now we have the intermediary md_make_request so it makes
at lot more sense to pass a struct mddev_s.
It makes it possible to have an mddev without its own queue too.
Signed-off-by: NNeilBrown <neilb@suse.de>

21a52c6d

md: remove ->changed and related code. · b821eaa5

由 NeilBrown 提交于 3月 29, 2010

We set ->changed to 1 and call check_disk_change at the end
of md_open so that bd_invalidated would be set and thus
partition rescan would happen appropriately.

Now that we call revalidate_disk directly, which sets bd_invalidates,
that indirection is no longer needed and can be removed.
Signed-off-by: NNeilBrown <neilb@suse.de>

b821eaa5

md: move io accounting out of personalities into md_make_request · 49077326

由 NeilBrown 提交于 3月 25, 2010

While I generally prefer letting personalities do as much as possible,
given that we have a central md_make_request anyway we may as well use
it to simplify code.
Also this centralises knowledge of ->gendisk which will help later.
Signed-off-by: NNeilBrown <neilb@suse.de>

49077326

md/raid5: small tidyup in raid5_align_endio · 2b7f2228

由 NeilBrown 提交于 3月 25, 2010

Diving through ->queue to find mddev is unnecessarily complex - there
is an easier path to finding mddev, so use that.
Signed-off-by: NNeilBrown <neilb@suse.de>

2b7f2228

md: add support for raid5 to raid4 conversion · a78d38a1

由 NeilBrown 提交于 3月 22, 2010

This is unlikely to be wanted, but we may as well provide it
for completeness.
Signed-off-by: NNeilBrown <neilb@suse.de>

a78d38a1

md:Add support for Raid0->Raid5 takeover · 54071b38

由 Trela Maciej 提交于 3月 08, 2010

Signed-off-by: NMaciej Trela <maciej.trela@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

54071b38

drivers/md: Remove unnecessary casts of void * · 7b92813c

由 H Hartley Sweeten 提交于 3月 08, 2010

void pointers do not need to be cast to other pointer types.
Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

7b92813c

17 5月, 2010 1 次提交

md: manage redundancy group in sysfs when changing level. · a64c876f

由 NeilBrown 提交于 4月 14, 2010

Some levels expect the 'redundancy group' to be present,
others don't.
So when we change level of an array we might need to
add or remove this group.

This requires fixing up the current practice of overloading ->private
to indicate (when ->pers == NULL) that something needs to be removed.
So create a new ->to_remove to fill that role.

When changing levels, we may need to add or remove attributes.  When
changing RAID5 -> RAID6, we both add and remove the same thing.  It is
important to catch this and optimise it out as the removal is delayed
until a lock is released, so trying to add immediately would cause
problems.


Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

a64c876f

26 2月, 2010 1 次提交

block: Consolidate phys_segment and hw_segment limits · 8a78362c

由 Martin K. Petersen 提交于 2月 26, 2010

Except for SCSI no device drivers distinguish between physical and
hardware segment limits.  Consolidate the two into a single segment
limit.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

8a78362c

17 2月, 2010 1 次提交

percpu: add __percpu sparse annotations to what's left · a29d8b8e

由 Tejun Heo 提交于 2月 02, 2010

Add __percpu sparse annotations to places which didn't make it in one
of the previous patches.  All converions are trivial.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors.  This patch doesn't affect normal builds.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NBorislav Petkov <borislav.petkov@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Neil Brown <neilb@suse.de>

a29d8b8e

10 2月, 2010 1 次提交

md: fix some lockdep issues between md and sysfs. · ef286f6f

由 NeilBrown 提交于 2月 09, 2010

======
This fix is related to
    http://bugzilla.kernel.org/show_bug.cgi?id=15142
but does not address that exact issue.
======

sysfs does like attributes being removed while they are being accessed
(i.e. read or written) and waits for the access to complete.

As accessing some md attributes takes the same lock that is held while
removing those attributes a deadlock can occur.

This patch addresses 3 issues in md that could lead to this deadlock.

Two relate to calling flush_scheduled_work while the lock is held.
This is probably a bad idea in general and as we use schedule_work to
delete various sysfs objects it is particularly bad.

In one case flush_scheduled_work is called from md_alloc (called by
md_probe) called from do_md_run which holds the lock.  This call is
only present to ensure that ->gendisk is set.  However we can be sure
that gendisk is always set (though possibly we couldn't when that code
was originally written.  This is because do_md_run is called in three
different contexts:
  1/ from md_ioctl.  This requires that md_open has succeeded, and it
     fails if ->gendisk is not set.
  2/ from writing a sysfs attribute.  This can only happen if the
     mddev has been registered in sysfs which happens in md_alloc
     after ->gendisk has been set.
  3/ from autorun_array which is only called by autorun_devices, which
     checks for ->gendisk to be set before calling autorun_array.
So the call to md_probe in do_md_run can be removed, and the check on
->gendisk can also go.


In the other case flush_scheduled_work is being called in do_md_stop,
purportedly to wait for all md_delayed_delete calls (which delete the
component rdevs) to complete.  However there really isn't any need to
wait for them - they have already been disconnected in all important
ways.

The third issue is that raid5->stop() removes some attribute names
while the lock is held.  There is already some infrastructure in place
to delay attribute removal until after the lock is released (using
schedule_work).  So extend that infrastructure to remove the
raid5_attrs_group.

This does not address all lockdep issues related to the sysfs
"s_active" lock.  The rest can be address by splitting that lockdep
context between symlinks and non-symlinks which hopefully will happen.
Signed-off-by: NNeilBrown <neilb@suse.de>

ef286f6f

09 2月, 2010 1 次提交

md: fix 'degraded' calculation when starting a reshape. · 9eb07c25

由 NeilBrown 提交于 2月 09, 2010

This code was written long ago when it was not possible to
reshape a degraded array.  Now it is so the current level of
degraded-ness needs to be taken in to account.  Also newly addded
devices should only reduce degradedness if they are deemed to be
in-sync.

In particular, if you convert a RAID5 to a RAID6, and increase the
number of devices at the same time, then the 5->6 conversion will
make the array degraded so the current code will produce a wrong
value for 'degraded' - "-1" to be precise.

If the reshape runs to completion end_reshape will calculate a correct
new value for 'degraded', but if a device fails during the reshape an
incorrect decision might be made based on the incorrect value of
"degraded".

This patch is suitable for 2.6.32-stable and if they are still open,
2.6.31-stable and 2.6.30-stable as well.

Cc: stable@kernel.org
Reported-by: NMichael Evans <mjevans1983@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

9eb07c25

14 12月, 2009 4 次提交

N
md: add MODULE_DESCRIPTION for all md related modules. · 0efb9e61
由 NeilBrown 提交于 12月 14, 2009
```
Suggested by  Oren Held <orenhe@il.ibm.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
```
0efb9e61

md/raid5: don't complete make_request on barrier until writes are scheduled · 729a1866

由 NeilBrown 提交于 12月 14, 2009

The post-barrier-flush is sent by md as soon as make_request on the
barrier write completes.  For raid5, the data might not be in the
per-device queues yet.  So for barrier requests, wait for any
pre-reading to be done so that the request will be in the per-device
queues.

We use the 'preread_active' count to check that nothing is still in
the preread phase, and delay the decrement of this count until after
write requests have been submitted to the underlying devices.
Signed-off-by: NNeilBrown <neilb@suse.de>

729a1866

md: support barrier requests on all personalities. · a2826aa9

由 NeilBrown 提交于 12月 14, 2009

Previously barriers were only supported on RAID1.  This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.

When a barrier arrives, we send a zero-length barrier to every active
device.  When that completes - and if the original request was not
empty -  we submit the barrier request itself (with the barrier flag
cleared) and then submit a fresh load of zero length barriers.

The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.

The reason for clearing the barrier flag is that a barrier request is
allowed to fail.  If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail.  That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.

RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet.  So we flush the stripe cache before
proceeding with the barrier.

Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted.  Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.

That will be done in later patches.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NAndre Noll <maan@systemlinux.org>

a2826aa9

md/raid5: remove some sparse warnings. · 8553fe7e

由 NeilBrown 提交于 12月 14, 2009

qd_idx is previously declared and given exactly the same value!
Signed-off-by: NNeilBrown <neilb@suse.de>

8553fe7e

13 11月, 2009 2 次提交

md/raid5: Allow dirty-degraded arrays to be assembled when only party is degraded. · c148ffdc

由 NeilBrown 提交于 11月 13, 2009

Normally is it not safe to allow a raid5 that is both dirty and
degraded to be assembled without explicit request from that admin, as
it can cause hidden data corruption.
This is because 'dirty' means that the parity cannot be trusted, and
'degraded' means that the parity needs to be used.

However, if the device that is missing contains only parity, then
there is no issue and assembly can continue.
This particularly applies when a RAID5 is being converted to a RAID6
and there is an unclean shutdown while the conversion is happening.

So check for whether the degraded space only contains parity, and
in that case, allow the assembly.
Signed-off-by: NNeilBrown <neilb@suse.de>

c148ffdc

Don't unconditionally set in_sync on newly added device in raid5_reshape · 7ef90146

由 NeilBrown 提交于 11月 13, 2009

When a reshape finds that it can add spare devices into the array,
those devices might already be 'in_sync' if they are beyond the old
size of the array, or they might not if they are within the array.

The first case happens when we change an N-drive RAID5 to an
N+1-drive RAID5.
The second happens when we convert an N-drive RAID5 to an
N+1-drive RAID6.

So set the flag more carefully.
Also, ->recovery_offset is only meaningful when the flag is clear,
so only set it in that case.

This change needs the preceding two to ensure that the non-in_sync
device doesn't get evicted from the array when it is stopped, in the
case where v0.90 metadata is used.
Signed-off-by: NNeilBrown <neilb@suse.de>

7ef90146

06 11月, 2009 1 次提交

md/raid5: make sure curr_sync_completes is uptodate when reshape starts · 8dee7211

由 NeilBrown 提交于 11月 06, 2009

This value is visible through sysfs and is used by mdadm
when it manages a reshape (backing up data that is about to be
rearranged).  So it is important that it is always correct.
Current it does not get updated properly when a reshape
starts which can cause problems when assembling an array
that is in the middle of being reshaped.

This is suitable for 2.6.31.y stable kernels.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

8dee7211

20 10月, 2009 1 次提交
- D
  md/raid6: kill a gcc-4.0.1 'uninitialized variable' warning · 6629542e
  由 Dan Williams 提交于 10月 19, 2009
```
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
```
  6629542e
16 10月, 2009 6 次提交

md/async: don't pass a memory pointer as a page pointer. · 5dd33c9a

由 NeilBrown 提交于 10月 16, 2009

md/raid6 passes a list of 'struct page *' to the async_tx routines,
which then either DMA map them for offload, or take the page_address
for CPU based calculations.

For RAID6 we sometime leave 'blanks' in the list of pages.
For CPU based calcs, we want to treat theses as a page of zeros.
For offloaded calculations, we simply don't pass a page to the
hardware.

Currently the 'blanks' are encoded as a pointer to
raid6_empty_zero_page.  This is a 4096 byte memory region, not a
'struct page'.  This is mostly handled correctly but is rather ugly.

So change the code to pass and expect a NULL pointer for the blanks.
When taking page_address of a page, we need to check for a NULL and
in that case use raid6_empty_zero_page.
Signed-off-by: NNeilBrown <neilb@suse.de>

5dd33c9a

md: Fix handling of raid5 array which is being reshaped to fewer devices. · 5e5e3e78

由 NeilBrown 提交于 10月 16, 2009

When a raid5 (or raid6) array is being reshaped to have fewer devices,
conf->raid_disks is the latter and hence smaller number of devices.
However sometimes we want to use a number which is the total number of
currently required devices - the larger of the 'old' and 'new' sizes.
Before we implemented reducing the number of devices, this was always
'new' i.e. ->raid_disks.
Now we need max(raid_disks, previous_raid_disks) in those places.

This particularly affects assembling an array that was shutdown while
in the middle of a reshape to fewer devices.

md.c needs a similar fix when interpreting the md metadata.
Signed-off-by: NNeilBrown <neilb@suse.de>

5e5e3e78

N
md: fix problems with RAID6 calculations for DDF. · e4424fee
由 NeilBrown 提交于 10月 16, 2009
```
Signed-off-by: NNeilBrown <neilb@suse.de>
```
e4424fee

md/raid456: downlevel multicore operations to raid_run_ops · 417b8d4a

由 Dan Williams 提交于 10月 16, 2009

The percpu conversion allowed a straightforward handoff of stripe
processing to the async subsytem that initially showed some modest gains
(+4%).  However, this model is too simplistic and leads to stripes
bouncing between raid5d and the async thread pool for every invocation
of handle_stripe().  As reported by Holger this can fall into a
pathological situation severely impacting throughput (6x performance
loss).

By downleveling the parallelism to raid_run_ops the pathological
stripe_head bouncing is eliminated.  This version still exhibits an
average 11% throughput loss for:

	mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
	echo 1024 > /sys/block/md0/md/stripe_cache_size
	dd if=/dev/zero of=/dev/md0 bs=1024k count=2048

...but the results are at least stable and can be used as a base for
further multicore experimentation.
Reported-by: NHolger Kiehl <Holger.Kiehl@dwd.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

417b8d4a

md/raid5: initialize conf->device_lock earlier · f5efd45a

由 Dan Williams 提交于 10月 16, 2009

Deallocating a raid5_conf_t structure requires taking 'device_lock'.
Ensure it is initialized before it is used, i.e. initialize the lock
before attempting any further initializations that might fail.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

f5efd45a

Revert "md: do not progress the resync process if the stripe was blocked" · 1442577b

由 NeilBrown 提交于 10月 16, 2009

This reverts commit df10cfbc.

This patch was based on a misunderstanding and risks introducing a busy-wait loop.
So revert it.
Acked-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

1442577b

23 9月, 2009 3 次提交

md: report device as congested when suspended · 3fa841d7

由 NeilBrown 提交于 9月 23, 2009

This should writeback from coming when the device is temporarily
suspended.
Signed-off-by: NNeilBrown <neilb@suse.de>

3fa841d7

md: Improve name of threads created by md_register_thread · 0da3c619

由 NeilBrown 提交于 9月 23, 2009

The management thread for raid4,5,6 arrays are all called
mdX_raid5, independent of the actual raid level, which is wrong and
can be confusion.

So change md_register_thread to use the name from the personality
unless no alternate name (like 'resync' or 'reshape') is given.

This is simpler and more correct.

Cc: Jinzc <zhenchengjin@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

0da3c619

md: remove sparse waring "symbol xxx shadows an earlier one" · a9f326eb

由 NeilBrown 提交于 9月 23, 2009

Rename some variable and remove some duplicate definitions
to avoid there warnings.  None of them are actual errors.
Signed-off-by: NNeilBrown <neilb@suse.de>

a9f326eb

17 9月, 2009 2 次提交

md/raid6: cleanup ops_run_compute6_2 · 6c910a78

由 Dan Williams 提交于 9月 16, 2009

Neil says:
	"It is correct as it stands, but the fact that every branch in
	 the 'if' part ends with a 'return' isn't immediately obvious,
	 so it is clearer if we are explicit about the if / then / else
	 structure."
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

6c910a78

md/raid6: eliminate BUG_ON with side effect · 2d6e4ecc

由 Dan Williams 提交于 9月 16, 2009

As pointed out by Neil it should be possible to build a driver with all
BUG_ON statements deleted.  It's bad form to have a BUG_ON with a side
effect.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

2d6e4ecc

11 9月, 2009 1 次提交

bio: first step in sanitizing the bio->bi_rw flag testing · 1f98a13f

由 Jens Axboe 提交于 9月 11, 2009

Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

1f98a13f

09 9月, 2009 1 次提交

dmaengine: add fence support · 0403e382

由 Dan Williams 提交于 9月 08, 2009

Some engines optimize operation by reading ahead in the descriptor chain
such that descriptor2 may start execution before descriptor1 completes.
If descriptor2 depends on the result from descriptor1 then a fence is
required (on descriptor2) to disable this optimization. The async_tx
api could implicitly identify dependencies via the 'depend_tx'
parameter, but that would constrain cases where the dependency chain
only specifies a completion order rather than a data dependency. So,
provide an ASYNC_TX_FENCE to explicitly identify data dependencies.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

0403e382

30 8月, 2009 5 次提交

md/raid456: distribute raid processing over multiple cores · 07a3b417

由 Dan Williams 提交于 8月 29, 2009

Now that the resources to handle stripe_head operations are allocated
percpu it is possible for raid5d to distribute stripe handling over
multiple cores.  This conversion also adds a call to cond_resched() in
the non-multicore case to prevent one core from getting monopolized for
raid operations.

Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

07a3b417

md/raid6: remove synchronous infrastructure · b774ef49

由 Yuri Tikhonov 提交于 8月 29, 2009

These routines have been replaced by there asynchronous counterparts.
Signed-off-by: NYuri Tikhonov <yur@emcraft.com>
Signed-off-by: NIlya Yanok <yanok@emcraft.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

b774ef49

md/raid6: asynchronous handle_stripe6 · 6c0069c0

由 Yuri Tikhonov 提交于 8月 29, 2009

1/ Use STRIPE_OP_BIOFILL to offload completion of read requests to
   raid_run_ops
2/ Implement a handler for sh->reconstruct_state similar to the raid5 case
   (adds handling of Q parity)
3/ Prevent handle_parity_checks6 from running concurrently with 'compute'
   operations
4/ Hook up raid_run_ops
Signed-off-by: NYuri Tikhonov <yur@emcraft.com>
Signed-off-by: NIlya Yanok <yanok@emcraft.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

6c0069c0

md/raid6: asynchronous handle_parity_check6 · d82dfee0

由 Dan Williams 提交于 7月 14, 2009

[ Based on an original patch by Yuri Tikhonov ]

Implement the state machine for handling the RAID-6 parities check and
repair functionality.  Note that the raid6 case does not need to check
for new failures, like raid5, as it will always writeback the correct
disks.  The raid5 case can be updated to check zero_sum_result to avoid
getting confused by new failures rather than retrying the entire check
operation.
Signed-off-by: NYuri Tikhonov <yur@emcraft.com>
Signed-off-by: NIlya Yanok <yanok@emcraft.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

d82dfee0

md/raid6: asynchronous handle_stripe_dirtying6 · a9b39a74

由 Yuri Tikhonov 提交于 8月 29, 2009

In the synchronous implementation of stripe dirtying we processed a
degraded stripe with one call to handle_stripe_dirtying6().  I.e.
compute the missing blocks from the other drives, then copy in the new
data and reconstruct the parities.

In the asynchronous case we do not perform stripe operations directly.
Instead, operations are scheduled with flags to be later serviced by
raid_run_ops.  So, for the degraded case the final reconstruction step
can only be carried out after all blocks have been brought up to date by
being read, or computed.  Like the raid5 case schedule_reconstruction()
sets STRIPE_OP_RECONSTRUCT to request a parity generation pass and
through operation chaining can handle compute and reconstruct in a
single raid_run_ops pass.

[dan.j.williams@intel.com: fixup handle_stripe_dirtying6 gating]
Signed-off-by: NYuri Tikhonov <yur@emcraft.com>
Signed-off-by: NIlya Yanok <yanok@emcraft.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

a9b39a74