提交 · 97658cdd3af7d01461874c93b89afa4a2465e7c6 · xiphi1978 / linux

18 4月, 2011 2 次提交

md: provide generic support for handling unplug callbacks. · 97658cdd

由 NeilBrown 提交于 4月 18, 2011

When an md device adds a request to a queue, it can call
mddev_check_plugged.
If this succeeds then we know that the md thread will be woken up
shortly, and ->plug_cnt will be non-zero until then, so some
processing can be delayed.

If it fails, then no unplug callback is expected and the make_request
function needs to do whatever is required to make the request happen.
Signed-off-by: NNeilBrown <neilb@suse.de>

97658cdd

md - remove old plugging code. · 482c0834

由 NeilBrown 提交于 4月 18, 2011

md has some plugging infrastructure for RAID5 to use because the
normal plugging infrastructure required a 'request_queue', and when
called from dm, RAID5 doesn't have one of those available.

This relied on the ->unplug_fn callback which doesn't exist any more.

So remove all of that code, both in md and raid5.  Subsequent patches
with restore the plugging functionality.
Signed-off-by: NNeilBrown <neilb@suse.de>

482c0834

31 3月, 2011 1 次提交

Fix common misspellings · 25985edc

由 Lucas De Marchi 提交于 3月 30, 2011

Fixes generated by 'codespell' and manually reviewed.
Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>

25985edc

24 2月, 2011 1 次提交

md: Fix - again - partition detection when array becomes active · f0b4f7e2

由 NeilBrown 提交于 2月 24, 2011

Revert
    b821eaa5
and
    f3b99be1

When I wrote the first of these I had a wrong idea about the
lifetime of 'struct block_device'.  It can disappear at any time that
the block device is not open if it falls out of the inode cache.

So relying on the 'size' recorded with it to detect when the
device size has changed and so we need to revalidate, is wrong.

Rather, we really do need the 'changed' attribute stored directly in
the mddev and set/tested as appropriate.

Without this patch, a sequence of:
   mknod / open / close / unlink

(which can cause a block_device to be created and then destroyed)
will result in a rescan of the partition table and consequence removal
and addition of partitions.
Several of these in a row can get udev racing to create and unlink and
other code can get confused.

With the patch, the rescan is only performed when needed and so there
are no races.

This is suitable for any stable kernel from 2.6.35.
Reported-by: N"Wojcik, Krzysztof" <krzysztof.wojcik@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Cc: stable@kernel.org

f0b4f7e2

31 1月, 2011 1 次提交

md: Remove the AllReserved flag for component devices. · f21e9ff7

由 NeilBrown 提交于 1月 31, 2011

This flag is not needed and is used badly.

Devices that are included in a native-metadata array are reserved
exclusively for that array - and currently have AllReserved set.
They all are bd_claimed for the rdev and so cannot be shared.

Devices that are included in external-metadata arrays can be shared
among multiple arrays - providing there is no overlap.
These are bd_claimed for md in general - not for a particular rdev.

When changing the amount of a device that is used in an array we need
to check for overlap.  This currently includes a check on AllReserved
So even without overlap, sharing with an AllReserved device is not
allowed.
However the bd_claim usage already precludes sharing with these
devices, so the test on AllReserved is not needed.  And in fact it is
wrong.

As this is the only use of AllReserved, simply remove all usage and
definition of AllReserved.
Signed-off-by: NNeilBrown <neilb@suse.de>

f21e9ff7

14 1月, 2011 3 次提交

md: separate meta and data devs · a6ff7e08

由 Jonathan Brassow 提交于 1月 14, 2011

Allow the metadata to be on a separate device from the
data.

This doesn't mean the data and metadata will by on separate
physical devices - it simply gives device-mapper and userspace
tools more flexibility.
Signed-off-by: NNeilBrown <neilb@suse.de>

a6ff7e08

md-new-param-to_sync_page_io · ccebd4c4

由 Jonathan Brassow 提交于 1月 14, 2011

Add new parameter to 'sync_page_io'.

The new parameter allows us to distinguish between metadata and data
operations.  This becomes important later when we add the ability to
use separate devices for data and metadata.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>

ccebd4c4

md: Ensure no IO request to get md device before it is properly initialised. · 0ca69886

由 NeilBrown 提交于 1月 14, 2011

When an md device is in the process of coming on line it is possible
for an IO request (typically a partition table probe) to get through
before the array is fully initialised, which can cause unexpected
behaviour (e.g. a crash).

So explicitly record when the array is ready for IO and don't allow IO
through until then.

There is no possibility for a similar problem when the array is going
off-line as there must only be one 'open' at that time, and it is busy
off-lining the array and so cannot send IO requests.  So no memory
barrier is needed in md_stop()

This has been a bug since commit 409c57f3 in 2.6.30 which
introduced md_make_request.  Before then, each personality would
register its own make_request_fn when it was ready.
This is suitable for any stable kernel from 2.6.30.y onwards.

Cc: <stable@kernel.org>
Signed-off-by: NNeilBrown <neilb@suse.de>
Reported-by: N"Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>

0ca69886

28 10月, 2010 2 次提交

md: use separate bio pool for each md device. · a167f663

由 NeilBrown 提交于 10月 26, 2010

bio_clone and bio_alloc allocate from a common bio pool.
If an md device is stacked with other devices that use this pool, or under
something like swap which uses the pool, then the multiple calls on
the pool can cause deadlocks.

So allocate a local bio pool for each md array and use that rather
than the common pool.

This pool is used both for regular IO and metadata updates.
Signed-off-by: NNeilBrown <neilb@suse.de>

a167f663

md: change type of first arg to sync_page_io. · 2b193363

由 NeilBrown 提交于 10月 27, 2010

Currently sync_page_io takes a 'bdev'.
Every caller passes 'rdev->bdev'.
We will soon want another field out of the rdev in sync_page_io,
So just pass the rdev instead of the bdev out of it.
Signed-off-by: NNeilBrown <neilb@suse.de>

2b193363

10 9月, 2010 1 次提交

md: implment REQ_FLUSH/FUA support · e9c7469b

由 Tejun Heo 提交于 9月 03, 2010

This patch converts md to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER.  In the core part (md.c), the following
changes are notable.

* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
  processing of other requests and thus there is no reason to mark the
  queue congested while FLUSH/FUA is in progress.

* REQ_FLUSH/FUA failures are final and its users don't need retry
  logic.  Retry logic is removed.

* Preflush needs to be issued to all member devices but FUA writes can
  be handled the same way as other writes - their processing can be
  deferred to request_queue of member devices.  md_barrier_request()
  is renamed to md_flush_request() and simplified accordingly.

For linear, raid0 and multipath, the core changes are enough.  raid1,
5 and 10 need the following conversions.

* raid1: Handling of FLUSH/FUA bio's can simply be deferred to
  request_queues of member devices.  Barrier related logic removed.

* raid5: Queue draining logic dropped.  FUA bit is propagated through
  biodrain and stripe resconstruction such that all the updated parts
  of the stripe are written out with FUA writes if any of the dirtying
  writes was FUA.  preread_active_stripes handling in make_request()
  is updated as suggested by Neil Brown.

* raid10: FUA bit needs to be propagated to write clones.

linear, raid0, 1, 5 and 10 tested.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NNeil Brown <neilb@suse.de>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

e9c7469b

30 8月, 2010 1 次提交

md: resolve confusion of MD_CHANGE_CLEAN · 070dc6dd

由 NeilBrown 提交于 8月 30, 2010

MD_CHANGE_CLEAN is used for two different purposes and this leads to
confusion.
One of the purposes is largely mirrored by MD_CHANGE_PENDING which is
not used for anything else, so have MD_CHANGE_PENDING take over that
purpose fully.

The two purposes are:
 1/ tell md_update_sb that an update is needed and that it is just a
   clean/dirty transition.
 2/ tell user-space that an transition from clean to dirty is pending
    (something wants to write), and tell te kernel (by clearin the
    flag) that the transition is OK.

The first purpose remains wit MD_CHANGE_CLEAN, the second is moved
fully to MD_CHANGE_PENDING.

This means that various places which conditionally set or cleared
MD_CHANGE_CLEAN no longer need to be conditional.
Signed-off-by: NNeilBrown <neilb@suse.de>

070dc6dd

08 8月, 2010 2 次提交

md: fix another deadlock with removing sysfs attributes. · bb4f1e9d

由 NeilBrown 提交于 8月 08, 2010

Move the deletion of sysfs attributes from reconfig_mutex to
open_mutex didn't really help as a process can try to take
open_mutex while holding reconfig_mutex, so the same deadlock can
happen, just requiring one more process to be involved in the chain.

I looks like I cannot easily use locking to wait for the sysfs
deletion to complete, so don't.

The only things that we cannot do while the deletions are still
pending is other things which can change the sysfs namespace: run,
takeover, stop.  Each of these can fail with -EBUSY.
So set a flag while doing a sysfs deletion, and fail run, takeover,
stop if that flag is set.

This is suitable for 2.6.35.x

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

bb4f1e9d

block: unify flags for struct bio and struct request · 7b6d91da

由 Christoph Hellwig 提交于 8月 07, 2010

Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

7b6d91da

26 7月, 2010 8 次提交

md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log. · e384e585

由 NeilBrown 提交于 6月 01, 2010

This allows md/raid5 to fully work as a dm target.

Normally md uses a 'filemap' which contains a list of pages of bits
each of which may be written separately.
dm-log uses and all-or-nothing approach to writing the log, so
when using a dm-log, ->filemap is NULL and the flags normally stored
in filemap_attr are stored in ->logattrs instead.
Signed-off-by: NNeilBrown <neilb@suse.de>

e384e585

md/bitmap: clean up plugging calls. · b63d7c2e

由 NeilBrown 提交于 6月 01, 2010

1/ use md_unplug in bitmap.c as we will soon be using bitmaps under
  arrays with no queue attached.

2/ Don't bother plugging the queue when we set a bit in the bitmap.
   The reason for this was to encourage as many bits as possible to
   get set before we unplug and write stuff out.
   However every personality already plugs the queue after
   bitmap_startwrite either directly (raid1/raid10) or be setting
   STRIPE_BIT_DELAY which causes the queue to be plugged later
   (raid5).
Signed-off-by: NNeilBrown <neilb@suse.de>

b63d7c2e

md/bitmap: white space clean up and similar. · ac2f40be

由 NeilBrown 提交于 6月 01, 2010

Fixes some whitespace problems
Fixed some checkpatch.pl complaints.
Replaced kmalloc ... memset(0), with kzalloc
Fixed an unlikely memory leak on an error path.
Reformatted a number of 'if/else' sets, sometimes
replacing goto with an else clause.
Removed some old comments and commented-out code.
Signed-off-by: NNeilBrown <neilb@suse.de>

ac2f40be

md/plug: optionally use plugger to unplug an array during resync/recovery. · 252ac522

由 NeilBrown 提交于 6月 01, 2010

If an array doesn't have a 'queue' then md_do_sync cannot
unplug it.
In that case it will have a 'plugger', so make that available
to the mddev, and use it to unplug the array if needed.
Signed-off-by: NNeilBrown <neilb@suse.de>

252ac522

md/raid5: add simple plugging infrastructure. · 2ac87401

由 NeilBrown 提交于 6月 01, 2010

md/raid5 uses the plugging infrastructure provided by the block layer
and 'struct request_queue'.  However when we plug raid5 under dm there
is no request queue so we cannot use that.

So create a similar infrastructure that is much lighter weight and use
it for raid5.
Signed-off-by: NNeilBrown <neilb@suse.de>

2ac87401

md: add support for raising dm events. · 768a418d

由 NeilBrown 提交于 7月 26, 2010

dm uses scheduled work to raise events to user-space.
So allow md device to have work_structs and schedule them on an error.
Signed-off-by: NNeilBrown <neilb@suse.de>

768a418d

md: export various start/stop interfaces · 390ee602

由 NeilBrown 提交于 6月 01, 2010

export entry points for starting and stopping md arrays.
This will be used by a module to make md/raid5 work under
dm.
Also stop calling md_stop_writes from md_stop, as that won't
work well with dm - it will want to call the two separately.
Signed-off-by: NNeilBrown <neilb@suse.de>

390ee602

md: split out md_rdev_init · e8bb9a83

由 NeilBrown 提交于 6月 01, 2010

This functionality will be needed separately in a subsequent patch, so
split it into it's own exported function.
Signed-off-by: NNeilBrown <neilb@suse.de>

e8bb9a83

21 7月, 2010 1 次提交

md: reduce dependence on sysfs. · 00bcb4ac

由 NeilBrown 提交于 6月 01, 2010

We will want md devices to live as dm targets where sysfs is not
visible.  So allow md to not connect to sysfs.
Signed-off-by: NNeilBrown <neilb@suse.de>

00bcb4ac

24 6月, 2010 1 次提交

md: fix handling of array level takeover that re-arranges devices. · e93f68a1

由 NeilBrown 提交于 6月 15, 2010

Most array level changes leave the list of devices largely unchanged,
possibly causing one at the end to become redundant.
However conversions between RAID0 and RAID10 need to renumber
all devices (except 0).

This renumbering is currently being done in the ->run method when the
new personality takes over.  However this is too late as the common
code in md.c might already have invalidated some of the devices if
they had a ->raid_disk number that appeared to high.

Moving it into the ->takeover method is too early as the array is
still active at that time and wrong ->raid_disk numbers could cause
confusion.

So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate
the new raid_disk number.
Now the common code knows exactly which devices need to be renumbered,
and which can be invalidated, and can do it all at a convenient time
when the array is suspend.
It can also update some symlinks in sysfs which previously were not be
updated correctly.
Reported-by: NMaciej Trela <maciej.trela@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

e93f68a1

18 5月, 2010 5 次提交

md: simplify updating of event count to sometimes avoid updating spares. · a8707c08

由 NeilBrown 提交于 5月 18, 2010

When updating the event count for a simple clean <-> dirty transition,
we try to avoid updating the spares so they can safely spin-down.
As the event_counts across an array must be +/- 1, this means
decrementing the event_count on a dirty->clean transition.
This is not always safe and we have to avoid the unsafe time.
We current do this with a misguided idea about it being safe or
not depending on whether the event_count is odd or even.  This
approach only works reliably in a few common instances, but easily
falls down.

So instead, simply keep internal state concerning whether it is safe
or not, and always assume it is not safe when an array is first
assembled.
Signed-off-by: NNeilBrown <neilb@suse.de>

a8707c08

md: pass mddev to make_request functions rather than request_queue · 21a52c6d

由 NeilBrown 提交于 4月 01, 2010

We used to pass the personality make_request function direct
to the block layer so the first argument had to be a queue.
But now we have the intermediary md_make_request so it makes
at lot more sense to pass a struct mddev_s.
It makes it possible to have an mddev without its own queue too.
Signed-off-by: NNeilBrown <neilb@suse.de>

21a52c6d

md: remove ->changed and related code. · b821eaa5

由 NeilBrown 提交于 3月 29, 2010

We set ->changed to 1 and call check_disk_change at the end
of md_open so that bd_invalidated would be set and thus
partition rescan would happen appropriately.

Now that we call revalidate_disk directly, which sets bd_invalidates,
that indirection is no longer needed and can be removed.
Signed-off-by: NNeilBrown <neilb@suse.de>

b821eaa5

md: discard StateChanged device flag. · c0cc75f8

由 NeilBrown 提交于 3月 22, 2010

This was needed when sysfs files could only be 'notified'
from process context.  Now that we have sys_notify_direct,
we can call it directly from an interrupt.
Signed-off-by: NNeilBrown <neilb@suse.de>

c0cc75f8

md: remove some dead fields from mddev_s · ee8b81b0

由 NeilBrown 提交于 3月 08, 2010

These fields have never been used.
commit 4b6d287f
added them, but also added identical files to bitmap_super_s,
and only used the latter.

So remove these unused fields.
Signed-off-by: NNeilBrown <neilb@suse.de>

ee8b81b0

17 5月, 2010 1 次提交

md: manage redundancy group in sysfs when changing level. · a64c876f

由 NeilBrown 提交于 4月 14, 2010

Some levels expect the 'redundancy group' to be present,
others don't.
So when we change level of an array we might need to
add or remove this group.

This requires fixing up the current practice of overloading ->private
to indicate (when ->pers == NULL) that something needs to be removed.
So create a new ->to_remove to fill that role.

When changing levels, we may need to add or remove attributes.  When
changing RAID5 -> RAID6, we both add and remove the same thing.  It is
important to catch this and optimise it out as the removal is delayed
until a lock is released, so trying to add immediately would cause
problems.


Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

a64c876f

14 12月, 2009 9 次提交

raid: improve MD/raid10 handling of correctable read errors. · 1e50915f

由 Robert Becker 提交于 12月 14, 2009

We've noticed severe lasting performance degradation of our raid
arrays when we have drives that yield large amounts of media errors.
The raid10 module will queue each failed read for retry, and also
will attempt call fix_read_error() to perform the read recovery.
Read recovery is performed while the array is frozen, so repeated
recovery attempts can degrade the performance of the array for
extended periods of time.

With this patch I propose adding a per md device max number of
corrected read attempts.  Each rdev will maintain a count of
read correction attempts in the rdev->read_errors field (not
used currently for raid10). When we enter fix_read_error()
we'll check to see when the last read error occurred, and
divide the read error count by 2 for every hour since the
last read error. If at that point our read error count
exceeds the read error threshold, we'll fail the raid device.

In addition in this patch I add sysfs nodes (get/set) for
the per md max_read_errors attribute, the rdev->read_errors
attribute, and added some printk's to indicate when
fix_read_error fails to repair an rdev.

For testing I used debugfs->fail_make_request to inject
IO errors to the rdev while doing IO to the raid array.
Signed-off-by: NRobert Becker <Rob.Becker@riverbed.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

1e50915f

md: Support write-intent bitmaps with externally managed metadata. · ece5cff0

由 NeilBrown 提交于 12月 14, 2009

In this case, the metadata needs to not be in the same
sector as the bitmap.
md will not read/write any bitmap metadata.  Config must be
done via sysfs and when a recovery makes the array non-degraded
again, writing 'true' to 'bitmap/can_clear' will allow bits in
the bitmap to be cleared again.
Signed-off-by: NNeilBrown <neilb@suse.de>

ece5cff0

md: support updating bitmap parameters via sysfs. · 43a70507

由 NeilBrown 提交于 12月 14, 2009

A new attribute directory 'bitmap' in 'md' is created which
contains files for configuring the bitmap.
'location' identifies where the bitmap is, either 'none',
or 'file' or 'sector offset from metadata'.
Writing 'location' can create or remove a bitmap.
Adding a 'file' bitmap this way is not yet supported.
'chunksize' and 'time_base' must be set before 'location'
can be set.

'chunksize' can be set before creating a bitmap, but is
currently always over-ridden by the bitmap superblock.

'time_base' and 'backlog' can be updated at any time.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NAndre Noll <maan@systemlinux.org>

43a70507

md: factor out parsing of fixed-point numbers · 72e02075

由 NeilBrown 提交于 12月 14, 2009

safe_delay_store can parse fixed point numbers (for fractions
of a second).  We will want to do that for another sysfs
file soon, so factor out the code.
Signed-off-by: NNeilBrown <neilb@suse.de>

72e02075

md: support bitmap offset appropriate for external-metadata arrays. · f6af949c

由 NeilBrown 提交于 12月 14, 2009

For md arrays were metadata is managed externally, the kernel does not
know about a superblock so the superblock offset is 0.
If we want to have a write-intent-bitmap near the end of the
devices of such an array, we should support sector_t sized offset.
We need offset be possibly negative for when the bitmap is before
the metadata, so use loff_t instead.

Also add sanity check that bitmap does not overlap with data.
Signed-off-by: NNeilBrown <neilb@suse.de>

f6af949c

md: move offset, daemon_sleep and chunksize out of bitmap structure · 42a04b50

由 NeilBrown 提交于 12月 14, 2009

... and into bitmap_info.  These are all configuration parameters
that need to be set before the bitmap is created.
Signed-off-by: NNeilBrown <neilb@suse.de>

42a04b50

md: collect bitmap-specific fields into one structure. · c3d9714e

由 NeilBrown 提交于 12月 14, 2009

In preparation for making bitmap fields configurable via sysfs,
start tidying up by making a single structure to contain the
configuration fields.
Signed-off-by: NNeilBrown <neilb@suse.de>

c3d9714e

md: support barrier requests on all personalities. · a2826aa9

由 NeilBrown 提交于 12月 14, 2009

Previously barriers were only supported on RAID1.  This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.

When a barrier arrives, we send a zero-length barrier to every active
device.  When that completes - and if the original request was not
empty -  we submit the barrier request itself (with the barrier flag
cleared) and then submit a fresh load of zero length barriers.

The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.

The reason for clearing the barrier flag is that a barrier request is
allowed to fail.  If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail.  That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.

RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet.  So we flush the stripe cache before
proceeding with the barrier.

Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted.  Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.

That will be done in later patches.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NAndre Noll <maan@systemlinux.org>

a2826aa9

md/bitmap: protect against bitmap removal while being updated. · aa5cbd10

由 NeilBrown 提交于 12月 14, 2009

A write intent bitmap can be removed from an array while the
array is active.
When this happens, all IO is suspended and flushed before the
bitmap is removed.
However it is possible that bitmap_daemon_work is still running to
clear old bits from the bitmap.  If it is, it can dereference the
bitmap after it has been freed.

So introduce a new mutex to protect bitmap_daemon_work and get it
before destroying a bitmap.

This is suitable for any current -stable kernel.
Signed-off-by: NNeilBrown <neilb@suse.de>
Cc: stable@kernel.org

aa5cbd10

23 9月, 2009 1 次提交

md: report device as congested when suspended · 3fa841d7

由 NeilBrown 提交于 9月 23, 2009

This should writeback from coming when the device is temporarily
suspended.
Signed-off-by: NNeilBrown <neilb@suse.de>

3fa841d7