提交 · e25e0920b5f0f2d46d16f14f7f51ccbfd0773671 · gsplhtlxg / clone-Linux

26 5月, 2009 7 次提交

由 NeilBrown 提交于 5月 26, 2009

md has no need for the BKL - it does its own locking.
So md_ioctl doesn't need to be a locked_ioctl.
Signed-off-by: NNeilBrown <neilb@suse.de>

b492b852

md: don't update curr_resync_completed without also updating reshape_position. · 7a91ee1f

由 NeilBrown 提交于 5月 26, 2009

In order for the metadata to always be consistent, we mustn't updated
curr_resync_completed without also updating reshape_position.

The reshape code updates both at the same time.  However since
commit 97e4f42d
the common md_do_sync will sometimes update curr_resync_completed
but is not in a position to update reshape_position.
So if MD_RECOVERY_RESHAPE is set (indicating that a reshape is
happening, so reshape_position might change), don't update
curr_resync_completed in md_do_sync, leave it to the per-personality
reshape code.
Signed-off-by: NNeilBrown <neilb@suse.de>

7a91ee1f

md: raid5: avoid sector values going negative when testing reshape progress. · 848b3182

由 NeilBrown 提交于 5月 26, 2009

As sector_t in unsigned, we cannot afford to let 'safepos' etc go
negative.
So replace
   a -= b;
by
   a -= min(b,a);
Signed-off-by: NNeilBrown <neilb@suse.de>

848b3182

md: export 'frozen' resync state through sysfs · b6a9ce68

由 NeilBrown 提交于 5月 26, 2009

The md resync engine has a 'frozen' state which ensures that
no resync/recovery.  This is used to avoid races.

Export this state through the 'sync_action' sysfs attribute
so that user-space can benefit and also avoid some races.
Signed-off-by: NNeilBrown <neilb@suse.de>

b6a9ce68

md: bitmap: improve bitmap maintenance code. · be512691

由 NeilBrown 提交于 5月 26, 2009

The code for checking which bits in the bitmap can be cleared
has 2 problems:
 1/ it repeatedly takes and drops a spinlock, where it would make
    more sense to just hold on to it most of the time.
 2/ it doesn't make use of some opportunities to skip large sections
    of the bitmap

This patch fixes those.  It will only affect CPU consumption, not
correctness.
Signed-off-by: NNeilBrown <neilb@suse.de>

be512691

md: improve errno return when setting array_size · 2b69c839

由 NeilBrown 提交于 5月 26, 2009

Instead of always returns EINVAL if anything goes wrong
when setting the array size, add the option of
  E2BIG
if the size requested is too large.  This makes it easier
for user-space to be sure what went wrong.
Signed-off-by: NNeilBrown <neilb@suse.de>

2b69c839

md: always update level / chunk_size / layout when writing v1.x metadata. · 62e1e389

由 NeilBrown 提交于 5月 26, 2009

We previously didn't update these fields when writing the metadata
because they could never change.  They can now, so we better write
them.
v0.90 metadata always updated these fields.
Signed-off-by: NNeilBrown <neilb@suse.de>

62e1e389

07 5月, 2009 7 次提交

md: remove rd%d links immediately after stopping an array. · c4647292

由 NeilBrown 提交于 5月 07, 2009

md maintains link in sys/mdXX/md/ to identify which device has
which role in the array. e.g.
   rd2 -> dev-sda

indicates that the device with role '2' in the array is sda.

These links are only present when the array is active.  They are
created immediately after ->run is called, and so should be removed
immediately after ->stop is called.
However they are currently removed a little bit later, and it is
possible for ->run to be called again, thus adding these links, before
they are removed.

So move the removal earlier so they are consistently only present when
the array is active.
Signed-off-by: NNeilBrown <neilb@suse.de>

c4647292

md: remove ability to explicit set an inactive array to 'clean'. · 5bf29597

由 NeilBrown 提交于 5月 07, 2009

Being able to write 'clean' to an 'array_state' of an inactive array
to activate it in 'clean' mode is both unnecessary and inconvenient.

It is unnecessary because the same can be achieved by writing
'active'.  This activates and array, but it still remains 'clean'
until the first write.

It is inconvenient because writing 'clean' is more often used to
cause an 'active' array to revert to 'clean' mode (thus blocking
any writes until a 'write-pending' is promoted to 'active').

Allowing 'clean' to both activate an array and mark an active array as
clean can lead to races:  One program writes 'clean' to mark the
active array as clean at the same time as another program writes
'inactive' to deactivate (stop) and active array.  Depending on which
writes first, the array could be deactivated and immediately
reactivated which isn't what was desired.

So just disable the use of 'clean' to activate an array.

This avoids a race that can be triggered with mdadm-3.0 and external
metadata, so it suitable for -stable.
Reported-by: NRafal Marszewski <rafal.marszewski@intel.com>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

5bf29597

md: constify VFTs · 110518bc

由 Jan Engelhardt 提交于 5月 07, 2009

Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
Signed-off-by: NNeilBrown <neilb@suse.de>

110518bc

md: tidy up status_resync to handle large arrays. · dd71cf6b

由 NeilBrown 提交于 5月 07, 2009

Two problems in status_resync.
1/ It still used Kilobytes as the basic block unit, while most code
   now uses sectors uniformly.
2/ It doesn't allow for the possibility that max_sectors exceeds
   the range of "unsigned long".

So
 - change "max_blocks" to "max_sectors", and store sector numbers
   in there and in 'resync'
 - Make 'rt' a 'sector_t' so it can temporarily hold the number of
   remaining sectors.
 - use sector_div rather than normal division.
 - change the magic '100' used to preserve precision to '32'.
   + making it a power of 2 makes division easier
   + it doesn't need to be as large as it was chosen when we averaged
     speed over the entire run.  Now we average speed over the last 30
     seconds or so.
Reported-by: N"Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
Signed-off-by: NNeilBrown <neilb@suse.de>

dd71cf6b

md: fix some (more) errors with bitmaps on devices larger than 2TB. · db305e50

由 NeilBrown 提交于 5月 07, 2009

If a write intent bitmap covers more than 2TB, we sometimes work with
values beyond 32bit, so these need to be sector_t.  This patches
add the required casts to some unsigned longs that are being shifted
up.

This will affect any raid10 larger than 2TB, or any raid1/4/5/6 with
member devices that are larger than 2TB.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reported-by: N"Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
Cc: stable@kernel.org

db305e50

md/raid10: don't clear bitmap during recovery if array will still be degraded. · 18055569

由 NeilBrown 提交于 5月 07, 2009

If we have a raid10 with multiple missing devices, and we recover just
one of these to a spare, then we risk (depending on the bitmap and
array chunk size) clearing bits of the bitmap for which recovery isn't
complete (because a device is still missing).

This can lead to a subsequent "re-add" being recovered without
any IO happening, which would result in loss of data.

This patch takes the safe approach of not clearing bitmap bits
if the array will still be degraded.

This patch is suitable for all active -stable kernels.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

18055569

md: fix loading of out-of-date bitmap. · b74fd282

由 NeilBrown 提交于 5月 07, 2009

When md is loading a bitmap which it knows is out of date, it fills
each page with 1s and writes it back out again.  However the
write_page call makes used of bitmap->file_pages and
bitmap->last_page_size which haven't been set correctly yet.  So this
can sometimes fail.

Move the setting of file_pages and last_page_size to before the call
to write_page.

This bug can cause the assembly on an array to fail, thus making the
data inaccessible.  Hence I think it is a suitable candidate for
-stable.

Cc: stable@kernel.org
Reported-by: NVojtech Pavlik <vojtech@suse.cz>
Signed-off-by: NNeilBrown <neilb@suse.de>

b74fd282

20 4月, 2009 1 次提交

md: support bitmaps on RAID10 arrays larger then 2 terabytes · 1f593903

由 NeilBrown 提交于 4月 20, 2009

.. and other arrays with components larger than 2 terabytes.

We use a "long" rather than a "sector_t" in part of the bitmap
size calculations, which is sad.
Reported-by: N"Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
Signed-off-by: NNeilBrown <neilb@suse.de>

1f593903

17 4月, 2009 1 次提交

md: update sync_completed and reshape_position even more often. · c03f6a19

由 NeilBrown 提交于 4月 17, 2009

There are circumstances when a user-space process might need to
"oversee" a resync/reshape process.  For example when doing an
in-place reshape of a raid5, it is prudent to take a backup of each
section before reshaping it as this is the only way to provide
safety against an unplanned shutdown (i.e. crash/power failure).

The sync_max sysfs value can be used to stop the resync from
advancing beyond a particular point.
So user-space can:
  suspend IO to the first section and back it up
  set 'sync_max' to the end of the section
  wait for 'sync_completed' to reach that point
  resume IO on the first section and move on to the next section.

However this process requires the kernel and user-space to run in
lock-step which could introduce unnecessary delays.

It would be better if a 'double buffered' approach could be used with
userspace and kernel space working on different sections with the
'next' section always ready when the 'current' section is finished.

One problem with implementing this is that sync_completed is only
guaranteed to be updated when the sync process reaches sync_max.
(it is updated on a time basis at other times, but it is hard to rely
on that).  This defeats some of the double buffering.

With this patch, sync_completed (and reshape_position) get updated as
the current position approaches sync_max, so there is room for
userspace to advance sync_max early without losing updates.

To be precise, sync_completed is updated when the current sync
position reaches half way between the current value of sync_completed
and the value of sync_max.  This will usually be a good time for user
space to update sync_max.

If sync_max does not get updated, the updates to sync_completed
(together with associated metadata updates) will occur at an
exponentially increasing frequency which will get unreasonably fast
(one update every page) immediately before the process hits sync_max
and stops.  So the update rate will be unreasonably fast only for an
insignificant period of time.
Signed-off-by: NNeilBrown <neilb@suse.de>

c03f6a19

15 4月, 2009 1 次提交

block: move bio list helpers into bio.h · 8f3d8ba2

由 Christoph Hellwig 提交于 4月 07, 2009

It's used by DM and MD and generally useful, so move the bio list
helpers into bio.h.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NAlasdair G Kergon <agk@redhat.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

8f3d8ba2

14 4月, 2009 3 次提交

md: improve usefulness and accuracy of sysfs file md/sync_completed. · acb180b0

由 NeilBrown 提交于 4月 14, 2009

The sync_completed file reports how much of a resync (or recovery or
reshape) has been completed.
However due to the possibility of out-of-order completion of writes,
it is not certain to be accurate.

We have an internal value - mddev->curr_resync_completed - which is an
accurate value (though it might not always be quite so uptodate).

So:
 - make curr_resync_completed be uptodate a little more often,
   particularly when raid5 reshape updates status in the metadata
 - report curr_resync_completed in the sysfs file
 - allow poll/select to report all updates to md/sync_completed.

This makes sync_completed completed usable by any external metadata
handler that wants to record this status information in its metadata.
Signed-off-by: NNeilBrown <neilb@suse.de>

acb180b0

md: allow setting newly added device to 'in_sync' via sysfs. · 6d56e278

由 NeilBrown 提交于 4月 14, 2009

When adding devices to an active array via sysfs, there is currently
no way to mark a device as 'in-sync' which is useful when
incrementally assembling an array.

So add that option.
Signed-off-by: NNeilBrown <neilb@suse.de>

6d56e278

md: tiny md.h cleanups · 63fe0817

由 Christoph Hellwig 提交于 4月 14, 2009

- update inclusion guard and make sure it covers the whole file
 - remove superflous #ifdef CONFIG_BLOCK
 - make sure all required headers are included so that new users aren't
   required to include others before
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NNeilBrown <neilb@suse.de>

63fe0817

09 4月, 2009 10 次提交

dm kcopyd: fix callback race · 340cd444

由 Mikulas Patocka 提交于 4月 09, 2009

If the thread calling dm_kcopyd_copy is delayed due to scheduling inside
split_job/segment_complete and the subjobs complete before the loop in
split_job completes, the kcopyd callback could be invoked from the
thread that called dm_kcopyd_copy instead of the kcopyd workqueue.

dm_kcopyd_copy -> split_job -> segment_complete -> job->fn()

Snapshots depend on the fact that callbacks are called from the singlethreaded
kcopyd workqueue and expect that there is no racing between individual
callbacks. The racing between callbacks can lead to corruption of exception
store and it can also mean that exception store callbacks are called twice
for the same exception - a likely reason for crashes reported inside
pending_complete() / remove_exception().

This patch fixes two problems:

1. job->fn being called from the thread that submitted the job (see above).

- Fix: hand over the completion callback to the kcopyd thread.

2. job->fn(read_err, write_err, job->context); in segment_complete
reports the error of the last subjob, not the union of all errors.

- Fix: pass job->write_err to the callback to report all error bits
  (it is done already in run_complete_job)

Cc: stable@kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

340cd444

dm kcopyd: prepare for callback race fix · 73830857

由 Mikulas Patocka 提交于 4月 09, 2009

Use a variable in segment_complete() to point to the dm_kcopyd_client
struct and only release job->pages in run_complete_job() if any are
defined.  These changes are needed by the next patch.

Cc: stable@kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

73830857

dm: implement basic barrier support · af7e466a

由 Mikulas Patocka 提交于 4月 09, 2009

Barriers are submitted to a worker thread that issues them in-order.

The thread is modified so that when it sees a barrier request it waits
for all pending IO before the request then submits the barrier and
waits for it.  (We must wait, otherwise it could be intermixed with
following requests.)

Errors from the barrier request are recorded in a per-device barrier_error
variable. There may be only one barrier request in progress at once.

For now, the barrier request is converted to a non-barrier request when
sending it to the underlying device.

This patch guarantees correct barrier behavior if the underlying device
doesn't perform write-back caching. The same requirement existed before
barriers were supported in dm.

Bottom layer barrier support (sending barriers by target drivers) and
handling devices with write-back caches will be done in further patches.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

af7e466a

dm: remove dm_request loop · 92c63902

由 Mikulas Patocka 提交于 4月 09, 2009

Remove queue_io return value and a loop in dm_request.

IO may be submitted to a worker thread with queue_io().  queue_io() sets
DMF_QUEUE_IO_TO_THREAD so that all further IO is queued for the thread. When
the thread finishes its work, it clears DMF_QUEUE_IO_TO_THREAD and from this
point on, requests are submitted from dm_request again. This will be used
for processing barriers.

Remove the loop in dm_request. queue_io() can submit I/Os to the worker thread
even if DMF_QUEUE_IO_TO_THREAD was not set.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

92c63902

dm: rework queueing and suspension · 3b00b203

由 Mikulas Patocka 提交于 4月 09, 2009

Rework shutting down on suspend and document the associated rules.

Drop write lock in __split_and_process_bio to allow more processing
concurrency.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

3b00b203

dm: simplify dm_request loop · 54d9a1b4

由 Alasdair G Kergon 提交于 4月 09, 2009

Refactor the code in dm_request().

Require the new DMF_BLOCK_FOR_SUSPEND flag on readahead bios we will
discard so we don't drop such bios while processing a barrier.
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

54d9a1b4

dm: split DMF_BLOCK_IO flag into two · 1eb787ec

由 Alasdair G Kergon 提交于 4月 09, 2009

Split the DMF_BLOCK_IO flag into two.

DMF_BLOCK_IO_FOR_SUSPEND is set when I/O must be blocked while suspending a
device.  DMF_QUEUE_IO_TO_THREAD is set when I/O must be queued to a
worker thread for later processing.
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

1eb787ec

dm: rearrange dm_wq_work · df12ee99

由 Alasdair G Kergon 提交于 4月 09, 2009

Refactor dm_wq_work() to make later patch more readable.
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

df12ee99

dm: remove limited barrier support · 692d0eb9

由 Mikulas Patocka 提交于 4月 09, 2009

Prepare for full barrier implementation: first remove the restricted support.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

692d0eb9

dm: add integrity support · 9c47008d

由 Martin K. Petersen 提交于 4月 09, 2009

This patch provides support for data integrity passthrough in the device
mapper.

 - If one or more component devices support integrity an integrity
   profile is preallocated for the DM device.

 - If all component devices have compatible profiles the DM device is
   flagged as capable.

 - Handle integrity metadata when splitting and cloning bios.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

9c47008d

07 4月, 2009 1 次提交

md/raid1: fix build breakage · 91a9e99d

由 Alexander Beregalov 提交于 4月 07, 2009

Fix this build error:

  drivers/md/raid1.c: In function 'raid1_congested':
  drivers/md/raid1.c:589: error: 'BDI_write_congested' undeclared

BDI_write_congested was changed in commit 1faa16d2 ("block: change the
request allocation/congestion logic to be sync/async based")
Signed-off-by: NAlexander Beregalov <a.beregalov@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

91a9e99d

06 4月, 2009 1 次提交

md/raid1 - don't assume newly allocated bvecs are initialised. · 303a0e11

由 NeilBrown 提交于 4月 06, 2009

Since commit d3f76110
newly allocated bvecs aren't initialised to NULL, so we have
to be more careful about freeing a bio which only managed
to get a few pages allocated to it.  Otherwise the resync
process crashes.

This patch is appropriate for 2.6.29-stable.

Cc: stable@kernel.org
Cc: "Jens Axboe" <jens.axboe@oracle.com>
Reported-by: NGabriele Tozzi <gabriele@tozzi.eu>
Signed-off-by: NNeilBrown <neilb@suse.de>

303a0e11

03 4月, 2009 8 次提交

dm: set queue ordered mode · 99360b4c

由 Mikulas Patocka 提交于 4月 02, 2009

Set queue ordered mode.  It doesn't really matter what we set here
because we don't ever put any requests on the queue.  But we need to set
something other than QUEUE_ORDERED_NONE so that __generic_make_request
passes barrier requests to us.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

99360b4c

dm: move wait queue declaration · b44ebeb0

由 Mikulas Patocka 提交于 4月 02, 2009

Move wait queue declaration and unplug to dm_wait_for_completion.

The purpose is to minimize duplicate code in the further patches.

The patch reorders functions a little bit. It doesn't change any
functionality. For proper non-deadlock operation, add_wait_queue must
happen before set_current_state(interruptible) and before the test for
!atomic_read(&md->pending).
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

b44ebeb0

dm: merge pushback and deferred bio lists · 022c2611

由 Mikulas Patocka 提交于 4月 02, 2009

Merge pushback and deferred lists into one list - use deferred list
for both deferred and pushed-back bios.

This will be needed for proper support of barrier bios: it is impossible to
support ordering correctly with two lists because the requests on both lists
will be mixed up.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

022c2611

dm: allow uninterruptible wait for pending io · 401600df

由 Mikulas Patocka 提交于 4月 02, 2009

Allow uninterruptible wait for pending IOs.

Add argument "interruptible" to dm_wait_for_completion that specifies
either interruptible or uninterruptible waiting.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

401600df

dm: merge __flush_deferred_io into caller · ef208587

由 Mikulas Patocka 提交于 4月 02, 2009

Merge __flush_deferred_io() into the only caller, dm_wq_work().

There's no need to have a function that has only one caller.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

ef208587

dm: move bio_io_error into __split_and_process_bio · f0b9a450

由 Mikulas Patocka 提交于 4月 02, 2009

Move the bio_io_error() calls directly into __split_and_process_bio().

This avoids some code duplication in later patches.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

f0b9a450

dm: rename __split_bio · 8a53c28d

由 Mikulas Patocka 提交于 4月 02, 2009

Rename __split_bio() to __split_and_process_bio() because it not only splits
the bio to serveral parts, but also submits them to target drivers.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

8a53c28d

dm: remove unnecessary struct dm_wq_req · 53d5914f

由 Mikulas Patocka 提交于 4月 02, 2009

Remove struct dm_wq_req and move "work" directly into struct mapped_device.

In the revised implementation, the thread will do just one type of work
(processing the queue).
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

53d5914f