提交 · 449aad3e25358812c43afc60918c5ad3819488e7 · OpenHarmony / kernel_linux

03 8月, 2009 7 次提交

md: Use revalidate_disk to effect changes in size of device. · 449aad3e

由 NeilBrown 提交于 8月 03, 2009

As revalidate_disk calls check_disk_size_change, it will cause
any capacity change of a gendisk to be propagated to the blockdev
inode.  So use that instead of mucking about with locks and
i_size_write.

Also add a call to revalidate_disk in do_md_run and a few other places
where the gendisk capacity is changed.
Signed-off-by: NNeilBrown <neilb@suse.de>

449aad3e

md: allow raid5_quiesce to work properly when reshape is happening. · 64bd660b

由 NeilBrown 提交于 8月 03, 2009

The ->quiesce method is not supposed to stop resync/recovery/reshape,
just normal IO.
But in raid5 we don't have a way to know which stripes are being
used for normal IO and which for resync etc, so we need to wait for
all stripes to be idle to be sure that all writes have completed.

However reshape keeps at least some stripe busy for an extended period
of time, so a call to raid5_quiesce can block for several seconds
needlessly.
So arrange for reshape etc to pause briefly while raid5_quiesce is
trying to quiesce the array so that the active_stripes count can
drop to zero.
Signed-off-by: NNeilBrown <neilb@suse.de>

64bd660b

md/raid5: set reshape_position correctly when reshape starts. · e516402c

由 NeilBrown 提交于 8月 03, 2009

As the internal reshape_progress counter is the main driver
for reshape, the fact that reshape_position sometimes starts with the
wrong value has minimal effect.  It is visible in sysfs and that
is all.
Signed-off-by: NNeilBrown <neilb@suse.de>

e516402c

md: Handle growth of v1.x metadata correctly. · 70471daf

由 NeilBrown 提交于 8月 03, 2009

The v1.x metadata does not have a fixed size and can grow
when devices are added.
If it grows enough to require an extra sector of storage,
we need to update the 'sb_size' to match.

Without this, md can write out an incomplete superblock with a
bad checksum, which will be rejected when trying to re-assemble
the array.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

70471daf

md: avoid array overflow with bad v1.x metadata · 3673f305

由 NeilBrown 提交于 8月 03, 2009

We trust the 'desc_nr' field in v1.x metadata enough to use it
as an index in an array.  This isn't really safe.
So range-check the value first.
Signed-off-by: NNeilBrown <neilb@suse.de>

3673f305

md: when a level change reduces the number of devices, remove the excess. · 3a981b03

由 NeilBrown 提交于 8月 03, 2009

When an array is changed from RAID6 to RAID5, fewer drives are
needed.  So any device that is made superfluous by the level
conversion must be marked as not-active.
For the RAID6->RAID5 conversion, this will be a drive which only
has 'Q' blocks on it.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

3a981b03

md: Push down data integrity code to personalities. · ac5e7113

由 Andre Noll 提交于 8月 03, 2009

This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.

md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity.  The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().

The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.

For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

ac5e7113

31 7月, 2009 1 次提交

md/raid6: release spare page at ->stop() · 95fc17aa

由 Dan Williams 提交于 7月 31, 2009

Add missing call to safe_put_page from stop() by unifying open coded
raid5_conf_t de-allocation under free_conf().

Cc: <stable@kernel.org>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

95fc17aa

24 7月, 2009 3 次提交

dm table: pass correct dev area size to device_area_is_valid · 5dea271b

由 Mike Snitzer 提交于 7月 23, 2009

Incorrect device area lengths are being passed to device_area_is_valid().

The regression appeared in 2.6.31-rc1 through commit
754c5fc7.

With the dm-stripe target, the size of the target (ti->len) was used
instead of the stripe_width (ti->len/#stripes).  An example of a
consequent incorrect error message is:

  device-mapper: table: 254:0: sdb too small for target
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

5dea271b

dm: remove queue next_ordered workaround for barriers · a732c207

由 Mike Snitzer 提交于 7月 23, 2009

This patch removes DM's bio-based vs request-based conditional setting
of next_ordered.  For bio-based DM the next_ordered check is no longer a
concern (as that check is now in the __make_request path).  For
request-based DM the default of QUEUE_ORDERED_NONE is now appropriate.

bio-based DM was changed to work-around the previously misplaced
next_ordered check with this commit:
99360b4c

request-based DM does not yet support barriers but reacted to the above
bio-based DM change with this commit:
5d67aa23

The above changes are no longer needed given Neil Brown's recent fix to
put the next_ordered check in the __make_request path:
db64f680Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: NeilBrown <neilb@suse.de>
Acked-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
Acked-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

a732c207

dm raid1: wake kmirrord when requeueing delayed bios after remote recovery · 69885683

由 Mikulas Patocka 提交于 7月 23, 2009

The recent commit 7513c2a7 (dm raid1:
add is_remote_recovering hook for clusters) changed do_writes() to
update the ms->writes list but forgot to wake up kmirrord to process it.

The rule is that when anything is being added on ms->reads, ms->writes
or ms->failures and the list was empty before we must call
wakeup_mirrord (for immediate processing) or delayed_wake (for delayed
processing).  Otherwise the bios could sit on the list indefinitely.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
CC: stable@kernel.org
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

69885683

11 7月, 2009 1 次提交

Fix congestion_wait() sync/async vs read/write confusion · 8aa7e847

由 Jens Axboe 提交于 7月 09, 2009

Commit 1faa16d2 accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

8aa7e847

09 7月, 2009 1 次提交

Remove multiple KERN_ prefixes from printk formats · ad361c98

由 Joe Perches 提交于 7月 06, 2009

Commit 5fd29d6c ("printk: clean up
handling of log-levels and newlines") changed printk semantics.  printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.

<level> is now included in the output on each additional use.

Remove all uses of multiple KERN_<level>s in formats.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ad361c98

01 7月, 2009 7 次提交

block: Create bip slabs with embedded integrity vectors · 7878cba9

由 Martin K. Petersen 提交于 6月 26, 2009

This patch restores stacking ability to the block layer integrity
infrastructure by creating a set of dedicated bip slabs.  Each bip slab
has an embedded bio_vec array at the end.  This cuts down on memory
allocations and also simplifies the code compared to the original bvec
version.  Only the largest bip slab is backed by a mempool.  The pool is
contained in the bio_set so stacking drivers can ensure forward
progress.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <axboe@carl.(none)>

7878cba9

md: use interruptible wait when duration is controlled by userspace. · e62e58a5

由 NeilBrown 提交于 7月 01, 2009

User space can set various limits on an md array so that resync waits
when it gets to a certain point, or so that I/O is blocked for a short
while.
When md is waiting against one of these limit, it should use an
interruptible wait so as not to add to the load average, and so are
not to trigger a warning if the wait goes on for too long.
Signed-off-by: NNeilBrown <neilb@suse.de>

e62e58a5

md/raid5: suspend shouldn't affect read requests. · a5c308d4

由 NeilBrown 提交于 7月 01, 2009

md allows write to regions on an array to be suspended temporarily.
This allows user-space to participate is aspects of reshape.
In particular, data can be copied with not risk of a race.
We should not be blocking read requests though, so don't.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

a5c308d4

md: tidy up error paths in md_alloc · 0909dc44

由 NeilBrown 提交于 7月 01, 2009

As the recent bug in md_alloc showed, having a single exit path for
unlocking and putting is a good idea.  So restructure md_alloc to have
a single mutex_unlock and mddev_put, and use gotos where necessary.
Found-by: NJiri Slaby <jirislaby@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

0909dc44

md: fix error path when duplicate name is found on md device creation. · 1ec22eb2

由 NeilBrown 提交于 7月 01, 2009

When an md device is created by name (rather than number) we need to
check that the name is not already in use.  If this check finds a
duplicate, we return an error without dropping the lock or freeing
the newly create mddev.
This patch fixes that.

Cc: stable@kernel.org
Found-by: NJiri Slaby <jirislaby@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

1ec22eb2

md: avoid dereferencing NULL pointer when accessing suspend_* sysfs attributes. · b8d966ef

由 NeilBrown 提交于 7月 01, 2009

If we try to modify one of the md/ sysfs files
  suspend_lo or suspend_hi
when the array is not active, we dereference a NULL.
Protect against that.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

b8d966ef

md: Use new topology calls to indicate alignment and I/O sizes · 8f6c2e4b

由 Martin K. Petersen 提交于 7月 01, 2009

Switch MD over to the new disk_stack_limits() function which checks for
aligment and adjusts preferred I/O sizes when stacking.

Also indicate preferred I/O sizes where applicable.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

8f6c2e4b

30 6月, 2009 2 次提交

dm table: fix blk_stack_limits arg to use bytes not sectors · ea9df47c

由 Mike Snitzer 提交于 6月 30, 2009

The offset passed to blk_stack_limits() must be in bytes not sectors.
Fixes false warnings like the following:
device-mapper: table: 254:1: target device sda6 is misaligned
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Reported-by: NFrans Pop <elendil@planet.nl>
Tested-by: NFrans Pop <elendil@planet.nl>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

ea9df47c

dm exception store: really fix type lookup · 874d2f61

由 Milan Broz 提交于 6月 30, 2009

Fix exception store name handling.

We need to reference exception store by zero terminated string.

Fixes regression introduced in commit f6bd4eb7

Cc: Yi Yang <yi.y.yang@intel.com>
Cc: Jonathan Brassow <jbrassow@redhat.com>
Cc: stable@kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NMilan Broz <mbroz@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

874d2f61

22 6月, 2009 18 次提交

dm mpath: change to be request based · f40c67f0

由 Kiyoshi Ueda 提交于 6月 22, 2009

This patch converts dm-multipath target to request-based from bio-based.

Basically, the patch just converts the I/O unit from struct bio
to struct request.
In the course of the conversion, it also changes the I/O queueing
mechanism.  The change in the I/O queueing is described in details
as follows.

I/O queueing mechanism change
-----------------------------
In I/O submission, map_io(), there is no mechanism change from
bio-based, since the clone request is ready for retry as it is.
However, in I/O complition, do_end_io(), there is a mechanism change
from bio-based, since the clone request is not ready for retry.

In do_end_io() of bio-based, the clone bio has all needed memory
for resubmission.  So the target driver can queue it and resubmit
it later without memory allocations.
The mechanism has almost no overhead.

On the other hand, in do_end_io() of request-based, the clone request
doesn't have clone bios, so the target driver can't resubmit it
as it is.  To resubmit the clone request, memory allocation for
clone bios is needed, and it takes some overheads.
To avoid the overheads just for queueing, the target driver doesn't
queue the clone request inside itself.
Instead, the target driver asks dm core for queueing and remapping
the original request of the clone request, since the overhead for
queueing is just a freeing memory for the clone request.

As a result, the target driver doesn't need to record/restore
the information of the original request for resubmitting
the clone request.  So dm_bio_details in dm_mpath_io is removed.

multipath_busy()
---------------------
The target driver returns "busy", only when the following case:
  o The target driver will map I/Os, if map() function is called
  and
  o The mapped I/Os will wait on underlying device's queue due to
    their congestions, if map() function is called now.

In other cases, the target driver doesn't return "busy".
Otherwise, dm core will keep the I/Os and the target driver can't
do what it wants.
(e.g. the target driver can't map I/Os now, so wants to kill I/Os.)
Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

f40c67f0

dm: disable interrupt when taking map_lock · 523d9297

由 Kiyoshi Ueda 提交于 6月 22, 2009

This patch disables interrupt when taking map_lock to avoid
lockdep warnings in request-based dm.

request-based dm takes map_lock after taking queue_lock with
disabling interrupt:
  spin_lock_irqsave(queue_lock)
  q->request_fn() == dm_request_fn()
    => dm_get_table()
         => read_lock(map_lock)
while queue_lock could be (but isn't) taken in interrupt context.
Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: NChristof Schmitt <christof.schmitt@de.ibm.com>
Acked-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

523d9297

dm: do not set QUEUE_ORDERED_DRAIN if request based · 5d67aa23

由 Kiyoshi Ueda 提交于 6月 22, 2009

Request-based dm doesn't have barrier support yet.
So we need to set QUEUE_ORDERED_DRAIN only for bio-based dm.
Since the device type is decided at the first table loading time,
the flag set is deferred until then.
Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

5d67aa23

dm: enable request based option · e6ee8c0b

由 Kiyoshi Ueda 提交于 6月 22, 2009

This patch enables request-based dm.

o Request-based dm and bio-based dm coexist, since there are
  some target drivers which are more fitting to bio-based dm.
  Also, there are other bio-based devices in the kernel
  (e.g. md, loop).
  Since bio-based device can't receive struct request,
  there are some limitations on device stacking between
  bio-based and request-based.

                     type of underlying device
                   bio-based      request-based
   ----------------------------------------------
    bio-based         OK                OK
    request-based     --                OK

  The device type is recognized by the queue flag in the kernel,
  so dm follows that.

o The type of a dm device is decided at the first table binding time.
  Once the type of a dm device is decided, the type can't be changed.

o Mempool allocations are deferred to at the table loading time, since
  mempools for request-based dm are different from those for bio-based
  dm and needed mempool type is fixed by the type of table.

o Currently, request-based dm supports only tables that have a single
  target.  To support multiple targets, we need to support request
  splitting or prevent bio/request from spanning multiple targets.
  The former needs lots of changes in the block layer, and the latter
  needs that all target drivers support merge() function.
  Both will take a time.
Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

e6ee8c0b

dm: prepare for request based option · cec47e3d

由 Kiyoshi Ueda 提交于 6月 22, 2009

This patch adds core functions for request-based dm.

When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
    make_request_fn: dm_make_request()
    pref_fn:         dm_prep_fn()
    request_fn:      dm_request_fn()
    softirq_done_fn: dm_softirq_done()
    lld_busy_fn:     dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).

Below is a brief summary of how request-based dm behaves, including:
  - making request from bio
  - cloning, mapping and dispatching request
  - completing request and bio
  - suspending md
  - resuming md

  bio to request
  ==============
  md->queue->make_request_fn() (dm_make_request()) calls __make_request()
  for a bio submitted to the md.
  Then, the bio is kept in the queue as a new request or merged into
  another request in the queue if possible.

  Cloning and Mapping
  ===================
  Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
  when requests are dispatched after they are sorted by the I/O scheduler.

  dm_request_fn() checks busy state of underlying devices using
  target's busy() function and stops dispatching requests to keep them
  on the dm device's queue if busy.
  It helps better I/O merging, since no merge is done for a request
  once it is dispatched to underlying devices.

  Actual cloning and mapping are done in dm_prep_fn() and map_request()
  called from dm_request_fn().
  dm_prep_fn() clones not only request but also bios of the request
  so that dm can hold bio completion in error cases and prevent
  the bio submitter from noticing the error.
  (See the "Completion" section below for details.)

  After the cloning, the clone is mapped by target's map_rq() function
    and inserted to underlying device's queue using
    blk_insert_cloned_request().

  Completion
  ==========
  Request completion can be hooked by rq->end_io(), but then, all bios
  in the request will have been completed even error cases, and the bio
  submitter will have noticed the error.
  To prevent the bio completion in error cases, request-based dm clones
  both bio and request and hooks both bio->bi_end_io() and rq->end_io():
      bio->bi_end_io(): end_clone_bio()
      rq->end_io():     end_clone_request()

  Summary of the request completion flow is below:
  blk_end_request() for a clone request
    => blk_update_request()
       => bio->bi_end_io() == end_clone_bio() for each clone bio
          => Free the clone bio
          => Success: Complete the original bio (blk_update_request())
             Error:   Don't complete the original bio
    => blk_finish_request()
       => rq->end_io() == end_clone_request()
          => blk_complete_request()
             => dm_softirq_done()
                => Free the clone request
                => Success: Complete the original request (blk_end_request())
                   Error:   Requeue the original request

  end_clone_bio() completes the original request on the size of
  the original bio in successful cases.
  Even if all bios in the original request are completed by that
  completion, the original request must not be completed yet to keep
  the ordering of request completion for the stacking.
  So end_clone_bio() uses blk_update_request() instead of
  blk_end_request().
  In error cases, end_clone_bio() doesn't complete the original bio.
  It just frees the cloned bio and gives over the error handling to
  end_clone_request().

  end_clone_request(), which is called with queue lock held, completes
  the clone request and the original request in a softirq context
  (dm_softirq_done()), which has no queue lock, to avoid a deadlock
  issue on submission of another request during the completion:
      - The submitted request may be mapped to the same device
      - Request submission requires queue lock, but the queue lock
        has been held by itself and it doesn't know that

  The clone request has no clone bio when dm_softirq_done() is called.
  So target drivers can't resubmit it again even error cases.
  Instead, they can ask dm core for requeueing and remapping
  the original request in that cases.

  suspend
  =======
  Request-based dm uses stopping md->queue as suspend of the md.
  For noflush suspend, just stops md->queue.

  For flush suspend, inserts a marker request to the tail of md->queue.
  And dispatches all requests in md->queue until the marker comes to
  the front of md->queue.  Then, stops dispatching request and waits
  for the all dispatched requests to complete.
  After that, completes the marker request, stops md->queue and
  wake up the waiter on the suspend queue, md->wait.

  resume
  ======
  Starts md->queue.
Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

cec47e3d

dm raid1: add userspace log · f5db4af4

由 Jonthan Brassow 提交于 6月 22, 2009

This patch contains a device-mapper mirror log module that forwards
requests to userspace for processing.

The structures used for communication between kernel and userspace are
located in include/linux/dm-log-userspace.h.  Due to the frequency,
diversity, and 2-way communication nature of the exchanges between
kernel and userspace, 'connector' was chosen as the interface for
communication.

The first log implementations written in userspace - "clustered-disk"
and "clustered-core" - support clustered shared storage.   A userspace
daemon (in the LVM2 source code repository) uses openAIS/corosync to
process requests in an ordered fashion with the rest of the nodes in the
cluster so as to prevent log state corruption.  Other implementations
with no association to LVM or openAIS/corosync, are certainly possible.

(Imagine if two machines are writing to the same region of a mirror.
They would both mark the region dirty, but you need a cluster-aware
entity that can handle properly marking the region clean when they are
done.  Otherwise, you might clear the region when the first machine is
done, not the second.)
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

f5db4af4

dm: calculate queue limits during resume not load · 754c5fc7