提交 · 33659ebbae262228eef4e0fe990f393d1f0ed941 · openanolis / cloud-kernel

24 6月, 2010 3 次提交

md: fix raid10 takeover: use new_layout for setup_conf · f73ea873

由 Maciej Trela 提交于 6月 16, 2010

Use mddev->new_layout in setup_conf.
Also use new_chunk, and don't set ->degraded in takeover().  That
gets set in run()
Signed-off-by: NMaciej Trela <maciej.trela@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

f73ea873

md: fix handling of array level takeover that re-arranges devices. · e93f68a1

由 NeilBrown 提交于 6月 15, 2010

Most array level changes leave the list of devices largely unchanged,
possibly causing one at the end to become redundant.
However conversions between RAID0 and RAID10 need to renumber
all devices (except 0).

This renumbering is currently being done in the ->run method when the
new personality takes over.  However this is too late as the common
code in md.c might already have invalidated some of the devices if
they had a ->raid_disk number that appeared to high.

Moving it into the ->takeover method is too early as the array is
still active at that time and wrong ->raid_disk numbers could cause
confusion.

So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate
the new raid_disk number.
Now the common code knows exactly which devices need to be renumbered,
and which can be invalidated, and can do it all at a convenient time
when the array is suspend.
It can also update some symlinks in sysfs which previously were not be
updated correctly.
Reported-by: NMaciej Trela <maciej.trela@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

e93f68a1

md: raid10: Fix null pointer dereference in fix_read_error() · 0544a21d

由 Prasanna S. Panchamukhi 提交于 6月 24, 2010

Such NULL pointer dereference can occur when the driver was fixing the
read errors/bad blocks and the disk was physically removed
causing a system crash. This patch check if the
rcu_dereference() returns valid rdev before accessing it in fix_read_error().

Cc: stable@kernel.org
Signed-off-by: NPrasanna S. Panchamukhi <prasanna.panchamukhi@riverbed.com>
Signed-off-by: NRob Becker <rbecker@riverbed.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

0544a21d

18 5月, 2010 7 次提交

md: Fix read balancing in RAID1 and RAID10 on drives > 2TB · af3a2cd6

由 NeilBrown 提交于 5月 08, 2010

read_balance uses a "unsigned long" for a sector number which
will get truncated beyond 2TB.
This will cause read-balancing to be non-optimal, and can cause
data to be read from the 'wrong' branch during a resync.  This has a
very small chance of returning wrong data.
Reported-by: NJordan Russell <jr-list-2010@quo.to>
Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

af3a2cd6

md/raid10: tidy up printk messages. · 128595ed

由 NeilBrown 提交于 5月 03, 2010

All raid10 printk messages now start
   md/raid10:md-device-name:
Signed-off-by: NNeilBrown <neilb@suse.de>

128595ed

md: pass mddev to make_request functions rather than request_queue · 21a52c6d

由 NeilBrown 提交于 4月 01, 2010

We used to pass the personality make_request function direct
to the block layer so the first argument had to be a queue.
But now we have the intermediary md_make_request so it makes
at lot more sense to pass a struct mddev_s.
It makes it possible to have an mddev without its own queue too.
Signed-off-by: NNeilBrown <neilb@suse.de>

21a52c6d

md: move io accounting out of personalities into md_make_request · 49077326

由 NeilBrown 提交于 3月 25, 2010

While I generally prefer letting personalities do as much as possible,
given that we have a central md_make_request anyway we may as well use
it to simplify code.
Also this centralises knowledge of ->gendisk which will help later.
Signed-off-by: NNeilBrown <neilb@suse.de>

49077326

md: Add support for Raid0->Raid10 takeover · dab8b292

由 Trela, Maciej 提交于 3月 08, 2010

Signed-off-by: NMaciej Trela <maciej.trela@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

dab8b292

md: don't use mddev->raid_disks in raid0 or raid10 while array is active. · 84707f38

由 NeilBrown 提交于 3月 16, 2010

In a subsequent patch we will make it possible to change
mddev->raid_disks while a RAID0 or RAID10 array is active.  This is
part of the process of reshaping such an array.

This means that we cannot use this value while processes requests
(it is OK to use it during initialisation as we are locked against
changes then).
Both RAID0 and RAID10 have the same value stored in the private data
structure, so use that value instead.
Signed-off-by: NNeilBrown <neilb@suse.de>

84707f38

drivers/md: Remove unnecessary casts of void * · 7b92813c

由 H Hartley Sweeten 提交于 3月 08, 2010

void pointers do not need to be cast to other pointer types.
Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

7b92813c

30 3月, 2010 1 次提交

include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6

由 Tejun Heo 提交于 3月 24, 2010

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: NTejun Heo <tj@kernel.org>
Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

5a0e3ad6

16 3月, 2010 1 次提交

md: deal with merge_bvec_fn in component devices better. · 627a2d3c

由 NeilBrown 提交于 3月 08, 2010

If a component device has a merge_bvec_fn then as we never call it
we must ensure we never need to.  Currently this is done by setting
max_sector to 1 PAGE, however this does not stop a bio being created
with several sub-page iovecs that would violate the merge_bvec_fn.

So instead set max_segments to 1 and set the segment boundary to the
same as a page boundary to ensure there is only ever one single-page
segment of IO requested at a time.

This can particularly be an issue when 'xen' is used as it is
known to submit multiple small buffers in a single bio.
Signed-off-by: NNeilBrown <neilb@suse.de>
Cc: stable@kernel.org

627a2d3c

26 2月, 2010 1 次提交

block: Rename blk_queue_max_sectors to blk_queue_max_hw_sectors · 086fa5ff

由 Martin K. Petersen 提交于 2月 26, 2010

The block layer calling convention is blk_queue_<limit name>.
blk_queue_max_sectors predates this practice, leading to some confusion.
Rename the function to appropriately reflect that its intended use is to
set max_hw_sectors.

Also introduce a temporary wrapper for backwards compability.  This can
be removed after the merge window is closed.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

086fa5ff

14 12月, 2009 7 次提交

N
md: add MODULE_DESCRIPTION for all md related modules. · 0efb9e61
由 NeilBrown 提交于 12月 14, 2009
```
Suggested by  Oren Held <orenhe@il.ibm.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
```
0efb9e61

raid: improve MD/raid10 handling of correctable read errors. · 1e50915f

由 Robert Becker 提交于 12月 14, 2009

We've noticed severe lasting performance degradation of our raid
arrays when we have drives that yield large amounts of media errors.
The raid10 module will queue each failed read for retry, and also
will attempt call fix_read_error() to perform the read recovery.
Read recovery is performed while the array is frozen, so repeated
recovery attempts can degrade the performance of the array for
extended periods of time.

With this patch I propose adding a per md device max number of
corrected read attempts.  Each rdev will maintain a count of
read correction attempts in the rdev->read_errors field (not
used currently for raid10). When we enter fix_read_error()
we'll check to see when the last read error occurred, and
divide the read error count by 2 for every hour since the
last read error. If at that point our read error count
exceeds the read error threshold, we'll fail the raid device.

In addition in this patch I add sysfs nodes (get/set) for
the per md max_read_errors attribute, the rdev->read_errors
attribute, and added some printk's to indicate when
fix_read_error fails to repair an rdev.

For testing I used debugfs->fail_make_request to inject
IO errors to the rdev while doing IO to the raid array.
Signed-off-by: NRobert Becker <Rob.Becker@riverbed.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

1e50915f

md/raid10: print more useful messages on device failure. · 67b8dc4b

由 Robert Becker 提交于 12月 14, 2009

When we get a read error on a device in a RAID10, and attempting to
repair the error fails, print more useful messages about why it
failed.
Signed-off-by: NRobert Becker <Rob.Becker@riverbed.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

67b8dc4b

md: remove needless setting of thread->timeout in raid10_quiesce · 9cd30fdc

由 NeilBrown 提交于 12月 14, 2009

As bitmap_create and bitmap_destroy already set thread->timeout
as appropriate, there is no need to do it in raid10_quiesce.
There is a possible need to wake the thread after the timeout
has been set low, but it is better to do that where the timeout
is actually set low, in bitmap_create.
Signed-off-by: NNeilBrown <neilb@suse.de>

9cd30fdc

N
md: change daemon_sleep to be in 'jiffies' rather than 'seconds'. · 1b04be96
由 NeilBrown 提交于 12月 14, 2009
```
This removes a lot of multiplications by HZ.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
1b04be96

md: move offset, daemon_sleep and chunksize out of bitmap structure · 42a04b50

由 NeilBrown 提交于 12月 14, 2009

... and into bitmap_info.  These are all configuration parameters
that need to be set before the bitmap is created.
Signed-off-by: NNeilBrown <neilb@suse.de>

42a04b50

md: support barrier requests on all personalities. · a2826aa9

由 NeilBrown 提交于 12月 14, 2009

Previously barriers were only supported on RAID1.  This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.

When a barrier arrives, we send a zero-length barrier to every active
device.  When that completes - and if the original request was not
empty -  we submit the barrier request itself (with the barrier flag
cleared) and then submit a fresh load of zero length barriers.

The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.

The reason for clearing the barrier flag is that a barrier request is
allowed to fail.  If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail.  That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.

RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet.  So we flush the stripe cache before
proceeding with the barrier.

Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted.  Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.

That will be done in later patches.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NAndre Noll <maan@systemlinux.org>

a2826aa9

16 10月, 2009 2 次提交

md: raid1/raid10: handle allocation errors during array setup. · ed9bfdf1

由 NeilBrown 提交于 10月 16, 2009

Both raid1 and raid10 create a mempool during startup.
If the 'alloc' function for this mempool fails, unplug_slaves
is called.
If that happens when the pool is being initialised, unplug_slaves
will try to use the 'conf' structure that isn't filled in yet, and
badness will happen.

So ensure that unplug_slaves doesn't get called unless we know
that the conf structure if fully initialised.
Signed-off-by: NNeilBrown <neilb@suse.de>

ed9bfdf1

md/raid1/raid10: add a cond_resched · 1d9d5241

由 NeilBrown 提交于 10月 16, 2009

During 'check' of a raid1 or raid10 it is possible for the management
thread to spend a lot of time running 'memcmp' on blocks from
different devices, so make sure the thread has a chance to schedule.
raid5d already has a cond_resched (in process_stripe).
Reported-By: NLee Howard <faxguy@howardsilvan.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

1d9d5241

23 9月, 2009 4 次提交

md: raid-1/10: fix RW bits manipulation · 1ef04fef

由 Dmitry Monakhov 提交于 9月 20, 2009

Recently Jens has changed bio_rw_flagged() logic by following
commit 1f98a13f. Now it returns
bool instead of int. This broke raid1/raid10 RW bits manipulation logic.
One of visible result is BUG_ON triggering due to empty barrier
here scsi_lib.c:1108 scsi_setup_fs_cmnd()
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

1ef04fef

md: report device as congested when suspended · 3fa841d7

由 NeilBrown 提交于 9月 23, 2009

This should writeback from coming when the device is temporarily
suspended.
Signed-off-by: NNeilBrown <neilb@suse.de>

3fa841d7

md: Improve name of threads created by md_register_thread · 0da3c619

由 NeilBrown 提交于 9月 23, 2009

The management thread for raid4,5,6 arrays are all called
mdX_raid5, independent of the actual raid level, which is wrong and
can be confusion.

So change md_register_thread to use the name from the personality
unless no alternate name (like 'resync' or 'reshape') is given.

This is simpler and more correct.

Cc: Jinzc <zhenchengjin@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

0da3c619

md: remove sparse waring "symbol xxx shadows an earlier one" · a9f326eb

由 NeilBrown 提交于 9月 23, 2009

Rename some variable and remove some duplicate definitions
to avoid there warnings.  None of them are actual errors.
Signed-off-by: NNeilBrown <neilb@suse.de>

a9f326eb

11 9月, 2009 1 次提交

bio: first step in sanitizing the bio->bi_rw flag testing · 1f98a13f

由 Jens Axboe 提交于 9月 11, 2009

Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

1f98a13f

03 8月, 2009 1 次提交

md: Push down data integrity code to personalities. · ac5e7113

由 Andre Noll 提交于 8月 03, 2009

This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.

md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity.  The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().

The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.

For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

ac5e7113

01 7月, 2009 1 次提交

md: Use new topology calls to indicate alignment and I/O sizes · 8f6c2e4b

由 Martin K. Petersen 提交于 7月 01, 2009

Switch MD over to the new disk_stack_limits() function which checks for
aligment and adjusts preferred I/O sizes when stacking.

Also indicate preferred I/O sizes where applicable.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

8f6c2e4b

18 6月, 2009 2 次提交

md: Push down reconstruction log message to personality code. · 8c6ac868

由 Andre Noll 提交于 6月 18, 2009

Currently, the md layer checks in analyze_sbs() if the raid level
supports reconstruction (mddev->level >= 1) and if reconstruction is
in progress (mddev->recovery_cp != MaxSector).

Move that printk into the personality code of those raid levels that
care (levels 1, 4, 5, 6, 10).
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

8c6ac868

md: Make mddev->chunk_size sector-based. · 9d8f0363

由 Andre Noll 提交于 6月 18, 2009

This patch renames the chunk_size field to chunk_sectors with the
implied change of semantics.  Since

	is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
				  = is_power_of_2(chunk_sectors)

these bits don't need an adjustment for the shift.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

9d8f0363

16 6月, 2009 2 次提交

md: raid10: chunk size check in run · 964e7913

由 raz ben yehuda 提交于 6月 16, 2009

have raid10 check chunk size in run method instead of in md

Signed-off-by: raziebe@gmail.com
Signed-off-by: NNeilBrown <neilb@suse.de>

964e7913

md: remove mddev_to_conf "helper" macro · 070ec55d

由 NeilBrown 提交于 6月 16, 2009

Having a macro just to cast a void* isn't really helpful.
I would must rather see that we are simply de-referencing ->private,
than have to know what the macro does.

So open code the macro everywhere and remove the pointless cast.
Signed-off-by: NNeilBrown <neilb@suse.de>

070ec55d

23 5月, 2009 1 次提交

block: Use accessor functions for queue limits · ae03bf63

由 Martin K. Petersen 提交于 5月 22, 2009

Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

ae03bf63

07 5月, 2009 1 次提交

md/raid10: don't clear bitmap during recovery if array will still be degraded. · 18055569

由 NeilBrown 提交于 5月 07, 2009

If we have a raid10 with multiple missing devices, and we recover just
one of these to a spare, then we risk (depending on the bitmap and
array chunk size) clearing bits of the bitmap for which recovery isn't
complete (because a device is still missing).

This can lead to a subsequent "re-add" being recovered without
any IO happening, which would result in loss of data.

This patch takes the safe approach of not clearing bitmap bits
if the array will still be degraded.

This patch is suitable for all active -stable kernels.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

18055569

15 4月, 2009 1 次提交

block: move bio list helpers into bio.h · 8f3d8ba2

由 Christoph Hellwig 提交于 4月 07, 2009

It's used by DM and MD and generally useful, so move the bio list
helpers into bio.h.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NAlasdair G Kergon <agk@redhat.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

8f3d8ba2

31 3月, 2009 4 次提交

md: 'array_size' sysfs attribute · b522adcd

由 Dan Williams 提交于 3月 31, 2009

Allow userspace to set the size of the array according to the following
semantics:

1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
   a) If size is set before the array is running, do_md_run will fail
      if size is greater than the default size
   b) A reshape attempt that reduces the default size to less than the set
      array size should be blocked
2/ once userspace sets the size the kernel will not change it
3/ writing 'default' to this attribute returns control of the size to the
   kernel and reverts to the size reported by the personality

Also, convert locations that need to know the default size from directly
reading ->array_sectors to <pers>_size.  Resync/reshape operations
always follow the default size.

Finally, fixup other locations that read a number of 1k-blocks from
userspace to use strict_blocks_to_sectors() which checks for unsigned
long long to sector_t overflow and blocks to sectors overflow.
Reviewed-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

b522adcd

md: centralize ->array_sectors modifications · 1f403624

由 Dan Williams 提交于 3月 31, 2009

Get personalities out of the business of directly modifying
->array_sectors.  Lays groundwork to introduce policy on when
->array_sectors can be modified.
Reviewed-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

1f403624

md: add 'size' as a personality method · 80c3a6ce

由 Dan Williams 提交于 3月 17, 2009

In preparation for giving userspace control over ->array_sectors we need
to be able to retrieve the 'default' size, and the 'anticipated' size
when a reshape is requested.  For personalities that do not reshape emit
a warning if anything but the default size is requested.

In the raid5 case we need to update ->previous_raid_disks to make the
new 'default' size available.
Reviewed-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

80c3a6ce

md: enable suspend/resume of md devices. · 409c57f3

由 NeilBrown 提交于 3月 31, 2009

To be able to change the 'level' of an md/raid array, we need to
suspend the device so that no requests are active - then move some
pointers around etc.

The code already keeps counts of active requests and the ->quiesce
function can be used to wait until those counts hit zero.
However the quiesce function blocks new requests once they are all
ready 'inside' the personality module, and that is too late if we want
to replace the personality modules.

So make all md requests come in through a common md_make_request
function that keeps track of how many requests have entered the
modules but may not yet be on the internal reference counts.
Allow md_make_request to be blocked when we want to suspend the
device, and make it possible to wait for all those in-transit requests
to be added to internal lists so that ->quiesce can wait for them.

There is still a problem that when a request completes, we drop the
ref count inside the personality code so there is a short time between
when the refcount hits zero, and when the personality code is no
longer being used.
The personality code never blocks (schedule or spinlock) between
dropping the refcount and exiting the routine, so this should be safe
(as put_module calls synchronize_sched() before unmapping the module
code).
Signed-off-by: NNeilBrown <neilb@suse.de>

409c57f3

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功