提交 · 8a392625b665c676a77c62f8608d10ff430bcb83 · openanolis / cloud-kernel

21 7月, 2008 7 次提交

md: Protect access to mddev->disks list using RCU · 4b80991c

由 NeilBrown 提交于 7月 21, 2008

All modifications and most access to the mddev->disks list are made
under the reconfig_mutex lock.  However there are three places where
the list is walked without any locking.  If a reconfig happens at this
time, havoc (and oops) can ensue.

So use RCU to protect these accesses:
  - wrap them in rcu_read_{,un}lock()
  - use list_for_each_entry_rcu
  - add to the list with list_add_rcu
  - delete from the list with list_del_rcu
  - delay the 'free' with call_rcu rather than schedule_work

Note that export_rdev did a list_del_init on this list.  In almost all
cases the entry was not in the list anymore so it was a no-op and so
safe.  It is no longer safe as after list_del_rcu we may not touch
the list_head.
An audit shows that export_rdev is called:
  - after unbind_rdev_from_array, in which case the delete has
     already been done,
  - after bind_rdev_to_array fails, in which case the delete isn't needed.
  - before the device has been put on a list at all (e.g. in
      add_new_disk where reading the superblock fails).
  - and in autorun devices after a failure when the device is on a
      different list.

So remove the list_del_init call from export_rdev, and add it back
immediately before the called to export_rdev for that last case.

Note also that ->same_set is sometimes used for lists other than
mddev->list (e.g. candidates).  In these cases rcu is not needed.
Signed-off-by: NNeilBrown <neilb@suse.de>

4b80991c

md: only count actual openers as access which prevent a 'stop' · f2ea68cf

由 NeilBrown 提交于 7月 21, 2008

Open isn't the only thing that increments ->active.  e.g. reading
/proc/mdstat will increment it briefly.  So to avoid false positives
in testing for concurrent access, introduce a new counter that counts
just the number of times the md device it open.
Signed-off-by: NNeilBrown <neilb@suse.de>

f2ea68cf

A
md: linear: Make array_size sector-based and rename it to array_sectors. · d6e22150
由 Andre Noll 提交于 7月 21, 2008
```
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>
```
d6e22150

md: Make mddev->array_size sector-based. · f233ea5c

由 Andre Noll 提交于 7月 21, 2008

This patch renames the array_size field of struct mddev_s to array_sectors
and converts all instances to use units of 512 byte sectors instead of 1k
blocks.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

f233ea5c

md: Make super_type->rdev_size_change() take sector-based sizes. · 15f4a5fd

由 Andre Noll 提交于 7月 21, 2008

Also, change the type of the size parameter from unsigned long long to
sector_t and rename it to num_sectors.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

15f4a5fd

md: Fix check for overlapping devices. · d07bd3bc

由 Andre Noll 提交于 7月 21, 2008

The checks in overlaps() expect all parameters either in block-based
or sector-based quantities. However, its single caller passes two
rdev->data_offset arguments as well as two rdev->size arguments, the
former being sector counts while the latter are measured in 1K blocks.

This could cause rdev_size_store() to accept an invalid size from user
space. Fix it by passing only sector-based quantities to overlaps().
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

d07bd3bc

md: Tidy up rdev_size_store a bit: · d7027458

由 Neil Brown 提交于 7月 12, 2008

 - used strict_strtoull in place of simple_strtoull
 - use my_mddev in place of rdev->mddev (they have the same value)
and more significantly,
 - don't adjust mddev->size to fit, rather reject changes which make
   rdev->size smaller than mddev->size

Adjusting mddev->size is a hangover from bind_rdev_to_array which
does a similar thing.  But it really is a better design to insist that
mddev->size is set as required, then the rdev->sizes are set to allow
for that.  The previous way invites confusion.
Signed-off-by: NNeilBrown <neilb@suse.de>

d7027458

15 7月, 2008 1 次提交

[SCSI] scsi_dh: fix kconfig related build errors · fe9233fb

由 Chandra Seetharaman 提交于 5月 23, 2008

Do not automatically "select" SCSI_DH for dm-multipath. If SCSI_DH
doesn't exist,just do not allow  hardware handlers to be used.

Handle SCSI_DH being a module also. Make sure it doesn't allow DM_MULTIPATH
to be compiled in when SCSI_DH is a module.

[jejb: added comment for Kconfig syntax]
Signed-off-by: NChandra Seetharaman <sekharan@us.ibm.com>
Reported-by: NRandy Dunlap <randy.dunlap@oracle.com>
Reported-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NJames Bottomley <James.Bottomley@HansenPartnership.com>

fe9233fb

11 7月, 2008 10 次提交

md: Turn rdev->sb_offset into a sector-based quantity. · 0f420358

由 Andre Noll 提交于 7月 11, 2008

Rename it to sb_start to make sure all users have been converted.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>

0f420358

md: Make calc_dev_sboffset() return a sector count. · b73df2d3

由 Andre Noll 提交于 7月 11, 2008

As BLOCK_SIZE_BITS is 10 and

	MD_NEW_SIZE_SECTORS(2 * x) = 2 * NEW_SIZE_BLOCKS(x),

the return value of calc_dev_sboffset() doubles. Fix up all three
callers accordingly.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>

b73df2d3

md: Replace calc_dev_size() by calc_num_sectors(). · e7debaa4

由 Andre Noll 提交于 7月 11, 2008

Number of sectors is the preferred unit for sizes of raid devices,
so change calc_dev_size() so that it returns this unit instead of
the number of 1K blocks.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>

e7debaa4

md: Make update_size() take the number of sectors. · d71f9f88

由 Andre Noll 提交于 7月 11, 2008

Changing the internal representations of sizes of raid devices
from 1K blocks to sector counts (512B units) is desirable because
it allows to get rid of many divisions/multiplications and unnecessary
casts that are present in the current code.

This patch is a first step in this direction. It replaces the old
1K-based "size" argument of update_size() by "num_sectors" and
fixes up its two callers.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>

d71f9f88

md: Better control of when do_md_stop is allowed to stop the array. · df5b20cf

由 Neil Brown 提交于 7月 11, 2008

do_md_stop check the number of active users before allowing the array
to be stopped.
Two problems:
  1/ it assumes the request is coming through an open file descriptor
     (via ioctl) so it allows for that.  This is not always the case.
  2/ it doesn't do the check it the array hasn't been activated.
     This is not good for cases when we use an inactive array to hold
     some devices in a container.
Signed-off-by: NNeil Brown <neilb@suse.de>

df5b20cf

md: get_disk_info(): Don't convert between signed and unsigned and back. · 26ef379f

由 Andre Noll 提交于 7月 11, 2008

The current code copies a signed int from user space, converts it to
unsigned and passes the unsigned value to find_rdev_nr() which expects
a signed value. Simply pass the signed value from user space directly.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>

26ef379f

md: Simplify restart_array(). · 80fab1d7

由 Andre Noll 提交于 7月 11, 2008

Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>

80fab1d7

md: alloc_disk_sb(): Return proper error value. · ebc24337

由 Andre Noll 提交于 7月 11, 2008

If alloc_page() fails, ENOMEM is a more suitable error value
than EINVAL.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>

ebc24337

md: Simplify sb_equal(). · ce0c8e05

由 Andre Noll 提交于 7月 11, 2008

The only caller of sb_equal() tests the return value against
zero, so it's OK to return the negated return value of memcmp().
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>

ce0c8e05

md: Simplify uuid_equal(). · 05710466

由 Andre Noll 提交于 7月 11, 2008

Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>

05710466

10 7月, 2008 1 次提交

md: ensure all blocks are uptodate or locked when syncing · 7a1fc53c

由 Dan Williams 提交于 7月 10, 2008

Remove the dubious attempt to prefer 'compute' over 'read'.  Not only is it
wrong given commit c337869d (md: do not compute parity unless it is on a failed
drive), but it can trigger a BUG_ON in handle_parity_checks5().

Cc: <stable@kernel.org>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeil Brown <neilb@suse.de>

7a1fc53c

08 7月, 2008 7 次提交

md: sb_equal(): Fix misleading printk. · 35020f1a

由 Andre Noll 提交于 3月 23, 2008

Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>

35020f1a

A
md: Fix a typo in the comment to cmd_match(). · 7f6ce769
由 Andre Noll 提交于 3月 23, 2008
```
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>
```
7f6ce769

md: Fix typo in array_state comment. · 910d8cb3

由 Andre Noll 提交于 3月 25, 2008

Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>

910d8cb3

md: sync_speed_show(): Trivial cleanups. · 9687a60c

由 Andre Noll 提交于 3月 25, 2008

- Remove superfluous parentheses.
- Make format string match the type of the variable that is printed.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>

9687a60c

md: do_md_run(): Fix misleading error message. · 13e53df3

由 Andre Noll 提交于 3月 26, 2008

In case pers->run() succeeds but creating the bitmap fails, we
print an error message stating that pers->run() has failed.

Print this message only if pers->run() really failed.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>

13e53df3

A
md: md_getgeo(): Move comment to proper position. · 2f9618ce
由 Andre Noll 提交于 4月 25, 2008
```
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>
```
2f9618ce
A
md: md_ioctl(): Fix misleading indentation. · bb57fc64
由 Andre Noll 提交于 4月 25, 2008
```
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeil Brown <neilb@suse.de>
```
bb57fc64

03 7月, 2008 1 次提交

Add bvec_merge_data to handle stacked devices and ->merge_bvec() · cc371e66

由 Alasdair G Kergon 提交于 7月 03, 2008

When devices are stacked, one device's merge_bvec_fn may need to perform
the mapping and then call one or more functions for its underlying devices.

The following bio fields are used:
  bio->bi_sector
  bio->bi_bdev
  bio->bi_size
  bio->bi_rw  using bio_data_dir()

This patch creates a new struct bvec_merge_data holding a copy of those
fields to avoid having to change them directly in the struct bio when
going down the stack only to have to change them back again on the way
back up.  (And then when the bio gets mapped for real, the whole
exercise gets repeated, but that's a problem for another day...)
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

cc371e66

02 7月, 2008 1 次提交

dm crypt: use cond_resched · c7f1b204

由 Milan Broz 提交于 7月 02, 2008

Add cond_resched() to prevent monopolising CPU when processing large bios.

dm-crypt processes encryption of bios in sector units.  If the bio request
is big it can spend a long time in the encryption call.
Signed-off-by: NMilan Broz <mbroz@redhat.com>
Tested-by: NYan Li <elliot.li.tech@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

c7f1b204

01 7月, 2008 1 次提交

md: resolve external metadata handling deadlock in md_allow_write · b5470dc5

由 Dan Williams 提交于 6月 27, 2008

md_allow_write() marks the metadata dirty while holding mddev->lock and then
waits for the write to complete.  For externally managed metadata this causes a
deadlock as userspace needs to take the lock to communicate that the metadata
update has completed.

Change md_allow_write() in the 'external' case to start the 'mark active'
operation and then return -EAGAIN.  The expected side effects while waiting for
userspace to write 'active' to 'array_state' are holding off reshape (code
currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to
fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and
cause updates to 'raid_disks' to fail.  Except for 'stripe_cache_size' changes
these failures can be mitigated by coordinating with mdmon.

md_write_start() still prevents writes from occurring until the metadata
handler has had a chance to take action as it unconditionally waits for
MD_CHANGE_CLEAN to be cleared.

[neilb@suse.de: return -EAGAIN, try GFP_NOIO]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

b5470dc5

28 6月, 2008 11 次提交

md: rationalize raid5 function names · 1fe797e6

由 Dan Williams 提交于 6月 28, 2008

From: Dan Williams <dan.j.williams@intel.com>

Commit a4456856 refactored some of the deep code paths in raid5.c into separate
functions.  The names chosen at the time do not consistently indicate what is
going to happen to the stripe.  So, update the names, and since a stripe is a
cache element use cache semantics like fill, dirty, and clean.

(also, fix up the indentation in fetch_block5)
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeil Brown <neilb@suse.de>

1fe797e6

md: handle operation chaining in raid5_run_ops · 7b3a871e

由 Dan Williams 提交于 6月 28, 2008

From: Dan Williams <dan.j.williams@intel.com>

Neil said:
> At the end of ops_run_compute5 you have:
>         /* ack now if postxor is not set to be run */
>         if (tx && !test_bit(STRIPE_OP_POSTXOR, &s->ops_run))
>                 async_tx_ack(tx);
>
> It looks odd having that test there.  Would it fit in raid5_run_ops
> better?

The intended global interpretation is that raid5_run_ops can build a chain
of xor and memcpy operations.  When MD registers the compute-xor it tells
async_tx to keep the operation handle around so that another item in the
dependency chain can be submitted. If we are just computing a block to
satisfy a read then we can terminate the chain immediately.  raid5_run_ops
gives a better context for this test since it cares about the entire chain.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeil Brown <neilb@suse.de>

7b3a871e

md: replace R5_WantPrexor with R5_WantDrain, add 'prexor' reconstruct_states · d8ee0728

由 Dan Williams 提交于 6月 28, 2008

From: Dan Williams <dan.j.williams@intel.com>

Currently ops_run_biodrain and other locations have extra logic to determine
which blocks are processed in the prexor and non-prexor cases.  This can be
eliminated if handle_write_operations5 flags the blocks to be processed in all
cases via R5_Wantdrain.  The presence of the prexor operation is tracked in
sh->reconstruct_state.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeil Brown <neilb@suse.de>

d8ee0728

md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states' · 600aa109

由 Dan Williams 提交于 6月 28, 2008

From: Dan Williams <dan.j.williams@intel.com>

Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion)  Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.

This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeil Brown <neilb@suse.de>

600aa109

md: replace STRIPE_OP_COMPUTE_BLK with STRIPE_COMPUTE_RUN · 976ea8d4

由 Dan Williams 提交于 6月 28, 2008

From: Dan Williams <dan.j.williams@intel.com>

Track the state of compute operations (recalculating a block from all the other
blocks in a stripe) with a state flag.  Reduces the scope of the
STRIPE_OP_COMPUTE_BLK flag to only tracking whether a compute operation has
been requested via the ops_request field of struct stripe_head_state.

Note, the compute operation that is performed in the course of doing a 'repair'
operation (check the parity block, recalculate it and write it back if the
check result is not zero) is tracked separately with the 'check_state'
variable.  Compute operations are held off while a 'check' is in progress, and
moving this check out to handle_issuing_new_read_requests5 the helper routine
__handle_issuing_new_read_requests5 can be simplified.

This is another step towards the removal of ops.{pending,ack,complete,count},
i.e. STRIPE_OP_COMPUTE_BLK only requests an operation and does not track the
state of the operation.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeil Brown <neilb@suse.de>

976ea8d4

md: replace STRIPE_OP_BIOFILL with STRIPE_BIOFILL_RUN · 83de75cc

由 Dan Williams 提交于 6月 28, 2008

From: Dan Williams <dan.j.williams@intel.com>

Track the state of read operations (copying data from the stripe cache to bio
buffers outside the lock) with a state flag.  Reduce the scope of the
STRIPE_OP_BIOFILL flag to only tracking whether a biofill operation has been
requested via the ops_request field of struct stripe_head_state.

This is another step towards the removal of ops.{pending,ack,complete,count},
i.e. STRIPE_OP_BIOFILL only requests an operation and does not track the state
of the operation.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeil Brown <neilb@suse.de>

83de75cc

md: replace STRIPE_OP_CHECK with 'check_states' · ecc65c9b

由 Dan Williams 提交于 6月 28, 2008

From: Dan Williams <dan.j.williams@intel.com>

The STRIPE_OP_* flags record the state of stripe operations which are
performed outside the stripe lock. Their use in indicating which
operations need to be run is straightforward; however, interpolating what
the next state of the stripe should be based on a given combination of
these flags is not straightforward, and has led to bugs. An easier to read
implementation with minimal degrees of freedom is needed.

Towards this goal, this patch introduces explicit states to replace what was
previously interpolated from the STRIPE_OP_* flags. For now this only converts
the handle_parity_checks5 path, removing a user of the
ops.{pending,ack,complete,count} fields of struct stripe_operations.

This conversion also found a remaining issue with the current code. There is
a small window for a drive to fail between when we schedule a repair and when
the parity calculation for that repair completes. When this happens we will
writeback to 'failed_num' when we really want to write back to 'pd_idx'.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeil Brown <neilb@suse.de>

ecc65c9b

md: unify raid5/6 i/o submission · f0e43bcd

由 Dan Williams 提交于 6月 28, 2008

From: Dan Williams <dan.j.williams@intel.com>

Let the raid6 path call ops_run_io to get pending i/o submitted.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeil Brown <neilb@suse.de>

f0e43bcd

md: use stripe_head_state in ops_run_io() · c4e5ac0a

由 Dan Williams 提交于 6月 28, 2008

From: Dan Williams <dan.j.williams@intel.com>

In handle_stripe after taking sh->lock we sample some bits into 's' (struct
stripe_head_state):

	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);

Use these values from 's' in ops_run_io() rather than re-sampling the bits.
This ensures a consistent snapshot (as seen under sh->lock) is used.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeil Brown <neilb@suse.de>

c4e5ac0a

md: kill STRIPE_OP_IO flag · 2b7497f0

由 Dan Williams 提交于 6月 28, 2008

From: Dan Williams <dan.j.williams@intel.com>

The R5_Want{Read,Write} flags already gate i/o.  So, this flag is
superfluous and we can unconditionally call ops_run_io().
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeil Brown <neilb@suse.de>

2b7497f0

md: kill STRIPE_OP_MOD_DMA in raid5 offload · b203886e

由 Dan Williams 提交于 6月 28, 2008

From: Dan Williams <dan.j.williams@intel.com>

This micro-optimization allowed the raid code to skip a re-read of the
parity block after checking parity.  It took advantage of the fact that
xor-offload-engines have their own internal result buffer and can check
parity without writing to memory.  Remove it for the following reasons:

1/ It is a layering violation for MD to need to manage the DMA and
   non-DMA paths within async_xor_zero_sum
2/ Bad precedent to toggle the 'ops' flags outside the lock
3/ Hard to realize a performance gain as reads will not need an updated
   parity block and writes will dirty it anyways.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeil Brown <neilb@suse.de>

b203886e

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功