提交 · 1e302a929e2da6e8448e2058e4b07b07252b57fe · openeuler / Kernel

03 4月, 2009 24 次提交

dm snapshot: move status to exception store · 1e302a92

由 Jonathan Brassow 提交于 4月 02, 2009

Let the exception store types print out their status through
the new API, rather than having the snapshot code do it.

Adjust the buffer position to allow for the preceding DMEMIT in the
arguments to type->status().
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

1e302a92

dm snapshot: move ctr parsing to exception store · fee1998e

由 Jonathan Brassow 提交于 4月 02, 2009

First step of having the exception stores parse their own arguments -
generalizing the interface.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

fee1998e

dm snapshot: use DMEMIT macro for status · 2e4a31df

由 Jonathan Brassow 提交于 4月 02, 2009

Use DMEMIT in place of snprintf.  This makes it easier later when
other modules are helping to populate our status output.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

2e4a31df

dm snapshot: remove dm_snap header · ccc45ea8

由 Jonathan Brassow 提交于 4月 02, 2009

Move some of the last bits from dm-snap.h into dm-snap.c where they
belong and remove dm-snap.h.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

ccc45ea8

dm snapshot: remove dm_snap header use · 71fab00a

由 Jonathan Brassow 提交于 4月 02, 2009

Move useful functions out of dm-snap.h and stop using dm-snap.h.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

71fab00a

dm exception store: move cow pointer · 49beb2b8

由 Jonathan Brassow 提交于 4月 02, 2009

Move COW device from snapshot to exception store.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

49beb2b8

dm exception store: move chunk_fields · d0216849

由 Jonathan Brassow 提交于 4月 02, 2009

Move chunk fields from snapshot to exception store.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

d0216849

dm exception store: move dm_target pointer · 0cea9c78

由 Jonathan Brassow 提交于 4月 02, 2009

Move target pointer from snapshot to exception store.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

0cea9c78

dm exception store: introduce registry · 493df71c

由 Jonathan Brassow 提交于 4月 02, 2009

Move exception stores into a registry.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

493df71c

dm raid1: add is_remote_recovering hook for clusters · 7513c2a7

由 Jonathan Brassow 提交于 4月 02, 2009

The logging API needs an extra function to make cluster mirroring
possible.  This new function allows us to check whether a mirror
region is being recovered on another machine in the cluster.  This
helps us prevent simultaneous recovery I/O and process I/O to the
same locations on disk.

Cluster-aware log modules will implement this function.  Single
machine log modules will not.  So, there is no performance
penalty for single machine mirrors.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Acked-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

7513c2a7

dm exception store: separate type from instance · b2a11465

由 Jonathan Brassow 提交于 4月 02, 2009

Introduce struct dm_exception_store_type.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

b2a11465

dm log: remove struct dm_dirty_log_internal · ec44ab9d

由 Mike Snitzer 提交于 4月 02, 2009

Remove the 'dm_dirty_log_internal' structure.  The resulting cleanup
eliminates extra memory allocations.  Therefore exposing the internal
list_head to the external 'dm_dirty_log_type' structure is a worthwhile
compromise.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

ec44ab9d

dm log: use standard kernel module refcount · 84e67c93

由 Mike Snitzer 提交于 4月 02, 2009

Avoid private module usage accounting by removing 'use' from
dm_dirty_log_internal.  The standard module reference counting is
sufficient.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

84e67c93

dm crypt: use kzfree · b81d6cf7

由 Johannes Weiner 提交于 4月 02, 2009

Use kzfree() instead of memset() + kfree().
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

b81d6cf7

dm target: remove struct tt_internal · 45194e4f

由 Cheng Renquan 提交于 4月 02, 2009

The tt_internal is really just a list_head to manage registered target_type
in a double linked list,

Here embed the list_head into target_type directly,
1. to avoid kmalloc/kfree;
2. then tt_internal is really unneeded;

Cc: stable@kernel.org
Signed-off-by: NCheng Renquan <crquan@gmail.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
Reviewed-by: NAlasdair G Kergon <agk@redhat.com>

45194e4f

dm table: fix upgrade mode race · 570b9d96

由 Alasdair G Kergon 提交于 4月 02, 2009

upgrade_mode() sets bdev to NULL temporarily, and does not have any
locking to exclude anything from seeing that NULL.

In dm_table_any_congested() bdev_get_queue() can dereference that NULL and
cause a reported oops.

Fix this by not changing that field during the mode upgrade.

Cc: stable@kernel.org
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

570b9d96

dm: path selector use module refcount directly · aea90588

由 Jun'ichi Nomura 提交于 4月 02, 2009

Fix refcount corruption in dm-path-selector

Refcounting with non-atomic ops under shared lock will corrupt the counter
in multi-processor system and may trigger BUG_ON().
Use module refcount.
# same approach as dm-target-use-module-refcount-directly.patch here
# https://www.redhat.com/archives/dm-devel/2008-December/msg00075.html

Typical oops:
  kernel BUG at linux-2.6.29-rc3/drivers/md/dm-path-selector.c:90!
  Pid: 11148, comm: dmsetup Not tainted 2.6.29-rc3-nm #1
  dm_put_path_selector+0x4d/0x61 [dm_multipath]
  Call Trace:
   [<ffffffffa031d3f9>] free_priority_group+0x33/0xb3 [dm_multipath]
   [<ffffffffa031d4aa>] free_multipath+0x31/0x67 [dm_multipath]
   [<ffffffffa031d50d>] multipath_dtr+0x2d/0x32 [dm_multipath]
   [<ffffffffa015d6c2>] dm_table_destroy+0x64/0xd8 [dm_mod]
   [<ffffffffa015b73a>] __unbind+0x46/0x4b [dm_mod]
   [<ffffffffa015b79f>] dm_swap_table+0x60/0x14d [dm_mod]
   [<ffffffffa015f963>] dev_suspend+0xfd/0x177 [dm_mod]
   [<ffffffffa0160250>] dm_ctl_ioctl+0x24c/0x29c [dm_mod]
   [<ffffffff80288cd3>] ? get_page_from_freelist+0x49c/0x61d
   [<ffffffffa015f866>] ? dev_suspend+0x0/0x177 [dm_mod]
   [<ffffffff802bf05c>] vfs_ioctl+0x2a/0x77
   [<ffffffff802bf4f1>] do_vfs_ioctl+0x448/0x4a0
   [<ffffffff802bf5a0>] sys_ioctl+0x57/0x7a
   [<ffffffff8020c05b>] system_call_fastpath+0x16/0x1b

Cc: stable@kernel.org
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

aea90588

dm target: use module refcount directly · 5642b8a6

由 Cheng Renquan 提交于 4月 02, 2009

The tt_internal's 'use' field is superfluous: the module's refcount can do
the work properly.  An acceptable side-effect is that this increases the
reference counts reported by 'lsmod'.

Remove the superfluous test when removing a target module.

[Crash possible without this on SMP - agk]

Cc: stable@kernel.org
Signed-off-by: NCheng Renquan <crquan@gmail.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
Reviewed-by: NAlasdair G Kergon <agk@redhat.com>
Reviewed-by: NJonathan Brassow <jbrassow@redhat.com>

5642b8a6

dm snapshot: avoid having two exceptions for the same chunk · 35bf659b

由 Mikulas Patocka 提交于 4月 02, 2009

We need to check if the exception was completed after dropping the lock.

After regaining the lock, __find_pending_exception checks if the exception
was already placed into &s->pending hash.

But we don't check if the exception was already completed and placed into
&s->complete hash. If the process waiting in alloc_pending_exception was
delayed at this point because of a scheduling latency and the exception
was meanwhile completed, we'd miss that and allocate another pending
exception for already completed chunk.

It would lead to a situation where two records for the same chunk exist
and potential data corruption because multiple snapshot I/Os to the
affected chunk could be redirected to different locations in the
snapshot.

Cc: stable@kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

35bf659b

dm snapshot: avoid dropping lock in __find_pending_exception · c6621392

由 Mikulas Patocka 提交于 4月 02, 2009

It is uncommon and bug-prone to drop a lock in a function that is called with
the lock held, so this is moved to the caller.

Cc: stable@kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

c6621392

dm snapshot: refactor __find_pending_exception · 2913808e

由 Mikulas Patocka 提交于 4月 02, 2009

Move looking-up of a pending exception from __find_pending_exception to another
function.

Cc: stable@kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

2913808e

dm io: make sync_io uninterruptible · b64b6bf4

由 Mikulas Patocka 提交于 4月 02, 2009

If someone sends signal to a process performing synchronous dm-io call,
the kernel may crash.

The function sync_io attempts to exit with -EINTR if it has pending signal,
however the structure "io" is allocated on stack, so already submitted io
requests end up touching unallocated stack space and corrupting kernel memory.

sync_io sets its state to TASK_UNINTERRUPTIBLE, so the signal can't break out
of io_schedule() --- however, if the signal was pending before sync_io entered
while (1) loop, the corruption of kernel memory will happen.

There is no way to cancel in-progress IOs, so the best solution is to ignore
signals at this point.

Cc: stable@kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

b64b6bf4

dm raid1: switch read_record from kmalloc to slab to save memory · 95f8fac8

由 Mikulas Patocka 提交于 4月 02, 2009

With my previous patch to save bi_io_vec, the size of dm_raid1_read_record
is significantly increased (the vector list takes 3072 bytes on 32-bit machines
and 4096 bytes on 64-bit machines).

The structure dm_raid1_read_record used to be allocated with kmalloc,
but kmalloc aligns the size on the next power-of-two so an object
slightly greater than 4096 will allocate 8192 bytes of memory and half of
that memory will be wasted.

This patch turns kmalloc into a slab cache which doesn't have this
padding so it will reduce the memory consumed.

Cc: stable@kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

95f8fac8

dm: preserve bi_io_vec when resubmitting bios · a920f6b3

由 Mikulas Patocka 提交于 4月 02, 2009

Device mapper saves and restores various fields in the bio, but it doesn't save
bi_io_vec.  If the device driver modifies this after a partially successful
request, dm-raid1 and dm-multipath may attempt to resubmit a bio that has
bi_size inconsistent with the size of vector.

To make requests resubmittable in dm-raid1 and dm-multipath, we must save
and restore the bio vector as well.

To reduce the memory overhead involved in this, we do not save the pages in a
vector and use a 16-bit field size if the page size is less than 65536.

Cc: stable@kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

a920f6b3

17 3月, 2009 5 次提交

dm crypt: wait for endio to complete before destruction · b35f8caa

由 Milan Broz 提交于 3月 16, 2009

The following oops has been reported when dm-crypt runs over a loop device.

...
[   70.381058] Process loop0 (pid: 4268, ti=cf3b2000 task=cf1cc1f0 task.ti=cf3b2000)
...
[   70.381058] Call Trace:
[   70.381058]  [<d0d76601>] ? crypt_dec_pending+0x5e/0x62 [dm_crypt]
[   70.381058]  [<d0d767b8>] ? crypt_endio+0xa2/0xaa [dm_crypt]
[   70.381058]  [<d0d76716>] ? crypt_endio+0x0/0xaa [dm_crypt]
[   70.381058]  [<c01a2f24>] ? bio_endio+0x2b/0x2e
[   70.381058]  [<d0806530>] ? dec_pending+0x224/0x23b [dm_mod]
[   70.381058]  [<d08066e4>] ? clone_endio+0x79/0xa4 [dm_mod]
[   70.381058]  [<d080666b>] ? clone_endio+0x0/0xa4 [dm_mod]
[   70.381058]  [<c01a2f24>] ? bio_endio+0x2b/0x2e
[   70.381058]  [<c02bad86>] ? loop_thread+0x380/0x3b7
[   70.381058]  [<c02ba8a1>] ? do_lo_send_aops+0x0/0x165
[   70.381058]  [<c013754f>] ? autoremove_wake_function+0x0/0x33
[   70.381058]  [<c02baa06>] ? loop_thread+0x0/0x3b7

When a table is being replaced, it waits for I/O to complete
before destroying the mempool, but the endio function doesn't
call mempool_free() until after completing the bio.

Fix it by swapping the order of those two operations.

The same problem occurs in dm.c with md referenced after dec_pending.
Again, we swap the order.

Cc: stable@kernel.org
Signed-off-by: NMilan Broz <mbroz@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

b35f8caa

dm crypt: fix kcryptd_async_done parameter · b2174eeb

由 Huang Ying 提交于 3月 16, 2009

In the async encryption-complete function (kcryptd_async_done), the
crypto_async_request passed in may be different from the one passed to
crypto_ablkcipher_encrypt/decrypt.  Only crypto_async_request->data is
guaranteed to be same as the one passed in.  The current
kcryptd_async_done uses the passed-in crypto_async_request directly
which may cause the AES-NI-based AES algorithm implementation to panic.

This patch fixes this bug by only using crypto_async_request->data,
which points to dm_crypt_request, the crypto_async_request passed in.
The original data (convert_context) is gotten from dm_crypt_request.

[mbroz@redhat.com: reworked]
Cc: stable@kernel.org
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NMilan Broz <mbroz@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

b2174eeb

dm io: respect BIO_MAX_PAGES limit · d659e6cc

由 Mikulas Patocka 提交于 3月 16, 2009

dm-io calls bio_get_nr_vecs to get the maximum number of pages to use
for a given device.  It allocates one additional bio_vec to use
internally but failed to respect BIO_MAX_PAGES, so fix this.

This was the likely cause of:
  https://bugzilla.redhat.com/show_bug.cgi?id=173153

Cc: stable@kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

d659e6cc

dm table: rework reference counting fix · f80a5570

由 Mikulas Patocka 提交于 3月 16, 2009

Fix an error introduced in dm-table-rework-reference-counting.patch.

When there is failure after table initialization, we need to use
dm_table_destroy, not dm_table_put, to free the table.

dm_table_put may be used only after dm_table_get.

Cc: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Reviewed-by: NJonathan Brassow <jbrassow@redhat.com>
Reviewed-by: NAlasdair G Kergon <agk@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

f80a5570

dm ioctl: validate name length when renaming · bc0fd67f

由 Milan Broz 提交于 3月 16, 2009

When renaming a mapped device validate the length of the new name.

The rename ioctl accepted any correctly-terminated string enclosed
within the data passed from userspace.  The other ioctls enforce a
size limit of DM_NAME_LEN.  If the name is changed and becomes longer
than that, the device can no longer be addressed by name.

Fix it by properly checking for device name length (including
terminating zero).

Cc: stable@kernel.org
Signed-off-by: NMilan Broz <mbroz@redhat.com>
Reviewed-by: NJonathan Brassow <jbrassow@redhat.com>
Reviewed-by: NAlasdair G Kergon <agk@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

bc0fd67f

04 3月, 2009 1 次提交

md: fix deadlock when stopping arrays · 5fd3a17e

由 Dan Williams 提交于 3月 04, 2009

Resolve a deadlock when stopping redundant arrays, i.e. ones that
require a call to sysfs_remove_group when shutdown.  The deadlock is
summarized below:

Thread1                Thread2
-------                -------
read sysfs attribute   stop array
                       take mddev lock
                       sysfs_remove_group
sysfs_get_active
wait for mddev lock
                       wait for active

Sysrq-w:
--------
mdmon         S 00000017  2212  4163      1
  f1982ea8 00000046 2dcf6b85 00000017 c0b23100 f2f83ed0 c0b23100 f2f8413c
  c0b23100 c0b23100 c0b1fb98 f2f8413c 00000000 f2f8413c c0b23100 f2291ecc
  00000002 c0b23100 00000000 00000017 f2f83ed0 f1982eac 00000046 c044d9dd
Call Trace:
  [<c044d9dd>] ? debug_mutex_add_waiter+0x1d/0x58
  [<c06ef451>] __mutex_lock_common+0x1d9/0x338
  [<c06ef451>] ? __mutex_lock_common+0x1d9/0x338
  [<c06ef5e3>] mutex_lock_interruptible_nested+0x33/0x3a
  [<c0634553>] ? mddev_lock+0x14/0x16
  [<c0634553>] mddev_lock+0x14/0x16
  [<c0634eda>] md_attr_show+0x2a/0x49
  [<c04e9997>] sysfs_read_file+0x93/0xf9
mdadm         D 00000017  2812  4177      1
  f0401d78 00000046 430456f8 00000017 f0401d58 f0401d20 c0b23100 f2da2c4c
  c0b23100 c0b23100 c0b1fb98 f2da2c4c 0a10fc36 00000000 c0b23100 f0401d70
  00000003 c0b23100 00000000 00000017 f2da29e0 00000001 00000002 00000000
Call Trace:
  [<c06eed1b>] schedule_timeout+0x1b/0x95
  [<c06eed1b>] ? schedule_timeout+0x1b/0x95
  [<c06eeb97>] ? wait_for_common+0x34/0xdc
  [<c044fa8a>] ? trace_hardirqs_on_caller+0x18/0x145
  [<c044fbc2>] ? trace_hardirqs_on+0xb/0xd
  [<c06eec03>] wait_for_common+0xa0/0xdc
  [<c0428c7c>] ? default_wake_function+0x0/0x12
  [<c06eeccc>] wait_for_completion+0x17/0x19
  [<c04ea620>] sysfs_addrm_finish+0x19f/0x1d1
  [<c04e920e>] sysfs_hash_and_remove+0x42/0x55
  [<c04eb4db>] sysfs_remove_group+0x57/0x86
  [<c0638086>] do_md_stop+0x13a/0x499

This has been there for a while, but is easier to trigger now that mdmon
is closely watching sysfs.

Cc: <stable@kernel.org>
Reported-by: NJacek Danecki <jacek.danecki@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

5fd3a17e

25 2月, 2009 3 次提交

md: avoid races when stopping resync. · 73d5c38a

由 NeilBrown 提交于 2月 25, 2009

There has been a race in raid10 and raid1 for a long time
which has only recently started showing up due to a scheduler changed.

When a sync_read request finishes, as soon as reschedule_retry
is called, another thread can mark the resync request as having
completed, so md_do_sync can finish, ->stop can be called, and
->conf can be freed.  So using conf after reschedule_retry is not
safe.

Similarly, when finishing a sync_write, calling md_done_sync must be
the last thing we do, as it allows a chain of events which will free
conf and other data structures.

The first of these requires action in raid10.c
The second requires action in raid1.c and raid10.c

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

73d5c38a

md/raid10: Don't call bitmap_cond_end_sync when we are doing recovery. · 78200d45

由 NeilBrown 提交于 2月 25, 2009

For raid1/4/5/6, resync (fixing inconsistencies between devices) is
very similar to recovery (rebuilding a failed device onto a spare).
The both walk through the device addresses in order.

For raid10 it can be quite different.  resync follows the 'array'
address, and makes sure all copies are the same.  Recover walks
through 'device' addresses and recreates each missing block.

The 'bitmap_cond_end_sync' function allows the write-intent-bitmap
(When present) to be updated to reflect a partially completed resync.
It makes assumptions which mean that it does not work correctly for
raid10 recovery at all.

In particularly, it can cause bitmap-directed recovery of a raid10 to
not recovery some of the blocks that need to be recovered.

So move the call to bitmap_cond_end_sync into the resync path, rather
than being in the common "resync or recovery" path.


Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

78200d45

md/raid10: Don't skip more than 1 bitmap-chunk at a time during recovery. · 09b4068a

由 NeilBrown 提交于 2月 25, 2009

When doing recovery on a raid10 with a write-intent bitmap, we only
need to recovery chunks that are flagged in the bitmap.

However if we choose to skip a chunk as it isn't flag, the code
currently skips the whole raid10-chunk, thus it might not recovery
some blocks that need recovering.

This patch fixes it.

In case that is confusing, it might help to understand that there
is a 'raid10 chunk size' which guides how data is distributed across
the devices, and a 'bitmap chunk size' which says how much data
corresponds to a single bit in the bitmap.

This bug only affects cases where the bitmap chunk size is smaller
than the raid10 chunk size.



Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

09b4068a

18 2月, 2009 1 次提交

block: fix bad definition of BIO_RW_SYNC · 93dbb393

由 Jens Axboe 提交于 2月 16, 2009

We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
213d9417.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

93dbb393

06 2月, 2009 3 次提交

md: Ensure an md array never has too many devices. · de01dfad

由 NeilBrown 提交于 2月 06, 2009

Each different metadata format supported by md supports a
different maximum number of devices.
We really should be enforcing this maximum in the kernel, but
we aren't quite doing that properly.

We currently only enforce it at the 'hot_add' point, which is an
older interface which is not used by current userspace.

We need to also enforce it at 'add_new_disk' time for active arrays
and at 'do_md_run' time when starting a new array.

So move the test from 'hot_add' into 'bind_rdev_to_array' which is
called from both 'hot_add' and 'add_new_disk, and add a new
test in 'analyse_sbs' which is called from 'do_md_run'.

This bug (or missing feature) has been around "forever" and so
the patch is suitable for any -stable that is currently maintained.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

de01dfad

md: Fix a bug in linear.c causing which_dev() to return the wrong device. · 852c8bf4

由 Andre Noll 提交于 2月 06, 2009

ab5bd5cb introduced the following
bug in linear software raid for large arrays on 32 bit machines:

which_dev() computes the device holding a given sector by shifting
down the sector number to a 32 bit range, dividing by the array
spacing and looking up the resulting index in the hash table of
the array.

Because the computed index might be slightly too small, a loop at
the end of which_dev() increases the index until the given sector
actually falls into the range of the device associated with that index.

The changes of the above mentioned commit caused this loop to check
whether the _index_ rather than the sector number is small enough,
effectively bypassing the loop and thus possibly returning the wrong
device.

As reported by Simon Kirby, this leads to errors such as

	linear_make_request: Sector 2340486136 out of bounds on dev sdi: 156301312 sectors, offset 2109870464

Fix this bug by introducing a local variable for the index so that
the variable containing the passed sector is left unchanged.

Cc: stable@kernel.org
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

852c8bf4

md: Allow read error in a single drive raid1 to be passed up. · 4706b349

由 NeilBrown 提交于 2月 06, 2009

If a raid1 only has a single working device and gets a read error, 
we choose to simply return that error up to the filesystem (or whatever)
rather than failing the whole array.

However the codes doesn't quite do that.  We attempt a readbalance
which allocates the same drive, so we retry the read - indefinitely. 

Instead:  If read_balance in the error case chooses the same drive that just
failed, treat it as a failure and don't retry.
Signed-off-by: NNeilBrown <neilb@suse.de>

4706b349

09 1月, 2009 3 次提交

md: don't retry recovery of raid1 that fails due to error on source drive. · 4044ba58

由 NeilBrown 提交于 1月 09, 2009

If a raid1 has only one working drive and it has a sector which
gives an error on read, then an attempt to recover onto a spare will
fail, but as the single remaining drive is not removed from the
array, the recovery will be immediately re-attempted, resulting
in an infinite recovery loop.

So detect this situation and don't retry recovery once an error
on the lone remaining drive is detected.

Allow recovery to be retried once every time a spare is added
in case the problem wasn't actually a media error.
Signed-off-by: NNeilBrown <neilb@suse.de>

4044ba58

md: Allow md devices to be created by name. · efeb53c0

由 NeilBrown 提交于 1月 09, 2009

Using sequential numbers to identify md devices is somewhat artificial.
Using names can be a lot more user-friendly.

Also, creating md devices by opening the device special file is a bit
awkward.

So this patch provides a new option for creating and naming devices.

Writing a name such as "md_home" to
    /sys/modules/md_mod/parameters/new_array
will cause an array with that name to be created.  It will appear in
/sys/block/ /proc/partitions and /proc/mdstat as 'md_home'.
It will have an arbitrary minor number allocated.

md devices that a created by an open are destroyed on the last
close when the device is inactive.
For named md devices, they will not be destroyed until the array
is explicitly stopped, either with the STOP_ARRAY ioctl or by
writing 'clear' to /sys/block/md_XXXX/md/array_state.

The name of the array must start 'md_' to avoid conflict with
other devices.
Signed-off-by: NNeilBrown <neilb@suse.de>

efeb53c0

md: make devices disappear when they are no longer needed. · d3374825

由 NeilBrown 提交于 1月 09, 2009

Currently md devices, once created, never disappear until the module
is unloaded.  This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.

If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed.  However it
is not possible to hook into the gendisk destruction process to enable
this.

So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed.  However this has a
complication.
Between the call
   __blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
   __blkdev_get->md_open

there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.

Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created.  We need to
ensure that this reference can not be used.  i.e. the ->open must
fail.

So:
 1/  in md_probe we set a flag in the mddev (hold_active) which
     indicates that the array should be treated as active, even
     though there are no references, and no appearance of activity.
     This is cleared by md_release when the device is closed if it
     is no longer needed.
     This ensures that the gendisk will survive between md_probe and
     md_open.

 2/  In md_open we check if the mddev we expect to open matches
     the gendisk that we did open.
     If there is a mismatch we return -ERESTARTSYS and modify
     __blkdev_get to retry from the top in that case.
     In the -ERESTARTSYS sys case we make sure to wait until
     the old gendisk (that we succeeded in opening) is really gone so
     we loop at most once.

Some udev configurations will always open an md device when it first
appears.   If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it.  This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).

As an array can be stopped by writing to a sysfs attribute
  echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects.  This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().
Signed-off-by: NNeilBrown <neilb@suse.de>

d3374825

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功