提交 · 851c30c9badfc6b294c98e887624bff53644ad21 · openeuler / raspberrypi-kernel

28 8月, 2013 3 次提交

raid5: offload stripe handle to workqueue · 851c30c9

由 Shaohua Li 提交于 8月 28, 2013

This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.

raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.

To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.

My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.

Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.

In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.

The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.

This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.

Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.

Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

851c30c9

raid5: fix stripe release order · d265d9dc

由 Shaohua Li 提交于 8月 28, 2013

patch "make release_stripe lockless" changes the order stripes are released.
Originally I thought block layer can take care of request merge, but it appears
there are still some requests not merged. It's easy to fix the order.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

d265d9dc

raid5: make release_stripe lockless · 773ca82f

由 Shaohua Li 提交于 8月 27, 2013

release_stripe still has big lock contention. We just add the stripe to a llist
without taking device_lock. We let the raid5d thread to do the real stripe
release, which must hold device_lock anyway. In this way, release_stripe
doesn't hold any locks.

The side effect is the released stripes order is changed. But sounds not a big
deal, stripes are never handled in order. And I thought block layer can already
do nice request merge, which means order isn't that important.

I kept the unplug release batch, which is unnecessary with this patch from lock
contention avoid point of view, and actually if we delete it, the stripe_head
release_list and lru can share storage. But the unplug release batch is also
helpful for request merge. We probably can delay wakeup raid5d till unplug, but
I'm still afraid of the case which raid5d is running.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

773ca82f

27 8月, 2013 5 次提交

md: avoid deadlock when dirty buffers during md_stop. · 260fa034

由 NeilBrown 提交于 8月 27, 2013

When the last process closes /dev/mdX sync_blockdev will be called so
that all buffers get flushed.
So if it is then opened for the STOP_ARRAY ioctl to be sent there will
be nothing to flush.

However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
moments before some other process which was writing closes their file
descriptor, then there won't be a 'last close' and the buffers might
not get flushed.

So do_md_stop() calls sync_blockdev().  However at this point it is
holding ->reconfig_mutex.  So if the array is currently 'clean' then
the writes from sync_blockdev() will not complete until the array
can be marked dirty and that won't happen until some other thread
can get ->reconfig_mutex.  So we deadlock.

We need to move the sync_blockdev() call to before we take
->reconfig_mutex.
However then some other thread could open /dev/mdX and write to it
after we call sync_blockdev() and before we actually stop the array.
This can leave dirty data in the page cache which is awkward.

So introduce new flag MD_STILL_CLOSED.  Set it before calling
sync_blockdev(), clear it if anyone does open the file, and abort the
STOP_ARRAY attempt if it gets set before we lock against further
opens.

It is still possible to get problems if you open /dev/mdX, write to
it, then issue the STOP_ARRAY ioctl.  Just don't do that.
Signed-off-by: NNeilBrown <neilb@suse.de>

260fa034

md: Don't test all of mddev->flags at once. · 7a0a5355

由 NeilBrown 提交于 8月 27, 2013

mddev->flags is mostly used to record if an update of the
metadata is needed.  Sometimes the whole field is tested
instead of just the important bits.  This makes it difficult
to introduce more state bits.

So replace all bare tests of mddev->flags with tests for the bits
that actually need testing.
Signed-off-by: NNeilBrown <neilb@suse.de>

7a0a5355

md: Fix apparent cut-and-paste error in super_90_validate · c9ad020f

由 Dave Jones 提交于 8月 19, 2013

Setting a variable to itself probably wasn't the intention here.
Signed-off-by: NDave Jones <davej@fedoraproject.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

c9ad020f

md: fix safe_mode buglet. · 275c51c4

由 NeilBrown 提交于 8月 08, 2013

Whe we set the safe_mode_timeout to a smaller value we trigger a timeout
immediately - otherwise the small value might not be honoured.
However if the previous timeout was 0 meaning "no timeout", we didn't.
This would mean that no timeout happens until the next write completes,
which could be a long time.
Signed-off-by: NNeilBrown <neilb@suse.de>

275c51c4

md: don't call md_allow_write in get_bitmap_file. · 60559da4

由 NeilBrown 提交于 7月 16, 2013

There is no really need as GFP_NOIO is very likely sufficient,
and failure is not catastrophic.

Calling md_allow_write here will convert a read-auto array to
read/write which could be confusing when you are just performing
a read operation.
Signed-off-by: NNeilBrown <neilb@suse.de>

60559da4

17 8月, 2013 1 次提交

dm cache: avoid conflicting remove_mapping() in mq policy · b936bf8b

由 Geert Uytterhoeven 提交于 7月 26, 2013

On sparc32, which includes <linux/swap.h> from <asm/pgtable_32.h>:

drivers/md/dm-cache-policy-mq.c:962:13: error: conflicting types for 'remove_mapping'
include/linux/swap.h:285:12: note: previous declaration of 'remove_mapping' was here

As mq_remove_mapping() already exists, and the local remove_mapping() is
used only once, inline it manually to avoid the conflict.
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair Kergon <agk@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

b936bf8b

25 7月, 2013 2 次提交

md/raid5: fix interaction of 'replace' and 'recovery'. · f94c0b66

由 NeilBrown 提交于 7月 22, 2013

If a device in a RAID4/5/6 is being replaced while another is being
recovered, then the writes to the replacement device currently don't
happen, resulting in corruption when the replacement completes and the
new drive takes over.

This is because the replacement writes are only triggered when
's.replacing' is set and not when the similar 's.sync' is set (which
is the case during resync and recovery - it means all devices need to
be read).

So schedule those writes when s.replacing is set as well.

In this case we cannot use "STRIPE_INSYNC" to record that the
replacement has happened as that is needed for recording that any
parity calculation is complete.  So introduce STRIPE_REPLACED to
record if the replacement has happened.

For safety we should also check that STRIPE_COMPUTE_RUN is not set.
This has a similar effect to the "s.locked == 0" test.  The latter
ensure that now IO has been flagged but not started.  The former
checks if any parity calculation has been flagged by not started.
We must wait for both of these to complete before triggering the
'replace'.

Add a similar test to the subsequent check for "are we finished yet".
This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
but it makes it more obvious that the REPLACE will happen before we
think we are finished.

Finally if a NeedReplace device is not UPTODATE then that is an
error.  We really must trigger a warning.

This bug was introduced in commit 9a3e1101
(md/raid5:  detect and handle replacements during recovery.)
which introduced replacement for raid5.
That was in 3.3-rc3, so any stable kernel since then would benefit
from this fix.

Cc: stable@vger.kernel.org (3.3+)
Reported-by: Nqindehua <13691222965@163.com>
Tested-by: Nqindehua <qindehua@163.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

f94c0b66

md/raid10: remove use-after-free bug. · 0eb25bb0

由 NeilBrown 提交于 7月 24, 2013

We always need to be careful when calling generic_make_request, as it
can start a chain of events which might free something that we are
using.

Here is one place I wasn't careful enough.  If the wbio2 is not in
use, then it might get freed at the first generic_make_request call.
So perform all necessary tests first.

This bug was introduced in 3.3-rc3 (24afd80d) and can cause an
oops, so fix is suitable for any -stable since then.

Cc: stable@vger.kernel.org (3.3+)
Signed-off-by: NNeilBrown <neilb@suse.de>

0eb25bb0

18 7月, 2013 3 次提交

md/raid1: fix bio handling problems in process_checks() · 30bc9b53

由 NeilBrown 提交于 7月 17, 2013

Recent change to use bio_copy_data() in raid1 when repairing
an array is faulty.

The underlying may have changed the bio in various ways using
bio_advance and these need to be undone not just for the 'sbio' which
is being copied to, but also the 'pbio' (primary) which is being
copied from.

So perform the reset on all bios that were read from and do it early.

This also ensure that the sbio->bi_io_vec[j].bv_len passed to
memcmp is correct.

This fixes a crash during a 'check' of a RAID1 array.  The crash was
introduced in 3.10 so this is suitable for 3.10-stable.

Cc: stable@vger.kernel.org (3.10)
Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

30bc9b53

md: Remove recent change which allows devices to skip recovery. · 5024c298

由 NeilBrown 提交于 7月 17, 2013

commit 7ceb17e8
    md: Allow devices to be re-added to a read-only array.

allowed a bit more than just that.  It also allows devices to be added
to a read-write array and to end up skipping recovery.

This patch removes the offending piece of code pending a rewrite for a
subsequent release.

More specifically:
 If the array has a bitmap, then the device will still need a bitmap
 based resync ('saved_raid_disk' is set under different conditions
 is a bitmap is present).
 If the array doesn't have a bitmap, then this is correct as long as
 nothing has been written to the array since the metadata was checked
 by ->validate_super.  However there is no locking to ensure that there
 was no write.

Bug was introduced in 3.10 and causes data corruption so
patch is suitable for 3.10-stable.

Cc: stable@vger.kernel.org (3.10)
Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

5024c298

md/raid10: fix two problems with RAID10 resync. · 7bb23c49

由 NeilBrown 提交于 7月 16, 2013

1/ When an different between blocks is found, data is copied from
   one bio to the other.  However bv_len is used as the length to
   copy and this could be zero.  So use r10_bio->sectors to calculate
   length instead.
   Using bv_len was probably always a bit dubious, but the introduction
   of bio_advance made it much more likely to be a problem.

2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
   except on bios that we schedule for a read.  This ensures that
   missing/failed devices don't confuse the loop at the top of
   sync_request write.
   Commit 8be185f2 "raid10: Use bio_reset()"
   removed a loop which set BIO_UPTDATE on all appropriate bios.
   So we need to re-add that flag.

These bugs were introduced in 3.10, so this patch is suitable for
3.10-stable, and can remove a potential for data corruption.

Cc: stable@vger.kernel.org (3.10)
Reported-by: NBrassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

7bb23c49

12 7月, 2013 8 次提交

bcache: Allocation kthread fixes · 79826c35

由 Kent Overstreet 提交于 7月 10, 2013

The alloc kthread should've been using try_to_freeze() - and also there
was the potential for the alloc kthread to get woken up after it had
shut down, which would have been bad.
Signed-off-by: NKent Overstreet <kmo@daterainc.com>

79826c35

bcache: Fix GC_SECTORS_USED() calculation · 29ebf465

由 Kent Overstreet 提交于 7月 11, 2013

Part of the job of garbage collection is to add up however many sectors
of live data it finds in each bucket, but that doesn't work very well if
it doesn't reset GC_SECTORS_USED() when it starts. Whoops.

This wouldn't have broken anything horribly, but allocation tries to
preferentially reclaim buckets that are mostly empty and that's not
gonna work with an incorrect GC_SECTORS_USED() value.
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

29ebf465

bcache: Journal replay fix · faa56736

由 Kent Overstreet 提交于 7月 11, 2013

The journal replay code starts by finding something that looks like a
valid journal entry, then it does a binary search over the unchecked
region of the journal for the journal entries with the highest sequence
numbers.

Trouble is, the logic was wrong - journal_read_bucket() returns true if
it found journal entries we need, but if the range of journal entries
we're looking for loops around the end of the journal - in that case
journal_read_bucket() could return true when it hadn't found the highest
sequence number we'd seen yet, and in that case the binary search did
the wrong thing. Whoops.
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

faa56736

bcache: Shutdown fix · 5caa52af

由 Kent Overstreet 提交于 7月 10, 2013

Stopping a cache set is supposed to make it stop attached backing
devices, but somewhere along the way that code got lost. Fixing this
mainly has the effect of fixing our reboot notifier.
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

5caa52af

bcache: Fix a sysfs splat on shutdown · c9502ea4

由 Kent Overstreet 提交于 7月 10, 2013

If we stopped a bcache device when we were already detaching (or
something like that), bcache_device_unlink() would try to remove a
symlink from sysfs that was already gone because the bcache dev kobject
had already been removed from sysfs.

So keep track of whether we've removed stuff from sysfs.
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

c9502ea4

bcache: Advertise that flushes are supported · 54d12f2b

由 Kent Overstreet 提交于 7月 10, 2013

Whoops - bcache's flush/FUA was mostly correct, but flushes get filtered
out unless we say we support them...
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

54d12f2b

bcache: check for allocation failures · d2a65ce2

由 Dan Carpenter 提交于 7月 05, 2013

There is a missing NULL check after the kzalloc().
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>

d2a65ce2

bcache: Fix a dumb race · 6aa8f1a6

由 Kent Overstreet 提交于 7月 10, 2013

In the far-too-complicated closure code - closures can have destructors,
for probably dubious reasons; they get run after the closure is no
longer waiting on anything but before dropping the parent ref, intended
just for freeing whatever memory the closure is embedded in.

Trouble is, when remaining goes to 0 and we've got nothing more to run -
we also have to unlock the closure, setting remaining to -1. If there's
a destructor, that unlock isn't doing anything - nobody could be trying
to lock it if we're about to free it - but if the unlock _is needed...
that check for a destructor was racy. Argh.
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

6aa8f1a6

11 7月, 2013 12 次提交

dm: add switch target · 9d0eb0ab

由 Jim Ramsay 提交于 7月 10, 2013

dm-switch is a new target that maps IO to underlying block devices
efficiently when there is a large number of fixed-sized address regions
but there is no simple pattern to allow for a compact mapping
representation such as dm-stripe.

Though we have developed this target for a specific storage device, Dell
EqualLogic, we have made an effort to keep it as general purpose as
possible in the hope that others may benefit.

Originally developed by Jim Ramsay. Simplified by Mikulas Patocka.
Signed-off-by: NJim Ramsay <jim_ramsay@dell.com>
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

9d0eb0ab

dm: optimize reorder structure · 2a7faeb1

由 Mikulas Patocka 提交于 7月 10, 2013

This reorder actually improves performance by 20% (from 39.1s to 32.8s)
on x86-64 quad core Opteron.

I have no explanation for this, possibly it makes some other entries are
better cache-aligned.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

2a7faeb1

dm: optimize use SRCU and RCU · 83d5e5b0

由 Mikulas Patocka 提交于 7月 10, 2013

This patch removes "io_lock" and "map_lock" in struct mapped_device and
"holders" in struct dm_table and replaces these mechanisms with
sleepable-rcu.

Previously, the code would call "dm_get_live_table" and "dm_table_put" to
get and release table. Now, the code is changed to call "dm_get_live_table"
and "dm_put_live_table". dm_get_live_table locks sleepable-rcu and
dm_put_live_table unlocks it.

dm_get_live_table_fast/dm_put_live_table_fast can be used instead of
dm_get_live_table/dm_put_live_table. These *_fast functions use
non-sleepable RCU, so the caller must not block between them.

If the code changes active or inactive dm table, it must call
dm_sync_table before destroying the old table.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

83d5e5b0

dm bufio: submit writes outside lock · 2480945c

由 Mikulas Patocka 提交于 7月 10, 2013

This patch changes dm-bufio so that it submits write I/Os outside of the
lock. If the number of submitted buffers is greater than the number of
requests on the target queue, submit_bio blocks. We want to block outside
of the lock to improve latency of other threads that may need the lock.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

2480945c

dm cache: fix arm link errors with inline · 43aeaa29

由 Mikulas Patocka 提交于 7月 10, 2013

Use __always_inline to avoid a link failure with gcc 4.6 on ARM.
gcc 4.7 is OK.

It creates a function block_div.part.8, it references __udivdi3 and
__umoddi3 and it is never called. The references to __udivdi3 and
__umoddi3 cause a link failure.
Reported-by: NRob Herring <robherring2@gmail.com>
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

43aeaa29

dm verity: use __ffs and __fls · 553d8fe0

由 Mikulas Patocka 提交于 7月 10, 2013

This patch changes ffs() to __ffs() and fls() to __fls() which don't add
one to the result.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

553d8fe0

dm flakey: correct ctr alloc failure mesg · 75e3a0f5

由 Alasdair G Kergon 提交于 7月 10, 2013

Remove the reference to the "linear" target from the error message
issued when allocation fails in the flakey target.

Cc: Robin Dong <sanbai@taobao.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

75e3a0f5

dm verity: remove pointless comparison · 5d8be843

由 Mikulas Patocka 提交于 7月 10, 2013

Remove num < 0 test in verity_ctr because num is unsigned.
(Found by Coverity.)
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

5d8be843

dm: use __GFP_HIGHMEM in __vmalloc · 220cd058

由 Mikulas Patocka 提交于 7月 10, 2013

Use __GFP_HIGHMEM in __vmalloc.

Pages allocated with __vmalloc can be allocated in high memory that is not
directly mapped to kernel space, so use __GFP_HIGHMEM just like vmalloc
does. This patch reduces memory pressure slightly because pages can be
allocated in the high zone.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

220cd058

dm verity: fix inability to use a few specific devices sizes · b1bf2de0

由 Mikulas Patocka 提交于 7月 10, 2013

Fix a boundary condition that caused failure for certain device sizes.

The problem is reported at
  http://code.google.com/p/cryptsetup/issues/detail?id=160

For certain device sizes the number of hashes at a specific level was
calculated incorrectly.

It happens for example for a device with data and metadata block size 4096
that has 16385 blocks and algorithm sha256.

The user can test if he is affected by this bug by running the
"veritysetup verify" command and also by activating the dm-verity kernel
driver and reading the whole block device. If it passes without an error,
then the user is not affected.

The condition for the bug is:

Split the total number of data blocks (data_block_bits) into bit strings,
each string has hash_per_block_bits bits. hash_per_block_bits is
rounddown(log2(metadata_block_size/hash_digest_size)). Equivalently, you
can say that you convert data_blocks_bits to 2^hash_per_block_bits base.

If there some zero bit string below the most significant bit string and at
least one bit below this zero bit string is set, then the bug happens.

The same bug exists in the userspace veritysetup tool, so you must use
fixed veritysetup too if you want to use devices that are affected by
this boundary condition.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # 3.4+
Cc: Milan Broz <gmazyland@gmail.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

b1bf2de0

dm ioctl: set noio flag to avoid __vmalloc deadlock · 1c0e883e

由 Mikulas Patocka 提交于 7月 10, 2013

Set noio flag while calling __vmalloc() because it doesn't fully respect
gfp flags to avoid a possible deadlock (see commit
502624bd).

This should be backported to stable kernels 3.8 and newer. The kernel 3.8
doesn't have memalloc_noio_save(), so we should set and restore process
flag PF_MEMALLOC instead.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

1c0e883e

dm mpath: fix ioctl deadlock when no paths · 6c182cd8

由 Hannes Reinecke 提交于 7月 10, 2013

When multipath needs to retry an ioctl the reference to the
current live table needs to be dropped. Otherwise a deadlock
occurs when all paths are down:
- dm_blk_ioctl takes a reference to the current table
  and spins in multipath_ioctl().
- A new table is being loaded, but upon resume the process
  hangs in dm_table_destroy() waiting for references to
  drop to zero.

With this patch the reference to the old table is dropped
prior to retry, thereby avoiding the deadlock.
Signed-off-by: NHannes Reinecke <hare@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

6c182cd8

04 7月, 2013 2 次提交

md/raid10: fix bug which causes all RAID10 reshapes to move no data. · 13765120

由 NeilBrown 提交于 7月 04, 2013

The recent comment:
commit 7e83ccbe
    md/raid10: Allow skipping recovery when clean arrays are assembled

Causes raid10 to skip a recovery in certain cases where it is safe to
do so.  Unfortunately it also causes a reshape to be skipped which is
never safe.  The result is that an attempt to reshape a RAID10 will
appear to complete instantly, but no data will have been moves so the
array will now contain garbage.
(If nothing is written, you can recovery by simple performing the
reverse reshape which will also complete instantly).

Bug was introduced in 3.10, so this is suitable for 3.10-stable.

Cc: stable@vger.kernel.org (3.10)
Cc: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NNeilBrown <neilb@suse.de>

13765120

md/raid5: allow 5-device RAID6 to be reshaped to 4-device. · fdcfbbb6

由 NeilBrown 提交于 7月 04, 2013

There is a bug in 'check_reshape' for raid5.c  To checks
that the new minimum number of devices is large enough (which is
good), but it does so also after the reshape has started (bad).

This is bad because
 - the calculation is now wrong as mddev->raid_disks has changed
   already, and
 - it is pointless because it is now too late to stop.

So only perform that test when reshape has not been committed to.
Signed-off-by: NNeilBrown <neilb@suse.de>

fdcfbbb6

03 7月, 2013 1 次提交

md/raid10: fix two bugs affecting RAID10 reshape. · 78eaa0d4

由 NeilBrown 提交于 7月 02, 2013

1/ If a RAID10 is being reshaped to a fewer number of devices
 and is stopped while this is ongoing, then when the array is
 reassembled the 'mirrors' array will be allocated too small.
 This will lead to an access error or memory corruption.

2/ A sanity test for a reshaping RAID10 array is restarted
 is slightly incorrect.

Due to the first bug, this is suitable for any -stable
kernel since 3.5 where this code was introduced.

Cc: stable@vger.kernel.org (v3.5+)
Signed-off-by: NNeilBrown <neilb@suse.de>

78eaa0d4

02 7月, 2013 3 次提交

bcache: Use standard utility code · 8e51e414

由 Kent Overstreet 提交于 6月 06, 2013

Some of bcache's utility code has made it into the rest of the kernel,
so drop the bcache versions.

Bcache used to have a workaround for allocating from a bio set under
generic_make_request() (if you allocated more than once, the bios you
already allocated would get stuck on current->bio_list when you
submitted, and you'd risk deadlock) - bcache would mask out __GFP_WAIT
when allocating bios under generic_make_request() so that allocation
could fail and it could retry from workqueue. But bio_alloc_bioset() has
a workaround now, so we can drop this hack and the associated error
handling.
Signed-off-by: NKent Overstreet <koverstreet@google.com>

8e51e414

bcache: Delete fuzz tester · f3059a54

由 Kent Overstreet 提交于 5月 15, 2013

This code has rotted and it hasn't been used in ages anyways.
Signed-off-by: NKent Overstreet <kmo@daterainc.com>

f3059a54

K
bcache: Document shrinker reserve better · 36c9ea98
由 Kent Overstreet 提交于 6月 03, 2013
```
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
```
36c9ea98