提交 · 5c675f83c68fbdf9c0e103c1090b06be747fa62c · gsplhtlxg / clone-Linux

04 2月, 2015 1 次提交

md: make ->congested robust against personality changes. · 5c675f83

由 NeilBrown 提交于 12月 15, 2014

There is currently no locking around calls to the 'congested'
bdi function.  If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.

So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.

When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock.  Fortunately this is the case.
Signed-off-by: NNeilBrown <neilb@suse.de>

5c675f83

14 10月, 2014 2 次提交
- N
  md: remove unwanted white space from md.c · f72ffdd6
  由 NeilBrown 提交于 9月 30, 2014
```
My editor shows much of this is RED.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  f72ffdd6
- N
  md/raid1: process_checks doesn't use its return value. · c95e6385
  由 NeilBrown 提交于 9月 09, 2014
```
process_checks() always returns '0', so change it to 'void'.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  c95e6385
09 10月, 2014 2 次提交
- N
  md: use set_bit/clear_bit instead of shift/mask for bi_flags changes. · 3fd83717
  由 NeilBrown 提交于 8月 23, 2014
```
Using {set,clear}_bit is more consistent than shifting and masking.

No functional change.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  3fd83717
- N
  md/raid1: minor typos and reformatting. · 5965b642
  由 NeilBrown 提交于 9月 04, 2014
```
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  5965b642
22 9月, 2014 8 次提交

md/raid1: fix_read_error should act on all non-faulty devices. · b8cb6b4c

由 NeilBrown 提交于 9月 18, 2014

If a devices is being recovered it is not InSync and is not Faulty.

If a read error is experienced on that device, fix_read_error()
will be called, but it ignores non-InSync devices.  So it will
neither fix the error nor fail the device.

It is incorrect that fix_read_error() ignores non-InSync devices.
It should only ignore Faulty devices.  So fix it.

This became a bug when we allowed reading from a device that was being
recovered.  It is suitable for any subsequent -stable kernel.

Fixes: da8840a7
Cc: stable@vger.kernel.org (v3.5+)
Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
Tested-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

b8cb6b4c

md/raid1: count resync requests in nr_pending. · 34e97f17

由 NeilBrown 提交于 9月 16, 2014

Both normal IO and resync IO can be retried with reschedule_retry()
and so be counted into ->nr_queued, but only normal IO gets counted in
->nr_pending.

Before the recent improvement to RAID1 resync there could only
possibly have been one or the other on the queue.  When handling a
read failure it could only be normal IO.  So when handle_read_error()
called freeze_array() the fact that freeze_array only compares
->nr_queued against ->nr_pending was safe.

But now that these two types can interleave, we can have both normal
and resync IO requests queued, so we need to count them both in
nr_pending.

This error can lead to freeze_array() hanging if there is a read
error, so it is suitable for -stable.

Fixes: 79ef3a8a
cc: stable@vger.kernel.org (v3.13+)
Reported-by: NBrassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

34e97f17

md/raid1: update next_resync under resync_lock. · c2fd4c94

由 NeilBrown 提交于 9月 10, 2014

raise_barrier() uses next_resync as part of its calculations, so it
really should be updated first, instead of afterwards.

next_resync is always used under resync_lock so update it under
resync lock to, just before it is used.  That is safest.

This could cause normal IO and resync IO to interact badly so
it suitable for -stable.

Fixes: 79ef3a8a
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NNeilBrown <neilb@suse.de>

c2fd4c94

md/raid1: Don't use next_resync to determine how far resync has progressed · 23554960

由 NeilBrown 提交于 9月 10, 2014

next_resync is (approximately) the location for the next resync request.
However it does *not* reliably determine the earliest location
at which resync might be happening.
This is because resync requests can complete out of order, and
we only limit the number of current requests, not the distance
from the earliest pending request to the latest.

mddev->curr_resync_completed is a reliable indicator of the earliest
position at which resync could be happening.   It is updated less
frequently, but is actually reliable which is more important.

So use it to determine if a write request is before the region
being resynced and so safe from conflict.

This error can allow resync IO to interfere with normal IO which
could lead to data corruption. Hence: stable.

Fixes: 79ef3a8a
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NNeilBrown <neilb@suse.de>

23554960

md/raid1: make sure resync waits for conflicting writes to complete. · 2f73d3c5

由 NeilBrown 提交于 9月 10, 2014

The resync/recovery process for raid1 was recently changed
so that writes could happen in parallel with resync providing
they were in different regions of the device.

There is a problem though:  While a write request will always
wait for conflicting resync to complete, a resync request
will *not* always wait for conflicting writes to complete.

Two changes are needed to fix this:

1/ raise_barrier (which waits until it is safe to do resync)
   must wait until current_window_requests is zero
2/ wait_battier (which waits at the start of a new write request)
   must update current_window_requests if the request could
   possible conflict with a concurrent resync.

As concurrent writes and resync can lead to data loss,
this patch is suitable for -stable.

Fixes: 79ef3a8a
Cc: stable@vger.kernel.org (v3.13+)
Cc: majianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

2f73d3c5

md/raid1: clean up request counts properly in close_sync() · 669cc7ba

由 NeilBrown 提交于 9月 04, 2014

If there are outstanding writes when close_sync is called,
the change to ->start_next_window might cause them to
decrement the wrong counter when they complete.  Fix this
by merging the two counters into the one that will be decremented.

Having an incorrect value in a counter can cause raise_barrier()
to hangs, so this is suitable for -stable.

Fixes: 79ef3a8a
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NNeilBrown <neilb@suse.de>

669cc7ba

md/raid1: be more cautious where we read-balance during resync. · c6d119cf

由 NeilBrown 提交于 9月 09, 2014

commit 79ef3a8a made
it possible for reads to happen concurrently with resync.
This means that we need to be more careful where read_balancing
is allowed during resync - we can no longer be sure that any
resync that has already started will definitely finish.

So keep read_balancing to before recovery_cp, which is conservative
but safe.

This bug makes it possible to read from a device that doesn't
have up-to-date data, so it can cause data corruption.
So it is suitable for any kernel since 3.11.

Fixes: 79ef3a8a
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NNeilBrown <neilb@suse.de>

c6d119cf

md/raid1: intialise start_next_window for READ case to avoid hang · f0cc9a05

由 NeilBrown 提交于 9月 22, 2014

r1_bio->start_next_window is not initialised in the READ
case, so allow_barrier may incorrectly decrement
   conf->current_window_requests
which can cause raise_barrier() to block forever.

Fixes: 79ef3a8a
cc: stable@vger.kernel.org (v3.13+)
Reported-by: NBrassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

f0cc9a05

31 7月, 2014 1 次提交

md/raid1,raid10: always abort recover on write error. · 2446dba0

由 NeilBrown 提交于 7月 31, 2014

Currently we don't abort recovery on a write error if the write error
to the recovering device was triggerd by normal IO (as opposed to
recovery IO).

This means that for one bitmap region, the recovery might write to the
recovering device for a few sectors, then not bother for subsequent
sectors (as it never writes to failed devices).  In this case
the bitmap bit will be cleared, but it really shouldn't.

The result is that if the recovering device fails and is then re-added
(after fixing whatever hardware problem triggerred the failure),
the second recovery won't redo the region it was in the middle of,
so some of the device will not be recovered properly.

If we abort the recovery, the region being processes will be cancelled
(bit not cleared) and the whole region will be retried.

As the bug can result in data corruption the patch is suitable for
-stable.  For kernels prior to 3.11 there is a conflict in raid10.c
which will require care.

Original-from: jiao hui <jiaohui@bwstor.com.cn>
Reported-and-tested-by: Njiao hui <jiaohui@bwstor.com.cn>
Signed-off-by: NNeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org

2446dba0

09 4月, 2014 1 次提交

md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation fails. · da1aab3d

由 NeilBrown 提交于 4月 09, 2014

When performing a user-request check/repair (MD_RECOVERY_REQUEST is set)
on a raid1, we allocate multiple bios each with their own set of pages.

If the page allocations for one bio fails, we currently do *not* free
the pages allocated for the previous bios, nor do we free the bio itself.

This patch frees all the already-allocate pages, and makes sure that
all the bios are freed as well.

This bug can cause a memory leak which can ultimately OOM a machine.
It was introduced in 3.10-rc1.

Fixes: a0787606
Cc: Kent Overstreet <koverstreet@google.com>
Cc: stable@vger.kernel.org (3.10+)
Reported-by: NRussell King - ARM Linux <linux@arm.linux.org.uk>
Signed-off-by: NNeilBrown <neilb@suse.de>

da1aab3d

05 2月, 2014 1 次提交

md/raid1: restore ability for check and repair to fix read errors. · 1877db75

由 NeilBrown 提交于 2月 05, 2014

commit 30bc9b53
    md/raid1: fix bio handling problems in process_checks()

Move the bio_reset() to a point before where BIO_UPTODATE is checked,
so that check now always report that the bio is uptodate, even if it is not.

This causes process_check() to sometimes treat read-errors as
successful matches so the good data isn't written out.

This patch preserves the flag until it is needed.

Bug was introduced in 3.11, but backported to 3.10-stable (as it fixed
an even worse bug).  So suitable for any -stable since 3.10.
Reported-and-tested-by: NMichael Tokarev <mjt@tls.msk.ru>
Cc: stable@vger.kernel.org (3.10+)
Fixed: 30bc9b53Signed-off-by: NNeilBrown <neilb@suse.de>

1877db75

14 1月, 2014 1 次提交

md/raid1: fix request counting bug in new 'barrier' code. · 41a336e0

由 NeilBrown 提交于 1月 14, 2014

The new iobarrier implementation in raid1 (which keeps normal writes
and resync activity separate) counts every request what is not before
the current resync point in either next_window_requests or
current_window_requests.
It flags that the request is counted by setting ->start_next_window.

allow_barrier follows this model exactly and decrements one of the
*_window_requests if and only if ->start_next_window is set.

However wait_barrier(), which increments *_window_requests uses a
slightly different test for setting -.start_next_window (which is set
from the return value of this function).
So there is a possibility of the counts getting out of sync, and this
leads to the resync hanging.

So change wait_barrier() to return a non-zero value in exactly the
same cases that it increments *_window_requests.

But was introduced in 3.13-rc1.
Reported-by: NBruno Wolff III <bruno@wolff.to>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68061
Fixes: 79ef3a8a
Cc: majianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

41a336e0

24 11月, 2013 1 次提交

block: Abstract out bvec iterator · 4f024f37

由 Kent Overstreet 提交于 10月 11, 2013

Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6

4f024f37

19 11月, 2013 4 次提交

raid1: Rewrite the implementation of iobarrier. · 79ef3a8a

由 majianpeng 提交于 11月 15, 2013

There is an iobarrier in raid1 because of contention between normal IO and
resync IO.  It suspends all normal IO when resync/recovery happens.

However if normal IO is out side the resync window, there is no contention.
So this patch changes the barrier mechanism to only block IO that
could contend with the resync that is currently happening.

We partition the whole space into five parts.
|---------|-----------|------------|----------------|-------|
        start   next_resync   start_next_window    end_window

start + RESYNC_WINDOW = next_resync
next_resync + NEXT_NORMALIO_DISTANCE = start_next_window
start_next_window + NEXT_NORMALIO_DISTANCE = end_window

Firstly we introduce some concepts:

1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the
      same time. A sync request is RESYNC_BLOCK_SIZE(64*1024).
      So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB.
2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync
      and start_next_window.  It also indicates the distance between
      start_next_window and end_window.
      It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if
      this turned out not to be optimal.
3 - next_resync: the next sector at which we will do sync IO.
4 - start: a position which is at most RESYNC_WINDOW before
      next_resync.
5 - start_next_window:  a position which is NEXT_NORMALIO_DISTANCE
      beyond next_resync.  Normal-io after this position doesn't need to
      wait for resync-io to complete.
6 - end_window:  a position which is 2 * NEXT_NORMALIO_DISTANCE beyond
      next_resync.  This also doesn't need to wait, but is counted
      differently.
7 - current_window_requests:  the count of normalIO between
      start_next_window and end_window.
8 - next_window_requests: the count of normalIO after end_window.

NormalIO will be partitioned into four types:

NormIO1:  the end sector of bio is smaller or equal the start
NormIO2:  the start sector of bio larger or equal to end_window
NormIO3:  the start sector of bio larger or equal to
          start_next_window.
NormIO4:  the location between start_next_window and end_window

|--------|-----------|--------------------|----------------|-------------|
    | start   |   next_resync   |  start_next_window   |  end_window |
 NormIO1   NormIO4            NormIO4                NormIO3      NormIO2

For NormIO1, we don't need any io barrier.
For NormIO4, we used a similar approach to the original iobarrier
    mechanism.  The normalIO and resyncIO must be kept separate.
For NormIO2/3, we add two fields to struct r1conf: "current_window_requests"
    and "next_window_requests". They indicate the count of active
    requests in the two window.
    For these, we don't wait for resync io to complete.

For resync action, if there are NormIO4s, we must wait for it.
If not, we can proceed.
But if resync action reaches start_next_window and
current_window_requests > 0 (that is there are NormIO3s), we must
wait until the current_window_requests becomes zero.
When current_window_requests becomes zero,  start_next_window also
moves forward. Then current_window_requests will replaced by
next_window_requests.

There is a problem which when and how to change from NormIO2 to
NormIO3.  Only then can sync action progress.

We add a field in struct r1conf "start_next_window".

A: if start_next_window == MaxSector, it means there are no NormIO2/3.
   So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
B: if current_window_requests == 0 && next_window_requests != 0, it
   means start_next_window move to end_window

There is another problem which how to differentiate between
old NormIO2(now it is NormIO3) and NormIO2.
For example, there are many bios which are NormIO2 and a bio which is
NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3.

We add a field in struct r1bio "start_next_window".
This is used to record the position conf->start_next_window when the call
to wait_barrier() is made in make_request().

In allow_barrier(), we check the conf->start_next_window.
If r1bio->stat_next_window == conf->start_next_window, it means
there is no transition between NormIO2 and NormIO3.
If r1bio->start_next_window != conf->start_next_window, it mean
there was a transition between NormIO2 and NormIO3.  There can only
have been one transition.  So it only means the bio is old NormIO2.

For one bio, there may be many r1bio's. So we make sure
all the r1bio->start_next_window are the same value.
If we met blocked_dev in make_request(), it must call allow_barrier
and wait_barrier. So the former and the later value of
conf->start_next_window will be change.
If there are many r1bio's with differnet start_next_window,
for the relevant bio, it depend on the last value of r1bio.
It will cause error. To avoid this, we must wait for previous r1bios
to complete.
Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

79ef3a8a

raid1: Add some macros to make code clearly. · 8e005f7c

由 majianpeng 提交于 11月 14, 2013

In a subsequent patch, we'll use some const parameters.
Using macros will make the code clearly.
Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

8e005f7c

raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array... · 07169fd4

由 majianpeng 提交于 11月 14, 2013

raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.

We used to use raise_barrier to suspend normal IO while we reconfigure
the array.  However raise_barrier will soon only suspend some normal
IO, not all.  So we need something else.
Change it to use freeze_array.
But freeze_array not only suspends normal io, it also suspends
resync io.
For the place where call raise_barrier for reconfigure, it isn't a
problem.
Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

07169fd4

raid1: Add a field array_frozen to indicate whether raid in freeze state. · b364e3d0

由 majianpeng 提交于 11月 14, 2013

Because the following patch will rewrite the content between normal IO
and resync IO. So we used a parameter to indicate whether raid is in freeze
array.
Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

b364e3d0

09 11月, 2013 1 次提交

block: Consolidate duplicated bio_trim() implementations · 6678d83f

由 Kent Overstreet 提交于 8月 07, 2013

Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on,
we should know better than this.
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6678d83f

24 10月, 2013 1 次提交

md: Fix skipping recovery for read-only arrays. · 61e4947c

由 Lukasz Dorau 提交于 10月 24, 2013

Since:
        commit 7ceb17e8
        md: Allow devices to be re-added to a read-only array.

spares are activated on a read-only array. In case of raid1 and raid10
personalities it causes that not-in-sync devices are marked in-sync
without checking if recovery has been finished.

If a read-only array is degraded and one of its devices is not in-sync
(because the array has been only partially recovered) recovery will be skipped.

This patch adds checking if recovery has been finished before marking a device
in-sync for raid1 and raid10 personalities. In case of raid5 personality
such condition is already present (at raid5.c:6029).

Bug was introduced in 3.10 and causes data corruption.

Cc: stable@vger.kernel.org
Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: NLukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

61e4947c

18 7月, 2013 1 次提交

md/raid1: fix bio handling problems in process_checks() · 30bc9b53

由 NeilBrown 提交于 7月 17, 2013

Recent change to use bio_copy_data() in raid1 when repairing
an array is faulty.

The underlying may have changed the bio in various ways using
bio_advance and these need to be undone not just for the 'sbio' which
is being copied to, but also the 'pbio' (primary) which is being
copied from.

So perform the reset on all bios that were read from and do it early.

This also ensure that the sbio->bi_io_vec[j].bv_len passed to
memcmp is correct.

This fixes a crash during a 'check' of a RAID1 array.  The crash was
introduced in 3.10 so this is suitable for 3.10-stable.

Cc: stable@vger.kernel.org (3.10)
Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

30bc9b53

14 6月, 2013 1 次提交

DM RAID: Add ability to restore transiently failed devices on resume · 9092c02d

由 Jonathan Brassow 提交于 5月 02, 2013

DM RAID: Add ability to restore transiently failed devices on resume

This patch adds code to the resume function to check over the devices
in the RAID array.  If any are found to be marked as failed and their
superblocks can be read, an attempt is made to reintegrate them into
the array.  This allows the user to refresh the array with a simple
suspend and resume of the array - rather than having to load a
completely new table, allocate and initialize all the structures and
throw away the old instantiation.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

9092c02d

13 6月, 2013 3 次提交

md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place · 5026d7a9

由 H. Peter Anvin 提交于 6月 12, 2013

There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME.  This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.

After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed.  This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.

However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.

Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.

[neilb: added raid5]

This patch is appropriate for any -stable since 3.7 when write_same
support was added.

Cc: stable@vger.kernel.org
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

5026d7a9

md/raid1,raid10: use freeze_array in place of raise_barrier in various places. · e2d59925

由 NeilBrown 提交于 6月 12, 2013

Various places in raid1 and raid10 are calling raise_barrier when they
really should call freeze_array.
The former is only intended to be called from "make_request".
The later has extra checks for 'nr_queued' and makes a call to
flush_pending_writes(), so it is safe to call it from within the
management thread.

Using raise_barrier will sometimes deadlock.  Using freeze_array
should not.

As 'freeze_array' currently expects one request to be pending (in
handle_read_error - the only previous caller), we need to pass
it the number of pending requests (extra) to ignore.

The deadlock was made particularly noticeable by commits
050b6615 (raid10) and 6b740b8d (raid1) which
appeared in 3.4, so the fix is appropriate for any -stable
kernel since then.

This patch probably won't apply directly to some early kernels and
will need to be applied by hand.

Cc: stable@vger.kernel.org
Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

e2d59925

md/raid1: consider WRITE as successful only if at least one non-Faulty and... · 3056e3ae

由 Alex Lyakas 提交于 6月 04, 2013

md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.

Without that fix, the following scenario could happen:

- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io()  calls call_bio_endio()
- As a result, in call_bio_endio():
        if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
                clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.

So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.

[neilb - applied identical change to raid10 as well]

This bug can result in lost data, so it is suitable for any
-stable kernel.

Cc: stable@vger.kernel.org
Signed-off-by: NAlex Lyakas <alex@zadarastorage.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

3056e3ae

30 4月, 2013 2 次提交

MD: ignore discard request for hard disks of hybid raid1/raid10 array · 32f9f570

由 Shaohua Li 提交于 4月 28, 2013

In SSD/hard disk hybid storage, discard request should be ignored for hard
disk. We used to be doing this way, but the unplug path forgets it.

This is suitable for stable tree since v3.6.

Cc: stable@vger.kernel.org
Reported-and-tested-by: NMarkus <M4rkusXXL@web.de>
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

32f9f570

md: raid1/raid10 md devices leak memory when stopping · 0fea7ed8

由 Hirokazu Takahashi 提交于 4月 24, 2013

Hi.

Raid1 and raid10 devices leak memory every time they stop.
This is a patch for linux-3.9.0-rc7 to fix this problem.

Thanks,
Hirokazu Takahashi.
Signed-off-by: NHirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: NNeilBrown <neilb@suse.de>

0fea7ed8

24 3月, 2013 9 次提交

block: Add bio_alloc_pages() · a0787606

由 Kent Overstreet 提交于 9月 10, 2012

More utility code to replace stuff that's getting open coded.
Signed-off-by: NKent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>

a0787606

block: Convert some code to bio_for_each_segment_all() · cb34e057

由 Kent Overstreet 提交于 9月 05, 2012

More prep work for immutable bvecs:

A few places in the code were either open coding or using the wrong
version - fix.

After we introduce the bvec iter, it'll no longer be possible to modify
the biovec through bio_for_each_segment_all() - it doesn't increment a
pointer to the current bvec, you pass in a struct bio_vec (not a
pointer) which is updated with what the current biovec would be (taking
into account bi_bvec_done and bi_size).

So because of that it's more worthwhile to be consistent about
bio_for_each_segment()/bio_for_each_segment_all() usage.
Signed-off-by: NKent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Alexander Viro <viro@zeniv.linux.org.uk>

cb34e057

block: Add bio_for_each_segment_all() · d74c6d51

由 Kent Overstreet 提交于 2月 06, 2013

__bio_for_each_segment() iterates bvecs from the specified index
instead of bio->bv_idx.  Currently, the only usage is to walk all the
bvecs after the bio has been advanced by specifying 0 index.

For immutable bvecs, we need to split these apart;
bio_for_each_segment() is going to have a different implementation.
This will also help document the intent of code that's using it -
bio_for_each_segment_all() is only legal to use for code that owns the
bio.
Signed-off-by: NKent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Neil Brown <neilb@suse.de>
CC: Boaz Harrosh <bharrosh@panasas.com>

d74c6d51

raid1: use bio_copy_data() · d3b45c2a

由 Kent Overstreet 提交于 9月 10, 2012

This doesn't really delete any code _yet_, but once immutable bvecs are
done we can just delete the rest of the code in that loop.
Signed-off-by: NKent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>

d3b45c2a

raid1: Refactor narrow_write_error() to not use bi_idx · b783863f

由 Kent Overstreet 提交于 9月 10, 2012

More bi_idx removal. This code was just open coding bio_clone(). This
could probably be further improved by using bio_advance() instead of
skipping over null pages, but that'd be a larger rework.
Signed-off-by: NKent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>

b783863f

raid1: use bio_reset() · 2aabaa65

由 Kent Overstreet 提交于 9月 11, 2012

Signed-off-by: NKent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>

2aabaa65

block: Add submit_bio_wait(), remove from md · 9e882242

由 Kent Overstreet 提交于 9月 10, 2012

Random cleanup - this code was duplicated and it's not really specific
to md.

Also added the ability to return the actual error code.
Signed-off-by: NKent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
Acked-by: NTejun Heo <tj@kernel.org>

9e882242

block: Use bio_sectors() more consistently · aa8b57aa

由 Kent Overstreet 提交于 2月 05, 2013

Bunch of places in the code weren't using it where they could be -
this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx
into a struct bvec_iter.
Signed-off-by: NKent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: "Ed L. Cashin" <ecashin@coraid.com>
CC: Nick Piggin <npiggin@kernel.dk>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Jim Paris <jim@jtan.com>
CC: Geoff Levand <geoff@infradead.org>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Steven Rostedt <rostedt@goodmis.org>
Acked-by: NEd Cashin <ecashin@coraid.com>

aa8b57aa

block: Add bio_end_sector() · f73a1c7d

由 Kent Overstreet 提交于 9月 25, 2012

Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.
Signed-off-by: NKent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: linux-s390@vger.kernel.org
CC: Chris Mason <chris.mason@fusionio.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: NSteven Whitehouse <swhiteho@redhat.com>

f73a1c7d