提交 · 46533ff7fefb7e9e3539494f5873b00091caa8eb · openeuler / Kernel

23 11月, 2016 1 次提交

md: Use REQ_FAILFAST_* on metadata writes where appropriate · 46533ff7

由 NeilBrown 提交于 11月 18, 2016

This can only be supported on personalities which ensure
that md_error() never causes an array to enter the 'failed'
state.  i.e. if marking a device Faulty would cause some
data to be inaccessible, the device is status is left as
non-Faulty.  This is true for RAID1 and RAID10.

If we get a failure writing metadata but the device doesn't
fail, it must be the last device so we re-write without
FAILFAST to improve chance of success.  We also flag the
device as LastDev so that future metadata updates don't
waste time on failfast writes.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

46533ff7

19 11月, 2016 2 次提交

md/raid1, raid10: add blktrace records when IO is delayed · 578b54ad

由 NeilBrown 提交于 11月 14, 2016

Both raid1 and raid10 will sometimes delay handling an IO request,
such as when resync is happening or there are too many requests queued.

Add some blktrace messsages so we can see when that is happening when
looking for performance artefacts.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

578b54ad

md: add block tracing for bio_remapping · 109e3765

由 NeilBrown 提交于 11月 18, 2016

The block tracing infrastructure (accessed with blktrace/blkparse)
supports the tracing of mapping bios from one device to another.
This is currently used when a bio in a partition is mapped to the
whole device, when bios are mapped by dm, and for mapping in md/raid5.
Other md personalities do not include this tracing yet, so add it.

When a read-error is detected we redirect the request to a different device.
This could justifiably be seen as a new mapping for the originial bio,
or a secondary mapping for the bio that errors.  This patch uses
the second option.

When md is used under dm-raid, the mappings are not traced as we do
not have access to the block device number of the parent.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

109e3765

08 11月, 2016 2 次提交

md/raid10: abort delayed writes when device fails. · a9ae93c8

由 NeilBrown 提交于 11月 04, 2016

When writing to an array with a bitmap enabled, the writes are grouped
in batches which are preceded by an update to the bitmap.

It is quite likely if that a drive develops a problem which is not
media related, that the bitmap write will be the first to report an
error and cause the device to be marked faulty (as the bitmap write is
at the start of a batch).

In this case, there is point submiting the subsequent writes to the
failed device - that just wastes times.

So re-check the Faulty state of a device before submitting a
delayed write.

This requires that we keep the 'rdev', rather than the 'bdev' in the
bio, then swap in the bdev just before final submission.
Reported-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

a9ae93c8

N
md/raid10: change printk() to pr_*() · 08464e09
由 NeilBrown 提交于 11月 02, 2016
```
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>
```
08464e09

25 10月, 2016 1 次提交

RAID10: ignore discard error · 579ed34f

由 Shaohua Li 提交于 10月 06, 2016

This is the counterpart of raid10 fix. If a write error occurs, raid10
will try to rewrite the bio in small chunk size. If the rewrite fails,
raid10 will record the error in bad block. narrow_write_error will
always use WRITE for the bio, but actually it could be a discard. Since
discard bio hasn't payload, write the bio will cause different issues.
But discard error isn't fatal, we can safely ignore it. This is what
this patch does.

This issue should exist since discard is added, but only exposed with
recent arbitrary bio size feature.

Cc: Sitsofe Wheeler <sitsofe@gmail.com>
Cc: stable@vger.kernel.org (v3.6)
Signed-off-by: NShaohua Li <shli@fb.com>

579ed34f

25 8月, 2016 1 次提交

raid10: record correct address of bad block · 27028626

由 Tomasz Majchrzak 提交于 8月 23, 2016

For failed write request record block address on a device, not block
address in an array.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

27028626

08 8月, 2016 1 次提交

block: rename bio bi_rw to bi_opf · 1eff9d32

由 Jens Axboe 提交于 8月 05, 2016

Since commit 63a4cc24, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.
Signed-off-by: NJens Axboe <axboe@fb.com>

1eff9d32

31 7月, 2016 1 次提交

raid10: increment write counter after bio is split · 9b622e2b

由 Tomasz Majchrzak 提交于 7月 28, 2016

md pending write counter must be incremented after bio is split,
otherwise it gets decremented too many times in end bio callback and
becomes negative.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

9b622e2b

20 7月, 2016 2 次提交

raid10: improve random reads performance · 0e5313e2

由 Tomasz Majchrzak 提交于 6月 24, 2016

RAID10 random read performance is lower than expected due to excessive spinlock
utilisation which is required mostly for rebuild/resync. Simplify allow_barrier
as it's in IO path and encounters a lot of unnecessary congestion.

As lower_barrier just takes a lock in order to decrement a counter, convert
counter (nr_pending) into atomic variable and remove the spin lock. There is
also a congestion for wake_up (it uses lock internally) so call it only when
it's really needed. As wake_up is not called constantly anymore, ensure process
waiting to raise a barrier is notified when there are no more waiting IOs.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

0e5313e2

md: use seconds granularity for error logging · 0e3ef49e

由 Arnd Bergmann 提交于 6月 17, 2016

The md code stores the exact time of the last error in the
last_read_error variable using a timespec structure. It only
ever uses the seconds portion of that though, so we can
use a scalar for it.

There won't be an overflow in 2038 here, because it already
used monotonic time and 32-bit is enough for that, but I've
decided to use time64_t for consistency in the conversion.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NShaohua Li <shli@fb.com>

0e3ef49e

14 6月, 2016 10 次提交

md: reduce the number of synchronize_rcu() calls when multiple devices fail. · d787be40

由 NeilBrown 提交于 6月 02, 2016

Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made
which can delay several milliseconds in some case.
If lots of devices fail at once - as could happen with a large RAID10 where one set
of devices are removed all at once - these delays can add up to be very inconcenient.

As failure is not reversible we can check for that first, setting a
separate flag if it is found, and then all synchronize_rcu() once for
all the flagged devices.  Then ->hot_remove_disk() function can skip the
synchronize_rcu() step if the flag is set.

fix build error(Shaohua)
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d787be40

md: be extra careful not to take a reference to a Faulty device. · f5b67ae8

由 NeilBrown 提交于 6月 02, 2016

It is important that we never increment rdev->nr_pending on a Faulty
device as ->hot_remove_disk() assumes that once the Faulty flag is visible
no code will take a new reference.

Some places take a new reference after only check In_sync.  This should
be safe as the two are changed together.  However to make the code more
obviously safe, add checks for 'Faulty' as well.

Note: the actual rule is:
  Never increment nr_pending if  Faulty is set and Blocked is clear,
  never clear Faulty, and never set Blocked without holding a reference
  through nr_pending.

fix build error (Shaohua)
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

f5b67ae8

md/raid10: simplify print_conf a little. · 4056ca51

由 NeilBrown 提交于 6月 02, 2016

'tmp' is only ever used to extract 'tmp->rdev', so just use 'rdev' directly.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

4056ca51

md/raid10: minor code improvement in fix_read_error() · d683c8e0

由 NeilBrown 提交于 6月 02, 2016

rdev already holds conf->mirrors[d].rdev, so no need to load it again.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d683c8e0

md/raid10: add rcu protection to rdev access during reshape. · d094d686

由 NeilBrown 提交于 6月 02, 2016

mirrors[].rdev can become NULL at any point unless:
   - a counted reference is held
   - ->reconfig_mutex is held, or
   - rcu_read_lock() is held

Reshape isn't always suitably careful as in the past rdev couldn't be
removed during reshape.  It can now, so add protection.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d094d686

md/raid10: add rcu protection to rdev access in raid10_sync_request. · f90145f3

由 NeilBrown 提交于 6月 02, 2016

mirrors[].rdev can become NULL at any point unless:
  - a counted reference is held
  - ->reconfig_mutex is held, or
  - rcu_read_lock() is held

Previously they could not become NULL during a resync/recovery/reshape either.
However when remove_and_add_spares() was added to hot_remove_disk(), that
changed.

So raid10_sync_request didn't previously need to protect rdev access,
but now it does.

Fix missed check(Shaohua)
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

f90145f3

md/raid10: add rcu protection in raid10_status. · d44b0a92

由 NeilBrown 提交于 6月 02, 2016

mirrors[].rdev can become NULL at any point unless:
 - a counted reference is held
 - ->reconfig_mutex is held, or
 - rcu_read_lock() is held

raid10_status holds none of these.  So add rcu_read_lock()
protection.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d44b0a92

md/raid10: fix refounct imbalance when resyncing an array with a replacement device. · 83f1261f

由 NeilBrown 提交于 6月 02, 2016

If you have a raid10 with a replacement device that is resyncing -
e.g. after a crash before the replacement was complete - the write to
the replacement will increment nr_pending on the wrong device, which
will lead to strangeness.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

83f1261f

md/raid1, raid10: don't recheck "Faulty" flag in read-balance. · 414e6b9a

由 NeilBrown 提交于 6月 02, 2016

Re-checking the faulty flag here brings no value.
The comment about "risk" refers to the risk that the device could
be in the process of being removed by ->hot_remove_disk().
However providing that the ->nr_pending count is incremented inside
an rcu_read_locked() region, there is no risk of that happening.

This is because the rdev pointer (in the personalities array) is set
to NULL before synchronize_rcu(), and ->nr_pending is tested
afterwards.  If the rcu_read_locked region happens before the
synchronize_rcu(), the test will see that nr_pending has been incremented.
If it happens afterwards, the rdev pointer will be NULL so there is nothing
to increment.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

414e6b9a

raid1/raid10: slow down resync if there is non-resync activity pending · 7ac50447

由 Tomasz Majchrzak 提交于 6月 13, 2016

A performance drop of mkfs has been observed on RAID10 during resync
since commit 09314799 ("md: remove 'go_faster' option from
->sync_request()"). Resync sends so many IOs it slows down non-resync
IOs significantly (few times). Add a short delay to a resync. The
previous long sleep (1s) has proven unnecessary, even very short delay
brings performance right.

The change also applied to raid1. The problem has not been observed on
raid1, however it shares barriers code with raid10 so it might be an
issue for some setup too.
Suggested-by: NNeilBrown <neilb@suse.com>
Link: http://lkml.kernel.org/r/20160609134555.GA9104@proton.igk.intel.comSigned-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

7ac50447

09 6月, 2016 1 次提交

block: add a separate operation type for secure erase · 288dab8a

由 Christoph Hellwig 提交于 6月 09, 2016

Instead of overloading the discard support with the REQ_SECURE flag.
Use the opportunity to rename the queue flag as well, and remove the
dead checks for this flag in the RAID 1 and RAID 10 drivers that don't
claim support for secure erase.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

288dab8a

08 6月, 2016 3 次提交

block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH · 28a8f0d3

由 Mike Christie 提交于 6月 05, 2016

To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: NMike Christie <mchristi@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

28a8f0d3

md: use bio op accessors · 796a5cf0

由 Mike Christie 提交于 6月 05, 2016

Separate the op from the rq_flag_bits and have md
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: NMike Christie <mchristi@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

796a5cf0

block/fs/drivers: remove rw argument from submit_bio · 4e49ea4a

由 Mike Christie 提交于 6月 05, 2016

This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.
Signed-off-by: NMike Christie <mchristi@redhat.com>

Fixed up fs/ext4/crypto.c
Signed-off-by: NJens Axboe <axboe@fb.com>

4e49ea4a

10 5月, 2016 2 次提交

md: set MD_CHANGE_PENDING in a atomic region · 85ad1d13

由 Guoqing Jiang 提交于 5月 03, 2016

Some code waits for a metadata update by:

1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
2. setting MD_CHANGE_PENDING and waking the management thread
3. waiting for MD_CHANGE_PENDING to be cleared

If the first two are done without locking, the code in md_update_sb()
which checks if it needs to repeat might test if an update is needed
before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
in the wait returning early.

So make sure all places that set MD_CHANGE_PENDING are atomicial, and
bit_clear_unless (suggested by Neil) is introduced for the purpose.

Cc: Martin Kepplinger <martink@posteo.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: <linux-kernel@vger.kernel.org>
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

85ad1d13

md: raid10: add prerequisite to run underneath dm-raid · 859644f0

由 Heinz Mauelshagen 提交于 5月 03, 2016

In case md runs underneath the dm-raid target, the mddev does not have
a request queue or gendisk, thus avoid accesses to it.

This patch adds two missing conditionals to the raid10 personality.
Signed-of-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

859644f0

18 3月, 2016 1 次提交

raid10: include bio_end_io_list in nr_queued to prevent freeze_array hang · 23ddba80

由 Shaohua Li 提交于 3月 14, 2016

This is the raid10 counterpart of the bug fixed by Nate
(raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang)

Fixes: 95af587e(md/raid10: ensure device failure recorded before write request returns)
Cc: stable@vger.kernel.org (V4.3+)
Cc: Nate Dailey <nate.dailey@stratus.com>
Signed-off-by: NShaohua Li <shli@fb.com>

23ddba80

21 1月, 2016 1 次提交

MD: rename some functions · 849674e4

由 Shaohua Li 提交于 1月 20, 2016

These short function names are hard to search. Rename them to make vim happy.
Signed-off-by: NShaohua Li <shli@fb.com>

849674e4

14 1月, 2016 1 次提交

md/raid: only permit hot-add of compatible integrity profiles · 1501efad

由 Dan Williams 提交于 1月 13, 2016

It is not safe for an integrity profile to be changed while i/o is
in-flight in the queue.  Prevent adding new disks or otherwise online
spares to an array if the device has an incompatible integrity profile.

The original change to the blk_integrity_unregister implementation in
md, commmit c7bfced9 "md: suspend i/o during runtime
blk_integrity_unregister" introduced an immediate hang regression.

This policy of disallowing changes the integrity profile once one has
been established is shared with DM.

Here is an abbreviated log from a test run that:
1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [   59.076127]
2/ Tries to add an integrity-disabled device (pmem1m) [   90.489209]
3/ Retries with an integrity-enabled device (pmem1s) [  205.671277]

[   59.076127] md/raid1:md0: active with 1 out of 2 mirrors
[   59.078302] md: data integrity enabled on md0
[..]
[   90.489209] md0: incompatible integrity profile for pmem1m
[..]
[  205.671277] md: super_written gets error=-5
[  205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
[  205.677386] md/raid1:md0: Operation continuing on 1 devices.
[  205.683037] RAID1 conf printout:
[  205.684699]  --- wd:1 rd:2
[  205.685972]  disk 0, wo:0, o:1, dev:pmem0s
[  205.687562]  disk 1, wo:1, o:1, dev:pmem1s
[  205.691717] md: recovery of RAID array md0

Fixes: c7bfced9 ("md: suspend i/o during runtime blk_integrity_unregister")
Cc: <stable@vger.kernel.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reported-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

1501efad

18 12月, 2015 1 次提交

md/raid10: fix data corruption and crash during resync · cc578588

由 Artur Paszkiewicz 提交于 12月 18, 2015

The commit c31df25f ("md/raid10: make sync_request_write() call
bio_copy_data()") replaced manual data copying with bio_copy_data() but
it doesn't work as intended. The source bio (fbio) is already processed,
so its bvec_iter has bi_size == 0 and bi_idx == bi_vcnt.  Because of
this, bio_copy_data() either does not copy anything, or worse, copies
data from the ->bi_next bio if it is set.  This causes wrong data to be
written to drives during resync and sometimes lockups/crashes in
bio_copy_data():

[  517.338478] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md126_raid10:3319]
[  517.347324] Modules linked in: raid10 xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul cryptd shpchp pcspkr ipmi_si ipmi_msghandler tpm_crb acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom sd_mod e1000e ax88179_178a usbnet mii ahci ata_generic crc32c_intel libahci ptp pata_acpi libata pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
[  517.440555] CPU: 0 PID: 3319 Comm: md126_raid10 Not tainted 4.3.0-rc6+ #1
[  517.448384] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0055.D14.1509221924 09/22/2015
[  517.459768] task: ffff880153773980 ti: ffff880150df8000 task.ti: ffff880150df8000
[  517.468529] RIP: 0010:[<ffffffff812e1888>]  [<ffffffff812e1888>] bio_copy_data+0xc8/0x3c0
[  517.478164] RSP: 0018:ffff880150dfbc98  EFLAGS: 00000246
[  517.484341] RAX: ffff880169356688 RBX: 0000000000001000 RCX: 0000000000000000
[  517.492558] RDX: 0000000000000000 RSI: ffffea0001ac2980 RDI: ffffea0000d835c0
[  517.500773] RBP: ffff880150dfbd08 R08: 0000000000000001 R09: ffff880153773980
[  517.508987] R10: ffff880169356600 R11: 0000000000001000 R12: 0000000000010000
[  517.517199] R13: 000000000000e000 R14: 0000000000000000 R15: 0000000000001000
[  517.525412] FS:  0000000000000000(0000) GS:ffff880174a00000(0000) knlGS:0000000000000000
[  517.534844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  517.541507] CR2: 00007f8a044d5fed CR3: 0000000169504000 CR4: 00000000001406f0
[  517.549722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  517.557929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  517.566144] Stack:
[  517.568626]  ffff880174a16bc0 ffff880153773980 ffff880169356600 0000000000000000
[  517.577659]  0000000000000001 0000000000000001 ffff880153773980 ffff88016a61a800
[  517.586715]  ffff880150dfbcf8 0000000000000001 ffff88016dd209e0 0000000000001000
[  517.595773] Call Trace:
[  517.598747]  [<ffffffffa043ef95>] raid10d+0xfc5/0x1690 [raid10]
[  517.605610]  [<ffffffff816697ae>] ? __schedule+0x29e/0x8e2
[  517.611987]  [<ffffffff814ff206>] md_thread+0x106/0x140
[  517.618072]  [<ffffffff810c1d80>] ? wait_woken+0x80/0x80
[  517.624252]  [<ffffffff814ff100>] ? super_1_load+0x520/0x520
[  517.630817]  [<ffffffff8109ef89>] kthread+0xc9/0xe0
[  517.636506]  [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
[  517.643653]  [<ffffffff8166d99f>] ret_from_fork+0x3f/0x70
[  517.649929]  [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-by: NShaohua Li <shli@kernel.org>
Cc: stable@vger.kernel.org (v4.2+)
Fixes: c31df25f ("md/raid10: make sync_request_write() call bio_copy_data()")
Signed-off-by: NNeilBrown <neilb@suse.com>

cc578588

24 10月, 2015 2 次提交

md/raid10: fix the 'new' raid10 layout to work correctly. · 8bce6d35

由 NeilBrown 提交于 10月 22, 2015

In Linux 3.9 we introduce a new 'far' layout for RAID10 which was
supposed to rotate the replicas differently and so provide better
resilience.  In particular it could survive more combinations of 2
drive failures.

Unfortunately. due to a coding error, this some did what was wanted,
sometimes improved less than we hoped, and sometimes - in very
unlikely circumstances - put multiple replicas on the same device so
the redundancy was harmed.

No public user-space tool has created arrays using this layout so it
is very unlikely that zero-redundancy arrays actually exist.  Probably
no arrays using any form of the new layout exist.  But we cannot be
certain.

So use another bit in the 'layout' number and introduce a bug-fixed
version of the layout.
Also when assembling an array, if it has a zero-redundancy layout,
give a warning.
Reported-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

8bce6d35

md/raid10: don't clear bitmap bit when bad-block-list write fails. · c340702c

由 NeilBrown 提交于 10月 24, 2015

When a write fails and a bad-block-list is present, we can
update the bad-block-list instead of writing the data. If
this succeeds then it is OK clear the relevant bitmap-bit as
no further 'sync' of the block is needed.

However if writing the bad-block-list fails then we need to
treat the write as failed and particularly must not clear
the bitmap bit. Otherwise the device can be re-added (after
any hardware connection issues are resolved) and because the
relevant bit in the bitmap is clear, that block will not be
resynced. This leads to data corruption.

We already delay the final bio_endio() on the write until
the bad-block-list is written so that when the write
returns: either that data is safe, the bad-block record is
safe, or the fact that the device is faulty is safe.
However we *don't* delay the clearing of the bitmap, so the
bitmap bit can be recorded as cleared before we know if the
bad-block-list was written safely.

So: delay that until the write really is safe.
i.e. move the call to close_write() until just before
calling bio_endio(), and recheck the 'is array degraded'
status before making that call.

This bug goes back to v3.1 when bad-block-lists were
introduced, though it only affects arrays created with
mdadm-3.3 or later as only those have bad-block lists.

Backports will require at least
Commit: 95af587e ("md/raid10: ensure device failure recorded before write request returns.")
as well. I'll send that to 'stable' separately.

Note that of the two tests of R10BIO_WriteError that this
patch adds, the first is certain to fail and the second is
certain to succeed. However doing it this way makes the
patch more obviously correct. I will tidy the code up in a
future merge window.
Reported-by: NNate Dailey <nate.dailey@stratus.com>
Fixes: bd870a16 ("md/raid10: Handle write errors by updating badblock log.")
Signed-off-by: NNeilBrown <neilb@suse.com>

c340702c

22 10月, 2015 1 次提交

md: suspend i/o during runtime blk_integrity_unregister · c7bfced9

由 Dan Williams 提交于 10月 21, 2015

Synchronize pending i/o against a change in the integrity profile to
avoid the possibility of spurious integrity errors.  Given linear_add()
is suspending the mddev before manipulating the mddev, do the same for
the other personalities.
Acked-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c7bfced9

21 10月, 2015 1 次提交

md/raid10: submit_bio_wait() returns 0 on success · 681ab469

由 Jes Sorensen 提交于 10月 20, 2015

This was introduced with 9e882242
which changed the return value of submit_bio_wait() to return != 0 on
error, but didn't update the caller accordingly.

Fixes: 9e882242 ("block: Add submit_bio_wait(), remove from md")
Cc: stable@vger.kernel.org (v3.10)
Reported-by: NBill Kuzeja <William.Kuzeja@stratus.com>
Signed-off-by: NJes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

681ab469

12 10月, 2015 1 次提交

md-cluster: Use a small window for resync · c40f341f

由 Goldwyn Rodrigues 提交于 8月 19, 2015

Suspending the entire device for resync could take too long. Resync
in small chunks.

cluster's resync window (32M) is maintained in r1conf as
cluster_sync_low and cluster_sync_high and processed in
raid1's sync_request(). If the current resync is outside the cluster
resync window:

1. Set the cluster_sync_low to curr_resync_completed.
2. Check if the sync will fit in the new window, if not issue a
   wait_barrier() and set cluster_sync_low to sector_nr.
3. Set cluster_sync_high to cluster_sync_low + resync_window.
4. Send a message to all nodes so they may add it in their suspension
   list.

bitmap_cond_end_sync is modified to allow to force a sync inorder
to get the curr_resync_completed uptodate with the sector passed.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

c40f341f

09 10月, 2015 1 次提交

crash in md-raid1 and md-raid10 due to incorrect list manipulation · a452744b

由 Mikulas Patocka 提交于 10月 01, 2015

The commit 55ce74d4 (md/raid1: ensure
device failure recorded before write request returns) is causing crash in
the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100%
reproducible.

The reason for the crash is that the newly added code in raid1d moves the
list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and
then incorrectly pops the bio from conf->bio_end_io_list (which is empty
because the list was alrady moved).

Raid-10 has a similar bug.

Kernel Fault: Code=15 regs=000000006ccb8640 (Addr=0000000100000000)
CPU: 3 PID: 1930 Comm: mdX_raid1 Not tainted 4.2.0-rc5-bisect+ #35
task: 000000006cc1f258 ti: 000000006ccb8000 task.ti: 000000006ccb8000

     YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001001111111000001111 Not tainted
r00-03  000000ff0804fe0f 000000001059d000 000000001059f818 000000007f16be38
r04-07  000000001059d000 000000007f16be08 0000000000200200 0000000000000001
r08-11  000000006ccb8260 000000007b7934d0 0000000000000001 0000000000000000
r12-15  000000004056f320 0000000000000000 0000000000013dd0 0000000000000000
r16-19  00000000f0d00ae0 0000000000000000 0000000000000000 0000000000000001
r20-23  000000000800000f 0000000042200390 0000000000000000 0000000000000000
r24-27  0000000000000001 000000000800000f 000000007f16be08 000000001059d000
r28-31  0000000100000000 000000006ccb8560 000000006ccb8640 0000000000000000
sr00-03  0000000000249800 0000000000000000 0000000000000000 0000000000249800
sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000

IASQ: 0000000000000000 0000000000000000 IAOQ: 000000001059f61c 000000001059f620
 IIR: 0f8010c6    ISR: 0000000000000000  IOR: 0000000100000000
 CPU:        3   CR30: 000000006ccb8000 CR31: 0000000000000000
 ORIG_R28: 000000001059d000
 IAOQ[0]: call_bio_endio+0x34/0x1a8 [raid1]
 IAOQ[1]: call_bio_endio+0x38/0x1a8 [raid1]
 RP(r2): raid_end_bio_io+0x88/0x168 [raid1]
Backtrace:
 [<000000001059f818>] raid_end_bio_io+0x88/0x168 [raid1]
 [<00000000105a4f64>] raid1d+0x144/0x1640 [raid1]
 [<000000004017fd5c>] kthread+0x144/0x160
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Fixes: 55ce74d4 ("md/raid1: ensure device failure recorded before write request returns.")
Fixes: 95af587e ("md/raid10: ensure device failure recorded before write request returns.")
Signed-off-by: NNeilBrown <neilb@suse.com>

a452744b

02 10月, 2015 1 次提交

md: drop null test before destroy functions · 644df1a8

由 Julia Lawall 提交于 9月 13, 2015

Remove unneeded NULL test.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ expression x; @@
-if (x != NULL)
  \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>
Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NNeilBrown <neilb@suse.com>

644df1a8

01 9月, 2015 2 次提交

md/raid10: ensure device failure recorded before write request returns. · 95af587e

由 NeilBrown 提交于 8月 14, 2015

When a write to one of the legs of a RAID10 fails, the failure is
recorded in the metadata of the other legs so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.
Signed-off-by: NNeilBrown <neilb@suse.com>

95af587e

N
md/raid10: fix a few typos in comments · 02ec5026
由 NeilBrown 提交于 7月 06, 2015
```
Signed-off-by: NNeilBrown <neilb@suse.com>
```
02ec5026

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功