提交 · 35b785f7691aa82c4b0b262392439cfa6f22816d · openanolis / cloud-kernel

08 11月, 2016 1 次提交

md: add bad block support for external metadata · 35b785f7

由 Tomasz Majchrzak 提交于 10月 21, 2016

Add new rdev flag which external metadata handler can use to switch
on/off bad block support. If new bad block is encountered, notify it via
rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been
cleared, notify update to rdev 'bad_blocks' sysfs file.

When bad blocks support is being removed, just clear rdev flag. It is
not necessary to reset badblocks->shift field. If there are bad blocks
cleared or added at the same time, it is ok for those changes to be
applied to the structure. The array is in blocked state and the drive
which cannot handle bad blocks any more will be removed from the array
before it is unlocked.

Simplify state_show function by adding a separator at the end of each
string and overwrite last separator with new line.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

35b785f7

29 10月, 2016 3 次提交

md: be careful not lot leak internal curr_resync value into metadata. -- (all) · 1217e1d1

由 NeilBrown 提交于 10月 28, 2016

mddev->curr_resync usually records where the current resync is up to,
but during the starting phase it has some "magic" values.

 1 - means that the array is trying to start a resync, but has yielded
     to another array which shares physical devices, and also needs to
     start a resync
 2 - means the array is trying to start resync, but has found another
     array which shares physical devices and has already started resync.

 3 - means that resync has commensed, but it is possible that nothing
     has actually been resynced yet.

It is important that this value not be visible to user-space and
particularly that it doesn't get written to the metadata, as the
resync or recovery checkpoint.  In part, this is because it may be
slightly higher than the correct value, though this is very rare.
In part, because it is not a multiple of 4K, and some devices only
support 4K aligned accesses.

There are two places where this value is propagates into either
->curr_resync_completed or ->recovery_cp or ->recovery_offset.
These currently avoid the propagation of values 1 and 3, but will
allow 3 to leak through.

Change them to only propagate the value if it is > 3.

As this can cause an array to fail, the patch is suitable for -stable.

Cc: stable@vger.kernel.org (v3.7+)
Reported-by: NViswesh <viswesh.vichu@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

1217e1d1

raid1: handle read error also in readonly mode · 7449f699

由 Tomasz Majchrzak 提交于 10月 28, 2016

If write is the first operation on a disk and it happens not to be
aligned to page size, block layer sends read request first. If read
operation fails, the disk is set as failed as no attempt to fix the
error is made because array is in auto-readonly mode. Similarily, the
disk is set as failed for read-only array.

Take the same approach as in raid10. Don't fail the disk if array is in
readonly or auto-readonly mode. Try to redirect the request first and if
unsuccessful, return a read error.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

7449f699

raid5-cache: correct condition for empty metadata write · 9a8b27fa

由 Shaohua Li 提交于 10月 27, 2016

As long as we recover one metadata block, we should write the empty metadata
write. The original code could make recovery corrupted if only one meta is
valid.
Reported-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: NShaohua Li <shli@fb.com>

9a8b27fa

25 10月, 2016 5 次提交

md: report 'write_pending' state when array in sync · 16f88949

由 Tomasz Majchrzak 提交于 10月 24, 2016

If there is a bad block on a disk and there is a recovery performed from
this disk, the same bad block is reported for a new disk. It involves
setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external
metadata this flag is not being cleared as array state is reported as
'clean'. The read request to bad block in RAID5 array gets stuck as it
is waiting for a flag to be cleared - as per commit c3cce6cd
("md/raid5: ensure device failure recorded before write request
returns.").

The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been
clarified in commit 070dc6dd ("md: resolve confusion of
MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in
personality error handlers since and it doesn't fully comply with
initial purpose. It was supposed to notify that write request is about
to start, however now it is also used to request metadata update.
Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has
been set and in_sync has been set to 0 at the same time. Error handlers
just set the flag without modifying in_sync value. Sysfs array state is
a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is
set and in_sync is set to 1. Userspace has no idea it is expected to
take some action.

Swap the order that array state is checked so 'write_pending' is
reported ahead of 'clean' ('write_pending' is a misleading name but it
is too late to rename it now).
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

16f88949

md/raid5: write an empty meta-block when creating log super-block · 56056c2e

由 Zhengyuan Liu 提交于 10月 24, 2016

If superblock points to an invalid meta block, r5l_load_log will set
create_super with true and create an new superblock, this runtime path
would always happen if we do no writing I/O to this array since it was
created. Writing an empty meta block could avoid this unnecessary
action at the first time we created log superblock.

Another reason is for the corretness of log recovery. Currently we have
bellow code to guarantee log revocery to be correct.

        if (ctx.seq > log->last_cp_seq + 1) {
                int ret;

                ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10);
                if (ret)
                        return ret;
                log->seq = ctx.seq + 11;
                log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
                r5l_write_super(log, ctx.pos);
        } else {
                log->log_start = ctx.pos;
                log->seq = ctx.seq;
        }

If we just created a array with a journal device, log->log_start and
log->last_checkpoint should all be 0, then we write three meta block
which are valid except mid one and supposed crash happened. The ctx.seq
would equal to log->last_cp_seq + 1 and log->log_start would be set to
position of mid invalid meta block after we did a recovery, this will
lead to problems which could be avoided with this patch.
Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: NShaohua Li <shli@fb.com>

56056c2e

md/raid5: initialize next_checkpoint field before use · 28cd88e2

由 Zhengyuan Liu 提交于 10月 24, 2016

No initial operation was done to this field when we
load/recovery the log, it got assignment only when IO
to raid disk was finished. So r5l_quiesce may use wrong
next_checkpoint to reclaim log space, that would make
reclaimable space calculation confused.
Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: NShaohua Li <shli@fb.com>

28cd88e2

RAID10: ignore discard error · 579ed34f

由 Shaohua Li 提交于 10月 06, 2016

This is the counterpart of raid10 fix. If a write error occurs, raid10
will try to rewrite the bio in small chunk size. If the rewrite fails,
raid10 will record the error in bad block. narrow_write_error will
always use WRITE for the bio, but actually it could be a discard. Since
discard bio hasn't payload, write the bio will cause different issues.
But discard error isn't fatal, we can safely ignore it. This is what
this patch does.

This issue should exist since discard is added, but only exposed with
recent arbitrary bio size feature.

Cc: Sitsofe Wheeler <sitsofe@gmail.com>
Cc: stable@vger.kernel.org (v3.6)
Signed-off-by: NShaohua Li <shli@fb.com>

579ed34f

RAID1: ignore discard error · e3f948cd

由 Shaohua Li 提交于 10月 06, 2016

If a write error occurs, raid1 will try to rewrite the bio in small
chunk size. If the rewrite fails, raid1 will record the error in bad
block. narrow_write_error will always use WRITE for the bio, but
actually it could be a discard. Since discard bio hasn't payload, write
the bio will cause different issues. But discard error isn't fatal, we
can safely ignore it. This is what this patch does.

This issue should exist since discard is added, but only exposed with
recent arbitrary bio size feature.
Reported-and-tested-by: NSitsofe Wheeler <sitsofe@gmail.com>
Cc: stable@vger.kernel.org (v3.6)
Signed-off-by: NShaohua Li <shli@fb.com>

e3f948cd

24 10月, 2016 1 次提交

dm table: fix missing dm_put_target_type() in dm_table_add_target() · dafa724b

由 tang.junhui 提交于 10月 21, 2016

dm_get_target_type() was previously called so any error returned from
dm_table_add_target() must first call dm_put_target_type().  Otherwise
the DM target module's reference count will leak and the associated
kernel module will be unable to be removed.

Also, leverage the fact that r is already -EINVAL and remove an extra
newline.

Fixes: 36a0456f ("dm table: add immutable feature")
Fixes: cc6cbe14 ("dm table: add always writeable feature")
Fixes: 3791e2fc ("dm table: add singleton feature")
Cc: stable@vger.kernel.org # 3.2+
Signed-off-by: Ntang.junhui <tang.junhui@zte.com.cn>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

dafa724b

19 10月, 2016 2 次提交

dm rq: clear kworker_task if kthread_run() returned an error · 937fa62e

由 Mike Snitzer 提交于 10月 18, 2016

cleanup_mapped_device() calls kthread_stop() if kworker_task is
non-NULL.  Currently the assigned value could be a valid task struct or
an error code (e.g -ENOMEM).  Reset md->kworker_task to NULL if
kthread_run() returned an erorr.

Fixes: 7193a9de ("dm rq: check kthread_run return for .request_fn request-based DM")
Cc: stable@vger.kernel.org # 4.8
Reported-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

937fa62e

dm: free io_barrier after blk_cleanup_queue call · d09960b0

由 Tahsin Erdogan 提交于 10月 10, 2016

dm_old_request_fn() has paths that access md->io_barrier.  The party
destroying io_barrier should ensure that no future execution of
dm_old_request_fn() is possible.  Move io_barrier destruction to below
blk_cleanup_queue() to ensure this and avoid a NULL pointer crash during
request-based DM device shutdown.

Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

d09960b0

18 10月, 2016 1 次提交

dm raid: fix activation of existing raid4/10 devices · b052b07c

由 Heinz Mauelshagen 提交于 10月 17, 2016

dm-raid 1.9.0 fails to activate existing RAID4/10 devices that have the
old superblock format (which does not have takeover/reshaping support
that was added via commit 33e53f06).

Fix validation path for old superblocks by reverting to the old raid4
layout and basing checks on mddev->new_{level,layout,...} members in
super_init_validation().

Cc: stable@vger.kernel.org # 4.8
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

b052b07c

14 10月, 2016 2 次提交

dm mirror: use all available legs on multiple failures · 12a7cf5b

由 Heinz Mauelshagen 提交于 10月 10, 2016

When any leg(s) have failed, any read will cause a new operational
default leg to be selected and the read is resubmitted to it. If that
new default leg fails the read too, no other still accessible legs are
used to resubmit the read again -- thus failing the io.

Fix by allowing the read to get resubmitted until all operational legs
have been exhausted. Also, remove any details.bi_dev use as a flag.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

12a7cf5b

dm mirror: fix read error on recovery after default leg failure · dcb2ff56

由 Heinz Mauelshagen 提交于 10月 10, 2016

If a default leg has failed, any read will cause a new operational
default leg to be selected and the read is resubmitted.  But until now
the read will return failure even though it was successful due to
resubmission.  The reason for this is bio->bi_error was not being
cleared before resubmitting the bio.

Fix by clearing bio->bi_error before resubmission.

Fixes: 4246a0b6 ("block: add a bi_error field to struct bio")
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

dcb2ff56

12 10月, 2016 2 次提交

kthread: kthread worker API cleanup · 3989144f

由 Petr Mladek 提交于 10月 11, 2016

A good practice is to prefix the names of functions by the name
of the subsystem.

The kthread worker API is a mix of classic kthreads and workqueues.  Each
worker has a dedicated kthread.  It runs a generic function that process
queued works.  It is implemented as part of the kthread subsystem.

This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:

__init_kthread_worker()		-> __kthread_init_worker()
init_kthread_worker()		-> kthread_init_worker()
init_kthread_work()		-> kthread_init_work()
insert_kthread_work()		-> kthread_insert_work()
queue_kthread_work()		-> kthread_queue_work()
flush_kthread_work()		-> kthread_flush_work()
flush_kthread_worker()		-> kthread_flush_worker()

Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.

Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:

  + "init" in the function names stands for the verb "initialize"
    aka "initialize worker". While "INIT" in the macro names
    stands for the noun "INITIALIZER" aka "worker initializer".

  + INIT() macros are used only in DEFINE() macros

  + init() functions are used close to the other kthread()
    functions. It looks much better if all the functions
    use the same scheme.

  + There will be also kthread_destroy_worker() that will
    be used close to kthread_cancel_work(). It is related
    to the init() function. Again it looks better if all
    functions use the same naming scheme.

  + there are several precedents for such init() function
    names, e.g. amd_iommu_init_device(), free_area_init_node(),
    jump_label_init_type(),  regmap_init_mmio_clk(),

  + It is not an argument but it was inconsistent even before.

[arnd@arndb.de: fix linux-next merge conflict]
 Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.comSuggested-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NPetr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3989144f

dm raid: fix compat_features validation · 5c33677c

由 Andy Whitcroft 提交于 10月 11, 2016

In ecbfb9f1 ("dm raid: add raid level takeover support") a new
compatible feature flag was added.  Validation for these compat_features
was added but this only passes for new raid mappings with this feature
flag.  This causes previously created raid mappings to be failed at
import.

Check compat_features for the only valid combination.

Fixes: ecbfb9f1 ("dm raid: add raid level takeover support")
Cc: stable@vger.kernel.org # v4.8
Signed-off-by: NAndy Whitcroft <apw@canonical.com>
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

5c33677c

04 10月, 2016 1 次提交

md: set rotational bit · bb086a89

由 Shaohua Li 提交于 9月 30, 2016

if all disks in an array are non-rotational, set the array
non-rotational.

This only works for array with all disks populated at startup. Support
for disk hotadd/hotremove could be added later if necessary.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NShaohua Li <shli@fb.com>

bb086a89

29 9月, 2016 1 次提交

dm mpath: always return reservation conflict without failing over · 8ff232c1

由 Hannes Reinecke 提交于 7月 15, 2015

If dm-mpath encounters an reservation conflict it should not fail the
path (as communication with the target is not affected) but should
rather retry on another path.  However, in doing so we might be inducing
a ping-pong between paths, with no guarantee of any forward progress.
And arguably a reservation conflict is an unexpected error, so we should
be passing it upwards to allow the application to take appropriate
steps.

This change resolves a show-stopper problem seen with the pNFS SCSI
layout because it is trivial to hit reservation conflict based failover
loops without it.

Doubts were raised about the implications of this change relative to
products like IBM's SVC.  But there is little point withholding a fix
for Linux because a proprietary product may or may not have some issues
in its implementation of how it interfaces with Linux.  In the future,
if there is glaring evidence that this change is certainly problematic
we can revisit it.
Signed-off-by: NHannes Reinecke <hare@suse.de>
Acked-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com> # tweaked header

8ff232c1

22 9月, 2016 21 次提交

dm bufio: remove dm_bufio_cond_resched() · 7cd32674

由 Peter Zijlstra 提交于 9月 13, 2016

Use cond_resched() like everybody else.

Mikulas explained why dm_bufio_cond_resched() was introduced to begin
with (hopefully cond_resched can be improved accordingly) here:
https://www.redhat.com/archives/dm-devel/2016-September/msg00112.html

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com> # added last comment in header

7cd32674

dm crypt: fix crash on exit · f659b100

由 Rabin Vincent 提交于 9月 21, 2016

As the documentation for kthread_stop() says, "if threadfn() may call
do_exit() itself, the caller must ensure task_struct can't go away".
dm-crypt does not ensure this and therefore crashes when crypt_dtr()
calls kthread_stop().  The crash is trivially reproducible by adding a
delay before the call to kthread_stop() and just opening and closing a
dm-crypt device.

 general protection fault: 0000 [#1] PREEMPT SMP
 CPU: 0 PID: 533 Comm: cryptsetup Not tainted 4.8.0-rc7+ #7
 task: ffff88003bd0df40 task.stack: ffff8800375b4000
 RIP: 0010: kthread_stop+0x52/0x300
 Call Trace:
  crypt_dtr+0x77/0x120
  dm_table_destroy+0x6f/0x120
  __dm_destroy+0x130/0x250
  dm_destroy+0x13/0x20
  dev_remove+0xe6/0x120
  ? dev_suspend+0x250/0x250
  ctl_ioctl+0x1fc/0x530
  ? __lock_acquire+0x24f/0x1b10
  dm_ctl_ioctl+0x13/0x20
  do_vfs_ioctl+0x91/0x6a0
  ? ____fput+0xe/0x10
  ? entry_SYSCALL_64_fastpath+0x5/0xbd
  ? trace_hardirqs_on_caller+0x151/0x1e0
  SyS_ioctl+0x41/0x70
  entry_SYSCALL_64_fastpath+0x1f/0xbd

This problem was introduced by bcbd94ff ("dm crypt: fix a possible
hang due to race condition on exit").

Looking at the description of that patch (excerpted below), it seems
like the problem it addresses can be solved by just using
set_current_state instead of __set_current_state, since we obviously
need the memory barrier.

| dm crypt: fix a possible hang due to race condition on exit
|
| A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE),
| __add_wait_queue, spin_unlock_irq and then tests kthread_should_stop().
| It is possible that the processor reorders memory accesses so that
| kthread_should_stop() is executed before __set_current_state().  If
| such reordering happens, there is a possible race on thread
| termination: [...]

So this patch just reverts the aforementioned patch and changes the
__set_current_state(TASK_INTERRUPTIBLE) to set_current_state(...).  This
fixes the crash and should also fix the potential hang.

Fixes: bcbd94ff ("dm crypt: fix a possible hang due to race condition on exit")
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: NRabin Vincent <rabinv@axis.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

f659b100

dm cache metadata: switch to using the new cursor api for loading metadata · f177940a

由 Joe Thornber 提交于 9月 20, 2016

This change offers a pretty significant performance improvement.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

f177940a

dm array: introduce cursor api · fdd1315a