提交 · ca39f7502425d437cbf83d29d99b43bd61342858 · openeuler / Kernel

03 8月, 2022 32 次提交

由 Christoph Hellwig 提交于 7月 19, 2022

Once a kobject is initialized, the containing object should not be
directly freed.  So delay initialization until it is added.  Also
remove the kobject_del call as the last put will remove the kobject as
well.  The explicitly delete isn't needed here, and dropping it will
simplify further fixes.

With this md_free now does not need to check that ->gendisk is non-NULL
as it is always set by the time that kobject_init is called on
mddev->kobj.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ca39f750

md/raid5: Convert prepare_to_wait() to wait_woken() api · ee1aa06b

由 Logan Gunthorpe 提交于 7月 07, 2022

raid5_get_active_stripe() can sleep in various situations and it
is called by make_stripe_request() while inside the
prepare_to_wait()/finish_wait() section. Nested waits like this are
not supported.

This was noticed while making other changes that add different sleeps
to raid5_get_active_stripe() that caused a WARNING with
CONFIG_DEBUG_ATOMIC_SLEEP.

No ill effects have been noticed with the code as is, but theoretically
a nested and here could cause a dead lock so it should be fixed.

To fix this, convert the prepare_to_wait() call to use wake_woken()
which supports nested sleeps.

Link: https://lwn.net/Articles/628628/Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ee1aa06b

md/raid5: Fix sectors_to_do bitmap overflow in raid5_make_request() · b9f91d80

由 Logan Gunthorpe 提交于 7月 07, 2022

For unaligned IO that have nearly maximum sectors, the number of stripes
will end up being one greater than the size of the bitmap. When this
happens, the last stripe in the IO will not be processed as it should
be, resulting in data corruption.

However, this is not normally seen when the backing block devices have
4K physical block sizes since the block layer will split the request
before that happens.

To fix this increase the bitmap size by one bit and ensure the full
number of stripes are checked when calling find_first_bit().
Reported-by: NDavid Sloan <David.Sloan@eideticom.com>
Fixes: 7e55c60a ("md/raid5: Pivot raid5_make_request()")
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b9f91d80

bcache: remove EXPERIMENTAL for Kconfig option 'Asynchronous device registration' · 640c46a2

由 Coly Li 提交于 7月 19, 2022

The "Asynchronous device registration (EXPERIMENTAL)" Kconfig option is
for 2+ years, it is used when registration takes too much time for
massive amount of cached data, to avoid udev task timeout during boot
time.

Many users and products enable this Kconfig option for quite long time
(e.g. SUSE Linux) and it works as expected and no issue reported.

It is time to remove the "EXPERIMENTAL" tag from this Kconfig item.
Signed-off-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220719042724.8498-2-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

640c46a2

md: Fix spelling mistake in comments · 9e26728b

由 Zhang Jiaming 提交于 7月 02, 2022

There are 2 spelling mistakes in comments. Fix it.
Signed-off-by: NZhang Jiaming <jiaming@nfschina.com>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9e26728b

md/raid5: Increase restriction on max segments per request · 9ad1a74f

由 Logan Gunthorpe 提交于 6月 16, 2022

The block layer defaults the maximum segments to 128, which means
requests tend to get split around the 512KB depending on how many
pages can be merged. There's no such restriction in the raid5 code
so increase the limit to USHRT_MAX so that larger requests can be
sent as one.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9ad1a74f

md/raid5: Improve debug prints · df1b620a

由 Logan Gunthorpe 提交于 6月 16, 2022

Add a debug print for raid5_make_request() so that each request is
printed and add the logical sector number to the debug print in
__add_stripe_bio().
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

df1b620a

md/raid5: Pivot raid5_make_request() · 7e55c60a

由 Logan Gunthorpe 提交于 6月 16, 2022

raid5_make_request() loops through every page in the request,
finds the appropriate stripe and adds the bio for that page in the
disk.

This causes a great deal of contention on the hash_lock and extra
work seeing each stripe must be found once for every data disk.

The number of times a stripe must be found can be reduced by pivoting
raid5_make_request() so that it loops through every stripe and then
loops through every disk in that stripe to see if the bio must be
added. This reduces the number of times the hash lock must be taken
by a factor equal to the number of data disks.

To accomplish this, the logical sectors that have already been added
must be tracked. Tracking them is done with a bitmap: the bits
for all pages are set at the start of the request and each bit
is cleared once the bio is added to a stripe.

Finding the next sector to be done is then just a call to
find_first_bit() so that sectors that have been done can simply be
skipped.

One minor downside is that the maximum sectors for a request must be
limited so that the bitmap can be appropriately sized on the stack.
This limit is arbitrarily chosen to be 256 stripe pages which works out
to 1MB if PAGE_SIZE == DEFAULT_STRIPE_SIZE. This doesn't actually
restrict the maximum request further seeing the default block queue
settings are used which restricts the number of segments to 128 (which
results in request sizes that are approximately 512KB).
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7e55c60a

md/raid5: Check all disks in a stripe_head for reshape progress · 486f6055

由 Logan Gunthorpe 提交于 6月 16, 2022

When testing if a previous stripe has had reshape expand past it, use
the earliest or latest logical sector in all the disks for that stripe
head. This will allow adding multiple disks at a time in a subesquent
patch.

To do this cleaner, refactor the check into a helper function called
stripe_ahead_of_reshape().
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

486f6055

md/raid5: Refactor add_stripe_bio() · 4ad1d984

由 Logan Gunthorpe 提交于 6月 16, 2022

Factor out two helper functions from add_stripe_bio(): one to check for
overlap (stripe_bio_overlaps()), and one to actually add the bio to the
stripe (__add_stripe_bio()). The latter function will always succeed.

This will be useful in the next patch so that overlap can be checked for
multiple disks before adding any
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4ad1d984

md/raid5: Keep a reference to last stripe_head for batch · 3312e6c8

由 Logan Gunthorpe 提交于 6月 16, 2022

When batching, every stripe head has to find the previous stripe head to
add to the batch list. This involves taking the hash lock which is
highly contended during IO.

Instead of finding the previous stripe_head each time, store a
reference to the previous stripe_head in a pointer so that it doesn't
require taking the contended lock another time.

The reference to the previous stripe must be released before scheduling
and waiting for work to get done. Otherwise, it can hold up
raid5_activate_delayed() and deadlock.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3312e6c8

md/raid5: Refactor for loop in raid5_make_request() into while loop · 0a2d1694

由 Logan Gunthorpe 提交于 6月 16, 2022

The for loop with retry label can be more cleanly expressed as a while
loop by moving the logical_sector increment into the success path.

No functional changes intended.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0a2d1694

md/raid5: Move read_seqcount_begin() into make_stripe_request() · 4f354560

由 Logan Gunthorpe 提交于 6月 16, 2022

Now that prepare_to_wait() isn't in the way, move read_sequcount_begin()
into make_stripe_request().

No functional changes intended.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4f354560

md/raid5: Drop the do_prepare flag in raid5_make_request() · 1cdb5b41

由 Logan Gunthorpe 提交于 6月 16, 2022

prepare_to_wait() can be reasonably called after schedule instead of
setting a flag and preparing in the next loop iteration.

This means that prepare_to_wait() will be called before
read_seqcount_begin(), but there shouldn't be any reason that the order
matters here. On the first iteration of the loop prepare_to_wait() is
already called first.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1cdb5b41

md/raid5: Factor out helper from raid5_make_request() loop · f4aec6a0

由 Logan Gunthorpe 提交于 6月 16, 2022

Factor out the inner loop of raid5_make_request() into it's own helper
called make_stripe_request().

The helper returns a number of statuses: SUCCESS, RETRY,
SCHEDULE_AND_RETRY and FAIL. This makes the code a bit easier to
understand and allows the SCHEDULE_AND_RETRY path to be made common.

A context structure is added to contain do_flush. It will be used
more in subsequent patches for state that needs to be kept
outside the loop.

No functional changes intended. This will be cleaned up further in
subsequent patches to untangle the gen_lock and do_prepare logic
further.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f4aec6a0

md/raid5: Move common stripe get code into new find_get_stripe() helper · 1baa1126

由 Logan Gunthorpe 提交于 6月 16, 2022

Both uses of find_stripe() require a fairly complicated dance to
increment the reference count. Move this into a common find_get_stripe()
helper.

No functional changes intended.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1baa1126

md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio() · 8757fef6

由 Logan Gunthorpe 提交于 6月 16, 2022

stripe_add_to_batch_list() is better done in the loop in make_request
instead of inside add_stripe_bio(). This is clearer and allows for
storing the batch_head state outside the loop in a subsequent patch.

The call to add_stripe_bio() in retry_aligned_read() is for read
and batching only applies to write. So it's impossible for batching
to happen at that call site.

No functional changes intended.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8757fef6

md/raid5: Refactor raid5_make_request loop · 27fb7010

由 Logan Gunthorpe 提交于 6月 16, 2022

Break immediately if raid5_get_active_stripe() returns NULL and deindent
the rest of the loop. Annotate this check with an unlikely().

This makes the code easier to read and reduces the indentation level.

No functional changes intended.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

27fb7010

md/raid5: Factor out ahead_of_reshape() function · a8bb304c

由 Logan Gunthorpe 提交于 6月 16, 2022

There are a few uses of an ugly ternary operator in raid5_make_request()
to check if a sector is a head of a reshape sector.

Factor this out into a simple helper called ahead_of_reshape().

No functional changes intended.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a8bb304c

md/raid5: Make logic blocking check consistent with logic that blocks · 6e3f50d3

由 Logan Gunthorpe 提交于 6月 16, 2022

The check in raid5_make_request differs very slightly from the logic
that causes it to block lower down. This likely does not cause a bug
as the check is fuzzy anyway (as reshape may move on between the first
check and the subsequent check). However, make it consistent so it can
be cleaned up in a subsequent patch.

The condition which causes the schedule is:

 !(mddev->reshape_backwards ? logical_sector < conf->reshape_progress :
   logical_sector >= conf->reshape_progress) &&
  (mddev->reshape_backwards ? logical_sector < conf->reshape_safe :
   logical_sector >= conf->reshape_safe)

The condition that causes the early bailout is made to match this.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6e3f50d3

md: unlock mddev before reap sync_thread in action_store · 9dfbdafd

由 Guoqing Jiang 提交于 6月 21, 2022

Since the bug which commit 8b48ec23 ("md: don't unregister sync_thread
with reconfig_mutex held") fixed is related with action_store path, other
callers which reap sync_thread didn't need to be changed.

Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
bug with belows.

1. unlock mddev before md_reap_sync_thread in action_store.
2. save reshape_position before unlock, then restore it to ensure position
   not changed accidentally by others.
Signed-off-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9dfbdafd

md: Explicitly create command-line configured devices · 05ce7fb9

由 Chris Webb 提交于 6月 01, 2022

Boot-time assembly of arrays with md= command-line arguments breaks when
CONFIG_BLOCK_LEGACY_AUTOLOAD is unset. md_setup_drive() in md-autodetect.c
calls blkdev_get_by_dev(), assuming this implicitly creates the block
device.

Fix this by attempting to md_alloc() the array first. As in the probe path,
ignore any error as failure is caught by blkdev_get_by_dev() anyway.
Signed-off-by: NChris Webb <chris@arachsys.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

05ce7fb9

md: Notify sysfs sync_completed in md_reap_sync_thread() · 9973f0fa

由 Logan Gunthorpe 提交于 6月 08, 2022

The mdadm test 07layouts randomly produces a kernel hung task deadlock.
The deadlock is caused by the suspend_lo/suspend_hi files being set by
the mdadm background process during reshape and not being cleared
because the process hangs. (Leaving aside the issue of the fragility of
freezing kernel tasks by buggy userspace processes...)

When the background mdadm process hangs it, is waiting (without a
timeout) on a change to the sync_completed file signalling that the
reshape has completed. The process is woken up a couple times when
the reshape finishes but it is woken up before MD_RECOVERY_RUNNING
is cleared so sync_completed_show() reports 0 instead of "none".

To fix this, notify the sysfs file in md_reap_sync_thread() after
MD_RECOVERY_RUNNING has been cleared. This wakes up mdadm and causes
it to continue and write to suspend_lo/suspend_hi to allow IO to
continue.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9973f0fa

md: Ensure resync is reported after it starts · b368856a

由 Logan Gunthorpe 提交于 6月 08, 2022

The 07layouts test in mdadm fails on some systems. The failure
presents itself as the backup file not being removed before the next
layout is grown into:

  mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup:
      File exists

This is because the background mdadm process, which is responsible for
cleaning up this backup file gets into an infinite loop waiting for
the reshape to start. mdadm checks the mdstat file if a reshape is
going and, if it is not, it waits for an event on the file or times
out in 5 seconds. On faster machines, the reshape may complete before
the 5 seconds times out, and thus the background mdadm process loops
waiting for a reshape to start that has already occurred.

mdadm reads the mdstat file to start, but mdstat does not report that the
reshape has begun, even though it has indeed begun. So the mdstat_wait()
call (in mdadm) which polls on the mdstat file won't ever return until
timing out.

The reason mdstat reports the reshape has started is due to an issue
in status_resync(). recovery_active is subtracted from curr_resync which
will result in a value of zero for the first chunk of reshaped data, and
the resulting read will report no reshape in progress.

To fix this, if "resync - recovery_active" is an overloaded value, force
the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b368856a

md: Use enum for overloaded magic numbers used by mddev->curr_resync · eac58d08

由 Logan Gunthorpe 提交于 6月 08, 2022

Comments in the code document special values used for
mddev->curr_resync. Make this clearer by using an enum to label these
values.

The only functional change is a couple places use the wrong comparison
operator that implied 3 is another special value. They are all
fixed to imply that 3 or greater is an active resync.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eac58d08

md/raid5-cache: Annotate pslot with __rcu notation · 6f28c5c3

由 Logan Gunthorpe 提交于 6月 08, 2022

radix_tree_lookup_slot() and radix_tree_replace_slot() API expect the
slot returned and looked up to be marked with __rcu. Otherwise
sparse warnings are generated:

  drivers/md/raid5-cache.c:2939:23: warning: incorrect type in
			assignment (different address spaces)
  drivers/md/raid5-cache.c:2939:23:    expected void **pslot
  drivers/md/raid5-cache.c:2939:23:    got void [noderef] __rcu **
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6f28c5c3

md/raid5-cache: Clear conf->log after finishing work · b13015af

由 Logan Gunthorpe 提交于 6月 08, 2022

A NULL pointer dereferlence on conf->log is seen randomly with
the mdadm test 21raid5cache. Kasan reporst:

BUG: KASAN: null-ptr-deref in r5l_reclaimable_space+0xf5/0x140
Read of size 8 at addr 0000000000000860 by task md0_reclaim/3086

Call Trace:
  dump_stack_lvl+0x5a/0x74
  kasan_report.cold+0x5f/0x1a9
  __asan_load8+0x69/0x90
  r5l_reclaimable_space+0xf5/0x140
  r5l_do_reclaim+0xf4/0x5e0
  r5l_reclaim_thread+0x69/0x3b0
  md_thread+0x1a2/0x2c0
  kthread+0x177/0x1b0
  ret_from_fork+0x22/0x30

This is caused by conf->log being cleared in r5l_exit_log() before
stopping the reclaim thread.

To fix this, clear conf->log after the reclaim_thread is unregistered
and after flushing disable_writeback_work.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b13015af

md/raid5-cache: Drop RCU usage of conf->log · 7769085c

由 Logan Gunthorpe 提交于 6月 08, 2022

The only place that uses RCU to access conf->log is in
r5l_log_disk_error(). This function is mostly used in the IO path
and once with mddev_lock() held in raid5_change_consistency_policy().

It is known that the IO will be suspended before the log is freed and
r5l_log_exit() is called with the mddev_lock() held.

This should mean that conf->log can not be freed while the function is
being called, so the RCU protection is not necessary. Drop the
rcu_read_lock() as well as the synchronize_rcu() and
rcu_assign_pointer() usage.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7769085c

md/raid5-cache: Take mddev_lock in r5c_journal_mode_show() · 78ede6a0

由 Logan Gunthorpe 提交于 6月 08, 2022

The mddev->lock spinlock doesn't protect against the removal of
conf->log in r5l_exit_log() so conf->log may be freed before it
is used.

To fix this, take the mddev_lock() insteaad of the mddev->lock spinlock.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

78ede6a0

md/raid5: suspend the array for calls to log_exit() · c629f345

由 Logan Gunthorpe 提交于 6月 08, 2022

The raid5-cache code relies on there being no IO in flight when
log_exit() is called. There are two places where this is not
guaranteed so add mddev_suspend() and mddev_resume() calls to these
sites.

The site in raid5_change_consistency_policy() is in the error path,
and another similar call site already has suspend/resume calls just
below it; so it should be equally safe to make that change here.

There is one remaining site in raid5_remove_disk() that we call log_exit()
without suspending the array. Unfortunately, as the comment stated, we
cannot call mddev_suspend from raid5d.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c629f345

md/raid5-ppl: Drop unused argument from ppl_handle_flush_request() · e0fccdaf

由 Logan Gunthorpe 提交于 6月 08, 2022

ppl_handle_flush_request() takes an struct r5log argument but doesn't
use it. It has no buisiness taking this argument as it is only used
by raid5-cache and has no way to derference it anyway. Remove
the argument.

No functional changes intended.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e0fccdaf

md/raid5-log: Drop extern decorators for function prototypes · ed0c6a5f

由 Logan Gunthorpe 提交于 6月 08, 2022

extern is not necessary and recommended against when defining prototype
functions in headers. checkpatch.pl complains about these. So remove
them.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ed0c6a5f

29 7月, 2022 8 次提交

dm: fix dm-raid crash if md_handle_request() splits bio · 9dd1cd32

由 Mike Snitzer 提交于 7月 20, 2022

Commit ca522482 ("dm: pass NULL bdev to bio_alloc_clone")
introduced the optimization to _not_ perform bio_associate_blkg()'s
relatively costly work when DM core clones its bio. But in doing so it
exposed the possibility for DM's cloned bio to alter DM target
behavior (e.g. crash) if a target were to issue IO without first
calling bio_set_dev().

The DM raid target can trigger an MD crash due to its need to split
the DM bio that is passed to md_handle_request(). The split will
recurse to submit_bio_noacct() using a bio with an uninitialized
->bi_blkg. This NULL bio->bi_blkg causes blk_throtl_bio() to
dereference a NULL blkg_to_tg(bio->bi_blkg).

Fix this in DM core by adding a new 'needs_bio_set_dev' target flag that
will make alloc_tio() call bio_set_dev() on behalf of the target.
dm-raid is the only target that requires this flag. bio_set_dev()
initializes the DM cloned bio's ->bi_blkg, using bio_associate_blkg,
before passing the bio to md_handle_request().

Long-term fix would be to audit and refactor MD code to rely on DM to
split its bio, using dm_accept_partial_bio(), but there are MD raid
personalities (e.g. raid1 and raid10) whose implementation are tightly
coupled to handling the bio splitting inline.

Fixes: ca522482 ("dm: pass NULL bdev to bio_alloc_clone")
Cc: stable@vger.kernel.org
Signed-off-by: NMike Snitzer <snitzer@kernel.org>

9dd1cd32

dm raid: fix address sanitizer warning in raid_resume · 7dad24db

由 Mikulas Patocka 提交于 7月 24, 2022

There is a KASAN warning in raid_resume when running the lvm test
lvconvert-raid.sh. The reason for the warning is that mddev->raid_disks
is greater than rs->raid_disks, so the loop touches one entry beyond
the allocated length.

Cc: stable@vger.kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@kernel.org>

7dad24db

dm raid: fix address sanitizer warning in raid_status · 1fbeea21

由 Mikulas Patocka 提交于 7月 24, 2022

There is this warning when using a kernel with the address sanitizer
and running this testsuite:
https://gitlab.com/cki-project/kernel-tests/-/tree/main/storage/swraid/scsi_raid

==================================================================
BUG: KASAN: slab-out-of-bounds in raid_status+0x1747/0x2820 [dm_raid]
Read of size 4 at addr ffff888079d2c7e8 by task lvcreate/13319
CPU: 0 PID: 13319 Comm: lvcreate Not tainted 5.18.0-0.rc3.<snip> #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
 <TASK>
 dump_stack_lvl+0x6a/0x9c
 print_address_description.constprop.0+0x1f/0x1e0
 print_report.cold+0x55/0x244
 kasan_report+0xc9/0x100
 raid_status+0x1747/0x2820 [dm_raid]
 dm_ima_measure_on_table_load+0x4b8/0xca0 [dm_mod]
 table_load+0x35c/0x630 [dm_mod]
 ctl_ioctl+0x411/0x630 [dm_mod]
 dm_ctl_ioctl+0xa/0x10 [dm_mod]
 __x64_sys_ioctl+0x12a/0x1a0
 do_syscall_64+0x5b/0x80

The warning is caused by reading conf->max_nr_stripes in raid_status. The
code in raid_status reads mddev->private, casts it to struct r5conf and
reads the entry max_nr_stripes.

However, if we have different raid type than 4/5/6, mddev->private
doesn't point to struct r5conf; it may point to struct r0conf, struct
r1conf, struct r10conf or struct mpconf. If we cast a pointer to one
of these structs to struct r5conf, we will be reading invalid memory
and KASAN warns about it.

Fix this bug by reading struct r5conf only if raid type is 4, 5 or 6.

Cc: stable@vger.kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@kernel.org>

1fbeea21

dm: Start pr_preempt from the same starting path · c6adada5

由 Mike Christie 提交于 7月 17, 2022

pr_preempt has a similar issue as reserve where for all the
reservation types except the All Registrants ones the preempt can
create a reservation. And a follow up reservation or release needs to
go down the same path the preempt did. This has the pr_preempt work
like reserve and release where we always start from the first path in
the first group.

This commit has been tested with windows failover clustering's
validation test and libiscsi's PGR tests to check for regressions.
They both don't have tests to verify this case, so I tested it
manually.
Signed-off-by: NMike Christie <michael.christie@oracle.com>
Signed-off-by: NMike Snitzer <snitzer@kernel.org>

c6adada5

dm: Fix PR release handling for non All Registrants · 08a3c338

由 Mike Christie 提交于 7月 17, 2022

This commit fixes a bug where we are leaving the reservation in place
even though pr_release has run and returned success.

If we have a Write Exclusive, Exclusive Access, or Write/Exclusive
Registrants only reservation, the release must be sent down the path
that is the reservation holder. The problem is multipath_prepare_ioctl
most likely selected path N for the reservation, then later when we do
the release multipath_prepare_ioctl will select a completely different
path. The device will then return success becuase the nvme and scsi
specs say to return success if there is no reservation or if the
release is sent down from a path that is not the holder. We then think
we have released the reservation.

This commit has us loop over each path and send a release so we can
make sure the release is executed on the correct path. It has been
tested with windows failover clustering's validation test which checks
this case, and it has been tested manually (the libiscsi PGR tests
don't have a test case for this yet, but I will be adding one).
Signed-off-by: NMike Christie <michael.christie@oracle.com>
Signed-off-by: NMike Snitzer <snitzer@kernel.org>

08a3c338

dm: Start pr_reserve from the same starting path · 70151087

由 Mike Christie 提交于 7月 17, 2022

When an app does a pr_reserve it will go to whatever path we happen to
be using at the time. This can result in errors when the app does a
second pr_reserve call and expects success but gets a failure because
the reserve is not done on the holder's path. This commit has us
always start trying to do reserves from the first path in the first
group.

Windows failover clustering will produce the type of pattern above.
With this commit, we will now pass its validation test for this case.
Signed-off-by: NMike Christie <michael.christie@oracle.com>
Signed-off-by: NMike Snitzer <snitzer@kernel.org>

70151087

dm: Allow dm_call_pr to be used for path searches · 8dd87f3c

由 Mike Christie 提交于 7月 17, 2022

The specs state that if you send a reserve down a path that is already
the holder success must be returned and if it goes down a path that
is not the holder reservation conflict must be returned. Windows
failover clustering will send a second reservation and expects that a
device returns success. The problem for multipathing is that for an
All Registrants reservation, we can send the reserve down any path but
for all other reservation types there is one path that is the holder.

To handle this we could add PR state to dm but that can get nasty.
Look at target_core_pr.c for an example of the type of things we'd
have to track. It will also get more complicated because other
initiators can change the state so we will have to add in async
event/sense handling.

This commit, and the 3 commits that follow, tries to keep dm simple
and keep just doing passthrough. This commit modifies dm_call_pr to be
able to find the first usable path that can execute our pr_op then
return. When dm_pr_reserve is converted to dm_call_pr in the next
commit for the normal case we will use the same path for every
reserve.
Signed-off-by: NMike Christie <michael.christie@oracle.com>
Signed-off-by: NMike Snitzer <snitzer@kernel.org>

8dd87f3c

dm: return early from dm_pr_call() if DM device is suspended · e120a5f1

由 Mike Snitzer 提交于 7月 22, 2022

Otherwise PR ops may be issued while the broader DM device is being
reconfigured, etc.

Fixes: 9c72bad1 ("dm: call PR reserve/unreserve on each underlying device")
Signed-off-by: NMike Snitzer <snitzer@kernel.org>

e120a5f1

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功