提交 · f81f7302e86f5c0a21b59c94164f2510812b7764 · openeuler / raspberrypi-kernel

02 11月, 2017 14 次提交

raid1: remove obsolete code in raid1_write_request · f81f7302

由 Guoqing Jiang 提交于 10月 24, 2017

There are some lines could be removed due to recent
change for raid1 such as commit 3956df15d634 ("md:
move suspend_hi/lo handling into core md code").

Also, seems some comments are put to wrong place,
move them before wait_barrier.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

f81f7302

md-cluster: Use a small window for raid10 resync · 8db87912

由 Guoqing Jiang 提交于 10月 24, 2017

Suspending the entire device for resync could take
too long. Resync in small chunks.

cluster's resync window is maintained in r10conf as
cluster_sync_low and cluster_sync_high, and processed
in raid10's sync_request(). If the current resync is
outside the cluster resync window:

1. Set the cluster_sync_low to curr_resync_completed.
2. Set cluster_sync_high to cluster_sync_low + stripe
   size.
3. Send a message to all nodes so they may add it in
   their suspension list.

Note:
We only support "near" raid10 so far, resync a far or
offset raid10 array could have trouble. So raid10_run
checks the layout of clustered raid10, it will refuse
to run if the layout is not correct.

With the "near" layout we process one stripe at a time
progressing monotonically through the address space.
So we can have a sliding window of whole-stripes which
moves through the array suspending IO on other nodes,
and both resync which uses array addresses and recovery
which uses device addresses can stay within this window.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

8db87912

md-cluster: Suspend writes in RAID10 if within range · cb8a7a7e

由 Guoqing Jiang 提交于 10月 24, 2017

If there is a resync going on, all nodes must suspend
writes to the range. This is recorded in suspend_info
and suspend_list.

If there is an I/O within the ranges of any of the
suspend_info, area_resyncing will return 1.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

cb8a7a7e

md-cluster/raid10: set "do_balance = 0" if area is resyncing · d4098c72

由 Guoqing Jiang 提交于 10月 24, 2017

Just like clustered raid1, it is impossible for cluster raid10
to choose the best device for read balance when the area of
array is resyncing. Because we cannot trust the data to be the
same on all devices at that time, so we choose just the first
one to use, so set do_balance to 0.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d4098c72

md: use lockdep_assert_held · efa4b77b

由 Shaohua Li 提交于 10月 18, 2017

lockdep_assert_held is a better way to assert lock held, and it works
for UP.
Signed-off-by: NShaohua Li <shli@fb.com>

efa4b77b

raid1: prevent freeze_array/wait_all_barriers deadlock · f6eca2d4

由 Nate Dailey 提交于 10月 17, 2017

If freeze_array is attempted in the middle of close_sync/
wait_all_barriers, deadlock can occur.

freeze_array will wait for nr_pending and nr_queued to line up.
wait_all_barriers increments nr_pending for each barrier bucket, one
at a time, but doesn't actually issue IO that could be counted in
nr_queued. So freeze_array is blocked until wait_all_barriers
completes and allow_all_barriers runs. At the same time, when
_wait_barrier sees array_frozen == 1, it stops and waits for
freeze_array to complete.

Prevent the deadlock by making close_sync call _wait_barrier and
_allow_barrier for one bucket at a time, instead of deferring the
_allow_barrier calls until after all _wait_barriers are complete.
Signed-off-by: NNate Dailey <nate.dailey@stratus.com>
Fix: fd76863e(RAID1: a new I/O barrier implementation to remove resync window)
Reviewed-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org (v4.11)
Signed-off-by: NShaohua Li <shli@fb.com>

f6eca2d4

md: use TASK_IDLE instead of blocking signals · ae89fd3d

由 Mikulas Patocka 提交于 10月 18, 2017

Hi - I submit this patch for the next merge window:

Some times ago, I made a patch f9c79bc0 that blocks signals around the
schedule() calls in MD. The MD subsystem needs to do an uninterruptible
sleep that is not accounted in load average - so we block signals and use
interruptible sleep.

The kernel has a special TASK_IDLE state for this purpose, so we can use
it instead of blocking signals. This patch doesn't fix any bug, it just
makes the code simpler.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Acked-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

ae89fd3d

md: remove special meaning of ->quiesce(.., 2) · b03e0ccb

由 NeilBrown 提交于 10月 19, 2017

The '2' argument means "wake up anything that is waiting".
This is an inelegant part of the design and was added
to help support management of suspend_lo/suspend_hi setting.
Now that suspend_lo/hi is managed in mddev_suspend/resume,
that need is gone.
These is still a couple of places where we call 'quiesce'
with an argument of '2', but they can safely be changed to
call ->quiesce(.., 1); ->quiesce(.., 0) which
achieve the same result at the small cost of pausing IO
briefly.

This removes a small "optimization" from suspend_{hi,lo}_store,
but it isn't clear that optimization served a useful purpose.
The code now is a lot clearer.
Suggested-by: NShaohua Li <shli@kernel.org>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

b03e0ccb

md: allow metadata update while suspending. · 35bfc521

由 NeilBrown 提交于 10月 17, 2017

There are various deadlocks that can occur
when a thread holds reconfig_mutex and calls
->quiesce(mddev, 1).
As some write request block waiting for
metadata to be updated (e.g. to record device
failure), and as the md thread updates the metadata
while the reconfig mutex is held, holding the mutex
can stop write requests completing, and this prevents
->quiesce(mddev, 1) from completing.

->quiesce() is now usually called from mddev_suspend(),
and it is always called with reconfig_mutex held.  So
at this time it is safe for the thread to update metadata
without explicitly taking the lock.

So add 2 new flags, one which says the unlocked updates is
allowed, and one which ways it is happening.  Then allow it
while the quiesce completes, and then wait for it to finish.
Reported-and-tested-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

35bfc521

md: use mddev_suspend/resume instead of ->quiesce() · 9e1cc0a5

由 NeilBrown 提交于 10月 17, 2017

mddev_suspend() is a more general interface than
calling ->quiesce() and is so more extensible.  A
future patch will make use of this.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

9e1cc0a5

md: move suspend_hi/lo handling into core md code · b3143b9a

由 NeilBrown 提交于 10月 17, 2017

responding to ->suspend_lo and ->suspend_hi is similar
to responding to ->suspended.  It is best to wait in
the common core code without incrementing ->active_io.
This allows mddev_suspend()/mddev_resume() to work while
requests are waiting for suspend_lo/hi to change.
This is will be important after a subsequent patch
which uses mddev_suspend() to synchronize updating for
suspend_lo/hi.

So move the code for testing suspend_lo/hi out of raid1.c
and raid5.c, and place it in md.c
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

b3143b9a

md: don't call bitmap_create() while array is quiesced. · 52a0d49d

由 NeilBrown 提交于 10月 17, 2017

bitmap_create() allocates memory with GFP_KERNEL and
so can wait for IO.
If called while the array is quiesced, it could wait indefinitely
for write out to the array - deadlock.
So call bitmap_create() before quiescing the array.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

52a0d49d

md: always hold reconfig_mutex when calling mddev_suspend() · 4d5324f7

由 NeilBrown 提交于 10月 19, 2017

Most often mddev_suspend() is called with
reconfig_mutex held.  Make this a requirement in
preparation a subsequent patch.  Also require
reconfig_mutex to be held for mddev_resume(),
partly for symmetry and partly to guarantee
no races with incr/decr of mddev->suspend.

Taking the mutex in r5c_disable_writeback_async() is
a little tricky as this is called from a work queue
via log->disable_writeback_work, and flush_work()
is called on that while holding ->reconfig_mutex.
If the work item hasn't run before flush_work()
is called, the work function will not be able to
get the mutex.

So we use mddev_trylock() inside the wait_event() call, and have that
abort when conf->log is set to NULL, which happens before
flush_work() is called.
We wait in mddev->sb_wait and ensure this is woken
when any of the conditions change.  This requires
waking mddev->sb_wait in mddev_unlock().  This is only
like to trigger extra wake_ups of threads that needn't
be woken when metadata is being written, and that
doesn't happen often enough that the cost would be
noticeable.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

4d5324f7

md: forbid a RAID5 from having both a bitmap and a journal. · 230b55fa

由 NeilBrown 提交于 10月 17, 2017

Having both a bitmap and a journal is pointless.
Attempting to do so can corrupt the bitmap if the journal
replay happens before the bitmap is initialized.
Rather than try to avoid this corruption, simply
refuse to allow arrays with both a bitmap and a journal.
So:
 - if raid5_run sees both are present, fail.
 - if adding a bitmap finds a journal is present, fail
 - if adding a journal finds a bitmap is present, fail.

Cc: stable@vger.kernel.org (4.10+)
Signed-off-by: NNeilBrown <neilb@suse.com>
Tested-by: NJoshua Kinard <kumba@gentoo.org>
Acked-by: NJoshua Kinard <kumba@gentoo.org>
Signed-off-by: NShaohua Li <shli@fb.com>

230b55fa

19 10月, 2017 1 次提交

raid5: Set R5_Expanded on parity devices as well as data. · 235b6003

由 NeilBrown 提交于 10月 17, 2017

When reshaping a fully degraded raid5/raid6 to a larger
nubmer of devices, the new device(s) are not in-sync
and so that can make the newly grown stripe appear to be
"failed".
To avoid this, we set the R5_Expanded flag to say "Even though
this device is not fully in-sync, this block is safe so
don't treat the device as failed for this stripe".
This flag is set for data devices, not not for parity devices.

Consequently, if you have a RAID6 with two devices that are partly
recovered and a spare, and start a reshape to include the spare,
then when the reshape gets past the point where the recovery was
up to, it will think the stripes are failed and will get into
an infinite loop, failing to make progress.

So when contructing parity on an EXPAND_READY stripe,
set R5_Expanded.
Reported-by: NCurt <lightspd@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

235b6003

17 10月, 2017 8 次提交

md: raid10: remove a couple of redundant variables and initializations · a0e764c5

由 Colin Ian King 提交于 10月 11, 2017

Variables dev and bio_last_sector are assigned values that are never
read and hence these are redundant variables and can be removed.
Also remove the duplicated initialization of sectors, the latter
assignment is identical to the first and can be removed.

Cleans up 3 clang build warnings:
Value stored to 'dev' is never read
Value stored to 'bio_last_sector' is never read
Value stored to 'sectors' during its initialization is never read
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NShaohua Li <shli@fb.com>

a0e764c5

md: rename some drivers/md/ files to have an "md-" prefix · 935fe098

由 Mike Snitzer 提交于 10月 10, 2017

Motivated by the desire to illiminate the imprecise nature of
DM-specific patches being unnecessarily sent to both the MD maintainer
and mailing-list.  Which is born out of the fact that DM files also
reside in drivers/md/

Now all MD-specific files in drivers/md/ start with either "raid" or
"md-" and the MAINTAINERS file has been updated accordingly.

Shaohua: don't change module name
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

935fe098

md: raid10: remove VLAIS · 584ed9fa

由 Matthias Kaehlcke 提交于 10月 05, 2017

The raid10 driver can't be built with clang since it uses a variable
length array in a structure (VLAIS):

drivers/md/raid10.c:4583:17: error: fields must have a constant size:
  'variable length array in structure' extension will never be supported

Allocate the r10bio struct with kmalloc instead of using the VLAIS
construct.

Shaohua: set the MD_RECOVERY_INTR bit
Neil Brown: use GFP_NOIO
Signed-off-by: NMatthias Kaehlcke <mka@chromium.org>
Reviewed-by: NGuenter Roeck <groeck@chromium.org>
Signed-off-by: NShaohua Li <shli@fb.com>

584ed9fa

md-cluster: make function cluster_check_sync_size static · 7a57157a

由 Colin Ian King 提交于 10月 03, 2017

The function cluster_check_sync_size is local to the source and does
not need to be in global scope, so make it static.

Cleans up sparse warning:
symbol 'cluster_check_sync_size' was not declared. Should it be static?
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NShaohua Li <shli@fb.com>

7a57157a

raid5-ppl: check recovery_offset when performing ppl recovery · 07719ff7

由 Artur Paszkiewicz 提交于 9月 29, 2017

If starting an array that is undergoing rebuild, make ppl recovery honor
the recovery_offset of a member disk and don't read data that is not yet
in-sync.
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

07719ff7

raid5-ppl: don't resync after rebuild · 611426e2

由 Artur Paszkiewicz 提交于 9月 29, 2017

The check for degraded array is unnecessary and causes a resync to be
performed after ppl recovery and rebuild when restarting an array during
rebuilding after unclean shutdown.
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

611426e2

md-cluster: fix wrong condition check in raid1_write_request · 385f4d7f

由 Guoqing Jiang 提交于 9月 29, 2017

The check used here is to avoid conflict between write and
resync, however we used the wrong logic, it should be the
inverse of the checking inside "if".

Fixes: 589a1c49 ("Suspend writes in RAID1 if within range")
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

385f4d7f

md/bitmap: revert a patch · 938b533d

由 Shaohua Li 提交于 10月 16, 2017

This reverts commit 8031c3dd. That patches doesn't work well if PAGE_SIZE >
4k. We will fix the original problem with a different approach.

Fix: 8031c3dd(md/bitmap: copy correct data for bitmap super)
Reported-by: NJoshua Kinard <kumba@gentoo.org>
Cc: stable@vger.kernel.org (4.10+)
Suggested-by: NNeil Brown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

938b533d

09 10月, 2017 1 次提交

md: always set THREAD_WAKEUP and wake up wqueue if thread existed · d1d90147

由 Guoqing Jiang 提交于 10月 09, 2017

Since commit 4ad23a97 ("MD: use per-cpu counter for writes_pending"),
the wait_queue is only got invoked if THREAD_WAKEUP is not set previously.

With above change, I can see process_metadata_update could always hang on
the wait queue, because mddev->thread could stay on 'D' status and the
THREAD_WAKEUP flag is not cleared since there are lots of place to wake up
mddev->thread. Then deadlock happened as follows:

linux175:~ # ps aux|grep md|grep D
root    20117   0.0 0.0         0   0 ? D   03:45   0:00 [md0_raid1]
root    20125   0.0 0.0         0   0 ? D   03:45   0:00 [md0_cluster_rec]
linux175:~ # cat /proc/20117/stack
[<ffffffffa0635604>] dlm_lock_sync+0x94/0xd0 [md_cluster]
[<ffffffffa0635674>] lock_token+0x34/0xd0 [md_cluster]
[<ffffffffa0635804>] metadata_update_start+0x64/0x110 [md_cluster]
[<ffffffffa04d985b>] md_update_sb.part.58+0x9b/0x860 [md_mod]
[<ffffffffa04da035>] md_update_sb+0x15/0x30 [md_mod]
[<ffffffffa04dc066>] md_check_recovery+0x266/0x490 [md_mod]
[<ffffffffa06450e2>] raid1d+0x42/0x810 [raid1]
[<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod]
[<ffffffff81091741>] kthread+0x101/0x140
linux175:~ # cat /proc/20125/stack
[<ffffffffa0636679>] recv_daemon+0x3f9/0x5c0 [md_cluster]
[<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod]
[<ffffffff81091741>] kthread+0x101/0x140

So let's revert the part of code in the commit to resovle the problem since
we can't get lots of benefits of previous change.

Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d1d90147

06 10月, 2017 1 次提交

md: fix deadlock error in recent patch. · d47c8ad2

由 NeilBrown 提交于 10月 05, 2017

A recent patch aimed to cause md_write_start() to fail (rather than
block) when the mddev was suspending, so as to avoid deadlocks.
Unfortunately the test in wait_event() was wrong, and it didn't change
behaviour at all.

We wait_event() must wait until the metadata is written OR the array is
suspending.

Fixes: cc27b0c7 ("md: fix deadlock between mddev_suspend() and md_write_start()")
Cc: stable@vger.kernel.org
Reported-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d47c8ad2

28 9月, 2017 4 次提交

md/raid5: cap worker count · 7d5d7b50

由 Shaohua Li 提交于 9月 21, 2017

static checker reports a potential integer overflow. Cap the worker count to
avoid the overflow.

Reported:-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NShaohua Li <shli@fb.com>

7d5d7b50

dm-raid: fix a race condition in request handling · c4d6a1b8

由 Shaohua Li 提交于 9月 21, 2017

raid_map calls pers->make_request, which missed the suspend check. Fix it with
the new md_handle_request API.

Fix: cc27b0c7(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

c4d6a1b8

md: fix a race condition for flush request handling · 79bf31a3

由 Shaohua Li 提交于 9月 21, 2017

md_submit_flush_data calls pers->make_request, which missed the suspend check.
Fix it with the new md_handle_request API.
Reported-by: NNate Dailey <nate.dailey@stratus.com>
Tested-by: NNate Dailey <nate.dailey@stratus.com>
Fix: cc27b0c7(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: stable@vger.kernel.org
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

79bf31a3

md: separate request handling · 393debc2

由 Shaohua Li 提交于 9月 21, 2017

With commit cc27b0c7, pers->make_request could bail out without handling
the bio. If that happens, we should retry.  The commit fixes md_make_request
but not other call sites. Separate the request handling part, so other call
sites can use it.
Reported-by: NNate Dailey <nate.dailey@stratus.com>
Fix: cc27b0c7(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: stable@vger.kernel.org
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

393debc2

11 9月, 2017 6 次提交

dax: remove the pmem_dax_ops->flush abstraction · c3ca015f

由 Mikulas Patocka 提交于 8月 31, 2017

Commit abebfbe2 ("dm: add ->flush() dax operation support") is
buggy. A DM device may be composed of multiple underlying devices and
all of them need to be flushed. That commit just routes the flush
request to the first device and ignores the other devices.

It could be fixed by adding more complex logic to the device mapper. But
there is only one implementation of the method pmem_dax_ops->flush - that
is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
don't need the pmem_dax_ops->flush abstraction at all, we can call
arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
can't ever reach anything different from arch_wb_cache_pmem().

It should be also pointed out that for some uses of persistent memory it
is needed to flush only a very small amount of data (such as 1 cacheline),
and it would be overkill if we go through that device mapper machinery for
a single flushed cache line.

Fix this by removing the pmem_dax_ops->flush abstraction and call
arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
mapper code that forwards the flushes.

Fixes: abebfbe2 ("dm: add ->flush() dax operation support")
Cc: stable@vger.kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

c3ca015f

dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK · b5e8ad92

由 Arnd Bergmann 提交于 8月 15, 2017

The new lockdep support for completions causeed the stack usage
in dm-integrity to explode, in case of write_journal from 504 bytes
to 1120 (using arm gcc-7.1.1):

drivers/md/dm-integrity.c: In function 'write_journal':
drivers/md/dm-integrity.c:827:1: error: the frame size of 1120 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]

The problem is that not only the size of 'struct completion' grows
significantly, but we end up having multiple copies of it on the stack
when we assign it from a local variable after the initial declaration.

COMPLETION_INITIALIZER_ONSTACK() is the right thing to use when we
want to declare and initialize a completion on the stack. However,
this driver doesn't do that and instead initializes the completion
just before it is used.

In this case, init_completion() does the same thing more efficiently,
and drops the stack usage for the function above down to 496 bytes.
While the other functions in this file are not bad enough to cause
a warning, they benefit equally from the change, so I do the change
across the entire file. In the one place where we reuse a completion,
I picked the cheaper reinit_completion() over init_completion().

Fixes: cd8084f9 ("locking/lockdep: Apply crossrelease to completions")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NMikulas Patocka <mpatocka@redhat.com>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

b5e8ad92

dm integrity: make blk_integrity_profile structure const · 7c373d66

由 Bhumika Goyal 提交于 8月 06, 2017

Make this structure const as it is only stored in the profile field of a
blk_integrity structure. This field is of type const, so make structure
as const.
Signed-off-by: NBhumika Goyal <bhumirks@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7c373d66

dm integrity: do not check integrity for failed read operations · b7e326f7

由 Hyunchul Lee 提交于 7月 31, 2017

Even though read operations fail, dm_integrity_map_continue() calls
integrity_metadata() to check integrity.  In this case, just complete
these.

This also makes it so read I/O errors do not generate integrity warnings
in the kernel log.

Cc: stable@vger.kernel.org
Signed-off-by: NHyunchul Lee <cheol.lee@lge.com>
Acked-by: NMilan Broz <gmazyland@gmail.com>
Acked-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

b7e326f7

dm log writes: fix >512b sectorsize support · 228bb5b2

由 Josef Bacik 提交于 7月 28, 2017

512b sectors vs device's physical sectorsize was not maintained
consistently and as such the support for >512b sector devices has bugs.
The log metadata expects native sectorsize but 512b sectors were being
stored.  Also, device's sectorsize was assumed when assigning the
bi_sector for blocks that were being logged.

Fix this up by adding two helpers to convert between bio and dev
sectors, and use these in the appropriate places to fix the problem and
make it clear which units go where.  Doing so allows dm-log-writes use
with 4k devices.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

228bb5b2

dm log writes: don't use all the cpu while waiting to log blocks · 0c79c620

由 Josef Bacik 提交于 7月 28, 2017

The check to see if the logging kthread needs to go to sleep is wrong,
it checks lc->pending_blocks, which will be non-0 if there are any
blocks that are pending, whether they are ready to be logged or not.
What we really want is to go to sleep until it's time to log blocks, so
change this check so we do actually go to sleep in between flushes.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

0c79c620

08 9月, 2017 1 次提交

bcache: initialize dirty stripes in flash_dev_run() · 175206cf

由 Tang Junhui 提交于 9月 07, 2017

bcache uses a Proportion-Differentiation Controller algorithm to control
writeback rate to cached devices. In the PD controller algorithm, dirty
stripes of thin flash device should not be counted in, because flash only
volumes never write back dirty data.

Currently dirty stripe counter for thin flash device is not initialized
when the thin flash device starts. Which means the following calculation
in PD controller will reference an undefined dirty stripes number, and
all cached devices attached to the same cache set where the thin flash
device lies on may have an inaccurate writeback rate.

This patch calles bch_sectors_dirty_init() in flash_dev_run(), to
correctly initialize dirty stripe counter when the thin flash device
starts to run. This patch also does following parameter data type change,
 -void bch_sectors_dirty_init(struct cached_dev *dc);
 +void bch_sectors_dirty_init(struct bcache_device *);
to call this function conveniently in flash_dev_run().

(Commit log is composed by Coly Li)
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

175206cf

06 9月, 2017 4 次提交

bcache: fix bch_hprint crash and improve output · 9276717b

由 Michael Lyle 提交于 9月 06, 2017

Most importantly, solve a crash where %llu was used to format signed
numbers.  This would cause a buffer overflow when reading sysfs
writeback_rate_debug, as only 20 bytes were allocated for this and
%llu writes 20 characters plus a null.

Always use the units mechanism rather than having different output
paths for simplicity.

Also, correct problems with display output where 1.10 was a larger
number than 1.09, by multiplying by 10 and then dividing by 1024 instead
of dividing by 100.  (Remainders of >= 1000 would print as .10).

Minor changes: Always display the decimal point instead of trying to
omit it based on number of digits shown.  Decide what units to use
based on 1000 as a threshold, not 1024 (in other words, always print
at most 3 digits before the decimal point).
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reported-by: NDmitry Yu Okunev <dyokunev@ut.mephi.ru>
Acked-by: NKent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9276717b

bcache: Update continue_at() documentation · 7b6a8570

由 Dan Carpenter 提交于 9月 06, 2017

continue_at() doesn't have a return statement anymore.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Acked-by: NColy Li <colyli@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7b6a8570

bcache: silence static checker warning · da22f0ee

由 Dan Carpenter 提交于 9月 06, 2017

In olden times, closure_return() used to have a hidden return built in.
We removed the hidden return but forgot to add a new return here.  If
"c" were NULL we would oops on the next line, but fortunately "c" is
never NULL.  Let's just remove the if statement.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

da22f0ee

bcache: fix for gc and write-back race · 9baf3097

由 Tang Junhui 提交于 9月 06, 2017

gc and write-back get raced (see the email "bcache get stucked" I sended
before):
gc thread                               write-back thread
|                                       |bch_writeback_thread()
|bch_gc_thread()                        |
|                                       |==>read_dirty()
|==>bch_btree_gc()                      |
|==>btree_root() //get btree root       |
|                //node write locker    |
|==>bch_btree_gc_root()                 |
|                                       |==>read_dirty_submit()
|                                       |==>write_dirty()
|                                       |==>continue_at(cl,
|                                       |               write_dirty_finish,
|                                       |               system_wq);
|                                       |==>write_dirty_finish()//excute
|                                       |               //in system_wq
|                                       |==>bch_btree_insert()
|                                       |==>bch_btree_map_leaf_nodes()
|                                       |==>__bch_btree_map_nodes()
|                                       |==>btree_root //try to get btree
|                                       |              //root node read
|                                       |              //lock
|                                       |-----stuck here
|==>bch_btree_set_root()
|==>bch_journal_meta()
|==>bch_journal()
|==>journal_try_write()
|==>journal_write_unlocked() //journal_full(&c->journal)
|                            //condition satisfied
|==>continue_at(cl, journal_write, system_wq); //try to excute
|                               //journal_write in system_wq
|                               //but work queue is excuting
|                               //write_dirty_finish()
|==>closure_sync(); //wait journal_write execute
|                   //over and wake up gc,
|-------------stuck here
|==>release root node write locker

This patch alloc a separate work-queue for write-back thread to avoid such
race.

(Commit log re-organized by Coly Li to pass checkpatch.pl checking)
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Acked-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9baf3097