提交 · 978e51ba38e00e9da09b3ef9ed8c94af7b55a1eb · openanolis / cloud-kernel

20 12月, 2017 2 次提交

dm: optimize bio-based NVMe IO submission · 978e51ba

由 Mike Snitzer 提交于 12月 09, 2017

Upper level bio-based drivers that stack immediately ontop of NVMe can
leverage direct_make_request().  In addition DM's NVMe bio-based
will initially only ever have one NVMe device that it submits IO to at a
time.  There is no splitting needed.  Enhance DM core so that
DM_TYPE_NVME_BIO_BASED's IO submission takes advantage of both of these
characteristics.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

978e51ba

dm: introduce DM_TYPE_NVME_BIO_BASED · 22c11858

由 Mike Snitzer 提交于 12月 04, 2017

If dm_table_determine_type() establishes DM_TYPE_NVME_BIO_BASED then
all devices in the DM table do not support partial completions.  Also,
the table has a single immutable target that doesn't require DM core to
split bios.

This will enable adding NVMe optimizations to bio-based DM.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

22c11858

18 12月, 2017 1 次提交

dm: simplify start of block stats accounting for bio-based · f3986374

由 Mike Snitzer 提交于 12月 17, 2017

No apparent need to generic_start_io_acct() until before the IO is ready
for submission.  start_io_acct() is the proper place to do this
accounting -- it is also where DM accounts for pending IO and, if
enabled, starts dm-stats accounting.

Replace start_io_acct()'s part_round_stats() with generic_start_io_acct().
This eliminates needing to take part_stat_lock() multiple times when
starting an IO on bio-based devices.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

f3986374

17 12月, 2017 5 次提交

dm: remove redundant mapped_device member from clone_info structure · bc02cdbe

由 Mike Snitzer 提交于 12月 14, 2017

'struct dm_io' already has the same pointer.  So update all accesses
from ci->md to ci->io->md.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

bc02cdbe

M
dm: remove now unused bio-based io_pool and _io_cache · dde1e1ec
由 Mike Snitzer 提交于 12月 11, 2017
```
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
dde1e1ec

dm: improve performance by moving dm_io structure to per-bio-data · 64f52b0e

由 Mike Snitzer 提交于 12月 11, 2017

Eliminates need for a separate mempool to allocate 'struct dm_io'
objects from.  As such, it saves an extra mempool allocation for each
original bio that DM core is issued.

This complicates the per-bio-data accessor functions by needing to
conditonally add extra padding to get to a target's per-bio-data.  But
in the end this provides a decent performance improvement for all
bio-based DM devices.

On an NVMe-loop based testbed to a ramdisk (~3100 MB/s): bio-based
DM linear performance improved by 2% (went from 2665 to 2777 MB/s).
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

64f52b0e

M
dm: rename 'bio' member of dm_io structure to 'orig_bio' · 745dc570
由 Mike Snitzer 提交于 12月 11, 2017
```
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
745dc570

dm: remove stale comment blocks · 2abf1fc9

由 Mike Snitzer 提交于 12月 09, 2017

These CRUD comments have worn out their welcome.  The code is what it
is, over time it'll hopefully get better.  But these comments serve no
purpose whatsoever.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

2abf1fc9

14 12月, 2017 15 次提交

dm: set QUEUE_FLAG_DAX accordingly in dm_table_set_restrictions() · ad3793fc

由 Mike Snitzer 提交于 12月 04, 2017

Rather than having DAX support be unique by setting it based on table
type in dm_setup_md_queue().
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

ad3793fc

dm: fix __send_changing_extent_only() to send first bio and chain remainder · 3d7f4562

由 Mike Snitzer 提交于 12月 08, 2017

__send_changing_extent_only() must follow the same pattern that was
established with commit "dm: ensure bio submission follows a depth-first
tree walk".  That is: submit first bio up to split boundary and then
split the remainder to further submissions.
Suggested-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

3d7f4562

dm: ensure bio-based DM's bioset and io_pool support targets' maximum IOs · 0776aa0e

由 Mike Snitzer 提交于 12月 08, 2017

alloc_multiple_bios() assumes it can allocate the requested number of
bios but until now there was no gaurantee that the mempools would be
accomodating.
Suggested-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

0776aa0e

dm: remove BIOSET_NEED_RESCUER based dm_offload infrastructure · 4a3f54d9

由 Mike Snitzer 提交于 11月 22, 2017

Now that all of DM has been revised and/or verified to no longer require
the use of BIOSET_NEED_RESCUER the dm_offload code may be removed.
Suggested-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

4a3f54d9

dm: safely allocate multiple bioset bios · 318716dd

由 Mike Snitzer 提交于 11月 22, 2017

DM targets can request multiple bios be sent to them by DM core (see:
num_{flush,discard,write_same,write_zeroes}_bios).  But until now these
bios were allocated in an unsafe manner than could potentially exhaust
the DM device's bioset -- in the face of multiple threads each trying to
do multiple allocations from the same DM device's bioset.

Fix __send_duplicate_bios() by using the new alloc_multiple_bios().  The
allocation strategy used by alloc_multiple_bios() models that used by
dm-crypt.c:crypt_alloc_buffer().

Neil Brown initially proposed this fix but the implementation has been
revised enough that it inappropriate to attribute the entirety of it to
him.
Suggested-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

318716dd

dm: remove unused 'num_write_bios' target interface · f31c21e4

由 NeilBrown 提交于 11月 22, 2017

No DM target provides num_write_bios and none has since dm-cache's
brief use in 2013.

Having the possibility of num_write_bios > 1 complicates bio
allocation.  So remove the interface and assume there is only one bio
needed.

If a target ever needs more, it must provide a suitable bioset and
allocate itself based on its particular needs.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

f31c21e4

dm: ensure bio submission follows a depth-first tree walk · 18a25da8

由 NeilBrown 提交于 9月 06, 2017

A dm device can, in general, represent a tree of targets, each of which
handles a sub-range of the range of blocks handled by the parent.

The bio sequencing managed by generic_make_request() requires that bios
are generated and handled in a depth-first manner.  Each call to a
make_request_fn() may submit bios to a single member device, and may
submit bios for a reduced region of the same device as the
make_request_fn.

In particular, any bios submitted to member devices must be expected to
be processed in order, so a later one must never wait for an earlier
one.

This ordering is usually achieved by using bio_split() to reduce a bio
to a size that can be completely handled by one target, and resubmitting
the remainder to the originating device. bio_queue_split() shows the
canonical approach.

dm doesn't follow this approach, largely because it has needed to split
bios since long before bio_split() was available.  It currently can
submit bios to separate targets within the one dm_make_request() call.
Dependencies between these targets, as can happen with dm-snap, can
cause deadlocks if either bios gets stuck behind the other in the queues
managed by generic_make_request().  This requires the 'rescue'
functionality provided by dm_offload_{start,end}.

Some of this requirement can be removed by changing the order of bio
submission to follow the canonical approach.  That is, if dm finds that
it needs to split a bio, the remainder should be sent to
generic_make_request() rather than being handled immediately.  This
delays the handling until the first part is completely processed, so the
deadlock problems do not occur.

__split_and_process_bio() can be called both from dm_make_request() and
from dm_wq_work().  When called from dm_wq_work() the current approach
is perfectly satisfactory as each bio will be processed immediately.
When called from dm_make_request(), current->bio_list will be non-NULL,
and in this case it is best to create a separate "clone" bio for the
remainder.

When we use bio_clone_bioset() to split off the front part of a bio
and chain the two together and submit the remainder to
generic_make_request(), it is important that the newly allocated
bio is used as the head to be processed immediately, and the original
bio gets "bio_advance()"d and sent to generic_make_request() as the
remainder.  Otherwise, if the newly allocated bio is used as the
remainder, and if it then needs to be split again, then the next
bio_clone_bioset() call will be made while holding a reference a bio
(result of the first clone) from the same bioset.  This can potentially
exhaust the bioset mempool and result in a memory allocation deadlock.

Note that there is no race caused by reassigning cio.io->bio after already
calling __map_bio().  This bio will only be dereferenced again after
dec_pending() has found io->io_count to be zero, and this cannot happen
before the dec_pending() call at the end of __split_and_process_bio().

To provide the clone bio when splitting, we use q->bio_split.  This
was previously being freed by bio-based dm to avoid having excess
rescuer threads.  As bio_split bio sets no longer create rescuer
threads, there is little cost and much gain from restoring the
q->bio_split bio set.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

18a25da8

dm io: remove BIOSET_NEED_RESCUER flag from bios bioset · c110a4b6

由 NeilBrown 提交于 8月 30, 2017

The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might
do two allocations from the one bioset, and the second one could block
until the first bio completes.

dm_io() is called from make_request_fn() context.  The closest it comes
to multiple allocations is in chunk_io() in dm-snap-persistent.  But
there the code uses a separate thread to avoid problems.

So BIOSET_NEED_RESCUER is not needed.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

c110a4b6

dm crypt: remove BIOSET_NEED_RESCUER flag · 80cd1757

由 NeilBrown 提交于 8月 30, 2017

The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might
do two allocations from the one bioset, and the second one could block
until the first bio completes.

dm-crypt does allocate from this bioset inside the dm make_request_fn,
but does so using GFP_NOWAIT so that the allocation will not block.

So BIOSET_NEED_RESCUER is not needed.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

80cd1757

dm: fix comment above dm_accept_partial_bio · c06b3e58

由 NeilBrown 提交于 11月 21, 2017

Clarify that dm_accept_partial_bio isn't allowed for REQ_OP_ZONE_RESET
bios.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

c06b3e58

dm raid: use rs_is_raid*() · 552aa679

由 Heinz Mauelshagen 提交于 12月 13, 2017

Cleanup, no functional change.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

552aa679

dm raid: simplify rs_get_progress() · 7c29744e

由 Heinz Mauelshagen 提交于 12月 13, 2017

No need to calculate the reshaping progress because
mddev->curr_resync_completed holds it.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7c29744e

dm raid: ensure 'a' chars during reshape · dc15b943

由 Heinz Mauelshagen 提交于 12月 13, 2017

During reshape, 'A' chars were reported in status rather than 'a'.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

dc15b943

dm raid: stop keeping raid set frozen altogether · 11e47232

由 Heinz Mauelshagen 提交于 12月 13, 2017

In order to avoid redoing synchronization/recovery/reshape partially,
the raid set got frozen until after all passed in table line flags had
been cleared. The related table reload sequence had to be precisely
followed, or reshaping may lead to data corruption caused by the active
mapping carrying on with a reshape when the inactive mapping already
had retrieved a stale reshape position.

Harden by retrieving the actual resync/recovery/reshape position
during resume whilst the active table is suspended thus avoiding
to keep the raid set frozen altogether. This prevents superfluous
redoing of an already resynchronized or recovered segment and,
most importantly, potential for redoing of an already reshaped
segment causing data corruption.

Fixes: d39f0010 ("dm raid: fix raid_resume() to keep raid set frozen as needed")
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

11e47232

dm raid: validate current raid sets redundancy · 53bf5384

由 Heinz Mauelshagen 提交于 12月 13, 2017

Verifying the current raid sets redundancy based on retrieved
superblock content has to use the superblock's raid level (e.g. raid0),
not the constructor requested one (e.g. raid10).

Using the requested raid level of raid10 lead to a "divide error"
on raid0 which defines data copies divided by to be zero.

Also check for bogus data copies.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

53bf5384

08 12月, 2017 13 次提交

M
dm raid: bump target version to reflect numerous fixes · b84cf269
由 Mike Snitzer 提交于 12月 04, 2017
```
Also update Documentation accordingly.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
b84cf269

dm raid: small cleanup and remove unsed "struct raid_set" member · 78a75d10

由 Heinz Mauelshagen 提交于 12月 02, 2017

Move raid_resume()'s setting of 'rw' and 'in_sync' to just prior to
mddev_resume().

Also, remove unused 'bitmap_loaded' member from "struct raid_set".

No functional changes.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

78a75d10

dm raid: fix rs_get_progress() synchronization state/ratio · 4102d9de

由 Heinz Mauelshagen 提交于 12月 02, 2017

Fix various sync state issues causing racy/bogus sync ratio,
sync_action ad health chars in dm_status() info output.

Sync ratio could be N/N (i.e. 100%) shortly after raid set
creation, i.e. creating a new RaidLV or upconverting a linear LV to
raid1 thus:
  "0 2097152 raid raid1 2 Aa 2097162/2097152 recover 0 0 -"
instead of:
  "0 2097152 raid raid1 2 Aa 0/2097152 idle 0 0 -"

Sync action could be non-idle, when the MD thread was done with io.

Health chars could be 'A' when they should be 'a' for a short time
before a resynchonization started.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

4102d9de

dm raid: avoid passing array_in_sync variable to raid_status() callees · 242ea5ad

由 Heinz Mauelshagen 提交于 12月 02, 2017

The raid_status() function passes the bool array_in_sync variable around
providing synchronization state of the MD array.  Replace it with a
runtime flag.  This will avoid a pattern of having to pass discrete
variables to various functions.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

242ea5ad

dm raid: display a consistent copy of the MD status via raid_status() · 67143510

由 Heinz Mauelshagen 提交于 12月 02, 2017

The MD sync thread updates recovery flags providing state of any
running, idle, frozen, recovering, reshaping, ... activity it performs
and updates respective flags asynchronously versus dm processing
raid_status().  To close that race window, take a single copy of the
flags and pass it into its callees.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

67143510

dm raid: fix raid_resume() to keep raid set frozen as needed · d39f0010

由 Heinz Mauelshagen 提交于 12月 02, 2017

During a reshape request: if userspace reloads a "raid" table multiple
times, resulting in multiple superblock reads, the raid set needs to
stay frozen until all config changes (chunk size, layout data_offset,
delta_disks) have been stored in the superblocks and respective flags
cleared.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

d39f0010

dm raid: add component device size checks to avoid runtime failure · 188a212d

由 Heinz Mauelshagen 提交于 12月 02, 2017

Check all component data device sizes versus calculated size.
Reject if device(s) are too small.  Otherwise, MD will fail the
operation by accessing beyond the end of the data device.

An example use-case is that growing bitmap won't fit any more and the MD
runtime will report an error when DM raid should catch this earlier.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

188a212d

dm raid: fix raid set size revalidation · 61e06e2c

由 Heinz Mauelshagen 提交于 12月 02, 2017

The raid set size is being revalidated unconditionally before a
reshaping conversion is started.  MD requires the size to only be
reduced in case of a stripe removing (i.e. shrinking) reshape but not
when growing because the raid array has to stay small until after the
growing reshape finishes.

Fix by avoiding the size revalidation in preresume unless a shrinking
reshape is requested.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

61e06e2c

dm raid: correct resizing state relative to reshape space in ctr · 7501537e

由 Heinz Mauelshagen 提交于 12月 02, 2017

Pay attention to existing reshape space to define if a raid set needs
resizing.  Otherwise we can hit "Can't resize a reshaping raid set"
when a reshape is being requested.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7501537e

dm raid: consume sizes after md_finish_reshape() completes changing them · 052b2b1e

由 Heinz Mauelshagen 提交于 12月 02, 2017

The md raid personalities call md_finish_reshape() at the end of a
reshape conversion which adjusts rdev->sectors.

Correct/check rdev->sectors before initiating a reshape and raise the
recovery pointer accordingly.

Otherwise, the DM raid coordinated reshape will fail.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

052b2b1e

dm raid: fix deadlock caused by premature md_stop_writes() · 1af2048a

由 Heinz Mauelshagen 提交于 12月 02, 2017

md_stop_writes() is called in raid_presuspend() causing deadlocks on
bios submitted afterwards -- which happens on loaded raid sets with
conversion requests.

Fix by moving md_stop_writes() to raid_postsuspend().  NOTE: when the
recovery's frozen (MD_RECOVERY_FROZEN), writes haven't been started (or
are already stopped) so don't stop them again.

Also remove superfluous readonly setting.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

1af2048a

dm bufio: fix shrinker scans when (nr_to_scan < retain_target) · fbc7c07e

由 Suren Baghdasaryan 提交于 12月 06, 2017

When system is under memory pressure it is observed that dm bufio
shrinker often reclaims only one buffer per scan. This change fixes
the following two issues in dm bufio shrinker that cause this behavior:

1. ((nr_to_scan - freed) <= retain_target) condition is used to
terminate slab scan process. This assumes that nr_to_scan is equal
to the LRU size, which might not be correct because do_shrink_slab()
in vmscan.c calculates nr_to_scan using multiple inputs.
As a result when nr_to_scan is less than retain_target (64) the scan
will terminate after the first iteration, effectively reclaiming one
buffer per scan and making scans very inefficient. This hurts vmscan
performance especially because mutex is acquired/released every time
dm_bufio_shrink_scan() is called.
New implementation uses ((LRU size - freed) <= retain_target)
condition for scan termination. LRU size can be safely determined
inside __scan() because this function is called after dm_bufio_lock().

2. do_shrink_slab() uses value returned by dm_bufio_shrink_count() to
determine number of freeable objects in the slab. However dm_bufio
always retains retain_target buffers in its LRU and will terminate
a scan when this mark is reached. Therefore returning the entire LRU size
from dm_bufio_shrink_count() is misleading because that does not
represent the number of freeable objects that slab will reclaim during
a scan. Returning (LRU size - retain_target) better represents the
number of freeable objects in the slab. This way do_shrink_slab()
returns 0 when (LRU size < retain_target) and vmscan will not try to
scan this shrinker avoiding scans that will not reclaim any memory.

Test: tested using Android device running
<AOSP>/system/extras/alloc-stress that generates memory pressure
and causes intensive shrinker scans
Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

fbc7c07e

dm mpath: fix bio-based multipath queue_if_no_path handling · c1fd0abe

由 Mike Snitzer 提交于 12月 07, 2017

Commit ca5beb76 ("dm mpath: micro-optimize the hot path relative to
MPATHF_QUEUE_IF_NO_PATH") caused bio-based DM-multipath to fail mptest's
"test_02_sdev_delete".

Restoring the logic that existed prior to commit ca5beb76 fixes this
bio-based DM-multipath regression.  Also verified all mptest tests pass
with request-based DM-multipath.

This commit effectively reverts commit ca5beb76 -- but it does so
without reintroducing the need to take the m->lock spinlock in
must_push_back_{rq,bio}.

Fixes: ca5beb76 ("dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

c1fd0abe

04 12月, 2017 2 次提交

dm: fix various targets to dm_register_target after module __init resources created · 7e6358d2

由 monty_pavel@sina.com 提交于 11月 25, 2017

A NULL pointer is seen if two concurrent "vgchange -ay -K <vg name>"
processes race to load the dm-thin-pool module:

 PID: 25992 TASK: ffff883cd7d23500 CPU: 4 COMMAND: "vgchange"
  #0 [ffff883cd743d600] machine_kexec at ffffffff81038fa9
  0000001 [ffff883cd743d660] crash_kexec at ffffffff810c5992
  0000002 [ffff883cd743d730] oops_end at ffffffff81515c90
  0000003 [ffff883cd743d760] no_context at ffffffff81049f1b
  0000004 [ffff883cd743d7b0] __bad_area_nosemaphore at ffffffff8104a1a5
  0000005 [ffff883cd743d800] bad_area at ffffffff8104a2ce
  0000006 [ffff883cd743d830] __do_page_fault at ffffffff8104aa6f
  0000007 [ffff883cd743d950] do_page_fault at ffffffff81517bae
  0000008 [ffff883cd743d980] page_fault at ffffffff81514f95
     [exception RIP: kmem_cache_alloc+108]
     RIP: ffffffff8116ef3c RSP: ffff883cd743da38 RFLAGS: 00010046
     RAX: 0000000000000004 RBX: ffffffff81121b90 RCX: ffff881bf1e78cc0
     RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000000
     RBP: ffff883cd743da68 R8: ffff881bf1a4eb00 R9: 0000000080042000
     R10: 0000000000002000 R11: 0000000000000000 R12: 00000000000000d0
     R13: 0000000000000000 R14: 00000000000000d0 R15: 0000000000000246
     ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
  0000009 [ffff883cd743da70] mempool_alloc_slab at ffffffff81121ba5
 0000010 [ffff883cd743da80] mempool_create_node at ffffffff81122083
 0000011 [ffff883cd743dad0] mempool_create at ffffffff811220f4
 0000012 [ffff883cd743dae0] pool_ctr at ffffffffa08de049 [dm_thin_pool]
 0000013 [ffff883cd743dbd0] dm_table_add_target at ffffffffa0005f2f [dm_mod]
 0000014 [ffff883cd743dc30] table_load at ffffffffa0008ba9 [dm_mod]
 0000015 [ffff883cd743dc90] ctl_ioctl at ffffffffa0009dc4 [dm_mod]

The race results in a NULL pointer because:

Process A (vgchange -ay -K):
 	a. send DM_LIST_VERSIONS_CMD ioctl;
 	b. pool_target not registered;
 	c. modprobe dm_thin_pool and wait until end.

Process B (vgchange -ay -K):
 	a. send DM_LIST_VERSIONS_CMD ioctl;
 	b. pool_target registered;
 	c. table_load->dm_table_add_target->pool_ctr;
 	d. _new_mapping_cache is NULL and panic.
Note:
 	1. process A and process B are two concurrent processes.
 	2. pool_target can be detected by process B but
 	_new_mapping_cache initialization has not ended.

To fix dm-thin-pool, and other targets (cache, multipath, and snapshot)
with the same problem, simply dm_register_target() after all resources
created during module init (as labelled with __init) are finished.

Cc: stable@vger.kernel.org
Signed-off-by: Nmonty <monty_pavel@sina.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7e6358d2

dm table: fix regression from improper dm_dev_internal.count refcount_t conversion · afc567a4

由 Mike Snitzer 提交于 11月 25, 2017

Multiple refcounts are needed if the device was already added. The
micro-optimization of setting the refcount to 1 on first added (rather
than fall thru to a common refcount_inc) lost sight of the fact that the
refcount_inc is also needed for the case when the device already exists
and the mode need not be upgraded.

Fixes: 2a0b4682 ("dm: convert dm_dev_internal.count from atomic_t to refcount_t")
Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

afc567a4

25 11月, 2017 2 次提交

bcache: check return value of register_shrinker · 6c4ca1e3

由 Michael Lyle 提交于 11月 24, 2017

register_shrinker is now __must_check, so check it to kill a warning.
Caller of bch_btree_cache_alloc in super.c appropriately checks return
value so this is fully plumbed through.

This V2 fixes checkpatch warnings and improves the commit description,
as I was too hasty getting the previous version out.
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NVojtech Pavlik <vojtech@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6c4ca1e3

bcache: recover data from backing when data is clean · e393aa24

由 Rui Hua 提交于 11月 24, 2017

When we send a read request and hit the clean data in cache device, there
is a situation called cache read race in bcache(see the commit in the tail
of cache_look_up(), the following explaination just copy from there):
The bucket we're reading from might be reused while our bio is in flight,
and we could then end up reading the wrong data. We guard against this
by checking (in bch_cache_read_endio()) if the pointer is stale again;
if so, we treat it as an error (s->iop.error = -EINTR) and reread from
the backing device (but we don't pass that error up anywhere)

It should be noted that cache read race happened under normal
circumstances, not the circumstance when SSD failed, it was counted
and shown in  /sys/fs/bcache/XXX/internal/cache_read_races.

Without this patch, when we use writeback mode, we will never reread from
the backing device when cache read race happened, until the whole cache
device is clean, because the condition
(s->recoverable && (dc && !atomic_read(&dc->has_dirty))) is false in
cached_dev_read_error(). In this situation, the s->iop.error(= -EINTR)
will be passed up, at last, user will receive -EINTR when it's bio end,
this is not suitable, and wield to up-application.

In this patch, we use s->read_dirty_data to judge whether the read
request hit dirty data in cache device, it is safe to reread data from
the backing device when the read request hit clean data. This can not
only handle cache read race, but also recover data when failed read
request from cache device.

[edited by mlyle to fix up whitespace, commit log title, comment
spelling]

Fixes: d59b2379 ("bcache: only permit to recovery read error when cache device is clean")
Cc: <stable@vger.kernel.org> # 4.14
Signed-off-by: NHua Rui <huarui.dev@gmail.com>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e393aa24

openanolis / cloud-kernel 10 个月 前同步成功

openanolis / cloud-kernel
10 个月前同步成功