提交 · 4452226ea276e74fc3e252c88d9bb7e8f8e44bf0 · gsplhtlxg / clone-Linux

02 6月, 2015 1 次提交

writeback: move backing_dev_info->state into bdi_writeback · 4452226e

由 Tejun Heo 提交于 5月 22, 2015

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear.  For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi.  To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->state into wb.

* enum bdi_state is renamed to wb_state and the prefix of all enums is
  changed from BDI_ to WB_.

* Explicit zeroing of bdi->state is removed without adding zeoring of
  wb->state as the whole data structure is zeroed on init anyway.

* As there's still only one bdi_writeback per backing_dev_info, all
  uses of bdi->state are mechanically replaced with bdi->wb.state
  introducing no behavior changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: drbd-dev@lists.linbit.com
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

4452226e

22 5月, 2015 2 次提交

block, dm: don't copy bios for request clones · 5f1b670d

由 Christoph Hellwig 提交于 5月 22, 2015

Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.

This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence.  With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.

I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

5f1b670d

block: remove management of bi_remaining when restoring original bi_end_io · 326e1dbb

由 Mike Snitzer 提交于 5月 22, 2015

Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
 1) saving a bio's original bi_end_io
 2) wiring up an intermediate bi_end_io
 3) restoring the original bi_end_io from intermediate bi_end_io
 4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called.  For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0.  When bio_inc_remaining() occurred during step
3 it brought it to a value of 2.  When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining.  For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

Also, the bio_inc_remaining() interface has been moved local to bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

326e1dbb

06 5月, 2015 2 次提交

bio: skip atomic inc/dec of ->bi_cnt for most use cases · dac56212

由 Jens Axboe 提交于 4月 17, 2015

Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.

If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.
Tested-by: NRobert Elliott <elliott@hp.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

dac56212

bio: skip atomic inc/dec of ->bi_remaining for non-chains · c4cf5261

由 Jens Axboe 提交于 4月 17, 2015

Struct bio has an atomic ref count for chained bio's, and we use this
to know when to end IO on the bio. However, most bio's are not chained,
so we don't need to always introduce this atomic operation as part of
ending IO.

Add a helper to elevate the bi_remaining count, and flag the bio as
now actually needing the decrement at end_io time. Rename the field
to __bi_remaining to catch any current users of this doing the
incrementing manually.

For high IOPS workloads, this reduces the overhead of bio_endio()
substantially.
Tested-by: NRobert Elliott <elliott@hp.com>
Acked-by: NKent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

c4cf5261

30 4月, 2015 2 次提交

dm: fix free_rq_clone() NULL pointer when requeueing unmapped request · aa6df8dd

由 Mike Snitzer 提交于 4月 29, 2015

Commit 02233342 ("dm: optimize dm_mq_queue_rq to _not_ use kthread if
using pure blk-mq") mistakenly removed free_rq_clone()'s clone->q check
before testing clone->q->mq_ops. It was an oversight to discontinue
that check for 1 of the 2 use-cases for free_rq_clone():
1) free_rq_clone() called when an unmapped original request is requeued
2) free_rq_clone() called in the request-based IO completion path

The clone->q check made sense for case #1 but not for #2. However, we
cannot just reinstate the check as it'd mask a serious bug in the IO
completion case #2 -- no in-flight request should have an uninitialized
request_queue (basic block layer refcounting _should_ ensure this).

The NULL pointer seen for case #1 is detailed here:
https://www.redhat.com/archives/dm-devel/2015-April/msg00160.html

Fix this free_rq_clone() NULL pointer by simply checking if the
mapped_device's type is DM_TYPE_MQ_REQUEST_BASED (clone's queue is
blk-mq) rather than checking clone->q->mq_ops. This avoids the need to
dereference clone->q, but a WARN_ON_ONCE is added to let us know if an
uninitialized clone request is being completed.
Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

aa6df8dd

dm: only initialize the request_queue once · 3e6180f0

由 Christoph Hellwig 提交于 4月 30, 2015

Commit bfebd1cd ("dm: add full blk-mq support to request-based DM")
didn't properly account for the need to short-circuit re-initializing
DM's blk-mq request_queue if it was already initialized.

Otherwise, reloading a blk-mq request-based DM table (either manually
or via multipathd) resulted in errors, see:
 https://www.redhat.com/archives/dm-devel/2015-April/msg00132.html

Fix is to only initialize the request_queue on the initial table load
(when the mapped_device type is assigned).

This is better than having dm_init_request_based_blk_mq_queue() return
early if the queue was already initialized because it elevates the
constraint to a more meaningful location in DM core.  As such the
pre-existing early return in dm_init_request_based_queue() can now be
removed.

Fixes: bfebd1cd ("dm: add full blk-mq support to request-based DM")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

3e6180f0

22 4月, 2015 23 次提交

md/raid5: don't do chunk aligned read on degraded array. · 9ffc8f7c

由 Eric Mei 提交于 3月 18, 2015

When array is degraded, read data landed on failed drives will result in
reading rest of data in a stripe. So a single sequential read would
result in same data being read twice.

This patch is to avoid chunk aligned read for degraded array. The
downside is to involve stripe cache which means associated CPU overhead
and extra memory copy.

Test Results:
Following test are done on a enterprise storage node with Seagate 6T SAS
drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2,
chunk size 128 KiB.

I use FIO, using direct-io with various bs size, enough queue depth,
tested sequential and 100% random read against 3 array config:
 1) optimal, as baseline;
 2) degraded;
 3) degraded with this patch.
Kernel version is 4.0-rc3.

Each individual test I only did once so there might be some variations,
but we just focus on big trend.

Sequential Read:
  bs=(KiB)  optimal(MiB/s)  degraded(MiB/s)  degraded-with-patch (MiB/s)
   1024       1608            656              995
    512       1624            710              956
    256       1635            728              980
    128       1636            771              983
     64       1612           1119             1000
     32       1580           1420             1004
     16       1368            688              986
      8        768            647              953
      4        411            413              850

Random Read:
  bs=(KiB)  optimal(IOPS)  degraded(IOPS)  degraded-with-patch (IOPS)
   1024        163            160              156
    512        274            273              272
    256        426            428              424
    128        576            592              591
     64        726            724              726
     32        849            848              837
     16        900            970              971
      8        927            940              929
      4        948            940              955

Some notes:
  * In sequential + optimal, as bs size getting smaller, the FIO thread
become CPU bound.
  * In sequential + degraded, there's big increase when bs is 64K and
32K, I don't have explanation.
  * In sequential + degraded-with-patch, the MD thread mostly become CPU
bound.

If you want to we can discuss specific data point in those data. But in
general it seems with this patch, we have more predictable and in most
cases significant better sequential read performance when array is
degraded, and almost no noticeable impact on random read.

Performance is a complicated thing, the patch works well for this
particular configuration, but may not be universal. For example I
imagine testing on all SSD array may have very different result. But I
personally think in most cases IO bandwidth is more scarce resource than
CPU.
Signed-off-by: NEric Mei <eric.mei@seagate.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

9ffc8f7c

md/raid5: allow the stripe_cache to grow and shrink. · edbe83ab

由 NeilBrown 提交于 2月 26, 2015

The default setting of 256 stripe_heads is probably
much too small for many configurations.  So it is best to make it
auto-configure.

Shrinking the cache under memory pressure is easy.  The only
interesting part here is that we put a fairly high cost
('seeks') on shrinking the cache as the cost is greater than
just having to read more data, it reduces parallelism.

Growing the cache on demand needs to be done carefully.  If we allow
fast growth, that can upset memory balance as lots of dirty memory can
quickly turn into lots of memory queued in the stripe_cache.
It is important for the raid5 block device to appear congested to
allow write-throttling to work.

So we only add stripes slowly. We set a flag when an allocation
fails because all stripes are in use, allocate at a convenient
time when that flag is set, and don't allow it to be set again
until at least one stripe_head has been released for re-use.

This means that a spurt of requests will only cause one stripe_head
to be allocated, but a steady stream of requests will slowly
increase the cache size - until memory pressure puts it back again.

It could take hours to reach a steady state.

The value written to, and displayed in, stripe_cache_size is
used as a minimum.  The cache can grow above this and shrink back
down to it.  The actual size is not directly visible, though it can
be deduced to some extent by watching stripe_cache_active.
Signed-off-by: NNeilBrown <neilb@suse.de>

edbe83ab

N
md/raid5: change ->inactive_blocked to a bit-flag. · 5423399a
由 NeilBrown 提交于 2月 26, 2015
```
This allows us to easily add more (atomic) flags.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
5423399a

md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe · 486f0644

由 NeilBrown 提交于 2月 25, 2015

Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe()
succeeds, do it inside the functions.

Also choose the correct hash to handle next inside the functions.

This removes duplication and will help with future new uses of
{grow,drop}_one_stripe.

This also fixes a minor bug where the "md/raid:%md: allocate XXkB"
message always said "0kB".
Signed-off-by: NNeilBrown <neilb@suse.de>

486f0644

md/raid5: pass gfp_t arg to grow_one_stripe() · a9683a79

由 NeilBrown 提交于 2月 25, 2015

This is needed for future improvement to stripe cache management.
Signed-off-by: NNeilBrown <neilb@suse.de>

a9683a79

md/raid5: introduce configuration option rmw_level · d06f191f

由 Markus Stockhausen 提交于 12月 15, 2014

Depending on the available coding we allow optimized rmw logic for write
operations. To support easier testing this patch allows manual control
of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.

The configuration can handle three levels of control.

rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
calculation has no implementation path yet to factor in/out chunks of
a syndrome. Enforcing this level can be benefical for slow CPUs with
hardware syndrome support and fast SSDs.

rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
save IOs. This equals the "old" unpatched behaviour and will be the
default.

rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
equal. We might have higher CPU consumption because of calculating the
parity twice but it can be benefical otherwise. E.g. RAID4 with fast
dedicated parity disk/SSD. The option is implemented just to be
forward-looking and will ONLY work with this patch!
Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NNeilBrown <neilb@suse.de>

d06f191f

md/raid5: activate raid6 rmw feature · 584acdd4

由 Markus Stockhausen 提交于 12月 15, 2014

Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.

1) Enable xor_syndrome() in the async layer.

2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.

3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.

4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.

5) Adapt the several places where we ignored Q handling up to now.

Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)

bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
        skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
   4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
   8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
  16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
  32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
  64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
 128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
 256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
 512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s
Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NNeilBrown <neilb@suse.de>

584acdd4

raid5: handle expansion/resync case with stripe batching · dabc4ec6

由 shli@kernel.org 提交于 12月 15, 2014

expansion/resync can grab a stripe when the stripe is in batch list. Since all
stripes in batch list must be in the same state, we can't allow some stripes
run into expansion/resync. So we delay expansion/resync for stripe in batch
list.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

dabc4ec6

raid5: handle io error of batch list · 72ac7330

由 shli@kernel.org 提交于 12月 15, 2014

If io error happens in any stripe of a batch list, the batch list will be
split, then normal process will run for the stripes in the list.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

72ac7330

RAID5: batch adjacent full stripe write · 59fc630b

由 shli@kernel.org 提交于 12月 15, 2014

stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
unit. Idealy we should use big size for adjacent full stripe writes. Bigger
stripe cache size means less stripes runing in the state machine so can reduce
cpu overhead. And also bigger size can cause bigger IO size dispatched to under
layer disks.

With below patch, we will automatically batch adjacent full stripe write
together. Such stripes will be added to the batch list. Only the first stripe
of the list will be put to handle_list and so run handle_stripe(). Some steps
of handle_stripe() are extended to cover all stripes of the list, including
ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
running in handle_stripe() and we send IO of whole stripe list together to
increase IO size.

Stripes added to a batch list have some limitations. A batch list can only
include full stripe write and can't cross chunk boundary to make sure stripes
have the same parity disks. Stripes in a batch list must be in the same state
(no written, toread and so on). If a stripe is in a batch list, all new
read/write to add_stripe_bio will be blocked to overlap conflict till the batch
list is handled. The limitations will make sure stripes in a batch list be in
exactly the same state in the life circly.

I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
PCIe SSD. This patch improves around 30% performance and IO size to under layer
disk is exactly 32k. I also run a 4k randwrite test in the same array to make
sure the performance isn't changed with the patch.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

59fc630b

raid5: track overwrite disk count · 7a87f434

由 shli@kernel.org 提交于 12月 15, 2014

Track overwrite disk count, so we can know if a stripe is a full stripe write.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

7a87f434

raid5: add a new flag to track if a stripe can be batched · da41ba65

由 shli@kernel.org 提交于 12月 15, 2014

A freshly new stripe with write request can be batched. Any time the stripe is
handled or new read is queued, the flag will be cleared.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

da41ba65

raid5: use flex_array for scribble data · 46d5b785

由 shli@kernel.org 提交于 12月 15, 2014

Use flex_array for scribble data. Next patch will batch several stripes
together, so scribble data should be able to cover several stripes, so this
patch also allocates scribble data for stripes across a chunk.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

46d5b785

md raid0: access mddev->queue (request queue member) conditionally because it... · 753f2856

由 Heinz Mauelshagen 提交于 2月 13, 2015

md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid

The patch makes 3 references to mddev->queue in the raid0 personality
conditional in order to allow for it to be accessed from dm-raid.
Mandatory, because md instances underneath dm-raid don't manage
a request queue of their own which'd lead to oopses without the patch.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Tested-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

753f2856

md: allow resync to go faster when there is competing IO. · ac8fa419

由 NeilBrown 提交于 2月 19, 2015

When md notices non-sync IO happening while it is trying
to resync (or reshape or recover) it slows down to the
set minimum.

The default minimum might have made sense many years ago
but the drives have become faster.  Changing the default
to match the times isn't really a long term solution.

This patch changes the code so that instead of waiting until the speed
has dropped to the target, it just waits until pending requests
have completed.
This means that the delay inserted is a function of the speed
of the devices.

Testing shows that:
 - for some loads, the resync speed is unchanged.  For those loads
   increasing the minimum doesn't change the speed either.
   So this is a good result.  To increase resync speed under such
   loads we would probably need to increase the resync window
   size.

 - for other loads, resync speed does increase to a reasonable
   fraction (e.g. 20%) of maximum possible, and throughput of
   the load only drops a little bit (e.g. 10%)

 - for other loads, throughput of the non-sync load drops quite a bit
   more.  These seem to be latency-sensitive loads.

So it isn't a perfect solution, but it is mostly an improvement.
Signed-off-by: NNeilBrown <neilb@suse.de>

ac8fa419

md: remove 'go_faster' option from ->sync_request() · 09314799

由 NeilBrown 提交于 2月 19, 2015

This option is not well justified and testing suggests that
it hardly ever makes any difference.

The comment suggests there might be a need to wait for non-resync
activity indicated by ->nr_waiting, however raise_barrier()
already waits for all of that.

So just remove it to simplify reasoning about speed limiting.

This allows us to remove a 'FIXME' comment from raid5.c as that
never used the flag.
Signed-off-by: NNeilBrown <neilb@suse.de>

09314799

md: don't require sync_min to be a multiple of chunk_size. · 50c37b13

由 NeilBrown 提交于 3月 23, 2015

There is really no need for sync_min to be a multiple of
chunk_size, and values read from here often aren't.
That means you cannot read a value and expect to be able
to write it back later.

So remove the chunk_size check, and round down to a multiple
of 4K, to be sure everything works with 4K-sector devices.
Signed-off-by: NNeilBrown <neilb@suse.de>

50c37b13

md-cluster: re-add capabilities · 97f6cd39

由 Goldwyn Rodrigues 提交于 4月 14, 2015

When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
the clustered md:

1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
   clear the Faulty bit in their respective rdev->flags.
2. The node initiating re-add, gathers the bitmaps of all nodes
   and copies them into the local bitmap. It does not clear the bitmap
   from which it is copying.
3. Initiating node schedules a md recovery to sync the devices.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

97f6cd39

md: re-add a failed disk · a6da4ef8

由 Goldwyn Rodrigues 提交于 4月 14, 2015

This adds the capability of re-adding a failed disk by
writing "re-add" to /sys/block/mdXX/md/dev-YYY/state.

This facilitates adding disks which have encountered a temporary
error such as a network disconnection/hiccup in an iSCSI device,
or a SAN cable disconnection which has been restored. In such
a situation, you do not need to remove and re-add the device.
Writing re-add to the failed device's state would add it again
to the array and perform the recovery of only the blocks which
were written after the device failed.

This works for generic md, and is not related to clustering. However,
this patch is to ease re-add operations listed above in clustering
environments.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

a6da4ef8

md-cluster: remove capabilities · 88bcfef7

由 Goldwyn Rodrigues 提交于 4月 14, 2015

This adds "remove" capabilities for the clustered environment.
When a user initiates removal of a device from the array, a
REMOVE message with disk number in the array is sent to all
the nodes which kick the respective device in their own array.

This facilitates the removal of failed devices.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

88bcfef7

md: Export and rename find_rdev_nr_rcu · 57d051dc

由 Goldwyn Rodrigues 提交于 4月 14, 2015

This is required by the clustering module (patches to follow) to
find the device to remove or re-add.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

57d051dc

md: Export and rename kick_rdev_from_array · fb56dfef

由 Goldwyn Rodrigues 提交于 4月 14, 2015

This export is required for clustering module in order to
co-ordinate remove/readd a rdev from all nodes.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

fb56dfef

md-cluster: correct the num for comparison · 8c58f02e

由 Guoqing Jiang 提交于 4月 21, 2015


Since the node num of md-cluster is from zero, and
cinfo->slot_number represents the slot num of dlm,
no need to check for equality.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

8c58f02e

17 4月, 2015 1 次提交

dm crypt: fix missing error code return from crypt_ctr error path · 44c144f9

由 Wei Yongjun 提交于 4月 16, 2015

Fix to return a negative error code from crypt_ctr()'s optional
parameter processing error path.
Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

44c144f9

16 4月, 2015 9 次提交

dm crypt: fix deadlock when async crypto algorithm returns -EBUSY · 0618764c

由 Ben Collins 提交于 4月 03, 2015

I suspect this doesn't show up for most anyone because software
algorithms typically don't have a sense of being too busy.  However,
when working with the Freescale CAAM driver it will return -EBUSY on
occasion under heavy -- which resulted in dm-crypt deadlock.

After checking the logic in some other drivers, the scheme for
crypt_convert() and it's callback, kcryptd_async_done(), were not
correctly laid out to properly handle -EBUSY or -EINPROGRESS.

Fix this by using the completion for both -EBUSY and -EINPROGRESS.  Now
crypt_convert()'s use of completion is comparable to
af_alg_wait_for_completion().  Similarly, kcryptd_async_done() follows
the pattern used in af_alg_complete().

Before this fix dm-crypt would lockup within 1-2 minutes running with
the CAAM driver.  Fix was regression tested against software algorithms
on PPC32 and x86_64, and things seem perfectly happy there as well.
Signed-off-by: NBen Collins <ben.c@servergy.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

0618764c

dm crypt: leverage immutable biovecs when decrypting on read · 59779079

由 Mike Snitzer 提交于 4月 09, 2015

Commit 003b5c57 ("block: Convert drivers to immutable biovecs")
stopped short of changing dm-crypt to leverage the fact that the biovec
array of a bio will no longer be modified.

Switch to using bio_clone_fast() when cloning bios for decryption after
read.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

59779079

dm crypt: update URLs to new cryptsetup project page · e44f23b3

由 Milan Broz 提交于 4月 05, 2015

Cryptsetup home page moved to GitLab.
Also remove link to abandonded Truecrypt page.
Signed-off-by: NMilan Broz <gmazyland@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

e44f23b3

dm: add log writes target · 0e9cebe7

由 Josef Bacik 提交于 3月 20, 2015

Introduce a new target that is meant for file system developers to test file
system integrity at particular points in the life of a file system.  We capture
all write requests and associated data and log them to a separate device
for later replay.  There is a userspace utility to do this replay.  The
idea behind this is to give file system developers a tool to verify that
the file system is always consistent.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reviewed-by: NZach Brown <zab@zabbo.net>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

0e9cebe7

dm table: use bool function return values of true/false not 1/0 · 7f61f5a0

由 Joe Perches 提交于 3月 30, 2015

Use the normal return values for bool functions.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7f61f5a0

dm verity: add error handling modes for corrupted blocks · 65ff5b7d

由 Sami Tolvanen 提交于 3月 18, 2015

Add device specific modes to dm-verity to specify how corrupted
blocks should be handled.  The following modes are defined:

  - DM_VERITY_MODE_EIO is the default behavior, where reading a
    corrupted block results in -EIO.

  - DM_VERITY_MODE_LOGGING only logs corrupted blocks, but does
    not block the read.

  - DM_VERITY_MODE_RESTART calls kernel_restart when a corrupted
    block is discovered.

In addition, each mode sends a uevent to notify userspace of
corruption and to allow further recovery actions.

The driver defaults to previous behavior (DM_VERITY_MODE_EIO)
and other modes can be enabled with an additional parameter to
the verity table.
Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

65ff5b7d

dm delay: use msecs_to_jiffies for time conversion · aca607ba

由 Nicholas Mc Guire 提交于 3月 17, 2015

Converting milliseconds to jiffies by "val * HZ / 1000" is technically
OK but msecs_to_jiffies(val) is the cleaner solution and handles all
corner cases correctly.
Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

aca607ba

dm log userspace base: fix compile warning · 18cc980a

由 Nicholas Mc Guire 提交于 3月 18, 2015

This fixes up a compile warning [-Wunused-but-set-variable] - given the
comment in userspace_set_region_sync() the non-reporting of errors is
intentional so the return value can be dropped to make gcc happy.

Also, fix typo in comment.
Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

18cc980a

dm log userspace transfer: match wait_for_completion_timeout return type · c32a512f

由 Nicholas Mc Guire 提交于 3月 15, 2015

Return type of wait_for_completion_timeout() is unsigned long not int.
An appropriately named unsigned long is added and the assignment fixed.
Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

c32a512f