提交 · 5f9d1fde7d54a5d5fd8cccbee9c9c31474fcdcf2 · openanolis / cloud-kernel

25 8月, 2016 4 次提交

raid5: fix memory leak of bio integrity data · 5f9d1fde

由 Shaohua Li 提交于 8月 22, 2016

Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5
doesn't alloc/free bio, instead it reuses bios. There are two issues in
current code:
1. the code calls bio_init (from
init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io).
The bio is reused, so likely there is integrity data attached. bio_init
will clear a pointer to integrity data and makes bio_reset can't release
the data
2. bio_reset is called before dispatching bio. After bio is finished,
it's possible we don't free bio's integrity data (eg, we don't call
bio_reset again)
Both issues will cause memory leak. The patch moves bio_init to stripe
creation and bio_reset to bio end io. This will fix the two issues.
Reported-by: NYi Zhang <yizhan@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5f9d1fde

raid10: record correct address of bad block · 27028626

由 Tomasz Majchrzak 提交于 8月 23, 2016

For failed write request record block address on a device, not block
address in an array.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

27028626

md-cluster: fix error return code in join() · 0f6187db

由 Wei Yongjun 提交于 8月 21, 2016

Fix to return error code -ENOMEM from the lockres_init() error
handling case instead of 0, as done elsewhere in this function.
Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NShaohua Li <shli@fb.com>

0f6187db

r5cache: set MD_JOURNAL_CLEAN correctly · 486b0f7b

由 Song Liu 提交于 8月 19, 2016

Currently, the code sets MD_JOURNAL_CLEAN when the array has
MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array
will be MD_JOURNAL_CLEAN even if the journal device is missing.

With this patch, the MD_JOURNAL_CLEAN is only set when the journal
device presents.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

486b0f7b

18 8月, 2016 2 次提交

md: don't print the same repeated messages about delayed sync operation · c622ca54

由 Artur Paszkiewicz 提交于 8月 16, 2016

This fixes a long-standing bug that caused a flood of messages like:
"md: delaying data-check of md1 until md2 has finished (they share one
or more physical units)"

It can be reproduced like this:
1. Create at least 3 raid1 arrays on a pair of disks, each on different
   partitions.
2. Request a sync operation like 'check' or 'repair' on 2 arrays by
   writing to their md/sync_action attribute files. One operation should
   start and one should be delayed and a message like the above will be
   printed.
3. Issue a write to the third array. Each write will cause 2 copies of
   the message to be printed.

This happens when wake_up(&resync_wait) is called, usually by
md_check_recovery(). Then the delayed sync thread again prints the
message and is put to sleep. This patch adds a check in md_do_sync() to
prevent printing this message more than once for the same pair of
devices.
Reported-by: NSven Koehler <sven.koehler@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=151801Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

c622ca54

md: remove obsolete ret in md_start_sync · 207efcd2

由 Guoqing Jiang 提交于 8月 12, 2016

The ret is not needed anymore since we have already
move resync_start into md_do_sync in commit 41a9a0dc.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

207efcd2

17 8月, 2016 1 次提交

md: do not count journal as spare in GET_ARRAY_INFO · b347af81

由 Song Liu 提交于 8月 11, 2016

GET_ARRAY_INFO counts journal as spare (spare_disks), which is not
accurate. This patch fixes this.
Reported-by: NYi Zhang <yizhan@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

b347af81

06 8月, 2016 2 次提交

md: Prevent IO hold during accessing to faulty raid5 array · 11367799

由 Alexey Obitotskiy 提交于 8月 03, 2016

After array enters in faulty state (e.g. number of failed drives
becomes more then accepted for raid5 level) it sets error flags
(one of this flags is MD_CHANGE_PENDING). For internal metadata
arrays MD_CHANGE_PENDING cleared into md_update_sb, but not for
external metadata arrays. MD_CHANGE_PENDING flag set prevents to
finish all new or non-finished IOs to array and hold them in
pending state. In some cases this can leads to deadlock situation.

For example, we have faulty array (2 of 4 drives failed) and
udev handle array state changes and blkid started (or other
userspace application that used array to read/write) but unable
to finish reads due to IO hold. At the same time we unable to get
exclusive access to array (to stop array in our case) because
another external application still use this array.

Fix makes possible to return IO with errors immediately.
So external application can finish working with array and
give exclusive access to other applications to perform
required management actions with array.
Signed-off-by: NAlexey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

11367799

MD: hold mddev lock to change bitmap location · d9dd26b2

由 Shaohua Li 提交于 7月 30, 2016

Changing the location changes a lot of things. Holding the lock to avoid race.
This makes the .quiesce called with mddev lock hold too.
Acked-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d9dd26b2

02 8月, 2016 1 次提交

raid5: fix incorrectly counter of conf->empty_inactive_list_nr · ff00d3b4

由 ZhengYuan Liu 提交于 7月 28, 2016

The counter conf->empty_inactive_list_nr is only used for determine if the
raid5 is congested which is deal with in function raid5_congested().
It was increased in get_free_stripe() when conf->inactive_list got to be
empty and decreased in release_inactive_stripe_list() when splice
temp_inactive_list to conf->inactive_list. However, this may have a
problem when raid5_get_active_stripe or stripe_add_to_batch_list was called,
because these two functions may call list_del_init(&sh->lru) to delete sh from
"conf->inactive_list + hash" which may cause "conf->inactive_list + hash" to
be empty when atomic_inc_not_zero(&sh->count) got false. So a check should be
done at these two point and increase empty_inactive_list_nr accordingly.
Otherwise the counter may get to be negative number which would influence
async readahead from VFS.
Signed-off-by: NZhengYuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: NShaohua Li <shli@fb.com>

ff00d3b4

31 7月, 2016 1 次提交

raid10: increment write counter after bio is split · 9b622e2b

由 Tomasz Majchrzak 提交于 7月 28, 2016

md pending write counter must be incremented after bio is split,
otherwise it gets decremented too many times in end bio callback and
becomes negative.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

9b622e2b

29 7月, 2016 1 次提交

MD: fix null pointer deference · 5d881783

由 Shaohua Li 提交于 7月 28, 2016

The md device might not have personality (for example, ddf raid array). The
issue is introduced by 8430e7e0(md: disconnect device from personality
before trying to remove it)
Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5d881783

21 7月, 2016 10 次提交

dm: allow bio-based table to be upgraded to bio-based with DAX support · b5ab4a9b

由 Toshi Kani 提交于 6月 28, 2016

Allow table type DM_TYPE_BIO_BASED to extend with DM_TYPE_DAX_BIO_BASED
since DM_TYPE_DAX_BIO_BASED supports bio-based requests.

This is needed to allow a snapshot of an LV with DAX support to be
removed.  One of the intermediate table reloads that lvm2 does switches
from DM_TYPE_BIO_BASED to DM_TYPE_DAX_BIO_BASED.  No known reason to
disallow this so...
Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

b5ab4a9b

dm snap: add fake origin_direct_access · f6e629bd

由 Toshi Kani 提交于 6月 28, 2016

dax-capable mapped-device is marked as DM_TYPE_DAX_BIO_BASED,
which supports both dax and bio-based operations.  dm-snap
needs to work with dax-capable device when bio-based operation
is used.

Add fake origin_direct_access() to origin device so that its
origin device is also marked as DM_TYPE_DAX_BIO_BASED for
dax-capable device.  This allows to extend target's DM table.
dm-snap works normally when bio-based operation is used.

dm-snap does not support dax operation, and mount with dax
option to a target device or snapshot device fails.
Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

f6e629bd

dm stripe: add DAX support · beec25b4

由 Toshi Kani 提交于 6月 24, 2016

Change dm-stripe to implement direct_access function,
stripe_direct_access(), which maps bdev and sector and
calls direct_access function of its physical target device.
Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

beec25b4

dm error: add DAX support · f8df1fdf

由 Mike Snitzer 提交于 6月 24, 2016

Allow the error target to replace an existing DAX-enabled target.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

f8df1fdf

dm linear: add DAX support · 84b22f83

由 Toshi Kani 提交于 6月 22, 2016

Change dm-linear to implement direct_access function,
linear_direct_access(), which maps sector and calls direct_access
function of its physical target device.
Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

84b22f83

dm: add infrastructure for DAX support · 545ed20e

由 Toshi Kani 提交于 6月 22, 2016

Change mapped device to implement direct_access function,
dm_blk_direct_access(), which calls a target direct_access function.
'struct target_type' is extended to have target direct_access interface.
This function limits direct accessible size to the dm_target's limit
with max_io_len().

Add dm_table_supports_dax() to iterate all targets and associated block
devices to check for DAX support.  To add DAX support to a DM target the
target must only implement the direct_access function.

Add a new dm type, DM_TYPE_DAX_BIO_BASED, which indicates that mapped
device supports DAX and is bio based.  This new type is used to assure
that all target devices have DAX support and remain that way after
QUEUE_FLAG_DAX is set in mapped device.

At initial table load, QUEUE_FLAG_DAX is set to mapped device when setting
DM_TYPE_DAX_BIO_BASED to the type.  Any subsequent table load to the
mapped device must have the same type, or else it fails per the check in
table_load().
Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

545ed20e

block: simplify and cleanup bvec pool handling · ed996a52

由 Christoph Hellwig 提交于 7月 19, 2016

Instead of a flag and an index just make sure an index of 0 means
no need to free the bvec array.  Also move the constants related
to the bvec pools together and use a consistent naming scheme for
them.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NMike Christie <mchristi@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

ed996a52

block: get rid of bio_rw and READA · 70246286

由 Christoph Hellwig 提交于 7月 19, 2016

These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces.  For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense.  Any check for READA is replaced with an
explicit check for REQ_RAHEAD.  Also remove the READA alias for
REQ_RAHEAD.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NMike Christie <mchristi@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

70246286

dm thin: fix a race condition between discarding and provisioning a block · 2a0fbffb

由 Joe Thornber 提交于 7月 01, 2016

The discard passdown was being issued after the block was unmapped,
which meant the block could be reprovisioned whilst the passdown discard
was still in flight.

We can only identify unshared blocks (safe to do a passdown a discard
to) once they're unmapped and their ref count hits zero. Block ref
counts are now used to guard against concurrent allocation of these
blocks that are being discarded. So now we unmap the block, issue
passdown discards, and the immediately increment ref counts for regions
that have been discarded via passed down (this is safe because
allocation occurs within the same thread). We then decrement ref counts
once the passdown discard IO is complete -- signaling these blocks may
now be allocated.

This fixes the potential for corruption that was reported here:
https://www.redhat.com/archives/dm-devel/2016-June/msg00311.htmlReported-by: NDennis Yang <dennisyang@qnap.com>
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

2a0fbffb

dm btree: fix a bug in dm_btree_find_next_single() · e7e0f730

由 Joe Thornber 提交于 7月 01, 2016

dm_btree_find_next_single() can short-circuit the search for a block
with a return of -ENODATA if all entries are higher than the search key
passed to lower_bound().

This hasn't been a problem because of the way the btree has been used by
DM thinp.  But it must be fixed now in preparation for fixing the race
in DM thinp's handling of simultaneous block discard vs allocation.
Otherwise, once that fix is in place, some of the blocks in a discard
would not be unmapped as expected.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

e7e0f730

20 7月, 2016 4 次提交

raid10: improve random reads performance · 0e5313e2

由 Tomasz Majchrzak 提交于 6月 24, 2016

RAID10 random read performance is lower than expected due to excessive spinlock
utilisation which is required mostly for rebuild/resync. Simplify allow_barrier
as it's in IO path and encounters a lot of unnecessary congestion.

As lower_barrier just takes a lock in order to decrement a counter, convert
counter (nr_pending) into atomic variable and remove the spin lock. There is
also a congestion for wake_up (it uses lock internally) so call it only when
it's really needed. As wake_up is not called constantly anymore, ensure process
waiting to raise a barrier is notified when there are no more waiting IOs.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

0e5313e2

md: add missing sysfs_notify on array_state update · 573275b5

由 Tomasz Majchrzak 提交于 6月 30, 2016

Changeset 6791875e has added early return from a function so there is no
sysfs notification for 'active' and 'clean' state change.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

573275b5

Fix kernel module refcount handling · 4cb9da7d

由 Alexey Obitotskiy 提交于 6月 23, 2016

md loads raidX modules and increments module refcount each time level
has changed but does not decrement it. You are unable to unload raid0
module after reshape because raid0 reshape changes level to raid4
and back to raid0.
Signed-off-by: NAleksey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

4cb9da7d

md: use seconds granularity for error logging · 0e3ef49e

由 Arnd Bergmann 提交于 6月 17, 2016

The md code stores the exact time of the last error in the
last_read_error variable using a timespec structure. It only
ever uses the seconds portion of that though, so we can
use a scalar for it.

There won't be an overflow in 2038 here, because it already
used monotonic time and 32-bit is enough for that, but I've
decided to use time64_t for consistency in the conversion.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NShaohua Li <shli@fb.com>

0e3ef49e

19 7月, 2016 14 次提交

dm raid: fix random optimal_io_size for raid0 · 89d3d9a1

由 Heinz Mauelshagen 提交于 7月 19, 2016

raid_io_hints() was retrieving the number of data stripes used for the
calculation of io_opt from struct r5conf, which is not defined for raid0
mappings.

Base the calculation on the in-core raid_set structure instead.

Also, adjust to use to_bytes() for the sector -> bytes conversion
throughout.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

89d3d9a1

dm raid: address checkpatch.pl complaints · 094f394d

由 Heinz Mauelshagen 提交于 7月 19, 2016

Use 'unsigned int' where appropriate.
Return negative errors.
Correct an indentation.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

094f394d

dm: call PR reserve/unreserve on each underlying device · 9c72bad1

由 Christoph Hellwig 提交于 7月 08, 2016

So far we tried to rely on the SCSI 'all target ports' bit to register
all path, but for many setups this didn't work properly as the different
paths are seen as separate initiators to the target instead of multiple
ports of the same initiator.  Because of that we'll stop setting the
'all target ports' bit in SCSI, and let device mapper handle iterating
over the device for each path and register them manually.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMike Christie <mchristi@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

9c72bad1

dm: fix second blk_delay_queue() parameter to be in msec units not jiffies · bd9f55ea

由 Tahsin Erdogan 提交于 7月 15, 2016

Commit d548b34b ("dm: reduce the queue delay used in dm_request_fn
from 100ms to 10ms") always intended the value to be 10 msecs -- it
just expressed it in jiffies because earlier commit 7eaceacc ("block:
remove per-queue plugging") did.
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Fixes: d548b34b ("dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms")
Cc: stable@vger.kernel.org # 4.1+ -- stable@ backports must be applied to drivers/md/dm.c

bd9f55ea

H
dm raid: change logical functions to actually return bool · d7ccc2e2
由 Heinz Mauelshagen 提交于 7月 06, 2016
```
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
d7ccc2e2

dm raid: use rdev_for_each in status · 32682409

由 Heinz Mauelshagen 提交于 6月 30, 2016

Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

32682409

dm raid: use rs->raid_disks to avoid memory leaks on free · ffeeac75

由 Heinz Mauelshagen 提交于 6月 30, 2016

Also makes code more consistent throughout.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

ffeeac75

dm raid: support delta_disks for raid1, fix table output · 7a7c330f

由 Heinz Mauelshagen 提交于 6月 30, 2016

Add "delta_disks" constructor argument support to raid1 to allow for
consistent userspace disk addition/removal handling.

Fix raid_status() to report all raid disks with status and table output
on disk adding reshapes, not just the ones listed on the mddev; optimize
its rebuild and writemostly output.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7a7c330f

dm raid: enhance reshape check and factor out reshape setup · 469b304b

由 Heinz Mauelshagen 提交于 6月 29, 2016

Enhance rs_reshape_requested() check function to be more transparent and
fix its raid10 check.

Streamline the constructor by factoring out reshaping preparation into
fucntion rs_prepare_reshape().
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

469b304b

dm raid: allow resize during recovery · 2a5556c2

由 Heinz Mauelshagen 提交于 6月 27, 2016

Resizing a RAID set during recovery can be allowed, because the MD
resynchronization thread will either stop any ongoing recovery in case
of shrinking below the current recovery position or carry on recovery
to the new size if the set is growing.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

2a5556c2

H
dm raid: fix rs_is_recovering() to allow for lvextend · 345a6cdc
由 Heinz Mauelshagen 提交于 6月 25, 2016
```
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
345a6cdc
H
dm raid: fix rebuild and catch bogus sync/resync flags · 37f10be1
由 Heinz Mauelshagen 提交于 6月 24, 2016
```
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
37f10be1

dm raid: fix ctr memory leaks on error paths · b1956dc4

由 Heinz Mauelshagen 提交于 6月 24, 2016

Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

b1956dc4

dm raid: fix typo in write_mostly flag · 65359ee6

由 Heinz Mauelshagen 提交于 6月 24, 2016

Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

65359ee6

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功