提交 · 279afbad4e54acbd61bf88a54a73af3bbfdeb5dd · openanolis / cloud-kernel

27 6月, 2013 10 次提交

bcache: Track dirty data by stripe · 279afbad

由 Kent Overstreet 提交于 6月 05, 2013

To make background writeback aware of raid5/6 stripes, we first need to
track the amount of dirty data within each stripe - we do this by
breaking up the existing sectors_dirty into per stripe atomic_ts
Signed-off-by: NKent Overstreet <koverstreet@google.com>

279afbad

bcache: Initialize sectors_dirty when attaching · 444fc0b6

由 Kent Overstreet 提交于 5月 11, 2013

Previously, dirty_data wouldn't get initialized until the first garbage
collection... which was a bit of a problem for background writeback (as
the PD controller keys off of it) and also confusing for users.

This is also prep work for making background writeback aware of raid5/6
stripes.
Signed-off-by: NKent Overstreet <koverstreet@google.com>

444fc0b6

bcache: Improve lazy sorting · 6ded34d1

由 Kent Overstreet 提交于 5月 11, 2013

The old lazy sorting code was kind of hacky - rewrite in a way that
mathematically makes more sense; the idea is that the size of the sets
of keys in a btree node should increase by a more or less fixed ratio
from smallest to biggest.
Signed-off-by: NKent Overstreet <koverstreet@google.com>

6ded34d1

bcache: Rip out pkey()/pbtree() · 85b1492e

由 Kent Overstreet 提交于 5月 14, 2013

Old gcc doesnt like the struct hack, and it is kind of ugly. So finish
off the work to convert pr_debug() statements to tracepoints, and delete
pkey()/pbtree().
Signed-off-by: NKent Overstreet <koverstreet@google.com>

85b1492e

bcache: Fix/revamp tracepoints · c37511b8

由 Kent Overstreet 提交于 4月 26, 2013

The tracepoints were reworked to be more sensible, and fixed a null
pointer deref in one of the tracepoints.

Converted some of the pr_debug()s to tracepoints - this is partly a
performance optimization; it used to be that with DEBUG or
CONFIG_DYNAMIC_DEBUG pr_debug() was an empty macro; but at some point it
was changed to an empty inline function.

Some of the pr_debug() statements had rather expensive function calls as
part of the arguments, so this code was getting run unnecessarily even
on non debug kernels - in some fast paths, too.
Signed-off-by: NKent Overstreet <koverstreet@google.com>

c37511b8

bcache: Refactor btree io · 57943511

由 Kent Overstreet 提交于 4月 25, 2013

The most significant change is that btree reads are now done
synchronously, instead of asynchronously and doing the post read stuff
from a workqueue.

This was originally done because we can't block on IO under
generic_make_request(). But - we already have a mechanism to punt cache
lookups to workqueue if needed, so if we just use that we don't have to
deal with the complexity of doing things asynchronously.

The main benefit is this makes the locking situation saner; we can hold
our write lock on the btree node until we're finished reading it, and we
don't need that btree_node_read_done() flag anymore.

Also, for writes, btree_write() was broken out into btree_node_write()
and btree_leaf_dirty() - the old code with the boolean argument was dumb
and confusing.

The prio_blocked mechanism was improved a bit too, now the only counter
is in struct btree_write, we don't mess with transfering a count from
struct btree anymore.

This required changing garbage collection to block prios at the start
and unblock when it finishes, which is cleaner than what it was doing
anyways (the old code had mostly the same effect, but was doing it in a
convoluted way)

And the btree iter btree_node_read_done() uses was converted to a real
mempool.
Signed-off-by: NKent Overstreet <koverstreet@google.com>

57943511

bcache: Convert allocator thread to kthread · 119ba0f8

由 Kent Overstreet 提交于 4月 24, 2013

Using a workqueue when we just want a single thread is a bit silly.
Signed-off-by: NKent Overstreet <koverstreet@google.com>

119ba0f8

bcache: Warn when a device is already registered. · a9dd53ad

由 Gabriel de Perthuis 提交于 5月 04, 2013

Signed-off-by: NGabriel de Perthuis <g2p.code+bcache@gmail.com>
Signed-off-by: NKent Overstreet <koverstreet@google.com>

a9dd53ad

bcache: fix a spurious gcc complaint, use scnprintf · bbc77aa7

由 Kent Overstreet 提交于 5月 28, 2013

An old version of gcc was complaining about using a const int as the
size of a stack allocated array. Which should be fine - but using
ARRAY_SIZE() is better, anyways.

Also, refactor the code to use scnprintf().
Signed-off-by: NKent Overstreet <koverstreet@google.com>

bbc77aa7

md: bcache: io.c: fix a potential NULL pointer dereference · 5c694129

由 Kumar Amit Mehta 提交于 5月 28, 2013

bio_alloc_bioset returns NULL on failure. This fix adds a missing check
for potential NULL pointer dereferencing.
Signed-off-by: NKumar Amit Mehta <gmate.amit@gmail.com>
Signed-off-by: NKent Overstreet <koverstreet@google.com>

5c694129

13 6月, 2013 4 次提交

md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place · 5026d7a9

由 H. Peter Anvin 提交于 6月 12, 2013

There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME.  This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.

After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed.  This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.

However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.

Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.

[neilb: added raid5]

This patch is appropriate for any -stable since 3.7 when write_same
support was added.

Cc: stable@vger.kernel.org
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

5026d7a9

md/raid1,raid10: use freeze_array in place of raise_barrier in various places. · e2d59925

由 NeilBrown 提交于 6月 12, 2013

Various places in raid1 and raid10 are calling raise_barrier when they
really should call freeze_array.
The former is only intended to be called from "make_request".
The later has extra checks for 'nr_queued' and makes a call to
flush_pending_writes(), so it is safe to call it from within the
management thread.

Using raise_barrier will sometimes deadlock.  Using freeze_array
should not.

As 'freeze_array' currently expects one request to be pending (in
handle_read_error - the only previous caller), we need to pass
it the number of pending requests (extra) to ignore.

The deadlock was made particularly noticeable by commits
050b6615 (raid10) and 6b740b8d (raid1) which
appeared in 3.4, so the fix is appropriate for any -stable
kernel since then.

This patch probably won't apply directly to some early kernels and
will need to be applied by hand.

Cc: stable@vger.kernel.org
Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

e2d59925

md/raid1: consider WRITE as successful only if at least one non-Faulty and... · 3056e3ae

由 Alex Lyakas 提交于 6月 04, 2013

md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.

Without that fix, the following scenario could happen:

- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io()  calls call_bio_endio()
- As a result, in call_bio_endio():
        if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
                clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.

So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.

[neilb - applied identical change to raid10 as well]

This bug can result in lost data, so it is suitable for any
-stable kernel.

Cc: stable@vger.kernel.org
Signed-off-by: NAlex Lyakas <alex@zadarastorage.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

3056e3ae

md: md_stop_writes() should always freeze recovery. · 6b6204ee

由 NeilBrown 提交于 5月 09, 2013

__md_stop_writes() will currently sometimes freeze recovery.
So any caller must be ready for that to happen, and indeed they are.

However if __md_stop_writes() doesn't freeze_recovery, then
a recovery could start before mddev_suspend() is called, which
could be awkward.  This can particularly cause problems or dm-raid.

So change __md_stop_writes() to always freeze recovery.  This is safe
and more predicatable.
Reported-by: NBrassow Jonathan <jbrassow@redhat.com>
Tested-by: NBrassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

6b6204ee

30 5月, 2013 1 次提交

raid5: Initialize bi_vcnt · 4997b72e

由 Kent Overstreet 提交于 5月 30, 2013

The patch that converted raid5 to use bio_reset() forgot to initialize
bi_vcnt.
Signed-off-by: NKent Overstreet <koverstreet@google.com>
Cc: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Tested-by: NIlia Mirkin <imirkin@alum.mit.edu>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4997b72e

20 5月, 2013 1 次提交

dm thin: fix metadata dev resize detection · 610bba8b

由 Alasdair G Kergon 提交于 5月 19, 2013

Fix detection of the need to resize the dm thin metadata device.

The code incorrectly tried to extend the metadata device when it
didn't need to due to a merging error with patch 24347e95 ("dm thin:
detect metadata device resizing").

  device-mapper: transaction manager: couldn't open metadata space map
  device-mapper: thin metadata: tm_open_with_sm failed
  device-mapper: thin: aborting transaction failed
  device-mapper: thin: switching pool to failure mode
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

610bba8b

15 5月, 2013 3 次提交

bcache: Fix error handling in init code · f59fce84

由 Kent Overstreet 提交于 5月 15, 2013

This code appears to have rotted... fix various bugs and do some
refactoring.
Signed-off-by: NKent Overstreet <koverstreet@google.com>

f59fce84

bcache: drop "select CLOSURES" · bbb1c3b5

由 Paul Bolle 提交于 5月 13, 2013

The Kconfig entry for BCACHE selects CLOSURES. But there's no Kconfig
symbol CLOSURES. That symbol was used in development versions of bcache,
but was removed when the closures code was no longer provided as a
kernel library. It can safely be dropped.
Signed-off-by: NPaul Bolle <pebolle@tiscali.nl>

bbb1c3b5

bcache: Fix incompatible pointer type warning · 867e1162

由 Emil Goode 提交于 5月 09, 2013

The function pointer release in struct block_device_operations
should point to functions declared as void.

Sparse warnings:

drivers/md/bcache/super.c:656:27: warning:
	incorrect type in initializer (different base types)
	drivers/md/bcache/super.c:656:27:
	expected void ( *release )( ... )
	drivers/md/bcache/super.c:656:27:
	got int ( static [toplevel] *<noident> )( ... )

drivers/md/bcache/super.c:656:2: warning:
	initialization from incompatible pointer type [enabled by default]

drivers/md/bcache/super.c:656:2: warning:
	(near initialization for ‘bcache_ops.release’) [enabled by default]
Signed-off-by: NEmil Goode <emilgoode@gmail.com>
Signed-off-by: NKent Overstreet <koverstreet@google.com>

867e1162

10 5月, 2013 20 次提交

dm cache: set config value · 2f14f4b5

由 Joe Thornber 提交于 5月 10, 2013

Share configuration option processing code between the dm cache
ctr and message functions.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

2f14f4b5

dm cache: move config fns · 2c73c471

由 Alasdair G Kergon 提交于 5月 10, 2013

Move process_config_option() in dm-cache-target.c to make the
next patch more readable.
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

2c73c471

dm thin: generate event when metadata threshold passed · ac8c3f3d

由 Joe Thornber 提交于 5月 10, 2013

Generate a dm event when the amount of remaining thin pool metadata
space falls below a certain level.

The threshold is taken to be a quarter of the size of the metadata
device with a minimum threshold of 4MB.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

ac8c3f3d

dm persistent metadata: add space map threshold callback · 2fc48021

由 Joe Thornber 提交于 5月 10, 2013

Add a threshold callback to dm persistent data space maps.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

2fc48021

dm persistent data: add threshold callback to space map · 7c3d3f2a

由 Joe Thornber 提交于 5月 10, 2013

Add a threshold callback function to the persistent data space map
interface for a subsequent patch to use.

dm-thin and dm-cache are interested in knowing when they're getting
low on metadata or data blocks.  This patch introduces a new method
for registering a callback against a threshold.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

7c3d3f2a

dm thin: detect metadata device resizing · 24347e95

由 Joe Thornber 提交于 5月 10, 2013

Allow the dm thin pool metadata device to be extended.

Whenever a pool is resumed, detect whether the size of the metadata
device has increased, and if so, extend the metadata to use the new
space.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

24347e95

dm persistent data: support space map resizing · 1921c56d

由 Joe Thornber 提交于 5月 10, 2013

Support extending a dm persistent data metadata space map.

The extend itself is implemented by switching back to the boostrap
allocator and pointing to the new space.  The extra bitmap indexes are
then allocated from the new space, and finally we switch back to the
proper space map ops and tweak the reference counts.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

1921c56d

dm thin: open dev read only when possible · 5d0db96d

由 Joe Thornber 提交于 5月 10, 2013

If a thin pool is created in read-only-metadata mode then only open the
metadata device read-only.

Previously it was always opened with FMODE_READ | FMODE_WRITE.

(Note that dm_get_device() still allows read-only dm devices to be used
read-write at the moment: If I create a read-only linear device for the
metadata, via dmsetup load --readonly, then I can still create a rw pool
out of it.)
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

5d0db96d

dm thin: refactor data dev resize · b17446df

由 Joe Thornber 提交于 5月 10, 2013

Refactor device size functions in preparation for similar metadata
device resizing functions.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

b17446df

dm cache: replace memcpy with struct assignment · 8c5008fa

由 Joe Thornber 提交于 5月 10, 2013

Use struct assignment rather than memcpy in dm cache.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

8c5008fa

dm cache: fix typos in comments · aeed1420

由 Joe Thornber 提交于 5月 10, 2013

Fix up some typos in dm-cache comments.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

aeed1420

dm cache policy: fix description of lookup fn · e12c1fd9

由 Alasdair G Kergon 提交于 5月 10, 2013

Correct the documented requirement on the return code from dm cache policy
lookup functions stated in the policy module header file.
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

e12c1fd9

dm persistent data: fix error message typos · 88a488f6

由 Joe Thornber 提交于 5月 10, 2013

Fix some typos in dm-space-map-metadata.c error messages.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

88a488f6

dm cache: tune migration throttling · f8350daf

由 Joe Thornber 提交于 5月 10, 2013

Tune the dm cache migration throttling.

i) Issue a tick every second, just in case there's no i/o going through.

ii) Drop the migration threshold right down to something suitable for
background work.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

f8350daf

dm mpath: enable WRITE SAME support · 042bcef8

由 Mike Snitzer 提交于 5月 10, 2013

Enable WRITE SAME support in dm multipath.  As far as multipath is
concerned it is just another write request.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Tested-by: NBharata B Rao <bharata.rao@gmail.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

042bcef8

dm table: fix write same support · dc019b21

由 Mike Snitzer 提交于 5月 10, 2013

If device_not_write_same_capable() returns true then the iterate_devices
loop in dm_table_supports_write_same() should return false.
Reported-by: NBharata B Rao <bharata.rao@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.8+
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

dc019b21

dm bufio: avoid a possible __vmalloc deadlock · 502624bd

由 Mikulas Patocka 提交于 5月 10, 2013

This patch uses memalloc_noio_save to avoid a possible deadlock in
dm-bufio.  (it could happen only with large block size, at most
PAGE_SIZE << MAX_ORDER (typically 8MiB).

__vmalloc doesn't fully respect gfp flags. The specified gfp flags are
used for allocation of requested pages, structures vmap_area, vmap_block
and vm_struct and the radix tree nodes.

However, the kernel pagetables are allocated always with GFP_KERNEL.
Thus the allocation of pagetables can recurse back to the I/O layer and
cause a deadlock.

This patch uses the function memalloc_noio_save to set per-process
PF_MEMALLOC_NOIO flag and the function memalloc_noio_restore to restore
it. When this flag is set, all allocations in the process are done with
implied GFP_NOIO flag, thus the deadlock can't happen.

This should be backported to stable kernels, but they don't have the
PF_MEMALLOC_NOIO flag and memalloc_noio_save/memalloc_noio_restore
functions. So, PF_MEMALLOC should be set and restored instead.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

502624bd

dm snapshot: fix error return code in snapshot_ctr · 09e8b813

由 Wei Yongjun 提交于 5月 10, 2013

Return -ENOMEM instead of success if unable to allocate pending
exception mempool in snapshot_ctr.
Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: stable@vger.kernel.org
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

09e8b813

dm cache: fix error return code in cache_create · fa4d683a

由 Wei Yongjun 提交于 5月 10, 2013

Return -ENOMEM if memory allocation fails in cache_create
instead of 0 (to avoid NULL pointer dereference).
Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: stable@vger.kernel.org
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

fa4d683a

dm stripe: fix regression in stripe_width calculation · d793e684

由 Mike Snitzer 提交于 5月 10, 2013

Fix a regression in the calculation of the stripe_width in the
dm stripe target which led to incorrect processing of device limits.

The stripe_width is the stripe device length divided by the number of
stripes.  The group of commits in the range f14fa693 ("dm stripe: fix
size test") to eb850de6 ("dm stripe: support for non power of 2
chunksize") interfered with each other (a merging error) and led to the
stripe_width being set incorrectly to the stripe device length divided by
chunk_size * stripe_count.

For example, a stripe device's table with: 0 33553920 striped 3 512 ...
should result in a stripe_width of 11184640 (33553920 / 3), but due to
the bug it was getting set to 21845 (33553920 / (512 * 3)).

The impact of this bug is that device topologies that previously worked
fine with the stripe target are no longer considered valid.  In
particular, there is a higher risk of seeing this issue if one of the
stripe devices has a 4K logical block size.  Resulting in an error
message like this:
"device-mapper: table: 253:4: len=21845 not aligned to h/w logical block size 4096 of dm-1"

The fix is to swap the order of the divisions and to use a temporary
variable for the second one, so that width retains the intended
value.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.6+
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

d793e684

07 5月, 2013 1 次提交

block_device_operations->release() should return void · db2a144b

由 Al Viro 提交于 5月 05, 2013

The value passed is 0 in all but "it can never happen" cases (and those
only in a couple of drivers) *and* it would've been lost on the way
out anyway, even if something tried to pass something meaningful.
Just don't bother.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

db2a144b

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功