提交 · 6d9d21e35fbfa2934339e96934f862d118abac23 · openeuler / raspberrypi-kernel

25 9月, 2013 1 次提交

bcache: Fix a dumb journal discard bug · 6d9d21e3

由 Kent Overstreet 提交于 9月 23, 2013

That switch statement was obviously wrong, leading to some sort of weird
spinning on rare occasion with discards enabled...
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6d9d21e3

11 9月, 2013 1 次提交

drivers: convert shrinkers to new count/scan API · 7dc19d5a

由 Dave Chinner 提交于 8月 28, 2013

Convert the driver shrinkers to the new API.  Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.

FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out.  The amount of broken code I just encountered is
mind boggling.  I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.

Special mention goes to the zcache/zcache2 drivers.  They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth.  And that doesn't even take into account the horrible,
broken code...

[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NGlauber Costa <glommer@openvz.org>
Acked-by: NMel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7dc19d5a

06 9月, 2013 9 次提交

dm stripe: silence a couple sparse warnings · 7fff5e8f

由 Mike Snitzer 提交于 9月 06, 2013

Eliminate the following sparse warnings:
drivers/md/dm-stripe.c:443:12: warning: symbol 'dm_stripe_init' was not declared. Should it be static?
drivers/md/dm-stripe.c:456:6: warning: symbol 'dm_stripe_exit' was not declared. Should it be static?
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

7fff5e8f

dm: add statistics support · fd2ed4d2

由 Mikulas Patocka 提交于 8月 16, 2013

Support the collection of I/O statistics on user-defined regions of
a DM device.  If no regions are defined no statistics are collected so
there isn't any performance impact.  Only bio-based DM devices are
currently supported.

Each user-defined region specifies a starting sector, length and step.
Individual statistics will be collected for each step-sized area within
the range specified.

The I/O statistics counters for each step-sized area of a region are
in the same format as /sys/block/*/stat or /proc/diskstats but extra
counters (12 and 13) are provided: total time spent reading and
writing in milliseconds.  All these counters may be accessed by sending
the @stats_print message to the appropriate DM device via dmsetup.

The creation of DM statistics will allocate memory via kmalloc or
fallback to using vmalloc space.  At most, 1/4 of the overall system
memory may be allocated by DM statistics.  The admin can see how much
memory is used by reading
/sys/module/dm_mod/parameters/stats_current_allocated_bytes

See Documentation/device-mapper/statistics.txt for more details.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

fd2ed4d2

dm thin: always return -ENOSPC if no_free_space is set · 94563bad

由 Mike Snitzer 提交于 8月 22, 2013

If pool has 'no_free_space' set it means a previous allocation already
determined the pool has no free space (and failed that allocation with
-ENOSPC).  By always returning -ENOSPC if 'no_free_space' is set, we do
not allow the pool to oscillate between allocating blocks and then not.

But a side-effect of this determinism is that if a user wants to be able
to allocate new blocks they'll need to reload the pool's table (to clear
the 'no_free_space' flag).  This reload will happen automatically if the
pool's data volume is resized.  But if the user takes action to free a
lot of space by deleting snapshot volumes, etc the pool will no longer
allow data allocations to continue without an intervening table reload.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

94563bad

dm ioctl: cleanup error handling in table_load · f11c1c56

由 Mike Snitzer 提交于 8月 27, 2013

Make use of common cleanup code.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

f11c1c56

dm ioctl: increase granularity of type_lock when loading table · 00c4fc3b

由 Mike Snitzer 提交于 8月 27, 2013

Hold the mapped device's type_lock before calling populate_table() since
it is where the table's type is determined based on the specified
targets.  There is no need to allow concurrent table loads to race to
establish the table's targets or type.

This eliminates the need to grab the lock in dm_table_set_type().

Also verify that the type_lock is held in both dm_set_md_type() and
dm_get_md_type().
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

00c4fc3b

dm ioctl: prevent rename to empty name or uuid · c2b04824

由 Alasdair Kergon 提交于 8月 29, 2013

A device-mapper device must always have a name consisting of a non-empty
string.  If the device also has a uuid, this similarly must not be an
empty string.

The DM_DEV_CREATE ioctl enforces these rules when the device is created,
but this patch is needed to enforce them when DM_DEV_RENAME is used to
change the name or uuid.
Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NMikulas Patocka <mpatocka@redhat.com>

c2b04824

dm thin: set pool read-only if breaking_sharing fails block allocation · d6fc2042

由 Mike Snitzer 提交于 8月 21, 2013

break_sharing() now handles an arbitrary alloc_data_block() error
the same way as provision_block(): marks pool read-only and errors the
cell.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

d6fc2042

dm thin: prefix pool error messages with pool device name · 4fa5971a

由 Mike Snitzer 提交于 8月 21, 2013

Useful to know which pool is experiencing the error.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

4fa5971a

dm: allow error target to replace bio-based and request-based targets · 169e2cc2

由 Mike Snitzer 提交于 8月 22, 2013

It may be useful to switch a request-based table to the "error" target.
Enhance the DM core to allow a hybrid target_type which is capable of
handling either bios (via .map) or requests (via .map_rq).

Add a request-based map function (.map_rq) to the "error" target_type;
making it DM's first hybrid target.  Train dm_table_set_type() to prefer
the mapped device's established type (request-based or bio-based).  If
the mapped device doesn't have an established type default to making the
table with the hybrid target(s) bio-based.

Tested 'dmsetup wipe_table' to work on both bio-based and request-based
devices.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJoe Jin <joe.jin@oracle.com>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

169e2cc2

02 9月, 2013 1 次提交

raid5: only wakeup necessary threads · bfc90cb0

由 Shaohua Li 提交于 8月 29, 2013

If there are not enough stripes to handle, we'd better not always
queue all available work_structs. If one worker can only handle small
or even none stripes, it will impact request merge and create lock
contention.

With this patch, the number of work_struct running will depend on
pending stripes number. Note: some statistics info used in the patch
are accessed without locking protection. This should doesn't matter,
we just try best to avoid queue unnecessary work_struct.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

bfc90cb0

28 8月, 2013 6 次提交

md/raid5: flush out all pending requests before proceeding with reshape. · 4d77e3ba

由 NeilBrown 提交于 8月 27, 2013

Some requests - particularly 'discard' and 'read' are handled
differently depending on whether a reshape is active or not.

It is harmless to assume reshape is active if it isn't but wrong
to act as though reshape is not active when it is.

So when we start reshape - after making clear to all requests that
reshape has started - use mddev_suspend/mddev_resume to flush out all
requests.  This will ensure that no requests will be assuming the
absence of reshape once it really starts.
Signed-off-by: NNeilBrown <neilb@suse.de>

4d77e3ba

md/raid5: use seqcount to protect access to shape in make_request. · c46501b2

由 NeilBrown 提交于 8月 27, 2013

make_request() access various shape parameters (raid_disks, chunk_size
etc) which might be changed by raid5_start_reshape().

If the later is called at and awkward time during the form, the wrong
stripe_head might be used.

So introduce a 'seqcount' and after finding a stripe_head make sure
there is no reason to expect that we got the wrong one.
Signed-off-by: NNeilBrown <neilb@suse.de>

c46501b2

raid5: sysfs entry to control worker thread number · b721420e

由 Shaohua Li 提交于 8月 27, 2013

Add a sysfs entry to control running workqueue thread number. If
group_thread_cnt is set to 0, we will disable workqueue offload handling of
stripes.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

b721420e

raid5: offload stripe handle to workqueue · 851c30c9

由 Shaohua Li 提交于 8月 28, 2013

This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.

raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.

To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.

My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.

Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.

In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.

The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.

This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.

Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.

Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

851c30c9

raid5: fix stripe release order · d265d9dc

由 Shaohua Li 提交于 8月 28, 2013

patch "make release_stripe lockless" changes the order stripes are released.
Originally I thought block layer can take care of request merge, but it appears
there are still some requests not merged. It's easy to fix the order.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

d265d9dc

raid5: make release_stripe lockless · 773ca82f

由 Shaohua Li 提交于 8月 27, 2013

release_stripe still has big lock contention. We just add the stripe to a llist
without taking device_lock. We let the raid5d thread to do the real stripe
release, which must hold device_lock anyway. In this way, release_stripe
doesn't hold any locks.

The side effect is the released stripes order is changed. But sounds not a big
deal, stripes are never handled in order. And I thought block layer can already
do nice request merge, which means order isn't that important.

I kept the unplug release batch, which is unnecessary with this patch from lock
contention avoid point of view, and actually if we delete it, the stripe_head
release_list and lru can share storage. But the unplug release batch is also
helpful for request merge. We probably can delay wakeup raid5d till unplug, but
I'm still afraid of the case which raid5d is running.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

773ca82f

27 8月, 2013 5 次提交

md: avoid deadlock when dirty buffers during md_stop. · 260fa034

由 NeilBrown 提交于 8月 27, 2013

When the last process closes /dev/mdX sync_blockdev will be called so
that all buffers get flushed.
So if it is then opened for the STOP_ARRAY ioctl to be sent there will
be nothing to flush.

However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
moments before some other process which was writing closes their file
descriptor, then there won't be a 'last close' and the buffers might
not get flushed.

So do_md_stop() calls sync_blockdev().  However at this point it is
holding ->reconfig_mutex.  So if the array is currently 'clean' then
the writes from sync_blockdev() will not complete until the array
can be marked dirty and that won't happen until some other thread
can get ->reconfig_mutex.  So we deadlock.

We need to move the sync_blockdev() call to before we take
->reconfig_mutex.
However then some other thread could open /dev/mdX and write to it
after we call sync_blockdev() and before we actually stop the array.
This can leave dirty data in the page cache which is awkward.

So introduce new flag MD_STILL_CLOSED.  Set it before calling
sync_blockdev(), clear it if anyone does open the file, and abort the
STOP_ARRAY attempt if it gets set before we lock against further
opens.

It is still possible to get problems if you open /dev/mdX, write to
it, then issue the STOP_ARRAY ioctl.  Just don't do that.
Signed-off-by: NNeilBrown <neilb@suse.de>

260fa034

md: Don't test all of mddev->flags at once. · 7a0a5355

由 NeilBrown 提交于 8月 27, 2013

mddev->flags is mostly used to record if an update of the
metadata is needed.  Sometimes the whole field is tested
instead of just the important bits.  This makes it difficult
to introduce more state bits.

So replace all bare tests of mddev->flags with tests for the bits
that actually need testing.
Signed-off-by: NNeilBrown <neilb@suse.de>

7a0a5355

md: Fix apparent cut-and-paste error in super_90_validate · c9ad020f

由 Dave Jones 提交于 8月 19, 2013

Setting a variable to itself probably wasn't the intention here.
Signed-off-by: NDave Jones <davej@fedoraproject.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

c9ad020f

md: fix safe_mode buglet. · 275c51c4

由 NeilBrown 提交于 8月 08, 2013

Whe we set the safe_mode_timeout to a smaller value we trigger a timeout
immediately - otherwise the small value might not be honoured.
However if the previous timeout was 0 meaning "no timeout", we didn't.
This would mean that no timeout happens until the next write completes,
which could be a long time.
Signed-off-by: NNeilBrown <neilb@suse.de>

275c51c4

md: don't call md_allow_write in get_bitmap_file. · 60559da4

由 NeilBrown 提交于 7月 16, 2013

There is no really need as GFP_NOIO is very likely sufficient,
and failure is not catastrophic.

Calling md_allow_write here will convert a read-auto array to
read/write which could be confusing when you are just performing
a read operation.
Signed-off-by: NNeilBrown <neilb@suse.de>

60559da4

24 8月, 2013 1 次提交

[SCSI] Return ENODATA on medium error · 7e782af5

由 Hannes Reinecke 提交于 7月 01, 2013

When a medium error is detected the SCSI stack should return
ENODATA to the upper layers.

[jejb: fix whitespace error]
Signed-off-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJames Bottomley <JBottomley@Parallels.com>

7e782af5

23 8月, 2013 8 次提交

dm space map: optimise sm_ll_dec and sm_ll_inc · f722063e

由 Joe Thornber 提交于 8月 09, 2013

Prior to this patch these methods did a lookup followed by an insert.
Instead they now call a common mutate function that adjusts the value
according to a callback function.  This avoids traversing the data
structures twice and hence improves performance.

Also factor out sm_ll_lookup_big_ref_count() for use by both
sm_ll_lookup() and sm_ll_mutate().
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

f722063e

dm btree: prefetch child nodes when walking tree for a dm_btree_del · 04f17c80

由 Joe Thornber 提交于 8月 09, 2013

dm-btree now takes advantage of dm-bufio's ability to prefetch data via
dm_bm_prefetch().  Prior to this change many btree node visits were
causing a synchronous read.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

04f17c80

dm btree: use pop_frame in dm_btree_del to cleanup code · cd5acf0b

由 Joe Thornber 提交于 8月 09, 2013

Remove a visited leaf straight away from the stack, rather than
marking all it's children as visited and letting it get removed on the
next iteration.  May also offer a micro optimisation in dm_btree_del.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

cd5acf0b

dm cache: eliminate holes in cache structure · c9ec5d7c

由 Mike Snitzer 提交于 8月 16, 2013

Reorder members in the cache structure to eliminate 6 out of 7 holes
(reclaiming 24 bytes).  Also, the 'worker' and 'waker' members no longer
straddle cachelines.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

c9ec5d7c

dm cache: fix stacking of geometry limits · f6109372

由 Mike Snitzer 提交于 8月 20, 2013

Do not blindly override the queue limits (specifically io_min and
io_opt).  Allow traditional stacking of these limits if io_opt is a
factor of the cache's data block size.

Without this patch mkfs.xfs does not recognize the cache device's
provided limits as a useful geometry (e.g. raid) so these hints are
ignored.  This was due to setting io_min to a useless value.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

f6109372

dm thin: fix stacking of geometry limits · 0cc67cd9

由 Mike Snitzer 提交于 8月 20, 2013

Do not blindly override the queue limits (specifically io_min and
io_opt).  Allow traditional stacking of these limits if io_opt is a
factor of the thin-pool's data block size.

Without this patch mkfs.xfs does not recognize the thin device's
provided limits as a useful geometry (e.g. raid) so these hints are
ignored.  This was due to setting io_min to a useless value.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

0cc67cd9

dm cache: add data block size limits to code and Documentation · 05473044

由 Mike Snitzer 提交于 8月 16, 2013

Place upper bound on the cache's data block size (1GB).

Inform users that the data block size can't be any arbitrary number,
i.e. its value must be between 32KB and 1GB.  Also, it should be a
multiple of 32KB.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

05473044

dm: stop using WQ_NON_REENTRANT · 670368a8

由 Tejun Heo 提交于 7月 30, 2013

dbf2576e ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away.  Remove its usages.

This patch doesn't introduce any behavior changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

670368a8

17 8月, 2013 1 次提交

dm cache: avoid conflicting remove_mapping() in mq policy · b936bf8b

由 Geert Uytterhoeven 提交于 7月 26, 2013

On sparc32, which includes <linux/swap.h> from <asm/pgtable_32.h>:

drivers/md/dm-cache-policy-mq.c:962:13: error: conflicting types for 'remove_mapping'
include/linux/swap.h:285:12: note: previous declaration of 'remove_mapping' was here

As mq_remove_mapping() already exists, and the local remove_mapping() is
used only once, inline it manually to avoid the conflict.
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair Kergon <agk@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

b936bf8b

25 7月, 2013 2 次提交

md/raid5: fix interaction of 'replace' and 'recovery'. · f94c0b66

由 NeilBrown 提交于 7月 22, 2013

If a device in a RAID4/5/6 is being replaced while another is being
recovered, then the writes to the replacement device currently don't
happen, resulting in corruption when the replacement completes and the
new drive takes over.

This is because the replacement writes are only triggered when
's.replacing' is set and not when the similar 's.sync' is set (which
is the case during resync and recovery - it means all devices need to
be read).

So schedule those writes when s.replacing is set as well.

In this case we cannot use "STRIPE_INSYNC" to record that the
replacement has happened as that is needed for recording that any
parity calculation is complete.  So introduce STRIPE_REPLACED to
record if the replacement has happened.

For safety we should also check that STRIPE_COMPUTE_RUN is not set.
This has a similar effect to the "s.locked == 0" test.  The latter
ensure that now IO has been flagged but not started.  The former
checks if any parity calculation has been flagged by not started.
We must wait for both of these to complete before triggering the
'replace'.

Add a similar test to the subsequent check for "are we finished yet".
This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
but it makes it more obvious that the REPLACE will happen before we
think we are finished.

Finally if a NeedReplace device is not UPTODATE then that is an
error.  We really must trigger a warning.

This bug was introduced in commit 9a3e1101
(md/raid5:  detect and handle replacements during recovery.)
which introduced replacement for raid5.
That was in 3.3-rc3, so any stable kernel since then would benefit
from this fix.

Cc: stable@vger.kernel.org (3.3+)
Reported-by: Nqindehua <13691222965@163.com>
Tested-by: Nqindehua <qindehua@163.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

f94c0b66

md/raid10: remove use-after-free bug. · 0eb25bb0

由 NeilBrown 提交于 7月 24, 2013

We always need to be careful when calling generic_make_request, as it
can start a chain of events which might free something that we are
using.

Here is one place I wasn't careful enough.  If the wbio2 is not in
use, then it might get freed at the first generic_make_request call.
So perform all necessary tests first.

This bug was introduced in 3.3-rc3 (24afd80d) and can cause an
oops, so fix is suitable for any -stable since then.

Cc: stable@vger.kernel.org (3.3+)
Signed-off-by: NNeilBrown <neilb@suse.de>

0eb25bb0

18 7月, 2013 3 次提交

md/raid1: fix bio handling problems in process_checks() · 30bc9b53

由 NeilBrown 提交于 7月 17, 2013

Recent change to use bio_copy_data() in raid1 when repairing
an array is faulty.

The underlying may have changed the bio in various ways using
bio_advance and these need to be undone not just for the 'sbio' which
is being copied to, but also the 'pbio' (primary) which is being
copied from.

So perform the reset on all bios that were read from and do it early.

This also ensure that the sbio->bi_io_vec[j].bv_len passed to
memcmp is correct.

This fixes a crash during a 'check' of a RAID1 array.  The crash was
introduced in 3.10 so this is suitable for 3.10-stable.

Cc: stable@vger.kernel.org (3.10)
Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

30bc9b53

md: Remove recent change which allows devices to skip recovery. · 5024c298

由 NeilBrown 提交于 7月 17, 2013

commit 7ceb17e8
    md: Allow devices to be re-added to a read-only array.

allowed a bit more than just that.  It also allows devices to be added
to a read-write array and to end up skipping recovery.

This patch removes the offending piece of code pending a rewrite for a
subsequent release.

More specifically:
 If the array has a bitmap, then the device will still need a bitmap
 based resync ('saved_raid_disk' is set under different conditions
 is a bitmap is present).
 If the array doesn't have a bitmap, then this is correct as long as
 nothing has been written to the array since the metadata was checked
 by ->validate_super.  However there is no locking to ensure that there
 was no write.

Bug was introduced in 3.10 and causes data corruption so
patch is suitable for 3.10-stable.

Cc: stable@vger.kernel.org (3.10)
Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

5024c298

md/raid10: fix two problems with RAID10 resync. · 7bb23c49

由 NeilBrown 提交于 7月 16, 2013

1/ When an different between blocks is found, data is copied from
   one bio to the other.  However bv_len is used as the length to
   copy and this could be zero.  So use r10_bio->sectors to calculate
   length instead.
   Using bv_len was probably always a bit dubious, but the introduction
   of bio_advance made it much more likely to be a problem.

2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
   except on bios that we schedule for a read.  This ensures that
   missing/failed devices don't confuse the loop at the top of
   sync_request write.
   Commit 8be185f2 "raid10: Use bio_reset()"
   removed a loop which set BIO_UPTDATE on all appropriate bios.
   So we need to re-add that flag.

These bugs were introduced in 3.10, so this patch is suitable for
3.10-stable, and can remove a potential for data corruption.

Cc: stable@vger.kernel.org (3.10)
Reported-by: NBrassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

7bb23c49

12 7月, 2013 2 次提交

bcache: Allocation kthread fixes · 79826c35

由 Kent Overstreet 提交于 7月 10, 2013

The alloc kthread should've been using try_to_freeze() - and also there
was the potential for the alloc kthread to get woken up after it had
shut down, which would have been bad.
Signed-off-by: NKent Overstreet <kmo@daterainc.com>

79826c35

bcache: Fix GC_SECTORS_USED() calculation · 29ebf465

由 Kent Overstreet 提交于 7月 11, 2013

Part of the job of garbage collection is to add up however many sectors
of live data it finds in each bucket, but that doesn't work very well if
it doesn't reset GC_SECTORS_USED() when it starts. Whoops.

This wouldn't have broken anything horribly, but allocation tries to
preferentially reclaim buckets that are mostly empty and that's not
gonna work with an incorrect GC_SECTORS_USED() value.
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

29ebf465