提交 · d59b23795933678c9638fd20c942d2b4f3cd6185 · openeuler / raspberrypi-kernel

31 10月, 2017 1 次提交

bcache: only permit to recovery read error when cache device is clean · d59b2379

由 Coly Li 提交于 10月 30, 2017

When bcache does read I/Os, for example in writeback or writethrough mode,
if a read request on cache device is failed, bcache will try to recovery
the request by reading from cached device. If the data on cached device is
not synced with cache device, then requester will get a stale data.

For critical storage system like database, providing stale data from
recovery may result an application level data corruption, which is
unacceptible.

With this patch, for a failed read request in writeback or writethrough
mode, recovery a recoverable read request only happens when cache device
is clean. That is to say, all data on cached device is up to update.

For other cache modes in bcache, read request will never hit
cached_dev_read_error(), they don't need this patch.

Please note, because cache mode can be switched arbitrarily in run time, a
writethrough mode might be switched from a writeback mode. Therefore
checking dc->has_data in writethrough mode still makes sense.

Changelog:
V4: Fix parens error pointed by Michael Lyle.
v3: By response from Kent Oversteet, he thinks recovering stale data is a
    bug to fix, and option to permit it is unnecessary. So this version
    the sysfs file is removed.
v2: rename sysfs entry from allow_stale_data_on_failure  to
    allow_stale_data_on_failure, and fix the confusing commit log.
v1: initial patch posted.

[small change to patch comment spelling by mlyle]
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reported-by: NArne Wolf <awolf@lenovo.com>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Kai Krakow <hurikhan77@gmail.com>
Cc: Eric Wheeler <bcache@lists.ewheeler.net>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d59b2379

17 10月, 2017 1 次提交

bcache: writeback rate clamping: make 32 bit safe · 9ce762e8

由 Michael Lyle 提交于 10月 16, 2017

Sorry this got through to linux-block, was detected by the kbuilds test
robot.  NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a
signed long constant.

Fixes: e41166c5 ("bcache: writeback rate shouldn't artifically clamp")
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9ce762e8

16 10月, 2017 13 次提交

bcache: safeguard a dangerous addressing in closure_queue · 6446c684

由 Liang Chen 提交于 10月 13, 2017

The use of the union reduces the size of closure struct by taking advantage
of the current size of its members. The offset of func in work_struct
equals the size of the first three members, so that work.work_func will
just reference the forth member - fn.

This is smart but dangerous. It can be broken if work_struct or the other
structs get changed, and can be a bit difficult to debug.
Signed-off-by: NLiang Chen <liangchen.linux@gmail.com>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6446c684

bcache: rearrange writeback main thread ratelimit · a8500fc8

由 Michael Lyle 提交于 10月 13, 2017

The time spent searching for things to write back "counts" for the
actual rate achieved, so don't flush the accumulated rate with each
chunk.

This will maintain better fidelity to user-commanded rates, but it
may slightly increase the burstiness of writeback.  The writeback
lock needs improvement to help mitigate this.
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NKent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a8500fc8

bcache: writeback rate shouldn't artifically clamp · e41166c5

由 Michael Lyle 提交于 10月 13, 2017

The previous code artificially limited writeback rate to 1000000
blocks/second (NSEC_PER_MSEC), which is a rate that can be met on fast
hardware.  The rate limiting code works fine (though with decreased
precision) up to 3 orders of magnitude faster, so use NSEC_PER_SEC.

Additionally, ensure that uint32_t is used as a type for rate throughout
the rate management so that type checking/clamp_t can work properly.

bch_next_delay should be rewritten for increased precision and better
handling of high rates and long sleep periods, but this is adequate for
now.
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reported-by: NColy Li <colyli@suse.de>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e41166c5

bcache: smooth writeback rate control · ae82ddbf

由 Michael Lyle 提交于 10月 13, 2017

This works in conjunction with the new PI controller.  Currently, in
real-world workloads, the rate controller attempts to write back 1
sector per second.  In practice, these minimum-rate writebacks are
between 4k and 60k in test scenarios, since bcache aggregates and
attempts to do contiguous writes and because filesystems on top of
bcachefs typically write 4k or more.

Previously, bcache used to guarantee to write at least once per second.
This means that the actual writeback rate would exceed the configured
amount by a factor of 8-120 or more.

This patch adjusts to be willing to sleep up to 2.5 seconds, and to
target writing 4k/second.  On the smallest writes, it will sleep 1
second like before, but many times it will sleep longer and load the
backing device less.  This keeps the loading on the cache and backing
device related to writeback more consistent when writing back at low
rates.
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ae82ddbf

bcache: implement PI controller for writeback rate · 1d316e65

由 Michael Lyle 提交于 10月 13, 2017

bcache uses a control system to attempt to keep the amount of dirty data
in cache at a user-configured level, while not responding excessively to
transients and variations in write rate.  Previously, the system was a
PD controller; but the output from it was integrated, turning the
Proportional term into an Integral term, and turning the Derivative term
into a crude Proportional term.  Performance of the controller has been
uneven in production, and it has tended to respond slowly, oscillate,
and overshoot.

This patch set replaces the current control system with an explicit PI
controller and tuning that should be correct for most hardware.  By
default, it attempts to write at a rate that would retire 1/40th of the
current excess blocks per second.  An integral term in turn works to
remove steady state errors.

IMO, this yields benefits in simplicity (removing weighted average
filtering, etc) and system performance.

Another small change is a tunable parameter is introduced to allow the
user to specify a minimum rate at which dirty blocks are retired.

There is a slight difference from earlier versions of the patch in
integral handling to prevent excessive negative integral windup.
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1d316e65

bcache: don't write back data if reading it failed · 5fa89fb9

由 Michael Lyle 提交于 10月 13, 2017

If an IO operation fails, and we didn't successfully read data from the
cache, don't writeback invalid/partial data to the backing disk.
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5fa89fb9

bcache: remove unused parameter · 23850102

由 Yijing Wang 提交于 10月 13, 2017

Parameter bio is no longer used, clean it.
Signed-off-by: NYijing Wang <wangyijing@huawei.com>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

23850102

bcache: update bio->bi_opf bypass/writeback REQ_ flag hints · b41c9b02

由 Eric Wheeler 提交于 10月 13, 2017

Flag for bypass if the IO is for read-ahead or background, unless the
read-ahead request is for metadata (eg, from gfs2).
        Bypass if:
                bio->bi_opf & (REQ_RAHEAD|REQ_BACKGROUND) &&
			!(bio->bi_opf & REQ_META))

        Writeback if:
                op_is_sync(bio->bi_opf) ||
			bio->bi_opf & (REQ_META|REQ_PRIO)
Signed-off-by: NEric Wheeler <bcache@linux.ewheeler.net>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b41c9b02

bcache: Remove redundant set_capacity · e89d6759

由 Yijing Wang 提交于 10月 13, 2017

set_capacity() has been called in bcache_device_init(),
remove the redundant one.
Signed-off-by: NYijing Wang <wangyijing@huawei.com>
Reviewed-by: NEric Wheeler <bcache@linux.ewheeler.net>
Acked-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e89d6759

bcache: rewrite multiple partitions support · 1dbe32ad

由 Coly Li 提交于 10月 13, 2017

Current partition support of bcache is confusing and buggy. It tries to
trace non-continuous device minor numbers by an ida bit string, and
mistakenly mixed bcache device index with minor numbers. This design
generates several negative results,
- Index of bcache device name is not consecutive under /dev/. If there are
  3 bcache devices, they name will be,
  /dev/bcache0, /dev/bcache16, /dev/bcache32
  Only bcache code indexes bcache device name is such an interesting way.
- First minor number of each bcache device is traced by ida bit string.
  One bcache device will occupy 16 bits, this is not a good idea. Indeed
  only one bit is enough.
- Because minor number and bcache device index are mixed, a device index
  is allocated by ida_simple_get(), but an first minor number is sent into
  ida_simple_remove() to release the device. It confused original author
  too.

Root cause of the above errors is, bcache code should not handle device
minor numbers at all! A standard process to support multiple partitions in
Linux kernel is,
- Device driver provides major device number, and indexes multiple device
  instances.
- Device driver does not allocat nor trace device minor number, only
  provides a first minor number of a given device instance, and sets how
  many minor numbers (paritions) the device instance may have.
All rested stuffs are handled by block layer code, most of the details can
be found from block/{genhd, partition-generic}.c files.

This patch re-writes multiple partitions support for bcache. It makes
whole things to be more clear, and uses ida bit string in a more efficeint
way.
- Ida bit string only traces bcache device index, not minor number. For a
  bcache device with 128 partitions, only one bit in ida bit string is
  enough.
- Device minor number and device index are separated in concept. Device
  index is used for /dev node naming, and ida bit string trace. Minor
  number is calculated from device index and only used to initialize
  first_minor of a bcache device.
- It does not follow any standard for 16 partitions on a bcache device.
  This patch sets 128 partitions on single bcache device at max, this is
  the limitation from GPT (GUID Partition Table) and supported by fdisk.

Considering a typical device minor number is 20 bits width, each bcache
device may have 128 partitions (7 bits), there can be 8192 bcache devices
existing on system. For most common deployment for a single server in
now days, it should be enough.

[minor spelling fixes in commit message by Michael Lyle]
Signed-off-by: NColy Li <colyli@suse.de>
Cc: Eric Wheeler <bcache@lists.ewheeler.net>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1dbe32ad

bcache: fix a comments typo in bch_alloc_sectors() · b1e8139e

由 Coly Li 提交于 10月 13, 2017

Code comments in alloc.c:bch_alloc_sectors() mentions a function
name find_data_bucket(), the correct function name should be
pick_data_bucket() indeed. bch_alloc_sectors() is a quite important
function in bcache allocation code, fixing the typo may help
other people to have less confusion.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b1e8139e

bcache: check ca->alloc_thread initialized before wake up it · 91af8300

由 Coly Li 提交于 10月 13, 2017

In bcache code, sysfs entries are created before all resources get
allocated, e.g. allocation thread of a cache set.

There is posibility for NULL pointer deference if a resource is accessed
but which is not initialized yet. Indeed Jorg Bornschein catches one on
cache set allocation thread and gets a kernel oops.

The reason for this bug is, when bch_bucket_alloc() is called during
cache set registration and attaching, ca->alloc_thread is not properly
allocated and initialized yet, call wake_up_process() on ca->alloc_thread
triggers NULL pointer deference failure. A simple and fast fix is, before
waking up ca->alloc_thread, checking whether it is allocated, and only
wake up ca->alloc_thread when it is not NULL.
Signed-off-by: NColy Li <colyli@suse.de>
Reported-by: NJorg Bornschein <jb@capsec.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: stable@vger.kernel.org
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

91af8300

bcache: Avoid nested function definition · 58f913dc

由 Peter Foley 提交于 10月 13, 2017

Fixes below error with clang:
../drivers/md/bcache/sysfs.c:759:3: error: function definition is not allowed here
                {       return *((uint16_t *) r) - *((uint16_t *) l); }
                ^
../drivers/md/bcache/sysfs.c:789:32: error: use of undeclared identifier 'cmp'
                sort(p, n, sizeof(uint16_t), cmp, NULL);
                                             ^
2 errors generated.

v2:
rename function to __bch_cache_cmp
Signed-off-by: NPeter Foley <pefoley2@pefoley.com>
Reviewed-by: NColy Li <colyli@suse.de>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

58f913dc

06 10月, 2017 1 次提交

block: remove QUEUE_FLAG_STACKABLE · 5fdee212

由 Christoph Hellwig 提交于 10月 05, 2017

We already have a queue_is_rq_based helper to check if a request_queue
is request based, so we can remove the flag for it.
Acked-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5fdee212

03 10月, 2017 1 次提交

buffer: have alloc_page_buffers() use __GFP_NOFAIL · 640ab98f

由 Jens Axboe 提交于 9月 27, 2017

Instead of adding weird retry logic in that function, utilize
__GFP_NOFAIL to ensure that the vm takes care of handling any
potential retries appropriately. This means we don't have to
call free_more_memory() from here.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

640ab98f

11 9月, 2017 6 次提交

dax: remove the pmem_dax_ops->flush abstraction · c3ca015f

由 Mikulas Patocka 提交于 8月 31, 2017

Commit abebfbe2 ("dm: add ->flush() dax operation support") is
buggy. A DM device may be composed of multiple underlying devices and
all of them need to be flushed. That commit just routes the flush
request to the first device and ignores the other devices.

It could be fixed by adding more complex logic to the device mapper. But
there is only one implementation of the method pmem_dax_ops->flush - that
is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
don't need the pmem_dax_ops->flush abstraction at all, we can call
arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
can't ever reach anything different from arch_wb_cache_pmem().

It should be also pointed out that for some uses of persistent memory it
is needed to flush only a very small amount of data (such as 1 cacheline),
and it would be overkill if we go through that device mapper machinery for
a single flushed cache line.

Fix this by removing the pmem_dax_ops->flush abstraction and call
arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
mapper code that forwards the flushes.

Fixes: abebfbe2 ("dm: add ->flush() dax operation support")
Cc: stable@vger.kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

c3ca015f

dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK · b5e8ad92

由 Arnd Bergmann 提交于 8月 15, 2017

The new lockdep support for completions causeed the stack usage
in dm-integrity to explode, in case of write_journal from 504 bytes
to 1120 (using arm gcc-7.1.1):

drivers/md/dm-integrity.c: In function 'write_journal':
drivers/md/dm-integrity.c:827:1: error: the frame size of 1120 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]

The problem is that not only the size of 'struct completion' grows
significantly, but we end up having multiple copies of it on the stack
when we assign it from a local variable after the initial declaration.

COMPLETION_INITIALIZER_ONSTACK() is the right thing to use when we
want to declare and initialize a completion on the stack. However,
this driver doesn't do that and instead initializes the completion
just before it is used.

In this case, init_completion() does the same thing more efficiently,
and drops the stack usage for the function above down to 496 bytes.
While the other functions in this file are not bad enough to cause
a warning, they benefit equally from the change, so I do the change
across the entire file. In the one place where we reuse a completion,
I picked the cheaper reinit_completion() over init_completion().

Fixes: cd8084f9 ("locking/lockdep: Apply crossrelease to completions")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NMikulas Patocka <mpatocka@redhat.com>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

b5e8ad92

dm integrity: make blk_integrity_profile structure const · 7c373d66

由 Bhumika Goyal 提交于 8月 06, 2017

Make this structure const as it is only stored in the profile field of a
blk_integrity structure. This field is of type const, so make structure
as const.
Signed-off-by: NBhumika Goyal <bhumirks@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7c373d66

dm integrity: do not check integrity for failed read operations · b7e326f7

由 Hyunchul Lee 提交于 7月 31, 2017

Even though read operations fail, dm_integrity_map_continue() calls
integrity_metadata() to check integrity.  In this case, just complete
these.

This also makes it so read I/O errors do not generate integrity warnings
in the kernel log.

Cc: stable@vger.kernel.org
Signed-off-by: NHyunchul Lee <cheol.lee@lge.com>
Acked-by: NMilan Broz <gmazyland@gmail.com>
Acked-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

b7e326f7

dm log writes: fix >512b sectorsize support · 228bb5b2

由 Josef Bacik 提交于 7月 28, 2017

512b sectors vs device's physical sectorsize was not maintained
consistently and as such the support for >512b sector devices has bugs.
The log metadata expects native sectorsize but 512b sectors were being
stored.  Also, device's sectorsize was assumed when assigning the
bi_sector for blocks that were being logged.

Fix this up by adding two helpers to convert between bio and dev
sectors, and use these in the appropriate places to fix the problem and
make it clear which units go where.  Doing so allows dm-log-writes use
with 4k devices.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

228bb5b2

dm log writes: don't use all the cpu while waiting to log blocks · 0c79c620

由 Josef Bacik 提交于 7月 28, 2017

The check to see if the logging kthread needs to go to sleep is wrong,
it checks lc->pending_blocks, which will be non-0 if there are any
blocks that are pending, whether they are ready to be logged or not.
What we really want is to go to sleep until it's time to log blocks, so
change this check so we do actually go to sleep in between flushes.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

0c79c620

08 9月, 2017 1 次提交

bcache: initialize dirty stripes in flash_dev_run() · 175206cf

由 Tang Junhui 提交于 9月 07, 2017

bcache uses a Proportion-Differentiation Controller algorithm to control
writeback rate to cached devices. In the PD controller algorithm, dirty
stripes of thin flash device should not be counted in, because flash only
volumes never write back dirty data.

Currently dirty stripe counter for thin flash device is not initialized
when the thin flash device starts. Which means the following calculation
in PD controller will reference an undefined dirty stripes number, and
all cached devices attached to the same cache set where the thin flash
device lies on may have an inaccurate writeback rate.

This patch calles bch_sectors_dirty_init() in flash_dev_run(), to
correctly initialize dirty stripe counter when the thin flash device
starts to run. This patch also does following parameter data type change,
 -void bch_sectors_dirty_init(struct cached_dev *dc);
 +void bch_sectors_dirty_init(struct bcache_device *);
to call this function conveniently in flash_dev_run().

(Commit log is composed by Coly Li)
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

175206cf

06 9月, 2017 14 次提交

bcache: fix bch_hprint crash and improve output · 9276717b

由 Michael Lyle 提交于 9月 06, 2017

Most importantly, solve a crash where %llu was used to format signed
numbers.  This would cause a buffer overflow when reading sysfs
writeback_rate_debug, as only 20 bytes were allocated for this and
%llu writes 20 characters plus a null.

Always use the units mechanism rather than having different output
paths for simplicity.

Also, correct problems with display output where 1.10 was a larger
number than 1.09, by multiplying by 10 and then dividing by 1024 instead
of dividing by 100.  (Remainders of >= 1000 would print as .10).

Minor changes: Always display the decimal point instead of trying to
omit it based on number of digits shown.  Decide what units to use
based on 1000 as a threshold, not 1024 (in other words, always print
at most 3 digits before the decimal point).
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reported-by: NDmitry Yu Okunev <dyokunev@ut.mephi.ru>
Acked-by: NKent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9276717b

bcache: Update continue_at() documentation · 7b6a8570

由 Dan Carpenter 提交于 9月 06, 2017

continue_at() doesn't have a return statement anymore.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Acked-by: NColy Li <colyli@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7b6a8570

bcache: silence static checker warning · da22f0ee

由 Dan Carpenter 提交于 9月 06, 2017

In olden times, closure_return() used to have a hidden return built in.
We removed the hidden return but forgot to add a new return here.  If
"c" were NULL we would oops on the next line, but fortunately "c" is
never NULL.  Let's just remove the if statement.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

da22f0ee

bcache: fix for gc and write-back race · 9baf3097

由 Tang Junhui 提交于 9月 06, 2017

gc and write-back get raced (see the email "bcache get stucked" I sended
before):
gc thread                               write-back thread
|                                       |bch_writeback_thread()
|bch_gc_thread()                        |
|                                       |==>read_dirty()
|==>bch_btree_gc()                      |
|==>btree_root() //get btree root       |
|                //node write locker    |
|==>bch_btree_gc_root()                 |
|                                       |==>read_dirty_submit()
|                                       |==>write_dirty()
|                                       |==>continue_at(cl,
|                                       |               write_dirty_finish,
|                                       |               system_wq);
|                                       |==>write_dirty_finish()//excute
|                                       |               //in system_wq
|                                       |==>bch_btree_insert()
|                                       |==>bch_btree_map_leaf_nodes()
|                                       |==>__bch_btree_map_nodes()
|                                       |==>btree_root //try to get btree
|                                       |              //root node read
|                                       |              //lock
|                                       |-----stuck here
|==>bch_btree_set_root()
|==>bch_journal_meta()
|==>bch_journal()
|==>journal_try_write()
|==>journal_write_unlocked() //journal_full(&c->journal)
|                            //condition satisfied
|==>continue_at(cl, journal_write, system_wq); //try to excute
|                               //journal_write in system_wq
|                               //but work queue is excuting
|                               //write_dirty_finish()
|==>closure_sync(); //wait journal_write execute
|                   //over and wake up gc,
|-------------stuck here
|==>release root node write locker

This patch alloc a separate work-queue for write-back thread to avoid such
race.

(Commit log re-organized by Coly Li to pass checkpatch.pl checking)
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Acked-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9baf3097

bcache: increase the number of open buckets · 89b1fc54

由 Tang Junhui 提交于 9月 06, 2017

In currently, we only alloc 6 open buckets for each cache set,
but in usually, we always attach about 10 or so backend devices for
each cache set, and the each bcache device are always accessed by
about 10 or so threads in top application layer. So 6 open buckets
are too few, It has led to that each of the same thread write data
to different buckets, which would cause low efficiency write-back,
and also cause buckets inefficient, and would be Very easy to run
out of.

I add debug message in bch_open_buckets_alloc() to print alloc bucket
info, and test with ten bcache devices with a cache set, and each
bcache device is accessed by ten threads.

From the debug message, we can see that, after the modification, One
bucket is more likely to assign to the same thread, and the data from
the same thread are more likely to write the same bucket. Usually the
same thread always write/read the same backend device, so it is good
for write-back and also promote the usage efficiency of buckets.
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

89b1fc54

bcache: Correct return value for sysfs attach errors · 77fa100f

由 Tony Asleson 提交于 9月 06, 2017

If you encounter any errors in bch_cached_dev_attach it will return
a negative error code.  The variable 'v' which stores the result is
unsigned, thus user space sees a very large value returned for bytes
written which can cause incorrect user space behavior.  Utilize 1
signed variable to use throughout the function to preserve error return
capability.
Signed-off-by: NTony Asleson <tasleson@redhat.com>
Acked-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

77fa100f

bcache: correct cache_dirty_target in __update_writeback_rate() · a8394090

由 Tang Junhui 提交于 9月 06, 2017

__update_write_rate() uses a Proportion-Differentiation Controller
algorithm to control writeback rate. A dirty target number is used in
this PD controller to control writeback rate. A larger target number
will make the writeback rate smaller, on the versus, a smaller target
number will make the writeback rate larger.

bcache uses the following steps to calculate the target number,
1) cache_sectors = all-buckets-of-cache-set * buckets-size
2) cache_dirty_target = cache_sectors * cached-device-writeback_percent
3) target = cache_dirty_target *
(sectors-of-cached-device/sectors-of-all-cached-devices-of-this-cache-set)

The calculation at step 1) for cache_sectors is incorrect, which does
not consider dirty blocks occupied by flash only volume.

A flash only volume can be took as a bcache device without cached
device. All data sectors allocated for it are persistent on cache device
and marked dirty, they are not touched by bcache writeback and garbage
collection code. So data blocks of flash only volume should be ignore
when calculating cache_sectors of cache set.

Current code does not subtract dirty sectors of flash only volume, which
results a larger target number from the above 3 steps. And in sequence
the cache device's writeback rate is smaller then a correct value,
writeback speed is slower on all cached devices.

This patch fixes the incorrect slower writeback rate by subtracting
dirty sectors of flash only volumes in __update_writeback_rate().

(Commit log composed by Coly Li to pass checkpatch.pl checking)
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a8394090

bcache: gc does not work when triggering by manual command · 0b43f49d

由 Tang Junhui 提交于 9月 06, 2017

I try to execute the following command to trigger gc thread:
[root@localhost internal]# echo 1 > trigger_gc
But it does not work, I debug the code in gc_should_run(), It works only
if in invalidating or sectors_to_gc < 0. So set sectors_to_gc to -1 to
meet the condition when we trigger gc by manual command.

(Code comments aded by Coly Li)
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0b43f49d

bcache: Don't reinvent the wheel but use existing llist API · 09b3efec

由 Byungchul Park 提交于 9月 06, 2017

Although llist provides proper APIs, they are not used. Make them used.
Signed-off-by: NByungchul Park <byungchul.park@lge.com>
Acked-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

09b3efec

bcache: do not subtract sectors_to_gc for bypassed IO · 69daf03a

由 Tang Junhui 提交于 9月 06, 2017

Since bypassed IOs use no bucket, so do not subtract sectors_to_gc to
trigger gc thread.
Signed-off-by: Ntang.junhui <tang.junhui@zte.com.cn>
Acked-by: NColy Li <colyli@suse.de>
Reviewed-by: NEric Wheeler <bcache@linux.ewheeler.net>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

69daf03a

bcache: fix sequential large write IO bypass · c81ffa32

由 Tang Junhui 提交于 9月 06, 2017

Sequential write IOs were tested with bs=1M by FIO in writeback cache
mode, these IOs were expected to be bypassed, but actually they did not.
We debug the code, and find in check_should_bypass():
    if (!congested &&
        mode == CACHE_MODE_WRITEBACK &&
        op_is_write(bio_op(bio)) &&
        (bio-＞bi_opf & REQ_SYNC))
        goto rescale
that means, If in writeback mode, a write IO with REQ_SYNC flag will not
be bypassed though it is a sequential large IO, It's not a correct thing
to do actually, so this patch remove these codes.
Signed-off-by: Ntang.junhui <tang.junhui@zte.com.cn>
Reviewed-by: NKent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: NEric Wheeler <bcache@linux.ewheeler.net>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c81ffa32

bcache: Fix leak of bdev reference · 4b758df2

由 Jan Kara 提交于 9月 06, 2017

If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
block device using lookup_bdev() to detect which situation we are in to
properly report error. However we never drop the reference returned to
us from lookup_bdev(). Fix that.
Signed-off-by: NJan Kara <jack@suse.cz>
Acked-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4b758df2

md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list · 184a09eb

由 Dennis Yang 提交于 9月 06, 2017

In release_stripe_plug(), if a stripe_head has its STRIPE_ON_UNPLUG_LIST
set, it indicates that this stripe_head is already in the raid5_plug_cb
list and release_stripe() would be called instead to drop a reference
count. Otherwise, the STRIPE_ON_UNPLUG_LIST bit would be set for this
stripe_head and it will get queued into the raid5_plug_cb list.

Since break_stripe_batch_list() did not preserve STRIPE_ON_UNPLUG_LIST,
A stripe could be re-added to plug list while it is still on that list
in the following situation. If stripe_head A is added to another
stripe_head B's batch list, in this case A will have its
batch_head != NULL and be added into the plug list. After that,
stripe_head B gets handled and called break_stripe_batch_list() to
reset all the batched stripe_head(including A which is still on
the plug list)'s state and reset their batch_head to NULL.
Before the plug list gets processed, if there is another write request
comes in and get stripe_head A, A will have its batch_head == NULL
(cleared by calling break_stripe_batch_list() on B) and be added to
plug list once again.
Signed-off-by: NDennis Yang <dennisyang@qnap.com>
Cc: stable@vger.kernel.org (v4.1+)
Signed-off-by: NShaohua Li <shli@fb.com>

184a09eb

md/raid5: fix a race condition in stripe batch · 3664847d

由 Shaohua Li 提交于 8月 25, 2017

We have a race condition in below scenario, say have 3 continuous stripes, sh1,
sh2 and sh3, sh1 is the stripe_head of sh2 and sh3:

CPU1				CPU2				CPU3
handle_stripe(sh3)
				stripe_add_to_batch_list(sh3)
				-> lock(sh2, sh3)
				-> lock batch_lock(sh1)
				-> add sh3 to batch_list of sh1
				-> unlock batch_lock(sh1)
								clear_batch_ready(sh1)
								-> lock(sh1) and batch_lock(sh1)
								-> clear STRIPE_BATCH_READY for all stripes in batch_list
								-> unlock(sh1) and batch_lock(sh1)
->clear_batch_ready(sh3)
-->test_and_clear_bit(STRIPE_BATCH_READY, sh3)
--->return 0 as sh->batch == NULL
				-> sh3->batch_head = sh1
				-> unlock (sh2, sh3)

In CPU1, handle_stripe will continue handle sh3 even it's in batch stripe list
of sh1. By moving sh3->batch_head assignment in to batch_lock, we make it
impossible to clear STRIPE_BATCH_READY before batch_head is set.

Thanks Stephane for helping debug this tricky issue.
Reported-and-tested-by: NStephane Thiell <sthiell@stanford.edu>
Cc: stable@vger.kernel.org (v4.1+)
Signed-off-by: NShaohua Li <shli@fb.com>

3664847d

01 9月, 2017 1 次提交

md/bitmap: disable bitmap_resize for file-backed bitmaps. · e8a27f83

由 NeilBrown 提交于 8月 31, 2017

bitmap_resize() does not work for file-backed bitmaps.
The buffer_heads are allocated and initialized when
the bitmap is read from the file, but resize doesn't
read from the file, it loads from the internal bitmap.
When it comes time to write the new bitmap, the bh is
non-existent and we crash.

The common case when growing an array involves making the array larger,
and that normally means making the bitmap larger.  Doing
that inside the kernel is possible, but would need more code.
It is probably easier to require people who use file-backed
bitmaps to remove them and re-add after a reshape.

So this patch disables the resizing of arrays which have
file-backed bitmaps.  This is better than crashing.
Reported-by: NZhilong Liu <zlliu@suse.com>
Fixes: d60b479d ("md/bitmap: add bitmap_resize function to allow bitmap resizing.")
Cc: stable@vger.kernel.org (v3.5+).
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

e8a27f83

28 8月, 2017 1 次提交

dm ioctl: constify ioctl lookup table · cf0dec66

由 Eric Biggers 提交于 6月 22, 2017

Constify the lookup table for device-mapper ioctls so that it is placed
in .rodata.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

cf0dec66