提交 · e1f425e770d21a34f51d7284e55f3fa984f8e275 · openeuler / raspberrypi-kernel

11 1月, 2018 7 次提交

nvme/multipath: Use blk_path_error · e1f425e7

由 Keith Busch 提交于 1月 09, 2018

Uses common code for determining if an error should be retried on
alternate path.
Acked-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e1f425e7

block: Provide blk_status_t decoding for path errors · 9111e568

由 Keith Busch 提交于 1月 09, 2018

This patch provides a common decoder for block status path related errors
that may be retried so various entities wishing to consult this do not
have to duplicate this decision.
Acked-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9111e568

nvme/multipath: Consult blk_status_t for failover · 908e4564

由 Keith Busch 提交于 1月 09, 2018

This removes nvme multipath's specific status decoding to see if failover
is needed, using the generic blk_status_t that was decoded earlier. This
abstraction from the raw NVMe status means all status decoding exists
in one place.
Acked-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

908e4564

nvme: Add more command status translation · e96fef2c

由 Keith Busch 提交于 1月 09, 2018

This adds more NVMe status code translations to blk_status_t values,
and captures all the current status codes NVMe multipath uses.
Acked-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e96fef2c

blk-mq: Explain when 'active_queues' is decremented · fcd36c36

由 Bart Van Assche 提交于 1月 10, 2018

It is nontrivial to derive from the blk-mq source code when
blk_mq_tags.active_queues is decremented. Hence add a comment that
explains this.
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fcd36c36

partitions/msdos: Unable to mount UFS 44bsd partitions · 5f15684b

由 Richard Narron 提交于 1月 10, 2018

UFS partitions from newer versions of FreeBSD 10 and 11 use relative
addressing for their subpartitions. But older versions of FreeBSD still
use absolute addressing just like OpenBSD and NetBSD.

Instead of simply testing for a FreeBSD partition, the code needs to
also test if the starting offset of the C subpartition is zero.

https://bugzilla.kernel.org/show_bug.cgi?id=197733Signed-off-by: NRichard Narron <comet.berkeley@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5f15684b

null_blk: add option for managing IO timeouts · 93b57046

由 Jens Axboe 提交于 1月 10, 2018

Use the fault injection framework to provide a way for null_blk
to configure timeouts. This only works for queue_mode 1 and 2,
since the bio mode doesn't have code for tracking timeouts.

Let's say you want to have a 10% chance of timing out every
100,000 requests, and for 5 total timeouts, you could do:

modprobe null_blk timeout="100000,10,0,5"

This is useful for adding blktests to test that IO timeouts
are handled appropriately.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

93b57046

10 1月, 2018 16 次提交

block, bfq: fix occurrences of request finish method's old name · 8993d445

由 Chiara Bruschi 提交于 12月 18, 2017

Commit '7b9e9361' ("blk-mq-sched: unify request finished methods")
changed the old name of current bfq_finish_request method, but left it
unchanged elsewhere in the code (related comments, part of function
name bfq_put_rq_priv_body).

This commit fixes all occurrences of the old name of this method by
changing them into the current name.

Fixes: 7b9e9361 ("blk-mq-sched: unify request finished methods")
Reviewed-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NFederico Motta <federico@willer.it>
Signed-off-by: NChiara Bruschi <bruschi.chiara@outlook.it>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8993d445

Revert "block: blk-merge: try to make front segments in full size" · b4b6cb61

由 Ming Lei 提交于 1月 10, 2018

This reverts commit a2d37968.

If max segment size isn't 512-aligned, this patch won't work well.

Also once multipage bvec is enabled, adjacent bvecs won't be physically
contiguous if page is added via bio_add_page(), so we don't need this
kind of complicated logic.
Reported-by: NDmitry Osipenko <digetx@gmail.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b4b6cb61

null_blk: wire up timeouts · 5448aca4

由 Jens Axboe 提交于 1月 09, 2018

This is needed to ensure that we actually handle timeouts.
Without it, the queue_mode=1 path will never call blk_add_timer(),
and the queue_mode=2 path will continually just return
EH_RESET_TIMER and we never actually complete the offending request.

This was used to test the new timeout code, and the changes around
killing off REQ_ATOM_COMPLETE.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5448aca4

bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHED · 8abef10b

由 Jens Axboe 提交于 1月 09, 2018

It's not available if we don't have group io scheduling set, and
there's no need to call it.

Fixes: 0d52af59 ("block, bfq: release oom-queue ref to root group on exit")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8abef10b

bcache: closures: move control bits one bit right · 3609c471

由 Michael Lyle 提交于 1月 09, 2018

Otherwise, architectures that do negated adds of atomics (e.g. s390)
to do atomic_sub fail in closure_set_stopped.
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3609c471

block: Fix kernel-doc warnings reported when building with W=1 · aa98192d

由 Bart Van Assche 提交于 1月 09, 2018

Commit 3a025e1d ("Add optional check for bad kernel-doc comments")
causes W=1 the kernel-doc script to be run and thereby causes several
new warnings to appear when building the kernel with W=1. Fix the
block layer kernel-doc headers such that the block layer again builds
cleanly with W=1.
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

aa98192d

blk-mq: Fix spelling in a source code comment · ee3e4de5

由 Bart Van Assche 提交于 1月 09, 2018

Change "nedeing" into "needing" and "caes" into "cases".

Fixes: commit f906a6a0 ("blk-mq: improve tag waiting setup for non-shared tags")
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ee3e4de5

blk-mq: silence false positive warnings in hctx_unlock() · 08b5a6e2

由 Jens Axboe 提交于 1月 09, 2018

In some stupider versions of gcc, it complains:

block/blk-mq.c: In function ‘blk_mq_complete_request’:
./include/linux/srcu.h:175:2: warning: ‘srcu_idx’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  __srcu_read_unlock(sp, idx);
  ^
block/blk-mq.c:620:6: note: ‘srcu_idx’ was declared here
  int srcu_idx;
      ^

which is completely bogus, since we only use srcu_idx when
hctx->flags & BLK_MQ_F_BLOCKING is set, and that's the case where
hctx_lock() has initialized it.

Just set it to '0' in the normal path in hctx_lock() to silence
this annoying warning.

Fixes: 04ced159 ("blk-mq: move hctx lock/unlock into a helper")
Fixes: 5197c05e ("blk-mq: protect completion path with RCU")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

08b5a6e2

blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcu · 05707b64

由 Tejun Heo 提交于 1月 09, 2018

The RCU protection has been expanded to cover both queueing and
completion paths making ->queue_rq_srcu a misnomer.  Rename it to
->srcu as suggested by Bart.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

05707b64

blk-mq: remove REQ_ATOM_STARTED · 5a61c363

由 Tejun Heo 提交于 1月 09, 2018

After the recent updates to use generation number and state based
synchronization, we can easily replace REQ_ATOM_STARTED usages by
adding an extra state to distinguish completed but not yet freed
state.

Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
left and is removed.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5a61c363

blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq · 634f9e46

由 Tejun Heo 提交于 1月 09, 2018

After the recent updates to use generation number and state based
synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
to avoid firing the same timeout multiple times.

Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
times.  This removes atomic bitops from hot paths too.

v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().

v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

634f9e46

blk-mq: make blk_abort_request() trigger timeout path · 358f70da

由 Tejun Heo 提交于 1月 09, 2018

With issue/complete and timeout paths now using the generation number
and state based synchronization, blk_abort_request() is the only one
which depends on REQ_ATOM_COMPLETE for arbitrating completion.

There's no reason for blk_abort_request() to be a completely separate
path.  This patch makes blk_abort_request() piggyback on the timeout
path instead of trying to terminate the request directly.

This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.

Note that this makes blk_abort_request() asynchronous - it initiates
abortion but the actual termination will happen after a short while,
even when the caller owns the request.  AFAICS, SCSI and ATA should be
fine with that and I think mtip32xx and dasd should be safe but not
completely sure.  It'd be great if people who know the drivers take a
look.

v2: - Add comment explaining the lack of synchronization around
      ->deadline update as requested by Bart.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Asai Thambi SP <asamymuthupa@micron.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Jan Hoeppner <hoeppner@linux.vnet.ibm.com>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

358f70da

blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE · 67818d25

由 Tejun Heo 提交于 1月 09, 2018

blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test
REQ_ATOM_COMPLETE to determine the request state.  Both uses are
speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for
equivalent results.  Replace the tests.  This will allow removing
REQ_ATOM_COMPLETE usages from blk-mq.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

67818d25

blk-mq: replace timeout synchronization with a RCU and generation based scheme · 1d9bd516

由 Tejun Heo 提交于 1月 09, 2018

Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunately, it contains quite a few holes.

There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.

1. Each request has a u64 generation + state value, which can be
   updated only by the request owner.  Whenever a request becomes
   in-flight, the generation number gets bumped up too.  This provides
   the basis for the timeout path to distinguish different recycle
   instances of the request.

   Also, marking a request in-flight and setting its deadline are
   protected with a seqcount so that the timeout path can fetch both
   values coherently.

2. The timeout path fetches the generation, state and deadline.  If
   the verdict is timeout, it records the generation into a dedicated
   request abortion field and does RCU wait.

3. The completion path is also protected by RCU (from the previous
   patch) and checks whether the current generation number and state
   match the abortion field.  If so, it skips completion.

4. The timeout path, after RCU wait, scans requests again and
   terminates the ones whose generation and state still match the ones
   requested for abortion.

   By now, the timeout path knows that either the generation number
   and state changed if it lost the race or the completion will yield
   to it and can safely timeout the request.

While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.

While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places.  Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.

Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path.  The race has always been there and this
patch doesn't change it.  It's just documenting the existing race.

v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
    - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
    - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.

v3: - Fixed possible extended seqcount / u64_stats_sync read looping
      spotted by Peter.
    - MQ_RQ_IDLE was incorrectly being set in complete_request instead
      of free_request.  Fixed.

v4: - Rebased on top of hctx_lock() refactoring patch.
    - Added comment explaining the use of hctx_lock() in completion path.

v5: - Added comments requested by Bart.
    - Note the addition of BLK_EH_RESET_TIMER race condition in the
      commit message.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1d9bd516

blk-mq: protect completion path with RCU · 5197c05e

由 Tejun Heo 提交于 1月 09, 2018

Currently, blk-mq protects only the issue path with RCU.  This patch
puts the completion path under the same RCU protection.  This will be
used to synchronize issue/completion against timeout by later patches,
which will also add the comments.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5197c05e

blk-mq: move hctx lock/unlock into a helper · 04ced159

由 Jens Axboe 提交于 1月 09, 2018

Move the RCU vs SRCU logic into lock/unlock helpers, which makes
the actual functional bits within the locked region much easier
to read.

tj: Reordered in front of timeout revamp patches and added the missing
    blk_mq_run_hw_queue() conversion.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

04ced159

09 1月, 2018 17 次提交

block, bfq: release oom-queue ref to root group on exit · 0d52af59

由 Paolo Valente 提交于 1月 09, 2018

On scheduler init, a reference to the root group, and a reference to
its corresponding blkg are taken for the oom queue. Yet these
references are not released on scheduler exit, which prevents these
objects from be freed. This commit adds the missing reference
releases.
Reported-by: NDavide Ferrari <davideferrari8@gmail.com>
Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0d52af59

block, bfq: put async queues for root bfq groups too · 52257ffb

由 Paolo Valente 提交于 1月 09, 2018

For each pair [device for which bfq is selected as I/O scheduler,
group in blkio/io], bfq maintains a corresponding bfq group. Each such
bfq group contains a set of async queues, with each async queue
created on demand, i.e., when some I/O request arrives for it.  On
creation, an async queue gets an extra reference, to make sure that
the queue is not freed as long as its bfq group exists.  Accordingly,
to allow the queue to be freed after the group exited, this extra
reference must released on group exit.

The above holds also for a bfq root group, i.e., for the bfq group
corresponding to the root blkio/io root for a given device. Yet, by
mistake, the references to the existing async queues of a root group
are not released when the latter exits. This causes a memory leak when
the instance of bfq for a given device exits. In a similar vein,
bfqg_stats_xfer_dead is not executed for a root group.

This commit fixes bfq_pd_offline so that the latter executes the above
missing operations for a root group too.
Reported-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Reported-by: NGuoqing Jiang <gqjiang@suse.com>
Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: NDavide Ferrari <davideferrari8@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

52257ffb

blk-mq: fix kernel oops in blk_mq_tag_idle() · 8ab0b7dc

由 Ming Lei 提交于 1月 09, 2018

HW queues may be unmapped in some cases, such as blk_mq_update_nr_hw_queues(),
then we need to check it before calling blk_mq_tag_idle(), otherwise
the following kernel oops can be triggered, so fix it by checking if
the hw queue is unmapped since it doesn't make sense to idle the tags
any more after hw queues are unmapped.

[  440.771298] Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
[  440.779104] task: ffff894bae755ee0 ti: ffff893bf9bc8000 task.ti: ffff893bf9bc8000
[  440.788359] RIP: 0010:[<ffffffffb730e2b4>]  [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40
[  440.798697] RSP: 0018:ffff893bf9bcbd10  EFLAGS: 00010286
[  440.805538] RAX: 0000000000000000 RBX: ffff895bb131dc00 RCX: 000000000000011f
[  440.814426] RDX: 00000000ffffffff RSI: 0000000000000120 RDI: ffff895bb131dc00
[  440.823301] RBP: ffff893bf9bcbd10 R08: 000000000001b860 R09: 4a51d361c00c0000
[  440.832193] R10: b5907f32b4cc7003 R11: ffffd6cabfb57000 R12: ffff894bafd1e008
[  440.841091] R13: 0000000000000001 R14: ffff895baf770000 R15: 0000000000000080
[  440.849988] FS:  0000000000000000(0000) GS:ffff894bbdcc0000(0000) knlGS:0000000000000000
[  440.859955] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  440.867274] CR2: 0000000000000008 CR3: 000000103d098000 CR4: 00000000001407e0
[  440.876169] Call Trace:
[  440.879818]  [<ffffffffb7309d68>] blk_mq_exit_hctx+0xd8/0xe0
[  440.887051]  [<ffffffffb730dc40>] blk_mq_free_queue+0xf0/0x160
[  440.894465]  [<ffffffffb72ff679>] blk_cleanup_queue+0xd9/0x150
[  440.901881]  [<ffffffffc08a802b>] nvme_ns_remove+0x5b/0xb0 [nvme_core]
[  440.910068]  [<ffffffffc08a811b>] nvme_remove_namespaces+0x3b/0x60 [nvme_core]
[  440.919026]  [<ffffffffc08b817b>] __nvme_rdma_remove_ctrl+0x2b/0xb0 [nvme_rdma]
[  440.928079]  [<ffffffffc08b8237>] nvme_rdma_del_ctrl_work+0x17/0x20 [nvme_rdma]
[  440.937126]  [<ffffffffb70ab58a>] process_one_work+0x17a/0x440
[  440.944517]  [<ffffffffb70ac3a8>] worker_thread+0x278/0x3c0
[  440.951607]  [<ffffffffb70ac130>] ? manage_workers.isra.24+0x2a0/0x2a0
[  440.959760]  [<ffffffffb70b352f>] kthread+0xcf/0xe0
[  440.966055]  [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40
[  440.973715]  [<ffffffffb76d8658>] ret_from_fork+0x58/0x90
[  440.980586]  [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40
[  440.988229] Code: 5b 41 5c 5d c3 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 f0 0f ba 77 40 01 19 d2 85 d2 75 08 c3 0f 1f 80 00 00 00 00 55 48 89 e5 <f0> ff 48 08 48 8d 78 10 e8 7f 0f 05 00 5d c3 0f 1f 00 66 2e 0f
[  441.011620] RIP  [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40
[  441.019301]  RSP <ffff893bf9bcbd10>
[  441.024052] CR2: 0000000000000008
Reported-by: NZhang Yi <yizhan@redhat.com>
Tested-by: NZhang Yi <yizhan@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8ab0b7dc

bcache: fix writeback target calc on large devices · 616486ab

由 Michael Lyle 提交于 1月 08, 2018

Bcache needs to scale the dirty data in the cache over the multiple
backing disks in order to calculate writeback rates for each.
The previous code did this by multiplying the target number of dirty
sectors by the backing device size, and expected it to fit into a
uint64_t; this blows up on relatively small backing devices.

The new approach figures out the bdev's share in 16384ths of the overall
cached data.  This is chosen to cope well when bdevs drastically vary in
size and to ensure that bcache can cross the petabyte boundary for each
backing device.

This has been improved based on Tang Junhui's feedback to ensure that
every device gets a share of dirty data, no matter how small it is
compared to the total backing pool.

The existing mechanism is very limited; this is purely a bug fix to
remove limits on volume size.  However, there still needs to be change
to make this "fair" over many volumes where some are idle.
Reported-by: NJack Douglas <jack@douglastechnology.co.uk>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

616486ab

bcache: fix misleading error message in bch_count_io_errors() · 5138ac67

由 Coly Li 提交于 1月 08, 2018

Bcache only does recoverable I/O for read operations by calling
cached_dev_read_error(). For write opertions there is no I/O recovery for
failed requests.

But in bch_count_io_errors() no matter read or write I/Os, before errors
counter reaches io error limit, pr_err() always prints "IO error on %,
recoverying". For write requests this information is misleading, because
there is no I/O recovery at all.

This patch adds a parameter 'is_read' to bch_count_io_errors(), and only
prints "recovering" by pr_err() when the bio direction is READ.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5138ac67

bcache: reduce cache_set devices iteration by devices_max_used · 2831231d

由 Coly Li 提交于 1月 08, 2018

Member devices of struct cache_set is used to reference all attached
bcache devices to this cache set. If it is treated as array of pointers,
size of devices[] is indicated by member nr_uuids of struct cache_set.

nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(),
	bucket_bytes(c) / sizeof(struct uuid_entry)
Bucket size is determined by user space tool "make-bcache", by default it
is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default
nr_uuids value is 4096 from the above calculation.

Every time when bcache code iterates bcache devices of a cache set, all
the 4096 pointers are checked even only 1 bcache device is attached to the
cache set, that's a wast of time and unncessary.

This patch adds a member devices_max_used to struct cache_set. Its value
is 1 + the maximum used index of devices[] in a cache set. When iterating
all valid bcache devices of a cache set, use c->devices_max_used in
for-loop may reduce a lot of useless checking.

Personally, my motivation of this patch is not for performance, I use it
in bcache debugging, which helps me to narrow down the scape to check
valid bcached devices of a cache set.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2831231d

bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct() · b40503ea

由 Zhai Zhaoxuan 提交于 1月 08, 2018

The function cached_dev_make_request() and flash_dev_make_request() call
generic_start_io_acct() with (struct bcache_device)->disk when they start a
closure. Then the function bio_complete() calls generic_end_io_acct() with
(struct search)->orig_bio->bi_disk when the closure has done.
Since the `bi_disk` is not the bcache device, the generic_end_io_acct() is
called with a wrong device queue.

It causes the "inflight" (in struct hd_struct) counter keep increasing
without decreasing.

This patch fix the problem by calling generic_end_io_acct() with
(struct bcache_device)->disk.
Signed-off-by: NZhai Zhaoxuan <kxuanobj@gmail.com>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NColy Li <colyli@suse.de>
Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b40503ea

bcache: mark closure_sync() __sched · ce439bf7

由 Kent Overstreet 提交于 1月 08, 2018

[edit by mlyle: include sched/debug.h to get __sched]
Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ce439bf7

bcache: Fix, improve efficiency of closure_sync() · e4bf7919

由 Kent Overstreet 提交于 1月 08, 2018

Eliminates cases where sync can race and fail to complete / get stuck.
Removes many status flags and simplifies entering-and-exiting closure
sleeping behaviors.

[mlyle: fixed conflicts due to changed return behavior in mainline.
extended commit comment, and squashed down two commits that were mostly
contradictory to get to this state.  Changed __set_current_state to
set_current_state per Jens review comment]
Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e4bf7919

bcache: allow quick writeback when backing idle · b1092c9a

由 Michael Lyle 提交于 1月 08, 2018

If the control system would wait for at least half a second, and there's
been no reqs hitting the backing disk for awhile: use an alternate mode
where we have at most one contiguous set of writebacks in flight at a
time. (But don't otherwise delay).  If front-end IO appears, it will
still be quick, as it will only have to contend with one real operation
in flight.  But otherwise, we'll be sending data to the backing disk as
quickly as it can accept it (with one op at a time).
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
Acked-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b1092c9a

bcache: writeback: properly order backing device IO · 6e6ccc67

由 Michael Lyle 提交于 1月 08, 2018

Writeback keys are presently iterated and dispatched for writeback in
order of the logical block address on the backing device.  Multiple may
be, in parallel, read from the cache device and then written back
(especially when there are contiguous I/O).

However-- there was no guarantee with the existing code that the writes
would be issued in LBA order, as the reads from the cache device are
often re-ordered.  In turn, when writing back quickly, the backing disk
often has to seek backwards-- this slows writeback and increases
utilization.

This patch introduces an ordering mechanism that guarantees that the
original order of issue is maintained for the write portion of the I/O.
Performance for writeback is significantly improved when there are
multiple contiguous keys or high writeback rates.
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
Tested-by: NTang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6e6ccc67

bcache: fix wrong return value in bch_debug_init() · 539d39eb

由 Tang Junhui 提交于 1月 08, 2018

in bch_debug_init(), ret is always 0, and the return value is useless,
change it to return 0 if be success after calling debugfs_create_dir(),
else return a non-zero value.
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

539d39eb

bcache: segregate flash only volume write streams · 4eca1cb2

由 Tang Junhui 提交于 1月 08, 2018

In such scenario that there are some flash only volumes
, and some cached devices, when many tasks request these devices in
writeback mode, the write IOs may fall to the same bucket as bellow:
| cached data | flash data | cached data | cached data| flash data|
then after writeback of these cached devices, the bucket would
be like bellow bucket:
| free | flash data | free | free | flash data |

So, there are many free space in this bucket, but since data of flash
only volumes still exists, so this bucket cannot be reclaimable,
which would cause waste of bucket space.

In this patch, we segregate flash only volume write streams from
cached devices, so data from flash only volumes and cached devices
can store in different buckets.

Compare to v1 patch, this patch do not add a additionally open bucket
list, and it is try best to segregate flash only volume write streams
from cached devices, sectors of flash only volumes may still be mixed
with dirty sectors of cached device, but the number is very small.

[mlyle: fixed commit log formatting, permissions, line endings]
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4eca1cb2

bcache: Use PTR_ERR_OR_ZERO() · 9d134117

由 Vasyl Gomonovych 提交于 1月 08, 2018

Fix ptr_ret.cocci warnings:
drivers/md/bcache/btree.c:1800:1-3: WARNING: PTR_ERR_OR_ZERO can be used

Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: scripts/coccinelle/api/ptr_ret.cocci
Signed-off-by: NVasyl Gomonovych <gomonovych@gmail.com>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9d134117

bcache: stop writeback thread after detaching · 8d29c442

由 Tang Junhui 提交于 1月 08, 2018

Currently, when a cached device detaching from cache, writeback thread is
not stopped, and writeback_rate_update work is not canceled. For example,
after the following command:
echo 1 >/sys/block/sdb/bcache/detach
you can still see the writeback thread. Then you attach the device to the
cache again, bcache will create another writeback thread, for example,
after below command:
echo  ba0fb5cd-658a-4533-9806-6ce166d883b9 > /sys/block/sdb/bcache/attach
then you will see 2 writeback threads.
This patch stops writeback thread and cancels writeback_rate_update work
when cached device detaching from cache.

Compare with patch v1, this v2 patch moves code down into the register
lock for safety in case of any future changes as Coly and Mike suggested.

[edit by mlyle: commit log spelling/formatting]
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8d29c442

bcache: ret IOERR when read meets metadata error · b221fc13

由 Rui Hua 提交于 1月 08, 2018

The read request might meet error when searching the btree, but the error
was not handled in cache_lookup(), and this kind of metadata failure will
not go into cached_dev_read_error(), finally, the upper layer will receive
bi_status=0.  In this patch we judge the metadata error by the return
value of bch_btree_map_keys(), there are two potential paths give rise to
the error:

1. Because the btree is not totally cached in memery, we maybe get error
   when read btree node from cache device (see bch_btree_node_get()), the
   likely errno is -EIO, -ENOMEM

2. When read miss happens, bch_btree_insert_check_key() will be called to
   insert a "replace_key" to btree(see cached_dev_cache_miss(), just for
   doing preparatory work before insert the missed data to cache device),
   a failure can also happen in this situation, the likely errno is
   -ENOMEM

bch_btree_map_keys() will return MAP_DONE in normal scenario, but we will
get either -EIO or -ENOMEM in above two cases. if this happened, we should
NOT recover data from backing device (when cache device is dirty) because
we don't know whether bkeys the read request covered are all clean.  And
after that happened, s->iop.status is still its initially value(0) before
we submit s->bio.bio, we set it to BLK_STS_IOERR, so it can go into
cached_dev_read_error(), and finally it can be passed to upper layer, or
recovered by reread from backing device.

[edit by mlyle: patch formatting, word-wrap, comment spelling,
commit log format]
Signed-off-by: NHua Rui <huarui.dev@gmail.com>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b221fc13

Merge branch 'nvme-4.16' of git://git.infradead.org/nvme into for-4.16/block · 550203e6

由 Jens Axboe 提交于 1月 08, 2018

Pull NVMe fixes from Christoph:

"Below are the pending nvme updates for Linux 4.16. Just fixes and
 cleanups from various contributors this time around."

550203e6