提交 · 3deff1a70d5901342f460f8cc36e5d0c5d51c319 · openanolis / cloud-kernel

09 4月, 2017 5 次提交

md: support REQ_OP_WRITE_ZEROES · 3deff1a7

由 Christoph Hellwig 提交于 4月 05, 2017

Copy & paste from the REQ_OP_WRITE_SAME code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

3deff1a7

sd: implement REQ_OP_WRITE_ZEROES · 02d26103

由 Christoph Hellwig 提交于 4月 05, 2017

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

02d26103

block: implement splitting of REQ_OP_WRITE_ZEROES bios · 885fa13f

由 Christoph Hellwig 提交于 4月 05, 2017

Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations
that limit the write zeroes size.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

885fa13f

block: renumber REQ_OP_WRITE_ZEROES · 1d62ac13

由 Christoph Hellwig 提交于 4月 05, 2017

Make life easy for implementations that needs to send a data buffer
to the device (e.g. SCSI) by numbering it as a data out command.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

1d62ac13

sd: split sd_setup_discard_cmnd · 81d926e8

由 Christoph Hellwig 提交于 4月 05, 2017

Split sd_setup_discard_cmnd into one function per provisioning type.  While
this creates some very slight duplication of boilerplate code it keeps the
code modular for additions of new provisioning types, and for reusing the
write same functions for the upcoming scsi implementation of the Write Zeroes
operation.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

81d926e8

08 4月, 2017 11 次提交

block: sed-opal: Tone down all the pr_* to debugs · 591c59d1

由 Scott Bauer 提交于 4月 07, 2017

Lets not flood the kernel log with messages unless
the user requests so.
Signed-off-by: NScott Bauer <scott.bauer@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

591c59d1

blk-mq: Clarify comments in blk_mq_dispatch_rq_list() · 710c785f

由 Bart Van Assche 提交于 4月 07, 2017

The blk_mq_dispatch_rq_list() implementation got modified several
times but the comments in that function were not updated every
time. Since it is nontrivial what is going on, update the comments
in blk_mq_dispatch_rq_list().
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

710c785f

blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list · 705cda97

由 Bart Van Assche 提交于 4月 07, 2017

Since the next patch in this series will use RCU to iterate over
tag_list, make this safe. Add lockdep_assert_held() statements
in functions that iterate over tag_list to make clear that using
list_for_each_entry() instead of list_for_each_entry_rcu() is
fine in these functions.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

705cda97

O
blk-mq: use true instead of 1 for blk_mq_queue_data.last · d945a365
由 Omar Sandoval 提交于 4月 05, 2017
```
Trivial cleanup.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>
```
d945a365

blk-mq: make driver tag failure path easier to follow · 807b1041

由 Omar Sandoval 提交于 4月 05, 2017

Minor cleanup that makes it easier to figure out what's going on in the
driver tag allocation failure path of blk_mq_dispatch_rq_list().
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

807b1041

blk-mq-sched: provide hooks for initializing hardware queue data · ee056f98

由 Omar Sandoval 提交于 4月 05, 2017

Schedulers need to be informed when a hardware queue is added or removed
at runtime so they can allocate/free per-hardware queue data. So,
replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
at init time, with .init_hctx() and .exit_hctx() hooks.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

ee056f98

Merge branch 'for-linus' into for-4.12/block · 65f619d2

由 Jens Axboe 提交于 4月 07, 2017

We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.
Signed-off-by: NJens Axboe <axboe@fb.com>

65f619d2

blk-mq: Restart a single queue if tag sets are shared · 6d8c6c0f

由 Bart Van Assche 提交于 4月 07, 2017

To improve scalability, if hardware queues are shared, restart
a single hardware queue in round-robin fashion. Rename
blk_mq_sched_restart_queues() to reflect the new semantics.
Remove blk_mq_sched_mark_restart_queue() because this function
has no callers. Remove flag QUEUE_FLAG_RESTART because this
patch removes the code that uses this flag.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

6d8c6c0f

dm rq: Avoid that request processing stalls sporadically · 6077c2d7

由 Bart Van Assche 提交于 4月 07, 2017

While running the srp-test software I noticed that request
processing stalls sporadically at the beginning of a test, namely
when mkfs is run against a dm-mpath device. Every time when that
happened the following command was sufficient to resume request
processing:

    echo run >/sys/kernel/debug/block/dm-0/state

This patch avoids that such request processing stalls occur. The
test I ran is as follows:

    while srp-test/run_tests -d -r 30 -t 02-mq; do :; done
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: NJens Axboe <axboe@fb.com>

6077c2d7

scsi: Avoid that SCSI queues get stuck · 36e3cf27

由 Bart Van Assche 提交于 4月 07, 2017

If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block
driver that implements that function is responsible for rerunning the
hardware queue once requests can be queued again successfully.

commit 52d7f1b5 ("blk-mq: Avoid that requeueing starts stopped
queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq()
for the BLK_MQ_RQ_QUEUE_BUSY case. Hence change all calls to functions
that are intended to rerun a busy queue such that these examine all
hardware queues instead of only stopped queues.

Since no other functions than scsi_internal_device_block() and
scsi_internal_device_unblock() should ever stop or restart a SCSI
queue, change the blk_mq_delay_queue() call into a
blk_mq_delay_run_hw_queue() call.

Fixes: commit 52d7f1b5 ("blk-mq: Avoid that requeueing starts stopped queues")
Fixes: commit 7e79dadc ("blk-mq: stop hardware queue in blk_mq_delay_queue()")
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Long Li <longli@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

36e3cf27

blk-mq: Introduce blk_mq_delay_run_hw_queue() · 7587a5ae

由 Bart Van Assche 提交于 4月 07, 2017

Introduce a function that runs a hardware queue unconditionally
after a delay. Note: there is already a function that stops and
restarts a hardware queue after a delay, namely blk_mq_delay_queue().

This function will be used in the next patch in this series.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Long Li <longli@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

7587a5ae

07 4月, 2017 7 次提交

block: trace completion of all bios. · fbbaf700

由 NeilBrown 提交于 4月 07, 2017

Currently only dm and md/raid5 bios trigger
trace_block_bio_complete().  Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete.  Only bio_endio() knows that.

So move the trace_block_bio_complete() call to bio_endio().

Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.

There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
   trace event at the 'request' level, there is no point generating
   one at the bio level too.  In this case the bi_sector and bi_size
   will have changed, so the bio level event would be wrong

2/ If the bio hasn't actually been queued yet, but is being aborted
   early, then a trace event could be confusing.  Some filesystems
   call bio_endio() but do not want tracing.

3/ The bio_integrity code interposes itself by replacing bi_end_io,
   then restoring it and calling bio_endio() again.  This would produce
   two identical trace events if left like that.

To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.

When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication.  A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component.  To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.

So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

fbbaf700

block: simple improvements for bio->flags · dbde775c

由 NeilBrown 提交于 4月 07, 2017

The comment for the 'flags' field of 'bio' mentions
"command" which is no longer stored there, and doesn't
mention the bvec pool number, which is.

BIO_RESET_BITS is set in such a way that it would need to be
updated if new bits were added, which is easy to miss.

BVEC_POOL_BITS is larger than needed.  The BVEC_POOL_IDX()
ranges from 0 to 6, so 3 bits are sufficient.

This patch make improvements in each of these areas.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

dbde775c

blk-mq: remap queues when adding/removing hardware queues · ebe8bddb

由 Omar Sandoval 提交于 4月 07, 2017

blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
behavior that drivers expect. However, commit 4e68a011 changed
blk_mq_queue_reinit() to not remap queues for the case of CPU
hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
queues as well. This breaks, for example, NBD's multi-connection mode,
leaving the added hardware queues unused. Fix it by making
blk_mq_update_nr_hw_queues() explicitly remap the queues.

Fixes: 4e68a011 ("blk-mq: don't redistribute hardware queues on a CPU hotplug event")
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

ebe8bddb

blk-mq-sched: fix crash in switch error path · 54d5329d

由 Omar Sandoval 提交于 4月 07, 2017

In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

54d5329d

blk-mq-sched: set up scheduler tags when bringing up new queues · 93252632

由 Omar Sandoval 提交于 4月 05, 2017

If a new hardware queue is added at runtime, we don't allocate scheduler
tags for it, leading to a crash. This hooks up the scheduler framework
to blk_mq_{init,exit}_hctx() to make sure everything gets properly
initialized/freed.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

93252632

blk-mq-sched: refactor scheduler initialization · 6917ff0b

由 Omar Sandoval 提交于 4月 05, 2017

Preparation cleanup for the next couple of fixes, push
blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

6917ff0b

blk-mq: use the right hctx when getting a driver tag fails · 81380ca1

由 Omar Sandoval 提交于 4月 07, 2017

While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.

The fix is twofold:

1. Make sure we save which hardware queue we were trying to get a
   request for in blk_mq_get_driver_tag() regardless of whether it
   succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
   blk_mq_hw_queue to make it clear that it must handle multiple
   hardware queues, since I've already messed this up on a couple of
   occasions.

This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

81380ca1

06 4月, 2017 6 次提交

block: move timeout field in struct request to pack better · 1dd5198b

由 Jens Axboe 提交于 4月 05, 2017

After commit 64c7f1d1, we went from 1 to 2 holes in my
test setup. If we move the timeout field a bit, we remove
both of those holes and shrink struct request by 8 bytes.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

1dd5198b

block, scsi: move the retries field to struct scsi_request · 64c7f1d1

由 Christoph Hellwig 提交于 4月 05, 2017

Instead of bloating the generic struct request with it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

64c7f1d1

nvme: move the retries count to struct nvme_request · 44e44b29

由 Christoph Hellwig 提交于 4月 05, 2017

The way NVMe uses this field is entirely different from the older
SCSI/BLOCK_PC usage, so move it into struct nvme_request.

Also reduce the size of the file to a unsigned char so that we leave
space for additional smaller fields that will appear soon.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

44e44b29

nvme: mark nvme_max_retries static · 83f3aeb3

由 Christoph Hellwig 提交于 4月 05, 2017

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

83f3aeb3

nvme: cleanup nvme_req_needs_retry · f6324b1b

由 Christoph Hellwig 提交于 4月 05, 2017

Don't pass the status explicitly but derive it from the requeust,
and unwind the complex condition to be more readable.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

f6324b1b

nvme: move ->retries setup to nvme_setup_cmd · 987f699a

由 Christoph Hellwig 提交于 4月 05, 2017

->retries is counting the number of times a command is resubmitted, and
be cleared on the first time we see the command.  We currently don't do
that for non-PCIe command, which is easily fixed by moving the setup
to common code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

987f699a

05 4月, 2017 4 次提交

remove the obsolete hd driver · 8e14be53

由 Christoph Hellwig 提交于 4月 05, 2017

This driver is for pre-IDE hardisk that are only found in PC from the
stoneage of personal computing, and which we don't support elsewhere
in the kernel these days.

It's also been marked broken forever.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

8e14be53

blk-mq: Remove blk_mq_queue_data.list · f2fbc9dd

由 Bart Van Assche 提交于 4月 05, 2017

The block layer core sets blk_mq_queue_data.list but no block
drivers read that member. Hence remove it and also the code that
is used to set this member.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

f2fbc9dd

cfq: Disable writeback throttling by default · 142bbdfc

由 Jan Kara 提交于 4月 04, 2017

Writeback throttling does not play well with CFQ since that also tries
to throttle async writes. As a result async writeback can get starved in
presence of readers. As an example take a benchmark simulating
postgreSQL database running over a standard rotating SATA drive. There
are 16 processes doing random reads from a huge file (2*machine memory),
1 process doing random writes to the huge file and calling fsync once
per 50000 writes and 1 process doing sequential 8k writes to a
relatively small file wrapping around at the end of the file and calling
fsync every 5 writes. Under this load read latency easily exceeds the
target latency of 75 ms (just because there are so many reads happening
against a relatively slow disk) and thus writeback is throttled to a
point where only 1 write request is allowed at a time. Blktrace data
then looks like:

  8,0    1        0     8.347751764     0  m   N cfq workload slice:40000000
  8,0    1        0     8.347755256     0  m   N cfq293A  / set_active wl_class: 0 wl_type:0
  8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    1     3814     8.347763916  5839 UT   N [kworker/u9:2] 1
  8,0    0        0     8.347777605     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3     1596     8.354364057     0  C   R 156109528 + 8 (6906954) [0]
  8,0    3        0     8.354383193     0  m   N cfq6196SN / complete rqnoidle 0
  8,0    3        0     8.354386476     0  m   N cfq schedule dispatch
  8,0    3        0     8.354399397     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3        0     8.354404705     0  m   N cfq293A  / dispatch_insert
  8,0    3        0     8.354409454     0  m   N cfq293A  / dispatched a request
  8,0    3        0     8.354412527     0  m   N cfq293A  / activate rq, drv=1
  8,0    3     1597     8.354414692     0  D   W 145961400 + 24 (6718452) [swapper/0]
  8,0    3        0     8.354484184     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3        0     8.354487536     0  m   N cfq293A  / slice expired t=0
  8,0    3        0     8.354498013     0  m   N / served: vt=5888102466265088 min_vt=5888074869387264
  8,0    3        0     8.354502692     0  m   N cfq293A  / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24
  8,0    3        0     8.354505695     0  m   N cfq293A  / del_from_rr
...
  8,0    0     1810     8.354728768     0  C   W 145961400 + 24 (314076) [0]
  8,0    0        0     8.354746927     0  m   N cfq293A  / complete rqnoidle 0
...
  8,0    1     3829     8.389886102  5839  G   W 145962968 + 24 [kworker/u9:2]
  8,0    1     3830     8.389888127  5839  P   N [kworker/u9:2]
  8,0    1     3831     8.389908102  5839  A   W 145978336 + 24 <- (8,4) 44000
  8,0    1     3832     8.389910477  5839  Q   W 145978336 + 24 [kworker/u9:2]
  8,0    1     3833     8.389914248  5839  I   W 145962968 + 24 (28146) [kworker/u9:2]
  8,0    1        0     8.389919137     0  m   N cfq293A  / insert_request
  8,0    1        0     8.389924305     0  m   N cfq293A  / add_to_rr
  8,0    1     3834     8.389933175  5839 UT   N [kworker/u9:2] 1
...
  8,0    0        0     9.455290997     0  m   N cfq workload slice:40000000
  8,0    0        0     9.455294769     0  m   N cfq293A  / set_active wl_class:0 wl_type:0
  8,0    0        0     9.455303499     0  m   N cfq293A  / fifo=ffff880003166090
  8,0    0        0     9.455306851     0  m   N cfq293A  / dispatch_insert
  8,0    0        0     9.455311251     0  m   N cfq293A  / dispatched a request
  8,0    0        0     9.455314324     0  m   N cfq293A  / activate rq, drv=1
  8,0    0     2043     9.455316210  6204  D   W 145962968 + 24 (1065401962) [pgioperf]
  8,0    0        0     9.455392407     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    0        0     9.455395969     0  m   N cfq293A  / slice expired t=0
  8,0    0        0     9.455404210     0  m   N / served: vt=5888958194597888 min_vt=5888941810597888
  8,0    0        0     9.455410077     0  m   N cfq293A  / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24
  8,0    0        0     9.455416851     0  m   N cfq293A  / del_from_rr
...
  8,0    0     2045     9.455648515     0  C   W 145962968 + 24 (332305) [0]
  8,0    0        0     9.455668350     0  m   N cfq293A  / complete rqnoidle 0
...
  8,0    1     4371     9.455710115  5839  G   W 145978336 + 24 [kworker/u9:2]
  8,0    1     4372     9.455712350  5839  P   N [kworker/u9:2]
  8,0    1     4373     9.455730159  5839  A   W 145986616 + 24 <- (8,4) 52280
  8,0    1     4374     9.455732674  5839  Q   W 145986616 + 24 [kworker/u9:2]
  8,0    1     4375     9.455737563  5839  I   W 145978336 + 24 (27448) [kworker/u9:2]
  8,0    1        0     9.455742871     0  m   N cfq293A  / insert_request
  8,0    1        0     9.455747550     0  m   N cfq293A  / add_to_rr
  8,0    1     4376     9.455756629  5839 UT   N [kworker/u9:2] 1

So we can see a Q event for a write request, then IO is blocked by
writeback throttling and G and I events for the request happen only once
other writeback IO is completed. Thus CFQ always sees only one write
request. When it sees it, it queues the async queue behind all the read
queues and the async queue gets scheduled after about one second. When
it is scheduled, that one request gets dispatched and async queue is
expired as it has no more requests to submit. Overall we submit about
one write request per second.

Although this scheduling is beneficial for read latency, writes are
heavily starved and this causes large delays all over the system (due to
processes blocking on page lock, transaction starts, etc.). When
writeback throttling is disabled, write throughput is about one fifth of
a read throughput which roughly matches readers/writers ratio and
overall the system stalls are much shorter.

Mixing writeback throttling logic with CFQ throttling logic is always a
recipe for surprises as CFQ assumes it sees the big part of the picture
which is not necessarily true when writeback throttling is blocking
requests. So disable writeback throttling logic by default when CFQ is
used as an IO scheduler.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

142bbdfc

block: fix inheriting request priority from bio · 85003a44

由 Adam Manzanares 提交于 4月 04, 2017

In 4.10 I introduced a patch that associates the ioc priority with
each request in the block layer. This work was done in the single queue
block layer code. This patch unifies ioc priority to request mapping across
the single/multi queue block layers.

I have tested this patch with the null block device driver with the following
parameters.

null_blk queue_mode=2 irqmode=0 use_per_node_hctx=1 nr_devices=1

I have not seen a performance regression with this patch and I would appreciate
any feedback or additional testing.

I have also verified that io priorities are passed to the device when using
the SQ and MQ path to a SATA HDD that supports io priorities.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAdam Manzanares <adam.manzanares@wdc.com>
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

85003a44

04 4月, 2017 7 次提交

nvme: factor request completion code into a common helper · 77f02a7a

由 Christoph Hellwig 提交于 3月 30, 2017

This avoids duplicating the logic four times, and it also allows to keep
some helpers static in core.c or just opencode them.

Note that this loses printing the aborted status on completions in the
PCI driver as that uses a data structure not available any more.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

77f02a7a

nvme-fc: drop ctrl for all command completions · 4bca70d0

由 Christoph Hellwig 提交于 3月 30, 2017

A requeue means we go through nvme_fc_start_fcp_op again and get
another controller reference.  To make sure the refcount doesn't
leak we also need to drop it for every completion that came from
the LLDD.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

4bca70d0

nvme-fc: increment request retries counter before requeuing · f2cd54d3

由 Sagi Grimberg 提交于 3月 29, 2017

This way our max retry limit holds as well.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

f2cd54d3

nvme-loop: increment request retries counter before requeuing · 7d9a5e71

由 Sagi Grimberg 提交于 3月 29, 2017

This way our max retry limit holds as well.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

7d9a5e71

nvme-rdma: increment request retries counter before requeuing · e806666e

由 Sagi Grimberg 提交于 3月 29, 2017

This way our max retry limit holds as well.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

e806666e

nvme_fc: Clean up host fcpio done status handling · 62eeacb0

由 James Smart 提交于 3月 23, 2017

As Dan Carpenter pointed out: mixing 16-bit nvme status with 32-bit
error status from driver. Corrected comment on fcp request struct
status field, and converted done routine to explicitly set nvme status
codes for nvme status.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

62eeacb0

nvmet_fc: Clear SG list to avoid double frees · c820ad4c

由 James Smart 提交于 3月 23, 2017

Clear SG list to avoid double frees of payload page list
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

c820ad4c

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功