- 12 11月, 2016 1 次提交
-
-
由 Jens Axboe 提交于
A previous commit changed this to pass in the hardware queue, but it was using the wrong hardware queue. Hence a request that was allocated on one hardware queue ended up being issued on another one, and that caused IO timeouts and oopses on some drivers. Since the request holds hardware queue private resources, like a tag, we can't just issue it on a different hardware queue. Fixes: 2253efc8 ("blk-mq: Move more code into blk_mq_direct_issue_request()") Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 11 11月, 2016 2 次提交
-
-
由 Jens Axboe 提交于
Enable throttling of buffered writeback to make it a lot more smooth, and has way less impact on other system activity. Background writeback should be, by definition, background activity. The fact that we flush huge bundles of it at the time means that it potentially has heavy impacts on foreground workloads, which isn't ideal. We can't easily limit the sizes of writes that we do, since that would impact file system layout in the presence of delayed allocation. So just throttle back buffered writeback, unless someone is waiting for it. The algorithm for when to throttle takes its inspiration in the CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors the minimum latencies of requests over a window of time. In that window of time, if the minimum latency of any request exceeds a given target, then a scale count is incremented and the queue depth is shrunk. The next monitoring window is shrunk accordingly. Unlike CoDel, if we hit a window that exhibits good behavior, then we simply increment the scale count and re-calculate the limits for that scale value. This prevents us from oscillating between a close-to-ideal value and max all the time, instead remaining in the windows where we get good behavior. Unlike CoDel, blk-wb allows the scale count to to negative. This happens if we primarily have writes going on. Unlike positive scale counts, this doesn't change the size of the monitoring window. When the heavy writers finish, blk-bw quickly snaps back to it's stable state of a zero scale count. The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency target to me met. It defaults to 2 msec for non-rotational storage, and 75 msec for rotational storage. Setting this value to '0' disables blk-wb. Generally, a user would not have to touch this setting. We don't enable WBT on devices that are managed with CFQ, and have a non-root block cgroup attached. If we have a proportional share setup on this particular disk, then the wbt throttling will interfere with that. We don't have a strong need for wbt for that case, since we will rely on CFQ doing that for us. Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
For legacy block, we simply track them in the request queue. For blk-mq, we track them on a per-sw queue basis, which we can then sum up through the hardware queues and finally to a per device state. The stats are tracked in, roughly, 0.1s interval windows. Add sysfs files to display the stats. The feature is off by default, to avoid any extra overhead. In-kernel users of it can turn it on by setting QUEUE_FLAG_STATS in the queue flags. We currently don't turn it on if someone just reads any of the stats files, that is something we could add as well. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 07 11月, 2016 1 次提交
-
-
由 Gabriel Krisman Bertazi 提交于
Commit 0e87e58b ("blk-mq: improve warning for running a queue on the wrong CPU") attempts to avoid triggering the WARN_ON in __blk_mq_run_hw_queue when the expected CPU is dead. Problem is, in the last batch execution before round robin, blk_mq_hctx_next_cpu can schedule a dead CPU and also update next_cpu to the next alive CPU in the mask, which will trigger the WARN_ON despite the previous workaround. The following patch fixes this scenario by always scheduling the value in hctx->next_cpu. This changes the moment when we round-robin the CPU running the hctx, but it really doesn't matter, since it still executes BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU. Fixes: 0e87e58b ("blk-mq: improve warning for running a queue on the wrong CPU") Signed-off-by: NGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 04 11月, 2016 1 次提交
-
-
由 Shaohua Li 提交于
This is corresponding part for blk-mq. Disk with multiple hardware queues doesn't need this as we only hold 1 request at most. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 03 11月, 2016 8 次提交
-
-
由 Bart Van Assche 提交于
Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls are followed by kicking the requeue list. Hence add an argument to these two functions that allows to kick the requeue list. This was proposed by Christoph Hellwig. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Bart Van Assche 提交于
blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations have finished. This function does *not* wait until all outstanding requests have finished (this means invocation of request.end_io()). The algorithm used by blk_mq_quiesce_queue() is as follows: * Hold either an RCU read lock or an SRCU read lock around .queue_rq() calls. The former is used if .queue_rq() does not block and the latter if .queue_rq() may block. * blk_mq_quiesce_queue() first calls blk_mq_stop_hw_queues() followed by synchronize_srcu() or synchronize_rcu(). The latter call waits for .queue_rq() invocations that started before blk_mq_quiesce_queue() was called. * The blk_mq_hctx_stopped() calls that control whether or not .queue_rq() will be called are called with the (S)RCU read lock held. This is necessary to avoid race conditions against blk_mq_quiesce_queue(). Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NMing Lei <tom.leiming@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Bart Van Assche 提交于
Since blk_mq_requeue_work() no longer restarts stopped queues canceling requeue work is no longer needed to prevent that a stopped queue would be restarted. Hence remove this function. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Bart Van Assche 提交于
Since blk_mq_requeue_work() starts stopped queues and since execution of this function can be scheduled after a queue has been stopped it is not possible to stop queues without using an additional state variable to track whether or not the queue has been stopped. Hence modify blk_mq_requeue_work() such that it does not start stopped queues. My conclusion after a review of the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list() callers is as follows: * In the dm driver starting and stopping queues should only happen if __dm_suspend() or __dm_resume() is called and not if the requeue list is processed. * In the SCSI core queue stopping and starting should only be performed by the scsi_internal_device_block() and scsi_internal_device_unblock() functions but not by any other function. Although the blk_mq_stop_hw_queue() call in scsi_queue_rq() may help to reduce CPU load if a LLD queue is full, figuring out whether or not a queue should be restarted when requeueing a command would require to introduce additional locking in scsi_mq_requeue_cmd() to avoid a race with scsi_internal_device_block(). Avoid this complexity by removing the blk_mq_stop_hw_queue() call from scsi_queue_rq(). * In the NVMe core only the functions that call blk_mq_start_stopped_hw_queues() explicitly should start stopped queues. * A blk_mq_start_stopped_hwqueues() call must be added in the xen-blkfront driver in its blkif_recover() function. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: James Bottomley <jejb@linux.vnet.ibm.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Bart Van Assche 提交于
Move the "hctx stopped" test and the insert request calls into blk_mq_direct_issue_request(). Rename that function into blk_mq_try_issue_directly() to reflect its new semantics. Pass the hctx pointer to that function instead of looking it up a second time. These changes avoid that code has to be duplicated in the next patch. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Bart Van Assche 提交于
The function blk_queue_stopped() allows to test whether or not a traditional request queue has been stopped. Introduce a helper function that allows block drivers to query easily whether or not one or more hardware contexts of a blk-mq queue have been stopped. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Bart Van Assche 提交于
Multiple functions test the BLK_MQ_S_STOPPED bit so introduce a helper function that performs this test. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NMing Lei <tom.leiming@gmail.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Bart Van Assche 提交于
The meaning of the BLK_MQ_S_STOPPED flag is "do not call .queue_rq()". Hence modify blk_mq_make_request() such that requests are queued instead of issued if a queue has been stopped. Reported-by: NMing Lei <tom.leiming@gmail.com> Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMing Lei <tom.leiming@gmail.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Cc: <stable@vger.kernel.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 28 10月, 2016 4 次提交
-
-
由 Christoph Hellwig 提交于
Now that we don't need the common flags to overflow outside the range of a 32-bit type we can encode them the same way for both the bio and request fields. This in addition allows us to place the operation first (and make some room for more ops while we're at it) and to stop having to shift around the operation values. In addition this allows passing around only one value in the block layer instead of two (and eventuall also in the file systems, but we can do that later) and thus clean up a lot of code. Last but not least this allows decreasing the size of the cmd_flags field in struct request to 32-bits. Various functions passing this value could also be updated, but I'd like to avoid the churn for now. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
A lot of the REQ_* flags are only used on struct requests, and only of use to the block layer and a few drivers that dig into struct request internals. This patch adds a new req_flags_t rq_flags field to struct request for them, and thus dramatically shrinks the number of common requests. It also removes the unfortunate situation where we have to fit the fields from the same enum into 32 bits for struct bio and 64 bits for struct request. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NShaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
We can just use struct blk_mq_alloc_data - it has a few more members, but we allocate it further down the stack anyway. So this cleans up the code, and reduces the stack overhead a bit. Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
If we end up sleeping due to running out of requests, we should update the hardware and software queues in the map ctx structure. Otherwise we could end up having rq->mq_ctx point to the pre-sleep context, and risk corrupting ctx->rq_list since we'll be grabbing the wrong lock when inserting the request. Reported-by: NDave Jones <davej@codemonkey.org.uk> Reported-by: NChris Mason <clm@fb.com> Tested-by: NChris Mason <clm@fb.com> Fixes: 63581af3 ("blk-mq: remove non-blocking pass in blk_mq_map_request") Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 24 9月, 2016 1 次提交
-
-
由 Christoph Hellwig 提交于
This provides the caller a feedback that a given hctx is not mapped and thus no command can be sent on it. Signed-off-by: NChristoph Hellwig <hch@lst.de> Tested-by: NSteve Wise <swise@opengridcomputing.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 23 9月, 2016 3 次提交
-
-
The "blk_mq_queue_reinit_dead()" just cleared the cpumask instead doing a copy. Since we might never had an online callback we could end up with a ZERO mask which in turn leads to crash as test robot demonstarted. Fixes: 65d5291e ("blk-mq: Convert to new hotplug state machine") Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its ->queue_rq() handler. For that case, blk-mq ensures that we always calls it from a safe context. Signed-off-by: NJens Axboe <axboe@fb.com> Tested-by: NJosef Bacik <jbacik@fb.com>
-
由 Christoph Hellwig 提交于
bt_get already does a non-blocking pass as well as running the queue when scheduling internally, no need to duplicate it. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 22 9月, 2016 3 次提交
-
-
由 Jens Axboe 提交于
Two cases: 1) blk_mq_alloc_request() needlessly re-runs the queue, after calling into the tag allocation without NOWAIT set. We don't need to do that. 2) blk_mq_map_request() should just use blk_mq_run_hw_queue() with the async flag set to false. Signed-off-by: NJens Axboe <axboe@fb.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
Install the callbacks via the state machine so we can phase out the cpu hotplug notifiers mess. Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-block@vger.kernel.org Cc: rt@linutronix.de Cc: Christoph Hellwing <hch@lst.de> Link: http://lkml.kernel.org/r/20160919212601.180033814@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Thomas Gleixner 提交于
Replace the block-mq notifier list management with the multi instance facility in the cpu hotplug state machine. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-block@vger.kernel.org Cc: rt@linutronix.de Cc: Christoph Hellwing <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 17 9月, 2016 3 次提交
-
-
由 Omar Sandoval 提交于
Allocating your own per-cpu allocation hint separately makes for an awkward API. Instead, allocate the per-cpu hint as part of the struct sbitmap_queue. There's no point for a struct sbitmap_queue without the cache, but you can still use a bare struct sbitmap. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
This is a generally useful data structure, so make it available to anyone else who might want to use it. It's also a nice cleanup separating the allocation logic from the rest of the tag handling logic. The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only selected by CONFIG_BLOCK for now. This should be a complete noop functionality-wise. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
We currently account a '0' dispatch, and anything above that still falls below the range set by BLK_MQ_MAX_DISPATCH_ORDER. If we dispatch more, we don't account it. Change the last bucket to be inclusive of anything above the range we track, and have the sysfs file reflect that by including a '+' in the output: $ cat /sys/block/nvme0n1/mq/0/dispatched 0 1006 1 20229 2 1 4 0 8 0 16 0 32+ 0 Signed-off-by: NJens Axboe <axboe@fb.com> Reviewed-by: NOmar Sandoval <osandov@fb.com>
-
- 15 9月, 2016 7 次提交
-
-
由 Jens Axboe 提交于
Fixes 1b157939 ("blk-mq: get rid of the cpumask in struct blk_mq_tags") Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Unused now that NVMe sets up irq affinity before calling into blk-mq. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
This allows drivers specify their own queue mapping by overriding the setup-time function that builds the mq_map. This can be used for example to build the map based on the MSI-X vector mapping provided by the core interrupt layer for PCI devices. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
All drivers use the default, so provide an inline version of it. If we ever need other queue mapping we can add an optional method back, although supporting will also require major changes to the queue setup code. This provides better code generation, and better debugability as well. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
The mapping is identical for all queues in a tag_set, so stop wasting memory for building multiple. Note that for now I've kept the mq_map pointer in the request_queue, but we'll need to investigate if we can remove it without suffering too much from the additional pointer chasing. The same would apply to the mq_ops pointer as well. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Currently blk-mq will totally remap hardware context when a CPU hotplug even happened, which causes major havoc for drivers, as they are never told about this remapping. E.g. any carefully sorted out CPU affinity will just be completely messed up. The rebuild also doesn't really help for the common case of cpu hotplug, which is soft onlining / offlining of cpus - in this case we should just leave the queue and irq mapping as is. If it actually worked it would have helped in the case of physical cpu hotplug, although for that we'd need a way to actually notify the driver. Note that drivers may already be able to accommodate such a topology change on their own, e.g. using the reset_controller sysfs file in NVMe will cause the driver to get things right for this case. With the rebuild removed we will simplify retain the queue mapping for a soft offlined CPU that will work when it comes back online, and will map any newly onlined CPU to queue 0 until the driver initiates a rebuild of the queue map. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Mike Snitzer 提交于
blk_mq_delay_kick_requeue_list() provides the ability to kick the q->requeue_list after a specified time. To do this the request_queue's 'requeue_work' member was changed to a delayed_work. blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued requests while it doesn't make sense to immediately requeue them (e.g. when all paths in a DM multipath have failed). Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 29 8月, 2016 2 次提交
-
-
由 Jens Axboe 提交于
When drivers or the core calls this function, they usually dereference the request shortly there after. Prefetch the first cache line. Profiling IO workloads shows that this is the most common cache miss on the block side of things. Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
We don't need the larger delayed work struct, since we always run it immediately. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 25 8月, 2016 2 次提交
-
-
由 Jens Axboe 提交于
__blk_mq_run_hw_queue() currently warns if we are running the queue on a CPU that isn't set in its mask. However, this can happen if a CPU is being offlined, and the workqueue handling will place the work on CPU0 instead. Improve the warning so that it only triggers if the batch cpu in the hardware queue is currently online. If it triggers for that case, then it's indicative of a flow problem in blk-mq, so we want to retain it for that case. Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
We do this in a few places, if the CPU is offline. This isn't allowed, though, since on multi queue hardware, we can't just move a request from one software queue to another, if they map to different hardware queues. The request and tag isn't valid on another hardware queue. This can happen if plugging races with CPU offlining. But it does no harm, since it can only happen in the window where we are currently busy freezing the queue and flushing IO, in preparation for redoing the software <-> hardware queue mappings. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 08 8月, 2016 1 次提交
-
-
由 Jens Axboe 提交于
Since commit 63a4cc24, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 05 8月, 2016 1 次提交
-
-
由 Gabriel Krisman Bertazi 提交于
In case a submitted request gets stuck for some reason, the block layer can prevent the request starvation by starting the scheduled timeout work. If this stuck request occurs at the same time another thread has started a queue freeze, the blk_mq_timeout_work will not be able to acquire the queue reference and will return silently, thus not issuing the timeout. But since the request is already holding a q_usage_counter reference and is unable to complete, it will never release its reference, preventing the queue from completing the freeze started by first thread. This puts the request_queue in a hung state, forever waiting for the freeze completion. This was observed while running IO to a NVMe device at the same time we toggled the CPU hotplug code. Eventually, once a request got stuck requiring a timeout during a queue freeze, we saw the CPU Hotplug notification code get stuck inside blk_mq_freeze_queue_wait, as shown in the trace below. [c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable) [c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350 [c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990 [c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0 [c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110 [c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0 [c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100 [c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0 [c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400 [c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0 [c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50 [c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140 [c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0 [c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0 [c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0 [c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200 [c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0 [c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230 [c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110 [c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4 The fix is to allow the timeout work to execute in the window between dropping the initial refcount reference and the release of the last reference, which actually marks the freeze completion. This can be achieved with percpu_refcount_tryget, which does not require the counter to be alive. This way the timeout work can do it's job and terminate a stuck request even during a freeze, returning its reference and avoiding the deadlock. Allowing the timeout to run is just a part of the fix, since for some devices, we might get stuck again inside the device driver's timeout handler, should it attempt to allocate a new request in that path - which is a quite common action for Abort commands, which need to be sent after a timeout. In NVMe, for instance, we call blk_mq_alloc_request from inside the timeout handler, which will fail during a freeze, since it also tries to acquire a queue reference. I considered a similar change to blk_mq_alloc_request as a generic solution for further device driver hangs, but we can't do that, since it would allow new requests to disturb the freeze process. I thought about creating a new function in the block layer to support unfreezable requests for these occasions, but after working on it for a while, I feel like this should be handled in a per-driver basis. I'm now experimenting with changes to the NVMe timeout path, but I'm open to suggestions of ways to make this generic. Signed-off-by: NGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> Cc: Brian King <brking@linux.vnet.ibm.com> Cc: Keith Busch <keith.busch@intel.com> Cc: linux-nvme@lists.infradead.org Cc: linux-block@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-