- 29 5月, 2018 1 次提交
-
-
由 Keith Busch 提交于
This patch simplifies the timeout handling by relying on the request reference counting to ensure the iterator is operating on an inflight and truly timed out request. Since the reference counting prevents the tag from being reallocated, the block layer no longer needs to prevent drivers from completing their requests while the timeout handler is operating on it: a driver completing a request is allowed to proceed to the next state without additional syncronization with the block layer. This also removes any need for generation sequence numbers since the request lifetime is prevented from being reallocated as a new sequence while timeout handling is operating on it. To enables this a refcount is added to struct request so that request users can be sure they're operating on the same request without it changing while they're processing it. The request's tag won't be released for reuse until both the timeout handler and the completion are done with it. Signed-off-by: NKeith Busch <keith.busch@intel.com> [hch: slight cleanups, added back submission side hctx lock, use cmpxchg for completions] Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 03 4月, 2018 1 次提交
-
-
由 Tejun Heo 提交于
Request abortion is performed by overriding deadline to now and scheduling timeout handling immediately. For the latter part, the code was using mod_timer(timeout, 0) which can't guarantee that the timer runs afterwards. Let's schedule the underlying work item directly instead. This fixes the hangs during probing reported by Sitsofe but it isn't yet clear to me how the failure can happen reliably if it's just the above described race condition. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: NSitsofe Wheeler <sitsofe@gmail.com> Reported-by: NMeelis Roos <mroos@linux.ee> Fixes: 358f70da ("blk-mq: make blk_abort_request() trigger timeout path") Cc: stable@vger.kernel.org # v4.16 Link: http://lkml.kernel.org/r/CALjAwxh-PVYFnYFCJpGOja+m5SzZ8Sa4J7ohxdK=r8NyOF-EMA@mail.gmail.com Link: http://lkml.kernel.org/r/alpine.LRH.2.21.1802261049140.4893@math.ut.eeSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 09 3月, 2018 1 次提交
-
-
由 Bart Van Assche 提交于
Introduce functions that modify the queue flags and that protect these modifications with the request queue lock. Except for moving one wake_up_all() call from inside to outside a critical section, this patch does not change any functionality. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 11 1月, 2018 1 次提交
-
-
由 Jens Axboe 提交于
We reduce the resolution of request expiry, but since we're already using jiffies for this where resolution depends on the kernel configuration and since the timeout resolution is coarse anyway, that should be fine. Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com> Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 1月, 2018 3 次提交
-
-
由 Tejun Heo 提交于
After the recent updates to use generation number and state based synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except to avoid firing the same timeout multiple times. Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple times. This removes atomic bitops from hot paths too. v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out(). v3: Added RQF_MQ_TIMEOUT_EXPIRED flag. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
With issue/complete and timeout paths now using the generation number and state based synchronization, blk_abort_request() is the only one which depends on REQ_ATOM_COMPLETE for arbitrating completion. There's no reason for blk_abort_request() to be a completely separate path. This patch makes blk_abort_request() piggyback on the timeout path instead of trying to terminate the request directly. This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq. Note that this makes blk_abort_request() asynchronous - it initiates abortion but the actual termination will happen after a short while, even when the caller owns the request. AFAICS, SCSI and ATA should be fine with that and I think mtip32xx and dasd should be safe but not completely sure. It'd be great if people who know the drivers take a look. v2: - Add comment explaining the lack of synchronization around ->deadline update as requested by Bart. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Asai Thambi SP <asamymuthupa@micron.com> Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Jan Hoeppner <hoeppner@linux.vnet.ibm.com> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
Currently, blk-mq timeout path synchronizes against the usual issue/completion path using a complex scheme involving atomic bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence rules. Unfortunately, it contains quite a few holes. There's a complex dancing around REQ_ATOM_STARTED and REQ_ATOM_COMPLETE between issue/completion and timeout paths; however, they don't have a synchronization point across request recycle instances and it isn't clear what the barriers add. blk_mq_check_expired() can easily read STARTED from N-2'th iteration, deadline from N-1'th, blk_mark_rq_complete() against Nth instance. In fact, it's pretty easy to make blk_mq_check_expired() terminate a later instance of a request. If we induce 5 sec delay before time_after_eq() test in blk_mq_check_expired(), shorten the timeout to 2s, and issue back-to-back large IOs, blk-mq starts timing out requests spuriously pretty quickly. Nothing actually timed out. It just made the call on a recycle instance of a request and then terminated a later instance long after the original instance finished. The scenario isn't theoretical either. This patch replaces the broken synchronization mechanism with a RCU and generation number based one. 1. Each request has a u64 generation + state value, which can be updated only by the request owner. Whenever a request becomes in-flight, the generation number gets bumped up too. This provides the basis for the timeout path to distinguish different recycle instances of the request. Also, marking a request in-flight and setting its deadline are protected with a seqcount so that the timeout path can fetch both values coherently. 2. The timeout path fetches the generation, state and deadline. If the verdict is timeout, it records the generation into a dedicated request abortion field and does RCU wait. 3. The completion path is also protected by RCU (from the previous patch) and checks whether the current generation number and state match the abortion field. If so, it skips completion. 4. The timeout path, after RCU wait, scans requests again and terminates the ones whose generation and state still match the ones requested for abortion. By now, the timeout path knows that either the generation number and state changed if it lost the race or the completion will yield to it and can safely timeout the request. While it's more lines of code, it's conceptually simpler, doesn't depend on direct use of subtle memory ordering or coherence, and hopefully doesn't terminate the wrong instance. While this change makes REQ_ATOM_COMPLETE synchronization unnecessary between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't removed yet as it's still used in other places. Future patches will move all state tracking to the new mechanism and remove all bitops in the hot paths. Note that this patch adds a comment explaining a race condition in BLK_EH_RESET_TIMER path. The race has always been there and this patch doesn't change it. It's just documenting the existing race. v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao. - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter. - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter. v3: - Fixed possible extended seqcount / u64_stats_sync read looping spotted by Peter. - MQ_RQ_IDLE was incorrectly being set in complete_request instead of free_request. Fixed. v4: - Rebased on top of hctx_lock() refactoring patch. - Added comment explaining the use of hctx_lock() in completion path. v5: - Added comments requested by Bart. - Note the addition of BLK_EH_RESET_TIMER race condition in the commit message. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 31 10月, 2017 1 次提交
-
-
由 Bart Van Assche 提交于
Make sure that if the timeout timer fires after a queue has been marked "dying" that the affected requests are finished. Reported-by: Nchenxiang (M) <chenxiang66@hisilicon.com> Fixes: commit 287922eb ("block: defer timeouts to a workqueue") Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com> Tested-by: Nchenxiang (M) <chenxiang66@hisilicon.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <keith.busch@intel.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 05 10月, 2017 1 次提交
-
-
由 Peter Zijlstra 提交于
Attempt to untangle the ordering in blk-mq. The patch introducing the single smp_mb__before_atomic() is obviously broken in that it doesn't clearly specify a pairing barrier and an obtained guarantee. The comment is further misleading in that it hints that the deadline store and the COMPLETE store also need to be ordered, but AFAICT there is no such dependency. However what does appear to be important is the clear happening _after_ the store, and that worked by pure accident. This clarifies blk_mq_start_request() -- we should not get there with STARTING set -- this simplifies the code and makes the barrier usage sane (the old code could be read to allow not having _any_ atomic after the barrier, in which case the barrier hasn't got anything to order). We then also introduce the missing pairing barrier for it. Also down-grade the barrier to smp_wmb(), this is cheaper for PowerPC/ARM and doesn't cost anything extra on x86. And it documents the STARTING vs COMPLETE ordering. Although I've not been entirely successful in reverse engineering the blk-mq state machine so there might still be more funnies around timeout vs requeue. If I got anything wrong, feel free to educate me by adding comments to clarify things ;-) Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Will Deacon <will.deacon@arm.com> Cc: Ming Lei <tom.leiming@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Fixes: 538b7534 ("blk-mq: request deadline must be visible before marking rq as started") Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 6月, 2017 1 次提交
-
-
由 Bart Van Assche 提交于
Instead of documenting the locking assumptions of most block layer functions as a comment, use lockdep_assert_held() to verify locking assumptions at runtime. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 4月, 2017 1 次提交
-
-
由 Christoph Hellwig 提交于
Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com> Acked-by: NRoger Pau Monné <roger.pau@citrix.com> Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 23 12月, 2015 2 次提交
-
-
由 Christoph Hellwig 提交于
This was added for the 'magic' AEN requests in the NVMe driver that never return. We now handle them purely inside the driver and don't need this core hack any more. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Timer context is not very useful for drivers to perform any meaningful abort action from. So instead of calling the driver from this useless context defer it to a workqueue as soon as possible. Note that while a delayed_work item would seem the right thing here I didn't dare to use it due to the magic in blk_add_timer that pokes deep into timer internals. But maybe this encourages Tejun to add a sensible API for that to the workqueue API and we'll all be fine in the end :) Contains a major update from Keith Bush: "This patch removes synchronizing the timeout work so that the timer can start a freeze on its own queue. The timer enters the queue, so timer context can only start a freeze, but not wait for frozen." Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 25 11月, 2015 2 次提交
-
-
由 Jens Axboe 提交于
Just a comment update on not needing queue_lock, and that we aren't really adding the request to a timeout list for !mq. Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
We only added the request to the request list for the !blk-mq case, so we should only delete it in that case as well. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 08 1月, 2015 1 次提交
-
-
由 Keith Busch 提交于
Some types of requests may be started that are not gauranteed to ever complete. This adds a request flag that a driver can use so mark the request as such. Signed-off-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 23 9月, 2014 3 次提交
-
-
由 Christoph Hellwig 提交于
Signed-off-by: NChristoph Hellwig <hch@lst.de> Moved blk_mq_rq_timed_out() definition to the private blk-mq.h header. Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
Commit 8cb34819cdd5d(blk-mq: unshared timeout handler) introduces blk-mq's own timeout handler, and removes following line: blk_queue_rq_timed_out(q, blk_mq_rq_timed_out); which then causes blk_add_timer() to bypass adding the timer, since blk-mq no longer has q->rq_timed_out_fn defined. This patch fixes the problem by bypassing the check for blk-mq, so that both request deadlines are still set and the rolling timer updated. Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Duplicate the (small) timeout handler in blk-mq so that we can pass arguments more easily to the driver timeout handler. This enables the next patch. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 31 5月, 2014 1 次提交
-
-
由 Jens Axboe 提交于
Commit f793aa53 relaxed the timer addition a little too much. If the timer isn't pending, we always need to add it. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 14 5月, 2014 1 次提交
-
-
由 Jens Axboe 提交于
This adds support for active queue tracking, meaning that the blk-mq tagging maintains a count of active users of a tag set. This allows us to maintain a notion of fairness between users, so that we can distribute the tag depth evenly without starving some users while allowing others to try unfair deep queues. If sharing of a tag set is detected, each hardware queue will track the depth of its own queue. And if this exceeds the total depth divided by the number of active queues, the user is actively throttled down. The active queue count is done lazily to avoid bouncing that data between submitter and completer. Each hardware queue gets marked active when it allocates its first tag, and gets marked inactive when 1) the last tag is cleared, and 2) the queue timeout grace period has passed. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 25 4月, 2014 1 次提交
-
-
由 Christoph Hellwig 提交于
Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 24 4月, 2014 1 次提交
-
-
由 Jens Axboe 提交于
If a requeue event races with a timeout, we can get into the situation where we attempt to complete a request from the timeout handler when it's not start anymore. This causes a crash. So have the timeout handler check that REQ_ATOM_STARTED is still set on the request - if not, we ignore the event. If this happens, the request has now been marked as complete. As a consequence, we need to ensure to clear REQ_ATOM_COMPLETE in blk_mq_start_request(), as to maintain proper request state. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 17 4月, 2014 1 次提交
-
-
由 Jens Axboe 提交于
Since we are now, by default, applying timer slack to expiry times, the logic for when to modify a timer in the block code is suboptimal. The block layer keeps a forward rolling timer per queue for all requests, and modifies this timer if a request has a shorter timeout than what the current expiry time is. However, this breaks down when our rounded timer values get applied slack. Then each new request ends up modifying the timer, since we're still a little in front of the timer + slack. Fix this by allowing a tolerance of HZ / 2, the timeout handling doesn't need to be very precise. This drastically cuts down the number of timer modifications we have to make. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 11 2月, 2014 1 次提交
-
-
由 Christoph Hellwig 提交于
Rework I/O completions to work more like the old code path. blk_mq_end_io now stays out of the business of deferring completions to others CPUs and calling blk_mark_rq_complete. The latter is very important to allow completing requests that have timed out and thus are already marked completed, the former allows using the IPI callout even for driver specific completions instead of having to reimplement them. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 09 11月, 2013 1 次提交
-
-
由 Duan Jiong 提交于
This patch fixes coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 08 11月, 2013 1 次提交
-
-
由 Jeff Moyer 提交于
crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 491, comm: scsi_eh_0 Tainted: G W ---------------- 2.6.32-220.13.1.el6.x86_64 #1 IBM -[8722PAX]-/00D1461 RIP: 0010:[<ffffffff8124e424>] [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0 RSP: 0018:ffff881057eefd60 EFLAGS: 00010012 RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8 RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780 RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338 R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000 FS: 0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540) Stack: 0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000 <0> ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393 <0> ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90 Call Trace: [<ffffffff81362323>] __scsi_queue_insert+0xa3/0x150 [<ffffffff8135f393>] ? scsi_eh_ready_devs+0x5e3/0x850 [<ffffffff81362a23>] scsi_queue_insert+0x13/0x20 [<ffffffff8135e4d4>] scsi_eh_flush_done_q+0x104/0x160 [<ffffffff8135fb6b>] scsi_error_handler+0x35b/0x660 [<ffffffff8135f810>] ? scsi_error_handler+0x0/0x660 [<ffffffff810908c6>] kthread+0x96/0xa0 [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffff81090830>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 RIP [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0 RSP <ffff881057eefd60> The RIP is this line: BUG_ON(blk_queued_rq(rq)); After digging through the code, I think there may be a race between the request completion and the timer handler running. A timer is started for each request put on the device's queue (see blk_start_request->blk_add_timer). If the request does not complete before the timer expires, the timer handler (blk_rq_timed_out_timer) will mark the request complete atomically: static inline int blk_mark_rq_complete(struct request *rq) { return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags); } and then call blk_rq_timed_out. The latter function will call scsi_times_out, which will return one of BLK_EH_HANDLED, BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED. If BLK_EH_RESET_TIMER is returned, blk_clear_rq_complete is called, and blk_add_timer is again called to simply wait longer for the request to complete. Now, if the request happens to complete while this is going on, what happens? Given that we know the completion handler will bail if it finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion handler running after that bit is cleared. So, from the above paragraph, after the call to blk_clear_rq_complete. If the completion sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom there (I haven't seen this in the cores). Next, if we get the completion before the call to list_add_tail, then the timer will eventually fire for an old req, which may either be freed or reallocated (there is evidence that this might be the case). Finally, if the completion comes in *after* the addition to the timeout list, I think it's harmless. The request will be removed from the timeout list, req_atom_complete will be set, and all will be well. This will only actually explain the coredumps *IF* the request structure was freed, reallocated *and* queued before the error handler thread had a chance to process it. That is possible, but it may make sense to keep digging for another race. I think that if this is what was happening, we would see other instances of this problem showing up as null pointer or garbage pointer dereferences, for example when the request structure was not re-used. It looks like we actually do run into that situation in other reports. This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags)); from blk_add_timer to the only caller that could trip over it (blk_start_request). It then inverts the calls to blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address the race. I've boot tested this patch, but nothing more. Signed-off-by: NJeff Moyer <jmoyer@redhat.com> Acked-by: NHannes Reinecke <hare@suse.de> Cc: stable@kernel.org Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 25 10月, 2013 1 次提交
-
-
由 Jens Axboe 提交于
Linux currently has two models for block devices: - The classic request_fn based approach, where drivers use struct request units for IO. The block layer provides various helper functionalities to let drivers share code, things like tag management, timeout handling, queueing, etc. - The "stacked" approach, where a driver squeezes in between the block layer and IO submitter. Since this bypasses the IO stack, driver generally have to manage everything themselves. With drivers being written for new high IOPS devices, the classic request_fn based driver doesn't work well enough. The design dates back to when both SMP and high IOPS was rare. It has problems with scaling to bigger machines, and runs into scaling issues even on smaller machines when you have IOPS in the hundreds of thousands per device. The stacked approach is then most often selected as the model for the driver. But this means that everybody has to re-invent everything, and along with that we get all the problems again that the shared approach solved. This commit introduces blk-mq, block multi queue support. The design is centered around per-cpu queues for queueing IO, which then funnel down into x number of hardware submission queues. We might have a 1:1 mapping between the two, or it might be an N:M mapping. That all depends on what the hardware supports. blk-mq provides various helper functions, which include: - Scalable support for request tagging. Most devices need to be able to uniquely identify a request both in the driver and to the hardware. The tagging uses per-cpu caches for freed tags, to enable cache hot reuse. - Timeout handling without tracking request on a per-device basis. Basically the driver should be able to get a notification, if a request happens to fail. - Optional support for non 1:1 mappings between issue and submission queues. blk-mq can redirect IO completions to the desired location. - Support for per-request payloads. Drivers almost always need to associate a request structure with some driver private command structure. Drivers can tell blk-mq this at init time, and then any request handed to the driver will have the required size of memory associated with it. - Support for merging of IO, and plugging. The stacked model gets neither of these. Even for high IOPS devices, merging sequential IO reduces per-command overhead and thus increases bandwidth. For now, this is provided as a potential 3rd queueing model, with the hope being that, as it matures, it can replace both the classic and stacked model. That would get us back to having just 1 real model for block devices, leaving the stacked approach to dm/md devices (as it was originally intended). Contributions in this patch from the following people: Shaohua Li <shli@fusionio.com> Alexander Gordeev <agordeev@redhat.com> Christoph Hellwig <hch@infradead.org> Mike Christie <michaelc@cs.wisc.edu> Matias Bjorling <m@bjorling.me> Jeff Moyer <jmoyer@redhat.com> Acked-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 7月, 2013 1 次提交
-
-
由 Hannes Reinecke 提交于
rq_timed_out_fn might have been unset while the request was in flight, so we need to check for it in blk_rq_timed_out(). Acked-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NHannes Reinecke <hare@suse.de> Signed-off-by: NStefan Weinhuber <wein@de.ibm.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 15 6月, 2012 1 次提交
-
-
由 Asias He 提交于
This function was only used by btrfs code in btrfs_abort_devices() (seems in a wrong way). It was removed in commit d07eb911, So, Let's remove the dead code to avoid any confusion. Changes in v2: update commit log, btrfs_abort_devices() was removed already. Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-kernel@vger.kernel.org Cc: Chris Mason <chris.mason@oracle.com> Cc: linux-btrfs@vger.kernel.org Cc: David Sterba <dave@jikos.cz> Signed-off-by: NAsias He <asias@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 04 8月, 2011 1 次提交
-
-
由 Akinobu Mita 提交于
init_fault_attr_dentries() is used to export fault_attr via debugfs. But it can only export it in debugfs root directory. Per Forlin is working on mmc_fail_request which adds support to inject data errors after a completed host transfer in MMC subsystem. The fault_attr for mmc_fail_request should be defined per mmc host and export it in debugfs directory per mmc host like /sys/kernel/debug/mmc0/mmc_fail_request. init_fault_attr_dentries() doesn't help for mmc_fail_request. So this introduces fault_create_debugfs_attr() which is able to create a directory in the arbitrary directory and replace init_fault_attr_dentries(). [akpm@linux-foundation.org: extraneous semicolon, per Randy] Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com> Tested-by: NPer Forlin <per.forlin@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 4月, 2010 1 次提交
-
-
由 Richard Kennedy 提交于
blk_rq_timed_out_timer() relied on blk_add_timer() never returning a timer value of zero, but commit 7838c15b removed the code that bumped this value when it was zero. Therefore when jiffies is near wrap we could get unlucky & not set the timeout value correctly. This patch uses a flag to indicate that the timeout value was set and so handles jiffies wrap correctly, and it keeps all the logic in one function so should be easier to maintain in the future. Signed-off-by: NRichard Kennedy <richard@rsk.demon.co.uk> Cc: stable@kernel.org Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 15 4月, 2010 1 次提交
-
-
由 Richard Kennedy 提交于
blk_rq_timed_out_timer() relied on blk_add_timer() never returning a timer value of zero, but commit 7838c15b removed the code that bumped this value when it was zero. Therefore when jiffies is near wrap we could get unlucky & not set the timeout value correctly. This patch uses a flag to indicate that the timeout value was set and so handles jiffies wrap correctly, and it keeps all the logic in one function so should be easier to maintain in the future. Signed-off-by: NRichard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 28 4月, 2009 1 次提交
-
-
由 Tejun Heo 提交于
* In blk_rq_timed_out_timer(), else { if } to else if * In blk_add_timer(), simplify if/else block [ Impact: cleanup ] Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 24 4月, 2009 1 次提交
-
-
由 Hannes Reinecke 提交于
Very rarely under stress testing of dm, oopses are occuring as something tampers with an old stack frame. This has been traced back to blk_abort_queue() leaving a timeout_list pointing to the stack. The reason is that sometimes blk_abort_request() won't delete the timer (if the request is marked as complete but before the timer has been removed, a small race window). Fix this by splicing back from the ususally empty list to the q->timeout_list. Signed-off-by: NHannes Reinecke <hare@suse.de> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 22 4月, 2009 1 次提交
-
-
由 Jens Axboe 提交于
There's nothing to do for those devices, since the timeout handling is based on requests. Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 18 2月, 2009 1 次提交
-
-
由 Hannes Reinecke 提交于
blk_abort_queue() iterates the timeout list and aborts each request on the list, but if the driver error handling readds a request to the timeout list during this processing, we could be looping forever. Fix this by splicing current entries to a local list and run over that list instead. Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 29 12月, 2008 3 次提交
-
-
由 Jens Axboe 提交于
For sync IO, we'll often do them serialized. This means we'll be touching the queue timer for every IO, as opposed to only occasionally like we do for queued IO. Instead of deleting the timer when the last request is removed, just let continue running. If a new request comes up soon we then don't have to readd the timer again. If no new requests arrive, the timer will expire without side effect later. This improves high iops sync IO by ~1%. Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Jens Axboe 提交于
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 malahal@us.ibm.com 提交于
Now the rq->deadline can't be zero if the request is in the timeout_list, so there is no need to have next_set. There is no need to access a request's deadline field if blk_rq_timed_out is called on it. Signed-off-by: NMalahal Naineni <malahal@us.ibm.com> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-