- 28 3月, 2017 12 次提交
-
-
由 Shaohua Li 提交于
When cgroups all reach low limit, cgroups can dispatch more IO. This could make some cgroups dispatch more IO but others not, and even some cgroups could dispatch less IO than their low limit. For example, cg1 low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is 120M/s for the workload. Their bps could something like this: cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80 At T1, all cgroups reach low limit, so they can dispatch more IO later. Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2 will have bandwidth below its low limit at most time. The big problem here is we don't know the maximum bandwidth of the workload, so we can't make smart decision to avoid the situation. This patch makes cgroup bandwidth change smooth. After disk upgrades from LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto their max limit immediately. Their bandwidth limit will be increased gradually to avoid above situation. So above example will became something like: cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80 -> 45/75 -> 22/98 In this way cgroups bandwidth will be above their limit in majority time, this still doesn't fully utilize disk bandwidth, but that's something we pay for sharing. Scale up is linear. The limit scales up 1/2 .low limit every throtl_slice after upgrade. The scale up will stop if the adjusted limit hits .max limit. Scale down is exponential. We cut the scale value half if a cgroup doesn't hit its .low limit. If the scale becomes 0, we then fully downgrade the queue to LIMIT_LOW state. Note this doesn't completely avoid cgroup running under its low limit. The best way to guarantee cgroup doesn't run under its limit is to set max limit. For example, if we set cg1 max limit to 40, cg2 will never run under its low limit. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the cgroup is idle. When this happens, the cgroup doesn't hit its limit, so we can't move the state machine to higher level and all cgroups will be throttled to their lower limit, so we waste bandwidth. Detecting idle cgroup is hard. This patch handles a simple case, a cgroup doesn't dispatch any IO. We ignore such cgroup's limit, so other cgroups can use the bandwidth. Please note this will be replaced with a more sophisticated algorithm later, but this demonstrates the idea how we handle idle cgroups, so I leave it here. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
The throtl_slice is 100ms by default. This is a long time for SSD, a lot of IO can run. To make cgroups have smoother throughput, we choose a small value (20ms) for SSD. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
throtl_slice is important for blk-throttling. It's called slice internally but it really is a time window blk-throttling samples data. blk-throttling will make decision based on the samplings. An example is bandwidth measurement. A cgroup's bandwidth is measured in the time interval of throtl_slice. A small throtl_slice meanse cgroups have smoother throughput but burn more CPUs. It has 100ms default value, which is not appropriate for all disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes it tunable. Since throtl_slice isn't a time slice, the sysfs name 'throttle_sample_time' reflects its character better. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
cgroup could be throttled to a limit but when all cgroups cross high limit, queue enters a higher state and so the group should be throttled to a higher limit. It's possible the cgroup is sleeping because of throttle and other cgroups don't dispatch IO any more. In this case, nobody can trigger current downgrade/upgrade logic. To fix this issue, we could either set up a timer to wakeup the cgroup if other cgroups are idle or make sure this cgroup doesn't sleep too long. Setting up a timer means we must change the timer very frequently. This patch chooses the latter. Making cgroup sleep time not too big wouldn't change cgroup bps/iops, but could make it wakeup more frequently, which isn't a big issue because throtl_slice * 8 is already quite big. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
When queue state machine is in LIMIT_MAX state, but a cgroup is below its low limit for some time, the queue should be downgraded to lower state as one cgroup's low limit isn't met. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
When queue is in LIMIT_LOW state and all cgroups with low limit cross the bps/iops limitation, we will upgrade queue's state to LIMIT_MAX. To determine if a cgroup exceeds its limitation, we check if the cgroup has pending request. Since cgroup is throttled according to the limit, pending request means the cgroup reaches the limit. If a cgroup has limit set for both read and write, we consider the combination of them for upgrade. The reason is read IO and write IO can interfere with each other. If we do the upgrade based in one direction IO, the other direction IO could be severly harmed. For a cgroup hierarchy, there are two cases. Children has lower low limit than parent. Parent's low limit is meaningless. If children's bps/iops cross low limit, we can upgrade queue state. The other case is children has higher low limit than parent. Children's low limit is meaningless. As long as parent's bps/iops (which is a sum of childrens bps/iops) cross low limit, we can upgrade queue state. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
each queue will have a state machine. Initially queue is in LIMIT_LOW state, which means all cgroups will be throttled according to their low limit. After all cgroups with low limit cross the limit, the queue state gets upgraded to LIMIT_MAX state. For max limit, cgroup will use the limit configured by user. For low limit, cgroup will use the minimal value between low limit and max limit configured by user. If the minimal value is 0, which means the cgroup doesn't configure low limit, we will use max limit to throttle the cgroup and the cgroup is ready to upgrade to LIMIT_MAX Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
Add low limit for cgroup and corresponding cgroup interface. To be consistent with memcg, we allow users configure .low limit higher than .max limit. But the internal logic always assumes .low limit is lower than .max limit. So we add extra bps/iops_conf fields in throtl_grp for userspace configuration. Old bps/iops fields in throtl_grp will be the actual limit we use for throttling. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
As discussed in LSF, add configure option for the interface and mark it as experimental, so people can try/test. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
We are going to support low/max limit, each cgroup will have 2 limits after that. This patch prepares for the multiple limits change. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
clean up the code to avoid using -1 Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 25 3月, 2017 3 次提交
-
-
由 Eric Biggers 提交于
blk_integrity_profile's are never modified, so mark them 'const' so that they are placed in .rodata and benefit from memory protection. Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Eric Biggers 提交于
blkdev_issue_flush() is now always synchronous, and it no longer has a flags argument. So remove the part of the comment about the WAIT flag. Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Eric Biggers 提交于
BLKDEV_IFL_* flags no longer exist; blkdev_issue_discard() now actually takes BLKDEV_DISCARD_* flags. Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 23 3月, 2017 7 次提交
-
-
由 Dan Carpenter 提交于
There isn't a bug here, but Smatch is not smart enough to know that "nr_iovecs" can't be negative so it complains about underflows. Really, it's slightly cleaner to make this parameter unsigned. Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Turn the different ways of merging or issuing I/O into a series of if/else statements instead of the current maze of gotos. Note that this means we pin the CPU a little longer for some cases as the CTX put is moved to common code at the end of the function. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Now that we have a nice direct issue heper this helps simplifying the code a bit, and also gets rid of the old_rq variable. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a new wrapper that takes care of RCU / SRCU locking to avoid having boileplate code in the caller which would get duplicated with new callers. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
They are mostly the same code anyway - this just one small conditional for the plug case that is different for both variants. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
This flag was never used since it was introduced. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jan Kara 提交于
When device open races with device shutdown, we can get the following oops in scsi_disk_get(): [11863.044351] general protection fault: 0000 [#1] SMP [11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop] [11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W 4.10.0-rc2-xen+ #35 [11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000 [11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70 [11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202 [11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000 [11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880 [11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001 [11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa [11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d [11863.059217] FS: 00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000 [11863.059217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0 [11863.059217] Call Trace: [11863.059217] ? disk_get_part+0x22/0x1f0 [11863.059217] sd_open+0x39/0x130 [11863.059217] __blkdev_get+0x69/0x430 [11863.059217] ? bd_acquire+0x7f/0xc0 [11863.059217] ? bd_acquire+0x96/0xc0 [11863.059217] ? blkdev_get+0x350/0x350 [11863.059217] blkdev_get+0x126/0x350 [11863.059217] ? _raw_spin_unlock+0x2b/0x40 [11863.059217] ? bd_acquire+0x7f/0xc0 [11863.059217] ? blkdev_get+0x350/0x350 [11863.059217] blkdev_open+0x65/0x80 ... As you can see RAX value is already poisoned showing that gendisk we got is already freed. The problem is that get_gendisk() looks up device number in ext_devt_idr and then does get_disk() which does kobject_get() on the disks kobject. However the disk gets removed from ext_devt_idr only in disk_release() (through blk_free_devt()) at which moment it has already 0 refcount and is already on its way to be freed. Indeed we've got a warning from kobject_get() about 0 refcount shortly before the oops. We fix the problem by using kobject_get_unless_zero() in get_disk() so that get_disk() cannot get reference on a disk that is already being freed. Tested-by: NLekshmi Pillai <lekshmicpillai@in.ibm.com> Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 22 3月, 2017 6 次提交
-
-
由 Jens Axboe 提交于
If a driver allocates a queue for stacked usage, then it does not currently get stats allocated. This causes the later init of, eg, writeback throttling to blow up. Move the init to the queue allocation instead. Additionally, allow a NULL callback unregistration. This avoids having the caller check for that, fixing another oops on removal of a block device that doesn't have poll stats allocated. Fixes: 34dbad5d ("blk-stat: convert to callback-based statistics reporting") Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
Currently, statistics are gathered in ~0.13s windows, and users grab the statistics whenever they need them. This is not ideal for both in-tree users: 1. Writeback throttling wants its own dynamically sized window of statistics. Since the blk-stats statistics are reset after every window and the wbt windows don't line up with the blk-stats windows, wbt doesn't see every I/O. 2. Polling currently grabs the statistics on every I/O. Again, depending on how the window lines up, we may miss some I/Os. It's also unnecessary overhead to get the statistics on every I/O; the hybrid polling heuristic would be just as happy with the statistics from the previous full window. This reworks the blk-stats infrastructure to be callback-based: users register a callback that they want called at a given time with all of the statistics from the window during which the callback was active. Users can dynamically bucketize the statistics. wbt and polling both currently use read vs. write, but polling can be extended to further subdivide based on request size. The callbacks are kept on an RCU list, and each callback has percpu stats buffers. There will only be a few users, so the overhead on the I/O completion side is low. The stats flushing is also simplified considerably: since the timer function is responsible for clearing the statistics, we don't have to worry about stale statistics. wbt is a trivial conversion. After the conversion, the windowing problem mentioned above is fixed. For polling, we register an extra callback that caches the previous window's statistics in the struct request_queue for the hybrid polling heuristic to use. Since we no longer have a single stats buffer for the request queue, this also removes the sysfs and debugfs stats entries. To replace those, we add a debugfs entry for the poll statistics. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
This is an implementation detail that no-one outside of blk-stat.c uses. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
The stats buckets will become generic soon, so make the existing users use the common READ and WRITE definitions instead of one internal to blk-stat. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
We always call wbt_exit() from blk_release_queue(), so these are unnecessary. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
We need to flush the batch _before_ we check the number of samples, otherwise we'll miss all of the batched samples. Fixes: cf43e6be ("block: add scalable completion tracking of requests") Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 15 3月, 2017 1 次提交
-
-
由 Jens Axboe 提交于
If we have scheduling enabled, we jump directly to insert-and-run. That's fine, but we run the queue async and we don't pass in information on whether we can block from this context or not. Fixup both these cases. Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 13 3月, 2017 1 次提交
-
-
由 Sagi Grimberg 提交于
In case cpu was unplugged, we need to make sure not to assume that the tags for that cpu are still allocated. so check for null tags when reinitializing a tagset. Reported-by: NYi Zhang <yizhan@redhat.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 12 3月, 2017 1 次提交
-
-
由 NeilBrown 提交于
Commit 79bd9959 ("blk: improve order of bio handling in generic_make_request()") changed current->bio_list so that it did not contain *all* of the queued bios, but only those submitted by the currently running make_request_fn. There are two places which walk the list and requeue selected bios, and others that check if the list is empty. These are no longer correct. So redefine current->bio_list to point to an array of two lists, which contain all queued bios, and adjust various code to test or walk both lists. Signed-off-by: NNeilBrown <neilb@suse.com> Fixes: 79bd9959 ("blk: improve order of bio handling in generic_make_request()") Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 09 3月, 2017 8 次提交
-
-
由 NeilBrown 提交于
To avoid recursion on the kernel stack when stacked block devices are in use, generic_make_request() will, when called recursively, queue new requests for later handling. They will be handled when the make_request_fn for the current bio completes. If any bios are submitted by a make_request_fn, these will ultimately be handled seqeuntially. If the handling of one of those generates further requests, they will be added to the end of the queue. This strict first-in-first-out behaviour can lead to deadlocks in various ways, normally because a request might need to wait for a previous request to the same device to complete. This can happen when they share a mempool, and can happen due to interdependencies particular to the device. Both md and dm have examples where this happens. These deadlocks can be erradicated by more selective ordering of bios. Specifically by handling them in depth-first order. That is: when the handling of one bio generates one or more further bios, they are handled immediately after the parent, before any siblings of the parent. That way, when generic_make_request() calls make_request_fn for some particular device, we can be certain that all previously submited requests for that device have been completely handled and are not waiting for anything in the queue of requests maintained in generic_make_request(). An easy way to achieve this would be to use a last-in-first-out stack instead of a queue. However this will change the order of consecutive bios submitted by a make_request_fn, which could have unexpected consequences. Instead we take a slightly more complex approach. A fresh queue is created for each call to a make_request_fn. After it completes, any bios for a different device are placed on the front of the main queue, followed by any bios for the same device, followed by all bios that were already on the queue before the make_request_fn was called. This provides the depth-first approach without reordering bios on the same level. This, by itself, it not enough to remove all deadlocks. It just makes it possible for drivers to take the extra step required themselves. To avoid deadlocks, drivers must never risk waiting for a request after submitting one to generic_make_request. This includes never allocing from a mempool twice in the one call to a make_request_fn. A common pattern in drivers is to call bio_split() in a loop, handling the first part and then looping around to possibly split the next part. Instead, a driver that finds it needs to split a bio should queue (with generic_make_request) the second part, handle the first part, and then return. The new code in generic_make_request will ensure the requests to underlying bios are processed first, then the second bio that was split off. If it splits again, the same process happens. In each case one bio will be completely handled before the next one is attempted. With this is place, it should be possible to disable the punt_bios_to_recover() recovery thread for many block devices, and eventually it may be possible to remove it completely. Ref: http://www.spinics.net/lists/raid/msg54680.htmlTested-by: NJinpu Wang <jinpu.wang@profitbricks.com> Inspired-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jan Kara 提交于
This reverts commit 0dba1314. It causes leaking of device numbers for SCSI when SCSI registers multiple gendisks for one request_queue in succession. It can be easily reproduced using Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE. Furthermore the protection provided by this commit is not needed anymore as the problem it was fixing got also fixed by commit 165a5e22 "block: Move bdi_unregister() to del_gendisk()". [1]: http://marc.info/?l=linux-block&m=148554717109098&w=2Signed-off-by: NJan Kara <jack@suse.cz> Acked-by: NDan Williams <dan.j.williams@intel.com> Tested-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jan Kara 提交于
Commit 165a5e22 "block: Move bdi_unregister() to del_gendisk()" added disk->queue dereference to del_gendisk(). Although del_gendisk() is not supposed to be called without disk->queue valid and blk_unregister_queue() warns in that case, this change will make it oops instead. Return to the old more robust behavior of just warning when del_gendisk() gets called for gendisk with disk->queue being NULL. Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NJan Kara <jack@suse.cz> Tested-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jon Derrick 提交于
Fixes check that the opal user is within the range, and cleans up unused method variables. Signed-off-by: NJon Derrick <jonathan.derrick@intel.com> Reviewed-by: NScott Bauer <scott.bauer@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
It is obviously that hctx->cpumask is per hctx, and both share same lifetime, so this patch moves freeing of hctx->cpumask into release handler of hctx's kobject. Signed-off-by: NMing Lei <tom.leiming@gmail.com> Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
This patch removes kobject_put() over hctx in __blk_mq_unregister_dev(), and trys to keep lifetime consistent between hctx and hctx's kobject. Now blk_mq_sysfs_register() and blk_mq_sysfs_unregister() become totally symmetrical, and kobject's refcounter drops to zero just when the hctx is freed. Signed-off-by: NMing Lei <tom.leiming@gmail.com> Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
Currently from kobject view, both q->mq_kobj and ctx->kobj can be released during one cycle of blk_mq_register_dev() and blk_mq_unregister_dev(). Actually, sw queue's lifetime is same with its request queue's, which is covered by request_queue->kobj. So we don't need to call kobject_put() for the two kinds of kobject in __blk_mq_unregister_dev(), instead we do that in release handler of request queue. Signed-off-by: NMing Lei <tom.leiming@gmail.com> Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
Both q->mq_kobj and sw queues' kobjects should have been initialized once, instead of doing that each add_disk context. Also this patch removes clearing of ctx in blk_mq_init_cpu_queues() because percpu allocator fills zero to allocated variable. This patch fixes one issue[1] reported from Omar. [1] kernel wearning when doing unbind/bind on one scsi-mq device [ 19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong. [ 19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34 [ 19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014 [ 19.350920] Workqueue: events_unbound async_run_entry_fn [ 19.350920] Call Trace: [ 19.350920] dump_stack+0x63/0x83 [ 19.350920] kobject_init+0x77/0x90 [ 19.350920] blk_mq_register_dev+0x40/0x130 [ 19.350920] blk_register_queue+0xb6/0x190 [ 19.350920] device_add_disk+0x1ec/0x4b0 [ 19.350920] sd_probe_async+0x10d/0x1c0 [sd_mod] [ 19.350920] async_run_entry_fn+0x48/0x150 [ 19.350920] process_one_work+0x1d0/0x480 [ 19.350920] worker_thread+0x48/0x4e0 [ 19.350920] kthread+0x101/0x140 [ 19.350920] ? process_one_work+0x480/0x480 [ 19.350920] ? kthread_create_on_node+0x60/0x60 [ 19.350920] ret_from_fork+0x2c/0x40 Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: NMing Lei <tom.leiming@gmail.com> Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 03 3月, 2017 1 次提交
-
-
由 Jan Kara 提交于
Commit 6cd18e71 "block: destroy bdi before blockdev is unregistered." moved bdi unregistration (at that time through bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because it needs to happen before blk_unregister_region() call in del_gendisk() for MD. SCSI though will free up the device number from sd_remove() called through a maze of callbacks from device_del() in __scsi_remove_device() before blk_cleanup_queue() and thus similar races as described in 6cd18e71 can happen for SCSI as well as reported by Omar [1]. Moving bdi_unregister() to del_gendisk() works for MD and fixes the problem for SCSI since del_gendisk() gets called from sd_remove() before freeing the device number. This also makes device_add_disk() (calling bdi_register_owner()) more symmetric with del_gendisk(). [1] http://marc.info/?l=linux-block&m=148554717109098&w=2Tested-by: NLekshmi Pillai <lekshmicpillai@in.ibm.com> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJan Kara <jack@suse.cz> Tested-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-