- 20 6月, 2017 1 次提交
-
-
由 Ingo Molnar 提交于
Rename: wait_queue_t => wait_queue_entry_t 'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue", but in reality it's a queue *entry*. The 'real' queue is the wait queue head, which had to carry the name. Start sorting this out by renaming it to 'wait_queue_entry_t'. This also allows the real structure name 'struct __wait_queue' to lose its double underscore and become 'struct wait_queue_entry', which is the more canonical nomenclature for such data types. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 15 6月, 2017 1 次提交
-
-
由 Bart Van Assche 提交于
Avoid that the following complaint is reported: BUG: sleeping function called from invalid context at kernel/workqueue.c:2790 in_atomic(): 1, irqs_disabled(): 0, pid: 41, name: rcuop/3 1 lock held by rcuop/3/41: #0: (rcu_callback){......}, at: [<ffffffff8111f9a2>] rcu_nocb_kthread+0x282/0x500 Call Trace: dump_stack+0x86/0xcf ___might_sleep+0x174/0x260 __might_sleep+0x4a/0x80 flush_work+0x7e/0x2e0 __cancel_work_timer+0x143/0x1c0 cancel_work_sync+0x10/0x20 blk_throtl_exit+0x25/0x60 blkcg_exit_queue+0x35/0x40 blk_release_queue+0x42/0x130 kobject_put+0xa9/0x190 This happens since we invoke callbacks that need to block from the queue release handler. Fix this by pushing the final release to a workqueue. Reported-by: NRoss Zwisler <zwisler@gmail.com> Fixes: commit b425e504 ("block: Avoid that blk_exit_rl() triggers a use-after-free") Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Updated changelog Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 08 6月, 2017 1 次提交
-
-
由 Paolo Valente 提交于
In blk-cgroup, operations on blkg objects are protected with the request_queue lock. This is no more the lock that protects I/O-scheduler operations in blk-mq. In fact, the latter are now protected with a finer-grained per-scheduler-instance lock. As a consequence, although blkg lookups are also rcu-protected, blk-mq I/O schedulers may see inconsistent data when they access blkg and blkg-related objects. BFQ does access these objects, and does incur this problem, in the following case. The blkg_lookup performed in bfq_get_queue, being protected (only) through rcu, may happen to return the address of a copy of the original blkg. If this is the case, then the blkg_get performed in bfq_get_queue, to pin down the blkg, is useless: it does not prevent blk-cgroup code from destroying both the original blkg and all objects directly or indirectly referred by the copy of the blkg. BFQ accesses these objects, which typically causes a crash for NULL-pointer dereference of memory-protection violation. Some additional protection mechanism should be added to blk-cgroup to address this issue. In the meantime, this commit provides a quick temporary fix for BFQ: cache (when safe) blkg data that might disappear right after a blkg_lookup. In particular, this commit exploits the following facts to achieve its goal without introducing further locks. Destroy operations on a blkg invoke, as a first step, hooks of the scheduler associated with the blkg. And these hooks are executed with bfqd->lock held for BFQ. As a consequence, for any blkg associated with the request queue an instance of BFQ is attached to, we are guaranteed that such a blkg is not destroyed, and that all the pointers it contains are consistent, while that instance is holding its bfqd->lock. A blkg_lookup performed with bfqd->lock held then returns a fully consistent blkg, which remains consistent until this lock is held. In more detail, this holds even if the returned blkg is a copy of the original one. Finally, also the object describing a group inside BFQ needs to be protected from destruction on the blkg_free of the original blkg (which invokes bfq_pd_free). This commit adds private refcounting for this object, to let it disappear only after no bfq_queue refers to it any longer. This commit also removes or updates some stale comments on locking issues related to blk-cgroup operations. Reported-by: NTomas Konir <tomas.konir@gmail.com> Reported-by: NLee Tibbert <lee.tibbert@gmail.com> Reported-by: NMarco Piazza <mpiazza@gmail.com> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Tested-by: NTomas Konir <tomas.konir@gmail.com> Tested-by: NLee Tibbert <lee.tibbert@gmail.com> Tested-by: NMarco Piazza <mpiazza@gmail.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 07 6月, 2017 4 次提交
-
-
由 Shaohua Li 提交于
hard disk IO latency varies a lot depending on spindle move. The latency range could be from several microseconds to several milliseconds. It's pretty hard to get the baseline latency used by io.low. We will use a different stragety here. The idea is only using IO with spindle move to determine if cgroup IO is in good state. For HD, if io latency is small (< 1ms), we ignore the IO. Such IO is likely from sequential IO, and is helpless to help determine if a cgroup's IO is impacted by other cgroups. With this, we only account IO with big latency. Then we can choose a hardcoded baseline latency for HD (4ms, which is typical IO latency with seek). With all these settings, the io.low latency works for both HD and SSD. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Joseph Qi 提交于
I have encountered a NULL pointer dereference in throtl_schedule_pending_timer: [ 413.735396] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 [ 413.735535] IP: [<ffffffff812ebbbf>] throtl_schedule_pending_timer+0x3f/0x210 [ 413.735643] PGD 22c8cf067 PUD 22cb34067 PMD 0 [ 413.735713] Oops: 0000 [#1] SMP ...... This is caused by the following case: blk_throtl_bio throtl_schedule_next_dispatch <= sq is top level one without parent throtl_schedule_pending_timer sq_to_tg(sq)->td->throtl_slice <= sq_to_tg(sq) returns NULL Fix it by using sq_to_td instead of sq_to_tg(sq)->td, which will always return a valid td. Fixes: 297e3d85 ("blk-throttle: make throtl_slice tunable") Signed-off-by: NJoseph Qi <qijiang.qj@alibaba-inc.com> Reviewed-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
If queue is stopped, we shouldn't dispatch request into driver and hardware, unfortunately the check is removed in bd166ef1(blk-mq-sched: add framework for MQ capable IO schedulers). This patch fixes the issue by moving the check back into __blk_mq_try_issue_directly(). This patch fixes request use-after-free[1][2] during canceling requets of NVMe in nvme_dev_disable(), which can be triggered easily during NVMe reset & remove test. [1] oops kernel log when CONFIG_BLK_DEV_INTEGRITY is on [ 103.412969] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a [ 103.412980] IP: bio_integrity_advance+0x48/0xf0 [ 103.412981] PGD 275a88067 [ 103.412981] P4D 275a88067 [ 103.412982] PUD 276c43067 [ 103.412983] PMD 0 [ 103.412984] [ 103.412986] Oops: 0000 [#1] SMP [ 103.412989] Modules linked in: vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd ipmi_ssif iTCO_wdt iTCO_vendor_support mxm_wmi glue_helper dcdbas ipmi_si mei_me pcspkr mei sg ipmi_devintf lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel nvme ahci nvme_core libahci libata tg3 i2c_core megaraid_sas ptp pps_core dm_mirror dm_region_hash dm_log dm_mod [ 103.413035] CPU: 0 PID: 102 Comm: kworker/0:2 Not tainted 4.11.0+ #1 [ 103.413036] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016 [ 103.413041] Workqueue: events nvme_remove_dead_ctrl_work [nvme] [ 103.413043] task: ffff9cc8775c8000 task.stack: ffffc033c252c000 [ 103.413045] RIP: 0010:bio_integrity_advance+0x48/0xf0 [ 103.413046] RSP: 0018:ffffc033c252fc10 EFLAGS: 00010202 [ 103.413048] RAX: 0000000000000000 RBX: ffff9cc8720a8cc0 RCX: ffff9cca72958240 [ 103.413049] RDX: ffff9cca72958000 RSI: 0000000000000008 RDI: ffff9cc872537f00 [ 103.413049] RBP: ffffc033c252fc28 R08: 0000000000000000 R09: ffffffffb963a0d5 [ 103.413050] R10: 000000000000063e R11: 0000000000000000 R12: ffff9cc8720a8d18 [ 103.413051] R13: 0000000000001000 R14: ffff9cc872682e00 R15: 00000000fffffffb [ 103.413053] FS: 0000000000000000(0000) GS:ffff9cc877c00000(0000) knlGS:0000000000000000 [ 103.413054] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 103.413055] CR2: 000000000000000a CR3: 0000000276c41000 CR4: 00000000001406f0 [ 103.413056] Call Trace: [ 103.413063] bio_advance+0x2a/0xe0 [ 103.413067] blk_update_request+0x76/0x330 [ 103.413072] blk_mq_end_request+0x1a/0x70 [ 103.413074] blk_mq_dispatch_rq_list+0x370/0x410 [ 103.413076] ? blk_mq_flush_busy_ctxs+0x94/0xe0 [ 103.413080] blk_mq_sched_dispatch_requests+0x173/0x1a0 [ 103.413083] __blk_mq_run_hw_queue+0x8e/0xa0 [ 103.413085] __blk_mq_delay_run_hw_queue+0x9d/0xa0 [ 103.413088] blk_mq_start_hw_queue+0x17/0x20 [ 103.413090] blk_mq_start_hw_queues+0x32/0x50 [ 103.413095] nvme_kill_queues+0x54/0x80 [nvme_core] [ 103.413097] nvme_remove_dead_ctrl_work+0x1f/0x40 [nvme] [ 103.413103] process_one_work+0x149/0x360 [ 103.413105] worker_thread+0x4d/0x3c0 [ 103.413109] kthread+0x109/0x140 [ 103.413111] ? rescuer_thread+0x380/0x380 [ 103.413113] ? kthread_park+0x60/0x60 [ 103.413120] ret_from_fork+0x2c/0x40 [ 103.413121] Code: 08 4c 8b 63 50 48 8b 80 80 00 00 00 48 8b 90 d0 03 00 00 31 c0 48 83 ba 40 02 00 00 00 48 8d 8a 40 02 00 00 48 0f 45 c1 c1 ee 09 <0f> b6 48 0a 0f b6 40 09 41 89 f5 83 e9 09 41 d3 ed 44 0f af e8 [ 103.413145] RIP: bio_integrity_advance+0x48/0xf0 RSP: ffffc033c252fc10 [ 103.413146] CR2: 000000000000000a [ 103.413157] ---[ end trace cd6875d16eb5a11e ]--- [ 103.455368] Kernel panic - not syncing: Fatal exception [ 103.459826] Kernel Offset: 0x37600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 103.850916] ---[ end Kernel panic - not syncing: Fatal exception [ 103.857637] sched: Unexpected reschedule of offline CPU#1! [ 103.863762] ------------[ cut here ]------------ [2] kernel hang in blk_mq_freeze_queue_wait() when CONFIG_BLK_DEV_INTEGRITY is off [ 247.129825] INFO: task nvme-test:1772 blocked for more than 120 seconds. [ 247.137311] Not tainted 4.12.0-rc2.upstream+ #4 [ 247.142954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 247.151704] Call Trace: [ 247.154445] __schedule+0x28a/0x880 [ 247.158341] schedule+0x36/0x80 [ 247.161850] blk_mq_freeze_queue_wait+0x4b/0xb0 [ 247.166913] ? remove_wait_queue+0x60/0x60 [ 247.171485] blk_freeze_queue+0x1a/0x20 [ 247.175770] blk_cleanup_queue+0x7f/0x140 [ 247.180252] nvme_ns_remove+0xa3/0xb0 [nvme_core] [ 247.185503] nvme_remove_namespaces+0x32/0x50 [nvme_core] [ 247.191532] nvme_uninit_ctrl+0x2d/0xa0 [nvme_core] [ 247.196977] nvme_remove+0x70/0x110 [nvme] [ 247.201545] pci_device_remove+0x39/0xc0 [ 247.205927] device_release_driver_internal+0x141/0x200 [ 247.211761] device_release_driver+0x12/0x20 [ 247.216531] pci_stop_bus_device+0x8c/0xa0 [ 247.221104] pci_stop_and_remove_bus_device_locked+0x1a/0x30 [ 247.227420] remove_store+0x7c/0x90 [ 247.231320] dev_attr_store+0x18/0x30 [ 247.235409] sysfs_kf_write+0x3a/0x50 [ 247.239497] kernfs_fop_write+0xff/0x180 [ 247.243867] __vfs_write+0x37/0x160 [ 247.247757] ? selinux_file_permission+0xe5/0x120 [ 247.253011] ? security_file_permission+0x3b/0xc0 [ 247.258260] vfs_write+0xb2/0x1b0 [ 247.261964] ? syscall_trace_enter+0x1d0/0x2b0 [ 247.266924] SyS_write+0x55/0xc0 [ 247.270540] do_syscall_64+0x67/0x150 [ 247.274636] entry_SYSCALL64_slow_path+0x25/0x25 [ 247.279794] RIP: 0033:0x7f5c96740840 [ 247.283785] RSP: 002b:00007ffd00e87ee8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 247.292238] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5c96740840 [ 247.300194] RDX: 0000000000000002 RSI: 00007f5c97060000 RDI: 0000000000000001 [ 247.308159] RBP: 00007f5c97060000 R08: 000000000000000a R09: 00007f5c97059740 [ 247.316123] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f5c96a14400 [ 247.324087] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000 [ 370.016340] INFO: task nvme-test:1772 blocked for more than 120 seconds. Fixes: 12d70958(blk-mq: don't fail allocating driver tag for stopped hw queue) Cc: stable@vger.kernel.org Signed-off-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
When direct issue is done on request picked up from plug list, the hctx need to be updated with the actual hw queue, otherwise wrong hctx is used and may hurt performance, especially when wrong SRCU readlock is acquired/released Reported-by: NBart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 03 6月, 2017 1 次提交
-
-
由 Dmitry Monakhov 提交于
If bio has no data, such as ones from blkdev_issue_flush(), then we have nothing to protect. This patch prevent bugon like follows: kfree_debugcheck: out of range ptr ac1fa1d106742a5ah kernel BUG at mm/slab.c:2773! invalid opcode: 0000 [#1] SMP Modules linked in: bcache CPU: 0 PID: 4428 Comm: xfs_io Tainted: G W 4.11.0-rc4-ext4-00041-g2ef0043-dirty #43 Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014 task: ffff880137786440 task.stack: ffffc90000ba8000 RIP: 0010:kfree_debugcheck+0x25/0x2a RSP: 0018:ffffc90000babde0 EFLAGS: 00010082 RAX: 0000000000000034 RBX: ac1fa1d106742a5a RCX: 0000000000000007 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013f3ccb40 RBP: ffffc90000babde8 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000fcb76420 R11: 00000000725172ed R12: 0000000000000282 R13: ffffffff8150e766 R14: ffff88013a145e00 R15: 0000000000000001 FS: 00007fb09384bf40(0000) GS:ffff88013f200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd0172f9e40 CR3: 0000000137fa9000 CR4: 00000000000006f0 Call Trace: kfree+0xc8/0x1b3 bio_integrity_free+0xc3/0x16b bio_free+0x25/0x66 bio_put+0x14/0x26 blkdev_issue_flush+0x7a/0x85 blkdev_fsync+0x35/0x42 vfs_fsync_range+0x8e/0x9f vfs_fsync+0x1c/0x1e do_fsync+0x31/0x4a SyS_fsync+0x10/0x14 entry_SYSCALL_64_fastpath+0x1f/0xc2 Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 02 6月, 2017 1 次提交
-
-
由 Bart Van Assche 提交于
Since the introduction of .init_rq_fn() and .exit_rq_fn() it is essential that the memory allocated for struct request_queue stays around until all blk_exit_rl() calls have finished. Hence make blk_init_rl() take a reference on struct request_queue. This patch fixes the following crash: general protection fault: 0000 [#2] SMP CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G D 4.12.0-rc2-dbg+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 task: ffff88013a108040 task.stack: ffffc9000071c000 RIP: 0010:free_request_size+0x1a/0x30 RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202 RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003 RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418 RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009 R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8 R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0 Call Trace: mempool_destroy.part.10+0x21/0x40 mempool_destroy+0xe/0x10 blk_exit_rl+0x12/0x20 blkg_free+0x4d/0xa0 __blkg_release_rcu+0x59/0x170 rcu_process_callbacks+0x260/0x4e0 __do_softirq+0x116/0x250 smpboot_thread_fn+0x123/0x1e0 kthread+0x109/0x140 ret_from_fork+0x31/0x40 Fixes: commit e9c787e6 ("scsi: allocate scsi_cmnd structures as part of struct request") Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Acked-by: NTejun Heo <tj@kernel.org> Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> # v4.11+ Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 31 5月, 2017 2 次提交
-
-
由 Hou Tao 提交于
When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY as the delay of cfq_group's vdisktime if there have been other cfq_groups already. When cfq is under iops mode, commit 9a7f38c4 ("cfq-iosched: Convert from jiffies to nanoseconds") could result in a large iops delay and lead to an abnormal io schedule delay for the added cfq_group. To fix it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5 when iops mode is enabled. Despite having the same value, the delay of a cfq_queue in idle class and the delay of cfq_group are different things, so I define two new macros for the delay of a cfq_group under time-slice mode and iops mode. Fixes: 9a7f38c4 ("cfq-iosched: Convert from jiffies to nanoseconds") Cc: <stable@vger.kernel.org> # 4.8+ Signed-off-by: NHou Tao <houtao1@huawei.com> Acked-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Keith Busch 提交于
The tagset lock needs to be held when iterating the tag_list, so a lockdep assert was added when updating number of hardware queues. The drivers calling this API, however, were unaware of the new requirement, so are failing the assertion. This patch takes the lock within the blk-mq function so the drivers do not have to be modified in order to be safe. Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list") Reported-by: NGabriel Krisman Bertazi <krisman@collabora.co.uk> Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 26 5月, 2017 1 次提交
-
-
由 Bart Van Assche 提交于
The code in blk-mq-debugfs.c assumes that it is working on a blk-mq queue and is not intended to work on a blk-sq queue. Hence only register blk-mq debugfs attributes for blk-mq queues. Fixes: commit 9c1051aa ("blk-mq: untangle debugfs and sysfs") Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Reviewed-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 23 5月, 2017 7 次提交
-
-
由 Richard 提交于
The code in block/partitions/msdos.c recognizes FreeBSD, OpenBSD and NetBSD partitions and does a reasonable job picking out OpenBSD and NetBSD UFS subpartitions. But for FreeBSD the subpartitions are always "bad". Kernel: <bsd:bad subpartition - ignored Though all 3 of these BSD systems use UFS as a file system, only FreeBSD uses relative start addresses in the subpartition declarations. The following patch fixes this for FreeBSD partitions and leaves the code for OpenBSD and NetBSD intact: Signed-off-by: NRichard Narron <comet.berkeley@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Dan Carpenter 提交于
We don't set an error code on this path. It means that we return NULL instead of an error pointer and the caller does a NULL dereference. Fixes: 6d1d8050 ("block, partition: add partition_meta_info to hd_struct") Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
Default value of io.low limit is 0. If user doesn't configure the limit, last patch makes cgroup be throttled to very tiny bps/iops, which could stall the system. A cgroup with default settings of io.low limit really means nothing, so we force user to configure all settings, otherwise io.low limit doesn't take effect. With this stragety, default setting of latency/idle isn't important, so just set them to very conservative and safe value. Signed-off-by: NShaohua Li <shli@fb.com> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
If a cgroup with low limit 0 for both bps/iops, the cgroup's low limit is ignored and we throttle the cgroup with its max limit. In this way, other cgroups with a low limit will not get protected. To fix this, we don't do the exception any more. cgroup will be throttled to a limit 0 if it uese default setting. To avoid completed stall, we give such cgroup tiny IO resources. Signed-off-by: NShaohua Li <shli@fb.com> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
These info are important to understand what's happening and help debug. Signed-off-by: NShaohua Li <shli@fb.com> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
For idle time, children's setting should not be bigger than parent's. For latency target, children's setting should not be smaller than parent's. The leaf nodes will adjust their settings according to the hierarchy and compare their IO with the settings and do upgrade/downgrade. parents nodes don't need to track their IO latency/idle time. Signed-off-by: NShaohua Li <shli@fb.com> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
No one uses it any more, so remove it. Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 11 5月, 2017 1 次提交
-
-
由 Christoph Hellwig 提交于
SCSI devices can return short writes on Write Same just like for normal writes, so we need to handle this case for our special payload requests as well. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reported-by: NAbdul Haleem <abdhalee@linux.vnet.ibm.com> Tested-by: NAbdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 10 5月, 2017 5 次提交
-
-
由 Wen Xiong 提交于
When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns "Input/output error". Looks block layer split the bio after calling bio_integrity_prep(bio). This patch fixes the issue. Below is how we debug this issue: (1)format nvme to 4K block # size with type 2 DIF (2)dd with block size bigger than 1024k. oflag=direct dd: error writing '/dev/nvme0n1': Input/output error We added some debug code in nvme device driver. It showed us the first op and the second op have the same bi and pi address. This is not correct. 1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400, dsmgmt=0x0, AT=0x0 & RT=0x505 Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828 2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0, AT=0x0 & RT=0x605 ==> This op fails and subsequent 5 retires.. Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828 With the fix, It showed us both of the first op and the second op have correct bi and pi address. 1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400, dsmgmt=0x0, AT=0x0 & RT=0x505 Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828 2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0, AT=0x0 & RT=0x605 Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual 0x00003028 Signed-off-by: NWen Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
If PREEMPT_RCU is enabled, rcu_read_lock() isn't strong enough for us to use this_cpu_ptr() in that section. Use the safer get/put_cpu_ptr() variants instead. Reported-by: NMike Galbraith <efault@gmx.de> Fixes: 34dbad5d ("blk-stat: convert to callback-based statistics reporting") Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
We warn twice for switching to a scheduler, if that switch fails. As we also report the failure in the return value to the sysfs write, remove the dmesg induced failures. Keep the failure print for warning to switch to the kconfig selected IO scheduler, as we can't report errors for that in any other way. Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Paolo Valente 提交于
The introduction of the BFQ and Kyber I/O schedulers has triggered a new wave of I/O benchmarks. Unfortunately, comments and discussions on these benchmarks confirm that there is still little awareness that it is very hard to achieve, at the same time, a low latency and a high throughput. In particular, virtually all benchmarks measure throughput, or throughput-related figures of merit, but, for BFQ, they use the scheduler in its default configuration. This configuration is geared, instead, toward a low latency. This is evidently a sign that BFQ documentation is still too unclear on this important aspect. This commit addresses this issue by stressing how BFQ configuration must be (easily) changed if the only goal is maximum throughput. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Paolo Valente 提交于
In the function __bfq_deactivate_entity, the pointer entity->sched_data could happen to be used before being properly initialized. This led to a NULL pointer dereference. This commit fixes this bug by just using this pointer only where it is safe to do so. Reported-by: NTom Harrison <l12436.tw@gmail.com> Tested-by: NTom Harrison <l12436.tw@gmail.com> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 09 5月, 2017 1 次提交
-
-
由 Dan Williams 提交于
For configurations that do not enable DAX filesystems or drivers, do not require the DAX core to be built. Given that the 'direct_access' method has been removed from 'block_device_operations', we can also go ahead and remove the block-related dax helper functions from fs/block_dev.c to drivers/dax/super.c. This keeps dax details out of the block layer and lets the DAX core be built as a module in the FS_DAX=n case. Filesystems need to include dax.h to call bdev_dax_supported(). Cc: linux-xfs@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NJan Kara <jack@suse.com> Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 08 5月, 2017 2 次提交
-
-
由 Colin Ian King 提交于
Making __blk_mq_stop_hw_queues static fixes sparse warning: block/blk-mq.c:6: warning: symbol '__blk_mq_stop_hw_queues' was not declared. Should it be static? Fixes: 2719aa21 ("blk-mq: don't use sync workqueue flushing from drivers") Signed-off-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Wanpeng Li 提交于
This can be triggered by hot-unplug one cpu. ====================================================== [ INFO: possible circular locking dependency detected ] 4.11.0+ #17 Not tainted ------------------------------------------------------- step_after_susp/2640 is trying to acquire lock: (all_q_mutex){+.+...}, at: [<ffffffffb33f95b8>] blk_mq_queue_reinit_work+0x18/0x110 but task is already holding lock: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (cpu_hotplug.lock){+.+.+.}: lock_acquire+0x11c/0x230 __mutex_lock+0x92/0x990 mutex_lock_nested+0x1b/0x20 get_online_cpus+0x64/0x80 blk_mq_init_allocated_queue+0x3a0/0x4e0 blk_mq_init_queue+0x3a/0x60 loop_add+0xe5/0x280 loop_init+0x124/0x177 do_one_initcall+0x53/0x1c0 kernel_init_freeable+0x1e3/0x27f kernel_init+0xe/0x100 ret_from_fork+0x31/0x40 -> #0 (all_q_mutex){+.+...}: __lock_acquire+0x189a/0x18a0 lock_acquire+0x11c/0x230 __mutex_lock+0x92/0x990 mutex_lock_nested+0x1b/0x20 blk_mq_queue_reinit_work+0x18/0x110 blk_mq_queue_reinit_dead+0x1c/0x20 cpuhp_invoke_callback+0x1f2/0x810 cpuhp_down_callbacks+0x42/0x80 _cpu_down+0xb2/0xe0 freeze_secondary_cpus+0xb6/0x390 suspend_devices_and_enter+0x3b3/0xa40 pm_suspend+0x129/0x490 state_store+0x82/0xf0 kobj_attr_store+0xf/0x20 sysfs_kf_write+0x45/0x60 kernfs_fop_write+0x135/0x1c0 __vfs_write+0x37/0x160 vfs_write+0xcd/0x1d0 SyS_write+0x58/0xc0 do_syscall_64+0x8f/0x710 return_from_SYSCALL_64+0x0/0x7a other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpu_hotplug.lock); lock(all_q_mutex); lock(cpu_hotplug.lock); lock(all_q_mutex); *** DEADLOCK *** 8 locks held by step_after_susp/2640: #0: (sb_writers#6){.+.+.+}, at: [<ffffffffb3244aed>] vfs_write+0x1ad/0x1d0 #1: (&of->mutex){+.+.+.}, at: [<ffffffffb32d3a51>] kernfs_fop_write+0x101/0x1c0 #2: (s_active#166){.+.+.+}, at: [<ffffffffb32d3a59>] kernfs_fop_write+0x109/0x1c0 #3: (pm_mutex){+.+...}, at: [<ffffffffb30d2ecd>] pm_suspend+0x21d/0x490 #4: (acpi_scan_lock){+.+.+.}, at: [<ffffffffb34dc3d7>] acpi_scan_lock_acquire+0x17/0x20 #5: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffffb306d6d7>] freeze_secondary_cpus+0x27/0x390 #6: (cpu_hotplug.dep_map){++++++}, at: [<ffffffffb306cfd5>] cpu_hotplug_begin+0x5/0xe0 #7: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0 stack backtrace: CPU: 3 PID: 2640 Comm: step_after_susp Not tainted 4.11.0+ #17 Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016 Call Trace: dump_stack+0x99/0xce print_circular_bug+0x1fa/0x270 __lock_acquire+0x189a/0x18a0 lock_acquire+0x11c/0x230 ? lock_acquire+0x11c/0x230 ? blk_mq_queue_reinit_work+0x18/0x110 ? blk_mq_queue_reinit_work+0x18/0x110 __mutex_lock+0x92/0x990 ? blk_mq_queue_reinit_work+0x18/0x110 ? kmem_cache_free+0x2cb/0x330 ? anon_transport_class_unregister+0x20/0x20 ? blk_mq_queue_reinit_work+0x110/0x110 mutex_lock_nested+0x1b/0x20 ? mutex_lock_nested+0x1b/0x20 blk_mq_queue_reinit_work+0x18/0x110 blk_mq_queue_reinit_dead+0x1c/0x20 cpuhp_invoke_callback+0x1f2/0x810 ? __flow_cache_shrink+0x160/0x160 cpuhp_down_callbacks+0x42/0x80 _cpu_down+0xb2/0xe0 freeze_secondary_cpus+0xb6/0x390 suspend_devices_and_enter+0x3b3/0xa40 ? rcu_read_lock_sched_held+0x79/0x80 pm_suspend+0x129/0x490 state_store+0x82/0xf0 kobj_attr_store+0xf/0x20 sysfs_kf_write+0x45/0x60 kernfs_fop_write+0x135/0x1c0 __vfs_write+0x37/0x160 ? rcu_read_lock_sched_held+0x79/0x80 ? rcu_sync_lockdep_assert+0x2f/0x60 ? __sb_start_write+0xd9/0x1c0 ? vfs_write+0x1ad/0x1d0 vfs_write+0xcd/0x1d0 SyS_write+0x58/0xc0 ? rcu_read_lock_sched_held+0x79/0x80 do_syscall_64+0x8f/0x710 ? trace_hardirqs_on_thunk+0x1a/0x1c entry_SYSCALL64_slow_path+0x25/0x25 The cpu hotplug path will hold cpu_hotplug.lock and then reinit all exiting queues for blk mq w/ all_q_mutex, however, blk_mq_init_allocated_queue() will contend these two locks in the inversion order. This is due to commit eabe0659 (blk/mq: Cure cpu hotplug lock inversion), it fixes a cpu hotplug lock inversion issue because of hotplug rework, however the hotplug rework is still work-in-progress and lives in a -tip branch and mainline cannot yet trigger that splat. The commit breaks the linus's tree in the merge window, so this patch reverts the lock order and avoids to splat linus's tree. Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 04 5月, 2017 12 次提交
-
-
由 Omar Sandoval 提交于
Expose the fifo lists, cached next requests, batching state, and dispatch list. It'd also be possible to add the sorted lists, but there aren't already seq_file helpers for rbtrees. Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
Expose the domain token pools, asynchronous sbitmap depth, domain request lists, and batching state. Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
This provides the infrastructure for schedulers to expose their internal state through debugfs. We add a list of queue attributes and a list of hctx attributes to struct elevator_type and wire them up when switching schedulers. Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Add missing seq_file.h header in blk-mq-debugfs.h Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
Originally, I tied debugfs registration/unregistration together with sysfs. There's no reason to do this, and it's getting in the way of letting schedulers define their own debugfs attributes. Instead, tie the debugfs registration to the lifetime of the structures themselves. The saner lifetimes mean we can also get rid of the extra mq directory and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now just nvme0n1/hctx0/tags. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
Preparation for adding more declarations. Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Bart Van Assche 提交于
In commit e869b546 ("blk-mq: Unregister debugfs attributes earlier"), we shuffled the debugfs cleanup around so that the "state" attribute was removed before we freed the blk-mq data structures. However, later changes are going to undo that, so we need to explicitly disallow running a dead queue. [Omar: rebased and updated commit message] Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
A large part of blk-mq-debugfs.c is file_operations and seq_file boilerplate. This sucks as is but will suck even more when schedulers can define their own debugfs entries. Factor it all out into a single blk_mq_debugfs_fops which multiplexes as needed. We store the request_queue, blk_mq_hw_ctx, or blk_mq_ctx in the parent directory dentry, which is kind of hacky, but it works. Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
It's not clear what these numbered directories represent unless you consult the code. We're about to get rid of the intermediate "mq" directory, so these would be even more confusing without that context. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
Slightly more readable, plus we also strip leading spaces. Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
blk_queue_flags_store() currently truncates and returns a short write if the operation being written is too long. This can give us weird results, like here: $ echo "run bar" echo: write error: invalid argument $ dmesg [ 1103.075435] blk_queue_flags_store: unsupported operation bar. Use either 'run' or 'start' Instead, return an error if the user does this. While we're here, make the argument names consistent with everywhere else in this file. Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
Make sure the spelled out flag names match the definition. This also adds a missing hctx state, BLK_MQ_S_START_ON_RUN, and a missing cmd_flag, __REQ_NOUNMAP. Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
This reads more naturally than spaces. Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-