- 17 1月, 2020 7 次提交
-
-
由 Christoph Hellwig 提交于
commit 529262d56dbebe6a26df5d2fd24cc0e1bc8579e5 upstream. This was intended to support users like nvme multipath, but is just getting in the way and adding another indirect call. Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 0a1b8b87d064a47fad9ec475316002da28559207 upstream. blk_poll() has always kept spinning until it found an IO. This is fine for SYNC polling, since we need to find one request we have pending, but in preparation for ASYNC polling it can be beneficial to just check if we have any entries available or not. Existing callers are converted to pass in 'spin == true', to retain the old behavior. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 1052b8ac5282daf35df331edcbdb645839d17e6a upstream. If we want to support async IO polling, then we have to allow finding completions that aren't just for the one we are looking for. Always pass in -1 to the mq_ops->poll() helper, and have that return how many events were found in this poll loop. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Damien Le Moal 提交于
commit 64845a1ddd655574886eb48e9a5eaeeb9b05bf0d upstream. Define get_current_ioprio() as an inline helper to obtain the caller I/O priority from its task I/O context. Use this helper in blk_init_request_from_bio() to set a request ioprio. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 85f4d4b65fdd67f1d6dc9eeb1d91923cef07eb6a upstream. We currently only really support sync poll, ie poll with 1 IO in flight. This prepares us for supporting async poll. Note that the returned value isn't necessarily 100% accurate. If poll races with IRQ completion, we assume that the fact that the task is now runnable means we found at least one entry. In reality it could be more than 1, or not even 1. This is fine, the caller will just need to take this into account. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 David Howells 提交于
commit 00e23707442a75b404392cef1405ab4fd498de6b upstream. Use accessor functions to access an iterator's type and direction. This allows for the possibility of using some other method of determining the type of iterator than if-chains with bitwise-AND conditions. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
This fixes the following format build warning: block/blk-iocost.c: In function 'ioc_stat_prfill': block/blk-iocost.c:2506:17: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 9 has type 'long int' [-Wformat=] Reported-by: Nkbuild test robot <lkp@intel.com> Fixes: 0670363c ("alinux: iocost: add ioc_gq stat") Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 15 1月, 2020 9 次提交
-
-
由 Jiufei Xue 提交于
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
Add a stat file to monitor the ioc_gq stat. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
Currently in blk_throtl_bio(), if one bio exceeds its throtl_grp's bps or iops limit, this bio will be queued throtl_grp's throtl_service_queue, then obviously mm subsys will submit more pages, even underlying device can not handle these io requests, also this will make large amount of pages entering writeback prematurely, later if some process writes some of these pages, it will wait for long time. I have done some tests: one process does buffered writes on a 1GB file, and make this process's blkcg max bps limit be 10MB/s, I observe this: #cat /proc/meminfo | grep -i back Writeback: 900024 kB WritebackTmp: 0 kB I think this Writeback value is just too big, indeed many bios have been queued in throtl_grp's throtl_service_queue, if one process try to write the last bio's page in this queue, it will call wait_on_page_writeback(page), which must wait the previous bios to finish and will take long time, we have also see 120s hung task warning in our server. INFO: task kworker/u128:0:30072 blocked for more than 120 seconds. Tainted: G E 4.9.147-013.ali3000_015_test.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u128:0 D 0 30072 2 0x00000000 Workqueue: writeback wb_workfn (flush-8:16) ffff882ddd066b40 0000000000000000 ffff882e5cad3400 ffff882fbe959e80 ffff882fa50b1a00 ffffc9003a5a3768 ffffffff8173325d ffffc9003a5a3780 00ff882e5cad3400 ffff882fbe959e80 ffffffff81360b49 ffff882e5cad3400 Call Trace: [<ffffffff8173325d>] ? __schedule+0x23d/0x6d0 [<ffffffff81360b49>] ? alloc_request_struct+0x19/0x20 [<ffffffff81733726>] schedule+0x36/0x80 [<ffffffff81736c56>] schedule_timeout+0x206/0x4b0 [<ffffffff81036c69>] ? sched_clock+0x9/0x10 [<ffffffff81363073>] ? get_request+0x403/0x810 [<ffffffff8110ca10>] ? ktime_get+0x40/0xb0 [<ffffffff81732f8a>] io_schedule_timeout+0xda/0x170 [<ffffffff81733f90>] ? bit_wait+0x60/0x60 [<ffffffff81733fab>] bit_wait_io+0x1b/0x60 [<ffffffff81733b28>] __wait_on_bit+0x58/0x90 [<ffffffff811b0d91>] ? find_get_pages_tag+0x161/0x2e0 [<ffffffff811aff62>] wait_on_page_bit+0x82/0xa0 [<ffffffff810d47f0>] ? wake_atomic_t_function+0x60/0x60 [<ffffffffa02fc181>] mpage_prepare_extent_to_map+0x2d1/0x310 [ext4] [<ffffffff8121ff65>] ? kmem_cache_alloc+0x185/0x1a0 [<ffffffffa0305a2f>] ? ext4_init_io_end+0x1f/0x40 [ext4] [<ffffffffa0300294>] ext4_writepages+0x404/0xef0 [ext4] [<ffffffff81508c64>] ? scsi_init_io+0x44/0x200 [<ffffffff81398a0f>] ? fprop_fraction_percpu+0x2f/0x80 [<ffffffff811c139e>] do_writepages+0x1e/0x30 [<ffffffff8127c0f5>] __writeback_single_inode+0x45/0x320 [<ffffffff8127c942>] writeback_sb_inodes+0x272/0x600 [<ffffffff8127cf6b>] wb_writeback+0x10b/0x300 [<ffffffff8127d884>] wb_workfn+0xb4/0x380 [<ffffffff810b85e9>] ? try_to_wake_up+0x59/0x3e0 [<ffffffff810a5759>] process_one_work+0x189/0x420 [<ffffffff810a5a3e>] worker_thread+0x4e/0x4b0 [<ffffffff810a59f0>] ? process_one_work+0x420/0x420 [<ffffffff810ac026>] kthread+0xe6/0x100 [<ffffffff810abf40>] ? kthread_park+0x60/0x60 [<ffffffff81738499>] ret_from_fork+0x39/0x50 To fix this issue, we can simply limit throtl_service_queue's max queued bios, currently we limit it to throtl_grp's bps_limit or iops limit, if it still exteeds, we just sleep for a while. Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jiufei Xue 提交于
Now we have counters for wait_time and service_time, but no completed ios, so the average latency can not be measured. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jiufei Xue 提交于
This patch does the code cleanup because the seq_show handlers for tg counters are the same. No functional changes. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
Add 2 interfaces to stat io throttle information: blkio.throttle.total_io_queued blkio.throttle.total_bytes_queued These interfaces are used for monitoring throttled io/bytes and analyzing if delay has relation with io throttle. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
io throtl stats will blkg_get at the beginning of throttle and then blkg_put at the new introduced bi_tg_end_io. This will cause blkg to be freed if end_io is called twice like dm-thin, which will save origin end_io first, and call its overwrite end_io and then the saved end_io. After that, access blkg is invalid and finally BUG: [ 4417.235048] BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0 [ 4417.236475] IP: [<ffffffff812e7c71>] throtl_update_dispatch_stats+0x21/0xb0 [ 4417.237865] PGD 98395067 PUD 362e1067 PMD 0 [ 4417.239232] Oops: 0000 [#1] SMP ...... [ 4417.274070] Call Trace: [ 4417.275407] [<ffffffff812ea93d>] blk_throtl_bio+0xfd/0x630 [ 4417.276760] [<ffffffff810b3613>] ? wake_up_process+0x23/0x40 [ 4417.278079] [<ffffffff81094c04>] ? wake_up_worker+0x24/0x30 [ 4417.279387] [<ffffffff81095772>] ? insert_work+0x62/0xa0 [ 4417.280697] [<ffffffff8116c2c7>] ? mempool_free_slab+0x17/0x20 [ 4417.282019] [<ffffffff8116c6c9>] ? mempool_free+0x49/0x90 [ 4417.283326] [<ffffffff812c9acf>] generic_make_request_checks+0x16f/0x360 [ 4417.284637] [<ffffffffa0340d97>] ? thin_map+0x227/0x2c0 [dm_thin_pool] [ 4417.285951] [<ffffffff812c9ce7>] generic_make_request+0x27/0x130 [ 4417.287240] [<ffffffffa0230b3d>] __map_bio+0xad/0x100 [dm_mod] [ 4417.288503] [<ffffffffa023257e>] __clone_and_map_data_bio+0x15e/0x240 [dm_mod] [ 4417.289778] [<ffffffffa02329ea>] __split_and_process_bio+0x38a/0x500 [dm_mod] [ 4417.291062] [<ffffffffa0232c91>] dm_make_request+0x131/0x1a0 [dm_mod] [ 4417.292344] [<ffffffff812c9da2>] generic_make_request+0xe2/0x130 [ 4417.293626] [<ffffffff812c9e61>] submit_bio+0x71/0x150 [ 4417.294909] [<ffffffff8121ab1d>] ? bio_alloc_bioset+0x20d/0x360 [ 4417.296195] [<ffffffff81215acb>] _submit_bh+0x14b/0x220 [ 4417.297484] [<ffffffff81215bb0>] submit_bh+0x10/0x20 [ 4417.298744] [<ffffffffa016d8d8>] jbd2_journal_commit_transaction+0x6c8/0x19a0 [jbd2] [ 4417.300014] [<ffffffff810135b8>] ? __switch_to+0xf8/0x4c0 [ 4417.301268] [<ffffffffa01731e9>] kjournald2+0xc9/0x270 [jbd2] [ 4417.302524] [<ffffffff810a0fd0>] ? wake_up_atomic_t+0x30/0x30 [ 4417.303753] [<ffffffffa0173120>] ? commit_timeout+0x10/0x10 [jbd2] [ 4417.304950] [<ffffffff8109ffef>] kthread+0xcf/0xe0 [ 4417.306107] [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140 [ 4417.307255] [<ffffffff81647f18>] ret_from_fork+0x58/0x90 [ 4417.308349] [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140 ...... Now we introduce a new bio flag BIO_THROTL_STATED to make sure blkg_get/put only get called once for the same bio. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
Add blkio.throttle.io_service_time and blkio.throttle.io_wait_time to get per-cgroup io delay statistics. io_service_time represents the time spent after io throttle to io completion, while io_wait_time represents the time spent on throttle queue. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
Indeed tool iostat's await is not good enough, which is somewhat sketchy and could not show request's latency on device driver's side. Here we add a new counter to track io request's d2c time, also with this patch, we can extend iostat to show this value easily. Note: I had checked how iostat is implemented, it just reads fields it needs, so iostat won't be affected by this change, so does tsar. Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 02 1月, 2020 1 次提交
-
-
由 Ming Lei 提交于
commit 79d08f89bb1b5c2c1ff90d9bb95497ab9e8aa7e0 upstream 'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1 bytes. Before 07173c3ec276 ("block: enable multipage bvecs"), one bio can include very limited pages, and usually at most 256, so the fs bio size won't be bigger than 1M bytes most of times. Since we support multi-page bvec, in theory one fs bio really can be added > 1M pages, especially in case of hugepage, or big writeback with too many dirty pages. Then there is chance in which .bi_size is overflowed. Fixes this issue by using bio_full() to check if the added segment may overflow .bi_size. Signed-off-by: NHui Zhu <teawaterz@linux.alibaba.com> Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com> Cc: kernel test robot <rong.a.chen@intel.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 07173c3ec276 ("block: enable multipage bvecs") Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 27 12月, 2019 23 次提交
-
-
由 Jiufei Xue 提交于
commit 8b37bc277fb459fa100808880a9d4e0641fff444 upstream. There is a bug that checking the same active_list over and over again in iocg_activate(). The intention of the code was checking whether all the ancestors and self have already been activated. So fix it. Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Dan Carpenter 提交于
commit 41591a51f00d2dc7bb9dc6e9bedf56c5cf6f2392 upstream. This code causes a static analysis warning: block/blk-iocost.c:2113 ioc_weight_write() error: double lock 'irq' We disable IRQs in blkg_conf_prep() and re-enable them in blkg_conf_finish(). IRQ disable/enable should not be nested because that means the IRQs will be enabled at the first unlock instead of the second one. Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
Function ioc_rqos_throttle() may called inside queue_lock. We should unlock the queue_lock before entering sleep. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit 9d179b865449b351ad5cb76dbea480c9170d4a27 upstream. blkcg_activate_policy() has the following bugs. * cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()") added @blkcg to ->pd_alloc_fn(); however, blkcg_activate_policy() ends up using pd's allocated for the root blkcg for all preallocations, so ->pd_init_fn() for non-root blkcgs can be passed in pd's which are allocated for the root blkcg. For blk-iocost, this means that ->pd_init_fn() can write beyond the end of the allocated object as it determines the length of the flex array at the end based on the blkcg's nesting level. * Each pd is initialized as they get allocated. If alloc fails, the policy will get freed with pd's initialized on it. * After the above partial failure, the partial pds are not freed. This patch fixes all the above issues by * Restructuring blkcg_activate_policy() so that alloc and init passes are separate. Init takes place only after all allocs succeeded and on failure all allocated pds are freed. * Unifying and fixing the cleanup of the remaining pd_prealloc. Signed-off-by: NTejun Heo <tj@kernel.org> Fixes: cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()") Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit 71c814077de60b2e7415dac6f5c4e98f59d521fd upstream. When blkcg_activate_policy() is creating blkg_policy_data for existing blkgs, it did in the wrong order - descendants first. Fix it. None of the existing controllers seem affected by this. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
When the QoS targets are met and nothing is being throttled, there's no way to tell how saturated the underlying device is - it could be almost entirely idle, at the cusp of saturation or anywhere inbetween. Given that there's no information, it's best to keep vrate as-is in this state. Before 7cd806a9a953 ("iocost: improve nr_lagging handling"), this was the case - if the device isn't missing QoS targets and nothing is being throttled, busy_level was reset to zero. While fixing nr_lagging handling, 7cd806a9a953 ("iocost: improve nr_lagging handling") broke this. Now, while the device is hitting QoS targets and nothing is being throttled, vrate keeps getting adjusted according to the existing busy_level. This led to vrate keeping climing till it hits max when there's an IO issuer with limited request concurrency if the vrate started low. vrate starts getting adjusted upwards until the issuer can issue IOs w/o being throttled. From then on, QoS targets keeps getting met and nothing on the system needs throttling and vrate keeps getting increased due to the existing busy_level. This patch makes the following changes to the busy_level logic. * Reset busy_level if nr_shortages is zero to avoid the above scenario. * Make non-zero nr_lagging block lowering nr_level but still clear positive busy_level if there's clear non-saturation signal - QoS targets are met and nr_shortages is non-zero. nr_lagging's role is preventing adjusting vrate upwards while there are long-running commands and it shouldn't keep busy_level positive while there's clear non-saturation signal. * Restructure code for clarity and add comments. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: NAndy Newell <newella@fb.com> Fixes: 7cd806a9a953 ("iocost: improve nr_lagging handling") Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit 7afcccafa59fb63b58f863a6c5e603a34625955b upstream. The default hard disk param sets latency targets at 50ms. As the default target percentiles are zero, these don't directly regulate vrate; however, they're still used to calculate the period length - 100ms in this case. This is excessively low. A SATA drive with QD32 saturated with random IOs can easily reach avg completion latency of several hundred msecs. A period duration which is substantially lower than avg completion latency can lead to wildly fluctuating vrate. Let's bump up the default latency targets to 250ms so that the period duration is sufficiently long. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit 7cd806a9a953f234b9865c30028f47fd738ce375 upstream. Some IOs may span multiple periods. As latencies are collected on completion, the inbetween periods won't register them and may incorrectly decide to increase vrate. nr_lagging tracks these IOs to avoid those situations. Currently, whenever there are IOs which are spanning from the previous period, busy_level is reset to 0 if negative thus suppressing vrate increase. This has the following two problems. * When latency target percentiles aren't set, vrate adjustment should only be governed by queue depth depletion; however, the current code keeps nr_lagging active which pulls in latency results and can keep down vrate unexpectedly. * When lagging condition is detected, it resets the entire negative busy_level. This turned out to be way too aggressive on some devices which sometimes experience extended latencies on a small subset of commands. In addition, a lagging IO will be accounted as latency target miss on completion anyway and resetting busy_level amplifies its impact unnecessarily. This patch fixes the above two problems by disabling nr_lagging counting when latency target percentiles aren't set and blocking vrate increases when there are lagging IOs while leaving busy_level as-is. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit 25d41e4aadb0788b4fae8a8fca90f437b9ebd727 upstream. vrate_adj tracepoint traces vrate changes; however, it does so only when busy_level is non-zero. busy_level turning to zero can sometimes be as interesting an event. This patch also enables vrate_adj tracepoint on other vrate related events - busy_level changes and non-zero nr_lagging. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
Bios are not associated with blkg before entering iocost controller. do it in ioc_rqos_throttle() as well as ioc_rqos_merge(). Considering that there are so many chances to create blkg before ioc_rqos_merge(), we just lookup the blkg here and if blkg are not exist, just return rather than create it. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
To support cgroup v1. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit 7c1ee704a1d6450f92372d57f5b76a458b51c1d4 upstream. Report debt and rename del_ms row to delay for consistency. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit e1518f63f246831af222758ead022cd40e79fab8 upstream. Merges have the same problem that forced-bios had which is fixed by the previous patch. The cost of a merge is calculated at the time of issue and force-advances vtime into the future. Until global vtime catches up, how the cgroup's hweight changes in the meantime doesn't matter and it often leads to situations where the cost is calculated at one hweight and paid at a very different one. See the previous patch for more details. Fix it by never advancing vtime into the future for merges. If budget is available, vtime is advanced. Otherwise, the cost is charged as debt. This brings merge cost handling in line with issue cost handling in ioc_rqos_throttle(). Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit 36a524814ff3e5d5385f42d30152fe8c5e1fd2c1 upstream. Currently, when a bio needs to be force-charged and there isn't enough budget, vtime is simply pushed into the future. This means that the cost of the whole bio is scaled using the current hweight and then charged immediately. Until the global vtime advances beyond this future vtime, the cgroup won't be allowed to issue normal IOs. This is incorrect and can lead to, for example, exploding vrate or extended stalls if vrate range is constrained. Consider the following scenario. 1. A cgroup with a very low hweight runs out of budget. 2. A storm of swap-out happens on it. All of them are scaled according to the current low hweight and charged to vtime pushing it to a far future. 3. All other cgroups go idle and now the above cgroup has access to the whole device. However, because vtime is already wound using the past low hweight, what its current hweight is doesn't matter until global vtime catches up to the local vtime. 4. As a result, either vrate gets ramped up extremely or the IOs stall while the underlying device is idle. This is because the hweight the overage is calculated at is different from the hweight that it's being paid at. Fix it by remembering the overage in absoulte vtime and continuously paying with the actual budget according to the current hweight at each period. Note that non-forced bios which wait already remembers the cost in absolute vtime. This brings forced-bio accounting in line. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit e036c4cabaa8d24375262ced3a191819a8077b74 upstream. ioc_pd_free() first cancels the hrtimers and then deactivates the iocg. However, the iocg timer can run inbetween and reschedule the hrtimers which will end up running after the iocg is freed leading to crashes like the following. general protection fault: 0000 [#1] SMP ... RIP: 0010:iocg_kick_delay+0xbe/0x1b0 RSP: 0018:ffffc90003598ea0 EFLAGS: 00010046 RAX: 1cee00fd69512b54 RBX: ffff8881bba48400 RCX: 00000000000003e8 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881bba48400 RBP: 0000000000004e20 R08: 0000000000000002 R09: 00000000000003e8 R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90003598ef0 R13: 00979f3810ad461f R14: ffff8881bba4b400 R15: 25439f950d26e1d1 FS: 0000000000000000(0000) GS:ffff88885f800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f64328c7e40 CR3: 0000000002409005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> iocg_delay_timer_fn+0x3d/0x60 __hrtimer_run_queues+0xfe/0x270 hrtimer_interrupt+0xf4/0x210 smp_apic_timer_interrupt+0x5e/0x120 apic_timer_interrupt+0xf/0x20 </IRQ> Fix it by canceling hrtimers after deactivating the iocg. Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Reported-by: NDave Jones <davej@codemonkey.org.uk> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit e916ad29d96485e5aa3d3237bfeab1522c713d5e upstream. ioc_cpd_alloc() forgot to check NULL return from kzalloc(). Add it. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit 3532e7227243beb0b782266dc05c40b6184ad051 upstream. blk_iocost_init() forgot to free its percpu stat on the error path. Fix it. Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Reported-by: NHillf Danton <hdanton@sina.com> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit 8504dea783b044cab620acbaef87b86ee84646fe upstream. Add a script which can be used to generate device-specific iocost linear model coefficients. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit 6954ff185ee0811cdd2e0f388ff4dd7df17f11af upstream. Instead of mucking with debugfs and ->pd_stat(), add drgn based monitoring script. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit 7caa47151ab2e644dd221f741ec7578d9532c9a3 upstream. This patchset implements IO cost model based work-conserving proportional controller. While io.latency provides the capability to comprehensively prioritize and protect IOs depending on the cgroups, its protection is binary - the lowest latency target cgroup which is suffering is protected at the cost of all others. In many use cases including stacking multiple workload containers in a single system, it's necessary to distribute IO capacity with better granularity. One challenge of controlling IO resources is the lack of trivially observable cost metric. The most common metrics - bandwidth and iops - can be off by orders of magnitude depending on the device type and IO pattern. However, the cost isn't a complete mystery. Given several key attributes, we can make fairly reliable predictions on how expensive a given stream of IOs would be, at least compared to other IO patterns. The function which determines the cost of a given IO is the IO cost model for the device. This controller distributes IO capacity based on the costs estimated by such model. The more accurate the cost model the better but the controller adapts based on IO completion latency and as long as the relative costs across differents IO patterns are consistent and sensible, it'll adapt to the actual performance of the device. Currently, the only implemented cost model is a simple linear one with a few sets of default parameters for different classes of device. This covers most common devices reasonably well. All the infrastructure to tune and add different cost models is already in place and a later patch will also allow using bpf progs for cost models. Please see the top comment in blk-iocost.c and documentation for more details. v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix for a divide-by-zero bug in current_hweight() triggered by zero inuse_sum. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> [Joseph: fix confilcts with ioc_rqos_throttle()] Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit 6f816b4b746c2241540e537682d30d8e9997d674 upstream. There are currently two start time timestamps - start_time_ns and io_start_time_ns. The former marks the request allocation and and the second issue-to-device time. The planned io.weight controller needs to measure the total time bios take to execute after it leaves rq_qos including the time spent waiting for request to become available, which can easily dominate on saturated devices. This patch adds request->alloc_time_ns which records when the request allocation attempt started. As it isn't used for the usual stats, make it optional behind CONFIG_BLK_RQ_ALLOC_TIME and QUEUE_FLAG_RQ_ALLOC_TIME so that it can be compiled out when there are no users and it's active only on queues which need it even when compiled in. v2: s/pre_start_time/alloc_time/ and add CONFIG_BLK_RQ_ALLOC_TIME gating as suggested by Jens. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit beab17fc2a507e85dd18b3cef83820c5770c5f34 upstream. io.weight is gonna be another rq_qos cgroup mechanism. Let's rename RQ_QOS_CGROUP which is being used by io.latency to RQ_QOS_LATENCY in preparation. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> [Jiufei: remove unused function rq_qos_id_to_name()] Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit 9677a3e01f838622d2efc9a3ccb97090a2c3156a upstream. wbt already gets queue depth changed notification through wbt_set_queue_depth(). Generalize it into rq_qos_ops->queue_depth_changed() so that other rq_qos policies can easily hook into the events too. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-