- 09 6月, 2020 1 次提交
-
-
由 zhongjiang-ali 提交于
task #28327019 Commit bc0cc360 ("alinux: blk-throttle: fix tg NULL pointer dereference") add an self-defined bio flags to fix an issue of use-after-free. But it is limited to 13 entry and has used up, hence it will fails to sync related patch. The patch replace reserved field with extended bio_flags to allow us to define more bio flags. Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
-
- 07 5月, 2020 1 次提交
-
-
由 Ming Lei 提交于
fix #27417914 commit 556f36e90dbe7dded81f4fac084d2bc8a2458330 upstream Spread queues among present CPUs first, then building mapping on other non-present CPUs. So we can minimize count of dead queues which are mapped by un-present CPUs only. Then bad IO performance can be avoided by unbalanced mapping between present CPUs and queues. The similar policy has been applied on Managed IRQ affinity. Cc: Yi Zhang <yi.zhang@redhat.com> Reported-by: NYi Zhang <yi.zhang@redhat.com> Reviewed-by: NBob Liu <bob.liu@oracle.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> [jeffle: remove code supporting multiple queue maps, which is merged since v5.0] Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 25 3月, 2020 1 次提交
-
-
由 Xiaoguang Wang 提交于
fix #25369772 In blk-mq device, we observed a issue that though iops is low, but iostat shows a very high svctm & util value, which is counter-intuitive. The root cause is that blk_account_io_start() calls part_round_stats() before "rq->part = part" statement, so part_round_stats() will count an inflight request to the whole device, but not for the specific partition, then it'll update whole device's io_ticks and time_in_queue with a stale part->stamp. To fix this issue, if a request's part is NULL, we just don't count it as an inflight request to the whole device. Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 18 3月, 2020 17 次提交
-
-
由 Xiaoguang Wang 提交于
When CONFIG_BLK_DEV_THROTTLING is enabled, though we may not set block cgroup's blk-throttle bps or iops limits, every bio still enters blk_throtl_bio() firstly, then this bug will result in the corresponding blkcg_gq's refcnt will increase by 1 for every bio. atomit_t is an 'int' type, and if usr continually issues batches of bios, this refcnt will overflow, which will trigger WARNING in blkg_get() or blkg_put(). Fixes: bc0cc360 ("alinux: blk-throttle: fix tg NULL pointer dereference") Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
Cherry-pick from commit b620743077e291ae7d0debd21f50413a8c266229 upstream. If we pass pages through an iov_iter we always already have a reference in the caller. Thus remove the ITER_BVEC_FLAG_NO_REF and don't take reference to pages by default for bvec backed iov_iters. [Joseph] Resolve conflicts since we don't have: 81ba6abd2bcd "block: loop: mark bvec as ITER_BVEC_FLAG_NO_REF" 7321ecbfc7cf "block: change how we get page references in bio_iov_iter_get_pages" Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Joseph Qi 提交于
In case some drivers such virtio-blk, poll function is not implementatin yet. Before commit 529262d5 ("block: remove ->poll_fn"), q->poll_fn is NULL and then blk_poll() won't do poll actually. So add a check for this to avoid NULL pointer dereference when calling q->mq_ops->poll. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Ming Lei 提交于
commit e87eb301bee183d82bb3d04bd71b6660889a2588 upstream. Just like aio/io_uring, we need to grab 2 refcount for queuing one request, one is for submission, another is for completion. If the request isn't queued from plug code path, the refcount grabbed in generic_make_request() serves for submission. In theroy, this refcount should have been released after the sumission(async run queue) is done. blk_freeze_queue() works with blk_sync_queue() together for avoiding race between cleanup queue and IO submission, given async run queue activities are canceled because hctx->run_work is scheduled with the refcount held, so it is fine to not hold the refcount when running the run queue work function for dispatch IO. However, if request is staggered into plug list, and finally queued from plug code path, the refcount in submission side is actually missed. And we may start to run queue after queue is removed because the queue's kobject refcount isn't guaranteed to be grabbed in flushing plug list context, then kernel oops is triggered, see the following race: blk_mq_flush_plug_list(): blk_mq_sched_insert_requests() insert requests to sw queue or scheduler queue blk_mq_run_hw_queue Because of concurrent run queue, all requests inserted above may be completed before calling the above blk_mq_run_hw_queue. Then queue can be freed during the above blk_mq_run_hw_queue(). Fixes the issue by grab .q_usage_counter before calling blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is safe because the queue is absolutely alive before inserting request. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Reviewed-by: NBart Van Assche <bvanassche@acm.org> Tested-by: NJames Smart <james.smart@broadcom.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> [Joseph: use the passing 'q' directly] Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Konstantin Khlebnikov 提交于
commit 42b1bd33dcdef4ffd98f695e188bab82f9fa46d8 upstream. Replace BFQ_GROUP_IOSCHED_ENABLED with CONFIG_BFQ_GROUP_IOSCHED. Code under these ifdefs never worked, something might be broken. Fixes: 0471559c ("block, bfq: add/remove entity weights correctly") Reviewed-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 5e27891e88555fecd8262e110e1a29feca4b0166 upstream. We just allocated the queue and haven't even set it up yet, hence we know that checking if ->mq_ops is NULL is always going to be true. In fact we do need to assign a lock to ->queue_lock always, as we need it for the queue flags modifications. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Ming Lei 提交于
commit 1a67356e9a4829da2935dd338630a550c59c8489 upstream. It is wrong to use bio->bi_vcnt to figure out how many segments there are in the bio even though CLONED flag isn't set on this bio, because this bio may be splitted or advanced. So always use bio_segments() in blk_recount_segments(), and it shouldn't cause any performance loss now because the physical segment number is figured out in blk_queue_split() and BIO_SEG_VALID is set meantime since bdced438 ("block: setup bi_phys_segments after splitting"). Reviewed-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Fixes: 76d8137a ("blk-merge: recaculate segment if it isn't less than max segments") Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 zhengbin 提交于
commit 4d7c1d3fd7c7eda7dea351f071945e843a46c145 upstream. If __device_add_disk-->bdi_register_owner-->bdi_register--> bdi_register_va-->device_create_vargs fails, bdi->dev is still NULL, __device_add_disk-->register_disk will visit bdi->dev->kobj. This patch fixes that. Signed-off-by: Nzhengbin <zhengbin13@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Dan Carpenter 提交于
commit 4e6db0f21c99c25980c8d183f95cdb6ad64cebd2 upstream. I recently found some code which called blk_mq_free_map_and_requests() with a NULL set->tags pointer. I fixed the caller, but it seems like a good idea to add a NULL check here as well. Now we can call: blk_mq_free_tag_set(set); blk_mq_free_tag_set(set); twice in a row and it's harmless. Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
commit d6f1dda27251909a27b8d8aacb498628a1047978 upstream. trace_block_getrq() is to indicate a request struct has been allocated for queue, so put it in right place. Reviewed-by: NJianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Greg Kroah-Hartman 提交于
commit 36991ca68db9dd43bac7f3519f080ee3939263ef upstream. If debugfs were to return a non-NULL error for a debugfs call, using that pointer later in debugfs_create_files() would crash. Fix that by properly checking the pointer before referencing it. Reported-by: NMichal Hocko <mhocko@kernel.org> Reported-and-tested-by: syzbot+b382ba6a802a3d242790@syzkaller.appspotmail.com Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Ming Lei 提交于
commit 1db4909e76f64a85f4aaa187f0f683f5c85a471d upstream. Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime from block layer's view, actually they don't because userspace may grab one kobject anytime via sysfs. This patch fixes the issue by the following approach: 1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing all ctxs 2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release handler of .mq_kobj 3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that .mq_kobj is always released after all ctxs are freed. This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE is enabled. Reported-by: NGuenter Roeck <linux@roeck-us.net> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Tested-by: NGuenter Roeck <linux@roeck-us.net> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jianchao Wang 提交于
commit e01ad46d53b59720c6ae69963ee1756506954c85 upstream. When we try to increate the nr_hw_queues, we may fail due to shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops and some entries in q->queue_hw_ctx are left with NULL. However, because queue map has been updated with new nr_hw_queues, some cpus have been mapped to hw queue which just encounters allocation failure, thus blk_mq_map_queue could return NULL. This will cause panic in following blk_mq_map_swqueue. To fix it, when increase nr_hw_queues fails, fallback to previous nr_hw_queues and post warning. At the same time, driver's .map_queues usually use completion irq affinity to map hw and cpu, fallback nr_hw_queues will cause lack of some cpu's map to hw, so use default blk_mq_map_queues to do that. Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jianchao Wang 提交于
commit 34d11ffac1f56c3895dad32153abd6814452dc77 upstream. When the hw queues and mq_map are updated, a hctx could be mapped to a different numa node. At this moment, we need to realloc the hctx. If fail to do that, go on using previous hctx. Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jianchao Wang 提交于
commit 477e19dedc9d3e1f4443a1d4ae00572a988120ea upstream. blk-mq debugfs and sysfs entries need to be removed before updating queue map, otherwise, we get get wrong result there. This patch fixes it and remove the redundant debugfs and sysfs register/unregister operations during __blk_mq_update_nr_hw_queues. Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com> Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
If one process context is stucked in wait_on_buffer(), lock_buffer(), lock_page() and wait_on_page_writeback() and wait_on_bit_io(), it's hard to tell ture reason, for example, whether this page is under io, or this page is just locked too long by other process context. Normally io request has multiple bios, and every bio contains multiple pages which will hold data to be read from or written to device, so here we record page info or bio info in task_struct while process calls lock_page(), lock_buffer(), wait_on_page_writeback(), wait_on_buffer() and wait_on_bit_io(), we add a new proce interface: [lege@localhost linux]$ cat /proc/4516/wait_res 1 ffffd0969f95d3c0 4295369599 4295381596 Above info means that thread 4516 is waitting on a page, address is ffffd0969f95d3c0, and has waited for 11997ms. First field denotes the page address process is waitting on. Second field denotes the wait moment and the third denotes current moment. In practice, if we found a process waitting on one page for too long time, we can get page's address by reading /proc/$pid/wait_page, and search this page address in all block devices' /sys/kernel/debug/block/${devname}/rq_hang, if search operation hits one, we can get the request and know why this io request hangs that long. Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
Background: We do not have a dependable block layer interface to determine whether block device has io requests which have not been completed for somewhat long time. Currently we have 'in_flight' interface, it counts the number of I/O requests that have been issued to the device driver but have not yet completed, and it does not include I/O requests that are in the queue but not yet issued to the device driver, which means it will not count io requests that have been stucked in block layer. Also say that there are steady io requests issued to device driver, 'in_flight' maybe always non-zero, but you could not determine whether there is one io request which has not been completed for too long. Solution: To find io requests which have not been completed for too long, here add 3 new inferfaces: /sys/block/vdb/queue/hang_threshold If one io request's running time has been greater than this value, count this io as hang. /sys/block/vdb/hang Show read/write io requests' hang counter. /sys/kernel/debug/block/vdb/rq_hang Show all hang io requests's detailed info, like below: ffff97db96301200 {.op=WRITE, .cmd_flags=SYNC, .rq_flags=STARTED| ELVPRIV|IO_STAT|STATS, .state=in_flight, .tag=30, .internal_tag=169, .start_time_ns=140634088407, .io_start_time_ns=140634102958, .current_time=146497371953, .bio = ffff97db91e8e000, .bio_pages = { ffffd096a0602540 }, .bio = ffff97db91e8ec00, .bio_pages = { ffffd096a070eec0 }, .bio = ffff97db91e8f600, .bio_pages = { ffffd096a0424cc0 }, .bio = ffff97db91e8f300, .bio_pages = { ffffd096a0600a80 }} With above info, we can easily see this request's latency distribution, and see next patch for bio_pages's usage. Note, /sys/kernel/debug/block/vdb/rq_hang only exists in blk-mq device driver and needs CONFIG_BLK_DEBUG_FS enabled. Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 17 1月, 2020 12 次提交
-
-
由 Dave Chinner 提交于
commit 4800bf7bc8c725e955fcbc6191cc872f43f506d3 upstream. A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to fall into an endless loop in the discard code. The test is creating a device that is exactly 2^32 sectors in size to test mkfs boundary conditions around the 32 bit sector overflow region. mkfs issues a discard for the entire device size by default, and hence this throws a sector count of 2^32 into blkdev_issue_discard(). It takes the number of sectors to discard as a sector_t - a 64 bit value. The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard") takes this sector count and casts it to a 32 bit value before comapring it against the maximum allowed discard size the device has. This truncates away the upper 32 bits, and so if the lower 32 bits of the sector count is zero, it starts issuing discards of length 0. This causes the code to fall into an endless loop, issuing a zero length discards over and over again on the same sector. Fixes: ba5d73851e71 ("block: cleanup __blkdev_issue_discard") Tested-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDave Chinner <dchinner@redhat.com> Killed pointless WARN_ON(). Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Ming Lei 提交于
commit ba5d73851e71847ba7f7f4c27a1a6e1f5ab91c79 upstream. Cleanup __blkdev_issue_discard() a bit: - remove local variable of 'end_sect' - remove code block of 'fail' Cc: Mike Snitzer <snitzer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Xiao Ni <xni@redhat.com> Cc: Mariusz Dabrowski <mariusz.dabrowski@intel.com> Tested-by: NRui Salvaterra <rsalvaterra@gmail.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 399254aaf4892113c806816f7e64cf40c804d46d upstream. If bio_iov_iter_get_pages() is called on an iov_iter that is flagged with NO_REF, then we don't need to add a page reference for the pages that we add. Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows not to drop a reference to these pages. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 6d0c48aede85e38316d0251564cab39cbc2422f6 upstream. For an ITER_BVEC, we can just iterate the iov and add the pages to the bio directly. For now, we grab a reference to those pages, and release them normally on IO completion. This isn't really needed for the normal case of O_DIRECT from/to a file, but some of the more esoteric use cases (like splice(2)) will unconditionally put the pipe buffer pages when the buffers are released. Until we can manage that case properly, ITER_BVEC pages are treated like normal pages in terms of reference counting. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
commit d04c406f29d9f4dbcb5eb5aa79ce0445c7e9d652 upstream. This prevents a HIPRI bio from being submitted through a stacking driver that does not support polling and thus won't poll for I/O completion. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
commit 529262d56dbebe6a26df5d2fd24cc0e1bc8579e5 upstream. This was intended to support users like nvme multipath, but is just getting in the way and adding another indirect call. Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 0a1b8b87d064a47fad9ec475316002da28559207 upstream. blk_poll() has always kept spinning until it found an IO. This is fine for SYNC polling, since we need to find one request we have pending, but in preparation for ASYNC polling it can be beneficial to just check if we have any entries available or not. Existing callers are converted to pass in 'spin == true', to retain the old behavior. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 1052b8ac5282daf35df331edcbdb645839d17e6a upstream. If we want to support async IO polling, then we have to allow finding completions that aren't just for the one we are looking for. Always pass in -1 to the mq_ops->poll() helper, and have that return how many events were found in this poll loop. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Damien Le Moal 提交于
commit 64845a1ddd655574886eb48e9a5eaeeb9b05bf0d upstream. Define get_current_ioprio() as an inline helper to obtain the caller I/O priority from its task I/O context. Use this helper in blk_init_request_from_bio() to set a request ioprio. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 85f4d4b65fdd67f1d6dc9eeb1d91923cef07eb6a upstream. We currently only really support sync poll, ie poll with 1 IO in flight. This prepares us for supporting async poll. Note that the returned value isn't necessarily 100% accurate. If poll races with IRQ completion, we assume that the fact that the task is now runnable means we found at least one entry. In reality it could be more than 1, or not even 1. This is fine, the caller will just need to take this into account. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 David Howells 提交于
commit 00e23707442a75b404392cef1405ab4fd498de6b upstream. Use accessor functions to access an iterator's type and direction. This allows for the possibility of using some other method of determining the type of iterator than if-chains with bitwise-AND conditions. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
This fixes the following format build warning: block/blk-iocost.c: In function 'ioc_stat_prfill': block/blk-iocost.c:2506:17: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 9 has type 'long int' [-Wformat=] Reported-by: Nkbuild test robot <lkp@intel.com> Fixes: 0670363c ("alinux: iocost: add ioc_gq stat") Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 15 1月, 2020 8 次提交
-
-
由 Jiufei Xue 提交于
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
Add a stat file to monitor the ioc_gq stat. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
Currently in blk_throtl_bio(), if one bio exceeds its throtl_grp's bps or iops limit, this bio will be queued throtl_grp's throtl_service_queue, then obviously mm subsys will submit more pages, even underlying device can not handle these io requests, also this will make large amount of pages entering writeback prematurely, later if some process writes some of these pages, it will wait for long time. I have done some tests: one process does buffered writes on a 1GB file, and make this process's blkcg max bps limit be 10MB/s, I observe this: #cat /proc/meminfo | grep -i back Writeback: 900024 kB WritebackTmp: 0 kB I think this Writeback value is just too big, indeed many bios have been queued in throtl_grp's throtl_service_queue, if one process try to write the last bio's page in this queue, it will call wait_on_page_writeback(page), which must wait the previous bios to finish and will take long time, we have also see 120s hung task warning in our server. INFO: task kworker/u128:0:30072 blocked for more than 120 seconds. Tainted: G E 4.9.147-013.ali3000_015_test.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u128:0 D 0 30072 2 0x00000000 Workqueue: writeback wb_workfn (flush-8:16) ffff882ddd066b40 0000000000000000 ffff882e5cad3400 ffff882fbe959e80 ffff882fa50b1a00 ffffc9003a5a3768 ffffffff8173325d ffffc9003a5a3780 00ff882e5cad3400 ffff882fbe959e80 ffffffff81360b49 ffff882e5cad3400 Call Trace: [<ffffffff8173325d>] ? __schedule+0x23d/0x6d0 [<ffffffff81360b49>] ? alloc_request_struct+0x19/0x20 [<ffffffff81733726>] schedule+0x36/0x80 [<ffffffff81736c56>] schedule_timeout+0x206/0x4b0 [<ffffffff81036c69>] ? sched_clock+0x9/0x10 [<ffffffff81363073>] ? get_request+0x403/0x810 [<ffffffff8110ca10>] ? ktime_get+0x40/0xb0 [<ffffffff81732f8a>] io_schedule_timeout+0xda/0x170 [<ffffffff81733f90>] ? bit_wait+0x60/0x60 [<ffffffff81733fab>] bit_wait_io+0x1b/0x60 [<ffffffff81733b28>] __wait_on_bit+0x58/0x90 [<ffffffff811b0d91>] ? find_get_pages_tag+0x161/0x2e0 [<ffffffff811aff62>] wait_on_page_bit+0x82/0xa0 [<ffffffff810d47f0>] ? wake_atomic_t_function+0x60/0x60 [<ffffffffa02fc181>] mpage_prepare_extent_to_map+0x2d1/0x310 [ext4] [<ffffffff8121ff65>] ? kmem_cache_alloc+0x185/0x1a0 [<ffffffffa0305a2f>] ? ext4_init_io_end+0x1f/0x40 [ext4] [<ffffffffa0300294>] ext4_writepages+0x404/0xef0 [ext4] [<ffffffff81508c64>] ? scsi_init_io+0x44/0x200 [<ffffffff81398a0f>] ? fprop_fraction_percpu+0x2f/0x80 [<ffffffff811c139e>] do_writepages+0x1e/0x30 [<ffffffff8127c0f5>] __writeback_single_inode+0x45/0x320 [<ffffffff8127c942>] writeback_sb_inodes+0x272/0x600 [<ffffffff8127cf6b>] wb_writeback+0x10b/0x300 [<ffffffff8127d884>] wb_workfn+0xb4/0x380 [<ffffffff810b85e9>] ? try_to_wake_up+0x59/0x3e0 [<ffffffff810a5759>] process_one_work+0x189/0x420 [<ffffffff810a5a3e>] worker_thread+0x4e/0x4b0 [<ffffffff810a59f0>] ? process_one_work+0x420/0x420 [<ffffffff810ac026>] kthread+0xe6/0x100 [<ffffffff810abf40>] ? kthread_park+0x60/0x60 [<ffffffff81738499>] ret_from_fork+0x39/0x50 To fix this issue, we can simply limit throtl_service_queue's max queued bios, currently we limit it to throtl_grp's bps_limit or iops limit, if it still exteeds, we just sleep for a while. Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jiufei Xue 提交于
Now we have counters for wait_time and service_time, but no completed ios, so the average latency can not be measured. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jiufei Xue 提交于
This patch does the code cleanup because the seq_show handlers for tg counters are the same. No functional changes. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
Add 2 interfaces to stat io throttle information: blkio.throttle.total_io_queued blkio.throttle.total_bytes_queued These interfaces are used for monitoring throttled io/bytes and analyzing if delay has relation with io throttle. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
io throtl stats will blkg_get at the beginning of throttle and then blkg_put at the new introduced bi_tg_end_io. This will cause blkg to be freed if end_io is called twice like dm-thin, which will save origin end_io first, and call its overwrite end_io and then the saved end_io. After that, access blkg is invalid and finally BUG: [ 4417.235048] BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0 [ 4417.236475] IP: [<ffffffff812e7c71>] throtl_update_dispatch_stats+0x21/0xb0 [ 4417.237865] PGD 98395067 PUD 362e1067 PMD 0 [ 4417.239232] Oops: 0000 [#1] SMP ...... [ 4417.274070] Call Trace: [ 4417.275407] [<ffffffff812ea93d>] blk_throtl_bio+0xfd/0x630 [ 4417.276760] [<ffffffff810b3613>] ? wake_up_process+0x23/0x40 [ 4417.278079] [<ffffffff81094c04>] ? wake_up_worker+0x24/0x30 [ 4417.279387] [<ffffffff81095772>] ? insert_work+0x62/0xa0 [ 4417.280697] [<ffffffff8116c2c7>] ? mempool_free_slab+0x17/0x20 [ 4417.282019] [<ffffffff8116c6c9>] ? mempool_free+0x49/0x90 [ 4417.283326] [<ffffffff812c9acf>] generic_make_request_checks+0x16f/0x360 [ 4417.284637] [<ffffffffa0340d97>] ? thin_map+0x227/0x2c0 [dm_thin_pool] [ 4417.285951] [<ffffffff812c9ce7>] generic_make_request+0x27/0x130 [ 4417.287240] [<ffffffffa0230b3d>] __map_bio+0xad/0x100 [dm_mod] [ 4417.288503] [<ffffffffa023257e>] __clone_and_map_data_bio+0x15e/0x240 [dm_mod] [ 4417.289778] [<ffffffffa02329ea>] __split_and_process_bio+0x38a/0x500 [dm_mod] [ 4417.291062] [<ffffffffa0232c91>] dm_make_request+0x131/0x1a0 [dm_mod] [ 4417.292344] [<ffffffff812c9da2>] generic_make_request+0xe2/0x130 [ 4417.293626] [<ffffffff812c9e61>] submit_bio+0x71/0x150 [ 4417.294909] [<ffffffff8121ab1d>] ? bio_alloc_bioset+0x20d/0x360 [ 4417.296195] [<ffffffff81215acb>] _submit_bh+0x14b/0x220 [ 4417.297484] [<ffffffff81215bb0>] submit_bh+0x10/0x20 [ 4417.298744] [<ffffffffa016d8d8>] jbd2_journal_commit_transaction+0x6c8/0x19a0 [jbd2] [ 4417.300014] [<ffffffff810135b8>] ? __switch_to+0xf8/0x4c0 [ 4417.301268] [<ffffffffa01731e9>] kjournald2+0xc9/0x270 [jbd2] [ 4417.302524] [<ffffffff810a0fd0>] ? wake_up_atomic_t+0x30/0x30 [ 4417.303753] [<ffffffffa0173120>] ? commit_timeout+0x10/0x10 [jbd2] [ 4417.304950] [<ffffffff8109ffef>] kthread+0xcf/0xe0 [ 4417.306107] [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140 [ 4417.307255] [<ffffffff81647f18>] ret_from_fork+0x58/0x90 [ 4417.308349] [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140 ...... Now we introduce a new bio flag BIO_THROTL_STATED to make sure blkg_get/put only get called once for the same bio. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
Add blkio.throttle.io_service_time and blkio.throttle.io_wait_time to get per-cgroup io delay statistics. io_service_time represents the time spent after io throttle to io completion, while io_wait_time represents the time spent on throttle queue. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-