提交 · 422652e5f361c254af0057bde70eab171e132596 · openanolis / cloud-kernel

09 6月, 2020 1 次提交

alinux: block: replace reserved field with extended bio_flags · 422652e5

由 zhongjiang-ali 提交于 6月 09, 2020

task #28327019

Commit bc0cc360 ("alinux: blk-throttle: fix tg NULL pointer
dereference") add an self-defined bio flags to fix an issue of
use-after-free. But it is limited to 13 entry and has used up,
hence it will fails to sync related patch.

The patch replace reserved field with extended bio_flags to allow
us to define more bio flags.
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>

422652e5

07 5月, 2020 1 次提交

blk-mq: balance mapping between present CPUs and queues · 5dc0acc2

由 Ming Lei 提交于 7月 25, 2019

fix #27417914

commit 556f36e90dbe7dded81f4fac084d2bc8a2458330 upstream

Spread queues among present CPUs first, then building mapping on other
non-present CPUs.

So we can minimize count of dead queues which are mapped by un-present
CPUs only. Then bad IO performance can be avoided by unbalanced mapping
between present CPUs and queues.

The similar policy has been applied on Managed IRQ affinity.

Cc: Yi Zhang <yi.zhang@redhat.com>
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[jeffle: remove code supporting multiple queue maps, which is merged since v5.0]
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

5dc0acc2

25 3月, 2020 1 次提交

alinux: blk-mq: fix broken io_ticks & time_in_queue update · a9ee8ebe

由 Xiaoguang Wang 提交于 3月 17, 2020

fix #25369772

In blk-mq device, we observed a issue that though iops is low, but iostat
shows a very high svctm & util value, which is counter-intuitive.

The root cause is that blk_account_io_start() calls part_round_stats()
before "rq->part = part" statement, so part_round_stats() will count
an inflight request to the whole device, but not for the specific
partition, then it'll update whole device's io_ticks and time_in_queue
with a stale part->stamp.

To fix this issue, if a request's part is NULL, we just don't count
it as an inflight request to the whole device.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

a9ee8ebe

18 3月, 2020 17 次提交

alinux: blk-throttle: fix logic error about BIO_THROTL_STATED in throtl_bio_end_io() · 8daa9640

由 Xiaoguang Wang 提交于 2月 18, 2020

When CONFIG_BLK_DEV_THROTTLING is enabled, though we may not set
block cgroup's blk-throttle bps or iops limits, every bio still
enters blk_throtl_bio() firstly, then this bug will result in the
corresponding blkcg_gq's refcnt will increase by 1 for every bio.
atomit_t is an 'int' type, and if usr continually issues batches
of bios, this refcnt will overflow, which will trigger WARNING in
blkg_get() or blkg_put().

Fixes: bc0cc360 ("alinux: blk-throttle: fix tg NULL pointer dereference")
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8daa9640

block: never take page references for ITER_BVEC · 709d159e

由 Christoph Hellwig 提交于 6月 26, 2019

Cherry-pick from commit b620743077e291ae7d0debd21f50413a8c266229 upstream.

If we pass pages through an iov_iter we always already have a reference
in the caller.  Thus remove the ITER_BVEC_FLAG_NO_REF and don't take
reference to pages by default for bvec backed iov_iters.

[Joseph] Resolve conflicts since we don't have:
81ba6abd2bcd "block: loop: mark bvec as ITER_BVEC_FLAG_NO_REF"
7321ecbfc7cf "block: change how we get page references in bio_iov_iter_get_pages"
Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

709d159e

blk-mq: fix NULL pointer deference in case no poll implementation · 1e41c505

由 Joseph Qi 提交于 5月 23, 2019

In case some drivers such virtio-blk, poll function is not implementatin
yet. Before commit 529262d5 ("block: remove ->poll_fn"), q->poll_fn
is NULL and then blk_poll() won't do poll actually.
So add a check for this to avoid NULL pointer dereference when calling
q->mq_ops->poll.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

1e41c505

blk-mq: grab .q_usage_counter when queuing request from plug code path · 4a55f77f

由 Ming Lei 提交于 4月 30, 2019

commit e87eb301bee183d82bb3d04bd71b6660889a2588 upstream.

Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
        blk_mq_sched_insert_requests()
                insert requests to sw queue or scheduler queue
                blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Tested-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[Joseph: use the passing 'q' directly]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4a55f77f

block/bfq: fix ifdef for CONFIG_BFQ_GROUP_IOSCHED=y · 2a7f90c1

由 Konstantin Khlebnikov 提交于 3月 29, 2019

commit 42b1bd33dcdef4ffd98f695e188bab82f9fa46d8 upstream.

Replace BFQ_GROUP_IOSCHED_ENABLED with CONFIG_BFQ_GROUP_IOSCHED.
Code under these ifdefs never worked, something might be broken.

Fixes: 0471559c ("block, bfq: add/remove entity weights correctly")
Reviewed-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

2a7f90c1

block: remove bogus check for queue_lock assignment · 7a086776

由 Jens Axboe 提交于 10月 12, 2018

commit 5e27891e88555fecd8262e110e1a29feca4b0166 upstream.

We just allocated the queue and haven't even set it up yet,
hence we know that checking if ->mq_ops is NULL is always
going to be true.

In fact we do need to assign a lock to ->queue_lock always,
as we need it for the queue flags modifications.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

7a086776

block: don't use bio->bi_vcnt to figure out segment number · 7eb1d529

由 Ming Lei 提交于 2月 15, 2019

commit 1a67356e9a4829da2935dd338630a550c59c8489 upstream.

It is wrong to use bio->bi_vcnt to figure out how many segments
there are in the bio even though CLONED flag isn't set on this bio,
because this bio may be splitted or advanced.

So always use bio_segments() in blk_recount_segments(), and it shouldn't
cause any performance loss now because the physical segment number is figured
out in blk_queue_split() and BIO_SEG_VALID is set meantime since
bdced438 ("block: setup bi_phys_segments after splitting").
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Fixes: 76d8137a ("blk-merge: recaculate segment if it isn't less than max segments")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

7eb1d529

block: fix NULL pointer dereference in register_disk · 9d70cdf3

由 zhengbin 提交于 2月 20, 2019

commit 4d7c1d3fd7c7eda7dea351f071945e843a46c145 upstream.

If __device_add_disk-->bdi_register_owner-->bdi_register-->
bdi_register_va-->device_create_vargs fails, bdi->dev is still
NULL, __device_add_disk-->register_disk will visit bdi->dev->kobj.
This patch fixes that.
Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

9d70cdf3

blk-mq: Add a NULL check in blk_mq_free_map_and_requests() · 5f0efd1f

由 Dan Carpenter 提交于 11月 29, 2018

commit 4e6db0f21c99c25980c8d183f95cdb6ad64cebd2 upstream.

I recently found some code which called blk_mq_free_map_and_requests()
with a NULL set->tags pointer.  I fixed the caller, but it seems like a
good idea to add a NULL check here as well.  Now we can call:

	blk_mq_free_tag_set(set);
	blk_mq_free_tag_set(set);

twice in a row and it's harmless.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

5f0efd1f

blk-mq: place trace_block_getrq() in correct place · d23f53bf

由 Xiaoguang Wang 提交于 10月 23, 2018

commit d6f1dda27251909a27b8d8aacb498628a1047978 upstream.

trace_block_getrq() is to indicate a request struct has been allocated
for queue, so put it in right place.
Reviewed-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

d23f53bf

blk-mq: protect debugfs_create_files() from failures · 9ec13b19

由 Greg Kroah-Hartman 提交于 1月 23, 2019

commit 36991ca68db9dd43bac7f3519f080ee3939263ef upstream.

If debugfs were to return a non-NULL error for a debugfs call, using
that pointer later in debugfs_create_files() would crash.

Fix that by properly checking the pointer before referencing it.
Reported-by: NMichal Hocko <mhocko@kernel.org>
Reported-and-tested-by: syzbot+b382ba6a802a3d242790@syzkaller.appspotmail.com
Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

9ec13b19

blk-mq: not embed .mq_kobj and ctx->kobj into queue instance · 9ff28240

由 Ming Lei 提交于 11月 20, 2018

commit 1db4909e76f64a85f4aaa187f0f683f5c85a471d upstream.

Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
from block layer's view, actually they don't because userspace may
grab one kobject anytime via sysfs.

This patch fixes the issue by the following approach:

1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
all ctxs

2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
handler of .mq_kobj

3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
.mq_kobj is always released after all ctxs are freed.

This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
is enabled.
Reported-by: NGuenter Roeck <linux@roeck-us.net>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

9ff28240

blk-mq: fallback to previous nr_hw_queues when updating fails · 4f3484ac

由 Jianchao Wang 提交于 10月 12, 2018

commit e01ad46d53b59720c6ae69963ee1756506954c85 upstream.

When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.

To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.

Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4f3484ac

blk-mq: realloc hctx when hw queue is mapped to another node · 0f44f194

由 Jianchao Wang 提交于 10月 12, 2018

commit 34d11ffac1f56c3895dad32153abd6814452dc77 upstream.

When the hw queues and mq_map are updated, a hctx could be mapped
to a different numa node. At this moment, we need to realloc the
hctx. If fail to do that, go on using previous hctx.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

0f44f194

blk-mq: adjust debugfs and sysfs register when updating nr_hw_queues · 1d5199ea

由 Jianchao Wang 提交于 10月 12, 2018

commit 477e19dedc9d3e1f4443a1d4ae00572a988120ea upstream.

blk-mq debugfs and sysfs entries need to be removed before updating
queue map, otherwise, we get get wrong result there. This patch fixes
it and remove the redundant debugfs and sysfs register/unregister
operations during __blk_mq_update_nr_hw_queues.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

1d5199ea

alinux: fs: record page or bio info while process is waitting on it · 2864376f

由 Xiaoguang Wang 提交于 11月 07, 2019

If one process context is stucked in wait_on_buffer(), lock_buffer(),
lock_page() and wait_on_page_writeback() and wait_on_bit_io(), it's
hard to tell ture reason, for example, whether this page is under io,
or this page is just locked too long by other process context.

Normally io request has multiple bios, and every bio contains multiple
pages which will hold data to be read from or written to device, so here
we record page info or bio info in task_struct while process calls
lock_page(), lock_buffer(), wait_on_page_writeback(), wait_on_buffer()
and wait_on_bit_io(), we add a new proce interface:
[lege@localhost linux]$ cat /proc/4516/wait_res
1 ffffd0969f95d3c0 4295369599 4295381596

Above info means that thread 4516 is waitting on a page, address is
ffffd0969f95d3c0, and has waited for 11997ms.

First field denotes the page address process is waitting on.
Second field denotes the wait moment and the third denotes current moment.

In practice, if we found a process waitting on one page for too long time,
we can get page's address by reading /proc/$pid/wait_page, and search this
page address in all block devices' /sys/kernel/debug/block/${devname}/rq_hang,
if search operation hits one, we can get the request and know why this io
request hangs that long.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

2864376f

alinux: blk: add iohang check function · 80d6ee24

由 Xiaoguang Wang 提交于 10月 11, 2019

Background:
  We do not have a dependable block layer interface to determine whether
block device has io requests which have not been completed for somewhat
long time. Currently we have 'in_flight' interface, it counts the number
of I/O requests that have been issued to the device driver but have
not yet completed, and it does not include I/O requests that are in the
queue but not yet issued to the device driver, which means it will not
count io requests that have been stucked in block layer.
  Also say that there are steady io requests issued to device driver,
'in_flight' maybe always non-zero, but you could not determine whether
there is one io request which has not been completed for too long.

Solution:
  To find io requests which have not been completed for too long, here
add 3 new inferfaces:
  /sys/block/vdb/queue/hang_threshold
If one io request's running time has been greater than this value, count
this io as hang.

  /sys/block/vdb/hang
Show read/write io requests' hang counter.

  /sys/kernel/debug/block/vdb/rq_hang
Show all hang io requests's detailed info, like below:
  ffff97db96301200 {.op=WRITE, .cmd_flags=SYNC, .rq_flags=STARTED|
ELVPRIV|IO_STAT|STATS, .state=in_flight, .tag=30, .internal_tag=169,
.start_time_ns=140634088407, .io_start_time_ns=140634102958,
.current_time=146497371953, .bio = ffff97db91e8e000,
.bio_pages = { ffffd096a0602540 }, .bio = ffff97db91e8ec00,
.bio_pages = { ffffd096a070eec0 }, .bio = ffff97db91e8f600,
.bio_pages = { ffffd096a0424cc0 }, .bio = ffff97db91e8f300,
.bio_pages = { ffffd096a0600a80 }}

With above info, we can easily see this request's latency distribution,
and see next patch for bio_pages's usage.

Note, /sys/kernel/debug/block/vdb/rq_hang only exists in blk-mq device driver
and needs CONFIG_BLK_DEBUG_FS enabled.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

80d6ee24

17 1月, 2020 12 次提交

block: fix 32 bit overflow in __blkdev_issue_discard() · e9ca62bc

由 Dave Chinner 提交于 11月 14, 2018

commit 4800bf7bc8c725e955fcbc6191cc872f43f506d3 upstream.

A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to
fall into an endless loop in the discard code. The test is creating
a device that is exactly 2^32 sectors in size to test mkfs boundary
conditions around the 32 bit sector overflow region.

mkfs issues a discard for the entire device size by default, and
hence this throws a sector count of 2^32 into
blkdev_issue_discard(). It takes the number of sectors to discard as
a sector_t - a 64 bit value.

The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
takes this sector count and casts it to a 32 bit value before
comapring it against the maximum allowed discard size the device
has. This truncates away the upper 32 bits, and so if the lower 32
bits of the sector count is zero, it starts issuing discards of
length 0. This causes the code to fall into an endless loop, issuing
a zero length discards over and over again on the same sector.

Fixes: ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
Tested-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDave Chinner <dchinner@redhat.com>

Killed pointless WARN_ON().
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

e9ca62bc

block: cleanup __blkdev_issue_discard() · 462ae85a

由 Ming Lei 提交于 10月 29, 2018

commit ba5d73851e71847ba7f7f4c27a1a6e1f5ab91c79 upstream.

Cleanup __blkdev_issue_discard() a bit:

- remove local variable of 'end_sect'
- remove code block of 'fail'

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Tested-by: NRui Salvaterra <rsalvaterra@gmail.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

462ae85a

block: add BIO_NO_PAGE_REF flag · c0d2a0b9

由 Jens Axboe 提交于 2月 27, 2019

commit 399254aaf4892113c806816f7e64cf40c804d46d upstream.

If bio_iov_iter_get_pages() is called on an iov_iter that is flagged
with NO_REF, then we don't need to add a page reference for the pages
that we add.

Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows
not to drop a reference to these pages.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

c0d2a0b9

block: implement bio helper to add iter bvec pages to bio · b1d06bf8

由 Jens Axboe 提交于 11月 30, 2018

commit 6d0c48aede85e38316d0251564cab39cbc2422f6 upstream.

For an ITER_BVEC, we can just iterate the iov and add the pages
to the bio directly. For now, we grab a reference to those pages,
and release them normally on IO completion. This isn't really needed
for the normal case of O_DIRECT from/to a file, but some of the more
esoteric use cases (like splice(2)) will unconditionally put the
pipe buffer pages when the buffers are released. Until we can manage
that case properly, ITER_BVEC pages are treated like normal pages
in terms of reference counting.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

b1d06bf8

block: clear REQ_HIPRI if polling is not supported · 719a321b

由 Christoph Hellwig 提交于 12月 14, 2018

commit d04c406f29d9f4dbcb5eb5aa79ce0445c7e9d652 upstream.

This prevents a HIPRI bio from being submitted through a stacking
driver that does not support polling and thus won't poll for I/O
completion.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

719a321b

block: remove ->poll_fn · dd1253a5

由 Christoph Hellwig 提交于 12月 02, 2018

commit 529262d56dbebe6a26df5d2fd24cc0e1bc8579e5 upstream.

This was intended to support users like nvme multipath, but is just
getting in the way and adding another indirect call.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

dd1253a5

block: make blk_poll() take a parameter on whether to spin or not · 7244851b

由 Jens Axboe 提交于 11月 26, 2018

commit 0a1b8b87d064a47fad9ec475316002da28559207 upstream.

blk_poll() has always kept spinning until it found an IO. This is
fine for SYNC polling, since we need to find one request we have
pending, but in preparation for ASYNC polling it can be beneficial
to just check if we have any entries available or not.

Existing callers are converted to pass in 'spin == true', to retain
the old behavior.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

7244851b

blk-mq: when polling for IO, look for any completion · d25f577c

由 Jens Axboe 提交于 11月 26, 2018

commit 1052b8ac5282daf35df331edcbdb645839d17e6a upstream.

If we want to support async IO polling, then we have to allow finding
completions that aren't just for the one we are looking for. Always pass
in -1 to the mq_ops->poll() helper, and have that return how many events
were found in this poll loop.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d25f577c

block: Introduce get_current_ioprio() · dd783bde

由 Damien Le Moal 提交于 11月 20, 2018

commit 64845a1ddd655574886eb48e9a5eaeeb9b05bf0d upstream.

Define get_current_ioprio() as an inline helper to obtain the caller
I/O priority from its task I/O context. Use this helper in
blk_init_request_from_bio() to set a request ioprio.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

dd783bde

block: have ->poll_fn() return number of entries polled · 35118adb

由 Jens Axboe 提交于 11月 06, 2018

commit 85f4d4b65fdd67f1d6dc9eeb1d91923cef07eb6a upstream.

We currently only really support sync poll, ie poll with 1 IO in flight.
This prepares us for supporting async poll.

Note that the returned value isn't necessarily 100% accurate. If poll
races with IRQ completion, we assume that the fact that the task is now
runnable means we found at least one entry. In reality it could be more
than 1, or not even 1. This is fine, the caller will just need to take
this into account.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

35118adb

iov_iter: Use accessor function · 3d314801

由 David Howells 提交于 10月 22, 2018

commit 00e23707442a75b404392cef1405ab4fd498de6b upstream.

Use accessor functions to access an iterator's type and direction.  This
allows for the possibility of using some other method of determining the
type of iterator than if-chains with bitwise-AND conditions.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

3d314801

alinux: iocost: fix format mismatch build warning · 43f57d41

由 Joseph Qi 提交于 1月 03, 2020

This fixes the following format build warning:
block/blk-iocost.c: In function 'ioc_stat_prfill':
block/blk-iocost.c:2506:17: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 9 has type 'long int' [-Wformat=]
Reported-by: Nkbuild test robot <lkp@intel.com>
Fixes: 0670363c ("alinux: iocost: add ioc_gq stat")
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

43f57d41

15 1月, 2020 8 次提交

J
alinux: iocost: rename weight to cost.weight to avoid conflict with cfq · 7d59e27f
由 Jiufei Xue 提交于 12月 05, 2019
```
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
```
7d59e27f

alinux: iocost: add ioc_gq stat · 0670363c

由 Jiufei Xue 提交于 11月 26, 2019

Add a stat file to monitor the ioc_gq stat.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0670363c

alinux: blk-throttle: limit bios to fix amount of pages entering writeback prematurely · 0fd4aa6d

由 Xiaoguang Wang 提交于 12月 28, 2018

Currently in blk_throtl_bio(), if one bio exceeds its throtl_grp's bps
or iops limit, this bio will be queued throtl_grp's throtl_service_queue,
then obviously mm subsys will submit more pages, even underlying device
can not handle these io requests, also this will make large amount of pages
entering writeback prematurely, later if some process writes some of these
pages, it will wait for long time.

I have done some tests: one process does buffered writes on a 1GB file,
and make this process's blkcg max bps limit be 10MB/s, I observe this:
	#cat /proc/meminfo  | grep -i back
	Writeback:        900024 kB
	WritebackTmp:          0 kB

I think this Writeback value is just too big, indeed many bios have been
queued in throtl_grp's throtl_service_queue, if one process try to write
the last bio's page in this queue, it will call wait_on_page_writeback(page),
which must wait the previous bios to finish and will take long time, we
have also see 120s hung task warning in our server.

 INFO: task kworker/u128:0:30072 blocked for more than 120 seconds.
       Tainted: G            E 4.9.147-013.ali3000_015_test.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 kworker/u128:0  D    0 30072      2 0x00000000
 Workqueue: writeback wb_workfn (flush-8:16)
  ffff882ddd066b40 0000000000000000 ffff882e5cad3400 ffff882fbe959e80
  ffff882fa50b1a00 ffffc9003a5a3768 ffffffff8173325d ffffc9003a5a3780
  00ff882e5cad3400 ffff882fbe959e80 ffffffff81360b49 ffff882e5cad3400
 Call Trace:
  [<ffffffff8173325d>] ? __schedule+0x23d/0x6d0
  [<ffffffff81360b49>] ? alloc_request_struct+0x19/0x20
  [<ffffffff81733726>] schedule+0x36/0x80
  [<ffffffff81736c56>] schedule_timeout+0x206/0x4b0
  [<ffffffff81036c69>] ? sched_clock+0x9/0x10
  [<ffffffff81363073>] ? get_request+0x403/0x810
  [<ffffffff8110ca10>] ? ktime_get+0x40/0xb0
  [<ffffffff81732f8a>] io_schedule_timeout+0xda/0x170
  [<ffffffff81733f90>] ? bit_wait+0x60/0x60
  [<ffffffff81733fab>] bit_wait_io+0x1b/0x60
  [<ffffffff81733b28>] __wait_on_bit+0x58/0x90
  [<ffffffff811b0d91>] ? find_get_pages_tag+0x161/0x2e0
  [<ffffffff811aff62>] wait_on_page_bit+0x82/0xa0
  [<ffffffff810d47f0>] ? wake_atomic_t_function+0x60/0x60
  [<ffffffffa02fc181>] mpage_prepare_extent_to_map+0x2d1/0x310 [ext4]
  [<ffffffff8121ff65>] ? kmem_cache_alloc+0x185/0x1a0
  [<ffffffffa0305a2f>] ? ext4_init_io_end+0x1f/0x40 [ext4]
  [<ffffffffa0300294>] ext4_writepages+0x404/0xef0 [ext4]
  [<ffffffff81508c64>] ? scsi_init_io+0x44/0x200
  [<ffffffff81398a0f>] ? fprop_fraction_percpu+0x2f/0x80
  [<ffffffff811c139e>] do_writepages+0x1e/0x30
  [<ffffffff8127c0f5>] __writeback_single_inode+0x45/0x320
  [<ffffffff8127c942>] writeback_sb_inodes+0x272/0x600
  [<ffffffff8127cf6b>] wb_writeback+0x10b/0x300
  [<ffffffff8127d884>] wb_workfn+0xb4/0x380
  [<ffffffff810b85e9>] ? try_to_wake_up+0x59/0x3e0
  [<ffffffff810a5759>] process_one_work+0x189/0x420
  [<ffffffff810a5a3e>] worker_thread+0x4e/0x4b0
  [<ffffffff810a59f0>] ? process_one_work+0x420/0x420
  [<ffffffff810ac026>] kthread+0xe6/0x100
  [<ffffffff810abf40>] ? kthread_park+0x60/0x60
  [<ffffffff81738499>] ret_from_fork+0x39/0x50

To fix this issue, we can simply limit throtl_service_queue's max queued
bios, currently we limit it to throtl_grp's bps_limit or iops limit, if it
still exteeds, we just sleep for a while.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

0fd4aa6d

alinux: block-throttle: add counters for completed io · 33ed5f09

由 Jiufei Xue 提交于 4月 10, 2018

Now we have counters for wait_time and service_time, but no completed
ios, so the average latency can not be measured.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

33ed5f09

alinux: block-throttle: code cleanup · b03ba65b

由 Jiufei Xue 提交于 4月 10, 2018

This patch does the code cleanup because the seq_show handlers for tg
counters are the same. No functional changes.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

b03ba65b

alinux: blk-throttle: add throttled io/bytes counter · 766cfe98

由 Joseph Qi 提交于 3月 08, 2018

Add 2 interfaces to stat io throttle information:
  blkio.throttle.total_io_queued
  blkio.throttle.total_bytes_queued

These interfaces are used for monitoring throttled io/bytes and
analyzing if delay has relation with io throttle.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

766cfe98

alinux: blk-throttle: fix tg NULL pointer dereference · bc0cc360

由 Joseph Qi 提交于 12月 07, 2017

io throtl stats will blkg_get at the beginning of throttle and then
blkg_put at the new introduced bi_tg_end_io. This will cause blkg to be
freed if end_io is called twice like dm-thin, which will save origin
end_io first, and call its overwrite end_io and then the saved end_io.
After that, access blkg is invalid and finally BUG:

[ 4417.235048] BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0
[ 4417.236475] IP: [<ffffffff812e7c71>] throtl_update_dispatch_stats+0x21/0xb0
[ 4417.237865] PGD 98395067 PUD 362e1067 PMD 0
[ 4417.239232] Oops: 0000 [#1] SMP
......
[ 4417.274070] Call Trace:
[ 4417.275407]  [<ffffffff812ea93d>] blk_throtl_bio+0xfd/0x630
[ 4417.276760]  [<ffffffff810b3613>] ? wake_up_process+0x23/0x40
[ 4417.278079]  [<ffffffff81094c04>] ? wake_up_worker+0x24/0x30
[ 4417.279387]  [<ffffffff81095772>] ? insert_work+0x62/0xa0
[ 4417.280697]  [<ffffffff8116c2c7>] ? mempool_free_slab+0x17/0x20
[ 4417.282019]  [<ffffffff8116c6c9>] ? mempool_free+0x49/0x90
[ 4417.283326]  [<ffffffff812c9acf>] generic_make_request_checks+0x16f/0x360
[ 4417.284637]  [<ffffffffa0340d97>] ? thin_map+0x227/0x2c0 [dm_thin_pool]
[ 4417.285951]  [<ffffffff812c9ce7>] generic_make_request+0x27/0x130
[ 4417.287240]  [<ffffffffa0230b3d>] __map_bio+0xad/0x100 [dm_mod]
[ 4417.288503]  [<ffffffffa023257e>] __clone_and_map_data_bio+0x15e/0x240 [dm_mod]
[ 4417.289778]  [<ffffffffa02329ea>] __split_and_process_bio+0x38a/0x500 [dm_mod]
[ 4417.291062]  [<ffffffffa0232c91>] dm_make_request+0x131/0x1a0 [dm_mod]
[ 4417.292344]  [<ffffffff812c9da2>] generic_make_request+0xe2/0x130
[ 4417.293626]  [<ffffffff812c9e61>] submit_bio+0x71/0x150
[ 4417.294909]  [<ffffffff8121ab1d>] ? bio_alloc_bioset+0x20d/0x360
[ 4417.296195]  [<ffffffff81215acb>] _submit_bh+0x14b/0x220
[ 4417.297484]  [<ffffffff81215bb0>] submit_bh+0x10/0x20
[ 4417.298744]  [<ffffffffa016d8d8>] jbd2_journal_commit_transaction+0x6c8/0x19a0 [jbd2]
[ 4417.300014]  [<ffffffff810135b8>] ? __switch_to+0xf8/0x4c0
[ 4417.301268]  [<ffffffffa01731e9>] kjournald2+0xc9/0x270 [jbd2]
[ 4417.302524]  [<ffffffff810a0fd0>] ? wake_up_atomic_t+0x30/0x30
[ 4417.303753]  [<ffffffffa0173120>] ? commit_timeout+0x10/0x10 [jbd2]
[ 4417.304950]  [<ffffffff8109ffef>] kthread+0xcf/0xe0
[ 4417.306107]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
[ 4417.307255]  [<ffffffff81647f18>] ret_from_fork+0x58/0x90
[ 4417.308349]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
......

Now we introduce a new bio flag BIO_THROTL_STATED to make sure
blkg_get/put only get called once for the same bio.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

bc0cc360

alinux: blk-throttle: support io delay stats · dc61ad52

由 Joseph Qi 提交于 12月 19, 2017

Add blkio.throttle.io_service_time and blkio.throttle.io_wait_time to
get per-cgroup io delay statistics.
io_service_time represents the time spent after io throttle to io
completion, while io_wait_time represents the time spent on throttle
queue.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

dc61ad52

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功