提交 · 92339373cb6a8fd38d87cfedeb7019b075f3dc79 · openanolis / cloud-kernel

29 6月, 2020 4 次提交

blk-mq: use plug for devices that implement ->commits_rqs() · 92339373

由 Jens Axboe 提交于 11月 29, 2018

fix #28871358

commit b2c5d16b72df1116f05c9be16a630ac939d34101 upstream

If we have that hook, we know the driver handles bd->last == true in
a smart fashion. If it does, even for multiple hardware queues, it's
a good idea to flush batches of requests to the device, if we have
batches of requests from the submitter.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

92339373

blk-mq: use bd->last == true for list inserts · 9e166ffa

由 Jens Axboe 提交于 11月 24, 2018

fix #28871358

commit be94f058f2bde6f0b0ee9059a35daa8e15be308f upstream

If we are issuing a list of requests, we know if we're at the last one.
If we fail issuing, ensure that we call ->commits_rqs() to flush any
potential previous requests.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9e166ffa

blk-mq: add mq_ops->commit_rqs() · 0111cff3

由 Jens Axboe 提交于 11月 27, 2018

fix #28871358

commit d666ba98f849ad44c4405ecc2180390ebe80f4f9 upstream

blk-mq passes information to the hardware about any given request being
the last that we will issue in this sequence. The point is that hardware
can defer costly doorbell type writes to the last request. But if we run
into errors issuing a sequence of requests, we may never send the request
with bd->last == true set. For that case, we need a hook that tells the
hardware that nothing else is coming right now.

For failures returned by the drivers ->queue_rq() hook, the driver is
responsible for flushing pending requests, if it uses bd->last to
optimize that part. This works like before, no changes there.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0111cff3

block: improve logic around when to sort a plug list · 26702d43

由 Jens Axboe 提交于 11月 27, 2018

fix #28871358

Only do it if we have requests for multiple queues in the same
plug.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

26702d43

15 6月, 2020 6 次提交

block, bfq: fix use-after-free in bfq_idle_slice_timer_body · baecb6b1

由 Zhiqiang Liu 提交于 3月 19, 2020

task #28557799

[ Upstream commit 2f95fa5c955d0a9987ffdc3a095e2f4e62c5f2a9 ]

In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is
not in bfqd-lock critical section. The bfqq, which is not
equal to NULL in bfq_idle_slice_timer, may be freed after passing
to bfq_idle_slice_timer_body. So we will access the freed memory.

In addition, considering the bfqq may be in race, we should
firstly check whether bfqq is in service before doing something
on it in bfq_idle_slice_timer_body func. If the bfqq in race is
not in service, it means the bfqq has been expired through
__bfq_bfqq_expire func, and wait_request flags has been cleared in
__bfq_bfqd_reset_in_service func. So we do not need to re-clear the
wait_request of bfqq which is not in service.

KASAN log is given as follows:
[13058.354613] ==================================================================
[13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290
[13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767
[13058.354646]
[13058.354655] CPU: 96 PID: 19767 Comm: fork13
[13058.354661] Call trace:
[13058.354667]  dump_backtrace+0x0/0x310
[13058.354672]  show_stack+0x28/0x38
[13058.354681]  dump_stack+0xd8/0x108
[13058.354687]  print_address_description+0x68/0x2d0
[13058.354690]  kasan_report+0x124/0x2e0
[13058.354697]  __asan_load8+0x88/0xb0
[13058.354702]  bfq_idle_slice_timer+0xac/0x290
[13058.354707]  __hrtimer_run_queues+0x298/0x8b8
[13058.354710]  hrtimer_interrupt+0x1b8/0x678
[13058.354716]  arch_timer_handler_phys+0x4c/0x78
[13058.354722]  handle_percpu_devid_irq+0xf0/0x558
[13058.354731]  generic_handle_irq+0x50/0x70
[13058.354735]  __handle_domain_irq+0x94/0x110
[13058.354739]  gic_handle_irq+0x8c/0x1b0
[13058.354742]  el1_irq+0xb8/0x140
[13058.354748]  do_wp_page+0x260/0xe28
[13058.354752]  __handle_mm_fault+0x8ec/0x9b0
[13058.354756]  handle_mm_fault+0x280/0x460
[13058.354762]  do_page_fault+0x3ec/0x890
[13058.354765]  do_mem_abort+0xc0/0x1b0
[13058.354768]  el0_da+0x24/0x28
[13058.354770]
[13058.354773] Allocated by task 19731:
[13058.354780]  kasan_kmalloc+0xe0/0x190
[13058.354784]  kasan_slab_alloc+0x14/0x20
[13058.354788]  kmem_cache_alloc_node+0x130/0x440
[13058.354793]  bfq_get_queue+0x138/0x858
[13058.354797]  bfq_get_bfqq_handle_split+0xd4/0x328
[13058.354801]  bfq_init_rq+0x1f4/0x1180
[13058.354806]  bfq_insert_requests+0x264/0x1c98
[13058.354811]  blk_mq_sched_insert_requests+0x1c4/0x488
[13058.354818]  blk_mq_flush_plug_list+0x2d4/0x6e0
[13058.354826]  blk_flush_plug_list+0x230/0x548
[13058.354830]  blk_finish_plug+0x60/0x80
[13058.354838]  read_pages+0xec/0x2c0
[13058.354842]  __do_page_cache_readahead+0x374/0x438
[13058.354846]  ondemand_readahead+0x24c/0x6b0
[13058.354851]  page_cache_sync_readahead+0x17c/0x2f8
[13058.354858]  generic_file_buffered_read+0x588/0xc58
[13058.354862]  generic_file_read_iter+0x1b4/0x278
[13058.354965]  ext4_file_read_iter+0xa8/0x1d8 [ext4]
[13058.354972]  __vfs_read+0x238/0x320
[13058.354976]  vfs_read+0xbc/0x1c0
[13058.354980]  ksys_read+0xdc/0x1b8
[13058.354984]  __arm64_sys_read+0x50/0x60
[13058.354990]  el0_svc_common+0xb4/0x1d8
[13058.354994]  el0_svc_handler+0x50/0xa8
[13058.354998]  el0_svc+0x8/0xc
[13058.354999]
[13058.355001] Freed by task 19731:
[13058.355007]  __kasan_slab_free+0x120/0x228
[13058.355010]  kasan_slab_free+0x10/0x18
[13058.355014]  kmem_cache_free+0x288/0x3f0
[13058.355018]  bfq_put_queue+0x134/0x208
[13058.355022]  bfq_exit_icq_bfqq+0x164/0x348
[13058.355026]  bfq_exit_icq+0x28/0x40
[13058.355030]  ioc_exit_icq+0xa0/0x150
[13058.355035]  put_io_context_active+0x250/0x438
[13058.355038]  exit_io_context+0xd0/0x138
[13058.355045]  do_exit+0x734/0xc58
[13058.355050]  do_group_exit+0x78/0x220
[13058.355054]  __wake_up_parent+0x0/0x50
[13058.355058]  el0_svc_common+0xb4/0x1d8
[13058.355062]  el0_svc_handler+0x50/0xa8
[13058.355066]  el0_svc+0x8/0xc
[13058.355067]
[13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464
[13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040)
[13058.355077] The buggy address belongs to the page:
[13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0
[13058.366175] flags: 0x2ffffe0000008100(slab|head)
[13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780
[13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000
[13058.370789] page dumped because: kasan: bad access detected
[13058.370791]
[13058.370792] Memory state around the buggy address:
[13058.370797]  ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb
[13058.370801]  ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[13058.370808]                                                                 ^
[13058.370811]  ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[13058.370815]  ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[13058.370817] ==================================================================
[13058.370820] Disabling lock debugging due to kernel taint

Here, we directly pass the bfqd to bfq_idle_slice_timer_body func.
--
V2->V3: rewrite the comment as suggested by Paolo Valente
V1->V2: add one comment, and add Fixes and Reported-by tag.

Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Reported-by: NWang Wang <wangwang2@huawei.com>
Signed-off-by: NZhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: NFeilong Lin <linfeilong@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

baecb6b1

block: Fix use-after-free issue accessing struct io_cq · fba123ba

由 Sahitya Tummala 提交于 3月 11, 2020

task #28557799

[ Upstream commit 30a2da7b7e225ef6c87a660419ea04d3cef3f6a7 ]

There is a potential race between ioc_release_fn() and
ioc_clear_queue() as shown below, due to which below kernel
crash is observed. It also can result into use-after-free
issue.

context#1:				context#2:
ioc_release_fn()			__ioc_clear_queue() gets the same icq
->spin_lock(&ioc->lock);		->spin_lock(&ioc->lock);
->ioc_destroy_icq(icq);
  ->list_del_init(&icq->q_node);
  ->call_rcu(&icq->__rcu_head,
  	icq_free_icq_rcu);
->spin_unlock(&ioc->lock);
					->ioc_destroy_icq(icq);
					  ->hlist_del_init(&icq->ioc_node);
					  This results into below crash as this memory
					  is now used by icq->__rcu_head in context#1.
					  There is a chance that icq could be free'd
					  as well.

22150.386550:   <6> Unable to handle kernel write to read-only memory
at virtual address ffffffaa8d31ca50
...
Call trace:
22150.607350:   <2>  ioc_destroy_icq+0x44/0x110
22150.611202:   <2>  ioc_clear_queue+0xac/0x148
22150.615056:   <2>  blk_cleanup_queue+0x11c/0x1a0
22150.619174:   <2>  __scsi_remove_device+0xdc/0x128
22150.623465:   <2>  scsi_forget_host+0x2c/0x78
22150.627315:   <2>  scsi_remove_host+0x7c/0x2a0
22150.631257:   <2>  usb_stor_disconnect+0x74/0xc8
22150.635371:   <2>  usb_unbind_interface+0xc8/0x278
22150.639665:   <2>  device_release_driver_internal+0x198/0x250
22150.644897:   <2>  device_release_driver+0x24/0x30
22150.649176:   <2>  bus_remove_device+0xec/0x140
22150.653204:   <2>  device_del+0x270/0x460
22150.656712:   <2>  usb_disable_device+0x120/0x390
22150.660918:   <2>  usb_disconnect+0xf4/0x2e0
22150.664684:   <2>  hub_event+0xd70/0x17e8
22150.668197:   <2>  process_one_work+0x210/0x480
22150.672222:   <2>  worker_thread+0x32c/0x4c8

Fix this by adding a new ICQ_DESTROYED flag in ioc_destroy_icq() to
indicate this icq is once marked as destroyed. Also, ensure
__ioc_clear_queue() is accessing icq within rcu_read_lock/unlock so
that icq doesn't get free'd up while it is still using it.
Signed-off-by: NSahitya Tummala <stummala@codeaurora.org>
Co-developed-by: NPradeep P V K <ppvk@codeaurora.org>
Signed-off-by: NPradeep P V K <ppvk@codeaurora.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

fba123ba

block: keep bdi->io_pages in sync with max_sectors_kb for stacked devices · dc938b41

由 Konstantin Khlebnikov 提交于 2月 28, 2020

task #28557799

[ Upstream commit e74d93e96d721c4297f2a900ad0191890d2fc2b0 ]

Field bdi->io_pages added in commit 9491ae4a ("mm: don't cap request
size based on read-ahead setting") removes unneeded split of read requests.

Stacked drivers do not call blk_queue_max_hw_sectors(). Instead they set
limits of their devices by blk_set_stacking_limits() + disk_stack_limits().
Field bio->io_pages stays zero until user set max_sectors_kb via sysfs.

This patch updates io_pages after merging limits in disk_stack_limits().

Commit c6d6e9b0f6b4 ("dm: do not allow readahead to limit IO size") fixed
the same problem for device-mapper devices, this one fixes MD RAIDs.

Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
Reviewed-by: NPaul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

dc938b41

block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group() · 8cf52b47

由 Carlo Nonato 提交于 3月 06, 2020

task #28557799

[ Upstream commit 14afc59361976c0ba39e3a9589c3eaa43ebc7e1d ]

The bfq_find_set_group() function takes as input a blkcg (which represents
a cgroup) and retrieves the corresponding bfq_group, then it updates the
bfq internal group hierarchy (see comments inside the function for why
this is needed) and finally it returns the bfq_group.
In the hierarchy update cycle, the pointer holding the correct bfq_group
that has to be returned is mistakenly used to traverse the hierarchy
bottom to top, meaning that in each iteration it gets overwritten with the
parent of the current group. Since the update cycle stops at root's
children (depth = 2), the overwrite becomes a problem only if the blkcg
describes a cgroup at a hierarchy level deeper than that (depth > 2). In
this case the root's child that happens to be also an ancestor of the
correct bfq_group is returned. The main consequence is that processes
contained in a cgroup at depth greater than 2 are wrongly placed in the
group described above by BFQ.

This commits fixes this problem by using a different bfq_group pointer in
the update cycle in order to avoid the overwrite of the variable holding
the original group reference.
Reported-by: NKwon Je Oh <kwonje.oh2@gmail.com>
Signed-off-by: NCarlo Nonato <carlo.nonato95@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8cf52b47

block: fix an integer overflow in logical block size · 8b05616d

由 Mikulas Patocka 提交于 1月 15, 2020

task #28557799

commit ad6bf88a6c19a39fb3b0045d78ea880325dfcf15 upstream.

Logical block size has type unsigned short. That means that it can be at
most 32768. However, there are architectures that can run with 64k pages
(for example arm64) and on these architectures, it may be possible to
create block devices with 64k block size.

For exmaple (run this on an architecture with 64k pages):

Mount will fail with this error because it tries to read the superblock using 2-sector
access:
  device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
  EXT4-fs (dm-0): unable to read superblock

This patch changes the logical block size from unsigned short to unsigned
int to avoid the overflow.

Cc: stable@vger.kernel.org
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8b05616d

block: fix memleak when __blk_rq_map_user_iov() is failed · a15ce925

由 Yang Yingliang 提交于 12月 18, 2019

task #28557799

[ Upstream commit 3b7995a98ad76da5597b488fa84aa5a56d43b608 ]

When I doing fuzzy test, get the memleak report:

BUG: memory leak
unreferenced object 0xffff88837af80000 (size 4096):
  comm "memleak", pid 3557, jiffies 4294817681 (age 112.499s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    20 00 00 00 10 01 00 00 00 00 00 00 01 00 00 00   ...............
  backtrace:
    [<000000001c894df8>] bio_alloc_bioset+0x393/0x590
    [<000000008b139a3c>] bio_copy_user_iov+0x300/0xcd0
    [<00000000a998bd8c>] blk_rq_map_user_iov+0x2f1/0x5f0
    [<000000005ceb7f05>] blk_rq_map_user+0xf2/0x160
    [<000000006454da92>] sg_common_write.isra.21+0x1094/0x1870
    [<00000000064bb208>] sg_write.part.25+0x5d9/0x950
    [<000000004fc670f6>] sg_write+0x5f/0x8c
    [<00000000b0d05c7b>] __vfs_write+0x7c/0x100
    [<000000008e177714>] vfs_write+0x1c3/0x500
    [<0000000087d23f34>] ksys_write+0xf9/0x200
    [<000000002c8dbc9d>] do_syscall_64+0x9f/0x4f0
    [<00000000678d8e9a>] entry_SYSCALL_64_after_hwframe+0x49/0xbe

If __blk_rq_map_user_iov() is failed in blk_rq_map_user_iov(),
the bio(s) which is allocated before this failing will leak. The
refcount of the bio(s) is init to 1 and increased to 2 by calling
bio_get(), but __blk_rq_unmap_user() only decrease it to 1, so
the bio cannot be freed. Fix it by calling blk_rq_unmap_user().
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

a15ce925

11 6月, 2020 1 次提交

alinux: blk-mq: remove QUEUE_FLAG_POLL from default MQ flags · 294d5fb2

由 Joseph Qi 提交于 6月 10, 2020

fix #28528017

In case of virtio-blk device, checking /sys/block/<device>/queue/io_poll
will show 1 and user can't disable it. Actually virtio-blk doesn't
support poll yet, so it will confuse end user. The root cause is mq
initialization will default set bit QUEUE_FLAG_POLL.

This fix takes ideas from the following upstream commits:
6544d229bf43 ("block: enable polling by default if a poll map is initalized")
6e0de61107f0 ("blk-mq: remove QUEUE_FLAG_POLL from default MQ flags")
Since we don't want to get HCTX_TYPE_POLL related logic involved, so
just check mq_ops->poll and then set QUEUE_FLAG_POLL.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

294d5fb2

09 6月, 2020 2 次提交

block: annotate refault stalls from IO submission · 4da72359

由 Johannes Weiner 提交于 8月 08, 2019

task #28327019

commit b8e24a9300b0836a9d39f6b20746766b3b81f1bd upstream

psi tracks the time tasks wait for refaulting pages to become
uptodate, but it does not track the time spent submitting the IO. The
submission part can be significant if backing storage is contended or
when cgroup throttling (io.latency) is in effect - a lot of time is
spent in submit_bio(). In that case, we underreport memory pressure.

Annotate submit_bio() to account submission time as memory stall when
the bio is reading userspace workingset pages.
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

4da72359

alinux: block: replace reserved field with extended bio_flags · 422652e5

由 zhongjiang-ali 提交于 6月 09, 2020

task #28327019

Commit bc0cc360 ("alinux: blk-throttle: fix tg NULL pointer
dereference") add an self-defined bio flags to fix an issue of
use-after-free. But it is limited to 13 entry and has used up,
hence it will fails to sync related patch.

The patch replace reserved field with extended bio_flags to allow
us to define more bio flags.
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>

422652e5

07 5月, 2020 1 次提交

blk-mq: balance mapping between present CPUs and queues · 5dc0acc2

由 Ming Lei 提交于 7月 25, 2019

fix #27417914

commit 556f36e90dbe7dded81f4fac084d2bc8a2458330 upstream

Spread queues among present CPUs first, then building mapping on other
non-present CPUs.

So we can minimize count of dead queues which are mapped by un-present
CPUs only. Then bad IO performance can be avoided by unbalanced mapping
between present CPUs and queues.

The similar policy has been applied on Managed IRQ affinity.

Cc: Yi Zhang <yi.zhang@redhat.com>
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[jeffle: remove code supporting multiple queue maps, which is merged since v5.0]
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

5dc0acc2

25 3月, 2020 1 次提交

alinux: blk-mq: fix broken io_ticks & time_in_queue update · a9ee8ebe

由 Xiaoguang Wang 提交于 3月 17, 2020

fix #25369772

In blk-mq device, we observed a issue that though iops is low, but iostat
shows a very high svctm & util value, which is counter-intuitive.

The root cause is that blk_account_io_start() calls part_round_stats()
before "rq->part = part" statement, so part_round_stats() will count
an inflight request to the whole device, but not for the specific
partition, then it'll update whole device's io_ticks and time_in_queue
with a stale part->stamp.

To fix this issue, if a request's part is NULL, we just don't count
it as an inflight request to the whole device.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

a9ee8ebe

18 3月, 2020 17 次提交

alinux: blk-throttle: fix logic error about BIO_THROTL_STATED in throtl_bio_end_io() · 8daa9640

由 Xiaoguang Wang 提交于 2月 18, 2020

When CONFIG_BLK_DEV_THROTTLING is enabled, though we may not set
block cgroup's blk-throttle bps or iops limits, every bio still
enters blk_throtl_bio() firstly, then this bug will result in the
corresponding blkcg_gq's refcnt will increase by 1 for every bio.
atomit_t is an 'int' type, and if usr continually issues batches
of bios, this refcnt will overflow, which will trigger WARNING in
blkg_get() or blkg_put().

Fixes: bc0cc360 ("alinux: blk-throttle: fix tg NULL pointer dereference")
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8daa9640

block: never take page references for ITER_BVEC · 709d159e

由 Christoph Hellwig 提交于 6月 26, 2019

Cherry-pick from commit b620743077e291ae7d0debd21f50413a8c266229 upstream.

If we pass pages through an iov_iter we always already have a reference
in the caller.  Thus remove the ITER_BVEC_FLAG_NO_REF and don't take
reference to pages by default for bvec backed iov_iters.

[Joseph] Resolve conflicts since we don't have:
81ba6abd2bcd "block: loop: mark bvec as ITER_BVEC_FLAG_NO_REF"
7321ecbfc7cf "block: change how we get page references in bio_iov_iter_get_pages"
Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

709d159e

blk-mq: fix NULL pointer deference in case no poll implementation · 1e41c505

由 Joseph Qi 提交于 5月 23, 2019

In case some drivers such virtio-blk, poll function is not implementatin
yet. Before commit 529262d5 ("block: remove ->poll_fn"), q->poll_fn
is NULL and then blk_poll() won't do poll actually.
So add a check for this to avoid NULL pointer dereference when calling
q->mq_ops->poll.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

1e41c505

blk-mq: grab .q_usage_counter when queuing request from plug code path · 4a55f77f

由 Ming Lei 提交于 4月 30, 2019

commit e87eb301bee183d82bb3d04bd71b6660889a2588 upstream.

Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
        blk_mq_sched_insert_requests()
                insert requests to sw queue or scheduler queue
                blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Tested-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[Joseph: use the passing 'q' directly]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4a55f77f

block/bfq: fix ifdef for CONFIG_BFQ_GROUP_IOSCHED=y · 2a7f90c1

由 Konstantin Khlebnikov 提交于 3月 29, 2019

commit 42b1bd33dcdef4ffd98f695e188bab82f9fa46d8 upstream.

Replace BFQ_GROUP_IOSCHED_ENABLED with CONFIG_BFQ_GROUP_IOSCHED.
Code under these ifdefs never worked, something might be broken.

Fixes: 0471559c ("block, bfq: add/remove entity weights correctly")
Reviewed-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

2a7f90c1

block: remove bogus check for queue_lock assignment · 7a086776

由 Jens Axboe 提交于 10月 12, 2018

commit 5e27891e88555fecd8262e110e1a29feca4b0166 upstream.

We just allocated the queue and haven't even set it up yet,
hence we know that checking if ->mq_ops is NULL is always
going to be true.

In fact we do need to assign a lock to ->queue_lock always,
as we need it for the queue flags modifications.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

7a086776

block: don't use bio->bi_vcnt to figure out segment number · 7eb1d529

由 Ming Lei 提交于 2月 15, 2019

commit 1a67356e9a4829da2935dd338630a550c59c8489 upstream.

It is wrong to use bio->bi_vcnt to figure out how many segments
there are in the bio even though CLONED flag isn't set on this bio,
because this bio may be splitted or advanced.

So always use bio_segments() in blk_recount_segments(), and it shouldn't
cause any performance loss now because the physical segment number is figured
out in blk_queue_split() and BIO_SEG_VALID is set meantime since
bdced438 ("block: setup bi_phys_segments after splitting").
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Fixes: 76d8137a ("blk-merge: recaculate segment if it isn't less than max segments")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

7eb1d529

block: fix NULL pointer dereference in register_disk · 9d70cdf3

由 zhengbin 提交于 2月 20, 2019

commit 4d7c1d3fd7c7eda7dea351f071945e843a46c145 upstream.

If __device_add_disk-->bdi_register_owner-->bdi_register-->
bdi_register_va-->device_create_vargs fails, bdi->dev is still
NULL, __device_add_disk-->register_disk will visit bdi->dev->kobj.
This patch fixes that.
Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

9d70cdf3

blk-mq: Add a NULL check in blk_mq_free_map_and_requests() · 5f0efd1f

由 Dan Carpenter 提交于 11月 29, 2018

commit 4e6db0f21c99c25980c8d183f95cdb6ad64cebd2 upstream.

I recently found some code which called blk_mq_free_map_and_requests()
with a NULL set->tags pointer.  I fixed the caller, but it seems like a
good idea to add a NULL check here as well.  Now we can call:

	blk_mq_free_tag_set(set);
	blk_mq_free_tag_set(set);

twice in a row and it's harmless.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

5f0efd1f

blk-mq: place trace_block_getrq() in correct place · d23f53bf

由 Xiaoguang Wang 提交于 10月 23, 2018

commit d6f1dda27251909a27b8d8aacb498628a1047978 upstream.

trace_block_getrq() is to indicate a request struct has been allocated
for queue, so put it in right place.
Reviewed-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

d23f53bf

blk-mq: protect debugfs_create_files() from failures · 9ec13b19

由 Greg Kroah-Hartman 提交于 1月 23, 2019

commit 36991ca68db9dd43bac7f3519f080ee3939263ef upstream.

If debugfs were to return a non-NULL error for a debugfs call, using
that pointer later in debugfs_create_files() would crash.

Fix that by properly checking the pointer before referencing it.
Reported-by: NMichal Hocko <mhocko@kernel.org>
Reported-and-tested-by: syzbot+b382ba6a802a3d242790@syzkaller.appspotmail.com
Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

9ec13b19

blk-mq: not embed .mq_kobj and ctx->kobj into queue instance · 9ff28240

由 Ming Lei 提交于 11月 20, 2018

commit 1db4909e76f64a85f4aaa187f0f683f5c85a471d upstream.

Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
from block layer's view, actually they don't because userspace may
grab one kobject anytime via sysfs.

This patch fixes the issue by the following approach:

1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
all ctxs

2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
handler of .mq_kobj

3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
.mq_kobj is always released after all ctxs are freed.

This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
is enabled.
Reported-by: NGuenter Roeck <linux@roeck-us.net>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

9ff28240

blk-mq: fallback to previous nr_hw_queues when updating fails · 4f3484ac

由 Jianchao Wang 提交于 10月 12, 2018

commit e01ad46d53b59720c6ae69963ee1756506954c85 upstream.

When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.

To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.

Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4f3484ac

blk-mq: realloc hctx when hw queue is mapped to another node · 0f44f194

由 Jianchao Wang 提交于 10月 12, 2018

commit 34d11ffac1f56c3895dad32153abd6814452dc77 upstream.

When the hw queues and mq_map are updated, a hctx could be mapped
to a different numa node. At this moment, we need to realloc the
hctx. If fail to do that, go on using previous hctx.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

0f44f194

blk-mq: adjust debugfs and sysfs register when updating nr_hw_queues · 1d5199ea

由 Jianchao Wang 提交于 10月 12, 2018

commit 477e19dedc9d3e1f4443a1d4ae00572a988120ea upstream.

blk-mq debugfs and sysfs entries need to be removed before updating
queue map, otherwise, we get get wrong result there. This patch fixes
it and remove the redundant debugfs and sysfs register/unregister
operations during __blk_mq_update_nr_hw_queues.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

1d5199ea

alinux: fs: record page or bio info while process is waitting on it · 2864376f

由 Xiaoguang Wang 提交于 11月 07, 2019

If one process context is stucked in wait_on_buffer(), lock_buffer(),
lock_page() and wait_on_page_writeback() and wait_on_bit_io(), it's
hard to tell ture reason, for example, whether this page is under io,
or this page is just locked too long by other process context.

Normally io request has multiple bios, and every bio contains multiple
pages which will hold data to be read from or written to device, so here
we record page info or bio info in task_struct while process calls
lock_page(), lock_buffer(), wait_on_page_writeback(), wait_on_buffer()
and wait_on_bit_io(), we add a new proce interface:
[lege@localhost linux]$ cat /proc/4516/wait_res
1 ffffd0969f95d3c0 4295369599 4295381596

Above info means that thread 4516 is waitting on a page, address is
ffffd0969f95d3c0, and has waited for 11997ms.

First field denotes the page address process is waitting on.
Second field denotes the wait moment and the third denotes current moment.

In practice, if we found a process waitting on one page for too long time,
we can get page's address by reading /proc/$pid/wait_page, and search this
page address in all block devices' /sys/kernel/debug/block/${devname}/rq_hang,
if search operation hits one, we can get the request and know why this io
request hangs that long.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

2864376f

alinux: blk: add iohang check function · 80d6ee24

由 Xiaoguang Wang 提交于 10月 11, 2019

Background:
  We do not have a dependable block layer interface to determine whether
block device has io requests which have not been completed for somewhat
long time. Currently we have 'in_flight' interface, it counts the number
of I/O requests that have been issued to the device driver but have
not yet completed, and it does not include I/O requests that are in the
queue but not yet issued to the device driver, which means it will not
count io requests that have been stucked in block layer.
  Also say that there are steady io requests issued to device driver,
'in_flight' maybe always non-zero, but you could not determine whether
there is one io request which has not been completed for too long.

Solution:
  To find io requests which have not been completed for too long, here
add 3 new inferfaces:
  /sys/block/vdb/queue/hang_threshold
If one io request's running time has been greater than this value, count
this io as hang.

  /sys/block/vdb/hang
Show read/write io requests' hang counter.

  /sys/kernel/debug/block/vdb/rq_hang
Show all hang io requests's detailed info, like below:
  ffff97db96301200 {.op=WRITE, .cmd_flags=SYNC, .rq_flags=STARTED|
ELVPRIV|IO_STAT|STATS, .state=in_flight, .tag=30, .internal_tag=169,
.start_time_ns=140634088407, .io_start_time_ns=140634102958,
.current_time=146497371953, .bio = ffff97db91e8e000,
.bio_pages = { ffffd096a0602540 }, .bio = ffff97db91e8ec00,
.bio_pages = { ffffd096a070eec0 }, .bio = ffff97db91e8f600,
.bio_pages = { ffffd096a0424cc0 }, .bio = ffff97db91e8f300,
.bio_pages = { ffffd096a0600a80 }}

With above info, we can easily see this request's latency distribution,
and see next patch for bio_pages's usage.

Note, /sys/kernel/debug/block/vdb/rq_hang only exists in blk-mq device driver
and needs CONFIG_BLK_DEBUG_FS enabled.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

80d6ee24

17 1月, 2020 8 次提交

block: fix 32 bit overflow in __blkdev_issue_discard() · e9ca62bc

由 Dave Chinner 提交于 11月 14, 2018

commit 4800bf7bc8c725e955fcbc6191cc872f43f506d3 upstream.

A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to
fall into an endless loop in the discard code. The test is creating
a device that is exactly 2^32 sectors in size to test mkfs boundary
conditions around the 32 bit sector overflow region.

mkfs issues a discard for the entire device size by default, and
hence this throws a sector count of 2^32 into
blkdev_issue_discard(). It takes the number of sectors to discard as
a sector_t - a 64 bit value.

The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
takes this sector count and casts it to a 32 bit value before
comapring it against the maximum allowed discard size the device
has. This truncates away the upper 32 bits, and so if the lower 32
bits of the sector count is zero, it starts issuing discards of
length 0. This causes the code to fall into an endless loop, issuing
a zero length discards over and over again on the same sector.

Fixes: ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
Tested-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDave Chinner <dchinner@redhat.com>

Killed pointless WARN_ON().
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

e9ca62bc

block: cleanup __blkdev_issue_discard() · 462ae85a

由 Ming Lei 提交于 10月 29, 2018

commit ba5d73851e71847ba7f7f4c27a1a6e1f5ab91c79 upstream.

Cleanup __blkdev_issue_discard() a bit:

- remove local variable of 'end_sect'
- remove code block of 'fail'

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Tested-by: NRui Salvaterra <rsalvaterra@gmail.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

462ae85a

block: add BIO_NO_PAGE_REF flag · c0d2a0b9

由 Jens Axboe 提交于 2月 27, 2019

commit 399254aaf4892113c806816f7e64cf40c804d46d upstream.

If bio_iov_iter_get_pages() is called on an iov_iter that is flagged
with NO_REF, then we don't need to add a page reference for the pages
that we add.

Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows
not to drop a reference to these pages.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

c0d2a0b9

block: implement bio helper to add iter bvec pages to bio · b1d06bf8

由 Jens Axboe 提交于 11月 30, 2018

commit 6d0c48aede85e38316d0251564cab39cbc2422f6 upstream.

For an ITER_BVEC, we can just iterate the iov and add the pages
to the bio directly. For now, we grab a reference to those pages,
and release them normally on IO completion. This isn't really needed
for the normal case of O_DIRECT from/to a file, but some of the more
esoteric use cases (like splice(2)) will unconditionally put the
pipe buffer pages when the buffers are released. Until we can manage
that case properly, ITER_BVEC pages are treated like normal pages
in terms of reference counting.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

b1d06bf8

block: clear REQ_HIPRI if polling is not supported · 719a321b

由 Christoph Hellwig 提交于 12月 14, 2018

commit d04c406f29d9f4dbcb5eb5aa79ce0445c7e9d652 upstream.

This prevents a HIPRI bio from being submitted through a stacking
driver that does not support polling and thus won't poll for I/O
completion.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

719a321b

block: remove ->poll_fn · dd1253a5

由 Christoph Hellwig 提交于 12月 02, 2018

commit 529262d56dbebe6a26df5d2fd24cc0e1bc8579e5 upstream.

This was intended to support users like nvme multipath, but is just
getting in the way and adding another indirect call.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

dd1253a5

block: make blk_poll() take a parameter on whether to spin or not · 7244851b

由 Jens Axboe 提交于 11月 26, 2018

commit 0a1b8b87d064a47fad9ec475316002da28559207 upstream.

blk_poll() has always kept spinning until it found an IO. This is
fine for SYNC polling, since we need to find one request we have
pending, but in preparation for ASYNC polling it can be beneficial
to just check if we have any entries available or not.

Existing callers are converted to pass in 'spin == true', to retain
the old behavior.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

7244851b

blk-mq: when polling for IO, look for any completion · d25f577c

由 Jens Axboe 提交于 11月 26, 2018

commit 1052b8ac5282daf35df331edcbdb645839d17e6a upstream.

If we want to support async IO polling, then we have to allow finding
completions that aren't just for the one we are looking for. Always pass
in -1 to the mq_ops->poll() helper, and have that return how many events
were found in this poll loop.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d25f577c

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功