提交 · 22c508950ed068bb851e2144d8a814ddfa021a83 · openanolis / cloud-kernel

02 9月, 2020 29 次提交

block: move queues types to the block layer · cd87b6e5

由 Christoph Hellwig 提交于 12月 02, 2018

to #28991349

commit e20ba6e1da029136ded295f33076483d65ddf50a upstream

Having another indirect all in the fast path doesn't really help
in our post-spectre world.  Also having too many queue type is just
going to create confusion, so I'd rather manage them centrally.

Note that the queue type naming and ordering changes a bit - the
first index now is the default queue for everything not explicitly
marked, the optional ones are read and poll queues.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

cd87b6e5

blk-mq: fix allocation for queue mapping table · 0c1a14f6

由 Ming Lei 提交于 12月 17, 2018

to #28991349

commit 07b35eb5a364fa59f88f65e6c786192f2c9163be upstream

Type of each element in queue mapping table is 'unsigned int,
intead of 'struct blk_mq_queue_map)', so fix it.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0c1a14f6

blk-mq: balance mapping between present CPUs and queues · 1944434f

由 Ming Lei 提交于 7月 25, 2019

to #28991349

commit 556f36e90dbe7dded81f4fac084d2bc8a2458330 upstream

Spread queues among present CPUs first, then building mapping on other
non-present CPUs.

So we can minimize count of dead queues which are mapped by un-present
CPUs only. Then bad IO performance can be avoided by unbalanced mapping
between present CPUs and queues.

The similar policy has been applied on Managed IRQ affinity.

Cc: Yi Zhang <yi.zhang@redhat.com>
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

1944434f

blk-mq: initial support for multiple queue maps · c454176d

由 Jens Axboe 提交于 10月 24, 2018

to #28991349

commit 843477d4cc5c4bb4e346c561ecd3b9d0bd67e8c8 upstream

Add a queue offset to the tag map. This enables users to map
iteratively, for each queue map type they support.

Bump maximum number of supported maps to 2, we're now fully
able to support more than 1 map.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c454176d

blk-mq: improve plug list sorting · ba2986d9

由 Jens Axboe 提交于 10月 30, 2018

to #28991349

commit 3110fc79606fb6003949246c6fb325dd43445273 upstream

Currently we only look at the software queue, but with support
for multiple maps, we should also look at the hardware queue.
This is important since we'll flush out the request list if
either the software queue or hardware queue don't match.

This sorts by software queue first, then hardware queue if
that differs. Finally we sort by request location like before.
This minimizes the flush points per plug list.
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

ba2986d9

blk-mq: cleanup and improve list insertion · 239c0d8b

由 Jens Axboe 提交于 10月 30, 2018

to #28991349

commit 67cae4c948a5311121905a2a8740c50daf7f6478 upstream

It's somewhat strange to have a list insertion function that
relies on the fact that the caller has mapped things correctly.
Pass in the hardware queue directly for insertion, which makes
for a much cleaner interface and implementation.
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

239c0d8b

blk-mq: cache request hardware queue mapping · 107174b9

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit ea4f995ee8b8f0578b3319949f2edd5d812fdb0a upstream

We call blk_mq_map_queue() a lot, at least two times for each
request per IO, sometimes more. Since we now have an indirect
call as well in that function. cache the mapping so we don't
have to re-call blk_mq_map_queue() for the same request
multiple times.
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

107174b9

blk-mq: separate number of hardware queues from nr_cpu_ids · 3eea3213

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit 392546aed22009060911f76b6ea24520e2f8b50f upstream

With multiple maps, nr_cpu_ids is no longer the maximum number of
hardware queues we support on a given devices. The initializer of
the tag_set can have set ->nr_hw_queues larger than the available
number of CPUs, since we can exceed that with multiple queue maps.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

3eea3213

blk-mq: support multiple hctx maps · 2ca20c32

由 Jens Axboe 提交于 10月 30, 2018

to #28991349

commit b3c661b15d5ab11d982e58bee23e05c1780528a1 upstream

Add support for the tag set carrying multiple queue maps, and
for the driver to inform blk-mq how many it wishes to support
through setting set->nr_maps.

This adds an mq_ops helper for drivers that support more than 1
map, mq_ops->rq_flags_to_type(). The function takes request/bio
flags and CPU, and returns a queue map index for that. We then
use the type information in blk_mq_map_queue() to index the map
set.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

2ca20c32

blk-mq: add 'type' attribute to the sysfs hctx directory · 37567ba0

由 Jens Axboe 提交于 10月 25, 2018

to #28991349

commit a783b81820fe3532809c98371ec904dfdb0ea9e5 upstream

It can be useful for a user to verify what type a given hardware
queue is, expose this information in sysfs.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

37567ba0

blk-mq: allow software queue to map to multiple hardware queues · d78f5292

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit f31967f0e455d08d3ea1d2f849bf62dafc92dbf4 upstream

The mapping used to be dependent on just the CPU location, but
now it's a tuple of (type, cpu) instead. This is a prep patch
for allowing a single software queue to map to multiple hardware
queues. No functional changes in this patch.

This changes the software queue count to an unsigned short
to save a bit of space. We can still support 64K-1 CPUs,
which should be enough. Add a check to catch a wrap.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d78f5292

block: don't lose track of REQ_INTEGRITY flag · c7abd0f1

由 Ming Lei 提交于 1月 16, 2019

to #28991349

commit 7809167da5c86fd6bf309b33dee7a797e263342f upstream

We need to pass bio->bi_opf after bio intergrity preparing, otherwise
the flag of REQ_INTEGRITY may not be set on the allocated request, then
breaks block integrity.

Fixes: f9afca4d367b ("blk-mq: pass in request/bio flags to queue mapping")
Cc: Hannes Reinecke <hare@suse.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c7abd0f1

blk-mq: pass in request/bio flags to queue mapping · b8617081

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit f9afca4d367b8c915f28d29fcaba7460640403ff upstream

Prep patch for being able to place request based not just on
CPU location, but also on the type of request.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b8617081

blk-mq: provide dummy blk_mq_map_queue_type() helper · 67c25121

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit ff2c56609d9b1f0739ae3a3bfdb78191d01e4192 upstream

Doesn't do anything right now, but it's needed as a prep patch
to get the interfaces right.

While in there, correct the blk_mq_map_queue() CPU type to an unsigned
int.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

67c25121

blk-mq: abstract out queue map · f63859ea

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit ed76e329d74a4b15ac0f5fd3adbd52ec0178a134 upstream

This is in preparation for allowing multiple sets of maps per
queue, if so desired.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f63859ea

blk-mq: kill q->mq_map · 7c058cfc

由 Jens Axboe 提交于 10月 16, 2018

to #28991349

commit a8908939af569ce2419f43fd56eeaf003bc3d85d upstream

It's just a pointer to set->mq_map, use that instead. Move the
assignment a bit earlier, so we always know it's valid.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

7c058cfc

Revert "blk-mq: balance mapping between present CPUs and queues" · 783899aa

由 Xiaoguang Wang 提交于 6月 25, 2020

to #28991349

This reverts commit a3d72a0c79fac0e113bbeb85e1e19b3b3568e2f5.

Previously we just backported this patch partly, now we revert
it temporarily and will backport it in later patches formally.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

783899aa

iocost: protect iocg->abs_vdebt with iocg->waitq.lock · 1ec0deaf

由 Tejun Heo 提交于 5月 04, 2020

to #29361128

commit 0b80f9866e6bbfb905140ed8787ff2af03652c0c upstream.

abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup
is and controls the activation of use_delay mechanism. Once a cgroup goes
over budget from forced IOs, it has to pay it back with its future budget.
The progress guarantee on debt paying comes from the iocg being active -
active iocgs are processed by the periodic timer, which ensures that as time
passes the debts dissipate and the iocg returns to normal operation.

However, both iocg activation and vdebt handling are asynchronous and a
sequence like the following may happen.

1. The iocg is in the process of being deactivated by the periodic timer.

2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns
without anything because it still sees that the iocg is already active.

3. The iocg is deactivated.

4. The bio from #2 is over budget but needs to be forced. It increases
abs_vdebt and goes over the threshold and enables use_delay.

5. IO control is enabled for the iocg's subtree and now IOs are attributed
to the descendant cgroups and the iocg itself no longer issues IOs.

This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no
further IOs which can activate it. This can end up unduly punishing all the
descendants cgroups.

The usual throttling path has the same issue - the iocg must be active while
throttled to ensure that future event will wake it up - and solves the
problem by synchronizing the throttling path with a spinlock. abs_vdebt
handling is another form of overage handling and shares a lot of
characteristics including the fact that it isn't in the hottest path.

This patch fixes the above and other possible races by strictly
synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NVlad Dmitriev <vvd@fb.com>
Cc: stable@vger.kernel.org # v5.4+
Fixes: e1518f63f246 ("blk-iocost: Don't let merges push vtime into the future")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

1ec0deaf

blk-iocost: account for IO size when testing latencies · 8c8fb141

由 Tejun Heo 提交于 4月 13, 2020

to #29361128

commit cd006509b0a93cb7ee9d9fd50ae274098997a460 upstream.

On each IO completion, iocost decides whether the IO met or missed its latency
target. Currently, the targets are fixed numbers per IO type. While this can be
good enough for loose latency targets way higher than typical completion
latencies, the effect of IO size makes it difficult to tighten the latency
target - a target adequate for 4k IOs might be too tight for 512k IOs and
vice-versa.

iocost already has all the necessary information to account for different IO
sizes when testing whether the latency target is met as iocost can calculate the
size vtime cost of a given IO. This patch updates the completion path to
calculate the size vtime cost of the IO, deduct the nsec equivalent from the
observed latency and use the adjusted value to decide whether the target is met.

This makes latency targets independent from IO size and enables determining
adequate latency targets with fixed size fio runs.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8c8fb141

block: make rq sector size accessible for block stats · 9bdcaff2

由 Hou Tao 提交于 5月 21, 2019

to #29361128

commit 3d24430694077313c75c6b89f618db09943621e4 upstream.

Currently rq->data_len will be decreased by partial completion or
zeroed by completion, so when blk_stat_add() is invoked, data_len
will be zero and there will never be samples in poll_cb because
blk_mq_poll_stats_bkt() will return -1 if data_len is zero.

We could move blk_stat_add() back to __blk_mq_complete_request(),
but that would make the effort of trying to call ktime_get_ns()
once in vain. Instead we can reuse throtl_size field, and use
it for both block stats and block throttle, and adjust the
logic in blk_mq_poll_stats_bkt() accordingly.

Fixes: 4bc6339a ("block: move blk_stat_add() to __blk_mq_end_request()")
Tested-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9bdcaff2

blk-iocost: switch to fixed non-auto-decaying use_delay · fc94dc72

由 Tejun Heo 提交于 4月 13, 2020

to #29361128

commit 54c52e10dc9b939084a7e6e3d32ce8fd8dee7898 upstream.

The use_delay mechanism was introduced by blk-iolatency to hold memory
allocators accountable for the reclaim and other shared IOs they cause. The
duration of the delay is dynamically balanced between iolatency increasing the
value on each target miss and it auto-decaying as time passes and threads get
delayed on it.

While this works well for iolatency, iocost's control model isn't compatible
with it. There is no repeated "violation" events which can be balanced against
auto-decaying. iocost instead knows how much a given cgroup is over budget and
wants to prevent that cgroup from issuing IOs while over budget. Until now,
iocost has been adding the cost of force-issued IOs. However, this doesn't
reflect the amount which is already over budget and is simply not enough to
counter the auto-decaying allowing anon-memory leaking low priority cgroup to
go over its alloted share of IOs.

As auto-decaying doesn't make much sense for iocost, this patch introduces a
different mode of operation for use_delay - when blkcg_set_delay() are used
insted of blkcg_add/use_delay(), the delay duration is not auto-decayed until it
is explicitly cleared with blkcg_clear_delay(). iocost is updated to keep the
delay duration synchronized to the budget overage amount.

With this change, iocost can effectively police cgroups which generate
significant amount of force-issued IOs.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

fc94dc72

blk-iocost: Fix error on iocost_ioc_vrate_adj · d14b5329

由 Waiman Long 提交于 4月 21, 2020

to #29361128

commmit d6c8e949a35d6906d6c03a50e9a9cdf4e494528a upstream.

Systemtap 4.2 is unable to correctly interpret the "u32 (*missed_ppm)[2]"
argument of the iocost_ioc_vrate_adj trace entry defined in
include/trace/events/iocost.h leading to the following error:

  /tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8:
  error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token
   , u32[]* __tracepoint_arg_missed_ppm

That argument type is indeed rather complex and hard to read. Looking
at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying
the argument to a simple "u32 *missed_ppm" and adjusting the trace
entry accordingly, the compilation error was gone.

Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d14b5329

blk-iocost: remove duplicated lines in comments · 4b3109d5

由 Weiping Zhang 提交于 2月 27, 2020

to #29361128

commit fa800d73c8d0d36b1f5929198371f421b69e610e upstream.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

4b3109d5

blk-iocost: fix incorrect vtime comparison in iocg_is_idle() · e7c5f028

由 Tejun Heo 提交于 3月 10, 2020

to 29361128

commit dcd6589b11d3b1e71f516a87a7b9646ed356b4c0 upstream.

vtimes may wrap and time_before/after64() should be used to determine
whether a given vtime is before or after another. iocg_is_idle() was
incorrectly using plain "<" comparison do determine whether done_vtime
is before vtime. Here, the only thing we're interested in is whether
done_vtime matches vtime which indicates that there's nothing in
flight. Let's test for inequality instead.
Signed-off-by: NTejun Heo <tj@kernel.org>
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e7c5f028

iocost: over-budget forced IOs should schedule async delay · 9ab225fe

由 Tejun Heo 提交于 12月 16, 2019

to #29361128

commit d7bd15a138aef3be227818aad9c501e43c89c8c5 upstream.

When over-budget IOs are force-issued through root cgroup,
iocg_kick_delay() adjusts the async delay accordingly but doesn't
actually schedule async throttle for the issuing task.  This bug is
pretty well masked because sooner or later the offending threads are
gonna get directly throttled on regular IOs or have async delay
scheduled by mem_cgroup_throttle_swaprate().

However, it can affect control quality on filesystem metadata heavy
operations.  Let's fix it by invoking blkcg_schedule_throttle() when
iocg_kick_delay() says async delay is needed.
Signed-off-by: NTejun Heo <tj@kernel.org>
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org
Reported-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9ab225fe

alinux: block: initialize io hang counter · 5d951856

由 Xiaoguang Wang 提交于 7月 21, 2020

fix #29420707

Otherwise we'll get stale io hang counter.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

5d951856

alinux: blk-iocost: bypass IOs earlier if disabled · e166c200

由 Joseph Qi 提交于 7月 16, 2020

to #29357063

The blkg lookup or create logic may bring much overhead even iocost is
disabled. So bypass it earlier in such case.

Fixes: 9da41925 ("alinux: iocost: fix NULL pointer dereference in ioc_rqos_throttle")
Reported-by: NHongnan Li <hongnan.li@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

e166c200

alinux: block-throttle: only do io statistics if needed · b8a94ed8

由 Xiaoguang Wang 提交于 7月 04, 2020

task #29063222

Current blk throttle codes always do io statistics even though users
don't specify valid throttle rules, which will introduce significant
overheads for applications that don't use blk throttle function and
is wrose in arm, see below perf data captured in arm:

sudo taskset -c 66 fio -ioengine=io_uring -sqthread_poll=1 -hipri=1
-sqthread_poll_cpu=65 -registerfiles=1 -fixedbufs=1 -direct=1
-filename=/dev/nvme0n1 -bs=4k -iodepth=8 -rw=randwrite  -time_based
-ramp_time=30 -runtime=60  -name="test"

Samples: 25K of event 'cycles', Event count (approx.): 16586974662
Overhead  Command      Shared Object      Symbol
   3.54%  io_uring-sq  [kernel.kallsyms]  [k]
throtl_stats_update_completion
   0.89%  io_uring-sq  [kernel.kallsyms]  [k] throtl_bio_end_io
   0.66%  io_uring-sq  [kernel.kallsyms]  [k] blk_throtl_bio
   0.05%  io_uring-sq  [kernel.kallsyms]  [k] blk_throtl_stat_add
   0.05%  io_uring-sq  [kernel.kallsyms]  [k] throtl_track_latency
   0.01%  io_uring-sq  [kernel.kallsyms]  [k] blk_throtl_bio_endio

Samples: 25K of event 'cycles', Event count (approx.): 16586974662
Overhead  Command      Shared Object      Symbol
   1.62%  io_uring-sq  [kernel.kallsyms]  [k] io_submit_sqes
   1.06%  io_uring-sq  [kernel.kallsyms]  [k] io_issue_sqe
   0.32%  io_uring-sq  [kernel.kallsyms]  [k] __io_queue_sqe
   0.06%  io_uring-sq  [kernel.kallsyms]  [k] io_queue_sqe

Above test doesn't set valid blk throttle rules, but the overhead
introduced by blk throttle is even bigger than many io_uring framework
functions, which is not acceptable.

To improve this issue, only do do io statistics if users specify valid
blk throttle rules, and this will also improve performance.

Before this patch:
clat (usec): min=5, max=6871, avg=18.70, stdev=17.89
 lat (usec): min=9, max=6871, avg=18.84, stdev=17.89
WRITE: bw=1618MiB/s (1697MB/s), 1618MiB/s-1618MiB/s (1697MB/s-1697MB/s),
io=94.8GiB (102GB), run=60001-60001msec

With this patch:
clat (usec): min=5, max=7554, avg=17.49, stdev=18.24
lat (usec): min=9, max=7554, avg=17.62, stdev=18.24
 WRITE: bw=1727MiB/s (1810MB/s), 1727MiB/s-1727MiB/s
(1810MB/s-1810MB/s), io=101GiB (109GB), run=60001-60001msec

About 6.6% bps improvement and 6.4% latency reduction.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b8a94ed8

blk-iolatency: only call ktime_get() if needed · d02cca18

由 Hongnan Li 提交于 7月 06, 2020

to #29139300

commit 6e2fa4dd683a22a7697e7ff51dad499406094d28 upstream

ktime_to_ns(ktime_get()), which is expensive, does not need to be called
if blk_iolatency_enabled() return false in blkcg_iolatency_done_bio().
Postponing ktime_to_ns(ktime_get()) execution reduces the CPU usage when
blk_iolatency is disabled.
Signed-off-by: NHongnan Li <hongnan.li@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d02cca18

29 6月, 2020 4 次提交

blk-mq: use plug for devices that implement ->commits_rqs() · 92339373

由 Jens Axboe 提交于 11月 29, 2018

fix #28871358

commit b2c5d16b72df1116f05c9be16a630ac939d34101 upstream

If we have that hook, we know the driver handles bd->last == true in
a smart fashion. If it does, even for multiple hardware queues, it's
a good idea to flush batches of requests to the device, if we have
batches of requests from the submitter.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

92339373

blk-mq: use bd->last == true for list inserts · 9e166ffa

由 Jens Axboe 提交于 11月 24, 2018

fix #28871358

commit be94f058f2bde6f0b0ee9059a35daa8e15be308f upstream

If we are issuing a list of requests, we know if we're at the last one.
If we fail issuing, ensure that we call ->commits_rqs() to flush any
potential previous requests.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9e166ffa

blk-mq: add mq_ops->commit_rqs() · 0111cff3

由 Jens Axboe 提交于 11月 27, 2018

fix #28871358

commit d666ba98f849ad44c4405ecc2180390ebe80f4f9 upstream

blk-mq passes information to the hardware about any given request being
the last that we will issue in this sequence. The point is that hardware
can defer costly doorbell type writes to the last request. But if we run
into errors issuing a sequence of requests, we may never send the request
with bd->last == true set. For that case, we need a hook that tells the
hardware that nothing else is coming right now.

For failures returned by the drivers ->queue_rq() hook, the driver is
responsible for flushing pending requests, if it uses bd->last to
optimize that part. This works like before, no changes there.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0111cff3

block: improve logic around when to sort a plug list · 26702d43

由 Jens Axboe 提交于 11月 27, 2018

fix #28871358

Only do it if we have requests for multiple queues in the same
plug.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

26702d43

15 6月, 2020 6 次提交

block, bfq: fix use-after-free in bfq_idle_slice_timer_body · baecb6b1

由 Zhiqiang Liu 提交于 3月 19, 2020

task #28557799

[ Upstream commit 2f95fa5c955d0a9987ffdc3a095e2f4e62c5f2a9 ]

In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is
not in bfqd-lock critical section. The bfqq, which is not
equal to NULL in bfq_idle_slice_timer, may be freed after passing
to bfq_idle_slice_timer_body. So we will access the freed memory.

In addition, considering the bfqq may be in race, we should
firstly check whether bfqq is in service before doing something
on it in bfq_idle_slice_timer_body func. If the bfqq in race is
not in service, it means the bfqq has been expired through
__bfq_bfqq_expire func, and wait_request flags has been cleared in
__bfq_bfqd_reset_in_service func. So we do not need to re-clear the
wait_request of bfqq which is not in service.

KASAN log is given as follows:
[13058.354613] ==================================================================
[13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290
[13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767
[13058.354646]
[13058.354655] CPU: 96 PID: 19767 Comm: fork13
[13058.354661] Call trace:
[13058.354667]  dump_backtrace+0x0/0x310
[13058.354672]  show_stack+0x28/0x38
[13058.354681]  dump_stack+0xd8/0x108
[13058.354687]  print_address_description+0x68/0x2d0
[13058.354690]  kasan_report+0x124/0x2e0
[13058.354697]  __asan_load8+0x88/0xb0
[13058.354702]  bfq_idle_slice_timer+0xac/0x290
[13058.354707]  __hrtimer_run_queues+0x298/0x8b8
[13058.354710]  hrtimer_interrupt+0x1b8/0x678
[13058.354716]  arch_timer_handler_phys+0x4c/0x78
[13058.354722]  handle_percpu_devid_irq+0xf0/0x558
[13058.354731]  generic_handle_irq+0x50/0x70
[13058.354735]  __handle_domain_irq+0x94/0x110
[13058.354739]  gic_handle_irq+0x8c/0x1b0
[13058.354742]  el1_irq+0xb8/0x140
[13058.354748]  do_wp_page+0x260/0xe28
[13058.354752]  __handle_mm_fault+0x8ec/0x9b0
[13058.354756]  handle_mm_fault+0x280/0x460
[13058.354762]  do_page_fault+0x3ec/0x890
[13058.354765]  do_mem_abort+0xc0/0x1b0
[13058.354768]  el0_da+0x24/0x28
[13058.354770]
[13058.354773] Allocated by task 19731:
[13058.354780]  kasan_kmalloc+0xe0/0x190
[13058.354784]  kasan_slab_alloc+0x14/0x20
[13058.354788]  kmem_cache_alloc_node+0x130/0x440
[13058.354793]  bfq_get_queue+0x138/0x858
[13058.354797]  bfq_get_bfqq_handle_split+0xd4/0x328
[13058.354801]  bfq_init_rq+0x1f4/0x1180
[13058.354806]  bfq_insert_requests+0x264/0x1c98
[13058.354811]  blk_mq_sched_insert_requests+0x1c4/0x488
[13058.354818]  blk_mq_flush_plug_list+0x2d4/0x6e0
[13058.354826]  blk_flush_plug_list+0x230/0x548
[13058.354830]  blk_finish_plug+0x60/0x80
[13058.354838]  read_pages+0xec/0x2c0
[13058.354842]  __do_page_cache_readahead+0x374/0x438
[13058.354846]  ondemand_readahead+0x24c/0x6b0
[13058.354851]  page_cache_sync_readahead+0x17c/0x2f8
[13058.354858]  generic_file_buffered_read+0x588/0xc58
[13058.354862]  generic_file_read_iter+0x1b4/0x278
[13058.354965]  ext4_file_read_iter+0xa8/0x1d8 [ext4]
[13058.354972]  __vfs_read+0x238/0x320
[13058.354976]  vfs_read+0xbc/0x1c0
[13058.354980]  ksys_read+0xdc/0x1b8
[13058.354984]  __arm64_sys_read+0x50/0x60
[13058.354990]  el0_svc_common+0xb4/0x1d8
[13058.354994]  el0_svc_handler+0x50/0xa8
[13058.354998]  el0_svc+0x8/0xc
[13058.354999]
[13058.355001] Freed by task 19731:
[13058.355007]  __kasan_slab_free+0x120/0x228
[13058.355010]  kasan_slab_free+0x10/0x18
[13058.355014]  kmem_cache_free+0x288/0x3f0
[13058.355018]  bfq_put_queue+0x134/0x208
[13058.355022]  bfq_exit_icq_bfqq+0x164/0x348
[13058.355026]  bfq_exit_icq+0x28/0x40
[13058.355030]  ioc_exit_icq+0xa0/0x150
[13058.355035]  put_io_context_active+0x250/0x438
[13058.355038]  exit_io_context+0xd0/0x138
[13058.355045]  do_exit+0x734/0xc58
[13058.355050]  do_group_exit+0x78/0x220
[13058.355054]  __wake_up_parent+0x0/0x50
[13058.355058]  el0_svc_common+0xb4/0x1d8
[13058.355062]  el0_svc_handler+0x50/0xa8
[13058.355066]  el0_svc+0x8/0xc
[13058.355067]
[13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464
[13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040)
[13058.355077] The buggy address belongs to the page:
[13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0
[13058.366175] flags: 0x2ffffe0000008100(slab|head)
[13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780
[13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000
[13058.370789] page dumped because: kasan: bad access detected
[13058.370791]
[13058.370792] Memory state around the buggy address:
[13058.370797]  ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb
[13058.370801]  ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[13058.370808]                                                                 ^
[13058.370811]  ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[13058.370815]  ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[13058.370817] ==================================================================
[13058.370820] Disabling lock debugging due to kernel taint

Here, we directly pass the bfqd to bfq_idle_slice_timer_body func.
--
V2->V3: rewrite the comment as suggested by Paolo Valente
V1->V2: add one comment, and add Fixes and Reported-by tag.

Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Reported-by: NWang Wang <wangwang2@huawei.com>
Signed-off-by: NZhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: NFeilong Lin <linfeilong@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

baecb6b1

block: Fix use-after-free issue accessing struct io_cq · fba123ba

由 Sahitya Tummala 提交于 3月 11, 2020

task #28557799

[ Upstream commit 30a2da7b7e225ef6c87a660419ea04d3cef3f6a7 ]

There is a potential race between ioc_release_fn() and
ioc_clear_queue() as shown below, due to which below kernel
crash is observed. It also can result into use-after-free
issue.

context#1:				context#2:
ioc_release_fn()			__ioc_clear_queue() gets the same icq
->spin_lock(&ioc->lock);		->spin_lock(&ioc->lock);
->ioc_destroy_icq(icq);
  ->list_del_init(&icq->q_node);
  ->call_rcu(&icq->__rcu_head,
  	icq_free_icq_rcu);
->spin_unlock(&ioc->lock);
					->ioc_destroy_icq(icq);
					  ->hlist_del_init(&icq->ioc_node);
					  This results into below crash as this memory
					  is now used by icq->__rcu_head in context#1.
					  There is a chance that icq could be free'd
					  as well.

22150.386550:   <6> Unable to handle kernel write to read-only memory
at virtual address ffffffaa8d31ca50
...
Call trace:
22150.607350:   <2>  ioc_destroy_icq+0x44/0x110
22150.611202:   <2>  ioc_clear_queue+0xac/0x148
22150.615056:   <2>  blk_cleanup_queue+0x11c/0x1a0
22150.619174:   <2>  __scsi_remove_device+0xdc/0x128
22150.623465:   <2>  scsi_forget_host+0x2c/0x78
22150.627315:   <2>  scsi_remove_host+0x7c/0x2a0
22150.631257:   <2>  usb_stor_disconnect+0x74/0xc8
22150.635371:   <2>  usb_unbind_interface+0xc8/0x278
22150.639665:   <2>  device_release_driver_internal+0x198/0x250
22150.644897:   <2>  device_release_driver+0x24/0x30
22150.649176:   <2>  bus_remove_device+0xec/0x140
22150.653204:   <2>  device_del+0x270/0x460
22150.656712:   <2>  usb_disable_device+0x120/0x390
22150.660918:   <2>  usb_disconnect+0xf4/0x2e0
22150.664684:   <2>  hub_event+0xd70/0x17e8
22150.668197:   <2>  process_one_work+0x210/0x480
22150.672222:   <2>  worker_thread+0x32c/0x4c8

Fix this by adding a new ICQ_DESTROYED flag in ioc_destroy_icq() to
indicate this icq is once marked as destroyed. Also, ensure
__ioc_clear_queue() is accessing icq within rcu_read_lock/unlock so
that icq doesn't get free'd up while it is still using it.
Signed-off-by: NSahitya Tummala <stummala@codeaurora.org>
Co-developed-by: NPradeep P V K <ppvk@codeaurora.org>
Signed-off-by: NPradeep P V K <ppvk@codeaurora.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

fba123ba

block: keep bdi->io_pages in sync with max_sectors_kb for stacked devices · dc938b41

由 Konstantin Khlebnikov 提交于 2月 28, 2020

task #28557799

[ Upstream commit e74d93e96d721c4297f2a900ad0191890d2fc2b0 ]

Field bdi->io_pages added in commit 9491ae4a ("mm: don't cap request
size based on read-ahead setting") removes unneeded split of read requests.

Stacked drivers do not call blk_queue_max_hw_sectors(). Instead they set
limits of their devices by blk_set_stacking_limits() + disk_stack_limits().
Field bio->io_pages stays zero until user set max_sectors_kb via sysfs.

This patch updates io_pages after merging limits in disk_stack_limits().

Commit c6d6e9b0f6b4 ("dm: do not allow readahead to limit IO size") fixed
the same problem for device-mapper devices, this one fixes MD RAIDs.

Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
Reviewed-by: NPaul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

dc938b41

block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group() · 8cf52b47

由 Carlo Nonato 提交于 3月 06, 2020

task #28557799

[ Upstream commit 14afc59361976c0ba39e3a9589c3eaa43ebc7e1d ]

The bfq_find_set_group() function takes as input a blkcg (which represents
a cgroup) and retrieves the corresponding bfq_group, then it updates the
bfq internal group hierarchy (see comments inside the function for why
this is needed) and finally it returns the bfq_group.
In the hierarchy update cycle, the pointer holding the correct bfq_group
that has to be returned is mistakenly used to traverse the hierarchy
bottom to top, meaning that in each iteration it gets overwritten with the
parent of the current group. Since the update cycle stops at root's
children (depth = 2), the overwrite becomes a problem only if the blkcg
describes a cgroup at a hierarchy level deeper than that (depth > 2). In
this case the root's child that happens to be also an ancestor of the
correct bfq_group is returned. The main consequence is that processes
contained in a cgroup at depth greater than 2 are wrongly placed in the
group described above by BFQ.

This commits fixes this problem by using a different bfq_group pointer in
the update cycle in order to avoid the overwrite of the variable holding
the original group reference.
Reported-by: NKwon Je Oh <kwonje.oh2@gmail.com>
Signed-off-by: NCarlo Nonato <carlo.nonato95@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8cf52b47

block: fix an integer overflow in logical block size · 8b05616d

由 Mikulas Patocka 提交于 1月 15, 2020

task #28557799

commit ad6bf88a6c19a39fb3b0045d78ea880325dfcf15 upstream.

Logical block size has type unsigned short. That means that it can be at
most 32768. However, there are architectures that can run with 64k pages
(for example arm64) and on these architectures, it may be possible to
create block devices with 64k block size.

For exmaple (run this on an architecture with 64k pages):

Mount will fail with this error because it tries to read the superblock using 2-sector
access:
  device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
  EXT4-fs (dm-0): unable to read superblock

This patch changes the logical block size from unsigned short to unsigned
int to avoid the overflow.

Cc: stable@vger.kernel.org
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8b05616d

block: fix memleak when __blk_rq_map_user_iov() is failed · a15ce925

由 Yang Yingliang 提交于 12月 18, 2019

task #28557799

[ Upstream commit 3b7995a98ad76da5597b488fa84aa5a56d43b608 ]

When I doing fuzzy test, get the memleak report:

BUG: memory leak
unreferenced object 0xffff88837af80000 (size 4096):
  comm "memleak", pid 3557, jiffies 4294817681 (age 112.499s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    20 00 00 00 10 01 00 00 00 00 00 00 01 00 00 00   ...............
  backtrace:
    [<000000001c894df8>] bio_alloc_bioset+0x393/0x590
    [<000000008b139a3c>] bio_copy_user_iov+0x300/0xcd0
    [<00000000a998bd8c>] blk_rq_map_user_iov+0x2f1/0x5f0
    [<000000005ceb7f05>] blk_rq_map_user+0xf2/0x160
    [<000000006454da92>] sg_common_write.isra.21+0x1094/0x1870
    [<00000000064bb208>] sg_write.part.25+0x5d9/0x950
    [<000000004fc670f6>] sg_write+0x5f/0x8c
    [<00000000b0d05c7b>] __vfs_write+0x7c/0x100
    [<000000008e177714>] vfs_write+0x1c3/0x500
    [<0000000087d23f34>] ksys_write+0xf9/0x200
    [<000000002c8dbc9d>] do_syscall_64+0x9f/0x4f0
    [<00000000678d8e9a>] entry_SYSCALL_64_after_hwframe+0x49/0xbe

If __blk_rq_map_user_iov() is failed in blk_rq_map_user_iov(),
the bio(s) which is allocated before this failing will leak. The
refcount of the bio(s) is init to 1 and increased to 2 by calling
bio_get(), but __blk_rq_unmap_user() only decrease it to 1, so
the bio cannot be freed. Fix it by calling blk_rq_unmap_user().
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

a15ce925

11 6月, 2020 1 次提交

alinux: blk-mq: remove QUEUE_FLAG_POLL from default MQ flags · 294d5fb2

由 Joseph Qi 提交于 6月 10, 2020

fix #28528017

In case of virtio-blk device, checking /sys/block/<device>/queue/io_poll
will show 1 and user can't disable it. Actually virtio-blk doesn't
support poll yet, so it will confuse end user. The root cause is mq
initialization will default set bit QUEUE_FLAG_POLL.

This fix takes ideas from the following upstream commits:
6544d229bf43 ("block: enable polling by default if a poll map is initalized")
6e0de61107f0 ("blk-mq: remove QUEUE_FLAG_POLL from default MQ flags")
Since we don't want to get HCTX_TYPE_POLL related logic involved, so
just check mq_ops->poll and then set QUEUE_FLAG_POLL.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

294d5fb2

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功