提交 · 6235c229fea4e5c5fb35002a5ee1ae2dc65d3198 · openanolis / cloud-kernel

08 12月, 2018 1 次提交

blk-mq: fix corruption with direct issue · 724ff9cb

由 Jens Axboe 提交于 12月 04, 2018

commit ffe81d45 upstream.

If we attempt a direct issue to a SCSI device, and it returns BUSY, then
we queue the request up normally. However, the SCSI layer may have
already setup SG tables etc for this particular command. If we later
merge with this request, then the old tables are no longer valid. Once
we issue the IO, we only read/write the original part of the request,
not the new state of it.

This causes data corruption, and is most often noticed with the file
system complaining about the just read data being invalid:

[  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)

because most of it is garbage...

This doesn't happen from the normal issue path, as we will simply defer
the request to the hardware queue dispatch list if we fail. Once it's on
the dispatch list, we never merge with it.

Fix this from the direct issue path by flagging the request as
REQ_NOMERGE so we don't change the size of it before issue.

See also:
  https://bugzilla.kernel.org/show_bug.cgi?id=201685Tested-by: NGuenter Roeck <linux@roeck-us.net>
Fixes: 6ce3dd6e ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

724ff9cb

01 12月, 2018 1 次提交

block: copy ioprio in __bio_clone_fast() and bounce · 487d58a9

由 Hannes Reinecke 提交于 11月 12, 2018

[ Upstream commit ca474b73 ]

We need to copy the io priority, too; otherwise the clone will run
with a different priority than the original one.

Fixes: 43b62ce3 ("block: move bio io prio to a new field")
Signed-off-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJean Delvare <jdelvare@suse.de>

Fixed up subject, and ordered stores.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

487d58a9

27 11月, 2018 1 次提交

block: Clear kernel memory before copying to user · a4da95ea

由 Keith Busch 提交于 11月 07, 2018

[ Upstream commit f3587d76da05f68098ddb1cb3c98cc6a9e8a402c ]

If the kernel allocates a bounce buffer for user read data, this memory
needs to be cleared before copying it to the user, otherwise it may leak
kernel memory to user space.

Laurence Oberman <loberman@redhat.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

a4da95ea

21 11月, 2018 1 次提交

SCSI: fix queue cleanup race before queue initialization is done · 410306a0

由 Ming Lei 提交于 11月 14, 2018

commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.

c2856ae2 ("blk-mq: quiesce queue before freeing queue") has
already fixed this race, however the implied synchronize_rcu()
in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
performance regression.

Then 1311326c ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
tried to quiesce queue for avoiding unnecessary synchronize_rcu()
only when queue initialization is done, because it is usual to see
lots of inexistent LUNs which need to be probed.

However, turns out it isn't safe to quiesce queue only when queue
initialization is done. Because when one SCSI command is completed,
the user of sending command can be waken up immediately, then the
scsi device may be removed, meantime the run queue in scsi_end_request()
is still in-progress, so kernel panic can be caused.

In Red Hat QE lab, there are several reports about this kind of kernel
panic triggered during kernel booting.

This patch tries to address the issue by grabing one queue usage
counter during freeing one request and the following run queue.

Fixes: 1311326c ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
Cc: Andrew Jones <drjones@redhat.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
Cc: stable <stable@vger.kernel.org>
Cc: jianchao.wang <jianchao.w.wang@oracle.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

410306a0

14 11月, 2018 4 次提交

block, bfq: correctly charge and reset entity service in all cases · 668c01c1

由 Paolo Valente 提交于 9月 14, 2018

[ Upstream commit cbeb869a ]

BFQ schedules entities (which represent either per-process queues or
groups of queues) as a function of their timestamps. In particular, as
a function of their (virtual) finish times. The finish time of an
entity is computed as a function of the budget assigned to the entity,
assuming, tentatively, that the entity, once in service, will receive
an amount of service equal to its budget. Then, when the entity is
expired because it finishes to be served, this finish time is updated
as a function of the actual service received by the entity. This
allows the entity to be correctly charged with only the service
received, and then to be correctly re-scheduled.

Yet an entity may receive service also while not being the entity in
service (in the scheduling environment of its parent entity), for
several reasons. If the entity remains with no backlog while receiving
this 'unofficial' service, then it is expired. Also on such an
expiration, the finish time of the entity should be updated to account
for only the service actually received by the entity. Unfortunately,
such an update is not performed for an entity expiring without being
the entity in service.

In a similar vein, the service counter of the entity in service is
reset when the entity is expired, to be ready to be used for next
service cycle. This reset too should be performed also in case an
entity is expired because it remains empty after receiving service
while not being the entity in service. But in this case the reset is
not performed.

This commit performs the above update of the finish time and reset of
the service received, also for an entity expiring while not being the
entity in service.
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

668c01c1

block: make sure writesame bio is aligned with logical block size · 1ea5c403

由 Ming Lei 提交于 10月 29, 2018

commit 34ffec60 upstream.

Obviously the created writesame bio has to be aligned with logical block
size, and use bio_allowed_max_sectors() to retrieve this number.

Cc: stable@vger.kernel.org
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Fixes: b49a0871 ("block: remove split code in blkdev_issue_{discard,write_same}")
Tested-by: NRui Salvaterra <rsalvaterra@gmail.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

1ea5c403

block: make sure discard bio is aligned with logical block size · 14657efd

由 Ming Lei 提交于 10月 29, 2018

commit 1adfc5e4 upstream.

Obviously the created discard bio has to be aligned with logical block size.

This patch introduces the helper of bio_allowed_max_sectors() for
this purpose.

Cc: stable@vger.kernel.org
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Fixes: 744889b7 ("block: don't deal with discard limit in blkdev_issue_discard()")
Fixes: a22c4d7e ("block: re-add discard_granularity and alignment checks")
Reported-by: NRui Salvaterra <rsalvaterra@gmail.com>
Tested-by: NRui Salvaterra <rsalvaterra@gmail.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

14657efd

block: setup bounce bio_sets properly · cf8d0973

由 Jens Axboe 提交于 10月 21, 2018

commit 52990a5f upstream.

We're only setting up the bounce bio sets if we happen
to need bouncing for regular HIGHMEM, not if we only need
it for ISA devices.

Protect the ISA bounce setup with a mutex, since it's
being invoked from driver init functions and can thus be
called in parallel.

Cc: stable@vger.kernel.org
Reported-by: NOndrej Zary <linux@rainbow-software.org>
Tested-by: NOndrej Zary <linux@rainbow-software.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

cf8d0973

18 10月, 2018 1 次提交

block: don't deal with discard limit in blkdev_issue_discard() · 744889b7

由 Ming Lei 提交于 10月 12, 2018

blk_queue_split() does respect this limit via bio splitting, so no
need to do that in blkdev_issue_discard(), then we can align to
normal bio submit(bio_add_page() & submit_bio()).

More importantly, this patch fixes one issue introduced in a22c4d7e
("block: re-add discard_granularity and alignment checks"), in which
zero discard bio may be generated in case of zero alignment.

Fixes: a22c4d7e ("block: re-add discard_granularity and alignment checks")
Cc: stable@vger.kernel.org
Cc: Ming Lin <ming.l@ssi.samsung.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Tested-by: NMariusz Dabrowski <mariusz.dabrowski@intel.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

744889b7

12 10月, 2018 1 次提交

blk-wbt: wake up all when we scale up, not down · 5e65a203

由 Josef Bacik 提交于 10月 11, 2018

Tetsuo brought to my attention that I screwed up the scale_up/scale_down
helpers when I factored out the rq-qos code.  We need to wake up all the
waiters when we add slots for requests to make, not when we shrink the
slots.  Otherwise we'll end up things waiting forever.  This was a
mistake and simply puts everything back the way it was.

cc: stable@vger.kernel.org
Fixes: a7905043 ("blk-rq-qos: refactor out common elements of blk-wbt")
eported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5e65a203

28 9月, 2018 1 次提交

blk-mq: I/O and timer unplugs are inverted in blktrace · 587562d0

由 Ilya Dryomov 提交于 9月 26, 2018

trace_block_unplug() takes true for explicit unplugs and false for
implicit unplugs.  schedule() unplugs are implicit and should be
reported as timer unplugs.  While correct in the legacy code, this has
been inverted in blk-mq since 4.11.

Cc: stable@vger.kernel.org
Fixes: bd166ef1 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

587562d0

27 9月, 2018 1 次提交

block: fix deadline elevator drain for zoned block devices · 854f31cc

由 Damien Le Moal 提交于 9月 27, 2018

When the deadline scheduler is used with a zoned block device, writes
to a zone will be dispatched one at a time. This causes the warning
message:

deadline: forced dispatching is broken (nr_sorted=X), please report this

to be displayed when switching to another elevator with the legacy I/O
path while write requests to a zone are being retained in the scheduler
queue.

Prevent this message from being displayed when executing
elv_drain_elevator() for a zoned block device. __blk_drain_queue() will
loop until all writes are dispatched and completed, resulting in the
desired elevator queue drain without extensive modifications to the
deadline code itself to handle forced-dispatch calls.
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Fixes: 8dc8146f ("deadline-iosched: Introduce zone locking support")
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

854f31cc

26 9月, 2018 1 次提交

blk-mq: Allow blocking queue tag iter callbacks · 530ca2c9

由 Keith Busch 提交于 9月 25, 2018

A recent commit runs tag iterator callbacks under the rcu read lock,
but existing callbacks do not satisfy the non-blocking requirement.
The commit intended to prevent an iterator from accessing a queue that's
being modified. This patch fixes the original issue by taking a queue
reference instead of reading it, which allows callbacks to make blocking
calls.

Fixes: f5bbbbe4 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")
Acked-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

530ca2c9

22 9月, 2018 1 次提交

block: use nanosecond resolution for iostat · b57e99b4

由 Omar Sandoval 提交于 9月 21, 2018

Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.

Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.

Fixes: 522a7775 ("block: consolidate struct request timestamp fields")
Cc: stable@vger.kernel.org
Reported-by: NKlaus Kusche <klaus.kusche@computerix.info>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b57e99b4

12 9月, 2018 1 次提交

blk-cgroup: increase number of supported policies · 01c5f85a

由 Jens Axboe 提交于 9月 11, 2018

After merging the iolatency policy, we potentially now have 4 policies
being registered, but only support 3. This causes one of them to fail
loading. Takashi reports that BFQ no longer works for him, because it
fails to load due to policy registration failure.

Bump to 5 policies, and also add a warning for when we have exceeded
the global amount. If we have to touch this again, we should switch
to a dynamic scheme instead.
Reported-by: NTakashi Iwai <tiwai@suse.de>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Tested-by: NTakashi Iwai <tiwai@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

01c5f85a

07 9月, 2018 1 次提交

block: bfq: swap puts in bfqg_and_blkg_put · d5274b3c

由 Konstantin Khlebnikov 提交于 9月 06, 2018

Fix trivial use-after-free. This could be last reference to bfqg.

Fixes: 8f9bebc3 ("block, bfq: access and cache blkg data only when safe")
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d5274b3c

06 9月, 2018 1 次提交

block: don't warn when doing fsync on read-only devices · 8b2ded1c

由 Mikulas Patocka 提交于 9月 05, 2018

It is possible to call fsync on a read-only handle (for example, fsck.ext2
does it when doing read-only check), and this call results in kernel
warning.

The patch b089cfd9 ("block: don't warn for flush on read-only device")
attempted to disable the warning, but it is buggy and it doesn't
(op_is_flush tests flags, but bio_op strips off the flags).
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Fixes: 721c7fc7 ("block: fail op_is_write() requests to read-only partitions")
Cc: stable@vger.kernel.org	# 4.18
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8b2ded1c

01 9月, 2018 3 次提交

blkcg: use tryget logic when associating a blkg with a bio · 31118850

由 Dennis Zhou (Facebook) 提交于 8月 31, 2018

There is a very small change a bio gets caught up in a really
unfortunate race between a task migration, cgroup exiting, and itself
trying to associate with a blkg. This is due to css offlining being
performed after the css->refcnt is killed which triggers removal of
blkgs that reach their blkg->refcnt of 0.

To avoid this, association with a blkg should use tryget and fallback to
using the root_blkg.

Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

31118850

blkcg: delay blkg destruction until after writeback has finished · 59b57717

由 Dennis Zhou (Facebook) 提交于 8月 31, 2018

Currently, blkcg destruction relies on a sequence of events:
  1. Destruction starts. blkcg_css_offline() is called and blkgs
     release their reference to the blkcg. This immediately destroys
     the cgwbs (writeback).
  2. With blkgs giving up their reference, the blkcg ref count should
     become zero and eventually call blkcg_css_free() which finally
     frees the blkcg.

Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
on the completion of all writeback associated with the blkcg. A count of
the number of cgwbs is maintained and once that goes to zero, blkg
destruction can follow. This should prevent premature blkg destruction
related to writeback.

The new process for blkcg cleanup is as follows:
  1. Destruction starts. blkcg_css_offline() is called which offlines
     writeback. Blkg destruction is delayed on the cgwb_refcnt count to
     avoid punting potentially large amounts of outstanding writeback
     to root while maintaining any ongoing policies. Here, the base
     cgwb_refcnt is put back.
  2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
     and handles destruction of blkgs. This is where the css reference
     held by each blkg is released.
  3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
     This finally frees the blkg.

It seems in the past blk-throttle didn't do the most understandable
things with taking data from a blkg while associating with current. So,
the simplification and unification of what blk-throttle is doing caused
this.

Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

59b57717

Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()" · 6b065462

由 Dennis Zhou (Facebook) 提交于 8月 31, 2018

This reverts commit 4c699480.

Destroying blkgs is tricky because of the nature of the relationship. A
blkg should go away when either a blkcg or a request_queue goes away.
However, blkg's pin the blkcg to ensure they remain valid. To break this
cycle, when a blkcg is offlined, blkgs put back their css ref. This
eventually lets css_free() get called which frees the blkcg.

The above commit (4c699480) breaks this order of events by trying to
destroy blkgs in css_free(). As the blkgs still hold references to the
blkcg, css_free() is never called.

The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
addressed in the following patch by delaying destruction of a blkg until
all writeback associated with the blkcg has been finished.

Fixes: 4c699480 ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6b065462

28 8月, 2018 5 次提交

block: bsg: move atomic_t ref_count variable to refcount API · db193954

由 John Pittman 提交于 8月 27, 2018

Currently, variable ref_count within the bsg_device struct is of
type atomic_t.  For variables being used as reference counters,
the refcount API should be used instead of atomic.  The newer
refcount API works to prevent counter overflows and use-after-free
bugs.  So, move this varable from the atomic API to refcount,
potentially avoiding the issues mentioned.
Signed-off-by: NJohn Pittman <jpittman@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

db193954

block: remove unnecessary condition check · 62d2a194

由 Chengguang Xu 提交于 8月 28, 2018

kmem_cache_destroy() can handle NULL pointer correctly, so there is
no need to check e->icq_cache before calling kmem_cache_destroy().
Signed-off-by: NChengguang Xu <cgxu519@gmx.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

62d2a194

blk-wbt: remove dead code · b0a84beb

由 Jens Axboe 提交于 8月 27, 2018

We already note and mark discard and swap IO from bio_to_wbt_flags().
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b0a84beb

blk-wbt: improve waking of tasks · 38cfb5a4

由 Jens Axboe 提交于 8月 26, 2018

We have two potential issues:

1) After commit 2887e41b, we only wake one process at the time when
   we finish an IO. We really want to wake up as many tasks as can
   queue IO. Before this commit, we woke up everyone, which could cause
   a thundering herd issue.

2) A task can potentially consume two wakeups, causing us to (in
   practice) miss a wakeup.

Fix both by providing our own wakeup function, which stops
__wake_up_common() from waking up more tasks if we fail to get a
queueing token. With the strict ordering we have on the wait list, this
wakes the right tasks and the right amount of tasks.

Based on a patch from Jianchao Wang <jianchao.w.wang@oracle.com>.
Tested-by: NAgarwal, Anchal <anchalag@amazon.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

38cfb5a4

blk-wbt: abstract out end IO completion handler · 061a5427

由 Jens Axboe 提交于 8月 26, 2018

Prep patch for calling the handler from a different context,
no functional changes in this patch.
Tested-by: NAgarwal, Anchal <anchalag@amazon.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

061a5427

23 8月, 2018 4 次提交

blk-wbt: don't maintain inflight counts if disabled · c125311d

由 Jens Axboe 提交于 8月 23, 2018

A previous commit removed the ability to have per-rq flags. We used
those flags to maintain inflight counts. Since we don't have those
anymore, we have to always maintain inflight counts, even if wbt is
disabled. This is clearly suboptimal.

Add a queue quiesce around changing the wbt latency settings from sysfs
to work around this. With that, we can reliably put the enabled check in
our bio_to_wbt_flags(), since we know the WBT_TRACKED flag will be
consistent for the lifetime of the request.

Fixes: c1c80384 ("block: remove external dependency on wbt_flags")
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c125311d

blk-wbt: fix has-sleeper queueing check · c45e6a03

由 Jens Axboe 提交于 8月 20, 2018

We need to do this inside the loop as well, or we can allow new
IO to supersede previous IO.
Tested-by: NAnchal Agarwal <anchalag@amazon.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c45e6a03

blk-wbt: use wq_has_sleeper() for wq active check · b7882093

由 Jens Axboe 提交于 8月 20, 2018

We need the memory barrier before checking the list head,
use the appropriate helper for this. The matching queue
side memory barrier is provided by set_current_state().
Tested-by: NAnchal Agarwal <anchalag@amazon.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b7882093

blk-wbt: move disable check into get_limit() · ffa358dc

由 Jens Axboe 提交于 8月 20, 2018

Check it in one place, instead of in multiple places.
Tested-by: NAnchal Agarwal <anchalag@amazon.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ffa358dc

21 8月, 2018 2 次提交

blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter · f5bbbbe4

由 Jianchao Wang 提交于 8月 21, 2018

For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
account the inflight requests. It will access the queue_hw_ctx and
nr_hw_queues w/o any protection. When updating nr_hw_queues and
blk_mq_in_flight/rw occur concurrently, panic comes up.

Before update nr_hw_queues, the q will be frozen. So we could use
q_usage_counter to avoid the race. percpu_ref_is_zero is used here
so that we will not miss any in-flight request. The access to
nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
under rcu critical section, __blk_mq_update_nr_hw_queues could use
synchronize_rcu to ensure the zeroed q_usage_counter to be globally
visible.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f5bbbbe4

blk-mq: init hctx sched after update ctx and hctx mapping · d48ece20

由 Jianchao Wang 提交于 8月 21, 2018

Currently, when update nr_hw_queues, IO scheduler's init_hctx will
be invoked before the mapping between ctx and hctx is adapted
correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
may depend on this mapping and get wrong result and panic finally.
A simply way to fix this is that switch the IO scheduler to 'none'
before update the nr_hw_queues, and then switch it back after
update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
to nobody use them any more.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d48ece20

18 8月, 2018 1 次提交

block: remove duplicate initialization · fcedba42

由 Chaitanya Kulkarni 提交于 8月 16, 2018

This patch removes the duplicate initialization of q->queue_head
in the blk_alloc_queue_node(). This removes the 2nd initialization
so that we preserve the initialization order same as declaration
present in struct request_queue.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fcedba42

17 8月, 2018 6 次提交

block: change return type to bool · 599d067d

由 Chengguang Xu 提交于 8月 16, 2018

Because blk_do_io_stat() only does a judgement about the request
contributes to IO statistics, it better changes return type to bool.
Signed-off-by: NChengguang Xu <cgxu519@gmx.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

599d067d

block, bfq: return nbytes and not zero from struct cftype .write() method · fc8ebd01

由 Maciej S. Szmigiero 提交于 8月 15, 2018

The value that struct cftype .write() method returns is then directly
returned to userspace as the value returned by write() syscall, so it
should be the number of bytes actually written (or consumed) and not zero.

Returning zero from write() syscall makes programs like /bin/echo or bash
spin.
Signed-off-by: NMaciej S. Szmigiero <mail@maciej.szmigiero.name>
Fixes: e21b7a0b ("block, bfq: add full hierarchical scheduling and cgroups support")
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fc8ebd01

block, bfq: improve code of bfq_bfqq_charge_time · f8121648

由 Paolo Valente 提交于 8月 16, 2018

bfq_bfqq_charge_time contains some lengthy and redundant code. This
commit trims and condenses that code.
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f8121648

block, bfq: reduce write overcharge · d5801088

由 Paolo Valente 提交于 8月 16, 2018

When a sync request is dispatched, the queue that contains that
request, and all the ancestor entities of that queue, are charged with
the number of sectors of the request. In constrast, if the request is
async, then the queue and its ancestor entities are charged with the
number of sectors of the request, multiplied by an overcharge
factor. This throttles the bandwidth for async I/O, w.r.t. to sync
I/O, and it is done to counter the tendency of async writes to steal
I/O throughput to reads.

On the opposite end, the lower this parameter, the stabler I/O
control, in the following respect.  The lower this parameter is, the
less the bandwidth enjoyed by a group decreases
- when the group does writes, w.r.t. to when it does reads;
- when other groups do reads, w.r.t. to when they do writes.

The fixes "block, bfq: always update the budget of an entity when
needed" and "block, bfq: readd missing reset of parent-entity service"
improved I/O control in bfq to such an extent that it has been
possible to revise this overcharge factor downwards.  This commit
introduces the resulting, new value.
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d5801088

block, bfq: always update the budget of an entity when needed · e02a0aa2

由 Paolo Valente 提交于 8月 16, 2018

When the next child entity to serve changes for a given parent entity,
the budget of that parent entity must be updated accordingly.
Unfortunately, this update is not performed, by mistake, for the
entities that happen to switch from having no child entity to serve,
to having one child entity to serve.
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e02a0aa2

block, bfq: readd missing reset of parent-entity service · 8a511ba5

由 Paolo Valente 提交于 8月 16, 2018

The received-service counter needs to be equal to 0 when an entity is
set in service. Unfortunately, commit "block, bfq: fix service being
wrongly set to zero in case of preemption" mistakenly removed the
resetting of this counter for the parent entities of the bfq_queue
being set in service. This commit fixes this issue by resetting
service for parent entities, directly on the expiration of the
in-service bfq_queue.

Fixes: 9fae8dd5 ("block, bfq: fix service being wrongly set to zero in case of preemption")
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8a511ba5

15 8月, 2018 2 次提交

blk-wbt: fix IO hang in wbt_wait() · df60f6e8

由 Ming Lei 提交于 8月 14, 2018

On wbt invariant is that if one IO is tracked via WBT_TRACKED, rqw->inflight
should be updated for tracking this IO.

But commit c1c80384 ("block: remove external dependency on wbt_flags")
forgets to remove the early handling of !rwb_enabled(rwb) inside wbt_wait(),
then the inflight counter may not be increased in wbt_wait(), but decreased
in wbt_done() for this kind of IO, so this counter may become negative, then
wbt_wait() may wait forever.

This patch fixes the report in the following link:

	https://marc.info/?l=linux-block&m=153221542021033&w=2

Fixes: c1c80384 ("block: remove external dependency on wbt_flags")
Cc: Josef Bacik <jbacik@fb.com>
Reported-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

df60f6e8

block: don't warn for flush on read-only device · b089cfd9

由 Jens Axboe 提交于 8月 14, 2018

Don't warn for a flush issued to a read-only device. It's not strictly
a writable command, as it doesn't change any on-media data by itself.
Reported-by: NStefan Agner <stefan@agner.ch>
Fixes: 721c7fc7 ("block: fail op_is_write() requests to read-only partitions")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b089cfd9

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功