提交 · 75448f40b9f6c0fd6d6afdf9101fbb2697fb5608 · openanolis / cloud-kernel

16 9月, 2019 1 次提交

blk-mq: free hw queue's resource in hctx's release handler · e238e6dc

由 Ming Lei 提交于 4月 30, 2019

[ Upstream commit c7e2d94b3d1634988a95ac4d77a72dc7487ece06 ]

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d9
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d9,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
	c2856ae2 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

	https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: NJames Smart <james.smart@broadcom.com>
Fixes: 45a9c9d9 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

e238e6dc

04 8月, 2019 1 次提交

block, scsi: Change the preempt-only flag into a counter · c58a6507

由 Bart Van Assche 提交于 9月 26, 2018

commit cd84a62e0078dce09f4ed349bec84f86c9d54b30 upstream.

The RQF_PREEMPT flag is used for three purposes:
- In the SCSI core, for making sure that power management requests
  are executed even if a device is in the "quiesced" state.
- For domain validation by SCSI drivers that use the parallel port.
- In the IDE driver, for IDE preempt requests.
Rename "preempt-only" into "pm-only" because the primary purpose of
this mode is power management. Since the power management core may
but does not have to resume a runtime suspended device before
performing system-wide suspend and since a later patch will set
"pm-only" mode as long as a block device is runtime suspended, make
it possible to set "pm-only" mode from more than one context. Since
with this change scsi_device_quiesce() is no longer idempotent, make
that function return early if it is called for a quiesced queue.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

c58a6507

31 7月, 2019 1 次提交

block: init flush rq ref count to 1 · 8a1a3d38

由 Josef Bacik 提交于 3月 07, 2019

[ Upstream commit b554db147feea39617b533ab6bca247c91c6198a ]

We discovered a problem in newer kernels where a disconnect of a NBD
device while the flush request was pending would result in a hang.  This
is because the blk mq timeout handler does

        if (!refcount_inc_not_zero(&rq->ref))
                return true;

to determine if it's ok to run the timeout handler for the request.
Flush_rq's don't have a ref count set, so we'd skip running the timeout
handler for this request and it would just sit there in limbo forever.

Fix this by always setting the refcount of any request going through
blk_init_rq() to 1.  I tested this with a nbd-server that dropped flush
requests to verify that it hung, and then tested with this patch to
verify I got the timeout as expected and the error handling kicked in.
Thanks,
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

8a1a3d38

10 7月, 2019 1 次提交

block: Fix a NULL pointer dereference in generic_make_request() · c9d8d3e9

由 Guilherme G. Piccoli 提交于 6月 28, 2019

```--------------------------------------------------------------
This patch is not on mainline and is meant to 4.19 stable *only*.
After the patch description there's a reasoning about that.
```

--------------------------------------------------------------

Commit 37f9579f ("blk-mq: Avoid that submitting a bio concurrently
with device removal triggers a crash") introduced a NULL pointer
dereference in generic_make_request(). The patch sets q to NULL and
enter_succeeded to false; right after, there's an 'if (enter_succeeded)'
which is not taken, and then the 'else' will dereference q in
blk_queue_dying(q).

This patch just moves the 'q = NULL' to a point in which it won't trigger
the oops, although the semantics of this NULLification remains untouched.

A simple test case/reproducer is as follows:
a) Build kernel v4.19.56-stable with CONFIG_BLK_CGROUP=n.

b) Create a raid0 md array with 2 NVMe devices as members, and mount
it with an ext4 filesystem.

c) Run the following oneliner (supposing the raid0 is mounted in /mnt):
(dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3;
echo 1 > /sys/block/nvme1n1/device/device/remove
(whereas nvme1n1 is the 2nd array member)

This will trigger the following oops:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
RIP: 0010:generic_make_request+0x32b/0x400
Call Trace:
 submit_bio+0x73/0x140
 ext4_io_submit+0x4d/0x60
 ext4_writepages+0x626/0xe90
 do_writepages+0x4b/0xe0
[...]

This patch has no functional changes and preserves the md/raid0 behavior
when a member is removed before kernel v4.17.

----------------------------
Why this is not on mainline?
----------------------------

The patch was originally submitted upstream in linux-raid and
linux-block mailing-lists - it was initially accepted by Song Liu,
but Christoph Hellwig[0] observed that there was a clean-up series
ready to be accepted from Ming Lei[1] that fixed the same issue.

The accepted patches from Ming's series in upstream are: commit
47cdee29ef9d ("block: move blk_exit_queue into __blk_release_queue") and
commit fe2008640ae3 ("block: don't protect generic_make_request_checks
with blk_queue_enter"). Those patches basically do a clean-up in the
block layer involving:

1) Putting back blk_exit_queue() logic into __blk_release_queue(); that
path was changed in the past and the logic from blk_exit_queue() was
added to blk_cleanup_queue().

2) Removing the guard/protection in generic_make_request_checks() with
blk_queue_enter().

The problem with Ming's series for -stable is that it relies in the
legacy request IO path removal. So it's "backport-able" to v5.0+,
but doing that for early versions (like 4.19) would incur in complex
code changes. Hence, it was suggested by Christoph and Song Liu that
this patch was submitted to stable only; otherwise merging it upstream
would add code to fix a path removed in a subsequent commit.

[0] lore.kernel.org/linux-block/20190521172258.GA32702@infradead.org
[1] lore.kernel.org/linux-block/20190515030310.20393-1-ming.lei@redhat.com

Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Tested-by: NEric Ren <renzhengeek@gmail.com>
Fixes: 37f9579f ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash")
Signed-off-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

c9d8d3e9

15 6月, 2019 1 次提交

blk-mq: move cancel of requeue_work into blk_mq_release · 525b5265

由 Ming Lei 提交于 4月 30, 2019

[ Upstream commit fbc2a15e3433058582e5635aabe48a3011a644a8 ]

With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

525b5265

08 5月, 2019 1 次提交

block: pass no-op callback to INIT_WORK(). · 96e4471d

由 Tetsuo Handa 提交于 1月 30, 2019

[ Upstream commit 2e3c18d0ada16f145087b2687afcad1748c0827c ]

syzbot is hitting flush_work() warning caused by commit 4d43d395fed12463
("workqueue: Try to catch flush_work() without INIT_WORK().") [1].
Although that commit did not expect INIT_WORK(NULL) case, calling
flush_work() without setting a valid callback should be avoided anyway.
Fix this problem by setting a no-op callback instead of NULL.

[1] https://syzkaller.appspot.com/bug?id=e390366bc48bc82a7c668326e0663be3b91cbd29Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-and-tested-by: Nsyzbot <syzbot+ba2a929dcf8e704c180e@syzkaller.appspotmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[sl: rename blk_timeout_work]
Signed-off-by: NSasha Levin <sashal@kernel.org>

96e4471d

21 11月, 2018 1 次提交

SCSI: fix queue cleanup race before queue initialization is done · 410306a0

由 Ming Lei 提交于 11月 14, 2018

commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.

c2856ae2 ("blk-mq: quiesce queue before freeing queue") has
already fixed this race, however the implied synchronize_rcu()
in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
performance regression.

Then 1311326c ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
tried to quiesce queue for avoiding unnecessary synchronize_rcu()
only when queue initialization is done, because it is usual to see
lots of inexistent LUNs which need to be probed.

However, turns out it isn't safe to quiesce queue only when queue
initialization is done. Because when one SCSI command is completed,
the user of sending command can be waken up immediately, then the
scsi device may be removed, meantime the run queue in scsi_end_request()
is still in-progress, so kernel panic can be caused.

In Red Hat QE lab, there are several reports about this kind of kernel
panic triggered during kernel booting.

This patch tries to address the issue by grabing one queue usage
counter during freeing one request and the following run queue.

Fixes: 1311326c ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
Cc: Andrew Jones <drjones@redhat.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
Cc: stable <stable@vger.kernel.org>
Cc: jianchao.wang <jianchao.w.wang@oracle.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

410306a0

22 9月, 2018 1 次提交

block: use nanosecond resolution for iostat · b57e99b4

由 Omar Sandoval 提交于 9月 21, 2018

Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.

Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.

Fixes: 522a7775 ("block: consolidate struct request timestamp fields")
Cc: stable@vger.kernel.org
Reported-by: NKlaus Kusche <klaus.kusche@computerix.info>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b57e99b4

06 9月, 2018 1 次提交

block: don't warn when doing fsync on read-only devices · 8b2ded1c

由 Mikulas Patocka 提交于 9月 05, 2018

It is possible to call fsync on a read-only handle (for example, fsck.ext2
does it when doing read-only check), and this call results in kernel
warning.

The patch b089cfd9 ("block: don't warn for flush on read-only device")
attempted to disable the warning, but it is buggy and it doesn't
(op_is_flush tests flags, but bio_op strips off the flags).
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Fixes: 721c7fc7 ("block: fail op_is_write() requests to read-only partitions")
Cc: stable@vger.kernel.org	# 4.18
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8b2ded1c

18 8月, 2018 1 次提交

block: remove duplicate initialization · fcedba42

由 Chaitanya Kulkarni 提交于 8月 16, 2018

This patch removes the duplicate initialization of q->queue_head
in the blk_alloc_queue_node(). This removes the 2nd initialization
so that we preserve the initialization order same as declaration
present in struct request_queue.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fcedba42

15 8月, 2018 1 次提交

block: don't warn for flush on read-only device · b089cfd9

由 Jens Axboe 提交于 8月 14, 2018

Don't warn for a flush issued to a read-only device. It's not strictly
a writable command, as it doesn't change any on-media data by itself.
Reported-by: NStefan Agner <stefan@agner.ch>
Fixes: 721c7fc7 ("block: fail op_is_write() requests to read-only partitions")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b089cfd9

09 8月, 2018 1 次提交

block: Introduce blk_exit_queue() · 4cf6324b

由 Bart Van Assche 提交于 8月 09, 2018

This patch does not change any functionality.
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Alexandru Moise <00moses.alexander00@gmail.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4cf6324b

05 8月, 2018 1 次提交

Partially revert "block: fail op_is_write() requests to read-only partitions" · a32e236e

由 Linus Torvalds 提交于 8月 03, 2018

It turns out that commit 721c7fc7 ("block: fail op_is_write()
requests to read-only partitions"), while obviously correct, causes
problems for some older lvm2 installations.

The reason is that the lvm snapshotting will continue to write to the
snapshow COW volume, even after the volume has been marked read-only.
End result: snapshot failure.

This has actually been fixed in newer version of the lvm2 tool, but the
old tools still exist, and the breakage was reported both in the kernel
bugzilla and in the Debian bugzilla:

  https://bugzilla.kernel.org/show_bug.cgi?id=200439
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900442

The lvm2 fix is here

  https://sourceware.org/git/?p=lvm2.git;a=commit;h=a6fdb9d9d70f51c49ad11a87ab4243344e6701a3

but until everybody has updated to recent versions, we'll have to weaken
the "never write to read-only partitions" check.  It now allows the
write to happen, but causes a warning, something like this:

  generic_make_request: Trying to write to read-only block-device dm-3 (partno X)
  Modules linked in: nf_tables xt_cgroup xt_owner kvm_intel iwlmvm kvm irqbypass iwlwifi
  CPU: 1 PID: 77 Comm: kworker/1:1 Not tainted 4.17.9-gentoo #3
  Hardware name: LENOVO 20B6A019RT/20B6A019RT, BIOS GJET91WW (2.41 ) 09/21/2016
  Workqueue: ksnaphd do_metadata
  RIP: 0010:generic_make_request_checks+0x4ac/0x600
  ...
  Call Trace:
   generic_make_request+0x64/0x400
   submit_bio+0x6c/0x140
   dispatch_io+0x287/0x430
   sync_io+0xc3/0x120
   dm_io+0x1f8/0x220
   do_metadata+0x1d/0x30
   process_one_work+0x1b9/0x3e0
   worker_thread+0x2b/0x3c0
   kthread+0x113/0x130
   ret_from_fork+0x35/0x40

Note that this is a "revert" in behavior only.  I'm leaving alone the
actual code cleanups in commit 721c7fc7, but letting the previously
uncaught request go through with a warning instead of stopping it.

Fixes: 721c7fc7 ("block: fail op_is_write() requests to read-only partitions")
Reported-and-tested-by: NWGH <wgh@torlan.ru>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a32e236e

03 8月, 2018 1 次提交

block: really disable runtime-pm for blk-mq · b233f127

由 Ming Lei 提交于 7月 30, 2018

Runtime PM isn't ready for blk-mq yet, and commit 765e40b6 ("block:
disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
it can't take effect in that way since user space still can switch
it on via 'echo auto > /sys/block/sdN/device/power/control'.

This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
and fixes all kinds of PM related kernel crash.

Cc: Tomas Janousek <tomi@nomi.cz>
Cc: Przemek Socha <soprwa@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: <stable@vger.kernel.org>
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NPatrick Steinhardt <ps@pks.im>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b233f127

30 7月, 2018 1 次提交

block: blk_init_allocated_queue() set q->fq as NULL in the fail case · 54648cf1

由 xiao jin 提交于 7月 30, 2018

We find the memory use-after-free issue in __blk_drain_queue()
on the kernel 4.14. After read the latest kernel 4.18-rc6 we
think it has the same problem.

Memory is allocated for q->fq in the blk_init_allocated_queue().
If the elevator init function called with error return, it will
run into the fail case to free the q->fq.

Then the __blk_drain_queue() uses the same memory after the free
of the q->fq, it will lead to the unpredictable event.

The patch is to set q->fq as NULL in the fail case of
blk_init_allocated_queue().

Fixes: commit 7c94e1c1 ("block: introduce blk_flush_queue to drive flush machinery")
Cc: <stable@vger.kernel.org>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Nxiao jin <jin.xiao@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

54648cf1

18 7月, 2018 1 次提交

block: Add and use op_stat_group() for indexing disk_stat fields. · ddcf35d3

由 Michael Callahan 提交于 7月 18, 2018

Add and use a new op_stat_group() function for indexing partition stat
fields rather than indexing them by rq_data_dir() or bio_data_dir().
This function works similarly to op_is_sync() in that it takes the
request::cmd_flags or bio::bi_opf flags and determines which stats
should et updated.

In addition, the second parameter to generic_start_io_acct() and
generic_end_io_acct() is now a REQ_OP rather than simply a read or
write bit and it uses op_stat_group() on the parameter to determine
the stat group.

Note that the partition in_flight counts are not part of the per-cpu
statistics and as such are not indexed via this function.  It's now
indexed by op_is_write().

tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.
Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ddcf35d3

09 7月, 2018 5 次提交

block: Add default switch case to blk_pm_allow_request() to kill warning · e9a83853

由 Geert Uytterhoeven 提交于 7月 06, 2018

With gcc 4.9.0 and 7.3.0:

    block/blk-core.c: In function 'blk_pm_allow_request':
    block/blk-core.c:2747:2: warning: enumeration value 'RPM_ACTIVE' not handled in switch [-Wswitch]
      switch (rq->q->rpm_status) {
      ^

Convert the return statement below the switch() block into a default
case to fix this.

Fixes: e4f36b24 ("block: fix peeking requests during PM")
Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e9a83853

block: remove external dependency on wbt_flags · c1c80384

由 Josef Bacik 提交于 7月 03, 2018

We don't really need to save this stuff in the core block code, we can
just pass the bio back into the helpers later on to derive the same
flags and update the rq->wbt_flags appropriately.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c1c80384

blk-rq-qos: refactor out common elements of blk-wbt · a7905043

由 Josef Bacik 提交于 7月 03, 2018

blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis.  Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a7905043

block: Document how blk_update_request() handles RQF_SPECIAL_PAYLOAD requests · 1954e9a9

由 Bart Van Assche 提交于 6月 27, 2018

The payload of struct request is stored in the request.bio chain if
the RQF_SPECIAL_PAYLOAD flag is not set and in request.special_vec if
RQF_SPECIAL_PAYLOAD has been set. However, blk_update_request()
iterates over req->bio whether or not RQF_SPECIAL_PAYLOAD has been
set. Additionally, the RQF_SPECIAL_PAYLOAD flag is ignored by
blk_rq_bytes() which means that the value returned by that function
is incorrect if the RQF_SPECIAL_PAYLOAD flag has been set. It is not
clear to me whether this is an oversight or whether this happened on
purpose. Anyway, document that it is known that both functions ignore
RQF_SPECIAL_PAYLOAD. See also commit f9d03f96 ("block: improve
handling of the magic discard payload").
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1954e9a9

blk-mq: avoid to synchronize rcu inside blk_cleanup_queue() · 1311326c

由 Ming Lei 提交于 6月 25, 2018

SCSI probing may synchronously create and destroy a lot of request_queues
for non-existent devices. Any synchronize_rcu() in queue creation or
destroy path may introduce long latency during booting, see detailed
description in comment of blk_register_queue().

This patch removes one synchronize_rcu() inside blk_cleanup_queue()
for this case, commit c2856ae2(blk-mq: quiesce queue before freeing queue)
needs synchronize_rcu() for implementing blk_mq_quiesce_queue(), but
when queue isn't initialized, it isn't necessary to do that since
only pass-through requests are involved, no original issue in
scsi_execute() at all.

Without this patch and previous one, it may take more 20+ seconds for
virtio-scsi to complete disk probe. With the two patches, the time becomes
less than 100ms.

Fixes: c2856ae2 ("blk-mq: quiesce queue before freeing queue")
Reported-by: NAndrew Jones <drjones@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: NAndrew Jones <drjones@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1311326c

28 6月, 2018 1 次提交

block: Fix cloning of requests with a special payload · 297ba57d

由 Bart Van Assche 提交于 6月 27, 2018

This patch avoids that removing a path controlled by the dm-mpath driver
while mkfs is running triggers the following kernel bug:

    kernel BUG at block/blk-core.c:3347!
    invalid opcode: 0000 [#1] PREEMPT SMP KASAN
    CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2
    RIP: 0010:blk_end_request_all+0x68/0x70
    Call Trace:
     <IRQ>
     dm_softirq_done+0x326/0x3d0 [dm_mod]
     blk_done_softirq+0x19b/0x1e0
     __do_softirq+0x128/0x60d
     irq_exit+0x100/0x110
     smp_call_function_single_interrupt+0x90/0x330
     call_function_single_interrupt+0xf/0x20
     </IRQ>

Fixes: f9d03f96 ("block: improve handling of the magic discard payload")
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

297ba57d

20 6月, 2018 1 次提交

Revert "block: Add warning for bi_next not NULL in bio_endio()" · 9c24c10a

由 Bart Van Assche 提交于 6月 19, 2018

Commit 0ba99ca4 ("block: Add warning for bi_next not NULL in
bio_endio()") breaks the dm driver. end_clone_bio() detects whether
or not a bio is the last bio associated with a request by checking
the .bi_next field. Commit 0ba99ca4 clears that field before
end_clone_bio() has had a chance to inspect that field. Hence revert
commit 0ba99ca4.

This patch avoids that KASAN reports the following complaint when
running the srp-test software (srp-test/run_tests -c -d -r 10 -t 02-mq):

==================================================================
BUG: KASAN: use-after-free in bio_advance+0x11b/0x1d0
Read of size 4 at addr ffff8801300e06d0 by task ksoftirqd/0/9

CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 4.18.0-rc1-dbg+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
 dump_stack+0xa4/0xf5
 print_address_description+0x6f/0x270
 kasan_report+0x241/0x360
 __asan_load4+0x78/0x80
 bio_advance+0x11b/0x1d0
 blk_update_request+0xa7/0x5b0
 scsi_end_request+0x56/0x320 [scsi_mod]
 scsi_io_completion+0x7d6/0xb20 [scsi_mod]
 scsi_finish_command+0x1c0/0x280 [scsi_mod]
 scsi_softirq_done+0x19a/0x230 [scsi_mod]
 blk_mq_complete_request+0x160/0x240
 scsi_mq_done+0x50/0x1a0 [scsi_mod]
 srp_recv_done+0x515/0x1330 [ib_srp]
 __ib_process_cq+0xa0/0xf0 [ib_core]
 ib_poll_handler+0x38/0xa0 [ib_core]
 irq_poll_softirq+0xe8/0x1f0
 __do_softirq+0x128/0x60d
 run_ksoftirqd+0x3f/0x60
 smpboot_thread_fn+0x352/0x460
 kthread+0x1c1/0x1e0
 ret_from_fork+0x24/0x30

Allocated by task 1918:
 save_stack+0x43/0xd0
 kasan_kmalloc+0xad/0xe0
 kasan_slab_alloc+0x11/0x20
 kmem_cache_alloc+0xfe/0x350
 mempool_alloc_slab+0x15/0x20
 mempool_alloc+0xfb/0x270
 bio_alloc_bioset+0x244/0x350
 submit_bh_wbc+0x9c/0x2f0
 __block_write_full_page+0x299/0x5a0
 block_write_full_page+0x16b/0x180
 blkdev_writepage+0x18/0x20
 __writepage+0x42/0x80
 write_cache_pages+0x376/0x8a0
 generic_writepages+0xbe/0x110
 blkdev_writepages+0xe/0x10
 do_writepages+0x9b/0x180
 __filemap_fdatawrite_range+0x178/0x1c0
 file_write_and_wait_range+0x59/0xc0
 blkdev_fsync+0x46/0x80
 vfs_fsync_range+0x66/0x100
 do_fsync+0x3d/0x70
 __x64_sys_fsync+0x21/0x30
 do_syscall_64+0x77/0x230
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 9:
 save_stack+0x43/0xd0
 __kasan_slab_free+0x137/0x190
 kasan_slab_free+0xe/0x10
 kmem_cache_free+0xd3/0x380
 mempool_free_slab+0x17/0x20
 mempool_free+0x63/0x160
 bio_free+0x81/0xa0
 bio_put+0x59/0x60
 end_bio_bh_io_sync+0x5d/0x70
 bio_endio+0x1a7/0x360
 blk_update_request+0xd0/0x5b0
 end_clone_bio+0xa3/0xd0 [dm_mod]
 bio_endio+0x1a7/0x360
 blk_update_request+0xd0/0x5b0
 scsi_end_request+0x56/0x320 [scsi_mod]
 scsi_io_completion+0x7d6/0xb20 [scsi_mod]
 scsi_finish_command+0x1c0/0x280 [scsi_mod]
 scsi_softirq_done+0x19a/0x230 [scsi_mod]
 blk_mq_complete_request+0x160/0x240
 scsi_mq_done+0x50/0x1a0 [scsi_mod]
 srp_recv_done+0x515/0x1330 [ib_srp]
 __ib_process_cq+0xa0/0xf0 [ib_core]
 ib_poll_handler+0x38/0xa0 [ib_core]
 irq_poll_softirq+0xe8/0x1f0
 __do_softirq+0x128/0x60d

The buggy address belongs to the object at ffff8801300e0640
 which belongs to the cache bio-0 of size 200
The buggy address is located 144 bytes inside of
 200-byte region [ffff8801300e0640, ffff8801300e0708)
The buggy address belongs to the page:
page:ffffea0004c03800 count:1 mapcount:0 mapping:ffff88015a563a00 index:0x0 compound_mapcount: 0
flags: 0x8000000000008100(slab|head)
raw: 8000000000008100 dead000000000100 dead000000000200 ffff88015a563a00
raw: 0000000000000000 0000000000330033 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8801300e0580: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
 ffff8801300e0600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
>ffff8801300e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                 ^
 ffff8801300e0700: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8801300e0780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================

Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes: 0ba99ca4 ("block: Add warning for bi_next not NULL in bio_endio()")
Acked-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9c24c10a

07 6月, 2018 1 次提交

block: always set partition number to '0' in blk_partition_remap() · c04fa44b

由 Hannes Reinecke 提交于 6月 07, 2018

blk_partition_remap() will only clear bi_partno if an actual remapping
has happened. But flush request et al don't have an actual size, so
the remapping doesn't happen and bi_partno is never cleared.
So for stacked devices blk_partition_remap() will be called on each level.
If (as is the case for native nvme multipathing) one of the lower-level
devices do _not_support partitioning a spurious I/O error is generated.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c04fa44b

03 6月, 2018 1 次提交

block: don't use blocking queue entered for recursive bio submits · cd4a4ae4

由 Jens Axboe 提交于 6月 02, 2018

If we end up splitting a bio and the queue goes away between
the initial submission and the later split submission, then we
can block forever in blk_queue_enter() waiting for the reference
to drop to zero. This will never happen, since we already hold
a reference.

Mark a split bio as already having entered the queue, so we can
just use the live non-blocking queue enter variant.

Thanks to Tetsuo Handa for the analysis.

Reported-by: syzbot+c4f9cebf9d651f6e54de@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cd4a4ae4

01 6月, 2018 3 次提交

block: move sysfs_lock into elevator_init · acddf3b3

由 Christoph Hellwig 提交于 5月 31, 2018

Both callers take just around so function call, so move it in.
Also remove the now pointless blk_mq_sched_init wrapper.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
Tested-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

acddf3b3

block: remove the always unused name argument to elevator_init · ddb72532

由 Christoph Hellwig 提交于 5月 31, 2018

Reported-by: NDamien Le Moal <Damien.LeMoal@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
Tested-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ddb72532

block: move initialization of elevator-related fields to blk_alloc_queue_node · cbf62af3

由 Christoph Hellwig 提交于 5月 31, 2018

No point in doing this in elevator_init.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reported-by: NDamien Le Moal <Damien.LeMoal@wdc.com>
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
Tested-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cbf62af3

31 5月, 2018 1 次提交

block: convert bounce, q->bio_split to bioset_init()/mempool_init() · 338aa96d

由 Kent Overstreet 提交于 5月 20, 2018

Convert the core block functionality to embedded bio sets.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

338aa96d

29 5月, 2018 1 次提交

blk-mq: Remove generation seqeunce · 12f5b931

由 Keith Busch 提交于 5月 29, 2018

This patch simplifies the timeout handling by relying on the request
reference counting to ensure the iterator is operating on an inflight
and truly timed out request. Since the reference counting prevents the
tag from being reallocated, the block layer no longer needs to prevent
drivers from completing their requests while the timeout handler is
operating on it: a driver completing a request is allowed to proceed to
the next state without additional syncronization with the block layer.

This also removes any need for generation sequence numbers since the
request lifetime is prevented from being reallocated as a new sequence
while timeout handling is operating on it.

To enables this a refcount is added to struct request so that request
users can be sure they're operating on the same request without it
changing while they're processing it.  The request's tag won't be
released for reuse until both the timeout handler and the completion
are done with it.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
[hch: slight cleanups, added back submission side hctx lock, use cmpxchg
 for completions]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

12f5b931

15 5月, 2018 2 次提交

block: Add warning for bi_next not NULL in bio_endio() · 0ba99ca4

由 Kent Overstreet 提交于 5月 08, 2018

Recently found a bug where a driver left bi_next not NULL and then
called bio_endio(), and then the submitter of the bio used
bio_copy_data() which was treating src and dst as lists of bios.

Fixed that bug by splitting out bio_list_copy_data(), but in case other
things are depending on bi_next in weird ways, add a warning to help
avoid more bugs like that in the future.
Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0ba99ca4

block: Use bioset_init() for fs_bio_set · f4f8154a

由 Kent Overstreet 提交于 5月 08, 2018

Minor optimization - remove a pointer indirection when using fs_bio_set.
Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f4f8154a

14 5月, 2018 4 次提交

block: use GFP_NOIO instead of __GFP_DIRECT_RECLAIM · c3036021

由 Christoph Hellwig 提交于 5月 09, 2018

We just can't do I/O when doing block layer requests allocations,
so use GFP_NOIO instead of the even more limited __GFP_DIRECT_RECLAIM.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c3036021

block: pass an explicit gfp_t to get_request · 4accf5fc

由 Christoph Hellwig 提交于 5月 09, 2018

blk_old_get_request already has it at hand, and in blk_queue_bio, which
is the fast path, it is constant.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4accf5fc

block: sanitize blk_get_request calling conventions · ff005a06

由 Christoph Hellwig 提交于 5月 09, 2018

Switch everyone to blk_get_request_flags, and then rename
blk_get_request_flags to blk_get_request.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ff005a06

C
block: fix __get_request documentation · a9a14d36
由 Christoph Hellwig 提交于 5月 09, 2018
```
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
```
a9a14d36

09 5月, 2018 3 次提交

block: consolidate struct request timestamp fields · 522a7775

由 Omar Sandoval 提交于 5月 09, 2018

Currently, struct request has four timestamp fields:

- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
  used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq

These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

522a7775

block: get rid of struct blk_issue_stat · 544ccc8d

由 Omar Sandoval 提交于 5月 09, 2018

struct blk_issue_stat squashes three things into one u64:

- The time the driver started working on a request
- The original size of the request (for the io.low controller)
- Flags for writeback throttling

It turns out that on x86_64, we have a 4 byte hole in struct request
which we can fill with the non-timestamp fields from blk_issue_stat,
simplifying things quite a bit.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

544ccc8d

block: pass struct request instead of struct blk_issue_stat to wbt · a8a45941

由 Omar Sandoval 提交于 5月 09, 2018

issue_stat is going to go away, so first make writeback throttling take
the containing request, update the internal wbt helpers accordingly, and
change rwb->sync_cookie to be the request pointer instead of the
issue_stat pointer. No functional change.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a8a45941

08 5月, 2018 1 次提交

block: Shorten interrupt disabled regions · 50864670

由 Thomas Gleixner 提交于 5月 04, 2018

Commit 9c40cef2 ("sched: Move blk_schedule_flush_plug() out of
__schedule()") moved the blk_schedule_flush_plug() call out of the
interrupt/preempt disabled region in the scheduler. This allows to replace
local_irq_save/restore(flags) by local_irq_disable/enable() in
blk_flush_plug_list().

But it makes more sense to disable interrupts explicitly when the request
queue is locked end reenable them when the request to is unlocked. This
shortens the interrupt disabled section which is important when the plug
list contains requests for more than one queue. The comment which claims
that disabling interrupts around the loop is misleading as the called
functions can reenable interrupts unconditionally anyway and obfuscates the
scope badly:

 local_irq_save(flags);
   spin_lock(q->queue_lock);
   ...
   queue_unplugged(q...);
     scsi_request_fn();
       spin_unlock_irq(q->queue_lock);

-------------------^^^ ????

       spin_lock_irq(q->queue_lock);
     spin_unlock(q->queue_lock);
 local_irq_restore(flags);

Aside of that the detached interrupt disabling is a constant pain for
PREEMPT_RT as it requires patching and special casing when RT is enabled
while with the spin_*_irq() variants this happens automatically.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110622174919.025446432@linutronix.deSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

50864670

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功