提交 · 6333ff6e5a43703a8f7ea16b0f2890a38f5467f0 · openeuler / Kernel

17 10月, 2019 2 次提交

Merge branch 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-linus · 6333ff6e

由 Jens Axboe 提交于 10月 16, 2019

Pull MD fix from Song.

* 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md/raid0: fix warning message for parameter default_layout

6333ff6e

md/raid0: fix warning message for parameter default_layout · 3874d73e

由 Song Liu 提交于 10月 14, 2019

The message should match the parameter, i.e. raid0.default_layout.

Fixes: c84a1372 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Cc: NeilBrown <neilb@suse.de>
Reported-by: NIvan Topolsky <doktor.yak@gmail.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

3874d73e

16 10月, 2019 3 次提交

libata/ahci: Fix PCS quirk application · 09d6ac8d

由 Dan Williams 提交于 10月 15, 2019

Commit c312ef17 "libata/ahci: Drop PCS quirk for Denverton and
beyond" got the polarity wrong on the check for which board-ids should
have the quirk applied. The board type board_ahci_pcs7 is defined at the
end of the list such that "pcs7" boards can be special cased in the
future if they need the quirk. All prior Intel board ids "<
board_ahci_pcs7" should proceed with applying the quirk.
Reported-by: NAndreas Friedrich <afrie@gmx.net>
Reported-by: NStephen Douthit <stephend@silicom-usa.com>
Fixes: c312ef17 ("libata/ahci: Drop PCS quirk for Denverton and beyond")
Cc: <stable@vger.kernel.org>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

09d6ac8d

blk-rq-qos: fix first node deletion of rq_qos_del() · 307f4065

由 Tejun Heo 提交于 10月 15, 2019

rq_qos_del() incorrectly assigns the node being deleted to the head if
it was the first on the list in the !prev path.  Fix it by iterating
with ** instead.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Fixes: a7905043 ("blk-rq-qos: refactor out common elements of blk-wbt")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

307f4065

blkcg: Fix multiple bugs in blkcg_activate_policy() · 9d179b86

由 Tejun Heo 提交于 10月 15, 2019

blkcg_activate_policy() has the following bugs.

* cf09a8ee ("blkcg: pass @q and @blkcg into
  blkcg_pol_alloc_pd_fn()") added @blkcg to ->pd_alloc_fn(); however,
  blkcg_activate_policy() ends up using pd's allocated for the root
  blkcg for all preallocations, so ->pd_init_fn() for non-root blkcgs
  can be passed in pd's which are allocated for the root blkcg.

  For blk-iocost, this means that ->pd_init_fn() can write beyond the
  end of the allocated object as it determines the length of the flex
  array at the end based on the blkcg's nesting level.

* Each pd is initialized as they get allocated.  If alloc fails, the
  policy will get freed with pd's initialized on it.

* After the above partial failure, the partial pds are not freed.

This patch fixes all the above issues by

* Restructuring blkcg_activate_policy() so that alloc and init passes
  are separate.  Init takes place only after all allocs succeeded and
  on failure all allocated pds are freed.

* Unifying and fixing the cleanup of the remaining pd_prealloc.
Signed-off-by: NTejun Heo <tj@kernel.org>
Fixes: cf09a8ee ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9d179b86

15 10月, 2019 2 次提交

io_uring: consider the overflow of sequence for timeout req · 5da0fb1a

由 yangerkun 提交于 10月 15, 2019

Now we recalculate the sequence of timeout with 'req->sequence =
ctx->cached_sq_head + count - 1', judge the right place to insert
for timeout_list by compare the number of request we still expected for
completion. But we have not consider about the situation of overflow:

1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for
the new timeout req can have a small req->sequence.

2. cached_sq_head of now may overflow compare with before req. And it
will lead the timeout req with small req->sequence.

This overflow will lead to the misorder of timeout_list, which can lead
to the wrong order of the completion of timeout_list. Fix it by reuse
req->submit.sequence to store the count, and change the logic of
inserting sort in io_timeout.
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5da0fb1a

block: Fix elv_support_iosched() · 7a7c5e71

由 Damien Le Moal 提交于 10月 09, 2019

A BIO based request queue does not have a tag_set, which prevent testing
for the flag BLK_MQ_F_NO_SCHED indicating that the queue does not
require an elevator. This leads to an incorrect initialization of a
default elevator in some cases such as BIO based null_blk
(queue_mode == BIO) with zoned mode enabled as the default elevator in
this case is mq-deadline instead of "none".

Fix this by testing for a NULL queue mq_ops field which indicates that
the queue is BIO based and should not have an elevator.
Reported-by: NShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7a7c5e71

11 10月, 2019 1 次提交

io_uring: fix sequence logic for timeout requests · 7adf4eaf

由 Jens Axboe 提交于 10月 10, 2019

We have two ways a request can be deferred:

1) It's a regular request that depends on another one
2) It's a timeout that tracks completions

We have a shared helper to determine whether to defer, and that
attempts to make the right decision based on the request. But we
only have some of this information in the caller. Un-share the
two timeout/defer helpers so the caller can use the right one.

Fixes: 5262f567 ("io_uring: IORING_OP_TIMEOUT support")
Reported-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7adf4eaf

10 10月, 2019 3 次提交

nbd: fix possible sysfs duplicate warning · 86248810

由 Xiubo Li 提交于 9月 19, 2019

1. nbd_put takes the mutex and drops nbd->ref to 0. It then does
idr_remove and drops the mutex.

2. nbd_genl_connect takes the mutex. idr_find/idr_for_each fails
to find an existing device, so it does nbd_dev_add.

3. just before the nbd_put could call nbd_dev_remove or not finished
totally, but if nbd_dev_add try to add_disk, we can hit:

debugfs: Directory 'nbd1' with parent 'block' already present!

This patch will make sure all the disk add/remove stuff are done
by holding the nbd_index_mutex lock.
Reported-by: NMike Christie <mchristi@redhat.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NXiubo Li <xiubli@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

86248810

null_blk: Fix zoned command return code · 79a85e21

由 Keith Busch 提交于 10月 10, 2019

The return code from null_handle_zoned() sets the cmd->error value.
Returning OK status when an error occured overwrites the intended
cmd->error. Return the appropriate error code instead of setting the
error in the cmd.

Fixes: fceb5d1b ("null_blk: create a helper for zoned devices")
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

79a85e21

io_uring: only flush workqueues on fileset removal · 8a997340

由 Jens Axboe 提交于 10月 09, 2019

We should not remove the workqueue, we just need to ensure that the
workqueues are synced. The workqueues are torn down on ctx removal.

Cc: stable@vger.kernel.org
Fixes: 6b06314c ("io_uring: add file set registration")
Reported-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8a997340

08 10月, 2019 1 次提交

io_uring: remove wait loop spurious wakeups · 6805b32e

由 Pavel Begunkov 提交于 10月 08, 2019

Any changes interesting to tasks waiting in io_cqring_wait() are
commited with io_cqring_ev_posted(). However, io_ring_drop_ctx_refs()
also tries to do that but with no reason, that means spurious wakeups
every io_free_req() and io_uring_enter().

Just use percpu_ref_put() instead.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6805b32e

06 10月, 2019 3 次提交

blk-wbt: fix performance regression in wbt scale_up/scale_down · b84477d3

由 Harshad Shirwadkar 提交于 10月 05, 2019

scale_up wakes up waiters after scaling up. But after scaling max, it
should not wake up more waiters as waiters will not have anything to
do. This patch fixes this by making scale_up (and also scale_down)
return when threshold is reached.

This bug causes increased fdatasync latency when fdatasync and dd
conv=sync are performed in parallel on 4.19 compared to 4.14. This
bug was introduced during refactoring of blk-wbt code.

Fixes: a7905043 ("blk-rq-qos: refactor out common elements of blk-wbt")
Cc: stable@vger.kernel.org
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b84477d3

Revert "libata, freezer: avoid block device removal while system is frozen" · 0e48f51c

由 Mika Westerberg 提交于 10月 04, 2019

This reverts commit 85fbd722.

The commit was added as a quick band-aid for a hang that happened when a
block device was removed during system suspend. Now that bdi_wq is not
freezable anymore the hang should not be possible and we can get rid of
this hack by reverting it.
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0e48f51c

bdi: Do not use freezable workqueue · a2b90f11

由 Mika Westerberg 提交于 10月 04, 2019

A removable block device, such as NVMe or SSD connected over Thunderbolt
can be hot-removed any time including when the system is suspended. When
device is hot-removed during suspend and the system gets resumed, kernel
first resumes devices and then thaws the userspace including freezable
workqueues. What happens in that case is that the NVMe driver notices
that the device is unplugged and removes it from the system. This ends
up calling bdi_unregister() for the gendisk which then schedules
wb_workfn() to be run one more time.

However, since the bdi_wq is still frozen flush_delayed_work() call in
wb_shutdown() blocks forever halting system resume process. User sees
this as hang as nothing is happening anymore.

Triggering sysrq-w reveals this:

Workqueue: nvme-wq nvme_remove_dead_ctrl_work [nvme]
Call Trace:
? __schedule+0x2c5/0x630
? wait_for_completion+0xa4/0x120
schedule+0x3e/0xc0
schedule_timeout+0x1c9/0x320
? resched_curr+0x1f/0xd0
? wait_for_completion+0xa4/0x120
wait_for_completion+0xc3/0x120
? wake_up_q+0x60/0x60
__flush_work+0x131/0x1e0
? flush_workqueue_prep_pwqs+0x130/0x130
bdi_unregister+0xb9/0x130
del_gendisk+0x2d2/0x2e0
nvme_ns_remove+0xed/0x110 [nvme_core]
nvme_remove_namespaces+0x96/0xd0 [nvme_core]
nvme_remove+0x5b/0x160 [nvme]
pci_device_remove+0x36/0x90
device_release_driver_internal+0xdf/0x1c0
nvme_remove_dead_ctrl_work+0x14/0x30 [nvme]
process_one_work+0x1c2/0x3f0
worker_thread+0x48/0x3e0
kthread+0x100/0x140
? current_work+0x30/0x30
? kthread_park+0x80/0x80
ret_from_fork+0x35/0x40

This is not limited to NVMes so exactly same issue can be reproduced by
hot-removing SSD (over Thunderbolt) while the system is suspended.

Prevent this from happening by removing WQ_FREEZABLE from bdi_wq.
Reported-by: NAceLan Kao <acelan.kao@canonical.com>
Link: https://marc.info/?l=linux-kernel&m=138695698516487
Link: https://bugzilla.kernel.org/show_bug.cgi?id=204385
Link: https://lore.kernel.org/lkml/20191002122136.GD2819@lahna.fi.intel.com/#tAcked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a2b90f11

04 10月, 2019 3 次提交

io_uring: fix reversed nonblock flag for link submission · bf7ec93c

由 Pavel Begunkov 提交于 10月 04, 2019

io_queue_link_head() accepts @force_nonblock flag, but io_ring_submit()
passes something opposite.

Fixes: c5766668 ("io_uring: optimize submit_and_wait API")
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bf7ec93c

block: sed-opal: fix sparse warning: convert __be64 data · a9eb49c9

由 Randy Dunlap 提交于 10月 02, 2019

sparse warns about incorrect type when using __be64 data.
It is not being converted to CPU-endian but it should be.

Fixes these sparse warnings:

../block/sed-opal.c:375:20: warning: incorrect type in assignment (different base types)
../block/sed-opal.c:375:20:    expected unsigned long long [usertype] align
../block/sed-opal.c:375:20:    got restricted __be64 const [usertype] alignment_granularity
../block/sed-opal.c:376:25: warning: incorrect type in assignment (different base types)
../block/sed-opal.c:376:25:    expected unsigned long long [usertype] lowest_lba
../block/sed-opal.c:376:25:    got restricted __be64 const [usertype] lowest_aligned_lba

Fixes: 455a7b23 ("block: Add Sed-opal library")
Cc: Scott Bauer <scott.bauer@intel.com>
Cc: Rafael Antognolli <rafael.antognolli@intel.com>
Cc: linux-block@vger.kernel.org
Reviewed-by: NJon Derrick <jonathan.derrick@intel.com>
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a9eb49c9

block: sed-opal: fix sparse warning: obsolete array init. · dc301025

由 Randy Dunlap 提交于 10月 02, 2019

Fix sparse warning: (missing '=')
../block/sed-opal.c:133:17: warning: obsolete array initializer, use C99 syntax

Fixes: ff91064e ("block: sed-opal: check size of shadow mbr")
Cc: linux-block@vger.kernel.org
Cc: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Cc: David Kozub <zub@linux.fjfi.cvut.cz>
Reviewed-by: NScott Bauer <sbauer@plzdonthack.me>
Reviewed-by: NRevanth Rajashekar <revanth.rajashekar@intel.com>
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dc301025

03 10月, 2019 1 次提交

block: pg: add header include guard · 3a4b46c3

由 Masahiro Yamada 提交于 7月 20, 2019

Add a header include guard just in case.
Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3a4b46c3

01 10月, 2019 4 次提交

Revert "s390/dasd: Add discard support for ESE volumes" · 964ce509

由 Stefan Haberland 提交于 10月 01, 2019

This reverts commit 7e64db15.

The thin provisioning feature introduces an IOCTL and the discard support
to allow userspace tools and filesystems to release unused and previously
allocated space respectively.

During some internal performance improvements and further tests, the
release of allocated space revealed some issues that may lead to data
corruption in some configurations when filesystems are mounted with
discard support enabled.

While we're working on a fix and trying to clarify the situation,
this commit reverts the discard support for ESE volumes to prevent
potential data corruption.

Cc: <stable@vger.kernel.org> # 5.3
Signed-off-by: NStefan Haberland <sth@linux.ibm.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

964ce509

s390/dasd: Fix error handling during online processing · dd454839

由 Jan Höppner 提交于 10月 01, 2019

It is possible that the CCW commands for reading volume and extent pool
information are not supported, either by the storage server (for
dedicated DASDs) or by z/VM (for virtual devices, such as MDISKs).

As a command reject will occur in such a case, the current error
handling leads to a failing online processing and thus the DASD can't be
used at all.

Since the data being read is not essential for an fully operational
DASD, the error handling can be removed. Information about the failing
command is sent to the s390dbf debug feature.

Fixes: c729696b ("s390/dasd: Recognise data for ESE volumes")
Cc: <stable@vger.kernel.org> # 5.3
Reported-by: NFrank Heimes <frank.heimes@canonical.com>
Signed-off-by: NJan Höppner <hoeppner@linux.ibm.com>
Signed-off-by: NStefan Haberland <sth@linux.ibm.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dd454839

io_uring: use __kernel_timespec in timeout ABI · bdf20073

由 Arnd Bergmann 提交于 10月 01, 2019

All system calls use struct __kernel_timespec instead of the old struct
timespec, but this one was just added with the old-style ABI. Change it
now to enforce the use of __kernel_timespec, avoiding ABI confusion and
the need for compat handlers on 32-bit architectures.

Any user space caller will have to use __kernel_timespec now, but this
is unambiguous and works for any C library regardless of the time_t
definition. A nicer way to specify the timeout would have been a less
ambiguous 64-bit nanosecond value, but I suppose it's too late now to
change that as this would impact both 32-bit and 64-bit users.

Fixes: 5262f567 ("io_uring: IORING_OP_TIMEOUT support")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bdf20073

loop: change queue block size to match when using DIO · 85560117

由 Martijn Coenen 提交于 9月 04, 2019

The loop driver assumes that if the passed in fd is opened with
O_DIRECT, the caller wants to use direct I/O on the loop device.
However, if the underlying block device has a different block size than
the loop block queue, direct I/O can't be enabled. Instead of requiring
userspace to manually change the blocksize and re-enable direct I/O,
just change the queue block sizes to match, as well as the io_min size.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMartijn Coenen <maco@android.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

85560117

28 9月, 2019 4 次提交

Merge branch 'nvme-5.4' of git://git.infradead.org/nvme into for-linus · 2d5ba0c7

由 Jens Axboe 提交于 9月 27, 2019

Pull NVMe changes from Sagi:

"This set consists of various fixes and cleanups:
 - controller removal race fix from Balbir
 - quirk additions from Gabriel and Jian-Hong
 - nvme-pci power state save fix from Mario
 - Add 64bit user commands (for 64bit registers) from Marta
 - nvme-rdma/nvme-tcp fixes from Max, Mark and Me
 - Minor cleanups and nits from James, Dan and John"

* 'nvme-5.4' of git://git.infradead.org/nvme:
  nvme-rdma: fix possible use-after-free in connect timeout
  nvme: Move ctrl sqsize to generic space
  nvme: Add ctrl attributes for queue_count and sqsize
  nvme: allow 64-bit results in passthru commands
  nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T
  nvmet-tcp: remove superflous check on request sgl
  Added QUIRKs for ADATA XPG SX8200 Pro 512GB
  nvme-rdma: Fix max_hw_sectors calculation
  nvme: fix an error code in nvme_init_subsystem()
  nvme-pci: Save PCI state before putting drive into deepest state
  nvme-tcp: fix wrong stop condition in io_work
  nvme-pci: Fix a race in controller removal
  nvmet: change ppl to lpp

2d5ba0c7

blk-mq: apply normal plugging for HDD · 3154df26

由 Ming Lei 提交于 9月 27, 2019

Some HDD drive may expose multiple hardware queues, such as MegraRaid.
Let's apply the normal plugging for such devices because sequential IO
may benefit a lot from plug merging.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Dave Chinner <dchinner@redhat.com>
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3154df26

blk-mq: honor IO scheduler for multiqueue devices · a12de1d4

由 Ming Lei 提交于 9月 27, 2019

If a device is using multiple queues, the IO scheduler may be bypassed.
This may hurt performance for some slow MQ devices, and it also breaks
zoned devices which depend on mq-deadline for respecting the write order
in one zone.

Don't bypass io scheduler if we have one setup.

This patch can double sequential write performance basically on MQ
scsi_debug when mq-deadline is applied.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Dave Chinner <dchinner@redhat.com>
Reviewed-by: NJavier González <javier@javigon.com>
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a12de1d4

nvme-rdma: fix possible use-after-free in connect timeout · 67b483dd

由 Sagi Grimberg 提交于 9月 24, 2019

If the connect times out, we may have already destroyed the
queue in the timeout handler, so test if the queue is still
allocated in the connect error handler.
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

67b483dd

27 9月, 2019 3 次提交

block: fix null pointer dereference in blk_mq_rq_timed_out() · 8d699663

由 Yufen Yu 提交于 9月 27, 2019

We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:

[  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[  108.827059] PGD 0 P4D 0
[  108.827313] Oops: 0000 [#1] SMP PTI
[  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[  108.829503] Workqueue: kblockd blk_mq_timeout_work
[  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[  108.838191] Call Trace:
[  108.838406]  bt_iter+0x74/0x80
[  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
[  108.839074]  ? __switch_to_asm+0x34/0x70
[  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
[  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
[  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
[  108.840732]  blk_mq_timeout_work+0x74/0x200
[  108.841151]  process_one_work+0x297/0x680
[  108.841550]  worker_thread+0x29c/0x6f0
[  108.841926]  ? rescuer_thread+0x580/0x580
[  108.842344]  kthread+0x16a/0x1a0
[  108.842666]  ? kthread_flush_work+0x170/0x170
[  108.843100]  ret_from_fork+0x35/0x40

The bug is caused by the race between timeout handle and completion for
flush request.

When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.

After commit 12f5b931 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.

However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.

We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NYufen Yu <yuyufen@huawei.com>

-------
v2:
 - move rq_status from struct request to struct blk_flush_queue
v3:
 - remove unnecessary '{}' pair.
v4:
 - let spinlock to protect 'fq->rq_status'
v5:
 - move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8d699663

rq-qos: get rid of redundant wbt_update_limits() · 2af2783f

由 Yufen Yu 提交于 9月 17, 2019

We have updated limits after calling wbt_set_min_lat(). No need to
update again.
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2af2783f

nvme: Move ctrl sqsize to generic space · f968688f

由 Keith Busch 提交于 9月 26, 2019

This isn't specific to fabrics.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

f968688f

26 9月, 2019 10 次提交

iocost: bump up default latency targets for hard disks · 7afcccaf

由 Tejun Heo 提交于 9月 25, 2019

The default hard disk param sets latency targets at 50ms.  As the
default target percentiles are zero, these don't directly regulate
vrate; however, they're still used to calculate the period length -
100ms in this case.

This is excessively low.  A SATA drive with QD32 saturated with random
IOs can easily reach avg completion latency of several hundred msecs.
A period duration which is substantially lower than avg completion
latency can lead to wildly fluctuating vrate.

Let's bump up the default latency targets to 250ms so that the period
duration is sufficiently long.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7afcccaf

iocost: improve nr_lagging handling · 7cd806a9

由 Tejun Heo 提交于 9月 25, 2019

Some IOs may span multiple periods.  As latencies are collected on
completion, the inbetween periods won't register them and may
incorrectly decide to increase vrate.  nr_lagging tracks these IOs to
avoid those situations.  Currently, whenever there are IOs which are
spanning from the previous period, busy_level is reset to 0 if
negative thus suppressing vrate increase.

This has the following two problems.

* When latency target percentiles aren't set, vrate adjustment should
  only be governed by queue depth depletion; however, the current code
  keeps nr_lagging active which pulls in latency results and can keep
  down vrate unexpectedly.

* When lagging condition is detected, it resets the entire negative
  busy_level.  This turned out to be way too aggressive on some
  devices which sometimes experience extended latencies on a small
  subset of commands.  In addition, a lagging IO will be accounted as
  latency target miss on completion anyway and resetting busy_level
  amplifies its impact unnecessarily.

This patch fixes the above two problems by disabling nr_lagging
counting when latency target percentiles aren't set and blocking vrate
increases when there are lagging IOs while leaving busy_level as-is.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7cd806a9

iocost: better trace vrate changes · 25d41e4a

由 Tejun Heo 提交于 9月 25, 2019

vrate_adj tracepoint traces vrate changes; however, it does so only
when busy_level is non-zero.  busy_level turning to zero can sometimes
be as interesting an event.  This patch also enables vrate_adj
tracepoint on other vrate related events - busy_level changes and
non-zero nr_lagging.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

25d41e4a

block: don't release queue's sysfs lock during switching elevator · b89f625e

由 Ming Lei 提交于 9月 23, 2019

cecf5d87 ("block: split .sysfs_lock into two locks") starts to
release & acquire sysfs_lock before registering/un-registering elevator
queue during switching elevator for avoiding potential deadlock from
showing & storing 'queue/iosched' attributes and removing elevator's
kobject.

Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
required in .show & .store of queue/iosched's attributes, and just
elevator's sysfs lock is acquired in elv_iosched_store() and
elv_iosched_show(). So it is safe to hold queue's sysfs lock when
registering/un-registering elevator queue.

The biggest issue is that commit cecf5d87 assumes that concurrent
write on 'queue/scheduler' can't happen. However, this assumption isn't
true, because kernfs_fop_write() only guarantees that concurrent write
aren't called on the same open file, but the write could be from
different open on the file. So we can't release & re-acquire queue's
sysfs lock during switching elevator, otherwise use-after-free on
elevator could be triggered.

Fixes the issue by not releasing queue's sysfs lock during switching
elevator.

Fixes: cecf5d87 ("block: split .sysfs_lock into two locks")
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b89f625e

blk-mq: move lockdep_assert_held() into elevator_exit · 284b94be

由 Ming Lei 提交于 9月 26, 2019

Commit c48dac13 ("block: don't hold q->sysfs_lock in elevator_init_mq")
removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with
lockdep_assert_held() called in blk_mq_sched_free_requests() which is
run in failure path of elevator_init_mq().

blk_mq_sched_free_requests() is called in the following 3 functions:

	elevator_init_mq()
	elevator_exit()
	blk_cleanup_queue()

In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly
by 'mutex_lock(&q->sysfs_lock)'.

So moving the lockdep_assert_held() from blk_mq_sched_free_requests()
into elevator_exit() for fixing the report by syzbot.

Reported-by: syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com
Fixed: c48dac13 ("block: don't hold q->sysfs_lock in elevator_init_mq")
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

284b94be

nvme: Add ctrl attributes for queue_count and sqsize · 2b1ff255

由 James Smart 提交于 9月 24, 2019

Current controller interrogation requires a lot of guesswork
on how many io queues were created and what the io sq size is.
The numbers are dependent upon core/fabric defaults, connect
arguments, and target responses.

Add sysfs attributes for queue_count and sqsize.
Signed-off-by: NJames Smart <jsmart2021@gmail.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

2b1ff255

nvme: allow 64-bit results in passthru commands · 65e68edc

由 Marta Rybczynska 提交于 9月 24, 2019

It is not possible to get 64-bit results from the passthru commands,
what prevents from getting for the Capabilities (CAP) property value.

As a result, it is not possible to implement IOL's NVMe Conformance
test 4.3 Case 1 for Fabrics targets [1] (page 123).

This issue has been already discussed [2], but without a solution.

This patch solves the problem by adding new ioctls with a new
passthru structure, including 64-bit results. The older ioctls stay
unchanged.

[1] https://www.iol.unh.edu/sites/default/files/testsuites/nvme/UNH-IOL_NVMe_Conformance_Test_Suite_v11.0.pdf
[2] http://lists.infradead.org/pipermail/linux-nvme/2018-June/018791.htmlSigned-off-by: NMarta Rybczynska <marta.rybczynska@kalray.eu>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

65e68edc

nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T · 19ea025e

由 Jian-Hong Pan 提交于 9月 24, 2019

Kingston NVME SSD with firmware version E8FK11.T has no interrupt after
resume with actions related to suspend to idle. This patch applied
NVME_QUIRK_SIMPLE_SUSPEND quirk to fix this issue.

Fixes: d916b1be ("nvme-pci: use host managed power state for suspend")
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887Signed-off-by: NJian-Hong Pan <jian-hong@endlessm.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

19ea025e

nvmet-tcp: remove superflous check on request sgl · 30f27d57

由 Sagi Grimberg 提交于 9月 13, 2019

Now that sgl_free is null safe, drop the superflous check.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

30f27d57

Added QUIRKs for ADATA XPG SX8200 Pro 512GB · f03e42c6

由 Gabriel Craciunescu 提交于 9月 23, 2019

Booting with default_ps_max_latency_us >6000 makes the device fail.
Also SUBNQN is NULL and gives a warning on each boot/resume.
 $ nvme id-ctrl /dev/nvme0 | grep ^subnqn
   subnqn    : (null)

I use this device with an Acer Nitro 5 (AN515-43-R8BF) Laptop.
To be sure is not a Laptop issue only, I tested the device on
my server board  with the same results.
( with 2x,4x link on the board and 4x link on a PCI-E card ).
Signed-off-by: NGabriel Craciunescu <nix.or.die@gmail.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

f03e42c6

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功