提交 · 3d24430694077313c75c6b89f618db09943621e4 · openeuler / Kernel

16 9月, 2019 1 次提交

block: make rq sector size accessible for block stats · 3d244306

由 Hou Tao 提交于 5月 21, 2019

Currently rq->data_len will be decreased by partial completion or
zeroed by completion, so when blk_stat_add() is invoked, data_len
will be zero and there will never be samples in poll_cb because
blk_mq_poll_stats_bkt() will return -1 if data_len is zero.

We could move blk_stat_add() back to __blk_mq_complete_request(),
but that would make the effort of trying to call ktime_get_ns()
once in vain. Instead we can reuse throtl_size field, and use
it for both block stats and block throttle, and adjust the
logic in blk_mq_poll_stats_bkt() accordingly.

Fixes: 4bc6339a ("block: move blk_stat_add() to __blk_mq_end_request()")
Tested-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3d244306

15 9月, 2019 1 次提交

bfq: Fix bfq linkage error · 89f3b6d6

由 Pavel Begunkov 提交于 9月 14, 2019

Since commit 795fe54c ("bfq: Add per-device weight"), bfq uses
blkg_conf_prep() and blkg_conf_finish(), which are not exported. So, it
causes linkage error if bfq compiled as a module.

Fixes: 795fe54c ("bfq: Add per-device weight")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

89f3b6d6

14 9月, 2019 7 次提交

Merge branch 'md-next' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.4/block · 99e5381d

由 Jens Axboe 提交于 9月 13, 2019

Pull MD fixes from Song.

* 'md-next' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  raid5: use bio_end_sector in r5_next_bio
  raid5: remove STRIPE_OPS_REQ_PENDING
  md: add feature flag MD_FEATURE_RAID0_LAYOUT
  md/raid0: avoid RAID0 data corruption due to layout confusion.
  raid5: don't set STRIPE_HANDLE to stripe which is in batch list
  raid5: don't increment read_errors on EILSEQ return

99e5381d

raid5: use bio_end_sector in r5_next_bio · 067df25c

由 Guoqing Jiang 提交于 9月 12, 2019

Actually, we calculate bio's end sector here, so use the common
way for the purpose.
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

067df25c

raid5: remove STRIPE_OPS_REQ_PENDING · feb9bf98

由 Guoqing Jiang 提交于 9月 12, 2019

This stripe state is not used anymore after commit 51acbcec
("md: remove CONFIG_MULTICORE_RAID456"), so remove the obsoleted
state.

gjiang@nb01257:~/md$ grep STRIPE_OPS_REQ_PENDING drivers/md/ -r
drivers/md/raid5.c:					  (1 << STRIPE_OPS_REQ_PENDING) |
drivers/md/raid5.h:	STRIPE_OPS_REQ_PENDING,
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

feb9bf98

md: add feature flag MD_FEATURE_RAID0_LAYOUT · 33f2c35a

由 NeilBrown 提交于 9月 09, 2019

Due to a bug introduced in Linux 3.14 we cannot determine the
correctly layout for a multi-zone RAID0 array - there are two
possibilities.

It is possible to tell the kernel which to chose using a module
parameter, but this can be clumsy to use.  It would be best if
the choice were recorded in the metadata.
So add a feature flag for this purpose.
If it is set, then the 'layout' field of the superblock is used
to determine which layout to use.

If this flag is not set, then mddev->layout gets set to -1,
which causes the module parameter to be required.
Acked-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NSong Liu <songliubraving@fb.com>

33f2c35a

md/raid0: avoid RAID0 data corruption due to layout confusion. · c84a1372

由 NeilBrown 提交于 9月 09, 2019

If the drives in a RAID0 are not all the same size, the array is
divided into zones.
The first zone covers all drives, to the size of the smallest.
The second zone covers all drives larger than the smallest, up to
the size of the second smallest - etc.

A change in Linux 3.14 unintentionally changed the layout for the
second and subsequent zones.  All the correct data is still stored, but
each chunk may be assigned to a different device than in pre-3.14 kernels.
This can lead to data corruption.

It is not possible to determine what layout to use - it depends which
kernel the data was written by.
So we add a module parameter to allow the old (0) or new (1) layout to be
specified, and refused to assemble an affected array if that parameter is
not set.

Fixes: 20d0189b ("block: Introduce new bio_split()")
cc: stable@vger.kernel.org (3.14+)
Acked-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NSong Liu <songliubraving@fb.com>

c84a1372

raid5: don't set STRIPE_HANDLE to stripe which is in batch list · 6ce220dd

由 Guoqing Jiang 提交于 9月 11, 2019

If stripe in batch list is set with STRIPE_HANDLE flag, then the stripe
could be set with STRIPE_ACTIVE by the handle_stripe function. And if
error happens to the batch_head at the same time, break_stripe_batch_list
is called, then below warning could happen (the same report in [1]), it
means a member of batch list was set with STRIPE_ACTIVE.

[7028915.431770] stripe state: 2001
[7028915.431815] ------------[ cut here ]------------
[7028915.431828] WARNING: CPU: 18 PID: 29089 at drivers/md/raid5.c:4614 break_stripe_batch_list+0x203/0x240 [raid456]
[...]
[7028915.431879] CPU: 18 PID: 29089 Comm: kworker/u82:5 Tainted: G           O    4.14.86-1-storage #4.14.86-1.2~deb9
[7028915.431881] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018
[7028915.431888] Workqueue: raid5wq raid5_do_work [raid456]
[7028915.431890] task: ffff9ab0ef36d7c0 task.stack: ffffb72926f84000
[7028915.431896] RIP: 0010:break_stripe_batch_list+0x203/0x240 [raid456]
[7028915.431898] RSP: 0018:ffffb72926f87ba8 EFLAGS: 00010286
[7028915.431900] RAX: 0000000000000012 RBX: ffff9aaa84a98000 RCX: 0000000000000000
[7028915.431901] RDX: 0000000000000000 RSI: ffff9ab2bfa15458 RDI: ffff9ab2bfa15458
[7028915.431902] RBP: ffff9aaa8fb4e900 R08: 0000000000000001 R09: 0000000000002eb4
[7028915.431903] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff9ab1736f1b00
[7028915.431904] R13: 0000000000000000 R14: ffff9aaa8fb4e900 R15: 0000000000000001
[7028915.431906] FS:  0000000000000000(0000) GS:ffff9ab2bfa00000(0000) knlGS:0000000000000000
[7028915.431907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[7028915.431908] CR2: 00007ff953b9f5d8 CR3: 0000000bf4009002 CR4: 00000000003606e0
[7028915.431909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[7028915.431910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[7028915.431910] Call Trace:
[7028915.431923]  handle_stripe+0x8e7/0x2020 [raid456]
[7028915.431930]  ? __wake_up_common_lock+0x89/0xc0
[7028915.431935]  handle_active_stripes.isra.58+0x35f/0x560 [raid456]
[7028915.431939]  raid5_do_work+0xc6/0x1f0 [raid456]

Also commit 59fc630b ("RAID5: batch adjacent full stripe write")
said "If a stripe is added to batch list, then only the first stripe
of the list should be put to handle_list and run handle_stripe."

So don't set STRIPE_HANDLE to stripe which is already in batch list,
otherwise the stripe could be put to handle_list and run handle_stripe,
then the above warning could be triggered.

[1]. https://www.spinics.net/lists/raid/msg62552.htmlSigned-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

6ce220dd

raid5: don't increment read_errors on EILSEQ return · b76b4715

由 Nigel Croxon 提交于 9月 06, 2019

While MD continues to count read errors returned by the lower layer.
If those errors are -EILSEQ, instead of -EIO, it should NOT increase
the read_errors count.

When RAID6 is set up on dm-integrity target that detects massive
corruption, the leg will be ejected from the array.  Even if the
issue is correctable with a sector re-write and the array has
necessary redundancy to correct it.

The leg is ejected because it runs up the rdev->read_errors beyond
conf->max_nr_stripes.  The return status in dm-drypt when there is
a data integrity error is -EILSEQ (BLK_STS_PROTECTION).
Signed-off-by: NNigel Croxon <ncroxon@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

b76b4715

13 9月, 2019 1 次提交

Merge branch 'nvme-5.4' of git://git.infradead.org/nvme into for-5.4/block · 21fa1004

由 Jens Axboe 提交于 9月 12, 2019

Pull NVMe updates from Sagi:

"Highlights includes:
 - controller reset and namespace scan races fixes
 - nvme discovery log change uevent support
 - naming improvements from Keith
 - multiple discovery controllers reject fix from James
 - some regular cleanups from various people"

* 'nvme-5.4' of git://git.infradead.org/nvme:
  nvmet: fix a wrong error status returned in error log page
  nvme: send discovery log page change events to userspace
  nvme: add uevent variables for controller devices
  nvme: enable aen regardless of the presence of I/O queues
  nvme-fabrics: allow discovery subsystems accept a kato
  nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
  nvme: Remove redundant assignment of cq vector
  nvme: Assign subsys instance from first ctrl
  nvme: tcp: remove redundant assignment to variable ret
  nvme: include admin_q sync with nvme_sync_queues
  nvme: Treat discovery subsystems as unique subsystems
  nvme: fix ns removal hang when failing to revalidate due to a transient error
  nvme: make nvme_report_ns_ids propagate error back
  nvme: make nvme_identify_ns propagate errors back
  nvme: pass status to nvme_error_status
  nvme-fc: Fail transport errors with NVME_SC_HOST_PATH
  nvme-tcp: fail command with NVME_SC_HOST_PATH_ERROR send failed
  nvme: fail cancelled commands with NVME_SC_HOST_PATH_ERROR

21fa1004

12 9月, 2019 24 次提交

nvmet: fix a wrong error status returned in error log page · 5f8badbc

由 Amit 提交于 9月 12, 2019

When the command data_len cannot hold all the controller errors,
we should simply return as much errors as we can fit
instead of failing the command.
Signed-off-by: NAmit Engel <amit.engel@dell.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

5f8badbc

nvme: send discovery log page change events to userspace · 85f8a435

由 Sagi Grimberg 提交于 7月 12, 2019

If the controller supports discovery log page change events,
we want to enable it. When we see a discovery log change event
we will send it up to userspace and expect it to handle it.
Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

85f8a435

nvme: add uevent variables for controller devices · a42f42e5

由 Sagi Grimberg 提交于 9月 04, 2019

When we send uevents to userspace, add controller specific
environment variables to uniquly identify the controller beyond
its device name.

This will be useful to address discovery log change events by
actually verifying that the discovery controller is indeed the
same as the device that generated the event.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

a42f42e5

nvme: enable aen regardless of the presence of I/O queues · 93da4023

由 Sagi Grimberg 提交于 8月 22, 2019

AENs in general are not related to the presence of I/O queues,
so enable them regardless. Note that the only exception is that
discovery controller will not support any of the requested AENs
and nvme_enable_aen will respect that and return, so it is still
safe to enable regardless.

Note it is safe to enable AENs even before the initial namespace
scanning as we have the scan operation in a workqueue context.
Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

93da4023

nvme-fabrics: allow discovery subsystems accept a kato · 2d352df5

由 Sagi Grimberg 提交于 7月 12, 2019

This modifies the behavior of discovery subsystems to accept
a kato as a preparation to support discovery log change
events. This also means that now every discovery controller
will have a default kato value, and for non-persistent connections
the host needs to pass in a zero kato value (keep_alive_tmo=0).
Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

2d352df5

nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery() · 1179d337

由 Markus Elfring 提交于 9月 06, 2019

Simplify this function implementation by using a known function.

Generated by: scripts/coccinelle/api/ptr_ret.cocci
Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

1179d337

nvme: Remove redundant assignment of cq vector · 97b3807e

由 Israel Rukshin 提交于 9月 05, 2019

The cq vector is already assigned with the correct value.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

97b3807e

nvme: Assign subsys instance from first ctrl · 733e4b69

由 Keith Busch 提交于 9月 05, 2019

The namespace disk names must be unique for the lifetime of the
subsystem. This was accomplished by using their parent subsystems'
instances which were allocated independently from the controllers
connected to that subsystem. This allowed name prefixes assigned to
namespaces to match a controller from an unrelated subsystem, and has
created confusion among users examining device nodes.

Ensure a namespace's subsystem instance never clashes with a controller
instance of another subsystem by transferring the instance ownership
to the parent subsystem from the first controller discovered in that
subsystem.
Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMinwoo Im <minwoo.im@samsung.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

733e4b69

nvme: tcp: remove redundant assignment to variable ret · 312910f4

由 Colin Ian King 提交于 9月 05, 2019

The variable ret is being initialized with a value that is never read
and is being re-assigned immediately afterwards. The assignment is
redundant and hence can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

312910f4

nvme: include admin_q sync with nvme_sync_queues · 03894b7a

由 Edmund Nadolski 提交于 9月 03, 2019

nvme_sync_queues currently syncs all namespace queues, but should
also sync the admin queue, if present.
Signed-off-by: NEdmund Nadolski <edmund.nadolski@intel.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

03894b7a

nvme: Treat discovery subsystems as unique subsystems · c26aa572

由 James Smart 提交于 9月 03, 2019

Current code matches subnqn and collapses all controllers to the
same subnqn to a single subsystem structure. This is good for
recognizing multiple controllers for the same subsystem. But with
the well-known discovery subnqn, the subsystems aren't truly the
same subsystem. As such, subsystem specific rules, such as no
overlap of controller id, do not apply. With today's behavior, the
check for overlap of controller id can fail, preventing the new
discovery controller from being created.

When searching for like subsystem nqn, exclude the discovery nqn
from matching. This will result in each discovery controller being
attached to a unique subsystem structure.
Signed-off-by: NJames Smart <jsmart2021@gmail.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

c26aa572

nvme: fix ns removal hang when failing to revalidate due to a transient error · 205da243

由 Sagi Grimberg 提交于 8月 30, 2019

If a controller reset is racing with a namespace revalidation, the
revalidation (admin) I/O will surely fail, but we should not remove the
namespace as we will execute the I/O when the controller is back up.
Same for spurious allocation errors (return -ENOMEM).

Fix this by checking the specific error code in nvme_revalidate_disk and
if it is a transient error (for example non DNR nvme statuses or
a negative ENOMEM as allocation failure), do not remove the namespace as
it will either recover when the controller is back up and schedule
a subsequent scan, or the controller is going away and the namespaces
will be removed anyways.

This fixes a hang namespace scanning racing with a controller reset and
also sporious I/O errors in path failover coditions where the
controller reset is racing with the namespace scan work with multipath
enabled.
Reported-by: NHannes Reinecke  <hare@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

205da243

nvme: make nvme_report_ns_ids propagate error back · 538af88e

由 Sagi Grimberg 提交于 8月 02, 2019

Make the callers check the return status and propagate
back accordingly (casting to errno from a positive nvme status).
Also print the return status in nvme_report_ns_ids.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

538af88e

nvme: make nvme_identify_ns propagate errors back · 331813f6

由 Sagi Grimberg 提交于 8月 02, 2019

right now callers of nvme_identify_ns only know that it failed,
but don't know why. Make nvme_identify_ns propagate the error back.
Because nvme_submit_sync_cmd may return a positive status code, we
make nvme_identify_ns receive the id by reference and return that
status up the call chain, but make sure not to leak positive nvme
status codes to the upper layers.
Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

331813f6

nvme: pass status to nvme_error_status · 2f9c1736

由 Sagi Grimberg 提交于 8月 29, 2019

No need for the full blown request structure.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

2f9c1736

nvme-fc: Fail transport errors with NVME_SC_HOST_PATH · 74bd8cbe

由 James Smart 提交于 8月 06, 2019

NVME_SC_INTERNAL should indicate an internal controller errors
and not host transport errors. These errors will propagate to
upper layers (essentially nvme core) and be interpereted as
transport errors which should not be taken into account for
namespace state or condition.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

74bd8cbe

nvme-tcp: fail command with NVME_SC_HOST_PATH_ERROR send failed · 16686010

由 Sagi Grimberg 提交于 8月 02, 2019

This is a more appropriate error status for a transport error
detected by us (the host).
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

16686010

nvme: fail cancelled commands with NVME_SC_HOST_PATH_ERROR · 1c0d12c0

由 Sagi Grimberg 提交于 8月 02, 2019

NVME_SC_ABORT_REQ means that the request was aborted due to
an abort command received. In our case, this is a transport
cancellation, so host pathing error is much more appropriate.

Also, convert NVME_SC_HOST_PATH_ERROR to BLK_STS_TRANSPORT for
such that callers can understand that the status is a transport
related error. This will be used by the ns scanning code to
understand if it got an error from the controller or that the
controller happens to be unreachable by the transport.
Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

1c0d12c0

block: fix race between switching elevator and removing queues · 0a67b5a9

由 Ming Lei 提交于 9月 12, 2019

cecf5d87 ("block: split .sysfs_lock into two locks") starts to
release & actuire sysfs_lock again during switching elevator. So it
isn't enough to prevent switching elevator from happening by simply
clearing QUEUE_FLAG_REGISTERED with holding sysfs_lock, because
in-progress switch still can move on after re-acquiring the lock,
meantime the flag of QUEUE_FLAG_REGISTERED won't get checked.

Fixes this issue by checking 'q->elevator' directly & locklessly after
q->kobj is removed in blk_unregister_queue(), this way is safe because
q->elevator can't be changed at that time.

Fixes: cecf5d87 ("block: split .sysfs_lock into two locks")
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0a67b5a9

scsi: core: remove dummy q->dev check · b804049d

由 Stanley Chu 提交于 9月 12, 2019

Currently blk_set_runtime_active() is checking if q->dev is null by
itself, thus remove the same checking in its user: scsi_dev_type_resume().
Signed-off-by: NStanley Chu <stanley.chu@mediatek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b804049d

block: bypass blk_set_runtime_active for uninitialized q->dev · 8a15b4d7

由 Stanley Chu 提交于 9月 12, 2019

Some devices may skip blk_pm_runtime_init() and have null pointer
in its request_queue->dev. For example, SCSI devices of UFS Well-Known
LUNs.

Currently the null pointer is checked by the user of
blk_set_runtime_active(), i.e., scsi_dev_type_resume(). It is better to
check it by blk_set_runtime_active() itself instead of by its users.
Signed-off-by: NStanley Chu <stanley.chu@mediatek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8a15b4d7

null_blk: validate the number of devices · f7c4ce89

由 André Almeida 提交于 9月 11, 2019

A negative number of devices is nonsensical, so change the type to
unsigned. If the number of devices is 0, it is impossible for userspace
to interact with the module, so refuse loading the driver for that case.
Signed-off-by: NAndré Almeida <andrealmeid@collabora.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f7c4ce89

null_blk: fix module name at log message · 4e47ee8f

由 André Almeida 提交于 9月 11, 2019

The name of the module is "null_blk", not "null". Make `pr_info()` follow
the pattern of `pr_err()` log messages.
Signed-off-by: NAndré Almeida <andrealmeid@collabora.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4e47ee8f

docs: block: null_blk: enhance document style · 04c56957

由 André Almeida 提交于 9月 11, 2019

Use proper ReST syntax for chapters. Add more information to enhance
standardization in the file and to make the rendering more homogeneous.
Add a SPDX identifier. Mark single-queue mode as deprecated.
Signed-off-by: NAndré Almeida <andrealmeid@collabora.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

04c56957

11 9月, 2019 6 次提交

iocost_monitor: Report debt · 7c1ee704

由 Tejun Heo 提交于 9月 04, 2019

Report debt and rename del_ms row to delay for consistency.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7c1ee704

iocost_monitor: Report more info with higher accuracy · b06f2d35

由 Tejun Heo 提交于 9月 04, 2019

When outputting json:

* Don't truncate numbers.

* Report address of iocg to ease drilling down further.

When outputting table:

* Use math.ceil() for delay_ms so that small delays don't read as 0.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b06f2d35

iocost_monitor: Always use strings for json values · e742bd5c

由 Tejun Heo 提交于 9月 04, 2019

Json has limited accuracy for numbers and can silently truncate 64bit
values, which can be extremely confusing.  Let's consistently use
string encapsulated values for json output.

While at it, convert an unnecesary f-string to str().
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e742bd5c

blk-iocost: Don't let merges push vtime into the future · e1518f63

由 Tejun Heo 提交于 9月 04, 2019

Merges have the same problem that forced-bios had which is fixed by
the previous patch.  The cost of a merge is calculated at the time of
issue and force-advances vtime into the future.  Until global vtime
catches up, how the cgroup's hweight changes in the meantime doesn't
matter and it often leads to situations where the cost is calculated
at one hweight and paid at a very different one.  See the previous
patch for more details.

Fix it by never advancing vtime into the future for merges.  If budget
is available, vtime is advanced.  Otherwise, the cost is charged as
debt.

This brings merge cost handling in line with issue cost handling in
ioc_rqos_throttle().
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e1518f63

blk-iocost: Account force-charged overage in absolute vtime · 36a52481

由 Tejun Heo 提交于 9月 04, 2019

Currently, when a bio needs to be force-charged and there isn't enough
budget, vtime is simply pushed into the future.  This means that the
cost of the whole bio is scaled using the current hweight and then
charged immediately.  Until the global vtime advances beyond this
future vtime, the cgroup won't be allowed to issue normal IOs.

This is incorrect and can lead to, for example, exploding vrate or
extended stalls if vrate range is constrained.  Consider the following
scenario.

1. A cgroup with a very low hweight runs out of budget.

2. A storm of swap-out happens on it.  All of them are scaled
   according to the current low hweight and charged to vtime pushing
   it to a far future.

3. All other cgroups go idle and now the above cgroup has access to
   the whole device.  However, because vtime is already wound using
   the past low hweight, what its current hweight is doesn't matter
   until global vtime catches up to the local vtime.

4. As a result, either vrate gets ramped up extremely or the IOs stall
   while the underlying device is idle.

This is because the hweight the overage is calculated at is different
from the hweight that it's being paid at.

Fix it by remembering the overage in absoulte vtime and continuously
paying with the actual budget according to the current hweight at each
period.

Note that non-forced bios which wait already remembers the cost in
absolute vtime.  This brings forced-bio accounting in line.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

36a52481

blk-iocost: Fix incorrect operation order during iocg free · e036c4ca

由 Tejun Heo 提交于 9月 10, 2019

ioc_pd_free() first cancels the hrtimers and then deactivates the
iocg.  However, the iocg timer can run inbetween and reschedule the
hrtimers which will end up running after the iocg is freed leading to
crashes like the following.

  general protection fault: 0000 [#1] SMP
  ...
  RIP: 0010:iocg_kick_delay+0xbe/0x1b0
  RSP: 0018:ffffc90003598ea0 EFLAGS: 00010046
  RAX: 1cee00fd69512b54 RBX: ffff8881bba48400 RCX: 00000000000003e8
  RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881bba48400
  RBP: 0000000000004e20 R08: 0000000000000002 R09: 00000000000003e8
  R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90003598ef0
  R13: 00979f3810ad461f R14: ffff8881bba4b400 R15: 25439f950d26e1d1
  FS:  0000000000000000(0000) GS:ffff88885f800000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f64328c7e40 CR3: 0000000002409005 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <IRQ>
   iocg_delay_timer_fn+0x3d/0x60
   __hrtimer_run_queues+0xfe/0x270
   hrtimer_interrupt+0xf4/0x210
   smp_apic_timer_interrupt+0x5e/0x120
   apic_timer_interrupt+0xf/0x20
   </IRQ>

Fix it by canceling hrtimers after deactivating the iocg.

Fixes: 7caa4715 ("blkcg: implement blk-iocost")
Reported-by: NDave Jones <davej@codemonkey.org.uk>
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e036c4ca

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功