提交 · 1aa50d020c7148f5f0bde15ca80fe6f91a8c5a4e · openeuler / Kernel

02 9月, 2020 38 次提交

blk-iocost: calculate iocg->usages[] from iocg->local_stat.usage_us · 1aa50d02

由 Tejun Heo 提交于 9月 01, 2020

Currently, iocg->usages[] which are used to guide inuse adjustments are
calculated from vtime deltas. This, however, assumes that the hierarchical
inuse weight at the time of calculation held for the entire period, which
often isn't true and can lead to significant errors.

Now that we have absolute usage information collected, we can derive
iocg->usages[] from iocg->local_stat.usage_us so that inuse adjustment
decisions are made based on actual absolute usage. The calculated usage is
clamped between 1 and WEIGHT_ONE and WEIGHT_ONE is also used to signal
saturation regardless of the current hierarchical inuse weight.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1aa50d02

blk-iocost: add absolute usage stat · 97eb1975

由 Tejun Heo 提交于 9月 01, 2020

Currently, iocost doesn't collect or expose any statistics punting off all
monitoring duties to drgn based iocost_monitor.py. While it works for some
scenarios, there are some usability and data availability challenges. For
example, accurate per-cgroup usage information can't be tracked by vtime
progression at all and the number available in iocg->usages[] are really
short-term snapshots used for control heuristics with possibly significant
errors.

This patch implements per-cgroup absolute usage stat counter and exposes it
through io.stat along with the current vrate. Usage stat collection and
flushing employ the same method as cgroup rstat on the active iocg's and the
only hot path overhead is preemption toggling and adding to a percpu
counter.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

97eb1975

blk-iocost: grab ioc->lock for debt handling · da437b95

由 Tejun Heo 提交于 9月 01, 2020

Currently, debt handling requires only iocg->waitq.lock. In the future, we
want to adjust and propagate inuse changes depending on debt status. Let's
grab ioc->lock in debt handling paths in preparation.

* Because ioc->lock nests outside iocg->waitq.lock, the decision to grab
  ioc->lock needs to be made before entering the critical sections.

* Add and use iocg_[un]lock() which handles the conditional double locking.

* Add @pay_debt to iocg_kick_waitq() so that debt payment happens only when
  the caller grabbed both locks.

This patch is prepatory and the comments contain references to future
changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

da437b95

blk-iocost: streamline vtime margin and timer slack handling · 7ca5b2e6

由 Tejun Heo 提交于 9月 01, 2020

The margin handling was pretty inconsistent.

* ioc->margin_us and ioc->inuse_margin_vtime were used as vtime margin
  thresholds. However, the two are in different units with the former
  requiring conversion to vtime on use.

* iocg_kick_waitq() was using a quarter of WAITQ_TIMER_MARGIN_PCT of
  period_us as the timer slack - ~1.2%. While iocg_kick_delay() was using a
  quarter of ioc->margin_us - ~12.5%. There aren't strong reasons to use
  different values for the two.

This patch cleans up margin and timer slack handling:

* vtime margins are now recorded in ioc->margins.{min, max} on period
  duration changes and used consistently.

* Timer slack is now 1% of period_us and recorded in ioc->timer_slack_ns and
  used consistently for iocg_kick_waitq() and iocg_kick_delay().

The only functional change is shortening of timer slack. No meaningful
visible change is expected.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7ca5b2e6

blk-iocost: make ioc_now->now and ioc->period_at 64bit · ce95570a

由 Tejun Heo 提交于 9月 01, 2020

They are in microseconds and wrap in around 1.2 hours with u32. While
unlikely, confusions from wraparounds are still possible. We aren't saving
anything meaningful by keeping these u32. Let's make them u64.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ce95570a

blk-iocost: use WEIGHT_ONE based fixed point number for weights · bd0adb91

由 Tejun Heo 提交于 9月 01, 2020

To improve weight donations, we want to able to scale inuse with a greater
accuracy and down below 1. Let's make non-hierarchical weights to use
WEIGHT_ONE based fixed point numbers too like hierarchical ones.

This doesn't cause any behavior changes yet.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bd0adb91

blk-iocost: s/HWEIGHT_WHOLE/WEIGHT_ONE/g · fe20cdb5

由 Tejun Heo 提交于 9月 01, 2020

We're gonna use HWEIGHT_WHOLE for regular weights too. Let's rename it to
WEIGHT_ONE.

Pure rename.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fe20cdb5

blk-iocost: make iocg_kick_waitq() call iocg_kick_delay() after paying debt · 7b84b49e

由 Tejun Heo 提交于 9月 01, 2020

iocg_kick_waitq() is the function which pays debt and iocg_kick_delay()
updates the actual delay status accordingly. If iocg_kick_delay() is not
called after iocg_kick_delay() updated debt, unnecessarily large delays can
be applied temporarily.

Let's make sure such conditions don't occur by making iocg_kick_waitq()
always call iocg_kick_delay() after paying debt.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7b84b49e

blk-iocost: move iocg_kick_delay() above iocg_kick_waitq() · 6ef20f78

由 Tejun Heo 提交于 9月 01, 2020

We'll make iocg_kick_waitq() call iocg_kick_delay(). Reorder them in
preparation. This is pure code reorganization.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6ef20f78

blk-iocost: clamp inuse and skip noops in __propagate_weights() · db84a72a

由 Tejun Heo 提交于 9月 01, 2020

__propagate_weights() currently expects the callers to clamp inuse within
[1, active], which is needlessly fragile. The inuse adjustment logic is
going to be revamped, in preparation, let's make __propagate_weights() clamp
inuse on entry.

Also, make it avoid weight updates altogether if neither active or inuse is
changed.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

db84a72a

blk-iocost: rename propagate_active_weights() to propagate_weights() · 00410f1b

由 Tejun Heo 提交于 9月 01, 2020

It already propagates two weights - active and inuse - and there will be
another soon. Let's drop the confusing misnomers. Rename
[__]propagate_active_weights() to [__]propagate_weights() and
commit_active_weights() to commit_weights().

This is pure rename.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

00410f1b

blk-iocost: use local[64]_t for percpu stat · 5e124f74

由 Tejun Heo 提交于 9月 01, 2020

blk-iocost has been reading percpu stat counters from remote cpus which on
some archs can lead to torn reads in really rare occassions. Use local[64]_t
for those counters.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5e124f74

C
block: remove the unused q argument to part_in_flight and part_in_flight_rw · 1f06959b
由 Christoph Hellwig 提交于 8月 31, 2020
```
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
```
1f06959b

block: remove the disk argument to delete_partition · 8328eb28

由 Christoph Hellwig 提交于 8月 31, 2020

We can trivially derive the gendisk from the hd_struct.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8328eb28

block: cleanup __alloc_disk_node · f93af2a4

由 Christoph Hellwig 提交于 8月 31, 2020

Use early returns and goto-based unwinding to simplify the flow a bit.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f93af2a4

block: move the devcgroup_inode_permission call to blkdev_get · e5c7fb40

由 Christoph Hellwig 提交于 8月 31, 2020

devcgroup_inode_permission is never called for the recusive case, so
move it out into blkdev_get.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e5c7fb40

block: remove an outdated comment on the bd_dev field · 46d40cfa

由 Christoph Hellwig 提交于 8月 31, 2020

kdev_t is long gone, so we don't need to comment a field isn't one..
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

46d40cfa

block: remove the discard_alignment field from struct hd_struct · 7cf34d97

由 Christoph Hellwig 提交于 8月 31, 2020

The alignment offset is only used in slow path callers, so just calculate
it on the fly.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7cf34d97

block: remove the alignment_offset field from struct hd_struct · 7b8917f5

由 Christoph Hellwig 提交于 8月 31, 2020

The alignment offset is only used in slow path callers, so just calculate
it on the fly.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7b8917f5

blk-mq: use BLK_MQ_NO_TAG for no tag · e44a6a23

由 Xianting Tian 提交于 8月 27, 2020

Replace various magic -1 constants for tags with BLK_MQ_NO_TAG.
Signed-off-by: NXianting Tian <tian.xianting@h3c.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e44a6a23

block: Remove blk_mq_attempt_merge() function · cdfcef9e

由 Baolin Wang 提交于 8月 28, 2020

The small blk_mq_attempt_merge() function is only called by
__blk_mq_sched_bio_merge(), just open code it.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cdfcef9e

block: Add a new helper to attempt to merge a bio · 7d7ca7c5

由 Baolin Wang 提交于 8月 28, 2020

There are lots of duplicated code when trying to merge a bio from
plug list and sw queue, we can introduce a new helper to attempt
to merge a bio, which can simplify the blk_bio_list_merge()
and blk_attempt_plug_merge().
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7d7ca7c5

block: Move blk_mq_bio_list_merge() into blk-merge.c · bdc6a287

由 Baolin Wang 提交于 8月 28, 2020

Move the blk_mq_bio_list_merge() into blk-merge.c and
rename it as a generic name.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bdc6a287

block: Move bio merge related functions into blk-merge.c · 8e756373

由 Baolin Wang 提交于 8月 28, 2020

It's better to move bio merge related functions into blk-merge.c,
which contains all merge related functions.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8e756373

blk-wbt: Remove obsolete multiqueue I/O scheduling comment · 339b5a25

由 Danny Lin 提交于 8月 29, 2020

This comment was added before the multiqueue I/O scheduler framework
was introduced; multiqueue has support for I/O scheduling now, so this
obsolete comment can be removed.
Signed-off-by: NDanny Lin <danny@kdrag0n.dev>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

339b5a25

virtio-blk: Use kobj_to_dev() instead of container_of() · 4ce79063

由 Tian Tao 提交于 8月 21, 2020

Use kobj_to_dev() instead of container_of()
Signed-off-by: NTian Tao <tiantao6@hisilicon.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4ce79063

raw: deprecate the raw driver · c4823983

由 Christoph Hellwig 提交于 8月 19, 2020

The raw driver has been replaced by O_DIRECT support on the block device
in 2002.  Deprecate it to prepare for removal in a few kernel releases.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c4823983

block: remove the BIO_USER_MAPPED flag · 3310eeba

由 Christoph Hellwig 提交于 8月 27, 2020

Just check if there is private data, in which case the bio must have
originated from bio_copy_user_iov.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3310eeba

block: remove __blk_rq_map_user_iov · 7589ad67

由 Christoph Hellwig 提交于 8月 27, 2020

Just duplicate a small amount of code in the low-level map into the bio
and copy to the bio routines, leading to much easier to follow and
maintain code, and better shared error handling.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7589ad67

block: remove __blk_rq_unmap_user · 7b63c052

由 Christoph Hellwig 提交于 8月 27, 2020

Open code __blk_rq_unmap_user in the two callers.  Both never pass a NULL
bio, and one of them can use an existing local variable instead of the bio
flag.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7b63c052

block: remove the BIO_NULL_MAPPED flag · f3256075

由 Christoph Hellwig 提交于 8月 27, 2020

We can simply use a boolean flag in the bio_map_data data structure
instead.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f3256075

nvme: don't call revalidate_disk from nvme_set_queue_dying · c13f0fbc

由 Christoph Hellwig 提交于 8月 23, 2020

In nvme_set_queue_dying we really just want to ensure the disk and bdev
sizes are set to zero.  Going through revalidate_disk leads to a somewhat
arcance and complex callchain relying on special behavior in a few
places.  Instead just lift the set_capacity directly to
nvme_set_queue_dying, and rename and move the nvme_mpath_update_disk_size
helper so that we can use it in nvme_set_queue_dying to propagate the
size to the bdev without detours.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c13f0fbc

block: fix locking for struct block_device size updates · c2b4bb8c

由 Christoph Hellwig 提交于 8月 23, 2020

Two different callers use two different mutexes for updating the
block device size, which obviously doesn't help to actually protect
against concurrent updates from the different callers.  In addition
one of the locks, bd_mutex is rather prone to deadlocks with other
parts of the block stack that use it for high level synchronization.

Switch to using a new spinlock protecting just the size updates, as
that is all we need, and make sure everyone does the update through
the proper helper.

This fixes a bug reported with the nvme revalidating disks during a
hot removal operation, which can currently deadlock on bd_mutex.
Reported-by: NXianting Tian <xianting_tian@126.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c2b4bb8c

block: replace bd_set_size with bd_set_nr_sectors · 611bee52

由 Christoph Hellwig 提交于 8月 23, 2020

Replace bd_set_size with a version that takes the number of sectors
instead, as that fits most of the current and future callers much better.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

611bee52

block: Make request_queue.rpm_status an enum · db04e18d

由 Geert Uytterhoeven 提交于 8月 19, 2020

request_queue.rpm_status is assigned values of the rpm_status enum only,
so reflect that in its type.

Note that including <linux/pm.h> is (currently) a no-op, as it is
already included through <linux/genhd.h> and <linux/device.h>, but it is
better to play it safe.
Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

db04e18d

Merge branch 'block-5.9' into for-5.10/block · a98278ec

由 Jens Axboe 提交于 9月 01, 2020

* block-5.9:
  blk-stat: make q->stats->lock irqsafe
  blk-iocost: ioc_pd_free() shouldn't assume irq disabled
  block: fix locking in bdev_del_partition
  block: release disk reference in hd_struct_free_work
  block: ensure bdi->io_pages is always initialized
  nvme-pci: cancel nvme device request before disabling
  nvme: only use power of two io boundaries
  nvme: fix controller instance leak
  nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
  nvme: Fix NULL dereference for pci nvme controllers
  nvme-rdma: fix reset hang if controller died in the middle of a reset
  nvme-rdma: fix timeout handler
  nvme-rdma: serialize controller teardown sequences
  nvme-tcp: fix reset hang if controller died in the middle of a reset
  nvme-tcp: fix timeout handler
  nvme-tcp: serialize controller teardown sequences
  nvme: have nvme_wait_freeze_timeout return if it timed out
  nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
  nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu

a98278ec

blk-stat: make q->stats->lock irqsafe · e11d80a8

由 Tejun Heo 提交于 9月 01, 2020

blk-iocost calls blk_stat_enable_accounting() while holding an irqsafe lock
which triggers a lockdep splat because q->stats->lock isn't irqsafe. Let's
make it irqsafe.
Signed-off-by: NTejun Heo <tj@kernel.org>
Fixes: cd006509 ("blk-iocost: account for IO size when testing latencies")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e11d80a8

blk-iocost: ioc_pd_free() shouldn't assume irq disabled · 5aeac7c4

由 Tejun Heo 提交于 9月 01, 2020

ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled
when it can be called with irq disabled or enabled. This has a small chance
of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations
instead.
Signed-off-by: NTejun Heo <tj@kernel.org>
Fixes: 7caa4715 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5aeac7c4

01 9月, 2020 2 次提交

block: fix locking in bdev_del_partition · 08fc1ab6

由 Christoph Hellwig 提交于 9月 01, 2020

We need to hold the whole device bd_mutex to protect against
other thread concurrently deleting out partition before we get
to it, and thus causing a use after free.

Fixes: cddae808 ("block: pass a hd_struct to delete_partition")
Reported-by: syzbot+6448f3c229bc52b82f69@syzkaller.appspotmail.com
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

08fc1ab6

block: release disk reference in hd_struct_free_work · cafe01ef

由 Ming Lei 提交于 9月 01, 2020

Commit e8c7d14a ("block: revert back to synchronous request_queue removal")
stops to release request queue from wq context because that commit
supposed all blk_put_queue() is called in context which is allowed
to sleep. However, this assumption isn't true because we release disk's
reference in partition's percpu_ref's ->release() which doesn't allow
to sleep, because the ->release() is run via call_rcu().

Fixes this issue by moving put disk reference into hd_struct_free_work()

Fixes: e8c7d14a ("block: revert back to synchronous request_queue removal")
Reported-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Tested-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cafe01ef

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功