提交 · 40d47c155e8ae9bcb3f2d0d01cf14d903c664726 · openeuler / Kernel

21 11月, 2019 1 次提交

block,bfq: Skip tracing hooks if possible · 40d47c15

由 Dmitry Monakhov 提交于 11月 01, 2019

In most cases blk_tracing is not active, but  bfq_log_bfqq macro
generate pid_str unconditionally, which result in significant overhead.

## Test
modprobe null_blk
echo bfq > /sys/block/nullb0/queue/scheduler
fio --name=t --ioengine=libaio --direct=1 --filename=/dev/nullb0 \
   --runtime=30 --time_based=1 --rw=write --iodepth=128 --bs=4k

# Results
|        | baseline | w/ patch | gain |
| iops   | 113.19K  | 126.42K  | +11% |
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NDmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

40d47c15

19 11月, 2019 1 次提交

block: sed-opal: Introduce SUM_SET_LIST parameter and append it using 'add_token_u64' · c6da429e

由 Revanth Rajashekar 提交于 11月 08, 2019

In function 'activate_lsp', rather than hard-coding the short atom
header(0x83), we need to let the function 'add_short_atom_header' append
the header based on the parameter being appended.

The parameter has been defined in Section 3.1.2.1 of
https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage-Opal_Feature_Set_Single_User_Mode_v1-00_r1-00-Final.pdfReviewed-by: NJon Derrick <jonathan.derrick@intel.com>
Signed-off-by: NRevanth Rajashekar <revanth.rajashekar@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c6da429e

18 11月, 2019 2 次提交

blk-cgroup: cgroup_rstat_updated() shouldn't be called on cgroup1 · 496074f9

由 Tejun Heo 提交于 11月 14, 2019

Currently, cgroup rstat is supported only on cgroup2 hierarchy and
rstat functions shouldn't be called on cgroup1 cgroups.  While
converting blk-cgroup core statistics to rstat, f7331648
("blk-cgroup: reimplement basic IO stats using cgroup rstat")
accidentally ended up calling cgroup_rstat_updated() on cgroup1
cgroups causing crashes.

Longer term, we probably should add cgroup1 support to rstat but for
now let's mask the call directly.

Fixes: f7331648 ("blk-cgroup: reimplement basic IO stats using cgroup rstat")
Tested-by: NFaiz Abbas <faiz_abbas@ti.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

496074f9

block: Don't disable interrupts in trigger_softirq() · de678bc6

由 Sebastian Andrzej Siewior 提交于 11月 18, 2019

trigger_softirq() is always invoked as a SMP-function call which is
always invoked with disables interrupts.

Don't disable interrupt in trigger_softirq() because interrupts are
already disabled.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

de678bc6

14 11月, 2019 2 次提交

sbitmap: Delete sbitmap_any_bit_clear() · 708edafa

由 John Garry 提交于 11月 14, 2019

Since the only caller of this function has been deleted, delete this one
also.
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

708edafa

blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue() · cb711b91

由 John Garry 提交于 11月 14, 2019

These functions are not referenced, so delete them.
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cb711b91

08 11月, 2019 9 次提交

block: split bio if the only bvec's length is > SZ_4K · 6952a7f8

由 Ming Lei 提交于 11月 08, 2019

64K PAGE_SIZE is popular on ARM64 or other ARCHs, and 64K has been big
enough to break some devices probably, so change the logic to split bio
if the only bvec's length is > SZ_4K instead of PAGE_SIZE.

Fixes: fa532287 (block: avoid blk_bio_segment_split for small I/O operations)
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6952a7f8

block: still try to split bio if the bvec crosses pages · 59db8ba2

由 Ming Lei 提交于 11月 08, 2019

Some device may set segment boundary as PAGE_SIZE - 1. If the bvec
crosses pages, and meantime its length is <= PAGE_SIZE, we still need
to split the bvec into 2 segments.

Fixes this issue by still splitting bio if the single bvec crosses
pages.
Reported-by: Nkernel test robot <lkp@intel.com>
Fixes: fa532287 (block: avoid blk_bio_segment_split for small I/O operations)
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

59db8ba2

blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT · 1d156646

由 Tejun Heo 提交于 11月 07, 2019

blkg_rwstat is now only used by bfq-iosched and blk-throtl when on
cgroup1.  Let's move it into its own files and gate it behind a config
option.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1d156646

blk-cgroup: reimplement basic IO stats using cgroup rstat · f7331648

由 Tejun Heo 提交于 11月 07, 2019

blk-cgroup has been using blkg_rwstat to track basic IO stats.
Unfortunately, reading recursive stats scales badly as itinvolves
walking all descendants.  On systems with a huge number of cgroups
(dead or alive), this can lead to substantial CPU cost when reading IO
stats.

This patch reimplements basic IO stats using cgroup rstat which uses
more memory but makes recursive stat reading O(# descendants which
have been active since last reading) instead of O(# descendants).

* blk-cgroup core no longer uses sync/async stats.  Introduce new stat
  enums - BLKG_IOSTAT_{READ|WRITE|DISCARD}.

* Add blkg_iostat[_set] which encapsulates byte and io stats, last
  values for propagation delta calculation and u64_stats_sync for
  correctness on 32bit archs.

* Update the new percpu stat counters directly and implement
  blkcg_rstat_flush() to implement propagation.

* blkg_print_stat() can now bring the stats up to date by calling
  cgroup_rstat_flush() and print them instead of directly summing up
  all descendants.

* It now allocates 96 bytes per cpu.  It used to be 40 bytes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Dan Schatzberg <dschatzberg@fb.com>
Cc: Daniel Xu <dlxu@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f7331648

blk-cgroup: remove now unused blkg_print_stat_{bytes|ios}_recursive() · 8a80d5d6

由 Tejun Heo 提交于 11月 07, 2019

These don't have users anymore.  Remove them.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8a80d5d6

blk-throtl: stop using blkg->stat_bytes and ->stat_ios · 7ca46438

由 Tejun Heo 提交于 11月 07, 2019

When used on cgroup1, blk-throtl uses the blkg->stat_bytes and
->stat_ios from blk-cgroup core to populate four stat knobs.
blk-cgroup core is moving away from blkg_rwstat to improve scalability
and won't be able to support this usage.

It isn't like the sharing gains all that much.  Let's break them out
to dedicated rwstat counters which are updated when on cgroup1.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7ca46438

bfq-iosched: stop using blkg->stat_bytes and ->stat_ios · fd41e603

由 Tejun Heo 提交于 11月 07, 2019

When used on cgroup1, bfq uses the blkg->stat_bytes and ->stat_ios
from blk-cgroup core to populate six stat knobs.  blk-cgroup core is
moving away from blkg_rwstat to improve scalability and won't be able
to support this usage.

It isn't like the sharing gains all that much.  Let's break it out to
dedicated rwstat counters which are updated when on cgroup1.  This
makes use of bfqg_*rwstat*() helpers outside of
CONFIG_BFQ_CGROUP_DEBUG.  Move them out.

v2: Compile fix when !CONFIG_BFQ_CGROUP_DEBUG.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fd41e603

bfq-iosched: relocate bfqg_*rwstat*() helpers · a557f1c7

由 Tejun Heo 提交于 11月 07, 2019

Collect them right under #ifdef CONFIG_BFQ_CGROUP_DEBUG.  The next
patch will use them from !DEBUG path and this makes it easy to move
them out of the ifdef block.

This is pure code reorganization.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a557f1c7

Merge branch 'for-linus' into for-5.5/block · 912c0a85

由 Jens Axboe 提交于 11月 07, 2019

Pull on for-linus to resolve what otherwise would have been a conflict
with the cgroups rstat patchset from Tejun.

* for-linus: (942 commits)
  blkcg: make blkcg_print_stat() print stats only for online blkgs
  nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
  nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
  nvme-rdma: fix a segmentation fault during module unload
  iocost: don't nest spin_lock_irq in ioc_weight_write()
  io_uring: ensure we clear io_kiocb->result before each issue
  um-ubd: Entrust re-queue to the upper layers
  nvme-multipath: remove unused groups_only mode in ana log
  nvme-multipath: fix possible io hang after ctrl reconnect
  io_uring: don't touch ctx in setup after ring fd install
  io_uring: Fix leaked shadow_req
  Linux 5.4-rc5
  riscv: cleanup do_trap_break
  nbd: verify socket is supported during setup
  ata: libahci_platform: Fix regulator_get_optional() misuse
  nbd: handle racing with error'ed out commands
  nbd: protect cmd->status with cmd->lock
  io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
  io_uring: used cached copies of sq->dropped and cq->overflow
  ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157
  ...

912c0a85

07 11月, 2019 5 次提交

block: add zone open, close and finish ioctl support · e876df1f

由 Ajay Joshi 提交于 10月 27, 2019

Introduce three new ioctl commands BLKOPENZONE, BLKCLOSEZONE and
BLKFINISHZONE to allow applications to control the condition of zones
on a zoned block device through the execution of the REQ_OP_ZONE_OPEN,
REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH operations.

Contains contributions from Matias Bjorling, Hans Holmberg,
Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig.
Reviewed-by: NJavier González <javier@javigon.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAjay Joshi <ajay.joshi@wdc.com>
Signed-off-by: NMatias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: NHans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e876df1f

block: add zone open, close and finish operations · 6c1b1da5

由 Ajay Joshi 提交于 10月 27, 2019

Zoned block devices (ZBC and ZAC devices) allow an explicit control
over the condition (state) of zones. The operations allowed are:
* Open a zone: Transition to open condition to indicate that a zone will
  actively be written
* Close a zone: Transition to closed condition to release the drive
  resources used for writing to a zone
* Finish a zone: Transition an open or closed zone to the full
  condition to prevent write operations

To enable this control for in-kernel zoned block device users, define
the new request operations REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE
and REQ_OP_ZONE_FINISH as well as the generic function
blkdev_zone_mgmt() for submitting these operations on a range of zones.
This results in blkdev_reset_zones() removal and replacement with this
new zone magement function. Users of blkdev_reset_zones() (f2fs and
dm-zoned) are updated accordingly.

Contains contributions from Matias Bjorling, Hans Holmberg,
Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig.
Reviewed-by: NJavier González <javier@javigon.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAjay Joshi <ajay.joshi@wdc.com>
Signed-off-by: NMatias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: NHans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6c1b1da5

block: Simplify REQ_OP_ZONE_RESET_ALL handling · c7a1d926

由 Damien Le Moal 提交于 10月 27, 2019

There is no need for the function __blkdev_reset_all_zones() as
REQ_OP_ZONE_RESET_ALL can be handled directly in blkdev_reset_zones()
bio loop with an early break from the loop. This patch removes this
function and modifies blkdev_reset_zones(), simplifying the code.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c7a1d926

block: Remove REQ_OP_ZONE_RESET plugging · a84324d2

由 Damien Le Moal 提交于 10月 27, 2019

REQ_OP_ZONE_RESET operations cannot be merged as these bios and requests
do not have a size and are never sequential due to the zone start sector
position required for their execution. As a result, there is no point in
using a plug around blkdev_reset_zones() bio issuing loop. This patch
removes this unnecessary plugging.
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NJavier González <javier@javigon.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a84324d2

blkcg: make blkcg_print_stat() print stats only for online blkgs · b0814361

由 Tejun Heo 提交于 11月 05, 2019

blkcg_print_stat() iterates blkgs under RCU and doesn't test whether
the blkg is online.  This can call into pd_stat_fn() on a pd which is
still being initialized leading to an oops.

The heaviest operation - recursively summing up rwstat counters - is
already done while holding the queue_lock.  Expand queue_lock to cover
the other operations and skip the blkg if it isn't online yet.  The
online state is protected by both blkcg and queue locks, so this
guarantees that only online blkgs are processed.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NRoman Gushchin <guro@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Fixes: 903d23f0 ("blk-cgroup: allow controllers to output their own stats")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b0814361

06 11月, 2019 3 次提交

block: Warn if elevator= parameter is used · f8db3835

由 Jan Kara 提交于 11月 06, 2019

With transition to blk-mq, the elevator= kernel argument was removed as
it makes less and less sense with the current variety of devices.  Since
this may surprise some users and there are advices on the Internet that
still suggest to use it, let's at least warn if the parameter is used.
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f8db3835

Merge branch 'nvme-5.4-rc7' of git://git.infradead.org/nvme into for-linus · 0473976c

由 Jens Axboe 提交于 11月 05, 2019

Pull NVMe fixes from Keith:

"We have a few late nvme fixes for a couple device removal kernel
 crashes, and a compat fix for a new ioctl introduced during this merge
 window."

* 'nvme-5.4-rc7' of git://git.infradead.org/nvme:
  nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
  nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
  nvme-rdma: fix a segmentation fault during module unload

0473976c

nvme: change nvme_passthru_cmd64 to explicitly mark rsvd · 0d6eeb1f

由 Charles Machalow 提交于 11月 04, 2019

Changing nvme_passthru_cmd64 to add a field: rsvd2. This field is an explicit
marker for the padding space added on certain platforms as a result of the
enlargement of the result field from 32 bit to 64 bits in size, and
fixes differences in struct size when using compat ioctl for 32-bit
binaries on 64-bit architecture.

Fixes: 65e68edc ("nvme: allow 64-bit results in passthru commands")
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NCharles Machalow <csm10495@gmail.com>
[changelog]
Signed-off-by: NKeith Busch <kbusch@kernel.org>

0d6eeb1f

05 11月, 2019 3 次提交

nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths · 763303a8

由 Anton Eidelman 提交于 11月 01, 2019

nvme_mpath_clear_ctrl_paths() iterates through
the ctrl->namespaces list while holding ctrl->scan_lock.
This does not seem to be the correct way of protecting
from concurrent list modification.

Specifically, nvme_scan_work() sorts ctrl->namespaces
AFTER unlocking scan_lock.

This may result in the following (rare) crash in ctrl disconnect
during scan_work:

    BUG: kernel NULL pointer dereference, address: 0000000000000050
    Oops: 0000 [#1] SMP PTI
    CPU: 0 PID: 3995 Comm: nvme 5.3.5-050305-generic
    RIP: 0010:nvme_mpath_clear_current_path+0xe/0x90 [nvme_core]
    ...
    Call Trace:
     nvme_mpath_clear_ctrl_paths+0x3c/0x70 [nvme_core]
     nvme_remove_namespaces+0x35/0xe0 [nvme_core]
     nvme_do_delete_ctrl+0x47/0x90 [nvme_core]
     nvme_sysfs_delete+0x49/0x60 [nvme_core]
     dev_attr_store+0x17/0x30
     sysfs_kf_write+0x3e/0x50
     kernfs_fop_write+0x11e/0x1a0
     __vfs_write+0x1b/0x40
     vfs_write+0xb9/0x1a0
     ksys_write+0x67/0xe0
     __x64_sys_write+0x1a/0x20
     do_syscall_64+0x5a/0x130
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f8d02bfb154

Fix:
After taking scan_lock in nvme_mpath_clear_ctrl_paths()
down_read(&ctrl->namespaces_rwsem) as well to make list traversal safe.
This will not cause deadlocks because taking scan_lock never happens
while holding the namespaces_rwsem.
Moreover, scan work downs namespaces_rwsem in the same order.

Alternative: sort ctrl->namespaces in nvme_scan_work()
while still holding the scan_lock.
This would leave nvme_mpath_clear_ctrl_paths() without correct protection
against ctrl->namespaces modification by anyone other than scan_work.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

763303a8

nvme-rdma: fix a segmentation fault during module unload · 9ad9e8d6

由 Max Gurtovoy 提交于 10月 29, 2019

In case there are controllers that are not associated with any RDMA
device (e.g. during unsuccessful reconnection) and the user will unload
the module, these controllers will not be freed and will access already
freed memory. The same logic appears in other fabric drivers as well.

Fixes: 87fd1253 ("nvme-rdma: remove redundant reference between ib_device and tagset")
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

9ad9e8d6

block: avoid blk_bio_segment_split for small I/O operations · fa532287

由 Christoph Hellwig 提交于 11月 04, 2019

__blk_queue_split() adds significant overhead for small I/O operations.
Add a shortcut to avoid it for cases where we know we never need to
split.

Based on a patch from Ming Lei.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fa532287

04 11月, 2019 4 次提交

blk-mq: make sure that line break can be printed · d2c9be89

由 Ming Lei 提交于 11月 04, 2019

8962842c ("blk-mq: avoid sysfs buffer overflow with too many CPU cores")
avoids sysfs buffer overflow, and reserves one character for line break.
However, the last snprintf() doesn't get correct 'size' parameter passed
in, so fixed it.

Fixes: 8962842c ("blk-mq: avoid sysfs buffer overflow with too many CPU cores")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d2c9be89

block: sed-opal: Introduce Opal Datastore UID · 62c441c6

由 Revanth Rajashekar 提交于 10月 31, 2019

This patch introduces Opal Datastore UID.
The generic read/write table ioctl can use this UID
to access the Opal Datastore.
Reviewed-by: NScott Bauer <sbauer@plzdonthack.me>
Reviewed-by: NJon Derrick <jonathan.derrick@intel.com>
Signed-off-by: NRevanth Rajashekar <revanth.rajashekar@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

62c441c6

block: sed-opal: Add support to read/write opal tables generically · 51f421c8

由 Revanth Rajashekar 提交于 10月 31, 2019

This feature gives the user RW access to any opal table with admin1
authority. The flags described in the new structure determines if the user
wants to read/write the data. Flags are checked for valid values in
order to allow future features to be added to the ioctl.

The user can provide the desired table's UID. Also, the ioctl provides a
size and offset field and internally will loop data accesses to return
the full data block. Read overrun is prevented by the initiator's
sec_send_recv() backend. The ioctl provides a private field with the
intention to accommodate any future expansions to the ioctl.
Reviewed-by: NScott Bauer <sbauer@plzdonthack.me>
Reviewed-by: NJon Derrick <jonathan.derrick@intel.com>
Signed-off-by: NRevanth Rajashekar <revanth.rajashekar@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

51f421c8

block: sed-opal: Generalizing write data to any opal table · 3495ea1b

由 Revanth Rajashekar 提交于 10月 31, 2019

This patch refactors the existing "write_shadowmbr" func and
creates a new generalized function "generic_table_write_data",
to write data to any opal table. Also, a few cleanups are included
in this patch.
Reviewed-by: NScott Bauer <sbauer@plzdonthack.me>
Reviewed-by: NJon Derrick <jonathan.derrick@intel.com>
Signed-off-by: NRevanth Rajashekar <revanth.rajashekar@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3495ea1b

03 11月, 2019 2 次提交

bdev: Refresh bdev size for disks without partitioning · cba22d86

由 Jan Kara 提交于 10月 21, 2019

Currently, block device size in not updated on second and further open
for block devices where partition scan is disabled. This is particularly
annoying for example for DVD drives as that means block device size does
not get updated once the media is inserted into a drive if the device is
already open when inserting the media. This is actually always the case
for example when pktcdvd is in use.

Fix the problem by revalidating block device size on every open even for
devices with partition scan disabled.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cba22d86

bdev: Factor out bdev revalidation into a common helper · 731dc486

由 Jan Kara 提交于 10月 21, 2019

Factor out code handling revalidation of bdev on disk change into a
common helper.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

731dc486

02 11月, 2019 1 次提交

blk-mq: avoid sysfs buffer overflow with too many CPU cores · 8962842c

由 Ming Lei 提交于 11月 02, 2019

It is reported that sysfs buffer overflow can be triggered if the system
has too many CPU cores(>841 on 4K PAGE_SIZE) when showing CPUs of
hctx via /sys/block/$DEV/mq/$N/cpu_list.

Use snprintf to avoid the potential buffer overflow.

This version doesn't change the attribute format, and simply stops
showing CPU numbers if the buffer is going to overflow.

Cc: stable@vger.kernel.org
Fixes: 676141e4("blk-mq: don't dump CPU -> hw queue map on driver load")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8962842c

01 11月, 2019 2 次提交

blk-mq: Make blk_mq_run_hw_queue() return void · 626fb735

由 John Garry 提交于 10月 30, 2019

Since commit 97889f9a ("blk-mq: remove synchronize_rcu() from
blk_mq_del_queue_tag_set()"), the return value of blk_mq_run_hw_queue()
is never checked, so make it return void, which very marginally simplifies
the code.
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

626fb735

iocost: don't nest spin_lock_irq in ioc_weight_write() · 41591a51

由 Dan Carpenter 提交于 10月 31, 2019

This code causes a static analysis warning:

    block/blk-iocost.c:2113 ioc_weight_write() error: double lock 'irq'

We disable IRQs in blkg_conf_prep() and re-enable them in
blkg_conf_finish().  IRQ disable/enable should not be nested because
that means the IRQs will be enabled at the first unlock instead of the
second one.

Fixes: 7caa4715 ("blkcg: implement blk-iocost")
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

41591a51

31 10月, 2019 1 次提交

io_uring: ensure we clear io_kiocb->result before each issue · 6873e0bd

由 Jens Axboe 提交于 10月 30, 2019

We use io_kiocb->result == -EAGAIN as a way to know if we need to
re-submit a polled request, as -EAGAIN reporting happens out-of-line
for IO submission failures. This field is cleared when we originally
allocate the request, but it isn't reset when we retry the submission
from async context. This can cause issues where we think something
needs a re-issue, but we're really just reading stale data.

Reset ->result whenever we re-prep a request for polled submission.

Cc: stable@vger.kernel.org
Fixes: 9e645e11 ("io_uring: add support for sqe links")
Reported-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6873e0bd

30 10月, 2019 1 次提交

um-ubd: Entrust re-queue to the upper layers · d848074b

由 Anton Ivanov 提交于 10月 29, 2019

Fixes crashes due to ubd requeue logic conflicting with the block-mq
logic. Crash is reproducible in 5.0 - 5.3.

Fixes: 53766def ("um: Clean-up command processing in UML UBD driver")
Cc: stable@vger.kernel.org # v5.0+
Signed-off-by: NAnton Ivanov <anton.ivanov@cambridgegreys.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d848074b

29 10月, 2019 2 次提交

nvme-multipath: remove unused groups_only mode in ana log · 86cccfbf

由 Anton Eidelman 提交于 10月 18, 2019

groups_only mode in nvme_read_ana_log() is no longer used: remove it.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

86cccfbf

nvme-multipath: fix possible io hang after ctrl reconnect · af8fd042

由 Anton Eidelman 提交于 10月 18, 2019

The following scenario results in an IO hang:
1) ctrl completes a request with NVME_SC_ANA_TRANSITION.
   NVME_NS_ANA_PENDING bit in ns->flags is set and ana_work is triggered.
2) ana_work: nvme_read_ana_log() tries to get the ANA log page from the ctrl.
   This fails because ctrl disconnects.
   Therefore nvme_update_ns_ana_state() is not called
   and NVME_NS_ANA_PENDING bit in ns->flags is not cleared.
3) ctrl reconnects: nvme_mpath_init(ctrl,...) calls
   nvme_read_ana_log(ctrl, groups_only=true).
   However, nvme_update_ana_state() does not update namespaces
   because nr_nsids = 0 (due to groups_only mode).
4) scan_work calls nvme_validate_ns() finds the ns and re-validates OK.

Result:
The ctrl is now live but NVME_NS_ANA_PENDING bit in ns->flags is still set.
Consequently ctrl will never be considered a viable path by __nvme_find_path().
IO will hang if ctrl is the only or the last path to the namespace.

More generally, while ctrl is reconnecting, its ANA state may change.
And because nvme_mpath_init() requests ANA log in groups_only mode,
these changes are not propagated to the existing ctrl namespaces.
This may result in a mal-function or an IO hang.

Solution:
nvme_mpath_init() will nvme_read_ana_log() with groups_only set to false.
This will not harm the new ctrl case (no namespaces present),
and will make sure the ANA state of namespaces gets updated after reconnect.

Note: Another option would be for nvme_mpath_init() to invoke
nvme_parse_ana_log(..., nvme_set_ns_ana_state) for each existing namespace.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

af8fd042

28 10月, 2019 1 次提交

io_uring: don't touch ctx in setup after ring fd install · 044c1ab3

由 Jens Axboe 提交于 10月 28, 2019

syzkaller reported an issue where it looks like a malicious app can
trigger a use-after-free of reading the ctx ->sq_array and ->rings
value right after having installed the ring fd in the process file
table.

Defer ring fd installation until after we're done reading those
values.

Fixes: 75b28aff ("io_uring: allocate the two rings together")
Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>

044c1ab3

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功