提交 · 2362acb6785611eda795bfc12e1ea6b202ecf62c · openeuler / Kernel

29 8月, 2020 9 次提交

nvme-rdma: fix reset hang if controller died in the middle of a reset · 2362acb6

由 Sagi Grimberg 提交于 7月 30, 2020

If the controller becomes unresponsive in the middle of a reset, we
will hang because we are waiting for the freeze to complete, but that
cannot happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to
proceed (either schedule a reconnect of remove the controller).
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

2362acb6

nvme-rdma: fix timeout handler · 0475a8dc

由 Sagi Grimberg 提交于 7月 29, 2020

When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.

However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.

Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

0475a8dc

nvme-rdma: serialize controller teardown sequences · 5110f402

由 Sagi Grimberg 提交于 8月 05, 2020

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

5110f402

nvme-tcp: fix reset hang if controller died in the middle of a reset · e5c01f4f

由 Sagi Grimberg 提交于 7月 30, 2020

If the controller becomes unresponsive in the middle of a reset, we will
hang because we are waiting for the freeze to complete, but that cannot
happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to proceed
(either schedule a reconnect of remove the controller).
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

e5c01f4f

nvme-tcp: fix timeout handler · 236187c4

由 Sagi Grimberg 提交于 7月 28, 2020

When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.

However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.

Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

236187c4

nvme-tcp: serialize controller teardown sequences · d4d61470

由 Sagi Grimberg 提交于 8月 05, 2020

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

d4d61470

nvme: have nvme_wait_freeze_timeout return if it timed out · 7cf0d7c0

由 Sagi Grimberg 提交于 7月 30, 2020

Users can detect if the wait has completed or not and take appropriate
actions based on this information (e.g. weather to continue
initialization or rather fail and schedule another initialization
attempt).
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

7cf0d7c0

nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance · d7144f5c

由 Sagi Grimberg 提交于 8月 14, 2020

NVME_CTRL_NEW should never see any I/O, because in order to start
initialization it has to transition to NVME_CTRL_CONNECTING and from
there it will never return to this state.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

d7144f5c

nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu · a6ce7d7b

由 Ziye Yang 提交于 8月 22, 2020

When handling commands without in-capsule data, we assign the ttag
assuming we already have the queue commands array allocated (based
on the queue size information in the connect data payload). However
if the connect itself did not send the connect data in-capsule we
have yet to allocate the queue commands,and we will assign a bogus
ttag and suffer a NULL dereference when we receive the corresponding
h2cdata pdu.

Fix this by checking if we already allocated commands before
dereferencing it when handling h2cdata, if we didn't, its for sure a
connect and we should use the preallocated connect command.
Signed-off-by: NZiye Yang <ziye.yang@intel.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

a6ce7d7b

28 8月, 2020 2 次提交

J
Merge branch 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.9 · a433d721
由 Jens Axboe 提交于 8月 28, 2020
```
Pull MD fix from Song.

* 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md/raid5: make sure stripe_size as power of two
```
a433d721

md/raid5: make sure stripe_size as power of two · 6af10a33

由 Yufen Yu 提交于 8月 20, 2020

Commit 3b5408b9 ("md/raid5: support config stripe_size by sysfs
entry") make stripe_size as a configurable value. It just requires
stripe_size as multiple of 4KB.

In fact, we should make sure stripe_size as power of two. Otherwise,
stripe_shift which is the result of ilog2 can not represent the real
stripe_size. Then, stripe_hash() and stripe_hash_locks_hash() may
get unexpected value.

Fixes: 3b5408b9 ("md/raid5: support config stripe_size by sysfs entry")
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

6af10a33

26 8月, 2020 2 次提交

loop: Set correct device size when using LOOP_CONFIGURE · 79e5dc59

由 Martijn Coenen 提交于 8月 25, 2020

The device size calculation was done before processing the loop
configuration, which meant that the we set the size on the underlying
block device incorrectly in case lo_offset/lo_sizelimit were set in the
configuration. Delay computing the size until we've setup the device
parameters correctly.

Fixes: 3448914e("loop: Add LOOP_CONFIGURE ioctl")
Reported-by: NLennart Poettering <mzxreary@0pointer.de>
Tested-by: NYang Xu <xuyang2018.jy@cn.fujitsu.com>
Signed-off-by: NMartijn Coenen <maco@android.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

79e5dc59

nbd: restore default timeout when setting it to zero · acb19e17

由 Hou Pu 提交于 8月 10, 2020

If we configured io timeout of nbd0 to 100s. Later after we
finished using it, we configured nbd0 again and set the io
timeout to 0. We expect it would timeout after 30 seconds
and keep retry. But in fact we could not change the timeout
when we set it to 0. the timeout is still the original 100s.

So change the timeout to default 30s when we set it to zero.
It also behaves same as commit 2da22da5 ("nbd: fix zero
cmd timeout handling v2").

It becomes more important if we were reconfigure a nbd device
and the io timeout it set to zero. Because it could take 30s
to detect the new socket and thus io could be completed more
quickly compared to 100s.
Signed-off-by: NHou Pu <houpu@bytedance.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

acb19e17

22 8月, 2020 23 次提交

null_blk: fix passing of REQ_FUA flag in null_handle_rq · 2d62e6b0

由 Hou Pu 提交于 8月 21, 2020

REQ_FUA should be checked using rq->cmd_flags instead of req_op().

Fixes: deb78b41 ("nullb: emulate cache")
Signed-off-by: NHou Pu <houpu@bytedance.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2d62e6b0

nvmet: Disable keep-alive timer when kato is cleared to 0h · 0d3b6a8d

由 Amit Engel 提交于 8月 19, 2020

Based on nvme spec, when keep alive timeout is set to zero
the keep-alive timer should be disabled.
Signed-off-by: NAmit Engel <amit.engel@dell.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0d3b6a8d

nvme: redirect commands on dying queue · 5eac5f33

由 Chao Leng 提交于 8月 18, 2020

If a command send through nvme-multipath failed on a dying queue, resend it
on another path.
Signed-off-by: NChao Leng <lengchao@huawei.com>
[hch: rebased on top of the completion refactoring]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5eac5f33

nvme: just check the status code type in nvme_is_path_error · 1e41f3bd

由 Christoph Hellwig 提交于 8月 18, 2020

Check the SCT sub-field for a path related status instead of enumerating
invididual status code.  As of NVMe 1.4 this adds "Internal Path Error"
and "Controller Pathing Error" to the list, but it also future proofs for
additional status codes added to the category.
Suggested-by: NChao Leng <lengchao@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1e41f3bd

nvme: refactor command completion · 5ddaabe8

由 Christoph Hellwig 提交于 8月 18, 2020

Lift all the code to decide the dispostition of a completed command
from nvme_complete_rq and nvme_failover_req into a new helper, which
returns an emum of the potential actions.  nvme_complete_rq then
just switches on those and calls the proper helper for the action.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5ddaabe8

nvme: rename and document nvme_end_request · 2eb81a33

由 Christoph Hellwig 提交于 8月 18, 2020

nvme_end_request is a bit misnamed, as it wraps around the
blk_mq_complete_* API.  It's semantics also are non-trivial, so give it
a more descriptive name and add a comment explaining the semantics.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2eb81a33

nvme: skip noiob for zoned devices · c41ad98b

由 Keith Busch 提交于 8月 07, 2020

Zoned block devices reuse the chunk_sectors queue limit to define zone
boundaries. If a such a device happens to also report an optimal
boundary, do not use that to define the chunk_sectors as that may
intermittently interfere with io splitting and zone size queries.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c41ad98b

nvme-pci: fix PRP pool size · c61b82c7

由 Christoph Hellwig 提交于 8月 18, 2020

All operations are based on the controller, not the host page size.
Switch the dma pool to use the controller page size as well to avoid
massive overallocations on large page size systems.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c61b82c7

nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth · 7442ddce

由 John Garry 提交于 8月 14, 2020

Recently nvme_dev.q_depth was changed from an int to u16 type.

This falls over for the queue depth calculation in nvme_pci_enable(),
where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow as a u16, as
NVME_CAP_MQES() is a 16b number also. That happens for me, and this is the
result:

root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer
dereference at virtual address 0000000000000010
Mem abort info:
ESR = 0x96000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=00000a27bf3c9000
[0000000000000010] pgd=0000000000000000, p4d=0000000000000000
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in: nvme nvme_core
CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted
5.8.0-next-20200812 #27
Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
V1.16.01 03/15/2019
Workqueue: nvme-reset-wq nvme_reset_work [nvme]
pstate: 80c00009 (Nzcv daif +PAN +UAO BTYPE=--)
pc : __sg_alloc_table_from_pages+0xec/0x238
lr : __sg_alloc_table_from_pages+0xc8/0x238
sp : ffff800013ccbad0
x29: ffff800013ccbad0 x28: ffff0a27b3d380a8
x27: 0000000000000000 x26: 0000000000002dc2
x25: 0000000000000dc0 x24: 0000000000000000
x23: 0000000000000000 x22: ffff800013ccbbe8
x21: 0000000000000010 x20: 0000000000000000
x19: 00000000fffff000 x18: ffffffffffffffff
x17: 00000000000000c0 x16: fffffe289eaf6380
x15: ffff800011b59948 x14: ffff002bc8fe98f8
x13: ff00000000000000 x12: ffff8000114ca000
x11: 0000000000000000 x10: ffffffffffffffff
x9 : ffffffffffffffc0 x8 : ffff0a27b5f9b6a0
x7 : 0000000000000000 x6 : 0000000000000001
x5 : ffff0a27b5f9b680 x4 : 0000000000000000
x3 : ffff0a27b5f9b680 x2 : 0000000000000000
 x1 : 0000000000000001 x0 : 0000000000000000
 Call trace:
__sg_alloc_table_from_pages+0xec/0x238
sg_alloc_table_from_pages+0x18/0x28
iommu_dma_alloc+0x474/0x678
dma_alloc_attrs+0xd8/0xf0
nvme_alloc_queue+0x114/0x160 [nvme]
nvme_reset_work+0xb34/0x14b4 [nvme]
process_one_work+0x1e8/0x360
worker_thread+0x44/0x478
kthread+0x150/0x158
ret_from_fork+0x10/0x34
 Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5)
---[ end trace 89bb2b72d59bf925 ]---

Fix by making onto a u32.

Also use u32 for nvme_dev.q_depth, as we assign this value from
nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this
avoids the same crash as above.

Fixes: 61f3b896 ("nvme-pci: use unsigned for io queue depth")
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7442ddce

nvme: Use spin_lock_irq() when taking the ctrl->lock · ecbcdf0c

由 Logan Gunthorpe 提交于 8月 12, 2020

When locking the ctrl->lock spinlock IRQs need to be disabled to avoid a
dead lock. The new spin_lock() calls recently added produce the
following lockdep warning when running the blktest nvme/003:

    ================================
    WARNING: inconsistent lock state
    --------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    ksoftirqd/2/22 [HC0[0]:SC1[1]:HE0:SE0] takes:
    ffff888276a8c4c0 (&ctrl->lock){+.?.}-{2:2}, at: nvme_keep_alive_end_io+0x50/0xc0
    {SOFTIRQ-ON-W} state was registered at:
      lock_acquire+0x164/0x500
      _raw_spin_lock+0x28/0x40
      nvme_get_effects_log+0x37/0x1c0
      nvme_init_identify+0x9e4/0x14f0
      nvme_reset_work+0xadd/0x2360
      process_one_work+0x66b/0xb70
      worker_thread+0x6e/0x6c0
      kthread+0x1e7/0x210
      ret_from_fork+0x22/0x30
    irq event stamp: 1449221
    hardirqs last  enabled at (1449220): [<ffffffff81c58e69>] ktime_get+0xf9/0x140
    hardirqs last disabled at (1449221): [<ffffffff83129665>] _raw_spin_lock_irqsave+0x25/0x60
    softirqs last  enabled at (1449210): [<ffffffff83400447>] __do_softirq+0x447/0x595
    softirqs last disabled at (1449215): [<ffffffff81b489b5>] run_ksoftirqd+0x35/0x50

    other info that might help us debug this:
     Possible unsafe locking scenario:

           CPU0
           ----
      lock(&ctrl->lock);
      <Interrupt>
        lock(&ctrl->lock);

     *** DEADLOCK ***

    no locks held by ksoftirqd/2/22.

    stack backtrace:
    CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 5.8.0-rc4-eid-vmlocalyes-dbg-00157-g7236657c #1450
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014
    Call Trace:
     dump_stack+0xc8/0x11a
     print_usage_bug.cold.63+0x235/0x23e
     mark_lock+0xa9c/0xcf0
     __lock_acquire+0xd9a/0x2b50
     lock_acquire+0x164/0x500
     _raw_spin_lock_irqsave+0x40/0x60
     nvme_keep_alive_end_io+0x50/0xc0
     blk_mq_end_request+0x158/0x210
     nvme_complete_rq+0x146/0x500
     nvme_loop_complete_rq+0x26/0x30 [nvme_loop]
     blk_done_softirq+0x187/0x1e0
     __do_softirq+0x118/0x595
     run_ksoftirqd+0x35/0x50
     smpboot_thread_fn+0x1d3/0x310
     kthread+0x1e7/0x210
     ret_from_fork+0x22/0x30

Fixes: be93e87e ("nvme: support for multiple Command Sets Supported and Effects log pages")
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Tested-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ecbcdf0c

nvmet: call blk_mq_free_request() directly · 7ee51cf6

由 Chaitanya Kulkarni 提交于 8月 06, 2020

Instead of calling blk_put_request() which calls blk_mq_free_request(),
call blk_mq_free_request() directly for NVMeOF passthru. This is to
mainly avoid an extra function call in the completion path
nvmet_passthru_req_done().
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7ee51cf6

nvmet: fix oops in pt cmd execution · a2138fd4

由 Chaitanya Kulkarni 提交于 8月 06, 2020

In the existing NVMeOF Passthru core command handling on failure of
nvme_alloc_request() it errors out with rq value set to NULL. In the
error handling path it calls blk_put_request() without checking if
rq is set to NULL or not which produces following Oops:-

[ 1457.346861] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1457.347838] #PF: supervisor read access in kernel mode
[ 1457.348464] #PF: error_code(0x0000) - not-present page
[ 1457.349085] PGD 0 P4D 0
[ 1457.349402] Oops: 0000 [#1] SMP NOPTI
[ 1457.349851] CPU: 18 PID: 10782 Comm: kworker/18:2 Tainted: G           OE     5.8.0-rc4nvme-5.9+ #35
[ 1457.350951] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214
[ 1457.352347] Workqueue: events nvme_loop_execute_work [nvme_loop]
[ 1457.353062] RIP: 0010:blk_mq_free_request+0xe/0x110
[ 1457.353651] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8
[ 1457.355975] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282
[ 1457.356636] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
[ 1457.357526] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000
[ 1457.358416] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000
[ 1457.359317] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000
[ 1457.360424] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8
[ 1457.361322] FS:  0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000
[ 1457.362337] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1457.363058] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0
[ 1457.363973] Call Trace:
[ 1457.364296]  nvmet_passthru_execute_cmd+0x150/0x2c0 [nvmet]
[ 1457.364990]  process_one_work+0x24e/0x5a0
[ 1457.365493]  ? __schedule+0x353/0x840
[ 1457.365957]  worker_thread+0x3c/0x380
[ 1457.366426]  ? process_one_work+0x5a0/0x5a0
[ 1457.366948]  kthread+0x135/0x150
[ 1457.367362]  ? kthread_create_on_node+0x60/0x60
[ 1457.367934]  ret_from_fork+0x22/0x30
[ 1457.368388] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corer
[ 1457.368414]  ata_piix crc32c_intel virtio_pci libata virtio_ring serio_raw t10_pi virtio floppy dm_]
[ 1457.380849] CR2: 0000000000000000
[ 1457.381288] ---[ end trace c6cab61bfd1f68fd ]---
[ 1457.381861] RIP: 0010:blk_mq_free_request+0xe/0x110
[ 1457.382469] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8
[ 1457.384749] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282
[ 1457.385393] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
[ 1457.386264] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000
[ 1457.387142] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000
[ 1457.388029] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000
[ 1457.388914] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8
[ 1457.389798] FS:  0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000
[ 1457.390796] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1457.391508] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0
[ 1457.392525] Kernel panic - not syncing: Fatal exception
[ 1457.394138] Kernel Offset: disabled
[ 1457.394677] ---[ end Kernel panic - not syncing: Fatal exception ]---

We fix this Oops by adding a new goto label out_put_req and reordering
the blk_put_request call to avoid calling blk_put_request() with rq
value is set to NULL. Here we also update the rest of the code
accordingly.

Fixes: 06b7164dfdc0 ("nvmet: add passthru code to process commands")
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a2138fd4

nvmet: add ns tear down label for pt-cmd handling · 4db69a3d

由 Chaitanya Kulkarni 提交于 8月 06, 2020

In the current implementation before submitting the passthru cmd we
may come across error e.g. getting ns from passthru controller,
allocating a request from passthru controller, etc. For all the failure
cases it only uses single goto label fail_out.

In the target code, we follow the pattern to have a separate label for
each error out the case when setting up multiple things before the actual
action.

This patch follows the same pattern and renames generic fail_out label
to out_put_ns and updates the error out cases in the
nvmet_passthru_execute_cmd() where it is needed.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4db69a3d

nvme: multipath: round-robin: eliminate "fallback" variable · e398863b

由 Martin Wilck 提交于 8月 06, 2020

If we find an optimized path, we quit the loop immediately. Thus we can use
just one variable for the next path, slighly simplifying the code.
Signed-off-by: NMartin Wilck <mwilck@suse.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e398863b

nvme: multipath: round-robin: fix single non-optimized path case · 93eb0381

由 Martin Wilck 提交于 8月 06, 2020

If there's only one usable, non-optimized path, nvme_round_robin_path()
returns NULL, which is wrong. Fix it by falling back to "old", like in
the single optimized path case. Also, if the active path isn't changed,
there's no need to re-assign the pointer.

Fixes: 3f6e3246 ("nvme-multipath: fix logic for non-optimized paths")
Signed-off-by: NMartin Wilck <mwilck@suse.com>
Signed-off-by: NMartin George <marting@netapp.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

93eb0381

nvme-fc: Fix wrong return value in __nvme_fc_init_request() · f34448cd

由 Tianjia Zhang 提交于 8月 02, 2020

On an error exit path, a negative error code should be returned
instead of a positive return value.

Fixes: e399441d ("nvme-fabrics: Add host support for FC transport")
Cc: James Smart <jsmart2021@gmail.com>
Signed-off-by: NTianjia Zhang <tianjia.zhang@linux.alibaba.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f34448cd

nvmet-passthru: Reject commands with non-sgl flags set · 0ceeab96

由 Logan Gunthorpe 提交于 7月 29, 2020

Any command with a non-SGL flag set (like fuse flags) should be
rejected.

Fixes: c1fef73f ("nvmet: add passthru code to process commands")
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0ceeab96

nvmet: fix a memory leak · 382fee1a

由 Sagi Grimberg 提交于 7月 30, 2020

We forgot to free new_model_number

Fixes: 013b7ebe ("nvmet: make ctrl model configurable")
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

382fee1a

blkcg: fix memleak for iolatency · 27029b4b

由 Yufen Yu 提交于 8月 10, 2020

Normally, blkcg_iolatency_exit() will free related memory in iolatency
when cleanup queue. But if blk_throtl_init() return error and queue init
fail, blkcg_iolatency_exit() will not do that for us. Then it cause
memory leak.

Fixes: d7067512 ("block: introduce blk-iolatency io controller")
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

27029b4b

MAINTAINERS: Add missing header files to BLOCK LAYER section · 0c8b9c35

由 Geert Uytterhoeven 提交于 8月 19, 2020

The various <linux/blk*.h> header files are part of the Block Layer.
Add them to the corresponding section in the MAINTAINERS file, so
scripts/get_maintainer.pl will pick them up.
Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0c8b9c35

block: fix get_max_io_size() · e4b469c6

由 Keith Busch 提交于 8月 06, 2020

A previous commit aligning splits to physical block sizes inadvertently
modified one return case such that that it now returns 0 length splits
when the number of sectors doesn't exceed the physical offset. This
later hits a BUG in bio_split(). Restore the previous working behavior.

Fixes: 9cc5169c ("block: Improve physical block alignment of split bios")
Reported-by: NEric Deal <eric.deal@wdc.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e4b469c6

blk-mq: insert request not through ->queue_rq into sw/scheduler queue · db03f88f

由 Ming Lei 提交于 8月 18, 2020

c616cbee ("blk-mq: punt failed direct issue to dispatch list") supposed
to add request which has been through ->queue_rq() to the hw queue dispatch
list, however it adds request running out of budget or driver tag to hw queue
too. This way basically bypasses request merge, and causes too many request
dispatched to LLD, and system% is unnecessary increased.

Fixes this issue by adding request not through ->queue_rq into sw/scheduler
queue, and this way is safe because no ->queue_rq is called on this request
yet.

High %system can be observed on Azure storvsc device, and even soft lock
is observed. This patch reduces %system during heavy sequential IO,
meantime decreases soft lockup risk.

Fixes: c616cbee ("blk-mq: punt failed direct issue to dispatch list")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

db03f88f

block/rnbd: Ensure err is always initialized in process_rdma · 17bc1030

由 Nathan Chancellor 提交于 8月 17, 2020

Clang warns:

drivers/block/rnbd/rnbd-srv.c:150:6: warning: variable 'err' is used
uninitialized whenever 'if' condition is true
[-Wsometimes-uninitialized]
        if (IS_ERR(bio)) {
            ^~~~~~~~~~~
drivers/block/rnbd/rnbd-srv.c:177:9: note: uninitialized use occurs here
        return err;
               ^~~
drivers/block/rnbd/rnbd-srv.c:150:2: note: remove the 'if' if its
condition is always false
        if (IS_ERR(bio)) {
        ^~~~~~~~~~~~~~~~~~
drivers/block/rnbd/rnbd-srv.c:126:9: note: initialize the variable 'err'
to silence this warning
        int err;
               ^
                = 0
1 warning generated.

err is indeed uninitialized when this statement is taken. Ensure that it
is assigned the error value of bio before jumping to the error handling
label.

Fixes: 735d77d4 ("rnbd: remove rnbd_dev_submit_io")
Reported-by: NBrooke Basile <brookebasile@gmail.com>
Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
Acked-by: NJack Wang <jinpu.wang@cloud.ionos.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/1134Signed-off-by: NJens Axboe <axboe@kernel.dk>

17bc1030

18 8月, 2020 2 次提交

bfq: fix blkio cgroup leakage v4 · 2de791ab

由 Dmitry Monakhov 提交于 8月 11, 2020

Changes from v1:
    - update commit description with proper ref-accounting justification

commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
introduce leak forbfq_group and blkcg_gq objects because of get/put
imbalance.
In fact whole idea of original commit is wrong because bfq_group entity
can not dissapear under us because it is referenced by child bfq_queue's
entities from here:
 -> bfq_init_entity()
    ->bfqg_and_blkg_get(bfqg);
    ->entity->parent = bfqg->my_entity

 -> bfq_put_queue(bfqq)
    FINAL_PUT
    ->bfqg_and_blkg_put(bfqq_group(bfqq))
    ->kmem_cache_free(bfq_pool, bfqq);

So parent entity can not disappear while child entity is in tree,
and child entities already has proper protection.
This patch revert commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")

bfq_group leak trace caused by bad commit:
-> blkg_alloc
   -> bfq_pq_alloc
     -> bfqg_get (+1)
->bfq_activate_bfqq
  ->bfq_activate_requeue_entity
    -> __bfq_activate_entity
       ->bfq_get_entity
         ->bfqg_and_blkg_get (+1)  <==== : Note1
->bfq_del_bfqq_busy
  ->bfq_deactivate_entity+0x53/0xc0 [bfq]
    ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
      -> bfq_forget_entity(is_in_service = true)
	 entity->on_st_or_in_serv = false   <=== :Note2
	 if (is_in_service)
	     return;  ==> do not touch reference
-> blkcg_css_offline
 -> blkcg_destroy_blkgs
  -> blkg_destroy
   -> bfq_pd_offline
    -> __bfq_deactivate_entity
         if (!entity->on_st_or_in_serv) /* true, because (Note2)
		return false;
 -> bfq_pd_free
    -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
So bfq_group and blkcg_gq  will leak forever, see test-case below.

##TESTCASE_BEGIN:
#!/bin/bash

max_iters=${1:-100}
#prep cgroup mounts
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/blkio
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

# Prepare blkdev
grep blkio /proc/cgroups
truncate -s 1M img
losetup /dev/loop0 img
echo bfq > /sys/block/loop0/queue/scheduler

grep blkio /proc/cgroups
for ((i=0;i<max_iters;i++))
do
    mkdir -p /sys/fs/cgroup/blkio/a
    echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
    dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
    echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
    rmdir /sys/fs/cgroup/blkio/a
    grep blkio /proc/cgroups
done
##TESTCASE_END:

Fixes: db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: NDmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2de791ab

block: Fix page_is_mergeable() for compound pages · d8166519

由 Matthew Wilcox (Oracle) 提交于 8月 17, 2020

If we pass in an offset which is larger than PAGE_SIZE, then
page_is_mergeable() thinks it's not mergeable with the previous bio_vec,
leading to a large number of bio_vecs being used. Use a slightly more
obvious test that the two pages are compatible with each other.

Fixes: 52d52d1c ("block: only allow contiguous page structs in a bio_vec")
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d8166519

17 8月, 2020 2 次提交

block: virtio_blk: fix handling single range discard request · af822aa6

由 Ming Lei 提交于 8月 17, 2020

1f23816b ("virtio_blk: add discard and write zeroes support") starts
to support multi-range discard for virtio-blk. However, the virtio-blk
disk may report max discard segment as 1, at least that is exactly what
qemu is doing.

So far, block layer switches to normal request merge if max discard segment
limit is 1, and multiple bios can be merged to single segment. This way may
cause memory corruption in virtblk_setup_discard_write_zeroes().

Fix the issue by handling single max discard segment in straightforward
way.

Fixes: 1f23816b ("virtio_blk: add discard and write zeroes support")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: Changpeng Liu <changpeng.liu@intel.com>
Cc: Daniel Verkamp <dverkamp@chromium.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

af822aa6

block: respect queue limit of max discard segment · 943b40c8

由 Ming Lei 提交于 8月 17, 2020

When queue_max_discard_segments(q) is 1, blk_discard_mergable() will
return false for discard request, then normal request merge is applied.
However, only queue_max_segments() is checked, so max discard segment
limit isn't respected.

Check max discard segment limit in the request merge code for fixing
the issue.

Discard request failure of virtio_blk is fixed.

Fixes: 69840466 ("block: fix the DISCARD request merge")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

943b40c8

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功