提交 · 60141aa08c08a43f3d22626b3a2532106a90a191 · openeuler / Kernel

02 2月, 2021 2 次提交

S
nvme-tcp: get rid of unused helper function · 60141aa0
由 Sagi Grimberg 提交于 1月 14, 2021
```
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
```
60141aa0

nvme-tcp: fix wrong setting of request iov_iter · cb9b870f

由 Sagi Grimberg 提交于 1月 14, 2021

We might set the iov_iter direction wrong, which is harmless for this
use-case, but get it right. Also this makes the code slightly cleaner.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

cb9b870f

19 1月, 2021 1 次提交

nvme-tcp: avoid request double completion for concurrent nvme_tcp_timeout · 9ebbfe49

由 Chao Leng 提交于 1月 14, 2021

Each name space has a request queue, if complete request long time,
multi request queues may have time out requests at the same time,
nvme_tcp_timeout will execute concurrently. Multi requests in different
request queues may be queued in the same tcp queue, multi
nvme_tcp_timeout may call nvme_tcp_stop_queue at the same time.
The first nvme_tcp_stop_queue will clear NVME_TCP_Q_LIVE and continue
stopping the tcp queue(cancel io_work), but the others check
NVME_TCP_Q_LIVE is already cleared, and then directly complete the
requests, complete request before the io work is completely canceled may
lead to a use-after-free condition.
Add a multex lock to serialize nvme_tcp_stop_queue.
Signed-off-by: NChao Leng <lengchao@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

9ebbfe49

15 1月, 2021 2 次提交

nvme-tcp: fix possible data corruption with bio merges · ca1ff67d

由 Sagi Grimberg 提交于 1月 13, 2021

When a bio merges, we can get a request that spans multiple
bios, and the overall request payload size is the sum of
all bios. When we calculate how much we need to send
from the existing bio (and bvec), we did not take into
account the iov_iter byte count cap.

Since multipage bvecs support, bvecs can split in the middle
which means that when we account for the last bvec send we
should also take the iov_iter byte count cap as it might be
lower than the last bvec size.
Reported-by: NHao Wang <pkuwangh@gmail.com>
Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
Tested-by: NHao Wang <pkuwangh@gmail.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ca1ff67d

nvme-tcp: Fix warning with CONFIG_DEBUG_PREEMPT · ada83177

由 Sagi Grimberg 提交于 1月 13, 2021

We shouldn't call smp_processor_id() in a preemptible
context, but this is advisory at best, so instead
call __smp_processor_id().

Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
Reported-by: NOr Gerlitz <gerlitz.or@gmail.com>
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ada83177

06 1月, 2021 1 次提交

nvme-tcp: Fix possible race of io_work and direct send · 5c11f7d9

由 Sagi Grimberg 提交于 12月 21, 2020

We may send a request (with or without its data) from two paths:

  1. From our I/O context nvme_tcp_io_work which is triggered from:
    - queue_rq
    - r2t reception
    - socket data_ready and write_space callbacks
  2. Directly from queue_rq if the send_list is empty (because we want to
     save the context switch associated with scheduling our io_work).

However, given that now we have the send_mutex, we may run into a race
condition where none of these contexts will send the pending payload to
the controller. Both io_work send path and queue_rq send path
opportunistically attempt to acquire the send_mutex however queue_rq only
attempts to send a single request, and if io_work context fails to
acquire the send_mutex it will complete without rescheduling itself.

The race can trigger with the following sequence:

  1. queue_rq sends request (no incapsule data) and blocks
  2. RX path receives r2t - prepares data PDU to send, adds h2cdata PDU
     to the send_list and schedules io_work
  3. io_work triggers and cannot acquire the send_mutex - because of (1),
     ends without self rescheduling
  4. queue_rq completes the send, and completes

==> no context will send the h2cdata - timeout.

Fix this by having queue_rq sending as much as it can from the send_list
such that if it still has any left, its because the socket buffer is
full and the socket write_space callback will trigger, thus guaranteeing
that a context will be scheduled to send the h2cdata PDU.

Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
Reported-by: NPotnuri Bharat Teja <bharat@chelsio.com>
Reported-by: NSamuel Jones <sjones@kalrayinc.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Tested-by: NPotnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5c11f7d9

02 12月, 2020 1 次提交

nvme: use consistent macro name for timeout · dc96f938

由 Chaitanya Kulkarni 提交于 11月 09, 2020

This is purely a clenaup patch, add prefix NVME to the ADMIN_TIMEOUT to
make consistent with NVME_IO_TIMEOUT.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

dc96f938

03 11月, 2020 2 次提交

nvme-tcp: avoid repeated request completion · 0a8a2c85

由 Sagi Grimberg 提交于 10月 22, 2020

The request may be executed asynchronously, and rq->state may be
changed to IDLE. To avoid repeated request completion, only
MQ_RQ_COMPLETE of rq->state is checked in nvme_tcp_complete_timed_out.
It is not safe, so need adding check IDLE for rq->state.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChao Leng <lengchao@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

0a8a2c85

nvme-tcp: avoid race between time out and tear down · d6f66210

由 Chao Leng 提交于 10月 22, 2020

Now use teardown_lock to serialize for time out and tear down. This may
cause abnormal: first cancel all request in tear down, then time out may
complete the request again, but the request may already be freed or
restarted.

To avoid race between time out and tear down, in tear down process,
first we quiesce the queue, and then delete the timer and cancel
the time out work for the queue. At the same time we need to delete
teardown_lock.
Signed-off-by: NChao Leng <lengchao@huawei.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d6f66210

03 10月, 2020 1 次提交

nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage() · 7d4194ab

由 Coly Li 提交于 10月 02, 2020

Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to
send slab pages. But for pages allocated by __get_free_pages() without
__GFP_COMP, which also have refcount as 0, they are still sent by
kernel_sendpage() to remote end, this is problematic.

The new introduced helper sendpage_ok() checks both PageSlab tag and
page_count counter, and returns true if the checking page is OK to be
sent by kernel_sendpage().

This patch fixes the page checking issue of nvme_tcp_try_send_data()
with sendpage_ok(). If sendpage_ok() returns true, send this page by
kernel_sendpage(), otherwise use sock_no_sendpage to handle this page.
Signed-off-by: NColy Li <colyli@suse.de>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Vlastimil Babka <vbabka@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7d4194ab

09 9月, 2020 1 次提交

nvme-tcp: cancel async events before freeing event struct · ceb1e087

由 David Milburn 提交于 9月 02, 2020

Cancel async event work in case async event has been queued up, and
nvme_tcp_submit_async_event() runs after event has been freed.
Signed-off-by: NDavid Milburn <dmilburn@redhat.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ceb1e087

29 8月, 2020 3 次提交

nvme-tcp: fix reset hang if controller died in the middle of a reset · e5c01f4f

由 Sagi Grimberg 提交于 7月 30, 2020

If the controller becomes unresponsive in the middle of a reset, we will
hang because we are waiting for the freeze to complete, but that cannot
happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to proceed
(either schedule a reconnect of remove the controller).
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

e5c01f4f

nvme-tcp: fix timeout handler · 236187c4

由 Sagi Grimberg 提交于 7月 28, 2020

When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.

However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.

Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

236187c4

nvme-tcp: serialize controller teardown sequences · d4d61470

由 Sagi Grimberg 提交于 8月 05, 2020

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

d4d61470

24 8月, 2020 1 次提交

treewide: Use fallthrough pseudo-keyword · df561f66

由 Gustavo A. R. Silva 提交于 8月 23, 2020

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-throughSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

df561f66

22 8月, 2020 1 次提交

nvme: rename and document nvme_end_request · 2eb81a33

由 Christoph Hellwig 提交于 8月 18, 2020

nvme_end_request is a bit misnamed, as it wraps around the
blk_mq_complete_* API.  It's semantics also are non-trivial, so give it
a more descriptive name and add a comment explaining the semantics.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2eb81a33

29 7月, 2020 2 次提交

nvme-tcp: fix controller reset hang during traffic · 2875b0ae

由 Sagi Grimberg 提交于 7月 24, 2020

commit fe35ec58 ("block: update hctx map when use multiple maps")
exposed an issue where we may hang trying to wait for queue freeze
during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple
queue maps (which we have now for default/read/poll) is attempting to
freeze the queue. However we never started queue freeze when starting the
reset, which means that we have inflight pending requests that entered the
queue that we will not complete once the queue is quiesced.

So start a freeze before we quiesce the queue, and unfreeze the queue
after we successfully connected the I/O queues (and make sure to call
blk_mq_update_nr_hw_queues only after we are sure that the queue was
already frozen).

This follows to how the pci driver handles resets.

Fixes: fe35ec58 ("block: update hctx map when use multiple maps")
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

2875b0ae

nvme: fix deadlock in disconnect during scan_work and/or ana_work · ecca390e

由 Sagi Grimberg 提交于 7月 22, 2020

A deadlock happens in the following scenario with multipath:
1) scan_work(nvme0) detects a new nsid while nvme0
    is an optimized path to it, path nvme1 happens to be
    inaccessible.

2) Before scan_work is complete nvme0 disconnect is initiated
    nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING

3) scan_work(1) attempts to submit IO,
    but nvme_path_is_optimized() observes nvme0 is not LIVE.
    Since nvme1 is a possible path IO is requeued and scan_work hangs.

--
Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  io_schedule+0x16/0x40
kernel:  do_read_cache_page+0x438/0x830
kernel:  read_cache_page+0x12/0x20
kernel:  read_dev_sector+0x27/0xc0
kernel:  read_lba+0xc1/0x220
kernel:  efi_partition+0x1e6/0x708
kernel:  check_partition+0x154/0x244
kernel:  rescan_partitions+0xae/0x280
kernel:  __blkdev_get+0x40f/0x560
kernel:  blkdev_get+0x3d/0x140
kernel:  __device_add_disk+0x388/0x480
kernel:  device_add_disk+0x13/0x20
kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x249/0x400
kernel:  kthread+0x104/0x140
--

4) Delete also hangs in flush_work(ctrl->scan_work)
    from nvme_remove_namespaces().

Similiarly a deadlock with ana_work may happen: if ana_work has started
and calls nvme_mpath_set_live and device_add_disk, it will
trigger I/O. When we trigger disconnect I/O will block because
our accessible (optimized) path is disconnecting, but the alternate
path is inaccessible, so I/O blocks. Then disconnect tries to flush
the ana_work and hangs.

[  605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
[  605.552087] Call Trace:
[  605.552683]  __schedule+0x2b9/0x6c0
[  605.553507]  schedule+0x42/0xb0
[  605.554201]  io_schedule+0x16/0x40
[  605.555012]  do_read_cache_page+0x438/0x830
[  605.556925]  read_cache_page+0x12/0x20
[  605.557757]  read_dev_sector+0x27/0xc0
[  605.558587]  amiga_partition+0x4d/0x4c5
[  605.561278]  check_partition+0x154/0x244
[  605.562138]  rescan_partitions+0xae/0x280
[  605.563076]  __blkdev_get+0x40f/0x560
[  605.563830]  blkdev_get+0x3d/0x140
[  605.564500]  __device_add_disk+0x388/0x480
[  605.565316]  device_add_disk+0x13/0x20
[  605.566070]  nvme_mpath_set_live+0x5e/0x130 [nvme_core]
[  605.567114]  nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
[  605.568197]  nvme_update_ana_state+0xca/0xe0 [nvme_core]
[  605.569360]  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
[  605.571385]  nvme_read_ana_log+0x76/0x100 [nvme_core]
[  605.572376]  nvme_ana_work+0x15/0x20 [nvme_core]
[  605.573330]  process_one_work+0x1db/0x380
[  605.574144]  worker_thread+0x4d/0x400
[  605.574896]  kthread+0x104/0x140
[  605.577205]  ret_from_fork+0x35/0x40
[  605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
[  605.579239]       Tainted: G           OE     5.3.5-050305-generic #201910071830
[  605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.582320] nvme            D    0 14044  14043 0x00000000
[  605.583424] Call Trace:
[  605.583935]  __schedule+0x2b9/0x6c0
[  605.584625]  schedule+0x42/0xb0
[  605.585290]  schedule_timeout+0x203/0x2f0
[  605.588493]  wait_for_completion+0xb1/0x120
[  605.590066]  __flush_work+0x123/0x1d0
[  605.591758]  __cancel_work_timer+0x10e/0x190
[  605.593542]  cancel_work_sync+0x10/0x20
[  605.594347]  nvme_mpath_stop+0x2f/0x40 [nvme_core]
[  605.595328]  nvme_stop_ctrl+0x12/0x50 [nvme_core]
[  605.596262]  nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
[  605.597333]  nvme_sysfs_delete+0x5c/0x70 [nvme_core]
[  605.598320]  dev_attr_store+0x17/0x30

Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
indicate the phase of controller deletion where I/O cannot be allowed
to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
be issued to the bottom device, and only after we flush the ana_work
and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
from re-firing by aborting early if we are not LIVE, so we should be safe
here.

In addition, change the transport drivers to follow the updated state
machine.

Fixes: 0d0b660f ("nvme: add ANA support")
Reported-by: NAnton Eidelman <anton@lightbitslabs.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ecca390e

26 7月, 2020 1 次提交

nvme-tcp: fix possible hang waiting for icresp response · adc99fd3

由 Sagi Grimberg 提交于 7月 23, 2020

If the controller died exactly when we are receiving icresp
we hang because icresp may never return. Make sure to set a
high finite limit.

Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

adc99fd3

08 7月, 2020 3 次提交

nvme-tcp: optimize network stack with setting msg flags according to batch size · 122e5b9f

由 Sagi Grimberg 提交于 6月 18, 2020

If we have a long list of request to send, signal the network stack
that more is coming (MSG_MORE). If we have nothing else, signal MSG_EOR.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

122e5b9f

nvme-tcp: leverage request plugging · 86f0348a

由 Sagi Grimberg 提交于 6月 18, 2020

blk-mq request plugging can improve the execution of our pipeline.
When we queue a request we actually trigger our I/O worker thread
yielding a context switch by definition. However if we know that
there are more requests in the pipe that are coming, we are better
off not trigger our I/O worker and only do that for the last request
in the batch (bd->last). By having it, we improve efficiency by
amortizing context switches over a batch of requests.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

86f0348a

nvme-tcp: have queue prod/cons send list become a llist · 15ec928a

由 Sagi Grimberg 提交于 6月 18, 2020

The queue processing will splice to a queue local list, this should
alleviate some contention on the send_list lock, but also prepares
us to the next patch where we look on these lists for network stack
flag optimization.

Remove queue lock as its not used anymore.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
[hch: simplified a loop]
Signed-off-by: NChristoph Hellwig <hch@lst.de>

15ec928a

25 6月, 2020 1 次提交

nvme-tcp: initialize tagset numa value to the value of the ctrl · 610c8235

由 Max Gurtovoy 提交于 6月 16, 2020

Both admin's and drive's tagsets should be set according the numa
node of the controller.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

610c8235

24 6月, 2020 1 次提交

nvme: use blk_mq_complete_request_remote to avoid an indirect function call · ff029451

由 Christoph Hellwig 提交于 6月 11, 2020

Use the new blk_mq_complete_request_remote helper to avoid an indirect
function call in the completion fast path.
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ff029451

11 6月, 2020 1 次提交

nvme-tcp: constify nvme_tcp_mq_ops and nvme_tcp_admin_mq_ops · 6acbd961

由 Rikard Falkeborn 提交于 5月 29, 2020

nvme_tcp_mq_ops and nvme_tcp_admin_mq_ops are never modified and can be
made const to allow the compiler to put them in read-only memory.

Before:
   text    data     bss     dec     hex filename
  53102    6885     576   60563    ec93 drivers/nvme/host/tcp.o

After:
   text    data     bss     dec     hex filename
  53422    6565     576   60563    ec93 drivers/nvme/host/tcp.o
Signed-off-by: NRikard Falkeborn <rikard.falkeborn@gmail.com>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6acbd961

29 5月, 2020 5 次提交

ipv4: add ip_sock_set_tos · 6ebf71ba

由 Christoph Hellwig 提交于 5月 28, 2020

Add a helper to directly set the IP_TOS sockopt from kernel space without
going through a fake uaccess.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ebf71ba

tcp: add tcp_sock_set_syncnt · 557eadfc

由 Christoph Hellwig 提交于 5月 28, 2020

Add a helper to directly set the TCP_SYNCNT sockopt from kernel space
without going through a fake uaccess.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

557eadfc

tcp: add tcp_sock_set_nodelay · 12abc5ee

由 Christoph Hellwig 提交于 5月 28, 2020

Add a helper to directly set the TCP_NODELAY sockopt from kernel space
without going through a fake uaccess.  Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Acked-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

12abc5ee

net: add sock_set_priority · 6e434967

由 Christoph Hellwig 提交于 5月 28, 2020

Add a helper to directly set the SO_PRIORITY sockopt from kernel space
without going through a fake uaccess.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6e434967

net: add sock_no_linger · c433594c

由 Christoph Hellwig 提交于 5月 28, 2020

Add a helper to directly set the SO_LINGER sockopt from kernel space
with onoff set to true and a linger time of 0 without going through a
fake uaccess.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c433594c

27 5月, 2020 1 次提交

nvme-tcp: set MSG_SENDPAGE_NOTLAST with MSG_MORE when we have more to send · 5bb052d7

由 Sagi Grimberg 提交于 5月 04, 2020

We can signal the stack that this is not the last page coming and the
stack can build a larger tso segment, so go ahead and use it.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5bb052d7

10 5月, 2020 3 次提交

nvme-tcp: try to send request in queue_rq context · db5ad6b7

由 Sagi Grimberg 提交于 5月 01, 2020

Today, nvme-tcp automatically schedules a send request
to a workqueue context, which is 1 more than we'd need
in case the socket buffer is wide open.

However, because we have async send activity (as a result
of r2t, or write_space callbacks), we need to synchronize
sends from possibly multiple contexts (ideally all running
on the same cpu though).

Thus, we only try to send directly from queue_rq in cases:
1. the send_list is empty
2. we can send it synchronously (i.e. not from the RX path)
3. we run on the same cpu as the queue->io_cpu to avoid
   contention on the send operation.
Proposed-by: NMark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: NMark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

db5ad6b7

nvme-tcp: avoid scheduling io_work if we are already polling · 72e5d757

由 Sagi Grimberg 提交于 5月 01, 2020

When the user runs polled I/O, we shouldn't have to trigger
the workqueue to generate the receive work upon the .data_ready
upcall. This prevents a redundant context switch when the
application is already polling for completions.
Proposed-by: NMark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: NMark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

72e5d757

nvme-tcp: use bh_lock in data_ready · 386e5e6e

由 Sagi Grimberg 提交于 4月 30, 2020

data_ready may be invoked from send context or from
softirq, so need bh locking for that.

Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

386e5e6e

01 4月, 2020 1 次提交

nvme-tcp: fix possible crash in recv error flow · 39d06079

由 Sagi Grimberg 提交于 3月 31, 2020

If the target misbehaves and sends us unexpected payload we
need to make sure to fail the controller and stop processing
the input stream. We clear the rd_enabled flag and stop
the io_work, but we may still requeue it if we still have pending
sends and then in the next invocation we will process the input
stream as the check is only in the .data_ready upcall.

To fix this we need to make sure not to self-requeue io_work
upon a recv flow error.

This fixes the crash:
 nvme nvme2: receive failed:  -22
 BUG: unable to handle page fault for address: ffffbeb5816c3b48
 nvme_ns_head_make_request: 29 callbacks suppressed
 block nvme0n5: no usable path - requeuing I/O
 block nvme0n5: no usable path - requeuing I/O
 block nvme0n7: no usable path - requeuing I/O
 block nvme0n7: no usable path - requeuing I/O
 block nvme0n3: no usable path - requeuing I/O
 block nvme0n3: no usable path - requeuing I/O
 block nvme0n3: no usable path - requeuing I/O
 block nvme0n7: no usable path - requeuing I/O
 block nvme0n3: no usable path - requeuing I/O
 block nvme0n3: no usable path - requeuing I/O
 #PF: supervisor read access inkernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 1039157067 P4D 1039157067 PUD 103915a067 PMD 102719f067 PTE 0
 Oops: 0000 [#1] SMP PTI
 CPU: 8 PID: 411 Comm: kworker/8:1H Not tainted 5.3.0-40-generic #32~18.04.1-Ubuntu
 Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0 12/17/2015
 Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
 RIP: 0010:nvme_tcp_recv_skb+0x2ae/0xb50 [nvme_tcp]
 RSP: 0018:ffffbeb5806cfd10 EFLAGS: 00010246
 RAX: ffffbeb5816c3b48 RBX: 00000000000003d0 RCX: 0000000000000008
 RDX: 00000000000003d0 RSI: 0000000000000001 RDI: ffff9a3040684b40
 RBP: ffffbeb5806cfd90 R08: 0000000000000000 R09: ffffffff946e6900
 R10: ffffbeb5806cfce0 R11: 0000000000000001 R12: 0000000000000000
 R13: ffff9a2ff86501c0 R14: 00000000000003d0 R15: ffff9a30b85f2798
 FS:  0000000000000000(0000) GS:ffff9a30bf800000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffbeb5816c3b48 CR3: 000000088400a006 CR4: 00000000003626e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  tcp_read_sock+0x8c/0x290
  ? __release_sock+0x9d/0xe0
  ? nvme_tcp_write_space+0xb0/0xb0 [nvme_tcp]
  nvme_tcp_io_work+0x4b4/0x830 [nvme_tcp]
  ? finish_task_switch+0x163/0x270
  process_one_work+0x1fd/0x3f0
  worker_thread+0x34/0x410
  kthread+0x121/0x140
  ? process_one_work+0x3f0/0x3f0
  ? kthread_park+0xb0/0xb0
  ret_from_fork+0x35/0x40
Reported-by: NRoy Shterman <roys@lightbitslabs.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

39d06079

31 3月, 2020 2 次提交

nvme-tcp: don't poll a non-live queue · f86e5bf8

由 Sagi Grimberg 提交于 3月 23, 2020

In error recovery we might be removing the queue so check we
can actually poll before we do.
Reported-by: NMark Wunderlich <mark.wunderlich@intel.com>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

f86e5bf8

nvme-tcp: fix possible crash in write_zeroes processing · 25e5cb78

由 Sagi Grimberg 提交于 3月 23, 2020

We cannot look at blk_rq_payload_bytes without first checking
that the request has a mappable physical segments first (e.g.
blk_rq_nr_phys_segments(rq) != 0) and only then to take the
request payload bytes. This caused us to send a wrong sgl to
the target or even dereference a non-existing buffer in case
we actually got to the data send sequence (if it was in-capsule).
Reported-by: NTony Asleson <tasleson@redhat.com>
Suggested-by: NChaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

25e5cb78

26 3月, 2020 3 次提交

nvme-tcp: Add warning on state change failure at nvme_tcp_setup_ctrl · bea54ef5

由 Israel Rukshin 提交于 3月 24, 2020

The transition to LIVE state should not fail in case of a new controller.
Moving to DELETING state before nvme_tcp_create_ctrl() allocates all the
resources may leads to NULL dereference at teardown flow (e.g., IO tagset,
admin_q, connect_q).
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

bea54ef5

nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl · 726612b6

由 Israel Rukshin 提交于 3月 24, 2020

Put the ctrl reference count at nvme_uninit_ctrl as opposed to
nvme_init_ctrl which takes it. This decrease the reference count at the
core layer instead of decreasing it on each transport separately.
Also move the call of nvme_uninit_ctrl at PCI driver after calling to
nvme_release_prp_pools and nvme_dev_unmap, in order to put the reference
count after using the dev. This is safe because those functions use
nvme_dev which is freed only later at nvme_pci_free_ctrl.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

726612b6

nvme: Fix ctrl use-after-free during sysfs deletion · b780d741

由 Israel Rukshin 提交于 3月 24, 2020

In case nvme_sysfs_delete() is called by the user before taking the ctrl
reference count, the ctrl may be freed during the creation and cause the
bug. Take the reference as soon as the controller is externally visible,
which is done by cdev_device_add() in nvme_init_ctrl(). Also take the
reference count at the core layer instead of taking it on each transport
separately.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

b780d741

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功