提交 · 20d44e86321299f61bb782a39aaa30f579823f58 · openeuler / Kernel

13 12月, 2018 4 次提交

datagram: introduce skb_copy_and_hash_datagram_iter helper · 65d69e25

由 Sagi Grimberg 提交于 12月 03, 2018

Introduce a helper to copy datagram into an iovec iterator
but also update a predefined hash. This is useful for
consumers of skb_copy_datagram_iter to also support inflight
data digest without having to finish to copy and only then
traverse the iovec and calculate the digest hash.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NSagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

65d69e25

iov_iter: introduce hash_and_copy_to_iter helper · d05f4435

由 Sagi Grimberg 提交于 12月 03, 2018

Allow consumers that want to use iov iterator helpers and also update
a predefined hash calculation online when copying data. This is useful
when copying incoming network buffers to a local iterator and calculate
a digest on the incoming stream. nvme-tcp host driver that will be
introduced in following patches is the first consumer via
skb_copy_and_hash_datagram_iter.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NSagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d05f4435

iov_iter: pass void csum pointer to csum_and_copy_to_iter · cb002d07

由 Sagi Grimberg 提交于 12月 03, 2018

The single caller to csum_and_copy_to_iter is skb_copy_and_csum_datagram
and we are trying to unite its logic with skb_copy_datagram_iter by passing
a callback to the copy function that we want to apply. Thus, we need
to make the checksum pointer private to the function.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NSagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

cb002d07

blkcg: handle dying request_queue when associating a blkg · 0273ac34

由 Dennis Zhou 提交于 12月 11, 2018

Between v3 [1] and v4 [2] of the blkg association series, the
association point moved from generic_make_request_checks(), which is
called after the request enters the queue, to bio_set_dev(), which is when
the bio is formed before submit_bio(). When the request_queue goes away,
the blkgs supporting the request_queue are destroyed and then the
q->root_blkg is set to %NULL.

This patch adds a %NULL check to blkg_tryget_closest() to prevent the
NPE caused by the above. It also adds a guard to see if the
request_queue is dying when creating a blkg to prevent creating a blkg
for a dead request_queue.

[1] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
[2] https://lore.kernel.org/lkml/20181126211946.77067-1-dennis@kernel.org/

Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device")
Reported-and-tested-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0273ac34

12 12月, 2018 2 次提交

lightnvm: disable interleaved metadata · a16816b9

由 Igor Konopko 提交于 12月 11, 2018

Currently pblk only check the size of I/O metadata and does not take
into account if this metadata is in a separate buffer or interleaved
in a single metadata buffer.

In reality only the first scenario is supported, where second mode will
break pblk functionality during any IO operation.

This patch prevents pblk to be instantiated in case device only
supports interleaved metadata.
Reviewed-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NIgor Konopko <igor.j.konopko@intel.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a16816b9

lightnvm: dynamic DMA pool entry size · 24828d05

由 Igor Konopko 提交于 12月 11, 2018

Currently lightnvm and pblk uses single DMA pool, for which the entry
size always is equal to PAGE_SIZE. The contents of each entry allocated
from the DMA pool consists of a PPA list (8bytes * 64), leaving
56bytes * 64 space for metadata. Since the metadata field can be bigger,
such as 128 bytes, the static size does not cover this use-case.

This patch adds support for I/O metadata above 56 bytes by changing DMA
pool size based on device meta size and allows pblk to use OOB metadata
>=16B.
Reviewed-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NIgor Konopko <igor.j.konopko@intel.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

24828d05

10 12月, 2018 4 次提交

block: return just one value from part_in_flight · e016b782

由 Mikulas Patocka 提交于 12月 06, 2018

The previous patches deleted all the code that needed the second value
returned from part_in_flight - now the kernel only uses the first value.

Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
it only returns one value.

This patch just refactors the code, there's no functional change.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e016b782

block: switch to per-cpu in-flight counters · 1226b8dd

由 Mikulas Patocka 提交于 12月 06, 2018

Now when part_round_stats is gone, we can switch to per-cpu in-flight
counters.

We use the local-atomic type local_t, so that if part_inc_in_flight or
part_dec_in_flight is reentrantly called from an interrupt, the value will
be correct.

The other counters could be corrupted due to reentrant interrupt, but the
corruption only results in slight counter skew - the in_flight counter
must be exact, so it needs local_t.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1226b8dd

block: delete part_round_stats and switch to less precise counting · 5b18b5a7

由 Mikulas Patocka 提交于 12月 06, 2018

We want to convert to per-cpu in_flight counters.

The function part_round_stats needs the in_flight counter every jiffy, it
would be too costly to sum all the percpu variables every jiffy, so it
must be deleted. part_round_stats is used to calculate two counters -
time_in_queue and io_ticks.

time_in_queue can be calculated without part_round_stats, by adding the
duration of the I/O when the I/O ends (the value is almost as exact as the
previously calculated value, except that time for in-progress I/Os is not
counted).

io_ticks can be approximated by increasing the value when I/O is started
or ended and the jiffies value has changed. If the I/Os take less than a
jiffy, the value is as exact as the previously calculated value. If the
I/Os take more than a jiffy, io_ticks can drift behind the previously
calculated value.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5b18b5a7

block: stop passing 'cpu' to all percpu stats methods · 112f158f

由 Mike Snitzer 提交于 12月 06, 2018

All of part_stat_* and related methods are used with preempt disabled,
so there is no need to pass cpu around to allow of them.  Just call
smp_processor_id() as needed.
Suggested-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

112f158f

09 12月, 2018 1 次提交

Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask" · 356ff8a9

由 David Rientjes 提交于 12月 07, 2018

This reverts commit 89c83fb5.

This should have been done as part of 2f0799a0 ("mm, thp: restore
node-local hugepage allocations").  The movement of the thp allocation
policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was
intended to only set __GFP_THISNODE for mempolicies that are not
MPOL_BIND whereas the revert could set this regardless of mempolicy.

While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask()
and alloc_pages_vma() was racy, that has since been removed since the
revert.  What is left is the possibility to use __GFP_THISNODE in
policy_node() when it is unexpected because the special handling for
hugepages in alloc_pages_vma()  was removed as part of the consolidation.

Secondly, prior to 89c83fb5, alloc_pages_vma() implemented a somewhat
different policy for hugepage allocations, which were allocated through
alloc_hugepage_vma().  For hugepage allocations, if the allocating
process's node is in the set of allowed nodes, allocate with
__GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with
__GFP_THISNODE instead).  This was changed for shmem_alloc_hugepage() to
allow fallback to other nodes in 89c83fb5 as it did for new_page() in
mm/mempolicy.c which is functionally different behavior and removes the
requirement to only allocate hugepages locally.

So this commit does a full revert of 89c83fb5 instead of the partial
revert that was done in 2f0799a0.  The result is the same thp
allocation policy for 4.20 that was in 4.19.

Fixes: 89c83fb5 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
Fixes: 2f0799a0 ("mm, thp: restore node-local hugepage allocations")
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

356ff8a9

08 12月, 2018 24 次提交

nvme: implement Enhanced Command Retry · 49cd84b6

由 Keith Busch 提交于 11月 27, 2018

A controller may have an internal state that is not able to successfully
process commands for a short duration. In such states, an immediate
command requeue is expected to fail. The driver may exceed its max
retry count, which permanently ends the command in failure when the same
command would succeed after waiting for the controller to be ready.

NVMe ratified TP 4033 provides a delay hint in the completion status
code for failed commands. Implement the retry delay based on the command
completion status and the controller's requested delay.

Note that requeued commands are handled per request_queue, not per
individual request. If multiple commands fail, the controller should
consistently report the desired delay time for retryable commands in
all CQEs, otherwise the requeue list may be kicked too soon.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

49cd84b6

nvmet: expose support for fabrics SQ flow control disable in treq · 9b95d2fb

由 Sagi Grimberg 提交于 11月 20, 2018

Technical Proposal introduces an indication for SQ flow control
disable support. Expose it since we are able to operate in this mode.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9b95d2fb

nvmet: don't override treq upon modification. · 0445e1b5

由 Sagi Grimberg 提交于 11月 19, 2018

Only override the allowed parts of it.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
[hch: slight tweak to the NVME_TREQ_SECURE_CHANNEL_MASK definition]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0445e1b5

nvmet: support fabrics sq flow control · e6a622fd

由 Sagi Grimberg 提交于 11月 19, 2018

Technical proposal 8005 "fabrics SQ flow control" introduces a mode
where a host and controller agree to omit sq_head pointer updates
when sending nvme completions.

In case the host indicated desire to operate in this mode (connect attribute)
the controller will return back a connect completion with sq_head value
of 0xffff as indication that it will omit sq_head pointer updates.

This mode saves us an atomic update in the I/O path.
Reviewed-by: NHannes Reinecke <hare@suse.com>
[hch: suggested better implementation]
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e6a622fd

nvmet-fc: remove the IN_ISR deferred scheduling options · 6e2e312e

由 James Smart 提交于 11月 14, 2018

All target lldd's call the cmd receive and op completions in non-isr
thread contexts. As such the IN_ISR options are not necessary.
Remove the functionality and flags, which also removes cpu assignments
to queues.
Signed-off-by: NJames Smart <jsmart2021@gmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6e2e312e

nvmet: add defines for discovery change async events · f301c2b1

由 Jay Sternberg 提交于 11月 12, 2018

Add AEN/AER values as defined by the specification
Signed-off-by: NJay Sternberg <jay.e.sternberg@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f301c2b1

nvmet: change aen mask functions to use bit numbers · 7114ddeb

由 Jay Sternberg 提交于 11月 12, 2018

Functions nvmet_aen_disabled and nvmet_clear_aen were using
values not bit numbers ie 1 << 9 not 9 for bit function clear_bit
and test_and_set_bit.
Signed-off-by: NJay Sternberg <jay.e.sternberg@intel.com>
Reviewed-by: NPhil Cayton <phil.cayton@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7114ddeb

nvme: support traffic based keep-alive · 6e3ca03e

由 Sagi Grimberg 提交于 11月 02, 2018

If the controller supports traffic based keep alive, we restart the keep
alive timer if any admin or io commands was completed during the kato
period.  This prevents a possible starvation of keep alive commands in
the presence of heavy traffic as in such case, we already have a health
indication from the host perspective.

Only set a comp_seen indicator in case the controller supports keep
alive to minimize the overhead for pci controllers.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6e3ca03e

nvme: introduce ctrl attributes enumeration · 12b21171

由 Sagi Grimberg 提交于 11月 02, 2018

We are growing more controller attributes, so use a proper enumeration
for it.  For now just add the 128-bit hostid which we support.
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

12b21171

blkcg: put back rcu lock in blkcg_bio_issue_check() · 4705de73

由 Dennis Zhou 提交于 12月 06, 2018

I was a little overzealous in removing the rcu_read_lock() call from
blkcg_bio_issue_check() and it broke blk-throttle. Put it back.

Fixes: e35403a034bf ("blkcg: associate blkg when associating a device")
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4705de73

blkcg: rename blkg_try_get() to blkg_tryget() · 7754f669