提交 · 920d13a884c0595451658a7b48af8ac16918628f · openeuler / Kernel

28 6月, 2017 13 次提交

nvme-pci: factor out the cqe reading mechanics from __nvme_process_cq · 920d13a8

由 Sagi Grimberg 提交于 6月 18, 2017

Also, maintain a consumed counter to rely on for doorbell and
cqe_seen update instead of directly relying on the cq head and phase.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

920d13a8

nvme-pci: factor out cqe handling into a dedicated routine · 83a12fb7

由 Sagi Grimberg 提交于 6月 18, 2017

Makes the code slightly more readable.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

83a12fb7

nvme-pci: Introduce nvme_ring_cq_doorbell · eb281c82

由 Sagi Grimberg 提交于 6月 18, 2017

Nice abstraction of the actual mechanics of how to do it.
Note the change that we call it after we assign nvmeq->cq_head
to avoid passing it.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eb281c82

drbd: Drop unnecessary static · e9d5d4a0

由 Julia Lawall 提交于 6月 27, 2017

Drop static on a local variable, when the variable is initialized before
any use, on every possible execution path through the function.  The
static has no benefit, and dropping it reduces the code size.

The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@bad exists@
position p;
identifier x;
type T;
@@

static T x@p;
...
x = <+...x...+>

@@
identifier x;
expression e;
type T;
position p != bad.p;
@@

-static
 T x@p;
 ... when != x
     when strict
?x = e;
// </smpl>

The change in code size is indicates by the following output from the size
command.

before:
   text    data     bss     dec     hex filename
  67299    2291    1056   70646   113f6 drivers/block/drbd/drbd_nl.o

after:
   text    data     bss     dec     hex filename
  67283    2291    1056   70630   113e6 drivers/block/drbd/drbd_nl.o
Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NRoland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e9d5d4a0

mmc/block: remove a call to blk_queue_bounce_limit · 8298912b

由 Christoph Hellwig 提交于 6月 19, 2017

BLK_BOUNCE_ANY is the defauly now, so the call is superflous.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8298912b

dm: don't set bounce limit · 41341afa

由 Christoph Hellwig 提交于 6月 19, 2017

Now all queues allocators come without abounce limit by default,
dm doesn't have to override this anymore.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

41341afa

block: don't set bounce limit in blk_init_queue · 8fc45044

由 Christoph Hellwig 提交于 6月 19, 2017

Instead move it to the callers.  Those that either don't use bio_data() or
page_address() or are specific to architectures that do not support highmem
are skipped.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8fc45044

block: don't set bounce limit in blk_init_allocated_queue · 0bf6595e

由 Christoph Hellwig 提交于 6月 19, 2017

And just move it into scsi_transport_sas which needs it due to low-level
drivers directly derferencing bio_data, and into blk_init_queue_node,
which will need a further push into the callers.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0bf6595e

blk-mq: don't bounce by default · 46685d1a

由 Christoph Hellwig 提交于 6月 19, 2017

For historical reasons we default to bouncing highmem pages for all block
queues.  But the blk-mq drivers are easy to audit to ensure that we don't
need this - scsi and mtip32xx set explicit limits and everyone else doesn't
have any particular ones.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

46685d1a

block: don't bother with bounce limits for make_request drivers · 0b0bcacc

由 Christoph Hellwig 提交于 6月 19, 2017

We only call blk_queue_bounce for request-based drivers, so stop messing
with it for make_request based drivers.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0b0bcacc

blk-map: call blk_queue_bounce from blk_rq_append_bio · caa4b024

由 Christoph Hellwig 提交于 6月 27, 2017

This makes moves the knowledge about bouncing out of the callers into the
block core (just like we do for the normal I/O path), and allows to unexport
blk_queue_bounce.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

caa4b024

pktcdvd: remove the call to blk_queue_bounce · e442cbf9

由 Christoph Hellwig 提交于 6月 19, 2017

pktcdvd is a make_request based stacking driver and thus doesn't have any
addressing limits on it's own.  It also doesn't use bio_data() or
page_address(), so it doesn't need a lowmem bounce either.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e442cbf9

nvme: add support for streams and directives · f5d11840

由 Jens Axboe 提交于 6月 27, 2017

This adds support for Directives in NVMe, particular for the Streams
directive. Support for Directives is a new feature in NVMe 1.3. It
allows a user to pass in information about where to store the data, so
that it the device can do so most effiently. If an application is
managing and writing data with different life times, mixing differently
retentioned data onto the same locations on flash can cause write
amplification to grow. This, in turn, will reduce performance and life
time of the device.
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f5d11840

27 6月, 2017 21 次提交

lightnvm: if LUNs are already allocated fix return · 12e9a6d6

由 Rakesh Pandit 提交于 6月 27, 2017

While creating new device with NVM_DEV_CREATE if LUNs are already
allocated ioctl would return -ENOMEM which is wrong.  This patch
propagates -EBUSY from nvm_reserve_luns which is correct response.

Fixes: ade69e24 ("lightnvm: merge gennvm with core")
Reviewed-by: NFrans Klaver <fransklaver@gmail.com>
Signed-off-by: NRakesh Pandit <rakesh@tuxera.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

12e9a6d6

lightnvm: pblk: fail gracefully on irrec. error · 588726d3

由 Javier González 提交于 6月 26, 2017

Due to user writes being decoupled from media writes because of the need
of an intermediate write buffer, irrecoverable media write errors lead
to pblk stalling; user writes fill up the buffer and end up in an
infinite retry loop.

In order to let user writes fail gracefully, it is necessary for pblk to
keep track of its own internal state and prevent further writes from
being placed into the write buffer.

This patch implements a state machine to keep track of internal errors
and, in case of failure, fail further user writes in an standard way.
Depending on the type of error, pblk will do its best to persist
buffered writes (which are already acknowledged) and close down on a
graceful manner. This way, data might be recovered by re-instantiating
pblk. Such state machine paves out the way for a state-based FTL log.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

588726d3

lightnvm: pblk: set mempool and workqueue params. · ef576494

由 Javier González 提交于 6月 26, 2017

Make constants to define sizes for internal mempools and workqueues. In
this process, adjust the values to be more meaningful given the internal
constrains of the FTL. In order to do this for workqueues, separate the
current auxiliary workqueue into two dedicated workqueues to manage
lines being closed and bad blocks.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ef576494

lightnvm: pblk: redesign GC algorithm · b20ba1bc

由 Javier González 提交于 6月 26, 2017

At the moment, in order to get enough read parallelism, we have recycled
several lines at the same time. This approach has proven not to work
well when reaching capacity, since we end up mixing valid data from all
lines, thus not maintaining a sustainable free/recycled line ratio.

The new design, relies on a two level workqueue mechanism. In the first
level, we read the metadata for a number of lines based on the GC list
they reside on (this is governed by the number of valid sectors in each
line). In the second level, we recycle a single line at a time. Here, we
issue reads in parallel, while a single GC write thread places data in
the write buffer. This design allows to (i) only move data from one line
at a time, thus maintaining a sane free/recycled ration and (ii)
maintain the GC writer busy with recycled data.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b20ba1bc

lightnvm: pblk: add lock assertions on helpers · 476118c9

由 Javier González 提交于 6月 26, 2017

Add lockdep assertions on helper functions.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

476118c9

lightnvm: pblk: cleanup unnecessary code · 0c0ea881

由 Javier González 提交于 6月 26, 2017

Cleanup unnecessary headers and code lines.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0c0ea881

lightnvm: pblk: set metadata list for all I/Os · 63e3809c

由 Javier González 提交于 6月 26, 2017

Set a dma area for all I/Os in order to read/write from/to the metadata
stored on the per-sector out-of-bound area.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

63e3809c

lightnvm: pblk: choose optimal victim GC line · d45ebd47

由 Javier González 提交于 6月 26, 2017

At the moment, we separate the closed lines on three different list
based on their number of valid sectors. GC recycles lines from each list
based on capacity. Lines from each list are taken in a FIFO fashion.

Since the number of lines is limited (it corresponds to the number of
blocks in a LUN, which is somewhere between 1000-2000), we can afford
scanning the lists to choose the optimal line to be recycled. This helps
specially in lines with a high number of valid sectors.

If the number of blocks per LUN increases, we will consider a more
efficient policy.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d45ebd47

lightnvm: pblk: decouple bad block from line alloc · dffdd960

由 Javier González 提交于 6月 26, 2017

Decouple bad block discovery from line allocation logic. This allows to
return meaningful error codes in case of bad block discovery failure.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dffdd960

lightnvm: pblk: simplify meta. memory allocation · f680f19a

由 Javier González 提交于 6月 26, 2017

smeta size will always be suitable for a kmalloc allocation. Simplify
the code and leave the vmalloc fallback only for emeta, where the pblk
configuration has an impact.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f680f19a

lightnvm: pblk: issue multiplane reads if possible · f9c10152

由 Javier González 提交于 6月 26, 2017

If a read request is sequential and its size aligns with a
multi-plane page size, use the multi-plane hint to process the I/O in
parallel in the controller.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f9c10152

lightnvm: pblk: delete redundant buffer pointer · 0880a9aa

由 Javier González 提交于 6月 26, 2017

After refactoring the metadata path, the backpointer controlling
synced I/Os in a line becomes unnecessary; metadata is scheduled
on the write thread, thus we know when the end of the line is reached
and act on it directly.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0880a9aa

lightnvm: pblk: delete redundant debug line stat · fd1b0158

由 Javier González 提交于 6月 26, 2017

Remove a legacy variable that helped verifying the consistency of the
run-time metadata for the free line list. With the new metadata layout,
this check is no longer necessary.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fd1b0158

lightnvm: pblk: sched. metadata on write thread · dd2a4343

由 Javier González 提交于 6月 26, 2017

At the moment, line metadata is persisted on a separate work queue, that
is kicked each time that a line is closed. The assumption when designing
this was that freeing the write thread from creating a new write request
was better than the potential impact of writes colliding on the media
(user I/O and metadata I/O). Experimentation has proven that this
assumption is wrong; collision can cause up to 25% of bandwidth and
introduce long tail latencies on the write thread, which potentially
cause user write threads to spend more time spinning to get a free entry
on the write buffer.

This patch moves the metadata logic to the write thread. When a line is
closed, remaining metadata is written in memory and is placed on a
metadata queue. The write thread then takes the metadata corresponding
to the previous line, creates the write request and schedules it to
minimize collisions on the media. Using this approach, we see that we
can saturate the media's bandwidth, which helps reducing both write
latencies and the spinning time for user writer threads.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dd2a4343

lightnvm: pblk: rename read request pool · 084ec9ba

由 Javier González 提交于 6月 26, 2017

Read requests allocate some extra memory to store its per I/O context.
Instead of requiring yet another memory pool for other type of requests,
generalize this context allocation (and change naming accordingly).
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

084ec9ba

lightnvm: pblk: generalize erase path · d624f371

由 Javier González 提交于 6月 26, 2017

Erase I/Os are scheduled with the following goals in mind: (i) minimize
LUNs collisions with write I/Os, and (ii) even out the price of erasing
on every write, instead of putting all the burden on when garbage
collection runs. This works well on the current design, but is specific
to the default mapping algorithm.

This patch generalizes the erase path so that other mapping algorithms
can select an arbitrary line to be erased instead. It also gets rid of
the erase semaphore since it creates jittering for user writes.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d624f371

lightnvm: pblk: expose max sec per write on sysfs · c2e9f5d4

由 Javier González 提交于 6月 26, 2017

Allow to configure the number of maximum sectors per write command
through sysfs. This makes it easier to tune write command sizes for
different controller configurations.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c2e9f5d4

lightnvm: pblk: add debug stat for read cache hits · db7ada33

由 Javier González 提交于 6月 26, 2017

Add a new debug counter to measure cache hits on the read path
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

db7ada33

lightnvm: pblk: spare double cpu_to_le64 calc. · caa69fa5

由 Javier González 提交于 6月 26, 2017

Spare a double calculation on the fast write path.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

caa69fa5

lightnvm: propagate right error code to target · 613fa267

由 Javier González 提交于 6月 26, 2017

If nvme_alloc_request fails, propagate the right error, instead of
assuming ENOMEM.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

613fa267

lightnvm: re-convert ppa format on I/O failure · 3e505afb

由 Javier González 提交于 6月 26, 2017

In case of a failure when submitting a request, convert the ppa_list
addresses to the target format so that it can interpret ppas for
recovery
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3e505afb

23 6月, 2017 1 次提交

mtip32xx: fix up the checking for internal command failure · 8c66ac6a

由 Jens Axboe 提交于 6月 23, 2017

This fixes up two commits that have touched this driver. The
command status field is now a blk_status_t, so we can't check
for < 0 and we definitely can't assume it's holding -Exxxx error
values. All we care about here is whether ->status is zero or not.
Check for that, and remove the various attempts at smart error
reporting. Just log to dmesg what command failed, and the
blk_status_t value.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Fixes: 2a842aca ("block: introduce new block status code type")
Fixes: 3f5e6a35 ("mtip32xx: convert internal command issue to block IO path")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8c66ac6a

21 6月, 2017 3 次提交

block: Change argument type of scsi_req_init() · c8d9cf22

由 Bart Van Assche 提交于 6月 20, 2017

Since scsi_req_init() works on a struct scsi_request, change the
argument type into struct scsi_request *.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c8d9cf22

block: Make most scsi_req_init() calls implicit · ca18d6f7

由 Bart Van Assche 提交于 6月 20, 2017

Instead of explicitly calling scsi_req_init() after blk_get_request(),
call that function from inside blk_get_request(). Add an
.initialize_rq_fn() callback function to the block drivers that need
it. Merge the IDE .init_rq_fn() function into .initialize_rq_fn()
because it is too small to keep it as a separate function. Keep the
scsi_req_init() call in ide_prep_sense() because it follows a
blk_rq_init() call.

References: commit 82ed4db4 ("block: split scsi_request out of struct request")
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ca18d6f7

null_blk: add support for shared tags · 82f402fe

由 Jens Axboe 提交于 6月 20, 2017

Some storage drivers need to share tag sets between devices. It's
useful to be able to model that with null_blk, to find hangs or
performance issues.

Add a 'shared_tags' bool module parameter that. If that is set to
true and nr_devices is bigger than 1, all devices allocated will
share the same tag set.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

82f402fe

19 6月, 2017 2 次提交

nvme: host: unquiesce queue in nvme_kill_queues() · 443bd90f

由 Ming Lei 提交于 6月 19, 2017

When nvme_kill_queues() is run, queues may be in
quiesced state, so we forcibly unquiesce queues to avoid
blocking dispatch, and I/O hang can be avoided in
remove path.

Peviously we use blk_mq_start_stopped_hw_queues() as
counterpart of blk_mq_quiesce_queue(), now we have
introduced blk_mq_unquiesce_queue(), so use it explicitly.

Cc: linux-nvme@lists.infradead.org
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

443bd90f

blk-mq: use the introduced blk_mq_unquiesce_queue() · f660174e

由 Ming Lei 提交于 6月 06, 2017

blk_mq_unquiesce_queue() is used for unquiescing the
queue explicitly, so replace blk_mq_start_stopped_hw_queues()
with it.

For the scsi part, this patch takes Bart's suggestion to
switch to block quiesce/unquiesce API completely.

Cc: linux-nvme@lists.infradead.org
Cc: linux-scsi@vger.kernel.org
Cc: dm-devel@redhat.com
Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f660174e

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功