提交 · a4bd217b432685d6a177c28a2af187f041c473b7 · openeuler / raspberrypi-kernel

17 4月, 2017 16 次提交

lightnvm: physical block device (pblk) target · a4bd217b

由 Javier González 提交于 4月 15, 2017

This patch introduces pblk, a host-side translation layer for
Open-Channel SSDs to expose them like block devices. The translation
layer allows data placement decisions, and I/O scheduling to be
managed by the host, enabling users to optimize the SSD for their
specific workloads.

An open-channel SSD has a set of LUNs (parallel units) and a
collection of blocks. Each block can be read in any order, but
writes must be sequential. Writes may also fail, and if a block
requires it, must also be reset before new writes can be
applied.

To manage the constraints, pblk maintains a logical to
physical address (L2P) table,  write cache, garbage
collection logic, recovery scheme, and logic to rate-limit
user I/Os versus garbage collection I/Os.

The L2P table is fully-associative and manages sectors at a
4KB granularity. Pblk stores the L2P table in two places, in
the out-of-band area of the media and on the last page of a
line. In the cause of a power failure, pblk will perform a
scan to recover the L2P table.

The user data is organized into lines. A line is data
striped across blocks and LUNs. The lines enable the host to
reduce the amount of metadata to maintain besides the user
data and makes it easier to implement RAID or erasure coding
in the future.

pblk implements multi-tenant support and can be instantiated
multiple times on the same drive. Each instance owns a
portion of the SSD - both regarding I/O bandwidth and
capacity - providing I/O isolation for each case.

Finally, pblk also exposes a sysfs interface that allows
user-space to peek into the internals of pblk. The interface
is available at /dev/block/*/pblk/ where * is the block
device name exposed.

This work also contains contributions from:
  Matias Bjørling <matias@cnexlabs.com>
  Simon A. F. Lund <slund@cnexlabs.com>
  Young Tack Jin <youngtack.jin@gmail.com>
  Huaicheng Li <huaicheng@cs.uchicago.edu>
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

a4bd217b

lightnvm: convert sprintf into strlcpy · 6eb08245

由 Javier González 提交于 4月 15, 2017

Convert sprintf calls to strlcpy in order to make possible buffer
overflow more obvious.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

6eb08245

lightnvm: fix type checks on rrpc · d788c59b

由 Javier González 提交于 4月 15, 2017

sector_t is always unsigned, therefore avoid < 0 checks on it.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

d788c59b

lightnvm: clean unused variable · 5d30f3bd

由 Javier González 提交于 4月 15, 2017

Clean unused variable on lightnvm core.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

5d30f3bd

lightnvm: make nvm_free static · 46b160ce

由 Javier González 提交于 4月 15, 2017

Prefix the nvm_free static function with a missing static keyword.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

46b160ce

lightnvm: allow to init targets on factory mode · 4af3f75d

由 Javier González 提交于 4月 15, 2017

Target initialization has two responsibilities: creating the target
partition and instantiating the target. This patch enables to create a
factory partition (e.g., do not trigger recovery on the given target).
This is useful for target development and for being able to restore the
device state at any moment in time without requiring a full-device
erase.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

4af3f75d

lightnvm: bad type conversion for nvme control bits · e85292fe

由 Javier González 提交于 4月 15, 2017

The NVMe I/O command control bits are 16 bytes, but is interpreted as
32 bytes in the lightnvm user I/O data path.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e85292fe

lightnvm: fix cleanup order of disk on init error · 7d1ef2f4

由 Javier González 提交于 4月 15, 2017

Reorder disk allocation such that the disk structure can be put
safely.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

7d1ef2f4

lightnvm: double-clear of dev->lun_map on target init error · edee1bdd

由 Javier González 提交于 4月 15, 2017

The dev->lun_map bits are cleared twice if an target init error occurs.
First in the target clean routine, and then next in the nvm_tgt_create
error function. Make sure that it is only cleared once by extending
nvm_remove_tgt_devi() with a clear bit, such that clearing of bits can
ignored when cleaning up a successful initialized target.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Fix style.
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

edee1bdd

lightnvm: don't check for failure from mempool_alloc() · b0e0306c

由 NeilBrown 提交于 4月 15, 2017

mempool_alloc() cannot fail if the gfp flags allow it to
sleep, and both GFP_KERNEL and GFP_NOIO allows for sleeping.

So rrpc_move_valid_pages() and rrpc_make_rq() don't need to
test the return value.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

b0e0306c

lightnvm: enable nvme size compile asserts · 48d663a3

由 Matias Bjørling 提交于 4月 15, 2017

The asserts in _nvme_nvm_check_size are not compiled due to the function
not begin called. Make sure that it is called, and also fix the wrong
sizes of asserts for nvme_nvm_addr_format, and nvme_nvm_bb_tbl, which
checked for number of bits instead of bytes.
Reported-by: NScott Bauer <scott.bauer@intel.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

48d663a3

lightnvm: free reverse device map · 7a3de2b3

由 Javier González 提交于 4月 15, 2017

Free the reverse mapping table correctly on target tear down
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

7a3de2b3

lightnvm: rename scrambler controller hint · a7737f39

由 Javier González 提交于 4月 15, 2017

According to the OCSSD 1.2 specification, the 0x200 hint enables the
media scrambler for the read/write opcode, providing that the controller
has been correctly configured by the firmware. Rename the macro to
represent this meaning.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

a7737f39

lightnvm: submit erases using the I/O path · 17912c49

由 Javier González 提交于 4月 15, 2017

Until now erases have been submitted as synchronous commands through a
dedicated erase function. In order to enable targets implementing
asynchronous erases, refactor the erase path so that it uses the normal
async I/O submission functions. If a target requires sync I/O, it can
implement it internally. Also, adapt rrpc to use the new erase path.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Fixed spelling error.
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

17912c49

nvme/lightnvm: Prevent small buffer overflow in nvme_nvm_identify · 2849a7be

由 Scott Bauer 提交于 4月 15, 2017

There are two closely named structs in lightnvm:
struct nvme_nvm_addr_format and
struct nvme_addr_format.

The first struct has 4 reserved bytes at the end, the second does not.
(gdb) p sizeof(struct nvme_nvm_addr_format)
$1 = 16
(gdb) p sizeof(struct nvm_addr_format)
$2 = 12

In the nvme_nvm_identify function we memcpy from the larger struct to the
smaller struct. We incorrectly pass the length of the larger struct
and overflow by 4 bytes, lets not do that.
Signed-off-by: NScott Bauer <scott.bauer@intel.com>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

2849a7be

lightnvm: Fix error handling · 654a01b7

由 Christophe JAILLET 提交于 4月 15, 2017

According to error handling in this function, it is likely that going to
'out' was expected here.
Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

654a01b7

15 4月, 2017 7 次提交

net: off by one in inet6_pton() · a88086e0

由 Dan Carpenter 提交于 4月 13, 2017

If "scope_len" is sizeof(scope_id) then we would put the NUL terminator
one space beyond the end of the buffer.

Fixes: b1a951fe ("net/utils: generic inet_pton_with_scope helper")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

a88086e0

blk-mq: introduce Kyber multiqueue I/O scheduler · 00e04393

由 Omar Sandoval 提交于 4月 14, 2017

The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
scale to multiple queues. Users configure only two knobs, the target
read and synchronous write latencies, and the scheduler tunes itself to
achieve that latency goal.

The implementation is based on "tokens", built on top of the scalable
bitmap library. Tokens serve as a mechanism for limiting requests. There
are two tiers of tokens: queueing tokens and dispatch tokens.

A queueing token is required to allocate a request. In fact, these
tokens are actually the blk-mq internal scheduler tags, but the
scheduler manages the allocation directly in order to implement its
policy.

Dispatch tokens are device-wide and split up into two scheduling
domains: reads vs. writes. Each hardware queue dispatches batches
round-robin between the scheduling domains as long as tokens are
available for that domain.

These tokens can be used as the mechanism to enable various policies.
The policy Kyber uses is inspired by active queue management techniques
for network routing, similar to blk-wbt. The scheduler monitors
latencies and scales the number of dispatch tokens accordingly. Queueing
tokens are used to prevent starvation of synchronous requests by
asynchronous requests.

Various extensions are possible, including better heuristics and ionice
support. The new scheduler isn't set as the default yet.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

00e04393

blk-mq-sched: make completed_request() callback more useful · c05f8525

由 Omar Sandoval 提交于 4月 14, 2017

Currently, this callback is called right after put_request() and has no
distinguishable purpose. Instead, let's call it before put_request() as
soon as I/O has completed on the request, before we account it in
blk-stat. With this, Kyber can enable stats when it sees a latency
outlier and make sure the outlier gets accounted.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c05f8525

blk-mq: export helpers · 5b727272

由 Omar Sandoval 提交于 4月 14, 2017

blk_mq_finish_request() is required for schedulers that define their own
put_request(). blk_mq_run_hw_queue() is required for schedulers that
hold back requests to be run later.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

5b727272

blk-mq: add shallow depth option for blk_mq_get_tag() · 229a9287

由 Omar Sandoval 提交于 4月 14, 2017

Wire up the sbitmap_get_shallow() operation to the tag code so that a
caller can limit the number of tags available to it.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

229a9287

sbitmap: add sbitmap_get_shallow() operation · c05e6673

由 Omar Sandoval 提交于 4月 14, 2017

This operation supports the use case of limiting the number of bits that
can be allocated for a given operation. Rather than setting aside some
bits at the end of the bitmap, we can set aside bits in each word of the
bitmap. This means we can keep the allocation hints spread out and
support sbitmap_resize() nicely at the cost of lower granularity for the
allowed depth.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c05e6673

remove the mg_disk driver · 84253394

由 Christoph Hellwig 提交于 4月 06, 2017

This drivers was added in 2008, but as far as a I can tell we never had a
single platform that actually registered resources for the platform driver.

It's also been unmaintained for a long time and apparently has a ATA mode
that can be driven using the IDE/libata subsystem.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

84253394

11 4月, 2017 3 次提交

block: Fix list corruption of blk stats callback list · 3f19cd23

由 Jan Kara 提交于 4月 11, 2017

When CFQ calls wbt_disable_default(), it will call
blk_stat_remove_callback() to stop gathering IO statistics for the
purposes of writeback throttling. Later, when request_queue is
unregistered, wbt_exit() will call blk_stat_remove_callback() again
which will try to delete callback from the list again and possibly cause
list corruption.

Fix the problem by making wbt_disable_default() called wbt_exit() which
is properly guarded against being called multiple times.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

3f19cd23

blk-mq: Show symbolic names for hctx state and flags · f5c0b091

由 Bart Van Assche 提交于 3月 30, 2017

Instead of showing the hctx state and flags as numbers, show the
names of the flags.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

f5c0b091

blk-mq: Export queue state through /sys/kernel/debug/block/*/state · 91d68905

由 Bart Van Assche 提交于 4月 10, 2017

Make it possible to check whether or not a block layer queue has
been stopped. Make it possible to start and to run a blk-mq queue
from user space.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

91d68905

09 4月, 2017 14 次提交

scsi: sd: Remove LBPRZ dependency for discards · bcd069bb

由 Martin K. Petersen 提交于 4月 05, 2017

Separating discards and zeroout operations allows us to remove the LBPRZ
block zeroing constraints from discards and honor the device preferences
for UNMAP commands.

If supported by the device, we'll also choose UNMAP over one of the
WRITE SAME variants for discards.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

bcd069bb

scsi: sd: Separate zeroout and discard command choices · e6bd9312

由 Martin K. Petersen 提交于 4月 05, 2017

Now that zeroout and discards are distinct operations we need to
separate the policy of choosing the appropriate command. Create a
zeroing_mode which can be one of:

write:			Zeroout assist not present, use regular WRITE
writesame:		Allow WRITE SAME(10/16) with a zeroed payload
writesame_16_unmap:	Allow WRITE SAME(16) with UNMAP
writesame_10_unmap:	Allow WRITE SAME(10) with UNMAP

The last two are conditional on the device being thin provisioned with
LBPRZ=1 and LBPWS=1 or LBPWS10=1 respectively.

Whether to set the UNMAP bit or not depends on the REQ_NOUNMAP flag. And
if none of the _unmap variants are supported, regular WRITE SAME will be
used if the device supports it.

The zeroout_mode is exported in sysfs and the detected mode for a given
device can be overridden using the string constants above.

With this change in place we can now issue WRITE SAME(16) with UNMAP set
for block zeroing applications that require hard guarantees and
logical_block_size granularity. And at the same time use the UNMAP
command with the device's preferred granulary and alignment for discard
operations.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e6bd9312

block: remove the discard_zeroes_data flag · 48920ff2

由 Christoph Hellwig 提交于 4月 05, 2017

Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

48920ff2

drbd: implement REQ_OP_WRITE_ZEROES · 45c21793

由 Christoph Hellwig 提交于 4月 05, 2017

It seems like DRBD assumes its on the wire TRIM request always zeroes data.
Use that fact to implement REQ_OP_WRITE_ZEROES.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

45c21793

drbd: make intelligent use of blkdev_issue_zeroout · 0dbed96a

由 Christoph Hellwig 提交于 4月 05, 2017

drbd always wants its discard wire operations to zero the blocks, so
use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of
reinventing it poorly.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

0dbed96a

block: stop using discards for zeroing · 71027e97

由 Christoph Hellwig 提交于 4月 05, 2017

Now that we have REQ_OP_WRITE_ZEROES implemented for all devices that
support efficient zeroing, we can remove the call to blkdev_issue_discard.
This means we only have two ways of zeroing left and can simplify the
code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

71027e97

mmc: remove the discard_zeroes_data flag · 5d1429fe

由 Christoph Hellwig 提交于 4月 05, 2017

mmc only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

5d1429fe

rsxx: remove the discard_zeroes_data flag · 274243f5

由 Christoph Hellwig 提交于 4月 05, 2017

rsxx only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

274243f5

rbd: remove the discard_zeroes_data flag · 93c1defe

由 Christoph Hellwig 提交于 4月 05, 2017

rbd only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

93c1defe

brd: remove discard support · f09a06a1

由 Christoph Hellwig 提交于 4月 05, 2017

It's just a in-driver reimplementation of writing zeroes to the pages,
which fails if the discards aren't page aligned.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

f09a06a1

loop: implement REQ_OP_WRITE_ZEROES · 19372e27

由 Christoph Hellwig 提交于 4月 05, 2017

It's identical to discard as hole punches will always leave us with
zeroes on reads.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

19372e27

zram: implement REQ_OP_WRITE_ZEROES · 31edeacd

由 Christoph Hellwig 提交于 4月 05, 2017

Just the same as discard if the block size equals the system page size.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

31edeacd

nvme: implement REQ_OP_WRITE_ZEROES · e850fd16

由 Christoph Hellwig 提交于 4月 05, 2017

But now for the real NVMe Write Zeroes yet, just to get rid of the
discard abuse for zeroing.  Also rename the quirk flag to be a bit
more self-explanatory.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e850fd16

sd: implement unmapping Write Zeroes · e4b87837

由 Christoph Hellwig 提交于 4月 05, 2017

Try to use a write same with unmap bit variant if the device supports it
and the caller allows for it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e4b87837