提交 · da22f0eea555baf9b0a84b52afe56db2052cfe8d · openeuler / Kernel

06 9月, 2017 10 次提交

bcache: silence static checker warning · da22f0ee

由 Dan Carpenter 提交于 9月 06, 2017

In olden times, closure_return() used to have a hidden return built in.
We removed the hidden return but forgot to add a new return here.  If
"c" were NULL we would oops on the next line, but fortunately "c" is
never NULL.  Let's just remove the if statement.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

da22f0ee

bcache: fix for gc and write-back race · 9baf3097

由 Tang Junhui 提交于 9月 06, 2017

gc and write-back get raced (see the email "bcache get stucked" I sended
before):
gc thread                               write-back thread
|                                       |bch_writeback_thread()
|bch_gc_thread()                        |
|                                       |==>read_dirty()
|==>bch_btree_gc()                      |
|==>btree_root() //get btree root       |
|                //node write locker    |
|==>bch_btree_gc_root()                 |
|                                       |==>read_dirty_submit()
|                                       |==>write_dirty()
|                                       |==>continue_at(cl,
|                                       |               write_dirty_finish,
|                                       |               system_wq);
|                                       |==>write_dirty_finish()//excute
|                                       |               //in system_wq
|                                       |==>bch_btree_insert()
|                                       |==>bch_btree_map_leaf_nodes()
|                                       |==>__bch_btree_map_nodes()
|                                       |==>btree_root //try to get btree
|                                       |              //root node read
|                                       |              //lock
|                                       |-----stuck here
|==>bch_btree_set_root()
|==>bch_journal_meta()
|==>bch_journal()
|==>journal_try_write()
|==>journal_write_unlocked() //journal_full(&c->journal)
|                            //condition satisfied
|==>continue_at(cl, journal_write, system_wq); //try to excute
|                               //journal_write in system_wq
|                               //but work queue is excuting
|                               //write_dirty_finish()
|==>closure_sync(); //wait journal_write execute
|                   //over and wake up gc,
|-------------stuck here
|==>release root node write locker

This patch alloc a separate work-queue for write-back thread to avoid such
race.

(Commit log re-organized by Coly Li to pass checkpatch.pl checking)
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Acked-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9baf3097

bcache: increase the number of open buckets · 89b1fc54

由 Tang Junhui 提交于 9月 06, 2017

In currently, we only alloc 6 open buckets for each cache set,
but in usually, we always attach about 10 or so backend devices for
each cache set, and the each bcache device are always accessed by
about 10 or so threads in top application layer. So 6 open buckets
are too few, It has led to that each of the same thread write data
to different buckets, which would cause low efficiency write-back,
and also cause buckets inefficient, and would be Very easy to run
out of.

I add debug message in bch_open_buckets_alloc() to print alloc bucket
info, and test with ten bcache devices with a cache set, and each
bcache device is accessed by ten threads.

From the debug message, we can see that, after the modification, One
bucket is more likely to assign to the same thread, and the data from
the same thread are more likely to write the same bucket. Usually the
same thread always write/read the same backend device, so it is good
for write-back and also promote the usage efficiency of buckets.
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

89b1fc54

bcache: Correct return value for sysfs attach errors · 77fa100f

由 Tony Asleson 提交于 9月 06, 2017

If you encounter any errors in bch_cached_dev_attach it will return
a negative error code.  The variable 'v' which stores the result is
unsigned, thus user space sees a very large value returned for bytes
written which can cause incorrect user space behavior.  Utilize 1
signed variable to use throughout the function to preserve error return
capability.
Signed-off-by: NTony Asleson <tasleson@redhat.com>
Acked-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

77fa100f

bcache: correct cache_dirty_target in __update_writeback_rate() · a8394090

由 Tang Junhui 提交于 9月 06, 2017

__update_write_rate() uses a Proportion-Differentiation Controller
algorithm to control writeback rate. A dirty target number is used in
this PD controller to control writeback rate. A larger target number
will make the writeback rate smaller, on the versus, a smaller target
number will make the writeback rate larger.

bcache uses the following steps to calculate the target number,
1) cache_sectors = all-buckets-of-cache-set * buckets-size
2) cache_dirty_target = cache_sectors * cached-device-writeback_percent
3) target = cache_dirty_target *
(sectors-of-cached-device/sectors-of-all-cached-devices-of-this-cache-set)

The calculation at step 1) for cache_sectors is incorrect, which does
not consider dirty blocks occupied by flash only volume.

A flash only volume can be took as a bcache device without cached
device. All data sectors allocated for it are persistent on cache device
and marked dirty, they are not touched by bcache writeback and garbage
collection code. So data blocks of flash only volume should be ignore
when calculating cache_sectors of cache set.

Current code does not subtract dirty sectors of flash only volume, which
results a larger target number from the above 3 steps. And in sequence
the cache device's writeback rate is smaller then a correct value,
writeback speed is slower on all cached devices.

This patch fixes the incorrect slower writeback rate by subtracting
dirty sectors of flash only volumes in __update_writeback_rate().

(Commit log composed by Coly Li to pass checkpatch.pl checking)
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a8394090

bcache: gc does not work when triggering by manual command · 0b43f49d

由 Tang Junhui 提交于 9月 06, 2017

I try to execute the following command to trigger gc thread:
[root@localhost internal]# echo 1 > trigger_gc
But it does not work, I debug the code in gc_should_run(), It works only
if in invalidating or sectors_to_gc < 0. So set sectors_to_gc to -1 to
meet the condition when we trigger gc by manual command.

(Code comments aded by Coly Li)
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0b43f49d

bcache: Don't reinvent the wheel but use existing llist API · 09b3efec

由 Byungchul Park 提交于 9月 06, 2017

Although llist provides proper APIs, they are not used. Make them used.
Signed-off-by: NByungchul Park <byungchul.park@lge.com>
Acked-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

09b3efec

bcache: do not subtract sectors_to_gc for bypassed IO · 69daf03a

由 Tang Junhui 提交于 9月 06, 2017

Since bypassed IOs use no bucket, so do not subtract sectors_to_gc to
trigger gc thread.
Signed-off-by: Ntang.junhui <tang.junhui@zte.com.cn>
Acked-by: NColy Li <colyli@suse.de>
Reviewed-by: NEric Wheeler <bcache@linux.ewheeler.net>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

69daf03a

bcache: fix sequential large write IO bypass · c81ffa32

由 Tang Junhui 提交于 9月 06, 2017

Sequential write IOs were tested with bs=1M by FIO in writeback cache
mode, these IOs were expected to be bypassed, but actually they did not.
We debug the code, and find in check_should_bypass():
    if (!congested &&
        mode == CACHE_MODE_WRITEBACK &&
        op_is_write(bio_op(bio)) &&
        (bio-＞bi_opf & REQ_SYNC))
        goto rescale
that means, If in writeback mode, a write IO with REQ_SYNC flag will not
be bypassed though it is a sequential large IO, It's not a correct thing
to do actually, so this patch remove these codes.
Signed-off-by: Ntang.junhui <tang.junhui@zte.com.cn>
Reviewed-by: NKent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: NEric Wheeler <bcache@linux.ewheeler.net>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c81ffa32

bcache: Fix leak of bdev reference · 4b758df2

由 Jan Kara 提交于 9月 06, 2017

If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
block device using lookup_bdev() to detect which situation we are in to
properly report error. However we never drop the reference returned to
us from lookup_bdev(). Fix that.
Signed-off-by: NJan Kara <jack@suse.cz>
Acked-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4b758df2

02 9月, 2017 2 次提交

block/loop: remove unused field · bc75705d

由 Shaohua Li 提交于 9月 01, 2017

nobody uses the list.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bc75705d

block/loop: fix use after free · 92d77332

由 Shaohua Li 提交于 9月 01, 2017

lo_rw_aio->call_read_iter->
1       aops->direct_IO
2       iov_iter_revert
lo_rw_aio_complete could happen between 1 and 2, the bio and bvec could
be freed before 2, which accesses bvec.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

92d77332

01 9月, 2017 7 次提交

block/loop: allow request merge for directio mode · 40326d8a

由 Shaohua Li 提交于 8月 31, 2017

Currently loop disables merge. While it makes sense for buffer IO mode,
directio mode can benefit from request merge. Without merge, loop could
send small size IO to underlayer disk and harm performance.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

40326d8a

block/loop: set hw_sectors · 54bb0ade

由 Shaohua Li 提交于 8月 31, 2017

Loop can handle any size of request. Limiting it to 255 sectors just
burns the CPU for bio split and request merge for underlayer disk and
also cause bad fs block allocation in directio mode.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

54bb0ade

nvme-fabrics: generate spec-compliant UUID NQNs · 40a5fce4

由 Daniel Verkamp 提交于 8月 30, 2017

The default host NQN, which is generated based on the host's UUID,
does not follow the UUID-based NQN format laid out in the NVMe 1.3
specification.  Remove the "NVMf:" portion of the NQN to match the spec.
Signed-off-by: NDaniel Verkamp <daniel.verkamp@intel.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Cc: stable@vger.kernel.org
Signed-off-by: NChristoph Hellwig <hch@lst.de>

40a5fce4

loop: fold loop_switch() into callers · 43cade80

由 Omar Sandoval 提交于 8月 24, 2017

The comments here are really outdated, and blk-mq made flushing much
simpler, so just fold the two cases into the callers.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

43cade80

loop: add ioctl for changing logical block size · 89e4fdec

由 Omar Sandoval 提交于 8月 24, 2017

This is a different approach from the first attempt in f2c6df7d
("loop: support 4k physical blocksize"). Rather than extending
LOOP_{GET,SET}_STATUS, add a separate ioctl just for setting the block
size.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

89e4fdec

loop: set physical block size to PAGE_SIZE · 6c6b6f28

由 Omar Sandoval 提交于 8月 24, 2017

The physical block size is "the lowest possible sector size that the
hardware can operate on without reverting to read-modify-write
operations" (from the comment on blk_queue_physical_block_size()). Since
loop does buffered I/O on the backing file by default, the RMW unit is a
page. This isn't the case for direct I/O mode, but let's keep it simple.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6c6b6f28

loop: get rid of lo_blocksize · 8a0740c4

由 Omar Sandoval 提交于 8月 24, 2017

This is only used for setting the soft block size on the struct
block_device once and then never used again.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8a0740c4

30 8月, 2017 6 次提交

nvmet: add support for reporting the host identifier · 28dd5cf7

由 Omri Mann 提交于 8月 30, 2017

And fix the Get/Set Log Page implementation to take all 8 bits of the
feature identifier into account.
Signed-off-by: NOmri Mann <omri@excelero.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
[hch: used the UUID API, updated changelog]

28dd5cf7

nvme: Use metadata for passthrough commands · 63263d60

由 Keith Busch 提交于 8月 29, 2017

The ioctls' struct allows the user to provide a metadata address and
length for a passthrough command. This patch uses these values that were
previously ignored and deletes the now unused wrapper function.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

63263d60

nvme: Make nvme user functions static · 485783ca

由 Keith Busch 提交于 8月 29, 2017

These functions are used only locally in the nvme core.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

485783ca

nvme/pci: Use req_op to determine DIF remapping · b5d8af5b

由 Keith Busch 提交于 8月 29, 2017

Only read and write commands need DIF remapping. Everything else uses
a passthrough integrity payload.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

b5d8af5b

nvme: factor metadata handling out of __nvme_submit_user_cmd · 1cad6562

由 Christoph Hellwig 提交于 8月 29, 2017

Keep the metadata code in a separate helper instead of making the
main function more complicated.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

1cad6562

nvme-fabrics: Convert nvmf_transports_mutex to an rwsem · 489beb91

由 Roland Dreier 提交于 8月 29, 2017

The mutex protects against the list of transports changing while a
controller is being created, but using a plain old mutex means that it
also serializes controller creation.  This unnecessarily slows down
creating multiple controllers - for example for the RDMA transport,
creating a controller involves establishing one connection for every IO
queue, which involves even more network/software round trips, so the
delay can become significant.

The simplest way to fix this is to change the mutex to an rwsem and only
hold it for writing when the list is being mutated.  Since we can take
the rwsem for reading while creating a controller, we can create multiple
controllers in parallel.
Signed-off-by: NRoland Dreier <roland@purestorage.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

489beb91

29 8月, 2017 15 次提交

nvme: don't blindly overwrite identifiers on disk revalidate · 1d5df6af

由 Christoph Hellwig 提交于 8月 17, 2017

Instead validate that these identifiers do not change, as that is
prohibited by the specification.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>

1d5df6af

nvme: remove nvme_revalidate_ns · cdbff4f2

由 Christoph Hellwig 提交于 8月 16, 2017

The function is used in two places, and the shared code for those will
diverge later in this series.

Instead factor out a new helper to get the ids for a namespace, simplify
the calling conventions for nvme_identify_ns and just open code the
sequence.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>

cdbff4f2

nvme: remove unused struct nvme_ns fields · 57eeaf8e

由 Christoph Hellwig 提交于 8月 16, 2017

And move the flags for the flags field near that field while touching
this area.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>

57eeaf8e

nvme: allow calling nvme_change_ctrl_state from irq context · 0a72bbba

由 Christoph Hellwig 提交于 8月 22, 2017

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>

0a72bbba

nvme: report more detailed status codes to the block layer · a751da33

由 Christoph Hellwig 提交于 8月 22, 2017

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>

a751da33

nvme: honor RTD3 Entry Latency for shutdowns · 07fbd32a

由 Martin K. Petersen 提交于 8月 25, 2017

If an NVMe controller reports RTD3 Entry Latency larger than
shutdown_timeout, up to a maximum of 60 seconds, use that value to set
the shutdown timer. Otherwise fall back to the module parameter which
defaults to 5 seconds.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
[hch: removed do_div, made transition time local scope]
Signed-off-by: NChristoph Hellwig <hch@lst.de>

07fbd32a

nvme: fix uninitialized prp2 value on small transfers · 5228b328

由 Jan H. Schönherr 提交于 8月 27, 2017

The value of iod->first_dma ends up as prp2 in NVMe commands. In case
there is not enough data to cross a page boundary, iod->first_dma is
never initialized and contains random data.

Comply with the NVMe specification and fill in 0 in that case.
Signed-off-by: NJan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5228b328

nvme-rdma: Use unlikely macro in the fast path · a7b7c7a1

由 Max Gurtovoy 提交于 8月 14, 2017

This patch slightly improves performance (mainly for small block sizes).
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a7b7c7a1

nvmet: use memcpy_and_pad for identify sn/fr · 17c39d05

由 Martin Wilck 提交于 8月 14, 2017

This changes the earlier patch "nvmet: don't report 0-bytes
in serial number" to use the memcpy_and_pad() helper introduced
in a previous patch.
Signed-off-by: NMartin Wilck <mwilck@suse.com>
Reviewed-by: NSagi Grimberg <sagi@grimbeg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

17c39d05

nvmet-fc: simplify sg list handling · 48fa362b

由 James Smart 提交于 7月 31, 2017

The existing nvmet_fc sg list handling has 2 faults:
a) the request between LLDD and transport has too large of an sg
   list (256 elements), which is normally 256k (64 elements).
b) sglist handling doesn't optimize on the fact that each element
   is a page.

This patch removes the static sg list in the request and uses the
dynamic list already present in the nvmet_fc transport. It also
simplies the handling of the sg list on multiple sequences to
take advantage of the per-page divisions.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

48fa362b

nvme-fc: Reattach to localports on re-registration · 5533d424

由 James Smart 提交于 7月 31, 2017

If the LLDD resets or detaches from an fc port, the LLDD will
deregister all remoteports seen by the fc port and deregister the
localport associated with the fc port. The teardown of the localport
structure will be held off due to reference counting until all the
remoteports are removed (and they are held off until all
controllers/associations to terminated). Currently, if the fc port
is reinit/reattached and registered again as a localport it is
treated as an independent entity from the prior localport and all
prior remoteports and controllers cannot be revived. They are
created as new and separate entities.

This patch changes the localport registration to look at the known
localports that are waiting to be torndown. If they are the same port
based on wwn's, the local port is transitioned out of the teardown
state.  This allows the remote ports and controller connections to
be reestablished and resumed as long as the localport can also be
reregistered within the timeout windows.

The patch adds a new routine nvme_fc_attach_to_unreg_lport() with
the functionality and moves the lport get/put routines to avoid
forward references.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5533d424

M
nvme: rename AMS symbolic constants to fit specification · 60b43f62
由 Max Gurtovoy 提交于 8月 13, 2017
```
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
```
60b43f62

nvme: add symbolic constants for CC identifiers · ad4e05b2

由 Max Gurtovoy 提交于 8月 13, 2017

Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ad4e05b2

nvme: fix identify namespace logging · caaa15c5

由 Sagi Grimberg 提交于 8月 15, 2017

Use ctrl->device and lose the func name.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

caaa15c5

nvme-fabrics: log a warning if hostid is invalid · 9b483da1

由 Guan Junxiong 提交于 8月 03, 2017

This helps users to quickly spot the reason of why connection fails
if the hostid is not compliant with the uuid format.
Signed-off-by: NGuan Junxiong <guanjunxiong@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

9b483da1

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功