- 19 6月, 2017 30 次提交
-
-
由 Ming Lei 提交于
Queue can be started by other blk-mq APIs and can be used in different cases, this limits uses of blk_mq_quiesce_queue() if it is based on stopping queue, and make its usage very difficult, especially users have to use the stop queue APIs carefully for avoiding to break blk_mq_quiesce_queue(). We have applied the QUIESCED flag for draining and blocking dispatch, so it isn't necessary to stop queue any more. After stopping queue is removed, blk_mq_quiesce_queue() can be used safely and easily, then users won't worry about queue restarting during quiescing at all. Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
Actually what we want to get from blk_mq_quiesce_queue() isn't only to wait for completion of all ongoing .queue_rq(). In the typical context of canceling requests, we need to make sure that the following is done in the dispatch path before starting to cancel requests: - failed dispatched request is finished - busy dispatched request is requeued, and the STARTED flag is cleared So update comment to keep code, doc and our expection consistent. Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
It is required that no dispatch can happen any more once blk_mq_quiesce_queue() returns, and we don't have such requirement on APIs of stopping queue. But blk_mq_quiesce_queue() still may not block/drain dispatch in the the case of BLK_MQ_S_START_ON_RUN, so use the new introduced flag of QUEUE_FLAG_QUIESCED and evaluate it inside RCU read-side critical sections for fixing this issue. Also blk_mq_quiesce_queue() is implemented via stopping queue, which limits its uses, and easy to cause race, because any queue restart in other paths may break blk_mq_quiesce_queue(). With the introduced flag of QUEUE_FLAG_QUIESCED, we don't need to depend on stopping queue for quiescing any more. Signed-off-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
blk_mq_unquiesce_queue() is used for unquiescing the queue explicitly, so replace blk_mq_start_stopped_hw_queues() with it. For the scsi part, this patch takes Bart's suggestion to switch to block quiesce/unquiesce API completely. Cc: linux-nvme@lists.infradead.org Cc: linux-scsi@vger.kernel.org Cc: dm-devel@redhat.com Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
blk_mq_start_stopped_hw_queues() is used implictly as counterpart of blk_mq_quiesce_queue() for unquiescing queue, so we introduce blk_mq_unquiesce_queue() and make it as counterpart of blk_mq_quiesce_queue() explicitly. This function is for improving the current quiescing mechanism in the following patches. Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
This patch introduces blk_mq_quiesce_queue_nowait() so that we can workaround mpt3sas for quiescing its queue. Once mpt3sas is fixed, we can remove this helper. Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
We usually put blk_mq_*() into include/linux/blk-mq.h, so move this API into there. Signed-off-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
blk_bio_segment_split() makes sure bios have no more than BIO_MAX_PAGES entries in the bi_io_vec. This was done because bio_clone_bioset() (when given a mempool bioset) could not handle larger io_vecs. No driver uses bio_clone_bioset() any more, they all use bio_clone_fast() if anything, and bio_clone_fast() doesn't clone the bi_io_vec. The main user of of bio_clone_bioset() at this level is bounce.c, and bouncing now happens before blk_bio_segment_split(), so that is not of concern. So remove the big helpful comment and the code. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
bio_clone() is no longer used. Only bio_clone_bioset() or bio_clone_fast(). This is for the best, as bio_clone() used fs_bio_set, and filesystems are unlikely to want to use bio_clone(). So remove bio_clone() and all references. This includes a fix to some incorrect documentation. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
This function allocates a bio, then a collection of pages. It copes with failure. It currently uses a mempool() to allocate the bio, but alloc_page() to allocate the pages. These fail in different ways, so the usage is inconsistent. Change the bio_clone() to bio_clone_kmalloc() so that no pool is used either for the bio or the pages. Reviewed-by: NChristoph Hellwig <hch@lst.de> Acked-by: NKent Overstreet <kent.overstreet@gmail.com> Reviewed-by : Ming Lei <ming.lei@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
bios that are re-submitted will pass through blk_queue_split() when blk_queue_bio() is called, and this will split the bio if necessary. There is no longer any need to do this splitting in xen-blkfront. Acked-by: NRoger Pau Monné <roger.pau@citrix.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
pblk_submit_read() uses bio_clone_bioset() but doesn't change the io_vec, so bio_clone_fast() is a better choice. It also uses fs_bio_set which is intended for filesystems. Using it in a device driver can deadlock. So allocate a new bioset, and and use bio_clone_fast(). Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJavier González <javier@cnexlabs.com> Tested-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
pktcdvd doesn't change the bi_io_vec of the clone bio, so it is more efficient to use bio_clone_fast(), and not clone the bi_io_vec. This requires providing a bio_set, and it is safest to provide a dedicated bio_set rather than sharing fs_bio_set, which filesytems use. This new bio_set, pkt_bio_set, can also be use for the bio_split() call as the two allocations (bio_clone_fast, and bio_split) are independent, neither can block a bio allocated by the other. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
drbd does not modify the bi_io_vec of the cloned bio, so there is no need to clone that part. So bio_clone_fast() is the better choice. For bio_clone_fast() we need to specify a bio_set. We could use fs_bio_set, which bio_clone() uses, or drbd_md_io_bio_set, which drbd uses for metadata, but it is generally best to avoid sharing bio_sets unless you can be certain that there are no interdependencies. So create a new bio_set, drbd_io_bio_set, and use bio_clone_fast(). Also remove a "XXX cannot fail ???" comment because it definitely cannot fail - bio_clone_fast() doesn't fail if the GFP flags allow for sleeping. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
bio_clone() makes a copy of the bi_io_vec, but rbd never changes that, so there is no need for a copy. bio_clone_fast() can be used instead, which avoids making the copy. This requires that we provide a bio_set. bio_clone() uses fs_bio_set, but it isn't, in general, safe to use the same bio_set at different levels of the stack, as that can lead to deadlocks. As filesystems use fs_bio_set, block devices shouldn't. As rbd never stacks, it is safe to have a single global bio_set for all rbd devices to use. So allocate that when the module is initialised, and use it with bio_clone_fast(). Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
Since commit 23688bf4 ("block: ensure to split after potentially bouncing a bio") blk_queue_bounce() is called *before* blk_queue_split(). This means that: 1/ the comments blk_queue_split() about bounce buffers are irrelevant, and 2/ a very large bio (more than BIO_MAX_PAGES) will no longer be split before it arrives at blk_queue_bounce(), leading to the possibility that bio_clone_bioset() will fail and a NULL will be dereferenced. Separately, blk_queue_bounce() shouldn't use fs_bio_set as the bio being copied could be from the same set, and this could lead to a deadlock. So: - allocate 2 private biosets for blk_queue_bounce, one for splitting enormous bios and one for cloning bios. - add code to split a bio that exceeds BIO_MAX_PAGES. - Fix up the comments in blk_queue_split() Credit-to: Ming Lei <tom.leiming@gmail.com> (suggested using single bio_for_each_segment loop) Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
A rescuing bioset is only useful if there might be bios from that same bioset on the bio_list_on_stack queue at a time when bio_alloc_bioset() is called. This never applies to q->bio_split. Allocations from q->bio_split are only ever made from blk_queue_split() which is only ever called early in each of various make_request_fn()s. The original bio (call this A) is then passed to generic_make_request() and is placed on the bio_list_on_stack queue, and the bio that was allocated from q->bio_split (B) is processed. The processing of this may cause other bios to be passed to generic_make_request() or may even cause the bio B itself to be passed, possible after some prefix has been split off (using some other bioset). generic_make_request() now guarantees that all of these bios (B and dependants) will be fully processed before the tail of the original bio A gets handled. None of these early bios can possible trigger an allocation from the original q->bio_split as they are either too small to require splitting or (more likely) are destined for a different queue. The next time that the original q->bio_split might be used by this thread is when A is processed again, as it might still be too big to handle directly. By this time there cannot be any other bios allocated from q->bio_split in the generic_make_request() queue. So no rescuing will ever be needed. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
This patch converts bioset_create() to not create a workqueue by default, so alloctions will never trigger punt_bios_to_rescuer(). It also introduces a new flag BIOSET_NEED_RESCUER which tells bioset_create() to preserve the old behavior. All callers of bioset_create() that are inside block device drivers, are given the BIOSET_NEED_RESCUER flag. biosets used by filesystems or other top-level users do not need rescuing as the bio can never be queued behind other bios. This includes fs_bio_set, blkdev_dio_pool, btrfs_bioset, xfs_ioend_bioset, and one allocated by target_core_iblock.c. biosets used by md/raid do not need rescuing as their usage was recently audited and revised to never risk deadlock. It is hoped that most, if not all, of the remaining biosets can end up being the non-rescued version. Reviewed-by: NChristoph Hellwig <hch@lst.de> Credit-to: Ming Lei <ming.lei@redhat.com> (minor fixes) Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
"flags" arguments are often seen as good API design as they allow easy extensibility. bioset_create_nobvec() is implemented internally as a variation in flags passed to __bioset_create(). To support future extension, make the internal structure part of the API. i.e. add a 'flags' argument to bioset_create() and discard bioset_create_nobvec(). Note that the bio_split allocations in drivers/md/raid* do not need the bvec mempool - they should have used bioset_create_nobvec(). Suggested-by: NChristoph Hellwig <hch@infradead.org> Reviewed-by: NChristoph Hellwig <hch@infradead.org> Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
blk_queue_split() is always called with the last arg being q->bio_split, where 'q' is the first arg. Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses q->bio_split. This is inconsistent and unnecessary. Remove the last arg and always use q->bio_split inside blk_queue_split() Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMing Lei <ming.lei@redhat.com> Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed) Reviewed-by: NJavier González <javier@cnexlabs.com> Tested-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Move most code into blk_mq_rq_ctx_init, and the rest into blk_mq_get_request. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
This patch makes sure we always allocate requests in the core blk-mq code and use a common prepare_request method to initialize them for both mq I/O schedulers. For Kyber and additional limit_depth method is added that is called before allocating the request. Also because none of the intializations can really fail the new method does not return an error - instead the bfq finish method is hardened to deal with the no-IOC case. Last but not least this removes the abuse of RQF_QUEUE by the blk-mq scheduling code as RQF_ELFPRIV is all that is needed now. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
blk_mq_sched_assign_ioc now only handles the assigned of the ioc if the schedule needs it (bfq only at the moment). The caller to the per-request initializer is moved out so that it can be merged with a similar call for the kyber I/O scheduler. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
icq_to_bic is a container_of operation, so we need to check for NULL before it. Also move the check outside the spinlock while we're at it. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Merge three functions only tail-called by blk_mq_free_request into blk_mq_free_request. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
No need to have two different callouts of bfq vs kyber. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Having these as separate helpers in a header really does not help readability, or my chances to refactor this code sanely. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Having them out of line in blk-mq-sched.c just makes the code flow unnecessarily complicated. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 18 6月, 2017 1 次提交
-
-
由 NeilBrown 提交于
When a filesystem is mounted from a loop device, writes are throttled by balance_dirty_pages() twice: once when writing to the filesystem and once when the loop_handle_cmd() writes to the backing file. This double-throttling can trigger positive feedback loops that create significant delays. The throttling at the lower level is seen by the upper level as a slow device, so it throttles extra hard. The PF_LESS_THROTTLE flag was created to handle exactly this circumstance, though with an NFS filesystem mounted from a local NFS server. It reduces the throttling on the lower layer so that it can proceed largely unthrottled. To demonstrate this, create a filesystem on a loop device and write (e.g. with dd) several large files which combine to consume significantly more than the limit set by /proc/sys/vm/dirty_ratio or dirty_bytes. Measure the total time taken. When I do this directly on a device (no loop device) the total time for several runs (mkfs, mount, write 200 files, umount) is fairly stable: 28-35 seconds. When I do this over a loop device the times are much worse and less stable. 52-460 seconds. Half below 100seconds, half above. When I apply this patch, the times become stable again, though not as fast as the no-loop-back case: 53-72 seconds. There may be room for further improvement as the total overhead still seems too high, but this is a big improvement. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMing Lei <tom.leiming@gmail.com> Suggested-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 17 6月, 2017 1 次提交
-
-
git://git.infradead.org/nvme由 Jens Axboe 提交于
Pull NVMe changes for 4.13 from Christoph: Highlights: - UUID identifier support from Johannes - Lots of cleanups from Sagi - Host Memory Buffer support from me And lots of cleanups and smaller fixes of course. Note that the UUID identifier changes are based on top of the uuid tree. I am the maintainer of that tree and will send it to Linus as soon as 4.12 is released as various other trees depend on it as well (and the diffstat includes those changes unfortunately)
-
- 16 6月, 2017 3 次提交
-
-
由 Arvind Yadav 提交于
of_device_ids are not supposed to change at runtime. All functions working with of_device_ids provided by <linux/of.h> work with const of_device_ids. So mark the non-const structs as const. File size before: text data bss dec hex filename 8908 1096 624 10628 2984 drivers/block/swim3.o File size after constify swim3_match: text data bss dec hex filename 9708 296 624 10628 2984 drivers/block/swim3.o Signed-off-by: NArvind Yadav <arvind.yadav.cs@gmail.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Bart Van Assche 提交于
This patch fixes two sparse warnings introduced by the "dedicated error codes for the block layer V3" patch series. These changes have not been tested. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Scott Bauer 提交于
The NVMe 1.3 spec introduces Namespace Optimal IO Boundaries (NOIOB), which standardizes the stripe mechanism we currently have quirks for. This patch implements the necessary logic to handle this new feature. Signed-off-by: NScott Bauer <scott.bauer@intel.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 15 6月, 2017 5 次提交
-
-
由 Sagi Grimberg 提交于
Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Christoph Hellwig 提交于
We don't need to wait for the reset from the delayed work item that is kicked off when we don't get a keepalive. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reported-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Christoph Hellwig 提交于
This moves the nvme_reset function from the PCIe driver to common code, renaming it to nvme_reset_ctrl in the process. Additionally a new helper nvme_reset_ctrl_sync is added for the case where we want to wait for the reset. To facilitate that the reset_work work structure is move to the common nvme_ctrl structure and the ->reset_ctrl method is removed. For now the drivers initialize the reset_work with their own callback, but longer term we should move to callouts for specific parts of the reset process and move even more code to the core. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Christoph Hellwig 提交于
Now that we get the tagset passed we can have a single implementation for the I/O and admin queues. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Christoph Hellwig 提交于
Now that we get the tagset passed we can have a single implementation for the I/O and admin queues. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-