- 30 11月, 2016 13 次提交
-
-
由 Javier González 提交于
In order to naturally support multi-target instances on an Open-Channel SSD, targets should own the LUNs they get blocks from and manage provisioning internally. This is done in several steps. Since targets own the LUNs the are instantiated on top of and manage the free block list internally, there is no need for a LUN abstraction in the media manager. LUNs are intrinsically managed as in the physical layout (ch:0,lun:0, ..., ch:0,lun:n, ch:1,lun:0, ch:1,lun:n, ..., ch:m,lun:0, ch:m,lun:n) and given to the targets based on the target creation ioctl. This simplifies LUN management and clears the path for a partition manager to sit directly underneath LightNVM targets. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Javier González 提交于
In order to naturally support multi-target instances on an Open-Channel SSD, targets should own the LUNs they get blocks from and manage provisioning internally. This is done in several steps. A part of this transformation is that targets manage their blocks internally. This patch eliminates the nvm_block abstraction and moves block management to the target logic. The rrpc target is transformed. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Javier González 提交于
Since LUNs are managed internally on targets, the media manager has no access to the free LUN lists. Thus, debug functions that show LUN information on the device should not be implemented on the media manager, but rather on the target in itself. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Javier González 提交于
Since LUNs are managed internally on the target, there is no need for the media manager to implement a get_lun operation. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Javier González 提交于
In order to naturally support multi-target instances on an Open-Channel SSD, targets should own the LUNs they get blocks from and manage provisioning internally. This is done in several steps. This patch moves the block provisioning inside of the target and removes the get/put block interface from the media manager. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Javier González 提交于
LUNs are exclusively owned by targets implementing a block device FTL. Doing this reservation requires at the moment a 2-way callback gennvm <-> target. The reason behind this is that LUNs were not assumed to always be exclusively owned by targets. However, this design decision goes against I/O determinism QoS (two targets would mix I/O on the same parallel unit in the device). This patch makes LUN reservation as part of the target creation on the media manager. This makes that LUNs are always exclusively owned by the target instantiated on top of them. LUN stripping and/or sharing should be implemented on the target itself or the layers on top. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Javier González 提交于
The gen_lun abstraction in the generic media manager was conceived on the assumption that a single target would instantiated on top of it. This has complicated target design to implement multi-instances. Remove this abstraction and move its logic to nvm_lun, which manages physical lun geometry and operations. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Javier González 提交于
Targets are assumed to used the same generic ppa format, where the address is partitioned on ch:lun:block:pg:pl:sec. Thus, make the function in charge of transforming the ppa address from a linear format to the generic one available to all targets. This function will be needed by the media manager in order to do target mapping translations when targets are divided on different physical partitions. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Javier González 提交于
Cleanup definition leftovers from old gennvm interface Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Javier González 提交于
Add ECC error codes to enable the appropriate handling in the target. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Javier González 提交于
Bad blocks should be managed by block owners. This would be either targets for data blocks or sysblk for system blocks. In order to support this, export two functions: One to mark a block as an specific type (e.g., bad block) and another to update the bad block table on the device. Move bad block management to rrpc. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Javier González 提交于
Erases might be subject to host hints. An example is multi-plane programming to erase blocks in parallel. Enable targets to specify this hint. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Matias Bjørling 提交于
Previously, LBA read and write were not supported in the lightnvm specification. Now that it supports it, lets use the traditional NVMe gendisk, and attach the lightnvm sysfs geometry export. Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 23 11月, 2016 2 次提交
-
-
由 Jens Axboe 提交于
Bit #7 is already used, move to bit #8 which is the first unused one. Fixes: 9561a7ad ("nbd: add multi-connection support") Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Josef Bacik 提交于
NBD can become contended on its single connection. We have to serialize all writes and we can only process one read response at a time. Fix this by allowing userspace to provide multiple connections to a single nbd device. This coupled with block-mq drastically increases performance in multi-process cases. Thanks, Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 22 11月, 2016 2 次提交
-
-
由 Ming Lei 提交于
Some drivers often use external bvec table, so introduce this helper for this case. It is always safe to access the bio->bi_io_vec in this way for this case. After converting to this usage, it will becomes a bit easier to evaluate the remaining direct access to bio->bi_io_vec, so it can help to prepare for the following multipage bvec support. Signed-off-by: NMing Lei <tom.leiming@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Fixed up the new O_DIRECT cases. Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Since commit 87374179 ("block: add a proper block layer data direction encoding") we only or the new op and flags into bi_opf in bio_set_op_attrs instead of clearing the old value. I've not seen any breakage with the new behavior, but it seems dangerous. Also convert it to an inline function to make the argument passing safer. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 18 11月, 2016 3 次提交
-
-
由 Tobias Klauser 提交于
With compilers which follow the C99 standard (like modern versions of gcc and clang), "extern inline" does the opposite thing from older versions of gcc (emits code for an externally linkable version of the inline function). "static inline" does the intended behavior in all cases instead. Description taken from commit 6d91857d ("staging, rtl8192e, LLVMLinux: Change extern inline to static inline"). This also fixes the following GCC warning when building with CONFIG_PM disabled: ./include/linux/blkdev.h:1143:20: warning: no previous prototype for 'blk_set_runtime_active' [-Wmissing-prototypes] Fixes: d07ab6d1 ("block: Add blk_set_runtime_active()") Reviewed-by: NMika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: NTobias Klauser <tklauser@distanz.ch> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
The previous commit introduced the hybrid sleep/poll mode. Take that one step further, and use the completion latencies to automatically sleep for half the mean completion time. This is a good approximation. This changes the 'io_poll_delay' sysfs file a bit to expose the various options. Depending on the value, the polling code will behave differently: -1 Never enter hybrid sleep mode 0 Use half of the completion mean for the sleep delay >0 Use this specific value as the sleep delay Signed-off-by: NJens Axboe <axboe@fb.com> Tested-By: NStephen Bates <sbates@raithlin.com> Reviewed-By: NStephen Bates <sbates@raithlin.com>
-
由 Jens Axboe 提交于
This patch enables a hybrid polling mode. Instead of polling after IO submission, we can induce an artificial delay, and then poll after that. For example, if the IO is presumed to complete in 8 usecs from now, we can sleep for 4 usecs, wake up, and then do our polling. This still puts a sleep/wakeup cycle in the IO path, but instead of the wakeup happening after the IO has completed, it'll happen before. With this hybrid scheme, we can achieve big latency reductions while still using the same (or less) amount of CPU. Signed-off-by: NJens Axboe <axboe@fb.com> Tested-By: NStephen Bates <sbates@raithlin.com> Reviewed-By: NStephen Bates <sbates@raithlin.com>
-
- 12 11月, 2016 1 次提交
-
-
由 Jens Axboe 提交于
The poll code is blk-mq specific, let's move it to blk-mq.c. This is a prep patch for improving the polling code. Signed-off-by: NJens Axboe <axboe@fb.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
- 11 11月, 2016 4 次提交
-
-
由 Jens Axboe 提交于
Enable throttling of buffered writeback to make it a lot more smooth, and has way less impact on other system activity. Background writeback should be, by definition, background activity. The fact that we flush huge bundles of it at the time means that it potentially has heavy impacts on foreground workloads, which isn't ideal. We can't easily limit the sizes of writes that we do, since that would impact file system layout in the presence of delayed allocation. So just throttle back buffered writeback, unless someone is waiting for it. The algorithm for when to throttle takes its inspiration in the CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors the minimum latencies of requests over a window of time. In that window of time, if the minimum latency of any request exceeds a given target, then a scale count is incremented and the queue depth is shrunk. The next monitoring window is shrunk accordingly. Unlike CoDel, if we hit a window that exhibits good behavior, then we simply increment the scale count and re-calculate the limits for that scale value. This prevents us from oscillating between a close-to-ideal value and max all the time, instead remaining in the windows where we get good behavior. Unlike CoDel, blk-wb allows the scale count to to negative. This happens if we primarily have writes going on. Unlike positive scale counts, this doesn't change the size of the monitoring window. When the heavy writers finish, blk-bw quickly snaps back to it's stable state of a zero scale count. The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency target to me met. It defaults to 2 msec for non-rotational storage, and 75 msec for rotational storage. Setting this value to '0' disables blk-wb. Generally, a user would not have to touch this setting. We don't enable WBT on devices that are managed with CFQ, and have a non-root block cgroup attached. If we have a proportional share setup on this particular disk, then the wbt throttling will interfere with that. We don't have a strong need for wbt for that case, since we will rely on CFQ doing that for us. Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
We can hook this up to the block layer, to help throttle buffered writes. wbt registers a few trace points that can be used to track what is happening in the system: wbt_lat: 259:0: latency 2446318 wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1, wmean=518866, wmin=15522, wmax=5330353, wsamples=57 wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32 This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat dumps the current read/write stats for that window, and wbt_step shows a step down event where we now scale back writes. Each trace includes the device, 259:0 in this case. Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
For legacy block, we simply track them in the request queue. For blk-mq, we track them on a per-sw queue basis, which we can then sum up through the hardware queues and finally to a per device state. The stats are tracked in, roughly, 0.1s interval windows. Add sysfs files to display the stats. The feature is off by default, to avoid any extra overhead. In-kernel users of it can turn it on by setting QUEUE_FLAG_STATS in the queue flags. We currently don't turn it on if someone just reads any of the stats files, that is something we could add as well. Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
This adds a shared per-request structure for all NVMe I/O. This structure is embedded as the first member in all NVMe transport drivers request private data and allows to implement common functionality between the drivers. The first use is to replace the current abuse of the SCSI command passthrough fields in struct request for the NVMe command passthrough, but it will grow a field more fields to allow implementing things like common abort handlers in the future. The passthrough commands are handled by having a pointer to the SQE (struct nvme_command) in struct nvme_request, and the union of the possible result fields, which had to be turned from an anonymous into a named union for that purpose. This avoids having to pass a reference to a full CQE around and thus makes checking the result a lot more lightweight. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 08 11月, 2016 1 次提交
-
-
由 Jens Axboe 提交于
Note in the bdi_writeback structure whenever a task ends up sleeping waiting for progress. We can use that information in the lower layers to increase the priority of writes. Signed-off-by: NJens Axboe <axboe@fb.com> Reviewed-by: NJan Kara <jack@suse.cz>
-
- 06 11月, 2016 1 次提交
-
-
由 Jens Axboe 提交于
For blk-mq, ->nr_requests does track queue depth, at least at init time. But for the older queue paths, it's simply a soft setting. On top of that, it's generally larger than the hardware setting on purpose, to allow backup of requests for merging. Fill a hole in struct request with a 'queue_depth' member, that drivers can call to more closely inform the block layer of the real queue depth. Signed-off-by: NJens Axboe <axboe@fb.com> Reviewed-by: NJan Kara <jack@suse.cz>
-
- 04 11月, 2016 1 次提交
-
-
由 Shaohua Li 提交于
Currently block plug holds up to 16 non-mergeable requests. This makes sense if the request size is small, eg, reduce lock contention. But if request size is big enough, we don't need to worry about lock contention. Holding such request makes no sense and it lows the disk utilization. In practice, this improves 10% throughput for my raid5 sequential write workload. The size (128k) is arbitrary right now, but it makes sure lock contention is small. This probably could be more intelligent, eg, check average request size holded. Since this is mainly for sequential IO, probably not worthy. V2: check the last request instead of the first request, so as long as there is one big size request we flush the plug. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 03 11月, 2016 8 次提交
-
-
由 Bart Van Assche 提交于
Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls are followed by kicking the requeue list. Hence add an argument to these two functions that allows to kick the requeue list. This was proposed by Christoph Hellwig. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Bart Van Assche 提交于
blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations have finished. This function does *not* wait until all outstanding requests have finished (this means invocation of request.end_io()). The algorithm used by blk_mq_quiesce_queue() is as follows: * Hold either an RCU read lock or an SRCU read lock around .queue_rq() calls. The former is used if .queue_rq() does not block and the latter if .queue_rq() may block. * blk_mq_quiesce_queue() first calls blk_mq_stop_hw_queues() followed by synchronize_srcu() or synchronize_rcu(). The latter call waits for .queue_rq() invocations that started before blk_mq_quiesce_queue() was called. * The blk_mq_hctx_stopped() calls that control whether or not .queue_rq() will be called are called with the (S)RCU read lock held. This is necessary to avoid race conditions against blk_mq_quiesce_queue(). Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NMing Lei <tom.leiming@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Bart Van Assche 提交于
Since blk_mq_requeue_work() no longer restarts stopped queues canceling requeue work is no longer needed to prevent that a stopped queue would be restarted. Hence remove this function. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Bart Van Assche 提交于
The function blk_queue_stopped() allows to test whether or not a traditional request queue has been stopped. Introduce a helper function that allows block drivers to query easily whether or not one or more hardware contexts of a blk-mq queue have been stopped. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Kent Overstreet 提交于
This is a helper that pins down a range from an iov_iter and adds it to a bio without requiring a separate memory allocation for the page array. It will be used for upcoming direct I/O implementations for block devices and iomap based file systems. Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com> [hch: ported to the iov_iter interface, renamed and added comments. All blame should be directed to me and all fame should go to Kent after this!] Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
If we're doing background type writes, then use the appropriate background write flags for that. Signed-off-by: NJens Axboe <axboe@fb.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
由 Jens Axboe 提交于
Add wbc_to_write_flags(), which returns the write modifier flags to use, based on a struct writeback_control. No functional changes in this patch, but it prepares us for factoring other wbc fields for write type. Signed-off-by: NJens Axboe <axboe@fb.com> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
由 Jens Axboe 提交于
This adds a new request flag, REQ_BACKGROUND, that callers can use to tell the block layer that this is background (non-urgent) IO. Signed-off-by: NJens Axboe <axboe@fb.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
- 01 11月, 2016 4 次提交
-
-
由 Christoph Hellwig 提交于
Now that we have a separate header for struct bio_vec there is absolutely no excuse for including this header from non-block I/O code. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
It's only needed for the CONFIG_SWAP-only use of bio_end_io_t. Because CONFIG_SWAP implies CONFIG_BLOCK this will allow to drop some ifdefs in blk_types.h. Instead we'll need to add a few explicit includes that were implicit before, though. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
The file only needs the struct bvec_iter delcaration, which is available from bvec.h. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Nothing in fs.h should require blk_types.h to be included. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-