- 25 6月, 2015 4 次提交
-
-
由 Ilya Dryomov 提交于
The default queue_limits::max_segments value (BLK_MAX_SEGMENTS = 128) unnecessarily limits bio sizes to 512k (assuming 4k pages). rbd, being a virtual block device, doesn't have any restrictions on the number of physical segments, so bump max_segments to max_hw_sectors, in theory allowing a sector per segment (although the only case this matters that I can think of is some readv/writev style thing). In practice this is going to give us 1M bios - the number of segments in a bio is limited in bio_get_nr_vecs() by BIO_MAX_PAGES = 256. Note that this doesn't result in any improvement on a typical direct sequential test. This is because on a box with a not too badly fragmented memory the default BLK_MAX_SEGMENTS is enough to see nice rbd object size sized requests. The only difference is the size of bios being merged - 512k vs 1M for something like $ dd if=/dev/zero of=/dev/rbd0 oflag=direct bs=$RBD_OBJ_SIZE $ dd if=/dev/rbd0 iflag=direct of=/dev/null bs=$RBD_OBJ_SIZE Signed-off-by: NIlya Dryomov <idryomov@gmail.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
由 Ilya Dryomov 提交于
As part of unmap sequence, kernel client has to talk to the OSDs to teardown watch on the header object. If none of the OSDs are available it would hang forever, until interrupted by a signal - when that happens we follow through with the rest of unmap procedure (i.e. unregister the device and put all the data structures) and the unmap is still considired successful (rbd cli tool exits with 0). The watch on the userspace side should eventually timeout so that's fine. This isn't very nice, because various userspace tools (pacemaker rbd resource agent, for example) then have to worry about setting up their own timeouts. Timeout it with mount_timeout (60 seconds by default). Signed-off-by: NIlya Dryomov <idryomov@gmail.com> Reviewed-by: NAlex Elder <elder@linaro.org> Reviewed-by: NSage Weil <sage@redhat.com>
-
由 Ilya Dryomov 提交于
There are currently three libceph-level timeouts that the user can specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive. All of these are in seconds and no checking is done on user input: negative values are accepted, we multiply them all by HZ which may or may not overflow, arbitrarily large jiffies then get added together, etc. There is also a bug in the way mount_timeout=0 is handled. It's supposed to mean "infinite timeout", but that's not how wait.h APIs treat it and so __ceph_open_session() for example will busy loop without much chance of being interrupted if none of ceph-mons are there. Fix all this by verifying user input, storing timeouts capped by msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies() helper for all user-specified waits to handle infinite timeouts correctly. Signed-off-by: NIlya Dryomov <idryomov@gmail.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
由 Yan, Zheng 提交于
Signed-off-by: NYan, Zheng <zyan@redhat.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
- 02 5月, 2015 1 次提交
-
-
由 Ilya Dryomov 提交于
When we end I/O struct request with error, we need to pass obj_request->length as @nr_bytes so that the entire obj_request worth of bytes is completed. Otherwise block layer ends up confused and we trip on rbd_assert(more ^ (which == img_request->obj_request_count)); in rbd_img_obj_callback() due to more being true no matter what. We already do it in most cases but we are missing some, in particular those where we don't even get a chance to submit any obj_requests, due to an early -ENOMEM for example. A number of obj_request->xferred assignments seem to be redundant but I haven't touched any of obj_request->xferred stuff to keep this small and isolated. Cc: Alex Elder <elder@linaro.org> Cc: stable@vger.kernel.org # 3.10+ Reported-by: NShawn Edwards <lesser.evil@gmail.com> Reviewed-by: NSage Weil <sage@redhat.com> Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
-
- 22 4月, 2015 1 次提交
-
-
由 Ilya Dryomov 提交于
After the switch to blk-mq rbd_wq processes requests, not devices. Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
-
- 20 4月, 2015 2 次提交
-
-
由 Ilya Dryomov 提交于
Set QUEUE_FLAG_NONROT. Following commit b277da0a ("block: disable entropy contributions for nonrot devices") we should also clear QUEUE_FLAG_ADD_RANDOM, but it's off by default for blk-mq drivers, so just note it in the comment. Also remove physical block size assignment - no sense in repeating defaults that are not going to change. Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
-
由 Ilya Dryomov 提交于
pr_info what exactly was the culprit: missing pool, image or snap. Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
-
- 19 2月, 2015 4 次提交
-
-
由 Christoph Hellwig 提交于
This converts the rbd driver to use the blk-mq infrastructure. Except for switching to a per-request work item this is almost mechanical. This was tested by Alexandre DERUMIER in November, and found to give him 120000 iops, although the only comparism available was an old 3.10 kernel which gave 80000iops. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NAlex Elder <elder@linaro.org> [idryomov@gmail.com: context, blk_mq_init_queue() EH] Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
-
由 Ilya Dryomov 提交于
If the clone is resized down to 0, it becomes standalone. If such resize is carried over while an image is mapped we would detect this and call rbd_dev_parent_put() which means "let go of all parent state, including the spec(s) of parent images(s)". This leads to a mismatch between "rbd info" and sysfs parent fields, so a fix is in order. # rbd create --image-format 2 --size 1 foo # rbd snap create foo@snap # rbd snap protect foo@snap # rbd clone foo@snap bar # DEV=$(rbd map bar) # rbd resize --allow-shrink --size 0 bar # rbd resize --size 1 bar # rbd info bar | grep parent parent: rbd/foo@snap Before: # cat /sys/bus/rbd/devices/0/parent (no parent image) After: # cat /sys/bus/rbd/devices/0/parent pool_id 0 pool_name rbd image_id 10056b8b4567 image_name foo snap_id 2 snap_name snap overlap 0 Signed-off-by: NIlya Dryomov <idryomov@redhat.com> Reviewed-by: NJosh Durgin <jdurgin@redhat.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
由 Ilya Dryomov 提交于
header_rwsem should be released on errors. Also remove useless rbd_dev->mapping.size != rbd_dev->header.image_size test. Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
-
由 Rickard Strandqvist 提交于
It's been largely superseded by dup_token() and unused for over 2 years, identified by cppcheck. Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se> [idryomov@redhat.com: changelog] Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
-
- 28 1月, 2015 2 次提交
-
-
由 Ilya Dryomov 提交于
This effectively reverts the last hunk of 392a9dad ("rbd: detect when clone image is flattened"). The problem with parent_overlap != 0 condition is that it's possible and completely valid to have an image with parent_overlap == 0 whose parent state needs to be cleaned up on unmap. The next commit, which drops the "clone image now standalone" logic, opens up another window of opportunity to hit this, but even without it # cat parent-ref.sh #!/bin/bash rbd create --image-format 2 --size 1 foo rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar rbd resize --allow-shrink --size 0 bar rbd resize --size 1 bar DEV=$(rbd map bar) rbd unmap $DEV leaves rbd_device/rbd_spec/etc and rbd_client along with ceph_client hanging around. My thinking behind calling rbd_dev_parent_put() unconditionally is that there shouldn't be any requests in flight at that point in time as we are deep into unmap sequence. Hence, even if rbd_dev_unparent() caused by flatten is delayed by in-flight requests, it will have finished by the time we reach rbd_dev_unprobe() caused by unmap, thus turning unconditional rbd_dev_parent_put() into a no-op. Fixes: http://tracker.ceph.com/issues/10352 Cc: stable@vger.kernel.org # 3.11+ Signed-off-by: NIlya Dryomov <idryomov@redhat.com> Reviewed-by: NJosh Durgin <jdurgin@redhat.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
由 Ilya Dryomov 提交于
The comment for rbd_dev_parent_get() said * We must get the reference before checking for the overlap to * coordinate properly with zeroing the parent overlap in * rbd_dev_v2_parent_info() when an image gets flattened. We * drop it again if there is no overlap. but the "drop it again if there is no overlap" part was missing from the implementation. This lead to absurd parent_ref values for images with parent_overlap == 0, as parent_ref was incremented for each img_request and virtually never decremented. Fix this by leveraging the fact that refresh path calls rbd_dev_v2_parent_info() under header_rwsem and use it for read in rbd_dev_parent_get(), instead of messing around with atomics. Get rid of barriers in rbd_dev_v2_parent_info() while at it - I don't see what they'd pair with now and I suspect we are in a pretty miserable situation as far as proper locking goes regardless. Cc: stable@vger.kernel.org # 3.11+ Signed-off-by: NIlya Dryomov <idryomov@redhat.com> Reviewed-by: NJosh Durgin <jdurgin@redhat.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
- 18 12月, 2014 2 次提交
-
-
由 Ilya Dryomov 提交于
CEPH_OSD_OP_DELETE is not an extent op, stop treating it as such. This sneaked in with discard patches - it's one of the three osd ops (the other two are CEPH_OSD_OP_TRUNCATE and CEPH_OSD_OP_ZERO) that discard is implemented with. Signed-off-by: NIlya Dryomov <idryomov@redhat.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
由 SF Markus Elfring 提交于
The functions ceph_put_snap_context() and iput() test whether their argument is NULL and then return immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net> [idryomov@redhat.com: squashed rbd.c hunk, changelog] Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
-
- 30 10月, 2014 2 次提交
-
-
由 Jan Kara 提交于
When we fail to allocate page vector in rbd_obj_read_sync() we just basically ignore the problem and continue which will result in an oops later. Fix the problem by returning proper error. CC: Yehuda Sadeh <yehuda@inktank.com> CC: Sage Weil <sage@inktank.com> CC: ceph-devel@vger.kernel.org CC: stable@vger.kernel.org Coverity-id: 1226882 Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
-
由 Ilya Dryomov 提交于
Using one queue per device doesn't make much sense given that our workfn processes "devices" and not "requests". Switch to a single workqueue for all devices. Signed-off-by: NIlya Dryomov <idryomov@redhat.com> Reviewed-by: NSage Weil <sage@redhat.com>
-
- 15 10月, 2014 14 次提交
-
-
由 Ilya Dryomov 提交于
Need to use WQ_MEM_RECLAIM for our workqueues to prevent I/O lockups under memory pressure - we sit on the memory reclaim path. Cc: stable@vger.kernel.org # 3.17, needs backporting for 3.16 Signed-off-by: NIlya Dryomov <idryomov@redhat.com> Tested-by: NMicha Krause <micha@krausam.de> Reviewed-by: NSage Weil <sage@redhat.com>
-
由 Josh Durgin 提交于
max_discard_sectors must be set for the queue to support discard. Operations implementing discard for rbd zero data, so report that. Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>
-
由 Josh Durgin 提交于
Only allocate two osd ops for discard requests, since the preallocation hint is only added for regular writes. Use rbd_img_obj_request_fill() to recreate the original write or discard osd operations, isolating that logic to one place, and change the assert in rbd_osd_req_create_copyup() to accept discard requests as well. Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>
-
由 Josh Durgin 提交于
rbd_img_request_fill() creates a ceph_osd_request and has logic for adding the appropriate osd ops to it based on the request type and image properties. For layered images, the original rbd_obj_request is resent with a copyup operation in front, using a new ceph_osd_request. The logic for adding the original operations should be the same as when first sending them, so move it to a helper function. op_type only needs to be checked once, so create a helper for that as well and call it outside the loop in rbd_img_request_fill(). Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>
-
由 Josh Durgin 提交于
Discard requests are a form of write, so they should go through the same process as plain write requests and trigger copy-on-write for layered images. Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>
-
由 Josh Durgin 提交于
Discard may try to delete an object from a non-layered image that does not exist. If this occurs, the image already has no data in that range, so change the result to success. Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>
-
由 Josh Durgin 提交于
Discards take a reference to the snapshot context of an image when they are created. This reference needs to be cleaned up when the request is done just as it is for regular writes. Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>
-
由 Josh Durgin 提交于
In rbd_img_request_fill() the image size is only checked to determine whether we can truncate an object instead of zeroing it for discard requests. Take rbd_dev->header_rwsem while reading the image size, and move this read into the discard check, so that non-discard ops don't need to take the semaphore in this function. Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>
-
由 Guangliang Zhao 提交于
This patch add the discard support for rbd driver. There are three types operation in the driver: 1. The objects would be removed if they completely contained within the discard range. 2. The objects would be truncated if they partly contained within the discard range, and align with their boundary. 3. Others would be zeroed. A discard request from blkdev_issue_discard() is defined which REQ_WRITE and REQ_DISCARD both marked and no data, so we must check the REQ_DISCARD first when getting the request type. This resolve: http://tracker.ceph.com/issues/190 [ Ilya Dryomov: This is incomplete and somewhat buggy, see follow up commits by Josh Durgin for refinements and fixes which weren't folded in to preserve authorship. ] Signed-off-by: NGuangliang Zhao <lucienchao@gmail.com> Reviewed-by: NJosh Durgin <josh.durgin@inktank.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
由 Guangliang Zhao 提交于
It could only handle the read and write operations now, extend it for the coming discard support. Signed-off-by: NGuangliang Zhao <lucienchao@gmail.com> Reviewed-by: NJosh Durgin <josh.durgin@inktank.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
由 Guangliang Zhao 提交于
It need to copyup the parent's content when layered writing, but an entire object write would overwrite it, so skip it. Signed-off-by: NGuangliang Zhao <lucienchao@gmail.com> Reviewed-by: NJosh Durgin <josh.durgin@inktank.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
由 Ilya Dryomov 提交于
To clarify the conditions and make it easier to add new ones. Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
-
由 Josh Durgin 提交于
These fields may both change while the image is mapped if a snapshot is created or deleted or the image is resized. They are guarded by rbd_dev->header_rwsem, so hold that while reading them, and store a local copy to refer to outside of the critical section. The local copy will stay consistent since the snapshot context is reference counted, and the mapping size is just a u64. This prevents torn loads from giving us inconsistent values. Move reading header.snapc into the caller of rbd_img_request_create() so that we only need to take the semaphore once. The read-only caller, rbd_parent_request_create() can just pass NULL for snapc, since the snapshot context is only relevant for writes. Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>
-
由 Ilya Dryomov 提交于
Trying to map an image out of a pool for which we don't have an 'x' permission bit fails with -ERANGE from ceph_extract_encoded_string() due to an unsigned vs signed bug. Fix it and get rid of the -EINVAL sink, thus propagating rbd::get_id cls method errors. (I've seen a bunch of unexplained -ERANGE reports, I bet this is it). Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
- 10 9月, 2014 2 次提交
-
-
由 Wei Yongjun 提交于
Fix to return -ENOMEM from the workqueue alloc error handling case instead of 0, as done elsewhere in this function. Reviewed-by: NAlex Elder <elder@linaro.org> Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
-
由 Ilya Dryomov 提交于
drivers/block/rbd.c: In function ‘rbd_dev_device_setup’: drivers/block/rbd.c:5090:19: warning: format not a string literal and no format arguments [-Wformat-security] Reported-by: Nkbuild test robot <fengguang.wu@intel.com> Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
-
- 07 8月, 2014 3 次提交
-
-
由 Ilya Dryomov 提交于
rbd_warn() string should be a single line - rbd_warn() appends \n. Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
-
由 Ilya Dryomov 提交于
Now that rbd_img_request_create() is called from work functions, no need to use GFP_ATOMIC. Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
由 Ilya Dryomov 提交于
While it was never a good idea to sleep in request_fn(), commit 34c6bc2c ("locking/mutexes: Add extra reschedule point") made it a *bad* idea. mutex_lock() since 3.15 may reschedule *before* putting task on the mutex wait queue, which for tasks in !TASK_RUNNING state means block forever. request_fn() may be called with !TASK_RUNNING on the way to schedule() in io_schedule(). Offload request handling to a workqueue, one per rbd device, to avoid calling blocking primitives from rbd_request_fn(). Fixes: http://tracker.ceph.com/issues/8818 Cc: stable@vger.kernel.org # 3.16, needs backporting for 3.15 Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com> Tested-by: NEric Eastman <eric0e@aol.com> Tested-by: NGreg Wilson <greg.wilson@keepertech.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
- 25 7月, 2014 3 次提交
-
-
由 Ilya Dryomov 提交于
If we are mapping a snapshot, we must read in the parent_overlap value of that snapshot instead of that of the base image. Not doing so may in particular result in us returning zeros instead of user data: # cat overlap-snap.sh #!/bin/bash rbd create --size 10 --image-format 2 foo FOO_DEV=$(rbd map foo) dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null echo "Base image" dd if=$FOO_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar rbd snap create bar@snap BAR_DEV=$(rbd map bar@snap) echo "Snapshot" dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd rbd resize --allow-shrink --size 4 bar echo "Snapshot after base image resize" dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd # ./overlap-snap.sh Base image 0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6. Snapshot 0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6. Resizing image: 100% complete...done. Snapshot after base image resize 0000000: e781 e33b d34b 2225 0000 0000 0000 0000 ...;.K"%........ Even though bar@snap is taken with the old bar parent_overlap (8M), reads from bar@snap beyond the new bar parent_overlap (4M) return zeroes. Fix it. Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
由 Ilya Dryomov 提交于
Currently rbd_dev_v2_header_info() reads in parent info before the snap context is read in. This is wrong, because we may need to look at the the parent_overlap value of the snapshot instead of that of the base image, for example when mapping a snapshot - see next commit. (When mapping a snapshot, all we got is its name and we need the snap context to translate that name into an id to know which parent info to look for.) The approach taken here is to make sure rbd_dev_v2_parent_info() is called after the snap context has been read in. The other approach would be to add a parent_overlap field to struct rbd_mapping and maintain it the same way rbd_mapping::size is maintained. The reason I chose the first approach is that the value of keeping around both base image values and the actual mapping values is unclear to me. Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-
由 Ilya Dryomov 提交于
There is no sense in trying to update the mapping size before it's even been set. Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: NAlex Elder <elder@linaro.org>
-