提交 · 57e95460f0b360c4d29c0320922b46b55dd2b79f · openeuler / Kernel

19 2月, 2015 4 次提交

由 Christoph Hellwig 提交于 1月 13, 2015

This converts the rbd driver to use the blk-mq infrastructure.  Except
for switching to a per-request work item this is almost mechanical.

This was tested by Alexandre DERUMIER in November, and found to give
him 120000 iops, although the only comparism available was an old
3.10 kernel which gave 80000iops.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NAlex Elder <elder@linaro.org>
[idryomov@gmail.com: context, blk_mq_init_queue() EH]
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

7ad18afa

rbd: do not treat standalone as flatten · cf32bd9c

由 Ilya Dryomov 提交于 1月 19, 2015

If the clone is resized down to 0, it becomes standalone.  If such
resize is carried over while an image is mapped we would detect this
and call rbd_dev_parent_put() which means "let go of all parent state,
including the spec(s) of parent images(s)".  This leads to a mismatch
between "rbd info" and sysfs parent fields, so a fix is in order.

    # rbd create --image-format 2 --size 1 foo
    # rbd snap create foo@snap
    # rbd snap protect foo@snap
    # rbd clone foo@snap bar
    # DEV=$(rbd map bar)
    # rbd resize --allow-shrink --size 0 bar
    # rbd resize --size 1 bar
    # rbd info bar | grep parent
            parent: rbd/foo@snap

Before:

    # cat /sys/bus/rbd/devices/0/parent
    (no parent image)

After:

    # cat /sys/bus/rbd/devices/0/parent
    pool_id 0
    pool_name rbd
    image_id 10056b8b4567
    image_name foo
    snap_id 2
    snap_name snap
    overlap 0
Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
Reviewed-by: NJosh Durgin <jdurgin@redhat.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

cf32bd9c

rbd: fix error paths in rbd_dev_refresh() · 73e39e4d

由 Ilya Dryomov 提交于 1月 08, 2015

header_rwsem should be released on errors.  Also remove useless
rbd_dev->mapping.size != rbd_dev->header.image_size test.
Signed-off-by: NIlya Dryomov <idryomov@redhat.com>

73e39e4d

rbd: nuke copy_token() · 3a25cf43

由 Rickard Strandqvist 提交于 1月 01, 2015

It's been largely superseded by dup_token() and unused for over
2 years, identified by cppcheck.
Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
[idryomov@redhat.com: changelog]
Signed-off-by: NIlya Dryomov <idryomov@redhat.com>

3a25cf43

28 1月, 2015 2 次提交

rbd: drop parent_ref in rbd_dev_unprobe() unconditionally · e69b8d41

由 Ilya Dryomov 提交于 1月 19, 2015

This effectively reverts the last hunk of 392a9dad ("rbd: detect
when clone image is flattened").

The problem with parent_overlap != 0 condition is that it's possible
and completely valid to have an image with parent_overlap == 0 whose
parent state needs to be cleaned up on unmap.  The next commit, which
drops the "clone image now standalone" logic, opens up another window
of opportunity to hit this, but even without it

    # cat parent-ref.sh
    #!/bin/bash
    rbd create --image-format 2 --size 1 foo
    rbd snap create foo@snap
    rbd snap protect foo@snap
    rbd clone foo@snap bar
    rbd resize --allow-shrink --size 0 bar
    rbd resize --size 1 bar
    DEV=$(rbd map bar)
    rbd unmap $DEV

leaves rbd_device/rbd_spec/etc and rbd_client along with ceph_client
hanging around.

My thinking behind calling rbd_dev_parent_put() unconditionally is that
there shouldn't be any requests in flight at that point in time as we
are deep into unmap sequence.  Hence, even if rbd_dev_unparent() caused
by flatten is delayed by in-flight requests, it will have finished by
the time we reach rbd_dev_unprobe() caused by unmap, thus turning
unconditional rbd_dev_parent_put() into a no-op.

Fixes: http://tracker.ceph.com/issues/10352

Cc: stable@vger.kernel.org # 3.11+
Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
Reviewed-by: NJosh Durgin <jdurgin@redhat.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

e69b8d41

rbd: fix rbd_dev_parent_get() when parent_overlap == 0 · ae43e9d0

由 Ilya Dryomov 提交于 1月 19, 2015

The comment for rbd_dev_parent_get() said

    * We must get the reference before checking for the overlap to
    * coordinate properly with zeroing the parent overlap in
    * rbd_dev_v2_parent_info() when an image gets flattened.  We
    * drop it again if there is no overlap.

but the "drop it again if there is no overlap" part was missing from
the implementation.  This lead to absurd parent_ref values for images
with parent_overlap == 0, as parent_ref was incremented for each
img_request and virtually never decremented.

Fix this by leveraging the fact that refresh path calls
rbd_dev_v2_parent_info() under header_rwsem and use it for read in
rbd_dev_parent_get(), instead of messing around with atomics.  Get rid
of barriers in rbd_dev_v2_parent_info() while at it - I don't see what
they'd pair with now and I suspect we are in a pretty miserable
situation as far as proper locking goes regardless.

Cc: stable@vger.kernel.org # 3.11+
Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
Reviewed-by: NJosh Durgin <jdurgin@redhat.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

ae43e9d0

18 12月, 2014 2 次提交

rbd: don't treat CEPH_OSD_OP_DELETE as extent op · 7e868b6e

由 Ilya Dryomov 提交于 11月 21, 2014

CEPH_OSD_OP_DELETE is not an extent op, stop treating it as such.  This
sneaked in with discard patches - it's one of the three osd ops (the
other two are CEPH_OSD_OP_TRUNCATE and CEPH_OSD_OP_ZERO) that discard
is implemented with.
Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

7e868b6e

ceph, rbd: delete unnecessary checks before two function calls · e96a650a

由 SF Markus Elfring 提交于 11月 02, 2014

The functions ceph_put_snap_context() and iput() test whether their
argument is NULL and then return immediately. Thus the test around the
call is not needed.

This issue was detected by using the Coccinelle software.
Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
[idryomov@redhat.com: squashed rbd.c hunk, changelog]
Signed-off-by: NIlya Dryomov <idryomov@redhat.com>

e96a650a

30 10月, 2014 2 次提交

rbd: Fix error recovery in rbd_obj_read_sync() · a8d42056

由 Jan Kara 提交于 10月 22, 2014

When we fail to allocate page vector in rbd_obj_read_sync() we just
basically ignore the problem and continue which will result in an oops
later. Fix the problem by returning proper error.

CC: Yehuda Sadeh <yehuda@inktank.com>
CC: Sage Weil <sage@inktank.com>
CC: ceph-devel@vger.kernel.org
CC: stable@vger.kernel.org
Coverity-id: 1226882
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NIlya Dryomov <idryomov@redhat.com>

a8d42056

rbd: use a single workqueue for all devices · f5ee37bd

由 Ilya Dryomov 提交于 10月 09, 2014

Using one queue per device doesn't make much sense given that our
workfn processes "devices" and not "requests".  Switch to a single
workqueue for all devices.
Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
Reviewed-by: NSage Weil <sage@redhat.com>

f5ee37bd

15 10月, 2014 14 次提交

rbd: rbd workqueues need a resque worker · 792c3a91

由 Ilya Dryomov 提交于 10月 10, 2014

Need to use WQ_MEM_RECLAIM for our workqueues to prevent I/O lockups
under memory pressure - we sit on the memory reclaim path.

Cc: stable@vger.kernel.org # 3.17, needs backporting for 3.16
Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
Tested-by: NMicha Krause <micha@krausam.de>
Reviewed-by: NSage Weil <sage@redhat.com>

792c3a91

rbd: set the remaining discard properties to enable support · b76f8239

由 Josh Durgin 提交于 4月 07, 2014

max_discard_sectors must be set for the queue to support discard.
Operations implementing discard for rbd zero data, so report that.
Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>

b76f8239

rbd: use helpers to handle discard for layered images correctly · d3246fb0

由 Josh Durgin 提交于 4月 07, 2014

Only allocate two osd ops for discard requests, since the
preallocation hint is only added for regular writes.  Use
rbd_img_obj_request_fill() to recreate the original write or discard
osd operations, isolating that logic to one place, and change the
assert in rbd_osd_req_create_copyup() to accept discard requests as
well.
Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>

d3246fb0

rbd: extract a method for adding object operations · 3b434a2a

由 Josh Durgin 提交于 4月 04, 2014

rbd_img_request_fill() creates a ceph_osd_request and has logic for
adding the appropriate osd ops to it based on the request type and
image properties.

For layered images, the original rbd_obj_request is resent with a
copyup operation in front, using a new ceph_osd_request. The logic for
adding the original operations should be the same as when first
sending them, so move it to a helper function.

op_type only needs to be checked once, so create a helper for that as
well and call it outside the loop in rbd_img_request_fill().
Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>

3b434a2a

rbd: make discard trigger copy-on-write · 1c220881

由 Josh Durgin 提交于 4月 04, 2014

Discard requests are a form of write, so they should go through the
same process as plain write requests and trigger copy-on-write for
layered images.
Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>

1c220881

rbd: tolerate -ENOENT for discard operations · d0265de7

由 Josh Durgin 提交于 4月 07, 2014

Discard may try to delete an object from a non-layered image that does not exist.
If this occurs, the image already has no data in that range, so change the
result to success.
Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>

d0265de7

rbd: fix snapshot context reference count for discards · bef95455

由 Josh Durgin 提交于 4月 04, 2014

Discards take a reference to the snapshot context of an image when
they are created.  This reference needs to be cleaned up when the
request is done just as it is for regular writes.
Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>

bef95455

rbd: read image size for discard check safely · 3c5df893

由 Josh Durgin 提交于 4月 04, 2014

In rbd_img_request_fill() the image size is only checked to determine
whether we can truncate an object instead of zeroing it for discard
requests. Take rbd_dev->header_rwsem while reading the image size, and
move this read into the discard check, so that non-discard ops don't
need to take the semaphore in this function.
Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>

3c5df893

rbd: initial discard bits from Guangliang Zhao · 90e98c52

由 Guangliang Zhao 提交于 4月 01, 2014

This patch add the discard support for rbd driver.

There are three types operation in the driver:
1. The objects would be removed if they completely contained
   within the discard range.
2. The objects would be truncated if they partly contained within
   the discard range, and align with their boundary.
3. Others would be zeroed.

A discard request from blkdev_issue_discard() is defined which
REQ_WRITE and REQ_DISCARD both marked and no data, so we must
check the REQ_DISCARD first when getting the request type.

This resolve:
	http://tracker.ceph.com/issues/190

[ Ilya Dryomov: This is incomplete and somewhat buggy, see follow up
  commits by Josh Durgin for refinements and fixes which weren't
  folded in to preserve authorship. ]
Signed-off-by: NGuangliang Zhao <lucienchao@gmail.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

90e98c52

rbd: extend the operation type · 6d2940c8

由 Guangliang Zhao 提交于 3月 13, 2014

It could only handle the read and write operations now,
extend it for the coming discard support.
Signed-off-by: NGuangliang Zhao <lucienchao@gmail.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

6d2940c8

rbd: skip the copyup when an entire object writing · c622d226

由 Guangliang Zhao 提交于 4月 01, 2014

It need to copyup the parent's content when layered writing,
but an entire object write would overwrite it, so skip it.
Signed-off-by: NGuangliang Zhao <lucienchao@gmail.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

c622d226

rbd: add img_obj_request_simple() helper · 70d045f6

由 Ilya Dryomov 提交于 9月 12, 2014

To clarify the conditions and make it easier to add new ones.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>

70d045f6

rbd: access snapshot context and mapping size safely · 4e752f0a

由 Josh Durgin 提交于 4月 08, 2014

These fields may both change while the image is mapped if a snapshot
is created or deleted or the image is resized. They are guarded by
rbd_dev->header_rwsem, so hold that while reading them, and store a
local copy to refer to outside of the critical section. The local copy
will stay consistent since the snapshot context is reference counted,
and the mapping size is just a u64. This prevents torn loads from
giving us inconsistent values.

Move reading header.snapc into the caller of rbd_img_request_create()
so that we only need to take the semaphore once. The read-only caller,
rbd_parent_request_create() can just pass NULL for snapc, since the
snapshot context is only relevant for writes.
Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>

4e752f0a

rbd: do not return -ERANGE on auth failures · 7dd440c9

由 Ilya Dryomov 提交于 9月 11, 2014

Trying to map an image out of a pool for which we don't have an 'x'
permission bit fails with -ERANGE from ceph_extract_encoded_string()
due to an unsigned vs signed bug.  Fix it and get rid of the -EINVAL
sink, thus propagating rbd::get_id cls method errors.  (I've seen
a bunch of unexplained -ERANGE reports, I bet this is it).
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

7dd440c9

10 9月, 2014 2 次提交

rbd: fix error return code in rbd_dev_device_setup() · 255939e7

由 Wei Yongjun 提交于 8月 13, 2014

Fix to return -ENOMEM from the workqueue alloc error handling
case instead of 0, as done elsewhere in this function.
Reviewed-by: NAlex Elder <elder@linaro.org>
Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>

255939e7

rbd: avoid format-security warning inside alloc_workqueue() · 58d1362b

由 Ilya Dryomov 提交于 8月 12, 2014

drivers/block/rbd.c: In function ‘rbd_dev_device_setup’:
drivers/block/rbd.c:5090:19: warning: format not a string literal and no format arguments [-Wformat-security]
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>

58d1362b

07 8月, 2014 3 次提交

rbd: remove extra newlines from rbd_warn() messages · 9584d508

由 Ilya Dryomov 提交于 7月 11, 2014

rbd_warn() string should be a single line - rbd_warn() appends \n.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>

9584d508

rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC · 7a716aac

由 Ilya Dryomov 提交于 8月 05, 2014

Now that rbd_img_request_create() is called from work functions, no
need to use GFP_ATOMIC.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

7a716aac

rbd: rework rbd_request_fn() · bc1ecc65

由 Ilya Dryomov 提交于 8月 04, 2014

While it was never a good idea to sleep in request_fn(), commit
34c6bc2c ("locking/mutexes: Add extra reschedule point") made it
a *bad* idea.  mutex_lock() since 3.15 may reschedule *before* putting
task on the mutex wait queue, which for tasks in !TASK_RUNNING state
means block forever.  request_fn() may be called with !TASK_RUNNING on
the way to schedule() in io_schedule().

Offload request handling to a workqueue, one per rbd device, to avoid
calling blocking primitives from rbd_request_fn().

Fixes: http://tracker.ceph.com/issues/8818

Cc: stable@vger.kernel.org # 3.16, needs backporting for 3.15
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Tested-by: NEric Eastman <eric0e@aol.com>
Tested-by: NGreg Wilson <greg.wilson@keepertech.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

bc1ecc65

25 7月, 2014 8 次提交

rbd: take snap_id into account when reading in parent info · 4d9b67cd

由 Ilya Dryomov 提交于 7月 24, 2014

If we are mapping a snapshot, we must read in the parent_overlap value
of that snapshot instead of that of the base image.  Not doing so may
in particular result in us returning zeros instead of user data:

    # cat overlap-snap.sh
    #!/bin/bash
    rbd create --size 10 --image-format 2 foo
    FOO_DEV=$(rbd map foo)
    dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
    echo "Base image"
    dd if=$FOO_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
    rbd snap create foo@snap
    rbd snap protect foo@snap
    rbd clone foo@snap bar
    rbd snap create bar@snap
    BAR_DEV=$(rbd map bar@snap)
    echo "Snapshot"
    dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
    rbd resize --allow-shrink --size 4 bar
    echo "Snapshot after base image resize"
    dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd

    # ./overlap-snap.sh
    Base image
    0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed  ...;.K"%`4(E..6.
    Snapshot
    0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed  ...;.K"%`4(E..6.
    Resizing image: 100% complete...done.
    Snapshot after base image resize
    0000000: e781 e33b d34b 2225 0000 0000 0000 0000  ...;.K"%........

Even though bar@snap is taken with the old bar parent_overlap (8M),
reads from bar@snap beyond the new bar parent_overlap (4M) return
zeroes.  Fix it.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

4d9b67cd

rbd: do not read in parent info before snap context · e8f59b59

由 Ilya Dryomov 提交于 7月 24, 2014

Currently rbd_dev_v2_header_info() reads in parent info before the snap
context is read in. This is wrong, because we may need to look at the
the parent_overlap value of the snapshot instead of that of the base
image, for example when mapping a snapshot - see next commit. (When
mapping a snapshot, all we got is its name and we need the snap context
to translate that name into an id to know which parent info to look
for.)

The approach taken here is to make sure rbd_dev_v2_parent_info() is
called after the snap context has been read in. The other approach
would be to add a parent_overlap field to struct rbd_mapping and
maintain it the same way rbd_mapping::size is maintained. The reason
I chose the first approach is that the value of keeping around both
base image values and the actual mapping values is unclear to me.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

e8f59b59

rbd: update mapping size only on refresh · 5ff1108c