提交 · d3767f0faeda5abdf205f947ae912d48dc70fa06 · openanolis / cloud-kernel

28 4月, 2016 2 次提交

rbd: report unsupported features to syslog · d3767f0f

由 Ilya Dryomov 提交于 4月 13, 2016

... instead of just returning an error.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NJosh Durgin <jdurgin@redhat.com>

d3767f0f

rbd: fix rbd map vs notify races · 811c6688

由 Ilya Dryomov 提交于 4月 15, 2016

A while ago, commit 9875201e ("rbd: fix use-after free of
rbd_dev->disk") fixed rbd unmap vs notify race by introducing
an exported wrapper for flushing notifies and sticking it into
do_rbd_remove().

A similar problem exists on the rbd map path, though: the watch is
registered in rbd_dev_image_probe(), while the disk is set up quite
a few steps later, in rbd_dev_device_setup().  Nothing prevents
a notify from coming in and crashing on a NULL rbd_dev->disk:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
    Call Trace:
     [<ffffffffa0508344>] rbd_watch_cb+0x34/0x180 [rbd]
     [<ffffffffa04bd290>] do_event_work+0x40/0xb0 [libceph]
     [<ffffffff8109d5db>] process_one_work+0x17b/0x470
     [<ffffffff8109e3ab>] worker_thread+0x11b/0x400
     [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400
     [<ffffffff810a5acf>] kthread+0xcf/0xe0
     [<ffffffff810b41b3>] ? finish_task_switch+0x53/0x170
     [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
     [<ffffffff81645dd8>] ret_from_fork+0x58/0x90
     [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
    RIP  [<ffffffffa050828a>] rbd_dev_refresh+0xfa/0x180 [rbd]

If an error occurs during rbd map, we have to error out, potentially
tearing down a watch.  Just like on rbd unmap, notifies have to be
flushed, otherwise rbd_watch_cb() may end up trying to read in the
image header after rbd_dev_image_release() has run:

    Assertion failure in rbd_dev_header_info() at line 4722:

     rbd_assert(rbd_image_format_valid(rbd_dev->image_format));

    Call Trace:
     [<ffffffff81cccee0>] ? rbd_parent_request_create+0x150/0x150
     [<ffffffff81cd4e59>] rbd_dev_refresh+0x59/0x390
     [<ffffffff81cd5229>] rbd_watch_cb+0x69/0x290
     [<ffffffff81fde9bf>] do_event_work+0x10f/0x1c0
     [<ffffffff81107799>] process_one_work+0x689/0x1a80
     [<ffffffff811076f7>] ? process_one_work+0x5e7/0x1a80
     [<ffffffff81132065>] ? finish_task_switch+0x225/0x640
     [<ffffffff81107110>] ? pwq_dec_nr_in_flight+0x2b0/0x2b0
     [<ffffffff81108c69>] worker_thread+0xd9/0x1320
     [<ffffffff81108b90>] ? process_one_work+0x1a80/0x1a80
     [<ffffffff8111b02d>] kthread+0x21d/0x2e0
     [<ffffffff8111ae10>] ? kthread_stop+0x550/0x550
     [<ffffffff82022802>] ret_from_fork+0x22/0x40
     [<ffffffff8111ae10>] ? kthread_stop+0x550/0x550
    RIP  [<ffffffff81ccd8f9>] rbd_dev_header_info+0xa19/0x1e30

To fix this, a) check if RBD_DEV_FLAG_EXISTS is set before calling
revalidate_disk(), b) move ceph_osdc_flush_notifies() call into
rbd_dev_header_unwatch_sync() to cover rbd map error paths and c) turn
header read-in into a critical section.  The latter also happens to
take care of rbd map foo@bar vs rbd snap rm foo@bar race.

Fixes: http://tracker.ceph.com/issues/15490Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NJosh Durgin <jdurgin@redhat.com>

811c6688

06 4月, 2016 1 次提交

rbd: use GFP_NOIO consistently for request allocations · 2224d879

由 David Disseldorp 提交于 4月 05, 2016

As of 5a60e876, RBD object request
allocations are made via rbd_obj_request_create() with GFP_NOIO.
However, subsequent OSD request allocations in rbd_osd_req_create*()
use GFP_ATOMIC.

With heavy page cache usage (e.g. OSDs running on same host as krbd
client), rbd_osd_req_create() order-1 GFP_ATOMIC allocations have been
observed to fail, where direct reclaim would have allowed GFP_NOIO
allocations to succeed.

Cc: stable@vger.kernel.org # 3.18+
Suggested-by: NVlastimil Babka <vbabka@suse.cz>
Suggested-by: NNeil Brown <neilb@suse.com>
Signed-off-by: NDavid Disseldorp <ddiss@suse.de>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

2224d879

26 3月, 2016 3 次提交

rbd: use KMEM_CACHE macro · 03d94406

由 Geliang Tang 提交于 3月 13, 2016

Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.
Signed-off-by: NGeliang Tang <geliangtang@163.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

03d94406

libceph: enable large, variable-sized OSD requests · 3f1af42a

由 Ilya Dryomov 提交于 2月 09, 2016

Turn r_ops into a flexible array member to enable large, consisting of
up to 16 ops, OSD requests.  The use case is scattered writeback in
cephfs and, as far as the kernel client is concerned, 16 is just a made
up number.

r_ops had size 3 for copyup+hint+write, but copyup is really a special
case - it can only happen once.  ceph_osd_request_cache is therefore
stuffed with num_ops=2 requests, anything bigger than that is allocated
with kmalloc().  req_mempool is backed by ceph_osd_request_cache, which
means either num_ops=1 or num_ops=2 for use_mempool=true - all existing
users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with
that.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

3f1af42a

libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op · 7665d85b

由 Yan, Zheng 提交于 1月 07, 2016

This avoids defining large array of r_reply_op_{len,result} in
in struct ceph_osd_request.
Signed-off-by: NYan, Zheng <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

7665d85b

22 1月, 2016 1 次提交

rbd: delete an unnecessary check before rbd_dev_destroy() · 1761b229

由 Markus Elfring 提交于 11月 23, 2015

The rbd_dev_destroy() function tests whether its argument is NULL
and then returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.
Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

1761b229

04 12月, 2015 1 次提交

rbd: don't put snap_context twice in rbd_queue_workfn() · 70b16db8

由 Ilya Dryomov 提交于 11月 27, 2015

Commit 4e752f0a ("rbd: access snapshot context and mapping size
safely") moved ceph_get_snap_context() out of rbd_img_request_create()
and into rbd_queue_workfn(), adding a ceph_put_snap_context() to the
error path in rbd_queue_workfn().  However, rbd_img_request_create()
consumes a ref on snapc, so calling ceph_put_snap_context() after
a successful rbd_img_request_create() leads to an extra put.  Fix it.

Cc: stable@vger.kernel.org # 3.18+
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NJosh Durgin <jdurgin@redhat.com>

70b16db8

03 11月, 2015 5 次提交

rbd: remove duplicate calls to rbd_dev_mapping_clear() · 4afb04c0

由 Ilya Dryomov 提交于 10月 22, 2015

Commit d1cf5788 ("rbd: set mapping info earlier") defined
rbd_dev_mapping_clear(), but, just a few days after, commit
f35a4dee ("rbd: set the mapping size and features later") moved
rbd_dev_mapping_set() calls and added another rbd_dev_mapping_clear()
call instead of moving the old one. Around the same time, another
duplicate was introduced in rbd_dev_device_release() - kill both.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

4afb04c0

rbd: set device_type::release instead of device::release · 6cac4695

由 Ilya Dryomov 提交于 10月 16, 2015

No point in providing an empty device_type::release callback and then
setting device::release for each rbd_dev dynamically.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

6cac4695

rbd: don't free rbd_dev outside of the release callback · dd5ac32d

由 Ilya Dryomov 提交于 10月 16, 2015

struct rbd_device has struct device embedded in it, which means it's
part of kobject universe and has an unpredictable life cycle.  Freeing
its memory outside of the release callback is flawed, yet commits
200a6a8b ("rbd: don't destroy rbd_dev in device release function")
and 8ad42cd0 ("rbd: don't have device release destroy rbd_dev")
moved rbd_dev_destroy() out to rbd_dev_image_release().

This commit reverts most of that, the key points are:

- rbd_dev->dev is initialized in rbd_dev_create(), making it possible
  to use rbd_dev_destroy() - which is just a put_device() - both before
  we register with device core and after.

- rbd_dev_release() (the release callback) is the only place we
  kfree(rbd_dev).  It's also where we do module_put(), keeping the
  module unload race window as small as possible.

- We pin the module in rbd_dev_create(), but only for mapping
  rbd_dev-s.  Moving image related stuff out of struct rbd_device into
  another struct which isn't tied with sysfs and device core is long
  overdue, but until that happens, this will keep rbd module refcount
  (which users can observe with lsmod) sane.

Fixes: http://tracker.ceph.com/issues/12697

Cc: Alex Elder <elder@linaro.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

dd5ac32d

rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails · b51c83c2

由 Ilya Dryomov 提交于 10月 15, 2015

Returning pool id (i.e. >= 0) from a sysfs ->store() callback makes
userspace think it needs to retry the write.  Fix it - it's a leftover
from the times when the equivalent of rbd_dev_create() was the first
action in rbd_add().
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

b51c83c2

rbd: drop null test before destroy functions · 13bf2834

由 Julia Lawall 提交于 9月 13, 2015

Remove unneeded NULL test.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ expression x; @@
-if (x != NULL) {
  \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
  x = NULL;
-}
// </smpl>
Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

13bf2834

31 10月, 2015 1 次提交

rbd: require stable pages if message data CRCs are enabled · bae818ee

由 Ronny Hegewald 提交于 10月 15, 2015

rbd requires stable pages, as it performs a crc of the page data before
they are send to the OSDs.

But since kernel 3.9 (patch 1d1d1a76
"mm: only enforce stable page writes if the backing device requires
it") it is not assumed anymore that block devices require stable pages.

This patch sets the necessary flag to get stable pages back for rbd.

In a ceph installation that provides multiple ext4 formatted rbd
devices "bad crc" messages appeared regularly (ca 1 message every 1-2
minutes on every OSD that provided the data for the rbd) in the
OSD-logs before this patch. After this patch this messages are pretty
much gone (only ca 1-2 / month / OSD).

Cc: stable@vger.kernel.org # 3.9+, needs backporting
Signed-off-by: NRonny Hegewald <Ronny.Hegewald@online.de>
[idryomov@gmail.com: require stable pages only in crc case, changelog]
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

bae818ee

24 10月, 2015 2 次提交

rbd: prevent kernel stack blow up on rbd map · 6d69bb53

由 Ilya Dryomov 提交于 10月 11, 2015

Mapping an image with a long parent chain (e.g. image foo, whose parent
is bar, whose parent is baz, etc) currently leads to a kernel stack
overflow, due to the following recursion in the reply path:

  rbd_osd_req_callback()
    rbd_obj_request_complete()
      rbd_img_obj_callback()
        rbd_img_parent_read_callback()
          rbd_obj_request_complete()
            ...

Limit the parent chain to 16 images, which is ~5K worth of stack.  When
the above recursion is eliminated, this limit can be lifted.

Fixes: http://tracker.ceph.com/issues/12538

Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NJosh Durgin <jdurgin@redhat.com>

6d69bb53

rbd: don't leak parent_spec in rbd_dev_probe_parent() · 1f2c6651

由 Ilya Dryomov 提交于 10月 11, 2015

Currently we leak parent_spec and trigger a "parent reference
underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
The problem is we take the !parent out_err branch and that only drops
refcounts; parent_spec that would've been freed had we called
rbd_dev_unparent() remains and triggers rbd_warn() in
rbd_dev_parent_put() - at that point we have parent_spec != NULL and
parent_ref == 0, so counter ends up being -1 after the decrement.

Redo rbd_dev_probe_parent() to fix this.

Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

1f2c6651

16 10月, 2015 2 次提交

rbd: use writefull op for object size writes · e30b7577

由 Ilya Dryomov 提交于 10月 07, 2015

This covers only the simplest case - an object size sized write, but
it's still useful in tiering setups when EC is used for the base tier
as writefull op can be proxied, saving an object promotion.

Even though updating ceph_osdc_new_request() to allow writefull should
just be a matter of fixing an assert, I didn't do it because its only
user is cephfs.  All other sites were updated.

Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

e30b7577

rbd: set max_sectors explicitly · 0d9fde4f

由 Ilya Dryomov 提交于 10月 07, 2015

Commit 30e2bc08 ("Revert "block: remove artifical max_hw_sectors
cap"") restored a clamp on max_sectors.  It's now 2560 sectors instead
of 1024, but it's not good enough: we set max_hw_sectors to rbd object
size because we don't want object sized I/Os to be split, and the
default object size is 4M.

So, set max_sectors to max_hw_sectors in rbd at queue init time.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

0d9fde4f

09 9月, 2015 2 次提交

rbd: plug rbd_dev->header.object_prefix memory leak · d194cd1d

由 Ilya Dryomov 提交于 8月 31, 2015

Need to free object_prefix when rbd_dev_v2_snap_context() fails, but
only if this is the first time we are reading in the header.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

d194cd1d

rbd: fix double free on rbd_dev->header_name · 3ebe138a

由 Ilya Dryomov 提交于 8月 31, 2015

If rbd_dev_image_probe() in rbd_dev_probe_parent() fails, header_name
is freed twice: once in rbd_dev_probe_parent() and then in its caller
rbd_dev_image_probe() (rbd_dev_image_probe() is called recursively to
handle parent images).

rbd_dev_probe_parent() is responsible for probing the parent, so it
shouldn't muck with clone's fields.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

3ebe138a

14 8月, 2015 1 次提交

block: kill merge_bvec_fn() completely · 8ae12666

由 Kent Overstreet 提交于 4月 27, 2015

As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@kernel.org>
Cc: ceph-devel@vger.kernel.org
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
Acked-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
 dm-era-target, and resolve merge conflicts]
Signed-off-by: NDongsu Park <dpark@posteo.net>
Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

8ae12666

31 7月, 2015 1 次提交

rbd: fix copyup completion race · 2761713d

由 Ilya Dryomov 提交于 7月 16, 2015

For write/discard obj_requests that involved a copyup method call, the
opcode of the first op is CEPH_OSD_OP_CALL and the ->callback is
rbd_img_obj_copyup_callback().  The latter frees copyup pages, sets
->xferred and delegates to rbd_img_obj_callback(), the "normal" image
object callback, for reporting to block layer and putting refs.

rbd_osd_req_callback() however treats CEPH_OSD_OP_CALL as a trivial op,
which means obj_request is marked done in rbd_osd_trivial_callback(),
*before* ->callback is invoked and rbd_img_obj_copyup_callback() has
a chance to run.  Marking obj_request done essentially means giving
rbd_img_obj_callback() a license to end it at any moment, so if another
obj_request from the same img_request is being completed concurrently,
rbd_img_obj_end_request() may very well be called on such prematurally
marked done request:

<obj_request-1/2 reply>
handle_reply()
  rbd_osd_req_callback()
    rbd_osd_trivial_callback()
    rbd_obj_request_complete()
    rbd_img_obj_copyup_callback()
    rbd_img_obj_callback()
                                    <obj_request-2/2 reply>
                                    handle_reply()
                                      rbd_osd_req_callback()
                                        rbd_osd_trivial_callback()
      for_each_obj_request(obj_request->img_request) {
        rbd_img_obj_end_request(obj_request-1/2)
        rbd_img_obj_end_request(obj_request-2/2) <--
      }

Calling rbd_img_obj_end_request() on such a request leads to trouble,
in particular because its ->xfferred is 0.  We report 0 to the block
layer with blk_update_request(), get back 1 for "this request has more
data in flight" and then trip on

    rbd_assert(more ^ (which == img_request->obj_request_count));

with rhs (which == ...) being 1 because rbd_img_obj_end_request() has
been called for both requests and lhs (more) being 1 because we haven't
got a chance to set ->xfferred in rbd_img_obj_copyup_callback() yet.

To fix this, leverage that rbd wants to call class methods in only two
cases: one is a generic method call wrapper (obj_request is standalone)
and the other is a copyup (obj_request is part of an img_request).  So
make a dedicated handler for CEPH_OSD_OP_CALL and directly invoke
rbd_img_obj_copyup_callback() from it if obj_request is part of an
img_request, similar to how CEPH_OSD_OP_READ handler invokes
rbd_img_obj_request_read_callback().

Since rbd_img_obj_copyup_callback() is now being called from the OSD
request callback (only), it is renamed to rbd_osd_copyup_callback().

Cc: Alex Elder <elder@linaro.org>
Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 3.18
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

2761713d

17 7月, 2015 1 次提交

block: have drivers use blk_queue_max_discard_sectors() · 2bb4cd5c

由 Jens Axboe 提交于 7月 14, 2015

Some drivers use it now, others just set the limits field manually.
But in preparation for splitting this into a hard and soft limit,
ensure that they all call the proper function for setting the hw
limit for discards.
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

2bb4cd5c

01 7月, 2015 1 次提交

rbd: use GFP_NOIO in rbd_obj_request_create() · 5a60e876

由 Ilya Dryomov 提交于 6月 24, 2015

rbd_obj_request_create() is called on the main I/O path, so we need to
use GFP_NOIO to make sure allocation doesn't blow back on us.  Not all
callers need this, but I'm still hardcoding the flag inside rather than
making it a parameter because a) this is going to stable, and b) those
callers shouldn't really use rbd_obj_request_create() and will be fixed
in the future.

More memory allocation fixes will follow.

Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

5a60e876

25 6月, 2015 7 次提交

rbd: queue_depth map option · b5584180

由 Ilya Dryomov 提交于 6月 23, 2015

nr_requests (/sys/block/rbd<id>/queue/nr_requests) is pretty much
irrelevant in blk-mq case because each driver sets its own max depth
that it can handle and that's the number of tags that gets preallocated
on setup.  Users can't increase queue depth beyond that value via
writing to nr_requests.

For rbd we are happy with the default BLKDEV_MAX_RQ (128) for most
cases but we want to give users the opportunity to increase it.
Introduce a new per-device queue_depth option to do just that:

    $ sudo rbd map -o queue_depth=1024 ...
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

b5584180

rbd: store rbd_options in rbd_device · d147543d

由 Ilya Dryomov 提交于 6月 22, 2015

Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

d147543d

rbd: terminate rbd_opts_tokens with Opt_err · 210c104c

由 Ilya Dryomov 提交于 6月 22, 2015

Also nuke useless Opt_last_bool and don't break lines unnecessarily.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

210c104c

rbd: bump queue_max_segments · d3834fef

由 Ilya Dryomov 提交于 6月 12, 2015

The default queue_limits::max_segments value (BLK_MAX_SEGMENTS = 128)
unnecessarily limits bio sizes to 512k (assuming 4k pages).  rbd, being
a virtual block device, doesn't have any restrictions on the number of
physical segments, so bump max_segments to max_hw_sectors, in theory
allowing a sector per segment (although the only case this matters that
I can think of is some readv/writev style thing).  In practice this is
going to give us 1M bios - the number of segments in a bio is limited
in bio_get_nr_vecs() by BIO_MAX_PAGES = 256.

Note that this doesn't result in any improvement on a typical direct
sequential test.  This is because on a box with a not too badly
fragmented memory the default BLK_MAX_SEGMENTS is enough to see nice
rbd object size sized requests.  The only difference is the size of
bios being merged - 512k vs 1M for something like

    $ dd if=/dev/zero of=/dev/rbd0 oflag=direct bs=$RBD_OBJ_SIZE
    $ dd if=/dev/rbd0 iflag=direct of=/dev/null bs=$RBD_OBJ_SIZE
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

d3834fef

rbd: timeout watch teardown on unmap with mount_timeout · 2894e1d7

由 Ilya Dryomov 提交于 5月 12, 2015

As part of unmap sequence, kernel client has to talk to the OSDs to
teardown watch on the header object. If none of the OSDs are available
it would hang forever, until interrupted by a signal - when that
happens we follow through with the rest of unmap procedure (i.e.
unregister the device and put all the data structures) and the unmap is
still considired successful (rbd cli tool exits with 0). The watch on
the userspace side should eventually timeout so that's fine.

This isn't very nice, because various userspace tools (pacemaker rbd
resource agent, for example) then have to worry about setting up their
own timeouts. Timeout it with mount_timeout (60 seconds by default).
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>
Reviewed-by: NSage Weil <sage@redhat.com>

2894e1d7

libceph: store timeouts in jiffies, verify user input · a319bf56

由 Ilya Dryomov 提交于 5月 15, 2015

There are currently three libceph-level timeouts that the user can
specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive.  All of
these are in seconds and no checking is done on user input: negative
values are accepted, we multiply them all by HZ which may or may not
overflow, arbitrarily large jiffies then get added together, etc.

There is also a bug in the way mount_timeout=0 is handled.  It's
supposed to mean "infinite timeout", but that's not how wait.h APIs
treat it and so __ceph_open_session() for example will busy loop
without much chance of being interrupted if none of ceph-mons are
there.

Fix all this by verifying user input, storing timeouts capped by
msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies()
helper for all user-specified waits to handle infinite timeouts
correctly.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

a319bf56

Y
libceph: allow setting osd_req_op's flags · 144cba14
由 Yan, Zheng 提交于 4月 27, 2015
```
Signed-off-by: NYan, Zheng <zyan@redhat.com>
Reviewed-by: NAlex Elder <elder@linaro.org>
```
144cba14

02 5月, 2015 1 次提交

rbd: end I/O the entire obj_request on error · 082a75da

由 Ilya Dryomov 提交于 4月 25, 2015

When we end I/O struct request with error, we need to pass
obj_request->length as @nr_bytes so that the entire obj_request worth
of bytes is completed.  Otherwise block layer ends up confused and we
trip on

    rbd_assert(more ^ (which == img_request->obj_request_count));

in rbd_img_obj_callback() due to more being true no matter what.  We
already do it in most cases but we are missing some, in particular
those where we don't even get a chance to submit any obj_requests, due
to an early -ENOMEM for example.

A number of obj_request->xferred assignments seem to be redundant but
I haven't touched any of obj_request->xferred stuff to keep this small
and isolated.

Cc: Alex Elder <elder@linaro.org>
Cc: stable@vger.kernel.org # 3.10+
Reported-by: NShawn Edwards <lesser.evil@gmail.com>
Reviewed-by: NSage Weil <sage@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

082a75da

22 4月, 2015 1 次提交

rbd: rbd_wq comment is obsolete · f77303bd

由 Ilya Dryomov 提交于 4月 22, 2015

After the switch to blk-mq rbd_wq processes requests, not devices.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

f77303bd

20 4月, 2015 2 次提交

rbd: mark block queue as non-rotational · d8a2c89c

由 Ilya Dryomov 提交于 3月 24, 2015

Set QUEUE_FLAG_NONROT.  Following commit b277da0a ("block: disable
entropy contributions for nonrot devices") we should also clear
QUEUE_FLAG_ADD_RANDOM, but it's off by default for blk-mq drivers, so
just note it in the comment.

Also remove physical block size assignment - no sense in repeating
defaults that are not going to change.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

d8a2c89c

rbd: be more informative on -ENOENT failures · 1fe48023

由 Ilya Dryomov 提交于 3月 05, 2015

pr_info what exactly was the culprit: missing pool, image or snap.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

1fe48023

19 2月, 2015 4 次提交

rbd: convert to blk-mq · 7ad18afa

由 Christoph Hellwig 提交于 1月 13, 2015

This converts the rbd driver to use the blk-mq infrastructure.  Except
for switching to a per-request work item this is almost mechanical.

This was tested by Alexandre DERUMIER in November, and found to give
him 120000 iops, although the only comparism available was an old
3.10 kernel which gave 80000iops.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NAlex Elder <elder@linaro.org>
[idryomov@gmail.com: context, blk_mq_init_queue() EH]
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

7ad18afa

rbd: do not treat standalone as flatten · cf32bd9c

由 Ilya Dryomov 提交于 1月 19, 2015

If the clone is resized down to 0, it becomes standalone.  If such
resize is carried over while an image is mapped we would detect this
and call rbd_dev_parent_put() which means "let go of all parent state,
including the spec(s) of parent images(s)".  This leads to a mismatch
between "rbd info" and sysfs parent fields, so a fix is in order.

    # rbd create --image-format 2 --size 1 foo
    # rbd snap create foo@snap
    # rbd snap protect foo@snap
    # rbd clone foo@snap bar
    # DEV=$(rbd map bar)
    # rbd resize --allow-shrink --size 0 bar
    # rbd resize --size 1 bar
    # rbd info bar | grep parent
            parent: rbd/foo@snap

Before:

    # cat /sys/bus/rbd/devices/0/parent
    (no parent image)

After:

    # cat /sys/bus/rbd/devices/0/parent
    pool_id 0
    pool_name rbd
    image_id 10056b8b4567
    image_name foo
    snap_id 2
    snap_name snap
    overlap 0
Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
Reviewed-by: NJosh Durgin <jdurgin@redhat.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

cf32bd9c

rbd: fix error paths in rbd_dev_refresh() · 73e39e4d

由 Ilya Dryomov 提交于 1月 08, 2015

header_rwsem should be released on errors.  Also remove useless
rbd_dev->mapping.size != rbd_dev->header.image_size test.
Signed-off-by: NIlya Dryomov <idryomov@redhat.com>

73e39e4d

rbd: nuke copy_token() · 3a25cf43

由 Rickard Strandqvist 提交于 1月 01, 2015

It's been largely superseded by dup_token() and unused for over
2 years, identified by cppcheck.
Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
[idryomov@redhat.com: changelog]
Signed-off-by: NIlya Dryomov <idryomov@redhat.com>

3a25cf43

28 1月, 2015 1 次提交

rbd: drop parent_ref in rbd_dev_unprobe() unconditionally · e69b8d41

由 Ilya Dryomov 提交于 1月 19, 2015

This effectively reverts the last hunk of 392a9dad ("rbd: detect
when clone image is flattened").

The problem with parent_overlap != 0 condition is that it's possible
and completely valid to have an image with parent_overlap == 0 whose
parent state needs to be cleaned up on unmap.  The next commit, which
drops the "clone image now standalone" logic, opens up another window
of opportunity to hit this, but even without it

    # cat parent-ref.sh
    #!/bin/bash
    rbd create --image-format 2 --size 1 foo
    rbd snap create foo@snap
    rbd snap protect foo@snap
    rbd clone foo@snap bar
    rbd resize --allow-shrink --size 0 bar
    rbd resize --size 1 bar
    DEV=$(rbd map bar)
    rbd unmap $DEV

leaves rbd_device/rbd_spec/etc and rbd_client along with ceph_client
hanging around.

My thinking behind calling rbd_dev_parent_put() unconditionally is that
there shouldn't be any requests in flight at that point in time as we
are deep into unmap sequence.  Hence, even if rbd_dev_unparent() caused
by flatten is delayed by in-flight requests, it will have finished by
the time we reach rbd_dev_unprobe() caused by unmap, thus turning
unconditional rbd_dev_parent_put() into a no-op.

Fixes: http://tracker.ceph.com/issues/10352

Cc: stable@vger.kernel.org # 3.11+
Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
Reviewed-by: NJosh Durgin <jdurgin@redhat.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

e69b8d41

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功