1. 03 11月, 2015 2 次提交
  2. 31 10月, 2015 1 次提交
    • R
      rbd: require stable pages if message data CRCs are enabled · bae818ee
      Ronny Hegewald 提交于
      rbd requires stable pages, as it performs a crc of the page data before
      they are send to the OSDs.
      
      But since kernel 3.9 (patch 1d1d1a76
      "mm: only enforce stable page writes if the backing device requires
      it") it is not assumed anymore that block devices require stable pages.
      
      This patch sets the necessary flag to get stable pages back for rbd.
      
      In a ceph installation that provides multiple ext4 formatted rbd
      devices "bad crc" messages appeared regularly (ca 1 message every 1-2
      minutes on every OSD that provided the data for the rbd) in the
      OSD-logs before this patch. After this patch this messages are pretty
      much gone (only ca 1-2 / month / OSD).
      
      Cc: stable@vger.kernel.org # 3.9+, needs backporting
      Signed-off-by: NRonny Hegewald <Ronny.Hegewald@online.de>
      [idryomov@gmail.com: require stable pages only in crc case, changelog]
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      bae818ee
  3. 24 10月, 2015 2 次提交
    • I
      rbd: prevent kernel stack blow up on rbd map · 6d69bb53
      Ilya Dryomov 提交于
      Mapping an image with a long parent chain (e.g. image foo, whose parent
      is bar, whose parent is baz, etc) currently leads to a kernel stack
      overflow, due to the following recursion in the reply path:
      
        rbd_osd_req_callback()
          rbd_obj_request_complete()
            rbd_img_obj_callback()
              rbd_img_parent_read_callback()
                rbd_obj_request_complete()
                  ...
      
      Limit the parent chain to 16 images, which is ~5K worth of stack.  When
      the above recursion is eliminated, this limit can be lifted.
      
      Fixes: http://tracker.ceph.com/issues/12538
      
      Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NJosh Durgin <jdurgin@redhat.com>
      6d69bb53
    • I
      rbd: don't leak parent_spec in rbd_dev_probe_parent() · 1f2c6651
      Ilya Dryomov 提交于
      Currently we leak parent_spec and trigger a "parent reference
      underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
      The problem is we take the !parent out_err branch and that only drops
      refcounts; parent_spec that would've been freed had we called
      rbd_dev_unparent() remains and triggers rbd_warn() in
      rbd_dev_parent_put() - at that point we have parent_spec != NULL and
      parent_ref == 0, so counter ends up being -1 after the decrement.
      
      Redo rbd_dev_probe_parent() to fix this.
      
      Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      1f2c6651
  4. 16 10月, 2015 2 次提交
    • I
      rbd: use writefull op for object size writes · e30b7577
      Ilya Dryomov 提交于
      This covers only the simplest case - an object size sized write, but
      it's still useful in tiering setups when EC is used for the base tier
      as writefull op can be proxied, saving an object promotion.
      
      Even though updating ceph_osdc_new_request() to allow writefull should
      just be a matter of fixing an assert, I didn't do it because its only
      user is cephfs.  All other sites were updated.
      
      Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      e30b7577
    • I
      rbd: set max_sectors explicitly · 0d9fde4f
      Ilya Dryomov 提交于
      Commit 30e2bc08 ("Revert "block: remove artifical max_hw_sectors
      cap"") restored a clamp on max_sectors.  It's now 2560 sectors instead
      of 1024, but it's not good enough: we set max_hw_sectors to rbd object
      size because we don't want object sized I/Os to be split, and the
      default object size is 4M.
      
      So, set max_sectors to max_hw_sectors in rbd at queue init time.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      0d9fde4f
  5. 09 9月, 2015 2 次提交
  6. 14 8月, 2015 1 次提交
    • K
      block: kill merge_bvec_fn() completely · 8ae12666
      Kent Overstreet 提交于
      As generic_make_request() is now able to handle arbitrarily sized bios,
      it's no longer necessary for each individual block driver to define its
      own ->merge_bvec_fn() callback. Remove every invocation completely.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: ceph-devel@vger.kernel.org
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
       dm-era-target, and resolve merge conflicts]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ae12666
  7. 31 7月, 2015 1 次提交
    • I
      rbd: fix copyup completion race · 2761713d
      Ilya Dryomov 提交于
      For write/discard obj_requests that involved a copyup method call, the
      opcode of the first op is CEPH_OSD_OP_CALL and the ->callback is
      rbd_img_obj_copyup_callback().  The latter frees copyup pages, sets
      ->xferred and delegates to rbd_img_obj_callback(), the "normal" image
      object callback, for reporting to block layer and putting refs.
      
      rbd_osd_req_callback() however treats CEPH_OSD_OP_CALL as a trivial op,
      which means obj_request is marked done in rbd_osd_trivial_callback(),
      *before* ->callback is invoked and rbd_img_obj_copyup_callback() has
      a chance to run.  Marking obj_request done essentially means giving
      rbd_img_obj_callback() a license to end it at any moment, so if another
      obj_request from the same img_request is being completed concurrently,
      rbd_img_obj_end_request() may very well be called on such prematurally
      marked done request:
      
      <obj_request-1/2 reply>
      handle_reply()
        rbd_osd_req_callback()
          rbd_osd_trivial_callback()
          rbd_obj_request_complete()
          rbd_img_obj_copyup_callback()
          rbd_img_obj_callback()
                                          <obj_request-2/2 reply>
                                          handle_reply()
                                            rbd_osd_req_callback()
                                              rbd_osd_trivial_callback()
            for_each_obj_request(obj_request->img_request) {
              rbd_img_obj_end_request(obj_request-1/2)
              rbd_img_obj_end_request(obj_request-2/2) <--
            }
      
      Calling rbd_img_obj_end_request() on such a request leads to trouble,
      in particular because its ->xfferred is 0.  We report 0 to the block
      layer with blk_update_request(), get back 1 for "this request has more
      data in flight" and then trip on
      
          rbd_assert(more ^ (which == img_request->obj_request_count));
      
      with rhs (which == ...) being 1 because rbd_img_obj_end_request() has
      been called for both requests and lhs (more) being 1 because we haven't
      got a chance to set ->xfferred in rbd_img_obj_copyup_callback() yet.
      
      To fix this, leverage that rbd wants to call class methods in only two
      cases: one is a generic method call wrapper (obj_request is standalone)
      and the other is a copyup (obj_request is part of an img_request).  So
      make a dedicated handler for CEPH_OSD_OP_CALL and directly invoke
      rbd_img_obj_copyup_callback() from it if obj_request is part of an
      img_request, similar to how CEPH_OSD_OP_READ handler invokes
      rbd_img_obj_request_read_callback().
      
      Since rbd_img_obj_copyup_callback() is now being called from the OSD
      request callback (only), it is renamed to rbd_osd_copyup_callback().
      
      Cc: Alex Elder <elder@linaro.org>
      Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 3.18
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      2761713d
  8. 17 7月, 2015 1 次提交
  9. 01 7月, 2015 1 次提交
    • I
      rbd: use GFP_NOIO in rbd_obj_request_create() · 5a60e876
      Ilya Dryomov 提交于
      rbd_obj_request_create() is called on the main I/O path, so we need to
      use GFP_NOIO to make sure allocation doesn't blow back on us.  Not all
      callers need this, but I'm still hardcoding the flag inside rather than
      making it a parameter because a) this is going to stable, and b) those
      callers shouldn't really use rbd_obj_request_create() and will be fixed
      in the future.
      
      More memory allocation fixes will follow.
      
      Cc: stable@vger.kernel.org # 3.10+
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      5a60e876
  10. 25 6月, 2015 7 次提交
    • I
      rbd: queue_depth map option · b5584180
      Ilya Dryomov 提交于
      nr_requests (/sys/block/rbd<id>/queue/nr_requests) is pretty much
      irrelevant in blk-mq case because each driver sets its own max depth
      that it can handle and that's the number of tags that gets preallocated
      on setup.  Users can't increase queue depth beyond that value via
      writing to nr_requests.
      
      For rbd we are happy with the default BLKDEV_MAX_RQ (128) for most
      cases but we want to give users the opportunity to increase it.
      Introduce a new per-device queue_depth option to do just that:
      
          $ sudo rbd map -o queue_depth=1024 ...
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      b5584180
    • I
      rbd: store rbd_options in rbd_device · d147543d
      Ilya Dryomov 提交于
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      d147543d
    • I
      rbd: terminate rbd_opts_tokens with Opt_err · 210c104c
      Ilya Dryomov 提交于
      Also nuke useless Opt_last_bool and don't break lines unnecessarily.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      210c104c
    • I
      rbd: bump queue_max_segments · d3834fef
      Ilya Dryomov 提交于
      The default queue_limits::max_segments value (BLK_MAX_SEGMENTS = 128)
      unnecessarily limits bio sizes to 512k (assuming 4k pages).  rbd, being
      a virtual block device, doesn't have any restrictions on the number of
      physical segments, so bump max_segments to max_hw_sectors, in theory
      allowing a sector per segment (although the only case this matters that
      I can think of is some readv/writev style thing).  In practice this is
      going to give us 1M bios - the number of segments in a bio is limited
      in bio_get_nr_vecs() by BIO_MAX_PAGES = 256.
      
      Note that this doesn't result in any improvement on a typical direct
      sequential test.  This is because on a box with a not too badly
      fragmented memory the default BLK_MAX_SEGMENTS is enough to see nice
      rbd object size sized requests.  The only difference is the size of
      bios being merged - 512k vs 1M for something like
      
          $ dd if=/dev/zero of=/dev/rbd0 oflag=direct bs=$RBD_OBJ_SIZE
          $ dd if=/dev/rbd0 iflag=direct of=/dev/null bs=$RBD_OBJ_SIZE
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      d3834fef
    • I
      rbd: timeout watch teardown on unmap with mount_timeout · 2894e1d7
      Ilya Dryomov 提交于
      As part of unmap sequence, kernel client has to talk to the OSDs to
      teardown watch on the header object.  If none of the OSDs are available
      it would hang forever, until interrupted by a signal - when that
      happens we follow through with the rest of unmap procedure (i.e.
      unregister the device and put all the data structures) and the unmap is
      still considired successful (rbd cli tool exits with 0).  The watch on
      the userspace side should eventually timeout so that's fine.
      
      This isn't very nice, because various userspace tools (pacemaker rbd
      resource agent, for example) then have to worry about setting up their
      own timeouts.  Timeout it with mount_timeout (60 seconds by default).
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      Reviewed-by: NSage Weil <sage@redhat.com>
      2894e1d7
    • I
      libceph: store timeouts in jiffies, verify user input · a319bf56
      Ilya Dryomov 提交于
      There are currently three libceph-level timeouts that the user can
      specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive.  All of
      these are in seconds and no checking is done on user input: negative
      values are accepted, we multiply them all by HZ which may or may not
      overflow, arbitrarily large jiffies then get added together, etc.
      
      There is also a bug in the way mount_timeout=0 is handled.  It's
      supposed to mean "infinite timeout", but that's not how wait.h APIs
      treat it and so __ceph_open_session() for example will busy loop
      without much chance of being interrupted if none of ceph-mons are
      there.
      
      Fix all this by verifying user input, storing timeouts capped by
      msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies()
      helper for all user-specified waits to handle infinite timeouts
      correctly.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      a319bf56
    • Y
      libceph: allow setting osd_req_op's flags · 144cba14
      Yan, Zheng 提交于
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      144cba14
  11. 02 5月, 2015 1 次提交
    • I
      rbd: end I/O the entire obj_request on error · 082a75da
      Ilya Dryomov 提交于
      When we end I/O struct request with error, we need to pass
      obj_request->length as @nr_bytes so that the entire obj_request worth
      of bytes is completed.  Otherwise block layer ends up confused and we
      trip on
      
          rbd_assert(more ^ (which == img_request->obj_request_count));
      
      in rbd_img_obj_callback() due to more being true no matter what.  We
      already do it in most cases but we are missing some, in particular
      those where we don't even get a chance to submit any obj_requests, due
      to an early -ENOMEM for example.
      
      A number of obj_request->xferred assignments seem to be redundant but
      I haven't touched any of obj_request->xferred stuff to keep this small
      and isolated.
      
      Cc: Alex Elder <elder@linaro.org>
      Cc: stable@vger.kernel.org # 3.10+
      Reported-by: NShawn Edwards <lesser.evil@gmail.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      082a75da
  12. 22 4月, 2015 1 次提交
  13. 20 4月, 2015 2 次提交
  14. 19 2月, 2015 4 次提交
  15. 28 1月, 2015 2 次提交
    • I
      rbd: drop parent_ref in rbd_dev_unprobe() unconditionally · e69b8d41
      Ilya Dryomov 提交于
      This effectively reverts the last hunk of 392a9dad ("rbd: detect
      when clone image is flattened").
      
      The problem with parent_overlap != 0 condition is that it's possible
      and completely valid to have an image with parent_overlap == 0 whose
      parent state needs to be cleaned up on unmap.  The next commit, which
      drops the "clone image now standalone" logic, opens up another window
      of opportunity to hit this, but even without it
      
          # cat parent-ref.sh
          #!/bin/bash
          rbd create --image-format 2 --size 1 foo
          rbd snap create foo@snap
          rbd snap protect foo@snap
          rbd clone foo@snap bar
          rbd resize --allow-shrink --size 0 bar
          rbd resize --size 1 bar
          DEV=$(rbd map bar)
          rbd unmap $DEV
      
      leaves rbd_device/rbd_spec/etc and rbd_client along with ceph_client
      hanging around.
      
      My thinking behind calling rbd_dev_parent_put() unconditionally is that
      there shouldn't be any requests in flight at that point in time as we
      are deep into unmap sequence.  Hence, even if rbd_dev_unparent() caused
      by flatten is delayed by in-flight requests, it will have finished by
      the time we reach rbd_dev_unprobe() caused by unmap, thus turning
      unconditional rbd_dev_parent_put() into a no-op.
      
      Fixes: http://tracker.ceph.com/issues/10352
      
      Cc: stable@vger.kernel.org # 3.11+
      Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
      Reviewed-by: NJosh Durgin <jdurgin@redhat.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      e69b8d41
    • I
      rbd: fix rbd_dev_parent_get() when parent_overlap == 0 · ae43e9d0
      Ilya Dryomov 提交于
      The comment for rbd_dev_parent_get() said
      
          * We must get the reference before checking for the overlap to
          * coordinate properly with zeroing the parent overlap in
          * rbd_dev_v2_parent_info() when an image gets flattened.  We
          * drop it again if there is no overlap.
      
      but the "drop it again if there is no overlap" part was missing from
      the implementation.  This lead to absurd parent_ref values for images
      with parent_overlap == 0, as parent_ref was incremented for each
      img_request and virtually never decremented.
      
      Fix this by leveraging the fact that refresh path calls
      rbd_dev_v2_parent_info() under header_rwsem and use it for read in
      rbd_dev_parent_get(), instead of messing around with atomics.  Get rid
      of barriers in rbd_dev_v2_parent_info() while at it - I don't see what
      they'd pair with now and I suspect we are in a pretty miserable
      situation as far as proper locking goes regardless.
      
      Cc: stable@vger.kernel.org # 3.11+
      Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
      Reviewed-by: NJosh Durgin <jdurgin@redhat.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      ae43e9d0
  16. 18 12月, 2014 2 次提交
  17. 30 10月, 2014 2 次提交
  18. 15 10月, 2014 6 次提交