1. 25 6月, 2015 4 次提交
    • I
      rbd: bump queue_max_segments · d3834fef
      Ilya Dryomov 提交于
      The default queue_limits::max_segments value (BLK_MAX_SEGMENTS = 128)
      unnecessarily limits bio sizes to 512k (assuming 4k pages).  rbd, being
      a virtual block device, doesn't have any restrictions on the number of
      physical segments, so bump max_segments to max_hw_sectors, in theory
      allowing a sector per segment (although the only case this matters that
      I can think of is some readv/writev style thing).  In practice this is
      going to give us 1M bios - the number of segments in a bio is limited
      in bio_get_nr_vecs() by BIO_MAX_PAGES = 256.
      
      Note that this doesn't result in any improvement on a typical direct
      sequential test.  This is because on a box with a not too badly
      fragmented memory the default BLK_MAX_SEGMENTS is enough to see nice
      rbd object size sized requests.  The only difference is the size of
      bios being merged - 512k vs 1M for something like
      
          $ dd if=/dev/zero of=/dev/rbd0 oflag=direct bs=$RBD_OBJ_SIZE
          $ dd if=/dev/rbd0 iflag=direct of=/dev/null bs=$RBD_OBJ_SIZE
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      d3834fef
    • I
      rbd: timeout watch teardown on unmap with mount_timeout · 2894e1d7
      Ilya Dryomov 提交于
      As part of unmap sequence, kernel client has to talk to the OSDs to
      teardown watch on the header object.  If none of the OSDs are available
      it would hang forever, until interrupted by a signal - when that
      happens we follow through with the rest of unmap procedure (i.e.
      unregister the device and put all the data structures) and the unmap is
      still considired successful (rbd cli tool exits with 0).  The watch on
      the userspace side should eventually timeout so that's fine.
      
      This isn't very nice, because various userspace tools (pacemaker rbd
      resource agent, for example) then have to worry about setting up their
      own timeouts.  Timeout it with mount_timeout (60 seconds by default).
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      Reviewed-by: NSage Weil <sage@redhat.com>
      2894e1d7
    • I
      libceph: store timeouts in jiffies, verify user input · a319bf56
      Ilya Dryomov 提交于
      There are currently three libceph-level timeouts that the user can
      specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive.  All of
      these are in seconds and no checking is done on user input: negative
      values are accepted, we multiply them all by HZ which may or may not
      overflow, arbitrarily large jiffies then get added together, etc.
      
      There is also a bug in the way mount_timeout=0 is handled.  It's
      supposed to mean "infinite timeout", but that's not how wait.h APIs
      treat it and so __ceph_open_session() for example will busy loop
      without much chance of being interrupted if none of ceph-mons are
      there.
      
      Fix all this by verifying user input, storing timeouts capped by
      msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies()
      helper for all user-specified waits to handle infinite timeouts
      correctly.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      a319bf56
    • Y
      libceph: allow setting osd_req_op's flags · 144cba14
      Yan, Zheng 提交于
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      144cba14
  2. 02 5月, 2015 1 次提交
    • I
      rbd: end I/O the entire obj_request on error · 082a75da
      Ilya Dryomov 提交于
      When we end I/O struct request with error, we need to pass
      obj_request->length as @nr_bytes so that the entire obj_request worth
      of bytes is completed.  Otherwise block layer ends up confused and we
      trip on
      
          rbd_assert(more ^ (which == img_request->obj_request_count));
      
      in rbd_img_obj_callback() due to more being true no matter what.  We
      already do it in most cases but we are missing some, in particular
      those where we don't even get a chance to submit any obj_requests, due
      to an early -ENOMEM for example.
      
      A number of obj_request->xferred assignments seem to be redundant but
      I haven't touched any of obj_request->xferred stuff to keep this small
      and isolated.
      
      Cc: Alex Elder <elder@linaro.org>
      Cc: stable@vger.kernel.org # 3.10+
      Reported-by: NShawn Edwards <lesser.evil@gmail.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      082a75da
  3. 22 4月, 2015 1 次提交
  4. 20 4月, 2015 2 次提交
  5. 19 2月, 2015 4 次提交
  6. 28 1月, 2015 2 次提交
    • I
      rbd: drop parent_ref in rbd_dev_unprobe() unconditionally · e69b8d41
      Ilya Dryomov 提交于
      This effectively reverts the last hunk of 392a9dad ("rbd: detect
      when clone image is flattened").
      
      The problem with parent_overlap != 0 condition is that it's possible
      and completely valid to have an image with parent_overlap == 0 whose
      parent state needs to be cleaned up on unmap.  The next commit, which
      drops the "clone image now standalone" logic, opens up another window
      of opportunity to hit this, but even without it
      
          # cat parent-ref.sh
          #!/bin/bash
          rbd create --image-format 2 --size 1 foo
          rbd snap create foo@snap
          rbd snap protect foo@snap
          rbd clone foo@snap bar
          rbd resize --allow-shrink --size 0 bar
          rbd resize --size 1 bar
          DEV=$(rbd map bar)
          rbd unmap $DEV
      
      leaves rbd_device/rbd_spec/etc and rbd_client along with ceph_client
      hanging around.
      
      My thinking behind calling rbd_dev_parent_put() unconditionally is that
      there shouldn't be any requests in flight at that point in time as we
      are deep into unmap sequence.  Hence, even if rbd_dev_unparent() caused
      by flatten is delayed by in-flight requests, it will have finished by
      the time we reach rbd_dev_unprobe() caused by unmap, thus turning
      unconditional rbd_dev_parent_put() into a no-op.
      
      Fixes: http://tracker.ceph.com/issues/10352
      
      Cc: stable@vger.kernel.org # 3.11+
      Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
      Reviewed-by: NJosh Durgin <jdurgin@redhat.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      e69b8d41
    • I
      rbd: fix rbd_dev_parent_get() when parent_overlap == 0 · ae43e9d0
      Ilya Dryomov 提交于
      The comment for rbd_dev_parent_get() said
      
          * We must get the reference before checking for the overlap to
          * coordinate properly with zeroing the parent overlap in
          * rbd_dev_v2_parent_info() when an image gets flattened.  We
          * drop it again if there is no overlap.
      
      but the "drop it again if there is no overlap" part was missing from
      the implementation.  This lead to absurd parent_ref values for images
      with parent_overlap == 0, as parent_ref was incremented for each
      img_request and virtually never decremented.
      
      Fix this by leveraging the fact that refresh path calls
      rbd_dev_v2_parent_info() under header_rwsem and use it for read in
      rbd_dev_parent_get(), instead of messing around with atomics.  Get rid
      of barriers in rbd_dev_v2_parent_info() while at it - I don't see what
      they'd pair with now and I suspect we are in a pretty miserable
      situation as far as proper locking goes regardless.
      
      Cc: stable@vger.kernel.org # 3.11+
      Signed-off-by: NIlya Dryomov <idryomov@redhat.com>
      Reviewed-by: NJosh Durgin <jdurgin@redhat.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      ae43e9d0
  7. 18 12月, 2014 2 次提交
  8. 30 10月, 2014 2 次提交
  9. 15 10月, 2014 14 次提交
  10. 10 9月, 2014 2 次提交
  11. 07 8月, 2014 3 次提交
  12. 25 7月, 2014 3 次提交
    • I
      rbd: take snap_id into account when reading in parent info · 4d9b67cd
      Ilya Dryomov 提交于
      If we are mapping a snapshot, we must read in the parent_overlap value
      of that snapshot instead of that of the base image.  Not doing so may
      in particular result in us returning zeros instead of user data:
      
          # cat overlap-snap.sh
          #!/bin/bash
          rbd create --size 10 --image-format 2 foo
          FOO_DEV=$(rbd map foo)
          dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
          echo "Base image"
          dd if=$FOO_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
          rbd snap create foo@snap
          rbd snap protect foo@snap
          rbd clone foo@snap bar
          rbd snap create bar@snap
          BAR_DEV=$(rbd map bar@snap)
          echo "Snapshot"
          dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
          rbd resize --allow-shrink --size 4 bar
          echo "Snapshot after base image resize"
          dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
      
          # ./overlap-snap.sh
          Base image
          0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed  ...;.K"%`4(E..6.
          Snapshot
          0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed  ...;.K"%`4(E..6.
          Resizing image: 100% complete...done.
          Snapshot after base image resize
          0000000: e781 e33b d34b 2225 0000 0000 0000 0000  ...;.K"%........
      
      Even though bar@snap is taken with the old bar parent_overlap (8M),
      reads from bar@snap beyond the new bar parent_overlap (4M) return
      zeroes.  Fix it.
      Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      4d9b67cd
    • I
      rbd: do not read in parent info before snap context · e8f59b59
      Ilya Dryomov 提交于
      Currently rbd_dev_v2_header_info() reads in parent info before the snap
      context is read in.  This is wrong, because we may need to look at the
      the parent_overlap value of the snapshot instead of that of the base
      image, for example when mapping a snapshot - see next commit.  (When
      mapping a snapshot, all we got is its name and we need the snap context
      to translate that name into an id to know which parent info to look
      for.)
      
      The approach taken here is to make sure rbd_dev_v2_parent_info() is
      called after the snap context has been read in.  The other approach
      would be to add a parent_overlap field to struct rbd_mapping and
      maintain it the same way rbd_mapping::size is maintained.  The reason
      I chose the first approach is that the value of keeping around both
      base image values and the actual mapping values is unclear to me.
      Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      e8f59b59
    • I
      rbd: update mapping size only on refresh · 5ff1108c
      Ilya Dryomov 提交于
      There is no sense in trying to update the mapping size before it's even
      been set.
      Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      5ff1108c