1. 18 12月, 2012 1 次提交
  2. 08 12月, 2012 1 次提交
  3. 27 11月, 2012 2 次提交
  4. 04 11月, 2012 1 次提交
    • R
      xen/blkback: persistent-grants fixes · cb5bd4d1
      Roger Pau Monne 提交于
      This patch contains fixes for persistent grants implementation v2:
      
       * handle == 0 is a valid handle, so initialize grants in blkback
         setting the handle to BLKBACK_INVALID_HANDLE instead of 0. Reported
         by Konrad Rzeszutek Wilk.
      
       * new_map is a boolean, use "true" or "false" instead of 1 and 0.
         Reported by Konrad Rzeszutek Wilk.
      
       * blkfront announces the persistent-grants feature as
         feature-persistent-grants, use feature-persistent instead which is
         consistent with blkback and the public Xen headers.
      
       * Add a consistency check in blkfront to make sure we don't try to
         access segments that have not been set.
      Reported-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NRoger Pau Monne <roger.pau@citrix.com>
      [v1: The new_map int->bool had already been changed]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      cb5bd4d1
  5. 30 10月, 2012 1 次提交
    • R
      xen/blkback: Persistent grant maps for xen blk drivers · 0a8704a5
      Roger Pau Monne 提交于
      This patch implements persistent grants for the xen-blk{front,back}
      mechanism. The effect of this change is to reduce the number of unmap
      operations performed, since they cause a (costly) TLB shootdown. This
      allows the I/O performance to scale better when a large number of VMs
      are performing I/O.
      
      Previously, the blkfront driver was supplied a bvec[] from the request
      queue. This was granted to dom0; dom0 performed the I/O and wrote
      directly into the grant-mapped memory and unmapped it; blkfront then
      removed foreign access for that grant. The cost of unmapping scales
      badly with the number of CPUs in Dom0. An experiment showed that when
      Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
      ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
      (at which point 650,000 IOPS are being performed in total). If more
      than 5 guests are used, the performance declines. By 10 guests, only
      400,000 IOPS are being performed.
      
      This patch improves performance by only unmapping when the connection
      between blkfront and back is broken.
      
      On startup blkfront notifies blkback that it is using persistent
      grants, and blkback will do the same. If blkback is not capable of
      persistent mapping, blkfront will still use the same grants, since it
      is compatible with the previous protocol, and simplifies the code
      complexity in blkfront.
      
      To perform a read, in persistent mode, blkfront uses a separate pool
      of pages that it maps to dom0. When a request comes in, blkfront
      transmutes the request so that blkback will write into one of these
      free pages. Blkback keeps note of which grefs it has already
      mapped. When a new ring request comes to blkback, it looks to see if
      it has already mapped that page. If so, it will not map it again. If
      the page hasn't been previously mapped, it is mapped now, and a record
      is kept of this mapping. Blkback proceeds as usual. When blkfront is
      notified that blkback has completed a request, it memcpy's from the
      shared memory, into the bvec supplied. A record that the {gref, page}
      tuple is mapped, and not inflight is kept.
      
      Writes are similar, except that the memcpy is peformed from the
      supplied bvecs, into the shared pages, before the request is put onto
      the ring.
      
      Blkback stores a mapping of grefs=>{page mapped to by gref} in
      a red-black tree. As the grefs are not known apriori, and provide no
      guarantees on their ordering, we have to perform a search
      through this tree to find the page, for every gref we receive. This
      operation takes O(log n) time in the worst case. In blkfront grants
      are stored using a single linked list.
      
      The maximum number of grants that blkback will persistenly map is
      currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
      prevent a malicios guest from attempting a DoS, by supplying fresh
      grefs, causing the Dom0 kernel to map excessively. If a guest
      is using persistent grants and exceeds the maximum number of grants to
      map persistenly the newly passed grefs will be mapped and unmaped.
      Using this approach, we can have requests that mix persistent and
      non-persistent grants, and we need to handle them correctly.
      This allows us to set the maximum number of persistent grants to a
      lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
      setting it will lead to unpredictable performance.
      
      In writing this patch, the question arrises as to if the additional
      cost of performing memcpys in the guest (to/from the pool of granted
      pages) outweigh the gains of not performing TLB shootdowns. The answer
      to that question is `no'. There appears to be very little, if any
      additional cost to the guest of using persistent grants. There is
      perhaps a small saving, from the reduced number of hypercalls
      performed in granting, and ending foreign access.
      Signed-off-by: NOliver Chick <oliver.chick@citrix.com>
      Signed-off-by: NRoger Pau Monne <roger.pau@citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      [v1: Fixed up the misuse of bool as int]
      0a8704a5
  6. 06 10月, 2012 21 次提交
  7. 02 10月, 2012 13 次提交
    • S
      rbd: BUG on invalid layout · 6cae3717
      Sage Weil 提交于
      This shouldn't actually be possible because the layout struct is
      constructed from the RBD header and validated then.
      
      [elder@inktank.com: converted BUG() call to equivalent rbd_assert()]
      Signed-off-by: NSage Weil <sage@inktank.com>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      6cae3717
    • A
      rbd: update remaining header fields for v2 · 6e14b1a6
      Alex Elder 提交于
      There are three fields that are not yet updated for format 2 rbd
      image headers:  the version of the header object; the encryption
      type; and the compression type.  There is no interface defined for
      fetching the latter two, so just initialize them explicitly to 0 for
      now.
      
      Change rbd_dev_v2_snap_context() so the caller can be supplied the
      version for the header object.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      6e14b1a6
    • A
      rbd: get snapshot name for a v2 image · b8b1e2db
      Alex Elder 提交于
      Define rbd_dev_v2_snap_name() to fetch the name for a particular
      snapshot in a format 2 rbd image.
      
      Define rbd_dev_v2_snap_info() to to be a wrapper for getting the
      name, size, and features for a particular snapshot, using an
      interface that matches the equivalent function for version 1 images.
      
      Define rbd_dev_snap_info() wrapper function and use it to call the
      appropriate function for getting the snapshot name, size, and
      features, dependent on the rbd image format.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      b8b1e2db
    • A
      rbd: get the snapshot context for a v2 image · 35d489f9
      Alex Elder 提交于
      Fetch the snapshot context for an rbd format 2 image by calling
      the "get_snapcontext" method on its header object.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      35d489f9
    • A
      rbd: get image features for a v2 image · b1b5402a
      Alex Elder 提交于
      The features values for an rbd format 2 image are fetched from the
      server using a "get_features" method.  The same method is used for
      getting the features for a snapshot, so structure this addition with
      a generic helper routine that can get this information for either.
      
      The server will provide two 64-bit feature masks, one representing
      the features potentially in use for this image (or its snapshot),
      and one representing features that must be supported by the client
      in order to work with the image.
      
      For the time being, neither of these is really used so we keep
      things simple and just record the first feature vector.  Once we
      start using these feature masks, what we record and what we expose
      to the user will most likely change.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      b1b5402a
    • A
      rbd: get the object prefix for a v2 rbd image · 1e130199
      Alex Elder 提交于
      The object prefix of an rbd format 2 image is fetched from the
      server using a "get_object_prefix" method.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      1e130199
    • A
      rbd: add code to get the size of a v2 rbd image · 9d475de5
      Alex Elder 提交于
      The size of an rbd format 2 image is fetched from the server using a
      "get_size" method.  The same method is used for getting the size of
      a snapshot, so structure this addition with a generic helper routine
      that we can get this information for either.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      9d475de5
    • A
      rbd: lay out header probe infrastructure · a30b71b9
      Alex Elder 提交于
      This defines a new function rbd_dev_probe() as a top-level
      function for populating detailed information about an rbd device.
      
      It first checks for the existence of a format 2 rbd image id object.
      If it exists, the image is assumed to be a format 2 rbd image, and
      another function rbd_dev_v2() is called to finish populating
      header data for that image.  If it does not exist, it is assumed to
      be an old (format 1) rbd image, and calls a similar function
      rbd_dev_v1() to populate its header information.
      
      A new field, rbd_dev->format, is defined to record which version
      of the rbd image format the device represents.  For a valid mapped
      rbd device it will have one of two values, 1 or 2.
      
      So far, the format 2 images are not really supported; this is
      laying out the infrastructure for fleshing out that support.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      a30b71b9
    • A
      rbd: encapsulate code that gets snapshot info · cd892126
      Alex Elder 提交于
      Create a function that encapsulates looking up the name, size and
      features related to a given snapshot, which is indicated by its
      index in an rbd device's snapshot context array of snapshot ids.
      
      This interface will be used to hide differences between the format 1
      and format 2 images.
      
      At the moment this (looking up the name anyway) is slightly less
      efficient than what's done currently, but we may be able to optimize
      this a bit later on by cacheing the last lookup if it proves to be a
      problem.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      cd892126
    • A
      rbd: add an rbd features field · 34b13184
      Alex Elder 提交于
      Record the features values for each rbd image and each of its
      snapshots.  This is really something that only becomes meaningful
      for version 2 images, so this is just putting in place code
      that will form common infrastructure.
      
      It may be useful to expand the sysfs entries--and therefore the
      information we maintain--for the image and for each snapshot.
      But I'm going to hold off doing that until we start making
      active use of the feature bits.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      34b13184
    • A
      rbd: don't use index in __rbd_add_snap_dev() · c8d18425
      Alex Elder 提交于
      Pass the snapshot id and snapshot size rather than an index
      to __rbd_add_snap_dev() to specify values for a new snapshot.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      c8d18425
    • A
      rbd: kill create_snap sysfs entry · 02cdb02c
      Alex Elder 提交于
      Josh proposed the following change, and I don't think I could
      explain it any better than he did:
      
          From: Josh Durgin <josh.durgin@inktank.com>
          Date: Tue, 24 Jul 2012 14:22:11 -0700
          To: ceph-devel <ceph-devel@vger.kernel.org>
          Message-ID: <500F1203.9050605@inktank.com>
      
          Right now the kernel still has one piece of rbd management
          duplicated from the rbd command line tool: snapshot creation.
          There's nothing special about snapshot creation that makes it
          advantageous to do from the kernel, so I'd like to remove the
          create_snap sysfs interface.  That is,
      	/sys/bus/rbd/devices/<id>/create_snap
          would be removed.
      
          Does anyone rely on the sysfs interface for creating rbd
          snapshots?  If so, how hard would it be to replace with:
      
      	rbd snap create pool/image@snap
      
          Is there any benefit to the sysfs interface that I'm missing?
      
          Josh
      
      This patch implements this proposal, removing the code that
      implements the "snap_create" sysfs interface for rbd images.
      As a result, quite a lot of other supporting code goes away.
      Suggested-by: NJosh Durgin <josh.durgin@inktank.com>
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      02cdb02c
    • A
      rbd: define rbd_dev_image_id() · 589d30e0
      Alex Elder 提交于
      New format 2 rbd images are permanently identified by a unique image
      id.  Each rbd image also has a name, but the name can be changed.
      A format 2 rbd image will have an object--whose name is based on the
      image name--which maps an image's name to its image id.
      
      Create a new function rbd_dev_image_id() that checks for the
      existence of the image id object, and if it's found, records the
      image id in the rbd_device structure.
      
      Create a new rbd device attribute (/sys/bus/rbd/<num>/image_id) that
      makes this information available.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      589d30e0