1. 02 5月, 2013 18 次提交
    • A
      libceph: change how "safe" callback is used · 26be8808
      Alex Elder 提交于
      An osd request currently has two callbacks.  They inform the
      initiator of the request when we've received confirmation for the
      target osd that a request was received, and when the osd indicates
      all changes described by the request are durable.
      
      The only time the second callback is used is in the ceph file system
      for a synchronous write.  There's a race that makes some handling of
      this case unsafe.  This patch addresses this problem.  The error
      handling for this callback is also kind of gross, and this patch
      changes that as well.
      
      In ceph_sync_write(), if a safe callback is requested we want to add
      the request on the ceph inode's unsafe items list.  Because items on
      this list must have their tid set (by ceph_osd_start_request()), the
      request added *after* the call to that function returns.  The
      problem with this is that there's a race between starting the
      request and adding it to the unsafe items list; the request may
      already be complete before ceph_sync_write() even begins to put it
      on the list.
      
      To address this, we change the way the "safe" callback is used.
      Rather than just calling it when the request is "safe", we use it to
      notify the initiator the bounds (start and end) of the period during
      which the request is *unsafe*.  So the initiator gets notified just
      before the request gets sent to the osd (when it is "unsafe"), and
      again when it's known the results are durable (it's no longer
      unsafe).  The first call will get made in __send_request(), just
      before the request message gets sent to the messenger for the first
      time.  That function is only called by __send_queued(), which is
      always called with the osd client's request mutex held.
      
      We then have this callback function insert the request on the ceph
      inode's unsafe list when we're told the request is unsafe.  This
      will avoid the race because this call will be made under protection
      of the osd client's request mutex.  It also nicely groups the setup
      and cleanup of the state associated with managing unsafe requests.
      
      The name of the "safe" callback field is changed to "unsafe" to
      better reflect its new purpose.  It has a Boolean "unsafe" parameter
      to indicate whether the request is becoming unsafe or is now safe.
      Because the "msg" parameter wasn't used, we drop that.
      
      This resolves the original problem reportedin:
          http://tracker.ceph.com/issues/4706Reported-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      26be8808
    • A
      ceph: let osd client clean up for interrupted request · 7d7d51ce
      Alex Elder 提交于
      In ceph_sync_write(), if a safe callback is supplied with a request,
      and an error is returned by ceph_osdc_wait_request(), a block of
      code is executed to remove the request from the unsafe writes list
      and drop references to capabilities acquired just prior to a call to
      ceph_osdc_wait_request().
      
      The only function used for this callback is sync_write_commit(),
      and it does *exactly* what that block of error handling code does.
      
      Now in ceph_osdc_wait_request(), if an error occurs (due to an
      interupt during a wait_for_completion_interruptible() call),
      complete_request() gets called, and that calls the request's
      safe_callback method if it's defined.
      
      So this means that this cleanup activity gets called twice in this
      case, which is erroneous (and in fact leads to a crash).
      
      Fix this by just letting the osd client handle the cleanup in
      the event of an interrupt.
      
      This resolves one problem mentioned in:
          http://tracker.ceph.com/issues/4706Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NYan, Zheng <zheng.z.yan@intel.com>
      7d7d51ce
    • A
      libceph: combine initializing and setting osd data · a4ce40a9
      Alex Elder 提交于
      This ends up being a rather large patch but what it's doing is
      somewhat straightforward.
      
      Basically, this is replacing two calls with one.  The first of the
      two calls is initializing a struct ceph_osd_data with data (either a
      page array, a page list, or a bio list); the second is setting an
      osd request op so it associates that data with one of the op's
      parameters.  In place of those two will be a single function that
      initializes the op directly.
      
      That means we sort of fan out a set of the needed functions:
          - extent ops with pages data
          - extent ops with pagelist data
          - extent ops with bio list data
      and
          - class ops with page data for receiving a response
      
      We also have define another one, but it's only used internally:
          - class ops with pagelist data for request parameters
      
      Note that we *still* haven't gotten rid of the osd request's
      r_data_in and r_data_out fields.  All the osd ops refer to them for
      their data.  For now, these data fields are pointers assigned to the
      appropriate r_data_* field when these new functions are called.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      a4ce40a9
    • A
      libceph: add data pointers in osd op structures · 8c042b0d
      Alex Elder 提交于
      An extent type osd operation currently implies that there will
      be corresponding data supplied in the data portion of the request
      (for write) or response (for read) message.  Similarly, an osd class
      method operation implies a data item will be supplied to receive
      the response data from the operation.
      
      Add a ceph_osd_data pointer to each of those structures, and assign
      it to point to eithre the incoming or the outgoing data structure in
      the osd message.  The data is not always available when an op is
      initially set up, so add two new functions to allow setting them
      after the op has been initialized.
      
      Begin to make use of the data item pointer available in the osd
      operation rather than the request data in or out structure in
      places where it's convenient.  Add some assertions to verify
      pointers are always set the way they're expected to be.
      
      This is a sort of stepping stone toward really moving the data
      into the osd request ops, to allow for some validation before
      making that jump.
      
      This is the first in a series of patches that resolve:
          http://tracker.ceph.com/issues/4657Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      8c042b0d
    • A
      libceph: keep source rather than message osd op array · 79528734
      Alex Elder 提交于
      An osd request keeps a pointer to the osd operations (ops) array
      that it builds in its request message.
      
      In order to allow each op in the array to have its own distinct
      data, we will need to keep track of each op's data, and that
      information does not go over the wire.
      
      As long as we're tracking the data we might as well just track the
      entire (source) op definition for each of the ops.  And if we're
      doing that, we'll have no more need to keep a pointer to the
      wire-encoded version.
      
      This patch makes the array of source ops be kept with the osd
      request structure, and uses that instead of the version encoded in
      the message in places where that was previously used.  The array
      will be embedded in the request structure, and the maximum number of
      ops we ever actually use is currently 2.  So reduce CEPH_OSD_MAX_OP
      to 2 to reduce the size of the structure.
      
      The result of doing this sort of ripples back up, and as a result
      various function parameters and local variables become unnecessary.
      
      Make r_num_ops be unsigned, and move the definition of struct
      ceph_osd_req_op earlier to ensure it's defined where needed.
      
      It does not yet add per-op data, that's coming soon.
      
      This resolves:
          http://tracker.ceph.com/issues/4656Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      79528734
    • A
      libceph: define osd data initialization helpers · 43bfe5de
      Alex Elder 提交于
      Define and use functions that encapsulate the initializion of a
      ceph_osd_data structure.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      43bfe5de
    • A
      libceph: hold off building osd request · 02ee07d3
      Alex Elder 提交于
      Defer building the osd request until just before submitting it in
      all callers except ceph_writepages_start().  (That caller will be
      handed in the next patch.)
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      02ee07d3
    • A
      libceph: don't build request in ceph_osdc_new_request() · acead002
      Alex Elder 提交于
      This patch moves the call to ceph_osdc_build_request() out of
      ceph_osdc_new_request() and into its caller.
      
      This is in order to defer formatting osd operation information into
      the request message until just before request is started.
      
      The only unusual (ab)user of ceph_osdc_build_request() is
      ceph_writepages_start(), where the final length of write request may
      change (downward) based on the current inode size or the oldest
      snapshot context with dirty data for the inode.
      
      The remaining callers don't change anything in the request after has
      been built.
      
      This means the ops array is now supplied by the caller.  It also
      means there is no need to pass the mtime to ceph_osdc_new_request()
      (it gets provided to ceph_osdc_build_request()).  And rather than
      passing a do_sync flag, have the number of ops in the ops array
      supplied imply adding a second STARTSYNC operation after the READ or
      WRITE requested.
      
      This and some of the patches that follow are related to having the
      messenger (only) be responsible for filling the content of the
      message header, as described here:
          http://tracker.ceph.com/issues/4589Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      acead002
    • H
      ceph: fix buffer pointer advance in ceph_sync_write · 022f3e2e
      Henry C Chang 提交于
      We should advance the user data pointer by _len_ instead of _written_.
      _len_ is the data length written in each iteration while _written_ is the
      accumulated data length we have writtent out.
      Signed-off-by: NHenry C Chang <henry.cy.chang@gmail.com>
      Reviewed-by: NGreg Farnum <greg@inktank.com>
      Tested-by: NSage Weil <sage@inktank.com>
      022f3e2e
    • A
      libceph: record byte count not page count · e0c59487
      Alex Elder 提交于
      Record the byte count for an osd request rather than the page count.
      The number of pages can always be derived from the byte count (and
      alignment/offset) but the reverse is not true.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      e0c59487
    • A
      libceph: separate read and write data · 0fff87ec
      Alex Elder 提交于
      An osd request defines information about where data to be read
      should be placed as well as where data to write comes from.
      Currently these are represented by common fields.
      
      Keep information about data for writing separate from data to be
      read by splitting these into data_in and data_out fields.
      
      This is the key patch in this whole series, in that it actually
      identifies which osd requests generate outgoing data and which
      generate incoming data.  It's less obvious (currently) that an osd
      CALL op generates both outgoing and incoming data; that's the focus
      of some upcoming work.
      
      This resolves:
          http://tracker.ceph.com/issues/4127Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      0fff87ec
    • A
      libceph: distinguish page and bio requests · 2ac2b7a6
      Alex Elder 提交于
      An osd request uses either pages or a bio list for its data.  Use a
      union to record information about the two, and add a data type
      tag to select between them.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      2ac2b7a6
    • A
      libceph: separate osd request data info · 2794a82a
      Alex Elder 提交于
      Pull the fields in an osd request structure that define the data for
      the request out into a separate structure.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      2794a82a
    • A
      libceph: don't assign page info in ceph_osdc_new_request() · 153e5167
      Alex Elder 提交于
      Currently ceph_osdc_new_request() assigns an osd request's
      r_num_pages and r_alignment fields.  The only thing it does
      after that is call ceph_osdc_build_request(), and that doesn't
      need those fields to be assigned.
      
      Move the assignment of those fields out of ceph_osdc_new_request()
      and into its caller.  As a result, the page_align parameter is no
      longer used, so get rid of it.
      
      Note that in ceph_sync_write(), the value for req->r_num_pages had
      already been calculated earlier (as num_pages, and fortunately
      it was computed the same way).  So don't bother recomputing it,
      but because it's not needed earlier, move that calculation after the
      call to ceph_osdc_new_request().  Hold off making the assignment to
      r_alignment, doing it instead r_pages and r_num_pages are
      getting set.
      
      Similarly, in start_read(), nr_pages already holds the number of
      pages in the array (and is calculated the same way), so there's no
      need to recompute it.  Move the assignment of the page alignment
      down with the others there as well.
      
      This and the next few patches are preparation work for:
          http://tracker.ceph.com/issues/4127Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      153e5167
    • A
      ceph: simplify ceph_sync_write() page_align calculation · 3a42b6c4
      Alex Elder 提交于
      (This is being reposted.  The first one had a problem because it
      erroneously added a similar change elsewhere; that change has been
      dropped.)
      
      The next patch in this series points out that the calculation for
      the number of pages in an osd request is getting done twice.  It
      is not obvious, but the result of both calculations is identical.
      This patch simplifies one of them--as a separate step--to make
      it clear that the transformation in the next patch is valid.
      
      In ceph_sync_write() there is some magic that computes page_align
      for an osd request.  But a little analysis shows it can be
      simplified.
      
      First, we have:
       	io_align = pos & ~PAGE_MASK;
      which is used here:
      	page_align = (pos - io_align + buf_align) & ~PAGE_MASK;
      
      Note (pos - io_align) simply rounds "pos" down to the nearest multiple
      of the page size.
      
      We also have:
       	buf_align = (unsigned long)data & ~PAGE_MASK;
      
      Adding buf_align to that rounded-down "pos" value will stay within
      the same page; the result will just be offset by the page offset for
      the "data" pointer.  The final mask therefore leaves just the value
      of "buf_align".
      
      One more simplification.  Note that the result of calc_pages_for()
      is invariant of which page the offset starts in--the only thing that
      matters is the offset within the starting page.  We will have
      put the proper page offset to use into "page_align", so just use
      that in calculating num_pages.
      
      This resolves:
          http://tracker.ceph.com/issues/4166Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      3a42b6c4
    • Y
      ceph: acquire i_mutex in __ceph_do_pending_vmtruncate · 3f99969f
      Yan, Zheng 提交于
      make __ceph_do_pending_vmtruncate() acquire the i_mutex if the caller
      does not hold the i_mutex, so ceph_aio_read() can call safely.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NGreg Farnum <greg@inktank.com>
      3f99969f
    • Y
      ceph: don't early drop Fw cap · 6070e0c1
      Yan, Zheng 提交于
      ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
      cap dirty before data is copied to page cache and inode size is
      updated. The optimization avoids slow cap revocation caused by
      balance_dirty_pages(), but introduces inode size update race. If
      ceph_check_caps() flushes the dirty cap before the inode size is
      updated, MDS can miss the new inode size. So just remove the
      optimization.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NGreg Farnum <greg@inktank.com>
      6070e0c1
    • S
      ceph: revert commit 22cddde1 · 7971bd92
      Sage Weil 提交于
      commit 22cddde1 breaks the atomicity of write operation, it also
      introduces a deadlock between write and truncate.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NGreg Farnum <greg@inktank.com>
      
      Conflicts:
      	fs/ceph/addr.c
      7971bd92
  2. 23 2月, 2013 1 次提交
  3. 19 2月, 2013 1 次提交
  4. 18 1月, 2013 2 次提交
    • S
      ceph: Check for created flag in response from mds · 6e8575fa
      Sam Lang 提交于
      The mds now sends back a created inode if the create request
      performed the create.  If the file already existed, no inode is
      returned in the reply.  This allows ceph to set the created flag
      in atomic_open so that permissions are properly checked in the case
      that the file wasn't created by the create call to the mds.
      
      To ensure compability with previous kernels, a feature for sending
      back the inode in the create reply was added, so that the mds will
      only send back the inode if the client indicates it supports the
      feature.
      Signed-off-by: NSam Lang <sam.lang@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      6e8575fa
    • S
      ceph: Check for err on mds request in atomic_open · 79aec984
      Sam Lang 提交于
      The error returned by ceph_mdsc_do_request includes errors sending the
      request, errors on timeout, or any errors coming from the mds.  If
      ceph_mdsc_do_request returns an error, the reply struct will most likely
      be bogus.  We need to bail out and propogate the error instead of
      overwriting it.
      Signed-off-by: NSam Lang <sam.lang@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      79aec984
  5. 18 12月, 2012 1 次提交
  6. 06 11月, 2012 1 次提交
    • S
      ceph: Fix i_size update race · 22cddde1
      Sage Weil 提交于
      ceph_aio_write() has an optimization that marks cap EPH_CAP_FILE_WR
      dirty before data is copied to page cache and inode size is updated.
      If ceph_check_caps() flushes the dirty cap before the inode size is
      updated, MDS can miss the new inode size. The fix is move
      ceph_{get,put}_cap_refs() into ceph_write_{begin,end}() and call
      __ceph_mark_dirty_caps() after inode size is updated.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NSage Weil <sage@inktank.com>
      22cddde1
  7. 02 10月, 2012 1 次提交
  8. 03 8月, 2012 1 次提交
    • S
      ceph: simplify+fix atomic_open · 5ef50c3b
      Sage Weil 提交于
      The initial ->atomic_open op was carried over from the old intent code,
      which was incomplete and didn't really work.  Replace it with a fresh
      method.  In particular:
      
       * always attempt to do an atomic open+lookup, both for the create case
         and for lookups of existing files.
       * fix symlink handling by returning 1 to the VFS so that we can follow
         the link to its destination. This fixes a longstanding ceph bug (#2392).
      Signed-off-by: NSage Weil <sage@inktank.com>
      5ef50c3b
  9. 14 7月, 2012 5 次提交
  10. 08 5月, 2012 1 次提交
  11. 14 12月, 2011 1 次提交
  12. 08 12月, 2011 1 次提交
    • S
      ceph: use i_ceph_lock instead of i_lock · be655596
      Sage Weil 提交于
      We have been using i_lock to protect all kinds of data structures in the
      ceph_inode_info struct, including lists of inodes that we need to iterate
      over while avoiding races with inode destruction.  That requires grabbing
      a reference to the inode with the list lock protected, but igrab() now
      takes i_lock to check the inode flags.
      
      Changing the list lock ordering would be a painful process.
      
      However, using a ceph-specific i_ceph_lock in the ceph inode instead of
      i_lock is a simple mechanical change and avoids the ordering constraints
      imposed by igrab().
      Reported-by: NAmon Ott <a.ott@m-privacy.de>
      Signed-off-by: NSage Weil <sage@newdream.net>
      be655596
  13. 27 7月, 2011 6 次提交