1. 18 1月, 2013 12 次提交
    • A
      libceph: don't set flags in ceph_osdc_alloc_request() · d178a9e7
      Alex Elder 提交于
      The only thing ceph_osdc_alloc_request() really does with the
      flags value it is passed is assign it to the newly-created
      osd request structure.  Do that in the caller instead.
      
      Both callers subsequently call ceph_osdc_build_request(), so have
      that function (instead of ceph_osdc_alloc_request()) issue a warning
      if a request comes through with neither the read nor write flags set.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      d178a9e7
    • A
      libceph: drop osdc from ceph_calc_raw_layout() · e75b45cf
      Alex Elder 提交于
      The osdc parameter to ceph_calc_raw_layout() is not used, so get rid
      of it.  Consequently, the corresponding parameter in calc_layout()
      becomes unused, so get rid of that as well.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      e75b45cf
    • A
      libceph: drop snapid in ceph_calc_raw_layout() · 4d6b250b
      Alex Elder 提交于
      A snapshot id must be provided to ceph_calc_raw_layout() even though
      it is not needed at all for calculating the layout.
      
      Where the snapshot id *is* needed is when building the request
      message for an osd operation.
      
      Drop the snapid parameter from ceph_calc_raw_layout() and pass
      that value instead in ceph_osdc_build_request().
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      4d6b250b
    • A
      libceph: pass length to ceph_calc_file_object_mapping() · e8afad65
      Alex Elder 提交于
      ceph_calc_file_object_mapping() takes (among other things) a "file"
      offset and length, and based on the layout, determines the object
      number ("bno") backing the affected portion of the file's data and
      the offset into that object where the desired range begins.  It also
      computes the size that should be used for the request--either the
      amount requested or something less if that would exceed the end of
      the object.
      
      This patch changes the input length parameter in this function so it
      is used only for input.  That is, the argument will be passed by
      value rather than by address, so the value provided won't get
      updated by the function.
      
      The value would only get updated if the length would surpass the
      current object, and in that case the value it got updated to would
      be exactly that returned in *oxlen.
      
      Only one of the two callers is affected by this change.  Update
      ceph_calc_raw_layout() so it records any updated value.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      e8afad65
    • A
      libceph: pass length to ceph_osdc_build_request() · 0120be3c
      Alex Elder 提交于
      The len argument to ceph_osdc_build_request() is set up to be
      passed by address, but that function never updates its value
      so there's no need to do this.  Tighten up the interface by
      passing the length directly.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      0120be3c
    • A
      libceph: kill op_needs_trail() · 5b9d1b1c
      Alex Elder 提交于
      Since every osd message is now prepared to include trailing data,
      there's no need to check ahead of time whether any operations will
      make use of the trail portion of the message.
      
      We can drop the second argument to get_num_ops(), and as a result we
      can also get rid of op_needs_trail() which is no longer used.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      5b9d1b1c
    • A
      libceph: always allow trail in osd request · c885837f
      Alex Elder 提交于
      An osd request structure contains an optional trail portion, which
      if present will contain data to be passed in the payload portion of
      the message containing the request.  The trail field is a
      ceph_pagelist pointer, and if null it indicates there is no trail.
      
      A ceph_pagelist structure contains a length field, and it can
      legitimately hold value 0.  Make use of this to change the
      interpretation of the "trail" of an osd request so that every osd
      request has trailing data, it just might have length 0.
      
      This means we change the r_trail field in a ceph_osd_request
      structure from a pointer to a structure that is always initialized.
      
      Note that in ceph_osdc_start_request(), the trail pointer (or now
      address of that structure) is assigned to a ceph message's trail
      field.  Here's why that's still OK (looking at net/ceph/messenger.c):
          - What would have resulted in a null pointer previously will now
            refer to a 0-length page list.  That message trail pointer
            is used in two functions, write_partial_msg_pages() and
            out_msg_pos_next().
          - In write_partial_msg_pages(), a null page list pointer is
            handled the same as a message with 0-length trail, and both
            result in a "in_trail" variable set to false.  The trail
            pointer is only used if in_trail is true.
          - The only other place the message trail pointer is used is
            out_msg_pos_next().  That function is only called by
            write_partial_msg_pages() and only touches the trail pointer
            if the in_trail value it is passed is true.
      Therefore a null ceph_msg->trail pointer is equivalent to a non-null
      pointer referring to a 0-length page list structure.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      c885837f
    • A
      rbd: drop oid parameters from ceph_osdc_build_request() · af77f26c
      Alex Elder 提交于
      The last two parameters to ceph_osd_build_request() describe the
      object id, but the values passed always come from the osd request
      structure whose address is also provided.  Get rid of those last
      two parameters.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      af77f26c
    • A
      libceph: reformat __reset_osd() · c3acb181
      Alex Elder 提交于
      Reformat __reset_osd() into three distinct blocks of code
      handling the three return cases.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      c3acb181
    • S
      crush: avoid recursion if we have already collided · 7d7c1f61
      Sage Weil 提交于
      This saves us some cycles, but does not affect the placement result at
      all.
      
      This corresponds to ceph.git commit 4abb53d4f.
      Signed-off-by: NSage Weil <sage@inktank.com>
      7d7c1f61
    • J
      libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed · 1604f488
      Jim Schutt 提交于
      Add libceph support for a new CRUSH tunable recently added to Ceph servers.
      
      Consider the CRUSH rule
        step chooseleaf firstn 0 type <node_type>
      
      This rule means that <n> replicas will be chosen in a manner such that
      each chosen leaf's branch will contain a unique instance of <node_type>.
      
      When an object is re-replicated after a leaf failure, if the CRUSH map uses
      a chooseleaf rule the remapped replica ends up under the <node_type> bucket
      that held the failed leaf.  This causes uneven data distribution across the
      storage cluster, to the point that when all the leaves but one fail under a
      particular <node_type> bucket, that remaining leaf holds all the data from
      its failed peers.
      
      This behavior also limits the number of peers that can participate in the
      re-replication of the data held by the failed leaf, which increases the
      time required to re-replicate after a failure.
      
      For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
      inner and outer descents.
      
      If the tree descent down to <node_type> is the outer descent, and the descent
      from <node_type> down to a leaf is the inner descent, the issue is that a
      down leaf is detected on the inner descent, so only the inner descent is
      retried.
      
      In order to disperse re-replicated data as widely as possible across a
      storage cluster after a failure, we want to retry the outer descent. So,
      fix up crush_choose() to allow the inner descent to return immediately on
      choosing a failed leaf.  Wire this up as a new CRUSH tunable.
      
      Note that after this change, for a chooseleaf rule, if the primary OSD
      in a placement group has failed, choosing a replacement may result in
      one of the other OSDs in the PG colliding with the new primary.  This
      requires that OSD's data for that PG to need moving as well.  This
      seems unavoidable but should be relatively rare.
      
      This corresponds to ceph.git commit 88f218181a9e6d2292e2697fc93797d0f6d6e5dc.
      Signed-off-by: NJim Schutt <jaschut@sandia.gov>
      Reviewed-by: NSage Weil <sage@inktank.com>
      1604f488
    • Y
      ceph: re-calculate truncate_size for strip object · a41bad1a
      Yan, Zheng 提交于
      Otherwise osd may truncate the object to larger size.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      a41bad1a
  2. 28 12月, 2012 4 次提交
    • S
      libceph: fix protocol feature mismatch failure path · 0fa6ebc6
      Sage Weil 提交于
      We should not set con->state to CLOSED here; that happens in
      ceph_fault() in the caller, where it first asserts that the state
      is not yet CLOSED.  Avoids a BUG when the features don't match.
      
      Since the fail_protocol() has become a trivial wrapper, replace
      calls to it with direct calls to reset_connection().
      Signed-off-by: NSage Weil <sage@inktank.com>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      0fa6ebc6
    • A
      libceph: WARN, don't BUG on unexpected connection states · 122070a2
      Alex Elder 提交于
      A number of assertions in the ceph messenger are implemented with
      BUG_ON(), killing the system if connection's state doesn't match
      what's expected.  At this point our state model is (evidently) not
      well understood enough for these assertions to trigger a BUG().
      Convert all BUG_ON(con->state...) calls to be WARN_ON(con->state...)
      so we learn about these issues without killing the machine.
      
      We now recognize that a connection fault can occur due to a socket
      closure at any time, regardless of the state of the connection.  So
      there is really nothing we can assert about the state of the
      connection at that point so eliminate that assertion.
      Reported-by: NUgis <ugis22@gmail.com>
      Tested-by: NUgis <ugis22@gmail.com>
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      122070a2
    • A
      libceph: always reset osds when kicking · e6d50f67
      Alex Elder 提交于
      When ceph_osdc_handle_map() is called to process a new osd map,
      kick_requests() is called to ensure all affected requests are
      updated if necessary to reflect changes in the osd map.  This
      happens in two cases:  whenever an incremental map update is
      processed; and when a full map update (or the last one if there is
      more than one) gets processed.
      
      In the former case, the kick_requests() call is followed immediately
      by a call to reset_changed_osds() to ensure any connections to osds
      affected by the map change are reset.  But for full map updates
      this isn't done.
      
      Both cases should be doing this osd reset.
      
      Rather than duplicating the reset_changed_osds() call, move it into
      the end of kick_requests().
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      e6d50f67
    • A
      libceph: move linger requests sooner in kick_requests() · ab60b16d
      Alex Elder 提交于
      The kick_requests() function is called by ceph_osdc_handle_map()
      when an osd map change has been indicated.  Its purpose is to
      re-queue any request whose target osd is different from what it
      was when it was originally sent.
      
      It is structured as two loops, one for incomplete but registered
      requests, and a second for handling completed linger requests.
      As a special case, in the first loop if a request marked to linger
      has not yet completed, it is moved from the request list to the
      linger list.  This is as a quick and dirty way to have the second
      loop handle sending the request along with all the other linger
      requests.
      
      Because of the way it's done now, however, this quick and dirty
      solution can result in these incomplete linger requests never
      getting re-sent as desired.  The problem lies in the fact that
      the second loop only arranges for a linger request to be sent
      if it appears its target osd has changed.  This is the proper
      handling for *completed* linger requests (it avoids issuing
      the same linger request twice to the same osd).
      
      But although the linger requests added to the list in the first loop
      may have been sent, they have not yet completed, so they need to be
      re-sent regardless of whether their target osd has changed.
      
      The first required fix is we need to avoid calling __map_request()
      on any incomplete linger request.  Otherwise the subsequent
      __map_request() call in the second loop will find the target osd
      has not changed and will therefore not re-send the request.
      
      Second, we need to be sure that a sent but incomplete linger request
      gets re-sent.  If the target osd is the same with the new osd map as
      it was when the request was originally sent, this won't happen.
      This can be fixed through careful handling when we move these
      requests from the request list to the linger list, by unregistering
      the request *before* it is registered as a linger request.  This
      works because a side-effect of unregistering the request is to make
      the request's r_osd pointer be NULL, and *that* will ensure the
      second loop actually re-sends the linger request.
      
      Processing of such a request is done at that point, so continue with
      the next one once it's been moved.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      ab60b16d
  3. 21 12月, 2012 5 次提交
  4. 18 12月, 2012 2 次提交
    • A
      libceph: socket can close in any connection state · 7bb21d68
      Alex Elder 提交于
      A connection's socket can close for any reason, independent of the
      state of the connection (and without irrespective of the connection
      mutex).  As a result, the connectino can be in pretty much any state
      at the time its socket is closed.
      
      Handle those other cases at the top of con_work().  Pull this whole
      block of code into a separate function to reduce the clutter.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      7bb21d68
    • A
      rbd: remove linger unconditionally · 61c74035
      Alex Elder 提交于
      In __unregister_linger_request(), the request is being removed
      from the osd client's req_linger list only when the request
      has a non-null osd pointer.  It should be done whether or not
      the request currently has an osd.
      
      This is most likely a non-issue because I believe the request
      will always have an osd when this function is called.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      61c74035
  5. 17 12月, 2012 2 次提交
    • A
      libceph: avoid using freed osd in __kick_osd_requests() · 685a7555
      Alex Elder 提交于
      If an osd has no requests and no linger requests, __reset_osd()
      will just remove it with a call to __remove_osd().  That drops
      a reference to the osd, and therefore the osd may have been free
      by the time __reset_osd() returns.  That function offers no
      indication this may have occurred, and as a result the osd will
      continue to be used even when it's no longer valid.
      
      Change__reset_osd() so it returns an error (ENODEV) when it
      deletes the osd being reset.  And change __kick_osd_requests() so it
      returns immediately (before referencing osd again) if __reset_osd()
      returns *any* error.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      685a7555
    • A
      ceph: don't reference req after put · 7d5f2481
      Alex Elder 提交于
      In __unregister_request(), there is a call to list_del_init()
      referencing a request that was the subject of a call to
      ceph_osdc_put_request() on the previous line.  This is not
      safe, because the request structure could have been freed
      by the time we reach the list_del_init().
      
      Fix this by reversing the order of these lines.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-off-by: NSage Weil <sage@inktank.com>
      7d5f2481
  6. 13 12月, 2012 1 次提交
    • S
      libceph: remove 'osdtimeout' option · 83aff95e
      Sage Weil 提交于
      This would reset a connection with any OSD that had an outstanding
      request that was taking more than N seconds.  The idea was that if the
      OSD was buggy, the client could compensate by resending the request.
      
      In reality, this only served to hide server bugs, and we haven't
      actually seen such a bug in quite a while.  Moreover, the userspace
      client code never did this.
      
      More importantly, often the request is taking a long time because the
      OSD is trying to recover, or overloaded, and killing the connection
      and retrying would only make the situation worse by giving the OSD
      more work to do.
      Signed-off-by: NSage Weil <sage@inktank.com>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      83aff95e
  7. 01 11月, 2012 1 次提交
  8. 30 10月, 2012 1 次提交
  9. 27 10月, 2012 1 次提交
    • S
      libceph: avoid NULL kref_put from NULL alloc_msg return · 7246240c
      Sage Weil 提交于
      The ceph_on_in_msg_alloc() method calls the ->alloc_msg() helper which
      may return NULL.  It also drops con->mutex while it allocates a message,
      which means that the connection state may change (e.g., get closed).  If
      that happens, we clean up and bail out.  Avoid calling ceph_msg_put() on
      a NULL return value and triggering a crash.
      
      This was observed when an ->alloc_msg() call races with a timeout that
      resends a zillion messages and resets the connection, and ->alloc_msg()
      returns NULL (because the request was resent to another target).
      
      Fixes http://tracker.newdream.net/issues/3342Signed-off-by: NSage Weil <sage@inktank.com>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      7246240c
  10. 10 10月, 2012 3 次提交
    • A
      rbd: define common queue_con_delay() · 802c6d96
      Alex Elder 提交于
      This patch defines a single function, queue_con_delay() to call
      queue_delayed_work() for a connection.  It basically generalizes
      what was previously queue_con() by adding the delay argument.
      queue_con() is now a simple helper that passes 0 for its delay.
      queue_con_delay() returns 0 if it queued work or an errno if it
      did not for some reason.
      
      If con_work() finds the BACKOFF flag set for a connection, it now
      calls queue_con_delay() to handle arranging to start again after a
      delay.
      
      Note about connection reference counts:  con_work() only ever gets
      called as a work item function.  At the time that work is scheduled,
      a reference to the connection is acquired, and the corresponding
      con_work() call is then responsible for dropping that reference
      before it returns.
      
      Previously, the backoff handling inside con_work() silently handed
      off its reference to delayed work it scheduled.  Now that
      queue_con_delay() is used, a new reference is acquired for the
      newly-scheduled work, and the original reference is dropped by the
      con->ops->put() call at the end of the function.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      802c6d96
    • A
      rbd: let con_work() handle backoff · 8618e30b
      Alex Elder 提交于
      Both ceph_fault() and con_work() include handling for imposing a
      delay before doing further processing on a faulted connection.
      The latter is used only if ceph_fault() is unable to.
      
      Instead, just let con_work() always be responsible for implementing
      the delay.  After setting up the delay value, set the BACKOFF flag
      on the connection unconditionally and call queue_con() to ensure
      con_work() will get called to handle it.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      8618e30b
    • A
      rbd: reset BACKOFF if unable to re-queue · 588377d6
      Alex Elder 提交于
      If ceph_fault() is unable to queue work after a delay, it sets the
      BACKOFF connection flag so con_work() will attempt to do so.
      
      In con_work(), when BACKOFF is set, if queue_delayed_work() doesn't
      result in newly-queued work, it simply ignores this condition and
      proceeds as if no backoff delay were desired.  There are two
      problems with this--one of which is a bug.
      
      The first problem is simply that the intended behavior is to back
      off, and if we aren't able queue the work item to run after a delay
      we're not doing that.
      
      The only reason queue_delayed_work() won't queue work is if the
      provided work item is already queued.  In the messenger, this
      means that con_work() is already scheduled to be run again.  So
      if we simply set the BACKOFF flag again when this occurs, we know
      the next con_work() call will again attempt to hold off activity
      on the connection until after the delay.
      
      The second problem--the bug--is a leak of a reference count.  If
      queue_delayed_work() returns 0 in con_work(), con->ops->put() drops
      the connection reference held on entry to con_work().  However,
      processing is (was) allowed to continue, and at the end of the
      function a second con->ops->put() is called.
      
      This patch fixes both problems.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      588377d6
  11. 02 10月, 2012 5 次提交
  12. 22 9月, 2012 1 次提交
    • A
      libceph: only kunmap kmapped pages · 5ce765a5
      Alex Elder 提交于
      In write_partial_msg_pages(), pages need to be kmapped in order to
      perform a CRC-32c calculation on them.  As an artifact of the way
      this code used to be structured, the kunmap() call was separated
      from the kmap() call and both were done conditionally.  But the
      conditions under which the kmap() and kunmap() calls were made
      differed, so there was a chance a kunmap() call would be done on a
      page that had not been mapped.
      
      The symptom of this was tripping a BUG() in kunmap_high() when
      pkmap_count[nr] became 0.
      Reported-by: NBryan K. Wright <bryan@virginia.edu>
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      5ce765a5
  13. 22 8月, 2012 1 次提交
    • J
      libceph: avoid truncation due to racing banners · 6d4221b5
      Jim Schutt 提交于
      Because the Ceph client messenger uses a non-blocking connect, it is
      possible for the sending of the client banner to race with the
      arrival of the banner sent by the peer.
      
      When ceph_sock_state_change() notices the connect has completed, it
      schedules work to process the socket via con_work().  During this
      time the peer is writing its banner, and arrival of the peer banner
      races with con_work().
      
      If con_work() calls try_read() before the peer banner arrives, there
      is nothing for it to do, after which con_work() calls try_write() to
      send the client's banner.  In this case Ceph's protocol negotiation
      can complete succesfully.
      
      The server-side messenger immediately sends its banner and addresses
      after accepting a connect request, *before* actually attempting to
      read or verify the banner from the client.  As a result, it is
      possible for the banner from the server to arrive before con_work()
      calls try_read().  If that happens, try_read() will read the banner
      and prepare protocol negotiation info via prepare_write_connect().
      prepare_write_connect() calls con_out_kvec_reset(), which discards
      the as-yet-unsent client banner.  Next, con_work() calls
      try_write(), which sends the protocol negotiation info rather than
      the banner that the peer is expecting.
      
      The result is that the peer sees an invalid banner, and the client
      reports "negotiation failed".
      
      Fix this by moving con_out_kvec_reset() out of
      prepare_write_connect() to its callers at all locations except the
      one where the banner might still need to be sent.
      
      [elder@inktak.com: added note about server-side behavior]
      Signed-off-by: NJim Schutt <jaschut@sandia.gov>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      6d4221b5
  14. 21 8月, 2012 1 次提交