1. 02 5月, 2013 28 次提交
    • A
      libceph: implement multiple data items in a message · ca8b3a69
      Alex Elder 提交于
      This patch adds support to the messenger for more than one data item
      in its data list.
      
      A message data cursor has two more fields to support this:
          - a count of the number of bytes left to be consumed across
            all data items in the list, "total_resid"
          - a pointer to the head of the list (for validation only)
      
      The cursor initialization routine has been split into two parts: the
      outer one, which initializes the cursor for traversing the entire
      list of data items; and the inner one, which initializes the cursor
      to start processing a single data item.
      
      When a message cursor is first initialized, the outer initialization
      routine sets total_resid to the length provided.  The data pointer
      is initialized to the first data item on the list.  From there, the
      inner initialization routine finishes by setting up to process the
      data item the cursor points to.
      
      Advancing the cursor consumes bytes in total_resid.  If the resid
      field reaches zero, it means the current data item is fully
      consumed.  If total_resid indicates there is more data, the cursor
      is advanced to point to the next data item, and then the inner
      initialization routine prepares for using that.  (A check is made at
      this point to make sure we don't wrap around the front of the list.)
      
      The type-specific init routines are modified so they can be given a
      length that's larger than what the data item can support.  The resid
      field is initialized to the smaller of the provided length and the
      length of the entire data item.
      
      When total_resid reaches zero, we're done.
      
      This resolves:
          http://tracker.ceph.com/issues/3761Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      ca8b3a69
    • A
      libceph: replace message data pointer with list · 5240d9f9
      Alex Elder 提交于
      In place of the message data pointer, use a list head which links
      through message data items.  For now we only support a single entry
      on that list.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      5240d9f9
    • A
      libceph: have cursor point to data · 8ae4f4f5
      Alex Elder 提交于
      Rather than having a ceph message data item point to the cursor it's
      associated with, have the cursor point to a data item.  This will
      allow a message cursor to be used for more than one data item.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      8ae4f4f5
    • A
      libceph: move cursor into message · 36153ec9
      Alex Elder 提交于
      A message will only be processing a single data item at a time, so
      there's no need for each data item to have its own cursor.
      
      Move the cursor embedded in the message data structure into the
      message itself.  To minimize the impact, keep the data->cursor
      field, but make it be a pointer to the cursor in the message.
      
      Move the definition of ceph_msg_data above ceph_msg_data_cursor so
      the cursor can point to the data without a forward definition rather
      than vice-versa.
      
      This and the upcoming patches are part of:
          http://tracker.ceph.com/issues/3761Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      36153ec9
    • A
      libceph: record bio length · c851c495
      Alex Elder 提交于
      The bio is the only data item type that doesn't record its full
      length.  Fix that.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      c851c495
    • A
      libceph: fix possible CONFIG_BLOCK build problem · ea96571f
      Alex Elder 提交于
      This patch:
          15a0d7b libceph: record message data length
      did not enclose some bio-specific code inside CONFIG_BLOCK as
      it should have.  Fix that.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      ea96571f
    • A
      libceph: record message data length · a1930804
      Alex Elder 提交于
      Keep track of the length of the data portion for a message in a
      separate field in the ceph_msg structure.  This information has
      been maintained in wire byte order in the message header, but
      that's going to change soon.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      a1930804
    • A
      libceph: make message data be a pointer · 6644ed7b
      Alex Elder 提交于
      Begin the transition from a single message data item to a list of
      them by replacing the "data" structure in a message with a pointer
      to a ceph_msg_data structure.
      
      A null pointer will indicate the message has no data; replace the
      use of ceph_msg_has_data() with a simple check for a null pointer.
      
      Create functions ceph_msg_data_create() and ceph_msg_data_destroy()
      to dynamically allocate and free a data item structure of a given type.
      
      When a message has its data item "set," allocate one of these to
      hold the data description, and free it when the last reference to
      the message is dropped.
      
      This partially resolves:
          http://tracker.ceph.com/issues/4429Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      6644ed7b
    • A
      libceph: kill last of ceph_msg_pos · f5db90bc
      Alex Elder 提交于
      The only remaining field in the ceph_msg_pos structure is
      did_page_crc.  In the new cursor model of things that flag (or
      something like it) belongs in the cursor.
      
      Define a new field "need_crc" in the cursor (which applies to all
      types of data) and initialize it to true whenever a cursor is
      initialized.
      
      In write_partial_message_data(), the data CRC still will be computed
      as before, but it will check the cursor->need_crc field to determine
      whether it's needed.  Any time the cursor is advanced to a new piece
      of a data item, need_crc will be set, and this will cause the crc
      for that entire piece to be accumulated into the data crc.
      
      In write_partial_message_data() the intermediate crc value is now
      held in a local variable so it doesn't have to be byte-swapped so
      many times.  In read_partial_msg_data() we do something similar
      (but mainly for consistency there).
      
      With that, the ceph_msg_pos structure can go away,  and it no longer
      needs to be passed as an argument to prepare_message_data().
      
      This cleanup is related to:
          http://tracker.ceph.com/issues/4428Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      f5db90bc
    • A
      libceph: kill most of ceph_msg_pos · 859a35d5
      Alex Elder 提交于
      All but one of the fields in the ceph_msg_pos structure are now
      never used (only assigned), so get rid of them.  This allows
      several small blocks of code to go away.
      
      This is cleanup of old code related to:
          http://tracker.ceph.com/issues/4428Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      859a35d5
    • A
      libceph: collapse all data items into one · 4c59b4a2
      Alex Elder 提交于
      It turns out that only one of the data item types is ever used at
      any one time in a single message (currently).
          - A page array is used by the osd client (on behalf of the file
            system) and by rbd.  Only one osd op (and therefore at most
            one data item) is ever used at a time by rbd.  And the only
            time the file system sends two, the second op contains no
            data.
          - A bio is only used by the rbd client (and again, only one
            data item per message)
          - A page list is used by the file system and by rbd for outgoing
            data, but only one op (and one data item) at a time.
      
      We can therefore collapse all three of our data item fields into a
      single field "data", and depend on the messenger code to properly
      handle it based on its type.
      
      This allows us to eliminate quite a bit of duplicated code.
      
      This is related to:
          http://tracker.ceph.com/issues/4429Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      4c59b4a2
    • A
      libceph: kill ceph message bio_iter, bio_seg · 6518be47
      Alex Elder 提交于
      The bio_iter and bio_seg fields in a message are no longer used, we
      use the cursor instead.  So get rid of them and the functions that
      operate on them them.
      
      This is related to:
          http://tracker.ceph.com/issues/4428Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      6518be47
    • A
      libceph: record residual bytes for all message data types · 25aff7c5
      Alex Elder 提交于
      All of the data types can use this, not just the page array.  Until
      now, only the bio type doesn't have it available, and only the
      initiator of the request (the rbd client) is able to supply the
      length of the full request without re-scanning the bio list.  Change
      the cursor init routines so the length is supplied based on the
      message header "data_len" field, and use that length to intiialize
      the "resid" field of the cursor.
      
      In addition, change the way "last_piece" is defined so it is based
      on the residual number of bytes in the original request.  This is
      necessary (at least for bio messages) because it is possible for
      a read request to succeed without consuming all of the space
      available in the data buffer.
      
      This resolves:
          http://tracker.ceph.com/issues/4427Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      25aff7c5
    • A
      libceph: kill message trail · 9d2a06c2
      Alex Elder 提交于
      The wart that is the ceph message trail can now be removed, because
      its only user was the osd client, and the previous patch made that
      no longer the case.
      
      The result allows write_partial_msg_pages() to be simplified
      considerably.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      9d2a06c2
    • A
      libceph: implement pages array cursor · e766d7b5
      Alex Elder 提交于
      Implement and use cursor routines for page array message data items
      for outbound message data.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      e766d7b5
    • A
      libceph: implement bio message data item cursor · 6aaa4511
      Alex Elder 提交于
      Implement and use cursor routines for bio message data items for
      outbound message data.
      
      (See the previous commit for reasoning in support of the changes
      in out_msg_pos_next().)
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      6aaa4511
    • A
      libceph: prepare for other message data item types · dd236fcb
      Alex Elder 提交于
      This just inserts some infrastructure in preparation for handling
      other types of ceph message data items.  No functional changes,
      just trying to simplify review by separating out some noise.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      dd236fcb
    • A
      libceph: start defining message data cursor · fe38a2b6
      Alex Elder 提交于
      This patch lays out the foundation for using generic routines to
      manage processing items of message data.
      
      For simplicity, we'll start with just the trail portion of a
      message, because it stands alone and is only present for outgoing
      data.
      
      First some basic concepts.  We'll use the term "data item" to
      represent one of the ceph_msg_data structures associated with a
      message.  There are currently four of those, with single-letter
      field names p, l, b, and t.  A data item is further broken into
      "pieces" which always lie in a single page.  A data item will
      include a "cursor" that will track state as the memory defined by
      the item is consumed by sending data from or receiving data into it.
      
      We define three routines to manipulate a data item's cursor: the
      "init" routine; the "next" routine; and the "advance" routine.  The
      "init" routine initializes the cursor so it points at the beginning
      of the first piece in the item.  The "next" routine returns the
      page, page offset, and length (limited by both the page and item
      size) of the next unconsumed piece in the item.  It also indicates
      to the caller whether the piece being returned is the last one in
      the data item.
      
      The "advance" routine consumes the requested number of bytes in the
      item (advancing the cursor).  This is used to record the number of
      bytes from the current piece that were actually sent or received by
      the network code.  It returns an indication of whether the result
      means the current piece has been fully consumed.  This is used by
      the message send code to determine whether it should calculate the
      CRC for the next piece processed.
      
      The trail of a message is implemented as a ceph pagelist.  The
      routines defined for it will be usable for non-trail pagelist data
      as well.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      fe38a2b6
    • A
      libceph: abstract message data · 43794509
      Alex Elder 提交于
      Group the types of message data into an abstract structure with a
      type indicator and a union containing fields appropriate to the
      type of data it represents.  Use this to represent the pages,
      pagelist, bio, and trail in a ceph message.
      
      Verify message data is of type NONE in ceph_msg_data_set_*()
      routines.  Since information about message data of type NONE really
      should not be interpreted, get rid of the other assertions in those
      functions.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      43794509
    • A
      libceph: be explicit about message data representation · f9e15777
      Alex Elder 提交于
      A ceph message has a data payload portion.  The memory for that data
      (either the source of data to send or the location to place data
      that is received) is specified in several ways.  The ceph_msg
      structure includes fields for all of those ways, but this
      mispresents the fact that not all of them are used at a time.
      
      Specifically, the data in a message can be in:
          - an array of pages
          - a list of pages
          - a list of Linux bios
          - a second list of pages (the "trail")
      (The two page lists are currently only ever used for outgoing data.)
      
      Impose more structure on the ceph message, making the grouping of
      some of these fields explicit.  Shorten the name of the
      "page_alignment" field.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      f9e15777
    • A
      libceph: define ceph_msg_has_*() data macros · 97fb1c7f
      Alex Elder 提交于
      Define and use macros ceph_msg_has_*() to determine whether to
      operate on the pages, pagelist, bio, and trail fields of a message.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      97fb1c7f
    • A
      libceph: record message data byte length · 4a73ef27
      Alex Elder 提交于
      Record the number of bytes of data in a page array rather than the
      number of pages in the array.  It can be assumed that the page array
      is of sufficient size to hold the number of bytes indicated (and
      offset by the indicated alignment).
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      4a73ef27
    • A
      libceph: isolate other message data fields · 27fa8385
      Alex Elder 提交于
      Define ceph_msg_data_set_pagelist(), ceph_msg_data_set_bio(), and
      ceph_msg_data_set_trail() to clearly abstract the assignment of the
      remaining data-related fields in a ceph message structure.  Use the
      new functions in the osd client and mds client.
      
      This partially resolves:
          http://tracker.ceph.com/issues/4263Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      27fa8385
    • A
      libceph: set page info with byte length · f1baeb2b
      Alex Elder 提交于
      When setting page array information for message data, provide the
      byte length rather than the page count ceph_msg_data_set_pages().
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      f1baeb2b
    • A
      libceph: isolate message page field manipulation · 02afca6c
      Alex Elder 提交于
      Define a function ceph_msg_data_set_pages(), which more clearly
      abstracts the assignment page-related fields for data in a ceph
      message structure.  Use this new function in the osd client and mds
      client.
      
      Ideally, these fields would never be set more than once (with
      BUG_ON() calls to guarantee that).  At the moment though the osd
      client sets these every time it receives a message, and in the event
      of a communication problem this can happen more than once.  (This
      will be resolved shortly, but setting up these helpers first makes
      it all a bit easier to work with.)
      
      Rearrange the field order in a ceph_msg structure to group those
      that are used to define the possible data payloads.
      
      This partially resolves:
          http://tracker.ceph.com/issues/4263Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      02afca6c
    • A
      libceph: kill ceph_msg->pagelist_count · ec02a2f2
      Alex Elder 提交于
      The pagelist_count field is never actually used, so get rid of it.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      ec02a2f2
    • A
      libceph: distinguish page array and pagelist count · d4b515fa
      Alex Elder 提交于
      Use distinct fields for tracking the number of pages in a message's
      page array and in a message's page list.  Currently only one or the
      other is used at a time, but that will be changing soon.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      d4b515fa
    • A
      libceph: make ceph_msg->bio_seg be unsigned · 07c09b72
      Alex Elder 提交于
      The bio_seg field is used by the ceph messenger in iterating through
      a bio.  It should never have a negative value, so make it an
      unsigned.  (I contemplated making it unsigned short to match the
      struct bio definition, but it offered no benefit.)
      
      Change variables used to hold bio_seg values to all be unsigned as
      well.  Change two variable names in init_bio_iter() to match the
      convention used everywhere else.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      07c09b72
  2. 14 2月, 2013 1 次提交
  3. 03 10月, 2012 1 次提交
  4. 31 7月, 2012 4 次提交
    • S
      libceph: clean up con flags · 4a861692
      Sage Weil 提交于
      Rename flags with CON_FLAG prefix, move the definitions into the c file,
      and (better) document their meaning.
      Signed-off-by: NSage Weil <sage@inktank.com>
      4a861692
    • S
      libceph: replace connection state bits with states · 8dacc7da
      Sage Weil 提交于
      Use a simple set of 6 enumerated values for the socket states (CON_STATE_*)
      and use those instead of the state bits.  All of the con->state checks are
      now under the protection of the con mutex, so this is safe.  It also
      simplifies many of the state checks because we can check for anything other
      than the expected state instead of various bits for races we can think of.
      
      This appears to hold up well to stress testing both with and without socket
      failure injection on the server side.
      Signed-off-by: NSage Weil <sage@inktank.com>
      8dacc7da
    • G
      libceph: prevent the race of incoming work during teardown · a2a32584
      Guanjun He 提交于
      Add an atomic variable 'stopping' as flag in struct ceph_messenger,
      set this flag to 1 in function ceph_destroy_client(), and add the condition code
      in function ceph_data_ready() to test the flag value, if true(1), just return.
      Signed-off-by: NGuanjun He <gjhe@suse.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      a2a32584
    • S
      libceph: fix messenger retry · a16cb1f7
      Sage Weil 提交于
      In ancient times, the messenger could both initiate and accept connections.
      An artifact if that was data structures to store/process an incoming
      ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
      Sadly, the negotiation code was referencing those structures and ignoring
      important information (like the peer's connect_seq) from the correct ones.
      
      Among other things, this fixes tight reconnect loops where the server sends
      RETRY_SESSION and we (the client) retries with the same connect_seq as last
      time.  This bug pretty easily triggered by injecting socket failures on the
      MDS and running some fs workload like workunits/direct_io/test_sync_io.
      Signed-off-by: NSage Weil <sage@inktank.com>
      a16cb1f7
  5. 18 7月, 2012 1 次提交
    • S
      libceph: fix messenger retry · 5bdca4e0
      Sage Weil 提交于
      In ancient times, the messenger could both initiate and accept connections.
      An artifact if that was data structures to store/process an incoming
      ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
      Sadly, the negotiation code was referencing those structures and ignoring
      important information (like the peer's connect_seq) from the correct ones.
      
      Among other things, this fixes tight reconnect loops where the server sends
      RETRY_SESSION and we (the client) retries with the same connect_seq as last
      time.  This bug pretty easily triggered by injecting socket failures on the
      MDS and running some fs workload like workunits/direct_io/test_sync_io.
      Signed-off-by: NSage Weil <sage@inktank.com>
      5bdca4e0
  6. 06 7月, 2012 3 次提交
  7. 22 6月, 2012 1 次提交
  8. 06 6月, 2012 1 次提交
    • A
      libceph: make ceph_con_revoke_message() a msg op · 8921d114
      Alex Elder 提交于
      ceph_con_revoke_message() is passed both a message and a ceph
      connection.  A ceph_msg allocated for incoming messages on a
      connection always has a pointer to that connection, so there's no
      need to provide the connection when revoking such a message.
      
      Note that the existing logic does not preclude the message supplied
      being a null/bogus message pointer.  The only user of this interface
      is the OSD client, and the only value an osd client passes is a
      request's r_reply field.  That is always non-null (except briefly in
      an error path in ceph_osdc_alloc_request(), and that drops the
      only reference so the request won't ever have a reply to revoke).
      So we can safely assume the passed-in message is non-null, but add a
      BUG_ON() to make it very obvious we are imposing this restriction.
      
      Rename the function ceph_msg_revoke_incoming() to reflect that it is
      really an operation on an incoming message.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      8921d114