1. 22 3月, 2012 40 次提交
    • A
      rbd: don't drop the rbd_id too early · 32eec68d
      Alex Elder 提交于
      Currently an rbd device's id is released when it is removed, but it
      is done before the code is run to clean up sysfs-related files (such
      as /sys/bus/rbd/devices/1).
      
      It's possible that an rbd is still in use after the rbd_remove()
      call has been made.  It's essentially the same as an active inode
      that stays around after it has been removed--until its final close
      operation.  This means that the id shows up as free for reuse at a
      time it should not be.
      
      The effect of this was seen by Jens Rehpoehler, who:
          - had a filesystem mounted on an rbd device
          - unmapped that filesystem (without unmounting)
          - found that the mount still worked properly
          - but hit a panic when he attempted to re-map a new rbd device
      
      This re-map attempt found the previously-unmapped id available.
      The subsequent attempt to reuse it was met with a panic while
      attempting to (re-)install the sysfs entry for the new mapped
      device.
      
      Fix this by holding off "putting" the rbd id, until the rbd_device
      release function is called--when the last reference is finally
      dropped.
      
      Note: This fixes: http://tracker.newdream.net/issues/1907Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      32eec68d
    • A
      rbd: small changes · 593a9e7b
      Alex Elder 提交于
      Here is another set of small code tidy-ups:
          - Define SECTOR_SHIFT and SECTOR_SIZE, and use these symbolic
            names throughout.  Tell the blk_queue system our physical
            block size, in the (unlikely) event we want to use something
            other than the default.
          - Delete the definition of struct rbd_info, which is never used.
          - Move the definition of dev_to_rbd() down in its source file,
            just above where it gets first used, and change its name to
            dev_to_rbd_dev().
          - Replace an open-coded operation in rbd_dev_release() to use
            dev_to_rbd_dev() instead.
          - Calculate the segment size for a given rbd_device just once in
            rbd_init_disk().
          - Use the '%zd' conversion specifier in rbd_snap_size_show(),
            since the value formatted is a size_t.
          - Switch to the '%llu' conversion specifier in rbd_snap_id_show().
            since the value formatted is unsigned.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      593a9e7b
    • A
      rbd: do some refactoring · 00f1f36f
      Alex Elder 提交于
      A few blocks of code are rearranged a bit here:
          - In rbd_header_from_disk():
      	- Don't bother computing snap_count until we're sure the
      	  on-disk header starts with a good signature.
      	- Move a few independent lines of code so they are *after* a
      	  check for a failed memory allocation.
      	- Get rid of unnecessary local variable "ret".
          - Make a few other changes in rbd_read_header(), similar to the
            above--just moving things around a bit while preserving the
            functionality.
          - In rbd_rq_fn(), just assign rq in the while loop's controlling
            expression rather than duplicating it before and at the end of
            the loop body.  This allows the use of "continue" rather than
            "goto next" in a number of spots.
          - Rearrange the logic in snap_by_name().  End result is the same.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      00f1f36f
    • A
      rbd: fix module sysfs setup/teardown code · fed4c143
      Alex Elder 提交于
      Once rbd_bus_type is registered, it allows an "add" operation via
      the /sys/bus/rbd/add bus attribute, and adding a new rbd device that
      way establishes a connection between the device and rbd_root_dev.
      But rbd_root_dev is not registered until after the rbd_bus_type
      registration is complete.  This could (in principle anyway) result
      in an invalid state.
      
      Since rbd_root_dev has no tie to rbd_bus_type we can reorder these
      two initializations and never be faced with this scenario.
      
      In addition, unregister the device in the event the bus registration
      fails at module init time.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      fed4c143
    • A
      rbd: don't allocate mon_addrs buffer in rbd_add() · 7ef3214a
      Alex Elder 提交于
      The mon_addrs buffer in rbd_add is used to hold a copy of the
      monitor IP addresses supplied via /sys/bus/rbd/add.  That is
      passed to rbd_get_client(), which never modifies it (nor do
      any of the functions it gets passed to thereafter)--the mon_addr
      parameter to rbd_get_client() is a pointer to constant data, so it
      can't be modifed.  Furthermore, rbd_get_client() has the length of
      the mon_addrs buffer and that is used to ensure nothing goes beyond
      its end.
      
      Based on all this, there is no reason that a buffer needs to
      be used to hold a copy of the mon_addrs provided via
      /sys/bus/rbd/add.   Instead, the location within that passed-in
      buffer can be provided, along with the length of the "token"
      therein which represents the monitor IP's.
      
      A small change to rbd_add_parse_args() allows the address within the
      buffer to be passed back, and the length is already returned.  This
      now means that, at least from the perspective of this interface,
      there is no such thing as a list of monitor addresses that is too
      long.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      7ef3214a
    • A
      rbd: have rbd_parse_args() report found mon_addrs size · 5214ecc4
      Alex Elder 提交于
      The argument parsing routine already computes the size of the
      mon_addrs buffer it extracts from the "command."  Pass it to the
      caller so it can use it to provide the length to rbd_get_client().
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      5214ecc4
    • A
      rbd: do a few checks at build time · 81a89793
      Alex Elder 提交于
      This is a bit gratuitous, but there are a few things that can be
      verified at build time rather than run time, so do that.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      81a89793
    • A
      rbd: don't use sscanf() in rbd_add_parse_args() · e28fff26
      Alex Elder 提交于
      Make use of a few simple helper routines to parse the arguments
      rather than sscanf().  This will treat both missing and too-long
      arguments as invalid input (rather than silently truncating the
      input in the too-long case).  In time this can also be used by
      rbd_add() to use the passed-in buffer in place, rather than copying
      its contents into new buffers.
      
      It appears to me that the sscanf() previously used would not
      correctly handle a supplied snapshot--the two final "%s" conversion
      specifications were not separated by a space, and I'm not sure
      how sscanf() handles that situation.  It may not be well-defined.
      So that may be a bug this change fixes (but I didn't verify that).
      
      The sizes of the mon_addrs and options buffers are now passed to
      rbd_add_parse_args(), so they can be supplied to copy_token().
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      e28fff26
    • A
      rbd: encapsulate argument parsing for rbd_add() · a725f65e
      Alex Elder 提交于
      Move the code that parses the arguments provided to rbd_add() (which
      are supplied via /sys/bus/rbd/add) into a separate function.
      
      Also rename the "mon_dev_name" variable in rbd_add() to be
      "mon_addrs".   The variable represents a list of one or more
      comma-separated monitor IP addresses, each with an optional port
      number.  I think "mon_addrs" captures that notion a little better.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      a725f65e
    • A
      rbd: simplify error handling in rbd_add() · 27cc2594
      Alex Elder 提交于
      If a couple pointers are initialized to NULL then a single
      "out_nomem" label can be used for all of the memory allocation
      failure cases in rbd_add().
      
      Also, get rid of the "irc" local variable there.  There is no
      real need for "rc" to be type ssize_t, and it can be used in
      the spot "irc" was.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      27cc2594
    • A
      rbd: reduce memory used for rbd_dev fields · 60571c7d
      Alex Elder 提交于
      The length of the string containing the monitor address
      specification(s) will never exceed the length of the string passed
      in to rbd_add().  The same holds true for the ceph + rbd options
      string.  So reduce the amount of memory allocated for these to
      that length rather than the maximum (1024 bytes).
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      60571c7d
    • A
      rbd: have rbd_get_client() return a rbd_client · d720bcb0
      Alex Elder 提交于
      Since rbd_get_client() currently returns an error code.  It assigns
      the rbd_client field of the rbd_device structure it is passed if
      successful.  Instead, have it return the created rbd_client
      structure and return a pointer-coded error if there is an error.
      This makes the assignment of the client pointer more obvious at the
      call site.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      d720bcb0
    • A
      rbd: a few simple changes · f0f8cef5
      Alex Elder 提交于
      Here are a few very simple cleanups:
          - Add a "RBD_" prefix to the two driver name string definitions.
          - Move the definition of struct rbd_request below struct rbd_req_coll
            to avoid the need for an empty declaration of the latter.
          - Move and group the definitions of rbd_root_dev_release() and
            rbd_root_dev, as well as rbd_bus_type and rbd_bus_attrs[],
            close to the top of the file.  Arrange the latter so
            rbd_bus_type.bus_attrs can be initialized statically.
          - Get rid of an unnecessary local variable in rbd_open().
          - Rework some hokey logic in rbd_bus_add_dev(), so the value of
            "ret" at the end is either 0 or -ENOENT to avoid the need for
            the code duplication that was there.
          - Rename a goto target in rbd_add().
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      f0f8cef5
    • A
      rbd: rename "node_lock" · 432b8587
      Alex Elder 提交于
      The spinlock used to protect rbd_client_list is named "node_lock".
      Rename it to "rbd_client_list_lock" to make it more obvious what
      it's for.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      432b8587
    • A
      rbd: move ctl_mutex lock inside rbd_client_create() · bc534d86
      Alex Elder 提交于
      Since rbd_client_create() is only called in one place, move the
      acquisition of the mutex around that call inside that function.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      bc534d86
    • A
      rbd: move ctl_mutex lock inside rbd_get_client() · d97081b0
      Alex Elder 提交于
      Since rbd_get_client() is only called in one place, move the
      acquisition of the mutex around that call inside that function.
      
      Furthermore, within rbd_get_client(), it appears the mutex only
      needs to be held while calling rbd_client_create().  (Moving
      the lock inside that function will wait for the next patch.)
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      d97081b0
    • A
      rbd: release client list lock sooner · e6994d3d
      Alex Elder 提交于
      In rbd_get_client(), if a client is reused, a number of things
      get done while still holding the list lock unnecessarily.
      
      This just moves a few things that need no lock protection outside
      the lock.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      e6994d3d
    • A
      rbd: restore previous rbd id sequence behavior · d184f6bf
      Alex Elder 提交于
      It used to be that selecting a new unique identifier for an added
      rbd device required searching all existing ones to find the highest
      id is used.  A recent change made that unnecessary, but made it
      so that id's used were monotonically non-decreasing.  It's a bit
      more pleasant to have smaller rbd id's though, and this change
      makes ids get allocated as they were before--each new id is one more
      than the maximum currently in use.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      d184f6bf
    • A
      rbd: tie rbd_dev_list changes to rbd_id operations · 499afd5b
      Alex Elder 提交于
      The only time entries are added to or removed from the global
      rbd_dev_list is exactly when a "put" or "get" operation is being
      performed on a rbd_dev's id.  So just move the list management code
      into get/put routines.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      499afd5b
    • A
      rbd: protect the rbd_dev_list with a spinlock · e124a82f
      Alex Elder 提交于
      The rbd_dev_list is just a simple list of all the current
      rbd_devices.  Using the ctl_mutex as a concurrency guard is
      overkill.  Instead, use a spinlock for that specific purpose.
      
      This also reduces the window that the ctl_mutex needs to be held in
      rbd_add().
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      e124a82f
    • A
      rbd: rework calculation of new rbd id's · 1ddbe94e
      Alex Elder 提交于
      In order to select a new unique identifier for an added rbd device,
      the list of all existing ones is searched and a value one greater
      than the highest id is used.
      
      The list search can be avoided by using an atomic variable that
      keeps track of the current highest id.  Using a get/put model for
      id's we can limit the boundless growth of id numbers a bit by
      arranging to reuse the current highest id once it gets released.
      Add these calls to "put" the id when an rbd is getting removed.
      
      Note that this changes the pattern of device id's used--new values
      will never be below the highest one seen so far (even if there
      exists an unused lower one).  I assert this is OK because the key
      property of an rbd id is its uniqueness, not its magnitude.
      
      Regardless, a follow-on patch will restore the old way of doing
      things, I just think this commit just makes the incremental change
      to atomics a little easier to understand.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      1ddbe94e
    • A
      rbd: encapsulate new rbd id selection · b7f23c36
      Alex Elder 提交于
      Move the loop that finds a new unique rbd id to use into
      its own helper function.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      b7f23c36
    • J
      rbd: use a single value of snap_name to mean no snap · cc9d734c
      Josh Durgin 提交于
      There's already a constant for this anyway.
      
      Since rbd_header_set_snap() is only used to set the rbd device
      snap_name field, just do that within that function rather than
      having it take the snap_name as an argument.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      
      v2: Changed interface rbd_header_set_snap() so it explicitly updates
          the snap_name in the rbd_device.  Also added a BUILD_BUG_ON()
          to verify the size of the snap_name field is sufficient for
          SNAP_HEAD_NAME.
      cc9d734c
    • A
      rbd: do not duplicate ceph_client pointer in rbd_device · 1dbb4399
      Alex Elder 提交于
      The rbd_device structure maintains a duplicate copy of the
      ceph_client pointer maintained in its rbd_client structure.  There
      appears to be no good reason for this, and its presence presents a
      risk of them getting out of synch or otherwise misused.  So kill it
      off, and use the rbd_client copy only.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      1dbb4399
    • A
      rbd: make ceph_parse_options() return a pointer · ee57741c
      Alex Elder 提交于
      ceph_parse_options() takes the address of a pointer as an argument
      and uses it to return the address of an allocated structure if
      successful.  With this interface is not evident at call sites that
      the pointer is always initialized.  Change the interface to return
      the address instead (or a pointer-coded error code) to make the
      validity of the returned pointer obvious.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      ee57741c
    • A
      rbd: a few small cleanups · 21079786
      Alex Elder 提交于
      Some minor cleanups in "drivers/block/rbd.c:
          - Use the more meaningful "RBD_MAX_OBJ_NAME_LEN" in place if "96"
            in the definition of RBD_MAX_MD_NAME_LEN.
          - Use DEFINE_SPINLOCK() to define and initialize node_lock.
          - Drop a needless (char *) cast in parse_rbd_opts_token().
          - Make a few minor formatting changes.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      21079786
    • A
      ceph: make ceph_setxattr() and ceph_removexattr() more alike · 18fa8b3f
      Alex Elder 提交于
      This patch just rearranges a few bits of code to make more
      portions of ceph_setxattr() and ceph_removexattr() identical.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      18fa8b3f
    • A
      ceph: avoid repeatedly computing the size of constant vxattr names · 3ce6cd12
      Alex Elder 提交于
      All names defined in the directory and file virtual extended
      attribute tables are constant, and the size of each is known at
      compile time.  So there's no need to compute their length every
      time any file's attribute is listed.
      
      Record the length of each string and use it when needed to determine
      the space need to represent them.  In addition, compute the
      aggregate size of strings in each table just once at initialization
      time.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      3ce6cd12
    • A
      ceph: encode type in vxattr callback routines · aa4066ed
      Alex Elder 提交于
      The names of the callback functions used for virtual extended
      attributes are based only on the last component of the attribute
      name.  Because of the way these are defined, this precludes allowing
      a single (lowest) attribute name for different callbacks, dependent
      on the type of file being operated on.  (For example, it might be
      nice to support both "ceph.dir.layout" and "ceph.file.layout".)
      
      Just change the callback names to avoid this problem.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      aa4066ed
    • A
      ceph: drop "_cb" from name of struct ceph_vxattr_cb · 881a5fa2
      Alex Elder 提交于
      A struct ceph_vxattr_cb does not represent a callback at all, but
      rather a virtual extended attribute itself.  Drop the "_cb" suffix
      from its name to reflect that.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      881a5fa2
    • A
      ceph: use macros to normalize vxattr table definitions · eb788084
      Alex Elder 提交于
      Entries in the ceph virtual extended attribute tables all follow a
      distinct pattern in their definition.  Enforce this pattern through
      the use of a macro.
      
      Also, a null name field signals the end of the table, so make that
      be the first field in the ceph_vxattr_cb structure.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      eb788084
    • A
      ceph: use a symbolic name for "ceph." extended attribute namespace · 22891907
      Alex Elder 提交于
      Use symbolic constants to define the top-level prefix for "ceph."
      extended attribute names.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      22891907
    • A
      ceph: pass inode rather than table to ceph_match_vxattr() · 06476a69
      Alex Elder 提交于
      All callers of ceph_match_vxattr() determine what to pass as the
      first argument by calling ceph_inode_vxattrs(inode).  Just do that
      inside ceph_match_vxattr() itself, changing it to take an inode
      rather than the vxattr pointer as its first argument.
      
      Also ensure the function works correctly for an empty table (i.e.,
      containing only a terminating null entry).
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      06476a69
    • A
      ceph: don't null-terminate xattr values · b829c195
      Alex Elder 提交于
      For some reason, ceph_setxattr() allocates an extra byte in which a
      '\0' is stored past the end of an extended attribute value.  This is
      not needed, and is potentially misleading, so get rid of it.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      b829c195
    • A
      ceph: eliminate some abusive casts · 99f0f3b2
      Alex Elder 提交于
      This fixes some spots where a type cast to (void *) was used as
      as a universal type hiding mechanism.  Instead, properly cast the
      type to the intended target type.
      Signed-off-by: NAlex Elder <elder@newdream.net>
      Signed-off-by: NSage Weil <sage@newdream.net>
      99f0f3b2
    • A
      ceph: eliminate some needless casts · bd406145
      Alex Elder 提交于
      This eliminates type casts in some places where they are not
      required.
      Signed-off-by: NAlex Elder <elder@newdream.net>
      Signed-off-by: NSage Weil <sage@newdream.net>
      bd406145
    • A
      ceph: kill addr_str_lock spinlock; use atomic instead · f64a9317
      Alex Elder 提交于
      A spinlock is used to protect a value used for selecting an array
      index for a string used for formatting a socket address for human
      consumption.  The index is reset to 0 if it ever reaches the maximum
      index value.
      
      Instead, use an ever-increasing atomic variable as a sequence
      number, and compute the array index by masking off all but the
      sequence number's lowest bits.  Make the number of entries in the
      array a power of two to allow the use of such a mask (to avoid jumps
      in the index value when the sequence number wraps).
      
      The length of these strings is somewhat arbitrarily set at 60 bytes.
      The worst-case length of a string produced is 54 bytes, for an IPv6
      address that can't be shortened, e.g.:
          [1234:5678:9abc:def0:1111:2222:123.234.210.100]:32767
      Change it so we arbitrarily use 64 bytes instead; if nothing else
      it will make the array of these line up better in hex dumps.
      
      Rename a few things to reinforce the distinction between the number
      of strings in the array and the length of individual strings.
      Signed-off-by: NAlex Elder <elder@newdream.net>
      Signed-off-by: NSage Weil <sage@newdream.net>
      f64a9317
    • A
      ceph: make use of "else" where appropriate · a5bc3129
      Alex Elder 提交于
      Rearrange ceph_tcp_connect() a bit, making use of "else" rather than
      re-testing a value with consecutive "if" statements.  Don't record a
      connection's socket pointer unless the connect operation is
      successful.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      a5bc3129
    • A
      ceph: use a shared zero page rather than one per messenger · 57666519
      Alex Elder 提交于
      Each messenger allocates a page to be used when writing zeroes
      out in the event of error or other abnormal condition.  Instead,
      use the kernel ZERO_PAGE() for that purpose.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      57666519
    • X
      ceph: fix overflow check in build_snap_context() · 80834312
      Xi Wang 提交于
      The overflow check for a + n * b should be (n > (ULONG_MAX - a) / b),
      rather than (n > ULONG_MAX / b - a).
      Signed-off-by: NXi Wang <xi.wang@gmail.com>
      Signed-off-by: NSage Weil <sage@newdream.net>
      80834312