1. 06 7月, 2012 8 次提交
    • A
      libceph: don't change socket state on sock event · 188048bc
      Alex Elder 提交于
      Currently the socket state change event handler records an error
      message on a connection to distinguish a close while connecting from
      a close while a connection was already established.
      
      Changing connection information during handling of a socket event is
      not very clean, so instead move this assignment inside con_work(),
      where it can be done during normal connection-level processing (and
      under protection of the connection mutex as well).
      
      Move the handling of a socket closed event up to the top of the
      processing loop in con_work(); there's no point in handling backoff
      etc. if we have a newly-closed socket to take care of.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      188048bc
    • A
      libceph: SOCK_CLOSED is a flag, not a state · a8d00e3c
      Alex Elder 提交于
      The following commit changed it so SOCK_CLOSED bit was stored in
      a connection's new "flags" field rather than its "state" field.
      
          libceph: start separating connection flags from state
          commit 928443cd
      
      That bit is used in con_close_socket() to protect against setting an
      error message more than once in the socket event handler function.
      
      Unfortunately, the field being operated on in that function was not
      updated to be "flags" as it should have been.  This fixes that
      error.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      a8d00e3c
    • A
      libceph: don't use bio_iter as a flag · abdaa6a8
      Alex Elder 提交于
      Recently a bug was fixed in which the bio_iter field in a ceph
      message was not being properly re-initialized when a message got
      re-transmitted:
          commit 43643528
          Author: Yan, Zheng <zheng.z.yan@intel.com>
          rbd: Clear ceph_msg->bio_iter for retransmitted message
      
      We are now only initializing the bio_iter field when we are about to
      start to write message data (in prepare_write_message_data()),
      rather than every time we are attempting to write any portion of the
      message data (in write_partial_msg_pages()).  This means we no
      longer need to use the msg->bio_iter field as a flag.
      
      So just don't do that any more.  Trust prepare_write_message_data()
      to ensure msg->bio_iter is properly initialized, every time we are
      about to begin writing (or re-writing) a message's bio data.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      abdaa6a8
    • A
      libceph: move init of bio_iter · 572c588e
      Alex Elder 提交于
      If a message has a non-null bio pointer, its bio_iter field is
      initialized in write_partial_msg_pages() if this has not been done
      already.  This is really a one-time setup operation for sending a
      message's (bio) data, so move that initialization code into
      prepare_write_message_data() which serves that purpose.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      572c588e
    • A
      libceph: move init_bio_*() functions up · df6ad1f9
      Alex Elder 提交于
      Move init_bio_iter() and iter_bio_next() up in their source file so
      the'll be defined before they're needed.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      df6ad1f9
    • A
      libceph: don't mark footer complete before it is · fd154f3c
      Alex Elder 提交于
      This is a nit, but prepare_write_message() sets the FOOTER_COMPLETE
      flag before the CRC for the data portion (recorded in the footer)
      has been completely computed.  Hold off setting the complete flag
      until we've decided it's ready to send.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      fd154f3c
    • A
      libceph: encapsulate advancing msg page · 84ca8fc8
      Alex Elder 提交于
      In write_partial_msg_pages(), once all the data from a page has been
      sent we advance to the next one.  Put the code that takes care of
      this into its own function.
      
      While modifying write_partial_msg_pages(), make its local variable
      "in_trail" be Boolean, and use the local variable "msg" (which is
      just the connection's current out_msg pointer) consistently.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      84ca8fc8
    • A
      libceph: encapsulate out message data setup · 739c905b
      Alex Elder 提交于
      Move the code that prepares to write the data portion of a message
      into its own function.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      739c905b
  2. 22 6月, 2012 2 次提交
  3. 19 6月, 2012 1 次提交
  4. 16 6月, 2012 3 次提交
  5. 07 6月, 2012 4 次提交
  6. 06 6月, 2012 13 次提交
    • A
      libceph: make ceph_con_revoke_message() a msg op · 8921d114
      Alex Elder 提交于
      ceph_con_revoke_message() is passed both a message and a ceph
      connection.  A ceph_msg allocated for incoming messages on a
      connection always has a pointer to that connection, so there's no
      need to provide the connection when revoking such a message.
      
      Note that the existing logic does not preclude the message supplied
      being a null/bogus message pointer.  The only user of this interface
      is the OSD client, and the only value an osd client passes is a
      request's r_reply field.  That is always non-null (except briefly in
      an error path in ceph_osdc_alloc_request(), and that drops the
      only reference so the request won't ever have a reply to revoke).
      So we can safely assume the passed-in message is non-null, but add a
      BUG_ON() to make it very obvious we are imposing this restriction.
      
      Rename the function ceph_msg_revoke_incoming() to reflect that it is
      really an operation on an incoming message.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      8921d114
    • A
      libceph: make ceph_con_revoke() a msg operation · 6740a845
      Alex Elder 提交于
      ceph_con_revoke() is passed both a message and a ceph connection.
      Now that any message associated with a connection holds a pointer
      to that connection, there's no need to provide the connection when
      revoking a message.
      
      This has the added benefit of precluding the possibility of the
      providing the wrong connection pointer.  If the message's connection
      pointer is null, it is not being tracked by any connection, so
      revoking it is a no-op.  This is supported as a convenience for
      upper layers, so they can revoke a message that is not actually
      "in flight."
      
      Rename the function ceph_msg_revoke() to reflect that it is really
      an operation on a message, not a connection.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      6740a845
    • A
      libceph: have messages take a connection reference · 92ce034b
      Alex Elder 提交于
      There are essentially two types of ceph messages: incoming and
      outgoing.  Outgoing messages are always allocated via ceph_msg_new(),
      and at the time of their allocation they are not associated with any
      particular connection.  Incoming messages are always allocated via
      ceph_con_in_msg_alloc(), and they are initially associated with the
      connection from which incoming data will be placed into the message.
      
      When an outgoing message gets sent, it becomes associated with a
      connection and remains that way until the message is successfully
      sent.  The association of an incoming message goes away at the point
      it is sent to an upper layer via a con->ops->dispatch method.
      
      This patch implements reference counting for all ceph messages, such
      that every message holds a reference (and a pointer) to a connection
      if and only if it is associated with that connection (as described
      above).
      
      
      For background, here is an explanation of the ceph message
      lifecycle, emphasizing when an association exists between a message
      and a connection.
      
      Outgoing Messages
      An outgoing message is "owned" by its allocator, from the time it is
      allocated in ceph_msg_new() up to the point it gets queued for
      sending in ceph_con_send().  Prior to that point the message's
      msg->con pointer is null; at the point it is queued for sending its
      message pointer is assigned to refer to the connection.  At that
      time the message is inserted into a connection's out_queue list.
      
      When a message on the out_queue list has been sent to the socket
      layer to be put on the wire, it is transferred out of that list and
      into the connection's out_sent list.  At that point it is still owned
      by the connection, and will remain so until an acknowledgement is
      received from the recipient that indicates the message was
      successfully transferred.  When such an acknowledgement is received
      (in process_ack()), the message is removed from its list (in
      ceph_msg_remove()), at which point it is no longer associated with
      the connection.
      
      So basically, any time a message is on one of a connection's lists,
      it is associated with that connection.  Reference counting outgoing
      messages can thus be done at the points a message is added to the
      out_queue (in ceph_con_send()) and the point it is removed from
      either its two lists (in ceph_msg_remove())--at which point its
      connection pointer becomes null.
      
      Incoming Messages
      When an incoming message on a connection is getting read (in
      read_partial_message()) and there is no message in con->in_msg,
      a new one is allocated using ceph_con_in_msg_alloc().  At that
      point the message is associated with the connection.  Once that
      message has been completely and successfully read, it is passed to
      upper layer code using the connection's con->ops->dispatch method.
      At that point the association between the message and the connection
      no longer exists.
      
      Reference counting of connections for incoming messages can be done
      by taking a reference to the connection when the message gets
      allocated, and releasing that reference when it gets handed off
      using the dispatch method.
      
      We should never fail to get a connection reference for a
      message--the since the caller should already hold one.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      92ce034b
    • A
      libceph: have messages point to their connection · 38941f80
      Alex Elder 提交于
      When a ceph message is queued for sending it is placed on a list of
      pending messages (ceph_connection->out_queue).  When they are
      actually sent over the wire, they are moved from that list to
      another (ceph_connection->out_sent).  When acknowledgement for the
      message is received, it is removed from the sent messages list.
      
      During that entire time the message is "in the possession" of a
      single ceph connection.  Keep track of that connection in the
      message.  This will be used in the next patch (and is a helpful
      bit of information for debugging anyway).
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      38941f80
    • A
      libceph: tweak ceph_alloc_msg() · 1c20f2d2
      Alex Elder 提交于
      The function ceph_alloc_msg() is only used to allocate a message
      that will be assigned to a connection's in_msg pointer.  Rename the
      function so this implied usage is more clear.
      
      In addition, make that assignment inside the function (again, since
      that's precisely what it's intended to be used for).  This allows us
      to return what is now provided via the passed-in address of a "skip"
      variable.  The return type is now Boolean to be explicit that there
      are only two possible outcomes.
      
      Make sure the result of an ->alloc_msg method call always sets the
      value of *skip properly.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      1c20f2d2
    • A
      libceph: fully initialize connection in con_init() · 1bfd89f4
      Alex Elder 提交于
      Move the initialization of a ceph connection's private pointer,
      operations vector pointer, and peer name information into
      ceph_con_init().  Rearrange the arguments so the connection pointer
      is first.  Hide the byte-swapping of the peer entity number inside
      ceph_con_init()
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      1bfd89f4
    • A
      libceph: init monitor connection when opening · 20581c1f
      Alex Elder 提交于
      Hold off initializing a monitor client's connection until just
      before it gets opened for use.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      20581c1f
    • S
      libceph: drop connection refcounting for mon_client · ec87ef43
      Sage Weil 提交于
      All references to the embedded ceph_connection come from the msgr
      workqueue, which is drained prior to mon_client destruction.  That
      means we can ignore con refcounting entirely.
      Signed-off-by: NSage Weil <sage@newdream.net>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      ec87ef43
    • A
      libceph: embed ceph connection structure in mon_client · 67130934
      Alex Elder 提交于
      A monitor client has a pointer to a ceph connection structure in it.
      This is the only one of the three ceph client types that do it this
      way; the OSD and MDS clients embed the connection into their main
      structures.  There is always exactly one ceph connection for a
      monitor client, so there is no need to allocate it separate from the
      monitor client structure.
      
      So switch the ceph_mon_client structure to embed its
      ceph_connection structure.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      67130934
    • S
      libceph: use con get/put ops from osd_client · 0d47766f
      Sage Weil 提交于
      There were a few direct calls to ceph_con_{get,put}() instead of the con
      ops from osd_client.c.  This is a bug since those ops aren't defined to
      be ceph_con_get/put.
      
      This breaks refcounting on the ceph_osd structs that contain the
      ceph_connections, and could lead to all manner of strangeness.
      
      The purpose of the ->get and ->put methods in a ceph connection are
      to allow the connection to indicate it has a reference to something
      external to the messaging system, *not* to indicate something
      external has a reference to the connection.
      
      [elder@inktank.com: added that last sentence]
      Signed-off-by: NSage Weil <sage@newdream.net>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      0d47766f
    • A
      libceph: osd_client: don't drop reply reference too early · ab8cb34a
      Alex Elder 提交于
      In ceph_osdc_release_request(), a reference to the r_reply message
      is dropped.  But just after that, that same message is revoked if it
      was in use to receive an incoming reply.  Reorder these so we are
      sure we hold a reference until we're actually done with the message.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      ab8cb34a
    • D
      rbd: endian bug in rbd_req_cb() · 895cfcc8
      Dan Carpenter 提交于
      Sparse complains about this because:
      drivers/block/rbd.c:996:20: warning: cast to restricted __le32
      drivers/block/rbd.c:996:20: warning: cast from restricted __le16
      
      These are set in osd_req_encode_op() and they are le16.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      895cfcc8
    • Y
      rbd: Fix ceph_snap_context size calculation · f9f9a190
      Yan, Zheng 提交于
      ceph_snap_context->snaps is an u64 array
      Signed-off-by: NZheng Yan <zheng.z.yan@intel.com>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      f9f9a190
  7. 03 6月, 2012 9 次提交
    • L
      Linux 3.5-rc1 · f8f5701b
      Linus Torvalds 提交于
      f8f5701b
    • L
      Merge tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm · 912afc36
      Linus Torvalds 提交于
      Pull device-mapper updates from Alasdair G Kergon:
       "Improve multipath's retrying mechanism in some defined circumstances
        and provide a simple reserve/release mechanism for userspace tools to
        access thin provisioning metadata while the pool is in use."
      
      * tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
        dm thin: provide userspace access to pool metadata
        dm thin: use slab mempools
        dm mpath: allow ioctls to trigger pg init
        dm mpath: delay retry of bypassed pg
        dm mpath: reduce size of struct multipath
      912afc36
    • J
      dm thin: provide userspace access to pool metadata · cc8394d8
      Joe Thornber 提交于
      This patch implements two new messages that can be sent to the thin
      pool target allowing it to take a snapshot of the _metadata_.  This,
      read-only snapshot can be accessed by userland, concurrently with the
      live target.
      
      Only one metadata snapshot can be held at a time.  The pool's status
      line will give the block location for the current msnap.
      
      Since version 0.1.5 of the userland thin provisioning tools, the
      thin_dump program displays the msnap as follows:
      
          thin_dump -m <msnap root> <metadata dev>
      
      Available here: https://github.com/jthornber/thin-provisioning-tools
      
      Now that userland can access the metadata we can do various things
      that have traditionally been kernel side tasks:
      
           i) Incremental backups.
      
           By using metadata snapshots we can work out what blocks have
           changed over time.  Combined with data snapshots we can ensure
           the data doesn't change while we back it up.
      
           A short proof of concept script can be found here:
      
           https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb
      
           ii) Migration of thin devices from one pool to another.
      
           iii) Merging snapshots back into an external origin.
      
           iv) Asyncronous replication.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      cc8394d8
    • M
      dm thin: use slab mempools · a24c2569
      Mike Snitzer 提交于
      Use dedicated caches prefixed with a "dm_" name rather than relying on
      kmalloc mempools backed by generic slab caches so the memory usage of
      thin provisioning (and any leaks) can be accounted for independently.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a24c2569
    • M
      dm mpath: allow ioctls to trigger pg init · 35991652
      Mikulas Patocka 提交于
      After the failure of a group of paths, any alternative paths that
      need initialising do not become available until further I/O is sent to
      the device.  Until this has happened, ioctls return -EAGAIN.
      
      With this patch, new paths are made available in response to an ioctl
      too.  The processing of the ioctl gets delayed until this has happened.
      
      Instead of returning an error, we submit a work item to kmultipathd
      (that will potentially activate the new path) and retry in ten
      milliseconds.
      
      Note that the patch doesn't retry an ioctl if the ioctl itself fails due
      to a path failure.  Such retries should be handled intelligently by the
      code that generated the ioctl in the first place, noting that some SCSI
      commands should not be retried because they are not idempotent (XOR write
      commands).  For commands that could be retried, there is a danger that
      if the device rejected the SCSI command, the path could be errorneously
      marked as failed, and the request would be retried on another path which
      might fail too.  It can be determined if the failure happens on the
      device or on the SCSI controller, but there is no guarantee that all
      SCSI drivers set these flags correctly.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      35991652
    • M
      dm mpath: delay retry of bypassed pg · f220fd4e
      Mike Christie 提交于
      If I/O needs retrying and only bypassed priority groups are available,
      set the pg_init_delay_retry flag to wait before retrying.
      
      If, for example, the reason for the bypass is that the controller is
      getting reset or there is a firmware upgrade happening, retrying right
      away would cause a flood of log messages and retries for what could be a
      few seconds or even several minutes.
      Signed-off-by: NMike Christie <michaelc@cs.wisc.edu>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f220fd4e
    • M
      dm mpath: reduce size of struct multipath · 1fbdd2b3
      Mike Snitzer 提交于
      Move multipath structure's 'lock' and 'queue_size' members to eliminate
      two 4-byte holes.  Also use a bit within a single unsigned int for each
      existing flag (saves 8-bytes).  This allows future flags to be added
      without each consuming an unsigned int.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      1fbdd2b3
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 4fc3acf2
      Linus Torvalds 提交于
      Pull networking updates from David Miller:
      
       1) Make syn floods consume significantly less resources by
      
          a) Not pre-COW'ing routing metrics for SYN/ACKs
          b) Mirroring the device queue mapping of the SYN for the SYN/ACK
             reply.
      
          Both from Eric Dumazet.
      
       2) Fix calculation errors in Byte Queue Limiting, from Hiroaki SHIMODA.
      
       3) Validate the length requested when building a paged SKB for a
          socket, so we don't overrun the page vector accidently.  From Jason
          Wang.
      
       4) When netlabel is disabled, we abort all IP option processing when we
          see a CIPSO option.  This isn't the right thing to do, we should
          simply skip over it and continue processing the remaining options
          (if any).  Fix from Paul Moore.
      
       5) SRIOV fixes for the mellanox driver from Jack orgenstein and Marcel
          Apfelbaum.
      
       6) 8139cp enables the receiver before the ring address is properly
          programmed, which potentially lets the device crap over random
          memory.  Fix from Jason Wang.
      
       7) e1000/e1000e fixes for i217 RST handling, and an improper buffer
          address reference in jumbo RX frame processing from Bruce Allan and
          Sebastian Andrzej Siewior, respectively.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        fec_mpc52xx: fix timestamp filtering
        mcs7830: Implement link state detection
        e1000e: fix Rapid Start Technology support for i217
        e1000: look into the page instead of skb->data for e1000_tbi_adjust_stats()
        r8169: call netif_napi_del at errpaths and at driver unload
        tcp: reflect SYN queue_mapping into SYNACK packets
        tcp: do not create inetpeer on SYNACK message
        8139cp/8139too: terminate the eeprom access with the right opmode
        8139cp: set ring address before enabling receiver
        cipso: handle CIPSO options correctly when NetLabel is disabled
        net: sock: validate data_len before allocating skb in sock_alloc_send_pskb()
        bql: Avoid possible inconsistent calculation.
        bql: Avoid unneeded limit decrement.
        bql: Fix POSDIFF() to integer overflow aware.
        net/mlx4_core: Fix obscure mlx4_cmd_box parameter in QUERY_DEV_CAP
        net/mlx4_core: Check port out-of-range before using in mlx4_slave_cap
        net/mlx4_core: Fixes for VF / Guest startup flow
        net/mlx4_en: Fix improper use of "port" parameter in mlx4_en_event
        net/mlx4_core: Fix number of EQs used in ICM initialisation
        net/mlx4_core: Fix the slave_id out-of-range test in mlx4_eq_int
      4fc3acf2
    • L
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 63004afa
      Linus Torvalds 提交于
      Pull straggler x86 fixes from Peter Anvin:
       "Three groups of patches:
      
        - EFI boot stub documentation and the ability to print error messages;
        - Removal for PTRACE_ARCH_PRCTL for x32 (obsolete interface which
          should never have been ported, and the port is broken and
          potentially dangerous.)
        - ftrace stack corruption fixes.  I'm not super-happy about the
          technical implementation, but it is probably the least invasive in
          the short term.  In the future I would like a single method for
          nesting the debug stack, however."
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32
        x86, efi: Add EFI boot stub documentation
        x86, efi; Add EFI boot stub console support
        x86, efi: Only close open files in error path
        ftrace/x86: Do not change stacks in DEBUG when calling lockdep
        x86: Allow nesting of the debug stack IDT setting
        x86: Reset the debug_stack update counter
        ftrace: Use breakpoint method to update ftrace caller
        ftrace: Synchronize variable setting with breakpoints
      63004afa