1. 03 8月, 2009 7 次提交
    • N
      md: Use revalidate_disk to effect changes in size of device. · 449aad3e
      NeilBrown 提交于
      As revalidate_disk calls check_disk_size_change, it will cause
      any capacity change of a gendisk to be propagated to the blockdev
      inode.  So use that instead of mucking about with locks and
      i_size_write.
      
      Also add a call to revalidate_disk in do_md_run and a few other places
      where the gendisk capacity is changed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      449aad3e
    • N
      md: allow raid5_quiesce to work properly when reshape is happening. · 64bd660b
      NeilBrown 提交于
      The ->quiesce method is not supposed to stop resync/recovery/reshape,
      just normal IO.
      But in raid5 we don't have a way to know which stripes are being
      used for normal IO and which for resync etc, so we need to wait for
      all stripes to be idle to be sure that all writes have completed.
      
      However reshape keeps at least some stripe busy for an extended period
      of time, so a call to raid5_quiesce can block for several seconds
      needlessly.
      So arrange for reshape etc to pause briefly while raid5_quiesce is
      trying to quiesce the array so that the active_stripes count can
      drop to zero.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      64bd660b
    • N
      md/raid5: set reshape_position correctly when reshape starts. · e516402c
      NeilBrown 提交于
      As the internal reshape_progress counter is the main driver
      for reshape, the fact that reshape_position sometimes starts with the
      wrong value has minimal effect.  It is visible in sysfs and that
      is all.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e516402c
    • N
      md: Handle growth of v1.x metadata correctly. · 70471daf
      NeilBrown 提交于
      The v1.x metadata does not have a fixed size and can grow
      when devices are added.
      If it grows enough to require an extra sector of storage,
      we need to update the 'sb_size' to match.
      
      Without this, md can write out an incomplete superblock with a
      bad checksum, which will be rejected when trying to re-assemble
      the array.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      70471daf
    • N
      md: avoid array overflow with bad v1.x metadata · 3673f305
      NeilBrown 提交于
      We trust the 'desc_nr' field in v1.x metadata enough to use it
      as an index in an array.  This isn't really safe.
      So range-check the value first.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3673f305
    • N
      md: when a level change reduces the number of devices, remove the excess. · 3a981b03
      NeilBrown 提交于
      When an array is changed from RAID6 to RAID5, fewer drives are
      needed.  So any device that is made superfluous by the level
      conversion must be marked as not-active.
      For the RAID6->RAID5 conversion, this will be a drive which only
      has 'Q' blocks on it.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3a981b03
    • A
      md: Push down data integrity code to personalities. · ac5e7113
      Andre Noll 提交于
      This patch replaces md_integrity_check() by two new public functions:
      md_integrity_register() and md_integrity_add_rdev() which are both
      personality-independent.
      
      md_integrity_register() is called from the ->run and ->hot_remove
      methods of all personalities that support data integrity.  The
      function iterates over the component devices of the array and
      determines if all active devices are integrity capable and if their
      profiles match. If this is the case, the common profile is registered
      for the mddev via blk_integrity_register().
      
      The second new function, md_integrity_add_rdev() is called from the
      ->hot_add_disk methods, i.e. whenever a new device is being added
      to a raid array. If the new device does not support data integrity,
      or has a profile different from the one already registered, data
      integrity for the mddev is disabled.
      
      For raid0 and linear, only the call to md_integrity_register() from
      the ->run method is necessary.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ac5e7113
  2. 31 7月, 2009 1 次提交
  3. 24 7月, 2009 3 次提交
  4. 11 7月, 2009 1 次提交
  5. 09 7月, 2009 1 次提交
  6. 01 7月, 2009 7 次提交
  7. 30 6月, 2009 2 次提交
  8. 22 6月, 2009 18 次提交
    • K
      dm mpath: change to be request based · f40c67f0
      Kiyoshi Ueda 提交于
      This patch converts dm-multipath target to request-based from bio-based.
      
      Basically, the patch just converts the I/O unit from struct bio
      to struct request.
      In the course of the conversion, it also changes the I/O queueing
      mechanism.  The change in the I/O queueing is described in details
      as follows.
      
      I/O queueing mechanism change
      -----------------------------
      In I/O submission, map_io(), there is no mechanism change from
      bio-based, since the clone request is ready for retry as it is.
      However, in I/O complition, do_end_io(), there is a mechanism change
      from bio-based, since the clone request is not ready for retry.
      
      In do_end_io() of bio-based, the clone bio has all needed memory
      for resubmission.  So the target driver can queue it and resubmit
      it later without memory allocations.
      The mechanism has almost no overhead.
      
      On the other hand, in do_end_io() of request-based, the clone request
      doesn't have clone bios, so the target driver can't resubmit it
      as it is.  To resubmit the clone request, memory allocation for
      clone bios is needed, and it takes some overheads.
      To avoid the overheads just for queueing, the target driver doesn't
      queue the clone request inside itself.
      Instead, the target driver asks dm core for queueing and remapping
      the original request of the clone request, since the overhead for
      queueing is just a freeing memory for the clone request.
      
      As a result, the target driver doesn't need to record/restore
      the information of the original request for resubmitting
      the clone request.  So dm_bio_details in dm_mpath_io is removed.
      
      multipath_busy()
      ---------------------
      The target driver returns "busy", only when the following case:
        o The target driver will map I/Os, if map() function is called
        and
        o The mapped I/Os will wait on underlying device's queue due to
          their congestions, if map() function is called now.
      
      In other cases, the target driver doesn't return "busy".
      Otherwise, dm core will keep the I/Os and the target driver can't
      do what it wants.
      (e.g. the target driver can't map I/Os now, so wants to kill I/Os.)
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f40c67f0
    • K
      dm: disable interrupt when taking map_lock · 523d9297
      Kiyoshi Ueda 提交于
      This patch disables interrupt when taking map_lock to avoid
      lockdep warnings in request-based dm.
      
      request-based dm takes map_lock after taking queue_lock with
      disabling interrupt:
        spin_lock_irqsave(queue_lock)
        q->request_fn() == dm_request_fn()
          => dm_get_table()
               => read_lock(map_lock)
      while queue_lock could be (but isn't) taken in interrupt context.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NChristof Schmitt <christof.schmitt@de.ibm.com>
      Acked-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      523d9297
    • K
      dm: do not set QUEUE_ORDERED_DRAIN if request based · 5d67aa23
      Kiyoshi Ueda 提交于
      Request-based dm doesn't have barrier support yet.
      So we need to set QUEUE_ORDERED_DRAIN only for bio-based dm.
      Since the device type is decided at the first table loading time,
      the flag set is deferred until then.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      5d67aa23
    • K
      dm: enable request based option · e6ee8c0b
      Kiyoshi Ueda 提交于
      This patch enables request-based dm.
      
      o Request-based dm and bio-based dm coexist, since there are
        some target drivers which are more fitting to bio-based dm.
        Also, there are other bio-based devices in the kernel
        (e.g. md, loop).
        Since bio-based device can't receive struct request,
        there are some limitations on device stacking between
        bio-based and request-based.
      
                           type of underlying device
                         bio-based      request-based
         ----------------------------------------------
          bio-based         OK                OK
          request-based     --                OK
      
        The device type is recognized by the queue flag in the kernel,
        so dm follows that.
      
      o The type of a dm device is decided at the first table binding time.
        Once the type of a dm device is decided, the type can't be changed.
      
      o Mempool allocations are deferred to at the table loading time, since
        mempools for request-based dm are different from those for bio-based
        dm and needed mempool type is fixed by the type of table.
      
      o Currently, request-based dm supports only tables that have a single
        target.  To support multiple targets, we need to support request
        splitting or prevent bio/request from spanning multiple targets.
        The former needs lots of changes in the block layer, and the latter
        needs that all target drivers support merge() function.
        Both will take a time.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      e6ee8c0b
    • K
      dm: prepare for request based option · cec47e3d
      Kiyoshi Ueda 提交于
      This patch adds core functions for request-based dm.
      
      When struct mapped device (md) is initialized, md->queue has
      an I/O scheduler and the following functions are used for
      request-based dm as the queue functions:
          make_request_fn: dm_make_request()
          pref_fn:         dm_prep_fn()
          request_fn:      dm_request_fn()
          softirq_done_fn: dm_softirq_done()
          lld_busy_fn:     dm_lld_busy()
      Actual initializations are done in another patch (PATCH 2).
      
      Below is a brief summary of how request-based dm behaves, including:
        - making request from bio
        - cloning, mapping and dispatching request
        - completing request and bio
        - suspending md
        - resuming md
      
        bio to request
        ==============
        md->queue->make_request_fn() (dm_make_request()) calls __make_request()
        for a bio submitted to the md.
        Then, the bio is kept in the queue as a new request or merged into
        another request in the queue if possible.
      
        Cloning and Mapping
        ===================
        Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
        when requests are dispatched after they are sorted by the I/O scheduler.
      
        dm_request_fn() checks busy state of underlying devices using
        target's busy() function and stops dispatching requests to keep them
        on the dm device's queue if busy.
        It helps better I/O merging, since no merge is done for a request
        once it is dispatched to underlying devices.
      
        Actual cloning and mapping are done in dm_prep_fn() and map_request()
        called from dm_request_fn().
        dm_prep_fn() clones not only request but also bios of the request
        so that dm can hold bio completion in error cases and prevent
        the bio submitter from noticing the error.
        (See the "Completion" section below for details.)
      
        After the cloning, the clone is mapped by target's map_rq() function
          and inserted to underlying device's queue using
          blk_insert_cloned_request().
      
        Completion
        ==========
        Request completion can be hooked by rq->end_io(), but then, all bios
        in the request will have been completed even error cases, and the bio
        submitter will have noticed the error.
        To prevent the bio completion in error cases, request-based dm clones
        both bio and request and hooks both bio->bi_end_io() and rq->end_io():
            bio->bi_end_io(): end_clone_bio()
            rq->end_io():     end_clone_request()
      
        Summary of the request completion flow is below:
        blk_end_request() for a clone request
          => blk_update_request()
             => bio->bi_end_io() == end_clone_bio() for each clone bio
                => Free the clone bio
                => Success: Complete the original bio (blk_update_request())
                   Error:   Don't complete the original bio
          => blk_finish_request()
             => rq->end_io() == end_clone_request()
                => blk_complete_request()
                   => dm_softirq_done()
                      => Free the clone request
                      => Success: Complete the original request (blk_end_request())
                         Error:   Requeue the original request
      
        end_clone_bio() completes the original request on the size of
        the original bio in successful cases.
        Even if all bios in the original request are completed by that
        completion, the original request must not be completed yet to keep
        the ordering of request completion for the stacking.
        So end_clone_bio() uses blk_update_request() instead of
        blk_end_request().
        In error cases, end_clone_bio() doesn't complete the original bio.
        It just frees the cloned bio and gives over the error handling to
        end_clone_request().
      
        end_clone_request(), which is called with queue lock held, completes
        the clone request and the original request in a softirq context
        (dm_softirq_done()), which has no queue lock, to avoid a deadlock
        issue on submission of another request during the completion:
            - The submitted request may be mapped to the same device
            - Request submission requires queue lock, but the queue lock
              has been held by itself and it doesn't know that
      
        The clone request has no clone bio when dm_softirq_done() is called.
        So target drivers can't resubmit it again even error cases.
        Instead, they can ask dm core for requeueing and remapping
        the original request in that cases.
      
        suspend
        =======
        Request-based dm uses stopping md->queue as suspend of the md.
        For noflush suspend, just stops md->queue.
      
        For flush suspend, inserts a marker request to the tail of md->queue.
        And dispatches all requests in md->queue until the marker comes to
        the front of md->queue.  Then, stops dispatching request and waits
        for the all dispatched requests to complete.
        After that, completes the marker request, stops md->queue and
        wake up the waiter on the suspend queue, md->wait.
      
        resume
        ======
        Starts md->queue.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      cec47e3d
    • J
      dm raid1: add userspace log · f5db4af4
      Jonthan Brassow 提交于
      This patch contains a device-mapper mirror log module that forwards
      requests to userspace for processing.
      
      The structures used for communication between kernel and userspace are
      located in include/linux/dm-log-userspace.h.  Due to the frequency,
      diversity, and 2-way communication nature of the exchanges between
      kernel and userspace, 'connector' was chosen as the interface for
      communication.
      
      The first log implementations written in userspace - "clustered-disk"
      and "clustered-core" - support clustered shared storage.   A userspace
      daemon (in the LVM2 source code repository) uses openAIS/corosync to
      process requests in an ordered fashion with the rest of the nodes in the
      cluster so as to prevent log state corruption.  Other implementations
      with no association to LVM or openAIS/corosync, are certainly possible.
      
      (Imagine if two machines are writing to the same region of a mirror.
      They would both mark the region dirty, but you need a cluster-aware
      entity that can handle properly marking the region clean when they are
      done.  Otherwise, you might clear the region when the first machine is
      done, not the second.)
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f5db4af4
    • M
      dm: calculate queue limits during resume not load · 754c5fc7
      Mike Snitzer 提交于
      Currently, device-mapper maintains a separate instance of 'struct
      queue_limits' for each table of each device.  When the configuration of
      a device is to be changed, first its table is loaded and this structure
      is populated, then the device is 'resumed' and the calculated
      queue_limits are applied.
      
      This places restrictions on how userspace may process related devices,
      where it is often advantageous to 'load' tables for several devices
      at once before 'resuming' them together.  As the new queue_limits
      only take effect after the 'resume', if they are changing and one
      device uses another, the latter must be 'resumed' before the former
      may be 'loaded'.
      
      This patch moves the calculation of these queue_limits out of
      the 'load' operation into 'resume'.  Since we are no longer
      pre-calculating this struct, we no longer need to maintain copies
      within our dm structs.
      
      dm_set_device_limits() now passes the 'start' of the device's
      data area (aka pe_start) as the 'offset' to blk_stack_limits().
      
      init_valid_queue_limits() is replaced by blk_set_default_limits().
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: martin.petersen@oracle.com
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      754c5fc7
    • M
      dm log: fix create_log_context to use logical_block_size of log device · 18d8594d
      Mike Snitzer 提交于
      create_log_context() must use the logical_block_size from the log disk,
      where the I/O happens, not the target's logical_block_size.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      18d8594d
    • M
      dm target:s introduce iterate devices fn · af4874e0
      Mike Snitzer 提交于
      Add .iterate_devices to 'struct target_type' to allow a function to be
      called for all devices in a DM target.  Implemented it for all targets
      except those in dm-snap.c (origin and snapshot).
      
      (The raid1 version number jumps to 1.12 because we originally reserved
      1.1 to 1.11 for 'block_on_error' but ended up using 'handle_errors'
      instead.)
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: martin.petersen@oracle.com
      af4874e0
    • M
      dm table: establish queue limits by copying table limits · 1197764e
      Mike Snitzer 提交于
      Copy the table's queue_limits to the DM device's request_queue.  This
      properly initializes the queue's topology limits and also avoids having
      to track the evolution of 'struct queue_limits' in
      dm_table_set_restrictions()
      
      Also fixes a bug that was introduced in dm_table_set_restrictions() via
      commit ae03bf63.  In addition to
      establishing 'bounce_pfn' in the queue's limits blk_queue_bounce_limit()
      also performs an allocation to setup the ISA DMA pool.  This allocation
      resulted in "sleeping function called from invalid context" when called
      from dm_table_set_restrictions().
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      1197764e
    • M
      dm table: replace struct io_restrictions with struct queue_limits · 5ab97588
      Mike Snitzer 提交于
      Use blk_stack_limits() to stack block limits (including topology) rather
      than duplicate the equivalent within Device Mapper.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      5ab97588
    • M
      dm table: validate device logical_block_size · be6d4305
      Mike Snitzer 提交于
      Impose necessary and sufficient conditions on a devices's table such
      that any incoming bio which respects its logical_block_size can be
      processed successfully.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      be6d4305
    • M
      dm table: ensure targets are aligned to logical_block_size · 02acc3a4
      Mike Snitzer 提交于
      Ensure I/O is aligned to the logical block size of target devices.
      
      Rename check_device_area() to device_area_is_valid() for clarity and
      establish the device limits including the logical block size prior to
      calling it.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      02acc3a4
    • M
      dm ioctl: support cookies for udev · 60935eb2
      Milan Broz 提交于
      Add support for passing a 32 bit "cookie" into the kernel with the
      DM_SUSPEND, DM_DEV_RENAME and DM_DEV_REMOVE ioctls.  The (unsigned)
      value of this cookie is returned to userspace alongside the uevents
      issued by these ioctls in the variable DM_COOKIE.
      
      This means the userspace process issuing these ioctls can be notified
      by udev after udev has completed any actions triggered.
      
      To minimise the interface extension, we pass the cookie into the
      kernel in the event_nr field which is otherwise unused when calling
      these ioctls.  Incrementing the version number allows userspace to
      determine in advance whether or not the kernel supports the cookie.
      If the kernel does support this but userspace does not, there should
      be no impact as the new variable will just get ignored.
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      60935eb2
    • P
      dm: sysfs add suspended attribute · 486d220f
      Peter Rajnoha 提交于
      Add a file named 'suspended' to each device-mapper device directory in
      sysfs.  It holds the value 1 while the device is suspended.  Otherwise
      it holds 0.
      Signed-off-by: NPeter Rajnoha <prajnoha@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      486d220f
    • J
      dm table: improve warning message when devices not freed before destruction · 1b6da754
      Jonthan Brassow 提交于
      Report any devices forgotten to be freed before a table is destroyed.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      1b6da754
    • K
      dm mpath: add service time load balancer · f392ba88
      Kiyoshi Ueda 提交于
      This patch adds a service time oriented dynamic load balancer,
      dm-service-time, which selects the path with the shortest estimated
      service time for the incoming I/O.
      The service time is estimated by dividing the in-flight I/O size
      by a performance value of each path.
      
      The performance value can be given as a table argument at the table
      loading time.  If no performance value is given, all paths are
      considered equal.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f392ba88
    • K
      dm mpath: add queue length load balancer · fd5e0339
      Kiyoshi Ueda 提交于
      This patch adds a dynamic load balancer, dm-queue-length, which
      balances the number of in-flight I/Os across the paths.
      
      The code is based on the patch posted by Stefan Bader:
      https://www.redhat.com/archives/dm-devel/2005-October/msg00050.htmlSigned-off-by: NStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      fd5e0339