1. 16 11月, 2010 1 次提交
  2. 05 10月, 2010 1 次提交
    • A
      block: autoconvert trivial BKL users to private mutex · 2a48fc0a
      Arnd Bergmann 提交于
      The block device drivers have all gained new lock_kernel
      calls from a recent pushdown, and some of the drivers
      were already using the BKL before.
      
      This turns the BKL into a set of per-driver mutexes.
      Still need to check whether this is safe to do.
      
      file=$1
      name=$2
      if grep -q lock_kernel ${file} ; then
          if grep -q 'include.*linux.mutex.h' ${file} ; then
                  sed -i '/include.*<linux\/smp_lock.h>/d' ${file}
          else
                  sed -i 's/include.*<linux\/smp_lock.h>.*$/include <linux\/mutex.h>/g' ${file}
          fi
          sed -i ${file} \
              -e "/^#include.*linux.mutex.h/,$ {
                      1,/^\(static\|int\|long\)/ {
                           /^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex);
      
      } }"  \
          -e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \
          -e '/[      ]*cycle_kernel_lock();/d'
      else
          sed -i -e '/include.*\<smp_lock.h\>/d' ${file}  \
                      -e '/cycle_kernel_lock()/d'
      fi
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      2a48fc0a
  3. 10 9月, 2010 6 次提交
    • M
      dm: convey that all flushes are processed as empty · b372d360
      Mike Snitzer 提交于
      Rename __clone_and_map_flush to __clone_and_map_empty_flush for added
      clarity.
      
      Simplify logic associated with REQ_FLUSH conditionals.
      
      Introduce a BUG_ON() and add a few more helpful comments to the code
      so that it is clear that all flushes are empty.
      
      Cleanup __split_and_process_bio() so that an empty flush isn't processed
      by a 'sector_count' focused while loop.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      b372d360
    • K
      dm: fix locking context in queue_io() · 05447420
      Kiyoshi Ueda 提交于
      Now queue_io() is called from dec_pending(), which may be called with
      interrupts disabled, so queue_io() must not enable interrupts
      unconditionally and must save/restore the current interrupts status.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      05447420
    • T
      dm: relax ordering of bio-based flush implementation · 6a8736d1
      Tejun Heo 提交于
      Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
      against other bio's.  This patch relaxes ordering around flushes.
      
      * A flush bio is no longer deferred to workqueue directly.  It's
        processed like other bio's but __split_and_process_bio() uses
        md->flush_bio as the clone source.  md->flush_bio is initialized to
        empty flush during md initialization and shared for all flushes.
      
      * As a flush bio now travels through the same execution path as other
        bio's, there's no need for dedicated error handling path either.  It
        can use the same error handling path in dec_pending().  Dedicated
        error handling removed along with md->flush_error.
      
      * When dec_pending() detects that a flush has completed, it checks
        whether the original bio has data.  If so, the bio is queued to the
        deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.
      
      * As flush sequencing is handled in the usual issue/completion path,
        dm_wq_work() no longer needs to handle flushes differently.  Now its
        only responsibility is re-issuing deferred bio's the same way as
        _dm_request() would.  REQ_FLUSH handling logic including
        process_flush() is dropped.
      
      * There's no reason for queue_io() and dm_wq_work() write lock
        dm->io_lock.  queue_io() now only uses md->deferred_lock and
        dm_wq_work() read locks dm->io_lock.
      
      * bio's no longer need to be queued on the deferred list while a flush
        is in progress making DMF_QUEUE_IO_TO_THREAD unncessary.  Drop it.
      
      This avoids stalling the device during flushes and simplifies the
      implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6a8736d1
    • T
      dm: implement REQ_FLUSH/FUA support for request-based dm · 29e4013d
      Tejun Heo 提交于
      This patch converts request-based dm to support the new REQ_FLUSH/FUA.
      
      The original request-based flush implementation depended on
      request_queue blocking other requests while a barrier sequence is in
      progress, which is no longer true for the new REQ_FLUSH/FUA.
      
      In general, request-based dm doesn't have infrastructure for cloning
      one source request to multiple targets, but the original flush
      implementation had a special mostly independent path which can issue
      flushes to multiple targets and sequence them.  However, the
      capability isn't currently in use and adds a lot of complexity.
      Moreoever, it's unlikely to be useful in its current form as it
      doesn't make sense to be able to send out flushes to multiple targets
      when write requests can't be.
      
      This patch rips out special flush code path and deals handles
      REQ_FLUSH/FUA requests the same way as other requests.  The only
      special treatment is that REQ_FLUSH requests use the block address 0
      when finding target, which is enough for now.
      
      * added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
        suggested by Mike Snitzer
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Tested-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      29e4013d
    • T
      dm: implement REQ_FLUSH/FUA support for bio-based dm · d87f4c14
      Tejun Heo 提交于
      This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
      now deprecated REQ_HARDBARRIER.
      
      * -EOPNOTSUPP handling logic dropped.
      
      * Preflush is handled as before but postflush is dropped and replaced
        with passing down REQ_FUA to member request_queues.  This replaces
        one array wide cache flush w/ member specific FUA writes.
      
      * __split_and_process_bio() now calls __clone_and_map_flush() directly
        for flushes and guarantees all FLUSH bio's going to targets are zero
      `  length.
      
      * It's now guaranteed that all FLUSH bio's which are passed onto dm
        targets are zero length.  bio_empty_barrier() tests are replaced
        with REQ_FLUSH tests.
      
      * Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes.
      
      * Dropped unlikely() around REQ_FLUSH tests.  Flushes are not unlikely
        enough to be marked with unlikely().
      
      * Block layer now filters out REQ_FLUSH/FUA bio's if the request_queue
        doesn't support cache flushing.  Advertise REQ_FLUSH | REQ_FUA
        capability.
      
      * Request based dm isn't converted yet.  dm_init_request_based_queue()
        resets flush support to 0 for now.  To avoid disturbing request
        based dm code, dm->flush_error is added for bio based dm while
        requested based dm continues to use dm->barrier_error.
      
      Lightly tested linear, stripe, raid1, snap and crypt targets.  Please
      proceed with caution as I'm not familiar with the code base.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: dm-devel@redhat.com
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      d87f4c14
    • T
      block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush() · 4913efe4
      Tejun Heo 提交于
      Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
      requests.  Deprecate barrier.  All REQ_HARDBARRIERs are failed with
      -EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
      blk_queue_flush().
      
      blk_queue_flush() takes combinations of REQ_FLUSH and FUA.  If a
      device has write cache and can flush it, it should set REQ_FLUSH.  If
      the device can handle FUA writes, it should also set REQ_FUA.
      
      All blk_queue_ordered() users are converted.
      
      * ORDERED_DRAIN is mapped to 0 which is the default value.
      * ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
      * ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NBoaz Harrosh <bharrosh@panasas.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: Pierre Ossman <drzeus@drzeus.cx>
      Cc: Stefan Weinhuber <wein@de.ibm.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4913efe4
  4. 12 8月, 2010 10 次提交
    • M
      dm: split discard requests on target boundaries · a79245b3
      Mike Snitzer 提交于
      Update __clone_and_map_discard to loop across all targets in a DM
      device's table when it processes a discard bio.  If a discard crosses a
      target boundary it must be split accordingly.
      
      Update __issue_target_requests and __issue_target_request to allow a
      cloned discard bio to have a custom start sector and size.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a79245b3
    • M
      dm: factor out max_io_len_target_boundary · 56a67df7
      Mike Snitzer 提交于
      Split max_io_len_target_boundary out of max_io_len so that the discard
      support can make use of it without duplicating max_io_len code.
      
      Avoiding max_io_len's split_io logic enables DM's discard support to
      submit the entire discard request to a target.  But discards must still
      be split on target boundaries.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      56a67df7
    • M
      dm: use common __issue_target_request for flush and discard support · 06a426ce
      Mike Snitzer 提交于
      Rename __flush_target to __issue_target_request now that it is used to
      issue both flush and discard requests.
      
      Introduce __issue_target_requests as a convenient wrapper to
      __issue_target_request 'num_flush_requests' or 'num_discard_requests'
      times per target.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      06a426ce
    • M
      dm: linear support discard · 5ae89a87
      Mike Snitzer 提交于
      Allow discards to be passed through to linear mappings if at least one
      underlying device supports it.  Discards will be forwarded only to
      devices that support them.
      
      A target that supports discards should set num_discard_requests to
      indicate how many times each discard request must be submitted to it.
      
      Verify table's underlying devices support discards prior to setting the
      associated DM device as capable of discards (via QUEUE_FLAG_DISCARD).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NJoe Thornber <thornber@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      5ae89a87
    • M
      dm: rename map_info flush_request to target_request_nr · 57cba5d3
      Mike Snitzer 提交于
      'target_request_nr' is a more generic name that reflects the fact that
      it will be used for both flush and discard support.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      57cba5d3
    • M
      dm: do not initialise full request queue when bio based · 4a0b4ddf
      Mike Snitzer 提交于
      Change bio-based mapped devices no longer to have a fully initialized
      request_queue (request_fn, elevator, etc).  This means bio-based DM
      devices no longer register elevator sysfs attributes ('iosched/' tree
      or 'scheduler' other than "none").
      
      In contrast, a request-based DM device will continue to have a full
      request_queue and will register elevator sysfs attributes.  Therefore
      a user can determine a DM device's type by checking if elevator sysfs
      attributes exist.
      
      First allocate a minimalist request_queue structure for a DM device
      (needed for both bio and request-based DM).
      
      Initialization of a full request_queue is deferred until it is known
      that the DM device is request-based, at the end of the table load
      sequence.
      
      Factor DM device's request_queue initialization:
      - common to both request-based and bio-based into dm_init_md_queue().
      - specific to request-based into dm_init_request_based_queue().
      
      The md->type_lock mutex is used to protect md->queue, in addition to
      md->type, during table_load().
      
      A DM device's first table_load will establish the immutable md->type.
      But md->queue initialization, based on md->type, may fail at that time
      (because blk_init_allocated_queue cannot allocate memory).  Therefore
      any subsequent table_load must (re)try dm_setup_md_queue independently of
      establishing md->type.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      4a0b4ddf
    • M
      dm ioctl: make bio or request based device type immutable · a5664dad
      Mike Snitzer 提交于
      Determine whether a mapped device is bio-based or request-based when
      loading its first (inactive) table and don't allow that to be changed
      later.
      
      This patch performs different device initialisation in each of the two
      cases.  (We don't think it's necessary to add code to support changing
      between the two types.)
      
      Allowed md->type transitions:
        DM_TYPE_NONE to DM_TYPE_BIO_BASED
        DM_TYPE_NONE to DM_TYPE_REQUEST_BASED
      
      We now prevent table_load from replacing the inactive table with a
      conflicting type of table even after an explicit table_clear.
      
      Introduce 'type_lock' into the struct mapped_device to protect md->type
      and to prepare for the next patch that will change the queue
      initialization and allocate memory while md->type_lock is held.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      
       drivers/md/dm-ioctl.c    |   15 +++++++++++++++
       drivers/md/dm.c          |   37 ++++++++++++++++++++++++++++++-------
       drivers/md/dm.h          |    5 +++++
       include/linux/dm-ioctl.h |    4 ++--
       4 files changed, 52 insertions(+), 9 deletions(-)
      a5664dad
    • M
      dm: skip second flush on bio unsupported error · 708e9295
      Mikulas Patocka 提交于
      When processing barriers, skip the second flush if processing the bio
      failed with -EOPNOTSUPP.  This can happen with discard+barrier requests.
      If the device doesn't support discard, there would be two useless
      SYNCHRONIZE CACHE commands.  The first dm_flush cannot be so easily
      optimized out, so we leave it there.
      
      Previously, -EOPNOTSUPP could be received in dec_pending only with empty
      barriers and we ignored that error, assuming the device not supporting
      cache flushes has cache always consistent.  With the addition of discard
      barriers, this -EOPNOTSUPP can also be generated by discards and we
      must record it in md->barrier_error for process_barrier.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      708e9295
    • K
      dm: separate device deletion from dm_put · 3f77316d
      Kiyoshi Ueda 提交于
      This patch separates the device deletion code from dm_put()
      to make sure the deletion happens in the process context.
      
      By this patch, device deletion always occurs in an ioctl (process)
      context and dm_put() can be called in interrupt context.
      As a result, the request-based dm's bad dm_put() usage pointed out
      by Mikulas below disappears.
          http://marc.info/?l=dm-devel&m=126699981019735&w=2
      
      Without this patch, I confirmed there is a case to crash the system:
          dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
      
      Some more backgrounds and details:
      In request-based dm, a device opener can remove a mapped_device
      while the last request is still completing, because bios in the last
      request complete first and then the device opener can close and remove
      the mapped_device before the last request completes:
        CPU0                                          CPU1
        =================================================================
        <<INTERRUPT>>
        blk_end_request_all(clone_rq)
          blk_update_request(clone_rq)
            bio_endio(clone_bio) == end_clone_bio
              blk_update_request(orig_rq)
                bio_endio(orig_bio)
                                                      <<I/O completed>>
                                                      dm_blk_close()
                                                      dev_remove()
                                                        dm_put(md)
                                                          <<Free md>>
         blk_finish_request(clone_rq)
           ....
           dm_end_request(clone_rq)
             free_rq_clone(clone_rq)
             blk_end_request_all(orig_rq)
             rq_completed(md)
      
      So request-based dm used dm_get()/dm_put() to hold md for each I/O
      until its request completion handling is fully done.
      However, the final dm_put() can call the device deletion code which
      must not be run in interrupt context and may cause kernel panic.
      
      To solve the problem, this patch moves the device deletion code,
      dm_destroy(), to predetermined places that is actually deleting
      the mapped_device in ioctl (process) context, and changes dm_put()
      just to decrement the reference count of the mapped_device.
      By this change, dm_put() can be used in any context and the symmetric
      model below is introduced:
          dm_create():  create a mapped_device
          dm_destroy(): destroy a mapped_device
          dm_get():     increment the reference count of a mapped_device
          dm_put():     decrement the reference count of a mapped_device
      
      dm_destroy() waits for all references of the mapped_device to disappear,
      then deletes the mapped_device.
      
      dm_destroy() uses active waiting with msleep(1), since deleting
      the mapped_device isn't performance-critical task.
      And since at this point, nobody opens the mapped_device and no new
      reference will be taken, the pending counts are just for racing
      completing activity and will eventually decrease to zero.
      
      For the unlikely case of the forced module unload, dm_destroy_immediate(),
      which doesn't wait and forcibly deletes the mapped_device, is also
      introduced and used in dm_hash_remove_all().  Otherwise, "rmmod -f"
      may be stuck and never return.
      And now, because the mapped_device is deleted at this point, subsequent
      accesses to the mapped_device may cause NULL pointer references.
      
      Cc: stable@kernel.org
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      3f77316d
    • K
      dm: prevent access to md being deleted · abdc568b
      Kiyoshi Ueda 提交于
      This patch prevents access to mapped_device which is being deleted.
      
      Currently, even after a mapped_device has been removed from the hash,
      it could be accessed through idr_find() using minor number.
      That could cause a race and NULL pointer reference below:
        CPU0                          CPU1
        ------------------------------------------------------------------
        dev_remove(param)
          down_write(_hash_lock)
          dm_lock_for_deletion(md)
            spin_lock(_minor_lock)
            set_bit(DMF_DELETING)
            spin_unlock(_minor_lock)
          __hash_remove(hc)
          up_write(_hash_lock)
                                      dev_status(param)
                                        md = find_device(param)
                                               down_read(_hash_lock)
                                               __find_device_hash_cell(param)
                                                 dm_get_md(param->dev)
                                                   md = dm_find_md(dev)
                                                          spin_lock(_minor_lock)
                                                          md = idr_find(MINOR(dev))
                                                          spin_unlock(_minor_lock)
          dm_put(md)
            free_dev(md)
                                                   dm_get(md)
                                               up_read(_hash_lock)
                                        __dev_status(md, param)
                                        dm_put(md)
      
      This patch fixes such problems.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: stable@kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      abdc568b
  5. 08 8月, 2010 5 次提交
  6. 06 3月, 2010 3 次提交
    • P
      dm ioctl: introduce flag indicating uevent was generated · 3abf85b5
      Peter Rajnoha 提交于
      Set a new DM_UEVENT_GENERATED_FLAG when returning from ioctls to
      indicate that a uevent was actually generated.  This tells the userspace
      caller that it may need to wait for the event to be processed.
      Signed-off-by: NPeter Rajnoha <prajnoha@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      3abf85b5
    • M
      dm: free dm_io before bio_endio not after · a97f925a
      Mikulas Patocka 提交于
      Free the dm_io structure before calling bio_endio() instead of after it,
      to ensure that the io_pool containing it is not referenced after it is
      freed.
      
      This partially fixes a problem described here
        https://www.redhat.com/archives/dm-devel/2010-February/msg00109.html
      
      thread 1:
      bio_endio(bio, io_error);
      /* scheduling happens */
      					thread 2:
      					close the device
      					remove the device
      thread 1:
      free_io(md, io);
      
      Thread 2, when removing the device, sees non-empty md->io_pool (because the
      io hasn't been freed by thread 1 yet) and may crash with BUG in mempool_free.
      Thread 1 may also crash, when freeing into a nonexisting mempool.
      
      To fix this we must make sure that bio_endio() is the last call and
      the md structure is not accessed afterwards.
      
      There is another bio_endio in process_barrier, but it is called from the thread
      and the thread is destroyed prior to freeing the mempools, so this call is
      not affected by the bug.
      
      A similar bug exists with module unloads - the module may be unloaded
      immediately after bio_endio - but that is more difficult to fix.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a97f925a
    • K
      dm table: remove dm_get from dm_table_get_md · ecdb2e25
      Kiyoshi Ueda 提交于
      Remove the dm_get() in dm_table_get_md() because dm_table_get_md() could
      be called from presuspend/postsuspend, which are called while
      mapped_device is in DMF_FREEING state, where dm_get() is not allowed.
      
      Justification for that is the lifetime of both objects: As far as the
      current dm design/implementation, mapped_device is never freed while
      targets are doing something, because dm core waits for targets to become
      quiet in dm_put() using presuspend/postsuspend.  So targets should be
      able to touch mapped_device without holding reference count of the
      mapped_device, and we should allow targets to touch mapped_device even
      if it is in DMF_FREEING state.
      
      Backgrounds:
      I'm trying to remove the multipath internal queue, since dm core now has
      a generic queue for request-based dm.  In the patch-set, the multipath
      target wants to request dm core to start/stop queue.  One of such
      start/stop requests can happen during postsuspend() while the target
      waits for pg-init to complete, because the target stops queue when
      starting pg-init and tries to restart it when completing pg-init.  Since
      queue belongs to mapped_device, it involves calling dm_table_get_md()
      and dm_put().  On the other hand, postsuspend() is called in dm_put()
      for mapped_device which is in DMF_FREEING state, and that triggers
      BUG_ON(DMF_FREEING) in the 2nd dm_put().
      
      I had tried to solve this problem by changing only multipath not to
      touch mapped_device which is in DMF_FREEING state, but I couldn't and I
      came up with a question why we need dm_get() in dm_table_get_md().
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      ecdb2e25
  7. 17 2月, 2010 1 次提交
    • K
      dm mpath: fix stall when requeueing io · 9eef87da
      Kiyoshi Ueda 提交于
      This patch fixes the problem that system may stall if target's ->map_rq
      returns DM_MAPIO_REQUEUE in map_request().
      E.g. stall happens on 1 CPU box when a dm-mpath device with queue_if_no_path
           bounces between all-paths-down and paths-up on I/O load.
      
      When target's ->map_rq returns DM_MAPIO_REQUEUE, map_request() requeues
      the request and returns to dm_request_fn().  Then, dm_request_fn()
      doesn't exit the I/O dispatching loop and continues processing
      the requeued request again.
      This map and requeue loop can be done with interrupt disabled,
      so 1 CPU system can be stalled if this situation happens.
      
      For example, commands below can stall my 1 CPU box within 1 minute or so:
        # dmsetup table mp
        mp: 0 2097152 multipath 1 queue_if_no_path 0 1 1 service-time 0 1 2 8:144 1 1
        # while true; do dd if=/dev/mapper/mp of=/dev/null bs=1M count=100; done &
        # while true; do \
        > dmsetup message mp 0 "fail_path 8:144" \
        > dmsetup suspend --noflush mp \
        > dmsetup resume mp \
        > dmsetup message mp 0 "reinstate_path 8:144" \
        > done
      
      To fix the problem above, this patch changes dm_request_fn() to exit
      the I/O dispatching loop once if a request is requeued in map_request().
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: stable@kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      9eef87da
  8. 11 12月, 2009 13 次提交
    • K
      dm: export suspended state to targets · 64dbce58
      Kiyoshi Ueda 提交于
      This patch adds the exported dm_suspended() function so that targets
      can check whether or not they are suspended.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      64dbce58
    • K
      dm: rename dm_suspended to dm_suspended_md · 4f186f8b
      Kiyoshi Ueda 提交于
      This patch renames dm_suspended() to dm_suspended_md() and
      keeps it internal to dm.
      No functional change.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      4f186f8b
    • K
      dm: swap target postsuspend call and setting suspended flag · 4d4471cb
      Kiyoshi Ueda 提交于
      This patch moves DMF_SUSPENDED flag set before postsuspend.
      No one should care about the ordering, because the flag set and
      the postsuspend are protected by a single lock, md->suspend_lock,
      and all strict flag-checkers take the lock.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      4d4471cb
    • J
      dm: trace request based remapping · 6db4ccd6
      Jun'ichi Nomura 提交于
      This patch adds a remapping trace to request-based dm.
      BIO-based dm already has the equivalent tracepoint.
      
      For example, under this dm stack (linear LV on multipath):
        # dmsetup ls --tree -o ascii
        vg-lv0 (253:1)
         `-mpath0 (253:0)
            |- (8:160)
            |- (66:80)
            |- (65:176)
            `- (65:160)
      
      Trace of 'dd of=/dev/vg/lv0 bs=128k count=1 oflag=direct' looks like this:
      
      without the patch:
        dd-6674  [000]   539.727384: block_bio_queue: 253,1 WS 0 + 256 [dd]
        dd-6674  [000]   539.727392: block_remap: 253,0 WS 384 + 256 <- (253,1) 0
        dd-6674  [000]   539.727394: block_bio_queue: 253,0 WS 384 + 256 [dd]
        dd-6674  [000]   539.727405: block_getrq: 253,0 WS 384 + 256 [dd]
        dd-6674  [000]   539.727409: block_plug: [dd]
        dd-6674  [000]   539.727410: block_rq_insert: 253,0 W 0 () 384 + 256 [dd]
        dd-6674  [000]   539.727416: block_rq_issue: 253,0 W 0 () 384 + 256 [dd]
        dd-6674  [000]   539.727426: block_rq_insert: 65,176 W 0 () 384 + 256 [dd]
        dd-6674  [000]   539.727427: block_rq_issue: 65,176 W 0 () 384 + 256 [dd]
        ...
      
      and with the patch: (the line with '**' is the trace added by this patch)
        dd-6617  [002]   162.914301: block_bio_queue: 253,1 WS 0 + 256 [dd]
        dd-6617  [002]   162.914314: block_remap: 253,0 WS 384 + 256 <- (253,1) 0
        dd-6617  [002]   162.914316: block_bio_queue: 253,0 WS 384 + 256 [dd]
        dd-6617  [002]   162.914331: block_getrq: 253,0 WS 384 + 256 [dd]
        dd-6617  [002]   162.914335: block_plug: [dd]
        dd-6617  [002]   162.914337: block_rq_insert: 253,0 W 0 () 384 + 256 [dd]
        dd-6617  [002]   162.914347: block_rq_issue: 253,0 W 0 () 384 + 256 [dd]
      **dd-6617  [002]   162.914356: block_rq_remap: 65,176 W 384 + 256 <- (253,0) 384
        dd-6617  [002]   162.914358: block_rq_insert: 65,176 W 0 () 384 + 256 [dd]
        dd-6617  [002]   162.914359: block_rq_issue: 65,176 W 0 () 384 + 256 [dd]
        ...
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      6db4ccd6
    • A
      dm: keep old table until after resume succeeded · 042d2a9b
      Alasdair G Kergon 提交于
      When swapping a new table into place, retain the old table until
      its replacement is in place.
      
      An old check for an empty table is removed because this is enforced
      in populate_table().
      
      __unbind() becomes redundant when followed by __bind().
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      042d2a9b
    • A
      dm: bind new table before destroying old · a7940155
      Alasdair G Kergon 提交于
      When replacing a mapped device's table during a 'resume', delay the
      destruction of the old table until the new one is successfully in place.
      
      This will make it easier for a later patch to transfer internal state
      information from the old table to the new one (something we do not currently
      support) while giving us more options for reversion if a later part
      of the operation fails.
      
      Devices are always in the suspended state during dm_swap_table().
      This patch reinforces the requirement that all I/O must have been
      flushed from the table targets while in this state (including any in
      workqueues).  In the case of 'noflush' suspending, unprocessed
      I/O should have been 'pushed back' to the dm core prior to this point,
      for resubmission after the new table is in place.
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a7940155
    • M
      dm: add dm_deleting_md function · 432a212c
      Mike Anderson 提交于
      Add dm_deleting_md to check whether or not a given mapped
      device is currently being deleted.
      Signed-off-by: NMike Anderson <andmike@linux.vnet.ibm.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      432a212c
    • A
      dm: rename dm_get_table to dm_get_live_table · 7c666411
      Alasdair G Kergon 提交于
      Rename dm_get_table to dm_get_live_table.
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      7c666411
    • K
      dm: add request based barrier support · d0bcb878
      Kiyoshi Ueda 提交于
      This patch adds barrier support for request-based dm.
      
      CORE DESIGN
      
      The design is basically same as bio-based dm, which emulates barrier
      by mapping empty barrier bios before/after a barrier I/O.
      But request-based dm has been using struct request_queue for I/O
      queueing, so the block-layer's barrier mechanism can be used.
      
      o Summary of the block-layer's behavior (which is depended by dm-core)
        Request-based dm uses QUEUE_ORDERED_DRAIN_FLUSH ordered mode for
        I/O barrier.  It means that when an I/O requiring barrier is found
        in the request_queue, the block-layer makes pre-flush request and
        post-flush request just before and just after the I/O respectively.
      
        After the ordered sequence starts, the block-layer waits for all
        in-flight I/Os to complete, then gives drivers the pre-flush request,
        the barrier I/O and the post-flush request one by one.
        It means that the request_queue is stopped automatically by
        the block-layer until drivers complete each sequence.
      
      o dm-core
        For the barrier I/O, treats it as a normal I/O, so no additional
        code is needed.
      
        For the pre/post-flush request, flushes caches by the followings:
          1. Make the number of empty barrier requests required by target's
             num_flush_requests, and map them (dm_rq_barrier()).
          2. Waits for the mapped barriers to complete (dm_rq_barrier()).
             If error has occurred, save the error value to md->barrier_error
             (dm_end_request()).
             (*) Basically, the first reported error is taken.
                 But -EOPNOTSUPP supersedes any error and DM_ENDIO_REQUEUE
                 follows.
          3. Requeue the pre/post-flush request if the error value is
             DM_ENDIO_REQUEUE.  Otherwise, completes with the error value
             (dm_rq_barrier_work()).
        The pre/post-flush work above is done in the kernel thread (kdmflush)
        context, since memory allocation which might sleep is needed in
        dm_rq_barrier() but sleep is not allowed in dm_request_fn(), which is
        an irq-disabled context.
        Also, clones of the pre/post-flush request share an original, so
        such clones can't be completed using the softirq context.
        Instead, complete them in the context of underlying device drivers.
        It should be safe since there is no I/O dispatching during
        the completion of such clones.
      
        For suspend, the workqueue of kdmflush needs to be flushed after
        the request_queue has been stopped.  Otherwise, the next flush work
        can be kicked even after the suspend completes.
      
      TARGET INTERFACE
      
      No new interface is added.
      Just use the existing num_flush_requests in struct target_type
      as same as bio-based dm.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      d0bcb878
    • K
      dm: move dm_end_request · 980691e5
      Kiyoshi Ueda 提交于
      This patch moves dm_end_request() to make the next patch more readable.
      No functional change.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      980691e5
    • K
      dm: refactor request based completion functions · 11a68244
      Kiyoshi Ueda 提交于
      This patch factors out the clone completion code, dm_done(),
      from dm_softirq_done() in preparation for a subsequent patch.
      No functional change.
      
      dm_done() will be used in barrier completion, which can't use and
      doesn't need softirq.  The softirq_done callback needs to get a clone
      from an original request but it can't in the case of barrier, where
      an original request is shared by multiple clones.  On the other hand,
      the completion of barrier clones doesn't involve re-submitting requests,
      which was the primary reason of the need for softirq.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      11a68244
    • K
      dm: use md pending for in flight IO counting · b4324fee
      Kiyoshi Ueda 提交于
      This patch changes the counter for the number of in_flight I/Os
      to md->pending from q->in_flight in preparation for a later patch.
      No functional change.
      
      Request-based dm used q->in_flight to count the number of in-flight
      clones assuming the counter is always incremented for an in-flight
      original request and original:clone is 1:1 relationship.
      However, it this no longer true for barrier requests.
      So use md->pending to count the number of in-flight clones.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b4324fee
    • K
      dm: simplify request based suspend · 9f518b27
      Kiyoshi Ueda 提交于
      The semantics of bio-based dm were changed recently in the case of
      suspend with "--nolockfs" but without "--noflush".
      Before 2.6.30, I/Os submitted before the suspend invocation were always
      flushed.  From 2.6.30 onwards, I/Os submitted before the suspend
      invocation might not be flushed.  (For details, see
      http://marc.info/?t=123994433400003&r=1&w=2)
      
      This patch brings the behaviour of request-based dm into line with
      bio-based dm, simplifying the code and preparing for a subsequent patch
      that will wait for all in_flight I/Os to complete without stopping
      request_queue and use dm_wait_for_completion() for it.
      
      This change in semantics simplifies the suspend code as follows:
        o Suspend is implemented as stopping request_queue
          in request-based dm, and all I/Os are queued in the request_queue
          even after suspend is invoked.
        o In the old semantics, we had to track whether I/Os were
          queued before or after the suspend invocation, so a special
          barrier-like request called 'suspend marker' was introduced.
        o With the new semantics, we don't need to flush any I/O
          so we can remove the marker and the code related to the marker
          handling and I/O flushing.
      
      After removing this codes, the suspend sequence is now:
        1. Flush all I/Os by lock_fs() if needed.
        2. Stop dispatching any I/O by stopping the request_queue.
        3. Wait for all in-flight I/Os to be completed or requeued.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      9f518b27