1. 08 8月, 2010 4 次提交
  2. 06 3月, 2010 3 次提交
    • P
      dm ioctl: introduce flag indicating uevent was generated · 3abf85b5
      Peter Rajnoha 提交于
      Set a new DM_UEVENT_GENERATED_FLAG when returning from ioctls to
      indicate that a uevent was actually generated.  This tells the userspace
      caller that it may need to wait for the event to be processed.
      Signed-off-by: NPeter Rajnoha <prajnoha@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      3abf85b5
    • M
      dm: free dm_io before bio_endio not after · a97f925a
      Mikulas Patocka 提交于
      Free the dm_io structure before calling bio_endio() instead of after it,
      to ensure that the io_pool containing it is not referenced after it is
      freed.
      
      This partially fixes a problem described here
        https://www.redhat.com/archives/dm-devel/2010-February/msg00109.html
      
      thread 1:
      bio_endio(bio, io_error);
      /* scheduling happens */
      					thread 2:
      					close the device
      					remove the device
      thread 1:
      free_io(md, io);
      
      Thread 2, when removing the device, sees non-empty md->io_pool (because the
      io hasn't been freed by thread 1 yet) and may crash with BUG in mempool_free.
      Thread 1 may also crash, when freeing into a nonexisting mempool.
      
      To fix this we must make sure that bio_endio() is the last call and
      the md structure is not accessed afterwards.
      
      There is another bio_endio in process_barrier, but it is called from the thread
      and the thread is destroyed prior to freeing the mempools, so this call is
      not affected by the bug.
      
      A similar bug exists with module unloads - the module may be unloaded
      immediately after bio_endio - but that is more difficult to fix.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a97f925a
    • K
      dm table: remove dm_get from dm_table_get_md · ecdb2e25
      Kiyoshi Ueda 提交于
      Remove the dm_get() in dm_table_get_md() because dm_table_get_md() could
      be called from presuspend/postsuspend, which are called while
      mapped_device is in DMF_FREEING state, where dm_get() is not allowed.
      
      Justification for that is the lifetime of both objects: As far as the
      current dm design/implementation, mapped_device is never freed while
      targets are doing something, because dm core waits for targets to become
      quiet in dm_put() using presuspend/postsuspend.  So targets should be
      able to touch mapped_device without holding reference count of the
      mapped_device, and we should allow targets to touch mapped_device even
      if it is in DMF_FREEING state.
      
      Backgrounds:
      I'm trying to remove the multipath internal queue, since dm core now has
      a generic queue for request-based dm.  In the patch-set, the multipath
      target wants to request dm core to start/stop queue.  One of such
      start/stop requests can happen during postsuspend() while the target
      waits for pg-init to complete, because the target stops queue when
      starting pg-init and tries to restart it when completing pg-init.  Since
      queue belongs to mapped_device, it involves calling dm_table_get_md()
      and dm_put().  On the other hand, postsuspend() is called in dm_put()
      for mapped_device which is in DMF_FREEING state, and that triggers
      BUG_ON(DMF_FREEING) in the 2nd dm_put().
      
      I had tried to solve this problem by changing only multipath not to
      touch mapped_device which is in DMF_FREEING state, but I couldn't and I
      came up with a question why we need dm_get() in dm_table_get_md().
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      ecdb2e25
  3. 17 2月, 2010 1 次提交
    • K
      dm mpath: fix stall when requeueing io · 9eef87da
      Kiyoshi Ueda 提交于
      This patch fixes the problem that system may stall if target's ->map_rq
      returns DM_MAPIO_REQUEUE in map_request().
      E.g. stall happens on 1 CPU box when a dm-mpath device with queue_if_no_path
           bounces between all-paths-down and paths-up on I/O load.
      
      When target's ->map_rq returns DM_MAPIO_REQUEUE, map_request() requeues
      the request and returns to dm_request_fn().  Then, dm_request_fn()
      doesn't exit the I/O dispatching loop and continues processing
      the requeued request again.
      This map and requeue loop can be done with interrupt disabled,
      so 1 CPU system can be stalled if this situation happens.
      
      For example, commands below can stall my 1 CPU box within 1 minute or so:
        # dmsetup table mp
        mp: 0 2097152 multipath 1 queue_if_no_path 0 1 1 service-time 0 1 2 8:144 1 1
        # while true; do dd if=/dev/mapper/mp of=/dev/null bs=1M count=100; done &
        # while true; do \
        > dmsetup message mp 0 "fail_path 8:144" \
        > dmsetup suspend --noflush mp \
        > dmsetup resume mp \
        > dmsetup message mp 0 "reinstate_path 8:144" \
        > done
      
      To fix the problem above, this patch changes dm_request_fn() to exit
      the I/O dispatching loop once if a request is requeued in map_request().
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: stable@kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      9eef87da
  4. 11 12月, 2009 18 次提交
    • K
      dm: export suspended state to targets · 64dbce58
      Kiyoshi Ueda 提交于
      This patch adds the exported dm_suspended() function so that targets
      can check whether or not they are suspended.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      64dbce58
    • K
      dm: rename dm_suspended to dm_suspended_md · 4f186f8b
      Kiyoshi Ueda 提交于
      This patch renames dm_suspended() to dm_suspended_md() and
      keeps it internal to dm.
      No functional change.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      4f186f8b
    • K
      dm: swap target postsuspend call and setting suspended flag · 4d4471cb
      Kiyoshi Ueda 提交于
      This patch moves DMF_SUSPENDED flag set before postsuspend.
      No one should care about the ordering, because the flag set and
      the postsuspend are protected by a single lock, md->suspend_lock,
      and all strict flag-checkers take the lock.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      4d4471cb
    • J
      dm: trace request based remapping · 6db4ccd6
      Jun'ichi Nomura 提交于
      This patch adds a remapping trace to request-based dm.
      BIO-based dm already has the equivalent tracepoint.
      
      For example, under this dm stack (linear LV on multipath):
        # dmsetup ls --tree -o ascii
        vg-lv0 (253:1)
         `-mpath0 (253:0)
            |- (8:160)
            |- (66:80)
            |- (65:176)
            `- (65:160)
      
      Trace of 'dd of=/dev/vg/lv0 bs=128k count=1 oflag=direct' looks like this:
      
      without the patch:
        dd-6674  [000]   539.727384: block_bio_queue: 253,1 WS 0 + 256 [dd]
        dd-6674  [000]   539.727392: block_remap: 253,0 WS 384 + 256 <- (253,1) 0
        dd-6674  [000]   539.727394: block_bio_queue: 253,0 WS 384 + 256 [dd]
        dd-6674  [000]   539.727405: block_getrq: 253,0 WS 384 + 256 [dd]
        dd-6674  [000]   539.727409: block_plug: [dd]
        dd-6674  [000]   539.727410: block_rq_insert: 253,0 W 0 () 384 + 256 [dd]
        dd-6674  [000]   539.727416: block_rq_issue: 253,0 W 0 () 384 + 256 [dd]
        dd-6674  [000]   539.727426: block_rq_insert: 65,176 W 0 () 384 + 256 [dd]
        dd-6674  [000]   539.727427: block_rq_issue: 65,176 W 0 () 384 + 256 [dd]
        ...
      
      and with the patch: (the line with '**' is the trace added by this patch)
        dd-6617  [002]   162.914301: block_bio_queue: 253,1 WS 0 + 256 [dd]
        dd-6617  [002]   162.914314: block_remap: 253,0 WS 384 + 256 <- (253,1) 0
        dd-6617  [002]   162.914316: block_bio_queue: 253,0 WS 384 + 256 [dd]
        dd-6617  [002]   162.914331: block_getrq: 253,0 WS 384 + 256 [dd]
        dd-6617  [002]   162.914335: block_plug: [dd]
        dd-6617  [002]   162.914337: block_rq_insert: 253,0 W 0 () 384 + 256 [dd]
        dd-6617  [002]   162.914347: block_rq_issue: 253,0 W 0 () 384 + 256 [dd]
      **dd-6617  [002]   162.914356: block_rq_remap: 65,176 W 384 + 256 <- (253,0) 384
        dd-6617  [002]   162.914358: block_rq_insert: 65,176 W 0 () 384 + 256 [dd]
        dd-6617  [002]   162.914359: block_rq_issue: 65,176 W 0 () 384 + 256 [dd]
        ...
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      6db4ccd6
    • A
      dm: keep old table until after resume succeeded · 042d2a9b
      Alasdair G Kergon 提交于
      When swapping a new table into place, retain the old table until
      its replacement is in place.
      
      An old check for an empty table is removed because this is enforced
      in populate_table().
      
      __unbind() becomes redundant when followed by __bind().
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      042d2a9b
    • A
      dm: bind new table before destroying old · a7940155
      Alasdair G Kergon 提交于
      When replacing a mapped device's table during a 'resume', delay the
      destruction of the old table until the new one is successfully in place.
      
      This will make it easier for a later patch to transfer internal state
      information from the old table to the new one (something we do not currently
      support) while giving us more options for reversion if a later part
      of the operation fails.
      
      Devices are always in the suspended state during dm_swap_table().
      This patch reinforces the requirement that all I/O must have been
      flushed from the table targets while in this state (including any in
      workqueues).  In the case of 'noflush' suspending, unprocessed
      I/O should have been 'pushed back' to the dm core prior to this point,
      for resubmission after the new table is in place.
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a7940155
    • M
      dm: add dm_deleting_md function · 432a212c
      Mike Anderson 提交于
      Add dm_deleting_md to check whether or not a given mapped
      device is currently being deleted.
      Signed-off-by: NMike Anderson <andmike@linux.vnet.ibm.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      432a212c
    • A
      dm: rename dm_get_table to dm_get_live_table · 7c666411
      Alasdair G Kergon 提交于
      Rename dm_get_table to dm_get_live_table.
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      7c666411
    • K
      dm: add request based barrier support · d0bcb878
      Kiyoshi Ueda 提交于
      This patch adds barrier support for request-based dm.
      
      CORE DESIGN
      
      The design is basically same as bio-based dm, which emulates barrier
      by mapping empty barrier bios before/after a barrier I/O.
      But request-based dm has been using struct request_queue for I/O
      queueing, so the block-layer's barrier mechanism can be used.
      
      o Summary of the block-layer's behavior (which is depended by dm-core)
        Request-based dm uses QUEUE_ORDERED_DRAIN_FLUSH ordered mode for
        I/O barrier.  It means that when an I/O requiring barrier is found
        in the request_queue, the block-layer makes pre-flush request and
        post-flush request just before and just after the I/O respectively.
      
        After the ordered sequence starts, the block-layer waits for all
        in-flight I/Os to complete, then gives drivers the pre-flush request,
        the barrier I/O and the post-flush request one by one.
        It means that the request_queue is stopped automatically by
        the block-layer until drivers complete each sequence.
      
      o dm-core
        For the barrier I/O, treats it as a normal I/O, so no additional
        code is needed.
      
        For the pre/post-flush request, flushes caches by the followings:
          1. Make the number of empty barrier requests required by target's
             num_flush_requests, and map them (dm_rq_barrier()).
          2. Waits for the mapped barriers to complete (dm_rq_barrier()).
             If error has occurred, save the error value to md->barrier_error
             (dm_end_request()).
             (*) Basically, the first reported error is taken.
                 But -EOPNOTSUPP supersedes any error and DM_ENDIO_REQUEUE
                 follows.
          3. Requeue the pre/post-flush request if the error value is
             DM_ENDIO_REQUEUE.  Otherwise, completes with the error value
             (dm_rq_barrier_work()).
        The pre/post-flush work above is done in the kernel thread (kdmflush)
        context, since memory allocation which might sleep is needed in
        dm_rq_barrier() but sleep is not allowed in dm_request_fn(), which is
        an irq-disabled context.
        Also, clones of the pre/post-flush request share an original, so
        such clones can't be completed using the softirq context.
        Instead, complete them in the context of underlying device drivers.
        It should be safe since there is no I/O dispatching during
        the completion of such clones.
      
        For suspend, the workqueue of kdmflush needs to be flushed after
        the request_queue has been stopped.  Otherwise, the next flush work
        can be kicked even after the suspend completes.
      
      TARGET INTERFACE
      
      No new interface is added.
      Just use the existing num_flush_requests in struct target_type
      as same as bio-based dm.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      d0bcb878
    • K
      dm: move dm_end_request · 980691e5
      Kiyoshi Ueda 提交于
      This patch moves dm_end_request() to make the next patch more readable.
      No functional change.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      980691e5
    • K
      dm: refactor request based completion functions · 11a68244
      Kiyoshi Ueda 提交于
      This patch factors out the clone completion code, dm_done(),
      from dm_softirq_done() in preparation for a subsequent patch.
      No functional change.
      
      dm_done() will be used in barrier completion, which can't use and
      doesn't need softirq.  The softirq_done callback needs to get a clone
      from an original request but it can't in the case of barrier, where
      an original request is shared by multiple clones.  On the other hand,
      the completion of barrier clones doesn't involve re-submitting requests,
      which was the primary reason of the need for softirq.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      11a68244
    • K
      dm: use md pending for in flight IO counting · b4324fee
      Kiyoshi Ueda 提交于
      This patch changes the counter for the number of in_flight I/Os
      to md->pending from q->in_flight in preparation for a later patch.
      No functional change.
      
      Request-based dm used q->in_flight to count the number of in-flight
      clones assuming the counter is always incremented for an in-flight
      original request and original:clone is 1:1 relationship.
      However, it this no longer true for barrier requests.
      So use md->pending to count the number of in-flight clones.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b4324fee
    • K
      dm: simplify request based suspend · 9f518b27
      Kiyoshi Ueda 提交于
      The semantics of bio-based dm were changed recently in the case of
      suspend with "--nolockfs" but without "--noflush".
      Before 2.6.30, I/Os submitted before the suspend invocation were always
      flushed.  From 2.6.30 onwards, I/Os submitted before the suspend
      invocation might not be flushed.  (For details, see
      http://marc.info/?t=123994433400003&r=1&w=2)
      
      This patch brings the behaviour of request-based dm into line with
      bio-based dm, simplifying the code and preparing for a subsequent patch
      that will wait for all in_flight I/Os to complete without stopping
      request_queue and use dm_wait_for_completion() for it.
      
      This change in semantics simplifies the suspend code as follows:
        o Suspend is implemented as stopping request_queue
          in request-based dm, and all I/Os are queued in the request_queue
          even after suspend is invoked.
        o In the old semantics, we had to track whether I/Os were
          queued before or after the suspend invocation, so a special
          barrier-like request called 'suspend marker' was introduced.
        o With the new semantics, we don't need to flush any I/O
          so we can remove the marker and the code related to the marker
          handling and I/O flushing.
      
      After removing this codes, the suspend sequence is now:
        1. Flush all I/Os by lock_fs() if needed.
        2. Stop dispatching any I/O by stopping the request_queue.
        3. Wait for all in-flight I/Os to be completed or requeued.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      9f518b27
    • K
      dm: abstract clone_rq · 6facdaff
      Kiyoshi Ueda 提交于
      This patch factors out the request cloning code in dm_prep_fn()
      as clone_rq().  No functional change.
      
      This patch is a preparation for a later patch in this series which needs to
      make clones from an original barrier request.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      6facdaff
    • K
      dm: pass gfp_mask to alloc_rq_tio · 08885643
      Kiyoshi Ueda 提交于
      This patch adds the gfp_mask argument to alloc_rq_tio().
      No functional change.
      
      This patch is a preparation for a later patch in this series which needs to
      allocate tio (for barrier I/O) with different allocation flag (GFP_NOIO) from
      the one in the normal I/O code path.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      08885643
    • K
      dm: use clone in map_request function · 598de409
      Kiyoshi Ueda 提交于
      This patch changes the argument of map_request() to clone request
      from original request.  No functional change.
      
      This patch is a preparation for PATCH 9, which needs to use
      map_request() for clones sharing an original barrier request.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      598de409
    • K
      dm: abstract dm_in_flight function · 90abb8c4
      Kiyoshi Ueda 提交于
      This patch adds md_in_flight() to get the number of in_flight I/Os.
      No functional change.
      
      This patch is a preparation for a later patch in this series, which
      changes I/O counter to md->pending from q->in_flight in request-based dm.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      90abb8c4
    • M
      dm io: use slab for struct io · 952b3557
      Mikulas Patocka 提交于
      Allocate "struct io" from a slab.
      
      This patch changes dm-io, so that "struct io" is allocated from a slab cache.
      It used to be allocated with kmalloc. Allocating from a slab will be needed
      for the next patch, because it requires a special alignment of "struct io"
      and kmalloc cannot meet this alignment.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      952b3557
  5. 17 10月, 2009 2 次提交
  6. 07 10月, 2009 1 次提交
    • N
      block: Seperate read and write statistics of in_flight requests v2 · 316d315b
      Nikanth Karthikesan 提交于
      Commit a9327cac added seperate read
      and write statistics of in_flight requests. And exported the number
      of read and write requests in progress seperately through sysfs.
      
      But  Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
      output from "iostat -kx 2". Global values for service time and
      utilization were garbage. For interval values, utilization was always
      100%, and service time is higher than normal.
      
      So this was reverted by commit 0f78ab98
      
      The problem was in part_round_stats_single(), I missed the following:
              if (now == part->stamp)
                      return;
      
      -       if (part->in_flight) {
      +       if (part_in_flight(part)) {
                      __part_stat_add(cpu, part, time_in_queue,
                                      part_in_flight(part) * (now - part->stamp));
                      __part_stat_add(cpu, part, io_ticks, (now - part->stamp));
      
      With this chunk included, the reported regression gets fixed.
      Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
      
      --
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      316d315b
  7. 05 10月, 2009 1 次提交
    • J
      Revert "Seperate read and write statistics of in_flight requests" · 0f78ab98
      Jens Axboe 提交于
      This reverts commit a9327cac.
      
      Corrado Zoccolo <czoccolo@gmail.com> reports:
      
      "with 2.6.32-rc1 I started getting the following strange output from
      "iostat -kx 2":
      Linux 2.6.31bisect (et2) 	04/10/2009 	_i686_	(2 CPU)
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                10,70    0,00    3,16   15,75    0,00   70,38
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda              18,22     0,00    0,67    0,01    14,77     0,02
      43,94     0,01   10,53 39043915,03 2629219,87
      sdb              60,89     9,68   50,79    3,04  1724,43    50,52
      65,95     0,70   13,06 488437,47 2629219,87
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 2,72    0,00    0,74    0,00    0,00   96,53
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 6,68    0,00    0,99    0,00    0,00   92,33
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 4,40    0,00    0,73    1,47    0,00   93,40
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     4,00    0,00    3,00     0,00    28,00
      18,67     0,06   19,50 333,33 100,00
      
      Global values for service time and utilization are garbage. For
      interval values, utilization is always 100%, and service time is
      higher than normal.
      
      I bisected it down to:
      [a9327cac] Seperate read and write
      statistics of in_flight requests
      and verified that reverting just that commit indeed solves the issue
      on 2.6.32-rc1."
      
      So until this is debugged, revert the bad commit.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0f78ab98
  8. 22 9月, 2009 1 次提交
  9. 14 9月, 2009 1 次提交
  10. 11 9月, 2009 1 次提交
  11. 05 9月, 2009 1 次提交
    • K
      dm multipath: fix oops when request based io fails when no paths · a77e28c7
      Kiyoshi Ueda 提交于
      The patch posted at http://marc.info/?l=dm-devel&m=124539787228784&w=2
      which was merged into cec47e3d ("dm:
      prepare for request based option") introduced a regression in
      request-based dm.
      
      If map_request() calls dm_kill_unmapped_request() to complete a cloned
      bio without dispatching it, clone->bio is still set when
      dm_end_request() is called and the BUG_ON(clone->bio) is incorrect.
      
      The patch fixes this bug by freeing bio in dm_end_request() if the clone
      has bio.  I've redone my tests to cover all I/O paths and confirmed
      there's no other regression.
      
      Here is the oops I hit in request-based dm when I do I/O to a multipath
      device which doesn't have any active path nor queue_if_no_path setting:
      
      ------------[ cut here ]------------
      kernel BUG at /root/2.6.31-rc4.rqdm/drivers/md/dm.c:828!
      invalid opcode: 0000 [#1] SMP
      last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
      CPU 1
      Modules linked in: autofs4 sunrpc cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log dm_service_time dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac sg sr_mod e1000e button cdrom serio_raw rtc_cmos rtc_core rtc_lib piix lpfc scsi_transport_fc ata_piix libata megaraid_sas sd_mod scsi_mod crc_t10dif ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
      Pid: 7, comm: ksoftirqd/1 Not tainted 2.6.31-rc4.rqdm #1 Express5800/120Lj [N8100-1417]
      RIP: 0010:[<ffffffffa023629d>]  [<ffffffffa023629d>] dm_softirq_done+0xbd/0x100 [dm_mod]
      RSP: 0018:ffff8800280a1f08  EFLAGS: 00010282
      RAX: ffffffffa02544e0 RBX: ffff8802aa1111d0 RCX: ffff8802aa1111e0
      RDX: ffff8802ab913e70 RSI: 0000000000000000 RDI: ffff8802ab913e70
      RBP: ffff8800280a1f28 R08: ffffc90005457040 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000000 R12: 00000000fffffffb
      R13: ffff8802ab913e88 R14: ffff8802ab9c1438 R15: 0000000000000100
      FS:  0000000000000000(0000) GS:ffff88002809e000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 0000003d54a98640 CR3: 000000029f0a1000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process ksoftirqd/1 (pid: 7, threadinfo ffff8802ae50e000, task ffff8802ae4f8040)
      Stack:
       ffff8800280a1f38 0000000000000020 ffffffff814f30a0 0000000000000004
      <0> ffff8800280a1f58 ffffffff8116b245 ffff8800280a1f38 ffff8800280a1f38
      <0> ffff8800280a1f58 0000000000000001 ffff8800280a1fa8 ffffffff810477bc
      Call Trace:
       <IRQ>
       [<ffffffff8116b245>] blk_done_softirq+0x75/0x90
       [<ffffffff810477bc>] __do_softirq+0xcc/0x210
       [<ffffffff81047170>] ? ksoftirqd+0x0/0x110
       [<ffffffff8100ce7c>] call_softirq+0x1c/0x50
       <EOI>
       [<ffffffff8100e785>] do_softirq+0x65/0xa0
       [<ffffffff81047170>] ? ksoftirqd+0x0/0x110
       [<ffffffff810471e0>] ksoftirqd+0x70/0x110
       [<ffffffff81059559>] kthread+0x99/0xb0
       [<ffffffff8100cd7a>] child_rip+0xa/0x20
       [<ffffffff8100c73c>] ? restore_args+0x0/0x30
       [<ffffffff810594c0>] ? kthread+0x0/0xb0
       [<ffffffff8100cd70>] ? child_rip+0x0/0x20
      Code: 44 89 e6 48 89 df e8 23 fb f2 e0 be 01 00 00 00 4c 89 f7 e8 f6 fd ff ff 5b 41 5c 41 5d 41 5e c9 c3 4c 89 ef e8 85 fe ff ff eb ed <0f> 0b eb fe 41 8b 85 dc 00 00 00 48 83 bb 10 01 00 00 00 89 83
      RIP  [<ffffffffa023629d>] dm_softirq_done+0xbd/0x100 [dm_mod]
       RSP <ffff8800280a1f08>
      ---[ end trace 16af0a1d8542da55 ]---
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a77e28c7
  12. 24 7月, 2009 1 次提交
  13. 01 7月, 2009 1 次提交
  14. 22 6月, 2009 4 次提交
    • K
      dm: disable interrupt when taking map_lock · 523d9297
      Kiyoshi Ueda 提交于
      This patch disables interrupt when taking map_lock to avoid
      lockdep warnings in request-based dm.
      
      request-based dm takes map_lock after taking queue_lock with
      disabling interrupt:
        spin_lock_irqsave(queue_lock)
        q->request_fn() == dm_request_fn()
          => dm_get_table()
               => read_lock(map_lock)
      while queue_lock could be (but isn't) taken in interrupt context.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NChristof Schmitt <christof.schmitt@de.ibm.com>
      Acked-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      523d9297
    • K
      dm: do not set QUEUE_ORDERED_DRAIN if request based · 5d67aa23
      Kiyoshi Ueda 提交于
      Request-based dm doesn't have barrier support yet.
      So we need to set QUEUE_ORDERED_DRAIN only for bio-based dm.
      Since the device type is decided at the first table loading time,
      the flag set is deferred until then.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      5d67aa23
    • K
      dm: enable request based option · e6ee8c0b
      Kiyoshi Ueda 提交于
      This patch enables request-based dm.
      
      o Request-based dm and bio-based dm coexist, since there are
        some target drivers which are more fitting to bio-based dm.
        Also, there are other bio-based devices in the kernel
        (e.g. md, loop).
        Since bio-based device can't receive struct request,
        there are some limitations on device stacking between
        bio-based and request-based.
      
                           type of underlying device
                         bio-based      request-based
         ----------------------------------------------
          bio-based         OK                OK
          request-based     --                OK
      
        The device type is recognized by the queue flag in the kernel,
        so dm follows that.
      
      o The type of a dm device is decided at the first table binding time.
        Once the type of a dm device is decided, the type can't be changed.
      
      o Mempool allocations are deferred to at the table loading time, since
        mempools for request-based dm are different from those for bio-based
        dm and needed mempool type is fixed by the type of table.
      
      o Currently, request-based dm supports only tables that have a single
        target.  To support multiple targets, we need to support request
        splitting or prevent bio/request from spanning multiple targets.
        The former needs lots of changes in the block layer, and the latter
        needs that all target drivers support merge() function.
        Both will take a time.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      e6ee8c0b
    • K
      dm: prepare for request based option · cec47e3d
      Kiyoshi Ueda 提交于
      This patch adds core functions for request-based dm.
      
      When struct mapped device (md) is initialized, md->queue has
      an I/O scheduler and the following functions are used for
      request-based dm as the queue functions:
          make_request_fn: dm_make_request()
          pref_fn:         dm_prep_fn()
          request_fn:      dm_request_fn()
          softirq_done_fn: dm_softirq_done()
          lld_busy_fn:     dm_lld_busy()
      Actual initializations are done in another patch (PATCH 2).
      
      Below is a brief summary of how request-based dm behaves, including:
        - making request from bio
        - cloning, mapping and dispatching request
        - completing request and bio
        - suspending md
        - resuming md
      
        bio to request
        ==============
        md->queue->make_request_fn() (dm_make_request()) calls __make_request()
        for a bio submitted to the md.
        Then, the bio is kept in the queue as a new request or merged into
        another request in the queue if possible.
      
        Cloning and Mapping
        ===================
        Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
        when requests are dispatched after they are sorted by the I/O scheduler.
      
        dm_request_fn() checks busy state of underlying devices using
        target's busy() function and stops dispatching requests to keep them
        on the dm device's queue if busy.
        It helps better I/O merging, since no merge is done for a request
        once it is dispatched to underlying devices.
      
        Actual cloning and mapping are done in dm_prep_fn() and map_request()
        called from dm_request_fn().
        dm_prep_fn() clones not only request but also bios of the request
        so that dm can hold bio completion in error cases and prevent
        the bio submitter from noticing the error.
        (See the "Completion" section below for details.)
      
        After the cloning, the clone is mapped by target's map_rq() function
          and inserted to underlying device's queue using
          blk_insert_cloned_request().
      
        Completion
        ==========
        Request completion can be hooked by rq->end_io(), but then, all bios
        in the request will have been completed even error cases, and the bio
        submitter will have noticed the error.
        To prevent the bio completion in error cases, request-based dm clones
        both bio and request and hooks both bio->bi_end_io() and rq->end_io():
            bio->bi_end_io(): end_clone_bio()
            rq->end_io():     end_clone_request()
      
        Summary of the request completion flow is below:
        blk_end_request() for a clone request
          => blk_update_request()
             => bio->bi_end_io() == end_clone_bio() for each clone bio
                => Free the clone bio
                => Success: Complete the original bio (blk_update_request())
                   Error:   Don't complete the original bio
          => blk_finish_request()
             => rq->end_io() == end_clone_request()
                => blk_complete_request()
                   => dm_softirq_done()
                      => Free the clone request
                      => Success: Complete the original request (blk_end_request())
                         Error:   Requeue the original request
      
        end_clone_bio() completes the original request on the size of
        the original bio in successful cases.
        Even if all bios in the original request are completed by that
        completion, the original request must not be completed yet to keep
        the ordering of request completion for the stacking.
        So end_clone_bio() uses blk_update_request() instead of
        blk_end_request().
        In error cases, end_clone_bio() doesn't complete the original bio.
        It just frees the cloned bio and gives over the error handling to
        end_clone_request().
      
        end_clone_request(), which is called with queue lock held, completes
        the clone request and the original request in a softirq context
        (dm_softirq_done()), which has no queue lock, to avoid a deadlock
        issue on submission of another request during the completion:
            - The submitted request may be mapped to the same device
            - Request submission requires queue lock, but the queue lock
              has been held by itself and it doesn't know that
      
        The clone request has no clone bio when dm_softirq_done() is called.
        So target drivers can't resubmit it again even error cases.
        Instead, they can ask dm core for requeueing and remapping
        the original request in that cases.
      
        suspend
        =======
        Request-based dm uses stopping md->queue as suspend of the md.
        For noflush suspend, just stops md->queue.
      
        For flush suspend, inserts a marker request to the tail of md->queue.
        And dispatches all requests in md->queue until the marker comes to
        the front of md->queue.  Then, stops dispatching request and waits
        for the all dispatched requests to complete.
        After that, completes the marker request, stops md->queue and
        wake up the waiter on the suspend queue, md->wait.
      
        resume
        ======
        Starts md->queue.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      cec47e3d