1. 22 6月, 2009 15 次提交
    • K
      dm: prepare for request based option · cec47e3d
      Kiyoshi Ueda 提交于
      This patch adds core functions for request-based dm.
      
      When struct mapped device (md) is initialized, md->queue has
      an I/O scheduler and the following functions are used for
      request-based dm as the queue functions:
          make_request_fn: dm_make_request()
          pref_fn:         dm_prep_fn()
          request_fn:      dm_request_fn()
          softirq_done_fn: dm_softirq_done()
          lld_busy_fn:     dm_lld_busy()
      Actual initializations are done in another patch (PATCH 2).
      
      Below is a brief summary of how request-based dm behaves, including:
        - making request from bio
        - cloning, mapping and dispatching request
        - completing request and bio
        - suspending md
        - resuming md
      
        bio to request
        ==============
        md->queue->make_request_fn() (dm_make_request()) calls __make_request()
        for a bio submitted to the md.
        Then, the bio is kept in the queue as a new request or merged into
        another request in the queue if possible.
      
        Cloning and Mapping
        ===================
        Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
        when requests are dispatched after they are sorted by the I/O scheduler.
      
        dm_request_fn() checks busy state of underlying devices using
        target's busy() function and stops dispatching requests to keep them
        on the dm device's queue if busy.
        It helps better I/O merging, since no merge is done for a request
        once it is dispatched to underlying devices.
      
        Actual cloning and mapping are done in dm_prep_fn() and map_request()
        called from dm_request_fn().
        dm_prep_fn() clones not only request but also bios of the request
        so that dm can hold bio completion in error cases and prevent
        the bio submitter from noticing the error.
        (See the "Completion" section below for details.)
      
        After the cloning, the clone is mapped by target's map_rq() function
          and inserted to underlying device's queue using
          blk_insert_cloned_request().
      
        Completion
        ==========
        Request completion can be hooked by rq->end_io(), but then, all bios
        in the request will have been completed even error cases, and the bio
        submitter will have noticed the error.
        To prevent the bio completion in error cases, request-based dm clones
        both bio and request and hooks both bio->bi_end_io() and rq->end_io():
            bio->bi_end_io(): end_clone_bio()
            rq->end_io():     end_clone_request()
      
        Summary of the request completion flow is below:
        blk_end_request() for a clone request
          => blk_update_request()
             => bio->bi_end_io() == end_clone_bio() for each clone bio
                => Free the clone bio
                => Success: Complete the original bio (blk_update_request())
                   Error:   Don't complete the original bio
          => blk_finish_request()
             => rq->end_io() == end_clone_request()
                => blk_complete_request()
                   => dm_softirq_done()
                      => Free the clone request
                      => Success: Complete the original request (blk_end_request())
                         Error:   Requeue the original request
      
        end_clone_bio() completes the original request on the size of
        the original bio in successful cases.
        Even if all bios in the original request are completed by that
        completion, the original request must not be completed yet to keep
        the ordering of request completion for the stacking.
        So end_clone_bio() uses blk_update_request() instead of
        blk_end_request().
        In error cases, end_clone_bio() doesn't complete the original bio.
        It just frees the cloned bio and gives over the error handling to
        end_clone_request().
      
        end_clone_request(), which is called with queue lock held, completes
        the clone request and the original request in a softirq context
        (dm_softirq_done()), which has no queue lock, to avoid a deadlock
        issue on submission of another request during the completion:
            - The submitted request may be mapped to the same device
            - Request submission requires queue lock, but the queue lock
              has been held by itself and it doesn't know that
      
        The clone request has no clone bio when dm_softirq_done() is called.
        So target drivers can't resubmit it again even error cases.
        Instead, they can ask dm core for requeueing and remapping
        the original request in that cases.
      
        suspend
        =======
        Request-based dm uses stopping md->queue as suspend of the md.
        For noflush suspend, just stops md->queue.
      
        For flush suspend, inserts a marker request to the tail of md->queue.
        And dispatches all requests in md->queue until the marker comes to
        the front of md->queue.  Then, stops dispatching request and waits
        for the all dispatched requests to complete.
        After that, completes the marker request, stops md->queue and
        wake up the waiter on the suspend queue, md->wait.
      
        resume
        ======
        Starts md->queue.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      cec47e3d
    • M
      dm: calculate queue limits during resume not load · 754c5fc7
      Mike Snitzer 提交于
      Currently, device-mapper maintains a separate instance of 'struct
      queue_limits' for each table of each device.  When the configuration of
      a device is to be changed, first its table is loaded and this structure
      is populated, then the device is 'resumed' and the calculated
      queue_limits are applied.
      
      This places restrictions on how userspace may process related devices,
      where it is often advantageous to 'load' tables for several devices
      at once before 'resuming' them together.  As the new queue_limits
      only take effect after the 'resume', if they are changing and one
      device uses another, the latter must be 'resumed' before the former
      may be 'loaded'.
      
      This patch moves the calculation of these queue_limits out of
      the 'load' operation into 'resume'.  Since we are no longer
      pre-calculating this struct, we no longer need to maintain copies
      within our dm structs.
      
      dm_set_device_limits() now passes the 'start' of the device's
      data area (aka pe_start) as the 'offset' to blk_stack_limits().
      
      init_valid_queue_limits() is replaced by blk_set_default_limits().
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: martin.petersen@oracle.com
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      754c5fc7
    • M
      dm ioctl: support cookies for udev · 60935eb2
      Milan Broz 提交于
      Add support for passing a 32 bit "cookie" into the kernel with the
      DM_SUSPEND, DM_DEV_RENAME and DM_DEV_REMOVE ioctls.  The (unsigned)
      value of this cookie is returned to userspace alongside the uevents
      issued by these ioctls in the variable DM_COOKIE.
      
      This means the userspace process issuing these ioctls can be notified
      by udev after udev has completed any actions triggered.
      
      To minimise the interface extension, we pass the cookie into the
      kernel in the event_nr field which is otherwise unused when calling
      these ioctls.  Incrementing the version number allows userspace to
      determine in advance whether or not the kernel supports the cookie.
      If the kernel does support this but userspace does not, there should
      be no impact as the new variable will just get ignored.
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      60935eb2
    • M
      dm: send empty barriers to targets in dm_flush · 52b1fd5a
      Mikulas Patocka 提交于
      Pass empty barrier flushes to the targets in dm_flush().
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      52b1fd5a
    • A
      dm: initialise tio in alloc_tio · 9015df24
      Alasdair G Kergon 提交于
      Move repeated dm_target_io initialisation inside alloc_tio().
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      9015df24
    • M
      dm: introduce num_flush_requests · f9ab94ce
      Mikulas Patocka 提交于
      Introduce num_flush_requests for a target to set to say how many flush
      instructions (empty barriers) it wants to receive.  These are sent by
      __clone_and_map_empty_barrier with map_info->flush_request going from 0
      to (num_flush_requests - 1).
      
      Old targets without flush support won't receive any flush requests.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f9ab94ce
    • M
      dm: remove check that prevents mapping empty bios · 27eaa149
      Mikulas Patocka 提交于
      Remove the check that the size of the cloned bio is not zero because a
      subsequent patch needs to send zero-sized barriers down this path.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      27eaa149
    • M
      dm: remove EOPNOTSUPP for barriers · fdb9572b
      Mikulas Patocka 提交于
      If the underlying device doesn't support barriers and dm receives a
      barrier, it waits until all requests on that device drain so it no
      longer needs to report -EOPNOTSUPP to the caller.
      
      This patch deals with the confusing situation when moving a volume from
      one physical device to another triggers an EOPNOTSUPP on a volume that
      didn't report it before.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      fdb9572b
    • M
      dm: store only first barrier error · 5aa2781d
      Mikulas Patocka 提交于
      With the following patches, more than one error can occur during
      processing.  Change md->barrier_error so that only the first one is
      recorded and returned to the caller.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      5aa2781d
    • M
      dm: process requeue in dm_wq_work · 2761e95f
      Mikulas Patocka 提交于
      If barrier request was returned with DM_ENDIO_REQUEUE,
      requeue it in dm_wq_work instead of dec_pending.
      
      This allows us to correctly handle a situation when some targets
      are asking for a requeue and other targets signal an error.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      2761e95f
    • M
      dm: make dm_flush return void · 531fe963
      Mikulas Patocka 提交于
      Make dm_flush return void.
      
      The first error during flush is stored in md->barrier_error instead.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      531fe963
    • M
      dm: always hold bdev reference · 32a926da
      Mikulas Patocka 提交于
      Fix a potential deadlock when creating multiple snapshots by holding a
      reference to struct block_device for the whole lifecycle of every dm
      device instead of obtaining it independently at each point it is needed.
      
      bdget_disk() was called while the device was being suspended, in
      dm_suspend().  However there could be other devices already suspended,
      for example when creating additional snapshots of a device. bdget_disk()
      can wait for IO and allocate memory resulting in waiting for the
      already-suspended device - deadlock.
      
      This patch changes the code so that it gets the reference to struct
      block_device when struct mapped_device is allocated and initialized in
      alloc_dev() where it is always OK to allocate memory or wait for I/O.
      It drops the reference when it is destroyed in free_dev().  Thus there
      is no call to bdget_disk() while any device is suspended.
      
      Previously unlock_fs() was called only if bdev was held.  Now it is
      called unconditionally, but the superfluous calls are harmless because
      it returns immediately if the filesystem was not previously frozen.
      
      This patch also now allows the device size to be changed in a
      noflush suspend because the bdev is held.  This has no adverse effect.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      32a926da
    • M
      dm: rename suspended_bdev to bdev · db8fef4f
      Mikulas Patocka 提交于
      Rename suspended_bdev to bdev.
      
      This patch doesn't change any functionality, just renames the variable.
      In the next patch, the variable will be used even for non-suspended device.
      
      (Pre-requisite for the per-target barrier support patches.)
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      db8fef4f
    • M
      dm: avoid unsupported spanning of md stripe boundaries · 8cbeb67a
      Mikulas Patocka 提交于
      A bio that has two or more vector entries, size less than or equal to
      page size, that crosses a stripe boundary of an underlying md device is
      accepted by device mapper (it conforms to all its limits) but not by the
      underlying device.
      
      The fix is: If device mapper selects the one-page maximum request size,
      it also needs to set its own q->merge_bvec_fn to reject any bios with
      multiple vector entries that span more pages.
      
      The problem was discovered in the following scenario:
        * MD - RAID-0
        * LV on the top of it (raid1, snapshot or striped with chunk
      size/stripe larger than RAID-0 stripe)
        * one of the logical volumes is exported to xen domU
        * inside xen domU it is partitioned, the key point is that the partition
      must be unaligned on page boundary (fdisk normally aligns the partition to
      63 sectors which will trigger it)
        * install the system on the partitioned disk in domU
      This causes I/O failures in dom0.
      Reference: https://bugzilla.redhat.com/show_bug.cgi?id=223947Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      8cbeb67a
    • M
      dm: sysfs skip output when device is being destroyed · 4d89b7b4
      Milan Broz 提交于
      Do not process sysfs attributes when device is being destroyed.
      
      Otherwise code can cause
        BUG_ON(test_bit(DMF_FREEING, &md->flags));
      in dm_put() call.
      
      Cc: stable@kernel.org
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      4d89b7b4
  2. 16 6月, 2009 1 次提交
  3. 10 6月, 2009 1 次提交
    • L
      tracing/events: convert block trace points to TRACE_EVENT() · 55782138
      Li Zefan 提交于
      TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
      these new capabilities to this tracepoint:
      
        - zero-copy and per-cpu splice() tracing
        - binary tracing without printf overhead
        - structured logging records exposed under /debug/tracing/events
        - trace events embedded in function tracer output and other plugins
        - user-defined, per tracepoint filter expressions
        ...
      
      Cons:
      
        - no dev_t info for the output of plug, unplug_timer and unplug_io events.
          no dev_t info for getrq and sleeprq events if bio == NULL.
          no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
      
          This is mainly because we can't get the deivce from a request queue.
          But this may change in the future.
      
        - A packet command is converted to a string in TP_assign, not TP_print.
          While blktrace do the convertion just before output.
      
          Since pc requests should be rather rare, this is not a big issue.
      
        - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
          has a unique format, which means we have some unused data in a trace entry.
      
          The overhead is minimized by using __dynamic_array() instead of __array().
      
      I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
      
            dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
      1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
      2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
      3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s
      
      So the overhead of tracing is very small, and no regression when using
      those trace events vs blktrace.
      
      And the binary output of TRACE_EVENT is much smaller than blktrace:
      
       # ls -l -h
       -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
       -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
       -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
      
      Following are some comparisons between TRACE_EVENT and blktrace:
      
      plug:
        kjournald-480   [000]   303.084981: block_plug: [kjournald]
        kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]
      
      unplug_io:
        kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
        kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1
      
      remap:
        kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
        kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384
      
      bio_backmerge:
        kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
        kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]
      
      getrq:
        kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]
      
        bash-2066  [001]  1072.953770:   8,0    G   N [bash]
        bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
      
      rq_complete:
        konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
        konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]
      
        ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
        ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
      
      rq_insert:
        kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]
      
      Changelog from v2 -> v3:
      
      - use the newly introduced __dynamic_array().
      
      Changelog from v1 -> v2:
      
      - use __string() instead of __array() to minimize the memory required
        to store hex dump of rq->cmd().
      
      - support large pc requests.
      
      - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
      
      - some cleanups.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      55782138
  4. 06 5月, 2009 1 次提交
  5. 15 4月, 2009 1 次提交
  6. 09 4月, 2009 8 次提交
  7. 03 4月, 2009 10 次提交
  8. 17 3月, 2009 1 次提交
    • M
      dm crypt: wait for endio to complete before destruction · b35f8caa
      Milan Broz 提交于
      The following oops has been reported when dm-crypt runs over a loop device.
      
      ...
      [   70.381058] Process loop0 (pid: 4268, ti=cf3b2000 task=cf1cc1f0 task.ti=cf3b2000)
      ...
      [   70.381058] Call Trace:
      [   70.381058]  [<d0d76601>] ? crypt_dec_pending+0x5e/0x62 [dm_crypt]
      [   70.381058]  [<d0d767b8>] ? crypt_endio+0xa2/0xaa [dm_crypt]
      [   70.381058]  [<d0d76716>] ? crypt_endio+0x0/0xaa [dm_crypt]
      [   70.381058]  [<c01a2f24>] ? bio_endio+0x2b/0x2e
      [   70.381058]  [<d0806530>] ? dec_pending+0x224/0x23b [dm_mod]
      [   70.381058]  [<d08066e4>] ? clone_endio+0x79/0xa4 [dm_mod]
      [   70.381058]  [<d080666b>] ? clone_endio+0x0/0xa4 [dm_mod]
      [   70.381058]  [<c01a2f24>] ? bio_endio+0x2b/0x2e
      [   70.381058]  [<c02bad86>] ? loop_thread+0x380/0x3b7
      [   70.381058]  [<c02ba8a1>] ? do_lo_send_aops+0x0/0x165
      [   70.381058]  [<c013754f>] ? autoremove_wake_function+0x0/0x33
      [   70.381058]  [<c02baa06>] ? loop_thread+0x0/0x3b7
      
      When a table is being replaced, it waits for I/O to complete
      before destroying the mempool, but the endio function doesn't
      call mempool_free() until after completing the bio.
      
      Fix it by swapping the order of those two operations.
      
      The same problem occurs in dm.c with md referenced after dec_pending.
      Again, we swap the order.
      
      Cc: stable@kernel.org
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b35f8caa
  9. 06 1月, 2009 2 次提交
    • M
      dm: add name and uuid to sysfs · 784aae73
      Milan Broz 提交于
      Implement simple read-only sysfs entry for device-mapper block device.
      
      This patch adds a simple sysfs directory named "dm" under block device
      properties and implements
      	- name attribute (string containing mapped device name)
      	- uuid attribute (string containing UUID, or empty string if not set)
      
      The kobject is embedded in mapped_device struct, so no additional
      memory allocation is needed for initializing sysfs entry.
      
      During the processing of sysfs attribute we need to lock mapped device
      which is done by a new function dm_get_from_kobj, which returns the md
      associated with kobject and increases the usage count.
      
      Each 'show attribute' function is responsible for its own locking.
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      784aae73
    • M
      dm table: rework reference counting · d5816876
      Mikulas Patocka 提交于
      Rework table reference counting.
      
      The existing code uses a reference counter. When the last reference is
      dropped and the counter reaches zero, the table destructor is called.
      Table reference counters are acquired/released from upcalls from other
      kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all).
      If the reference counter reaches zero in one of the upcalls, the table
      destructor is called from almost random kernel code.
      
      This leads to various problems:
      * dm_any_congested being called under a spinlock, which calls the
        destructor, which calls some sleeping function.
      * the destructor attempting to take a lock that is already taken by the
        same process.
      * stale reference from some other kernel code keeps the table
        constructed, which keeps some devices open, even after successful
        return from "dmsetup remove". This can confuse lvm and prevent closing
        of underlying devices or reusing device minor numbers.
      
      The patch changes reference counting so that the table destructor can be
      called only at predetermined places.
      
      The table has always exactly one reference from either mapped_device->map
      or hash_cell->new_map. After this patch, this reference is not counted
      in table->holders.  A pair of dm_create_table/dm_destroy_table functions
      is used for table creation/destruction.
      
      Temporary references from the other code increase table->holders. A pair
      of dm_table_get/dm_table_put functions is used to manipulate it.
      
      When the table is about to be destroyed, we wait for table->holders to
      reach 0. Then, we call the table destructor.  We use active waiting with
      msleep(1), because the situation happens rarely (to one user in 5 years)
      and removing the device isn't performance-critical task: the user doesn't
      care if it takes one tick more or not.
      
      This way, the destructor is called only at specific points
      (dm_table_destroy function) and the above problems associated with lazy
      destruction can't happen.
      
      Finally remove the temporary protection added to dm_any_congested().
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      d5816876