1. 01 11月, 2011 1 次提交
  2. 26 9月, 2011 2 次提交
  3. 02 8月, 2011 6 次提交
    • M
      dm table: set flush capability based on underlying devices · ed8b752b
      Mike Snitzer 提交于
      DM has always advertised both REQ_FLUSH and REQ_FUA flush capabilities
      regardless of whether or not a given DM device's underlying devices
      also advertised a need for them.
      
      Block's flush-merge changes from 2.6.39 have proven to be more costly
      for DM devices.  Performance regressions have been reported even when
      DM's underlying devices do not advertise that they have a write cache.
      
      Fix the performance regressions by configuring a DM device's flushing
      capabilities based on those of the underlying devices' capabilities.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      ed8b752b
    • M
      dm table: share target argument parsing functions · 498f0103
      Mike Snitzer 提交于
      Move multipath target argument parsing code into dm-table so other
      targets can share it.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      498f0103
    • M
      dm: ignore merge_bvec for snapshots when safe · d5b9dd04
      Mikulas Patocka 提交于
      Add a new flag DMF_MERGE_IS_OPTIONAL to struct mapped_device to indicate
      whether the device can accept bios larger than the size its merge
      function returns.  When set, use this to send large bios to snapshots
      which can split them if necessary.  Snapshot I/O may be significantly
      fragmented and this approach seems to improve peformance.
      
      Before the patch, dm_set_device_limits restricted bio size to page size
      if the underlying device had a merge function and the target didn't
      provide a merge function.  After the patch, dm_set_device_limits
      restricts bio size to page size if the underlying device has a merge
      function, doesn't have DMF_MERGE_IS_OPTIONAL flag and the target doesn't
      provide a merge function.
      
      The snapshot target can't provide a merge function because when the merge
      function is called, it is impossible to determine where the bio will be
      remapped.  Previously this led us to impose a 4k limit, which we can
      now remove if the snapshot store is located on a device without a merge
      function.  Together with another patch for optimizing full chunk writes,
      it improves performance from 29MB/s to 40MB/s when writing to the
      filesystem on snapshot store.
      
      If the snapshot store is placed on a non-dm device with a merge function
      (such as md-raid), device mapper still limits all bios to page size.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      d5b9dd04
    • M
      dm table: clean dm_get_device and move exports · 08649012
      Mike Snitzer 提交于
      There is no need for __table_get_device to be factored out.
      Also move the exports to the end of their respective functions.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      08649012
    • J
      dm: use vzalloc · e29e65aa
      Joe Perches 提交于
      Use vzalloc() instead of vmalloc()+memset().
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      e29e65aa
    • M
      dm table: fix discard support · 936688d7
      Mike Snitzer 提交于
      Remove 'discards_supported' from the dm_table structure.  The same
      information can be easily discovered from the table's target(s) in
      dm_table_supports_discards().
      
      Before this fix dm_table_supports_discards() would skip checking the
      individual targets' 'discards_supported' flag if any one target in the
      table didn't set num_discard_requests > 0.  Now the per-target
      'discards_supported' flag is effective at insuring the final DM device
      advertises discard support.  But, to be clear, targets that don't
      support discards (!num_discard_requests) will not receive discard
      requests.
      
      Also DMWARN if a target sets 'discards_supported' override but forgets
      to set 'num_discard_requests'.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      936688d7
  4. 27 7月, 2011 1 次提交
  5. 29 5月, 2011 2 次提交
    • M
      dm table: reject devices without request fns · f4808ca9
      Milan Broz 提交于
      This patch adds a check that a block device has a request function
      defined before it is used.  Otherwise, misconfiguration can cause an oops.
      
      Because we are allowing devices with zero size e.g. an offline multipath
      device as in commit 2cd54d9b
      ("dm: allow offline devices") there needs to be an additional check
      to ensure devices are initialised.  Some block devices, like a loop
      device without a backing file, exist but have no request function.
      
      Reproducer is trivial: dm-mirror on unbound loop device
      (no backing file on loop devices)
      
      dmsetup create x --table "0 8 mirror core 2 8 sync 2 /dev/loop0 0 /dev/loop1 0"
      
      and mirror resync will immediatelly cause OOps.
      
      BUG: unable to handle kernel NULL pointer dereference at   (null)
       ? generic_make_request+0x2bd/0x590
       ? kmem_cache_alloc+0xad/0x190
       submit_bio+0x53/0xe0
       ? bio_add_page+0x3b/0x50
       dispatch_io+0x1ca/0x210 [dm_mod]
       ? read_callback+0x0/0xd0 [dm_mirror]
       dm_io+0xbb/0x290 [dm_mod]
       do_mirror+0x1e0/0x748 [dm_mirror]
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f4808ca9
    • M
      dm table: allow targets to support discards internally · 4c259327
      Mike Snitzer 提交于
      Permit a target to support discards regardless of whether or not all its
      underlying devices do.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      4c259327
  6. 06 4月, 2011 1 次提交
    • M
      dm: improve block integrity support · a63a5cf8
      Mike Snitzer 提交于
      The current block integrity (DIF/DIX) support in DM is verifying that
      all devices' integrity profiles match during DM device resume (which
      is past the point of no return).  To some degree that is unavoidable
      (stacked DM devices force this late checking).  But for most DM
      devices (which aren't stacking on other DM devices) the ideal time to
      verify all integrity profiles match is during table load.
      
      Introduce the notion of an "initialized" integrity profile: a profile
      that was blk_integrity_register()'d with a non-NULL 'blk_integrity'
      template.  Add blk_integrity_is_initialized() to allow checking if a
      profile was initialized.
      
      Update DM integrity support to:
      - check all devices with _initialized_ integrity profiles match
        during table load; uninitialized profiles (e.g. for underlying DM
        device(s) of a stacked DM device) are ignored.
      - disallow a table load that would result in an integrity profile that
        conflicts with a DM device's existing (in-use) integrity profile
      - avoid clearing an existing integrity profile
      - validate all integrity profiles match during resume; but if they
        don't all we can do is report the mismatch (during resume we're past
        the point of no return)
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      a63a5cf8
  7. 17 3月, 2011 1 次提交
  8. 10 3月, 2011 1 次提交
  9. 15 1月, 2011 1 次提交
    • T
      block: restore multiple bd_link_disk_holder() support · 49731baa
      Tejun Heo 提交于
      Commit e09b457b (block: simplify holder symlink handling) incorrectly
      assumed that there is only one link at maximum.  dm may use multiple
      links and expects block layer to track reference count for each link,
      which is different from and unrelated to the exclusive device holder
      identified by @holder when the device is opened.
      
      Remove the single holder assumption and automatic removal of the link
      and revive the per-link reference count tracking.  The code
      essentially behaves the same as before commit e09b457b sans the
      unnecessary kobject reference count dancing.
      
      While at it, note that this facility should not be used by anyone else
      than the current ones.  Sysfs symlinks shouldn't be abused like this
      and the whole thing doesn't belong in the block layer at all.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NMilan Broz <mbroz@redhat.com>
      Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      49731baa
  10. 14 1月, 2011 2 次提交
  11. 17 12月, 2010 2 次提交
    • M
      block: max hardware sectors limit wrapper · 72d4cd9f
      Mike Snitzer 提交于
      Implement blk_limits_max_hw_sectors() and make
      blk_queue_max_hw_sectors() a wrapper around it.
      
      DM needs this to avoid setting queue_limits' max_hw_sectors and
      max_sectors directly.  dm_set_device_limits() now leverages
      blk_limits_max_hw_sectors() logic to establish the appropriate
      max_hw_sectors minimum (PAGE_SIZE).  Fixes issue where DM was
      incorrectly setting max_sectors rather than max_hw_sectors (which
      caused dm_merge_bvec()'s max_hw_sectors check to be ineffective).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@kernel.org
      Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      72d4cd9f
    • M
      block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead · e692cb66
      Martin K. Petersen 提交于
      When stacking devices, a request_queue is not always available. This
      forced us to have a no_cluster flag in the queue_limits that could be
      used as a carrier until the request_queue had been set up for a
      metadevice.
      
      There were several problems with that approach. First of all it was up
      to the stacking device to remember to set queue flag after stacking had
      completed. Also, the queue flag and the queue limits had to be kept in
      sync at all times. We got that wrong, which could lead to us issuing
      commands that went beyond the max scatterlist limit set by the driver.
      
      The proper fix is to avoid having two flags for tracking the same thing.
      We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
      block layer merging functions. The queue_limit 'no_cluster' is turned
      into 'cluster' to avoid double negatives and to ease stacking.
      Clustering defaults to being enabled as before. The queue flag logic is
      removed from the stacking function, and explicitly setting the cluster
      flag is no longer necessary in DM and MD.
      Reported-by: NEd Lin <ed.lin@promise.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      e692cb66
  12. 13 11月, 2010 3 次提交
    • T
      block: clean up blkdev_get() wrappers and their users · d4d77629
      Tejun Heo 提交于
      After recent blkdev_get() modifications, open_by_devnum() and
      open_bdev_exclusive() are simple wrappers around blkdev_get().
      Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
      
      blkdev_get_by_dev() is identical to open_by_devnum().
      blkdev_get_by_path() is slightly different in that it doesn't
      automatically add %FMODE_EXCL to @mode.
      
      All users are converted.  Most conversions are mechanical and don't
      introduce any behavior difference.  There are several exceptions.
      
      * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
        reason to OR it explicitly on blkdev_put().
      
      * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
        sb->s_mode.
      
      * With the above changes, sb->s_mode now always should contain
        FMODE_EXCL.  WARN_ON_ONCE() added to kill_block_super() to detect
        errors.
      
      The new blkdev_get_*() functions are with proper docbook comments.
      While at it, add function description to blkdev_get() too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Joern Engel <joern@lazybastard.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: xfs-masters@oss.sgi.com
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      d4d77629
    • T
      block: make blkdev_get/put() handle exclusive access · e525fd89
      Tejun Heo 提交于
      Over time, block layer has accumulated a set of APIs dealing with bdev
      open, close, claim and release.
      
      * blkdev_get/put() are the primary open and close functions.
      
      * bd_claim/release() deal with exclusive open.
      
      * open/close_bdev_exclusive() are combination of open and claim and
        the other way around, respectively.
      
      * bd_link/unlink_disk_holder() to create and remove holder/slave
        symlinks.
      
      * open_by_devnum() wraps bdget() + blkdev_get().
      
      The interface is a bit confusing and the decoupling of open and claim
      makes it impossible to properly guarantee exclusive access as
      in-kernel open + claim sequence can disturb the existing exclusive
      open even before the block layer knows the current open if for another
      exclusive access.  Reorganize the interface such that,
      
      * blkdev_get() is extended to include exclusive access management.
        @holder argument is added and, if is @FMODE_EXCL specified, it will
        gain exclusive access atomically w.r.t. other exclusive accesses.
      
      * blkdev_put() is similarly extended.  It now takes @mode argument and
        if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
        the last exclusive claim is released, the holder/slave symlinks are
        removed automatically.
      
      * bd_claim/release() and close_bdev_exclusive() are no longer
        necessary and either made static or removed.
      
      * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
        is no longer necessary and removed.
      
      * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
        and blkdev_get().  It also has an unexpected extra bdev_read_only()
        test which probably should be moved into blkdev_get().
      
      * open_by_devnum() is modified to take @holder argument and pass it to
        blkdev_get().
      
      Most of bdev open/close operations are unified into blkdev_get/put()
      and most exclusive accesses are tested atomically at the open time (as
      it should).  This cleans up code and removes some, both valid and
      invalid, but unnecessary all the same, corner cases.
      
      open_bdev_exclusive() and open_by_devnum() can use further cleanup -
      rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
      special features.  Well, let's leave them for another day.
      
      Most conversions are straight-forward.  drbd conversion is a bit more
      involved as there was some reordering, but the logic should stay the
      same.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Brown <neilb@suse.de>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: dm-devel@redhat.com
      Cc: drbd-dev@lists.linbit.com
      Cc: Leo Chen <leochen@broadcom.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      e525fd89
    • T
      block: simplify holder symlink handling · e09b457b
      Tejun Heo 提交于
      Code to manage symlinks in /sys/block/*/{holders|slaves} are overly
      complex with multiple holder considerations, redundant extra
      references to all involved kobjects, unused generic kobject holder
      support and unnecessary mixup with bd_claim/release functionalities.
      
      Strip it down to what's necessary (single gendisk holder) and make it
      use a separate interface.  This is a step for cleaning up
      bd_claim/release.  This patch makes dm-table slightly more complex but
      it will be simplified again with further changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Brown <neilb@suse.de>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      e09b457b
  13. 11 9月, 2010 1 次提交
  14. 12 8月, 2010 2 次提交
  15. 06 3月, 2010 2 次提交
    • N
      dm table: remove unused dm_get_device range parameters · 8215d6ec
      Nikanth Karthikesan 提交于
      Remove unused parameters(start and len) of dm_get_device()
      and fix the callers.
      Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      8215d6ec
    • K
      dm table: remove dm_get from dm_table_get_md · ecdb2e25
      Kiyoshi Ueda 提交于
      Remove the dm_get() in dm_table_get_md() because dm_table_get_md() could
      be called from presuspend/postsuspend, which are called while
      mapped_device is in DMF_FREEING state, where dm_get() is not allowed.
      
      Justification for that is the lifetime of both objects: As far as the
      current dm design/implementation, mapped_device is never freed while
      targets are doing something, because dm core waits for targets to become
      quiet in dm_put() using presuspend/postsuspend.  So targets should be
      able to touch mapped_device without holding reference count of the
      mapped_device, and we should allow targets to touch mapped_device even
      if it is in DMF_FREEING state.
      
      Backgrounds:
      I'm trying to remove the multipath internal queue, since dm core now has
      a generic queue for request-based dm.  In the patch-set, the multipath
      target wants to request dm core to start/stop queue.  One of such
      start/stop requests can happen during postsuspend() while the target
      waits for pg-init to complete, because the target stops queue when
      starting pg-init and tries to restart it when completing pg-init.  Since
      queue belongs to mapped_device, it involves calling dm_table_get_md()
      and dm_put().  On the other hand, postsuspend() is called in dm_put()
      for mapped_device which is in DMF_FREEING state, and that triggers
      BUG_ON(DMF_FREEING) in the 2nd dm_put().
      
      I had tried to solve this problem by changing only multipath not to
      touch mapped_device which is in DMF_FREEING state, but I couldn't and I
      came up with a question why we need dm_get() in dm_table_get_md().
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      ecdb2e25
  16. 11 1月, 2010 1 次提交
  17. 16 12月, 2009 1 次提交
    • A
      tree-wide: convert open calls to remove spaces to skip_spaces() lib function · e7d2860b
      André Goddard Rosa 提交于
      Makes use of skip_spaces() defined in lib/string.c for removing leading
      spaces from strings all over the tree.
      
      It decreases lib.a code size by 47 bytes and reuses the function tree-wide:
         text    data     bss     dec     hex filename
        64688     584     592   65864   10148 (TOTALS-BEFORE)
        64641     584     592   65817   10119 (TOTALS-AFTER)
      
      Also, while at it, if we see (*str && isspace(*str)), we can be sure to
      remove the first condition (*str) as the second one (isspace(*str)) also
      evaluates to 0 whenever *str == 0, making it redundant. In other words,
      "a char equals zero is never a space".
      
      Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below,
      and found occurrences of this pattern on 3 more files:
          drivers/leds/led-class.c
          drivers/leds/ledtrig-timer.c
          drivers/video/output.c
      
      @@
      expression str;
      @@
      
      ( // ignore skip_spaces cases
      while (*str &&  isspace(*str)) { \(str++;\|++str;\) }
      |
      - *str &&
      isspace(*str)
      )
      Signed-off-by: NAndré Goddard Rosa <andre.goddard@gmail.com>
      Cc: Julia Lawall <julia@diku.dk>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
      Cc: David Howells <dhowells@redhat.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Samuel Ortiz <samuel@sortiz.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7d2860b
  18. 11 12月, 2009 1 次提交
    • A
      dm: bind new table before destroying old · a7940155
      Alasdair G Kergon 提交于
      When replacing a mapped device's table during a 'resume', delay the
      destruction of the old table until the new one is successfully in place.
      
      This will make it easier for a later patch to transfer internal state
      information from the old table to the new one (something we do not currently
      support) while giving us more options for reversion if a later part
      of the operation fails.
      
      Devices are always in the suspended state during dm_swap_table().
      This patch reinforces the requirement that all I/O must have been
      flushed from the table targets while in this state (including any in
      workqueues).  In the case of 'noflush' suspending, unprocessed
      I/O should have been 'pushed back' to the dm core prior to this point,
      for resubmission after the new table is in place.
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a7940155
  19. 05 9月, 2009 3 次提交
  20. 24 7月, 2009 2 次提交
  21. 30 6月, 2009 1 次提交
  22. 22 6月, 2009 3 次提交
    • K
      dm: do not set QUEUE_ORDERED_DRAIN if request based · 5d67aa23
      Kiyoshi Ueda 提交于
      Request-based dm doesn't have barrier support yet.
      So we need to set QUEUE_ORDERED_DRAIN only for bio-based dm.
      Since the device type is decided at the first table loading time,
      the flag set is deferred until then.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      5d67aa23
    • K
      dm: enable request based option · e6ee8c0b
      Kiyoshi Ueda 提交于
      This patch enables request-based dm.
      
      o Request-based dm and bio-based dm coexist, since there are
        some target drivers which are more fitting to bio-based dm.
        Also, there are other bio-based devices in the kernel
        (e.g. md, loop).
        Since bio-based device can't receive struct request,
        there are some limitations on device stacking between
        bio-based and request-based.
      
                           type of underlying device
                         bio-based      request-based
         ----------------------------------------------
          bio-based         OK                OK
          request-based     --                OK
      
        The device type is recognized by the queue flag in the kernel,
        so dm follows that.
      
      o The type of a dm device is decided at the first table binding time.
        Once the type of a dm device is decided, the type can't be changed.
      
      o Mempool allocations are deferred to at the table loading time, since
        mempools for request-based dm are different from those for bio-based
        dm and needed mempool type is fixed by the type of table.
      
      o Currently, request-based dm supports only tables that have a single
        target.  To support multiple targets, we need to support request
        splitting or prevent bio/request from spanning multiple targets.
        The former needs lots of changes in the block layer, and the latter
        needs that all target drivers support merge() function.
        Both will take a time.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      e6ee8c0b
    • K
      dm: prepare for request based option · cec47e3d
      Kiyoshi Ueda 提交于
      This patch adds core functions for request-based dm.
      
      When struct mapped device (md) is initialized, md->queue has
      an I/O scheduler and the following functions are used for
      request-based dm as the queue functions:
          make_request_fn: dm_make_request()
          pref_fn:         dm_prep_fn()
          request_fn:      dm_request_fn()
          softirq_done_fn: dm_softirq_done()
          lld_busy_fn:     dm_lld_busy()
      Actual initializations are done in another patch (PATCH 2).
      
      Below is a brief summary of how request-based dm behaves, including:
        - making request from bio
        - cloning, mapping and dispatching request
        - completing request and bio
        - suspending md
        - resuming md
      
        bio to request
        ==============
        md->queue->make_request_fn() (dm_make_request()) calls __make_request()
        for a bio submitted to the md.
        Then, the bio is kept in the queue as a new request or merged into
        another request in the queue if possible.
      
        Cloning and Mapping
        ===================
        Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
        when requests are dispatched after they are sorted by the I/O scheduler.
      
        dm_request_fn() checks busy state of underlying devices using
        target's busy() function and stops dispatching requests to keep them
        on the dm device's queue if busy.
        It helps better I/O merging, since no merge is done for a request
        once it is dispatched to underlying devices.
      
        Actual cloning and mapping are done in dm_prep_fn() and map_request()
        called from dm_request_fn().
        dm_prep_fn() clones not only request but also bios of the request
        so that dm can hold bio completion in error cases and prevent
        the bio submitter from noticing the error.
        (See the "Completion" section below for details.)
      
        After the cloning, the clone is mapped by target's map_rq() function
          and inserted to underlying device's queue using
          blk_insert_cloned_request().
      
        Completion
        ==========
        Request completion can be hooked by rq->end_io(), but then, all bios
        in the request will have been completed even error cases, and the bio
        submitter will have noticed the error.
        To prevent the bio completion in error cases, request-based dm clones
        both bio and request and hooks both bio->bi_end_io() and rq->end_io():
            bio->bi_end_io(): end_clone_bio()
            rq->end_io():     end_clone_request()
      
        Summary of the request completion flow is below:
        blk_end_request() for a clone request
          => blk_update_request()
             => bio->bi_end_io() == end_clone_bio() for each clone bio
                => Free the clone bio
                => Success: Complete the original bio (blk_update_request())
                   Error:   Don't complete the original bio
          => blk_finish_request()
             => rq->end_io() == end_clone_request()
                => blk_complete_request()
                   => dm_softirq_done()
                      => Free the clone request
                      => Success: Complete the original request (blk_end_request())
                         Error:   Requeue the original request
      
        end_clone_bio() completes the original request on the size of
        the original bio in successful cases.
        Even if all bios in the original request are completed by that
        completion, the original request must not be completed yet to keep
        the ordering of request completion for the stacking.
        So end_clone_bio() uses blk_update_request() instead of
        blk_end_request().
        In error cases, end_clone_bio() doesn't complete the original bio.
        It just frees the cloned bio and gives over the error handling to
        end_clone_request().
      
        end_clone_request(), which is called with queue lock held, completes
        the clone request and the original request in a softirq context
        (dm_softirq_done()), which has no queue lock, to avoid a deadlock
        issue on submission of another request during the completion:
            - The submitted request may be mapped to the same device
            - Request submission requires queue lock, but the queue lock
              has been held by itself and it doesn't know that
      
        The clone request has no clone bio when dm_softirq_done() is called.
        So target drivers can't resubmit it again even error cases.
        Instead, they can ask dm core for requeueing and remapping
        the original request in that cases.
      
        suspend
        =======
        Request-based dm uses stopping md->queue as suspend of the md.
        For noflush suspend, just stops md->queue.
      
        For flush suspend, inserts a marker request to the tail of md->queue.
        And dispatches all requests in md->queue until the marker comes to
        the front of md->queue.  Then, stops dispatching request and waits
        for the all dispatched requests to complete.
        After that, completes the marker request, stops md->queue and
        wake up the waiter on the suspend queue, md->wait.
      
        resume
        ======
        Starts md->queue.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      cec47e3d