1. 10 9月, 2010 7 次提交
    • M
      dm: convey that all flushes are processed as empty · b372d360
      Mike Snitzer 提交于
      Rename __clone_and_map_flush to __clone_and_map_empty_flush for added
      clarity.
      
      Simplify logic associated with REQ_FLUSH conditionals.
      
      Introduce a BUG_ON() and add a few more helpful comments to the code
      so that it is clear that all flushes are empty.
      
      Cleanup __split_and_process_bio() so that an empty flush isn't processed
      by a 'sector_count' focused while loop.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      b372d360
    • K
      dm: fix locking context in queue_io() · 05447420
      Kiyoshi Ueda 提交于
      Now queue_io() is called from dec_pending(), which may be called with
      interrupts disabled, so queue_io() must not enable interrupts
      unconditionally and must save/restore the current interrupts status.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      05447420
    • T
      dm: relax ordering of bio-based flush implementation · 6a8736d1
      Tejun Heo 提交于
      Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
      against other bio's.  This patch relaxes ordering around flushes.
      
      * A flush bio is no longer deferred to workqueue directly.  It's
        processed like other bio's but __split_and_process_bio() uses
        md->flush_bio as the clone source.  md->flush_bio is initialized to
        empty flush during md initialization and shared for all flushes.
      
      * As a flush bio now travels through the same execution path as other
        bio's, there's no need for dedicated error handling path either.  It
        can use the same error handling path in dec_pending().  Dedicated
        error handling removed along with md->flush_error.
      
      * When dec_pending() detects that a flush has completed, it checks
        whether the original bio has data.  If so, the bio is queued to the
        deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.
      
      * As flush sequencing is handled in the usual issue/completion path,
        dm_wq_work() no longer needs to handle flushes differently.  Now its
        only responsibility is re-issuing deferred bio's the same way as
        _dm_request() would.  REQ_FLUSH handling logic including
        process_flush() is dropped.
      
      * There's no reason for queue_io() and dm_wq_work() write lock
        dm->io_lock.  queue_io() now only uses md->deferred_lock and
        dm_wq_work() read locks dm->io_lock.
      
      * bio's no longer need to be queued on the deferred list while a flush
        is in progress making DMF_QUEUE_IO_TO_THREAD unncessary.  Drop it.
      
      This avoids stalling the device during flushes and simplifies the
      implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6a8736d1
    • T
      dm: implement REQ_FLUSH/FUA support for request-based dm · 29e4013d
      Tejun Heo 提交于
      This patch converts request-based dm to support the new REQ_FLUSH/FUA.
      
      The original request-based flush implementation depended on
      request_queue blocking other requests while a barrier sequence is in
      progress, which is no longer true for the new REQ_FLUSH/FUA.
      
      In general, request-based dm doesn't have infrastructure for cloning
      one source request to multiple targets, but the original flush
      implementation had a special mostly independent path which can issue
      flushes to multiple targets and sequence them.  However, the
      capability isn't currently in use and adds a lot of complexity.
      Moreoever, it's unlikely to be useful in its current form as it
      doesn't make sense to be able to send out flushes to multiple targets
      when write requests can't be.
      
      This patch rips out special flush code path and deals handles
      REQ_FLUSH/FUA requests the same way as other requests.  The only
      special treatment is that REQ_FLUSH requests use the block address 0
      when finding target, which is enough for now.
      
      * added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
        suggested by Mike Snitzer
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Tested-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      29e4013d
    • T
      dm: implement REQ_FLUSH/FUA support for bio-based dm · d87f4c14
      Tejun Heo 提交于
      This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
      now deprecated REQ_HARDBARRIER.
      
      * -EOPNOTSUPP handling logic dropped.
      
      * Preflush is handled as before but postflush is dropped and replaced
        with passing down REQ_FUA to member request_queues.  This replaces
        one array wide cache flush w/ member specific FUA writes.
      
      * __split_and_process_bio() now calls __clone_and_map_flush() directly
        for flushes and guarantees all FLUSH bio's going to targets are zero
      `  length.
      
      * It's now guaranteed that all FLUSH bio's which are passed onto dm
        targets are zero length.  bio_empty_barrier() tests are replaced
        with REQ_FLUSH tests.
      
      * Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes.
      
      * Dropped unlikely() around REQ_FLUSH tests.  Flushes are not unlikely
        enough to be marked with unlikely().
      
      * Block layer now filters out REQ_FLUSH/FUA bio's if the request_queue
        doesn't support cache flushing.  Advertise REQ_FLUSH | REQ_FUA
        capability.
      
      * Request based dm isn't converted yet.  dm_init_request_based_queue()
        resets flush support to 0 for now.  To avoid disturbing request
        based dm code, dm->flush_error is added for bio based dm while
        requested based dm continues to use dm->barrier_error.
      
      Lightly tested linear, stripe, raid1, snap and crypt targets.  Please
      proceed with caution as I'm not familiar with the code base.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: dm-devel@redhat.com
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      d87f4c14
    • T
      md: implment REQ_FLUSH/FUA support · e9c7469b
      Tejun Heo 提交于
      This patch converts md to support REQ_FLUSH/FUA instead of now
      deprecated REQ_HARDBARRIER.  In the core part (md.c), the following
      changes are notable.
      
      * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
        processing of other requests and thus there is no reason to mark the
        queue congested while FLUSH/FUA is in progress.
      
      * REQ_FLUSH/FUA failures are final and its users don't need retry
        logic.  Retry logic is removed.
      
      * Preflush needs to be issued to all member devices but FUA writes can
        be handled the same way as other writes - their processing can be
        deferred to request_queue of member devices.  md_barrier_request()
        is renamed to md_flush_request() and simplified accordingly.
      
      For linear, raid0 and multipath, the core changes are enough.  raid1,
      5 and 10 need the following conversions.
      
      * raid1: Handling of FLUSH/FUA bio's can simply be deferred to
        request_queues of member devices.  Barrier related logic removed.
      
      * raid5: Queue draining logic dropped.  FUA bit is propagated through
        biodrain and stripe resconstruction such that all the updated parts
        of the stripe are written out with FUA writes if any of the dirtying
        writes was FUA.  preread_active_stripes handling in make_request()
        is updated as suggested by Neil Brown.
      
      * raid10: FUA bit needs to be propagated to write clones.
      
      linear, raid0, 1, 5 and 10 tested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      e9c7469b
    • T
      block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush() · 4913efe4
      Tejun Heo 提交于
      Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
      requests.  Deprecate barrier.  All REQ_HARDBARRIERs are failed with
      -EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
      blk_queue_flush().
      
      blk_queue_flush() takes combinations of REQ_FLUSH and FUA.  If a
      device has write cache and can flush it, it should set REQ_FLUSH.  If
      the device can handle FUA writes, it should also set REQ_FUA.
      
      All blk_queue_ordered() users are converted.
      
      * ORDERED_DRAIN is mapped to 0 which is the default value.
      * ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
      * ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NBoaz Harrosh <bharrosh@panasas.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: Pierre Ossman <drzeus@drzeus.cx>
      Cc: Stefan Weinhuber <wein@de.ibm.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4913efe4
  2. 18 8月, 2010 4 次提交
  3. 12 8月, 2010 29 次提交
    • M
      dm mpath: support discard · 959eb4e5
      Mike Snitzer 提交于
      Enable discard support in the DM multipath target.
      
      This discard support depends on a few discard-specific fixes to the
      block layer's request stacking driver methods.
      
      Discard requests are optional so don't allow a failed discard to trigger
      path failures.  If there is a real problem with a given path the
      barriers associated with the discard (either before or after the
      discard) will cause path failure.  That said, unconditionally passing
      discard failures up the stack is not ideal.  This must be fixed once DM
      has more information about the nature of the underlying storage failure.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
      959eb4e5
    • M
      dm stripe: support discards · 7b76ec11
      Mikulas Patocka 提交于
      The DM core will submit a discard bio to the stripe target for each
      stripe in a striped DM device.  The stripe target will determine
      stripe-specific portions of the supplied bio to be remapped into
      individual (at most 'num_discard_requests' extents).  If a given
      stripe-specific discard bio doesn't touch a particular stripe the bio
      will be dropped.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      7b76ec11
    • M
      dm: split discard requests on target boundaries · a79245b3
      Mike Snitzer 提交于
      Update __clone_and_map_discard to loop across all targets in a DM
      device's table when it processes a discard bio.  If a discard crosses a
      target boundary it must be split accordingly.
      
      Update __issue_target_requests and __issue_target_request to allow a
      cloned discard bio to have a custom start sector and size.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a79245b3
    • M
      dm stripe: optimize sector division · c96053b7
      Mikulas Patocka 提交于
      Optimize sector division: If the number of stripes is a power of two,
      we can do shift and mask instead of division.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      c96053b7
    • M
      dm stripe: move sector translation to a function · 65988525
      Mikulas Patocka 提交于
      Move sector to stripe translation into a function.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      65988525
    • M
      dm: error return error for discards · 38e1b257
      Mike Snitzer 提交于
      Have the error target respond to a discard request with a hard -EIO
      rather than fail the request with -EOPNOTSUPP.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      38e1b257
    • M
      dm delay: support discard · 3fd5d480
      Mike Snitzer 提交于
      Enable discard support for the delay target.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      3fd5d480
    • M
      dm: zero silently drop discards · f8facb61
      Mike Snitzer 提交于
      Have the zero target silently drop a discard rather than fail the
      request with -EOPNOTSUPP.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f8facb61
    • A
      dm: use dm_target_offset macro · b441a262
      Alasdair G Kergon 提交于
      Use new dm_target_offset() macro to avoid most references to ti->begin
      in dm targets.
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b441a262
    • M
      dm: factor out max_io_len_target_boundary · 56a67df7
      Mike Snitzer 提交于
      Split max_io_len_target_boundary out of max_io_len so that the discard
      support can make use of it without duplicating max_io_len code.
      
      Avoiding max_io_len's split_io logic enables DM's discard support to
      submit the entire discard request to a target.  But discards must still
      be split on target boundaries.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      56a67df7
    • M
      dm: use common __issue_target_request for flush and discard support · 06a426ce
      Mike Snitzer 提交于
      Rename __flush_target to __issue_target_request now that it is used to
      issue both flush and discard requests.
      
      Introduce __issue_target_requests as a convenient wrapper to
      __issue_target_request 'num_flush_requests' or 'num_discard_requests'
      times per target.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      06a426ce
    • M
      dm: linear support discard · 5ae89a87
      Mike Snitzer 提交于
      Allow discards to be passed through to linear mappings if at least one
      underlying device supports it.  Discards will be forwarded only to
      devices that support them.
      
      A target that supports discards should set num_discard_requests to
      indicate how many times each discard request must be submitted to it.
      
      Verify table's underlying devices support discards prior to setting the
      associated DM device as capable of discards (via QUEUE_FLAG_DISCARD).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NJoe Thornber <thornber@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      5ae89a87
    • M
      dm crypt: simplify crypt_ctr · 5ebaee6d
      Milan Broz 提交于
      Allocate cipher strings indpendently of struct crypt_config and move
      cipher parsing and allocation into a separate function to prepare for
      supporting the cryptoapi format e.g. "xts(aes)".
      
      No functional change in this patch.
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      5ebaee6d
    • M
      dm crypt: simplify crypt_config destruction logic · 28513fcc
      Milan Broz 提交于
      Use just one label and reuse common destructor for crypt target.
      
      Parse remaining argv arguments in logic order.
      
      Also do not ignore error values from IV init and set key functions.
      
      No functional change in this patch except changed return codes
      based on above.
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      28513fcc
    • P
      dm: allow autoloading of dm mod · 7e507eb6
      Peter Rajnoha 提交于
      Add devname:mapper/control and MAPPER_CTRL_MINOR module alias
      to support dm-mod module autoloading.
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NPeter Rajnoha <prajnoha@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      7e507eb6
    • M
      dm: rename map_info flush_request to target_request_nr · 57cba5d3
      Mike Snitzer 提交于
      'target_request_nr' is a more generic name that reflects the fact that
      it will be used for both flush and discard support.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      57cba5d3
    • W
      dm ioctl: refactor dm_table_complete · 26803b9f
      Will Drewry 提交于
      This change unifies the various checks and finalization that occurs on a
      table prior to use.  By doing so, it allows table construction without
      traversing the dm-ioctl interface.
      Signed-off-by: NWill Drewry <wad@chromium.org>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      26803b9f
    • M
      dm snapshot: implement merge · b1d55528
      Mikulas Patocka 提交于
      Implement merge method for the snapshot origin to improve read
      performance.
      
      Without merge method, dm asks the upper layers to submit smallest possible
      bios --- one page. Submitting such small bios impacts performance negatively
      when reading or writing the origin device.
      
      Without this patch, CPU consumption when reading the origin on lvm on md-raid0
      was 6 to 12%, with this patch, it drops to 1 to 4%.
      
      Note: in my testing, it actually degraded performance in some settings, I
      traced it to Maxtor disks having problems with > 512-sector requests.
      Reducing the number of sectors to /sys/block/sd*/queue/max_sectors_kb to
      256 fixed the read performance. I think we don't have to care about weird
      disks that actually degrade performance because of large requests being
      sent to them.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b1d55528
    • M
      dm: do not initialise full request queue when bio based · 4a0b4ddf
      Mike Snitzer 提交于
      Change bio-based mapped devices no longer to have a fully initialized
      request_queue (request_fn, elevator, etc).  This means bio-based DM
      devices no longer register elevator sysfs attributes ('iosched/' tree
      or 'scheduler' other than "none").
      
      In contrast, a request-based DM device will continue to have a full
      request_queue and will register elevator sysfs attributes.  Therefore
      a user can determine a DM device's type by checking if elevator sysfs
      attributes exist.
      
      First allocate a minimalist request_queue structure for a DM device
      (needed for both bio and request-based DM).
      
      Initialization of a full request_queue is deferred until it is known
      that the DM device is request-based, at the end of the table load
      sequence.
      
      Factor DM device's request_queue initialization:
      - common to both request-based and bio-based into dm_init_md_queue().
      - specific to request-based into dm_init_request_based_queue().
      
      The md->type_lock mutex is used to protect md->queue, in addition to
      md->type, during table_load().
      
      A DM device's first table_load will establish the immutable md->type.
      But md->queue initialization, based on md->type, may fail at that time
      (because blk_init_allocated_queue cannot allocate memory).  Therefore
      any subsequent table_load must (re)try dm_setup_md_queue independently of
      establishing md->type.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      4a0b4ddf
    • M
      dm ioctl: make bio or request based device type immutable · a5664dad
      Mike Snitzer 提交于
      Determine whether a mapped device is bio-based or request-based when
      loading its first (inactive) table and don't allow that to be changed
      later.
      
      This patch performs different device initialisation in each of the two
      cases.  (We don't think it's necessary to add code to support changing
      between the two types.)
      
      Allowed md->type transitions:
        DM_TYPE_NONE to DM_TYPE_BIO_BASED
        DM_TYPE_NONE to DM_TYPE_REQUEST_BASED
      
      We now prevent table_load from replacing the inactive table with a
      conflicting type of table even after an explicit table_clear.
      
      Introduce 'type_lock' into the struct mapped_device to protect md->type
      and to prepare for the next patch that will change the queue
      initialization and allocate memory while md->type_lock is held.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      
       drivers/md/dm-ioctl.c    |   15 +++++++++++++++
       drivers/md/dm.c          |   37 ++++++++++++++++++++++++++++++-------
       drivers/md/dm.h          |    5 +++++
       include/linux/dm-ioctl.h |    4 ++--
       4 files changed, 52 insertions(+), 9 deletions(-)
      a5664dad
    • M
      dm: skip second flush on bio unsupported error · 708e9295
      Mikulas Patocka 提交于
      When processing barriers, skip the second flush if processing the bio
      failed with -EOPNOTSUPP.  This can happen with discard+barrier requests.
      If the device doesn't support discard, there would be two useless
      SYNCHRONIZE CACHE commands.  The first dm_flush cannot be so easily
      optimized out, so we leave it there.
      
      Previously, -EOPNOTSUPP could be received in dec_pending only with empty
      barriers and we ignored that error, assuming the device not supporting
      cache flushes has cache always consistent.  With the addition of discard
      barriers, this -EOPNOTSUPP can also be generated by discards and we
      must record it in md->barrier_error for process_barrier.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      708e9295
    • T
      dm snapshot: persistent use define for disk header chunk size · 87c961cb
      Tomohiro Kusumi 提交于
      This patch fixes hard-coded value for the size of a chunk that includes
      disk header for persistent snapshot. It should be changed to existing
      macro NUM_SNAPSHOT_HDR_CHUNKS instead of using hard-coded value 1.
      Signed-off-by: NTomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      87c961cb
    • J
      dm crypt: use kstrdup · a9c88f2e
      Julia Lawall 提交于
      Use kstrdup when the goal of an allocation is copy a string into the
      allocated region.
      
      The semantic patch that makes this change is as follows:
      (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @@
      expression from,to;
      expression flag,E1,E2;
      statement S;
      @@
      
      -  to = kmalloc(strlen(from) + 1,flag);
      +  to = kstrdup(from, flag);
         ... when != \(from = E1 \| to = E1 \)
         if (to==NULL || ...) S
         ... when != \(from = E2 \| to = E2 \)
      -  strcpy(to, from);
      // </smpl>
      Signed-off-by: NJulia Lawall <julia@diku.dk>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a9c88f2e
    • A
      dm ioctl: use nonseekable_open · 402ab352
      Arnd Bergmann 提交于
      The dm control device does not implement read/write, so it has no use for
      seeking.  Using no_llseek prevents falling back to default_llseek, which
      requires the BKL.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      402ab352
    • K
      dm: separate device deletion from dm_put · 3f77316d
      Kiyoshi Ueda 提交于
      This patch separates the device deletion code from dm_put()
      to make sure the deletion happens in the process context.
      
      By this patch, device deletion always occurs in an ioctl (process)
      context and dm_put() can be called in interrupt context.
      As a result, the request-based dm's bad dm_put() usage pointed out
      by Mikulas below disappears.
          http://marc.info/?l=dm-devel&m=126699981019735&w=2
      
      Without this patch, I confirmed there is a case to crash the system:
          dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
      
      Some more backgrounds and details:
      In request-based dm, a device opener can remove a mapped_device
      while the last request is still completing, because bios in the last
      request complete first and then the device opener can close and remove
      the mapped_device before the last request completes:
        CPU0                                          CPU1
        =================================================================
        <<INTERRUPT>>
        blk_end_request_all(clone_rq)
          blk_update_request(clone_rq)
            bio_endio(clone_bio) == end_clone_bio
              blk_update_request(orig_rq)
                bio_endio(orig_bio)
                                                      <<I/O completed>>
                                                      dm_blk_close()
                                                      dev_remove()
                                                        dm_put(md)
                                                          <<Free md>>
         blk_finish_request(clone_rq)
           ....
           dm_end_request(clone_rq)
             free_rq_clone(clone_rq)
             blk_end_request_all(orig_rq)
             rq_completed(md)
      
      So request-based dm used dm_get()/dm_put() to hold md for each I/O
      until its request completion handling is fully done.
      However, the final dm_put() can call the device deletion code which
      must not be run in interrupt context and may cause kernel panic.
      
      To solve the problem, this patch moves the device deletion code,
      dm_destroy(), to predetermined places that is actually deleting
      the mapped_device in ioctl (process) context, and changes dm_put()
      just to decrement the reference count of the mapped_device.
      By this change, dm_put() can be used in any context and the symmetric
      model below is introduced:
          dm_create():  create a mapped_device
          dm_destroy(): destroy a mapped_device
          dm_get():     increment the reference count of a mapped_device
          dm_put():     decrement the reference count of a mapped_device
      
      dm_destroy() waits for all references of the mapped_device to disappear,
      then deletes the mapped_device.
      
      dm_destroy() uses active waiting with msleep(1), since deleting
      the mapped_device isn't performance-critical task.
      And since at this point, nobody opens the mapped_device and no new
      reference will be taken, the pending counts are just for racing
      completing activity and will eventually decrease to zero.
      
      For the unlikely case of the forced module unload, dm_destroy_immediate(),
      which doesn't wait and forcibly deletes the mapped_device, is also
      introduced and used in dm_hash_remove_all().  Otherwise, "rmmod -f"
      may be stuck and never return.
      And now, because the mapped_device is deleted at this point, subsequent
      accesses to the mapped_device may cause NULL pointer references.
      
      Cc: stable@kernel.org
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      3f77316d
    • K
      dm ioctl: release _hash_lock between devices in remove_all · 98f33285
      Kiyoshi Ueda 提交于
      This patch changes dm_hash_remove_all() to release _hash_lock when
      removing a device.  After removing the device, dm_hash_remove_all()
      takes _hash_lock and searches the hash from scratch again.
      
      This patch is a preparation for the next patch, which changes device
      deletion code to wait for md reference to be 0.  Without this patch,
      the wait in the next patch may cause AB-BA deadlock:
        CPU0                                CPU1
        -----------------------------------------------------------------------
        dm_hash_remove_all()
          down_write(_hash_lock)
                                            table_status()
                                              md = find_device()
                                                     dm_get(md)
                                                       <increment md->holders>
                                              dm_get_live_or_inactive_table()
                                                dm_get_inactive_table()
                                                  down_write(_hash_lock)
          <in the md deletion code>
            <wait for md->holders to be 0>
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: stable@kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      98f33285
    • K
      dm: prevent access to md being deleted · abdc568b
      Kiyoshi Ueda 提交于
      This patch prevents access to mapped_device which is being deleted.
      
      Currently, even after a mapped_device has been removed from the hash,
      it could be accessed through idr_find() using minor number.
      That could cause a race and NULL pointer reference below:
        CPU0                          CPU1
        ------------------------------------------------------------------
        dev_remove(param)
          down_write(_hash_lock)
          dm_lock_for_deletion(md)
            spin_lock(_minor_lock)
            set_bit(DMF_DELETING)
            spin_unlock(_minor_lock)
          __hash_remove(hc)
          up_write(_hash_lock)
                                      dev_status(param)
                                        md = find_device(param)
                                               down_read(_hash_lock)
                                               __find_device_hash_cell(param)
                                                 dm_get_md(param->dev)
                                                   md = dm_find_md(dev)
                                                          spin_lock(_minor_lock)
                                                          md = idr_find(MINOR(dev))
                                                          spin_unlock(_minor_lock)
          dm_put(md)
            free_dev(md)
                                                   dm_get(md)
                                               up_read(_hash_lock)
                                        __dev_status(md, param)
                                        dm_put(md)
      
      This patch fixes such problems.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: stable@kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      abdc568b
    • P
      dm ioctl: return uevent flag after rename · 856a6f1d
      Peter Rajnoha 提交于
      All the dm ioctls that generate uevents set the DM_UEVENT_GENERATED flag so
      that userspace knows whether or not to wait for a uevent to be processed
      before continuing,
      
      The dm rename ioctl sets this flag but was not structured to return it
      to userspace.  This patch restructures the rename ioctl processing to
      behave like the other ioctls that return data and so fix this.
      Signed-off-by: NPeter Rajnoha <prajnoha@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      856a6f1d
    • A
      dm ioctl: make __dev_status void · 094ea9a0
      Alasdair G Kergon 提交于
      __dev_status() cannot fail so make it void and simplify callers.
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      094ea9a0