1. 06 3月, 2014 4 次提交
    • J
      dm thin: fix noflush suspend IO queueing · 738211f7
      Joe Thornber 提交于
      i) by the time DM core calls the postsuspend hook the dm_noflush flag
      has been cleared.  So the old thin_postsuspend did nothing.  We need to
      use the presuspend hook instead.
      
      ii) There was a race between bios leaving DM core and arriving in the
      deferred queue.
      
      thin_presuspend now sets a 'requeue' flag causing all bios destined for
      that thin to be requeued back to DM core.  Then it requeues all held IO,
      and all IO on the deferred queue (destined for that thin).  Finally
      postsuspend clears the 'requeue' flag.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      738211f7
    • J
      dm thin: fix deadlock in __requeue_bio_list · 18adc577
      Joe Thornber 提交于
      The spin lock in requeue_io() was held for too long, allowing deadlock.
      Don't worry, due to other issues addressed in the following "dm thin:
      fix noflush suspend IO queueing" commit, this code was never called.
      
      Fix this by taking the spin lock for a much shorter period of time.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      18adc577
    • J
      dm thin: fix out of data space handling · 3e1a0699
      Joe Thornber 提交于
      Ideally a thin pool would never run out of data space; the low water
      mark would trigger userland to extend the pool before we completely run
      out of space.  However, many small random IOs to unprovisioned space can
      consume data space at an alarming rate.  Adjust your low water mark if
      you're frequently seeing "out-of-data-space" mode.
      
      Before this fix, if data space ran out the pool would be put in
      PM_READ_ONLY mode which also aborted the pool's current metadata
      transaction (data loss for any changes in the transaction).  This had a
      side-effect of needlessly compromising data consistency.  And retry of
      queued unserviceable bios, once the data pool was resized, could
      initiate changes to potentially inconsistent pool metadata.
      
      Now when the pool's data space is exhausted transition to a new pool
      mode (PM_OUT_OF_DATA_SPACE) that allows metadata to be changed but data
      may not be allocated.  This allows users to remove thin volumes or
      discard data to recover data space.
      
      The pool is no longer put in PM_READ_ONLY mode in response to the pool
      running out of data space.  And PM_READ_ONLY mode no longer aborts the
      pool's current metadata transaction.  Also, set_pool_mode() will now
      notify userspace when the pool mode is changed.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      3e1a0699
    • M
      dm thin: ensure user takes action to validate data and metadata consistency · 07f2b6e0
      Mike Snitzer 提交于
      If a thin metadata operation fails the current transaction will abort,
      whereby causing potential for IO layers up the stack (e.g. filesystems)
      to have data loss.  As such, set THIN_METADATA_NEEDS_CHECK_FLAG in the
      thin metadata's superblock which:
      1) requires the user verify the thin metadata is consistent (e.g. use
         thin_check, etc)
      2) suggests the user verify the thin data is consistent (e.g. use fsck)
      
      The only way to clear the superblock's THIN_METADATA_NEEDS_CHECK_FLAG is
      to run thin_repair.
      
      On metadata operation failure: abort current metadata transaction, set
      pool in read-only mode, and now set the needs_check flag.
      
      As part of this change, constraints are introduced or relaxed:
      * don't allow a pool to transition to write mode if needs_check is set
      * don't allow data or metadata space to be resized if needs_check is set
      * if a thin pool's metadata space is exhausted: the kernel will now
        force the user to take the pool offline for repair before the kernel
        will allow the metadata space to be extended.
      
      Also, update Documentation to include information about when the thin
      provisioning target commits metadata, how it handles metadata failures
      and running out of space.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      07f2b6e0
  2. 05 3月, 2014 1 次提交
    • M
      dm thin: synchronize the pool mode during suspend · cdc2b415
      Mike Snitzer 提交于
      Commit b5330655 ("dm thin: handle metadata failures more consistently")
      increased potential for the pool's mode to be changed in response to
      metadata operation failures.
      
      When the pool mode is changed it isn't synchronized with the mode in
      pool_features stored in the target's context (ti->private) that is used
      as the basis for (re)establishing the pool mode during resume via
      bind_control_target.
      
      It is important that we synchronize the pool mode when it is changed
      otherwise the pool may experience and unexpected mode transition on the
      next resume (especially if there was no new table load).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      cdc2b415
  3. 04 3月, 2014 2 次提交
    • M
      dm snapshot: fix metadata corruption · 2c945820
      Mikulas Patocka 提交于
      Commit 55494bf2 ("dm snapshot: use dm-bufio") broke snapshots.
      Before that 3.14-rc1 commit, loading a snapshot's list of exceptions
      involved reading exception areas one by one into ps->area and inserting
      those exceptions into the hash table.  Commit 55494bf2 changed
      it so that dm-bufio with prefetch is used to load exceptions in batchs.
      Exceptions are loaded correctly, but ps->area is left uninitialized.
      When a new exception is allocated, it is stored in this uninitialized
      ps->area which will be written to the disk.  This causes metadata
      corruption.
      
      Fix this corruption by copying the last area that was read via dm-bufio
      into ps->area.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2c945820
    • M
      dm: fix Kconfig indentation · c64d240d
      Mike Snitzer 提交于
      Since DM_DEBUG_BLOCK_STACK_TRACING is a DM_PERSISTENT_DATA config option
      move it from drivers/md/Kconfig to drivers/md/persistent-data/Kconfig.
      
      Doing so fixes indentation for other DM config options.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c64d240d
  4. 01 3月, 2014 1 次提交
    • H
      dm cache mq: fix memory allocation failure for large cache devices · 14f398ca
      Heinz Mauelshagen 提交于
      The memory allocated for the multiqueue policy's hash table doesn't need
      to be physically contiguous.  Use vzalloc() instead of kzalloc().
      Fedora has been carrying this fix since 10/10/2013.
      
      Failure seen during creation of a 10TB cached device with a 2048 sector
      block size and 411GB cache size:
      
       dmsetup: page allocation failure: order:9, mode:0x10c0d0
       CPU: 11 PID: 29235 Comm: dmsetup Not tainted 3.10.4 #3
       Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1a       12/30/2011
        000000000010c0d0 ffff880090941898 ffffffff81387ab4 ffff880090941928
        ffffffff810bb26f 0000000000000009 000000000010c0d0 ffff880090941928
        ffffffff81385dbc ffffffff815f3840 ffffffff00000000 000002000010c0d0
       Call Trace:
        [<ffffffff81387ab4>] dump_stack+0x19/0x1b
        [<ffffffff810bb26f>] warn_alloc_failed+0x110/0x124
        [<ffffffff81385dbc>] ? __alloc_pages_direct_compact+0x17c/0x18e
        [<ffffffff810bda2e>] __alloc_pages_nodemask+0x6c7/0x75e
        [<ffffffff810bdad7>] __get_free_pages+0x12/0x3f
        [<ffffffff810ea148>] kmalloc_order_trace+0x29/0x88
        [<ffffffff810ec1fd>] __kmalloc+0x36/0x11b
        [<ffffffffa031eeed>] ? mq_create+0x1dc/0x2cf [dm_cache_mq]
        [<ffffffffa031efc0>] mq_create+0x2af/0x2cf [dm_cache_mq]
        [<ffffffffa0314605>] dm_cache_policy_create+0xa7/0xd2 [dm_cache]
        [<ffffffffa0312530>] ? cache_ctr+0x245/0xa13 [dm_cache]
        [<ffffffffa031263e>] cache_ctr+0x353/0xa13 [dm_cache]
        [<ffffffffa012b916>] dm_table_add_target+0x227/0x2ce [dm_mod]
        [<ffffffffa012e8e4>] table_load+0x286/0x2ac [dm_mod]
        [<ffffffffa012e65e>] ? dev_wait+0x8a/0x8a [dm_mod]
        [<ffffffffa012e324>] ctl_ioctl+0x39a/0x3c2 [dm_mod]
        [<ffffffffa012e35a>] dm_ctl_ioctl+0xe/0x12 [dm_mod]
        [<ffffffff81101181>] vfs_ioctl+0x21/0x34
        [<ffffffff811019d3>] do_vfs_ioctl+0x3b1/0x3f4
        [<ffffffff810f4d2e>] ? ____fput+0x9/0xb
        [<ffffffff81050b6c>] ? task_work_run+0x7e/0x92
        [<ffffffff81101a68>] SyS_ioctl+0x52/0x82
        [<ffffffff81391d92>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      14f398ca
  5. 28 2月, 2014 2 次提交
    • H
      dm cache: fix truncation bug when mapping I/O to >2TB fast device · e0d849fa
      Heinz Mauelshagen 提交于
      When remapping a block to the cache's fast device that is larger than
      2TB we must not truncate the destination sector to 32bits.  The 32bit
      temporary result of from_cblock() was being overflowed in
      remap_to_cache() due to the logical left shift.
      
      Use an intermediate 64bit type to store the 32bit from_cblock() result
      to fix the overflow.
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      e0d849fa
    • M
      dm thin: allow metadata space larger than supported to go unused · 7d48935e
      Mike Snitzer 提交于
      It was always intended that a user could provide a thin metadata device
      that is larger than the max supported by the on-disk format.  The extra
      space would just go unused.
      
      Unfortunately that never worked.  If the user attempted to use a larger
      metadata device on creation they would get an error like the following:
      
       device-mapper: space map common: space map too large
       device-mapper: transaction manager: couldn't create metadata space map
       device-mapper: thin metadata: tm_create_with_sm failed
       device-mapper: table: 252:17: thin-pool: Error creating metadata object
       device-mapper: ioctl: error adding target to table
      
      Fix this by allowing the initial metadata space map creation to cap its
      size at the max number of blocks supported (DM_SM_METADATA_MAX_BLOCKS).
      get_metadata_dev_size() must also impose DM_SM_METADATA_MAX_BLOCKS (via
      THIN_METADATA_MAX_SECTORS), otherwise extending metadata would cap at
      THIN_METADATA_MAX_SECTORS_WARNING (which is larger than supported).
      
      Also, the calculation for THIN_METADATA_MAX_SECTORS didn't account for
      the sizeof the disk_bitmap_header.  So the supported maximum metadata
      size is a bit smaller (reduced from 33423360 to 33292800 sectors).
      
      Lastly, remove the "excess space will not be used" warning message from
      get_metadata_dev_size(); it resulted in printing the warning multiple
      times.  Factor out warn_if_metadata_device_too_big(), call it from
      pool_ctr() and maybe_resize_metadata_dev().
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      7d48935e
  6. 26 2月, 2014 1 次提交
  7. 25 2月, 2014 1 次提交
    • M
      dm thin: fix the error path for the thin device constructor · 1acacc07
      Mike Snitzer 提交于
      dm_pool_close_thin_device() must be called if dm_set_target_max_io_len()
      fails in thin_ctr().  Otherwise __pool_destroy() will fail because the
      pool will still have an open thin device:
      
       device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open
       device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed.
      
      Also, must establish error code if failing thin_ctr() because the pool
      is in fail_io mode.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Cc: stable@vger.kernel.org
      1acacc07
  8. 18 2月, 2014 5 次提交
    • M
      dm raid1: fix immutable biovec related BUG when retrying read bio · f3a44fe0
      Mikulas Patocka 提交于
      When restoring bi_end_io, increase bi_remaining before retrying the bio
      to avoid BUG_ON(atomic_read(&bio->bi_remaining) <= 0) in bio_endio().
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f3a44fe0
    • M
      dm io: fix I/O to multiple destinations · d73f9907
      Mikulas Patocka 提交于
      Commit 003b5c57 ("block: Convert drivers
      to immutable biovecs") broke dm-mirror due to dm-io breakage.
      
      dm-io had three possible iterators (DM_IO_PAGE_LIST, DM_IO_BVEC,
      DM_IO_VMA) that iterate over pages where the I/O should be performed.
      
      The switch to immutable biovecs changed the DM_IO_BVEC iterator to
      DM_IO_BIO.  Before this change the iterator stored the pointer to a bio
      vector in the dpages structure.  The iterator incremented the pointer in
      the dpages structure as it advanced over the pages.  After the immutable
      biovecs change, the DM_IO_BIO iterator stores a pointer to the bio in
      the dpages structure and uses bio_advance to change the bio as it
      advances.
      
      The problem is that the function dispatch_io stores the content of the
      dpages structure into the variable old_pages and restores it before
      issuing I/O to each of the devices.  Before the change, the statement
      "*dp = old_pages;" restored the iterator to its starting position.
      After the change, struct dpages holds a pointer to the bio, thus the
      statement "*dp = old_pages;" doesn't restore the iterator.
      
      Consequently, in the context of dm-mirror: only the first mirror leg is
      written correctly, the kernel locks up when trying to write the other
      mirror legs because the number of sectors to write in the where->count
      variable doesn't match the number of sectors returned by the iterator.
      
      This patch fixes the bug by partially reverting the original patch - it
      changes the code so that struct dpages holds a pointer to the bio vector,
      so that the statement "*dp = old_pages;" restores the iterator correctly.
      
      The field "context_u" holds the offset from the beginning of the current
      bio vector entry, just like the "bio->bi_iter.bi_bvec_done" field.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d73f9907
    • M
      dm thin: avoid metadata commit if a pool's thin devices haven't changed · 4d1662a3
      Mike Snitzer 提交于
      Commit 905e51b3 ("dm thin: commit outstanding data every second")
      introduced a periodic commit.  This commit occurs regardless of whether
      any thin devices have made changes.
      
      Fix the periodic commit to check if any of a pool's thin devices have
      changed using dm_pool_changed_this_transaction().
      Reported-by: NAlexander Larsson <alexl@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Cc: stable@vger.kernel.org
      4d1662a3
    • M
      dm cache: do not add migration to completed list before unhooking bio · 80ae49aa
      Mike Snitzer 提交于
      When completing an overwrite bio, in overwrite_endio(), the associated
      migration should not be added to the 'completed_migrations' until the
      bio's fields are restored with dm_unhook_bio().
      
      Otherwise, do_worker() can race to process 'completed_migrations' before
      dm_unhook_bio() -- so the bio's bi_end_io is incorrect.  This is
      unlikely to cause any problems given the current code but should be
      fixed on the basis of correctness.
      
      Also, the cache's spinlock only needs to be held when manipulating the
      'completed_migrations' list -- other changes don't need protection.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      80ae49aa
    • M
      dm cache: move hook_info into common portion of per_bio_data structure · c6eda5e8
      Mike Snitzer 提交于
      Commit c9d28d5d ("dm cache: promotion optimisation for writes")
      incorrectly placed the 'hook_info' member in the writethrough-only
      portion of the per_bio_data structure.
      
      Given that the overwrite optimization may be used for writeback the
      'hook_info' member must be placed above the 'cache' member of the
      per_bio_data structure.  Any members above 'cache' are available from
      both writeback and writethrough modes' per_bio_data structure.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Cc: stable@vger.kernel.org # 3.13+
      c6eda5e8
  9. 16 2月, 2014 1 次提交
    • K
      of: search the best compatible match first in __of_match_node() · 06b29e76
      Kevin Hao 提交于
      Currently, of_match_node compares each given match against all node's
      compatible strings with of_device_is_compatible.
      
      To achieve multiple compatible strings per node with ordering from
      specific to generic, this requires given matches to be ordered from
      specific to generic. For most of the drivers this is not true and also
      an alphabetical ordering is more sane there.
      
      Therefore, this patch introduces a function to match each of the node's
      compatible strings against all given compatible matches without type and
      name first, before checking the next compatible string. This implies
      that node's compatibles are ordered from specific to generic while
      given matches can be in any order. If we fail to find such a match
      entry, then fall-back to the old method in order to keep compatibility.
      
      Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
      Signed-off-by: NKevin Hao <haokexin@gmail.com>
      Tested-by: NStephen Chivers <schivers@csc.com>
      Signed-off-by: NRob Herring <robh@kernel.org>
      06b29e76
  10. 15 2月, 2014 8 次提交
  11. 14 2月, 2014 14 次提交