1. 04 2月, 2015 8 次提交
    • N
      md: make merge_bvec_fn more robust in face of personality changes. · 64590f45
      NeilBrown 提交于
      There is no locking around calls to merge_bvec_fn(), so
      it is possible that calls which coincide with a level (or personality)
      change could go wrong.
      
      So create a central dispatch point for these functions and use
      rcu_read_lock().
      If the array is suspended, reject any merge that can be rejected.
      If not, we know it is safe to call the function.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      64590f45
    • N
      md: make ->congested robust against personality changes. · 5c675f83
      NeilBrown 提交于
      There is currently no locking around calls to the 'congested'
      bdi function.  If called at an awkward time while an array is
      being converted from one level (or personality) to another, there
      is a tiny chance of running code in an unreferenced module etc.
      
      So add a 'congested' function to the md_personality operations
      structure, and call it with appropriate locking from a central
      'mddev_congested'.
      
      When the array personality is changing the array will be 'suspended'
      so no IO is processed.
      If mddev_congested detects this, it simply reports that the
      array is congested, which is a safe guess.
      As mddev_suspend calls synchronize_rcu(), mddev_congested can
      avoid races by included the whole call inside an rcu_read_lock()
      region.
      This require that the congested functions for all subordinate devices
      can be run under rcu_lock.  Fortunately this is the case.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5c675f83
    • N
      md: rename mddev->write_lock to mddev->lock · 85572d7c
      NeilBrown 提交于
      This lock is used for (slightly) more than helping with writing
      superblocks, and it will soon be extended further.  So the
      name is inappropriate.
      
      Also, the _irq variant hasn't been needed since 2.6.37 as it is
      never taking from interrupt or bh context.
      
      So:
        -rename write_lock to lock
        -document what it protects
        -remove _irq ... except in md_flush_request() as there
           is no wait_event_lock() (with no _irq).  This can be
           cleaned up after appropriate changes to wait.h.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      85572d7c
    • N
      md/raid5: need_this_block: tidy/fix last condition. · ea664c82
      NeilBrown 提交于
      That last condition is unclear and over cautious.
      
      There are two related issues here.
      
      If a partial write is destined for a missing device, then
      either RMW or RCW can work.  We must read all the available
      block.  Only then can the missing blocks be calculated, and
      then the parity update performed.
      
      If RMW is not an option, then there is a complication even
      without partial writes.  If we would need to read a missing
      device to perform the reconstruction, then we must first read every
      block so the missing device data can be computed.
      This is the case for RAID6 (Which currently does not support
      RMW) and for times when we don't trust the parity (after a crash)
      and so are in the process of resyncing it.
      
      So make these two cases more clear and separate, and perform
      the relevant tests more  thoroughly.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ea664c82
    • N
      md/raid5: need_this_block: start simplifying the last two conditions. · a9d56950
      NeilBrown 提交于
      Both the last two cases are only relevant if something has failed and
      something needs to be written (but not over-written), and if it is OK
      to pre-read blocks at this point.  So factor out those tests and
      explain them.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a9d56950
    • N
      md/raid5: separate out the easy conditions in need_this_block. · a79cfe12
      NeilBrown 提交于
      Some of the conditions in need_this_block have very straight
      forward motivation.  Separate those out and document them.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a79cfe12
    • N
      md/raid5: separate large if clause out of fetch_block(). · 2c58f06e
      NeilBrown 提交于
      fetch_block() has a very large and hard to read 'if' condition.
      
      Separate it into its own function so that it can be
      made more readable.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2c58f06e
    • J
      md: do_release_stripe(): No need to call md_wakeup_thread() twice · ad3ab8b6
      Jes Sorensen 提交于
      67f45548 introduced a call to
      md_wakeup_thread() when adding to the delayed_list. However the md
      thread is woken up unconditionally just below.
      
      Remove the unnecessary wakeup call.
      Signed-off-by: NJes Sorensen <Jes.Sorensen@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ad3ab8b6
  2. 02 2月, 2015 2 次提交
  3. 25 1月, 2015 1 次提交
    • M
      dm: fix handling of multiple internal suspends · 96b26c8c
      Mikulas Patocka 提交于
      Commit ffcc3936 ("dm: enhance internal suspend and resume interface")
      attempted to handle multiple internal suspends on the same device, but
      it did that incorrectly.  When these functions are called in this order
      on the same device the device is no longer suspended, but it should be:
      	dm_internal_suspend_noflush
      	dm_internal_suspend_noflush
      	dm_internal_resume
      
      Fix this bug by maintaining an 'internal_suspend_count' and resuming
      the device when this count drops to zero.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      96b26c8c
  4. 24 1月, 2015 1 次提交
    • J
      dm cache: fix problematic dual use of a single migration count variable · a59db676
      Joe Thornber 提交于
      Introduce a new variable to count the number of allocated migration
      structures.  The existing variable cache->nr_migrations became
      overloaded.  It was used to:
      
       i) track of the number of migrations in flight for the purposes of
          quiescing during suspend.
      
       ii) to estimate the amount of background IO occuring.
      
      Recent discard changes meant that REQ_DISCARD bios are processed with
      a migration.  Discards are not background IO so nr_migrations was not
      incremented.  However this could cause quiescing to complete early.
      
      (i) is now handled with a new variable cache->nr_allocated_migrations.
      cache->nr_migrations has been renamed cache->nr_io_migrations.
      cleanup_migration() is now called free_io_migration(), since it
      decrements that variable.
      
      Also, remove the unused cache->next_migration variable that got replaced
      with with prealloc_structs a while ago.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      a59db676
  5. 23 1月, 2015 1 次提交
    • J
      dm cache: share cache-metadata object across inactive and active DM tables · 9b1cc9f2
      Joe Thornber 提交于
      If a DM table is reloaded with an inactive table when the device is not
      suspended (normal procedure for LVM2), then there will be two dm-bufio
      objects that can diverge.  This can lead to a situation where the
      inactive table uses bufio to read metadata at the same time the active
      table writes metadata -- resulting in the inactive table having stale
      metadata buffers once it is promoted to the active table slot.
      
      Fix this by using reference counting and a global list of cache metadata
      objects to ensure there is only one metadata object per metadata device.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      9b1cc9f2
  6. 18 12月, 2014 4 次提交
    • Z
      dm: fix missed error code if .end_io isn't implemented by target_type · 5164bece
      zhendong chen 提交于
      In bio-based DM's clone_endio(), when target_type doesn't implement
      .end_io (e.g. linear) r will be always be initialized 0.  So if a
      WRITE SAME bio fails WRITE SAME will not be disabled as intended.
      
      Fix this by initializing r to error, rather than 0, in clone_endio().
      Signed-off-by: NAlex Chen <alex.chen@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Fixes: 7eee4ae2 ("dm: disable WRITE SAME if it fails")
      Cc: stable@vger.kernel.org
      5164bece
    • M
      dm thin: fix crash by initializing thin device's refcount and completion earlier · 2b94e896
      Marc Dionne 提交于
      Commit 80e96c54 ("dm thin: do not allow thin device activation
      while pool is suspended") delayed the initialization of a new thin
      device's refcount and completion until after this new thin was added
      to the pool's active_thins list and the pool lock is released.  This
      opens a race with a worker thread that walks the list and calls
      thin_get/put, noticing that the refcount goes to 0 and calling
      complete, freezing up the system and giving the oops below:
      
       kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
       kernel: IP: [<ffffffff810d360b>] __wake_up_common+0x2b/0x90
      
       kernel: Call Trace:
       kernel: [<ffffffff810d3683>] __wake_up_locked+0x13/0x20
       kernel: [<ffffffff810d3dc7>] complete+0x37/0x50
       kernel: [<ffffffffa0595c50>] thin_put+0x20/0x30 [dm_thin_pool]
       kernel: [<ffffffffa059aab7>] do_worker+0x667/0x870 [dm_thin_pool]
       kernel: [<ffffffff816a8a4c>] ? __schedule+0x3ac/0x9a0
       kernel: [<ffffffff810b1aef>] process_one_work+0x14f/0x400
       kernel: [<ffffffff810b206b>] worker_thread+0x6b/0x490
       kernel: [<ffffffff810b2000>] ? rescuer_thread+0x260/0x260
       kernel: [<ffffffff810b6a7b>] kthread+0xdb/0x100
       kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170
       kernel: [<ffffffff816ad7ec>] ret_from_fork+0x7c/0xb0
       kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170
      
      Set the thin device's initial refcount and initialize the completion
      before adding it to the pool's active_thins list in thin_ctr().
      Signed-off-by: NMarc Dionne <marc.dionne@your-file-system.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2b94e896
    • J
      dm thin: fix missing out-of-data-space to write mode transition if blocks are released · 2c43fd26
      Joe Thornber 提交于
      Discard bios and thin device deletion have the potential to release data
      blocks.  If the thin-pool is in out-of-data-space mode, and blocks were
      released, transition the thin-pool back to full write mode.
      
      The correct time to do this is just after the thin-pool metadata commit.
      It cannot be done before the commit because the space maps will not
      allow immediate reuse of the data blocks in case there's a rollback
      following power failure.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      2c43fd26
    • J
      dm thin: fix inability to discard blocks when in out-of-data-space mode · 45ec9bd0
      Joe Thornber 提交于
      When the pool was in PM_OUT_OF_SPACE mode its process_prepared_discard
      function pointer was incorrectly being set to
      process_prepared_discard_passdown rather than process_prepared_discard.
      
      This incorrect function pointer meant the discard was being passed down,
      but not effecting the mapping.  As such any discard that was issued, in
      an attempt to reclaim blocks, would not successfully free data space.
      Reported-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      45ec9bd0
  7. 11 12月, 2014 1 次提交
    • N
      md: Check MD_RECOVERY_RUNNING as well as ->sync_thread. · f851b60d
      NeilBrown 提交于
      A recent change to md started the ->sync_thread from a asynchronously
      from a work_queue rather than synchronously.  This means that there
      can be a small window between the time when MD_RECOVERY_RUNNING is set
      and when ->sync_thread is set.
      
      So code that checks ->sync_thread might now conclude that the thread
      has not been started and (because a lock is held) will not be started.
      That is no longer the case.
      
      Most of those places are best fixed by testing MD_RECOVERY_RUNNING
      as well.  To make this completely reliable, we wake_up(&resync_wait)
      after clearing that flag as well as after clearing ->sync_thread.
      
      Other places are better served by flushing the relevant workqueue
      to ensure that that if the sync thread was starting, it has now
      started.  This is particularly best if we are about to stop the
      sync thread.
      
      Fixes: ac05f256Signed-off-by: NNeilBrown <neilb@suse.de>
      f851b60d
  8. 03 12月, 2014 2 次提交
    • K
      md: fix semicolon.cocci warnings · 7d7e64f2
      kbuild test robot 提交于
      drivers/md/md.c:7175:43-44: Unneeded semicolon
      
       Removes unneeded semicolon.
      
      Generated by: scripts/coccinelle/misc/semicolon.cocci
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7d7e64f2
    • N
      md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants. · 108cef3a
      NeilBrown 提交于
      It is critical that fetch_block() and handle_stripe_dirtying()
      are consistent in their analysis of what needs to be loaded.
      Otherwise raid5 can wait forever for a block that won't be loaded.
      
      Currently when writing to a RAID5 that is resyncing, to a location
      beyond the resync offset, handle_stripe_dirtying chooses a
      reconstruct-write cycle, but fetch_block() assumes a
      read-modify-write, and a lockup can happen.
      
      So treat that case just like RAID6, just as we do in
      handle_stripe_dirtying.  RAID6 always does reconstruct-write.
      
      This bug was introduced when the behaviour of handle_stripe_dirtying
      was changed in 3.7, so the patch is suitable for any kernel since,
      though it will need careful merging for some versions.
      
      Cc: stable@vger.kernel.org (v3.7+)
      Fixes: a7854487Reported-by: NHenry Cai <henryplusplus@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      108cef3a
  9. 02 12月, 2014 12 次提交
  10. 24 11月, 2014 3 次提交
  11. 22 11月, 2014 1 次提交
    • M
      dm thin: fix pool_io_hints to avoid looking at max_hw_sectors · d200c30e
      Mike Snitzer 提交于
      Simplify the pool_io_hints code that works to establish a max_sectors
      value that is a power-of-2 factor of the thin-pool's blocksize.  The
      biggest associated improvement is that the DM thin-pool is no longer
      concerning itself with the data device's max_hw_sectors when adjusting
      max_sectors.
      
      This fixes the relative fragility of the original "dm thin: adjust
      max_sectors_kb based on thinp blocksize" commit that only became
      apparent when testing was performed using a DM thin-pool ontop of a
      virtio_blk device.  One proposed upstream patch detailed the problems
      inherent in virtio_blk: https://lkml.org/lkml/2014/11/20/611
      
      So even though virtio_blk incorrectly set its max_hw_sectors it actually
      helped make it clear that we need DM thinp to be tolerant of any future
      Linux driver that incorrectly sets max_hw_sectors.
      
      We only need to be concerned with modifying the thin-pool device's
      max_sectors limit if it is smaller than the thin-pool's blocksize.  In
      this case the value of max_sectors does become a limiting factor when
      upper layers (e.g. filesystems) construct their bios.  But if the
      hardware can support IOs larger than the thin-pool's blocksize the user
      is encouraged to adjust the thin-pool's data device's max_sectors
      accordingly -- doing so will enable the thin-pool to inherit the
      established user-defined max_sectors.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d200c30e
  12. 20 11月, 2014 4 次提交
    • M
      dm thin: suspend/resume active thin devices when reloading thin-pool · 583024d2
      Mike Snitzer 提交于
      Before this change it was expected that userspace would first suspend
      all active thin devices, reload/resize the thin-pool target, then resume
      all active thin devices.  Now the thin-pool suspend/resume will trigger
      the suspend/resume of all active thins via appropriate calls to
      dm_internal_suspend and dm_internal_resume.
      
      Store the mapped_device for each thin device in struct thin_c to make
      these calls possible.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      583024d2
    • M
      dm: enhance internal suspend and resume interface · ffcc3936
      Mike Snitzer 提交于
      Rename dm_internal_{suspend,resume} to dm_internal_{suspend,resume}_fast
      -- dm-stats will continue using these methods to avoid all the extra
      suspend/resume logic that is not needed in order to quickly flush IO.
      
      Introduce dm_internal_suspend_noflush() variant that actually calls the
      mapped_device's target callbacks -- otherwise target-specific hooks are
      avoided (e.g. dm-thin's thin_presuspend and thin_postsuspend).  Common
      code between dm_internal_{suspend_noflush,resume} and
      dm_{suspend,resume} was factored out as __dm_{suspend,resume}.
      
      Update dm_internal_{suspend_noflush,resume} to always take and release
      the mapped_device's suspend_lock.  Also update dm_{suspend,resume} to be
      aware of potential for DM_INTERNAL_SUSPEND_FLAG to be set and respond
      accordingly by interruptibly waiting for the DM_INTERNAL_SUSPEND_FLAG to
      be cleared.  Add lockdep annotation to dm_suspend() and dm_resume().
      
      The existing DM_SUSPEND_FLAG remains unchanged.
      DM_INTERNAL_SUSPEND_FLAG is set by dm_internal_suspend_noflush() and
      cleared by dm_internal_resume().
      
      Both DM_SUSPEND_FLAG and DM_INTERNAL_SUSPEND_FLAG may be set if a device
      was already suspended when dm_internal_suspend_noflush() was called --
      this can be thought of as a "nested suspend".  A "nested suspend" can
      occur with legacy userspace dm-thin code that might suspend all active
      thin volumes before suspending the pool for resize.
      
      But otherwise, in the normal dm-thin-pool suspend case moving forward:
      the thin-pool will have DM_SUSPEND_FLAG set and all active thins from
      that thin-pool will have DM_INTERNAL_SUSPEND_FLAG set.
      
      Also add DM_INTERNAL_SUSPEND_FLAG to status report.  This new
      DM_INTERNAL_SUSPEND_FLAG state is being reported to assist with
      debugging (e.g. 'dmsetup info' will report an internally suspended
      device accordingly).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      ffcc3936
    • M
      dm thin: do not allow thin device activation while pool is suspended · 80e96c54
      Mike Snitzer 提交于
      Otherwise IO could be issued to the pool while it is suspended.
      
      Care was taken to properly interlock between the thin and thin-pool
      targets when accessing the pool's 'suspended' flag.  The thin_ctr will
      not add a new thin device to the pool's active_thins list if the pool is
      susepended.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      80e96c54
    • M
      dm: add presuspend_undo hook to target_type · d67ee213
      Mike Snitzer 提交于
      The DM thin-pool target now must undo the changes performed during
      pool_presuspend() so introduce presuspend_undo hook in target_type.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      d67ee213