1. 31 7月, 2012 1 次提交
    • N
      md: remove plug_cnt feature of plugging. · 0021b7bc
      NeilBrown 提交于
      This seemed like a good idea at the time, but after further thought I
      cannot see it making a difference other than very occasionally and
      testing to try to exercise the case it is most likely to help did not
      show any performance difference by removing it.
      
      So remove the counting of active plugs and allow 'pending writes' to
      be activated at any time, not just when no plugs are active.
      
      This is only relevant when there is a write-intent bitmap, and the
      updating of the bitmap will likely introduce enough delay that
      the single-threading of bitmap updates will be enough to collect large
      numbers of updates together.
      
      Removing this will make it easier to centralise the unplug code, and
      will clear the other for other unplug enhancements which have a
      measurable effect.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0021b7bc
  2. 20 7月, 2012 3 次提交
    • M
      dm raid1: set discard_zeroes_data_unsupported · 7c8d3a42
      Mikulas Patocka 提交于
      We can't guarantee that REQ_DISCARD on dm-mirror zeroes the data even if
      the underlying disks support zero on discard.  So this patch sets
      ti->discard_zeroes_data_unsupported.
      
      For example, if the mirror is in the process of resynchronizing, it may
      happen that kcopyd reads a piece of data, then discard is sent on the
      same area and then kcopyd writes the piece of data to another leg.
      Consequently, the data is not zeroed.
      
      The flag was made available by commit 983c7db3
      (dm crypt: always disable discard_zeroes_data).
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      7c8d3a42
    • M
      dm thin: do not send discards to shared blocks · 650d2a06
      Mikulas Patocka 提交于
      When process_discard receives a partial discard that doesn't cover a
      full block, it sends this discard down to that block. Unfortunately, the
      block can be shared and the discard would corrupt the other snapshots
      sharing this block.
      
      This patch detects block sharing and ends the discard with success when
      sending it to the shared block.
      
      The above change means that if the device supports discard it can't be
      guaranteed that a discard request zeroes data. Therefore, we set
      ti->discard_zeroes_data_unsupported.
      
      Thin target discard support with this bug arrived in commit
      104655fd (dm thin: support discards).
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      650d2a06
    • M
      dm raid1: fix crash with mirror recovery and discard · 751f188d
      Mikulas Patocka 提交于
      This patch fixes a crash when a discard request is sent during mirror
      recovery.
      
      Firstly, some background.  Generally, the following sequence happens during
      mirror synchronization:
      - function do_recovery is called
      - do_recovery calls dm_rh_recovery_prepare
      - dm_rh_recovery_prepare uses a semaphore to limit the number
        simultaneously recovered regions (by default the semaphore value is 1,
        so only one region at a time is recovered)
      - dm_rh_recovery_prepare calls __rh_recovery_prepare,
        __rh_recovery_prepare asks the log driver for the next region to
        recover. Then, it sets the region state to DM_RH_RECOVERING. If there
        are no pending I/Os on this region, the region is added to
        quiesced_regions list. If there are pending I/Os, the region is not
        added to any list. It is added to the quiesced_regions list later (by
        dm_rh_dec function) when all I/Os finish.
      - when the region is on quiesced_regions list, there are no I/Os in
        flight on this region. The region is popped from the list in
        dm_rh_recovery_start function. Then, a kcopyd job is started in the
        recover function.
      - when the kcopyd job finishes, recovery_complete is called. It calls
        dm_rh_recovery_end. dm_rh_recovery_end adds the region to
        recovered_regions or failed_recovered_regions list (depending on
        whether the copy operation was successful or not).
      
      The above mechanism assumes that if the region is in DM_RH_RECOVERING
      state, no new I/Os are started on this region. When I/O is started,
      dm_rh_inc_pending is called, which increases reg->pending count. When
      I/O is finished, dm_rh_dec is called. It decreases reg->pending count.
      If the count is zero and the region was in DM_RH_RECOVERING state,
      dm_rh_dec adds it to the quiesced_regions list.
      
      Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is
      in DM_RH_RECOVERING state, it could be added to quiesced_regions list
      multiple times or it could be added to this list when kcopyd is copying
      data (it is assumed that the region is not on any list while kcopyd does
      its jobs). This results in memory corruption and crash.
      
      There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests
      do not belong to any region, so they are always added to the sync list
      in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH
      requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH
      requests. These bypasses avoid the crash possibility described above.
      
      These bypasses were improperly implemented for REQ_DISCARD when
      the mirror target gained discard support in commit
      5fc2ffea (dm raid1: support discard).
      
      In do_writes, REQ_DISCARD requests is always added to the sync queue and
      immediately dispatched (even if the region is in DM_RH_RECOVERING).  However,
      dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts.  So it violates the
      rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list
      corruption described above.
      
      This patch changes it so that REQ_DISCARD requests follow the same path
      as REQ_FLUSH. This avoids the crash.
      
      Reference: https://bugzilla.redhat.com/837607Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      751f188d
  3. 19 7月, 2012 3 次提交
    • N
      md/raid1: close some possible races on write errors during resync · 58e94ae1
      NeilBrown 提交于
      commit 4367af55
         md/raid1: clear bad-block record when write succeeds.
      
      Added a 'reschedule_retry' call possibility at the end of
      end_sync_write, but didn't add matching code at the end of
      sync_request_write.  So if the writes complete very quickly, or
      scheduling makes it seem that way, then we can miss rescheduling
      the request and the resync could hang.
      
      Also commit 73d5c38a
          md: avoid races when stopping resync.
      
      Fix a race condition in this same code in end_sync_write but didn't
      make the change in sync_request_write.
      
      This patch updates sync_request_write to fix both of those.
      Patch is suitable for 3.1 and later kernels.
      Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Original-version-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      58e94ae1
    • N
      md: avoid crash when stopping md array races with closing other open fds. · a05b7ea0
      NeilBrown 提交于
      md will refuse to stop an array if any other fd (or mounted fs) is
      using it.
      When any fs is unmounted of when the last open fd is closed all
      pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put)
      so there will be no pending IO to worry about when the array is
      stopped.
      
      However in order to send the STOP_ARRAY ioctl to stop the array one
      must first get and open fd on the block device.
      If some fd is being used to write to the block device and it is closed
      after mdadm open the block device, but before mdadm issues the
      STOP_ARRAY ioctl, then there will be no last-close on the md device so
      __blkdev_put will not call sync_blockdev.
      
      If this happens, then IO can still be in-flight while md tears down
      the array and bad things can happen (use-after-free and subsequent
      havoc).
      
      So in the case where do_md_stop is being called from an open file
      descriptor, call sync_block after taking the mutex to ensure there
      will be no new openers.
      
      This is needed when setting a read-write device to read-only too.
      
      Cc: stable@vger.kernel.org
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a05b7ea0
    • N
      md: fix bug in handling of new_data_offset · 25f7fd47
      NeilBrown 提交于
      commit c6563a8c
          md: add possibility to change data-offset for devices.
      
      introduced a 'new_data_offset' attribute which should normally
      be the same as 'data_offset', but can be explicitly set to a different
      value to allow a reshape operation to move the data.
      
      Unfortunately when the 'data_offset' is explicitly set through
      sysfs, the new_data_offset is not also set, so the two would become
      out-of-sync incorrectly.
      
      One result of this is that trying to set the 'size' after the
      'data_offset' would fail because it is not permitted to set the size
      when the 'data_offset' and 'new_data_offset' are different - as that
      can be confusing.
      Consequently when mdadm tried to do this while assembling an IMSM
      array it would fail.
      
      This bug was introduced in 3.5-rc1.
      Reported-by: NBrian Downing <bdowning@lavos.net>
      Bisected-by: NBrian Downing <bdowning@lavos.net>
      Tested-by: NBrian Downing <bdowning@lavos.net>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      25f7fd47
  4. 09 7月, 2012 1 次提交
    • N
      md/raid1: fix use-after-free bug in RAID1 data-check code. · 2d4f4f33
      NeilBrown 提交于
      This bug has been present ever since data-check was introduce
      in 2.6.16.  However it would only fire if a data-check were
      done on a degraded array, which was only possible if the array
      has 3 or more devices.  This is certainly possible, but is quite
      uncommon.
      
      Since hot-replace was added in 3.3 it can happen more often as
      the same condition can arise if not all possible replacements are
      present.
      
      The problem is that as soon as we submit the last read request, the
      'r1_bio' structure could be freed at any time, so we really should
      stop looking at it.  If the last device is being read from we will
      stop looking at it.  However if the last device is not due to be read
      from, we will still check the bio pointer in the r1_bio, but the
      r1_bio might already be free.
      
      So use the read_targets counter to make sure we stop looking for bios
      to submit as soon as we have submitted them all.
      
      This fix is suitable for any -stable kernel since 2.6.16.
      
      Cc: stable@vger.kernel.org
      Reported-by: NArnold Schulz <arnysch@gmx.net>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2d4f4f33
  5. 04 7月, 2012 1 次提交
  6. 03 7月, 2012 17 次提交
    • M
      dm persistent data: fix allocation failure in space map checker init · b0239faa
      Mike Snitzer 提交于
      If CONFIG_DM_DEBUG_SPACE_MAPS is enabled and memory is fragmented and a
      sufficiently-large metadata device is used in a thin pool then the space
      map checker will fail to allocate the memory it requires.
      
      Switch from kmalloc to vmalloc to allow larger virtually contiguous
      allocations for the space map checker's internal count arrays.
      Reported-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b0239faa
    • M
      dm persistent data: handle space map checker creation failure · 62662303
      Mike Snitzer 提交于
      If CONFIG_DM_DEBUG_SPACE_MAPS is enabled and dm_sm_checker_create()
      fails, dm_tm_create_internal() would still return success even though it
      cleaned up all resources it was supposed to have created.  This will
      lead to a kernel crash:
      
      general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
      ...
      RIP: 0010:[<ffffffff81593659>]  [<ffffffff81593659>] dm_bufio_get_block_size+0x9/0x20
      Call Trace:
        [<ffffffff81599bae>] dm_bm_block_size+0xe/0x10
        [<ffffffff8159b8b8>] sm_ll_init+0x78/0xd0
        [<ffffffff8159c1a6>] sm_ll_new_disk+0x16/0xa0
        [<ffffffff8159c98e>] dm_sm_disk_create+0xfe/0x160
        [<ffffffff815abf6e>] dm_pool_metadata_open+0x16e/0x6a0
        [<ffffffff815aa010>] pool_ctr+0x3f0/0x900
        [<ffffffff8158d565>] dm_table_add_target+0x195/0x450
        [<ffffffff815904c4>] table_load+0xe4/0x330
        [<ffffffff815917ea>] ctl_ioctl+0x15a/0x2c0
        [<ffffffff81591963>] dm_ctl_ioctl+0x13/0x20
        [<ffffffff8116a4f8>] do_vfs_ioctl+0x98/0x560
        [<ffffffff8116aa51>] sys_ioctl+0x91/0xa0
        [<ffffffff81869f52>] system_call_fastpath+0x16/0x1b
      
      Fix the space map checker code to return an appropriate ERR_PTR and have
      dm_sm_disk_create() and dm_tm_create_internal() check for it with
      IS_ERR.
      Reported-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      62662303
    • M
      dm persistent data: fix shadow_info_leak on dm_tm_destroy · 25d7cd6f
      Mike Snitzer 提交于
      Cleanup the shadow table before destroying the transaction manager.
      
      Reference: leak was identified with kmemleak when running
      test_discard_random_sectors in the thinp-test-suite.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      25d7cd6f
    • J
      dm thin: commit metadata before creating metadata snapshot · 0d200aef
      Joe Thornber 提交于
      Userland sometimes sees a corrupt metadata block if metadata is changing
      rapidly when a metadata snapshot is reserved for userland,  To make the
      problem go away, commit before we take the metadata snapshot (which is a
      sensible thing to do anyway).
      
      The checksums mean userland spots this corruption immediately so there's
      no risk of acting on incorrect data.  No corruption exists from the
      kernel's point of view, and thin_check passes after pool shutdown.
      
      I believe this is to do with shared blocks at the first level of the
      {device, mapping} btree.  Prior to the metadata-snap support no sharing
      at this level was possible, so this patch is only required after commit
      cc8394d8 ("dm thin: provide userspace
      access to pool metadata").
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      0d200aef
    • N
      md: fix up plugging (again). · b357f04a
      NeilBrown 提交于
      The value returned by "mddev_check_plug" is only valid until the
      next 'schedule' as that will unplug things.  This could happen at any
      call to mempool_alloc.
      So just calling mddev_check_plug at the start doesn't really make
      sense.
      
      So call it just before, or just after, queuing things for the thread.
      As the action that happens at unplug is to wake the thread, this makes
      lots of sense.
      If we cannot add a plug (which requires a small GFP_ATOMIC alloc) we
      wake thread immediately.
      
      RAID5 is a bit different.  Requests are queued for the thread and the
      thread is woken by release_stripe.  So we don't need to wake the
      thread on failure.
      However the thread doesn't perform certain actions when there is any
      active plug, so it is important to install a plug before waking the
      thread.  So for RAID5 we install the plug *before* queuing the request
      and waking the thread.
      
      Without this patch it is possible for raid1 or raid10 to queue a
      request without then waking the thread, resulting in the array locking
      up.
      
      Also change raid10 to only flush_pending_write when there are not
      active plugs, just like raid1.
      
      This patch is suitable for 3.0 or later.  I plan to submit it to
      -stable, but I'll like to let it spend a few weeks in mainline
      first to be sure it is completely safe.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b357f04a
    • N
      md: support re-add of recovering devices. · f4563091
      NeilBrown 提交于
      We currently only allow a device to be re-added if it appear to be
      in-sync.  This is overly restrictive as it may be desirable to re-add
      a device that is in the middle of recovery.
      
      So remove the test for "InSync" - the test on rdev->raid_disk is
      sufficient to ensure that the re-add will succeed.
      Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Tested-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f4563091
    • N
      md/raid1: fix bug in read_balance introduced by hot-replace · 32644afd
      NeilBrown 提交于
      When we added hot_replace we doubled the number of devices
      that could be in a RAID1 array.  So we doubled how far read_balance
      would search.  Unfortunately we didn't double the point at which
      it looped back to the beginning - so it effectively loops over
      all non-replacement disks twice.
      This doesn't cause bad behaviour, but it pointless and means we
      never read from replacement devices.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      32644afd
    • S
      raid5: delayed stripe fix · fab363b5
      Shaohua Li 提交于
      There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
      the two bits have relationship. A delayed stripe can be moved to hold list only
      when preread active stripe count is below IO_THRESHOLD. If a stripe has both
      the bits set, such stripe will be in delayed list and preread count not 0,
      which will make such stripe never leave delayed list.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fab363b5
    • M
      md/raid456: When read error cannot be recovered, record bad block · 2e8ac303
      majianpeng 提交于
      We may not be able to fix a bad block if:
       - the array is degraded
       - the over-write fails.
      
      In these cases we currently eject the device, but we should
      record a bad block if possible.
      Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2e8ac303
    • N
      md: make 'name' arg to md_register_thread non-optional. · 0232605d
      NeilBrown 提交于
      Having the 'name' arg optional and defaulting to the current
      personality name is no necessary and leads to errors, as when
      changing the level of an array we can end up using the
      name of the old level instead of the new one.
      
      So make it non-optional and always explicitly pass the name
      of the level that the array will be.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0232605d
    • N
      md/raid10: fix failure when trying to repair a read error. · 055d3747
      NeilBrown 提交于
      commit 58c54fcc
           md/raid10: handle further errors during fix_read_error better.
      
      in 3.1 added "r10_sync_page_io" which takes an IO size in sectors.
      But we were passing the IO size in bytes!!!
      This resulting in bio_add_page failing, and empty request being sent
      down, and a consequent BUG_ON in scsi_lib.
      
      [fix missing space in error message at same time]
      
      This fix is suitable for 3.1.y and later.
      
      Cc: stable@vger.kernel.org
      Reported-by: NChristian Balzer <chibi@gol.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      055d3747
    • N
      md/raid5: fix refcount problem when blocked_rdev is set. · 5f066c63
      NeilBrown 提交于
      commit 43220aa0
          md/raid5: fix a hang on device failure.
      
      fixed a hang, but introduced a refcounting in-balance so
      that if the presence of bad-blocks ever caused an rdev to
      be 'blocked' we would increment the refcount on the rdev and
      never decrement it.
      
      So added the needed rdev_dec_pending when md_wait_for_blocked_rdev
      is not called.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5f066c63
    • M
      md:Add blk_plug in sync_thread. · 7c2c57c9
      majianpeng 提交于
      Add blk_plug in sync_thread will increase the performance of sync.
      Because sync_thread did not blk_plug,so when raid sync, the bio merge
      not well.
      
      Testing environment:
      SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI
      Controller.
      OS:Linux xxx 3.5.0-rc2+ #340 SMP Tue Jun 12 09:00:25 CST 2012
      x86_64 x86_64 x86_64 GNU/Linux.
      RAID5: four ST31000524NS disk.
      
      Without blk_plug:recovery speed about 63M/Sec;
      Add blk_plug:recovery speed about 120M/Sec.
      
      Using blktrace:
      blktrace -d /dev/sdb -w 60  -o -|blkparse -i -
      
      without blk_plug:
      Total (8,16):
       Reads Queued:      309811,     1239MiB	 Writes Queued:           0,        0KiB
       Read Dispatches:   283583,     1189MiB	 Write Dispatches:        0,        0KiB
       Reads Requeued:         0		 Writes Requeued:         0
       Reads Completed:   273351,     1149MiB	 Writes Completed:        0,        0KiB
       Read Merges:        23533,    94132KiB	 Write Merges:            0,        0KiB
       IO unplugs:             0        	 Timer unplugs:           0
      
      add blk_plug:
      Total (8,16):
       Reads Queued:      428697,     1714MiB	 Writes Queued:           0,        0KiB
       Read Dispatches:     3954,     1714MiB	 Write Dispatches:        0,        0KiB
       Reads Requeued:         0		 Writes Requeued:         0
       Reads Completed:     3956,     1715MiB	 Writes Completed:        0,        0KiB
       Read Merges:       424743,     1698MiB	 Write Merges:            0,        0KiB
       IO unplugs:             0        	 Timer unplugs:        3384
      
      The ratio of merge will be markedly increased.
      Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7c2c57c9
    • M
      md/raid5: In ops_run_io, inc nr_pending before calling md_wait_for_blocked_rdev · 1850753d
      majianpeng 提交于
      In ops_run_io(), the call to md_wait_for_blocked_rdev will decrement
      nr_pending so we lose the reference we hold on the rdev.
      So atomic_inc it first to maintain the reference.
      
      This bug was introduced by commit  73e92e51
          md/raid5.  Don't write to known bad block on doubtful devices.
      
      which appeared in 3.0, so patch is suitable for stable kernels since
      then.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1850753d
    • M
      md/raid5: Do not add data_offset before call to is_badblock · 6c0544e2
      majianpeng 提交于
      In chunk_aligned_read() we are adding data_offset before calling
      is_badblock.  But is_badblock also adds data_offset, so that is bad.
      
      So move the addition of data_offset to after the call to
      is_badblock.
      
      This bug was introduced by commit 31c176ec
           md/raid5: avoid reading from known bad blocks.
      which first appeared in 3.0.  So that patch is suitable for any
      -stable kernel from 3.0.y onwards.  However it will need minor
      revision for most of those (as the comment didn't appear until
      recently).
      
      Cc: stable@vger.kernel.org
      Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6c0544e2
    • N
      md/raid5: prefer replacing failed devices over want-replacement devices. · 5cfb22a1
      NeilBrown 提交于
      If a RAID5 has both a failed device and a device marked as
      'WantReplacement', then we should preferentially replace the failed
      device.
      However the current code replaces whichever is found first.
      So split into 2 loops, check fail failed/missing first, and only check
      for WantReplacement if nothing is failed or missing.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5cfb22a1
    • N
      md/raid10: Don't try to recovery unmatched (and unused) chunks. · fc448a18
      NeilBrown 提交于
      If a RAID10 has an odd number of chunks - as might happen when there
      are an odd number of devices - the last chunk has no pair and so is
      not mirrored.  We don't store data there, but when recovering the last
      device in an array we retry to recover that last chunk from a
      non-existent location.  This results in an error, and the recovery
      aborts.
      
      When we get to that last chunk we should just stop - there is nothing
      more to do anyway.
      
      This bug has been present since the introduction of RAID10, so the
      patch is appropriate for any -stable kernel.
      
      Cc: stable@vger.kernel.org
      Reported-by: NChristian Balzer <chibi@gol.com>
      Tested-by: NChristian Balzer <chibi@gol.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fc448a18
  7. 03 6月, 2012 5 次提交
    • J
      dm thin: provide userspace access to pool metadata · cc8394d8
      Joe Thornber 提交于
      This patch implements two new messages that can be sent to the thin
      pool target allowing it to take a snapshot of the _metadata_.  This,
      read-only snapshot can be accessed by userland, concurrently with the
      live target.
      
      Only one metadata snapshot can be held at a time.  The pool's status
      line will give the block location for the current msnap.
      
      Since version 0.1.5 of the userland thin provisioning tools, the
      thin_dump program displays the msnap as follows:
      
          thin_dump -m <msnap root> <metadata dev>
      
      Available here: https://github.com/jthornber/thin-provisioning-tools
      
      Now that userland can access the metadata we can do various things
      that have traditionally been kernel side tasks:
      
           i) Incremental backups.
      
           By using metadata snapshots we can work out what blocks have
           changed over time.  Combined with data snapshots we can ensure
           the data doesn't change while we back it up.
      
           A short proof of concept script can be found here:
      
           https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb
      
           ii) Migration of thin devices from one pool to another.
      
           iii) Merging snapshots back into an external origin.
      
           iv) Asyncronous replication.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      cc8394d8
    • M
      dm thin: use slab mempools · a24c2569
      Mike Snitzer 提交于
      Use dedicated caches prefixed with a "dm_" name rather than relying on
      kmalloc mempools backed by generic slab caches so the memory usage of
      thin provisioning (and any leaks) can be accounted for independently.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a24c2569
    • M
      dm mpath: allow ioctls to trigger pg init · 35991652
      Mikulas Patocka 提交于
      After the failure of a group of paths, any alternative paths that
      need initialising do not become available until further I/O is sent to
      the device.  Until this has happened, ioctls return -EAGAIN.
      
      With this patch, new paths are made available in response to an ioctl
      too.  The processing of the ioctl gets delayed until this has happened.
      
      Instead of returning an error, we submit a work item to kmultipathd
      (that will potentially activate the new path) and retry in ten
      milliseconds.
      
      Note that the patch doesn't retry an ioctl if the ioctl itself fails due
      to a path failure.  Such retries should be handled intelligently by the
      code that generated the ioctl in the first place, noting that some SCSI
      commands should not be retried because they are not idempotent (XOR write
      commands).  For commands that could be retried, there is a danger that
      if the device rejected the SCSI command, the path could be errorneously
      marked as failed, and the request would be retried on another path which
      might fail too.  It can be determined if the failure happens on the
      device or on the SCSI controller, but there is no guarantee that all
      SCSI drivers set these flags correctly.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      35991652
    • M
      dm mpath: delay retry of bypassed pg · f220fd4e
      Mike Christie 提交于
      If I/O needs retrying and only bypassed priority groups are available,
      set the pg_init_delay_retry flag to wait before retrying.
      
      If, for example, the reason for the bypass is that the controller is
      getting reset or there is a firmware upgrade happening, retrying right
      away would cause a flood of log messages and retries for what could be a
      few seconds or even several minutes.
      Signed-off-by: NMike Christie <michaelc@cs.wisc.edu>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f220fd4e
    • M
      dm mpath: reduce size of struct multipath · 1fbdd2b3
      Mike Snitzer 提交于
      Move multipath structure's 'lock' and 'queue_size' members to eliminate
      two 4-byte holes.  Also use a bit within a single unsigned int for each
      existing flag (saves 8-bytes).  This allows future flags to be added
      without each consuming an unsigned int.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      1fbdd2b3
  8. 31 5月, 2012 1 次提交
    • N
      md: raid1/raid10: fix problem with merge_bvec_fn · aba336bd
      NeilBrown 提交于
      The new merge_bvec_fn which calls the corresponding function
      in subsidiary devices requires that mddev->merge_check_needed
      be set if any child has a merge_bvec_fn.
      
      However were were only setting that when a device was hot-added,
      not when a device was present from the start.
      
      This bug was introduced in 3.4 so patch is suitable for 3.4.y
      kernels.  However that are conflicts in raid10.c so a separate
      patch will be needed for 3.4.y.
      
      Cc: stable@vger.kernel.org
      Reported-by: NSebastian Riemer <sebastian.riemer@profitbricks.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      aba336bd
  9. 22 5月, 2012 8 次提交
    • N
      md/bitmap: record the space available for the bitmap in the superblock. · 1dff2b87
      NeilBrown 提交于
      Now that bitmaps can grow and shrink it is best if we record
      how much space is available.  This means that when
      we reduce the size of the bitmap we won't "lose" the space
      for late when we might want to increase the size of the bitmap
      again.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1dff2b87
    • N
      md/raid10: Remove extras after reshape to smaller number of devices. · 63aced61
      NeilBrown 提交于
      When a reshape which reduced the number of devices finishes
      we must remove the extra devices.
      
      So ensure  that raid10_remove_disk won't try to keep them, and
      have raid10_finish_reshape clear the 'in_sync' flag.  Then
      remove_and_add_spares will be able to remove them.
      Reported-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      63aced61
    • N
      md/raid5: improve removal of extra devices after reshape. · da7613b8
      NeilBrown 提交于
      After a reshape which reduced the number of devices we need
      to disconnect the extra devices.
      The code for this doesn't currently handle 'replacement' devices.
      It is very unlikely that such devices will be present, but it is
      safest to handle them anyway.
      
      So simplify the handling.  Just clear In_sync and leave it
      to remove_and_add_spaces (which will be called soon) to do
      the real works.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      da7613b8
    • Y
      md: check the return of mddev_find() · 0c098220
      Yuanhan Liu 提交于
      Check the return of mddev_find(), since it may fail due to out of
      memeory or out of usable minor number.
      
      The reason I chose -ENODEV instead of -ENOMEM or something else is
      md_alloc() function chose that ;)
      Signed-off-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0c098220
    • J
      MD RAID1: Further conditionalize 'fullsync' · 4f0a5e01
      Jonathan Brassow 提交于
      A RAID1 device does not necessarily need a fullsync if the bitmap can be used instead.
      
      Similar to commit d6b212f4 in raid5.c, if a raid1
      device can be brought back (i.e. from a transient failure) it shouldn't need a
      complete resync.  Provided the bitmap is not to old, it will have recorded the areas
      of the disk that need recovery.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4f0a5e01
    • J
      DM RAID: Use md_error() in place of simply setting Faulty bit · c32fb9e7
      Jonathan Brassow 提交于
      When encountering an error while reading the superblock, call md_error.
      
      We are currently setting the 'Faulty' bit on one of the array devices when an
      error is encountered while reading the superblock of a dm-raid array.  We should
      be calling md_error(), as it handles the error more completely.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c32fb9e7
    • J
      DM RAID: Record and handle missing devices · 81f382f9
      Jonathan Brassow 提交于
      Missing dm-raid devices should be recorded in the superblock
      
      When specifying the devices that compose a DM RAID array, it is possible to denote
      failed or missing devices with '-'s.  When this occurs, we must record this in the
      superblock.  We do this by checking if the array position's data device is missing
      and then forcing MD to record the superblock by setting 'MD_CHANGE_DEVS' in
      'raid_resume'.  If we do not cause the superblock to be rewritten by the resume
      function, it is possible for a stale superblock to be written by an out-going
      in-active table (during 'raid_dtr').
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      81f382f9
    • J
      DM RAID: Set recovery flags on resume · 47525e59
      Jonathan Brassow 提交于
      Properly initialize MD recovery flags when resuming device-mapper devices.
      
      When a device-mapper device is suspended, all I/O must stop.  This is done by
      calling 'md_stop_writes' and 'mddev_suspend'.  These calls in-turn manipulate
      the recovery flags - including setting 'MD_RECOVERY_FROZEN'.  The DM device
      may have been suspended while recovery was not yet complete, so the process
      needs to pick-up where it left off.  Since 'mddev_resume' does not unset
      'MD_RECOVERY_FROZEN' and set 'MD_RECOVERY_NEEDED', we must do it ourselves.
      'MD_RECOVERY_NEEDED' can safely be set in 'mddev_resume', but 'MD_RECOVERY_FROZEN'
      must be set outside of 'mddev_resume' due to how MD handles RAID reshaping.
      (e.g.  It is possible for a user to delay reshaping a RAID5->RAID6 by purposefully
      setting 'MD_RECOVERY_FROZEN'.  Clearing it in 'mddev_resume' would override the
      desired behavior.)
      
      Because 'mddev_resume' already unconditionally calls 'md_wakeup_thread(mddev->thread)'
      there is no need to make this call from 'raid_resume' since it calls 'mddev_resume'.
      
      Also clean up where  level_store calls mddev_resume() - it current
      duplicates some of the funcitons of that call. - NB
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      47525e59