1. 30 10月, 2012 1 次提交
  2. 11 10月, 2012 7 次提交
  3. 19 9月, 2012 1 次提交
    • N
      md: make sure metadata is updated when spares are activated or removed. · 6dafab6b
      NeilBrown 提交于
      It isn't always necessary to update the metadata when spares are
      removed as the presence-or-not of a spare isn't really important to
      the integrity of an array.
      Also activating a spare doesn't always require updating the metadata
      as the update on 'recovery-completed' is usually sufficient.
      
      However the introduction of 'replacement' devices have made these
      transitions sometimes more important.  For example the 'Replacement'
      flag isn't cleared until the original device is removed, so we need
      to ensure a metadata update after that 'spare' is removed.
      
      So set MD_CHANGE_DEVS whenever a spare is activated or removed, to
      complement the current situation where it is set when a spare is added
      or a device is failed (or a number of other less common situations).
      
      This is suitable for -stable as out-of-data metadata could lead
      to data corruption.
      This is only relevant for 3.3 and later 9when 'replacement' as
      introduced.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6dafab6b
  4. 09 9月, 2012 3 次提交
  5. 16 8月, 2012 1 次提交
    • N
      md: Don't truncate size at 4TB for RAID0 and Linear · 667a5313
      NeilBrown 提交于
      commit 27a7b260
         md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.
      
      changed 0.90 metadata handling to truncated size to 4TB as that is
      all that 0.90 can record.
      However for RAID0 and Linear, 0.90 doesn't need to record the size, so
      this truncation is not needed and causes working arrays to become too small.
      
      So avoid the truncation for RAID0 and Linear
      
      This bug was introduced in 3.1 and is suitable for any stable kernels
      from then onwards.
      As the offending commit was tagged for 'stable', any stable kernel
      that it was applied to should also get this patch.  That includes
      at least 2.6.32, 2.6.33 and 3.0. (Thanks to Ben Hutchings for
      providing that list).
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeil Brown <neilb@suse.de>
      667a5313
  6. 31 7月, 2012 4 次提交
    • N
      blk: pass from_schedule to non-request unplug functions. · 74018dc3
      NeilBrown 提交于
      This will allow md/raid to know why the unplug was called,
      and will be able to act according - if !from_schedule it
      is safe to perform tasks which could themselves schedule.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74018dc3
    • N
      blk: centralize non-request unplug handling. · 9cbb1750
      NeilBrown 提交于
      Both md and umem has similar code for getting notified on an
      blk_finish_plug event.
      Centralize this code in block/ and allow each driver to
      provide its distinctive difference.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9cbb1750
    • N
      md: remove plug_cnt feature of plugging. · 0021b7bc
      NeilBrown 提交于
      This seemed like a good idea at the time, but after further thought I
      cannot see it making a difference other than very occasionally and
      testing to try to exercise the case it is most likely to help did not
      show any performance difference by removing it.
      
      So remove the counting of active plugs and allow 'pending writes' to
      be activated at any time, not just when no plugs are active.
      
      This is only relevant when there is a write-intent bitmap, and the
      updating of the bitmap will likely introduce enough delay that
      the single-threading of bitmap updates will be enough to collect large
      numbers of updates together.
      
      Removing this will make it easier to centralise the unplug code, and
      will clear the other for other unplug enhancements which have a
      measurable effect.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0021b7bc
    • N
      md: remove duplicated test on ->openers when calling do_md_stop() · 90cf195d
      NeilBrown 提交于
      do_md_stop tests mddev->openers while holding ->open_mutex,
      and fails if this count is too high.
      So callers do not need to check mddev->openers and doing so isn't
      very meaningful as they don't hold ->open_mutex so the number could
      change.
      
      So remove the unnecessary tests on mddev->openers.
      These are not called often enough for there to be any gain in
      an early test on ->open_mutex to avoid the need for a slightly more
      costly mutex_lock call.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      90cf195d
  7. 19 7月, 2012 2 次提交
    • N
      md: avoid crash when stopping md array races with closing other open fds. · a05b7ea0
      NeilBrown 提交于
      md will refuse to stop an array if any other fd (or mounted fs) is
      using it.
      When any fs is unmounted of when the last open fd is closed all
      pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put)
      so there will be no pending IO to worry about when the array is
      stopped.
      
      However in order to send the STOP_ARRAY ioctl to stop the array one
      must first get and open fd on the block device.
      If some fd is being used to write to the block device and it is closed
      after mdadm open the block device, but before mdadm issues the
      STOP_ARRAY ioctl, then there will be no last-close on the md device so
      __blkdev_put will not call sync_blockdev.
      
      If this happens, then IO can still be in-flight while md tears down
      the array and bad things can happen (use-after-free and subsequent
      havoc).
      
      So in the case where do_md_stop is being called from an open file
      descriptor, call sync_block after taking the mutex to ensure there
      will be no new openers.
      
      This is needed when setting a read-write device to read-only too.
      
      Cc: stable@vger.kernel.org
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a05b7ea0
    • N
      md: fix bug in handling of new_data_offset · 25f7fd47
      NeilBrown 提交于
      commit c6563a8c
          md: add possibility to change data-offset for devices.
      
      introduced a 'new_data_offset' attribute which should normally
      be the same as 'data_offset', but can be explicitly set to a different
      value to allow a reshape operation to move the data.
      
      Unfortunately when the 'data_offset' is explicitly set through
      sysfs, the new_data_offset is not also set, so the two would become
      out-of-sync incorrectly.
      
      One result of this is that trying to set the 'size' after the
      'data_offset' would fail because it is not permitted to set the size
      when the 'data_offset' and 'new_data_offset' are different - as that
      can be confusing.
      Consequently when mdadm tried to do this while assembling an IMSM
      array it would fail.
      
      This bug was introduced in 3.5-rc1.
      Reported-by: NBrian Downing <bdowning@lavos.net>
      Bisected-by: NBrian Downing <bdowning@lavos.net>
      Tested-by: NBrian Downing <bdowning@lavos.net>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      25f7fd47
  8. 03 7月, 2012 3 次提交
    • N
      md: support re-add of recovering devices. · f4563091
      NeilBrown 提交于
      We currently only allow a device to be re-added if it appear to be
      in-sync.  This is overly restrictive as it may be desirable to re-add
      a device that is in the middle of recovery.
      
      So remove the test for "InSync" - the test on rdev->raid_disk is
      sufficient to ensure that the re-add will succeed.
      Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Tested-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f4563091
    • N
      md: make 'name' arg to md_register_thread non-optional. · 0232605d
      NeilBrown 提交于
      Having the 'name' arg optional and defaulting to the current
      personality name is no necessary and leads to errors, as when
      changing the level of an array we can end up using the
      name of the old level instead of the new one.
      
      So make it non-optional and always explicitly pass the name
      of the level that the array will be.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0232605d
    • M
      md:Add blk_plug in sync_thread. · 7c2c57c9
      majianpeng 提交于
      Add blk_plug in sync_thread will increase the performance of sync.
      Because sync_thread did not blk_plug,so when raid sync, the bio merge
      not well.
      
      Testing environment:
      SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI
      Controller.
      OS:Linux xxx 3.5.0-rc2+ #340 SMP Tue Jun 12 09:00:25 CST 2012
      x86_64 x86_64 x86_64 GNU/Linux.
      RAID5: four ST31000524NS disk.
      
      Without blk_plug:recovery speed about 63M/Sec;
      Add blk_plug:recovery speed about 120M/Sec.
      
      Using blktrace:
      blktrace -d /dev/sdb -w 60  -o -|blkparse -i -
      
      without blk_plug:
      Total (8,16):
       Reads Queued:      309811,     1239MiB	 Writes Queued:           0,        0KiB
       Read Dispatches:   283583,     1189MiB	 Write Dispatches:        0,        0KiB
       Reads Requeued:         0		 Writes Requeued:         0
       Reads Completed:   273351,     1149MiB	 Writes Completed:        0,        0KiB
       Read Merges:        23533,    94132KiB	 Write Merges:            0,        0KiB
       IO unplugs:             0        	 Timer unplugs:           0
      
      add blk_plug:
      Total (8,16):
       Reads Queued:      428697,     1714MiB	 Writes Queued:           0,        0KiB
       Read Dispatches:     3954,     1714MiB	 Write Dispatches:        0,        0KiB
       Reads Requeued:         0		 Writes Requeued:         0
       Reads Completed:     3956,     1715MiB	 Writes Completed:        0,        0KiB
       Read Merges:       424743,     1698MiB	 Write Merges:            0,        0KiB
       IO unplugs:             0        	 Timer unplugs:        3384
      
      The ratio of merge will be markedly increased.
      Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7c2c57c9
  9. 22 5月, 2012 8 次提交
    • Y
      md: check the return of mddev_find() · 0c098220
      Yuanhan Liu 提交于
      Check the return of mddev_find(), since it may fail due to out of
      memeory or out of usable minor number.
      
      The reason I chose -ENODEV instead of -ENOMEM or something else is
      md_alloc() function chose that ;)
      Signed-off-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0c098220
    • J
      DM RAID: Set recovery flags on resume · 47525e59
      Jonathan Brassow 提交于
      Properly initialize MD recovery flags when resuming device-mapper devices.
      
      When a device-mapper device is suspended, all I/O must stop.  This is done by
      calling 'md_stop_writes' and 'mddev_suspend'.  These calls in-turn manipulate
      the recovery flags - including setting 'MD_RECOVERY_FROZEN'.  The DM device
      may have been suspended while recovery was not yet complete, so the process
      needs to pick-up where it left off.  Since 'mddev_resume' does not unset
      'MD_RECOVERY_FROZEN' and set 'MD_RECOVERY_NEEDED', we must do it ourselves.
      'MD_RECOVERY_NEEDED' can safely be set in 'mddev_resume', but 'MD_RECOVERY_FROZEN'
      must be set outside of 'mddev_resume' due to how MD handles RAID reshaping.
      (e.g.  It is possible for a user to delay reshaping a RAID5->RAID6 by purposefully
      setting 'MD_RECOVERY_FROZEN'.  Clearing it in 'mddev_resume' would override the
      desired behavior.)
      
      Because 'mddev_resume' already unconditionally calls 'md_wakeup_thread(mddev->thread)'
      there is no need to make this call from 'raid_resume' since it calls 'mddev_resume'.
      
      Also clean up where  level_store calls mddev_resume() - it current
      duplicates some of the funcitons of that call. - NB
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      47525e59
    • N
      md: allow array to be resized while bitmap is present. · a4a6125a
      NeilBrown 提交于
      Now that bitmaps can be resized, we can allow an array to be resized
      while the bitmap is present.
      
      This only covers resizing that involves changing the effective size
      of member devices, not resizing that changes the number of devices.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a4a6125a
    • N
      md/bitmap: move some fields of 'struct bitmap' into a 'storage' substruct. · 1ec885cd
      NeilBrown 提交于
      This new 'struct bitmap_storage' reflects the external storage of the
      bitmap.
      Having this clearly defined will make it easier to change the storage
      used while the array is active.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1ec885cd
    • N
      md/bitmap: allow a bitmap with no backing storage. · ef99bf48
      NeilBrown 提交于
      An md bitmap comprises two parts
       - internal counting of active writes per 'chunk'.
       - external storage of whether there are any active writes on
         each chunk
      
      The second requires the first, but the first doesn't require the
      second.
      
      Not having backing storage means that the bitmap cannot expedite
      resync after a crash, but it still allows us to expedite the recovery
      of a recently-removed device.
      
      So: allow a bitmap to exist even if there is no backing device.
      In that case we default to 128M chunks.
      
      A particular value of this is that we can remove and re-add a bitmap
      (possibly of a different granularity) on a degraded array, and not
      lose the information needed to fast-recover the missing device.
      
      We don't actually activate these bitmaps yet - that will come
      in a later patch.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ef99bf48
    • N
      md/bitmap: add new 'space' attribute for bitmaps. · 6409bb05
      NeilBrown 提交于
      If we are to allow bitmaps to be resized when the array is resized,
      we need to know how much space there is.
      
      So create an attribute to store this information and set appropriate
      defaults.
      
      It can be set more precisely via sysfs, or future metadata extensions
      may allow it to be recorded.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6409bb05
    • N
      md: move freeing of badblocks.page into md_rdev_clear · 4fa2f327
      NeilBrown 提交于
      This ensures that it is always freed - there were case where
      we failed to free the page.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4fa2f327
    • N
      md: dm-raid should call helper function to clear rdev. · 545c8795
      NeilBrown 提交于
      dm-raid currently open-codes the freeing of some members of
      and rdev.  It is more maintainable to have it call common code
      from md.c which does this for all call-sites.
      
      So remove free_disk_sb to md_rdev_clear, export it, and use it in
      dm-raid.c
      Signed-off-by: NNeilBrown <neilb@suse.de>
      545c8795
  10. 21 5月, 2012 5 次提交
    • N
      md: use resync_max_sectors for reshape as well as resync. · c804cdec
      NeilBrown 提交于
      Some resync type operations need to act on the address space of the
      device, others on the address space of the array.
      
      This only affects RAID10, so it sets resync_max_sectors to the array
      size (it defaults to the device size), and that is currently used for
      resync only.  However reshape of a RAID10 must be done against the
      array size, not device size, so change code to use resync_max_sectors
      for both the resync and the reshape cases.
      This does not affect RAID5 or RAID1, just RAID10.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c804cdec
    • N
      md: teach sync_page_io about new_data_offset. · 1fdd6fc9
      NeilBrown 提交于
      Some code in raid1 and raid10 use sync_page_io to
      read/write pages when responding to read errors.
      As we will shortly support changing data_offset for
      raid10, this function must understand new_data_offset.
      
      So add that understanding.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1fdd6fc9
    • N
      md: add possibility to change data-offset for devices. · c6563a8c
      NeilBrown 提交于
      When reshaping we can avoid costly intermediate backup by
      changing the 'start' address of the array on the device
      (if there is enough room).
      
      So as a first step, allow such a change to be requested
      through sysfs, and recorded in v1.x metadata.
      
      (As we didn't previous check that all 'pad' fields were zero,
       we need a new FEATURE flag for this.
       A (belatedly) check that all remaining 'pad' fields are
       zero to avoid a repeat of this)
      
      The new data offset must be requested separately for each device.
      This allows each to have a different change in the data offset.
      This is not likely to be used often but as data_offset can be
      set per-device, new_data_offset should be too.
      
      This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
      it is never used and never will be.  At the same time we add a new
      arg ('in_new') which is currently always zero but will be used more
      soon.
      
      When a reshape finishes we will need to update the data_offset
      and rdev->sectors.  So provide an exported function to do that.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c6563a8c
    • N
      md: allow a reshape operation to be reversed. · 2c810cdd
      NeilBrown 提交于
      Currently a reshape operation always progresses from the start
      of the array to the end unless the number of devices is being
      reduced, in which case it progressed in the opposite direction.
      
      To reverse a partial reshape which changes the number of devices
      you can stop the array and re-assemble with the raid-disks numbers
      reversed and it will undo.
      
      However for a reshape that does not change the number of devices
      it is not possible to reverse the reshape in the middle - you have to
      wait until it completes.
      
      So add a 'reshape_direction' attribute with is either 'forwards' or
      'backwards' and can be explicitly set when delta_disks is zero.
      
      This will become more important when we allow the data_offset to
      change in a reshape.  Then the explicit statement of what direction is
      being used will be more useful.
      
      This can be enabled in raid5 trivially as it already supports
      reverse reshape and just needs to use a different trigger to request it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2c810cdd
    • S
      md: using GFP_NOIO to allocate bio for flush request · b5e1b8ce
      Shaohua Li 提交于
      A flush request is usually issued in transaction commit code path, so
      using GFP_KERNEL to allocate memory for flush request bio falls into
      the classic deadlock issue.
      
      This is suitable for any -stable kernel to which it applies as it
      avoids a possible deadlock.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b5e1b8ce
  11. 17 5月, 2012 1 次提交
    • J
      MD: Add del_timer_sync to mddev_suspend (fix nasty panic) · 0d9f4f13
      Jonathan Brassow 提交于
      Use del_timer_sync to remove timer before mddev_suspend finishes.
      
      We don't want a timer going off after an mddev_suspend is called.  This is
      especially true with device-mapper, since it can call the destructor function
      immediately following a suspend.  This results in the removal (kfree) of the
      structures upon which the timer depends - resulting in a very ugly panic.
      Therefore, we add a del_timer_sync to mddev_suspend to prevent this.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0d9f4f13
  12. 24 4月, 2012 2 次提交
    • N
      md: fix possible corruption of array metadata on shutdown. · 30b8aa91
      NeilBrown 提交于
      commit c744a65c
        md: don't set md arrays to readonly on shutdown.
      
      removed the possibility of a 'BUG' when data is written to an array
      that has just been switched to read-only, but also introduced the
      possibility that the array metadata could be corrupted.
      
      If, when md_notify_reboot gets the mddev lock, the array is
      in a state where it is assembled but hasn't been started (as can
      happen if the personality module is not available, or in other unusual
      situations), then incorrect metadata will be written out making it
      impossible to re-assemble the array.
      
      So only call __md_stop_writes() if the array has actually been
      activated.
      
      This patch is needed for any stable kernel which has had the above
      commit applied.
      
      Cc: stable@vger.kernel.org
      Reported-by: NChristoph Nelles <evilazrael@evilazrael.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      30b8aa91
    • N
      md: don't call ->add_disk unless there is good reason. · ed209584
      NeilBrown 提交于
      Commit 7bfec5f3
      
         md/raid5: If there is a spare and a want_replacement device, start replacement.
      
      cause md_check_recovery to call ->add_disk much more often.
      Instead of only when the array is degraded, it is now called whenever
      md_check_recovery finds anything useful to do, which includes
      updating the metadata for clean<->dirty transition.
      This causes unnecessary work, and causes info messages from ->add_disk
      to be reported much too often.
      
      So refine md_check_recovery to only do any actual recovery checking
      (including ->add_disk) if MD_RECOVERY_NEEDED is set.
      
      This fix is suitable for 3.3.y:
      
      Cc: stable@vger.kernel.org
      Reported-by: NJan Ceuleers <jan.ceuleers@computer.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ed209584
  13. 19 3月, 2012 2 次提交