1. 03 7月, 2012 10 次提交
    • S
      raid5: delayed stripe fix · fab363b5
      Shaohua Li 提交于
      There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
      the two bits have relationship. A delayed stripe can be moved to hold list only
      when preread active stripe count is below IO_THRESHOLD. If a stripe has both
      the bits set, such stripe will be in delayed list and preread count not 0,
      which will make such stripe never leave delayed list.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fab363b5
    • M
      md/raid456: When read error cannot be recovered, record bad block · 2e8ac303
      majianpeng 提交于
      We may not be able to fix a bad block if:
       - the array is degraded
       - the over-write fails.
      
      In these cases we currently eject the device, but we should
      record a bad block if possible.
      Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2e8ac303
    • N
      md: make 'name' arg to md_register_thread non-optional. · 0232605d
      NeilBrown 提交于
      Having the 'name' arg optional and defaulting to the current
      personality name is no necessary and leads to errors, as when
      changing the level of an array we can end up using the
      name of the old level instead of the new one.
      
      So make it non-optional and always explicitly pass the name
      of the level that the array will be.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0232605d
    • N
      md/raid10: fix failure when trying to repair a read error. · 055d3747
      NeilBrown 提交于
      commit 58c54fcc
           md/raid10: handle further errors during fix_read_error better.
      
      in 3.1 added "r10_sync_page_io" which takes an IO size in sectors.
      But we were passing the IO size in bytes!!!
      This resulting in bio_add_page failing, and empty request being sent
      down, and a consequent BUG_ON in scsi_lib.
      
      [fix missing space in error message at same time]
      
      This fix is suitable for 3.1.y and later.
      
      Cc: stable@vger.kernel.org
      Reported-by: NChristian Balzer <chibi@gol.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      055d3747
    • N
      md/raid5: fix refcount problem when blocked_rdev is set. · 5f066c63
      NeilBrown 提交于
      commit 43220aa0
          md/raid5: fix a hang on device failure.
      
      fixed a hang, but introduced a refcounting in-balance so
      that if the presence of bad-blocks ever caused an rdev to
      be 'blocked' we would increment the refcount on the rdev and
      never decrement it.
      
      So added the needed rdev_dec_pending when md_wait_for_blocked_rdev
      is not called.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5f066c63
    • M
      md:Add blk_plug in sync_thread. · 7c2c57c9
      majianpeng 提交于
      Add blk_plug in sync_thread will increase the performance of sync.
      Because sync_thread did not blk_plug,so when raid sync, the bio merge
      not well.
      
      Testing environment:
      SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI
      Controller.
      OS:Linux xxx 3.5.0-rc2+ #340 SMP Tue Jun 12 09:00:25 CST 2012
      x86_64 x86_64 x86_64 GNU/Linux.
      RAID5: four ST31000524NS disk.
      
      Without blk_plug:recovery speed about 63M/Sec;
      Add blk_plug:recovery speed about 120M/Sec.
      
      Using blktrace:
      blktrace -d /dev/sdb -w 60  -o -|blkparse -i -
      
      without blk_plug:
      Total (8,16):
       Reads Queued:      309811,     1239MiB	 Writes Queued:           0,        0KiB
       Read Dispatches:   283583,     1189MiB	 Write Dispatches:        0,        0KiB
       Reads Requeued:         0		 Writes Requeued:         0
       Reads Completed:   273351,     1149MiB	 Writes Completed:        0,        0KiB
       Read Merges:        23533,    94132KiB	 Write Merges:            0,        0KiB
       IO unplugs:             0        	 Timer unplugs:           0
      
      add blk_plug:
      Total (8,16):
       Reads Queued:      428697,     1714MiB	 Writes Queued:           0,        0KiB
       Read Dispatches:     3954,     1714MiB	 Write Dispatches:        0,        0KiB
       Reads Requeued:         0		 Writes Requeued:         0
       Reads Completed:     3956,     1715MiB	 Writes Completed:        0,        0KiB
       Read Merges:       424743,     1698MiB	 Write Merges:            0,        0KiB
       IO unplugs:             0        	 Timer unplugs:        3384
      
      The ratio of merge will be markedly increased.
      Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7c2c57c9
    • M
      md/raid5: In ops_run_io, inc nr_pending before calling md_wait_for_blocked_rdev · 1850753d
      majianpeng 提交于
      In ops_run_io(), the call to md_wait_for_blocked_rdev will decrement
      nr_pending so we lose the reference we hold on the rdev.
      So atomic_inc it first to maintain the reference.
      
      This bug was introduced by commit  73e92e51
          md/raid5.  Don't write to known bad block on doubtful devices.
      
      which appeared in 3.0, so patch is suitable for stable kernels since
      then.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1850753d
    • M
      md/raid5: Do not add data_offset before call to is_badblock · 6c0544e2
      majianpeng 提交于
      In chunk_aligned_read() we are adding data_offset before calling
      is_badblock.  But is_badblock also adds data_offset, so that is bad.
      
      So move the addition of data_offset to after the call to
      is_badblock.
      
      This bug was introduced by commit 31c176ec
           md/raid5: avoid reading from known bad blocks.
      which first appeared in 3.0.  So that patch is suitable for any
      -stable kernel from 3.0.y onwards.  However it will need minor
      revision for most of those (as the comment didn't appear until
      recently).
      
      Cc: stable@vger.kernel.org
      Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6c0544e2
    • N
      md/raid5: prefer replacing failed devices over want-replacement devices. · 5cfb22a1
      NeilBrown 提交于
      If a RAID5 has both a failed device and a device marked as
      'WantReplacement', then we should preferentially replace the failed
      device.
      However the current code replaces whichever is found first.
      So split into 2 loops, check fail failed/missing first, and only check
      for WantReplacement if nothing is failed or missing.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5cfb22a1
    • N
      md/raid10: Don't try to recovery unmatched (and unused) chunks. · fc448a18
      NeilBrown 提交于
      If a RAID10 has an odd number of chunks - as might happen when there
      are an odd number of devices - the last chunk has no pair and so is
      not mirrored.  We don't store data there, but when recovering the last
      device in an array we retry to recover that last chunk from a
      non-existent location.  This results in an error, and the recovery
      aborts.
      
      When we get to that last chunk we should just stop - there is nothing
      more to do anyway.
      
      This bug has been present since the introduction of RAID10, so the
      patch is appropriate for any -stable kernel.
      
      Cc: stable@vger.kernel.org
      Reported-by: NChristian Balzer <chibi@gol.com>
      Tested-by: NChristian Balzer <chibi@gol.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fc448a18
  2. 03 6月, 2012 5 次提交
    • J
      dm thin: provide userspace access to pool metadata · cc8394d8
      Joe Thornber 提交于
      This patch implements two new messages that can be sent to the thin
      pool target allowing it to take a snapshot of the _metadata_.  This,
      read-only snapshot can be accessed by userland, concurrently with the
      live target.
      
      Only one metadata snapshot can be held at a time.  The pool's status
      line will give the block location for the current msnap.
      
      Since version 0.1.5 of the userland thin provisioning tools, the
      thin_dump program displays the msnap as follows:
      
          thin_dump -m <msnap root> <metadata dev>
      
      Available here: https://github.com/jthornber/thin-provisioning-tools
      
      Now that userland can access the metadata we can do various things
      that have traditionally been kernel side tasks:
      
           i) Incremental backups.
      
           By using metadata snapshots we can work out what blocks have
           changed over time.  Combined with data snapshots we can ensure
           the data doesn't change while we back it up.
      
           A short proof of concept script can be found here:
      
           https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb
      
           ii) Migration of thin devices from one pool to another.
      
           iii) Merging snapshots back into an external origin.
      
           iv) Asyncronous replication.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      cc8394d8
    • M
      dm thin: use slab mempools · a24c2569
      Mike Snitzer 提交于
      Use dedicated caches prefixed with a "dm_" name rather than relying on
      kmalloc mempools backed by generic slab caches so the memory usage of
      thin provisioning (and any leaks) can be accounted for independently.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a24c2569
    • M
      dm mpath: allow ioctls to trigger pg init · 35991652
      Mikulas Patocka 提交于
      After the failure of a group of paths, any alternative paths that
      need initialising do not become available until further I/O is sent to
      the device.  Until this has happened, ioctls return -EAGAIN.
      
      With this patch, new paths are made available in response to an ioctl
      too.  The processing of the ioctl gets delayed until this has happened.
      
      Instead of returning an error, we submit a work item to kmultipathd
      (that will potentially activate the new path) and retry in ten
      milliseconds.
      
      Note that the patch doesn't retry an ioctl if the ioctl itself fails due
      to a path failure.  Such retries should be handled intelligently by the
      code that generated the ioctl in the first place, noting that some SCSI
      commands should not be retried because they are not idempotent (XOR write
      commands).  For commands that could be retried, there is a danger that
      if the device rejected the SCSI command, the path could be errorneously
      marked as failed, and the request would be retried on another path which
      might fail too.  It can be determined if the failure happens on the
      device or on the SCSI controller, but there is no guarantee that all
      SCSI drivers set these flags correctly.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      35991652
    • M
      dm mpath: delay retry of bypassed pg · f220fd4e
      Mike Christie 提交于
      If I/O needs retrying and only bypassed priority groups are available,
      set the pg_init_delay_retry flag to wait before retrying.
      
      If, for example, the reason for the bypass is that the controller is
      getting reset or there is a firmware upgrade happening, retrying right
      away would cause a flood of log messages and retries for what could be a
      few seconds or even several minutes.
      Signed-off-by: NMike Christie <michaelc@cs.wisc.edu>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f220fd4e
    • M
      dm mpath: reduce size of struct multipath · 1fbdd2b3
      Mike Snitzer 提交于
      Move multipath structure's 'lock' and 'queue_size' members to eliminate
      two 4-byte holes.  Also use a bit within a single unsigned int for each
      existing flag (saves 8-bytes).  This allows future flags to be added
      without each consuming an unsigned int.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      1fbdd2b3
  3. 31 5月, 2012 1 次提交
    • N
      md: raid1/raid10: fix problem with merge_bvec_fn · aba336bd
      NeilBrown 提交于
      The new merge_bvec_fn which calls the corresponding function
      in subsidiary devices requires that mddev->merge_check_needed
      be set if any child has a merge_bvec_fn.
      
      However were were only setting that when a device was hot-added,
      not when a device was present from the start.
      
      This bug was introduced in 3.4 so patch is suitable for 3.4.y
      kernels.  However that are conflicts in raid10.c so a separate
      patch will be needed for 3.4.y.
      
      Cc: stable@vger.kernel.org
      Reported-by: NSebastian Riemer <sebastian.riemer@profitbricks.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      aba336bd
  4. 22 5月, 2012 24 次提交