1. 24 10月, 2015 2 次提交
    • G
      md-cluster: Call update_raid_disks() if another node --grow's raid_disks · 28c1b9fd
      Goldwyn Rodrigues 提交于
      To incorporate --grow feature executed on one node, other nodes need to
      acknowledge the change in number of disks. Call update_raid_disks()
      to update internal data structures.
      
      This leads to call check_reshape() -> md_allow_write() -> md_update_sb(),
      this results in a deadlock. This is done so it can safely allocate memory
      (which might trigger writeback which might write to raid1). This is
      not required for md with a bitmap.
      
      In the clustered case, we don't perform md_update_sb() in md_allow_write(),
      but in do_md_run(). Also we disable safemode for clustered mode.
      
      mddev->recovery_cp need not be set in check_sb_changes() because this
      is required only when a node reads another node's bitmap. mddev->recovery_cp
      (which is read from sb->resync_offset), is set only if mddev is in_sync.
      Since we disabled safemode, in_sync is set to zero.
      In a clustered environment, the MD may not be in sync because another
      node could be writing to it. So make sure that in_sync is not set in
      case of clustered node in __md_stop_writes().
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      28c1b9fd
    • N
      md/raid1: don't clear bitmap bit when bad-block-list write fails. · bd8688a1
      NeilBrown 提交于
      When a write fails and a bad-block-list is present, we can
      update the bad-block-list instead of writing the data.  If
      this succeeds then it is OK clear the relevant bitmap-bit as
      no further 'sync' of the block is needed.
      
      However if writing the bad-block-list fails then we need to
      treat the write as failed and particularly must not clear
      the bitmap bit.  Otherwise the device can be re-added (after
      any hardware connection issues are resolved) and because the
      relevant bit in the bitmap is clear, that block will not be
      resynced.  This leads to data corruption.
      
      We already delay the final bio_endio() on the write until
      the bad-block-list is written so that when the write
      returns: either that data is safe, the bad-block record is
      safe, or the fact that the device is faulty is safe.
      However we *don't* delay the clearing of the bitmap, so the
      bitmap bit can be recorded as cleared before we know if the
      bad-block-list was written safely.
      
      So: delay that until the write really is safe.
      i.e. move the call to close_write() until just before
      calling bio_endio(), and recheck the 'is array degraded'
      status before making that call.
      
      This bug goes back to v3.1 when bad-block-lists were
      introduced, though it only affects arrays created with
      mdadm-3.3 or later as only those have bad-block lists.
      
      Backports will require at least
      Commit: 55ce74d4 ("md/raid1: ensure device failure recorded before write request returns.")
      as well.  I'll send that to 'stable' separately.
      
      Note that of the two tests of R1BIO_WriteError that this
      patch adds, the first is certain to fail and the second is
      certain to succeed.  However doing it this way makes the
      patch more obviously correct.  I will tidy the code up in a
      future merge window.
      Reported-and-tested-by: NNate Dailey <nate.dailey@stratus.com>
      Cc: Jes Sorensen <Jes.Sorensen@redhat.com>
      Fixes: cd5ff9a1 ("md/raid1:  Handle write errors by updating badblock log.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      bd8688a1
  2. 22 10月, 2015 1 次提交
  3. 21 10月, 2015 1 次提交
  4. 12 10月, 2015 3 次提交
    • G
      md-cluster: Perform resync/recovery under a DLM lock · c186b128
      Goldwyn Rodrigues 提交于
      Resync or recovery must be performed by only one node at a time.
      A DLM lock resource, resync_lockres provides the mutual exclusion
      so that only one node performs the recovery/resync at a time.
      
      If a node is unable to get the resync_lockres, because recovery is
      being performed by another node, it set MD_RECOVER_NEEDED so as
      to schedule recovery in the future.
      
      Remove the debug message in resync_info_update()
      used during development.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      c186b128
    • G
      md-cluster: Improve md_reload_sb to be less error prone · 70bcecdb
      Goldwyn Rodrigues 提交于
      md_reload_sb is too simplistic and it explicitly needs to determine
      the changes made by the writing node. However, there are multiple areas
      where a simple reload could fail.
      
      Instead, read the superblock of one of the "good" rdevs and update
      the necessary information:
      
      - read the superblock into a newly allocated page, by temporarily
        swapping out rdev->sb_page and calling ->load_super.
      - if that fails return
      - if it succeeds, call check_sb_changes
        1. iterates over list of active devices and checks the matching
         dev_roles[] value.
         	If that is 'faulty', the device must be  marked as faulty
      	 - call md_error to mark the device as faulty. Make sure
      	   not to set CHANGE_DEVS and wakeup mddev->thread or else
      	   it would initiate a resync process, which is the responsibility
      	   of the "primary" node.
      	 - clear the Blocked bit
      	 - Call remove_and_add_spares() to hot remove the device.
      	If the device is 'spare':
      	 - call remove_and_add_spares() to get the number of spares
      	   added in this operation.
      	 - Reduce mddev->degraded to mark the array as not degraded.
        2. reset recovery_cp
      - read the rest of the rdevs to update recovery_offset. If recovery_offset
        is equal to MaxSector, call spare_active() to set it In_sync
      
      This required that recovery_offset be initialized to MaxSector, as
      opposed to zero so as to communicate the end of sync for a rdev.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      70bcecdb
    • G
      md-cluster: Use a small window for resync · c40f341f
      Goldwyn Rodrigues 提交于
      Suspending the entire device for resync could take too long. Resync
      in small chunks.
      
      cluster's resync window (32M) is maintained in r1conf as
      cluster_sync_low and cluster_sync_high and processed in
      raid1's sync_request(). If the current resync is outside the cluster
      resync window:
      
      1. Set the cluster_sync_low to curr_resync_completed.
      2. Check if the sync will fit in the new window, if not issue a
         wait_barrier() and set cluster_sync_low to sector_nr.
      3. Set cluster_sync_high to cluster_sync_low + resync_window.
      4. Send a message to all nodes so they may add it in their suspension
         list.
      
      bitmap_cond_end_sync is modified to allow to force a sync inorder
      to get the curr_resync_completed uptodate with the sector passed.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c40f341f
  5. 09 10月, 2015 1 次提交
    • M
      crash in md-raid1 and md-raid10 due to incorrect list manipulation · a452744b
      Mikulas Patocka 提交于
      The commit 55ce74d4 (md/raid1: ensure
      device failure recorded before write request returns) is causing crash in
      the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100%
      reproducible.
      
      The reason for the crash is that the newly added code in raid1d moves the
      list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and
      then incorrectly pops the bio from conf->bio_end_io_list (which is empty
      because the list was alrady moved).
      
      Raid-10 has a similar bug.
      
      Kernel Fault: Code=15 regs=000000006ccb8640 (Addr=0000000100000000)
      CPU: 3 PID: 1930 Comm: mdX_raid1 Not tainted 4.2.0-rc5-bisect+ #35
      task: 000000006cc1f258 ti: 000000006ccb8000 task.ti: 000000006ccb8000
      
           YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
      PSW: 00001000000001001111111000001111 Not tainted
      r00-03  000000ff0804fe0f 000000001059d000 000000001059f818 000000007f16be38
      r04-07  000000001059d000 000000007f16be08 0000000000200200 0000000000000001
      r08-11  000000006ccb8260 000000007b7934d0 0000000000000001 0000000000000000
      r12-15  000000004056f320 0000000000000000 0000000000013dd0 0000000000000000
      r16-19  00000000f0d00ae0 0000000000000000 0000000000000000 0000000000000001
      r20-23  000000000800000f 0000000042200390 0000000000000000 0000000000000000
      r24-27  0000000000000001 000000000800000f 000000007f16be08 000000001059d000
      r28-31  0000000100000000 000000006ccb8560 000000006ccb8640 0000000000000000
      sr00-03  0000000000249800 0000000000000000 0000000000000000 0000000000249800
      sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      
      IASQ: 0000000000000000 0000000000000000 IAOQ: 000000001059f61c 000000001059f620
       IIR: 0f8010c6    ISR: 0000000000000000  IOR: 0000000100000000
       CPU:        3   CR30: 000000006ccb8000 CR31: 0000000000000000
       ORIG_R28: 000000001059d000
       IAOQ[0]: call_bio_endio+0x34/0x1a8 [raid1]
       IAOQ[1]: call_bio_endio+0x38/0x1a8 [raid1]
       RP(r2): raid_end_bio_io+0x88/0x168 [raid1]
      Backtrace:
       [<000000001059f818>] raid_end_bio_io+0x88/0x168 [raid1]
       [<00000000105a4f64>] raid1d+0x144/0x1640 [raid1]
       [<000000004017fd5c>] kthread+0x144/0x160
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Fixes: 55ce74d4 ("md/raid1: ensure device failure recorded before write request returns.")
      Fixes: 95af587e ("md/raid10: ensure device failure recorded before write request returns.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      a452744b
  6. 02 10月, 2015 2 次提交
  7. 01 9月, 2015 2 次提交
    • N
      md/raid1: ensure device failure recorded before write request returns. · 55ce74d4
      NeilBrown 提交于
      When a write to one of the legs of a RAID1 fails, the failure is
      recorded in the metadata of the other leg(s) so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again  (maybe a cable was unplugged).
      
      Similarly when we record a bad-block in response to a write failure,
      we must not let the write complete until the bad-block update is safe.
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that experienced an error on a new queue which
         is only processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      55ce74d4
    • N
      md: close some races between setting and checking sync_action. · 985ca973
      NeilBrown 提交于
      When checking sync_action in a script, we want to be sure it is
      as accurate as possible.
      As resync/reshape etc doesn't always start immediately (a separate
      thread is scheduled to do it), it is best if 'action_show'
      checks if MD_RECOVER_NEEDED is set (which it does) and in that
      case reports what is likely to start soon (which it only sometimes
      does).
      
      So:
       - report 'reshape' if reshape_position suggests one might start.
       - set MD_RECOVERY_RECOVER in raid1_reshape(), because that is very
         likely to happen next.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      985ca973
  8. 14 8月, 2015 1 次提交
    • K
      block: kill merge_bvec_fn() completely · 8ae12666
      Kent Overstreet 提交于
      As generic_make_request() is now able to handle arbitrarily sized bios,
      it's no longer necessary for each individual block driver to define its
      own ->merge_bvec_fn() callback. Remove every invocation completely.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: ceph-devel@vger.kernel.org
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
       dm-era-target, and resolve merge conflicts]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ae12666
  9. 03 8月, 2015 1 次提交
    • N
      md/raid1: extend spinlock to protect raid1_end_read_request against inconsistencies · 423f04d6
      NeilBrown 提交于
      raid1_end_read_request() assumes that the In_sync bits are consistent
      with the ->degaded count.
      raid1_spare_active updates the In_sync bit before the ->degraded count
      and so exposes an inconsistency, as does error()
      So extend the spinlock in raid1_spare_active() and error() to hide those
      inconsistencies.
      
      This should probably be part of
        Commit: 34cab6f4 ("md/raid1: fix test for 'was read error from
        last working device'.")
      as it addresses the same issue.  It fixes the same bug and should go
      to -stable for same reasons.
      
      Fixes: 76073054 ("md/raid1: clean up read_balance.")
      Cc: stable@vger.kernel.org (v3.0+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      423f04d6
  10. 29 7月, 2015 2 次提交
    • J
      block: manipulate bio->bi_flags through helpers · b7c44ed9
      Jens Axboe 提交于
      Some places use helpers now, others don't. We only have the 'is set'
      helper, add helpers for setting and clearing flags too.
      
      It was a bit of a mess of atomic vs non-atomic access. With
      BIO_UPTODATE gone, we don't have any risk of concurrent access to the
      flags. So relax the restriction and don't make any of them atomic. The
      flags that do have serialization issues (reffed and chained), we
      already handle those separately.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b7c44ed9
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  11. 24 7月, 2015 2 次提交
    • G
      Fix read-balancing during node failure · 90382ed9
      Goldwyn Rodrigues 提交于
      During a node failure, We need to suspend read balancing so that the
      reads are directed to the first device and stale data is not read.
      Suspending writes is not required because these would be recorded and
      synced eventually.
      
      A new flag MD_CLUSTER_SUSPEND_READ_BALANCING is set in recover_prep().
      area_resyncing() will respond true for the entire devices if this
      flag is set and the request type is READ. The flag is cleared
      in recover_done().
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reported-By: NDavid Teigland <teigland@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      90382ed9
    • N
      md/raid1: fix test for 'was read error from last working device'. · 34cab6f4
      NeilBrown 提交于
      When we get a read error from the last working device, we don't
      try to repair it, and don't fail the device.  We simple report a
      read error to the caller.
      
      However the current test for 'is this the last working device' is
      wrong.
      When there is only one fully working device, it assumes that a
      non-faulty device is that device.  However a spare which is rebuilding
      would be non-faulty but so not the only working device.
      
      So change the test from "!Faulty" to "In_sync".  If ->degraded says
      there is only one fully working device and this device is in_sync,
      this must be the one.
      
      This bug has existed since we allowed read_balance to read from
      a recovering spare in v3.0
      Reported-and-tested-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Fixes: 76073054 ("md/raid1: clean up read_balance.")
      Cc: stable@vger.kernel.org (v3.0+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      34cab6f4
  12. 02 6月, 2015 1 次提交
    • T
      writeback: move backing_dev_info->state into bdi_writeback · 4452226e
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->state into wb.
      
      * enum bdi_state is renamed to wb_state and the prefix of all enums is
        changed from BDI_ to WB_.
      
      * Explicit zeroing of bdi->state is removed without adding zeoring of
        wb->state as the whole data structure is zeroed on init anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->state are mechanically replaced with bdi->wb.state
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: drbd-dev@lists.linbit.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4452226e
  13. 22 4月, 2015 1 次提交
    • N
      md: remove 'go_faster' option from ->sync_request() · 09314799
      NeilBrown 提交于
      This option is not well justified and testing suggests that
      it hardly ever makes any difference.
      
      The comment suggests there might be a need to wait for non-resync
      activity indicated by ->nr_waiting, however raise_barrier()
      already waits for all of that.
      
      So just remove it to simplify reasoning about speed limiting.
      
      This allows us to remove a 'FIXME' comment from raid5.c as that
      never used the flag.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      09314799
  14. 25 2月, 2015 1 次提交
  15. 23 2月, 2015 3 次提交
    • G
      Add new disk to clustered array · 1aee41f6
      Goldwyn Rodrigues 提交于
      Algorithm:
      1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
         ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
      2. Node 1 sends NEWDISK with uuid and slot number
      3. Other nodes issue kobject_uevent_env with uuid and slot number
      (Steps 4,5 could be a udev rule)
      4. In userspace, the node searches for the disk, perhaps
         using blkid -t SUB_UUID=""
      5. Other nodes issue either of the following depending on whether the disk
         was found:
         ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
      	 disc.number set to slot number)
         ioctl(CLUSTERED_DISK_NACK)
      6. Other nodes drop lock on no-new-devs (CR) if device is found
      7. Node 1 attempts EX lock on no-new-devs
      8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
         as SpareLocal
      9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
      10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message.
      Signed-off-by: NLidong Zhong <lzhong@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      1aee41f6
    • G
      Read from the first device when an area is resyncing · 7d49ffcf
      Goldwyn Rodrigues 提交于
      set choose_first true for cluster read in read balance when the area
      is resyncing.
      Signed-off-by: NLidong Zhong <lzhong@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      7d49ffcf
    • G
      Suspend writes in RAID1 if within range · 589a1c49
      Goldwyn Rodrigues 提交于
      If there is a resync going on, all nodes must suspend writes to the
      range. This is recorded in the suspend_info/suspend_list.
      
      If there is an I/O within the ranges of any of the suspend_info,
      should_suspend will return 1.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      589a1c49
  16. 16 2月, 2015 1 次提交
  17. 04 2月, 2015 4 次提交
    • N
      md: rename ->stop to ->free · afa0f557
      NeilBrown 提交于
      Now that the ->stop function only frees the private data,
      rename is accordingly.
      
      Also pass in the private pointer as an arg rather than using
      mddev->private.  This flexibility will be useful in level_store().
      
      Finally, don't clear ->private.  It doesn't make sense to clear
      it seeing that isn't what we free, and it is no longer necessary
      to clear ->private (it was some time ago before  ->to_remove was
      introduced).
      
      Setting ->to_remove in ->free() is a bit of a wart, but not a
      big problem at the moment.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      afa0f557
    • N
      md: split detach operation out from ->stop. · 5aa61f42
      NeilBrown 提交于
      Each md personality has a 'stop' operation which does two
      things:
       1/ it finalizes some aspects of the array to ensure nothing
          is accessing the ->private data
       2/ it frees the ->private data.
      
      All the steps in '1' can apply to all arrays and so can be
      performed in common code.
      
      This is useful as in the case where we change the personality which
      manages an array (in level_store()), it would be helpful to do
      step 1 early, and step 2 later.
      
      So split the 'step 1' functionality out into a new mddev_detach().
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5aa61f42
    • N
      md: make merge_bvec_fn more robust in face of personality changes. · 64590f45
      NeilBrown 提交于
      There is no locking around calls to merge_bvec_fn(), so
      it is possible that calls which coincide with a level (or personality)
      change could go wrong.
      
      So create a central dispatch point for these functions and use
      rcu_read_lock().
      If the array is suspended, reject any merge that can be rejected.
      If not, we know it is safe to call the function.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      64590f45
    • N
      md: make ->congested robust against personality changes. · 5c675f83
      NeilBrown 提交于
      There is currently no locking around calls to the 'congested'
      bdi function.  If called at an awkward time while an array is
      being converted from one level (or personality) to another, there
      is a tiny chance of running code in an unreferenced module etc.
      
      So add a 'congested' function to the md_personality operations
      structure, and call it with appropriate locking from a central
      'mddev_congested'.
      
      When the array personality is changing the array will be 'suspended'
      so no IO is processed.
      If mddev_congested detects this, it simply reports that the
      array is congested, which is a safe guess.
      As mddev_suspend calls synchronize_rcu(), mddev_congested can
      avoid races by included the whole call inside an rcu_read_lock()
      region.
      This require that the congested functions for all subordinate devices
      can be run under rcu_lock.  Fortunately this is the case.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5c675f83
  18. 14 10月, 2014 2 次提交
  19. 09 10月, 2014 2 次提交
  20. 22 9月, 2014 7 次提交
    • N
      md/raid1: fix_read_error should act on all non-faulty devices. · b8cb6b4c
      NeilBrown 提交于
      If a devices is being recovered it is not InSync and is not Faulty.
      
      If a read error is experienced on that device, fix_read_error()
      will be called, but it ignores non-InSync devices.  So it will
      neither fix the error nor fail the device.
      
      It is incorrect that fix_read_error() ignores non-InSync devices.
      It should only ignore Faulty devices.  So fix it.
      
      This became a bug when we allowed reading from a device that was being
      recovered.  It is suitable for any subsequent -stable kernel.
      
      Fixes: da8840a7
      Cc: stable@vger.kernel.org (v3.5+)
      Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Tested-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b8cb6b4c
    • N
      md/raid1: count resync requests in nr_pending. · 34e97f17
      NeilBrown 提交于
      Both normal IO and resync IO can be retried with reschedule_retry()
      and so be counted into ->nr_queued, but only normal IO gets counted in
      ->nr_pending.
      
      Before the recent improvement to RAID1 resync there could only
      possibly have been one or the other on the queue.  When handling a
      read failure it could only be normal IO.  So when handle_read_error()
      called freeze_array() the fact that freeze_array only compares
      ->nr_queued against ->nr_pending was safe.
      
      But now that these two types can interleave, we can have both normal
      and resync IO requests queued, so we need to count them both in
      nr_pending.
      
      This error can lead to freeze_array() hanging if there is a read
      error, so it is suitable for -stable.
      
      Fixes: 79ef3a8a
      cc: stable@vger.kernel.org (v3.13+)
      Reported-by: NBrassow Jonathan <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      34e97f17
    • N
      md/raid1: update next_resync under resync_lock. · c2fd4c94
      NeilBrown 提交于
      raise_barrier() uses next_resync as part of its calculations, so it
      really should be updated first, instead of afterwards.
      
      next_resync is always used under resync_lock so update it under
      resync lock to, just before it is used.  That is safest.
      
      This could cause normal IO and resync IO to interact badly so
      it suitable for -stable.
      
      Fixes: 79ef3a8a
      cc: stable@vger.kernel.org (v3.13+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c2fd4c94
    • N
      md/raid1: Don't use next_resync to determine how far resync has progressed · 23554960
      NeilBrown 提交于
      next_resync is (approximately) the location for the next resync request.
      However it does *not* reliably determine the earliest location
      at which resync might be happening.
      This is because resync requests can complete out of order, and
      we only limit the number of current requests, not the distance
      from the earliest pending request to the latest.
      
      mddev->curr_resync_completed is a reliable indicator of the earliest
      position at which resync could be happening.   It is updated less
      frequently, but is actually reliable which is more important.
      
      So use it to determine if a write request is before the region
      being resynced and so safe from conflict.
      
      This error can allow resync IO to interfere with normal IO which
      could lead to data corruption. Hence: stable.
      
      Fixes: 79ef3a8a
      cc: stable@vger.kernel.org (v3.13+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      23554960
    • N
      md/raid1: make sure resync waits for conflicting writes to complete. · 2f73d3c5
      NeilBrown 提交于
      The resync/recovery process for raid1 was recently changed
      so that writes could happen in parallel with resync providing
      they were in different regions of the device.
      
      There is a problem though:  While a write request will always
      wait for conflicting resync to complete, a resync request
      will *not* always wait for conflicting writes to complete.
      
      Two changes are needed to fix this:
      
      1/ raise_barrier (which waits until it is safe to do resync)
         must wait until current_window_requests is zero
      2/ wait_battier (which waits at the start of a new write request)
         must update current_window_requests if the request could
         possible conflict with a concurrent resync.
      
      As concurrent writes and resync can lead to data loss,
      this patch is suitable for -stable.
      
      Fixes: 79ef3a8a
      Cc: stable@vger.kernel.org (v3.13+)
      Cc: majianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2f73d3c5
    • N
      md/raid1: clean up request counts properly in close_sync() · 669cc7ba
      NeilBrown 提交于
      If there are outstanding writes when close_sync is called,
      the change to ->start_next_window might cause them to
      decrement the wrong counter when they complete.  Fix this
      by merging the two counters into the one that will be decremented.
      
      Having an incorrect value in a counter can cause raise_barrier()
      to hangs, so this is suitable for -stable.
      
      Fixes: 79ef3a8a
      cc: stable@vger.kernel.org (v3.13+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      669cc7ba
    • N
      md/raid1: be more cautious where we read-balance during resync. · c6d119cf
      NeilBrown 提交于
      commit 79ef3a8a made
      it possible for reads to happen concurrently with resync.
      This means that we need to be more careful where read_balancing
      is allowed during resync - we can no longer be sure that any
      resync that has already started will definitely finish.
      
      So keep read_balancing to before recovery_cp, which is conservative
      but safe.
      
      This bug makes it possible to read from a device that doesn't
      have up-to-date data, so it can cause data corruption.
      So it is suitable for any kernel since 3.11.
      
      Fixes: 79ef3a8a
      cc: stable@vger.kernel.org (v3.13+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c6d119cf