1. 09 12月, 2016 2 次提交
  2. 23 11月, 2016 3 次提交
    • N
      md/raid1: add failfast handling for writes. · 212e7eb7
      NeilBrown 提交于
      When writing to a fastfail device we use MD_FASTFAIL unless
      it is the only device being written to.
      
      For resync/recovery, assume there was a working device to
      read from so always use REQ_FASTFAIL_DEV.
      
      If a write for resync/recovery fails, we just fail the
      device - there is not much else to do.
      
      If a normal failfast write fails, but the device cannot be
      failed (must be only one left), we queue for write error
      handling.  This will call narrow_write_error() to retry the
      write synchronously and without any FAILFAST flags.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      212e7eb7
    • N
      md/raid1: add failfast handling for reads. · 2e52d449
      NeilBrown 提交于
      If a device is marked FailFast and it is not the only device
      we can read from, we mark the bio with REQ_FAILFAST_* flags.
      
      If this does fail, we don't try read repair but just allow
      failure.  If it was the last device it doesn't fail of
      course, so the retry happens on the same device - this time
      without FAILFAST.  A subsequent failure will not retry but
      will just pass up the error.
      
      During resync we may use FAILFAST requests and on a failure
      we will simply use the other device(s).
      
      During recovery we will only use FAILFAST in the unusual
      case were there are multiple places to read from - i.e. if
      there are > 2 devices.  If we get a failure we will fail the
      device and complete the resync/recovery with remaining
      devices.
      
      The new R1BIO_FailFast flag is set on read reqest to suggest
      the a FAILFAST request might be acceptable.  The rdev needs
      to have FailFast set as well for the read to actually use
      REQ_FAILFAST_*.
      
      We need to know there are at least two working devices
      before we can set R1BIO_FailFast, so we mustn't stop looking
      at the first device we find.  So the "min_pending == 0"
      handling to not exit early, but too always choose the
      best_pending_disk if min_pending == 0.
      
      The spinlocked region in raid1_error() in enlarged to ensure
      that if two bios, reading from two different devices, fail
      at the same time, then there is no risk that both devices
      will be marked faulty, leaving zero "In_sync" devices.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2e52d449
    • N
      md: Use REQ_FAILFAST_* on metadata writes where appropriate · 46533ff7
      NeilBrown 提交于
      This can only be supported on personalities which ensure
      that md_error() never causes an array to enter the 'failed'
      state.  i.e. if marking a device Faulty would cause some
      data to be inaccessible, the device is status is left as
      non-Faulty.  This is true for RAID1 and RAID10.
      
      If we get a failure writing metadata but the device doesn't
      fail, it must be the last device so we re-write without
      FAILFAST to improve chance of success.  We also flag the
      device as LastDev so that future metadata updates don't
      waste time on failfast writes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      46533ff7
  3. 19 11月, 2016 2 次提交
    • N
      md/raid1, raid10: add blktrace records when IO is delayed · 578b54ad
      NeilBrown 提交于
      Both raid1 and raid10 will sometimes delay handling an IO request,
      such as when resync is happening or there are too many requests queued.
      
      Add some blktrace messsages so we can see when that is happening when
      looking for performance artefacts.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      578b54ad
    • N
      md: add block tracing for bio_remapping · 109e3765
      NeilBrown 提交于
      The block tracing infrastructure (accessed with blktrace/blkparse)
      supports the tracing of mapping bios from one device to another.
      This is currently used when a bio in a partition is mapped to the
      whole device, when bios are mapped by dm, and for mapping in md/raid5.
      Other md personalities do not include this tracing yet, so add it.
      
      When a read-error is detected we redirect the request to a different device.
      This could justifiably be seen as a new mapping for the originial bio,
      or a secondary mapping for the bio that errors.  This patch uses
      the second option.
      
      When md is used under dm-raid, the mappings are not traced as we do
      not have access to the block device number of the parent.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      109e3765
  4. 10 11月, 2016 1 次提交
    • N
      md/raid1: fix: IO can block resync indefinitely · f2c771a6
      NeilBrown 提交于
      While performing a resync/recovery, raid1 divides the
      array space into three regions:
       - before the resync
       - at or shortly after the resync point
       - much further ahead of the resync point.
      
      Write requests to the first or third do not need to wait.  Write
      requests to the middle region do need to wait if resync requests are
      pending.
      
      If there are any active write requests in the middle region, resync
      will wait for them.
      
      Due to an accounting error, there is a small range of addresses,
      between conf->next_resync and conf->start_next_window, where write
      requests will *not* be blocked, but *will* be counted in the middle
      region.  This can effectively block resync indefinitely if filesystem
      writes happen repeatedly to this region.
      
      As ->next_window_requests is incremented when the sector is after
        conf->start_next_window + NEXT_NORMALIO_DISTANCE
      the same boundary should be used for determining when write requests
      should wait.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      f2c771a6
  5. 08 11月, 2016 2 次提交
    • N
      md/raid1: abort delayed writes when device fails. · 5e2c7a36
      NeilBrown 提交于
      When writing to an array with a bitmap enabled, the writes are grouped
      in batches which are preceded by an update to the bitmap.
      
      It is quite likely if that a drive develops a problem which is not
      media related, that the bitmap write will be the first to report an
      error and cause the device to be marked faulty (as the bitmap write is
      at the start of a batch).
      
      In this case, there is point submiting the subsequent writes to the
      failed device - that just wastes times.
      
      So re-check the Faulty state of a device before submitting a
      delayed write.
      
      This requires that we keep the 'rdev', rather than the 'bdev' in the
      bio, then swap in the bdev just before final submission.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5e2c7a36
    • N
      md/raid1: change printk() to pr_*() · 1d41c216
      NeilBrown 提交于
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1d41c216
  6. 29 10月, 2016 1 次提交
    • T
      raid1: handle read error also in readonly mode · 7449f699
      Tomasz Majchrzak 提交于
      If write is the first operation on a disk and it happens not to be
      aligned to page size, block layer sends read request first. If read
      operation fails, the disk is set as failed as no attempt to fix the
      error is made because array is in auto-readonly mode. Similarily, the
      disk is set as failed for read-only array.
      
      Take the same approach as in raid10. Don't fail the disk if array is in
      readonly or auto-readonly mode. Try to redirect the request first and if
      unsuccessful, return a read error.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      7449f699
  7. 25 10月, 2016 1 次提交
    • S
      RAID1: ignore discard error · e3f948cd
      Shaohua Li 提交于
      If a write error occurs, raid1 will try to rewrite the bio in small
      chunk size. If the rewrite fails, raid1 will record the error in bad
      block. narrow_write_error will always use WRITE for the bio, but
      actually it could be a discard. Since discard bio hasn't payload, write
      the bio will cause different issues. But discard error isn't fatal, we
      can safely ignore it. This is what this patch does.
      
      This issue should exist since discard is added, but only exposed with
      recent arbitrary bio size feature.
      Reported-and-tested-by: NSitsofe Wheeler <sitsofe@gmail.com>
      Cc: stable@vger.kernel.org (v3.6)
      Signed-off-by: NShaohua Li <shli@fb.com>
      e3f948cd
  8. 22 9月, 2016 1 次提交
  9. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  10. 21 7月, 2016 1 次提交
  11. 14 6月, 2016 6 次提交
  12. 09 6月, 2016 1 次提交
  13. 08 6月, 2016 3 次提交
  14. 10 5月, 2016 1 次提交
    • G
      md: set MD_CHANGE_PENDING in a atomic region · 85ad1d13
      Guoqing Jiang 提交于
      Some code waits for a metadata update by:
      
      1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
      2. setting MD_CHANGE_PENDING and waking the management thread
      3. waiting for MD_CHANGE_PENDING to be cleared
      
      If the first two are done without locking, the code in md_update_sb()
      which checks if it needs to repeat might test if an update is needed
      before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
      in the wait returning early.
      
      So make sure all places that set MD_CHANGE_PENDING are atomicial, and
      bit_clear_unless (suggested by Neil) is introduced for the purpose.
      
      Cc: Martin Kepplinger <martink@posteo.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: <linux-kernel@vger.kernel.org>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      85ad1d13
  15. 01 4月, 2016 1 次提交
  16. 18 3月, 2016 1 次提交
    • N
      raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang · ccfc7bf1
      Nate Dailey 提交于
      If raid1d is handling a mix of read and write errors, handle_read_error's
      call to freeze_array can get stuck.
      
      This can happen because, though the bio_end_io_list is initially drained,
      writes can be added to it via handle_write_finished as the retry_list
      is processed. These writes contribute to nr_pending but are not included
      in nr_queued.
      
      If a later entry on the retry_list triggers a call to handle_read_error,
      freeze array hangs waiting for nr_pending == nr_queued+extra. The writes
      on the bio_end_io_list aren't included in nr_queued so the condition will
      never be satisfied.
      
      To prevent the hang, include bio_end_io_list writes in nr_queued.
      
      There's probably a better way to handle decrementing nr_queued, but this
      seemed like the safest way to avoid breaking surrounding code.
      
      I'm happy to supply the script I used to repro this hang.
      
      Fixes: 55ce74d4(md/raid1: ensure device failure recorded before write request returns.)
      Cc: stable@vger.kernel.org (v4.3+)
      Signed-off-by: NNate Dailey <nate.dailey@stratus.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ccfc7bf1
  17. 15 3月, 2016 1 次提交
  18. 21 1月, 2016 1 次提交
  19. 14 1月, 2016 1 次提交
    • D
      md/raid: only permit hot-add of compatible integrity profiles · 1501efad
      Dan Williams 提交于
      It is not safe for an integrity profile to be changed while i/o is
      in-flight in the queue.  Prevent adding new disks or otherwise online
      spares to an array if the device has an incompatible integrity profile.
      
      The original change to the blk_integrity_unregister implementation in
      md, commmit c7bfced9 "md: suspend i/o during runtime
      blk_integrity_unregister" introduced an immediate hang regression.
      
      This policy of disallowing changes the integrity profile once one has
      been established is shared with DM.
      
      Here is an abbreviated log from a test run that:
      1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [   59.076127]
      2/ Tries to add an integrity-disabled device (pmem1m) [   90.489209]
      3/ Retries with an integrity-enabled device (pmem1s) [  205.671277]
      
      [   59.076127] md/raid1:md0: active with 1 out of 2 mirrors
      [   59.078302] md: data integrity enabled on md0
      [..]
      [   90.489209] md0: incompatible integrity profile for pmem1m
      [..]
      [  205.671277] md: super_written gets error=-5
      [  205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
      [  205.677386] md/raid1:md0: Operation continuing on 1 devices.
      [  205.683037] RAID1 conf printout:
      [  205.684699]  --- wd:1 rd:2
      [  205.685972]  disk 0, wo:0, o:1, dev:pmem0s
      [  205.687562]  disk 1, wo:1, o:1, dev:pmem1s
      [  205.691717] md: recovery of RAID array md0
      
      Fixes: c7bfced9 ("md: suspend i/o during runtime blk_integrity_unregister")
      Cc: <stable@vger.kernel.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reported-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      1501efad
  20. 24 10月, 2015 2 次提交
    • G
      md-cluster: Call update_raid_disks() if another node --grow's raid_disks · 28c1b9fd
      Goldwyn Rodrigues 提交于
      To incorporate --grow feature executed on one node, other nodes need to
      acknowledge the change in number of disks. Call update_raid_disks()
      to update internal data structures.
      
      This leads to call check_reshape() -> md_allow_write() -> md_update_sb(),
      this results in a deadlock. This is done so it can safely allocate memory
      (which might trigger writeback which might write to raid1). This is
      not required for md with a bitmap.
      
      In the clustered case, we don't perform md_update_sb() in md_allow_write(),
      but in do_md_run(). Also we disable safemode for clustered mode.
      
      mddev->recovery_cp need not be set in check_sb_changes() because this
      is required only when a node reads another node's bitmap. mddev->recovery_cp
      (which is read from sb->resync_offset), is set only if mddev is in_sync.
      Since we disabled safemode, in_sync is set to zero.
      In a clustered environment, the MD may not be in sync because another
      node could be writing to it. So make sure that in_sync is not set in
      case of clustered node in __md_stop_writes().
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      28c1b9fd
    • N
      md/raid1: don't clear bitmap bit when bad-block-list write fails. · bd8688a1
      NeilBrown 提交于
      When a write fails and a bad-block-list is present, we can
      update the bad-block-list instead of writing the data.  If
      this succeeds then it is OK clear the relevant bitmap-bit as
      no further 'sync' of the block is needed.
      
      However if writing the bad-block-list fails then we need to
      treat the write as failed and particularly must not clear
      the bitmap bit.  Otherwise the device can be re-added (after
      any hardware connection issues are resolved) and because the
      relevant bit in the bitmap is clear, that block will not be
      resynced.  This leads to data corruption.
      
      We already delay the final bio_endio() on the write until
      the bad-block-list is written so that when the write
      returns: either that data is safe, the bad-block record is
      safe, or the fact that the device is faulty is safe.
      However we *don't* delay the clearing of the bitmap, so the
      bitmap bit can be recorded as cleared before we know if the
      bad-block-list was written safely.
      
      So: delay that until the write really is safe.
      i.e. move the call to close_write() until just before
      calling bio_endio(), and recheck the 'is array degraded'
      status before making that call.
      
      This bug goes back to v3.1 when bad-block-lists were
      introduced, though it only affects arrays created with
      mdadm-3.3 or later as only those have bad-block lists.
      
      Backports will require at least
      Commit: 55ce74d4 ("md/raid1: ensure device failure recorded before write request returns.")
      as well.  I'll send that to 'stable' separately.
      
      Note that of the two tests of R1BIO_WriteError that this
      patch adds, the first is certain to fail and the second is
      certain to succeed.  However doing it this way makes the
      patch more obviously correct.  I will tidy the code up in a
      future merge window.
      Reported-and-tested-by: NNate Dailey <nate.dailey@stratus.com>
      Cc: Jes Sorensen <Jes.Sorensen@redhat.com>
      Fixes: cd5ff9a1 ("md/raid1:  Handle write errors by updating badblock log.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      bd8688a1
  21. 22 10月, 2015 1 次提交
  22. 21 10月, 2015 1 次提交
  23. 12 10月, 2015 3 次提交
    • G
      md-cluster: Perform resync/recovery under a DLM lock · c186b128
      Goldwyn Rodrigues 提交于
      Resync or recovery must be performed by only one node at a time.
      A DLM lock resource, resync_lockres provides the mutual exclusion
      so that only one node performs the recovery/resync at a time.
      
      If a node is unable to get the resync_lockres, because recovery is
      being performed by another node, it set MD_RECOVER_NEEDED so as
      to schedule recovery in the future.
      
      Remove the debug message in resync_info_update()
      used during development.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      c186b128
    • G
      md-cluster: Improve md_reload_sb to be less error prone · 70bcecdb
      Goldwyn Rodrigues 提交于
      md_reload_sb is too simplistic and it explicitly needs to determine
      the changes made by the writing node. However, there are multiple areas
      where a simple reload could fail.
      
      Instead, read the superblock of one of the "good" rdevs and update
      the necessary information:
      
      - read the superblock into a newly allocated page, by temporarily
        swapping out rdev->sb_page and calling ->load_super.
      - if that fails return
      - if it succeeds, call check_sb_changes
        1. iterates over list of active devices and checks the matching
         dev_roles[] value.
         	If that is 'faulty', the device must be  marked as faulty
      	 - call md_error to mark the device as faulty. Make sure
      	   not to set CHANGE_DEVS and wakeup mddev->thread or else
      	   it would initiate a resync process, which is the responsibility
      	   of the "primary" node.
      	 - clear the Blocked bit
      	 - Call remove_and_add_spares() to hot remove the device.
      	If the device is 'spare':
      	 - call remove_and_add_spares() to get the number of spares
      	   added in this operation.
      	 - Reduce mddev->degraded to mark the array as not degraded.
        2. reset recovery_cp
      - read the rest of the rdevs to update recovery_offset. If recovery_offset
        is equal to MaxSector, call spare_active() to set it In_sync
      
      This required that recovery_offset be initialized to MaxSector, as
      opposed to zero so as to communicate the end of sync for a rdev.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      70bcecdb
    • G
      md-cluster: Use a small window for resync · c40f341f
      Goldwyn Rodrigues 提交于
      Suspending the entire device for resync could take too long. Resync
      in small chunks.
      
      cluster's resync window (32M) is maintained in r1conf as
      cluster_sync_low and cluster_sync_high and processed in
      raid1's sync_request(). If the current resync is outside the cluster
      resync window:
      
      1. Set the cluster_sync_low to curr_resync_completed.
      2. Check if the sync will fit in the new window, if not issue a
         wait_barrier() and set cluster_sync_low to sector_nr.
      3. Set cluster_sync_high to cluster_sync_low + resync_window.
      4. Send a message to all nodes so they may add it in their suspension
         list.
      
      bitmap_cond_end_sync is modified to allow to force a sync inorder
      to get the curr_resync_completed uptodate with the sector passed.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c40f341f
  24. 09 10月, 2015 1 次提交
    • M
      crash in md-raid1 and md-raid10 due to incorrect list manipulation · a452744b
      Mikulas Patocka 提交于
      The commit 55ce74d4 (md/raid1: ensure
      device failure recorded before write request returns) is causing crash in
      the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100%
      reproducible.
      
      The reason for the crash is that the newly added code in raid1d moves the
      list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and
      then incorrectly pops the bio from conf->bio_end_io_list (which is empty
      because the list was alrady moved).
      
      Raid-10 has a similar bug.
      
      Kernel Fault: Code=15 regs=000000006ccb8640 (Addr=0000000100000000)
      CPU: 3 PID: 1930 Comm: mdX_raid1 Not tainted 4.2.0-rc5-bisect+ #35
      task: 000000006cc1f258 ti: 000000006ccb8000 task.ti: 000000006ccb8000
      
           YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
      PSW: 00001000000001001111111000001111 Not tainted
      r00-03  000000ff0804fe0f 000000001059d000 000000001059f818 000000007f16be38
      r04-07  000000001059d000 000000007f16be08 0000000000200200 0000000000000001
      r08-11  000000006ccb8260 000000007b7934d0 0000000000000001 0000000000000000
      r12-15  000000004056f320 0000000000000000 0000000000013dd0 0000000000000000
      r16-19  00000000f0d00ae0 0000000000000000 0000000000000000 0000000000000001
      r20-23  000000000800000f 0000000042200390 0000000000000000 0000000000000000
      r24-27  0000000000000001 000000000800000f 000000007f16be08 000000001059d000
      r28-31  0000000100000000 000000006ccb8560 000000006ccb8640 0000000000000000
      sr00-03  0000000000249800 0000000000000000 0000000000000000 0000000000249800
      sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      
      IASQ: 0000000000000000 0000000000000000 IAOQ: 000000001059f61c 000000001059f620
       IIR: 0f8010c6    ISR: 0000000000000000  IOR: 0000000100000000
       CPU:        3   CR30: 000000006ccb8000 CR31: 0000000000000000
       ORIG_R28: 000000001059d000
       IAOQ[0]: call_bio_endio+0x34/0x1a8 [raid1]
       IAOQ[1]: call_bio_endio+0x38/0x1a8 [raid1]
       RP(r2): raid_end_bio_io+0x88/0x168 [raid1]
      Backtrace:
       [<000000001059f818>] raid_end_bio_io+0x88/0x168 [raid1]
       [<00000000105a4f64>] raid1d+0x144/0x1640 [raid1]
       [<000000004017fd5c>] kthread+0x144/0x160
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Fixes: 55ce74d4 ("md/raid1: ensure device failure recorded before write request returns.")
      Fixes: 95af587e ("md/raid10: ensure device failure recorded before write request returns.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      a452744b
  25. 02 10月, 2015 1 次提交