1. 18 3月, 2016 3 次提交
    • A
      md/raid5: Cleanup cpu hotplug notifier · 1d034e68
      Anna-Maria Gleixner 提交于
      The raid456_cpu_notify() hotplug callback lacks handling of the
      CPU_UP_CANCELED case. That means if CPU_UP_PREPARE fails, the scratch
      buffer is leaked.
      
      Add handling for CPU_UP_CANCELED[_FROZEN] hotplug notifier transitions
      to free the scratch buffer.
      
      CC: Shaohua Li <shli@kernel.org>
      CC: linux-raid@vger.kernel.org
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1d034e68
    • S
      raid10: include bio_end_io_list in nr_queued to prevent freeze_array hang · 23ddba80
      Shaohua Li 提交于
      This is the raid10 counterpart of the bug fixed by Nate
      (raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang)
      
      Fixes: 95af587e(md/raid10: ensure device failure recorded before write request returns)
      Cc: stable@vger.kernel.org (V4.3+)
      Cc: Nate Dailey <nate.dailey@stratus.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      23ddba80
    • N
      raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang · ccfc7bf1
      Nate Dailey 提交于
      If raid1d is handling a mix of read and write errors, handle_read_error's
      call to freeze_array can get stuck.
      
      This can happen because, though the bio_end_io_list is initially drained,
      writes can be added to it via handle_write_finished as the retry_list
      is processed. These writes contribute to nr_pending but are not included
      in nr_queued.
      
      If a later entry on the retry_list triggers a call to handle_read_error,
      freeze array hangs waiting for nr_pending == nr_queued+extra. The writes
      on the bio_end_io_list aren't included in nr_queued so the condition will
      never be satisfied.
      
      To prevent the hang, include bio_end_io_list writes in nr_queued.
      
      There's probably a better way to handle decrementing nr_queued, but this
      seemed like the safest way to avoid breaking surrounding code.
      
      I'm happy to supply the script I used to repro this hang.
      
      Fixes: 55ce74d4(md/raid1: ensure device failure recorded before write request returns.)
      Cc: stable@vger.kernel.org (v4.3+)
      Signed-off-by: NNate Dailey <nate.dailey@stratus.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ccfc7bf1
  2. 15 3月, 2016 4 次提交
  3. 10 3月, 2016 2 次提交
    • S
      md/raid5: output stripe state for debug · fb3229d5
      Shaohua Li 提交于
      Neil recently fixed an obscure race in break_stripe_batch_list. Debug would be
      quite convenient if we know the stripe state. This is what this patch does.
      Signed-off-by: NShaohua Li <shli@fb.com>
      fb3229d5
    • N
      md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list · 550da24f
      NeilBrown 提交于
      break_stripe_batch_list breaks up a batch and copies some flags from
      the batch head to the members, preserving others.
      
      It doesn't preserve or copy STRIPE_PREREAD_ACTIVE.  This is not
      normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a
      stripe_head is added to a batch, and is not set on stripe_heads
      already in a batch.
      
      However there is no locking to ensure one thread doesn't set the flag
      after it has just been cleared in another.  This does occasionally happen.
      
      md/raid5 maintains a count of the number of stripe_heads with
      STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes.  When
      break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently
      this could becomes incorrect and will never again return to zero.
      
      md/raid5 delays the handling of some stripe_heads until
      preread_active_stripes becomes zero.  So when the above mention race
      happens, those stripe_heads become blocked and never progress,
      resulting is write to the array handing.
      
      So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE
      in the members of a batch.
      
      URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741
      URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153
      URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz
      Reported-by: Martin Svec <martin.svec@zoner.cz> (and others)
      Tested-by: NTom Weber <linux@junkyard.4t2.com>
      Fixes: 1b956f7a ("md/raid5: be more selective about distributing flags across batch.")
      Cc: stable@vger.kernel.org (v4.1 and later)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      550da24f
  4. 08 3月, 2016 1 次提交
  5. 27 2月, 2016 4 次提交
    • S
      MD: warn for potential deadlock · 70d9798b
      Shaohua Li 提交于
      The personality thread shouldn't call mddev_suspend(). Because
      mddev_suspend() will for all IO finish, but IO is handled in personality
      thread, so this could cause deadlock. To trigger this early, add a
      warning if mddev_suspend() is called from personality thread.
      Suggested-by: NNeilBrown <neilb@suse.com>
      Cc: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      70d9798b
    • S
      md: Drop sending a change uevent when stopping · 399146b8
      Sebastian Parschauer 提交于
      When stopping an MD device, then its device node /dev/mdX may still
      exist afterwards or it is recreated by udev. The next open() call
      can lead to creation of an inoperable MD device. The reason for
      this is that a change event (KOBJ_CHANGE) is sent to udev which
      races against the remove event (KOBJ_REMOVE) from md_free().
      So drop sending the change event.
      
      A change is likely also required in mdadm as many versions send the
      change event to udev as well.
      
      Neil mentioned the change event is a workaround for old kernel
      Commit: 934d9c23 ("md: destroy partitions and notify udev when md array is stopped.")
      new mdadm can handle device remove now, so this isn't required any more.
      
      Cc: NeilBrown <neilb@suse.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Jes Sorensen <Jes.Sorensen@redhat.com>
      Signed-off-by: NSebastian Parschauer <sebastian.riemer@profitbricks.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      399146b8
    • S
      RAID5: revert e9e4c377 to fix a livelock · 6ab2a4b8
      Shaohua Li 提交于
      Revert commit
      e9e4c377(md/raid5: per hash value and exclusive wait_for_stripe)
      
      The problem is raid5_get_active_stripe waits on
      conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes
      in this order:
      - release all stripes with hash 0
      - raid5_get_active_stripe still sleeps since active_stripes >
        max_nr_stripes * 3 / 4
      - release all stripes with hash other than 0. active_stripes becomes 0
      - raid5_get_active_stripe still sleeps, since nobody wakes up
        wait_for_stripe[0]
      The system live locks. The problem is active_stripes isn't a per-hash
      count. Revert the patch makes the live lock go away.
      
      Cc: stable@vger.kernel.org (v4.2+)
      Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: NeilBrown <neilb@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6ab2a4b8
    • S
      RAID5: check_reshape() shouldn't call mddev_suspend · 27a353c0
      Shaohua Li 提交于
      check_reshape() is called from raid5d thread. raid5d thread shouldn't
      call mddev_suspend(), because mddev_suspend() waits for all IO finish
      but IO is handled in raid5d thread, we could easily deadlock here.
      
      This issue is introduced by
      738a2738 ("md/raid5: fix allocation of 'scribble' array.")
      
      Cc: stable@vger.kernel.org (v4.1+)
      Reported-and-tested-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      27a353c0
  6. 26 2月, 2016 1 次提交
  7. 22 2月, 2016 1 次提交
    • M
      dm: fix dm_rq_target_io leak on faults with .request_fn DM w/ blk-mq paths · 4328daa2
      Mike Snitzer 提交于
      Using request-based DM mpath configured with the following stacking
      (.request_fn DM mpath ontop of scsi-mq paths):
      
      echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
      echo N > /sys/module/dm_mod/parameters/use_blk_mq
      
      'struct dm_rq_target_io' would leak if a request is requeued before a
      blk-mq clone is allocated (or fails to allocate).  free_rq_tio()
      wasn't being called.
      
      kmemleak reported:
      
      unreferenced object 0xffff8800b90b98c0 (size 112):
        comm "kworker/7:1H", pid 5692, jiffies 4295056109 (age 78.589s)
        hex dump (first 32 bytes):
          00 d0 5c 2c 03 88 ff ff 40 00 bf 01 00 c9 ff ff  ..\,....@.......
          e0 d9 b1 34 00 88 ff ff 00 00 00 00 00 00 00 00  ...4............
        backtrace:
          [<ffffffff81672b6e>] kmemleak_alloc+0x4e/0xb0
          [<ffffffff811dbb63>] kmem_cache_alloc+0xc3/0x1e0
          [<ffffffff8117eae5>] mempool_alloc_slab+0x15/0x20
          [<ffffffff8117ec1e>] mempool_alloc+0x6e/0x170
          [<ffffffffa00029ac>] dm_old_prep_fn+0x3c/0x180 [dm_mod]
          [<ffffffff812fbd78>] blk_peek_request+0x168/0x290
          [<ffffffffa0003e62>] dm_request_fn+0xb2/0x1b0 [dm_mod]
          [<ffffffff812f66e3>] __blk_run_queue+0x33/0x40
          [<ffffffff812f9585>] blk_delay_work+0x25/0x40
          [<ffffffff81096fff>] process_one_work+0x14f/0x3d0
          [<ffffffff81097715>] worker_thread+0x125/0x4b0
          [<ffffffff8109ce88>] kthread+0xd8/0xf0
          [<ffffffff8167cb8f>] ret_from_fork+0x3f/0x70
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      crash> struct -o dm_rq_target_io
      struct dm_rq_target_io {
          ...
      }
      SIZE: 112
      
      Fixes: e5863d9a ("dm: allocate requests in target when stacking on blk-mq devices")
      Cc: stable@vger.kernel.org # 4.0+
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      4328daa2
  8. 25 1月, 2016 2 次提交
  9. 21 1月, 2016 1 次提交
  10. 14 1月, 2016 4 次提交
    • D
      md/raid: only permit hot-add of compatible integrity profiles · 1501efad
      Dan Williams 提交于
      It is not safe for an integrity profile to be changed while i/o is
      in-flight in the queue.  Prevent adding new disks or otherwise online
      spares to an array if the device has an incompatible integrity profile.
      
      The original change to the blk_integrity_unregister implementation in
      md, commmit c7bfced9 "md: suspend i/o during runtime
      blk_integrity_unregister" introduced an immediate hang regression.
      
      This policy of disallowing changes the integrity profile once one has
      been established is shared with DM.
      
      Here is an abbreviated log from a test run that:
      1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [   59.076127]
      2/ Tries to add an integrity-disabled device (pmem1m) [   90.489209]
      3/ Retries with an integrity-enabled device (pmem1s) [  205.671277]
      
      [   59.076127] md/raid1:md0: active with 1 out of 2 mirrors
      [   59.078302] md: data integrity enabled on md0
      [..]
      [   90.489209] md0: incompatible integrity profile for pmem1m
      [..]
      [  205.671277] md: super_written gets error=-5
      [  205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
      [  205.677386] md/raid1:md0: Operation continuing on 1 devices.
      [  205.683037] RAID1 conf printout:
      [  205.684699]  --- wd:1 rd:2
      [  205.685972]  disk 0, wo:0, o:1, dev:pmem0s
      [  205.687562]  disk 1, wo:1, o:1, dev:pmem1s
      [  205.691717] md: recovery of RAID array md0
      
      Fixes: c7bfced9 ("md: suspend i/o during runtime blk_integrity_unregister")
      Cc: <stable@vger.kernel.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reported-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      1501efad
    • S
      raid5-cache: handle journal hotadd in quiesce · 16a43f6a
      Shaohua Li 提交于
      Handle journal hotadd in quiesce to avoid creating duplicated threads.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      16a43f6a
    • S
      MD: add journal with array suspended · 87d4d916
      Shaohua Li 提交于
      Hot add journal disk in recovery thread context brings a lot of trouble
      as IO could be running. Unlike spare disk hot add, adding journal disk
      with array suspended makes more sense and implmentation is much easier.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      87d4d916
    • S
      md: set MD_HAS_JOURNAL in correct places · a62ab49e
      Shaohua Li 提交于
      Set MD_HAS_JOURNAL when a array is loaded or journal is initialized.
      This is to avoid the flags set too early in journal disk hotadd.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      a62ab49e
  11. 10 1月, 2016 2 次提交
  12. 09 1月, 2016 1 次提交
    • M
      dm snapshot: fix hung bios when copy error occurs · 385277bf
      Mikulas Patocka 提交于
      When there is an error copying a chunk dm-snapshot can incorrectly hold
      associated bios indefinitely, resulting in hung IO.
      
      The function copy_callback sets pe->error if there was error copying the
      chunk, and then calls complete_exception.  complete_exception calls
      pending_complete on error, otherwise it calls commit_exception with
      commit_callback (and commit_callback calls complete_exception).
      
      The persistent exception store (dm-snap-persistent.c) assumes that calls
      to prepare_exception and commit_exception are paired.
      persistent_prepare_exception increases ps->pending_count and
      persistent_commit_exception decreases it.
      
      If there is a copy error, persistent_prepare_exception is called but
      persistent_commit_exception is not.  This results in the variable
      ps->pending_count never returning to zero and that causes some pending
      exceptions (and their associated bios) to be held forever.
      
      Fix this by unconditionally calling commit_exception regardless of
      whether the copy was successful.  A new "valid" parameter is added to
      commit_exception -- when the copy fails this parameter is set to zero so
      that the chunk that failed to copy (and all following chunks) is not
      recorded in the snapshot store.  Also, remove commit_callback now that
      it is merely a wrapper around pending_complete.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      385277bf
  13. 07 1月, 2016 3 次提交
  14. 06 1月, 2016 11 次提交