1. 31 10月, 2015 2 次提交
    • N
      Revert "md: allow a partially recovered device to be hot-added to an array." · d01552a7
      NeilBrown 提交于
      This reverts commit 7eb41885.
      
      This commit is poorly justified, I can find not discusison in email,
      and it clearly causes a problem.
      
      If a device which is being recovered fails and is subsequently
      re-added to an array, there could easily have been changes to the
      array *before* the point where the recovery was up to.  So the
      recovery must start again from the beginning.
      
      If a spare is being recovered and fails, then when it is re-added we
      really should do a bitmap-based recovery up to the recovery-offset,
      and then a full recovery from there.  Before this reversion, we only
      did the "full recovery from there" which is not corect.  After this
      reversion with will do a full recovery from the start, which is safer
      but not ideal.
      
      It will be left to a future patch to arrange the two different styles
      of recovery.
      Reported-and-tested-by: NNate Dailey <nate.dailey@stratus.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Cc: stable@vger.kernel.org (3.14+)
      Fixes: 7eb41885 ("md: allow a partially recovered device to be hot-added to an array.")
      d01552a7
    • R
      md/raid5: fix locking in handle_stripe_clean_event() · b8a9d66d
      Roman Gushchin 提交于
      After commit 566c09c5 ("raid5: relieve lock contention in get_active_stripe()")
      __find_stripe() is called under conf->hash_locks + hash.
      But handle_stripe_clean_event() calls remove_hash() under
      conf->device_lock.
      
      Under some cirscumstances the hash chain can be circuited,
      and we get an infinite loop with disabled interrupts and locked hash
      lock in __find_stripe(). This leads to hard lockup on multiple CPUs
      and following system crash.
      
      I was able to reproduce this behavior on raid6 over 6 ssd disks.
      The devices_handle_discard_safely option should be set to enable trim
      support. The following script was used:
      
      for i in `seq 1 32`; do
          dd if=/dev/zero of=large$i bs=10M count=100 &
      done
      
      neilb: original was against a 3.x kernel.  I forward-ported
        to 4.3-rc.  This verison is suitable for any kernel since
        Commit: 59fc630b ("RAID5: batch adjacent full stripe write")
        (v4.1+).  I'll post a version for earlier kernels to stable.
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Fixes: 566c09c5 ("raid5: relieve lock contention in get_active_stripe()")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <stable@vger.kernel.org> # 3.13 - 4.2
      b8a9d66d
  2. 24 10月, 2015 6 次提交
    • N
      md/raid10: fix the 'new' raid10 layout to work correctly. · 8bce6d35
      NeilBrown 提交于
      In Linux 3.9 we introduce a new 'far' layout for RAID10 which was
      supposed to rotate the replicas differently and so provide better
      resilience.  In particular it could survive more combinations of 2
      drive failures.
      
      Unfortunately. due to a coding error, this some did what was wanted,
      sometimes improved less than we hoped, and sometimes - in very
      unlikely circumstances - put multiple replicas on the same device so
      the redundancy was harmed.
      
      No public user-space tool has created arrays using this layout so it
      is very unlikely that zero-redundancy arrays actually exist.  Probably
      no arrays using any form of the new layout exist.  But we cannot be
      certain.
      
      So use another bit in the 'layout' number and introduce a bug-fixed
      version of the layout.
      Also when assembling an array, if it has a zero-redundancy layout,
      give a warning.
      Reported-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      8bce6d35
    • N
      md/raid10: don't clear bitmap bit when bad-block-list write fails. · c340702c
      NeilBrown 提交于
      When a write fails and a bad-block-list is present, we can
      update the bad-block-list instead of writing the data.  If
      this succeeds then it is OK clear the relevant bitmap-bit as
      no further 'sync' of the block is needed.
      
      However if writing the bad-block-list fails then we need to
      treat the write as failed and particularly must not clear
      the bitmap bit.  Otherwise the device can be re-added (after
      any hardware connection issues are resolved) and because the
      relevant bit in the bitmap is clear, that block will not be
      resynced.  This leads to data corruption.
      
      We already delay the final bio_endio() on the write until
      the bad-block-list is written so that when the write
      returns: either that data is safe, the bad-block record is
      safe, or the fact that the device is faulty is safe.
      However we *don't* delay the clearing of the bitmap, so the
      bitmap bit can be recorded as cleared before we know if the
      bad-block-list was written safely.
      
      So: delay that until the write really is safe.
      i.e. move the call to close_write() until just before
      calling bio_endio(), and recheck the 'is array degraded'
      status before making that call.
      
      This bug goes back to v3.1 when bad-block-lists were
      introduced, though it only affects arrays created with
      mdadm-3.3 or later as only those have bad-block lists.
      
      Backports will require at least
      Commit: 95af587e ("md/raid10: ensure device failure recorded before write request returns.")
      as well.  I'll send that to 'stable' separately.
      
      Note that of the two tests of R10BIO_WriteError that this
      patch adds, the first is certain to fail and the second is
      certain to succeed.  However doing it this way makes the
      patch more obviously correct.  I will tidy the code up in a
      future merge window.
      Reported-by: NNate Dailey <nate.dailey@stratus.com>
      Fixes: bd870a16 ("md/raid10:  Handle write errors by updating badblock log.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c340702c
    • N
      md/raid1: don't clear bitmap bit when bad-block-list write fails. · bd8688a1
      NeilBrown 提交于
      When a write fails and a bad-block-list is present, we can
      update the bad-block-list instead of writing the data.  If
      this succeeds then it is OK clear the relevant bitmap-bit as
      no further 'sync' of the block is needed.
      
      However if writing the bad-block-list fails then we need to
      treat the write as failed and particularly must not clear
      the bitmap bit.  Otherwise the device can be re-added (after
      any hardware connection issues are resolved) and because the
      relevant bit in the bitmap is clear, that block will not be
      resynced.  This leads to data corruption.
      
      We already delay the final bio_endio() on the write until
      the bad-block-list is written so that when the write
      returns: either that data is safe, the bad-block record is
      safe, or the fact that the device is faulty is safe.
      However we *don't* delay the clearing of the bitmap, so the
      bitmap bit can be recorded as cleared before we know if the
      bad-block-list was written safely.
      
      So: delay that until the write really is safe.
      i.e. move the call to close_write() until just before
      calling bio_endio(), and recheck the 'is array degraded'
      status before making that call.
      
      This bug goes back to v3.1 when bad-block-lists were
      introduced, though it only affects arrays created with
      mdadm-3.3 or later as only those have bad-block lists.
      
      Backports will require at least
      Commit: 55ce74d4 ("md/raid1: ensure device failure recorded before write request returns.")
      as well.  I'll send that to 'stable' separately.
      
      Note that of the two tests of R1BIO_WriteError that this
      patch adds, the first is certain to fail and the second is
      certain to succeed.  However doing it this way makes the
      patch more obviously correct.  I will tidy the code up in a
      future merge window.
      Reported-and-tested-by: NNate Dailey <nate.dailey@stratus.com>
      Cc: Jes Sorensen <Jes.Sorensen@redhat.com>
      Fixes: cd5ff9a1 ("md/raid1:  Handle write errors by updating badblock log.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      bd8688a1
    • J
      dm cache: the CLEAN_SHUTDOWN flag was not being set · 3201ac45
      Joe Thornber 提交于
      If the CLEAN_SHUTDOWN flag is not set when a cache is loaded then all cache
      blocks are marked as dirty and a full writeback occurs.
      
      __commit_transaction() is responsible for setting/clearing
      CLEAN_SHUTDOWN (based the flags_mutator that is passed in).
      
      Fix this issue, of the cache's on-disk flags being wrong, by making sure
      __commit_transaction() does not reset the flags after the mutator has
      altered the flags in preparation for them being serialized to disk.
      
      before:
      
      sb_flags = mutator(le32_to_cpu(disk_super->flags));
      disk_super->flags = cpu_to_le32(sb_flags);
      disk_super->flags = cpu_to_le32(cmd->flags);
      
      after:
      
      disk_super->flags = cpu_to_le32(cmd->flags);
      sb_flags = mutator(le32_to_cpu(disk_super->flags));
      disk_super->flags = cpu_to_le32(sb_flags);
      Reported-by: NBogdan Vasiliev <bogdan.vasiliev@gmail.com>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      3201ac45
    • M
      dm btree: fix leak of bufio-backed block in btree_split_beneath error path · 4dcb8b57
      Mike Snitzer 提交于
      btree_split_beneath()'s error path had an outstanding FIXME that speaks
      directly to the potential for _not_ cleaning up a previously allocated
      bufio-backed block.
      
      Fix this by releasing the previously allocated bufio block using
      unlock_block().
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <thornber@redhat.com>
      Cc: stable@vger.kernel.org
      4dcb8b57
    • J
      dm btree remove: fix a bug when rebalancing nodes after removal · 2871c69e
      Joe Thornber 提交于
      Commit 4c7e3093 ("dm btree remove: fix bug in redistribute3") wasn't
      a complete fix for redistribute3().
      
      The redistribute3 function takes 3 btree nodes and shares out the entries
      evenly between them.  If the three nodes in total contained
      (MAX_ENTRIES * 3) - 1 entries between them then this was erroneously getting
      rebalanced as (MAX_ENTRIES - 1) on the left and right, and (MAX_ENTRIES + 1) in
      the center.
      
      Fix this issue by being more careful about calculating the target number
      of entries for the left and right nodes.
      
      Unit tested in userspace using this program:
      https://github.com/jthornber/redistribute3-test/blob/master/redistribute3_t.cSigned-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      2871c69e
  3. 22 10月, 2015 3 次提交
  4. 21 10月, 2015 2 次提交
  5. 14 10月, 2015 2 次提交
  6. 10 10月, 2015 1 次提交
  7. 09 10月, 2015 2 次提交
    • J
      dm cache: fix NULL pointer when switching from cleaner policy · 2bffa150
      Joe Thornber 提交于
      The cleaner policy doesn't make use of the per cache block hint space in
      the metadata (unlike the other policies).  When switching from the
      cleaner policy to mq or smq a NULL pointer crash (in dm_tm_new_block)
      was observed.  The crash was caused by bugs in dm-cache-metadata.c
      when trying to skip creation of the hint btree.
      
      The minimal fix is to change hint size for the cleaner policy to 4 bytes
      (only hint size supported).
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      2bffa150
    • M
      crash in md-raid1 and md-raid10 due to incorrect list manipulation · a452744b
      Mikulas Patocka 提交于
      The commit 55ce74d4 (md/raid1: ensure
      device failure recorded before write request returns) is causing crash in
      the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100%
      reproducible.
      
      The reason for the crash is that the newly added code in raid1d moves the
      list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and
      then incorrectly pops the bio from conf->bio_end_io_list (which is empty
      because the list was alrady moved).
      
      Raid-10 has a similar bug.
      
      Kernel Fault: Code=15 regs=000000006ccb8640 (Addr=0000000100000000)
      CPU: 3 PID: 1930 Comm: mdX_raid1 Not tainted 4.2.0-rc5-bisect+ #35
      task: 000000006cc1f258 ti: 000000006ccb8000 task.ti: 000000006ccb8000
      
           YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
      PSW: 00001000000001001111111000001111 Not tainted
      r00-03  000000ff0804fe0f 000000001059d000 000000001059f818 000000007f16be38
      r04-07  000000001059d000 000000007f16be08 0000000000200200 0000000000000001
      r08-11  000000006ccb8260 000000007b7934d0 0000000000000001 0000000000000000
      r12-15  000000004056f320 0000000000000000 0000000000013dd0 0000000000000000
      r16-19  00000000f0d00ae0 0000000000000000 0000000000000000 0000000000000001
      r20-23  000000000800000f 0000000042200390 0000000000000000 0000000000000000
      r24-27  0000000000000001 000000000800000f 000000007f16be08 000000001059d000
      r28-31  0000000100000000 000000006ccb8560 000000006ccb8640 0000000000000000
      sr00-03  0000000000249800 0000000000000000 0000000000000000 0000000000249800
      sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      
      IASQ: 0000000000000000 0000000000000000 IAOQ: 000000001059f61c 000000001059f620
       IIR: 0f8010c6    ISR: 0000000000000000  IOR: 0000000100000000
       CPU:        3   CR30: 000000006ccb8000 CR31: 0000000000000000
       ORIG_R28: 000000001059d000
       IAOQ[0]: call_bio_endio+0x34/0x1a8 [raid1]
       IAOQ[1]: call_bio_endio+0x38/0x1a8 [raid1]
       RP(r2): raid_end_bio_io+0x88/0x168 [raid1]
      Backtrace:
       [<000000001059f818>] raid_end_bio_io+0x88/0x168 [raid1]
       [<00000000105a4f64>] raid1d+0x144/0x1640 [raid1]
       [<000000004017fd5c>] kthread+0x144/0x160
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Fixes: 55ce74d4 ("md/raid1: ensure device failure recorded before write request returns.")
      Fixes: 95af587e ("md/raid10: ensure device failure recorded before write request returns.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      a452744b
  8. 06 10月, 2015 1 次提交
  9. 03 10月, 2015 1 次提交
    • M
      dm raid: fix round up of default region size · 042745ee
      Mikulas Patocka 提交于
      Commit 3a0f9aae ("dm raid: round region_size to power of two")
      intended to make sure that the default region size is a power of two.
      However, the logic in that commit is incorrect and sets the variable
      region_size to 0 or 1, depending on whether min_region_size is a power
      of two.
      
      Fix this logic, using roundup_pow_of_two(), so that region_size is
      properly rounded up to the next power of two.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Fixes: 3a0f9aae ("dm raid: round region_size to power of two")
      Cc: stable@vger.kernel.org # v3.8+
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      042745ee
  10. 02 10月, 2015 8 次提交
  11. 01 10月, 2015 1 次提交
    • J
      dm: fix AB-BA deadlock in __dm_destroy() · 2a708cff
      Junichi Nomura 提交于
      __dm_destroy() takes io_barrier SRCU lock (dm_get_live_table) and
      suspend_lock in reverse order.  Doing so can cause AB-BA deadlock:
      
        __dm_destroy                    dm_swap_table
        ---------------------------------------------------
                                        mutex_lock(suspend_lock)
        dm_get_live_table()
          srcu_read_lock(io_barrier)
                                        dm_sync_table()
                                          synchronize_srcu(io_barrier)
                                            .. waiting for dm_put_live_table()
        mutex_lock(suspend_lock)
          .. waiting for suspend_lock
      
      Fix this by taking the locks in proper order.
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Fixes: ab7c7bb6 ("dm: hold suspend_lock while suspending device during device deletion")
      Acked-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      2a708cff
  12. 15 9月, 2015 1 次提交
    • M
      dm crypt: constrain crypt device's max_segment_size to PAGE_SIZE · 586b286b
      Mike Snitzer 提交于
      Setting the dm-crypt device's max_segment_size to PAGE_SIZE is an
      unfortunate constraint that is required to avoid the potential for
      exceeding dm-crypt's underlying device's max_segments limits -- due to
      crypt_alloc_buffer() possibly allocating pages for the encryption bio
      that are not as physically contiguous as the original bio.
      
      It is interesting to note that this problem was already fixed back in
      2007 via commit 91e10625 ("dm crypt: use bio_add_page").  But Linux 4.0
      commit cf2f1abf ("dm crypt: don't allocate pages for a partial
      request") regressed dm-crypt back to _not_ using bio_add_page().  But
      given dm-crypt's cpu parallelization changes all depend on commit
      cf2f1abf's abandoning of the more complex io fragments processing that
      dm-crypt previously had we cannot easily go back to using
      bio_add_page().
      
      So all said the cleanest way to resolve this issue is to fix dm-crypt to
      properly constrain the original bios entering dm-crypt so the encryption
      bios that dm-crypt generates from the original bios are always
      compatible with the underlying device's max_segments queue limits.
      
      It should be noted that technically Linux 4.3 does _not_ need this fix
      because of the block core's new late bio-splitting capability.  But, it
      is reasoned, there is little to be gained by having the block core split
      the encrypted bio that is composed of PAGE_SIZE segments.  That said, in
      the future we may revert this change.
      
      Fixes: cf2f1abf ("dm crypt: don't allocate pages for a partial request")
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=104421Suggested-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.0+
      586b286b
  13. 14 9月, 2015 1 次提交
  14. 12 9月, 2015 1 次提交
  15. 01 9月, 2015 8 次提交
    • J
      dm cache: fix use after freeing migrations · cc7da0ba
      Joe Thornber 提交于
      Both free_io_migration() and issue_discard() dereference a migration
      that was just freed.  Fix those by saving off the migrations's cache
      object before freeing the migration.  Also cleanup needless mg->cache
      dereferences now that the cache object is available directly.
      
      Fixes: e44b6a5a ("dm cache: move wake_waker() from free_migrations() to where it is needed")
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      cc7da0ba
    • M
      dm cache: small cleanups related to deferred prison cell cleanup · dc9cee5d
      Mike Snitzer 提交于
      Eliminate __cell_release() since it only had one caller that always
      released the cell holder.
      
      Switch cell_error_with_code() to using free_prison_cell() for the sake
      of consistency.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      dc9cee5d
    • J
      dm cache: fix leaking of deferred bio prison cells · 9153df74
      Joe Thornber 提交于
      There were two cases where dm_cell_visit_release() was being called,
      which removes the cell from the prison's rbtree, but the callers didn't
      also return the cell to the mempool.  Fix this by having them call
      free_prison_cell().
      
      This leak manifested as the 'kmalloc-96' slab growing until OOM.
      
      Fixes: 651f5fa2 ("dm cache: defer whole cells")
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.1+
      9153df74
    • N
      md/raid5: ensure device failure recorded before write request returns. · c3cce6cd
      NeilBrown 提交于
      When a write to one of the devices of a RAID5/6 fails, the failure is
      recorded in the metadata of the other devices so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again (maybe a cable was unplugged).
      
      Similarly when we record a bad-block in response to a write failure,
      we must not let the write complete until the bad-block update is safe.
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that completed when MD_CHANGE_PENDING is set to
         only be processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c3cce6cd
    • N
      md/raid5: use bio_list for the list of bios to return. · 34a6f80e
      NeilBrown 提交于
      This will make it easier to splice two lists together which will
      be needed in future patch.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      34a6f80e
    • N
      md/raid10: ensure device failure recorded before write request returns. · 95af587e
      NeilBrown 提交于
      When a write to one of the legs of a RAID10 fails, the failure is
      recorded in the metadata of the other legs so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again (maybe a cable was unplugged).
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that experienced an error on a new queue which
         is only processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      95af587e
    • N
      md/raid1: ensure device failure recorded before write request returns. · 55ce74d4
      NeilBrown 提交于
      When a write to one of the legs of a RAID1 fails, the failure is
      recorded in the metadata of the other leg(s) so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again  (maybe a cable was unplugged).
      
      Similarly when we record a bad-block in response to a write failure,
      we must not let the write complete until the bad-block update is safe.
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that experienced an error on a new queue which
         is only processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      55ce74d4
    • N
      md-cluster: remove inappropriate try_module_get from join() · 18b9f679
      NeilBrown 提交于
      md_setup_cluster already calls try_module_get(), so this
      try_module_get isn't needed.
      Also, there is no matching module_put (except in error patch),
      so this leaves an unbalanced module count.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      18b9f679