1. 13 10月, 2015 8 次提交
  2. 12 10月, 2015 10 次提交
    • G
      md-cluster: Fix adding of new disk with new reload code · dbb64f86
      Goldwyn Rodrigues 提交于
      Adding the disk worked incorrectly with the new reload code. Fix it:
      
       - No operation should be performed on rdev marked as Candidate
       - After a metadata update operation, kick disk if role is 0xfffe
         else clear Candidate bit and continue with the regular change check.
       - Saving the mode of the lock resource to check if token lock is already
         locked, because it can be called twice while adding a disk. However,
         unlock_comm() must be called only once.
       - add_new_disk() is called by the node initiating the --add operation.
         If it needs to be canceled, call add_new_disk_cancel(). The operation
         is completed by md_update_sb() which will write and unlock the
         communication.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      dbb64f86
    • G
      md-cluster: Perform resync/recovery under a DLM lock · c186b128
      Goldwyn Rodrigues 提交于
      Resync or recovery must be performed by only one node at a time.
      A DLM lock resource, resync_lockres provides the mutual exclusion
      so that only one node performs the recovery/resync at a time.
      
      If a node is unable to get the resync_lockres, because recovery is
      being performed by another node, it set MD_RECOVER_NEEDED so as
      to schedule recovery in the future.
      
      Remove the debug message in resync_info_update()
      used during development.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      c186b128
    • G
      md-cluster: Perform a lazy update · 2aa82191
      Goldwyn Rodrigues 提交于
      In a clustered environment, a change such as marking a device faulty,
      can be recorded by any of the nodes. This is communicated to all the
      nodes and re-recording such a change is unnecessary, and quite often
      pretty disruptive.
      
      With this patch, just before the update, we detect for the changes
      and if the changes are already in superblock, we abort the update
      after clearing all the flags
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      2aa82191
    • G
      md-cluster: Improve md_reload_sb to be less error prone · 70bcecdb
      Goldwyn Rodrigues 提交于
      md_reload_sb is too simplistic and it explicitly needs to determine
      the changes made by the writing node. However, there are multiple areas
      where a simple reload could fail.
      
      Instead, read the superblock of one of the "good" rdevs and update
      the necessary information:
      
      - read the superblock into a newly allocated page, by temporarily
        swapping out rdev->sb_page and calling ->load_super.
      - if that fails return
      - if it succeeds, call check_sb_changes
        1. iterates over list of active devices and checks the matching
         dev_roles[] value.
         	If that is 'faulty', the device must be  marked as faulty
      	 - call md_error to mark the device as faulty. Make sure
      	   not to set CHANGE_DEVS and wakeup mddev->thread or else
      	   it would initiate a resync process, which is the responsibility
      	   of the "primary" node.
      	 - clear the Blocked bit
      	 - Call remove_and_add_spares() to hot remove the device.
      	If the device is 'spare':
      	 - call remove_and_add_spares() to get the number of spares
      	   added in this operation.
      	 - Reduce mddev->degraded to mark the array as not degraded.
        2. reset recovery_cp
      - read the rest of the rdevs to update recovery_offset. If recovery_offset
        is equal to MaxSector, call spare_active() to set it In_sync
      
      This required that recovery_offset be initialized to MaxSector, as
      opposed to zero so as to communicate the end of sync for a rdev.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      70bcecdb
    • G
      md: remove_and_add_spares() to activate specific rdev · 2910ff17
      Goldwyn Rodrigues 提交于
      remove_and_add_spares() checks for all devices to activate spare.
      Change it to activate a specific device if a non-null rdev
      argument is passed.
      
      remove_and_add_spares() can be used to activate spares in
      slot_store() as well.
      
      For hot_remove_disk(), check if rdev->raid_disk == -1 before
      calling remove_and_add_spares()
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      2910ff17
    • G
      md-cluster: Wake up suspended process · b8ca846e
      Goldwyn Rodrigues 提交于
      When the suspended_area is deleted, the suspended processes
      must be woken up in order to complete their I/O.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      b8ca846e
    • G
      md-cluster: send BITMAP_NEEDS_SYNC when node is leaving cluster · 09995411
      Guoqing Jiang 提交于
      Previously, BITMAP_NEEDS_SYNC message is sent when the resyc
      aborts, but it could abort for different reasons, and not all
      of reasons require another node to take over the resync ownship.
      
      It is better make BITMAP_NEEDS_SYNC message only be sent when
      the node is leaving cluster with dirty bitmap. And we also need
      to ensure dlm connection is ok.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      09995411
    • G
      md-cluster: Use a small window for resync · c40f341f
      Goldwyn Rodrigues 提交于
      Suspending the entire device for resync could take too long. Resync
      in small chunks.
      
      cluster's resync window (32M) is maintained in r1conf as
      cluster_sync_low and cluster_sync_high and processed in
      raid1's sync_request(). If the current resync is outside the cluster
      resync window:
      
      1. Set the cluster_sync_low to curr_resync_completed.
      2. Check if the sync will fit in the new window, if not issue a
         wait_barrier() and set cluster_sync_low to sector_nr.
      3. Set cluster_sync_high to cluster_sync_low + resync_window.
      4. Send a message to all nodes so they may add it in their suspension
         list.
      
      bitmap_cond_end_sync is modified to allow to force a sync inorder
      to get the curr_resync_completed uptodate with the sector passed.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c40f341f
    • G
      md: Increment version for clustered bitmaps · 3c462c88
      Goldwyn Rodrigues 提交于
      Add BITMAP_MAJOR_CLUSTERED as 5, in order to prevent older kernels
      to assemble a clustered device.
      
      In order to maximize compatibility, the major version is set to
      BITMAP_MAJOR_CLUSTERED *only* if the bitmap is clustered.
      
      Added MD_FEATURE_CLUSTERED in order to return error for older
      kernels which would assemble MD even if the bitmap is corrupted.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      3c462c88
    • G
      md-cluster: complete all write requests before adding suspend_info · 9ed38ff5
      Goldwyn Rodrigues 提交于
      process_suspend_info - which handles the RESYNCING request - must not
      reply until all writes which were initiated before the request arrived,
      have completed.
      
      As a by-product, all process_* functions now take mddev as their
      first arguement making it uniform.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      9ed38ff5
  3. 10 10月, 2015 1 次提交
  4. 09 10月, 2015 2 次提交
    • J
      dm cache: fix NULL pointer when switching from cleaner policy · 2bffa150
      Joe Thornber 提交于
      The cleaner policy doesn't make use of the per cache block hint space in
      the metadata (unlike the other policies).  When switching from the
      cleaner policy to mq or smq a NULL pointer crash (in dm_tm_new_block)
      was observed.  The crash was caused by bugs in dm-cache-metadata.c
      when trying to skip creation of the hint btree.
      
      The minimal fix is to change hint size for the cleaner policy to 4 bytes
      (only hint size supported).
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      2bffa150
    • M
      crash in md-raid1 and md-raid10 due to incorrect list manipulation · a452744b
      Mikulas Patocka 提交于
      The commit 55ce74d4 (md/raid1: ensure
      device failure recorded before write request returns) is causing crash in
      the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100%
      reproducible.
      
      The reason for the crash is that the newly added code in raid1d moves the
      list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and
      then incorrectly pops the bio from conf->bio_end_io_list (which is empty
      because the list was alrady moved).
      
      Raid-10 has a similar bug.
      
      Kernel Fault: Code=15 regs=000000006ccb8640 (Addr=0000000100000000)
      CPU: 3 PID: 1930 Comm: mdX_raid1 Not tainted 4.2.0-rc5-bisect+ #35
      task: 000000006cc1f258 ti: 000000006ccb8000 task.ti: 000000006ccb8000
      
           YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
      PSW: 00001000000001001111111000001111 Not tainted
      r00-03  000000ff0804fe0f 000000001059d000 000000001059f818 000000007f16be38
      r04-07  000000001059d000 000000007f16be08 0000000000200200 0000000000000001
      r08-11  000000006ccb8260 000000007b7934d0 0000000000000001 0000000000000000
      r12-15  000000004056f320 0000000000000000 0000000000013dd0 0000000000000000
      r16-19  00000000f0d00ae0 0000000000000000 0000000000000000 0000000000000001
      r20-23  000000000800000f 0000000042200390 0000000000000000 0000000000000000
      r24-27  0000000000000001 000000000800000f 000000007f16be08 000000001059d000
      r28-31  0000000100000000 000000006ccb8560 000000006ccb8640 0000000000000000
      sr00-03  0000000000249800 0000000000000000 0000000000000000 0000000000249800
      sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      
      IASQ: 0000000000000000 0000000000000000 IAOQ: 000000001059f61c 000000001059f620
       IIR: 0f8010c6    ISR: 0000000000000000  IOR: 0000000100000000
       CPU:        3   CR30: 000000006ccb8000 CR31: 0000000000000000
       ORIG_R28: 000000001059d000
       IAOQ[0]: call_bio_endio+0x34/0x1a8 [raid1]
       IAOQ[1]: call_bio_endio+0x38/0x1a8 [raid1]
       RP(r2): raid_end_bio_io+0x88/0x168 [raid1]
      Backtrace:
       [<000000001059f818>] raid_end_bio_io+0x88/0x168 [raid1]
       [<00000000105a4f64>] raid1d+0x144/0x1640 [raid1]
       [<000000004017fd5c>] kthread+0x144/0x160
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Fixes: 55ce74d4 ("md/raid1: ensure device failure recorded before write request returns.")
      Fixes: 95af587e ("md/raid10: ensure device failure recorded before write request returns.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      a452744b
  5. 06 10月, 2015 1 次提交
  6. 03 10月, 2015 1 次提交
    • M
      dm raid: fix round up of default region size · 042745ee
      Mikulas Patocka 提交于
      Commit 3a0f9aae ("dm raid: round region_size to power of two")
      intended to make sure that the default region size is a power of two.
      However, the logic in that commit is incorrect and sets the variable
      region_size to 0 or 1, depending on whether min_region_size is a power
      of two.
      
      Fix this logic, using roundup_pow_of_two(), so that region_size is
      properly rounded up to the next power of two.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Fixes: 3a0f9aae ("dm raid: round region_size to power of two")
      Cc: stable@vger.kernel.org # v3.8+
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      042745ee
  7. 02 10月, 2015 8 次提交
  8. 01 10月, 2015 1 次提交
    • J
      dm: fix AB-BA deadlock in __dm_destroy() · 2a708cff
      Junichi Nomura 提交于
      __dm_destroy() takes io_barrier SRCU lock (dm_get_live_table) and
      suspend_lock in reverse order.  Doing so can cause AB-BA deadlock:
      
        __dm_destroy                    dm_swap_table
        ---------------------------------------------------
                                        mutex_lock(suspend_lock)
        dm_get_live_table()
          srcu_read_lock(io_barrier)
                                        dm_sync_table()
                                          synchronize_srcu(io_barrier)
                                            .. waiting for dm_put_live_table()
        mutex_lock(suspend_lock)
          .. waiting for suspend_lock
      
      Fix this by taking the locks in proper order.
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Fixes: ab7c7bb6 ("dm: hold suspend_lock while suspending device during device deletion")
      Acked-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      2a708cff
  9. 15 9月, 2015 1 次提交
    • M
      dm crypt: constrain crypt device's max_segment_size to PAGE_SIZE · 586b286b
      Mike Snitzer 提交于
      Setting the dm-crypt device's max_segment_size to PAGE_SIZE is an
      unfortunate constraint that is required to avoid the potential for
      exceeding dm-crypt's underlying device's max_segments limits -- due to
      crypt_alloc_buffer() possibly allocating pages for the encryption bio
      that are not as physically contiguous as the original bio.
      
      It is interesting to note that this problem was already fixed back in
      2007 via commit 91e10625 ("dm crypt: use bio_add_page").  But Linux 4.0
      commit cf2f1abf ("dm crypt: don't allocate pages for a partial
      request") regressed dm-crypt back to _not_ using bio_add_page().  But
      given dm-crypt's cpu parallelization changes all depend on commit
      cf2f1abf's abandoning of the more complex io fragments processing that
      dm-crypt previously had we cannot easily go back to using
      bio_add_page().
      
      So all said the cleanest way to resolve this issue is to fix dm-crypt to
      properly constrain the original bios entering dm-crypt so the encryption
      bios that dm-crypt generates from the original bios are always
      compatible with the underlying device's max_segments queue limits.
      
      It should be noted that technically Linux 4.3 does _not_ need this fix
      because of the block core's new late bio-splitting capability.  But, it
      is reasoned, there is little to be gained by having the block core split
      the encrypted bio that is composed of PAGE_SIZE segments.  That said, in
      the future we may revert this change.
      
      Fixes: cf2f1abf ("dm crypt: don't allocate pages for a partial request")
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=104421Suggested-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.0+
      586b286b
  10. 14 9月, 2015 1 次提交
  11. 12 9月, 2015 1 次提交
  12. 01 9月, 2015 5 次提交
    • J
      dm cache: fix use after freeing migrations · cc7da0ba
      Joe Thornber 提交于
      Both free_io_migration() and issue_discard() dereference a migration
      that was just freed.  Fix those by saving off the migrations's cache
      object before freeing the migration.  Also cleanup needless mg->cache
      dereferences now that the cache object is available directly.
      
      Fixes: e44b6a5a ("dm cache: move wake_waker() from free_migrations() to where it is needed")
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      cc7da0ba
    • M
      dm cache: small cleanups related to deferred prison cell cleanup · dc9cee5d
      Mike Snitzer 提交于
      Eliminate __cell_release() since it only had one caller that always
      released the cell holder.
      
      Switch cell_error_with_code() to using free_prison_cell() for the sake
      of consistency.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      dc9cee5d
    • J
      dm cache: fix leaking of deferred bio prison cells · 9153df74
      Joe Thornber 提交于
      There were two cases where dm_cell_visit_release() was being called,
      which removes the cell from the prison's rbtree, but the callers didn't
      also return the cell to the mempool.  Fix this by having them call
      free_prison_cell().
      
      This leak manifested as the 'kmalloc-96' slab growing until OOM.
      
      Fixes: 651f5fa2 ("dm cache: defer whole cells")
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.1+
      9153df74
    • N
      md/raid5: ensure device failure recorded before write request returns. · c3cce6cd
      NeilBrown 提交于
      When a write to one of the devices of a RAID5/6 fails, the failure is
      recorded in the metadata of the other devices so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again (maybe a cable was unplugged).
      
      Similarly when we record a bad-block in response to a write failure,
      we must not let the write complete until the bad-block update is safe.
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that completed when MD_CHANGE_PENDING is set to
         only be processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c3cce6cd
    • N
      md/raid5: use bio_list for the list of bios to return. · 34a6f80e
      NeilBrown 提交于
      This will make it easier to splice two lists together which will
      be needed in future patch.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      34a6f80e