1. 08 11月, 2016 1 次提交
    • T
      md: add bad block support for external metadata · 35b785f7
      Tomasz Majchrzak 提交于
      Add new rdev flag which external metadata handler can use to switch
      on/off bad block support. If new bad block is encountered, notify it via
      rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been
      cleared, notify update to rdev 'bad_blocks' sysfs file.
      
      When bad blocks support is being removed, just clear rdev flag. It is
      not necessary to reset badblocks->shift field. If there are bad blocks
      cleared or added at the same time, it is ok for those changes to be
      applied to the structure. The array is in blocked state and the drive
      which cannot handle bad blocks any more will be removed from the array
      before it is unlocked.
      
      Simplify state_show function by adding a separator at the end of each
      string and overwrite last separator with new line.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Reviewed-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      35b785f7
  2. 29 10月, 2016 3 次提交
    • N
      md: be careful not lot leak internal curr_resync value into metadata. -- (all) · 1217e1d1
      NeilBrown 提交于
      mddev->curr_resync usually records where the current resync is up to,
      but during the starting phase it has some "magic" values.
      
       1 - means that the array is trying to start a resync, but has yielded
           to another array which shares physical devices, and also needs to
           start a resync
       2 - means the array is trying to start resync, but has found another
           array which shares physical devices and has already started resync.
      
       3 - means that resync has commensed, but it is possible that nothing
           has actually been resynced yet.
      
      It is important that this value not be visible to user-space and
      particularly that it doesn't get written to the metadata, as the
      resync or recovery checkpoint.  In part, this is because it may be
      slightly higher than the correct value, though this is very rare.
      In part, because it is not a multiple of 4K, and some devices only
      support 4K aligned accesses.
      
      There are two places where this value is propagates into either
      ->curr_resync_completed or ->recovery_cp or ->recovery_offset.
      These currently avoid the propagation of values 1 and 3, but will
      allow 3 to leak through.
      
      Change them to only propagate the value if it is > 3.
      
      As this can cause an array to fail, the patch is suitable for -stable.
      
      Cc: stable@vger.kernel.org (v3.7+)
      Reported-by: NViswesh <viswesh.vichu@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1217e1d1
    • T
      raid1: handle read error also in readonly mode · 7449f699
      Tomasz Majchrzak 提交于
      If write is the first operation on a disk and it happens not to be
      aligned to page size, block layer sends read request first. If read
      operation fails, the disk is set as failed as no attempt to fix the
      error is made because array is in auto-readonly mode. Similarily, the
      disk is set as failed for read-only array.
      
      Take the same approach as in raid10. Don't fail the disk if array is in
      readonly or auto-readonly mode. Try to redirect the request first and if
      unsuccessful, return a read error.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      7449f699
    • S
      raid5-cache: correct condition for empty metadata write · 9a8b27fa
      Shaohua Li 提交于
      As long as we recover one metadata block, we should write the empty metadata
      write. The original code could make recovery corrupted if only one meta is
      valid.
      Reported-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NShaohua Li <shli@fb.com>
      9a8b27fa
  3. 25 10月, 2016 5 次提交
    • T
      md: report 'write_pending' state when array in sync · 16f88949
      Tomasz Majchrzak 提交于
      If there is a bad block on a disk and there is a recovery performed from
      this disk, the same bad block is reported for a new disk. It involves
      setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external
      metadata this flag is not being cleared as array state is reported as
      'clean'. The read request to bad block in RAID5 array gets stuck as it
      is waiting for a flag to be cleared - as per commit c3cce6cd
      ("md/raid5: ensure device failure recorded before write request
      returns.").
      
      The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been
      clarified in commit 070dc6dd ("md: resolve confusion of
      MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in
      personality error handlers since and it doesn't fully comply with
      initial purpose. It was supposed to notify that write request is about
      to start, however now it is also used to request metadata update.
      Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has
      been set and in_sync has been set to 0 at the same time. Error handlers
      just set the flag without modifying in_sync value. Sysfs array state is
      a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is
      set and in_sync is set to 1. Userspace has no idea it is expected to
      take some action.
      
      Swap the order that array state is checked so 'write_pending' is
      reported ahead of 'clean' ('write_pending' is a misleading name but it
      is too late to rename it now).
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      16f88949
    • Z
      md/raid5: write an empty meta-block when creating log super-block · 56056c2e
      Zhengyuan Liu 提交于
      If superblock points to an invalid meta block, r5l_load_log will set
      create_super with true and create an new superblock, this runtime path
      would always happen if we do no writing I/O to this array since it was
      created. Writing an empty meta block could avoid this unnecessary
      action at the first time we created log superblock.
      
      Another reason is for the corretness of log recovery. Currently we have
      bellow code to guarantee log revocery to be correct.
      
              if (ctx.seq > log->last_cp_seq + 1) {
                      int ret;
      
                      ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10);
                      if (ret)
                              return ret;
                      log->seq = ctx.seq + 11;
                      log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
                      r5l_write_super(log, ctx.pos);
              } else {
                      log->log_start = ctx.pos;
                      log->seq = ctx.seq;
              }
      
      If we just created a array with a journal device, log->log_start and
      log->last_checkpoint should all be 0, then we write three meta block
      which are valid except mid one and supposed crash happened. The ctx.seq
      would equal to log->last_cp_seq + 1 and log->log_start would be set to
      position of mid invalid meta block after we did a recovery, this will
      lead to problems which could be avoided with this patch.
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NShaohua Li <shli@fb.com>
      56056c2e
    • Z
      md/raid5: initialize next_checkpoint field before use · 28cd88e2
      Zhengyuan Liu 提交于
      No initial operation was done to this field when we
      load/recovery the log, it got assignment only when IO
      to raid disk was finished. So r5l_quiesce may use wrong
      next_checkpoint to reclaim log space, that would make
      reclaimable space calculation confused.
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NShaohua Li <shli@fb.com>
      28cd88e2
    • S
      RAID10: ignore discard error · 579ed34f
      Shaohua Li 提交于
      This is the counterpart of raid10 fix. If a write error occurs, raid10
      will try to rewrite the bio in small chunk size. If the rewrite fails,
      raid10 will record the error in bad block. narrow_write_error will
      always use WRITE for the bio, but actually it could be a discard. Since
      discard bio hasn't payload, write the bio will cause different issues.
      But discard error isn't fatal, we can safely ignore it. This is what
      this patch does.
      
      This issue should exist since discard is added, but only exposed with
      recent arbitrary bio size feature.
      
      Cc: Sitsofe Wheeler <sitsofe@gmail.com>
      Cc: stable@vger.kernel.org (v3.6)
      Signed-off-by: NShaohua Li <shli@fb.com>
      579ed34f
    • S
      RAID1: ignore discard error · e3f948cd
      Shaohua Li 提交于
      If a write error occurs, raid1 will try to rewrite the bio in small
      chunk size. If the rewrite fails, raid1 will record the error in bad
      block. narrow_write_error will always use WRITE for the bio, but
      actually it could be a discard. Since discard bio hasn't payload, write
      the bio will cause different issues. But discard error isn't fatal, we
      can safely ignore it. This is what this patch does.
      
      This issue should exist since discard is added, but only exposed with
      recent arbitrary bio size feature.
      Reported-and-tested-by: NSitsofe Wheeler <sitsofe@gmail.com>
      Cc: stable@vger.kernel.org (v3.6)
      Signed-off-by: NShaohua Li <shli@fb.com>
      e3f948cd
  4. 24 10月, 2016 1 次提交
  5. 19 10月, 2016 2 次提交
  6. 18 10月, 2016 1 次提交
  7. 14 10月, 2016 2 次提交
  8. 12 10月, 2016 2 次提交
    • P
      kthread: kthread worker API cleanup · 3989144f
      Petr Mladek 提交于
      A good practice is to prefix the names of functions by the name
      of the subsystem.
      
      The kthread worker API is a mix of classic kthreads and workqueues.  Each
      worker has a dedicated kthread.  It runs a generic function that process
      queued works.  It is implemented as part of the kthread subsystem.
      
      This patch renames the existing kthread worker API to use
      the corresponding name from the workqueues API prefixed by
      kthread_:
      
      __init_kthread_worker()		-> __kthread_init_worker()
      init_kthread_worker()		-> kthread_init_worker()
      init_kthread_work()		-> kthread_init_work()
      insert_kthread_work()		-> kthread_insert_work()
      queue_kthread_work()		-> kthread_queue_work()
      flush_kthread_work()		-> kthread_flush_work()
      flush_kthread_worker()		-> kthread_flush_worker()
      
      Note that the names of DEFINE_KTHREAD_WORK*() macros stay
      as they are. It is common that the "DEFINE_" prefix has
      precedence over the subsystem names.
      
      Note that INIT() macros and init() functions use different
      naming scheme. There is no good solution. There are several
      reasons for this solution:
      
        + "init" in the function names stands for the verb "initialize"
          aka "initialize worker". While "INIT" in the macro names
          stands for the noun "INITIALIZER" aka "worker initializer".
      
        + INIT() macros are used only in DEFINE() macros
      
        + init() functions are used close to the other kthread()
          functions. It looks much better if all the functions
          use the same scheme.
      
        + There will be also kthread_destroy_worker() that will
          be used close to kthread_cancel_work(). It is related
          to the init() function. Again it looks better if all
          functions use the same naming scheme.
      
        + there are several precedents for such init() function
          names, e.g. amd_iommu_init_device(), free_area_init_node(),
          jump_label_init_type(),  regmap_init_mmio_clk(),
      
        + It is not an argument but it was inconsistent even before.
      
      [arnd@arndb.de: fix linux-next merge conflict]
       Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.comSuggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3989144f
    • A
      dm raid: fix compat_features validation · 5c33677c
      Andy Whitcroft 提交于
      In ecbfb9f1 ("dm raid: add raid level takeover support") a new
      compatible feature flag was added.  Validation for these compat_features
      was added but this only passes for new raid mappings with this feature
      flag.  This causes previously created raid mappings to be failed at
      import.
      
      Check compat_features for the only valid combination.
      
      Fixes: ecbfb9f1 ("dm raid: add raid level takeover support")
      Cc: stable@vger.kernel.org # v4.8
      Signed-off-by: NAndy Whitcroft <apw@canonical.com>
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      5c33677c
  9. 04 10月, 2016 1 次提交
  10. 29 9月, 2016 1 次提交
    • H
      dm mpath: always return reservation conflict without failing over · 8ff232c1
      Hannes Reinecke 提交于
      If dm-mpath encounters an reservation conflict it should not fail the
      path (as communication with the target is not affected) but should
      rather retry on another path.  However, in doing so we might be inducing
      a ping-pong between paths, with no guarantee of any forward progress.
      And arguably a reservation conflict is an unexpected error, so we should
      be passing it upwards to allow the application to take appropriate
      steps.
      
      This change resolves a show-stopper problem seen with the pNFS SCSI
      layout because it is trivial to hit reservation conflict based failover
      loops without it.
      
      Doubts were raised about the implications of this change relative to
      products like IBM's SVC.  But there is little point withholding a fix
      for Linux because a proprietary product may or may not have some issues
      in its implementation of how it interfaces with Linux.  In the future,
      if there is glaring evidence that this change is certainly problematic
      we can revisit it.
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: Mike Snitzer <snitzer@redhat.com> # tweaked header
      8ff232c1
  11. 22 9月, 2016 21 次提交
    • P
      dm bufio: remove dm_bufio_cond_resched() · 7cd32674
      Peter Zijlstra 提交于
      Use cond_resched() like everybody else.
      
      Mikulas explained why dm_bufio_cond_resched() was introduced to begin
      with (hopefully cond_resched can be improved accordingly) here:
      https://www.redhat.com/archives/dm-devel/2016-September/msg00112.html
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: Mike Snitzer <snitzer@redhat.com> # added last comment in header
      7cd32674
    • R
      dm crypt: fix crash on exit · f659b100
      Rabin Vincent 提交于
      As the documentation for kthread_stop() says, "if threadfn() may call
      do_exit() itself, the caller must ensure task_struct can't go away".
      dm-crypt does not ensure this and therefore crashes when crypt_dtr()
      calls kthread_stop().  The crash is trivially reproducible by adding a
      delay before the call to kthread_stop() and just opening and closing a
      dm-crypt device.
      
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU: 0 PID: 533 Comm: cryptsetup Not tainted 4.8.0-rc7+ #7
       task: ffff88003bd0df40 task.stack: ffff8800375b4000
       RIP: 0010: kthread_stop+0x52/0x300
       Call Trace:
        crypt_dtr+0x77/0x120
        dm_table_destroy+0x6f/0x120
        __dm_destroy+0x130/0x250
        dm_destroy+0x13/0x20
        dev_remove+0xe6/0x120
        ? dev_suspend+0x250/0x250
        ctl_ioctl+0x1fc/0x530
        ? __lock_acquire+0x24f/0x1b10
        dm_ctl_ioctl+0x13/0x20
        do_vfs_ioctl+0x91/0x6a0
        ? ____fput+0xe/0x10
        ? entry_SYSCALL_64_fastpath+0x5/0xbd
        ? trace_hardirqs_on_caller+0x151/0x1e0
        SyS_ioctl+0x41/0x70
        entry_SYSCALL_64_fastpath+0x1f/0xbd
      
      This problem was introduced by bcbd94ff ("dm crypt: fix a possible
      hang due to race condition on exit").
      
      Looking at the description of that patch (excerpted below), it seems
      like the problem it addresses can be solved by just using
      set_current_state instead of __set_current_state, since we obviously
      need the memory barrier.
      
      | dm crypt: fix a possible hang due to race condition on exit
      |
      | A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE),
      | __add_wait_queue, spin_unlock_irq and then tests kthread_should_stop().
      | It is possible that the processor reorders memory accesses so that
      | kthread_should_stop() is executed before __set_current_state().  If
      | such reordering happens, there is a possible race on thread
      | termination: [...]
      
      So this patch just reverts the aforementioned patch and changes the
      __set_current_state(TASK_INTERRUPTIBLE) to set_current_state(...).  This
      fixes the crash and should also fix the potential hang.
      
      Fixes: bcbd94ff ("dm crypt: fix a possible hang due to race condition on exit")
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org # v4.0+
      Signed-off-by: NRabin Vincent <rabinv@axis.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f659b100
    • J
      dm cache metadata: switch to using the new cursor api for loading metadata · f177940a
      Joe Thornber 提交于
      This change offers a pretty significant performance improvement.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f177940a
    • J
      dm array: introduce cursor api · fdd1315a
      Joe Thornber 提交于
      More efficient way to iterate an array due to prefetching (makes use of
      the new dm_btree_cursor_* api).
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      fdd1315a
    • J
      dm btree: introduce cursor api · 7d111c81
      Joe Thornber 提交于
      This uses prefetching to speed up iteration through a btree.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      7d111c81
    • J
      dm cache policy smq: distribute entries to random levels when switching to smq · 9d1b404c
      Joe Thornber 提交于
      For smq the 32 bit 'hint' stores the multiqueue level that the entry
      should be stored in.  If a different policy has been used previously,
      and then switched to smq, the hints will be invalid.  In which case we
      used to put all entries in the bottom level of the multiqueue, and then
      redistribute.  Redistribution is faster if we put entries with invalid
      hints in random levels initially.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      9d1b404c
    • J
      dm cache: speed up writing of the hint array · 4e781b49
      Joe Thornber 提交于
      It's far quicker to always delete the hint array and recreate with
      dm_array_new() because we avoid the copying caused by mutation.
      
      Also simplifies the policy interface, replacing the walk_hints() with
      the simpler get_hint().
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      4e781b49
    • J
      dm array: add dm_array_new() · dd6a77d9
      Joe Thornber 提交于
      dm_array_new() creates a new, populated array more efficiently than
      starting with an empty one and resizing.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      dd6a77d9
    • G
      block: export bio_free_pages to other modules · 491221f8
      Guoqing Jiang 提交于
      bio_free_pages is introduced in commit 1dfa0f68
      ("block: add a helper to free bio bounce buffer pages"),
      we can reuse the func in other modules after it was
      imported.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Acked-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      491221f8
    • S
      raid5: handle register_shrinker failure · 30c89465
      Shaohua Li 提交于
      register_shrinker() now can fail. When it happens, shrinker.nr_deferred is
      null. We use it to determine if unregister_shrinker is required.
      Signed-off-by: NShaohua Li <shli@fb.com>
      30c89465
    • C
      raid5: fix to detect failure of register_shrinker · 6a0f53ff
      Chao Yu 提交于
      register_shrinker can fail after commit 1d3d4437 ("vmscan: per-node
      deferred work"), we should detect the failure of it, otherwise we may
      fail to register shrinker after raid5 configuration was setup successfully.
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6a0f53ff
    • S
      md: fix a potential deadlock · 90bcf133
      Shaohua Li 提交于
      lockdep reports a potential deadlock. Fix this by droping the mutex
      before md_import_device
      
      [ 1137.126601] ======================================================
      [ 1137.127013] [ INFO: possible circular locking dependency detected ]
      [ 1137.127013] 4.8.0-rc4+ #538 Not tainted
      [ 1137.127013] -------------------------------------------------------
      [ 1137.127013] mdadm/16675 is trying to acquire lock:
      [ 1137.127013]  (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff81243cf3>] __blkdev_get+0x63/0x450
      [ 1137.127013]
      but task is already holding lock:
      [ 1137.127013]  (detected_devices_mutex){+.+.+.}, at: [<ffffffff81a5138c>] md_ioctl+0x2ac/0x1f50
      [ 1137.127013]
      which lock already depends on the new lock.
      
      [ 1137.127013]
      the existing dependency chain (in reverse order) is:
      [ 1137.127013]
      -> #1 (detected_devices_mutex){+.+.+.}:
      [ 1137.127013]        [<ffffffff810b6f19>] lock_acquire+0xb9/0x220
      [ 1137.127013]        [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0
      [ 1137.127013]        [<ffffffff81a4eeaf>] md_autodetect_dev+0x3f/0x90
      [ 1137.127013]        [<ffffffff81595be8>] rescan_partitions+0x1a8/0x2c0
      [ 1137.127013]        [<ffffffff81590081>] __blkdev_reread_part+0x71/0xb0
      [ 1137.127013]        [<ffffffff815900e5>] blkdev_reread_part+0x25/0x40
      [ 1137.127013]        [<ffffffff81590c4b>] blkdev_ioctl+0x51b/0xa30
      [ 1137.127013]        [<ffffffff81242bf1>] block_ioctl+0x41/0x50
      [ 1137.127013]        [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0
      [ 1137.127013]        [<ffffffff81215321>] SyS_ioctl+0x41/0x70
      [ 1137.127013]        [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [ 1137.127013]
      -> #0 (&bdev->bd_mutex){+.+.+.}:
      [ 1137.127013]        [<ffffffff810b6af2>] __lock_acquire+0x1662/0x1690
      [ 1137.127013]        [<ffffffff810b6f19>] lock_acquire+0xb9/0x220
      [ 1137.127013]        [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0
      [ 1137.127013]        [<ffffffff81243cf3>] __blkdev_get+0x63/0x450
      [ 1137.127013]        [<ffffffff81244307>] blkdev_get+0x227/0x350
      [ 1137.127013]        [<ffffffff812444f6>] blkdev_get_by_dev+0x36/0x50
      [ 1137.127013]        [<ffffffff81a46d65>] lock_rdev+0x35/0x80
      [ 1137.127013]        [<ffffffff81a49bb4>] md_import_device+0xb4/0x1b0
      [ 1137.127013]        [<ffffffff81a513d6>] md_ioctl+0x2f6/0x1f50
      [ 1137.127013]        [<ffffffff815909b3>] blkdev_ioctl+0x283/0xa30
      [ 1137.127013]        [<ffffffff81242bf1>] block_ioctl+0x41/0x50
      [ 1137.127013]        [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0
      [ 1137.127013]        [<ffffffff81215321>] SyS_ioctl+0x41/0x70
      [ 1137.127013]        [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [ 1137.127013]
      other info that might help us debug this:
      
      [ 1137.127013]  Possible unsafe locking scenario:
      
      [ 1137.127013]        CPU0                    CPU1
      [ 1137.127013]        ----                    ----
      [ 1137.127013]   lock(detected_devices_mutex);
      [ 1137.127013]                                lock(&bdev->bd_mutex);
      [ 1137.127013]                                lock(detected_devices_mutex);
      [ 1137.127013]   lock(&bdev->bd_mutex);
      [ 1137.127013]
       *** DEADLOCK ***
      
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      90bcf133
    • S
      md/bitmap: fix wrong cleanup · f71f1cf9
      Shaohua Li 提交于
      if bitmap_create fails, the bitmap is already cleaned up and the returned value
      is an error number. We can't do the cleanup again.
      Reported-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: NShaohua Li <shli@fb.com>
      f71f1cf9
    • S
      raid5: allow arbitrary max_hw_sectors · 1dffdddd
      Shaohua Li 提交于
      raid5 will split bio to proper size internally, there is no point to use
      underlayer disk's max_hw_sectors. In my qemu system, without the change,
      the raid5 only receives 128k size bio, which reduces the chance of bio
      merge sending to underlayer disks.
      Signed-off-by: NShaohua Li <shli@fb.com>
      1dffdddd
    • G
      md-cluster: make resync lock also could be interruptted · d6385db9
      Guoqing Jiang 提交于
      When one node is perform resync or recovery, other nodes
      can't get resync lock and could block for a while before
      it holds the lock, so we can't stop array immediately for
      this scenario.
      
      To make array could be stop quickly, we check MD_CLOSING
      in dlm_lock_sync_interruptible to make us can interrupt
      the lock request.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d6385db9
    • G
      md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang · 7bcda714
      Guoqing Jiang 提交于
      When some node leaves cluster, then it's bitmap need to be
      synced by another node, so "md*_recover" thread is triggered
      for the purpose. However, with below steps. we can find tasks
      hang happened either in B or C.
      
      1. Node A create a resyncing cluster raid1, assemble it in
         other two nodes (B and C).
      2. stop array in B and C.
      3. stop array in A.
      
      linux44:~ # ps aux|grep md|grep D
      root	5938	0.0  0.1  19852  1964 pts/0    D+   14:52   0:00 mdadm -S md0
      root	5939	0.0  0.0      0     0 ?        D    14:52   0:00 [md0_recover]
      
      linux44:~ # cat /proc/5939/stack
      [<ffffffffa04cf321>] dlm_lock_sync+0x71/0x90 [md_cluster]
      [<ffffffffa04d0705>] recover_bitmaps+0x125/0x220 [md_cluster]
      [<ffffffffa052105d>] md_thread+0x16d/0x180 [md_mod]
      [<ffffffff8107ad94>] kthread+0xb4/0xc0
      [<ffffffff8152a518>] ret_from_fork+0x58/0x90
      
      linux44:~ # cat /proc/5938/stack
      [<ffffffff8107afde>] kthread_stop+0x6e/0x120
      [<ffffffffa0519da0>] md_unregister_thread+0x40/0x80 [md_mod]
      [<ffffffffa04cfd20>] leave+0x70/0x120 [md_cluster]
      [<ffffffffa0525e24>] md_cluster_stop+0x14/0x30 [md_mod]
      [<ffffffffa05269ab>] bitmap_free+0x14b/0x150 [md_mod]
      [<ffffffffa0523f3b>] do_md_stop+0x35b/0x5a0 [md_mod]
      [<ffffffffa0524e83>] md_ioctl+0x873/0x1590 [md_mod]
      [<ffffffff81288464>] blkdev_ioctl+0x214/0x7d0
      [<ffffffff811dd3dd>] block_ioctl+0x3d/0x40
      [<ffffffff811b92d4>] do_vfs_ioctl+0x2d4/0x4b0
      [<ffffffff811b9538>] SyS_ioctl+0x88/0xa0
      [<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b
      
      The problem is caused by recover_bitmaps can't reliably abort
      when the thread is unregistered. So dlm_lock_sync_interruptible
      is introduced to detect the thread's situation to fix the problem.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      7bcda714
    • G
      md-cluster: convert the completion to wait queue · fccb60a4
      Guoqing Jiang 提交于
      Previously, we used completion to sync between require dlm lock
      and sync_ast, however we will have to expose completion.wait
      and completion.done in dlm_lock_sync_interruptible (introduced
      later), it is not a common usage for completion, so convert
      related things to wait queue.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      fccb60a4
    • G
      md-cluster: protect md_find_rdev_nr_rcu with rcu lock · 5f0aa21d
      Guoqing Jiang 提交于
      We need to use rcu_read_lock/unlock to avoid potential
      race.
      Reported-by: NShaohua Li <shli@fb.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5f0aa21d
    • G
      md-cluster: clean related infos of cluster · c20c33f0
      Guoqing Jiang 提交于
      cluster_info and bitmap_info.nodes also need to be
      cleared when array is stopped.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      c20c33f0
    • G
      md: changes for MD_STILL_CLOSED flag · af8d8e6f
      Guoqing Jiang 提交于
      When stop clustered raid while it is pending on resync,
      MD_STILL_CLOSED flag could be cleared since udev rule
      is triggered to open the mddev. So obviously array can't
      be stopped soon and returns EBUSY.
      
      	mdadm -Ss          md-raid-arrays.rules
        set MD_STILL_CLOSED          md_open()
      	... ... ...          clear MD_STILL_CLOSED
      	do_md_stop
      
      We make below changes to resolve this issue:
      
      1. rename MD_STILL_CLOSED to MD_CLOSING since it is set
         when stop array and it means we are stopping array.
      2. let md_open returns early if CLOSING is set, so no
         other threads will open array if one thread is trying
         to close it.
      3. no need to clear CLOSING bit in md_open because 1 has
         ensure the bit is cleared, then we also don't need to
         test CLOSING bit in do_md_stop.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      af8d8e6f
    • G
      md-cluster: remove some unnecessary dlm_unlock_sync · e3f924d3
      Guoqing Jiang 提交于
      Since DLM_LKF_FORCEUNLOCK is used in lockres_free,
      we don't need to call dlm_unlock_sync before free
      lock resource.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      e3f924d3