1. 19 10月, 2021 6 次提交
  2. 18 10月, 2021 3 次提交
  3. 22 9月, 2021 1 次提交
    • C
      md: fix a lock order reversal in md_alloc · 7df835a3
      Christoph Hellwig 提交于
      Commit b0140891 ("md: Fix race when creating a new md device.")
      not only moved assigning mddev->gendisk before calling add_disk, which
      fixes the races described in the commit log, but also added a
      mddev->open_mutex critical section over add_disk and creation of the
      md kobj.  Adding a kobject after add_disk is racy vs deleting the gendisk
      right after adding it, but md already prevents against that by holding
      a mddev->active reference.
      
      On the other hand taking this lock added a lock order reversal with what
      is not disk->open_mutex (used to be bdev->bd_mutex when the commit was
      added) for partition devices, which need that lock for the internal open
      for the partition scan, and a recent commit also takes it for
      non-partitioned devices, leading to further lockdep splatter.
      
      Fixes: b0140891 ("md: Fix race when creating a new md device.")
      Fixes: d6263387 ("block: support delayed holder registration")
      Reported-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      7df835a3
  4. 15 6月, 2021 5 次提交
  5. 01 6月, 2021 1 次提交
  6. 24 4月, 2021 1 次提交
    • H
      md-cluster: fix use-after-free issue when removing rdev · f7c7a2f9
      Heming Zhao 提交于
      md_kick_rdev_from_array will remove rdev, so we should
      use rdev_for_each_safe to search list.
      
      How to trigger:
      
      env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).
      
      ```
      node2=192.168.0.3
      
      for i in {1..20}; do
          echo ==== $i `date` ====;
      
          mdadm -Ss && ssh ${node2} "mdadm -Ss"
          wipefs -a /dev/sda /dev/sdb
      
          mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
             /dev/sdb --assume-clean
          ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
          mdadm --wait /dev/md0
          ssh ${node2} "mdadm --wait /dev/md0"
      
          mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
          sleep 1
      done
      ```
      
      Crash stack:
      
      ```
      stack segment: 0000 [#1] SMP
      ... ...
      RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
      ... ...
      RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
      RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
      RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
      RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
      R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
      FS:  0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       raid1d+0x5c/0xd40 [raid1]
       ? finish_task_switch+0x75/0x2a0
       ? lock_timer_base+0x67/0x80
       ? try_to_del_timer_sync+0x4d/0x80
       ? del_timer_sync+0x41/0x50
       ? schedule_timeout+0x254/0x2d0
       ? md_start_sync+0xe0/0xe0 [md_mod]
       ? md_thread+0x127/0x160 [md_mod]
       md_thread+0x127/0x160 [md_mod]
       ? wait_woken+0x80/0x80
       kthread+0x10d/0x130
       ? kthread_park+0xa0/0xa0
       ret_from_fork+0x1f/0x40
      ```
      
      Fixes: dbb64f86 ("md-cluster: Fix adding of new disk with new reload code")
      Fixes: 659b254f ("md-cluster: remove a disk asynchronously from cluster environment")
      Cc: stable@vger.kernel.org
      Reviewed-by: NGang He <ghe@suse.com>
      Signed-off-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      f7c7a2f9
  7. 16 4月, 2021 3 次提交
  8. 08 4月, 2021 3 次提交
    • C
      md: split mddev_find · 65aa97c4
      Christoph Hellwig 提交于
      Split mddev_find into a simple mddev_find that just finds an existing
      mddev by the unit number, and a more complicated mddev_find that deals
      with find or allocating a mddev.
      
      This turns out to fix this bug reported by Zhao Heming.
      
      ----------------------------- snip ------------------------------
      commit d3374825 ("md: make devices disappear when they are no longer
      needed.") introduced protection between mddev creating & removing. The
      md_open shouldn't create mddev when all_mddevs list doesn't contain
      mddev. With currently code logic, there will be very easy to trigger
      soft lockup in non-preempt env.
      
      *** env ***
      kvm-qemu VM 2C1G with 2 iscsi luns
      kernel should be non-preempt
      
      *** script ***
      
      about trigger 1 time with 10 tests
      
      `1  node1="15sp3-mdcluster1"
      2  node2="15sp3-mdcluster2"
      3
      4  mdadm -Ss
      5  ssh ${node2} "mdadm -Ss"
      6  wipefs -a /dev/sda /dev/sdb
      7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
         /dev/sdb --assume-clean
      8
      9  for i in {1..100}; do
      10    echo ==== $i ====;
      11
      12    echo "test  ...."
      13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
      14    sleep 1
      15
      16    echo "clean  ....."
      17    ssh ${node2} "mdadm -Ss"
      18 done
      `
      I use mdcluster env to trigger soft lockup, but it isn't mdcluster
      speical bug. To stop md array in mdcluster env will do more jobs than
      non-cluster array, which will leave enough time/gap to allow kernel to
      run md_open.
      
      *** stack ***
      
      `ID: 2831   TASK: ffff8dd7223b5040  CPU: 0   COMMAND: "mdadm"
       #0 [ffffa15d00a13b90] __schedule at ffffffffb8f1935f
       #1 [ffffa15d00a13ba8] exact_lock at ffffffffb8a4a66d
       #2 [ffffa15d00a13bb0] kobj_lookup at ffffffffb8c62fe3
       #3 [ffffa15d00a13c28] __blkdev_get at ffffffffb89273b9
       #4 [ffffa15d00a13c98] blkdev_get at ffffffffb8927964
       #5 [ffffa15d00a13cb0] do_dentry_open at ffffffffb88dc4b4
       #6 [ffffa15d00a13ce0] path_openat at ffffffffb88f0ccc
       #7 [ffffa15d00a13db8] do_filp_open at ffffffffb88f32bb
       #8 [ffffa15d00a13ee0] do_sys_open at ffffffffb88ddc7d
       #9 [ffffa15d00a13f38] do_syscall_64 at ffffffffb86053cb ffffffffb900008c
      
      or:
      [  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
      [  884.226515]  md_open+0x3c/0xe0 [md_mod]
      [  884.226518]  __blkdev_get+0x30d/0x710
      [  884.226520]  ? bd_acquire+0xd0/0xd0
      [  884.226522]  blkdev_get+0x14/0x30
      [  884.226524]  do_dentry_open+0x204/0x3a0
      [  884.226531]  path_openat+0x2fc/0x1520
      [  884.226534]  ? seq_printf+0x4e/0x70
      [  884.226536]  do_filp_open+0x9b/0x110
      [  884.226542]  ? md_release+0x20/0x20 [md_mod]
      [  884.226543]  ? seq_read+0x1d8/0x3e0
      [  884.226545]  ? kmem_cache_alloc+0x18a/0x270
      [  884.226547]  ? do_sys_open+0x1bd/0x260
      [  884.226548]  do_sys_open+0x1bd/0x260
      [  884.226551]  do_syscall_64+0x5b/0x1e0
      [  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      `
      *** rootcause ***
      
      "mdadm -A" (or other array assemble commands) will start a daemon "mdadm
      --monitor" by default. When "mdadm -Ss" is running, the stop action will
      wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
      info from /proc/mdstat. This time mddev in kernel still exist, so
      /proc/mdstat still show md device, which makes "mdadm --monitor" to open
      /dev/md0.
      
      The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
      open action will trigger md_open which is creating action. Racing is
      happening.
      
      `<thread 1>: "mdadm -Ss"
      md_release
        mddev_put deletes mddev from all_mddevs
        queue_work for mddev_delayed_delete
        at this time, "/dev/md0" is still available for opening
      
      <thread 2>: "mdadm --monitor ..."
      md_open
       + mddev_find can't find mddev of /dev/md0, and create a new mddev and
       |    return.
       + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
            -ERESTARTSYS.
      `
      In non-preempt kernel, <thread 2> is occupying on current CPU. and
      mddev_delayed_delete which was created in <thread 1> also can't be
      schedule.
      
      In preempt kernel, it can also trigger above racing. But kernel doesn't
      allow one thread running on a CPU all the time. after <thread 2> running
      some time, the later "mdadm -A" (refer above script line 13) will call
      md_alloc to alloc a new gendisk for mddev. it will break md_open
      statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
      the soft lockup is broken.
      ------------------------------ snip ------------------------------
      
      Cc: stable@vger.kernel.org
      Fixes: d3374825 ("md: make devices disappear when they are no longer needed.")
      Reported-by: NHeming Zhao <heming.zhao@suse.com>
      Reviewed-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      65aa97c4
    • C
      md: factor out a mddev_find_locked helper from mddev_find · 8b57251f
      Christoph Hellwig 提交于
      Factor out a self-contained helper to just lookup a mddev by the dev_t
      "unit".
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      8b57251f
    • Z
      md: md_open returns -EBUSY when entering racing area · 6a4db2a6
      Zhao Heming 提交于
      commit d3374825 ("md: make devices disappear when they are no longer
      needed.") introduced protection between mddev creating & removing. The
      md_open shouldn't create mddev when all_mddevs list doesn't contain
      mddev. With currently code logic, there will be very easy to trigger
      soft lockup in non-preempt env.
      
      This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
      will break the infinitely retry when md_open enter racing area.
      
      This patch is partly fix soft lockup issue, full fix needs mddev_find
      is split into two functions: mddev_find & mddev_find_or_alloc. And
      md_open should call new mddev_find (it only does searching job).
      
      For more detail, please refer with Christoph's "split mddev_find" patch
      in later commits.
      
      *** env ***
      kvm-qemu VM 2C1G with 2 iscsi luns
      kernel should be non-preempt
      
      *** script ***
      
      about trigger every time with below script
      
      ```
      1  node1="mdcluster1"
      2  node2="mdcluster2"
      3
      4  mdadm -Ss
      5  ssh ${node2} "mdadm -Ss"
      6  wipefs -a /dev/sda /dev/sdb
      7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
         /dev/sdb --assume-clean
      8
      9  for i in {1..10}; do
      10    echo ==== $i ====;
      11
      12    echo "test  ...."
      13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
      14    sleep 1
      15
      16    echo "clean  ....."
      17    ssh ${node2} "mdadm -Ss"
      18 done
      ```
      
      I use mdcluster env to trigger soft lockup, but it isn't mdcluster
      speical bug. To stop md array in mdcluster env will do more jobs than
      non-cluster array, which will leave enough time/gap to allow kernel to
      run md_open.
      
      *** stack ***
      
      ```
      [  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
      [  884.226515]  md_open+0x3c/0xe0 [md_mod]
      [  884.226518]  __blkdev_get+0x30d/0x710
      [  884.226520]  ? bd_acquire+0xd0/0xd0
      [  884.226522]  blkdev_get+0x14/0x30
      [  884.226524]  do_dentry_open+0x204/0x3a0
      [  884.226531]  path_openat+0x2fc/0x1520
      [  884.226534]  ? seq_printf+0x4e/0x70
      [  884.226536]  do_filp_open+0x9b/0x110
      [  884.226542]  ? md_release+0x20/0x20 [md_mod]
      [  884.226543]  ? seq_read+0x1d8/0x3e0
      [  884.226545]  ? kmem_cache_alloc+0x18a/0x270
      [  884.226547]  ? do_sys_open+0x1bd/0x260
      [  884.226548]  do_sys_open+0x1bd/0x260
      [  884.226551]  do_syscall_64+0x5b/0x1e0
      [  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      ```
      
      *** rootcause ***
      
      "mdadm -A" (or other array assemble commands) will start a daemon "mdadm
      --monitor" by default. When "mdadm -Ss" is running, the stop action will
      wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
      info from /proc/mdstat. This time mddev in kernel still exist, so
      /proc/mdstat still show md device, which makes "mdadm --monitor" to open
      /dev/md0.
      
      The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
      open action will trigger md_open which is creating action. Racing is
      happening.
      
      ```
      <thread 1>: "mdadm -Ss"
      md_release
        mddev_put deletes mddev from all_mddevs
        queue_work for mddev_delayed_delete
        at this time, "/dev/md0" is still available for opening
      
      <thread 2>: "mdadm --monitor ..."
      md_open
       + mddev_find can't find mddev of /dev/md0, and create a new mddev and
       |    return.
       + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
            -ERESTARTSYS.
      ```
      
      In non-preempt kernel, <thread 2> is occupying on current CPU. and
      mddev_delayed_delete which was created in <thread 1> also can't be
      schedule.
      
      In preempt kernel, it can also trigger above racing. But kernel doesn't
      allow one thread running on a CPU all the time. after <thread 2> running
      some time, the later "mdadm -A" (refer above script line 13) will call
      md_alloc to alloc a new gendisk for mddev. it will break md_open
      statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
      the soft lockup is broken.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      6a4db2a6
  9. 25 3月, 2021 2 次提交
  10. 02 2月, 2021 2 次提交
  11. 28 1月, 2021 3 次提交
  12. 25 1月, 2021 2 次提交
  13. 21 1月, 2021 1 次提交
    • X
      md: Set prev_flush_start and flush_bio in an atomic way · dc5d17a3
      Xiao Ni 提交于
      One customer reports a crash problem which causes by flush request. It
      triggers a warning before crash.
      
              /* new request after previous flush is completed */
              if (ktime_after(req_start, mddev->prev_flush_start)) {
                      WARN_ON(mddev->flush_bio);
                      mddev->flush_bio = bio;
                      bio = NULL;
              }
      
      The WARN_ON is triggered. We use spin lock to protect prev_flush_start and
      flush_bio in md_flush_request. But there is no lock protection in
      md_submit_flush_data. It can set flush_bio to NULL first because of
      compiler reordering write instructions.
      
      For example, flush bio1 sets flush bio to NULL first in
      md_submit_flush_data. An interrupt or vmware causing an extended stall
      happen between updating flush_bio and prev_flush_start. Because flush_bio
      is NULL, flush bio2 can get the lock and submit to underlayer disks. Then
      flush bio1 updates prev_flush_start after the interrupt or extended stall.
      
      Then flush bio3 enters in md_flush_request. The start time req_start is
      behind prev_flush_start. The flush_bio is not NULL(flush bio2 hasn't
      finished). So it can trigger the WARN_ON now. Then it calls INIT_WORK
      again. INIT_WORK() will re-initialize the list pointers in the
      work_struct, which then can result in a corrupted work list and the
      work_struct queued a second time. With the work list corrupted, it can
      lead in invalid work items being used and cause a crash in
      process_one_work.
      
      We need to make sure only one flush bio can be handled at one same time.
      So add spin lock in md_submit_flush_data to protect prev_flush_start and
      flush_bio in an atomic way.
      Reviewed-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      dc5d17a3
  14. 10 12月, 2020 1 次提交
  15. 05 12月, 2020 1 次提交
  16. 02 12月, 2020 3 次提交
  17. 01 12月, 2020 2 次提交
    • Z
      md/cluster: fix deadlock when node is doing resync job · bca5b065
      Zhao Heming 提交于
      md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
      During sending msg, node can concurrently receive msg from another node.
      When node does resync job, grab token_lockres:EX may trigger a deadlock:
      ```
      nodeA                       nodeB
      --------------------     --------------------
      a.
      send METADATA_UPDATED
      held token_lockres:EX
                               b.
                               md_do_sync
                                resync_info_update
                                  send RESYNCING
                                   + set MD_CLUSTER_SEND_LOCK
                                   + wait for holding token_lockres:EX
      
                               c.
                               mdadm /dev/md0 --remove /dev/sdg
                                + held reconfig_mutex
                                + send REMOVE
                                   + wait_event(MD_CLUSTER_SEND_LOCK)
      
                               d.
                               recv_daemon //METADATA_UPDATED from A
                                process_metadata_update
                                 + (mddev_trylock(mddev) ||
                                    MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                                   //this time, both return false forever
      ```
      Explaination:
      a. A send METADATA_UPDATED
         This will block another node to send msg
      
      b. B does sync jobs, which will send RESYNCING at intervals.
         This will be block for holding token_lockres:EX lock.
      
      c. B do "mdadm --remove", which will send REMOVE.
         This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
      
      d. B recv METADATA_UPDATED msg, which send from A in step <a>.
         This will be blocked by step <c>: holding mddev lock, it makes
         wait_event can't hold mddev lock. (btw,
         MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
      
      There is a similar deadlock in commit 0ba95977
      ("md-cluster: use sync way to handle METADATA_UPDATED msg")
      In that commit, step c is "update sb". This patch step c is
      "mdadm --remove".
      
      For fixing this issue, we can refer the solution of function:
      metadata_update_start. Which does the same grab lock_token action.
      lock_comm can use the same steps to avoid deadlock. By moving
      MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
      It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
      but it is safe & can break deadlock.
      
      Repro steps (I only triggered 3 times with hundreds tests):
      
      two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
      ```
      ssh root@node2 "mdadm -S --scan"
      mdadm -S --scan
      for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
      count=20; done
      
      mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
       --bitmap-chunk=1M
      ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
      
      sleep 5
      
      mkfs.xfs /dev/md0
      mdadm --manage --add /dev/md0 /dev/sdi
      mdadm --wait /dev/md0
      mdadm --grow --raid-devices=3 /dev/md0
      
      mdadm /dev/md0 --fail /dev/sdg
      mdadm /dev/md0 --remove /dev/sdg
      mdadm --grow --raid-devices=2 /dev/md0
      ```
      
      test script will hung when executing "mdadm --remove".
      
      ```
       # dump stacks by "echo t > /proc/sysrq-trigger"
      md0_cluster_rec D    0  5329      2 0x80004000
      Call Trace:
       __schedule+0x1f6/0x560
       ? _cond_resched+0x2d/0x40
       ? schedule+0x4a/0xb0
       ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
       ? wait_woken+0x80/0x80
       ? process_recvd_msg+0x113/0x1d0 [md_cluster]
       ? recv_daemon+0x9e/0x120 [md_cluster]
       ? md_thread+0x94/0x160 [md_mod]
       ? wait_woken+0x80/0x80
       ? md_congested+0x30/0x30 [md_mod]
       ? kthread+0x115/0x140
       ? __kthread_bind_mask+0x60/0x60
       ? ret_from_fork+0x1f/0x40
      
      mdadm           D    0  5423      1 0x00004004
      Call Trace:
       __schedule+0x1f6/0x560
       ? __schedule+0x1fe/0x560
       ? schedule+0x4a/0xb0
       ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
       ? wait_woken+0x80/0x80
       ? remove_disk+0x4f/0x90 [md_cluster]
       ? hot_remove_disk+0xb1/0x1b0 [md_mod]
       ? md_ioctl+0x50c/0xba0 [md_mod]
       ? wait_woken+0x80/0x80
       ? blkdev_ioctl+0xa2/0x2a0
       ? block_ioctl+0x39/0x40
       ? ksys_ioctl+0x82/0xc0
       ? __x64_sys_ioctl+0x16/0x20
       ? do_syscall_64+0x5f/0x150
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      md0_resync      D    0  5425      2 0x80004000
      Call Trace:
       __schedule+0x1f6/0x560
       ? schedule+0x4a/0xb0
       ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
       ? wait_woken+0x80/0x80
       ? lock_token+0x2d/0x90 [md_cluster]
       ? resync_info_update+0x95/0x100 [md_cluster]
       ? raid1_sync_request+0x7d3/0xa40 [raid1]
       ? md_do_sync.cold+0x737/0xc8f [md_mod]
       ? md_thread+0x94/0x160 [md_mod]
       ? md_congested+0x30/0x30 [md_mod]
       ? kthread+0x115/0x140
       ? __kthread_bind_mask+0x60/0x60
       ? ret_from_fork+0x1f/0x40
      ```
      
      At last, thanks for Xiao's solution.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Suggested-by: NXiao Ni <xni@redhat.com>
      Reviewed-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      bca5b065
    • Z
      md/cluster: block reshape with remote resync job · a8da01f7
      Zhao Heming 提交于
      Reshape request should be blocked with ongoing resync job. In cluster
      env, a node can start resync job even if the resync cmd isn't executed
      on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
      will start resync job. However, current update_raid_disks() only check
      local recovery status, which is incomplete. As a result, we see user will
      execute "mdadm --grow" successfully on local, while the remote node deny
      to do reshape job when it doing resync job. The inconsistent handling
      cause array enter unexpected status. If user doesn't observe this issue
      and continue executing mdadm cmd, the array doesn't work at last.
      
      Fix this issue by blocking reshape request. When node executes "--grow"
      and detects ongoing resync, it should stop and report error to user.
      
      The following script reproduces the issue with ~100% probability.
      (two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
      ```
       # on node1, node2 is the remote node.
      ssh root@node2 "mdadm -S --scan"
      mdadm -S --scan
      for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
      count=20; done
      
      mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
      ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
      
      sleep 5
      
      mdadm --manage --add /dev/md0 /dev/sdi
      mdadm --wait /dev/md0
      mdadm --grow --raid-devices=3 /dev/md0
      
      mdadm /dev/md0 --fail /dev/sdg
      mdadm /dev/md0 --remove /dev/sdg
      mdadm --grow --raid-devices=2 /dev/md0
      ```
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      a8da01f7