1. 25 3月, 2021 2 次提交
  2. 02 2月, 2021 2 次提交
  3. 28 1月, 2021 3 次提交
  4. 25 1月, 2021 2 次提交
  5. 21 1月, 2021 1 次提交
    • X
      md: Set prev_flush_start and flush_bio in an atomic way · dc5d17a3
      Xiao Ni 提交于
      One customer reports a crash problem which causes by flush request. It
      triggers a warning before crash.
      
              /* new request after previous flush is completed */
              if (ktime_after(req_start, mddev->prev_flush_start)) {
                      WARN_ON(mddev->flush_bio);
                      mddev->flush_bio = bio;
                      bio = NULL;
              }
      
      The WARN_ON is triggered. We use spin lock to protect prev_flush_start and
      flush_bio in md_flush_request. But there is no lock protection in
      md_submit_flush_data. It can set flush_bio to NULL first because of
      compiler reordering write instructions.
      
      For example, flush bio1 sets flush bio to NULL first in
      md_submit_flush_data. An interrupt or vmware causing an extended stall
      happen between updating flush_bio and prev_flush_start. Because flush_bio
      is NULL, flush bio2 can get the lock and submit to underlayer disks. Then
      flush bio1 updates prev_flush_start after the interrupt or extended stall.
      
      Then flush bio3 enters in md_flush_request. The start time req_start is
      behind prev_flush_start. The flush_bio is not NULL(flush bio2 hasn't
      finished). So it can trigger the WARN_ON now. Then it calls INIT_WORK
      again. INIT_WORK() will re-initialize the list pointers in the
      work_struct, which then can result in a corrupted work list and the
      work_struct queued a second time. With the work list corrupted, it can
      lead in invalid work items being used and cause a crash in
      process_one_work.
      
      We need to make sure only one flush bio can be handled at one same time.
      So add spin lock in md_submit_flush_data to protect prev_flush_start and
      flush_bio in an atomic way.
      Reviewed-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      dc5d17a3
  6. 10 12月, 2020 1 次提交
  7. 05 12月, 2020 1 次提交
  8. 02 12月, 2020 3 次提交
  9. 01 12月, 2020 6 次提交
    • Z
      md/cluster: fix deadlock when node is doing resync job · bca5b065
      Zhao Heming 提交于
      md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
      During sending msg, node can concurrently receive msg from another node.
      When node does resync job, grab token_lockres:EX may trigger a deadlock:
      ```
      nodeA                       nodeB
      --------------------     --------------------
      a.
      send METADATA_UPDATED
      held token_lockres:EX
                               b.
                               md_do_sync
                                resync_info_update
                                  send RESYNCING
                                   + set MD_CLUSTER_SEND_LOCK
                                   + wait for holding token_lockres:EX
      
                               c.
                               mdadm /dev/md0 --remove /dev/sdg
                                + held reconfig_mutex
                                + send REMOVE
                                   + wait_event(MD_CLUSTER_SEND_LOCK)
      
                               d.
                               recv_daemon //METADATA_UPDATED from A
                                process_metadata_update
                                 + (mddev_trylock(mddev) ||
                                    MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                                   //this time, both return false forever
      ```
      Explaination:
      a. A send METADATA_UPDATED
         This will block another node to send msg
      
      b. B does sync jobs, which will send RESYNCING at intervals.
         This will be block for holding token_lockres:EX lock.
      
      c. B do "mdadm --remove", which will send REMOVE.
         This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
      
      d. B recv METADATA_UPDATED msg, which send from A in step <a>.
         This will be blocked by step <c>: holding mddev lock, it makes
         wait_event can't hold mddev lock. (btw,
         MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
      
      There is a similar deadlock in commit 0ba95977
      ("md-cluster: use sync way to handle METADATA_UPDATED msg")
      In that commit, step c is "update sb". This patch step c is
      "mdadm --remove".
      
      For fixing this issue, we can refer the solution of function:
      metadata_update_start. Which does the same grab lock_token action.
      lock_comm can use the same steps to avoid deadlock. By moving
      MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
      It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
      but it is safe & can break deadlock.
      
      Repro steps (I only triggered 3 times with hundreds tests):
      
      two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
      ```
      ssh root@node2 "mdadm -S --scan"
      mdadm -S --scan
      for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
      count=20; done
      
      mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
       --bitmap-chunk=1M
      ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
      
      sleep 5
      
      mkfs.xfs /dev/md0
      mdadm --manage --add /dev/md0 /dev/sdi
      mdadm --wait /dev/md0
      mdadm --grow --raid-devices=3 /dev/md0
      
      mdadm /dev/md0 --fail /dev/sdg
      mdadm /dev/md0 --remove /dev/sdg
      mdadm --grow --raid-devices=2 /dev/md0
      ```
      
      test script will hung when executing "mdadm --remove".
      
      ```
       # dump stacks by "echo t > /proc/sysrq-trigger"
      md0_cluster_rec D    0  5329      2 0x80004000
      Call Trace:
       __schedule+0x1f6/0x560
       ? _cond_resched+0x2d/0x40
       ? schedule+0x4a/0xb0
       ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
       ? wait_woken+0x80/0x80
       ? process_recvd_msg+0x113/0x1d0 [md_cluster]
       ? recv_daemon+0x9e/0x120 [md_cluster]
       ? md_thread+0x94/0x160 [md_mod]
       ? wait_woken+0x80/0x80
       ? md_congested+0x30/0x30 [md_mod]
       ? kthread+0x115/0x140
       ? __kthread_bind_mask+0x60/0x60
       ? ret_from_fork+0x1f/0x40
      
      mdadm           D    0  5423      1 0x00004004
      Call Trace:
       __schedule+0x1f6/0x560
       ? __schedule+0x1fe/0x560
       ? schedule+0x4a/0xb0
       ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
       ? wait_woken+0x80/0x80
       ? remove_disk+0x4f/0x90 [md_cluster]
       ? hot_remove_disk+0xb1/0x1b0 [md_mod]
       ? md_ioctl+0x50c/0xba0 [md_mod]
       ? wait_woken+0x80/0x80
       ? blkdev_ioctl+0xa2/0x2a0
       ? block_ioctl+0x39/0x40
       ? ksys_ioctl+0x82/0xc0
       ? __x64_sys_ioctl+0x16/0x20
       ? do_syscall_64+0x5f/0x150
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      md0_resync      D    0  5425      2 0x80004000
      Call Trace:
       __schedule+0x1f6/0x560
       ? schedule+0x4a/0xb0
       ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
       ? wait_woken+0x80/0x80
       ? lock_token+0x2d/0x90 [md_cluster]
       ? resync_info_update+0x95/0x100 [md_cluster]
       ? raid1_sync_request+0x7d3/0xa40 [raid1]
       ? md_do_sync.cold+0x737/0xc8f [md_mod]
       ? md_thread+0x94/0x160 [md_mod]
       ? md_congested+0x30/0x30 [md_mod]
       ? kthread+0x115/0x140
       ? __kthread_bind_mask+0x60/0x60
       ? ret_from_fork+0x1f/0x40
      ```
      
      At last, thanks for Xiao's solution.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Suggested-by: NXiao Ni <xni@redhat.com>
      Reviewed-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      bca5b065
    • Z
      md/cluster: block reshape with remote resync job · a8da01f7
      Zhao Heming 提交于
      Reshape request should be blocked with ongoing resync job. In cluster
      env, a node can start resync job even if the resync cmd isn't executed
      on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
      will start resync job. However, current update_raid_disks() only check
      local recovery status, which is incomplete. As a result, we see user will
      execute "mdadm --grow" successfully on local, while the remote node deny
      to do reshape job when it doing resync job. The inconsistent handling
      cause array enter unexpected status. If user doesn't observe this issue
      and continue executing mdadm cmd, the array doesn't work at last.
      
      Fix this issue by blocking reshape request. When node executes "--grow"
      and detects ongoing resync, it should stop and report error to user.
      
      The following script reproduces the issue with ~100% probability.
      (two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
      ```
       # on node1, node2 is the remote node.
      ssh root@node2 "mdadm -S --scan"
      mdadm -S --scan
      for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
      count=20; done
      
      mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
      ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
      
      sleep 5
      
      mdadm --manage --add /dev/md0 /dev/sdi
      mdadm --wait /dev/md0
      mdadm --grow --raid-devices=3 /dev/md0
      
      mdadm /dev/md0 --fail /dev/sdg
      mdadm /dev/md0 --remove /dev/sdg
      mdadm --grow --raid-devices=2 /dev/md0
      ```
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      a8da01f7
    • P
      md: use current request time as base for ktime comparisons · a23f2aae
      Pankaj Gupta 提交于
      Request coalescing logic uses 'prev_flush_start' as base to
      compare the current request start time. 'prev_flush_start' is
      updated in other context.
      
      This patch changes this by using ktime comparison base to
      'req_start' for better readability of code.
      Signed-off-by: NPankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      a23f2aae
    • P
      md: add comments in md_flush_request() · 204d1a64
      Pankaj Gupta 提交于
      Request coalescing logic is dependent on flush time update in other
      context. This patch adds comments to understand the code flow better.
      Signed-off-by: NPankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      204d1a64
    • P
      md: improve variable names in md_flush_request() · 81ba3c24
      Pankaj Gupta 提交于
      This patch improves readability by using better variable names
      in flush request coalescing logic.
      Signed-off-by: NPankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Reviewed-by: NPaul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      81ba3c24
    • D
      md: fix a warning caused by a race between concurrent md_ioctl()s · c731b84b
      Dae R. Jeong 提交于
      Syzkaller reports a warning as belows.
      WARNING: CPU: 0 PID: 9647 at drivers/md/md.c:7169
      ...
      Call Trace:
      ...
      RIP: 0010:md_ioctl+0x4017/0x5980 drivers/md/md.c:7169
      RSP: 0018:ffff888096027950 EFLAGS: 00010293
      RAX: ffff88809322c380 RBX: 0000000000000932 RCX: ffffffff84e266f2
      RDX: 0000000000000000 RSI: ffffffff84e299f7 RDI: 0000000000000007
      RBP: ffff888096027bc0 R08: ffff88809322c380 R09: ffffed101341a482
      R10: ffff888096027940 R11: ffff88809a0d240f R12: 0000000000000932
      R13: ffff8880a2c14100 R14: ffff88809a0d2268 R15: ffff88809a0d2408
       __blkdev_driver_ioctl block/ioctl.c:304 [inline]
       blkdev_ioctl+0xece/0x1c10 block/ioctl.c:606
       block_ioctl+0xee/0x130 fs/block_dev.c:1930
       vfs_ioctl fs/ioctl.c:46 [inline]
       file_ioctl fs/ioctl.c:509 [inline]
       do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696
       ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This is caused by a race between two concurrenct md_ioctl()s closing
      the array.
      CPU1 (md_ioctl())                   CPU2 (md_ioctl())
      ------                              ------
      set_bit(MD_CLOSING, &mddev->flags);
      did_set_md_closing = true;
                                          WARN_ON_ONCE(test_bit(MD_CLOSING,
                                                  &mddev->flags));
      if(did_set_md_closing)
          clear_bit(MD_CLOSING, &mddev->flags);
      
      Fix the warning by returning immediately if the MD_CLOSING bit is set
      in &mddev->flags which indicates that the array is being closed.
      
      Fixes: 065e519e ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop")
      Reported-by: syzbot+1e46a0864c1a6e9bd3d8@syzkaller.appspotmail.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NDae R. Jeong <dae.r.jeong@kaist.ac.kr>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      c731b84b
  10. 16 11月, 2020 3 次提交
  11. 09 10月, 2020 1 次提交
  12. 25 9月, 2020 3 次提交
  13. 12 9月, 2020 1 次提交
  14. 10 9月, 2020 1 次提交
  15. 02 9月, 2020 1 次提交
  16. 06 8月, 2020 1 次提交
  17. 03 8月, 2020 3 次提交
  18. 22 7月, 2020 2 次提交
    • Z
      md-cluster: fix rmmod issue when md_cluster convert bitmap to none · edee9dfe
      Zhao Heming 提交于
      update_array_info misses calling module_put when removing cluster bitmap.
      
      steps to reproduce:
      ```
      node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda
      /dev/sdb
      mdadm: array /dev/md0 started.
      node1 # lsmod | egrep "dlm|md_|raid1"
      md_cluster             28672  1
      dlm                   212992  14 md_cluster
      configfs               57344  2 dlm
      raid1                  53248  1
      md_mod                176128  2 raid1,md_cluster
      node1 # mdadm -G /dev/md0 -b none
      node1 # lsmod | egrep "dlm|md_|raid1"
      md_cluster             28672  1 <== should be zero
      dlm                   212992  9 md_cluster
      configfs               57344  2 dlm
      raid1                  53248  1
      md_mod                176128  2 raid1,md_cluster
      node1 # mdadm -G /dev/md0 -b clustered
      node1 # lsmod | egrep "dlm|md_|raid1"
      md_cluster             28672  2 <== increase
      dlm                   212992  14 md_cluster
      configfs               57344  2 dlm
      raid1                  53248  1
      md_mod                176128  2 raid1,md_cluster
      node1 # mdadm -G /dev/md0 -b none
      node1 # mdadm -G /dev/md0 -b clustered
      node1 # lsmod | egrep "dlm|md_|raid1"
      md_cluster             28672  3 <== increase
      dlm                   212992  14 md_cluster
      configfs               57344  2 dlm
      raid1                  53248  1
      md_mod                176128  2 raid1,md_cluster
      ```
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      edee9dfe
    • Z
      md-cluster: fix safemode_delay value when converting to clustered bitmap · 7c9d5c54
      Zhao Heming 提交于
      When array convert to clustered bitmap, the safe_mode_delay doesn't
      clean and vice versa. the /sys/block/mdX/md/safe_mode_delay keep original
      value after changing bitmap type. In safe_delay_store(), the code forbids
      setting mddev->safemode_delay when array is clustered. So in cluster-md
      env, the expected safemode_delay value should be 0.
      
      Reproducible steps:
      ```
      node1 # mdadm --zero-superblock /dev/sd{b,c,d}
      node1 # mdadm -C /dev/md0 -b internal -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc
      node1 # cat /sys/block/md0/md/safe_mode_delay
      0.204
      node1 # mdadm -G /dev/md0 -b none
      node1 # mdadm --grow /dev/md0 --bitmap=clustered
      node1 # cat /sys/block/md0/md/safe_mode_delay
      0.204  <== doesn't change, should ZERO for cluster-md
      
      node1 # mdadm --zero-superblock /dev/sd{b,c,d}
      node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc
      node1 # cat /sys/block/md0/md/safe_mode_delay
      0.000
      node1 # mdadm -G /dev/md0 -b none
      node1 # cat /sys/block/md0/md/safe_mode_delay
      0.000  <== doesn't change, should default value
      ```
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      7c9d5c54
  19. 16 7月, 2020 3 次提交