1. 26 4月, 2022 3 次提交
    • X
      md: fix an incorrect NULL check in md_reload_sb · 64c54d92
      Xiaomeng Tong 提交于
      The bug is here:
      	if (!rdev || rdev->desc_nr != nr) {
      
      The list iterator value 'rdev' will *always* be set and non-NULL
      by rdev_for_each_rcu(), so it is incorrect to assume that the
      iterator value will be NULL if the list is empty or no element
      found (In fact, it will be a bogus pointer to an invalid struct
      object containing the HEAD). Otherwise it will bypass the check
      and lead to invalid memory access passing the check.
      
      To fix the bug, use a new variable 'iter' as the list iterator,
      while using the original variable 'pdev' as a dedicated pointer to
      point to the found element.
      
      Cc: stable@vger.kernel.org
      Fixes: 70bcecdb ("md-cluster: Improve md_reload_sb to be less error prone")
      Signed-off-by: NXiaomeng Tong <xiam0nd.tong@gmail.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      64c54d92
    • X
      md: fix an incorrect NULL check in does_sb_need_changing · fc873834
      Xiaomeng Tong 提交于
      The bug is here:
      	if (!rdev)
      
      The list iterator value 'rdev' will *always* be set and non-NULL
      by rdev_for_each(), so it is incorrect to assume that the iterator
      value will be NULL if the list is empty or no element found.
      Otherwise it will bypass the NULL check and lead to invalid memory
      access passing the check.
      
      To fix the bug, use a new variable 'iter' as the list iterator,
      while using the original variable 'rdev' as a dedicated pointer to
      point to the found element.
      
      Cc: stable@vger.kernel.org
      Fixes: 2aa82191 ("md-cluster: Perform a lazy update")
      Acked-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: NXiaomeng Tong <xiam0nd.tong@gmail.com>
      Acked-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      fc873834
    • M
      md: Set MD_BROKEN for RAID1 and RAID10 · 9631abdb
      Mariusz Tkaczyk 提交于
      There is no direct mechanism to determine raid failure outside
      personality. It is done by checking rdev->flags after executing
      md_error(). If "faulty" flag is not set then -EBUSY is returned to
      userspace. -EBUSY means that array will be failed after drive removal.
      
      Mdadm has special routine to handle the array failure and it is executed
      if -EBUSY is returned by md.
      
      There are at least two known reasons to not consider this mechanism
      as correct:
      1. drive can be removed even if array will be failed[1].
      2. -EBUSY seems to be wrong status. Array is not busy, but removal
         process cannot proceed safe.
      
      -EBUSY expectation cannot be removed without breaking compatibility
      with userspace. In this patch first issue is resolved by adding support
      for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in
      next commit.
      
      The idea is to set the MD_BROKEN if we are sure that raid is in failed
      state now. This is done in each error_handler(). In md_error() MD_BROKEN
      flag is checked. If is set, then -EBUSY is returned to userspace.
      
      As in previous commit, it causes that #mdadm --set-faulty is able to
      fail array. Previously proposed workaround is valid if optional
      functionality[1] is disabled.
      
      [1] commit 9a567843("md: allow last device to be forcibly removed from
          RAID1/RAID10.")
      Reviewd-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NMariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      9631abdb
  2. 18 4月, 2022 2 次提交
  3. 09 3月, 2022 1 次提交
  4. 04 2月, 2022 1 次提交
  5. 03 2月, 2022 1 次提交
    • S
      md: fix NULL pointer deref with nowait but no mddev->queue · 0f9650bd
      Song Liu 提交于
      Leon reported NULL pointer deref with nowait support:
      
      [   15.123761] device-mapper: raid: Loading target version 1.15.1
      [   15.124185] device-mapper: raid: Ignoring chunk size parameter for RAID 1
      [   15.124192] device-mapper: raid: Choosing default region size of 4MiB
      [   15.129524] BUG: kernel NULL pointer dereference, address: 0000000000000060
      [   15.129530] #PF: supervisor write access in kernel mode
      [   15.129533] #PF: error_code(0x0002) - not-present page
      [   15.129535] PGD 0 P4D 0
      [   15.129538] Oops: 0002 [#1] PREEMPT SMP NOPTI
      [   15.129541] CPU: 5 PID: 494 Comm: ldmtool Not tainted 5.17.0-rc2-1-mainline #1 9fe89d43dfcb215d2731e6f8851740520778615e
      [   15.129546] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F36e 10/14/2021
      [   15.129549] RIP: 0010:blk_queue_flag_set+0x7/0x20
      [   15.129555] Code: 00 00 00 0f 1f 44 00 00 48 8b 35 e4 e0 04 02 48 8d 57 28 bf 40 01 \
             00 00 e9 16 c1 be ff 66 0f 1f 44 00 00 0f 1f 44 00 00 89 ff <f0> 48 0f ab 7e 60 \
             31 f6 89 f7 c3 66 66 2e 0f 1f 84 00 00 00 00 00
      [   15.129559] RSP: 0018:ffff966b81987a88 EFLAGS: 00010202
      [   15.129562] RAX: ffff8b11c363a0d0 RBX: ffff8b11e294b070 RCX: 0000000000000000
      [   15.129564] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000001d
      [   15.129566] RBP: ffff8b11e294b058 R08: 0000000000000000 R09: 0000000000000000
      [   15.129568] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b11e294b070
      [   15.129570] R13: 0000000000000000 R14: ffff8b11e294b000 R15: 0000000000000001
      [   15.129572] FS:  00007fa96e826780(0000) GS:ffff8b18deb40000(0000) knlGS:0000000000000000
      [   15.129575] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   15.129577] CR2: 0000000000000060 CR3: 000000010b8ce000 CR4: 00000000003506e0
      [   15.129580] Call Trace:
      [   15.129582]  <TASK>
      [   15.129584]  md_run+0x67c/0xc70 [md_mod 1e470c1b6bcf1114198109f42682f5a2740e9531]
      [   15.129597]  raid_ctr+0x134a/0x28ea [dm_raid 6a645dd7519e72834bd7e98c23497eeade14cd63]
      [   15.129604]  ? dm_split_args+0x63/0x150 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
      [   15.129615]  dm_table_add_target+0x188/0x380 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
      [   15.129625]  table_load+0x13b/0x370 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
      [   15.129635]  ? dev_suspend+0x2d0/0x2d0 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
      [   15.129644]  ctl_ioctl+0x1bd/0x460 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
      [   15.129655]  dm_ctl_ioctl+0xa/0x20 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
      [   15.129663]  __x64_sys_ioctl+0x8e/0xd0
      [   15.129667]  do_syscall_64+0x5c/0x90
      [   15.129672]  ? syscall_exit_to_user_mode+0x23/0x50
      [   15.129675]  ? do_syscall_64+0x69/0x90
      [   15.129677]  ? do_syscall_64+0x69/0x90
      [   15.129679]  ? syscall_exit_to_user_mode+0x23/0x50
      [   15.129682]  ? do_syscall_64+0x69/0x90
      [   15.129684]  ? do_syscall_64+0x69/0x90
      [   15.129686]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [   15.129689] RIP: 0033:0x7fa96ecd559b
      [   15.129692] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c \
          c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff \
          ff 73 01 c3 48 8b 0d a5 a8 0c 00 f7 d8 64 89 01 48
      [   15.129696] RSP: 002b:00007ffcaf85c258 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
      [   15.129699] RAX: ffffffffffffffda RBX: 00007fa96f1b48f0 RCX: 00007fa96ecd559b
      [   15.129701] RDX: 00007fa97017e610 RSI: 00000000c138fd09 RDI: 0000000000000003
      [   15.129702] RBP: 00007fa96ebab583 R08: 00007fa97017c9e0 R09: 00007ffcaf85bf27
      [   15.129704] R10: 0000000000000001 R11: 0000000000000206 R12: 00007fa97017e610
      [   15.129706] R13: 00007fa97017e640 R14: 00007fa97017e6c0 R15: 00007fa97017e530
      [   15.129709]  </TASK>
      
      This is caused by missing mddev->queue check for setting QUEUE_FLAG_NOWAIT
      Fix this by moving the QUEUE_FLAG_NOWAIT logic to under mddev->queue check.
      
      Fixes: f51d46d0 ("md: add support for REQ_NOWAIT")
      Reported-by: NLeon Möller <jkhsjdhjs@totally.rip>
      Tested-by: NLeon Möller <jkhsjdhjs@totally.rip>
      Cc: Vishal Verma <vverma@digitalocean.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      0f9650bd
  6. 02 2月, 2022 2 次提交
  7. 07 1月, 2022 4 次提交
    • G
      md: use default_groups in kobj_type · 1745e857
      Greg Kroah-Hartman 提交于
      There are currently 2 ways to create a set of sysfs files for a
      kobj_type, through the default_attrs field, and the default_groups
      field.  Move the md rdev sysfs code to use default_groups field which
      has been the preferred way since commit aa30f47c ("kobject: Add
      support for default attribute groups to kobj_type") so that we can soon
      get rid of the obsolete default_attrs field.
      
      Cc: Song Liu <song@kernel.org>
      Cc: linux-raid@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NSong Liu <song@kernel.org>
      1745e857
    • X
      md: Move alloc/free acct bioset in to personality · 0c031fd3
      Xiao Ni 提交于
      bioset acct is only needed for raid0 and raid5. Therefore, md_run only
      allocates it for raid0 and raid5. However, this does not cover
      personality takeover, which may cause uninitialized bioset. For example,
      the following repro steps:
      
        mdadm -CR /dev/md0 -l1 -n2 /dev/loop0 /dev/loop1
        mdadm --wait /dev/md0
        mkfs.xfs /dev/md0
        mdadm /dev/md0 --grow -l5
        mount /dev/md0 /mnt
      
      causes panic like:
      
      [  225.933939] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [  225.934903] #PF: supervisor instruction fetch in kernel mode
      [  225.935639] #PF: error_code(0x0010) - not-present page
      [  225.936361] PGD 0 P4D 0
      [  225.936677] Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI
      [  225.937525] CPU: 27 PID: 1133 Comm: mount Not tainted 5.16.0-rc3+ #706
      [  225.938416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.module_el8.4.0+547+a85d02ba 04/01/2014
      [  225.939922] RIP: 0010:0x0
      [  225.940289] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
      [  225.941196] RSP: 0018:ffff88815897eff0 EFLAGS: 00010246
      [  225.941897] RAX: 0000000000000000 RBX: 0000000000092800 RCX: ffffffff81370a39
      [  225.942813] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000092800
      [  225.943772] RBP: 1ffff1102b12fe04 R08: fffffbfff0b43c01 R09: fffffbfff0b43c01
      [  225.944807] R10: ffffffff85a1e007 R11: fffffbfff0b43c00 R12: ffff88810eaaaf58
      [  225.945757] R13: 0000000000000000 R14: ffff88810eaaafb8 R15: ffff88815897f040
      [  225.946709] FS:  00007ff3f2505080(0000) GS:ffff888fb5e00000(0000) knlGS:0000000000000000
      [  225.947814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  225.948556] CR2: ffffffffffffffd6 CR3: 000000015aa5a006 CR4: 0000000000370ee0
      [  225.949537] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  225.950455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  225.951414] Call Trace:
      [  225.951787]  <TASK>
      [  225.952120]  mempool_alloc+0xe5/0x250
      [  225.952625]  ? mempool_resize+0x370/0x370
      [  225.953187]  ? rcu_read_lock_sched_held+0xa1/0xd0
      [  225.953862]  ? rcu_read_lock_bh_held+0xb0/0xb0
      [  225.954464]  ? sched_clock_cpu+0x15/0x120
      [  225.955019]  ? find_held_lock+0xac/0xd0
      [  225.955564]  bio_alloc_bioset+0x1ed/0x2a0
      [  225.956080]  ? lock_downgrade+0x3a0/0x3a0
      [  225.956644]  ? bvec_alloc+0xc0/0xc0
      [  225.957135]  bio_clone_fast+0x19/0x80
      [  225.957651]  raid5_make_request+0x1370/0x1b70
      [  225.958286]  ? sched_clock_cpu+0x15/0x120
      [  225.958797]  ? __lock_acquire+0x8b2/0x3510
      [  225.959339]  ? raid5_get_active_stripe+0xce0/0xce0
      [  225.959986]  ? lock_is_held_type+0xd8/0x130
      [  225.960528]  ? rcu_read_lock_sched_held+0xa1/0xd0
      [  225.961135]  ? rcu_read_lock_bh_held+0xb0/0xb0
      [  225.961703]  ? sched_clock_cpu+0x15/0x120
      [  225.962232]  ? lock_release+0x27a/0x6c0
      [  225.962746]  ? do_wait_intr_irq+0x130/0x130
      [  225.963302]  ? lock_downgrade+0x3a0/0x3a0
      [  225.963815]  ? lock_release+0x6c0/0x6c0
      [  225.964348]  md_handle_request+0x342/0x530
      [  225.964888]  ? set_in_sync+0x170/0x170
      [  225.965397]  ? blk_queue_split+0x133/0x150
      [  225.965988]  ? __blk_queue_split+0x8b0/0x8b0
      [  225.966524]  ? submit_bio_checks+0x3b2/0x9d0
      [  225.967069]  md_submit_bio+0x127/0x1c0
      [...]
      
      Fix this by moving alloc/free of acct bioset to pers->run and pers->free.
      
      While we are on this, properly handle md_integrity_register() error in
      raid0_run().
      
      Fixes: daee2024 (md: check level before create and exit io_acct_set)
      Cc: stable@vger.kernel.org
      Acked-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      0c031fd3
    • R
      md: fix spelling of "its" · dd3dc5f4
      Randy Dunlap 提交于
      Use the possessive "its" instead of the contraction "it's"
      in printed messages.
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Song Liu <song@kernel.org>
      Cc: linux-raid@vger.kernel.org
      Signed-off-by: NSong Liu <song@kernel.org>
      dd3dc5f4
    • V
      md: add support for REQ_NOWAIT · f51d46d0
      Vishal Verma 提交于
      commit 021a2446 ("block: add QUEUE_FLAG_NOWAIT") added support
      for checking whether a given bdev supports handling of REQ_NOWAIT or not.
      Since then commit 6abc4946 ("dm: add support for REQ_NOWAIT and enable
      it for linear target") added support for REQ_NOWAIT for dm. This uses
      a similar approach to incorporate REQ_NOWAIT for md based bios.
      
      This patch was tested using t/io_uring tool within FIO. A nvme drive
      was partitioned into 2 partitions and a simple raid 0 configuration
      /dev/md0 was created.
      
      md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0]
            937423872 blocks super 1.2 512k chunks
      
      Before patch:
      
      $ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
      
      Running top while the above runs:
      
      $ ps -eL | grep $(pidof io_uring)
      
        38396   38396 pts/2    00:00:00 io_uring
        38396   38397 pts/2    00:00:15 io_uring
        38396   38398 pts/2    00:00:13 iou-wrk-38397
      
      We can see iou-wrk-38397 io worker thread created which gets created
      when io_uring sees that the underlying device (/dev/md0 in this case)
      doesn't support nowait.
      
      After patch:
      
      $ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
      
      Running top while the above runs:
      
      $ ps -eL | grep $(pidof io_uring)
      
        38341   38341 pts/2    00:10:22 io_uring
        38341   38342 pts/2    00:10:37 io_uring
      
      After running this patch, we don't see any io worker thread
      being created which indicated that io_uring saw that the
      underlying device does support nowait. This is the exact behaviour
      noticed on a dm device which also supports nowait.
      
      For all the other raid personalities except raid0, we would need
      to train pieces which involves make_request fn in order for them
      to correctly handle REQ_NOWAIT.
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NVishal Verma <vverma@digitalocean.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      f51d46d0
  8. 11 12月, 2021 2 次提交
  9. 29 11月, 2021 1 次提交
  10. 19 10月, 2021 7 次提交
  11. 18 10月, 2021 3 次提交
  12. 22 9月, 2021 1 次提交
    • C
      md: fix a lock order reversal in md_alloc · 7df835a3
      Christoph Hellwig 提交于
      Commit b0140891 ("md: Fix race when creating a new md device.")
      not only moved assigning mddev->gendisk before calling add_disk, which
      fixes the races described in the commit log, but also added a
      mddev->open_mutex critical section over add_disk and creation of the
      md kobj.  Adding a kobject after add_disk is racy vs deleting the gendisk
      right after adding it, but md already prevents against that by holding
      a mddev->active reference.
      
      On the other hand taking this lock added a lock order reversal with what
      is not disk->open_mutex (used to be bdev->bd_mutex when the commit was
      added) for partition devices, which need that lock for the internal open
      for the partition scan, and a recent commit also takes it for
      non-partitioned devices, leading to further lockdep splatter.
      
      Fixes: b0140891 ("md: Fix race when creating a new md device.")
      Fixes: d6263387 ("block: support delayed holder registration")
      Reported-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      7df835a3
  13. 15 6月, 2021 5 次提交
  14. 01 6月, 2021 1 次提交
  15. 24 4月, 2021 1 次提交
    • H
      md-cluster: fix use-after-free issue when removing rdev · f7c7a2f9
      Heming Zhao 提交于
      md_kick_rdev_from_array will remove rdev, so we should
      use rdev_for_each_safe to search list.
      
      How to trigger:
      
      env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).
      
      ```
      node2=192.168.0.3
      
      for i in {1..20}; do
          echo ==== $i `date` ====;
      
          mdadm -Ss && ssh ${node2} "mdadm -Ss"
          wipefs -a /dev/sda /dev/sdb
      
          mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
             /dev/sdb --assume-clean
          ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
          mdadm --wait /dev/md0
          ssh ${node2} "mdadm --wait /dev/md0"
      
          mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
          sleep 1
      done
      ```
      
      Crash stack:
      
      ```
      stack segment: 0000 [#1] SMP
      ... ...
      RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
      ... ...
      RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
      RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
      RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
      RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
      R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
      FS:  0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       raid1d+0x5c/0xd40 [raid1]
       ? finish_task_switch+0x75/0x2a0
       ? lock_timer_base+0x67/0x80
       ? try_to_del_timer_sync+0x4d/0x80
       ? del_timer_sync+0x41/0x50
       ? schedule_timeout+0x254/0x2d0
       ? md_start_sync+0xe0/0xe0 [md_mod]
       ? md_thread+0x127/0x160 [md_mod]
       md_thread+0x127/0x160 [md_mod]
       ? wait_woken+0x80/0x80
       kthread+0x10d/0x130
       ? kthread_park+0xa0/0xa0
       ret_from_fork+0x1f/0x40
      ```
      
      Fixes: dbb64f86 ("md-cluster: Fix adding of new disk with new reload code")
      Fixes: 659b254f ("md-cluster: remove a disk asynchronously from cluster environment")
      Cc: stable@vger.kernel.org
      Reviewed-by: NGang He <ghe@suse.com>
      Signed-off-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      f7c7a2f9
  16. 16 4月, 2021 3 次提交
  17. 08 4月, 2021 2 次提交
    • C
      md: split mddev_find · 65aa97c4
      Christoph Hellwig 提交于
      Split mddev_find into a simple mddev_find that just finds an existing
      mddev by the unit number, and a more complicated mddev_find that deals
      with find or allocating a mddev.
      
      This turns out to fix this bug reported by Zhao Heming.
      
      ----------------------------- snip ------------------------------
      commit d3374825 ("md: make devices disappear when they are no longer
      needed.") introduced protection between mddev creating & removing. The
      md_open shouldn't create mddev when all_mddevs list doesn't contain
      mddev. With currently code logic, there will be very easy to trigger
      soft lockup in non-preempt env.
      
      *** env ***
      kvm-qemu VM 2C1G with 2 iscsi luns
      kernel should be non-preempt
      
      *** script ***
      
      about trigger 1 time with 10 tests
      
      `1  node1="15sp3-mdcluster1"
      2  node2="15sp3-mdcluster2"
      3
      4  mdadm -Ss
      5  ssh ${node2} "mdadm -Ss"
      6  wipefs -a /dev/sda /dev/sdb
      7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
         /dev/sdb --assume-clean
      8
      9  for i in {1..100}; do
      10    echo ==== $i ====;
      11
      12    echo "test  ...."
      13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
      14    sleep 1
      15
      16    echo "clean  ....."
      17    ssh ${node2} "mdadm -Ss"
      18 done
      `
      I use mdcluster env to trigger soft lockup, but it isn't mdcluster
      speical bug. To stop md array in mdcluster env will do more jobs than
      non-cluster array, which will leave enough time/gap to allow kernel to
      run md_open.
      
      *** stack ***
      
      `ID: 2831   TASK: ffff8dd7223b5040  CPU: 0   COMMAND: "mdadm"
       #0 [ffffa15d00a13b90] __schedule at ffffffffb8f1935f
       #1 [ffffa15d00a13ba8] exact_lock at ffffffffb8a4a66d
       #2 [ffffa15d00a13bb0] kobj_lookup at ffffffffb8c62fe3
       #3 [ffffa15d00a13c28] __blkdev_get at ffffffffb89273b9
       #4 [ffffa15d00a13c98] blkdev_get at ffffffffb8927964
       #5 [ffffa15d00a13cb0] do_dentry_open at ffffffffb88dc4b4
       #6 [ffffa15d00a13ce0] path_openat at ffffffffb88f0ccc
       #7 [ffffa15d00a13db8] do_filp_open at ffffffffb88f32bb
       #8 [ffffa15d00a13ee0] do_sys_open at ffffffffb88ddc7d
       #9 [ffffa15d00a13f38] do_syscall_64 at ffffffffb86053cb ffffffffb900008c
      
      or:
      [  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
      [  884.226515]  md_open+0x3c/0xe0 [md_mod]
      [  884.226518]  __blkdev_get+0x30d/0x710
      [  884.226520]  ? bd_acquire+0xd0/0xd0
      [  884.226522]  blkdev_get+0x14/0x30
      [  884.226524]  do_dentry_open+0x204/0x3a0
      [  884.226531]  path_openat+0x2fc/0x1520
      [  884.226534]  ? seq_printf+0x4e/0x70
      [  884.226536]  do_filp_open+0x9b/0x110
      [  884.226542]  ? md_release+0x20/0x20 [md_mod]
      [  884.226543]  ? seq_read+0x1d8/0x3e0
      [  884.226545]  ? kmem_cache_alloc+0x18a/0x270
      [  884.226547]  ? do_sys_open+0x1bd/0x260
      [  884.226548]  do_sys_open+0x1bd/0x260
      [  884.226551]  do_syscall_64+0x5b/0x1e0
      [  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      `
      *** rootcause ***
      
      "mdadm -A" (or other array assemble commands) will start a daemon "mdadm
      --monitor" by default. When "mdadm -Ss" is running, the stop action will
      wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
      info from /proc/mdstat. This time mddev in kernel still exist, so
      /proc/mdstat still show md device, which makes "mdadm --monitor" to open
      /dev/md0.
      
      The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
      open action will trigger md_open which is creating action. Racing is
      happening.
      
      `<thread 1>: "mdadm -Ss"
      md_release
        mddev_put deletes mddev from all_mddevs
        queue_work for mddev_delayed_delete
        at this time, "/dev/md0" is still available for opening
      
      <thread 2>: "mdadm --monitor ..."
      md_open
       + mddev_find can't find mddev of /dev/md0, and create a new mddev and
       |    return.
       + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
            -ERESTARTSYS.
      `
      In non-preempt kernel, <thread 2> is occupying on current CPU. and
      mddev_delayed_delete which was created in <thread 1> also can't be
      schedule.
      
      In preempt kernel, it can also trigger above racing. But kernel doesn't
      allow one thread running on a CPU all the time. after <thread 2> running
      some time, the later "mdadm -A" (refer above script line 13) will call
      md_alloc to alloc a new gendisk for mddev. it will break md_open
      statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
      the soft lockup is broken.
      ------------------------------ snip ------------------------------
      
      Cc: stable@vger.kernel.org
      Fixes: d3374825 ("md: make devices disappear when they are no longer needed.")
      Reported-by: NHeming Zhao <heming.zhao@suse.com>
      Reviewed-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      65aa97c4
    • C
      md: factor out a mddev_find_locked helper from mddev_find · 8b57251f
      Christoph Hellwig 提交于
      Factor out a self-contained helper to just lookup a mddev by the dev_t
      "unit".
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      8b57251f