1. 14 5月, 2021 2 次提交
    • M
      dm integrity: revert to not using discard filler when recalulating · dbae70d4
      Mikulas Patocka 提交于
      Revert the commit 7a5b96b4 ("dm integrity:
      use discard support when recalculating").
      
      There's a bug that when we write some data beyond the current recalculate
      boundary, the checksum will be rewritten with the discard filler later.
      And the data will no longer have integrity protection. There's no easy
      fix for this case.
      
      Also, another problematic case is if dm-integrity is used to detect
      bitrot (random device errors, bit flips, etc); dm-integrity should
      detect that even for unused sectors. With commit 7a5b96b4 it can
      happen that such change is undetected (because discard filler is not a
      valid checksum).
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      dbae70d4
    • M
      dm snapshot: fix crash with transient storage and zero chunk size · c699a0db
      Mikulas Patocka 提交于
      The following commands will crash the kernel:
      
      modprobe brd rd_size=1048576
      dmsetup create o --table "0 `blockdev --getsize /dev/ram0` snapshot-origin /dev/ram0"
      dmsetup create s --table "0 `blockdev --getsize /dev/ram0` snapshot /dev/ram0 /dev/ram1 N 0"
      
      The reason is that when we test for zero chunk size, we jump to the label
      bad_read_metadata without setting the "r" variable. The function
      snapshot_ctr destroys all the structures and then exits with "r == 0". The
      kernel then crashes because it falsely believes that snapshot_ctr
      succeeded.
      
      In order to fix the bug, we set the variable "r" to -EINVAL.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c699a0db
  2. 10 5月, 2021 1 次提交
  3. 07 5月, 2021 1 次提交
  4. 01 5月, 2021 5 次提交
    • M
      dm raid: remove unnecessary discard limits for raid0 and raid10 · ca4a4e9a
      Mike Snitzer 提交于
      Commit 29efc390 ("md/md0: optimize raid0 discard handling") and
      commit d30588b2 ("md/raid10: improve raid10 discard request")
      remove MD raid0's and raid10's inability to properly handle large
      discards. So eliminate associated constraints from dm-raid's support.
      
      Depends-on: 29efc390 ("md/md0: optimize raid0 discard handling")
      Depends-on: d30588b2 ("md/raid10: improve raid10 discard request")
      Reported-by: NMatthew Ruffell <matthew.ruffell@canonical.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      ca4a4e9a
    • B
      dm rq: fix double free of blk_mq_tag_set in dev remove after table load fails · 8e947c8f
      Benjamin Block 提交于
      When loading a device-mapper table for a request-based mapped device,
      and the allocation/initialization of the blk_mq_tag_set for the device
      fails, a following device remove will cause a double free.
      
      E.g. (dmesg):
        device-mapper: core: Cannot initialize queue for request-based dm-mq mapped device
        device-mapper: ioctl: unable to set up device queue for new table.
        Unable to handle kernel pointer dereference in virtual kernel address space
        Failing address: 0305e098835de000 TEID: 0305e098835de803
        Fault in home space mode while using kernel ASCE.
        AS:000000025efe0007 R3:0000000000000024
        Oops: 0038 ilc:3 [#1] SMP
        Modules linked in: ... lots of modules ...
        Supported: Yes, External
        CPU: 0 PID: 7348 Comm: multipathd Kdump: loaded Tainted: G        W      X    5.3.18-53-default #1 SLE15-SP3
        Hardware name: IBM 8561 T01 7I2 (LPAR)
        Krnl PSW : 0704e00180000000 000000025e368eca (kfree+0x42/0x330)
                   R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
        Krnl GPRS: 000000000000004a 000000025efe5230 c1773200d779968d 0000000000000000
                   000000025e520270 000000025e8d1b40 0000000000000003 00000007aae10000
                   000000025e5202a2 0000000000000001 c1773200d779968d 0305e098835de640
                   00000007a8170000 000003ff80138650 000000025e5202a2 000003e00396faa8
        Krnl Code: 000000025e368eb8: c4180041e100       lgrl    %r1,25eba50b8
                   000000025e368ebe: ecba06b93a55       risbg   %r11,%r10,6,185,58
                  #000000025e368ec4: e3b010000008       ag      %r11,0(%r1)
                  >000000025e368eca: e310b0080004       lg      %r1,8(%r11)
                   000000025e368ed0: a7110001           tmll    %r1,1
                   000000025e368ed4: a7740129           brc     7,25e369126
                   000000025e368ed8: e320b0080004       lg      %r2,8(%r11)
                   000000025e368ede: b904001b           lgr     %r1,%r11
        Call Trace:
         [<000000025e368eca>] kfree+0x42/0x330
         [<000000025e5202a2>] blk_mq_free_tag_set+0x72/0xb8
         [<000003ff801316a8>] dm_mq_cleanup_mapped_device+0x38/0x50 [dm_mod]
         [<000003ff80120082>] free_dev+0x52/0xd0 [dm_mod]
         [<000003ff801233f0>] __dm_destroy+0x150/0x1d0 [dm_mod]
         [<000003ff8012bb9a>] dev_remove+0x162/0x1c0 [dm_mod]
         [<000003ff8012a988>] ctl_ioctl+0x198/0x478 [dm_mod]
         [<000003ff8012ac8a>] dm_ctl_ioctl+0x22/0x38 [dm_mod]
         [<000000025e3b11ee>] ksys_ioctl+0xbe/0xe0
         [<000000025e3b127a>] __s390x_sys_ioctl+0x2a/0x40
         [<000000025e8c15ac>] system_call+0xd8/0x2c8
        Last Breaking-Event-Address:
         [<000000025e52029c>] blk_mq_free_tag_set+0x6c/0xb8
        Kernel panic - not syncing: Fatal exception: panic_on_oops
      
      When allocation/initialization of the blk_mq_tag_set fails in
      dm_mq_init_request_queue(), it is uninitialized/freed, but the pointer
      is not reset to NULL; so when dev_remove() later gets into
      dm_mq_cleanup_mapped_device() it sees the pointer and tries to
      uninitialize and free it again.
      
      Fix this by setting the pointer to NULL in dm_mq_init_request_queue()
      error-handling. Also set it to NULL in dm_mq_cleanup_mapped_device().
      
      Cc: <stable@vger.kernel.org> # 4.6+
      Fixes: 1c357a1e ("dm: allocate blk_mq_tag_set rather than embed in mapped_device")
      Signed-off-by: NBenjamin Block <bblock@linux.ibm.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      8e947c8f
    • M
      dm integrity: use discard support when recalculating · 7a5b96b4
      Mikulas Patocka 提交于
      If we have discard support we don't have to recalculate hash - we can
      just fill the metadata with the discard pattern.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      7a5b96b4
    • M
      dm integrity: increase RECALC_SECTORS to improve recalculate speed · b1a2b933
      Mikulas Patocka 提交于
      Increase RECALC_SECTORS because it improves recalculate speed slightly
      (from 390kiB/s to 410kiB/s).
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b1a2b933
    • M
      dm integrity: don't re-write metadata if discarding same blocks · a9c0fda4
      Mikulas Patocka 提交于
      If we discard already discarded blocks we do not need to write discard
      pattern to the metadata, because it is already there.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      a9c0fda4
  5. 24 4月, 2021 2 次提交
    • P
      md/raid1: properly indicate failure when ending a failed write request · 2417b986
      Paul Clements 提交于
      This patch addresses a data corruption bug in raid1 arrays using bitmaps.
      Without this fix, the bitmap bits for the failed I/O end up being cleared.
      
      Since we are in the failure leg of raid1_end_write_request, the request
      either needs to be retried (R1BIO_WriteError) or failed (R1BIO_Degraded).
      
      Fixes: eeba6809 ("md/raid1: end bio when the device faulty")
      Cc: stable@vger.kernel.org # v5.2+
      Signed-off-by: NPaul Clements <paul.clements@us.sios.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      2417b986
    • H
      md-cluster: fix use-after-free issue when removing rdev · f7c7a2f9
      Heming Zhao 提交于
      md_kick_rdev_from_array will remove rdev, so we should
      use rdev_for_each_safe to search list.
      
      How to trigger:
      
      env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).
      
      ```
      node2=192.168.0.3
      
      for i in {1..20}; do
          echo ==== $i `date` ====;
      
          mdadm -Ss && ssh ${node2} "mdadm -Ss"
          wipefs -a /dev/sda /dev/sdb
      
          mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
             /dev/sdb --assume-clean
          ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
          mdadm --wait /dev/md0
          ssh ${node2} "mdadm --wait /dev/md0"
      
          mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
          sleep 1
      done
      ```
      
      Crash stack:
      
      ```
      stack segment: 0000 [#1] SMP
      ... ...
      RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
      ... ...
      RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
      RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
      RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
      RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
      R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
      FS:  0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       raid1d+0x5c/0xd40 [raid1]
       ? finish_task_switch+0x75/0x2a0
       ? lock_timer_base+0x67/0x80
       ? try_to_del_timer_sync+0x4d/0x80
       ? del_timer_sync+0x41/0x50
       ? schedule_timeout+0x254/0x2d0
       ? md_start_sync+0xe0/0xe0 [md_mod]
       ? md_thread+0x127/0x160 [md_mod]
       md_thread+0x127/0x160 [md_mod]
       ? wait_woken+0x80/0x80
       kthread+0x10d/0x130
       ? kthread_park+0xa0/0xa0
       ret_from_fork+0x1f/0x40
      ```
      
      Fixes: dbb64f86 ("md-cluster: Fix adding of new disk with new reload code")
      Fixes: 659b254f ("md-cluster: remove a disk asynchronously from cluster environment")
      Cc: stable@vger.kernel.org
      Reviewed-by: NGang He <ghe@suse.com>
      Signed-off-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      f7c7a2f9
  6. 22 4月, 2021 1 次提交
    • H
      dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload sequences · f99a8e43
      Heinz Mauelshagen 提交于
      If fast table reloads occur during an ongoing reshape of raid4/5/6
      devices the target may race reading a superblock vs the the MD resync
      thread; causing an inconclusive reshape state to be read in its
      constructor.
      
      lvm2 test lvconvert-raid-reshape-stripes-load-reload.sh can cause
      BUG_ON() to trigger in md_run(), e.g.:
      "kernel BUG at drivers/md/raid5.c:7567!".
      
      Scenario triggering the bug:
      
      1. the MD sync thread calls end_reshape() from raid5_sync_request()
         when done reshaping. However end_reshape() _only_ updates the
         reshape position to MaxSector keeping the changed layout
         configuration though (i.e. any delta disks, chunk sector or RAID
         algorithm changes). That inconclusive configuration is stored in
         the superblock.
      
      2. dm-raid constructs a mapping, loading named inconsistent superblock
         as of step 1 before step 3 is able to finish resetting the reshape
         state completely, and calls md_run() which leads to mentioned bug
         in raid5.c.
      
      3. the MD RAID personality's finish_reshape() is called; which resets
         the reshape information on chunk sectors, delta disks, etc. This
         explains why the bug is rarely seen on multi-core machines, as MD's
         finish_reshape() superblock update races with the dm-raid
         constructor's superblock load in step 2.
      
      Fix identifies inconclusive superblock content in the dm-raid
      constructor and resets it before calling md_run(), factoring out
      identifying checks into rs_is_layout_change() to share in existing
      rs_reshape_requested() and new rs_reset_inclonclusive_reshape(). Also
      enhance a comment and remove an empty line.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f99a8e43
  7. 21 4月, 2021 1 次提交
  8. 20 4月, 2021 7 次提交
  9. 16 4月, 2021 4 次提交
    • S
      md/bitmap: wait for external bitmap writes to complete during tear down · 404a8ef5
      Sudhakar Panneerselvam 提交于
      NULL pointer dereference was observed in super_written() when it tries
      to access the mddev structure.
      
      [The below stack trace is from an older kernel, but the problem described
      in this patch applies to the mainline kernel.]
      
      [ 1194.474861] task: ffff8fdd20858000 task.stack: ffffb99d40790000
      [ 1194.488000] RIP: 0010:super_written+0x29/0xe1
      [ 1194.499688] RSP: 0018:ffff8ffb7fcc3c78 EFLAGS: 00010046
      [ 1194.512477] RAX: 0000000000000000 RBX: ffff8ffb7bf4a000 RCX: ffff8ffb78991048
      [ 1194.527325] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ffb56b8a200
      [ 1194.542576] RBP: ffff8ffb7fcc3c90 R08: 000000000000000b R09: 0000000000000000
      [ 1194.558001] R10: ffff8ffb56b8a298 R11: 0000000000000000 R12: ffff8ffb56b8a200
      [ 1194.573070] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      [ 1194.588117] FS:  0000000000000000(0000) GS:ffff8ffb7fcc0000(0000) knlGS:0000000000000000
      [ 1194.604264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1194.617375] CR2: 00000000000002b8 CR3: 00000021e040a002 CR4: 00000000007606e0
      [ 1194.632327] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1194.647865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1194.663316] PKRU: 55555554
      [ 1194.674090] Call Trace:
      [ 1194.683735]  <IRQ>
      [ 1194.692948]  bio_endio+0xae/0x135
      [ 1194.703580]  blk_update_request+0xad/0x2fa
      [ 1194.714990]  blk_update_bidi_request+0x20/0x72
      [ 1194.726578]  __blk_end_bidi_request+0x2c/0x4d
      [ 1194.738373]  __blk_end_request_all+0x31/0x49
      [ 1194.749344]  blk_flush_complete_seq+0x377/0x383
      [ 1194.761550]  flush_end_io+0x1dd/0x2a7
      [ 1194.772910]  blk_finish_request+0x9f/0x13c
      [ 1194.784544]  scsi_end_request+0x180/0x25c
      [ 1194.796149]  scsi_io_completion+0xc8/0x610
      [ 1194.807503]  scsi_finish_command+0xdc/0x125
      [ 1194.818897]  scsi_softirq_done+0x81/0xde
      [ 1194.830062]  blk_done_softirq+0xa4/0xcc
      [ 1194.841008]  __do_softirq+0xd9/0x29f
      [ 1194.851257]  irq_exit+0xe6/0xeb
      [ 1194.861290]  do_IRQ+0x59/0xe3
      [ 1194.871060]  common_interrupt+0x1c6/0x382
      [ 1194.881988]  </IRQ>
      [ 1194.890646] RIP: 0010:cpuidle_enter_state+0xdd/0x2a5
      [ 1194.902532] RSP: 0018:ffffb99d40793e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff43
      [ 1194.917317] RAX: ffff8ffb7fce27c0 RBX: ffff8ffb7fced800 RCX: 000000000000001f
      [ 1194.932056] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000
      [ 1194.946428] RBP: ffffb99d40793ea0 R08: 0000000000000004 R09: 0000000000002ed2
      [ 1194.960508] R10: 0000000000002664 R11: 0000000000000018 R12: 0000000000000003
      [ 1194.974454] R13: 000000000000000b R14: ffffffff925715a0 R15: 0000011610120d5a
      [ 1194.988607]  ? cpuidle_enter_state+0xcc/0x2a5
      [ 1194.999077]  cpuidle_enter+0x17/0x19
      [ 1195.008395]  call_cpuidle+0x23/0x3a
      [ 1195.017718]  do_idle+0x172/0x1d5
      [ 1195.026358]  cpu_startup_entry+0x73/0x75
      [ 1195.035769]  start_secondary+0x1b9/0x20b
      [ 1195.044894]  secondary_startup_64+0xa5/0xa5
      [ 1195.084921] RIP: super_written+0x29/0xe1 RSP: ffff8ffb7fcc3c78
      [ 1195.096354] CR2: 00000000000002b8
      
      bio in the above stack is a bitmap write whose completion is invoked after
      the tear down sequence sets the mddev structure to NULL in rdev.
      
      During tear down, there is an attempt to flush the bitmap writes, but for
      external bitmaps, there is no explicit wait for all the bitmap writes to
      complete. For instance, md_bitmap_flush() is called to flush the bitmap
      writes, but the last call to md_bitmap_daemon_work() in md_bitmap_flush()
      could generate new bitmap writes for which there is no explicit wait to
      complete those writes. The call to md_bitmap_update_sb() will return
      simply for external bitmaps and the follow-up call to md_update_sb() is
      conditional and may not get called for external bitmaps. This results in a
      kernel panic when the completion routine, super_written() is called which
      tries to reference mddev in the rdev that has been set to
      NULL(in unbind_rdev_from_array() by tear down sequence).
      
      The solution is to call md_super_wait() for external bitmaps after the
      last call to md_bitmap_daemon_work() in md_bitmap_flush() to ensure there
      are no pending bitmap writes before proceeding with the tear down.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com>
      Reviewed-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      404a8ef5
    • C
      md: do not return existing mddevs from mddev_find_or_alloc · 0d809b38
      Christoph Hellwig 提交于
      Instead of returning an existing mddev, just for it to be discarded
      later directly return -EEXIST.  Rename the function to mddev_alloc now
      that it doesn't find an existing mddev.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      0d809b38
    • C
      md: refactor mddev_find_or_alloc · d144fe6f
      Christoph Hellwig 提交于
      Allocate the new mddev first speculatively, which greatly simplifies
      the code flow.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      d144fe6f
    • C
      md: factor out a mddev_alloc_unit helper from mddev_find · 85c8c3c1
      Christoph Hellwig 提交于
      Split out a self contained helper to find a free minor for the md
      "unit" number.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      85c8c3c1
  10. 15 4月, 2021 1 次提交
    • J
      dm verity fec: fix misaligned RS roots IO · 8ca7cab8
      Jaegeuk Kim 提交于
      commit df7b59ba ("dm verity: fix FEC for RS roots unaligned to
      block size") introduced the possibility for misaligned roots IO
      relative to the underlying device's logical block size. E.g. Android's
      default RS roots=2 results in dm_bufio->block_size=1024, which causes
      the following EIO if the logical block size of the device is 4096,
      given v->data_dev_block_bits=12:
      
      E sd 0    : 0:0:0: [sda] tag#30 request not aligned to the logical block size
      E blk_update_request: I/O error, dev sda, sector 10368424 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
      E device-mapper: verity-fec: 254:8: FEC 9244672: parity read failed (block 18056): -5
      
      Fix this by onlu using f->roots for dm_bufio blocksize IFF it is
      aligned to v->data_dev_block_bits.
      
      Fixes: df7b59ba ("dm verity: fix FEC for RS roots unaligned to block size")
      Cc: stable@vger.kernel.org
      Signed-off-by: NJaegeuk Kim <jaegeuk@google.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      8ca7cab8
  11. 11 4月, 2021 7 次提交
  12. 09 4月, 2021 1 次提交
  13. 08 4月, 2021 3 次提交
    • C
      md: split mddev_find · 65aa97c4
      Christoph Hellwig 提交于
      Split mddev_find into a simple mddev_find that just finds an existing
      mddev by the unit number, and a more complicated mddev_find that deals
      with find or allocating a mddev.
      
      This turns out to fix this bug reported by Zhao Heming.
      
      ----------------------------- snip ------------------------------
      commit d3374825 ("md: make devices disappear when they are no longer
      needed.") introduced protection between mddev creating & removing. The
      md_open shouldn't create mddev when all_mddevs list doesn't contain
      mddev. With currently code logic, there will be very easy to trigger
      soft lockup in non-preempt env.
      
      *** env ***
      kvm-qemu VM 2C1G with 2 iscsi luns
      kernel should be non-preempt
      
      *** script ***
      
      about trigger 1 time with 10 tests
      
      `1  node1="15sp3-mdcluster1"
      2  node2="15sp3-mdcluster2"
      3
      4  mdadm -Ss
      5  ssh ${node2} "mdadm -Ss"
      6  wipefs -a /dev/sda /dev/sdb
      7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
         /dev/sdb --assume-clean
      8
      9  for i in {1..100}; do
      10    echo ==== $i ====;
      11
      12    echo "test  ...."
      13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
      14    sleep 1
      15
      16    echo "clean  ....."
      17    ssh ${node2} "mdadm -Ss"
      18 done
      `
      I use mdcluster env to trigger soft lockup, but it isn't mdcluster
      speical bug. To stop md array in mdcluster env will do more jobs than
      non-cluster array, which will leave enough time/gap to allow kernel to
      run md_open.
      
      *** stack ***
      
      `ID: 2831   TASK: ffff8dd7223b5040  CPU: 0   COMMAND: "mdadm"
       #0 [ffffa15d00a13b90] __schedule at ffffffffb8f1935f
       #1 [ffffa15d00a13ba8] exact_lock at ffffffffb8a4a66d
       #2 [ffffa15d00a13bb0] kobj_lookup at ffffffffb8c62fe3
       #3 [ffffa15d00a13c28] __blkdev_get at ffffffffb89273b9
       #4 [ffffa15d00a13c98] blkdev_get at ffffffffb8927964
       #5 [ffffa15d00a13cb0] do_dentry_open at ffffffffb88dc4b4
       #6 [ffffa15d00a13ce0] path_openat at ffffffffb88f0ccc
       #7 [ffffa15d00a13db8] do_filp_open at ffffffffb88f32bb
       #8 [ffffa15d00a13ee0] do_sys_open at ffffffffb88ddc7d
       #9 [ffffa15d00a13f38] do_syscall_64 at ffffffffb86053cb ffffffffb900008c
      
      or:
      [  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
      [  884.226515]  md_open+0x3c/0xe0 [md_mod]
      [  884.226518]  __blkdev_get+0x30d/0x710
      [  884.226520]  ? bd_acquire+0xd0/0xd0
      [  884.226522]  blkdev_get+0x14/0x30
      [  884.226524]  do_dentry_open+0x204/0x3a0
      [  884.226531]  path_openat+0x2fc/0x1520
      [  884.226534]  ? seq_printf+0x4e/0x70
      [  884.226536]  do_filp_open+0x9b/0x110
      [  884.226542]  ? md_release+0x20/0x20 [md_mod]
      [  884.226543]  ? seq_read+0x1d8/0x3e0
      [  884.226545]  ? kmem_cache_alloc+0x18a/0x270
      [  884.226547]  ? do_sys_open+0x1bd/0x260
      [  884.226548]  do_sys_open+0x1bd/0x260
      [  884.226551]  do_syscall_64+0x5b/0x1e0
      [  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      `
      *** rootcause ***
      
      "mdadm -A" (or other array assemble commands) will start a daemon "mdadm
      --monitor" by default. When "mdadm -Ss" is running, the stop action will
      wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
      info from /proc/mdstat. This time mddev in kernel still exist, so
      /proc/mdstat still show md device, which makes "mdadm --monitor" to open
      /dev/md0.
      
      The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
      open action will trigger md_open which is creating action. Racing is
      happening.
      
      `<thread 1>: "mdadm -Ss"
      md_release
        mddev_put deletes mddev from all_mddevs
        queue_work for mddev_delayed_delete
        at this time, "/dev/md0" is still available for opening
      
      <thread 2>: "mdadm --monitor ..."
      md_open
       + mddev_find can't find mddev of /dev/md0, and create a new mddev and
       |    return.
       + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
            -ERESTARTSYS.
      `
      In non-preempt kernel, <thread 2> is occupying on current CPU. and
      mddev_delayed_delete which was created in <thread 1> also can't be
      schedule.
      
      In preempt kernel, it can also trigger above racing. But kernel doesn't
      allow one thread running on a CPU all the time. after <thread 2> running
      some time, the later "mdadm -A" (refer above script line 13) will call
      md_alloc to alloc a new gendisk for mddev. it will break md_open
      statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
      the soft lockup is broken.
      ------------------------------ snip ------------------------------
      
      Cc: stable@vger.kernel.org
      Fixes: d3374825 ("md: make devices disappear when they are no longer needed.")
      Reported-by: NHeming Zhao <heming.zhao@suse.com>
      Reviewed-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      65aa97c4
    • C
      md: factor out a mddev_find_locked helper from mddev_find · 8b57251f
      Christoph Hellwig 提交于
      Factor out a self-contained helper to just lookup a mddev by the dev_t
      "unit".
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      8b57251f
    • Z
      md: md_open returns -EBUSY when entering racing area · 6a4db2a6
      Zhao Heming 提交于
      commit d3374825 ("md: make devices disappear when they are no longer
      needed.") introduced protection between mddev creating & removing. The
      md_open shouldn't create mddev when all_mddevs list doesn't contain
      mddev. With currently code logic, there will be very easy to trigger
      soft lockup in non-preempt env.
      
      This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
      will break the infinitely retry when md_open enter racing area.
      
      This patch is partly fix soft lockup issue, full fix needs mddev_find
      is split into two functions: mddev_find & mddev_find_or_alloc. And
      md_open should call new mddev_find (it only does searching job).
      
      For more detail, please refer with Christoph's "split mddev_find" patch
      in later commits.
      
      *** env ***
      kvm-qemu VM 2C1G with 2 iscsi luns
      kernel should be non-preempt
      
      *** script ***
      
      about trigger every time with below script
      
      ```
      1  node1="mdcluster1"
      2  node2="mdcluster2"
      3
      4  mdadm -Ss
      5  ssh ${node2} "mdadm -Ss"
      6  wipefs -a /dev/sda /dev/sdb
      7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
         /dev/sdb --assume-clean
      8
      9  for i in {1..10}; do
      10    echo ==== $i ====;
      11
      12    echo "test  ...."
      13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
      14    sleep 1
      15
      16    echo "clean  ....."
      17    ssh ${node2} "mdadm -Ss"
      18 done
      ```
      
      I use mdcluster env to trigger soft lockup, but it isn't mdcluster
      speical bug. To stop md array in mdcluster env will do more jobs than
      non-cluster array, which will leave enough time/gap to allow kernel to
      run md_open.
      
      *** stack ***
      
      ```
      [  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
      [  884.226515]  md_open+0x3c/0xe0 [md_mod]
      [  884.226518]  __blkdev_get+0x30d/0x710
      [  884.226520]  ? bd_acquire+0xd0/0xd0
      [  884.226522]  blkdev_get+0x14/0x30
      [  884.226524]  do_dentry_open+0x204/0x3a0
      [  884.226531]  path_openat+0x2fc/0x1520
      [  884.226534]  ? seq_printf+0x4e/0x70
      [  884.226536]  do_filp_open+0x9b/0x110
      [  884.226542]  ? md_release+0x20/0x20 [md_mod]
      [  884.226543]  ? seq_read+0x1d8/0x3e0
      [  884.226545]  ? kmem_cache_alloc+0x18a/0x270
      [  884.226547]  ? do_sys_open+0x1bd/0x260
      [  884.226548]  do_sys_open+0x1bd/0x260
      [  884.226551]  do_syscall_64+0x5b/0x1e0
      [  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      ```
      
      *** rootcause ***
      
      "mdadm -A" (or other array assemble commands) will start a daemon "mdadm
      --monitor" by default. When "mdadm -Ss" is running, the stop action will
      wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
      info from /proc/mdstat. This time mddev in kernel still exist, so
      /proc/mdstat still show md device, which makes "mdadm --monitor" to open
      /dev/md0.
      
      The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
      open action will trigger md_open which is creating action. Racing is
      happening.
      
      ```
      <thread 1>: "mdadm -Ss"
      md_release
        mddev_put deletes mddev from all_mddevs
        queue_work for mddev_delayed_delete
        at this time, "/dev/md0" is still available for opening
      
      <thread 2>: "mdadm --monitor ..."
      md_open
       + mddev_find can't find mddev of /dev/md0, and create a new mddev and
       |    return.
       + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
            -ERESTARTSYS.
      ```
      
      In non-preempt kernel, <thread 2> is occupying on current CPU. and
      mddev_delayed_delete which was created in <thread 1> also can't be
      schedule.
      
      In preempt kernel, it can also trigger above racing. But kernel doesn't
      allow one thread running on a CPU all the time. after <thread 2> running
      some time, the later "mdadm -A" (refer above script line 13) will call
      md_alloc to alloc a new gendisk for mddev. it will break md_open
      statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
      the soft lockup is broken.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      6a4db2a6
  14. 27 3月, 2021 4 次提交