1. 26 4月, 2022 6 次提交
    • H
      md: replace deprecated strlcpy & remove duplicated line · 92d9aac9
      Heming Zhao 提交于
      This commit includes two topics:
      
      1> replace deprecated strlcpy
      
      change strlcpy to strscpy for strlcpy is marked as deprecated in
      Documentation/process/deprecated.rst
      
      2> remove duplicated strlcpy line
      
      in md_bitmap_read_sb@md-bitmap.c there are two duplicated strlcpy(), the
      history:
      
      - commit cf921cc1 ("Add node recovery callbacks") introduced the first
        usage of strlcpy().
      
      - commit b97e9257 ("Use separate bitmaps for each nodes in the cluster")
        introduced the second strlcpy(). this time, the two strlcpy() are same,
         we can remove anyone safely.
      
      - commit d3b178ad ("md: Skip cluster setup for dm-raid") added dm-raid
        special handling. And the "nodes" value is the key of this patch. but
        from this patch, strlcpy() which was introduced by b97e9257
        become necessary.
      
      - commit 3c462c88 ("md: Increment version for clustered bitmaps") used
        clustered major version to only handle in clustered env. this patch
        could look a polishment for clustered code logic.
      
      So cf921cc1 became useless after d3b178ad, we could remove it
      safely.
      Signed-off-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      92d9aac9
    • H
      md/bitmap: don't set sb values if can't pass sanity check · e68cb83a
      Heming Zhao 提交于
      If bitmap area contains invalid data, kernel will crash then mdadm
      triggers "Segmentation fault".
      This is cluster-md speical bug. In non-clustered env, mdadm will
      handle broken metadata case. In clustered array, only kernel space
      handles bitmap slot info. But even this bug only happened in clustered
      env, current sanity check is wrong, the code should be changed.
      
      How to trigger: (faulty injection)
      
      dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sda
      dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sdb
      mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb
      mdadm -Ss
      echo aaa > magic.txt
       == below modifying slot 2 bitmap data ==
      dd if=magic.txt of=/dev/sda seek=16384 bs=1 count=3 <== destroy magic
      dd if=/dev/zero of=/dev/sda seek=16436 bs=1 count=4 <== ZERO chunksize
      mdadm -A /dev/md0 /dev/sda /dev/sdb
       == kernel crashes. mdadm outputs "Segmentation fault" ==
      
      Reason of kernel crash:
      
      In md_bitmap_read_sb (called by md_bitmap_create), bad bitmap magic didn't
      block chunksize assignment, and zero value made DIV_ROUND_UP_SECTOR_T()
      trigger "divide error".
      
      Crash log:
      
      kernel: md: md0 stopped.
      kernel: md/raid1:md0: not clean -- starting background reconstruction
      kernel: md/raid1:md0: active with 2 out of 2 mirrors
      kernel: dlm: ... ...
      kernel: md-cluster: Joined cluster 44810aba-38bb-e6b8-daca-bc97a0b254aa slot 1
      kernel: md0: invalid bitmap file superblock: bad magic
      kernel: md_bitmap_copy_from_slot can't get bitmap from slot 2
      kernel: md-cluster: Could not gather bitmaps from slot 2
      kernel: divide error: 0000 [#1] SMP NOPTI
      kernel: CPU: 0 PID: 1603 Comm: mdadm Not tainted 5.14.6-1-default
      kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
      kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod]
      kernel: RSP: 0018:ffffc22ac0843ba0 EFLAGS: 00010246
      kernel: ... ...
      kernel: Call Trace:
      kernel:  ? dlm_lock_sync+0xd0/0xd0 [md_cluster 77fe..7a0]
      kernel:  md_bitmap_copy_from_slot+0x2c/0x290 [md_mod 24ea..d3a]
      kernel:  load_bitmaps+0xec/0x210 [md_cluster 77fe..7a0]
      kernel:  md_bitmap_load+0x81/0x1e0 [md_mod 24ea..d3a]
      kernel:  do_md_run+0x30/0x100 [md_mod 24ea..d3a]
      kernel:  md_ioctl+0x1290/0x15a0 [md_mod 24ea....d3a]
      kernel:  ? mddev_unlock+0xaa/0x130 [md_mod 24ea..d3a]
      kernel:  ? blkdev_ioctl+0xb1/0x2b0
      kernel:  block_ioctl+0x3b/0x40
      kernel:  __x64_sys_ioctl+0x7f/0xb0
      kernel:  do_syscall_64+0x59/0x80
      kernel:  ? exit_to_user_mode_prepare+0x1ab/0x230
      kernel:  ? syscall_exit_to_user_mode+0x18/0x40
      kernel:  ? do_syscall_64+0x69/0x80
      kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
      kernel: RIP: 0033:0x7f4a15fa722b
      kernel: ... ...
      kernel: ---[ end trace 8afa7612f559c868 ]---
      kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod]
      Reported-by: Nkernel test robot <lkp@intel.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      e68cb83a
    • X
      md: fix an incorrect NULL check in md_reload_sb · 64c54d92
      Xiaomeng Tong 提交于
      The bug is here:
      	if (!rdev || rdev->desc_nr != nr) {
      
      The list iterator value 'rdev' will *always* be set and non-NULL
      by rdev_for_each_rcu(), so it is incorrect to assume that the
      iterator value will be NULL if the list is empty or no element
      found (In fact, it will be a bogus pointer to an invalid struct
      object containing the HEAD). Otherwise it will bypass the check
      and lead to invalid memory access passing the check.
      
      To fix the bug, use a new variable 'iter' as the list iterator,
      while using the original variable 'pdev' as a dedicated pointer to
      point to the found element.
      
      Cc: stable@vger.kernel.org
      Fixes: 70bcecdb ("md-cluster: Improve md_reload_sb to be less error prone")
      Signed-off-by: NXiaomeng Tong <xiam0nd.tong@gmail.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      64c54d92
    • X
      md: fix an incorrect NULL check in does_sb_need_changing · fc873834
      Xiaomeng Tong 提交于
      The bug is here:
      	if (!rdev)
      
      The list iterator value 'rdev' will *always* be set and non-NULL
      by rdev_for_each(), so it is incorrect to assume that the iterator
      value will be NULL if the list is empty or no element found.
      Otherwise it will bypass the NULL check and lead to invalid memory
      access passing the check.
      
      To fix the bug, use a new variable 'iter' as the list iterator,
      while using the original variable 'rdev' as a dedicated pointer to
      point to the found element.
      
      Cc: stable@vger.kernel.org
      Fixes: 2aa82191 ("md-cluster: Perform a lazy update")
      Acked-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: NXiaomeng Tong <xiam0nd.tong@gmail.com>
      Acked-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      fc873834
    • M
      raid5: introduce MD_BROKEN · 57668f0a
      Mariusz Tkaczyk 提交于
      Raid456 module had allowed to achieve failed state. It was fixed by
      fb73b357 ("raid5: block failing device if raid will be failed").
      This fix introduces a bug, now if raid5 fails during IO, it may result
      with a hung task without completion. Faulty flag on the device is
      necessary to process all requests and is checked many times, mainly in
      analyze_stripe().
      Allow to set faulty on drive again and set MD_BROKEN if raid is failed.
      
      As a result, this level is allowed to achieve failed state again, but
      communication with userspace (via -EBUSY status) will be preserved.
      
      This restores possibility to fail array via #mdadm --set-faulty command
      and will be fixed by additional verification on mdadm side.
      
      Reproduction steps:
       mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1
       mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
       mkfs.xfs /dev/md126 -f
       mount /dev/md126 /mnt/root/
      
       fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw
      --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4
      --time_based --group_reporting --name=throughput-test-job
      --eta-newline=1 &
      
       echo 1 > /sys/block/nvme2n1/device/device/remove
       echo 1 > /sys/block/nvme1n1/device/device/remove
      
       [ 1475.787779] Call Trace:
       [ 1475.793111] __schedule+0x2a6/0x700
       [ 1475.799460] schedule+0x38/0xa0
       [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456]
       [ 1475.813856] ? finish_wait+0x80/0x80
       [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456]
       [ 1475.828281] ? finish_wait+0x80/0x80
       [ 1475.834727] ? finish_wait+0x80/0x80
       [ 1475.841127] ? finish_wait+0x80/0x80
       [ 1475.847480] md_handle_request+0x119/0x190
       [ 1475.854390] md_make_request+0x8a/0x190
       [ 1475.861041] generic_make_request+0xcf/0x310
       [ 1475.868145] submit_bio+0x3c/0x160
       [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60
       [ 1475.882070] iomap_dio_bio_actor+0x175/0x390
       [ 1475.889149] iomap_apply+0xff/0x310
       [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390
       [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390
       [ 1475.909974] iomap_dio_rw+0x2f2/0x490
       [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390
       [ 1475.923680] ? atime_needs_update+0x77/0xe0
       [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
       [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
       [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs]
       [ 1475.953403] aio_read+0xd5/0x180
       [ 1475.959395] ? _cond_resched+0x15/0x30
       [ 1475.965907] io_submit_one+0x20b/0x3c0
       [ 1475.972398] __x64_sys_io_submit+0xa2/0x180
       [ 1475.979335] ? do_io_getevents+0x7c/0xc0
       [ 1475.986009] do_syscall_64+0x5b/0x1a0
       [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca
       [ 1476.000255] RIP: 0033:0x7f11fc27978d
       [ 1476.006631] Code: Bad RIP value.
       [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds.
      
      Cc: stable@vger.kernel.org
      Fixes: fb73b357 ("raid5: block failing device if raid will be failed")
      Reviewd-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NMariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      57668f0a
    • M
      md: Set MD_BROKEN for RAID1 and RAID10 · 9631abdb
      Mariusz Tkaczyk 提交于
      There is no direct mechanism to determine raid failure outside
      personality. It is done by checking rdev->flags after executing
      md_error(). If "faulty" flag is not set then -EBUSY is returned to
      userspace. -EBUSY means that array will be failed after drive removal.
      
      Mdadm has special routine to handle the array failure and it is executed
      if -EBUSY is returned by md.
      
      There are at least two known reasons to not consider this mechanism
      as correct:
      1. drive can be removed even if array will be failed[1].
      2. -EBUSY seems to be wrong status. Array is not busy, but removal
         process cannot proceed safe.
      
      -EBUSY expectation cannot be removed without breaking compatibility
      with userspace. In this patch first issue is resolved by adding support
      for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in
      next commit.
      
      The idea is to set the MD_BROKEN if we are sure that raid is in failed
      state now. This is done in each error_handler(). In md_error() MD_BROKEN
      flag is checked. If is set, then -EBUSY is returned to userspace.
      
      As in previous commit, it causes that #mdadm --set-faulty is able to
      fail array. Previously proposed workaround is valid if optional
      functionality[1] is disabled.
      
      [1] commit 9a567843("md: allow last device to be forcibly removed from
          RAID1/RAID10.")
      Reviewd-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NMariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      9631abdb
  2. 18 4月, 2022 6 次提交
  3. 16 4月, 2022 1 次提交
    • S
      dm: fix bio length of empty flush · 92b914e2
      Shin'ichiro Kawasaki 提交于
      The commit 92986f6b ("dm: use bio_clone_fast in alloc_io/alloc_tio")
      removed bio_clone_fast() call from alloc_tio() when ci->io->tio is
      available. In this case, ci->bio is not copied to ci->io->tio.clone.
      This is fine since init_clone_info() sets same values to ci->bio and
      ci->io->tio.clone.
      
      However, when incoming bios have REQ_PREFLUSH flag, __send_empty_flush()
      prepares a zero length bio on stack and set it to ci->bio. At this time,
      ci->io->tio.clone still keeps non-zero length. When alloc_tio() chooses
      this ci->io->tio.clone as the bio to map, it is passed to targets as
      non-empty flush bio. It causes bio length check failure in dm-zoned and
      unexpected operation such as dm_accept_partial_bio() call.
      
      To avoid the non-empty flush bio, set zero length to ci->io->tio.clone
      in __send_empty_flush().
      
      Fixes: 92986f6b ("dm: use bio_clone_fast in alloc_io/alloc_tio")
      Signed-off-by: NShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@kernel.org>
      92b914e2
  4. 15 4月, 2022 1 次提交
  5. 14 4月, 2022 3 次提交
  6. 02 4月, 2022 2 次提交
    • M
      dm: fix bio polling to handle possibile BLK_STS_AGAIN · 52919840
      Ming Lei 提交于
      Expanded testing of DM's bio polling support (using more fio threads
      to dm-linear ontop of null_blk) exposed the possibility for polled
      bios to hang (repeatedly polling in io_uring) when null_blk responds
      with BLK_STS_AGAIN (due to lack of resources):
      
      1) io_complete_rw_iopoll() is called from blkdev_bio_end_io_async() to
         notify kiocb is done, that is the completion interface between block
         layer and io_uring
      
      2) io_complete_rw_iopoll() is called from io_do_iopoll()
      
      3) dm returns BLK_STS_AGAIN for one bio (on behalf of underlying
         driver), then io_complete_rw_iopoll is called, but io_do_iopoll()
         doesn't handle -EAGAIN at all (due to logic in io_rw_should_reissue)
      
      4) reason for dm's BLK_STS_AGAIN is underlying null_blk driver ran out
         of requests (easier to reproduce by setting low hw_queue_depth).
      
      5) dm should handle BLK_STS_AGAIN for POLLED underlying IO, and may
         retry in dm layer.
      
      This fix adds REQ_POLLED specific BLK_STS_AGAIN handling to
      dm_io_complete() that clears REQ_POLLED and requeues the bio to DM
      using queue_io().
      
      Fixes: b99fdcdc ("dm: support bio polling")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      [snitzer: revised header, reused dm_io_complete's REQ_POLLED case]
      Signed-off-by: NMike Snitzer <snitzer@kernel.org>
      52919840
    • M
      dm: fix dm_io and dm_target_io flags race condition on Alpha · aad5b23e
      Mikulas Patocka 提交于
      Early alpha processors cannot write a single byte or short; they read 8
      bytes, modify the value in registers and write back 8 bytes.
      
      This could cause race condition in the structure dm_io - if the fields
      flags and io_count are modified simultaneously.
      
      Fix this bug by using 32-bit flags if we are on Alpha and if we are
      compiling for a processor that doesn't have the byte-word-extension.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Fixes: bd4a6dd2 ("dm: reduce size of dm_io and dm_target_io structs")
      [snitzer: Jens allowed this change since Mikulas owns a relevant Alpha!]
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NMike Snitzer <snitzer@kernel.org>
      aad5b23e
  7. 01 4月, 2022 2 次提交
  8. 22 3月, 2022 4 次提交
  9. 11 3月, 2022 6 次提交
  10. 10 3月, 2022 1 次提交
    • M
      dm: support bio polling · b99fdcdc
      Ming Lei 提交于
      Support bio polling (REQ_POLLED) in the following approach:
      
      1) only support io polling on normal READ/WRITE, and other abnormal IOs
      still fallback to IRQ mode, so the target io (and DM's clone bio) is
      exactly inside the dm io.
      
      2) hold one refcnt on io->io_count after submitting this dm bio with
      REQ_POLLED
      
      3) support dm native bio splitting, any dm io instance associated with
      current bio will be added into one list which head is bio->bi_private
      which will be recovered before ending this bio
      
      4) implement .poll_bio() callback, call bio_poll() on the single target
      bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call
      dm_io_dec_pending() after the target io is done in .poll_bio()
      
      5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL,
      which is based on Jeffle's previous patch.
      
      These changes are good for a 30-35% IOPS improvement for polled IO.
      
      For detailed test results please see (Jens, thanks for testing!):
      https://listman.redhat.com/archives/dm-devel/2022-March/049868.html
      or https://marc.info/?l=linux-block&m=164684246214700&w=2Tested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b99fdcdc
  11. 09 3月, 2022 6 次提交
  12. 08 3月, 2022 1 次提交
  13. 07 3月, 2022 1 次提交