1. 20 1月, 2022 1 次提交
    • Z
      md: Fix undefined behaviour in is_mddev_idle · 406295a3
      zhangwensheng 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4QXS1?from=project-issue
      CVE: NA
      
      --------------------------------
      
      UBSAN reports this problem:
      
      [ 5984.281385] UBSAN: Undefined behaviour in drivers/md/md.c:8175:15
      [ 5984.281390] signed integer overflow:
      [ 5984.281393] -2147483291 - 2072033152 cannot be represented in type 'int'
      [ 5984.281400] CPU: 25 PID: 1854 Comm: md101_resync Kdump: loaded Not tainted 4.19.90
      [ 5984.281404] Hardware name: Huawei TaiShan 200 (Model 5280)/BC82AMDDA
      [ 5984.281406] Call trace:
      [ 5984.281415]  dump_backtrace+0x0/0x310
      [ 5984.281418]  show_stack+0x28/0x38
      [ 5984.281425]  dump_stack+0xec/0x15c
      [ 5984.281430]  ubsan_epilogue+0x18/0x84
      [ 5984.281434]  handle_overflow+0x14c/0x19c
      [ 5984.281439]  __ubsan_handle_sub_overflow+0x34/0x44
      [ 5984.281445]  is_mddev_idle+0x338/0x3d8
      [ 5984.281449]  md_do_sync+0x1bb8/0x1cf8
      [ 5984.281452]  md_thread+0x220/0x288
      [ 5984.281457]  kthread+0x1d8/0x1e0
      [ 5984.281461]  ret_from_fork+0x10/0x18
      
      When the stat aacum of the disk is greater than INT_MAX, its value
      becomes negative after casting to 'int', which may lead to overflow
      after subtracting a positive number. In the same way, when the value
      of sync_io is greater than INT_MAX,overflow may also occur. These
      situations will lead to undefined behavior.
      
      Otherwise, if the stat accum of the disk is close to INT_MAX when
      creating raid arrays, the initial value of last_events would be set
      close to INT_MAX when mddev initializes IO event counters.
      'curr_events - rdev->last_events > 64' will always false during
      synchronization. If all the disks of mddev are in this case,
      is_mddev_idle() will always return 1, which may cause non-sync IO
      is very slow.
      
      To address these problems, need to use 64bit signed integer type
      for sync_io,last_events, and curr_events.
      Signed-off-by: Nzhangwensheng <zhangwensheng5@huawei.com>
      Reviewed-by: NTao Hou <houtao1@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      406295a3
  2. 14 1月, 2022 1 次提交
  3. 06 12月, 2021 1 次提交
  4. 15 11月, 2021 1 次提交
  5. 19 10月, 2021 1 次提交
  6. 03 6月, 2021 5 次提交
    • J
      md: Fix missing unused status line of /proc/mdstat · 1e099dfb
      Jan Glauber 提交于
      stable inclusion
      from stable-5.10.37
      commit 0035a4704557ba66824c08d5759d6e743747410b
      bugzilla: 51868
      CVE: NA
      
      --------------------------------
      
      commit 7abfabaf upstream.
      
      Reading /proc/mdstat with a read buffer size that would not
      fit the unused status line in the first read will skip this
      line from the output.
      
      So 'dd if=/proc/mdstat bs=64 2>/dev/null' will not print something
      like: unused devices: <none>
      
      Don't return NULL immediately in start() for v=2 but call
      show() once to print the status line also for multiple reads.
      
      Cc: stable@vger.kernel.org
      Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code and interface")
      Signed-off-by: NJan Glauber <jglauber@digitalocean.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1e099dfb
    • Z
      md: md_open returns -EBUSY when entering racing area · 640134e4
      Zhao Heming 提交于
      stable inclusion
      from stable-5.10.37
      commit b70b7ec500892f8bc12ffc6f60a3af6fd61d3a8b
      bugzilla: 51868
      CVE: NA
      
      --------------------------------
      
      commit 6a4db2a6 upstream.
      
      commit d3374825 ("md: make devices disappear when they are no longer
      needed.") introduced protection between mddev creating & removing. The
      md_open shouldn't create mddev when all_mddevs list doesn't contain
      mddev. With currently code logic, there will be very easy to trigger
      soft lockup in non-preempt env.
      
      This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
      will break the infinitely retry when md_open enter racing area.
      
      This patch is partly fix soft lockup issue, full fix needs mddev_find
      is split into two functions: mddev_find & mddev_find_or_alloc. And
      md_open should call new mddev_find (it only does searching job).
      
      For more detail, please refer with Christoph's "split mddev_find" patch
      in later commits.
      
      *** env ***
      kvm-qemu VM 2C1G with 2 iscsi luns
      kernel should be non-preempt
      
      *** script ***
      
      about trigger every time with below script
      
      ```
      1  node1="mdcluster1"
      2  node2="mdcluster2"
      3
      4  mdadm -Ss
      5  ssh ${node2} "mdadm -Ss"
      6  wipefs -a /dev/sda /dev/sdb
      7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
         /dev/sdb --assume-clean
      8
      9  for i in {1..10}; do
      10    echo ==== $i ====;
      11
      12    echo "test  ...."
      13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
      14    sleep 1
      15
      16    echo "clean  ....."
      17    ssh ${node2} "mdadm -Ss"
      18 done
      ```
      
      I use mdcluster env to trigger soft lockup, but it isn't mdcluster
      speical bug. To stop md array in mdcluster env will do more jobs than
      non-cluster array, which will leave enough time/gap to allow kernel to
      run md_open.
      
      *** stack ***
      
      ```
      [  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
      [  884.226515]  md_open+0x3c/0xe0 [md_mod]
      [  884.226518]  __blkdev_get+0x30d/0x710
      [  884.226520]  ? bd_acquire+0xd0/0xd0
      [  884.226522]  blkdev_get+0x14/0x30
      [  884.226524]  do_dentry_open+0x204/0x3a0
      [  884.226531]  path_openat+0x2fc/0x1520
      [  884.226534]  ? seq_printf+0x4e/0x70
      [  884.226536]  do_filp_open+0x9b/0x110
      [  884.226542]  ? md_release+0x20/0x20 [md_mod]
      [  884.226543]  ? seq_read+0x1d8/0x3e0
      [  884.226545]  ? kmem_cache_alloc+0x18a/0x270
      [  884.226547]  ? do_sys_open+0x1bd/0x260
      [  884.226548]  do_sys_open+0x1bd/0x260
      [  884.226551]  do_syscall_64+0x5b/0x1e0
      [  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      ```
      
      *** rootcause ***
      
      "mdadm -A" (or other array assemble commands) will start a daemon "mdadm
      --monitor" by default. When "mdadm -Ss" is running, the stop action will
      wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
      info from /proc/mdstat. This time mddev in kernel still exist, so
      /proc/mdstat still show md device, which makes "mdadm --monitor" to open
      /dev/md0.
      
      The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
      open action will trigger md_open which is creating action. Racing is
      happening.
      
      ```
      <thread 1>: "mdadm -Ss"
      md_release
        mddev_put deletes mddev from all_mddevs
        queue_work for mddev_delayed_delete
        at this time, "/dev/md0" is still available for opening
      
      <thread 2>: "mdadm --monitor ..."
      md_open
       + mddev_find can't find mddev of /dev/md0, and create a new mddev and
       |    return.
       + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
            -ERESTARTSYS.
      ```
      
      In non-preempt kernel, <thread 2> is occupying on current CPU. and
      mddev_delayed_delete which was created in <thread 1> also can't be
      schedule.
      
      In preempt kernel, it can also trigger above racing. But kernel doesn't
      allow one thread running on a CPU all the time. after <thread 2> running
      some time, the later "mdadm -A" (refer above script line 13) will call
      md_alloc to alloc a new gendisk for mddev. it will break md_open
      statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
      the soft lockup is broken.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      640134e4
    • C
      md: factor out a mddev_find_locked helper from mddev_find · f80d4b29
      Christoph Hellwig 提交于
      stable inclusion
      from stable-5.10.37
      commit cdcfa77a332a57962ee3af255f8769fd5cdf97ad
      bugzilla: 51868
      CVE: NA
      
      --------------------------------
      
      commit 8b57251f upstream.
      
      Factor out a self-contained helper to just lookup a mddev by the dev_t
      "unit".
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      f80d4b29
    • C
      md: split mddev_find · 69eae441
      Christoph Hellwig 提交于
      stable inclusion
      from stable-5.10.37
      commit 07e73740850299e39f1737aff4811e79021f72e5
      bugzilla: 51868
      CVE: NA
      
      --------------------------------
      
      commit 65aa97c4 upstream.
      
      Split mddev_find into a simple mddev_find that just finds an existing
      mddev by the unit number, and a more complicated mddev_find that deals
      with find or allocating a mddev.
      
      This turns out to fix this bug reported by Zhao Heming.
      
      ----------------------------- snip ------------------------------
      commit d3374825 ("md: make devices disappear when they are no longer
      needed.") introduced protection between mddev creating & removing. The
      md_open shouldn't create mddev when all_mddevs list doesn't contain
      mddev. With currently code logic, there will be very easy to trigger
      soft lockup in non-preempt env.
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      69eae441
    • H
      md-cluster: fix use-after-free issue when removing rdev · 152be1b9
      Heming Zhao 提交于
      stable inclusion
      from stable-5.10.37
      commit 61b8c6efbe87c445c3907fc36a9644ed705228f8
      bugzilla: 51868
      CVE: NA
      
      --------------------------------
      
      commit f7c7a2f9 upstream.
      
      md_kick_rdev_from_array will remove rdev, so we should
      use rdev_for_each_safe to search list.
      
      How to trigger:
      
      env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).
      
      ```
      node2=192.168.0.3
      
      for i in {1..20}; do
          echo ==== $i `date` ====;
      
          mdadm -Ss && ssh ${node2} "mdadm -Ss"
          wipefs -a /dev/sda /dev/sdb
      
          mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
             /dev/sdb --assume-clean
          ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
          mdadm --wait /dev/md0
          ssh ${node2} "mdadm --wait /dev/md0"
      
          mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
          sleep 1
      done
      ```
      
      Crash stack:
      
      ```
      stack segment: 0000 [#1] SMP
      ... ...
      RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
      ... ...
      RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
      RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
      RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
      RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
      R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
      FS:  0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       raid1d+0x5c/0xd40 [raid1]
       ? finish_task_switch+0x75/0x2a0
       ? lock_timer_base+0x67/0x80
       ? try_to_del_timer_sync+0x4d/0x80
       ? del_timer_sync+0x41/0x50
       ? schedule_timeout+0x254/0x2d0
       ? md_start_sync+0xe0/0xe0 [md_mod]
       ? md_thread+0x127/0x160 [md_mod]
       md_thread+0x127/0x160 [md_mod]
       ? wait_woken+0x80/0x80
       kthread+0x10d/0x130
       ? kthread_park+0xa0/0xa0
       ret_from_fork+0x1f/0x40
      ```
      
      Fixes: dbb64f86 ("md-cluster: Fix adding of new disk with new reload code")
      Fixes: 659b254f ("md-cluster: remove a disk asynchronously from cluster environment")
      Cc: stable@vger.kernel.org
      Reviewed-by: NGang He <ghe@suse.com>
      Signed-off-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      152be1b9
  7. 09 3月, 2021 1 次提交
    • X
      md: Set prev_flush_start and flush_bio in an atomic way · 78a62c13
      Xiao Ni 提交于
      stable inclusion
      from stable-5.10.15
      commit fe272570d0037c54c40cba31eaa42a343a5656fa
      bugzilla: 48167
      
      --------------------------------
      
      commit dc5d17a3 upstream.
      
      One customer reports a crash problem which causes by flush request. It
      triggers a warning before crash.
      
              /* new request after previous flush is completed */
              if (ktime_after(req_start, mddev->prev_flush_start)) {
                      WARN_ON(mddev->flush_bio);
                      mddev->flush_bio = bio;
                      bio = NULL;
              }
      
      The WARN_ON is triggered. We use spin lock to protect prev_flush_start and
      flush_bio in md_flush_request. But there is no lock protection in
      md_submit_flush_data. It can set flush_bio to NULL first because of
      compiler reordering write instructions.
      
      For example, flush bio1 sets flush bio to NULL first in
      md_submit_flush_data. An interrupt or vmware causing an extended stall
      happen between updating flush_bio and prev_flush_start. Because flush_bio
      is NULL, flush bio2 can get the lock and submit to underlayer disks. Then
      flush bio1 updates prev_flush_start after the interrupt or extended stall.
      
      Then flush bio3 enters in md_flush_request. The start time req_start is
      behind prev_flush_start. The flush_bio is not NULL(flush bio2 hasn't
      finished). So it can trigger the WARN_ON now. Then it calls INIT_WORK
      again. INIT_WORK() will re-initialize the list pointers in the
      work_struct, which then can result in a corrupted work list and the
      work_struct queued a second time. With the work list corrupted, it can
      lead in invalid work items being used and cause a crash in
      process_one_work.
      
      We need to make sure only one flush bio can be handled at one same time.
      So add spin lock in md_submit_flush_data to protect prev_flush_start and
      flush_bio in an atomic way.
      Reviewed-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJack Wang <jinpu.wang@cloud.ionos.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      78a62c13
  8. 12 1月, 2021 2 次提交
    • Z
      md/cluster: fix deadlock when node is doing resync job · 4663a7ee
      Zhao Heming 提交于
      stable inclusion
      from stable-5.10.4
      commit d27d1942e173c1c719e00a596f5db60221b05b14
      bugzilla: 46903
      
      --------------------------------
      
      commit bca5b065 upstream.
      
      md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
      During sending msg, node can concurrently receive msg from another node.
      When node does resync job, grab token_lockres:EX may trigger a deadlock:
      ```
      nodeA                       nodeB
      --------------------     --------------------
      a.
      send METADATA_UPDATED
      held token_lockres:EX
                               b.
                               md_do_sync
                                resync_info_update
                                  send RESYNCING
                                   + set MD_CLUSTER_SEND_LOCK
                                   + wait for holding token_lockres:EX
      
                               c.
                               mdadm /dev/md0 --remove /dev/sdg
                                + held reconfig_mutex
                                + send REMOVE
                                   + wait_event(MD_CLUSTER_SEND_LOCK)
      
                               d.
                               recv_daemon //METADATA_UPDATED from A
                                process_metadata_update
                                 + (mddev_trylock(mddev) ||
                                    MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                                   //this time, both return false forever
      ```
      Explaination:
      a. A send METADATA_UPDATED
         This will block another node to send msg
      
      b. B does sync jobs, which will send RESYNCING at intervals.
         This will be block for holding token_lockres:EX lock.
      
      c. B do "mdadm --remove", which will send REMOVE.
         This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
      
      d. B recv METADATA_UPDATED msg, which send from A in step <a>.
         This will be blocked by step <c>: holding mddev lock, it makes
         wait_event can't hold mddev lock. (btw,
         MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
      
      There is a similar deadlock in commit 0ba95977
      ("md-cluster: use sync way to handle METADATA_UPDATED msg")
      In that commit, step c is "update sb". This patch step c is
      "mdadm --remove".
      
      For fixing this issue, we can refer the solution of function:
      metadata_update_start. Which does the same grab lock_token action.
      lock_comm can use the same steps to avoid deadlock. By moving
      MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
      It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
      but it is safe & can break deadlock.
      
      Repro steps (I only triggered 3 times with hundreds tests):
      
      two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
      ```
      ssh root@node2 "mdadm -S --scan"
      mdadm -S --scan
      for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
      count=20; done
      
      mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
       --bitmap-chunk=1M
      ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
      
      sleep 5
      
      mkfs.xfs /dev/md0
      mdadm --manage --add /dev/md0 /dev/sdi
      mdadm --wait /dev/md0
      mdadm --grow --raid-devices=3 /dev/md0
      
      mdadm /dev/md0 --fail /dev/sdg
      mdadm /dev/md0 --remove /dev/sdg
      mdadm --grow --raid-devices=2 /dev/md0
      ```
      
      test script will hung when executing "mdadm --remove".
      
      ```
       # dump stacks by "echo t > /proc/sysrq-trigger"
      md0_cluster_rec D    0  5329      2 0x80004000
      Call Trace:
       __schedule+0x1f6/0x560
       ? _cond_resched+0x2d/0x40
       ? schedule+0x4a/0xb0
       ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
       ? wait_woken+0x80/0x80
       ? process_recvd_msg+0x113/0x1d0 [md_cluster]
       ? recv_daemon+0x9e/0x120 [md_cluster]
       ? md_thread+0x94/0x160 [md_mod]
       ? wait_woken+0x80/0x80
       ? md_congested+0x30/0x30 [md_mod]
       ? kthread+0x115/0x140
       ? __kthread_bind_mask+0x60/0x60
       ? ret_from_fork+0x1f/0x40
      
      mdadm           D    0  5423      1 0x00004004
      Call Trace:
       __schedule+0x1f6/0x560
       ? __schedule+0x1fe/0x560
       ? schedule+0x4a/0xb0
       ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
       ? wait_woken+0x80/0x80
       ? remove_disk+0x4f/0x90 [md_cluster]
       ? hot_remove_disk+0xb1/0x1b0 [md_mod]
       ? md_ioctl+0x50c/0xba0 [md_mod]
       ? wait_woken+0x80/0x80
       ? blkdev_ioctl+0xa2/0x2a0
       ? block_ioctl+0x39/0x40
       ? ksys_ioctl+0x82/0xc0
       ? __x64_sys_ioctl+0x16/0x20
       ? do_syscall_64+0x5f/0x150
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      md0_resync      D    0  5425      2 0x80004000
      Call Trace:
       __schedule+0x1f6/0x560
       ? schedule+0x4a/0xb0
       ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
       ? wait_woken+0x80/0x80
       ? lock_token+0x2d/0x90 [md_cluster]
       ? resync_info_update+0x95/0x100 [md_cluster]
       ? raid1_sync_request+0x7d3/0xa40 [raid1]
       ? md_do_sync.cold+0x737/0xc8f [md_mod]
       ? md_thread+0x94/0x160 [md_mod]
       ? md_congested+0x30/0x30 [md_mod]
       ? kthread+0x115/0x140
       ? __kthread_bind_mask+0x60/0x60
       ? ret_from_fork+0x1f/0x40
      ```
      
      At last, thanks for Xiao's solution.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Suggested-by: NXiao Ni <xni@redhat.com>
      Reviewed-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      4663a7ee
    • Z
      md/cluster: block reshape with remote resync job · e35dbd4a
      Zhao Heming 提交于
      stable inclusion
      from stable-5.10.4
      commit 3ace8d52ee4afac46e4d718612efa09591bd7506
      bugzilla: 46903
      
      --------------------------------
      
      commit a8da01f7 upstream.
      
      Reshape request should be blocked with ongoing resync job. In cluster
      env, a node can start resync job even if the resync cmd isn't executed
      on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
      will start resync job. However, current update_raid_disks() only check
      local recovery status, which is incomplete. As a result, we see user will
      execute "mdadm --grow" successfully on local, while the remote node deny
      to do reshape job when it doing resync job. The inconsistent handling
      cause array enter unexpected status. If user doesn't observe this issue
      and continue executing mdadm cmd, the array doesn't work at last.
      
      Fix this issue by blocking reshape request. When node executes "--grow"
      and detects ongoing resync, it should stop and report error to user.
      
      The following script reproduces the issue with ~100% probability.
      (two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
      ```
       # on node1, node2 is the remote node.
      ssh root@node2 "mdadm -S --scan"
      mdadm -S --scan
      for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
      count=20; done
      
      mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
      ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
      
      sleep 5
      
      mdadm --manage --add /dev/md0 /dev/sdi
      mdadm --wait /dev/md0
      mdadm --grow --raid-devices=3 /dev/md0
      
      mdadm /dev/md0 --fail /dev/sdg
      mdadm /dev/md0 --remove /dev/sdg
      mdadm --grow --raid-devices=2 /dev/md0
      ```
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      e35dbd4a
  9. 07 1月, 2021 1 次提交
    • D
      md: fix a warning caused by a race between concurrent md_ioctl()s · a69924ea
      Dae R. Jeong 提交于
      stable inclusion
      from stable-5.10.3
      commit 70eb256f8c8a7b9c698af22157bf20005641883f
      bugzilla: 46871
      
      --------------------------------
      
      commit c731b84b upstream.
      
      Syzkaller reports a warning as belows.
      WARNING: CPU: 0 PID: 9647 at drivers/md/md.c:7169
      ...
      Call Trace:
      ...
      RIP: 0010:md_ioctl+0x4017/0x5980 drivers/md/md.c:7169
      RSP: 0018:ffff888096027950 EFLAGS: 00010293
      RAX: ffff88809322c380 RBX: 0000000000000932 RCX: ffffffff84e266f2
      RDX: 0000000000000000 RSI: ffffffff84e299f7 RDI: 0000000000000007
      RBP: ffff888096027bc0 R08: ffff88809322c380 R09: ffffed101341a482
      R10: ffff888096027940 R11: ffff88809a0d240f R12: 0000000000000932
      R13: ffff8880a2c14100 R14: ffff88809a0d2268 R15: ffff88809a0d2408
       __blkdev_driver_ioctl block/ioctl.c:304 [inline]
       blkdev_ioctl+0xece/0x1c10 block/ioctl.c:606
       block_ioctl+0xee/0x130 fs/block_dev.c:1930
       vfs_ioctl fs/ioctl.c:46 [inline]
       file_ioctl fs/ioctl.c:509 [inline]
       do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696
       ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This is caused by a race between two concurrenct md_ioctl()s closing
      the array.
      CPU1 (md_ioctl())                   CPU2 (md_ioctl())
      ------                              ------
      set_bit(MD_CLOSING, &mddev->flags);
      did_set_md_closing = true;
                                          WARN_ON_ONCE(test_bit(MD_CLOSING,
                                                  &mddev->flags));
      if(did_set_md_closing)
          clear_bit(MD_CLOSING, &mddev->flags);
      
      Fix the warning by returning immediately if the MD_CLOSING bit is set
      in &mddev->flags which indicates that the array is being closed.
      
      Fixes: 065e519e ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop")
      Reported-by: syzbot+1e46a0864c1a6e9bd3d8@syzkaller.appspotmail.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NDae R. Jeong <dae.r.jeong@kaist.ac.kr>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      a69924ea
  10. 10 12月, 2020 1 次提交
  11. 09 10月, 2020 1 次提交
  12. 25 9月, 2020 3 次提交
  13. 12 9月, 2020 1 次提交
  14. 10 9月, 2020 1 次提交
  15. 02 9月, 2020 1 次提交
  16. 06 8月, 2020 1 次提交
  17. 03 8月, 2020 3 次提交
  18. 22 7月, 2020 2 次提交
    • Z
      md-cluster: fix rmmod issue when md_cluster convert bitmap to none · edee9dfe
      Zhao Heming 提交于
      update_array_info misses calling module_put when removing cluster bitmap.
      
      steps to reproduce:
      ```
      node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda
      /dev/sdb
      mdadm: array /dev/md0 started.
      node1 # lsmod | egrep "dlm|md_|raid1"
      md_cluster             28672  1
      dlm                   212992  14 md_cluster
      configfs               57344  2 dlm
      raid1                  53248  1
      md_mod                176128  2 raid1,md_cluster
      node1 # mdadm -G /dev/md0 -b none
      node1 # lsmod | egrep "dlm|md_|raid1"
      md_cluster             28672  1 <== should be zero
      dlm                   212992  9 md_cluster
      configfs               57344  2 dlm
      raid1                  53248  1
      md_mod                176128  2 raid1,md_cluster
      node1 # mdadm -G /dev/md0 -b clustered
      node1 # lsmod | egrep "dlm|md_|raid1"
      md_cluster             28672  2 <== increase
      dlm                   212992  14 md_cluster
      configfs               57344  2 dlm
      raid1                  53248  1
      md_mod                176128  2 raid1,md_cluster
      node1 # mdadm -G /dev/md0 -b none
      node1 # mdadm -G /dev/md0 -b clustered
      node1 # lsmod | egrep "dlm|md_|raid1"
      md_cluster             28672  3 <== increase
      dlm                   212992  14 md_cluster
      configfs               57344  2 dlm
      raid1                  53248  1
      md_mod                176128  2 raid1,md_cluster
      ```
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      edee9dfe
    • Z
      md-cluster: fix safemode_delay value when converting to clustered bitmap · 7c9d5c54
      Zhao Heming 提交于
      When array convert to clustered bitmap, the safe_mode_delay doesn't
      clean and vice versa. the /sys/block/mdX/md/safe_mode_delay keep original
      value after changing bitmap type. In safe_delay_store(), the code forbids
      setting mddev->safemode_delay when array is clustered. So in cluster-md
      env, the expected safemode_delay value should be 0.
      
      Reproducible steps:
      ```
      node1 # mdadm --zero-superblock /dev/sd{b,c,d}
      node1 # mdadm -C /dev/md0 -b internal -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc
      node1 # cat /sys/block/md0/md/safe_mode_delay
      0.204
      node1 # mdadm -G /dev/md0 -b none
      node1 # mdadm --grow /dev/md0 --bitmap=clustered
      node1 # cat /sys/block/md0/md/safe_mode_delay
      0.204  <== doesn't change, should ZERO for cluster-md
      
      node1 # mdadm --zero-superblock /dev/sd{b,c,d}
      node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc
      node1 # cat /sys/block/md0/md/safe_mode_delay
      0.000
      node1 # mdadm -G /dev/md0 -b none
      node1 # cat /sys/block/md0/md/safe_mode_delay
      0.000  <== doesn't change, should default value
      ```
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      7c9d5c54
  19. 16 7月, 2020 3 次提交
  20. 15 7月, 2020 1 次提交
    • J
      md: fix deadlock causing by sysfs_notify · e1a86dbb
      Junxiao Bi 提交于
      The following deadlock was captured. The first process is holding 'kernfs_mutex'
      and hung by io. The io was staging in 'r1conf.pending_bio_list' of raid1 device,
      this pending bio list would be flushed by second process 'md127_raid1', but
      it was hung by 'kernfs_mutex'. Using sysfs_notify_dirent_safe() to replace
      sysfs_notify() can fix it. There were other sysfs_notify() invoked from io
      path, removed all of them.
      
       PID: 40430  TASK: ffff8ee9c8c65c40  CPU: 29  COMMAND: "probe_file"
        #0 [ffffb87c4df37260] __schedule at ffffffff9a8678ec
        #1 [ffffb87c4df372f8] schedule at ffffffff9a867f06
        #2 [ffffb87c4df37310] io_schedule at ffffffff9a0c73e6
        #3 [ffffb87c4df37328] __dta___xfs_iunpin_wait_3443 at ffffffffc03a4057 [xfs]
        #4 [ffffb87c4df373a0] xfs_iunpin_wait at ffffffffc03a6c79 [xfs]
        #5 [ffffb87c4df373b0] __dta_xfs_reclaim_inode_3357 at ffffffffc039a46c [xfs]
        #6 [ffffb87c4df37400] xfs_reclaim_inodes_ag at ffffffffc039a8b6 [xfs]
        #7 [ffffb87c4df37590] xfs_reclaim_inodes_nr at ffffffffc039bb33 [xfs]
        #8 [ffffb87c4df375b0] xfs_fs_free_cached_objects at ffffffffc03af0e9 [xfs]
        #9 [ffffb87c4df375c0] super_cache_scan at ffffffff9a287ec7
       #10 [ffffb87c4df37618] shrink_slab at ffffffff9a1efd93
       #11 [ffffb87c4df37700] shrink_node at ffffffff9a1f5968
       #12 [ffffb87c4df37788] do_try_to_free_pages at ffffffff9a1f5ea2
       #13 [ffffb87c4df377f0] try_to_free_mem_cgroup_pages at ffffffff9a1f6445
       #14 [ffffb87c4df37880] try_charge at ffffffff9a26cc5f
       #15 [ffffb87c4df37920] memcg_kmem_charge_memcg at ffffffff9a270f6a
       #16 [ffffb87c4df37958] new_slab at ffffffff9a251430
       #17 [ffffb87c4df379c0] ___slab_alloc at ffffffff9a251c85
       #18 [ffffb87c4df37a80] __slab_alloc at ffffffff9a25635d
       #19 [ffffb87c4df37ac0] kmem_cache_alloc at ffffffff9a251f89
       #20 [ffffb87c4df37b00] alloc_inode at ffffffff9a2a2b10
       #21 [ffffb87c4df37b20] iget_locked at ffffffff9a2a4854
       #22 [ffffb87c4df37b60] kernfs_get_inode at ffffffff9a311377
       #23 [ffffb87c4df37b80] kernfs_iop_lookup at ffffffff9a311e2b
       #24 [ffffb87c4df37ba8] lookup_slow at ffffffff9a290118
       #25 [ffffb87c4df37c10] walk_component at ffffffff9a291e83
       #26 [ffffb87c4df37c78] path_lookupat at ffffffff9a293619
       #27 [ffffb87c4df37cd8] filename_lookup at ffffffff9a2953af
       #28 [ffffb87c4df37de8] user_path_at_empty at ffffffff9a295566
       #29 [ffffb87c4df37e10] vfs_statx at ffffffff9a289787
       #30 [ffffb87c4df37e70] SYSC_newlstat at ffffffff9a289d5d
       #31 [ffffb87c4df37f18] sys_newlstat at ffffffff9a28a60e
       #32 [ffffb87c4df37f28] do_syscall_64 at ffffffff9a003949
       #33 [ffffb87c4df37f50] entry_SYSCALL_64_after_hwframe at ffffffff9aa001ad
           RIP: 00007f617a5f2905  RSP: 00007f607334f838  RFLAGS: 00000246
           RAX: ffffffffffffffda  RBX: 00007f6064044b20  RCX: 00007f617a5f2905
           RDX: 00007f6064044b20  RSI: 00007f6064044b20  RDI: 00007f6064005890
           RBP: 00007f6064044aa0   R8: 0000000000000030   R9: 000000000000011c
           R10: 0000000000000013  R11: 0000000000000246  R12: 00007f606417e6d0
           R13: 00007f6064044aa0  R14: 00007f6064044b10  R15: 00000000ffffffff
           ORIG_RAX: 0000000000000006  CS: 0033  SS: 002b
      
       PID: 927    TASK: ffff8f15ac5dbd80  CPU: 42  COMMAND: "md127_raid1"
        #0 [ffffb87c4df07b28] __schedule at ffffffff9a8678ec
        #1 [ffffb87c4df07bc0] schedule at ffffffff9a867f06
        #2 [ffffb87c4df07bd8] schedule_preempt_disabled at ffffffff9a86825e
        #3 [ffffb87c4df07be8] __mutex_lock at ffffffff9a869bcc
        #4 [ffffb87c4df07ca0] __mutex_lock_slowpath at ffffffff9a86a013
        #5 [ffffb87c4df07cb0] mutex_lock at ffffffff9a86a04f
        #6 [ffffb87c4df07cc8] kernfs_find_and_get_ns at ffffffff9a311d83
        #7 [ffffb87c4df07cf0] sysfs_notify at ffffffff9a314b3a
        #8 [ffffb87c4df07d18] md_update_sb at ffffffff9a688696
        #9 [ffffb87c4df07d98] md_update_sb at ffffffff9a6886d5
       #10 [ffffb87c4df07da8] md_check_recovery at ffffffff9a68ad9c
       #11 [ffffb87c4df07dd0] raid1d at ffffffffc01f0375 [raid1]
       #12 [ffffb87c4df07ea0] md_thread at ffffffff9a680348
       #13 [ffffb87c4df07f08] kthread at ffffffff9a0b8005
       #14 [ffffb87c4df07f50] ret_from_fork at ffffffff9aa00344
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      e1a86dbb
  21. 14 7月, 2020 2 次提交
    • A
      md: improve io stats accounting · 41d2d848
      Artur Paszkiewicz 提交于
      Use generic io accounting functions to manage io stats. There was an
      attempt to do this earlier in commit 18c0b223 ("md: use generic io
      stats accounting functions to simplify io stat accounting"), but it did
      not include a call to generic_end_io_acct() and caused issues with
      tracking in-flight IOs, so it was later removed in commit 74672d06
      ("md: fix md io stats accounting broken").
      
      This patch attempts to fix this by using both disk_start_io_acct() and
      disk_end_io_acct(). To make it possible, a struct md_io is allocated for
      every new md bio, which includes the io start_time. A new mempool is
      introduced for this purpose. We override bio->bi_end_io with our own
      callback and call disk_start_io_acct() before passing the bio to
      md_handle_request(). When it completes, we call disk_end_io_acct() and
      the original bi_end_io callback.
      
      This adds correct statistics about in-flight IOs and IO processing time,
      interpreted e.g. in iostat as await, svctm, aqu-sz and %util.
      
      It also fixes a situation where too many IOs where reported if a bio was
      re-submitted to the mddev, because io accounting is now performed only
      on newly arriving bios.
      Acked-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      41d2d848
    • C
      md: raid0/linear: fix dereference before null check on pointer mddev · 9a5a8597
      Colin Ian King 提交于
      Pointer mddev is being dereferenced with a test_bit call before mddev
      is being null checked, this may cause a null pointer dereference. Fix
      this by moving the null pointer checks to sanity check mddev before
      it is dereferenced.
      
      Addresses-Coverity: ("Dereference before null check")
      Fixes: 62f7b198 ("md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      9a5a8597
  22. 09 7月, 2020 2 次提交
  23. 01 7月, 2020 3 次提交
  24. 14 5月, 2020 1 次提交