1. 01 12月, 2020 4 次提交
    • P
      md: use current request time as base for ktime comparisons · a23f2aae
      Pankaj Gupta 提交于
      Request coalescing logic uses 'prev_flush_start' as base to
      compare the current request start time. 'prev_flush_start' is
      updated in other context.
      
      This patch changes this by using ktime comparison base to
      'req_start' for better readability of code.
      Signed-off-by: NPankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      a23f2aae
    • P
      md: add comments in md_flush_request() · 204d1a64
      Pankaj Gupta 提交于
      Request coalescing logic is dependent on flush time update in other
      context. This patch adds comments to understand the code flow better.
      Signed-off-by: NPankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      204d1a64
    • P
      md: improve variable names in md_flush_request() · 81ba3c24
      Pankaj Gupta 提交于
      This patch improves readability by using better variable names
      in flush request coalescing logic.
      Signed-off-by: NPankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Reviewed-by: NPaul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      81ba3c24
    • D
      md: fix a warning caused by a race between concurrent md_ioctl()s · c731b84b
      Dae R. Jeong 提交于
      Syzkaller reports a warning as belows.
      WARNING: CPU: 0 PID: 9647 at drivers/md/md.c:7169
      ...
      Call Trace:
      ...
      RIP: 0010:md_ioctl+0x4017/0x5980 drivers/md/md.c:7169
      RSP: 0018:ffff888096027950 EFLAGS: 00010293
      RAX: ffff88809322c380 RBX: 0000000000000932 RCX: ffffffff84e266f2
      RDX: 0000000000000000 RSI: ffffffff84e299f7 RDI: 0000000000000007
      RBP: ffff888096027bc0 R08: ffff88809322c380 R09: ffffed101341a482
      R10: ffff888096027940 R11: ffff88809a0d240f R12: 0000000000000932
      R13: ffff8880a2c14100 R14: ffff88809a0d2268 R15: ffff88809a0d2408
       __blkdev_driver_ioctl block/ioctl.c:304 [inline]
       blkdev_ioctl+0xece/0x1c10 block/ioctl.c:606
       block_ioctl+0xee/0x130 fs/block_dev.c:1930
       vfs_ioctl fs/ioctl.c:46 [inline]
       file_ioctl fs/ioctl.c:509 [inline]
       do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696
       ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This is caused by a race between two concurrenct md_ioctl()s closing
      the array.
      CPU1 (md_ioctl())                   CPU2 (md_ioctl())
      ------                              ------
      set_bit(MD_CLOSING, &mddev->flags);
      did_set_md_closing = true;
                                          WARN_ON_ONCE(test_bit(MD_CLOSING,
                                                  &mddev->flags));
      if(did_set_md_closing)
          clear_bit(MD_CLOSING, &mddev->flags);
      
      Fix the warning by returning immediately if the MD_CLOSING bit is set
      in &mddev->flags which indicates that the array is being closed.
      
      Fixes: 065e519e ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop")
      Reported-by: syzbot+1e46a0864c1a6e9bd3d8@syzkaller.appspotmail.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NDae R. Jeong <dae.r.jeong@kaist.ac.kr>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      c731b84b
  2. 16 11月, 2020 2 次提交
  3. 09 10月, 2020 1 次提交
  4. 25 9月, 2020 3 次提交
  5. 12 9月, 2020 1 次提交
  6. 10 9月, 2020 1 次提交
  7. 02 9月, 2020 1 次提交
  8. 06 8月, 2020 1 次提交
  9. 03 8月, 2020 3 次提交
  10. 22 7月, 2020 2 次提交
    • Z
      md-cluster: fix rmmod issue when md_cluster convert bitmap to none · edee9dfe
      Zhao Heming 提交于
      update_array_info misses calling module_put when removing cluster bitmap.
      
      steps to reproduce:
      ```
      node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda
      /dev/sdb
      mdadm: array /dev/md0 started.
      node1 # lsmod | egrep "dlm|md_|raid1"
      md_cluster             28672  1
      dlm                   212992  14 md_cluster
      configfs               57344  2 dlm
      raid1                  53248  1
      md_mod                176128  2 raid1,md_cluster
      node1 # mdadm -G /dev/md0 -b none
      node1 # lsmod | egrep "dlm|md_|raid1"
      md_cluster             28672  1 <== should be zero
      dlm                   212992  9 md_cluster
      configfs               57344  2 dlm
      raid1                  53248  1
      md_mod                176128  2 raid1,md_cluster
      node1 # mdadm -G /dev/md0 -b clustered
      node1 # lsmod | egrep "dlm|md_|raid1"
      md_cluster             28672  2 <== increase
      dlm                   212992  14 md_cluster
      configfs               57344  2 dlm
      raid1                  53248  1
      md_mod                176128  2 raid1,md_cluster
      node1 # mdadm -G /dev/md0 -b none
      node1 # mdadm -G /dev/md0 -b clustered
      node1 # lsmod | egrep "dlm|md_|raid1"
      md_cluster             28672  3 <== increase
      dlm                   212992  14 md_cluster
      configfs               57344  2 dlm
      raid1                  53248  1
      md_mod                176128  2 raid1,md_cluster
      ```
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      edee9dfe
    • Z
      md-cluster: fix safemode_delay value when converting to clustered bitmap · 7c9d5c54
      Zhao Heming 提交于
      When array convert to clustered bitmap, the safe_mode_delay doesn't
      clean and vice versa. the /sys/block/mdX/md/safe_mode_delay keep original
      value after changing bitmap type. In safe_delay_store(), the code forbids
      setting mddev->safemode_delay when array is clustered. So in cluster-md
      env, the expected safemode_delay value should be 0.
      
      Reproducible steps:
      ```
      node1 # mdadm --zero-superblock /dev/sd{b,c,d}
      node1 # mdadm -C /dev/md0 -b internal -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc
      node1 # cat /sys/block/md0/md/safe_mode_delay
      0.204
      node1 # mdadm -G /dev/md0 -b none
      node1 # mdadm --grow /dev/md0 --bitmap=clustered
      node1 # cat /sys/block/md0/md/safe_mode_delay
      0.204  <== doesn't change, should ZERO for cluster-md
      
      node1 # mdadm --zero-superblock /dev/sd{b,c,d}
      node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc
      node1 # cat /sys/block/md0/md/safe_mode_delay
      0.000
      node1 # mdadm -G /dev/md0 -b none
      node1 # cat /sys/block/md0/md/safe_mode_delay
      0.000  <== doesn't change, should default value
      ```
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      7c9d5c54
  11. 16 7月, 2020 3 次提交
  12. 15 7月, 2020 1 次提交
    • J
      md: fix deadlock causing by sysfs_notify · e1a86dbb
      Junxiao Bi 提交于
      The following deadlock was captured. The first process is holding 'kernfs_mutex'
      and hung by io. The io was staging in 'r1conf.pending_bio_list' of raid1 device,
      this pending bio list would be flushed by second process 'md127_raid1', but
      it was hung by 'kernfs_mutex'. Using sysfs_notify_dirent_safe() to replace
      sysfs_notify() can fix it. There were other sysfs_notify() invoked from io
      path, removed all of them.
      
       PID: 40430  TASK: ffff8ee9c8c65c40  CPU: 29  COMMAND: "probe_file"
        #0 [ffffb87c4df37260] __schedule at ffffffff9a8678ec
        #1 [ffffb87c4df372f8] schedule at ffffffff9a867f06
        #2 [ffffb87c4df37310] io_schedule at ffffffff9a0c73e6
        #3 [ffffb87c4df37328] __dta___xfs_iunpin_wait_3443 at ffffffffc03a4057 [xfs]
        #4 [ffffb87c4df373a0] xfs_iunpin_wait at ffffffffc03a6c79 [xfs]
        #5 [ffffb87c4df373b0] __dta_xfs_reclaim_inode_3357 at ffffffffc039a46c [xfs]
        #6 [ffffb87c4df37400] xfs_reclaim_inodes_ag at ffffffffc039a8b6 [xfs]
        #7 [ffffb87c4df37590] xfs_reclaim_inodes_nr at ffffffffc039bb33 [xfs]
        #8 [ffffb87c4df375b0] xfs_fs_free_cached_objects at ffffffffc03af0e9 [xfs]
        #9 [ffffb87c4df375c0] super_cache_scan at ffffffff9a287ec7
       #10 [ffffb87c4df37618] shrink_slab at ffffffff9a1efd93
       #11 [ffffb87c4df37700] shrink_node at ffffffff9a1f5968
       #12 [ffffb87c4df37788] do_try_to_free_pages at ffffffff9a1f5ea2
       #13 [ffffb87c4df377f0] try_to_free_mem_cgroup_pages at ffffffff9a1f6445
       #14 [ffffb87c4df37880] try_charge at ffffffff9a26cc5f
       #15 [ffffb87c4df37920] memcg_kmem_charge_memcg at ffffffff9a270f6a
       #16 [ffffb87c4df37958] new_slab at ffffffff9a251430
       #17 [ffffb87c4df379c0] ___slab_alloc at ffffffff9a251c85
       #18 [ffffb87c4df37a80] __slab_alloc at ffffffff9a25635d
       #19 [ffffb87c4df37ac0] kmem_cache_alloc at ffffffff9a251f89
       #20 [ffffb87c4df37b00] alloc_inode at ffffffff9a2a2b10
       #21 [ffffb87c4df37b20] iget_locked at ffffffff9a2a4854
       #22 [ffffb87c4df37b60] kernfs_get_inode at ffffffff9a311377
       #23 [ffffb87c4df37b80] kernfs_iop_lookup at ffffffff9a311e2b
       #24 [ffffb87c4df37ba8] lookup_slow at ffffffff9a290118
       #25 [ffffb87c4df37c10] walk_component at ffffffff9a291e83
       #26 [ffffb87c4df37c78] path_lookupat at ffffffff9a293619
       #27 [ffffb87c4df37cd8] filename_lookup at ffffffff9a2953af
       #28 [ffffb87c4df37de8] user_path_at_empty at ffffffff9a295566
       #29 [ffffb87c4df37e10] vfs_statx at ffffffff9a289787
       #30 [ffffb87c4df37e70] SYSC_newlstat at ffffffff9a289d5d
       #31 [ffffb87c4df37f18] sys_newlstat at ffffffff9a28a60e
       #32 [ffffb87c4df37f28] do_syscall_64 at ffffffff9a003949
       #33 [ffffb87c4df37f50] entry_SYSCALL_64_after_hwframe at ffffffff9aa001ad
           RIP: 00007f617a5f2905  RSP: 00007f607334f838  RFLAGS: 00000246
           RAX: ffffffffffffffda  RBX: 00007f6064044b20  RCX: 00007f617a5f2905
           RDX: 00007f6064044b20  RSI: 00007f6064044b20  RDI: 00007f6064005890
           RBP: 00007f6064044aa0   R8: 0000000000000030   R9: 000000000000011c
           R10: 0000000000000013  R11: 0000000000000246  R12: 00007f606417e6d0
           R13: 00007f6064044aa0  R14: 00007f6064044b10  R15: 00000000ffffffff
           ORIG_RAX: 0000000000000006  CS: 0033  SS: 002b
      
       PID: 927    TASK: ffff8f15ac5dbd80  CPU: 42  COMMAND: "md127_raid1"
        #0 [ffffb87c4df07b28] __schedule at ffffffff9a8678ec
        #1 [ffffb87c4df07bc0] schedule at ffffffff9a867f06
        #2 [ffffb87c4df07bd8] schedule_preempt_disabled at ffffffff9a86825e
        #3 [ffffb87c4df07be8] __mutex_lock at ffffffff9a869bcc
        #4 [ffffb87c4df07ca0] __mutex_lock_slowpath at ffffffff9a86a013
        #5 [ffffb87c4df07cb0] mutex_lock at ffffffff9a86a04f
        #6 [ffffb87c4df07cc8] kernfs_find_and_get_ns at ffffffff9a311d83
        #7 [ffffb87c4df07cf0] sysfs_notify at ffffffff9a314b3a
        #8 [ffffb87c4df07d18] md_update_sb at ffffffff9a688696
        #9 [ffffb87c4df07d98] md_update_sb at ffffffff9a6886d5
       #10 [ffffb87c4df07da8] md_check_recovery at ffffffff9a68ad9c
       #11 [ffffb87c4df07dd0] raid1d at ffffffffc01f0375 [raid1]
       #12 [ffffb87c4df07ea0] md_thread at ffffffff9a680348
       #13 [ffffb87c4df07f08] kthread at ffffffff9a0b8005
       #14 [ffffb87c4df07f50] ret_from_fork at ffffffff9aa00344
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      e1a86dbb
  13. 14 7月, 2020 2 次提交
    • A
      md: improve io stats accounting · 41d2d848
      Artur Paszkiewicz 提交于
      Use generic io accounting functions to manage io stats. There was an
      attempt to do this earlier in commit 18c0b223 ("md: use generic io
      stats accounting functions to simplify io stat accounting"), but it did
      not include a call to generic_end_io_acct() and caused issues with
      tracking in-flight IOs, so it was later removed in commit 74672d06
      ("md: fix md io stats accounting broken").
      
      This patch attempts to fix this by using both disk_start_io_acct() and
      disk_end_io_acct(). To make it possible, a struct md_io is allocated for
      every new md bio, which includes the io start_time. A new mempool is
      introduced for this purpose. We override bio->bi_end_io with our own
      callback and call disk_start_io_acct() before passing the bio to
      md_handle_request(). When it completes, we call disk_end_io_acct() and
      the original bi_end_io callback.
      
      This adds correct statistics about in-flight IOs and IO processing time,
      interpreted e.g. in iostat as await, svctm, aqu-sz and %util.
      
      It also fixes a situation where too many IOs where reported if a bio was
      re-submitted to the mddev, because io accounting is now performed only
      on newly arriving bios.
      Acked-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      41d2d848
    • C
      md: raid0/linear: fix dereference before null check on pointer mddev · 9a5a8597
      Colin Ian King 提交于
      Pointer mddev is being dereferenced with a test_bit call before mddev
      is being null checked, this may cause a null pointer dereference. Fix
      this by moving the null pointer checks to sanity check mddev before
      it is dereferenced.
      
      Addresses-Coverity: ("Dereference before null check")
      Fixes: 62f7b198 ("md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      9a5a8597
  14. 09 7月, 2020 2 次提交
  15. 01 7月, 2020 3 次提交
  16. 14 5月, 2020 9 次提交
    • X
      md: add a newline when printing parameter 'start_ro' by sysfs · 3f99980c
      Xiongfeng Wang 提交于
      Add a missing newline when printing module parameter 'start_ro' by
      sysfs.
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      3f99980c
    • C
      md: stop using ->queuedata · e4fc5a74
      Christoph Hellwig 提交于
      Pointer to mddev is already available in private_data.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      e4fc5a74
    • C
      md: remove redundant memalloc scope API usage · 3024ba2d
      Coly Li 提交于
      In mddev_create_serial_pool(), memalloc scope APIs memalloc_noio_save()
      and memalloc_noio_restore() are used when allocating memory by calling
      mempool_create_kmalloc_pool(). After adding the memalloc scope APIs in
      raid array suspend context, it is unncessary to explicitly call them
      around mempool_create_kmalloc_pool() any longer.
      
      This patch removes the redundant memalloc scope APIs in
      mddev_create_serial_pool().
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      3024ba2d
    • C
      md: use memalloc scope APIs in mddev_suspend()/mddev_resume() · 78f57ef9
      Coly Li 提交于
      In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO
      flag, then it is sent into kvmalloc_array() inside scribble_alloc().
      
      The problem is kvmalloc_array() eventually calls kvmalloc_node() which
      does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then
      kmalloc_node() is called indeed to allocate physically continuous
      pages. When system memory is under heavy pressure, and the requesting
      size is large, there is high probability that allocating continueous
      pages will fail.
      
      But simply using GFP_KERNEL flag to call kvmalloc_array() is also
      progblematic. In the code path where scribble_alloc() is called, the
      raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os
      and such I/Os go back to the suspend raid array, deadlock will happen.
      
      What is desired here is to allocate non-physically (a.k.a virtually)
      continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests
      to use the mmealloc sceope APIs to restrict memory reclaim I/O in
      allocating context, specifically to call memalloc_noio_save() when
      suspend the raid array and to call memalloc_noio_restore() when
      resume the raid array.
      
      This patch adds the memalloc scope APIs in mddev_suspend() and
      mddev_resume(), to restrict memory reclaim I/Os during the raid array
      is suspended. The benifit of adding the memalloc scope API in the
      unified entry point mddev_suspend()/mddev_resume() is, no matter which
      md raid array type (personality), we are sure the deadlock by recursive
      memory reclaim I/O won't happen on the suspending context.
      
      Please notice that the memalloc scope APIs only take effect on the raid
      array suspending context, if the memory allocation is from another new
      created kthread after raid array suspended, the recursive memory reclaim
      I/Os won't be restricted. The mddev_suspend()/mddev_resume() entries are
      used for the critical section where the raid metadata is modifying,
      creating a kthread to allocate memory inside the critical section is
      queer and very probably being buggy.
      
      Fixes: b330e6a4 ("md: convert to kvmalloc")
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      78f57ef9
    • G
      md: remove the extra line for ->hot_add_disk · 3f79cc22
      Guoqing Jiang 提交于
      It is not not necessary to add a newline for them since they don't exceed
      80 characters, and it is not intutive to distinguish ->hot_add_disk() from
      hot_add_disk() too.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      3f79cc22
    • G
      md: flush md_rdev_misc_wq for HOT_ADD_DISK case · 78b990cf
      Guoqing Jiang 提交于
      Since rdev->kobj is removed asynchronously, it is possible that the
      rdev->kobj still exists when try to add the rdev again after rdev
      is removed. But this path md_ioctl (HOT_ADD_DISK) -> hot_add_disk
      -> bind_rdev_to_array missed it.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      78b990cf
    • G
      md: don't flush workqueue unconditionally in md_open · f6766ff6
      Guoqing Jiang 提交于
      We need to check mddev->del_work before flush workqueu since the purpose
      of flush is to ensure the previous md is disappeared. Otherwise the similar
      deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the
      bdev->bd_mutex before flush workqueue.
      
      kernel: [  154.522645] ======================================================
      kernel: [  154.522647] WARNING: possible circular locking dependency detected
      kernel: [  154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G           O
      kernel: [  154.522651] ------------------------------------------------------
      kernel: [  154.522653] mdadm/2482 is trying to acquire lock:
      kernel: [  154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0
      kernel: [  154.522673]
      kernel: [  154.522673] but task is already holding lock:
      kernel: [  154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
      kernel: [  154.522691]
      kernel: [  154.522691] which lock already depends on the new lock.
      kernel: [  154.522691]
      kernel: [  154.522694]
      kernel: [  154.522694] the existing dependency chain (in reverse order) is:
      kernel: [  154.522696]
      kernel: [  154.522696] -> #4 (&bdev->bd_mutex){+.+.}:
      kernel: [  154.522704]        __mutex_lock+0x87/0x950
      kernel: [  154.522706]        __blkdev_get+0x79/0x590
      kernel: [  154.522708]        blkdev_get+0x65/0x140
      kernel: [  154.522709]        blkdev_get_by_dev+0x2f/0x40
      kernel: [  154.522716]        lock_rdev+0x3d/0x90 [md_mod]
      kernel: [  154.522719]        md_import_device+0xd6/0x1b0 [md_mod]
      kernel: [  154.522723]        new_dev_store+0x15e/0x210 [md_mod]
      kernel: [  154.522728]        md_attr_store+0x7a/0xc0 [md_mod]
      kernel: [  154.522732]        kernfs_fop_write+0x117/0x1b0
      kernel: [  154.522735]        vfs_write+0xad/0x1a0
      kernel: [  154.522737]        ksys_write+0xa4/0xe0
      kernel: [  154.522745]        do_syscall_64+0x64/0x2b0
      kernel: [  154.522748]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      kernel: [  154.522749]
      kernel: [  154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}:
      kernel: [  154.522752]        __mutex_lock+0x87/0x950
      kernel: [  154.522756]        new_dev_store+0xc9/0x210 [md_mod]
      kernel: [  154.522759]        md_attr_store+0x7a/0xc0 [md_mod]
      kernel: [  154.522761]        kernfs_fop_write+0x117/0x1b0
      kernel: [  154.522763]        vfs_write+0xad/0x1a0
      kernel: [  154.522765]        ksys_write+0xa4/0xe0
      kernel: [  154.522767]        do_syscall_64+0x64/0x2b0
      kernel: [  154.522769]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      kernel: [  154.522770]
      kernel: [  154.522770] -> #2 (kn->count#253){++++}:
      kernel: [  154.522775]        __kernfs_remove+0x253/0x2c0
      kernel: [  154.522778]        kernfs_remove+0x1f/0x30
      kernel: [  154.522780]        kobject_del+0x28/0x60
      kernel: [  154.522783]        mddev_delayed_delete+0x24/0x30 [md_mod]
      kernel: [  154.522786]        process_one_work+0x2a7/0x5f0
      kernel: [  154.522788]        worker_thread+0x2d/0x3d0
      kernel: [  154.522793]        kthread+0x117/0x130
      kernel: [  154.522795]        ret_from_fork+0x3a/0x50
      kernel: [  154.522796]
      kernel: [  154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}:
      kernel: [  154.522800]        process_one_work+0x27e/0x5f0
      kernel: [  154.522802]        worker_thread+0x2d/0x3d0
      kernel: [  154.522804]        kthread+0x117/0x130
      kernel: [  154.522806]        ret_from_fork+0x3a/0x50
      kernel: [  154.522807]
      kernel: [  154.522807] -> #0 ((wq_completion)md_misc){+.+.}:
      kernel: [  154.522813]        __lock_acquire+0x1392/0x1690
      kernel: [  154.522816]        lock_acquire+0xb4/0x1a0
      kernel: [  154.522818]        flush_workqueue+0xab/0x4b0
      kernel: [  154.522821]        md_open+0xb6/0xc0 [md_mod]
      kernel: [  154.522823]        __blkdev_get+0xea/0x590
      kernel: [  154.522825]        blkdev_get+0x65/0x140
      kernel: [  154.522828]        do_dentry_open+0x1d1/0x380
      kernel: [  154.522831]        path_openat+0x567/0xcc0
      kernel: [  154.522834]        do_filp_open+0x9b/0x110
      kernel: [  154.522836]        do_sys_openat2+0x201/0x2a0
      kernel: [  154.522838]        do_sys_open+0x57/0x80
      kernel: [  154.522840]        do_syscall_64+0x64/0x2b0
      kernel: [  154.522842]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      kernel: [  154.522844]
      kernel: [  154.522844] other info that might help us debug this:
      kernel: [  154.522844]
      kernel: [  154.522846] Chain exists of:
      kernel: [  154.522846]   (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex
      kernel: [  154.522846]
      kernel: [  154.522850]  Possible unsafe locking scenario:
      kernel: [  154.522850]
      kernel: [  154.522852]        CPU0                    CPU1
      kernel: [  154.522853]        ----                    ----
      kernel: [  154.522854]   lock(&bdev->bd_mutex);
      kernel: [  154.522856]                                lock(&mddev->reconfig_mutex);
      kernel: [  154.522858]                                lock(&bdev->bd_mutex);
      kernel: [  154.522860]   lock((wq_completion)md_misc);
      kernel: [  154.522861]
      kernel: [  154.522861]  *** DEADLOCK ***
      kernel: [  154.522861]
      kernel: [  154.522864] 1 lock held by mdadm/2482:
      kernel: [  154.522865]  #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
      kernel: [  154.522868]
      kernel: [  154.522868] stack backtrace:
      kernel: [  154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G           O      5.6.0-rc7-lp151.27-default #25
      kernel: [  154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      kernel: [  154.522878] Call Trace:
      kernel: [  154.522881]  dump_stack+0x8f/0xcb
      kernel: [  154.522884]  check_noncircular+0x194/0x1b0
      kernel: [  154.522888]  ? __lock_acquire+0x1392/0x1690
      kernel: [  154.522890]  __lock_acquire+0x1392/0x1690
      kernel: [  154.522893]  lock_acquire+0xb4/0x1a0
      kernel: [  154.522895]  ? flush_workqueue+0x84/0x4b0
      kernel: [  154.522898]  flush_workqueue+0xab/0x4b0
      kernel: [  154.522900]  ? flush_workqueue+0x84/0x4b0
      kernel: [  154.522905]  ? md_open+0xb6/0xc0 [md_mod]
      kernel: [  154.522908]  md_open+0xb6/0xc0 [md_mod]
      kernel: [  154.522910]  __blkdev_get+0xea/0x590
      kernel: [  154.522912]  ? bd_acquire+0xc0/0xc0
      kernel: [  154.522914]  blkdev_get+0x65/0x140
      kernel: [  154.522916]  ? bd_acquire+0xc0/0xc0
      kernel: [  154.522918]  do_dentry_open+0x1d1/0x380
      kernel: [  154.522921]  path_openat+0x567/0xcc0
      kernel: [  154.522923]  ? __lock_acquire+0x380/0x1690
      kernel: [  154.522926]  do_filp_open+0x9b/0x110
      kernel: [  154.522929]  ? __alloc_fd+0xe5/0x1f0
      kernel: [  154.522935]  ? kmem_cache_alloc+0x28c/0x630
      kernel: [  154.522939]  ? do_sys_openat2+0x201/0x2a0
      kernel: [  154.522941]  do_sys_openat2+0x201/0x2a0
      kernel: [  154.522944]  do_sys_open+0x57/0x80
      kernel: [  154.522946]  do_syscall_64+0x64/0x2b0
      kernel: [  154.522948]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      kernel: [  154.522951] RIP: 0033:0x7f98d279d9ae
      
      And md_alloc also flushed the same workqueue, but the thing is different
      here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and
      the flush is necessary to avoid race condition, so leave it as it is.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      f6766ff6
    • G
      md: add new workqueue for delete rdev · cc1ffe61
      Guoqing Jiang 提交于
      Since the purpose of call flush_workqueue in new_dev_store is to ensure
      md_delayed_delete() has completed, so we should check rdev->del_work is
      pending or not.
      
      To suppress lockdep warning, we have to check mddev->del_work while
      md_delayed_delete is attached to rdev->del_work, so it is not aligned
      to the purpose of flush workquee. So a new workqueue is needed to avoid
      the awkward situation, and introduce a new func flush_rdev_wq to flush
      the new workqueue after check if there was pending work.
      
      Also like new_dev_store, ADD_NEW_DISK ioctl has the same purpose to flush
      workqueue while it holds bdev->bd_mutex, so make the same change applies
      to the ioctl to avoid similar lock issue.
      
      And md_delayed_delete actually wants to delete rdev, so rename the function
      to rdev_delayed_delete.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      cc1ffe61
    • G
      md: add checkings before flush md_misc_wq · 21e0958e
      Guoqing Jiang 提交于
      Coly reported possible circular locking dependencyi with LOCKDEP enabled,
      quote the below info from the detailed report [1].
      
      [ 1607.673903] Chain exists of:
      [ 1607.673903]   kn->count#256 --> (wq_completion)md_misc -->
      (work_completion)(&rdev->del_work)
      [ 1607.673903]
      [ 1607.827946]  Possible unsafe locking scenario:
      [ 1607.827946]
      [ 1607.898780]        CPU0                    CPU1
      [ 1607.952980]        ----                    ----
      [ 1608.007173]   lock((work_completion)(&rdev->del_work));
      [ 1608.069690]                                lock((wq_completion)md_misc);
      [ 1608.149887]                                lock((work_completion)(&rdev->del_work));
      [ 1608.242563]   lock(kn->count#256);
      [ 1608.283238]
      [ 1608.283238]  *** DEADLOCK ***
      [ 1608.283238]
      [ 1608.354078] 2 locks held by kworker/5:0/843:
      [ 1608.405152]  #0: ffff8889eecc9948 ((wq_completion)md_misc){+.+.}, at:
      process_one_work+0x42b/0xb30
      [ 1608.512399]  #1: ffff888a1d3b7e10
      ((work_completion)(&rdev->del_work)){+.+.}, at: process_one_work+0x42b/0xb30
      [ 1608.632130]
      
      Since works (rdev->del_work and mddev->del_work) are queued in md_misc_wq,
      then lockdep_map lock is held if either of them are running, then both of
      them try to hold kernfs lock by call kobject_del. Then if new_dev_store
      or array_state_store are triggered by write to the related sysfs node, so
      the write operation gets kernfs lock, but need the lockdep_map because all
      of them would trigger flush_workqueue(md_misc_wq) finally, then the same
      lockdep_map lock is needed.
      
      To suppress the lockdep warnning, we should flush the workqueue in case the
      related work is pending. And several works are attached to md_misc_wq, so
      we need to check which work should be checked:
      
      1. for __md_stop_writes, the purpose of call flush workqueue is ensure sync
      thread is started if it was starting, so check mddev->del_work is pending
      or not since md_start_sync is attached to mddev->del_work.
      
      2. __md_stop flushes md_misc_wq to ensure event_work is done, check the
      event_work is enough. Assume raid_{ctr,dtr} -> md_stop -> __md_stop doesn't
      need the kernfs lock.
      
      3. both new_dev_store (holds kernfs lock) and ADD_NEW_DISK ioctl (holds the
      bdev->bd_mutex) call flush_workqueue to ensure md_delayed_delete has
      completed, this case will be handled in next patch.
      
      4. md_open flushes workqueue to ensure the previous md is disappeared, but
      it holds bdev->bd_mutex then try to flush workqueue, so it is better to
      check mddev->del_work as well to avoid potential lock issue, this will be
      done in another patch.
      
      [1]: https://marc.info/?l=linux-raid&m=158518958031584&w=2
      
      Cc: Coly Li <colyli@suse.de>
      Reported-by: NColy Li <colyli@suse.de>
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      21e0958e
  17. 28 3月, 2020 1 次提交
    • C
      block: simplify queue allocation · 3d745ea5
      Christoph Hellwig 提交于
      Current make_request based drivers use either blk_alloc_queue_node or
      blk_alloc_queue to allocate a queue, and then set up the make_request_fn
      function pointer and a few parameters using the blk_queue_make_request
      helper.  Simplify this by passing the make_request pointer to
      blk_alloc_queue, and while at it merge the _node variant into the main
      helper by always passing a node_id, and remove the superfluous gfp_mask
      parameter.  A lower-level __blk_alloc_queue is kept for the blk-mq case.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d745ea5