1. 16 7月, 2020 1 次提交
  2. 14 5月, 2020 9 次提交
    • X
      md: add a newline when printing parameter 'start_ro' by sysfs · 3f99980c
      Xiongfeng Wang 提交于
      Add a missing newline when printing module parameter 'start_ro' by
      sysfs.
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      3f99980c
    • C
      md: stop using ->queuedata · e4fc5a74
      Christoph Hellwig 提交于
      Pointer to mddev is already available in private_data.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      e4fc5a74
    • C
      md: remove redundant memalloc scope API usage · 3024ba2d
      Coly Li 提交于
      In mddev_create_serial_pool(), memalloc scope APIs memalloc_noio_save()
      and memalloc_noio_restore() are used when allocating memory by calling
      mempool_create_kmalloc_pool(). After adding the memalloc scope APIs in
      raid array suspend context, it is unncessary to explicitly call them
      around mempool_create_kmalloc_pool() any longer.
      
      This patch removes the redundant memalloc scope APIs in
      mddev_create_serial_pool().
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      3024ba2d
    • C
      md: use memalloc scope APIs in mddev_suspend()/mddev_resume() · 78f57ef9
      Coly Li 提交于
      In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO
      flag, then it is sent into kvmalloc_array() inside scribble_alloc().
      
      The problem is kvmalloc_array() eventually calls kvmalloc_node() which
      does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then
      kmalloc_node() is called indeed to allocate physically continuous
      pages. When system memory is under heavy pressure, and the requesting
      size is large, there is high probability that allocating continueous
      pages will fail.
      
      But simply using GFP_KERNEL flag to call kvmalloc_array() is also
      progblematic. In the code path where scribble_alloc() is called, the
      raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os
      and such I/Os go back to the suspend raid array, deadlock will happen.
      
      What is desired here is to allocate non-physically (a.k.a virtually)
      continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests
      to use the mmealloc sceope APIs to restrict memory reclaim I/O in
      allocating context, specifically to call memalloc_noio_save() when
      suspend the raid array and to call memalloc_noio_restore() when
      resume the raid array.
      
      This patch adds the memalloc scope APIs in mddev_suspend() and
      mddev_resume(), to restrict memory reclaim I/Os during the raid array
      is suspended. The benifit of adding the memalloc scope API in the
      unified entry point mddev_suspend()/mddev_resume() is, no matter which
      md raid array type (personality), we are sure the deadlock by recursive
      memory reclaim I/O won't happen on the suspending context.
      
      Please notice that the memalloc scope APIs only take effect on the raid
      array suspending context, if the memory allocation is from another new
      created kthread after raid array suspended, the recursive memory reclaim
      I/Os won't be restricted. The mddev_suspend()/mddev_resume() entries are
      used for the critical section where the raid metadata is modifying,
      creating a kthread to allocate memory inside the critical section is
      queer and very probably being buggy.
      
      Fixes: b330e6a4 ("md: convert to kvmalloc")
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      78f57ef9
    • G
      md: remove the extra line for ->hot_add_disk · 3f79cc22
      Guoqing Jiang 提交于
      It is not not necessary to add a newline for them since they don't exceed
      80 characters, and it is not intutive to distinguish ->hot_add_disk() from
      hot_add_disk() too.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      3f79cc22
    • G
      md: flush md_rdev_misc_wq for HOT_ADD_DISK case · 78b990cf
      Guoqing Jiang 提交于
      Since rdev->kobj is removed asynchronously, it is possible that the
      rdev->kobj still exists when try to add the rdev again after rdev
      is removed. But this path md_ioctl (HOT_ADD_DISK) -> hot_add_disk
      -> bind_rdev_to_array missed it.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      78b990cf
    • G
      md: don't flush workqueue unconditionally in md_open · f6766ff6
      Guoqing Jiang 提交于
      We need to check mddev->del_work before flush workqueu since the purpose
      of flush is to ensure the previous md is disappeared. Otherwise the similar
      deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the
      bdev->bd_mutex before flush workqueue.
      
      kernel: [  154.522645] ======================================================
      kernel: [  154.522647] WARNING: possible circular locking dependency detected
      kernel: [  154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G           O
      kernel: [  154.522651] ------------------------------------------------------
      kernel: [  154.522653] mdadm/2482 is trying to acquire lock:
      kernel: [  154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0
      kernel: [  154.522673]
      kernel: [  154.522673] but task is already holding lock:
      kernel: [  154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
      kernel: [  154.522691]
      kernel: [  154.522691] which lock already depends on the new lock.
      kernel: [  154.522691]
      kernel: [  154.522694]
      kernel: [  154.522694] the existing dependency chain (in reverse order) is:
      kernel: [  154.522696]
      kernel: [  154.522696] -> #4 (&bdev->bd_mutex){+.+.}:
      kernel: [  154.522704]        __mutex_lock+0x87/0x950
      kernel: [  154.522706]        __blkdev_get+0x79/0x590
      kernel: [  154.522708]        blkdev_get+0x65/0x140
      kernel: [  154.522709]        blkdev_get_by_dev+0x2f/0x40
      kernel: [  154.522716]        lock_rdev+0x3d/0x90 [md_mod]
      kernel: [  154.522719]        md_import_device+0xd6/0x1b0 [md_mod]
      kernel: [  154.522723]        new_dev_store+0x15e/0x210 [md_mod]
      kernel: [  154.522728]        md_attr_store+0x7a/0xc0 [md_mod]
      kernel: [  154.522732]        kernfs_fop_write+0x117/0x1b0
      kernel: [  154.522735]        vfs_write+0xad/0x1a0
      kernel: [  154.522737]        ksys_write+0xa4/0xe0
      kernel: [  154.522745]        do_syscall_64+0x64/0x2b0
      kernel: [  154.522748]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      kernel: [  154.522749]
      kernel: [  154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}:
      kernel: [  154.522752]        __mutex_lock+0x87/0x950
      kernel: [  154.522756]        new_dev_store+0xc9/0x210 [md_mod]
      kernel: [  154.522759]        md_attr_store+0x7a/0xc0 [md_mod]
      kernel: [  154.522761]        kernfs_fop_write+0x117/0x1b0
      kernel: [  154.522763]        vfs_write+0xad/0x1a0
      kernel: [  154.522765]        ksys_write+0xa4/0xe0
      kernel: [  154.522767]        do_syscall_64+0x64/0x2b0
      kernel: [  154.522769]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      kernel: [  154.522770]
      kernel: [  154.522770] -> #2 (kn->count#253){++++}:
      kernel: [  154.522775]        __kernfs_remove+0x253/0x2c0
      kernel: [  154.522778]        kernfs_remove+0x1f/0x30
      kernel: [  154.522780]        kobject_del+0x28/0x60
      kernel: [  154.522783]        mddev_delayed_delete+0x24/0x30 [md_mod]
      kernel: [  154.522786]        process_one_work+0x2a7/0x5f0
      kernel: [  154.522788]        worker_thread+0x2d/0x3d0
      kernel: [  154.522793]        kthread+0x117/0x130
      kernel: [  154.522795]        ret_from_fork+0x3a/0x50
      kernel: [  154.522796]
      kernel: [  154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}:
      kernel: [  154.522800]        process_one_work+0x27e/0x5f0
      kernel: [  154.522802]        worker_thread+0x2d/0x3d0
      kernel: [  154.522804]        kthread+0x117/0x130
      kernel: [  154.522806]        ret_from_fork+0x3a/0x50
      kernel: [  154.522807]
      kernel: [  154.522807] -> #0 ((wq_completion)md_misc){+.+.}:
      kernel: [  154.522813]        __lock_acquire+0x1392/0x1690
      kernel: [  154.522816]        lock_acquire+0xb4/0x1a0
      kernel: [  154.522818]        flush_workqueue+0xab/0x4b0
      kernel: [  154.522821]        md_open+0xb6/0xc0 [md_mod]
      kernel: [  154.522823]        __blkdev_get+0xea/0x590
      kernel: [  154.522825]        blkdev_get+0x65/0x140
      kernel: [  154.522828]        do_dentry_open+0x1d1/0x380
      kernel: [  154.522831]        path_openat+0x567/0xcc0
      kernel: [  154.522834]        do_filp_open+0x9b/0x110
      kernel: [  154.522836]        do_sys_openat2+0x201/0x2a0
      kernel: [  154.522838]        do_sys_open+0x57/0x80
      kernel: [  154.522840]        do_syscall_64+0x64/0x2b0
      kernel: [  154.522842]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      kernel: [  154.522844]
      kernel: [  154.522844] other info that might help us debug this:
      kernel: [  154.522844]
      kernel: [  154.522846] Chain exists of:
      kernel: [  154.522846]   (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex
      kernel: [  154.522846]
      kernel: [  154.522850]  Possible unsafe locking scenario:
      kernel: [  154.522850]
      kernel: [  154.522852]        CPU0                    CPU1
      kernel: [  154.522853]        ----                    ----
      kernel: [  154.522854]   lock(&bdev->bd_mutex);
      kernel: [  154.522856]                                lock(&mddev->reconfig_mutex);
      kernel: [  154.522858]                                lock(&bdev->bd_mutex);
      kernel: [  154.522860]   lock((wq_completion)md_misc);
      kernel: [  154.522861]
      kernel: [  154.522861]  *** DEADLOCK ***
      kernel: [  154.522861]
      kernel: [  154.522864] 1 lock held by mdadm/2482:
      kernel: [  154.522865]  #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
      kernel: [  154.522868]
      kernel: [  154.522868] stack backtrace:
      kernel: [  154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G           O      5.6.0-rc7-lp151.27-default #25
      kernel: [  154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      kernel: [  154.522878] Call Trace:
      kernel: [  154.522881]  dump_stack+0x8f/0xcb
      kernel: [  154.522884]  check_noncircular+0x194/0x1b0
      kernel: [  154.522888]  ? __lock_acquire+0x1392/0x1690
      kernel: [  154.522890]  __lock_acquire+0x1392/0x1690
      kernel: [  154.522893]  lock_acquire+0xb4/0x1a0
      kernel: [  154.522895]  ? flush_workqueue+0x84/0x4b0
      kernel: [  154.522898]  flush_workqueue+0xab/0x4b0
      kernel: [  154.522900]  ? flush_workqueue+0x84/0x4b0
      kernel: [  154.522905]  ? md_open+0xb6/0xc0 [md_mod]
      kernel: [  154.522908]  md_open+0xb6/0xc0 [md_mod]
      kernel: [  154.522910]  __blkdev_get+0xea/0x590
      kernel: [  154.522912]  ? bd_acquire+0xc0/0xc0
      kernel: [  154.522914]  blkdev_get+0x65/0x140
      kernel: [  154.522916]  ? bd_acquire+0xc0/0xc0
      kernel: [  154.522918]  do_dentry_open+0x1d1/0x380
      kernel: [  154.522921]  path_openat+0x567/0xcc0
      kernel: [  154.522923]  ? __lock_acquire+0x380/0x1690
      kernel: [  154.522926]  do_filp_open+0x9b/0x110
      kernel: [  154.522929]  ? __alloc_fd+0xe5/0x1f0
      kernel: [  154.522935]  ? kmem_cache_alloc+0x28c/0x630
      kernel: [  154.522939]  ? do_sys_openat2+0x201/0x2a0
      kernel: [  154.522941]  do_sys_openat2+0x201/0x2a0
      kernel: [  154.522944]  do_sys_open+0x57/0x80
      kernel: [  154.522946]  do_syscall_64+0x64/0x2b0
      kernel: [  154.522948]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      kernel: [  154.522951] RIP: 0033:0x7f98d279d9ae
      
      And md_alloc also flushed the same workqueue, but the thing is different
      here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and
      the flush is necessary to avoid race condition, so leave it as it is.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      f6766ff6
    • G
      md: add new workqueue for delete rdev · cc1ffe61
      Guoqing Jiang 提交于
      Since the purpose of call flush_workqueue in new_dev_store is to ensure
      md_delayed_delete() has completed, so we should check rdev->del_work is
      pending or not.
      
      To suppress lockdep warning, we have to check mddev->del_work while
      md_delayed_delete is attached to rdev->del_work, so it is not aligned
      to the purpose of flush workquee. So a new workqueue is needed to avoid
      the awkward situation, and introduce a new func flush_rdev_wq to flush
      the new workqueue after check if there was pending work.
      
      Also like new_dev_store, ADD_NEW_DISK ioctl has the same purpose to flush
      workqueue while it holds bdev->bd_mutex, so make the same change applies
      to the ioctl to avoid similar lock issue.
      
      And md_delayed_delete actually wants to delete rdev, so rename the function
      to rdev_delayed_delete.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      cc1ffe61
    • G
      md: add checkings before flush md_misc_wq · 21e0958e
      Guoqing Jiang 提交于
      Coly reported possible circular locking dependencyi with LOCKDEP enabled,
      quote the below info from the detailed report [1].
      
      [ 1607.673903] Chain exists of:
      [ 1607.673903]   kn->count#256 --> (wq_completion)md_misc -->
      (work_completion)(&rdev->del_work)
      [ 1607.673903]
      [ 1607.827946]  Possible unsafe locking scenario:
      [ 1607.827946]
      [ 1607.898780]        CPU0                    CPU1
      [ 1607.952980]        ----                    ----
      [ 1608.007173]   lock((work_completion)(&rdev->del_work));
      [ 1608.069690]                                lock((wq_completion)md_misc);
      [ 1608.149887]                                lock((work_completion)(&rdev->del_work));
      [ 1608.242563]   lock(kn->count#256);
      [ 1608.283238]
      [ 1608.283238]  *** DEADLOCK ***
      [ 1608.283238]
      [ 1608.354078] 2 locks held by kworker/5:0/843:
      [ 1608.405152]  #0: ffff8889eecc9948 ((wq_completion)md_misc){+.+.}, at:
      process_one_work+0x42b/0xb30
      [ 1608.512399]  #1: ffff888a1d3b7e10
      ((work_completion)(&rdev->del_work)){+.+.}, at: process_one_work+0x42b/0xb30
      [ 1608.632130]
      
      Since works (rdev->del_work and mddev->del_work) are queued in md_misc_wq,
      then lockdep_map lock is held if either of them are running, then both of
      them try to hold kernfs lock by call kobject_del. Then if new_dev_store
      or array_state_store are triggered by write to the related sysfs node, so
      the write operation gets kernfs lock, but need the lockdep_map because all
      of them would trigger flush_workqueue(md_misc_wq) finally, then the same
      lockdep_map lock is needed.
      
      To suppress the lockdep warnning, we should flush the workqueue in case the
      related work is pending. And several works are attached to md_misc_wq, so
      we need to check which work should be checked:
      
      1. for __md_stop_writes, the purpose of call flush workqueue is ensure sync
      thread is started if it was starting, so check mddev->del_work is pending
      or not since md_start_sync is attached to mddev->del_work.
      
      2. __md_stop flushes md_misc_wq to ensure event_work is done, check the
      event_work is enough. Assume raid_{ctr,dtr} -> md_stop -> __md_stop doesn't
      need the kernfs lock.
      
      3. both new_dev_store (holds kernfs lock) and ADD_NEW_DISK ioctl (holds the
      bdev->bd_mutex) call flush_workqueue to ensure md_delayed_delete has
      completed, this case will be handled in next patch.
      
      4. md_open flushes workqueue to ensure the previous md is disappeared, but
      it holds bdev->bd_mutex then try to flush workqueue, so it is better to
      check mddev->del_work as well to avoid potential lock issue, this will be
      done in another patch.
      
      [1]: https://marc.info/?l=linux-raid&m=158518958031584&w=2
      
      Cc: Coly Li <colyli@suse.de>
      Reported-by: NColy Li <colyli@suse.de>
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      21e0958e
  3. 28 3月, 2020 1 次提交
    • C
      block: simplify queue allocation · 3d745ea5
      Christoph Hellwig 提交于
      Current make_request based drivers use either blk_alloc_queue_node or
      blk_alloc_queue to allocate a queue, and then set up the make_request_fn
      function pointer and a few parameters using the blk_queue_make_request
      helper.  Simplify this by passing the make_request pointer to
      blk_alloc_queue, and while at it merge the _node variant into the main
      helper by always passing a node_id, and remove the superfluous gfp_mask
      parameter.  A lower-level __blk_alloc_queue is kept for the blk-mq case.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d745ea5
  4. 25 3月, 2020 1 次提交
  5. 24 3月, 2020 2 次提交
  6. 18 3月, 2020 1 次提交
    • G
      md: check arrays is suspended in mddev_detach before call quiesce operations · 6b40bec3
      Guoqing Jiang 提交于
      Don't call quiesce(1) and quiesce(0) if array is already suspended,
      otherwise in level_store, the array is writable after mddev_detach
      in below part though the intention is to make array writable after
      resume.
      
      	mddev_suspend(mddev);
      	mddev_detach(mddev);
      	...
      	mddev_resume(mddev);
      
      And it also causes calltrace as follows in [1].
      
      [48005.653834] WARNING: CPU: 1 PID: 45380 at kernel/kthread.c:510 kthread_park+0x77/0x90
      [...]
      [48005.653976] CPU: 1 PID: 45380 Comm: mdadm Tainted: G           OE     5.4.10-arch1-1 #1
      [48005.653979] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4105-ITX, BIOS P1.40 08/06/2018
      [48005.653984] RIP: 0010:kthread_park+0x77/0x90
      [48005.654015] Call Trace:
      [48005.654039]  r5l_quiesce+0x3c/0x70 [raid456]
      [48005.654052]  raid5_quiesce+0x228/0x2e0 [raid456]
      [48005.654073]  mddev_detach+0x30/0x70 [md_mod]
      [48005.654090]  level_store+0x202/0x670 [md_mod]
      [48005.654099]  ? security_capable+0x40/0x60
      [48005.654114]  md_attr_store+0x7b/0xc0 [md_mod]
      [48005.654123]  kernfs_fop_write+0xce/0x1b0
      [48005.654132]  vfs_write+0xb6/0x1a0
      [48005.654138]  ksys_write+0x67/0xe0
      [48005.654146]  do_syscall_64+0x4e/0x140
      [48005.654155]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [48005.654161] RIP: 0033:0x7fa0c8737497
      
      [1]: https://bugzilla.kernel.org/show_bug.cgi?id=206161Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      6b40bec3
  7. 04 2月, 2020 1 次提交
  8. 14 1月, 2020 7 次提交
    • G
      md/raid1: use bucket based mechanism for IO serialization · 025471f9
      Guoqing Jiang 提交于
      Since raid1 had already used bucket based mechanism to reduce
      the conflict between write IO and resync IO, it is possible to
      speed up performance for io serialization with refer to the
      same mechanism.
      
      To align with the barrier bucket mechanism, we created arrays
      (with the same number of BARRIER_BUCKETS_NR) for spinlock, rb
      tree and waitqueue. Then we can reduce lock competition with
      multiple spinlocks, boost search performance with multiple rb
      trees and also reduce thundering herd problem with multiple
      waitqueues.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      025471f9
    • G
      md: introduce a new struct for IO serialization · 69b00b5b
      Guoqing Jiang 提交于
      Obviously, IO serialization could cause the degradation of
      performance a lot. In order to reduce the degradation, so a
      rb interval tree is added in raid1 to speed up the check of
      collision.
      
      So, a rb root is needed in md_rdev, then abstract all the
      serialize related members to a new struct (serial_in_rdev),
      embed it into md_rdev.
      
      Of course, we need to free the struct if it is not needed
      anymore, so rdev/rdevs_uninit_serial are added accordingly.
      And they should be called when destroty memory pool or can't
      alloc memory.
      
      And we need to consider to call mddev_destroy_serial_pool
      in case serialize_policy/write-behind is disabled, bitmap
      is destroyed or in __md_stop_writes.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      69b00b5b
    • G
      md: reorgnize mddev_create/destroy_serial_pool · de31ee94
      Guoqing Jiang 提交于
      So far, IO serialization is used for two scenarios:
      
      1. raid1 which enables write-behind mode, and there is rdev in the array
      which is multi-queue device and flaged with writemostly.
      2. IO serialization is enabled or disabled by change serialize_policy.
      
      So introduce rdev_need_serial to check the first scenario. And for 1, IO
      serialization is enabled automatically while 2 is controlled manually.
      
      And it is possible that both scenarios are true, so for create serial pool,
      rdev/rdevs_init_serial should be separate from check if the pool existed or
      not. Then for destroy pool, we need to check if the pool is needed by other
      rdevs due to the first scenario.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      de31ee94
    • G
      md: add serialize_policy sysfs node for raid1 · 3938f5fb
      Guoqing Jiang 提交于
      With the new sysfs node, we can use it to control if raid1 array
      wants io serialization or not. So mddev_create_serial_pool and
      mddev_destroy_serial_pool are called in serialize_policy_store
      to enable or disable the serialization.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      3938f5fb
    • G
      md: prepare for enable raid1 io serialization · 11d3a9f6
      Guoqing Jiang 提交于
      1. The related resources (spin_lock, list and waitqueue) are needed for
      address raid1 reorder overlap issue too, in this case, rdev is set to
      NULL for mddev_create/destroy_serial_pool which implies all rdevs need
      to handle these resources.
      
      And also add "is_suspend" to mddev_destroy_serial_pool since it will
      be called under suspended situation, which also makes both create and
      destroy pool have same arguments.
      
      2. Introduce rdevs_init_serial which is called if raid1 io serialization
      is enabled since all rdevs need to init related stuffs.
      
      3. rdev_init_serial and clear_bit(CollisionCheck, &rdev->flags) should
      be called between suspend and resume.
      
      No need to export mddev_create_serial_pool since it is only called in
      md-mod module.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      11d3a9f6
    • G
      md: fix a typo s/creat/create · 3e173ab5
      Guoqing Jiang 提交于
      It actually means create here, so fix the typo.
      Reported-by: NSong Liu <liu.song.a23@gmail.com>
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      3e173ab5
    • G
      md: rename wb stuffs · 404659cf
      Guoqing Jiang 提交于
      Previously, wb_info_pool and wb_list stuffs are introduced
      to address potential data inconsistence issue for write
      behind device.
      
      Now rename them to serial related name, since the same
      mechanism will be used to address reorder overlap write
      issue for raid1.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      404659cf
  9. 12 12月, 2019 1 次提交
  10. 12 11月, 2019 1 次提交
  11. 25 10月, 2019 2 次提交
    • Y
      md: no longer compare spare disk superblock events in super_load · 6a5cb53a
      Yufen Yu 提交于
      We have a test case as follow:
      
        mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
      	--assume-clean --bitmap=internal
        mdadm -S /dev/md1
        mdadm -A /dev/md1 /dev/sd[b-c] --run --force
      
        mdadm --zero /dev/sda
        mdadm /dev/md1 -a /dev/sda
      
        echo offline > /sys/block/sdc/device/state
        echo offline > /sys/block/sdb/device/state
        sleep 5
        mdadm -S /dev/md1
      
        echo running > /sys/block/sdb/device/state
        echo running > /sys/block/sdc/device/state
        mdadm -A /dev/md1 /dev/sd[a-c] --run --force
      
      When we readd /dev/sda to the array, it started to do recovery.
      After offline the other two disks in md1, the recovery have
      been interrupted and superblock update info cannot be written
      to the offline disks. While the spare disk (/dev/sda) can continue
      to update superblock info.
      
      After stopping the array and assemble it, we found the array
      run fail, with the follow kernel message:
      
      [  172.986064] md: kicking non-fresh sdb from array!
      [  173.004210] md: kicking non-fresh sdc from array!
      [  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
      [  173.022406] md1: failed to create bitmap (-5)
      [  173.023466] md: md1 stopped.
      
      Since both sdb and sdc have the value of 'sb->events' smaller than
      that in sda, they have been kicked from the array. However, the only
      remained disk sda is in 'spare' state before stop and it cannot be
      added to conf->mirrors[] array. In the end, raid array assemble
      and run fail.
      
      In fact, we can use the older disk sdb or sdc to assemble the array.
      That means we should not choose the 'spare' disk as the fresh disk in
      analyze_sbs().
      
      To fix the problem, we do not compare superblock events when it is
      a spare disk, as same as validate_super.
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      6a5cb53a
    • D
      md: improve handling of bio with REQ_PREFLUSH in md_flush_request() · 775d7831
      David Jeffery 提交于
      If pers->make_request fails in md_flush_request(), the bio is lost. To
      fix this, pass back a bool to indicate if the original make_request call
      should continue to handle the I/O and instead of assuming the flush logic
      will push it to completion.
      
      Convert md_flush_request to return a bool and no longer calls the raid
      driver's make_request function.  If the return is true, then the md flush
      logic has or will complete the bio and the md make_request call is done.
      If false, then the md make_request function needs to keep processing like
      it is a normal bio. Let the original call to md_handle_request handle any
      need to retry sending the bio to the raid driver's make_request function
      should it be needed.
      
      Also mark md_flush_request and the make_request function pointer as
      __must_check to issue warnings should these critical return values be
      ignored.
      
      Fixes: 2bc13b83 ("md: batch flush requests.")
      Cc: stable@vger.kernel.org # # v4.19+
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
      Reviewed-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      775d7831
  12. 14 9月, 2019 1 次提交
    • N
      md: add feature flag MD_FEATURE_RAID0_LAYOUT · 33f2c35a
      NeilBrown 提交于
      Due to a bug introduced in Linux 3.14 we cannot determine the
      correctly layout for a multi-zone RAID0 array - there are two
      possibilities.
      
      It is possible to tell the kernel which to chose using a module
      parameter, but this can be clumsy to use.  It would be best if
      the choice were recorded in the metadata.
      So add a feature flag for this purpose.
      If it is set, then the 'layout' field of the superblock is used
      to determine which layout to use.
      
      If this flag is not set, then mddev->layout gets set to -1,
      which causes the module parameter to be required.
      Acked-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      33f2c35a
  13. 04 9月, 2019 1 次提交
    • G
      md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone · 62f7b198
      Guilherme G. Piccoli 提交于
      Currently md raid0/linear are not provided with any mechanism to validate
      if an array member got removed or failed. The driver keeps sending BIOs
      regardless of the state of array members, and kernel shows state 'clean'
      in the 'array_state' sysfs attribute. This leads to the following
      situation: if a raid0/linear array member is removed and the array is
      mounted, some user writing to this array won't realize that errors are
      happening unless they check dmesg or perform one fsync per written file.
      Despite udev signaling the member device is gone, 'mdadm' cannot issue the
      STOP_ARRAY ioctl successfully, given the array is mounted.
      
      In other words, no -EIO is returned and writes (except direct ones) appear
      normal. Meaning the user might think the wrote data is correctly stored in
      the array, but instead garbage was written given that raid0 does stripping
      (and so, it requires all its members to be working in order to not corrupt
      data). For md/linear, writes to the available members will work fine, but
      if the writes go to the missing member(s), it'll cause a file corruption
      situation, whereas the portion of the writes to the missing devices aren't
      written effectively.
      
      This patch changes this behavior: we check if the block device's gendisk
      is UP when submitting the BIO to the array member, and if it isn't, we flag
      the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
      request to the array requiring data from a valid member is still completed.
      While flagging the device as MD_BROKEN, we also show a rate-limited warning
      in the kernel log.
      
      A new array state 'broken' was added too: it mimics the state 'clean' in
      every aspect, being useful only to distinguish if the array has some member
      missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
      state. This state cannot be written in 'array_state' as it just shows
      one or more members of the array are missing but acts like 'clean', it
      wouldn't make sense to write it.
      
      With this patch, the filesystem reacts much faster to the event of missing
      array member: after some I/O errors, ext4 for instance aborts the journal
      and prevents corruption. Without this change, we're able to keep writing
      in the disk and after a machine reboot, e2fsck shows some severe fs errors
      that demand fixing. This patch was tested in ext4 and xfs filesystems, and
      requires a 'mdadm' counterpart to handle the 'broken' state.
      
      Cc: Song Liu <songliubraving@fb.com>
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      62f7b198
  14. 28 8月, 2019 2 次提交
    • N
      md: don't report active array_state until after revalidate_disk() completes. · 9d4b45d6
      NeilBrown 提交于
      Until revalidate_disk() has completed, the size of a new md array will
      appear to be zero.
      So we shouldn't report, through array_state, that the array is active
      until that time.
      udev rules check array_state to see if the array is ready.  As soon as
      it appear to be zero, fsck can be run.  If it find the size to be
      zero, it will fail.
      
      So add a new flag to provide an interlock between do_md_run() and
      array_state_show().  This flag is set while do_md_run() is active and
      it prevents array_state_show() from reporting that the array is
      active.
      
      Before do_md_run() is called, ->pers will be NULL so array is
      definitely not active.
      After do_md_run() is called, revalidate_disk() will have run and the
      array will be completely ready.
      
      We also move various sysfs_notify*() calls out of md_run() into
      do_md_run() after MD_NOT_READY is cleared.  This ensure the
      information is ready before the notification is sent.
      
      Prior to v4.12, array_state_show() was called with the
      mddev->reconfig_mutex held, which provided exclusion with do_md_run().
      
      Note that MD_NOT_READY cleared twice.  This is deliberate to cover
      both success and error paths with minimal noise.
      
      Fixes: b7b17c9b ("md: remove mddev_lock() from md_attr_show()")
      Cc: stable@vger.kernel.org (v4.12++)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      9d4b45d6
    • N
      md: only call set_in_sync() when it is expected to succeed. · 480523fe
      NeilBrown 提交于
      Since commit 4ad23a97 ("MD: use per-cpu counter for
      writes_pending"), set_in_sync() is substantially more expensive: it
      can wait for a full RCU grace period which can be 10s of milliseconds.
      
      So we should only call it when the cost is justified.
      
      md_check_recovery() currently calls set_in_sync() every time it finds
      anything to do (on non-external active arrays).  For an array
      performing resync or recovery, this will be quite often.
      Each call will introduce a delay to the md thread, which can noticeable
      affect IO submission latency.
      
      In md_check_recovery() we only need to call set_in_sync() if
      'safemode' was non-zero at entry, meaning that there has been not
      recent IO.  So we save this "safemode was nonzero" state, and only
      call set_in_sync() if it was non-zero.
      
      This measurably reduces mean and maximum IO submission latency during
      resync/recovery.
      Reported-and-tested-by: NJack Wang <jinpu.wang@cloud.ionos.com>
      Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
      Cc: stable@vger.kernel.org (v4.12+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      480523fe
  15. 08 8月, 2019 4 次提交
    • G
      md: don't call spare_active in md_reap_sync_thread if all member devices can't work · 0d8ed0e9
      Guoqing Jiang 提交于
      When add one disk to array, the md_reap_sync_thread is responsible
      to activate the spare and set In_sync flag for the new member in
      spare_active().
      
      But if raid1 has one member disk A, and disk B is added to the array.
      Then we offline A before all the datas are synchronized from A to B,
      obviously B doesn't have the latest data as A, but B is still marked
      with In_sync flag.
      
      So let's not call spare_active under the condition, otherwise B is
      still showed with 'U' state which is not correct.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      0d8ed0e9
    • G
      md: don't set In_sync if array is frozen · 062f5b2a
      Guoqing Jiang 提交于
      When a disk is added to array, the following path is called in mdadm.
      
      Manage_subdevs -> sysfs_freeze_array
                     -> Manage_add
                     -> sysfs_set_str(&info, NULL, "sync_action","idle")
      
      Then from kernel side, Manage_add invokes the path (add_new_disk ->
      validate_super = super_1_validate) to set In_sync flag.
      
      Since In_sync means "device is in_sync with rest of array", and the new
      added disk need to resync thread to help the synchronization of data.
      And md_reap_sync_thread would call spare_active to set In_sync for the
      new added disk finally. So don't set In_sync if array is in frozen.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      062f5b2a
    • G
      md: allow last device to be forcibly removed from RAID1/RAID10. · 9a567843
      Guoqing Jiang 提交于
      When the 'last' device in a RAID1 or RAID10 reports an error,
      we do not mark it as failed.  This would serve little purpose
      as there is no risk of losing data beyond that which is obviously
      lost (as there is with RAID5), and there could be other sectors
      on the device which are readable, and only readable from this device.
      This in general this maximises access to data.
      
      However the current implementation also stops an admin from removing
      the last device by direct action.  This is rarely useful, but in many
      case is not harmful and can make automation easier by removing special
      cases.
      
      Also, if an attempt to write metadata fails the device must be marked
      as faulty, else an infinite loop will result, attempting to update
      the metadata on all non-faulty devices.
      
      So add 'fail_last_dev' member to 'struct mddev', then we can bypasses
      the 'last disk' checks for RAID1 and RAID10, and control the behavior
      per array by change sysfs node.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      [add sysfs node for fail_last_dev by Guoqing]
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      9a567843
    • A
      md: Convert to use int_pow() · cf891607
      Andy Shevchenko 提交于
      Instead of linear approach to calculate power of 10, use generic int_pow()
      which does it better.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      cf891607
  16. 21 6月, 2019 3 次提交
    • G
      md: add bitmap_abort label in md_run · d494549a
      Guoqing Jiang 提交于
      Now, there are two places need to consider about
      the failure of destroy bitmap, so move the common
      part between bitmap_abort and abort label.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      d494549a
    • G
      md: introduce mddev_create/destroy_wb_pool for the change of member device · 963c555e
      Guoqing Jiang 提交于
      Previously, we called rdev_init_wb to avoid potential data
      inconsistency when array is created.
      
      Now, we need to call the function and create mempool if a
      device is added or just be flaged as "writemostly". So
      mddev_create_wb_pool is introduced and called accordingly.
      And for safety reason, we mark implicit GFP_NOIO allocation
      scope for create mempool during mddev_suspend/mddev_resume.
      
      And mempool should be removed conversely after remove a
      member device or its's "writemostly" flag, which is done
      by call mddev_destroy_wb_pool.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      963c555e
    • G
      md/raid1: fix potential data inconsistency issue with write behind device · 3e148a32
      Guoqing Jiang 提交于
      For write-behind mode, we think write IO is complete once it has
      reached all the non-writemostly devices. It works fine for single
      queue devices.
      
      But for multiqueue device, if there are lots of IOs come from upper
      layer, then the write-behind device could issue those IOs to different
      queues, depends on the each queue's delay, so there is no guarantee
      that those IOs can arrive in order.
      
      To address the issue, we need to check the collision among write
      behind IOs, we can only continue without collision, otherwise wait
      for the completion of previous collisioned IO.
      
      And WBCollision is introduced for multiqueue device which is worked
      under write-behind mode.
      
      But this patch doesn't handle below cases which could have the data
      inconsistency issue as well, these cases will be handled in later
      patches.
      
      1. modify max_write_behind by write backlog node.
      2. add or remove array's bitmap dynamically.
      3. the change of member disk.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      3e148a32
  17. 18 6月, 2019 1 次提交
    • M
      md: fix for divide error in status_resync · 9642fa73
      Mariusz Tkaczyk 提交于
      Stopping external metadata arrays during resync/recovery causes
      retries, loop of interrupting and starting reconstruction, until it
      hit at good moment to stop completely. While these retries
      curr_mark_cnt can be small- especially on HDD drives, so subtraction
      result can be smaller than 0. However it is casted to uint without
      checking. As a result of it the status bar in /proc/mdstat while stopping
      is strange (it jumps between 0% and 99%).
      
      The real problem occurs here after commit 72deb455 ("block: remove
      CONFIG_LBDAF"). Sector_div() macro has been changed, now the
      divisor is casted to uint32. For db = -8 the divisior(db/32-1) becomes 0.
      
      Check if db value can be really counted and replace these macro by
      div64_u64() inline.
      Signed-off-by: NMariusz Tkaczyk <mariusz.tkaczyk@intel.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      9642fa73
  18. 15 6月, 2019 1 次提交