1. 17 4月, 2019 1 次提交
  2. 11 4月, 2019 5 次提交
  3. 07 4月, 2019 1 次提交
    • C
      block: remove CONFIG_LBDAF · 72deb455
      Christoph Hellwig 提交于
      Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
      architectures.  These types are required to support block device and/or
      file sizes larger than 2 TiB, and have generally defaulted to on for
      a long time.  Enabling the option only increases the i386 tinyconfig
      size by 145 bytes, and many data structures already always use
      64-bit values for their in-core and on-disk data structures anyway,
      so there should not be a large change in dynamic memory usage either.
      
      Dropping this option removes a somewhat weird non-default config that
      has cause various bugs or compiler warnings when actually used.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      72deb455
  4. 02 4月, 2019 2 次提交
    • N
      md: batch flush requests. · 2bc13b83
      NeilBrown 提交于
      Currently if many flush requests are submitted to an md device is quick
      succession, they are serialized and can take a long to process them all.
      We don't really need to call flush all those times - a single flush call
      can satisfy all requests submitted before it started.
      So keep track of when the current flush started and when it finished,
      allow any pending flush that was requested before the flush started
      to complete without waiting any more.
      
      Test results from Xiao:
      
      Test is done on a raid10 device which is created by 4 SSDs. The tool is
      dbench.
      
      1. The latest linux stable kernel
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    768    10.509    78.305
        Flush                  2078376     0.013    10.094
        Close                  21787697     0.019    18.821
        LockX                    96580     0.007     3.184
        Mkdir                      384     0.008     0.062
        Rename                 1255883     0.191    23.534
        ReadX                  46495589     0.020    14.230
        WriteX                 14790591     7.123    60.706
        Unlink                 5989118     0.440    54.551
        UnlockX                  96580     0.005     2.736
        FIND_FIRST             10393845     0.042    12.079
        SET_FILE_INFORMATION   2415558     0.129    10.088
        QUERY_FILE_INFORMATION 4711725     0.005     8.462
        QUERY_PATH_INFORMATION 26883327     0.032    21.715
        QUERY_FS_INFORMATION   4929409b     0.010     8.238
        NTCreateX              29660080     0.100    53.268
      
      Throughput 1034.88 MB/sec (sync open)  128 clients  128 procs
      max_latency=60.712 ms
      
      2. With patch1 "Revert "MD: fix lock contention for flush bios""
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    256     8.326    36.761
        Flush                   693291     3.974   180.269
        Close                  7266404     0.009    36.929
        LockX                    32160     0.006     0.840
        Mkdir                      128     0.008     0.021
        Rename                  418755     0.063    29.945
        ReadX                  15498708     0.007     7.216
        WriteX                 4932310    22.482   267.928
        Unlink                 1997557     0.109    47.553
        UnlockX                  32160     0.004     1.110
        FIND_FIRST             3465791     0.036     7.320
        SET_FILE_INFORMATION    805825     0.015     1.561
        QUERY_FILE_INFORMATION 1570950     0.005     2.403
        QUERY_PATH_INFORMATION 8965483     0.013    14.277
        QUERY_FS_INFORMATION   1643626     0.009     3.314
        NTCreateX              9892174     0.061    41.278
      
      Throughput 345.009 MB/sec (sync open)  128 clients  128 procs
      max_latency=267.939 m
      
      3. With patch1 and patch2
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    768     9.570    54.588
        Flush                  2061354     0.666    15.102
        Close                  21604811     0.012    25.697
        LockX                    95770     0.007     1.424
        Mkdir                      384     0.008     0.053
        Rename                 1245411     0.096    12.263
        ReadX                  46103198     0.011    12.116
        WriteX                 14667988     7.375    60.069
        Unlink                 5938936     0.173    30.905
        UnlockX                  95770     0.005     4.147
        FIND_FIRST             10306407     0.041    11.715
        SET_FILE_INFORMATION   2395987     0.048     7.640
        QUERY_FILE_INFORMATION 4672371     0.005     9.291
        QUERY_PATH_INFORMATION 26656735     0.018    19.719
        QUERY_FS_INFORMATION   4887940     0.010     7.654
        NTCreateX              29410811     0.059    28.551
      
      Throughput 1026.21 MB/sec (sync open)  128 clients  128 procs
      max_latency=60.075 ms
      
      Cc: <stable@vger.kernel.org> # v4.19+
      Tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2bc13b83
    • N
      Revert "MD: fix lock contention for flush bios" · 4bc034d3
      NeilBrown 提交于
      This reverts commit 5a409b4f.
      
      This patch has two problems.
      
      1/ it make multiple calls to submit_bio() from inside a make_request_fn.
       The bios thus submitted will be queued on current->bio_list and not
       submitted immediately.  As the bios are allocated from a mempool,
       this can theoretically result in a deadlock - all the pool of requests
       could be in various ->bio_list queues and a subsequent mempool_alloc
       could block waiting for one of them to be released.
      
      2/ It aims to handle a case when there are many concurrent flush requests.
        It handles this by submitting many requests in parallel - all of which
        are identical and so most of which do nothing useful.
        It would be more efficient to just send one lower-level request, but
        allow that to satisfy multiple upper-level requests.
      
      Fixes: 5a409b4f ("MD: fix lock contention for flush bios")
      Cc: <stable@vger.kernel.org> # v4.19+
      Tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4bc034d3
  5. 14 1月, 2019 1 次提交
  6. 21 12月, 2018 2 次提交
  7. 10 12月, 2018 1 次提交
  8. 23 10月, 2018 2 次提交
  9. 19 10月, 2018 4 次提交
    • G
      md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interrupted · cb9ee154
      Guoqing Jiang 提交于
      We need to continue the reshaping if it was interrupted in
      original node. So original node should call resync_bitmap
      in case reshaping is aborted.
      
      Then BITMAP_NEEDS_SYNC message is broadcasted to other nodes,
      node which continues the reshaping should restart reshape from
      mddev->reshape_position instead of from the first beginning.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cb9ee154
    • G
      md-cluster/raid10: don't call remove_and_add_spares during reshaping stage · ca1e98e0
      Guoqing Jiang 提交于
      remove_and_add_spares is not needed if reshape is
      happening in another node, because raid10_add_disk
      called inside raid10_start_reshape would handle the
      role changes of disk. Plus, remove_and_add_spares
      can't deal with the role change due to reshape.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ca1e98e0
    • G
      md-cluster/raid10: call update_size in md_reap_sync_thread · aefb2e5f
      Guoqing Jiang 提交于
      We need to change the capacity in all nodes after one node
      finishs reshape. And as we did before, we can't change the
      capacity directly in md_do_sync, instead, the capacity should
      be only changed in update_size or received CHANGE_CAPACITY
      msg.
      
      So master node calls update_size after completes reshape in
      md_reap_sync_thread, but we need to skip ops->update_size if
      MD_CLOSING is set since reshaping could not be finish.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      aefb2e5f
    • G
      md-cluster/raid10: support add disk under grow mode · 7564beda
      Guoqing Jiang 提交于
      For clustered raid10 scenario, we need to let all the nodes
      know about that a new disk is added to the array, and the
      reshape caused by add new member just need to be happened in
      one node, but other nodes should know about the change.
      
      Since reshape means read data from somewhere (which is already
      used by array) and write data to unused region. Obviously, it
      is awful if one node is reading data from address while another
      node is writing to the same address. Considering we have
      implemented suspend writes in the resyncing area, so we can
      just broadcast the reading address to other nodes to avoid the
      trouble.
      
      For master node, it would call reshape_request then update sb
      during the reshape period. To avoid above trouble, we call
      resync_info_update to send RESYNC message in reshape_request.
      
      Then from slave node's view, it receives two type messages:
      1. RESYNCING message
      Slave node add the address (where master node reading data from)
      to suspend list.
      
      2. METADATA_UPDATED message
      Once slave nodes know the reshaping is started in master node,
      it is time to update reshape position and call start_reshape to
      follow master node's step. After reshape is done, only reshape
      position is need to be updated, so the majority task of reshaping
      is happened on the master node.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      7564beda
  10. 15 10月, 2018 1 次提交
  11. 04 10月, 2018 1 次提交
    • N
      md: allow metadata updates while suspending an array - fix · 059421e0
      NeilBrown 提交于
      Commit 35bfc521 ("md: allow metadata update while suspending.")
      added support for allowing md_check_recovery() to still perform
      metadata updates while the array is entering the 'suspended' state.
      This is needed to allow the processes of entering the state to
      complete.
      
      Unfortunately, the patch doesn't really work.  The test for
      "mddev->suspended" at the start of md_check_recovery() means that the
      function doesn't try to do anything at all while entering suspend.
      
      This patch moves the code of updating the metadata while suspending to
      *before* the test on mddev->suspended.
      Reported-by: NJeff Mahoney <jeffm@suse.com>
      Fixes: 35bfc521 ("md: allow metadata update while suspending.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      059421e0
  12. 02 10月, 2018 1 次提交
  13. 02 8月, 2018 1 次提交
  14. 25 7月, 2018 1 次提交
  15. 18 7月, 2018 2 次提交
    • M
      block: Add and use op_stat_group() for indexing disk_stat fields. · ddcf35d3
      Michael Callahan 提交于
      Add and use a new op_stat_group() function for indexing partition stat
      fields rather than indexing them by rq_data_dir() or bio_data_dir().
      This function works similarly to op_is_sync() in that it takes the
      request::cmd_flags or bio::bi_opf flags and determines which stats
      should et updated.
      
      In addition, the second parameter to generic_start_io_acct() and
      generic_end_io_acct() is now a REQ_OP rather than simply a read or
      write bit and it uses op_stat_group() on the parameter to determine
      the stat group.
      
      Note that the partition in_flight counts are not part of the per-cpu
      statistics and as such are not indexed via this function.  It's now
      indexed by op_is_write().
      
      tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Matias Bjorling <mb@lightnvm.io>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddcf35d3
    • M
      block: Add part_stat_read_accum to read across field entries. · 59767fbd
      Michael Callahan 提交于
      Add a part_stat_read_accum macro to genhd.h to read and sum across
      field entries.  For example to sum up the number read and write
      sectors completed.  In addition to being ar reasonable cleanup by
      itself this will make it easier to add new stat fields in the future.
      
      tj: Refreshed on top of v4.17.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      59767fbd
  16. 06 7月, 2018 1 次提交
    • G
      md-cluster: show array's status more accurate · 0357ba27
      Guoqing Jiang 提交于
      When resync or recovery is happening in one node,
      other nodes don't show the appropriate info now.
      
      For example, when create an array in master node
      without "--assume-clean", then assemble the array
      in slave nodes, you can see "resync=PENDING" when
      read /proc/mdstat in slave nodes. However, the info
      is confusing since "PENDING" status is introduced
      for start array in read-only mode.
      
      We introduce RESYNCING_REMOTE flag to indicate that
      resync thread is running in remote node. The flags
      is set when node receive RESYNCING msg. And we clear
      the REMOTE flag in following cases:
      
      1. resync or recover is finished in master node,
         which means slaves receive msg with both lo
         and hi are set to 0.
      2. node continues resync/recovery in recover_bitmaps.
      3. when resync_finish is called.
      
      Then we show accurate information in status_resync
      by check REMOTE flags and with other conditions.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0357ba27
  17. 19 6月, 2018 1 次提交
  18. 08 6月, 2018 1 次提交
    • K
      md: Unify mddev destruction paths · 28dec870
      Kent Overstreet 提交于
      Previously, mddev_put() had a couple different paths for freeing a
      mddev, due to the fact that the kobject wasn't initialized when the
      mddev was first allocated. If we move the kobject_init() to when it's
      first allocated and just use kobject_add() later, we can clean all this
      up.
      
      This also removes a hack in mddev_put() to avoid freeing biosets under a
      spinlock, which involved copying biosets on the stack after the reset
      bioset_init() changes.
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      28dec870
  19. 31 5月, 2018 1 次提交
  20. 22 5月, 2018 1 次提交
    • X
      MD: fix lock contention for flush bios · 5a409b4f
      Xiao Ni 提交于
      There is a lock contention when there are many processes which send flush bios
      to md device. eg. Create many lvs on one raid device and mkfs.xfs on each lv.
      
      Now it just can handle flush request sequentially. It needs to wait mddev->flush_bio
      to be NULL, otherwise get mddev->lock.
      
      This patch remove mddev->flush_bio and handle flush bio asynchronously.
      I did a test with command dbench -s 128 -t 300. This is the test result:
      
      =================Without the patch============================
       Operation                Count    AvgLat    MaxLat
       --------------------------------------------------
       Flush                    11165   167.595  5879.560
       Close                   107469     1.391  2231.094
       LockX                      384     0.003     0.019
       Rename                    5944     2.141  1856.001
       ReadX                   208121     0.003     0.074
       WriteX                   98259  1925.402 15204.895
       Unlink                   25198    13.264  3457.268
       UnlockX                    384     0.001     0.009
       FIND_FIRST               47111     0.012     0.076
       SET_FILE_INFORMATION     12966     0.007     0.065
       QUERY_FILE_INFORMATION   27921     0.004     0.085
       QUERY_PATH_INFORMATION  124650     0.005     5.766
       QUERY_FS_INFORMATION     22519     0.003     0.053
       NTCreateX               141086     4.291  2502.812
      
      Throughput 3.7181 MB/sec (sync open)  128 clients  128 procs  max_latency=15204.905 ms
      
      =================With the patch============================
       Operation                Count    AvgLat    MaxLat
       --------------------------------------------------
       Flush                     4500   174.134   406.398
       Close                    48195     0.060   467.062
       LockX                      256     0.003     0.029
       Rename                    2324     0.026     0.360
       ReadX                    78846     0.004     0.504
       WriteX                   66832   562.775  1467.037
       Unlink                    5516     3.665  1141.740
       UnlockX                    256     0.002     0.019
       FIND_FIRST               16428     0.015     0.313
       SET_FILE_INFORMATION      6400     0.009     0.520
       QUERY_FILE_INFORMATION   17865     0.003     0.089
       QUERY_PATH_INFORMATION   47060     0.078   416.299
       QUERY_FS_INFORMATION      7024     0.004     0.032
       NTCreateX                55921     0.854  1141.452
      
      Throughput 11.744 MB/sec (sync open)  128 clients  128 procs  max_latency=1467.041 ms
      
      The test is done on raid1 disk with two rotational disks
      
      V5: V4 is more complicated than the version with memory pool. So revert to the memory pool
      version
      
      V4: use address of fbio to do hash to choose free flush info.
      V3:
      Shaohua suggests mempool is overkill. In v3 it allocs memory during creating raid device
      and uses a simple bitmap to record which resource is free.
      
      Fix a bug from v2. It should set flush_pending to 1 at first.
      
      V2:
      Neil pointed out two problems. One is counting error problem and another is return value
      when allocat memory fails.
      1. counting error problem
      This isn't safe.  It is only safe to call rdev_dec_pending() on rdevs
      that you previously called
                                atomic_inc(&rdev->nr_pending);
      If an rdev was added to the list between the start and end of the flush,
      this will do something bad.
      
      Now it doesn't use bio_chain. It uses specified call back function for each
      flush bio.
      2. Returned on IO error when kmalloc fails is wrong.
      I use mempool suggested by Neil in V2
      3. Fixed some places pointed by Guoqing
      Suggested-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5a409b4f
  21. 18 5月, 2018 1 次提交
    • Y
      md: fix NULL dereference of mddev->pers in remove_and_add_spares() · c42a0e26
      Yufen Yu 提交于
      We met NULL pointer BUG as follow:
      
      [  151.760358] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
      [  151.761340] PGD 80000001011eb067 P4D 80000001011eb067 PUD 1011ea067 PMD 0
      [  151.762039] Oops: 0000 [#1] SMP PTI
      [  151.762406] Modules linked in:
      [  151.762723] CPU: 2 PID: 3561 Comm: mdadm-test Kdump: loaded Not tainted 4.17.0-rc1+ #238
      [  151.763542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
      [  151.764432] RIP: 0010:remove_and_add_spares.part.56+0x13c/0x3a0
      [  151.765061] RSP: 0018:ffffc90001d7fcd8 EFLAGS: 00010246
      [  151.765590] RAX: 0000000000000000 RBX: ffff88013601d600 RCX: 0000000000000000
      [  151.766306] RDX: 0000000000000000 RSI: ffff88013601d600 RDI: ffff880136187000
      [  151.767014] RBP: ffff880136187018 R08: 0000000000000003 R09: 0000000000000051
      [  151.767728] R10: ffffc90001d7fed8 R11: 0000000000000000 R12: ffff88013601d600
      [  151.768447] R13: ffff8801298b1300 R14: ffff880136187000 R15: 0000000000000000
      [  151.769160] FS:  00007f2624276700(0000) GS:ffff88013ae80000(0000) knlGS:0000000000000000
      [  151.769971] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  151.770554] CR2: 0000000000000060 CR3: 0000000111aac000 CR4: 00000000000006e0
      [  151.771272] Call Trace:
      [  151.771542]  md_ioctl+0x1df2/0x1e10
      [  151.771906]  ? __switch_to+0x129/0x440
      [  151.772295]  ? __schedule+0x244/0x850
      [  151.772672]  blkdev_ioctl+0x4bd/0x970
      [  151.773048]  block_ioctl+0x39/0x40
      [  151.773402]  do_vfs_ioctl+0xa4/0x610
      [  151.773770]  ? dput.part.23+0x87/0x100
      [  151.774151]  ksys_ioctl+0x70/0x80
      [  151.774493]  __x64_sys_ioctl+0x16/0x20
      [  151.774877]  do_syscall_64+0x5b/0x180
      [  151.775258]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      For raid6, when two disk of the array are offline, two spare disks can
      be added into the array. Before spare disks recovery completing,
      system reboot and mdadm thinks it is ok to restart the degraded
      array by md_ioctl(). Since disks in raid6 is not only_parity(),
      raid5_run() will abort, when there is no PPL feature or not setting
      'start_dirty_degraded' parameter. Therefore, mddev->pers is NULL.
      
      But, mddev->raid_disks has been set and it will not be cleared when
      raid5_run abort. md_ioctl() can execute cmd 'HOT_REMOVE_DISK' to
      remove a disk by mdadm, which will cause NULL pointer dereference
      in remove_and_add_spares() finally.
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      c42a0e26
  22. 02 5月, 2018 1 次提交
    • N
      md: fix two problems with setting the "re-add" device state. · 011abdc9
      NeilBrown 提交于
      If "re-add" is written to the "state" file for a device
      which is faulty, this has an effect similar to removing
      and re-adding the device.  It should take up the
      same slot in the array that it previously had, and
      an accelerated (e.g. bitmap-based) rebuild should happen.
      
      The slot that "it previously had" is determined by
      rdev->saved_raid_disk.
      However this is not set when a device fails (only when a device
      is added), and it is cleared when resync completes.
      This means that "re-add" will normally work once, but may not work a
      second time.
      
      This patch includes two fixes.
      1/ when a device fails, record the ->raid_disk value in
          ->saved_raid_disk before clearing ->raid_disk
      2/ when "re-add" is written to a device for which
          ->saved_raid_disk is not set, fail.
      
      I think this is suitable for stable as it can
      cause re-adding a device to be forced to do a full
      resync which takes a lot longer and so puts data at
      more risk.
      
      Cc: <stable@vger.kernel.org> (v4.1)
      Fixes: 97f6cd39 ("md-cluster: re-add capabilities")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      011abdc9
  23. 09 4月, 2018 1 次提交
  24. 09 3月, 2018 1 次提交
  25. 01 3月, 2018 1 次提交
  26. 26 2月, 2018 1 次提交
    • B
      md: fix a potential deadlock of raid5/raid10 reshape · 8876391e
      BingJing Chang 提交于
      There is a potential deadlock if mount/umount happens when
      raid5_finish_reshape() tries to grow the size of emulated disk.
      
      How the deadlock happens?
      1) The raid5 resync thread finished reshape (expanding array).
      2) The mount or umount thread holds VFS sb->s_umount lock and tries to
         write through critical data into raid5 emulated block device. So it
         waits for raid5 kernel thread handling stripes in order to finish it
         I/Os.
      3) In the routine of raid5 kernel thread, md_check_recovery() will be
         called first in order to reap the raid5 resync thread. That is,
         raid5_finish_reshape() will be called. In this function, it will try
         to update conf and call VFS revalidate_disk() to grow the raid5
         emulated block device. It will try to acquire VFS sb->s_umount lock.
      The raid5 kernel thread cannot continue, so no one can handle mount/
      umount I/Os (stripes). Once the write-through I/Os cannot be finished,
      mount/umount will not release sb->s_umount lock. The deadlock happens.
      
      The raid5 kernel thread is an emulated block device. It is responible to
      handle I/Os (stripes) from upper layers. The emulated block device
      should not request any I/Os on itself. That is, it should not call VFS
      layer functions. (If it did, it will try to acquire VFS locks to
      guarantee the I/Os sequence.) So we have the resync thread to send
      resync I/O requests and to wait for the results.
      
      For solving this potential deadlock, we can put the size growth of the
      emulated block device as the final step of reshape thread.
      
      2017/12/29:
      Thanks to Guoqing Jiang <gqjiang@suse.com>,
      we confirmed that there is the same deadlock issue in raid10. It's
      reproducible and can be fixed by this patch. For raid10.c, we can remove
      the similar code to prevent deadlock as well since they has been called
      before.
      Reported-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NChung-Chiang Cheng <cccheng@synology.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      8876391e
  27. 20 2月, 2018 1 次提交
    • N
      md: only allow remove_and_add_spares when no sync_thread running. · 39772f0a
      NeilBrown 提交于
      The locking protocols in md assume that a device will
      never be removed from an array during resync/recovery/reshape.
      When that isn't happening, rcu or reconfig_mutex is needed
      to protect an rdev pointer while taking a refcount.  When
      it is happening, that protection isn't needed.
      
      Unfortunately there are cases were remove_and_add_spares() is
      called when recovery might be happening: is state_store(),
      slot_store() and hot_remove_disk().
      In each case, this is just an optimization, to try to expedite
      removal from the personality so the device can be removed from
      the array.  If resync etc is happening, we just have to wait
      for md_check_recover to find a suitable time to call
      remove_and_add_spares().
      
      This optimization and not essential so it doesn't
      matter if it fails.
      So change remove_and_add_spares() to abort early if
      resync/recovery/reshape is happening, unless it is called
      from md_check_recovery() as part of a newly started recovery.
      The parameter "this" is only NULL when called from
      md_check_recovery() so when it is NULL, there is no need to abort.
      
      As this can result in a NULL dereference, the fix is suitable
      for -stable.
      
      cc: yuyufen <yuyufen@huawei.com>
      Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
      Fixes: 8430e7e0 ("md: disconnect device from personality before trying to remove it.")
      Cc: stable@ver.kernel.org (v4.8+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      39772f0a
  28. 19 2月, 2018 1 次提交
  29. 18 2月, 2018 1 次提交