1. 24 5月, 2019 1 次提交
  2. 02 4月, 2019 2 次提交
    • N
      md: batch flush requests. · 2bc13b83
      NeilBrown 提交于
      Currently if many flush requests are submitted to an md device is quick
      succession, they are serialized and can take a long to process them all.
      We don't really need to call flush all those times - a single flush call
      can satisfy all requests submitted before it started.
      So keep track of when the current flush started and when it finished,
      allow any pending flush that was requested before the flush started
      to complete without waiting any more.
      
      Test results from Xiao:
      
      Test is done on a raid10 device which is created by 4 SSDs. The tool is
      dbench.
      
      1. The latest linux stable kernel
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    768    10.509    78.305
        Flush                  2078376     0.013    10.094
        Close                  21787697     0.019    18.821
        LockX                    96580     0.007     3.184
        Mkdir                      384     0.008     0.062
        Rename                 1255883     0.191    23.534
        ReadX                  46495589     0.020    14.230
        WriteX                 14790591     7.123    60.706
        Unlink                 5989118     0.440    54.551
        UnlockX                  96580     0.005     2.736
        FIND_FIRST             10393845     0.042    12.079
        SET_FILE_INFORMATION   2415558     0.129    10.088
        QUERY_FILE_INFORMATION 4711725     0.005     8.462
        QUERY_PATH_INFORMATION 26883327     0.032    21.715
        QUERY_FS_INFORMATION   4929409b     0.010     8.238
        NTCreateX              29660080     0.100    53.268
      
      Throughput 1034.88 MB/sec (sync open)  128 clients  128 procs
      max_latency=60.712 ms
      
      2. With patch1 "Revert "MD: fix lock contention for flush bios""
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    256     8.326    36.761
        Flush                   693291     3.974   180.269
        Close                  7266404     0.009    36.929
        LockX                    32160     0.006     0.840
        Mkdir                      128     0.008     0.021
        Rename                  418755     0.063    29.945
        ReadX                  15498708     0.007     7.216
        WriteX                 4932310    22.482   267.928
        Unlink                 1997557     0.109    47.553
        UnlockX                  32160     0.004     1.110
        FIND_FIRST             3465791     0.036     7.320
        SET_FILE_INFORMATION    805825     0.015     1.561
        QUERY_FILE_INFORMATION 1570950     0.005     2.403
        QUERY_PATH_INFORMATION 8965483     0.013    14.277
        QUERY_FS_INFORMATION   1643626     0.009     3.314
        NTCreateX              9892174     0.061    41.278
      
      Throughput 345.009 MB/sec (sync open)  128 clients  128 procs
      max_latency=267.939 m
      
      3. With patch1 and patch2
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    768     9.570    54.588
        Flush                  2061354     0.666    15.102
        Close                  21604811     0.012    25.697
        LockX                    95770     0.007     1.424
        Mkdir                      384     0.008     0.053
        Rename                 1245411     0.096    12.263
        ReadX                  46103198     0.011    12.116
        WriteX                 14667988     7.375    60.069
        Unlink                 5938936     0.173    30.905
        UnlockX                  95770     0.005     4.147
        FIND_FIRST             10306407     0.041    11.715
        SET_FILE_INFORMATION   2395987     0.048     7.640
        QUERY_FILE_INFORMATION 4672371     0.005     9.291
        QUERY_PATH_INFORMATION 26656735     0.018    19.719
        QUERY_FS_INFORMATION   4887940     0.010     7.654
        NTCreateX              29410811     0.059    28.551
      
      Throughput 1026.21 MB/sec (sync open)  128 clients  128 procs
      max_latency=60.075 ms
      
      Cc: <stable@vger.kernel.org> # v4.19+
      Tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2bc13b83
    • N
      Revert "MD: fix lock contention for flush bios" · 4bc034d3
      NeilBrown 提交于
      This reverts commit 5a409b4f.
      
      This patch has two problems.
      
      1/ it make multiple calls to submit_bio() from inside a make_request_fn.
       The bios thus submitted will be queued on current->bio_list and not
       submitted immediately.  As the bios are allocated from a mempool,
       this can theoretically result in a deadlock - all the pool of requests
       could be in various ->bio_list queues and a subsequent mempool_alloc
       could block waiting for one of them to be released.
      
      2/ It aims to handle a case when there are many concurrent flush requests.
        It handles this by submitting many requests in parallel - all of which
        are identical and so most of which do nothing useful.
        It would be more efficient to just send one lower-level request, but
        allow that to satisfy multiple upper-level requests.
      
      Fixes: 5a409b4f ("MD: fix lock contention for flush bios")
      Cc: <stable@vger.kernel.org> # v4.19+
      Tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4bc034d3
  3. 19 10月, 2018 1 次提交
    • G
      md-cluster/raid10: support add disk under grow mode · 7564beda
      Guoqing Jiang 提交于
      For clustered raid10 scenario, we need to let all the nodes
      know about that a new disk is added to the array, and the
      reshape caused by add new member just need to be happened in
      one node, but other nodes should know about the change.
      
      Since reshape means read data from somewhere (which is already
      used by array) and write data to unused region. Obviously, it
      is awful if one node is reading data from address while another
      node is writing to the same address. Considering we have
      implemented suspend writes in the resyncing area, so we can
      just broadcast the reading address to other nodes to avoid the
      trouble.
      
      For master node, it would call reshape_request then update sb
      during the reshape period. To avoid above trouble, we call
      resync_info_update to send RESYNC message in reshape_request.
      
      Then from slave node's view, it receives two type messages:
      1. RESYNCING message
      Slave node add the address (where master node reading data from)
      to suspend list.
      
      2. METADATA_UPDATED message
      Once slave nodes know the reshaping is started in master node,
      it is time to update reshape position and call start_reshape to
      follow master node's step. After reshape is done, only reshape
      position is need to be updated, so the majority task of reshaping
      is happened on the master node.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      7564beda
  4. 06 7月, 2018 1 次提交
    • G
      md-cluster: show array's status more accurate · 0357ba27
      Guoqing Jiang 提交于
      When resync or recovery is happening in one node,
      other nodes don't show the appropriate info now.
      
      For example, when create an array in master node
      without "--assume-clean", then assemble the array
      in slave nodes, you can see "resync=PENDING" when
      read /proc/mdstat in slave nodes. However, the info
      is confusing since "PENDING" status is introduced
      for start array in read-only mode.
      
      We introduce RESYNCING_REMOTE flag to indicate that
      resync thread is running in remote node. The flags
      is set when node receive RESYNCING msg. And we clear
      the REMOTE flag in following cases:
      
      1. resync or recover is finished in master node,
         which means slaves receive msg with both lo
         and hi are set to 0.
      2. node continues resync/recovery in recover_bitmaps.
      3. when resync_finish is called.
      
      Then we show accurate information in status_resync
      by check REMOTE flags and with other conditions.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0357ba27
  5. 31 5月, 2018 1 次提交
  6. 22 5月, 2018 1 次提交
    • X
      MD: fix lock contention for flush bios · 5a409b4f
      Xiao Ni 提交于
      There is a lock contention when there are many processes which send flush bios
      to md device. eg. Create many lvs on one raid device and mkfs.xfs on each lv.
      
      Now it just can handle flush request sequentially. It needs to wait mddev->flush_bio
      to be NULL, otherwise get mddev->lock.
      
      This patch remove mddev->flush_bio and handle flush bio asynchronously.
      I did a test with command dbench -s 128 -t 300. This is the test result:
      
      =================Without the patch============================
       Operation                Count    AvgLat    MaxLat
       --------------------------------------------------
       Flush                    11165   167.595  5879.560
       Close                   107469     1.391  2231.094
       LockX                      384     0.003     0.019
       Rename                    5944     2.141  1856.001
       ReadX                   208121     0.003     0.074
       WriteX                   98259  1925.402 15204.895
       Unlink                   25198    13.264  3457.268
       UnlockX                    384     0.001     0.009
       FIND_FIRST               47111     0.012     0.076
       SET_FILE_INFORMATION     12966     0.007     0.065
       QUERY_FILE_INFORMATION   27921     0.004     0.085
       QUERY_PATH_INFORMATION  124650     0.005     5.766
       QUERY_FS_INFORMATION     22519     0.003     0.053
       NTCreateX               141086     4.291  2502.812
      
      Throughput 3.7181 MB/sec (sync open)  128 clients  128 procs  max_latency=15204.905 ms
      
      =================With the patch============================
       Operation                Count    AvgLat    MaxLat
       --------------------------------------------------
       Flush                     4500   174.134   406.398
       Close                    48195     0.060   467.062
       LockX                      256     0.003     0.029
       Rename                    2324     0.026     0.360
       ReadX                    78846     0.004     0.504
       WriteX                   66832   562.775  1467.037
       Unlink                    5516     3.665  1141.740
       UnlockX                    256     0.002     0.019
       FIND_FIRST               16428     0.015     0.313
       SET_FILE_INFORMATION      6400     0.009     0.520
       QUERY_FILE_INFORMATION   17865     0.003     0.089
       QUERY_PATH_INFORMATION   47060     0.078   416.299
       QUERY_FS_INFORMATION      7024     0.004     0.032
       NTCreateX                55921     0.854  1141.452
      
      Throughput 11.744 MB/sec (sync open)  128 clients  128 procs  max_latency=1467.041 ms
      
      The test is done on raid1 disk with two rotational disks
      
      V5: V4 is more complicated than the version with memory pool. So revert to the memory pool
      version
      
      V4: use address of fbio to do hash to choose free flush info.
      V3:
      Shaohua suggests mempool is overkill. In v3 it allocs memory during creating raid device
      and uses a simple bitmap to record which resource is free.
      
      Fix a bug from v2. It should set flush_pending to 1 at first.
      
      V2:
      Neil pointed out two problems. One is counting error problem and another is return value
      when allocat memory fails.
      1. counting error problem
      This isn't safe.  It is only safe to call rdev_dec_pending() on rdevs
      that you previously called
                                atomic_inc(&rdev->nr_pending);
      If an rdev was added to the list between the start and end of the flush,
      this will do something bad.
      
      Now it doesn't use bio_chain. It uses specified call back function for each
      flush bio.
      2. Returned on IO error when kmalloc fails is wrong.
      I use mempool suggested by Neil in V2
      3. Fixed some places pointed by Guoqing
      Suggested-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5a409b4f
  7. 19 2月, 2018 1 次提交
  8. 16 1月, 2018 1 次提交
    • T
      raid5-ppl: PPL support for disks with write-back cache enabled · 1532d9e8
      Tomasz Majchrzak 提交于
      In order to provide data consistency with PPL for disks with write-back
      cache enabled all data has to be flushed to disks before next PPL
      entry. The disks to be flushed are marked in the bitmap. It's modified
      under a mutex and it's only read after PPL io unit is submitted.
      
      A limitation of 64 disks in the array has been introduced to keep data
      structures and implementation simple. RAID5 arrays with so many disks are
      not likely due to high risk of multiple disks failure. Such restriction
      should not be a real life limitation.
      
      With write-back cache disabled next PPL entry is submitted when data write
      for current one completes. Data flush defers next log submission so trigger
      it when there are no stripes for handling found.
      
      As PPL assures all data is flushed to disk at request completion, just
      acknowledge flush request when PPL is enabled.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      1532d9e8
  9. 12 12月, 2017 1 次提交
    • S
      md: introduce new personality funciton start() · d5d885fd
      Song Liu 提交于
      In do_md_run(), md threads should not wake up until the array is fully
      initialized in md_run(). However, in raid5_run(), raid5-cache may wake
      up mddev->thread to flush stripes that need to be written back. This
      design doesn't break badly right now. But it could lead to bad bug in
      the future.
      
      This patch tries to resolve this problem by splitting start up work
      into two personality functions, run() and start(). Tasks that do not
      require the md threads should go into run(), while task that require
      the md threads go into start().
      
      r5l_load_log() is moved to raid5_start(), so it is not called until
      the md threads are started in do_md_run().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d5d885fd
  10. 02 11月, 2017 3 次提交
    • S
      md: use lockdep_assert_held · efa4b77b
      Shaohua Li 提交于
      lockdep_assert_held is a better way to assert lock held, and it works
      for UP.
      Signed-off-by: NShaohua Li <shli@fb.com>
      efa4b77b
    • N
      md: remove special meaning of ->quiesce(.., 2) · b03e0ccb
      NeilBrown 提交于
      The '2' argument means "wake up anything that is waiting".
      This is an inelegant part of the design and was added
      to help support management of suspend_lo/suspend_hi setting.
      Now that suspend_lo/hi is managed in mddev_suspend/resume,
      that need is gone.
      These is still a couple of places where we call 'quiesce'
      with an argument of '2', but they can safely be changed to
      call ->quiesce(.., 1); ->quiesce(.., 0) which
      achieve the same result at the small cost of pausing IO
      briefly.
      
      This removes a small "optimization" from suspend_{hi,lo}_store,
      but it isn't clear that optimization served a useful purpose.
      The code now is a lot clearer.
      Suggested-by: NShaohua Li <shli@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b03e0ccb
    • N
      md: allow metadata update while suspending. · 35bfc521
      NeilBrown 提交于
      There are various deadlocks that can occur
      when a thread holds reconfig_mutex and calls
      ->quiesce(mddev, 1).
      As some write request block waiting for
      metadata to be updated (e.g. to record device
      failure), and as the md thread updates the metadata
      while the reconfig mutex is held, holding the mutex
      can stop write requests completing, and this prevents
      ->quiesce(mddev, 1) from completing.
      
      ->quiesce() is now usually called from mddev_suspend(),
      and it is always called with reconfig_mutex held.  So
      at this time it is safe for the thread to update metadata
      without explicitly taking the lock.
      
      So add 2 new flags, one which says the unlocked updates is
      allowed, and one which ways it is happening.  Then allow it
      while the quiesce completes, and then wait for it to finish.
      Reported-and-tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      35bfc521
  11. 28 9月, 2017 1 次提交
  12. 28 8月, 2017 1 次提交
  13. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  14. 22 7月, 2017 2 次提交
  15. 11 7月, 2017 1 次提交
  16. 22 6月, 2017 1 次提交
    • N
      md: use a separate bio_set for synchronous IO. · 5a85071c
      NeilBrown 提交于
      md devices allocate a bio_set and use it for two
      distinct purposes.
      mddev->bio_set is used to clone bios as part of sending
      upper level requests down to lower level devices,
      and it is also use for synchronous IO such as superblock
      and bitmap updates, and for correcting read errors.
      
      This multiple usage can lead to deadlocks.  It is likely
      that cloned bios might be queued for write and to be
      waiting for a metadata update before the write can be permitted.
      If the cloning exhausted mddev->bio_set, the metadata update
      may not be able to proceed.
      
      This scenario has been seen during heavy testing, with lots of IO and
      lots of memory pressure.
      
      Address this by adding a new bio_set specifically for synchronous IO.
      All synchronous IO goes directly to the underlying device and is not
      queued at the md level, so request using entries from the new
      mddev->sync_set will complete in a timely fashion.
      Requests that use mddev->bio_set will sometimes need to wait
      for synchronous IO, but will no longer risk deadlocking that iO.
      
      Also: small simplification in mddev_put(): there is no need to
      wait until the spinlock is released before calling bioset_free().
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5a85071c
  17. 14 6月, 2017 1 次提交
    • N
      md: fix deadlock between mddev_suspend() and md_write_start() · cc27b0c7
      NeilBrown 提交于
      If mddev_suspend() races with md_write_start() we can deadlock
      with mddev_suspend() waiting for the request that is currently
      in md_write_start() to complete the ->make_request() call,
      and md_write_start() waiting for the metadata to be updated
      to mark the array as 'dirty'.
      As metadata updates done by md_check_recovery() only happen then
      the mddev_lock() can be claimed, and as mddev_suspend() is often
      called with the lock held, these threads wait indefinitely for each
      other.
      
      We fix this by having md_write_start() abort if mddev_suspend()
      is happening, and ->make_request() aborts if md_write_start()
      aborted.
      md_make_request() can detect this abort, decrease the ->active_io
      count, and wait for mddev_suspend().
      Reported-by: NNix <nix@esperi.org.uk>
      Fix: 68866e42(MD: no sync IO while suspended)
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cc27b0c7
  18. 06 6月, 2017 1 次提交
  19. 09 5月, 2017 1 次提交
  20. 09 4月, 2017 1 次提交
  21. 25 3月, 2017 2 次提交
  22. 23 3月, 2017 2 次提交
    • N
      MD: use per-cpu counter for writes_pending · 4ad23a97
      NeilBrown 提交于
      The 'writes_pending' counter is used to determine when the
      array is stable so that it can be marked in the superblock
      as "Clean".  Consequently it needs to be updated frequently
      but only checked for zero occasionally.  Recent changes to
      raid5 cause the count to be updated even more often - once
      per 4K rather than once per bio.  This provided
      justification for making the updates more efficient.
      
      So we replace the atomic counter a percpu-refcount.
      This can be incremented and decremented cheaply most of the
      time, and can be switched to "atomic" mode when more
      precise counting is needed.  As it is possible for multiple
      threads to want a precise count, we introduce a
      "sync_checker" counter to count the number of threads
      in "set_in_sync()", and only switch the refcount back
      to percpu mode when that is zero.
      
      We need to be careful about races between set_in_sync()
      setting ->in_sync to 1, and md_write_start() setting it
      to zero.  md_write_start() holds the rcu_read_lock()
      while checking if the refcount is in percpu mode.  If
      it is, then we know a switch to 'atomic' will not happen until
      after we call rcu_read_unlock(), in which case set_in_sync()
      will see the elevated count, and not set in_sync to 1.
      If it is not in percpu mode, we take the mddev->lock to
      ensure proper synchronization.
      
      It is no longer possible to quickly check if the count is zero, which
      we previously did to update a timer or to schedule the md_thread.
      So now we do these every time we decrement that counter, but make
      sure they are fast.
      
      mod_timer() already optimizes the case where the timeout value doesn't
      actually change.  We leverage that further by always rounding off the
      jiffies to the timeout value.  This may delay the marking of 'clean'
      slightly, but ensure we only perform atomic operation here when absolutely
      needed.
      
      md_wakeup_thread() current always calls wake_up(), even if
      THREAD_WAKEUP is already set.  That too can be optimised to avoid
      calls to wake_up().
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      4ad23a97
    • N
      md/raid5: use md_write_start to count stripes, not bios · 49728050
      NeilBrown 提交于
      We use md_write_start() to increase the count of pending writes, and
      md_write_end() to decrement the count.  We currently count bios
      submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
      has been attached to.
      
      So now, raid5_make_request() calls md_write_start() and then
      md_write_end() to keep the count elevated during the setup of the
      request.
      
      add_stripe_bio() calls md_write_start() for each stripe_head, and the
      completion routines always call md_write_end(), instead of only
      calling it when raid5_dec_bi_active_stripes() returns 0.
      make_discard_request also calls md_write_start/end().
      
      The parallel between md_write_{start,end} and use of bi_phys_segments
      can be seen in that:
       Whenever we set bi_phys_segments to 1, we now call md_write_start.
       Whenever we increment it on non-read requests with
         raid5_inc_bi_active_stripes(), we now call md_write_start().
       Whenever we decrement bi_phys_segments on non-read requsts with
          raid5_dec_bi_active_stripes(), we now call md_write_end().
      
      This reduces our dependence on keeping a per-bio count of active
      stripes in bi_phys_segments.
      
      md_write_inc() is added which parallels md_write_start(), but requires
      that a write has already been started, and is certain never to sleep.
      This can be used inside a spinlocked region when adding to a write
      request.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      49728050
  23. 17 3月, 2017 3 次提交
    • A
      raid5-ppl: runtime PPL enabling or disabling · ba903a3e
      Artur Paszkiewicz 提交于
      Allow writing to 'consistency_policy' attribute when the array is
      active. Add a new function 'change_consistency_policy' to the
      md_personality operations structure to handle the change in the
      personality code. Values "ppl" and "resync" are accepted and
      turn PPL on and off respectively.
      
      When enabling PPL its location and size should first be set using
      'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
      written at this location on each member device.
      
      Enabling or disabling PPL is performed under a suspended array.  The
      raid5_reset_stripe_cache function frees the stripe cache and allocates
      it again in order to allocate or free the ppl_pages for the stripes in
      the stripe cache.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ba903a3e
    • A
      md: superblock changes for PPL · ea0213e0
      Artur Paszkiewicz 提交于
      Include information about PPL location and size into mdp_superblock_1
      and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
      put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
      'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
      to mddev->flags to indicate that PPL is enabled on an array.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ea0213e0
    • G
      md-cluster: use sync way to handle METADATA_UPDATED msg · 0ba95977
      Guoqing Jiang 提交于
      Previously, when node received METADATA_UPDATED msg, it just
      need to wakeup mddev->thread, then md_reload_sb will be called
      eventually.
      
      We taken the asynchronous way to avoid a deadlock issue, the
      deadlock issue could happen when one node is receiving the
      METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
      the path:
      
      md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                        -> md_update_sb-metadata_update_start
      		     (want EX on token however token is
      		      got by the sending node)
      
      Since we will support resizing for clustered raid, and we
      need the metadata update handling to be synchronous so that
      the initiating node can detect failure, so we need to change
      the way for handling METADATA_UPDATED msg.
      
      But, we obviously need to avoid above deadlock with the
      sync way. To make this happen, we considered to not hold
      reconfig_mutex to call md_reload_sb, if some other thread
      has already taken reconfig_mutex and waiting for the 'token',
      then process_recvd_msg() can safely call md_reload_sb()
      without taking the mutex. This is because we can be certain
      that no other thread will take the mutex, and we also certain
      that the actions performed by md_reload_sb() won't interfere
      with anything that the other thread is in the middle of.
      
      To make this more concrete, we added a new cinfo->state bit
              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      
      Which is set in lock_token() just before dlm_lock_sync() is
      called, and cleared just after. As lock_token() is always
      called with reconfig_mutex() held (the specific case is the
      resync_info_update which is distinguished well in previous
      patch), if process_recvd_msg() finds that the new bit is set,
      then the mutex must be held by some other thread, and it will
      keep waiting.
      
      So process_metadata_update() can call md_reload_sb() if either
      mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      is set. The tricky bit is what to do if neither of these apply.
      We need to wait. Fortunately mddev_unlock() always calls wake_up()
      on mddev->thread->wqueue. So we can get lock_token() to call
      wake_up() on that when it sets the bit.
      
      There are also some related changes inside this commit:
      1. remove RELOAD_SB related codes since there are not valid anymore.
      2. mddev is added into md_cluster_info then we can get mddev inside
         lock_token.
      3. add new parameter for lock_token to distinguish reconfig_mutex
         is held or not.
      
      And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
      1. set it before unregister thread, otherwise a deadlock could
         appear if stop a resyncing array.
         This is because md_unregister_thread(&cinfo->recv_thread) is
         blocked by recv_daemon -> process_recvd_msg
      			  -> process_metadata_update.
         To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
         also need to be set before unregister thread.
      2. set it in metadata_update_start to fix another deadlock.
      	a. Node A sends METADATA_UPDATED msg (held Token lock).
      	b. Node B wants to do resync, and is blocked since it can't
      	   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
      	   not set since the callchain
      	   (md_do_sync -> sync_request
              	       -> resync_info_update
      		       -> sendmsg
      		       -> lock_comm -> lock_token)
      	   doesn't hold reconfig_mutex.
      	c. Node B trys to update sb (held reconfig_mutex), but stopped
      	   at wait_event() in metadata_update_start since we have set
      	   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
      	d. Then Node B receives METADATA_UPDATED msg from A, of course
      	   recv_daemon is blocked forever.
         Since metadata_update_start always calls lock_token with reconfig_mutex,
         we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
         lock_token don't need to set it twice unless lock_token is invoked from
         lock_comm.
      
      Finally, thanks to Neil for his great idea and help!
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0ba95977
  24. 10 3月, 2017 1 次提交
  25. 16 2月, 2017 1 次提交
    • M
      md: fast clone bio in bio_clone_mddev() · d7a10308
      Ming Lei 提交于
      Firstly bio_clone_mddev() is used in raid normal I/O and isn't
      in resync I/O path.
      
      Secondly all the direct access to bvec table in raid happens on
      resync I/O except for write behind of raid1, in which we still
      use bio_clone() for allocating new bvec table.
      
      So this patch replaces bio_clone() with bio_clone_fast()
      in bio_clone_mddev().
      
      Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
      suggested by Christoph Hellwig.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d7a10308
  26. 14 2月, 2017 1 次提交
  27. 06 1月, 2017 1 次提交
  28. 09 12月, 2016 1 次提交
    • S
      md: separate flags for superblock changes · 2953079c
      Shaohua Li 提交于
      The mddev->flags are used for different purposes. There are a lot of
      places we check/change the flags without masking unrelated flags, we
      could check/change unrelated flags. These usage are most for superblock
      write, so spearate superblock related flags. This should make the code
      clearer and also fix real bugs.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2953079c
  29. 23 11月, 2016 2 次提交
    • N
      md: Use REQ_FAILFAST_* on metadata writes where appropriate · 46533ff7
      NeilBrown 提交于
      This can only be supported on personalities which ensure
      that md_error() never causes an array to enter the 'failed'
      state.  i.e. if marking a device Faulty would cause some
      data to be inaccessible, the device is status is left as
      non-Faulty.  This is true for RAID1 and RAID10.
      
      If we get a failure writing metadata but the device doesn't
      fail, it must be the last device so we re-write without
      FAILFAST to improve chance of success.  We also flag the
      device as LastDev so that future metadata updates don't
      waste time on failfast writes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      46533ff7
    • N
      md/failfast: add failfast flag for md to be used by some personalities. · 688834e6
      NeilBrown 提交于
      This patch just adds a 'failfast' per-device flag which can be stored
      in v0.90 or v1.x metadata.
      The flag is not used yet but the intent is that it can be used for
      mirrored (raid1/raid10) arrays where low latency is more important
      than keeping all devices on-line.
      
      Setting the flag for a device effectively gives permission for that
      device to be marked as Faulty and excluded from the array on the first
      error.  The underlying driver will be directed not to retry requests
      that result in failures.  There is a proviso that the device must not
      be marked faulty if that would cause the array as a whole to fail, it
      may only be marked Faulty if the array remains functional, but is
      degraded.
      
      Failures on read requests will cause the device to be marked
      as Faulty immediately so that further reads will avoid that
      device.  No attempt will be made to correct read errors by
      over-writing with the correct data.
      
      It is expected that if transient errors, such as cable unplug, are
      possible, then something in user-space will revalidate failed
      devices and re-add them when they appear to be working again.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      688834e6
  30. 10 11月, 2016 1 次提交
  31. 08 11月, 2016 1 次提交
    • T
      md: add bad block support for external metadata · 35b785f7
      Tomasz Majchrzak 提交于
      Add new rdev flag which external metadata handler can use to switch
      on/off bad block support. If new bad block is encountered, notify it via
      rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been
      cleared, notify update to rdev 'bad_blocks' sysfs file.
      
      When bad blocks support is being removed, just clear rdev flag. It is
      not necessary to reset badblocks->shift field. If there are bad blocks
      cleared or added at the same time, it is ok for those changes to be
      applied to the structure. The array is in blocked state and the drive
      which cannot handle bad blocks any more will be removed from the array
      before it is unlocked.
      
      Simplify state_show function by adding a separator at the end of each
      string and overwrite last separator with new line.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Reviewed-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      35b785f7