1. 18 12月, 2019 1 次提交
    • D
      md: improve handling of bio with REQ_PREFLUSH in md_flush_request() · a12c768d
      David Jeffery 提交于
      commit 775d78319f1ceb32be8eb3b1202ccdc60e9cb7f1 upstream.
      
      If pers->make_request fails in md_flush_request(), the bio is lost. To
      fix this, pass back a bool to indicate if the original make_request call
      should continue to handle the I/O and instead of assuming the flush logic
      will push it to completion.
      
      Convert md_flush_request to return a bool and no longer calls the raid
      driver's make_request function.  If the return is true, then the md flush
      logic has or will complete the bio and the md make_request call is done.
      If false, then the md make_request function needs to keep processing like
      it is a normal bio. Let the original call to md_handle_request handle any
      need to retry sending the bio to the raid driver's make_request function
      should it be needed.
      
      Also mark md_flush_request and the make_request function pointer as
      __must_check to issue warnings should these critical return values be
      ignored.
      
      Fixes: 2bc13b83e629 ("md: batch flush requests.")
      Cc: stable@vger.kernel.org # # v4.19+
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
      Reviewed-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a12c768d
  2. 24 11月, 2019 1 次提交
    • N
      md: allow metadata updates while suspending an array - fix · b69cfc4f
      NeilBrown 提交于
      [ Upstream commit 059421e041eb461fb2b3e81c9adaec18ef03ca3c ]
      
      Commit 35bfc521 ("md: allow metadata update while suspending.")
      added support for allowing md_check_recovery() to still perform
      metadata updates while the array is entering the 'suspended' state.
      This is needed to allow the processes of entering the state to
      complete.
      
      Unfortunately, the patch doesn't really work.  The test for
      "mddev->suspended" at the start of md_check_recovery() means that the
      function doesn't try to do anything at all while entering suspend.
      
      This patch moves the code of updating the metadata while suspending to
      *before* the test on mddev->suspended.
      Reported-by: NJeff Mahoney <jeffm@suse.com>
      Fixes: 35bfc521 ("md: allow metadata update while suspending.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b69cfc4f
  3. 05 10月, 2019 4 次提交
    • N
      md: only call set_in_sync() when it is expected to succeed. · 5dc86e95
      NeilBrown 提交于
      commit 480523feae581ab714ba6610388a3b4619a2f695 upstream.
      
      Since commit 4ad23a97 ("MD: use per-cpu counter for
      writes_pending"), set_in_sync() is substantially more expensive: it
      can wait for a full RCU grace period which can be 10s of milliseconds.
      
      So we should only call it when the cost is justified.
      
      md_check_recovery() currently calls set_in_sync() every time it finds
      anything to do (on non-external active arrays).  For an array
      performing resync or recovery, this will be quite often.
      Each call will introduce a delay to the md thread, which can noticeable
      affect IO submission latency.
      
      In md_check_recovery() we only need to call set_in_sync() if
      'safemode' was non-zero at entry, meaning that there has been not
      recent IO.  So we save this "safemode was nonzero" state, and only
      call set_in_sync() if it was non-zero.
      
      This measurably reduces mean and maximum IO submission latency during
      resync/recovery.
      Reported-and-tested-by: NJack Wang <jinpu.wang@cloud.ionos.com>
      Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
      Cc: stable@vger.kernel.org (v4.12+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5dc86e95
    • N
      md: don't report active array_state until after revalidate_disk() completes. · 598a2cda
      NeilBrown 提交于
      commit 9d4b45d6af442237560d0bb5502a012baa5234b7 upstream.
      
      Until revalidate_disk() has completed, the size of a new md array will
      appear to be zero.
      So we shouldn't report, through array_state, that the array is active
      until that time.
      udev rules check array_state to see if the array is ready.  As soon as
      it appear to be zero, fsck can be run.  If it find the size to be
      zero, it will fail.
      
      So add a new flag to provide an interlock between do_md_run() and
      array_state_show().  This flag is set while do_md_run() is active and
      it prevents array_state_show() from reporting that the array is
      active.
      
      Before do_md_run() is called, ->pers will be NULL so array is
      definitely not active.
      After do_md_run() is called, revalidate_disk() will have run and the
      array will be completely ready.
      
      We also move various sysfs_notify*() calls out of md_run() into
      do_md_run() after MD_NOT_READY is cleared.  This ensure the
      information is ready before the notification is sent.
      
      Prior to v4.12, array_state_show() was called with the
      mddev->reconfig_mutex held, which provided exclusion with do_md_run().
      
      Note that MD_NOT_READY cleared twice.  This is deliberate to cover
      both success and error paths with minimal noise.
      
      Fixes: b7b17c9b ("md: remove mddev_lock() from md_attr_show()")
      Cc: stable@vger.kernel.org (v4.12++)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      598a2cda
    • G
      md: don't set In_sync if array is frozen · 37153845
      Guoqing Jiang 提交于
      [ Upstream commit 062f5b2ae12a153644c765e7ba3b0f825427be1d ]
      
      When a disk is added to array, the following path is called in mdadm.
      
      Manage_subdevs -> sysfs_freeze_array
                     -> Manage_add
                     -> sysfs_set_str(&info, NULL, "sync_action","idle")
      
      Then from kernel side, Manage_add invokes the path (add_new_disk ->
      validate_super = super_1_validate) to set In_sync flag.
      
      Since In_sync means "device is in_sync with rest of array", and the new
      added disk need to resync thread to help the synchronization of data.
      And md_reap_sync_thread would call spare_active to set In_sync for the
      new added disk finally. So don't set In_sync if array is in frozen.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      37153845
    • G
      md: don't call spare_active in md_reap_sync_thread if all member devices can't work · d38aff20
      Guoqing Jiang 提交于
      [ Upstream commit 0d8ed0e9bf9643f27f4816dca61081784dedb38d ]
      
      When add one disk to array, the md_reap_sync_thread is responsible
      to activate the spare and set In_sync flag for the new member in
      spare_active().
      
      But if raid1 has one member disk A, and disk B is added to the array.
      Then we offline A before all the datas are synchronized from A to B,
      obviously B doesn't have the latest data as A, but B is still marked
      with In_sync flag.
      
      So let's not call spare_active under the condition, otherwise B is
      still showed with 'U' state which is not correct.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d38aff20
  4. 14 7月, 2019 1 次提交
    • M
      md: fix for divide error in status_resync · 270ae00a
      Mariusz Tkaczyk 提交于
      [ Upstream commit 9642fa73d073527b0cbc337cc17a47d545d82cd2 ]
      
      Stopping external metadata arrays during resync/recovery causes
      retries, loop of interrupting and starting reconstruction, until it
      hit at good moment to stop completely. While these retries
      curr_mark_cnt can be small- especially on HDD drives, so subtraction
      result can be smaller than 0. However it is casted to uint without
      checking. As a result of it the status bar in /proc/mdstat while stopping
      is strange (it jumps between 0% and 99%).
      
      The real problem occurs here after commit 72deb455b5ec ("block: remove
      CONFIG_LBDAF"). Sector_div() macro has been changed, now the
      divisor is casted to uint32. For db = -8 the divisior(db/32-1) becomes 0.
      
      Check if db value can be really counted and replace these macro by
      div64_u64() inline.
      Signed-off-by: NMariusz Tkaczyk <mariusz.tkaczyk@intel.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      270ae00a
  5. 26 5月, 2019 3 次提交
    • Y
      md: add mddev->pers to avoid potential NULL pointer dereference · 10cb519c
      Yufen Yu 提交于
      commit ee37e62191a59d253fc916b9fc763deb777211e2 upstream.
      
      When doing re-add, we need to ensure rdev->mddev->pers is not NULL,
      which can avoid potential NULL pointer derefence in fallowing
      add_bound_rdev().
      
      Fixes: a6da4ef8 ("md: re-add a failed disk")
      Cc: Xiao Ni <xni@redhat.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: <stable@vger.kernel.org> # 4.4+
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      10cb519c
    • N
      md: batch flush requests. · 3deaa1dc
      NeilBrown 提交于
      commit 2bc13b83e6298486371761de503faeffd15b7534 upstream.
      
      Currently if many flush requests are submitted to an md device is quick
      succession, they are serialized and can take a long to process them all.
      We don't really need to call flush all those times - a single flush call
      can satisfy all requests submitted before it started.
      So keep track of when the current flush started and when it finished,
      allow any pending flush that was requested before the flush started
      to complete without waiting any more.
      
      Test results from Xiao:
      
      Test is done on a raid10 device which is created by 4 SSDs. The tool is
      dbench.
      
      1. The latest linux stable kernel
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    768    10.509    78.305
        Flush                  2078376     0.013    10.094
        Close                  21787697     0.019    18.821
        LockX                    96580     0.007     3.184
        Mkdir                      384     0.008     0.062
        Rename                 1255883     0.191    23.534
        ReadX                  46495589     0.020    14.230
        WriteX                 14790591     7.123    60.706
        Unlink                 5989118     0.440    54.551
        UnlockX                  96580     0.005     2.736
        FIND_FIRST             10393845     0.042    12.079
        SET_FILE_INFORMATION   2415558     0.129    10.088
        QUERY_FILE_INFORMATION 4711725     0.005     8.462
        QUERY_PATH_INFORMATION 26883327     0.032    21.715
        QUERY_FS_INFORMATION   4929409b     0.010     8.238
        NTCreateX              29660080     0.100    53.268
      
      Throughput 1034.88 MB/sec (sync open)  128 clients  128 procs
      max_latency=60.712 ms
      
      2. With patch1 "Revert "MD: fix lock contention for flush bios""
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    256     8.326    36.761
        Flush                   693291     3.974   180.269
        Close                  7266404     0.009    36.929
        LockX                    32160     0.006     0.840
        Mkdir                      128     0.008     0.021
        Rename                  418755     0.063    29.945
        ReadX                  15498708     0.007     7.216
        WriteX                 4932310    22.482   267.928
        Unlink                 1997557     0.109    47.553
        UnlockX                  32160     0.004     1.110
        FIND_FIRST             3465791     0.036     7.320
        SET_FILE_INFORMATION    805825     0.015     1.561
        QUERY_FILE_INFORMATION 1570950     0.005     2.403
        QUERY_PATH_INFORMATION 8965483     0.013    14.277
        QUERY_FS_INFORMATION   1643626     0.009     3.314
        NTCreateX              9892174     0.061    41.278
      
      Throughput 345.009 MB/sec (sync open)  128 clients  128 procs
      max_latency=267.939 m
      
      3. With patch1 and patch2
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    768     9.570    54.588
        Flush                  2061354     0.666    15.102
        Close                  21604811     0.012    25.697
        LockX                    95770     0.007     1.424
        Mkdir                      384     0.008     0.053
        Rename                 1245411     0.096    12.263
        ReadX                  46103198     0.011    12.116
        WriteX                 14667988     7.375    60.069
        Unlink                 5938936     0.173    30.905
        UnlockX                  95770     0.005     4.147
        FIND_FIRST             10306407     0.041    11.715
        SET_FILE_INFORMATION   2395987     0.048     7.640
        QUERY_FILE_INFORMATION 4672371     0.005     9.291
        QUERY_PATH_INFORMATION 26656735     0.018    19.719
        QUERY_FS_INFORMATION   4887940     0.010     7.654
        NTCreateX              29410811     0.059    28.551
      
      Throughput 1026.21 MB/sec (sync open)  128 clients  128 procs
      max_latency=60.075 ms
      
      Cc: <stable@vger.kernel.org> # v4.19+
      Tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3deaa1dc
    • N
      Revert "MD: fix lock contention for flush bios" · 7f6b9285
      NeilBrown 提交于
      commit 4bc034d35377196c854236133b07730a777c4aba upstream.
      
      This reverts commit 5a409b4f.
      
      This patch has two problems.
      
      1/ it make multiple calls to submit_bio() from inside a make_request_fn.
       The bios thus submitted will be queued on current->bio_list and not
       submitted immediately.  As the bios are allocated from a mempool,
       this can theoretically result in a deadlock - all the pool of requests
       could be in various ->bio_list queues and a subsequent mempool_alloc
       could block waiting for one of them to be released.
      
      2/ It aims to handle a case when there are many concurrent flush requests.
        It handles this by submitting many requests in parallel - all of which
        are identical and so most of which do nothing useful.
        It would be more efficient to just send one lower-level request, but
        allow that to satisfy multiple upper-level requests.
      
      Fixes: 5a409b4f ("MD: fix lock contention for flush bios")
      Cc: <stable@vger.kernel.org> # v4.19+
      Tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f6b9285
  6. 14 11月, 2018 4 次提交
  7. 02 8月, 2018 1 次提交
  8. 25 7月, 2018 1 次提交
  9. 18 7月, 2018 2 次提交
    • M
      block: Add and use op_stat_group() for indexing disk_stat fields. · ddcf35d3
      Michael Callahan 提交于
      Add and use a new op_stat_group() function for indexing partition stat
      fields rather than indexing them by rq_data_dir() or bio_data_dir().
      This function works similarly to op_is_sync() in that it takes the
      request::cmd_flags or bio::bi_opf flags and determines which stats
      should et updated.
      
      In addition, the second parameter to generic_start_io_acct() and
      generic_end_io_acct() is now a REQ_OP rather than simply a read or
      write bit and it uses op_stat_group() on the parameter to determine
      the stat group.
      
      Note that the partition in_flight counts are not part of the per-cpu
      statistics and as such are not indexed via this function.  It's now
      indexed by op_is_write().
      
      tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Matias Bjorling <mb@lightnvm.io>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddcf35d3
    • M
      block: Add part_stat_read_accum to read across field entries. · 59767fbd
      Michael Callahan 提交于
      Add a part_stat_read_accum macro to genhd.h to read and sum across
      field entries.  For example to sum up the number read and write
      sectors completed.  In addition to being ar reasonable cleanup by
      itself this will make it easier to add new stat fields in the future.
      
      tj: Refreshed on top of v4.17.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      59767fbd
  10. 06 7月, 2018 1 次提交
    • G
      md-cluster: show array's status more accurate · 0357ba27
      Guoqing Jiang 提交于
      When resync or recovery is happening in one node,
      other nodes don't show the appropriate info now.
      
      For example, when create an array in master node
      without "--assume-clean", then assemble the array
      in slave nodes, you can see "resync=PENDING" when
      read /proc/mdstat in slave nodes. However, the info
      is confusing since "PENDING" status is introduced
      for start array in read-only mode.
      
      We introduce RESYNCING_REMOTE flag to indicate that
      resync thread is running in remote node. The flags
      is set when node receive RESYNCING msg. And we clear
      the REMOTE flag in following cases:
      
      1. resync or recover is finished in master node,
         which means slaves receive msg with both lo
         and hi are set to 0.
      2. node continues resync/recovery in recover_bitmaps.
      3. when resync_finish is called.
      
      Then we show accurate information in status_resync
      by check REMOTE flags and with other conditions.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0357ba27
  11. 19 6月, 2018 1 次提交
  12. 08 6月, 2018 1 次提交
    • K
      md: Unify mddev destruction paths · 28dec870
      Kent Overstreet 提交于
      Previously, mddev_put() had a couple different paths for freeing a
      mddev, due to the fact that the kobject wasn't initialized when the
      mddev was first allocated. If we move the kobject_init() to when it's
      first allocated and just use kobject_add() later, we can clean all this
      up.
      
      This also removes a hack in mddev_put() to avoid freeing biosets under a
      spinlock, which involved copying biosets on the stack after the reset
      bioset_init() changes.
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      28dec870
  13. 31 5月, 2018 1 次提交
  14. 22 5月, 2018 1 次提交
    • X
      MD: fix lock contention for flush bios · 5a409b4f
      Xiao Ni 提交于
      There is a lock contention when there are many processes which send flush bios
      to md device. eg. Create many lvs on one raid device and mkfs.xfs on each lv.
      
      Now it just can handle flush request sequentially. It needs to wait mddev->flush_bio
      to be NULL, otherwise get mddev->lock.
      
      This patch remove mddev->flush_bio and handle flush bio asynchronously.
      I did a test with command dbench -s 128 -t 300. This is the test result:
      
      =================Without the patch============================
       Operation                Count    AvgLat    MaxLat
       --------------------------------------------------
       Flush                    11165   167.595  5879.560
       Close                   107469     1.391  2231.094
       LockX                      384     0.003     0.019
       Rename                    5944     2.141  1856.001
       ReadX                   208121     0.003     0.074
       WriteX                   98259  1925.402 15204.895
       Unlink                   25198    13.264  3457.268
       UnlockX                    384     0.001     0.009
       FIND_FIRST               47111     0.012     0.076
       SET_FILE_INFORMATION     12966     0.007     0.065
       QUERY_FILE_INFORMATION   27921     0.004     0.085
       QUERY_PATH_INFORMATION  124650     0.005     5.766
       QUERY_FS_INFORMATION     22519     0.003     0.053
       NTCreateX               141086     4.291  2502.812
      
      Throughput 3.7181 MB/sec (sync open)  128 clients  128 procs  max_latency=15204.905 ms
      
      =================With the patch============================
       Operation                Count    AvgLat    MaxLat
       --------------------------------------------------
       Flush                     4500   174.134   406.398
       Close                    48195     0.060   467.062
       LockX                      256     0.003     0.029
       Rename                    2324     0.026     0.360
       ReadX                    78846     0.004     0.504
       WriteX                   66832   562.775  1467.037
       Unlink                    5516     3.665  1141.740
       UnlockX                    256     0.002     0.019
       FIND_FIRST               16428     0.015     0.313
       SET_FILE_INFORMATION      6400     0.009     0.520
       QUERY_FILE_INFORMATION   17865     0.003     0.089
       QUERY_PATH_INFORMATION   47060     0.078   416.299
       QUERY_FS_INFORMATION      7024     0.004     0.032
       NTCreateX                55921     0.854  1141.452
      
      Throughput 11.744 MB/sec (sync open)  128 clients  128 procs  max_latency=1467.041 ms
      
      The test is done on raid1 disk with two rotational disks
      
      V5: V4 is more complicated than the version with memory pool. So revert to the memory pool
      version
      
      V4: use address of fbio to do hash to choose free flush info.
      V3:
      Shaohua suggests mempool is overkill. In v3 it allocs memory during creating raid device
      and uses a simple bitmap to record which resource is free.
      
      Fix a bug from v2. It should set flush_pending to 1 at first.
      
      V2:
      Neil pointed out two problems. One is counting error problem and another is return value
      when allocat memory fails.
      1. counting error problem
      This isn't safe.  It is only safe to call rdev_dec_pending() on rdevs
      that you previously called
                                atomic_inc(&rdev->nr_pending);
      If an rdev was added to the list between the start and end of the flush,
      this will do something bad.
      
      Now it doesn't use bio_chain. It uses specified call back function for each
      flush bio.
      2. Returned on IO error when kmalloc fails is wrong.
      I use mempool suggested by Neil in V2
      3. Fixed some places pointed by Guoqing
      Suggested-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5a409b4f
  15. 18 5月, 2018 1 次提交
    • Y
      md: fix NULL dereference of mddev->pers in remove_and_add_spares() · c42a0e26
      Yufen Yu 提交于
      We met NULL pointer BUG as follow:
      
      [  151.760358] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
      [  151.761340] PGD 80000001011eb067 P4D 80000001011eb067 PUD 1011ea067 PMD 0
      [  151.762039] Oops: 0000 [#1] SMP PTI
      [  151.762406] Modules linked in:
      [  151.762723] CPU: 2 PID: 3561 Comm: mdadm-test Kdump: loaded Not tainted 4.17.0-rc1+ #238
      [  151.763542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
      [  151.764432] RIP: 0010:remove_and_add_spares.part.56+0x13c/0x3a0
      [  151.765061] RSP: 0018:ffffc90001d7fcd8 EFLAGS: 00010246
      [  151.765590] RAX: 0000000000000000 RBX: ffff88013601d600 RCX: 0000000000000000
      [  151.766306] RDX: 0000000000000000 RSI: ffff88013601d600 RDI: ffff880136187000
      [  151.767014] RBP: ffff880136187018 R08: 0000000000000003 R09: 0000000000000051
      [  151.767728] R10: ffffc90001d7fed8 R11: 0000000000000000 R12: ffff88013601d600
      [  151.768447] R13: ffff8801298b1300 R14: ffff880136187000 R15: 0000000000000000
      [  151.769160] FS:  00007f2624276700(0000) GS:ffff88013ae80000(0000) knlGS:0000000000000000
      [  151.769971] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  151.770554] CR2: 0000000000000060 CR3: 0000000111aac000 CR4: 00000000000006e0
      [  151.771272] Call Trace:
      [  151.771542]  md_ioctl+0x1df2/0x1e10
      [  151.771906]  ? __switch_to+0x129/0x440
      [  151.772295]  ? __schedule+0x244/0x850
      [  151.772672]  blkdev_ioctl+0x4bd/0x970
      [  151.773048]  block_ioctl+0x39/0x40
      [  151.773402]  do_vfs_ioctl+0xa4/0x610
      [  151.773770]  ? dput.part.23+0x87/0x100
      [  151.774151]  ksys_ioctl+0x70/0x80
      [  151.774493]  __x64_sys_ioctl+0x16/0x20
      [  151.774877]  do_syscall_64+0x5b/0x180
      [  151.775258]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      For raid6, when two disk of the array are offline, two spare disks can
      be added into the array. Before spare disks recovery completing,
      system reboot and mdadm thinks it is ok to restart the degraded
      array by md_ioctl(). Since disks in raid6 is not only_parity(),
      raid5_run() will abort, when there is no PPL feature or not setting
      'start_dirty_degraded' parameter. Therefore, mddev->pers is NULL.
      
      But, mddev->raid_disks has been set and it will not be cleared when
      raid5_run abort. md_ioctl() can execute cmd 'HOT_REMOVE_DISK' to
      remove a disk by mdadm, which will cause NULL pointer dereference
      in remove_and_add_spares() finally.
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      c42a0e26
  16. 02 5月, 2018 1 次提交
    • N
      md: fix two problems with setting the "re-add" device state. · 011abdc9
      NeilBrown 提交于
      If "re-add" is written to the "state" file for a device
      which is faulty, this has an effect similar to removing
      and re-adding the device.  It should take up the
      same slot in the array that it previously had, and
      an accelerated (e.g. bitmap-based) rebuild should happen.
      
      The slot that "it previously had" is determined by
      rdev->saved_raid_disk.
      However this is not set when a device fails (only when a device
      is added), and it is cleared when resync completes.
      This means that "re-add" will normally work once, but may not work a
      second time.
      
      This patch includes two fixes.
      1/ when a device fails, record the ->raid_disk value in
          ->saved_raid_disk before clearing ->raid_disk
      2/ when "re-add" is written to a device for which
          ->saved_raid_disk is not set, fail.
      
      I think this is suitable for stable as it can
      cause re-adding a device to be forced to do a full
      resync which takes a lot longer and so puts data at
      more risk.
      
      Cc: <stable@vger.kernel.org> (v4.1)
      Fixes: 97f6cd39 ("md-cluster: re-add capabilities")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      011abdc9
  17. 09 4月, 2018 1 次提交
  18. 09 3月, 2018 1 次提交
  19. 01 3月, 2018 1 次提交
  20. 26 2月, 2018 1 次提交
    • B
      md: fix a potential deadlock of raid5/raid10 reshape · 8876391e
      BingJing Chang 提交于
      There is a potential deadlock if mount/umount happens when
      raid5_finish_reshape() tries to grow the size of emulated disk.
      
      How the deadlock happens?
      1) The raid5 resync thread finished reshape (expanding array).
      2) The mount or umount thread holds VFS sb->s_umount lock and tries to
         write through critical data into raid5 emulated block device. So it
         waits for raid5 kernel thread handling stripes in order to finish it
         I/Os.
      3) In the routine of raid5 kernel thread, md_check_recovery() will be
         called first in order to reap the raid5 resync thread. That is,
         raid5_finish_reshape() will be called. In this function, it will try
         to update conf and call VFS revalidate_disk() to grow the raid5
         emulated block device. It will try to acquire VFS sb->s_umount lock.
      The raid5 kernel thread cannot continue, so no one can handle mount/
      umount I/Os (stripes). Once the write-through I/Os cannot be finished,
      mount/umount will not release sb->s_umount lock. The deadlock happens.
      
      The raid5 kernel thread is an emulated block device. It is responible to
      handle I/Os (stripes) from upper layers. The emulated block device
      should not request any I/Os on itself. That is, it should not call VFS
      layer functions. (If it did, it will try to acquire VFS locks to
      guarantee the I/Os sequence.) So we have the resync thread to send
      resync I/O requests and to wait for the results.
      
      For solving this potential deadlock, we can put the size growth of the
      emulated block device as the final step of reshape thread.
      
      2017/12/29:
      Thanks to Guoqing Jiang <gqjiang@suse.com>,
      we confirmed that there is the same deadlock issue in raid10. It's
      reproducible and can be fixed by this patch. For raid10.c, we can remove
      the similar code to prevent deadlock as well since they has been called
      before.
      Reported-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NChung-Chiang Cheng <cccheng@synology.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      8876391e
  21. 20 2月, 2018 1 次提交
    • N
      md: only allow remove_and_add_spares when no sync_thread running. · 39772f0a
      NeilBrown 提交于
      The locking protocols in md assume that a device will
      never be removed from an array during resync/recovery/reshape.
      When that isn't happening, rcu or reconfig_mutex is needed
      to protect an rdev pointer while taking a refcount.  When
      it is happening, that protection isn't needed.
      
      Unfortunately there are cases were remove_and_add_spares() is
      called when recovery might be happening: is state_store(),
      slot_store() and hot_remove_disk().
      In each case, this is just an optimization, to try to expedite
      removal from the personality so the device can be removed from
      the array.  If resync etc is happening, we just have to wait
      for md_check_recover to find a suitable time to call
      remove_and_add_spares().
      
      This optimization and not essential so it doesn't
      matter if it fails.
      So change remove_and_add_spares() to abort early if
      resync/recovery/reshape is happening, unless it is called
      from md_check_recovery() as part of a newly started recovery.
      The parameter "this" is only NULL when called from
      md_check_recovery() so when it is NULL, there is no need to abort.
      
      As this can result in a NULL dereference, the fix is suitable
      for -stable.
      
      cc: yuyufen <yuyufen@huawei.com>
      Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
      Fixes: 8430e7e0 ("md: disconnect device from personality before trying to remove it.")
      Cc: stable@ver.kernel.org (v4.8+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      39772f0a
  22. 19 2月, 2018 1 次提交
  23. 18 2月, 2018 1 次提交
  24. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  25. 16 1月, 2018 1 次提交
    • T
      raid5-ppl: PPL support for disks with write-back cache enabled · 1532d9e8
      Tomasz Majchrzak 提交于
      In order to provide data consistency with PPL for disks with write-back
      cache enabled all data has to be flushed to disks before next PPL
      entry. The disks to be flushed are marked in the bitmap. It's modified
      under a mutex and it's only read after PPL io unit is submitted.
      
      A limitation of 64 disks in the array has been introduced to keep data
      structures and implementation simple. RAID5 arrays with so many disks are
      not likely due to high risk of multiple disks failure. Such restriction
      should not be a real life limitation.
      
      With write-back cache disabled next PPL entry is submitted when data write
      for current one completes. Data flush defers next log submission so trigger
      it when there are no stripes for handling found.
      
      As PPL assures all data is flushed to disk at request completion, just
      acknowledge flush request when PPL is enabled.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      1532d9e8
  26. 12 12月, 2017 1 次提交
    • S
      md: introduce new personality funciton start() · d5d885fd
      Song Liu 提交于
      In do_md_run(), md threads should not wake up until the array is fully
      initialized in md_run(). However, in raid5_run(), raid5-cache may wake
      up mddev->thread to flush stripes that need to be written back. This
      design doesn't break badly right now. But it could lead to bad bug in
      the future.
      
      This patch tries to resolve this problem by splitting start up work
      into two personality functions, run() and start(). Tasks that do not
      require the md threads should go into run(), while task that require
      the md threads go into start().
      
      r5l_load_log() is moved to raid5_start(), so it is not called until
      the md threads are started in do_md_run().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d5d885fd
  27. 02 12月, 2017 1 次提交
    • N
      md: limit mdstat resync progress to max_sectors · d2e2ec82
      Nate Dailey 提交于
      There is a small window near the end of md_do_sync where mddev->curr_resync
      can be equal to MaxSector.
      
      If status_resync is called during this window, the resulting /proc/mdstat
      output contains a HUGE number of = signs due to the very large curr_resync:
      
      Personalities : [raid1]
      md123 : active raid1 sdd3[2] sdb3[0]
        204736 blocks super 1.0 [2/1] [U_]
        [=====================================================================
         ... (82 MB more) ...
         ================>]  recovery =429496729.3% (9223372036854775807/204736)
         finish=0.2min speed=12796K/sec
        bitmap: 0/1 pages [0KB], 65536KB chunk
      
      Modify status_resync to ensure the resync variable doesn't exceed
      the array's max_sectors.
      Signed-off-by: NNate Dailey <nate.dailey@stratus.com>
      Acked-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d2e2ec82
  28. 29 11月, 2017 1 次提交
  29. 15 11月, 2017 1 次提交
  30. 11 11月, 2017 1 次提交
  31. 09 11月, 2017 1 次提交
    • N
      md: be cautious about using ->curr_resync_completed for ->recovery_offset · db0505d3
      NeilBrown 提交于
      The ->recovery_offset shows how much of a non-InSync device is actually
      in sync - how much has been recoveryed.
      
      When performing a recovery, ->curr_resync and ->curr_resync_completed
      follow the device address being recovered and so can be used to update
      ->recovery_offset.
      
      When performing a reshape, ->curr_resync* might follow the device
      addresses (raid5) or might follow array addresses (raid10), so cannot
      in general be used to set ->recovery_offset.  When reshaping backwards,
      ->curre_resync* measures from the *end* of the array-or-device, so is
      particularly unhelpful.
      
      So change the common code in md.c to only use ->curr_resync_complete
      for the simple recovery case, and add code to raid5.c to update
      ->recovery_offset during a forwards reshape.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      db0505d3