1. 26 5月, 2019 1 次提交
    • N
      Revert "MD: fix lock contention for flush bios" · 7f6b9285
      NeilBrown 提交于
      commit 4bc034d35377196c854236133b07730a777c4aba upstream.
      
      This reverts commit 5a409b4f.
      
      This patch has two problems.
      
      1/ it make multiple calls to submit_bio() from inside a make_request_fn.
       The bios thus submitted will be queued on current->bio_list and not
       submitted immediately.  As the bios are allocated from a mempool,
       this can theoretically result in a deadlock - all the pool of requests
       could be in various ->bio_list queues and a subsequent mempool_alloc
       could block waiting for one of them to be released.
      
      2/ It aims to handle a case when there are many concurrent flush requests.
        It handles this by submitting many requests in parallel - all of which
        are identical and so most of which do nothing useful.
        It would be more efficient to just send one lower-level request, but
        allow that to satisfy multiple upper-level requests.
      
      Fixes: 5a409b4f ("MD: fix lock contention for flush bios")
      Cc: <stable@vger.kernel.org> # v4.19+
      Tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f6b9285
  2. 14 11月, 2018 4 次提交
  3. 02 8月, 2018 1 次提交
  4. 25 7月, 2018 1 次提交
  5. 18 7月, 2018 2 次提交
    • M
      block: Add and use op_stat_group() for indexing disk_stat fields. · ddcf35d3
      Michael Callahan 提交于
      Add and use a new op_stat_group() function for indexing partition stat
      fields rather than indexing them by rq_data_dir() or bio_data_dir().
      This function works similarly to op_is_sync() in that it takes the
      request::cmd_flags or bio::bi_opf flags and determines which stats
      should et updated.
      
      In addition, the second parameter to generic_start_io_acct() and
      generic_end_io_acct() is now a REQ_OP rather than simply a read or
      write bit and it uses op_stat_group() on the parameter to determine
      the stat group.
      
      Note that the partition in_flight counts are not part of the per-cpu
      statistics and as such are not indexed via this function.  It's now
      indexed by op_is_write().
      
      tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Matias Bjorling <mb@lightnvm.io>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddcf35d3
    • M
      block: Add part_stat_read_accum to read across field entries. · 59767fbd
      Michael Callahan 提交于
      Add a part_stat_read_accum macro to genhd.h to read and sum across
      field entries.  For example to sum up the number read and write
      sectors completed.  In addition to being ar reasonable cleanup by
      itself this will make it easier to add new stat fields in the future.
      
      tj: Refreshed on top of v4.17.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      59767fbd
  6. 06 7月, 2018 1 次提交
    • G
      md-cluster: show array's status more accurate · 0357ba27
      Guoqing Jiang 提交于
      When resync or recovery is happening in one node,
      other nodes don't show the appropriate info now.
      
      For example, when create an array in master node
      without "--assume-clean", then assemble the array
      in slave nodes, you can see "resync=PENDING" when
      read /proc/mdstat in slave nodes. However, the info
      is confusing since "PENDING" status is introduced
      for start array in read-only mode.
      
      We introduce RESYNCING_REMOTE flag to indicate that
      resync thread is running in remote node. The flags
      is set when node receive RESYNCING msg. And we clear
      the REMOTE flag in following cases:
      
      1. resync or recover is finished in master node,
         which means slaves receive msg with both lo
         and hi are set to 0.
      2. node continues resync/recovery in recover_bitmaps.
      3. when resync_finish is called.
      
      Then we show accurate information in status_resync
      by check REMOTE flags and with other conditions.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0357ba27
  7. 19 6月, 2018 1 次提交
  8. 08 6月, 2018 1 次提交
    • K
      md: Unify mddev destruction paths · 28dec870
      Kent Overstreet 提交于
      Previously, mddev_put() had a couple different paths for freeing a
      mddev, due to the fact that the kobject wasn't initialized when the
      mddev was first allocated. If we move the kobject_init() to when it's
      first allocated and just use kobject_add() later, we can clean all this
      up.
      
      This also removes a hack in mddev_put() to avoid freeing biosets under a
      spinlock, which involved copying biosets on the stack after the reset
      bioset_init() changes.
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      28dec870
  9. 31 5月, 2018 1 次提交
  10. 22 5月, 2018 1 次提交
    • X
      MD: fix lock contention for flush bios · 5a409b4f
      Xiao Ni 提交于
      There is a lock contention when there are many processes which send flush bios
      to md device. eg. Create many lvs on one raid device and mkfs.xfs on each lv.
      
      Now it just can handle flush request sequentially. It needs to wait mddev->flush_bio
      to be NULL, otherwise get mddev->lock.
      
      This patch remove mddev->flush_bio and handle flush bio asynchronously.
      I did a test with command dbench -s 128 -t 300. This is the test result:
      
      =================Without the patch============================
       Operation                Count    AvgLat    MaxLat
       --------------------------------------------------
       Flush                    11165   167.595  5879.560
       Close                   107469     1.391  2231.094
       LockX                      384     0.003     0.019
       Rename                    5944     2.141  1856.001
       ReadX                   208121     0.003     0.074
       WriteX                   98259  1925.402 15204.895
       Unlink                   25198    13.264  3457.268
       UnlockX                    384     0.001     0.009
       FIND_FIRST               47111     0.012     0.076
       SET_FILE_INFORMATION     12966     0.007     0.065
       QUERY_FILE_INFORMATION   27921     0.004     0.085
       QUERY_PATH_INFORMATION  124650     0.005     5.766
       QUERY_FS_INFORMATION     22519     0.003     0.053
       NTCreateX               141086     4.291  2502.812
      
      Throughput 3.7181 MB/sec (sync open)  128 clients  128 procs  max_latency=15204.905 ms
      
      =================With the patch============================
       Operation                Count    AvgLat    MaxLat
       --------------------------------------------------
       Flush                     4500   174.134   406.398
       Close                    48195     0.060   467.062
       LockX                      256     0.003     0.029
       Rename                    2324     0.026     0.360
       ReadX                    78846     0.004     0.504
       WriteX                   66832   562.775  1467.037
       Unlink                    5516     3.665  1141.740
       UnlockX                    256     0.002     0.019
       FIND_FIRST               16428     0.015     0.313
       SET_FILE_INFORMATION      6400     0.009     0.520
       QUERY_FILE_INFORMATION   17865     0.003     0.089
       QUERY_PATH_INFORMATION   47060     0.078   416.299
       QUERY_FS_INFORMATION      7024     0.004     0.032
       NTCreateX                55921     0.854  1141.452
      
      Throughput 11.744 MB/sec (sync open)  128 clients  128 procs  max_latency=1467.041 ms
      
      The test is done on raid1 disk with two rotational disks
      
      V5: V4 is more complicated than the version with memory pool. So revert to the memory pool
      version
      
      V4: use address of fbio to do hash to choose free flush info.
      V3:
      Shaohua suggests mempool is overkill. In v3 it allocs memory during creating raid device
      and uses a simple bitmap to record which resource is free.
      
      Fix a bug from v2. It should set flush_pending to 1 at first.
      
      V2:
      Neil pointed out two problems. One is counting error problem and another is return value
      when allocat memory fails.
      1. counting error problem
      This isn't safe.  It is only safe to call rdev_dec_pending() on rdevs
      that you previously called
                                atomic_inc(&rdev->nr_pending);
      If an rdev was added to the list between the start and end of the flush,
      this will do something bad.
      
      Now it doesn't use bio_chain. It uses specified call back function for each
      flush bio.
      2. Returned on IO error when kmalloc fails is wrong.
      I use mempool suggested by Neil in V2
      3. Fixed some places pointed by Guoqing
      Suggested-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5a409b4f
  11. 18 5月, 2018 1 次提交
    • Y
      md: fix NULL dereference of mddev->pers in remove_and_add_spares() · c42a0e26
      Yufen Yu 提交于
      We met NULL pointer BUG as follow:
      
      [  151.760358] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
      [  151.761340] PGD 80000001011eb067 P4D 80000001011eb067 PUD 1011ea067 PMD 0
      [  151.762039] Oops: 0000 [#1] SMP PTI
      [  151.762406] Modules linked in:
      [  151.762723] CPU: 2 PID: 3561 Comm: mdadm-test Kdump: loaded Not tainted 4.17.0-rc1+ #238
      [  151.763542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
      [  151.764432] RIP: 0010:remove_and_add_spares.part.56+0x13c/0x3a0
      [  151.765061] RSP: 0018:ffffc90001d7fcd8 EFLAGS: 00010246
      [  151.765590] RAX: 0000000000000000 RBX: ffff88013601d600 RCX: 0000000000000000
      [  151.766306] RDX: 0000000000000000 RSI: ffff88013601d600 RDI: ffff880136187000
      [  151.767014] RBP: ffff880136187018 R08: 0000000000000003 R09: 0000000000000051
      [  151.767728] R10: ffffc90001d7fed8 R11: 0000000000000000 R12: ffff88013601d600
      [  151.768447] R13: ffff8801298b1300 R14: ffff880136187000 R15: 0000000000000000
      [  151.769160] FS:  00007f2624276700(0000) GS:ffff88013ae80000(0000) knlGS:0000000000000000
      [  151.769971] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  151.770554] CR2: 0000000000000060 CR3: 0000000111aac000 CR4: 00000000000006e0
      [  151.771272] Call Trace:
      [  151.771542]  md_ioctl+0x1df2/0x1e10
      [  151.771906]  ? __switch_to+0x129/0x440
      [  151.772295]  ? __schedule+0x244/0x850
      [  151.772672]  blkdev_ioctl+0x4bd/0x970
      [  151.773048]  block_ioctl+0x39/0x40
      [  151.773402]  do_vfs_ioctl+0xa4/0x610
      [  151.773770]  ? dput.part.23+0x87/0x100
      [  151.774151]  ksys_ioctl+0x70/0x80
      [  151.774493]  __x64_sys_ioctl+0x16/0x20
      [  151.774877]  do_syscall_64+0x5b/0x180
      [  151.775258]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      For raid6, when two disk of the array are offline, two spare disks can
      be added into the array. Before spare disks recovery completing,
      system reboot and mdadm thinks it is ok to restart the degraded
      array by md_ioctl(). Since disks in raid6 is not only_parity(),
      raid5_run() will abort, when there is no PPL feature or not setting
      'start_dirty_degraded' parameter. Therefore, mddev->pers is NULL.
      
      But, mddev->raid_disks has been set and it will not be cleared when
      raid5_run abort. md_ioctl() can execute cmd 'HOT_REMOVE_DISK' to
      remove a disk by mdadm, which will cause NULL pointer dereference
      in remove_and_add_spares() finally.
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      c42a0e26
  12. 02 5月, 2018 1 次提交
    • N
      md: fix two problems with setting the "re-add" device state. · 011abdc9
      NeilBrown 提交于
      If "re-add" is written to the "state" file for a device
      which is faulty, this has an effect similar to removing
      and re-adding the device.  It should take up the
      same slot in the array that it previously had, and
      an accelerated (e.g. bitmap-based) rebuild should happen.
      
      The slot that "it previously had" is determined by
      rdev->saved_raid_disk.
      However this is not set when a device fails (only when a device
      is added), and it is cleared when resync completes.
      This means that "re-add" will normally work once, but may not work a
      second time.
      
      This patch includes two fixes.
      1/ when a device fails, record the ->raid_disk value in
          ->saved_raid_disk before clearing ->raid_disk
      2/ when "re-add" is written to a device for which
          ->saved_raid_disk is not set, fail.
      
      I think this is suitable for stable as it can
      cause re-adding a device to be forced to do a full
      resync which takes a lot longer and so puts data at
      more risk.
      
      Cc: <stable@vger.kernel.org> (v4.1)
      Fixes: 97f6cd39 ("md-cluster: re-add capabilities")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      011abdc9
  13. 09 4月, 2018 1 次提交
  14. 09 3月, 2018 1 次提交
  15. 01 3月, 2018 1 次提交
  16. 26 2月, 2018 1 次提交
    • B
      md: fix a potential deadlock of raid5/raid10 reshape · 8876391e
      BingJing Chang 提交于
      There is a potential deadlock if mount/umount happens when
      raid5_finish_reshape() tries to grow the size of emulated disk.
      
      How the deadlock happens?
      1) The raid5 resync thread finished reshape (expanding array).
      2) The mount or umount thread holds VFS sb->s_umount lock and tries to
         write through critical data into raid5 emulated block device. So it
         waits for raid5 kernel thread handling stripes in order to finish it
         I/Os.
      3) In the routine of raid5 kernel thread, md_check_recovery() will be
         called first in order to reap the raid5 resync thread. That is,
         raid5_finish_reshape() will be called. In this function, it will try
         to update conf and call VFS revalidate_disk() to grow the raid5
         emulated block device. It will try to acquire VFS sb->s_umount lock.
      The raid5 kernel thread cannot continue, so no one can handle mount/
      umount I/Os (stripes). Once the write-through I/Os cannot be finished,
      mount/umount will not release sb->s_umount lock. The deadlock happens.
      
      The raid5 kernel thread is an emulated block device. It is responible to
      handle I/Os (stripes) from upper layers. The emulated block device
      should not request any I/Os on itself. That is, it should not call VFS
      layer functions. (If it did, it will try to acquire VFS locks to
      guarantee the I/Os sequence.) So we have the resync thread to send
      resync I/O requests and to wait for the results.
      
      For solving this potential deadlock, we can put the size growth of the
      emulated block device as the final step of reshape thread.
      
      2017/12/29:
      Thanks to Guoqing Jiang <gqjiang@suse.com>,
      we confirmed that there is the same deadlock issue in raid10. It's
      reproducible and can be fixed by this patch. For raid10.c, we can remove
      the similar code to prevent deadlock as well since they has been called
      before.
      Reported-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NChung-Chiang Cheng <cccheng@synology.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      8876391e
  17. 20 2月, 2018 1 次提交
    • N
      md: only allow remove_and_add_spares when no sync_thread running. · 39772f0a
      NeilBrown 提交于
      The locking protocols in md assume that a device will
      never be removed from an array during resync/recovery/reshape.
      When that isn't happening, rcu or reconfig_mutex is needed
      to protect an rdev pointer while taking a refcount.  When
      it is happening, that protection isn't needed.
      
      Unfortunately there are cases were remove_and_add_spares() is
      called when recovery might be happening: is state_store(),
      slot_store() and hot_remove_disk().
      In each case, this is just an optimization, to try to expedite
      removal from the personality so the device can be removed from
      the array.  If resync etc is happening, we just have to wait
      for md_check_recover to find a suitable time to call
      remove_and_add_spares().
      
      This optimization and not essential so it doesn't
      matter if it fails.
      So change remove_and_add_spares() to abort early if
      resync/recovery/reshape is happening, unless it is called
      from md_check_recovery() as part of a newly started recovery.
      The parameter "this" is only NULL when called from
      md_check_recovery() so when it is NULL, there is no need to abort.
      
      As this can result in a NULL dereference, the fix is suitable
      for -stable.
      
      cc: yuyufen <yuyufen@huawei.com>
      Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
      Fixes: 8430e7e0 ("md: disconnect device from personality before trying to remove it.")
      Cc: stable@ver.kernel.org (v4.8+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      39772f0a
  18. 19 2月, 2018 1 次提交
  19. 18 2月, 2018 1 次提交
  20. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  21. 16 1月, 2018 1 次提交
    • T
      raid5-ppl: PPL support for disks with write-back cache enabled · 1532d9e8
      Tomasz Majchrzak 提交于
      In order to provide data consistency with PPL for disks with write-back
      cache enabled all data has to be flushed to disks before next PPL
      entry. The disks to be flushed are marked in the bitmap. It's modified
      under a mutex and it's only read after PPL io unit is submitted.
      
      A limitation of 64 disks in the array has been introduced to keep data
      structures and implementation simple. RAID5 arrays with so many disks are
      not likely due to high risk of multiple disks failure. Such restriction
      should not be a real life limitation.
      
      With write-back cache disabled next PPL entry is submitted when data write
      for current one completes. Data flush defers next log submission so trigger
      it when there are no stripes for handling found.
      
      As PPL assures all data is flushed to disk at request completion, just
      acknowledge flush request when PPL is enabled.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      1532d9e8
  22. 12 12月, 2017 1 次提交
    • S
      md: introduce new personality funciton start() · d5d885fd
      Song Liu 提交于
      In do_md_run(), md threads should not wake up until the array is fully
      initialized in md_run(). However, in raid5_run(), raid5-cache may wake
      up mddev->thread to flush stripes that need to be written back. This
      design doesn't break badly right now. But it could lead to bad bug in
      the future.
      
      This patch tries to resolve this problem by splitting start up work
      into two personality functions, run() and start(). Tasks that do not
      require the md threads should go into run(), while task that require
      the md threads go into start().
      
      r5l_load_log() is moved to raid5_start(), so it is not called until
      the md threads are started in do_md_run().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d5d885fd
  23. 02 12月, 2017 1 次提交
    • N
      md: limit mdstat resync progress to max_sectors · d2e2ec82
      Nate Dailey 提交于
      There is a small window near the end of md_do_sync where mddev->curr_resync
      can be equal to MaxSector.
      
      If status_resync is called during this window, the resulting /proc/mdstat
      output contains a HUGE number of = signs due to the very large curr_resync:
      
      Personalities : [raid1]
      md123 : active raid1 sdd3[2] sdb3[0]
        204736 blocks super 1.0 [2/1] [U_]
        [=====================================================================
         ... (82 MB more) ...
         ================>]  recovery =429496729.3% (9223372036854775807/204736)
         finish=0.2min speed=12796K/sec
        bitmap: 0/1 pages [0KB], 65536KB chunk
      
      Modify status_resync to ensure the resync variable doesn't exceed
      the array's max_sectors.
      Signed-off-by: NNate Dailey <nate.dailey@stratus.com>
      Acked-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d2e2ec82
  24. 29 11月, 2017 1 次提交
  25. 15 11月, 2017 1 次提交
  26. 11 11月, 2017 1 次提交
  27. 09 11月, 2017 1 次提交
    • N
      md: be cautious about using ->curr_resync_completed for ->recovery_offset · db0505d3
      NeilBrown 提交于
      The ->recovery_offset shows how much of a non-InSync device is actually
      in sync - how much has been recoveryed.
      
      When performing a recovery, ->curr_resync and ->curr_resync_completed
      follow the device address being recovered and so can be used to update
      ->recovery_offset.
      
      When performing a reshape, ->curr_resync* might follow the device
      addresses (raid5) or might follow array addresses (raid10), so cannot
      in general be used to set ->recovery_offset.  When reshaping backwards,
      ->curre_resync* measures from the *end* of the array-or-device, so is
      particularly unhelpful.
      
      So change the common code in md.c to only use ->curr_resync_complete
      for the simple recovery case, and add code to raid5.c to update
      ->recovery_offset during a forwards reshape.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      db0505d3
  28. 02 11月, 2017 9 次提交
    • A
      md: don't check MD_SB_CHANGE_CLEAN in md_allow_write · b90f6ff0
      Artur Paszkiewicz 提交于
      Only MD_SB_CHANGE_PENDING should be used to wait for transition from
      clean to dirty. Checking also MD_SB_CHANGE_CLEAN is unnecessary and can
      race with e.g. md_do_sync(). This sporadically causes a hang when
      changing consistency policy during resync:
      
      INFO: task mdadm:6183 blocked for more than 30 seconds.
            Not tainted 4.14.0-rc3+ #391
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mdadm           D12752  6183   6022 0x00000000
      Call Trace:
       __schedule+0x93f/0x990
       schedule+0x6b/0x90
       md_allow_write+0x100/0x130 [md_mod]
       ? do_wait_intr_irq+0x90/0x90
       resize_stripes+0x3a/0x5b0 [raid456]
       ? kernfs_fop_write+0xbe/0x180
       raid5_change_consistency_policy+0xa6/0x200 [raid456]
       consistency_policy_store+0x2e/0x70 [md_mod]
       md_attr_store+0x90/0xc0 [md_mod]
       sysfs_kf_write+0x42/0x50
       kernfs_fop_write+0x119/0x180
       __vfs_write+0x28/0x110
       ? rcu_sync_lockdep_assert+0x12/0x60
       ? __sb_start_write+0x15a/0x1c0
       ? vfs_write+0xa3/0x1a0
       vfs_write+0xb4/0x1a0
       SyS_write+0x49/0xa0
       entry_SYSCALL_64_fastpath+0x18/0xad
      
      Fixes: 2214c260 ("md: don't return -EAGAIN in md_allow_write for external metadata arrays")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b90f6ff0
    • S
      md: use lockdep_assert_held · efa4b77b
      Shaohua Li 提交于
      lockdep_assert_held is a better way to assert lock held, and it works
      for UP.
      Signed-off-by: NShaohua Li <shli@fb.com>
      efa4b77b
    • N
      md: remove special meaning of ->quiesce(.., 2) · b03e0ccb
      NeilBrown 提交于
      The '2' argument means "wake up anything that is waiting".
      This is an inelegant part of the design and was added
      to help support management of suspend_lo/suspend_hi setting.
      Now that suspend_lo/hi is managed in mddev_suspend/resume,
      that need is gone.
      These is still a couple of places where we call 'quiesce'
      with an argument of '2', but they can safely be changed to
      call ->quiesce(.., 1); ->quiesce(.., 0) which
      achieve the same result at the small cost of pausing IO
      briefly.
      
      This removes a small "optimization" from suspend_{hi,lo}_store,
      but it isn't clear that optimization served a useful purpose.
      The code now is a lot clearer.
      Suggested-by: NShaohua Li <shli@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b03e0ccb
    • N
      md: allow metadata update while suspending. · 35bfc521
      NeilBrown 提交于
      There are various deadlocks that can occur
      when a thread holds reconfig_mutex and calls
      ->quiesce(mddev, 1).
      As some write request block waiting for
      metadata to be updated (e.g. to record device
      failure), and as the md thread updates the metadata
      while the reconfig mutex is held, holding the mutex
      can stop write requests completing, and this prevents
      ->quiesce(mddev, 1) from completing.
      
      ->quiesce() is now usually called from mddev_suspend(),
      and it is always called with reconfig_mutex held.  So
      at this time it is safe for the thread to update metadata
      without explicitly taking the lock.
      
      So add 2 new flags, one which says the unlocked updates is
      allowed, and one which ways it is happening.  Then allow it
      while the quiesce completes, and then wait for it to finish.
      Reported-and-tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      35bfc521
    • N
      md: use mddev_suspend/resume instead of ->quiesce() · 9e1cc0a5
      NeilBrown 提交于
      mddev_suspend() is a more general interface than
      calling ->quiesce() and is so more extensible.  A
      future patch will make use of this.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      9e1cc0a5
    • N
      md: move suspend_hi/lo handling into core md code · b3143b9a
      NeilBrown 提交于
      responding to ->suspend_lo and ->suspend_hi is similar
      to responding to ->suspended.  It is best to wait in
      the common core code without incrementing ->active_io.
      This allows mddev_suspend()/mddev_resume() to work while
      requests are waiting for suspend_lo/hi to change.
      This is will be important after a subsequent patch
      which uses mddev_suspend() to synchronize updating for
      suspend_lo/hi.
      
      So move the code for testing suspend_lo/hi out of raid1.c
      and raid5.c, and place it in md.c
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b3143b9a
    • N
      md: don't call bitmap_create() while array is quiesced. · 52a0d49d
      NeilBrown 提交于
      bitmap_create() allocates memory with GFP_KERNEL and
      so can wait for IO.
      If called while the array is quiesced, it could wait indefinitely
      for write out to the array - deadlock.
      So call bitmap_create() before quiescing the array.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      52a0d49d
    • N
      md: always hold reconfig_mutex when calling mddev_suspend() · 4d5324f7
      NeilBrown 提交于
      Most often mddev_suspend() is called with
      reconfig_mutex held.  Make this a requirement in
      preparation a subsequent patch.  Also require
      reconfig_mutex to be held for mddev_resume(),
      partly for symmetry and partly to guarantee
      no races with incr/decr of mddev->suspend.
      
      Taking the mutex in r5c_disable_writeback_async() is
      a little tricky as this is called from a work queue
      via log->disable_writeback_work, and flush_work()
      is called on that while holding ->reconfig_mutex.
      If the work item hasn't run before flush_work()
      is called, the work function will not be able to
      get the mutex.
      
      So we use mddev_trylock() inside the wait_event() call, and have that
      abort when conf->log is set to NULL, which happens before
      flush_work() is called.
      We wait in mddev->sb_wait and ensure this is woken
      when any of the conditions change.  This requires
      waking mddev->sb_wait in mddev_unlock().  This is only
      like to trigger extra wake_ups of threads that needn't
      be woken when metadata is being written, and that
      doesn't happen often enough that the cost would be
      noticeable.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      4d5324f7
    • N
      md: forbid a RAID5 from having both a bitmap and a journal. · 230b55fa
      NeilBrown 提交于
      Having both a bitmap and a journal is pointless.
      Attempting to do so can corrupt the bitmap if the journal
      replay happens before the bitmap is initialized.
      Rather than try to avoid this corruption, simply
      refuse to allow arrays with both a bitmap and a journal.
      So:
       - if raid5_run sees both are present, fail.
       - if adding a bitmap finds a journal is present, fail
       - if adding a journal finds a bitmap is present, fail.
      
      Cc: stable@vger.kernel.org (4.10+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Tested-by: NJoshua Kinard <kumba@gentoo.org>
      Acked-by: NJoshua Kinard <kumba@gentoo.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      230b55fa