1. 26 2月, 2018 1 次提交
    • B
      md: fix a potential deadlock of raid5/raid10 reshape · 8876391e
      BingJing Chang 提交于
      There is a potential deadlock if mount/umount happens when
      raid5_finish_reshape() tries to grow the size of emulated disk.
      
      How the deadlock happens?
      1) The raid5 resync thread finished reshape (expanding array).
      2) The mount or umount thread holds VFS sb->s_umount lock and tries to
         write through critical data into raid5 emulated block device. So it
         waits for raid5 kernel thread handling stripes in order to finish it
         I/Os.
      3) In the routine of raid5 kernel thread, md_check_recovery() will be
         called first in order to reap the raid5 resync thread. That is,
         raid5_finish_reshape() will be called. In this function, it will try
         to update conf and call VFS revalidate_disk() to grow the raid5
         emulated block device. It will try to acquire VFS sb->s_umount lock.
      The raid5 kernel thread cannot continue, so no one can handle mount/
      umount I/Os (stripes). Once the write-through I/Os cannot be finished,
      mount/umount will not release sb->s_umount lock. The deadlock happens.
      
      The raid5 kernel thread is an emulated block device. It is responible to
      handle I/Os (stripes) from upper layers. The emulated block device
      should not request any I/Os on itself. That is, it should not call VFS
      layer functions. (If it did, it will try to acquire VFS locks to
      guarantee the I/Os sequence.) So we have the resync thread to send
      resync I/O requests and to wait for the results.
      
      For solving this potential deadlock, we can put the size growth of the
      emulated block device as the final step of reshape thread.
      
      2017/12/29:
      Thanks to Guoqing Jiang <gqjiang@suse.com>,
      we confirmed that there is the same deadlock issue in raid10. It's
      reproducible and can be fixed by this patch. For raid10.c, we can remove
      the similar code to prevent deadlock as well since they has been called
      before.
      Reported-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NChung-Chiang Cheng <cccheng@synology.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      8876391e
  2. 22 2月, 2018 1 次提交
    • A
      md: raid5: avoid string overflow warning · 53b8d89d
      Arnd Bergmann 提交于
      gcc warns about a possible overflow of the kmem_cache string, when adding
      four characters to a string of the same length:
      
      drivers/md/raid5.c: In function 'setup_conf':
      drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a region of size between 1 and 32 [-Werror=format-overflow=]
        sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
                                        ^~~~
      drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into a destination of size 32
        sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      If I'm counting correctly, we need 11 characters for the fixed part
      of the string and 18 characters for a 64-bit pointer (when no gendisk
      is used), so that leaves three characters for conf->level, which should
      always be sufficient.
      
      This makes the code use snprintf() with the correct length, to
      make the code more robust against changes, and to get the compiler
      to shut up.
      
      In commit f4be6b43 ("md/raid5: ensure we create a unique name for
      kmem_cache when mddev has no gendisk") from 2010, Neil said that
      the pointer could be removed "shortly" once devices without gendisk
      are disallowed. I have no idea if that happened, but if it did, that
      should probably be changed as well.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      53b8d89d
  3. 18 2月, 2018 1 次提交
  4. 16 1月, 2018 1 次提交
    • T
      raid5-ppl: PPL support for disks with write-back cache enabled · 1532d9e8
      Tomasz Majchrzak 提交于
      In order to provide data consistency with PPL for disks with write-back
      cache enabled all data has to be flushed to disks before next PPL
      entry. The disks to be flushed are marked in the bitmap. It's modified
      under a mutex and it's only read after PPL io unit is submitted.
      
      A limitation of 64 disks in the array has been introduced to keep data
      structures and implementation simple. RAID5 arrays with so many disks are
      not likely due to high risk of multiple disks failure. Such restriction
      should not be a real life limitation.
      
      With write-back cache disabled next PPL entry is submitted when data write
      for current one completes. Data flush defers next log submission so trigger
      it when there are no stripes for handling found.
      
      As PPL assures all data is flushed to disk at request completion, just
      acknowledge flush request when PPL is enabled.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      1532d9e8
  5. 12 12月, 2017 1 次提交
    • S
      md: introduce new personality funciton start() · d5d885fd
      Song Liu 提交于
      In do_md_run(), md threads should not wake up until the array is fully
      initialized in md_run(). However, in raid5_run(), raid5-cache may wake
      up mddev->thread to flush stripes that need to be written back. This
      design doesn't break badly right now. But it could lead to bad bug in
      the future.
      
      This patch tries to resolve this problem by splitting start up work
      into two personality functions, run() and start(). Tasks that do not
      require the md threads should go into run(), while task that require
      the md threads go into start().
      
      r5l_load_log() is moved to raid5_start(), so it is not called until
      the md threads are started in do_md_run().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d5d885fd
  6. 02 12月, 2017 1 次提交
  7. 09 11月, 2017 1 次提交
    • N
      md: be cautious about using ->curr_resync_completed for ->recovery_offset · db0505d3
      NeilBrown 提交于
      The ->recovery_offset shows how much of a non-InSync device is actually
      in sync - how much has been recoveryed.
      
      When performing a recovery, ->curr_resync and ->curr_resync_completed
      follow the device address being recovered and so can be used to update
      ->recovery_offset.
      
      When performing a reshape, ->curr_resync* might follow the device
      addresses (raid5) or might follow array addresses (raid10), so cannot
      in general be used to set ->recovery_offset.  When reshaping backwards,
      ->curre_resync* measures from the *end* of the array-or-device, so is
      particularly unhelpful.
      
      So change the common code in md.c to only use ->curr_resync_complete
      for the simple recovery case, and add code to raid5.c to update
      ->recovery_offset during a forwards reshape.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      db0505d3
  8. 02 11月, 2017 4 次提交
    • M
      md: use TASK_IDLE instead of blocking signals · ae89fd3d
      Mikulas Patocka 提交于
      Hi - I submit this patch for the next merge window:
      
      Some times ago, I made a patch f9c79bc0 that blocks signals around the
      schedule() calls in MD. The MD subsystem needs to do an uninterruptible
      sleep that is not accounted in load average - so we block signals and use
      interruptible sleep.
      
      The kernel has a special TASK_IDLE state for this purpose, so we can use
      it instead of blocking signals. This patch doesn't fix any bug, it just
      makes the code simpler.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ae89fd3d
    • N
      md: remove special meaning of ->quiesce(.., 2) · b03e0ccb
      NeilBrown 提交于
      The '2' argument means "wake up anything that is waiting".
      This is an inelegant part of the design and was added
      to help support management of suspend_lo/suspend_hi setting.
      Now that suspend_lo/hi is managed in mddev_suspend/resume,
      that need is gone.
      These is still a couple of places where we call 'quiesce'
      with an argument of '2', but they can safely be changed to
      call ->quiesce(.., 1); ->quiesce(.., 0) which
      achieve the same result at the small cost of pausing IO
      briefly.
      
      This removes a small "optimization" from suspend_{hi,lo}_store,
      but it isn't clear that optimization served a useful purpose.
      The code now is a lot clearer.
      Suggested-by: NShaohua Li <shli@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b03e0ccb
    • N
      md: move suspend_hi/lo handling into core md code · b3143b9a
      NeilBrown 提交于
      responding to ->suspend_lo and ->suspend_hi is similar
      to responding to ->suspended.  It is best to wait in
      the common core code without incrementing ->active_io.
      This allows mddev_suspend()/mddev_resume() to work while
      requests are waiting for suspend_lo/hi to change.
      This is will be important after a subsequent patch
      which uses mddev_suspend() to synchronize updating for
      suspend_lo/hi.
      
      So move the code for testing suspend_lo/hi out of raid1.c
      and raid5.c, and place it in md.c
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b3143b9a
    • N
      md: forbid a RAID5 from having both a bitmap and a journal. · 230b55fa
      NeilBrown 提交于
      Having both a bitmap and a journal is pointless.
      Attempting to do so can corrupt the bitmap if the journal
      replay happens before the bitmap is initialized.
      Rather than try to avoid this corruption, simply
      refuse to allow arrays with both a bitmap and a journal.
      So:
       - if raid5_run sees both are present, fail.
       - if adding a bitmap finds a journal is present, fail
       - if adding a journal finds a bitmap is present, fail.
      
      Cc: stable@vger.kernel.org (4.10+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Tested-by: NJoshua Kinard <kumba@gentoo.org>
      Acked-by: NJoshua Kinard <kumba@gentoo.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      230b55fa
  9. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  10. 19 10月, 2017 1 次提交
    • N
      raid5: Set R5_Expanded on parity devices as well as data. · 235b6003
      NeilBrown 提交于
      When reshaping a fully degraded raid5/raid6 to a larger
      nubmer of devices, the new device(s) are not in-sync
      and so that can make the newly grown stripe appear to be
      "failed".
      To avoid this, we set the R5_Expanded flag to say "Even though
      this device is not fully in-sync, this block is safe so
      don't treat the device as failed for this stripe".
      This flag is set for data devices, not not for parity devices.
      
      Consequently, if you have a RAID6 with two devices that are partly
      recovered and a spare, and start a reshape to include the spare,
      then when the reshape gets past the point where the recovery was
      up to, it will think the stripes are failed and will get into
      an infinite loop, failing to make progress.
      
      So when contructing parity on an EXPAND_READY stripe,
      set R5_Expanded.
      Reported-by: NCurt <lightspd@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      235b6003
  11. 17 10月, 2017 1 次提交
  12. 28 9月, 2017 1 次提交
  13. 06 9月, 2017 2 次提交
    • D
      md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list · 184a09eb
      Dennis Yang 提交于
      In release_stripe_plug(), if a stripe_head has its STRIPE_ON_UNPLUG_LIST
      set, it indicates that this stripe_head is already in the raid5_plug_cb
      list and release_stripe() would be called instead to drop a reference
      count. Otherwise, the STRIPE_ON_UNPLUG_LIST bit would be set for this
      stripe_head and it will get queued into the raid5_plug_cb list.
      
      Since break_stripe_batch_list() did not preserve STRIPE_ON_UNPLUG_LIST,
      A stripe could be re-added to plug list while it is still on that list
      in the following situation. If stripe_head A is added to another
      stripe_head B's batch list, in this case A will have its
      batch_head != NULL and be added into the plug list. After that,
      stripe_head B gets handled and called break_stripe_batch_list() to
      reset all the batched stripe_head(including A which is still on
      the plug list)'s state and reset their batch_head to NULL.
      Before the plug list gets processed, if there is another write request
      comes in and get stripe_head A, A will have its batch_head == NULL
      (cleared by calling break_stripe_batch_list() on B) and be added to
      plug list once again.
      Signed-off-by: NDennis Yang <dennisyang@qnap.com>
      Cc: stable@vger.kernel.org (v4.1+)
      Signed-off-by: NShaohua Li <shli@fb.com>
      184a09eb
    • S
      md/raid5: fix a race condition in stripe batch · 3664847d
      Shaohua Li 提交于
      We have a race condition in below scenario, say have 3 continuous stripes, sh1,
      sh2 and sh3, sh1 is the stripe_head of sh2 and sh3:
      
      CPU1				CPU2				CPU3
      handle_stripe(sh3)
      				stripe_add_to_batch_list(sh3)
      				-> lock(sh2, sh3)
      				-> lock batch_lock(sh1)
      				-> add sh3 to batch_list of sh1
      				-> unlock batch_lock(sh1)
      								clear_batch_ready(sh1)
      								-> lock(sh1) and batch_lock(sh1)
      								-> clear STRIPE_BATCH_READY for all stripes in batch_list
      								-> unlock(sh1) and batch_lock(sh1)
      ->clear_batch_ready(sh3)
      -->test_and_clear_bit(STRIPE_BATCH_READY, sh3)
      --->return 0 as sh->batch == NULL
      				-> sh3->batch_head = sh1
      				-> unlock (sh2, sh3)
      
      In CPU1, handle_stripe will continue handle sh3 even it's in batch stripe list
      of sh1. By moving sh3->batch_head assignment in to batch_lock, we make it
      impossible to clear STRIPE_BATCH_READY before batch_head is set.
      
      Thanks Stephane for helping debug this tricky issue.
      Reported-and-tested-by: NStephane Thiell <sthiell@stanford.edu>
      Cc: stable@vger.kernel.org (v4.1+)
      Signed-off-by: NShaohua Li <shli@fb.com>
      3664847d
  14. 28 8月, 2017 1 次提交
  15. 26 8月, 2017 1 次提交
  16. 25 8月, 2017 1 次提交
    • S
      md/raid5: release/flush io in raid5_do_work() · 9c72a18e
      Song Liu 提交于
      In raid5, there are scenarios where some ios are deferred to a later
      time, and some IO need a flush to complete. To make sure we make
      progress with these IOs, we need to call the following functions:
      
          flush_deferred_bios(conf);
          r5l_flush_stripe_to_raid(conf->log);
      
      Both of these functions are called in raid5d(), but missing in
      raid5_do_work(). As a result, these functions are not called
      when multi-threading (group_thread_cnt > 0) is enabled. This patch
      adds calls to these function to raid5_do_work().
      
      Note for stable branches:
      
        r5l_flush_stripe_to_raid(conf->log) is need for 4.4+
        flush_deferred_bios(conf) is only needed for 4.11+
      
      Cc: stable@vger.kernel.org (4.4+)
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      9c72a18e
  17. 24 8月, 2017 2 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
    • C
      raid5: remove a call to get_start_sect · 10433d04
      Christoph Hellwig 提交于
      The block layer always remaps partitions before calling into the
      ->make_request methods of drivers.  Thus the call to get_start_sect in
      in_chunk_boundary will always return 0 and can be removed.
      Reviewed-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      10433d04
  18. 24 7月, 2017 1 次提交
  19. 22 7月, 2017 1 次提交
  20. 11 7月, 2017 1 次提交
    • X
      Raid5 should update rdev->sectors after reshape · b5d27718
      Xiao Ni 提交于
      The raid5 md device is created by the disks which we don't use the total size. For example,
      the size of the device is 5G and it just uses 3G of the devices to create one raid5 device.
      Then change the chunksize and wait reshape to finish. After reshape finishing stop the raid
      and assemble it again. It fails.
      mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --size=3G --chunk=32 --assume-clean
      mdadm /dev/md0 --grow --chunk=64
      wait reshape to finish
      mdadm -S /dev/md0
      mdadm -As
      The error messages:
      [197519.814302] md: loop1 does not have a valid v1.2 superblock, not importing!
      [197519.821686] md: md_import_device returned -22
      
      After reshape the data offset is changed. It selects backwards direction in this condition.
      In function super_1_load it compares the available space of the underlying device with
      sb->data_size. The new data offset gets bigger after reshape. So super_1_load returns -EINVAL.
      rdev->sectors is updated in md_finish_reshape. Then sb->data_size is set in super_1_sync based
      on rdev->sectors. So add md_finish_reshape in end_reshape.
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Acked-by: NGuoqing Jiang <gqjiang@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NShaohua Li <shli@fb.com>
      b5d27718
  21. 19 6月, 2017 1 次提交
  22. 14 6月, 2017 2 次提交
    • M
      md: don't use flush_signals in userspace processes · f9c79bc0
      Mikulas Patocka 提交于
      The function flush_signals clears all pending signals for the process. It
      may be used by kernel threads when we need to prepare a kernel thread for
      responding to signals. However using this function for an userspaces
      processes is incorrect - clearing signals without the program expecting it
      can cause misbehavior.
      
      The raid1 and raid5 code uses flush_signals in its request routine because
      it wants to prepare for an interruptible wait. This patch drops
      flush_signals and uses sigprocmask instead to block all signals (including
      SIGKILL) around the schedule() call. The signals are not lost, but the
      schedule() call won't respond to them.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Acked-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      f9c79bc0
    • N
      md: fix deadlock between mddev_suspend() and md_write_start() · cc27b0c7
      NeilBrown 提交于
      If mddev_suspend() races with md_write_start() we can deadlock
      with mddev_suspend() waiting for the request that is currently
      in md_write_start() to complete the ->make_request() call,
      and md_write_start() waiting for the metadata to be updated
      to mark the array as 'dirty'.
      As metadata updates done by md_check_recovery() only happen then
      the mddev_lock() can be claimed, and as mddev_suspend() is often
      called with the lock held, these threads wait indefinitely for each
      other.
      
      We fix this by having md_write_start() abort if mddev_suspend()
      is happening, and ->make_request() aborts if md_write_start()
      aborted.
      md_make_request() can detect this abort, decrease the ->active_io
      count, and wait for mddev_suspend().
      Reported-by: NNix <nix@esperi.org.uk>
      Fix: 68866e42(MD: no sync IO while suspended)
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cc27b0c7
  23. 09 6月, 2017 1 次提交
  24. 06 6月, 2017 1 次提交
  25. 25 5月, 2017 1 次提交
    • N
      md: report sector of stripes with check mismatches · e1539036
      Nix 提交于
      This makes it possible, with appropriate filesystem support, for a
      sysadmin to tell what is affected by the mismatch, and whether
      it should be ignored (if it's inside a swap partition, for
      instance).
      
      We ratelimit to prevent log flooding: if there are so many
      mismatches that ratelimiting is necessary, the individual messages
      are relatively unlikely to be important (either the machine is
      swapping like crazy or something is very wrong with the disk).
      Signed-off-by: NNick Alcock <nick.alcock@oracle.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      e1539036
  26. 12 5月, 2017 2 次提交
    • S
      md/r5cache: handle sync with data in write back cache · 5ddf0440
      Song Liu 提交于
      Currently, sync of raid456 array cannot make progress when hitting
      data in writeback r5cache.
      
      This patch fixes this issue by flushing cached data of the stripe
      before processing the sync request. This is achived by:
      
      1. In handle_stripe(), do not set STRIPE_SYNCING if the stripe is
         in write back cache;
      2. In r5c_try_caching_write(), handle the stripe in sync with write
         through;
      3. In do_release_stripe(), make stripe in sync write out and send
         it to the state machine.
      
      Shaohua: explictly set STRIPE_HANDLE after write out completed
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5ddf0440
    • S
      md/r5cache: gracefully handle journal device errors for writeback mode · 70d466f7
      Song Liu 提交于
      For the raid456 with writeback cache, when journal device failed during
      normal operation, it is still possible to persist all data, as all
      pending data is still in stripe cache. However, it is necessary to handle
      journal failure gracefully.
      
      During journal failures, the following logic handles the graceful shutdown
      of journal:
      1. raid5_error() marks the device as Faulty and schedules async work
         log->disable_writeback_work;
      2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is
         suspended, set to write through, and then resumed. mddev_suspend()
         flushes all cached stripes;
      3. All cached stripes need to be flushed carefully to the RAID array.
      
      This patch fixes issues within the process above:
      1. In r5c_update_on_rdev_error() schedule disable_writeback_work for
         journal failures;
      2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING,
         since raid5_error() updates superblock.
      3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
         to make progress during log_failed;
      4. In delay_towrite(), if log failed only process data in the cache (skip
         new writes in dev->towrite);
      5. In __get_priority_stripe(), process loprio_list during journal device
         failures.
      6. In raid5_remove_disk(), wait for all cached stripes are flushed before
         calling log_exit().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      70d466f7
  27. 09 5月, 2017 1 次提交
  28. 05 5月, 2017 1 次提交
    • J
      md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lock · 3d05f3ae
      Julia Cartwright 提交于
      On mainline, there is no functional difference, just less code, and
      symmetric lock/unlock paths.
      
      On PREEMPT_RT builds, this fixes the following warning, seen by
      Alexander GQ Gerasiov, due to the sleeping nature of spinlocks.
      
         BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:993
         in_atomic(): 0, irqs_disabled(): 1, pid: 58, name: kworker/u12:1
         CPU: 5 PID: 58 Comm: kworker/u12:1 Tainted: G        W       4.9.20-rt16-stand6-686 #1
         Hardware name: Supermicro SYS-5027R-WRF/X9SRW-F, BIOS 3.2a 10/28/2015
         Workqueue: writeback wb_workfn (flush-253:0)
         Call Trace:
          dump_stack+0x47/0x68
          ? migrate_enable+0x4a/0xf0
          ___might_sleep+0x101/0x180
          rt_spin_lock+0x17/0x40
          add_stripe_bio+0x4e3/0x6c0 [raid456]
          ? preempt_count_add+0x42/0xb0
          raid5_make_request+0x737/0xdd0 [raid456]
      Reported-by: NAlexander GQ Gerasiov <gq@redlab-i.ru>
      Tested-by: NAlexander GQ Gerasiov <gq@redlab-i.ru>
      Signed-off-by: NJulia Cartwright <julia@ni.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      3d05f3ae
  29. 26 4月, 2017 1 次提交
  30. 12 4月, 2017 1 次提交
    • N
      md/raid5: make chunk_aligned_read() split bios more cleanly. · dd7a8f5d
      NeilBrown 提交于
      chunk_aligned_read() currently uses fs_bio_set - which is meant for
      filesystems to use - and loops if multiple splits are needed, which is
      not best practice.
      As this is only used for READ requests, not writes, it is unlikely
      to cause a problem.  However it is best to be consistent in how
      we split bios, and to follow the pattern used in raid1/raid10.
      
      So create a private bioset, bio_split, and use it to perform a single
      split, submitting the remainder to generic_make_request() for later
      processing.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      dd7a8f5d
  31. 11 4月, 2017 3 次提交
    • A
      raid5-ppl: partial parity calculation optimization · ae1713e2
      Artur Paszkiewicz 提交于
      In case of read-modify-write, partial partity is the same as the result
      of ops_run_prexor5(), so we can just copy sh->dev[pd_idx].page into
      sh->ppl_page instead of calculating it again.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ae1713e2
    • A
      raid5-ppl: use resize_stripes() when enabling or disabling ppl · 845b9e22
      Artur Paszkiewicz 提交于
      Use resize_stripes() instead of raid5_reset_stripe_cache() to allocate
      or free sh->ppl_page at runtime for all stripes in the stripe cache.
      raid5_reset_stripe_cache() required suspending the mddev and could
      deadlock because of GFP_KERNEL allocations.
      
      Move the 'newsize' check to check_reshape() to allow reallocating the
      stripes with the same number of disks. Allocate sh->ppl_page in
      alloc_stripe() instead of grow_buffers(). Pass 'struct r5conf *conf' as
      a parameter to alloc_stripe() because it is needed to check whether to
      allocate ppl_page. Add free_stripe() and use it to free stripes rather
      than directly call kmem_cache_free(). Also free sh->ppl_page in
      free_stripe().
      
      Set MD_HAS_PPL at the end of ppl_init_log() instead of explicitly
      setting it in advance and add another parameter to log_init() to allow
      calling ppl_init_log() without the bit set. Don't try to calculate
      partial parity or add a stripe to log if it does not have ppl_page set.
      
      Enabling ppl can now be performed without suspending the mddev, because
      the log won't be used until new stripes are allocated with ppl_page.
      Calling mddev_suspend/resume is still necessary when disabling ppl,
      because we want all stripes to finish before stopping the log, but
      resize_stripes() can be called after mddev_resume() when ppl is no
      longer active.
      Suggested-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      845b9e22
    • N
      md/raid6: Fix anomily when recovering a single device in RAID6. · 7471fb77
      NeilBrown 提交于
      When recoverying a single missing/failed device in a RAID6,
      those stripes where the Q block is on the missing device are
      handled a bit differently.  In these cases it is easy to
      check that the P block is correct, so we do.  This results
      in the P block be destroy.  Consequently the P block needs
      to be read a second time in order to compute Q.  This causes
      lots of seeks and hurts performance.
      
      It shouldn't be necessary to re-read P as it can be computed
      from the DATA.  But we only compute blocks on missing
      devices, since c337869d ("md: do not compute parity
      unless it is on a failed drive").
      
      So relax the change made in that commit to allow computing
      of the P block in a RAID6 which it is the only missing that
      block.
      
      This makes RAID6 recovery run much faster as the disk just
      "before" the recovering device is no longer seeking
      back-and-forth.
      Reported-by-tested-by: NBrad Campbell <lists2009@fnarfbargle.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      7471fb77