1. 18 12月, 2019 3 次提交
    • D
      md: improve handling of bio with REQ_PREFLUSH in md_flush_request() · a12c768d
      David Jeffery 提交于
      commit 775d78319f1ceb32be8eb3b1202ccdc60e9cb7f1 upstream.
      
      If pers->make_request fails in md_flush_request(), the bio is lost. To
      fix this, pass back a bool to indicate if the original make_request call
      should continue to handle the I/O and instead of assuming the flush logic
      will push it to completion.
      
      Convert md_flush_request to return a bool and no longer calls the raid
      driver's make_request function.  If the return is true, then the md flush
      logic has or will complete the bio and the md make_request call is done.
      If false, then the md make_request function needs to keep processing like
      it is a normal bio. Let the original call to md_handle_request handle any
      need to retry sending the bio to the raid driver's make_request function
      should it be needed.
      
      Also mark md_flush_request and the make_request function pointer as
      __must_check to issue warnings should these critical return values be
      ignored.
      
      Fixes: 2bc13b83e629 ("md: batch flush requests.")
      Cc: stable@vger.kernel.org # # v4.19+
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
      Reviewed-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a12c768d
    • D
      dm zoned: reduce overhead of backing device checks · 56a83024
      Dmitry Fomichev 提交于
      commit e7fad909b68aa37470d9f2d2731b5bec355ee5d6 upstream.
      
      Commit 75d66ffb48efb3 added backing device health checks and as a part
      of these checks, check_events() block ops template call is invoked in
      dm-zoned mapping path as well as in reclaim and flush path. Calling
      check_events() with ATA or SCSI backing devices introduces a blocking
      scsi_test_unit_ready() call being made in sd_check_events(). Even though
      the overhead of calling scsi_test_unit_ready() is small for ATA zoned
      devices, it is much larger for SCSI and it affects performance in a very
      negative way.
      
      Fix this performance regression by executing check_events() only in case
      of any I/O errors. The function dmz_bdev_is_dying() is modified to call
      only blk_queue_dying(), while calls to check_events() are made in a new
      helper function, dmz_check_bdev().
      Reported-by: Nzhangxiaoxu <zhangxiaoxu5@huawei.com>
      Fixes: 75d66ffb48efb3 ("dm zoned: properly handle backing device failure")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      56a83024
    • M
      dm writecache: handle REQ_FUA · 10b9bf59
      Maged Mokhtar 提交于
      commit c1005322ff02110a4df7f0033368ea015062b583 upstream.
      
      Call writecache_flush() on REQ_FUA in writecache_map().
      
      Cc: stable@vger.kernel.org # 4.18+
      Signed-off-by: NMaged Mokhtar <mmokhtar@petasan.org>
      Acked-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      10b9bf59
  2. 13 12月, 2019 1 次提交
  3. 05 12月, 2019 4 次提交
    • H
      dm raid: fix false -EBUSY when handling check/repair message · fe8e594c
      Heinz Mauelshagen 提交于
      [ Upstream commit 74694bcbdf7e28a5ad548cdda9ac56d30be00d13 ]
      
      Sending a check/repair message infrequently leads to -EBUSY instead of
      properly identifying an active resync.  This occurs because
      raid_message() is testing recovery bits in a racy way.
      
      Fix by calling decipher_sync_action() from raid_message() to properly
      identify the idle state of the RAID device.
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fe8e594c
    • S
      dm flakey: Properly corrupt multi-page bios. · d4c637af
      Sweet Tea 提交于
      [ Upstream commit a00f5276e26636cbf72f24f79831026d2e2868e7 ]
      
      The flakey target is documented to be able to corrupt the Nth byte in
      a bio, but does not corrupt byte indices after the first biovec in the
      bio. Change the corrupting function to actually corrupt the Nth byte
      no matter in which biovec that index falls.
      
      A test device generating two-page bios, atop a flakey device configured
      to corrupt a byte index on the second page, verified both the failure
      to corrupt before this patch and the expected corruption after this
      change.
      Signed-off-by: NJohn Dorminy <jdorminy@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d4c637af
    • S
      bcache: do not mark writeback_running too early · b2d84967
      Shenghui Wang 提交于
      [ Upstream commit 79b791466e525c98f6aeee9acf5726e7b27f4231 ]
      
      A fresh backing device is not attached to any cache_set, and
      has no writeback kthread created until first attached to some
      cache_set.
      
      But bch_cached_dev_writeback_init run
      "
      	dc->writeback_running		= true;
      	WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING,
      			&dc->disk.flags));
      "
      for any newly formatted backing devices.
      
      For a fresh standalone backing device, we can get something like
      following even if no writeback kthread created:
      ------------------------
      /sys/block/bcache0/bcache# cat writeback_running
      1
      /sys/block/bcache0/bcache# cat writeback_rate_debug
      rate:		512.0k/sec
      dirty:		0.0k
      target:		0.0k
      proportional:	0.0k
      integral:	0.0k
      change:		0.0k/sec
      next io:	-15427384ms
      
      The none ZERO fields are misleading as no alive writeback kthread yet.
      
      Set dc->writeback_running false as no writeback thread created in
      bch_cached_dev_writeback_init().
      
      We have writeback thread created and woken up in bch_cached_dev_writeback
      _start(). Set dc->writeback_running true before bch_writeback_queue()
      called, as a writeback thread will check if dc->writeback_running is true
      before writing back dirty data, and hung if false detected.
      
      After the change, we can get the following output for a fresh standalone
      backing device:
      -----------------------
      /sys/block/bcache0/bcache$ cat writeback_running
      0
      /sys/block/bcache0/bcache# cat writeback_rate_debug
      rate:		0.0k/sec
      dirty:		0.0k
      target:		0.0k
      proportional:	0.0k
      integral:	0.0k
      change:		0.0k/sec
      next io:	0ms
      
      v1 -> v2:
        Set dc->writeback_running before bch_writeback_queue() called,
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b2d84967
    • S
      bcache: do not check if debug dentry is ERR or NULL explicitly on remove · 6f48e238
      Shenghui Wang 提交于
      [ Upstream commit ae17102316550b4b230a283febe31b2a9ff30084 ]
      
      debugfs_remove and debugfs_remove_recursive will check if the dentry
      pointer is NULL or ERR, and will do nothing in that case.
      
      Remove the check in cache_set_free and bch_debug_init.
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6f48e238
  4. 01 12月, 2019 2 次提交
  5. 24 11月, 2019 3 次提交
  6. 06 11月, 2019 3 次提交
    • C
      bcache: fix input overflow to writeback_rate_minimum · 437de041
      Coly Li 提交于
      [ Upstream commit dab71b2db98dcdd4657d151b01a7be88ce10f9d1 ]
      
      dc->writeback_rate_minimum is type unsigned integer variable, it is set
      via sysfs interface, and converte from input string to unsigned integer
      by d_strtoul_nonzero(). When the converted input value is larger than
      UINT_MAX, overflow to unsigned integer happens.
      
      This patch fixes the overflow by using sysfs_strotoul_clamp() to
      convert input string and limit the value in range [1, UINT_MAX], then
      the overflow can be avoided.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      437de041
    • M
      dm snapshot: rework COW throttling to fix deadlock · a8afda77
      Mikulas Patocka 提交于
      [ Upstream commit b21555786f18cd77f2311ad89074533109ae3ffa ]
      
      Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and
      workqueue stalls") introduced a semaphore to limit the maximum number of
      in-flight kcopyd (COW) jobs.
      
      The implementation of this throttling mechanism is prone to a deadlock:
      
      1. One or more threads write to the origin device causing COW, which is
         performed by kcopyd.
      
      2. At some point some of these threads might reach the s->cow_count
         semaphore limit and block in down(&s->cow_count), holding a read lock
         on _origins_lock.
      
      3. Someone tries to acquire a write lock on _origins_lock, e.g.,
         snapshot_ctr(), which blocks because the threads at step (2) already
         hold a read lock on it.
      
      4. A COW operation completes and kcopyd runs dm-snapshot's completion
         callback, which ends up calling pending_complete().
         pending_complete() tries to resubmit any deferred origin bios. This
         requires acquiring a read lock on _origins_lock, which blocks.
      
         This happens because the read-write semaphore implementation gives
         priority to writers, meaning that as soon as a writer tries to enter
         the critical section, no readers will be allowed in, until all
         writers have completed their work.
      
         So, pending_complete() waits for the writer at step (3) to acquire
         and release the lock. This writer waits for the readers at step (2)
         to release the read lock and those readers wait for
         pending_complete() (the kcopyd thread) to signal the s->cow_count
         semaphore: DEADLOCK.
      
      The above was thoroughly analyzed and documented by Nikos Tsironis as
      part of his initial proposal for fixing this deadlock, see:
      https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html
      
      Fix this deadlock by reworking COW throttling so that it waits without
      holding any locks. Add a variable 'in_progress' that counts how many
      kcopyd jobs are running. A function wait_for_in_progress() will sleep if
      'in_progress' is over the limit. It drops _origins_lock in order to
      avoid the deadlock.
      Reported-by: NGuruswamy Basavaiah <guru2018@gmail.com>
      Reported-by: NNikos Tsironis <ntsironis@arrikto.com>
      Reviewed-by: NNikos Tsironis <ntsironis@arrikto.com>
      Tested-by: NNikos Tsironis <ntsironis@arrikto.com>
      Fixes: 721b1d98fb51 ("dm snapshot: Fix excessive memory usage and workqueue stalls")
      Cc: stable@vger.kernel.org # v5.0+
      Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a8afda77
    • M
      dm snapshot: introduce account_start_copy() and account_end_copy() · 223f1af6
      Mikulas Patocka 提交于
      [ Upstream commit a2f83e8b0c82c9500421a26c49eb198b25fcdea3 ]
      
      This simple refactoring moves code for modifying the semaphore cow_count
      into separate functions to prepare for changes that will extend these
      methods to provide for a more sophisticated mechanism for COW
      throttling.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      223f1af6
  7. 29 10月, 2019 2 次提交
    • M
      dm cache: fix bugs when a GFP_NOWAIT allocation fails · e49c84c5
      Mikulas Patocka 提交于
      commit 13bd677a472d534bf100bab2713efc3f9e3f5978 upstream.
      
      GFP_NOWAIT allocation can fail anytime - it doesn't wait for memory being
      available and it fails if the mempool is exhausted and there is not enough
      memory.
      
      If we go down this path:
        map_bio -> mg_start -> alloc_migration -> mempool_alloc(GFP_NOWAIT)
      we can see that map_bio() doesn't check the return value of mg_start(),
      and the bio is leaked.
      
      If we go down this path:
        map_bio -> mg_start -> mg_lock_writes -> alloc_prison_cell ->
        dm_bio_prison_alloc_cell_v2 -> mempool_alloc(GFP_NOWAIT) ->
        mg_lock_writes -> mg_complete
      the bio is ended with an error - it is unacceptable because it could
      cause filesystem corruption if the machine ran out of memory
      temporarily.
      
      Change GFP_NOWAIT to GFP_NOIO, so that the mempool code will properly
      wait until memory becomes available. mempool_alloc with GFP_NOIO can't
      fail, so remove the code paths that deal with allocation failure.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e49c84c5
    • S
      md/raid0: fix warning message for parameter default_layout · 9457994a
      Song Liu 提交于
      [ Upstream commit 3874d73e06c9b9dc15de0b7382fc223986d75571 ]
      
      The message should match the parameter, i.e. raid0.default_layout.
      
      Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
      Cc: NeilBrown <neilb@suse.de>
      Reported-by: NIvan Topolsky <doktor.yak@gmail.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9457994a
  8. 05 10月, 2019 12 次提交
    • N
      md/raid0: avoid RAID0 data corruption due to layout confusion. · bbe3e205
      NeilBrown 提交于
      [ Upstream commit c84a1372df929033cb1a0441fb57bd3932f39ac9 ]
      
      If the drives in a RAID0 are not all the same size, the array is
      divided into zones.
      The first zone covers all drives, to the size of the smallest.
      The second zone covers all drives larger than the smallest, up to
      the size of the second smallest - etc.
      
      A change in Linux 3.14 unintentionally changed the layout for the
      second and subsequent zones.  All the correct data is still stored, but
      each chunk may be assigned to a different device than in pre-3.14 kernels.
      This can lead to data corruption.
      
      It is not possible to determine what layout to use - it depends which
      kernel the data was written by.
      So we add a module parameter to allow the old (0) or new (1) layout to be
      specified, and refused to assemble an affected array if that parameter is
      not set.
      
      Fixes: 20d0189b ("block: Introduce new bio_split()")
      cc: stable@vger.kernel.org (3.14+)
      Acked-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bbe3e205
    • N
      md: only call set_in_sync() when it is expected to succeed. · 5dc86e95
      NeilBrown 提交于
      commit 480523feae581ab714ba6610388a3b4619a2f695 upstream.
      
      Since commit 4ad23a97 ("MD: use per-cpu counter for
      writes_pending"), set_in_sync() is substantially more expensive: it
      can wait for a full RCU grace period which can be 10s of milliseconds.
      
      So we should only call it when the cost is justified.
      
      md_check_recovery() currently calls set_in_sync() every time it finds
      anything to do (on non-external active arrays).  For an array
      performing resync or recovery, this will be quite often.
      Each call will introduce a delay to the md thread, which can noticeable
      affect IO submission latency.
      
      In md_check_recovery() we only need to call set_in_sync() if
      'safemode' was non-zero at entry, meaning that there has been not
      recent IO.  So we save this "safemode was nonzero" state, and only
      call set_in_sync() if it was non-zero.
      
      This measurably reduces mean and maximum IO submission latency during
      resync/recovery.
      Reported-and-tested-by: NJack Wang <jinpu.wang@cloud.ionos.com>
      Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
      Cc: stable@vger.kernel.org (v4.12+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5dc86e95
    • N
      md: don't report active array_state until after revalidate_disk() completes. · 598a2cda
      NeilBrown 提交于
      commit 9d4b45d6af442237560d0bb5502a012baa5234b7 upstream.
      
      Until revalidate_disk() has completed, the size of a new md array will
      appear to be zero.
      So we shouldn't report, through array_state, that the array is active
      until that time.
      udev rules check array_state to see if the array is ready.  As soon as
      it appear to be zero, fsck can be run.  If it find the size to be
      zero, it will fail.
      
      So add a new flag to provide an interlock between do_md_run() and
      array_state_show().  This flag is set while do_md_run() is active and
      it prevents array_state_show() from reporting that the array is
      active.
      
      Before do_md_run() is called, ->pers will be NULL so array is
      definitely not active.
      After do_md_run() is called, revalidate_disk() will have run and the
      array will be completely ready.
      
      We also move various sysfs_notify*() calls out of md_run() into
      do_md_run() after MD_NOT_READY is cleared.  This ensure the
      information is ready before the notification is sent.
      
      Prior to v4.12, array_state_show() was called with the
      mddev->reconfig_mutex held, which provided exclusion with do_md_run().
      
      Note that MD_NOT_READY cleared twice.  This is deliberate to cover
      both success and error paths with minimal noise.
      
      Fixes: b7b17c9b ("md: remove mddev_lock() from md_attr_show()")
      Cc: stable@vger.kernel.org (v4.12++)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      598a2cda
    • X
      md/raid6: Set R5_ReadError when there is read failure on parity disk · e8323e0d
      Xiao Ni 提交于
      commit 143f6e733b73051cd22dcb80951c6c929da413ce upstream.
      
      7471fb77 ("md/raid6: Fix anomily when recovering a single device in
      RAID6.") avoids rereading P when it can be computed from other members.
      However, this misses the chance to re-write the right data to P. This
      patch sets R5_ReadError if the re-read fails.
      
      Also, when re-read is skipped, we also missed the chance to reset
      rdev->read_errors to 0. It can fail the disk when there are many read
      errors on P member disk (other disks don't have read error)
      
      V2: upper layer read request don't read parity/Q data. So there is no
      need to consider such situation.
      
      This is Reported-by: kbuild test robot <lkp@intel.com>
      
      Fixes: 7471fb77 ("md/raid6: Fix anomily when recovering a single device in RAID6.")
      Cc: <stable@vger.kernel.org> #4.4+
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e8323e0d
    • M
      blk-mq: add callback of .cleanup_rq · 4ec3ca27
      Ming Lei 提交于
      [ Upstream commit 226b4fc75c78f9c497c5182d939101b260cfb9f3 ]
      
      SCSI maintains its own driver private data hooked off of each SCSI
      request, and the pridate data won't be freed after scsi_queue_rq()
      returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver
      (e.g. dm-rq) may need to retry these SCSI requests, before SCSI has
      fully dispatched them, due to a lower level SCSI driver's resource
      limitation identified in scsi_queue_rq(). Currently SCSI's per-request
      private data is leaked when the upper layer driver (dm-rq) frees and
      then retries these requests in response to BLK_STS_RESOURCE or
      BLK_STS_DEV_RESOURCE returns from scsi_queue_rq().
      
      This usecase is so specialized that it doesn't warrant training an
      existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to
      account for freeing its driver private data -- doing so would add an
      extra branch for handling a special case that all other consumers of
      SCSI (and blk-mq) won't ever need to worry about.
      
      So the most pragmatic way forward is to delegate freeing SCSI driver
      private data to the upper layer driver (dm-rq).  Do so by adding
      new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method
      from dm-rq.  A following commit will implement the .cleanup_rq() hook
      in scsi_mq_ops.
      
      Cc: Ewan D. Milne <emilne@redhat.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: <stable@vger.kernel.org>
      Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4ec3ca27
    • N
      raid5: don't increment read_errors on EILSEQ return · 0a43d5d4
      Nigel Croxon 提交于
      [ Upstream commit b76b4715eba0d0ed574f58918b29c1b2f0fa37a8 ]
      
      While MD continues to count read errors returned by the lower layer.
      If those errors are -EILSEQ, instead of -EIO, it should NOT increase
      the read_errors count.
      
      When RAID6 is set up on dm-integrity target that detects massive
      corruption, the leg will be ejected from the array.  Even if the
      issue is correctable with a sector re-write and the array has
      necessary redundancy to correct it.
      
      The leg is ejected because it runs up the rdev->read_errors beyond
      conf->max_nr_stripes.  The return status in dm-drypt when there is
      a data integrity error is -EILSEQ (BLK_STS_PROTECTION).
      Signed-off-by: NNigel Croxon <ncroxon@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0a43d5d4
    • G
      raid5: don't set STRIPE_HANDLE to stripe which is in batch list · a5443cd2
      Guoqing Jiang 提交于
      [ Upstream commit 6ce220dd2f8ea71d6afc29b9a7524c12e39f374a ]
      
      If stripe in batch list is set with STRIPE_HANDLE flag, then the stripe
      could be set with STRIPE_ACTIVE by the handle_stripe function. And if
      error happens to the batch_head at the same time, break_stripe_batch_list
      is called, then below warning could happen (the same report in [1]), it
      means a member of batch list was set with STRIPE_ACTIVE.
      
      [7028915.431770] stripe state: 2001
      [7028915.431815] ------------[ cut here ]------------
      [7028915.431828] WARNING: CPU: 18 PID: 29089 at drivers/md/raid5.c:4614 break_stripe_batch_list+0x203/0x240 [raid456]
      [...]
      [7028915.431879] CPU: 18 PID: 29089 Comm: kworker/u82:5 Tainted: G           O    4.14.86-1-storage #4.14.86-1.2~deb9
      [7028915.431881] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018
      [7028915.431888] Workqueue: raid5wq raid5_do_work [raid456]
      [7028915.431890] task: ffff9ab0ef36d7c0 task.stack: ffffb72926f84000
      [7028915.431896] RIP: 0010:break_stripe_batch_list+0x203/0x240 [raid456]
      [7028915.431898] RSP: 0018:ffffb72926f87ba8 EFLAGS: 00010286
      [7028915.431900] RAX: 0000000000000012 RBX: ffff9aaa84a98000 RCX: 0000000000000000
      [7028915.431901] RDX: 0000000000000000 RSI: ffff9ab2bfa15458 RDI: ffff9ab2bfa15458
      [7028915.431902] RBP: ffff9aaa8fb4e900 R08: 0000000000000001 R09: 0000000000002eb4
      [7028915.431903] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff9ab1736f1b00
      [7028915.431904] R13: 0000000000000000 R14: ffff9aaa8fb4e900 R15: 0000000000000001
      [7028915.431906] FS:  0000000000000000(0000) GS:ffff9ab2bfa00000(0000) knlGS:0000000000000000
      [7028915.431907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [7028915.431908] CR2: 00007ff953b9f5d8 CR3: 0000000bf4009002 CR4: 00000000003606e0
      [7028915.431909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [7028915.431910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [7028915.431910] Call Trace:
      [7028915.431923]  handle_stripe+0x8e7/0x2020 [raid456]
      [7028915.431930]  ? __wake_up_common_lock+0x89/0xc0
      [7028915.431935]  handle_active_stripes.isra.58+0x35f/0x560 [raid456]
      [7028915.431939]  raid5_do_work+0xc6/0x1f0 [raid456]
      
      Also commit 59fc630b ("RAID5: batch adjacent full stripe write")
      said "If a stripe is added to batch list, then only the first stripe
      of the list should be put to handle_list and run handle_stripe."
      
      So don't set STRIPE_HANDLE to stripe which is already in batch list,
      otherwise the stripe could be put to handle_list and run handle_stripe,
      then the above warning could be triggered.
      
      [1]. https://www.spinics.net/lists/raid/msg62552.htmlSigned-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a5443cd2
    • Y
      md/raid1: fail run raid1 array when active disk less than one · f1db7562
      Yufen Yu 提交于
      [ Upstream commit 07f1a6850c5d5a65c917c3165692b5179ac4cb6b ]
      
      When run test case:
        mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal
        mdadm -S /dev/md1
        mdadm -A /dev/md1 /dev/sd[b-c] --run --force
      
        mdadm --zero /dev/sda
        mdadm /dev/md1 -a /dev/sda
      
        echo offline > /sys/block/sdc/device/state
        echo offline > /sys/block/sdb/device/state
        sleep 5
        mdadm -S /dev/md1
      
        echo running > /sys/block/sdb/device/state
        echo running > /sys/block/sdc/device/state
        mdadm -A /dev/md1 /dev/sd[a-c] --run --force
      
      mdadm run fail with kernel message as follow:
      [  172.986064] md: kicking non-fresh sdb from array!
      [  173.004210] md: kicking non-fresh sdc from array!
      [  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
      [  173.022406] md1: failed to create bitmap (-5)
      
      In fact, when active disk in raid1 array less than one, we
      need to return fail in raid1_run().
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f1db7562
    • K
      closures: fix a race on wakeup from closure_sync · f0956418
      Kent Overstreet 提交于
      [ Upstream commit a22a9602b88fabf10847f238ff81fde5f906fef7 ]
      
      The race was when a thread using closure_sync() notices cl->s->done == 1
      before the thread calling closure_put() calls wake_up_process(). Then,
      it's possible for that thread to return and exit just before
      wake_up_process() is called - so we're trying to wake up a process that
      no longer exists.
      
      rcu_read_lock() is sufficient to protect against this, as there's an rcu
      barrier somewhere in the process teardown path.
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      Acked-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f0956418
    • G
      md: don't set In_sync if array is frozen · 37153845
      Guoqing Jiang 提交于
      [ Upstream commit 062f5b2ae12a153644c765e7ba3b0f825427be1d ]
      
      When a disk is added to array, the following path is called in mdadm.
      
      Manage_subdevs -> sysfs_freeze_array
                     -> Manage_add
                     -> sysfs_set_str(&info, NULL, "sync_action","idle")
      
      Then from kernel side, Manage_add invokes the path (add_new_disk ->
      validate_super = super_1_validate) to set In_sync flag.
      
      Since In_sync means "device is in_sync with rest of array", and the new
      added disk need to resync thread to help the synchronization of data.
      And md_reap_sync_thread would call spare_active to set In_sync for the
      new added disk finally. So don't set In_sync if array is in frozen.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      37153845
    • G
      md: don't call spare_active in md_reap_sync_thread if all member devices can't work · d38aff20
      Guoqing Jiang 提交于
      [ Upstream commit 0d8ed0e9bf9643f27f4816dca61081784dedb38d ]
      
      When add one disk to array, the md_reap_sync_thread is responsible
      to activate the spare and set In_sync flag for the new member in
      spare_active().
      
      But if raid1 has one member disk A, and disk B is added to the array.
      Then we offline A before all the datas are synchronized from A to B,
      obviously B doesn't have the latest data as A, but B is still marked
      with In_sync flag.
      
      So let's not call spare_active under the condition, otherwise B is
      still showed with 'U' state which is not correct.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d38aff20
    • Y
      md/raid1: end bio when the device faulty · 1cd972e0
      Yufen Yu 提交于
      [ Upstream commit eeba6809d8d58908b5ed1b5ceb5fcb09a98a7cad ]
      
      When write bio return error, it would be added to conf->retry_list
      and wait for raid1d thread to retry write and acknowledge badblocks.
      
      In narrow_write_error(), the error bio will be split in the unit of
      badblock shift (such as one sector) and raid1d thread issues them
      one by one. Until all of the splited bio has finished, raid1d thread
      can go on processing other things, which is time consuming.
      
      But, there is a scene for error handling that is not necessary.
      When the device has been set faulty, flush_bio_list() may end
      bios in pending_bio_list with error status. Since these bios
      has not been issued to the device actually, error handlding to
      retry write and acknowledge badblocks make no sense.
      
      Even without that scene, when the device is faulty, badblocks info
      can not be written out to the device. Thus, we also no need to
      handle the error IO.
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1cd972e0
  9. 01 10月, 2019 2 次提交
    • M
      dm zoned: fix invalid memory access · dc9118fe
      Mikulas Patocka 提交于
      [ Upstream commit 0c8e9c2d668278652af028c3cc068c65f66342f4 ]
      
      Commit 75d66ffb48efb30f2dd42f041ba8b39c5b2bd115 ("dm zoned: properly
      handle backing device failure") triggers a coverity warning:
      
      *** CID 1452808:  Memory - illegal accesses  (USE_AFTER_FREE)
      /drivers/md/dm-zoned-target.c: 137 in dmz_submit_bio()
      131             clone->bi_private = bioctx;
      132
      133             bio_advance(bio, clone->bi_iter.bi_size);
      134
      135             refcount_inc(&bioctx->ref);
      136             generic_make_request(clone);
      >>>     CID 1452808:  Memory - illegal accesses  (USE_AFTER_FREE)
      >>>     Dereferencing freed pointer "clone".
      137             if (clone->bi_status == BLK_STS_IOERR)
      138                     return -EIO;
      139
      140             if (bio_op(bio) == REQ_OP_WRITE && dmz_is_seq(zone))
      141                     zone->wp_block += nr_blocks;
      142
      
      The "clone" bio may be processed and freed before the check
      "clone->bi_status == BLK_STS_IOERR" - so this check can access invalid
      memory.
      
      Fixes: 75d66ffb48efb3 ("dm zoned: properly handle backing device failure")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      dc9118fe
    • C
      bcache: remove redundant LIST_HEAD(journal) from run_cache_set() · ad16dfef
      Coly Li 提交于
      [ Upstream commit cdca22bcbc64fc83dadb8d927df400a8d86ddabb ]
      
      Commit 95f18c9d1310 ("bcache: avoid potential memleak of list of
      journal_replay(s) in the CACHE_SYNC branch of run_cache_set") forgets
      to remove the original define of LIST_HEAD(journal), which makes
      the change no take effect. This patch removes redundant variable
      LIST_HEAD(journal) from run_cache_set(), to make Shenghui's fix
      working.
      
      Fixes: 95f18c9d1310 ("bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set")
      Reported-by: NJuha Aatrokoski <juha.aatrokoski@aalto.fi>
      Cc: Shenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ad16dfef
  10. 16 9月, 2019 8 次提交
    • C
      bcache: fix race in btree_flush_write() · b113f984
      Coly Li 提交于
      [ Upstream commit 50a260e859964002dab162513a10f91ae9d3bcd3 ]
      
      There is a race between mca_reap(), btree_node_free() and journal code
      btree_flush_write(), which results very rare and strange deadlock or
      panic and are very hard to reproduce.
      
      Let me explain how the race happens. In btree_flush_write() one btree
      node with oldest journal pin is selected, then it is flushed to cache
      device, the select-and-flush is a two steps operation. Between these two
      steps, there are something may happen inside the race window,
      - The selected btree node was reaped by mca_reap() and allocated to
        other requesters for other btree node.
      - The slected btree node was selected, flushed and released by mca
        shrink callback bch_mca_scan().
      When btree_flush_write() tries to flush the selected btree node, firstly
      b->write_lock is held by mutex_lock(). If the race happens and the
      memory of selected btree node is allocated to other btree node, if that
      btree node's write_lock is held already, a deadlock very probably
      happens here. A worse case is the memory of the selected btree node is
      released, then all references to this btree node (e.g. b->write_lock)
      will trigger NULL pointer deference panic.
      
      This race was introduced in commit cafe5635 ("bcache: A block layer
      cache"), and enlarged by commit c4dc2497 ("bcache: fix high CPU
      occupancy during journal"), which selected 128 btree nodes and flushed
      them one-by-one in a quite long time period.
      
      Such race is not easy to reproduce before. On a Lenovo SR650 server with
      48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
      device assembled by 3 NVMe SSDs as backing device, this race can be
      observed around every 10,000 times btree_flush_write() gets called. Both
      deadlock and kernel panic all happened as aftermath of the race.
      
      The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
      is set when selecting btree nodes, and cleared after btree nodes
      flushed. Then when mca_reap() selects a btree node with this bit set,
      this btree node will be skipped. Since mca_reap() only reaps btree node
      without BTREE_NODE_journal_flush flag, such race is avoided.
      
      Once corner case should be noticed, that is btree_node_free(). It might
      be called in some error handling code path. For example the following
      code piece from btree_split(),
              2149 err_free2:
              2150         bkey_put(b->c, &n2->key);
              2151         btree_node_free(n2);
              2152         rw_unlock(true, n2);
              2153 err_free1:
              2154         bkey_put(b->c, &n1->key);
              2155         btree_node_free(n1);
              2156         rw_unlock(true, n1);
      At line 2151 and 2155, the btree node n2 and n1 are released without
      mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
      If btree_node_free() is called directly in such error handling path,
      and the selected btree node has BTREE_NODE_journal_flush bit set, just
      delay for 1 us and retry again. In this case this btree node won't
      be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
      and free the btree node memory.
      
      Fixes: cafe5635 ("bcache: A block layer cache")
      Signed-off-by: NColy Li <colyli@suse.de>
      Reported-and-tested-by: Nkbuild test robot <lkp@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b113f984
    • C
      bcache: add comments for mutex_lock(&b->write_lock) · f73c35d9
      Coly Li 提交于
      [ Upstream commit 41508bb7d46b74dba631017e5a702a86caf1db8c ]
      
      When accessing or modifying BTREE_NODE_dirty bit, it is not always
      necessary to acquire b->write_lock. In bch_btree_cache_free() and
      mca_reap() acquiring b->write_lock is necessary, and this patch adds
      comments to explain why mutex_lock(&b->write_lock) is necessary for
      checking or clearing BTREE_NODE_dirty bit there.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f73c35d9
    • C
      bcache: only clear BTREE_NODE_dirty bit when it is set · 7989a502
      Coly Li 提交于
      [ Upstream commit e5ec5f4765ada9c75fb3eee93a6e72f0e50599d5 ]
      
      In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is
      always set no matter btree node is dirty or not. The code looks like
      this,
      	if (btree_node_dirty(b))
      		btree_complete_write(b, btree_current_write(b));
      	clear_bit(BTREE_NODE_dirty, &b->flags);
      
      Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty
      bit is cleared, then it is unnecessary to clear the bit again.
      
      This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is
      true (the bit is set), to save a few CPU cycles.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7989a502
    • M
      dm thin metadata: check if in fail_io mode when setting needs_check · ecf99cde
      Mike Snitzer 提交于
      [ Upstream commit 54fa16ee532705985e6c946da455856f18f63ee1 ]
      
      Check if in fail_io mode at start of dm_pool_metadata_set_needs_check().
      Otherwise dm_pool_metadata_set_needs_check()'s superblock_lock() can
      crash in dm_bm_write_lock() while accessing the block manager object
      that was previously destroyed as part of a failed
      dm_pool_abort_metadata() that ultimately set fail_io to begin with.
      
      Also, update DMERR() message to more accurately describe
      superblock_lock() failure.
      
      Cc: stable@vger.kernel.org
      Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ecf99cde
    • M
      dm crypt: move detailed message into debug level · fcb2f1e2
      Milan Broz 提交于
      [ Upstream commit 7a1cd7238fde6ab367384a4a2998cba48330c398 ]
      
      The information about tag size should not be printed without debug info
      set. Also print device major:minor in the error message to identify the
      device instance.
      
      Also use rate limiting and debug level for info about used crypto API
      implementaton.  This is important because during online reencryption
      the existing message saturates syslog (because we are moving hotzone
      across the whole device).
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fcb2f1e2
    • Y
      dm mpath: fix missing call of path selector type->end_io · 69409854
      Yufen Yu 提交于
      [ Upstream commit 5de719e3d01b4abe0de0d7b857148a880ff2a90b ]
      
      After commit 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via
      blk_insert_cloned_request feedback"), map_request() will requeue the tio
      when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.
      
      Thus, if device driver status is error, a tio may be requeued multiple
      times until the return value is not DM_MAPIO_REQUEUE.  That means
      type->start_io may be called multiple times, while type->end_io is only
      called when IO complete.
      
      In fact, even without commit 396eaf21, setup_clone() failure can
      also cause tio requeue and associated missed call to type->end_io.
      
      The service-time path selector selects path based on in_flight_size,
      which is increased by st_start_io() and decreased by st_end_io().
      Missed calls to st_end_io() can lead to in_flight_size count error and
      will cause the selector to make the wrong choice.  In addition,
      queue-length path selector will also be affected.
      
      To fix the problem, call type->end_io in ->release_clone_rq before tio
      requeue.  map_info is passed to ->release_clone_rq() for map_request()
      error path that result in requeue.
      
      Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
      Cc: stable@vger.kernl.org
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      69409854
    • T
      bcache: treat stale && dirty keys as bad keys · 687e470e
      Tang Junhui 提交于
      [ Upstream commit 58ac323084ebf44f8470eeb8b82660f9d0ee3689 ]
      
      Stale && dirty keys can be produced in the follow way:
      After writeback in write_dirty_finish(), dirty keys k1 will
      replace by clean keys k2
      ==>ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
      ==>btree_insert_fn(struct btree_op *b_op, struct btree *b)
      ==>static int bch_btree_insert_node(struct btree *b,
             struct btree_op *op,
             struct keylist *insert_keys,
             atomic_t *journal_ref,
      Then two steps:
      A) update k1 to k2 in btree node memory;
         bch_btree_insert_keys(b, op, insert_keys, replace_key)
      B) Write the bset(contains k2) to cache disk by a 30s delay work
         bch_btree_leaf_dirty(b, journal_ref).
      But before the 30s delay work write the bset to cache device,
      these things happened:
      A) GC works, and reclaim the bucket k2 point to;
      B) Allocator works, and invalidate the bucket k2 point to,
         and increase the gen of the bucket, and place it into free_inc
         fifo;
      C) Until now, the 30s delay work still does not finish work,
         so in the disk, the key still is k1, it is dirty and stale
         (its gen is smaller than the gen of the bucket). and then the
         machine power off suddenly happens;
      D) When the machine power on again, after the btree reconstruction,
         the stale dirty key appear.
      
      In bch_extent_bad(), when expensive_debug_checks is off, it would
      treat the dirty key as good even it is stale keys, and it would
      cause bellow probelms:
      A) In read_dirty() it would cause machine crash:
         BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
      B) It could be worse when reads hits stale dirty keys, it would
         read old incorrect data.
      
      This patch tolerate the existence of these stale && dirty keys,
      and treat them as bad key in bch_extent_bad().
      
      (Coly Li: fix indent which was modified by sender's email client)
      Signed-off-by: NTang Junhui <tang.junhui.linux@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      687e470e
    • C
      bcache: replace hard coded number with BUCKET_GC_GEN_MAX · d1cec665
      Coly Li 提交于
      [ Upstream commit 149d0efada7777ad5a5242b095692af142f533d8 ]
      
      In extents.c:bch_extent_bad(), number 96 is used as parameter to call
      btree_bug_on(). The purpose is to check whether stale gen value exceeds
      BUCKET_GC_GEN_MAX, so it is better to use macro BUCKET_GC_GEN_MAX to
      make the code more understandable.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d1cec665