1. 05 10月, 2019 9 次提交
    • X
      md/raid6: Set R5_ReadError when there is read failure on parity disk · e8323e0d
      Xiao Ni 提交于
      commit 143f6e733b73051cd22dcb80951c6c929da413ce upstream.
      
      7471fb77 ("md/raid6: Fix anomily when recovering a single device in
      RAID6.") avoids rereading P when it can be computed from other members.
      However, this misses the chance to re-write the right data to P. This
      patch sets R5_ReadError if the re-read fails.
      
      Also, when re-read is skipped, we also missed the chance to reset
      rdev->read_errors to 0. It can fail the disk when there are many read
      errors on P member disk (other disks don't have read error)
      
      V2: upper layer read request don't read parity/Q data. So there is no
      need to consider such situation.
      
      This is Reported-by: kbuild test robot <lkp@intel.com>
      
      Fixes: 7471fb77 ("md/raid6: Fix anomily when recovering a single device in RAID6.")
      Cc: <stable@vger.kernel.org> #4.4+
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e8323e0d
    • M
      blk-mq: add callback of .cleanup_rq · 4ec3ca27
      Ming Lei 提交于
      [ Upstream commit 226b4fc75c78f9c497c5182d939101b260cfb9f3 ]
      
      SCSI maintains its own driver private data hooked off of each SCSI
      request, and the pridate data won't be freed after scsi_queue_rq()
      returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver
      (e.g. dm-rq) may need to retry these SCSI requests, before SCSI has
      fully dispatched them, due to a lower level SCSI driver's resource
      limitation identified in scsi_queue_rq(). Currently SCSI's per-request
      private data is leaked when the upper layer driver (dm-rq) frees and
      then retries these requests in response to BLK_STS_RESOURCE or
      BLK_STS_DEV_RESOURCE returns from scsi_queue_rq().
      
      This usecase is so specialized that it doesn't warrant training an
      existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to
      account for freeing its driver private data -- doing so would add an
      extra branch for handling a special case that all other consumers of
      SCSI (and blk-mq) won't ever need to worry about.
      
      So the most pragmatic way forward is to delegate freeing SCSI driver
      private data to the upper layer driver (dm-rq).  Do so by adding
      new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method
      from dm-rq.  A following commit will implement the .cleanup_rq() hook
      in scsi_mq_ops.
      
      Cc: Ewan D. Milne <emilne@redhat.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: <stable@vger.kernel.org>
      Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4ec3ca27
    • N
      raid5: don't increment read_errors on EILSEQ return · 0a43d5d4
      Nigel Croxon 提交于
      [ Upstream commit b76b4715eba0d0ed574f58918b29c1b2f0fa37a8 ]
      
      While MD continues to count read errors returned by the lower layer.
      If those errors are -EILSEQ, instead of -EIO, it should NOT increase
      the read_errors count.
      
      When RAID6 is set up on dm-integrity target that detects massive
      corruption, the leg will be ejected from the array.  Even if the
      issue is correctable with a sector re-write and the array has
      necessary redundancy to correct it.
      
      The leg is ejected because it runs up the rdev->read_errors beyond
      conf->max_nr_stripes.  The return status in dm-drypt when there is
      a data integrity error is -EILSEQ (BLK_STS_PROTECTION).
      Signed-off-by: NNigel Croxon <ncroxon@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0a43d5d4
    • G
      raid5: don't set STRIPE_HANDLE to stripe which is in batch list · a5443cd2
      Guoqing Jiang 提交于
      [ Upstream commit 6ce220dd2f8ea71d6afc29b9a7524c12e39f374a ]
      
      If stripe in batch list is set with STRIPE_HANDLE flag, then the stripe
      could be set with STRIPE_ACTIVE by the handle_stripe function. And if
      error happens to the batch_head at the same time, break_stripe_batch_list
      is called, then below warning could happen (the same report in [1]), it
      means a member of batch list was set with STRIPE_ACTIVE.
      
      [7028915.431770] stripe state: 2001
      [7028915.431815] ------------[ cut here ]------------
      [7028915.431828] WARNING: CPU: 18 PID: 29089 at drivers/md/raid5.c:4614 break_stripe_batch_list+0x203/0x240 [raid456]
      [...]
      [7028915.431879] CPU: 18 PID: 29089 Comm: kworker/u82:5 Tainted: G           O    4.14.86-1-storage #4.14.86-1.2~deb9
      [7028915.431881] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018
      [7028915.431888] Workqueue: raid5wq raid5_do_work [raid456]
      [7028915.431890] task: ffff9ab0ef36d7c0 task.stack: ffffb72926f84000
      [7028915.431896] RIP: 0010:break_stripe_batch_list+0x203/0x240 [raid456]
      [7028915.431898] RSP: 0018:ffffb72926f87ba8 EFLAGS: 00010286
      [7028915.431900] RAX: 0000000000000012 RBX: ffff9aaa84a98000 RCX: 0000000000000000
      [7028915.431901] RDX: 0000000000000000 RSI: ffff9ab2bfa15458 RDI: ffff9ab2bfa15458
      [7028915.431902] RBP: ffff9aaa8fb4e900 R08: 0000000000000001 R09: 0000000000002eb4
      [7028915.431903] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff9ab1736f1b00
      [7028915.431904] R13: 0000000000000000 R14: ffff9aaa8fb4e900 R15: 0000000000000001
      [7028915.431906] FS:  0000000000000000(0000) GS:ffff9ab2bfa00000(0000) knlGS:0000000000000000
      [7028915.431907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [7028915.431908] CR2: 00007ff953b9f5d8 CR3: 0000000bf4009002 CR4: 00000000003606e0
      [7028915.431909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [7028915.431910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [7028915.431910] Call Trace:
      [7028915.431923]  handle_stripe+0x8e7/0x2020 [raid456]
      [7028915.431930]  ? __wake_up_common_lock+0x89/0xc0
      [7028915.431935]  handle_active_stripes.isra.58+0x35f/0x560 [raid456]
      [7028915.431939]  raid5_do_work+0xc6/0x1f0 [raid456]
      
      Also commit 59fc630b ("RAID5: batch adjacent full stripe write")
      said "If a stripe is added to batch list, then only the first stripe
      of the list should be put to handle_list and run handle_stripe."
      
      So don't set STRIPE_HANDLE to stripe which is already in batch list,
      otherwise the stripe could be put to handle_list and run handle_stripe,
      then the above warning could be triggered.
      
      [1]. https://www.spinics.net/lists/raid/msg62552.htmlSigned-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a5443cd2
    • Y
      md/raid1: fail run raid1 array when active disk less than one · f1db7562
      Yufen Yu 提交于
      [ Upstream commit 07f1a6850c5d5a65c917c3165692b5179ac4cb6b ]
      
      When run test case:
        mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal
        mdadm -S /dev/md1
        mdadm -A /dev/md1 /dev/sd[b-c] --run --force
      
        mdadm --zero /dev/sda
        mdadm /dev/md1 -a /dev/sda
      
        echo offline > /sys/block/sdc/device/state
        echo offline > /sys/block/sdb/device/state
        sleep 5
        mdadm -S /dev/md1
      
        echo running > /sys/block/sdb/device/state
        echo running > /sys/block/sdc/device/state
        mdadm -A /dev/md1 /dev/sd[a-c] --run --force
      
      mdadm run fail with kernel message as follow:
      [  172.986064] md: kicking non-fresh sdb from array!
      [  173.004210] md: kicking non-fresh sdc from array!
      [  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
      [  173.022406] md1: failed to create bitmap (-5)
      
      In fact, when active disk in raid1 array less than one, we
      need to return fail in raid1_run().
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f1db7562
    • K
      closures: fix a race on wakeup from closure_sync · f0956418
      Kent Overstreet 提交于
      [ Upstream commit a22a9602b88fabf10847f238ff81fde5f906fef7 ]
      
      The race was when a thread using closure_sync() notices cl->s->done == 1
      before the thread calling closure_put() calls wake_up_process(). Then,
      it's possible for that thread to return and exit just before
      wake_up_process() is called - so we're trying to wake up a process that
      no longer exists.
      
      rcu_read_lock() is sufficient to protect against this, as there's an rcu
      barrier somewhere in the process teardown path.
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      Acked-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f0956418
    • G
      md: don't set In_sync if array is frozen · 37153845
      Guoqing Jiang 提交于
      [ Upstream commit 062f5b2ae12a153644c765e7ba3b0f825427be1d ]
      
      When a disk is added to array, the following path is called in mdadm.
      
      Manage_subdevs -> sysfs_freeze_array
                     -> Manage_add
                     -> sysfs_set_str(&info, NULL, "sync_action","idle")
      
      Then from kernel side, Manage_add invokes the path (add_new_disk ->
      validate_super = super_1_validate) to set In_sync flag.
      
      Since In_sync means "device is in_sync with rest of array", and the new
      added disk need to resync thread to help the synchronization of data.
      And md_reap_sync_thread would call spare_active to set In_sync for the
      new added disk finally. So don't set In_sync if array is in frozen.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      37153845
    • G
      md: don't call spare_active in md_reap_sync_thread if all member devices can't work · d38aff20
      Guoqing Jiang 提交于
      [ Upstream commit 0d8ed0e9bf9643f27f4816dca61081784dedb38d ]
      
      When add one disk to array, the md_reap_sync_thread is responsible
      to activate the spare and set In_sync flag for the new member in
      spare_active().
      
      But if raid1 has one member disk A, and disk B is added to the array.
      Then we offline A before all the datas are synchronized from A to B,
      obviously B doesn't have the latest data as A, but B is still marked
      with In_sync flag.
      
      So let's not call spare_active under the condition, otherwise B is
      still showed with 'U' state which is not correct.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d38aff20
    • Y
      md/raid1: end bio when the device faulty · 1cd972e0
      Yufen Yu 提交于
      [ Upstream commit eeba6809d8d58908b5ed1b5ceb5fcb09a98a7cad ]
      
      When write bio return error, it would be added to conf->retry_list
      and wait for raid1d thread to retry write and acknowledge badblocks.
      
      In narrow_write_error(), the error bio will be split in the unit of
      badblock shift (such as one sector) and raid1d thread issues them
      one by one. Until all of the splited bio has finished, raid1d thread
      can go on processing other things, which is time consuming.
      
      But, there is a scene for error handling that is not necessary.
      When the device has been set faulty, flush_bio_list() may end
      bios in pending_bio_list with error status. Since these bios
      has not been issued to the device actually, error handlding to
      retry write and acknowledge badblocks make no sense.
      
      Even without that scene, when the device is faulty, badblocks info
      can not be written out to the device. Thus, we also no need to
      handle the error IO.
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1cd972e0
  2. 01 10月, 2019 2 次提交
    • M
      dm zoned: fix invalid memory access · dc9118fe
      Mikulas Patocka 提交于
      [ Upstream commit 0c8e9c2d668278652af028c3cc068c65f66342f4 ]
      
      Commit 75d66ffb48efb30f2dd42f041ba8b39c5b2bd115 ("dm zoned: properly
      handle backing device failure") triggers a coverity warning:
      
      *** CID 1452808:  Memory - illegal accesses  (USE_AFTER_FREE)
      /drivers/md/dm-zoned-target.c: 137 in dmz_submit_bio()
      131             clone->bi_private = bioctx;
      132
      133             bio_advance(bio, clone->bi_iter.bi_size);
      134
      135             refcount_inc(&bioctx->ref);
      136             generic_make_request(clone);
      >>>     CID 1452808:  Memory - illegal accesses  (USE_AFTER_FREE)
      >>>     Dereferencing freed pointer "clone".
      137             if (clone->bi_status == BLK_STS_IOERR)
      138                     return -EIO;
      139
      140             if (bio_op(bio) == REQ_OP_WRITE && dmz_is_seq(zone))
      141                     zone->wp_block += nr_blocks;
      142
      
      The "clone" bio may be processed and freed before the check
      "clone->bi_status == BLK_STS_IOERR" - so this check can access invalid
      memory.
      
      Fixes: 75d66ffb48efb3 ("dm zoned: properly handle backing device failure")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      dc9118fe
    • C
      bcache: remove redundant LIST_HEAD(journal) from run_cache_set() · ad16dfef
      Coly Li 提交于
      [ Upstream commit cdca22bcbc64fc83dadb8d927df400a8d86ddabb ]
      
      Commit 95f18c9d1310 ("bcache: avoid potential memleak of list of
      journal_replay(s) in the CACHE_SYNC branch of run_cache_set") forgets
      to remove the original define of LIST_HEAD(journal), which makes
      the change no take effect. This patch removes redundant variable
      LIST_HEAD(journal) from run_cache_set(), to make Shenghui's fix
      working.
      
      Fixes: 95f18c9d1310 ("bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set")
      Reported-by: NJuha Aatrokoski <juha.aatrokoski@aalto.fi>
      Cc: Shenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ad16dfef
  3. 16 9月, 2019 8 次提交
    • C
      bcache: fix race in btree_flush_write() · b113f984
      Coly Li 提交于
      [ Upstream commit 50a260e859964002dab162513a10f91ae9d3bcd3 ]
      
      There is a race between mca_reap(), btree_node_free() and journal code
      btree_flush_write(), which results very rare and strange deadlock or
      panic and are very hard to reproduce.
      
      Let me explain how the race happens. In btree_flush_write() one btree
      node with oldest journal pin is selected, then it is flushed to cache
      device, the select-and-flush is a two steps operation. Between these two
      steps, there are something may happen inside the race window,
      - The selected btree node was reaped by mca_reap() and allocated to
        other requesters for other btree node.
      - The slected btree node was selected, flushed and released by mca
        shrink callback bch_mca_scan().
      When btree_flush_write() tries to flush the selected btree node, firstly
      b->write_lock is held by mutex_lock(). If the race happens and the
      memory of selected btree node is allocated to other btree node, if that
      btree node's write_lock is held already, a deadlock very probably
      happens here. A worse case is the memory of the selected btree node is
      released, then all references to this btree node (e.g. b->write_lock)
      will trigger NULL pointer deference panic.
      
      This race was introduced in commit cafe5635 ("bcache: A block layer
      cache"), and enlarged by commit c4dc2497 ("bcache: fix high CPU
      occupancy during journal"), which selected 128 btree nodes and flushed
      them one-by-one in a quite long time period.
      
      Such race is not easy to reproduce before. On a Lenovo SR650 server with
      48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
      device assembled by 3 NVMe SSDs as backing device, this race can be
      observed around every 10,000 times btree_flush_write() gets called. Both
      deadlock and kernel panic all happened as aftermath of the race.
      
      The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
      is set when selecting btree nodes, and cleared after btree nodes
      flushed. Then when mca_reap() selects a btree node with this bit set,
      this btree node will be skipped. Since mca_reap() only reaps btree node
      without BTREE_NODE_journal_flush flag, such race is avoided.
      
      Once corner case should be noticed, that is btree_node_free(). It might
      be called in some error handling code path. For example the following
      code piece from btree_split(),
              2149 err_free2:
              2150         bkey_put(b->c, &n2->key);
              2151         btree_node_free(n2);
              2152         rw_unlock(true, n2);
              2153 err_free1:
              2154         bkey_put(b->c, &n1->key);
              2155         btree_node_free(n1);
              2156         rw_unlock(true, n1);
      At line 2151 and 2155, the btree node n2 and n1 are released without
      mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
      If btree_node_free() is called directly in such error handling path,
      and the selected btree node has BTREE_NODE_journal_flush bit set, just
      delay for 1 us and retry again. In this case this btree node won't
      be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
      and free the btree node memory.
      
      Fixes: cafe5635 ("bcache: A block layer cache")
      Signed-off-by: NColy Li <colyli@suse.de>
      Reported-and-tested-by: Nkbuild test robot <lkp@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b113f984
    • C
      bcache: add comments for mutex_lock(&b->write_lock) · f73c35d9
      Coly Li 提交于
      [ Upstream commit 41508bb7d46b74dba631017e5a702a86caf1db8c ]
      
      When accessing or modifying BTREE_NODE_dirty bit, it is not always
      necessary to acquire b->write_lock. In bch_btree_cache_free() and
      mca_reap() acquiring b->write_lock is necessary, and this patch adds
      comments to explain why mutex_lock(&b->write_lock) is necessary for
      checking or clearing BTREE_NODE_dirty bit there.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f73c35d9
    • C
      bcache: only clear BTREE_NODE_dirty bit when it is set · 7989a502
      Coly Li 提交于
      [ Upstream commit e5ec5f4765ada9c75fb3eee93a6e72f0e50599d5 ]
      
      In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is
      always set no matter btree node is dirty or not. The code looks like
      this,
      	if (btree_node_dirty(b))
      		btree_complete_write(b, btree_current_write(b));
      	clear_bit(BTREE_NODE_dirty, &b->flags);
      
      Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty
      bit is cleared, then it is unnecessary to clear the bit again.
      
      This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is
      true (the bit is set), to save a few CPU cycles.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7989a502
    • M
      dm thin metadata: check if in fail_io mode when setting needs_check · ecf99cde
      Mike Snitzer 提交于
      [ Upstream commit 54fa16ee532705985e6c946da455856f18f63ee1 ]
      
      Check if in fail_io mode at start of dm_pool_metadata_set_needs_check().
      Otherwise dm_pool_metadata_set_needs_check()'s superblock_lock() can
      crash in dm_bm_write_lock() while accessing the block manager object
      that was previously destroyed as part of a failed
      dm_pool_abort_metadata() that ultimately set fail_io to begin with.
      
      Also, update DMERR() message to more accurately describe
      superblock_lock() failure.
      
      Cc: stable@vger.kernel.org
      Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ecf99cde
    • M
      dm crypt: move detailed message into debug level · fcb2f1e2
      Milan Broz 提交于
      [ Upstream commit 7a1cd7238fde6ab367384a4a2998cba48330c398 ]
      
      The information about tag size should not be printed without debug info
      set. Also print device major:minor in the error message to identify the
      device instance.
      
      Also use rate limiting and debug level for info about used crypto API
      implementaton.  This is important because during online reencryption
      the existing message saturates syslog (because we are moving hotzone
      across the whole device).
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fcb2f1e2
    • Y
      dm mpath: fix missing call of path selector type->end_io · 69409854
      Yufen Yu 提交于
      [ Upstream commit 5de719e3d01b4abe0de0d7b857148a880ff2a90b ]
      
      After commit 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via
      blk_insert_cloned_request feedback"), map_request() will requeue the tio
      when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.
      
      Thus, if device driver status is error, a tio may be requeued multiple
      times until the return value is not DM_MAPIO_REQUEUE.  That means
      type->start_io may be called multiple times, while type->end_io is only
      called when IO complete.
      
      In fact, even without commit 396eaf21, setup_clone() failure can
      also cause tio requeue and associated missed call to type->end_io.
      
      The service-time path selector selects path based on in_flight_size,
      which is increased by st_start_io() and decreased by st_end_io().
      Missed calls to st_end_io() can lead to in_flight_size count error and
      will cause the selector to make the wrong choice.  In addition,
      queue-length path selector will also be affected.
      
      To fix the problem, call type->end_io in ->release_clone_rq before tio
      requeue.  map_info is passed to ->release_clone_rq() for map_request()
      error path that result in requeue.
      
      Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
      Cc: stable@vger.kernl.org
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      69409854
    • T
      bcache: treat stale && dirty keys as bad keys · 687e470e
      Tang Junhui 提交于
      [ Upstream commit 58ac323084ebf44f8470eeb8b82660f9d0ee3689 ]
      
      Stale && dirty keys can be produced in the follow way:
      After writeback in write_dirty_finish(), dirty keys k1 will
      replace by clean keys k2
      ==>ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
      ==>btree_insert_fn(struct btree_op *b_op, struct btree *b)
      ==>static int bch_btree_insert_node(struct btree *b,
             struct btree_op *op,
             struct keylist *insert_keys,
             atomic_t *journal_ref,
      Then two steps:
      A) update k1 to k2 in btree node memory;
         bch_btree_insert_keys(b, op, insert_keys, replace_key)
      B) Write the bset(contains k2) to cache disk by a 30s delay work
         bch_btree_leaf_dirty(b, journal_ref).
      But before the 30s delay work write the bset to cache device,
      these things happened:
      A) GC works, and reclaim the bucket k2 point to;
      B) Allocator works, and invalidate the bucket k2 point to,
         and increase the gen of the bucket, and place it into free_inc
         fifo;
      C) Until now, the 30s delay work still does not finish work,
         so in the disk, the key still is k1, it is dirty and stale
         (its gen is smaller than the gen of the bucket). and then the
         machine power off suddenly happens;
      D) When the machine power on again, after the btree reconstruction,
         the stale dirty key appear.
      
      In bch_extent_bad(), when expensive_debug_checks is off, it would
      treat the dirty key as good even it is stale keys, and it would
      cause bellow probelms:
      A) In read_dirty() it would cause machine crash:
         BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
      B) It could be worse when reads hits stale dirty keys, it would
         read old incorrect data.
      
      This patch tolerate the existence of these stale && dirty keys,
      and treat them as bad key in bch_extent_bad().
      
      (Coly Li: fix indent which was modified by sender's email client)
      Signed-off-by: NTang Junhui <tang.junhui.linux@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      687e470e
    • C
      bcache: replace hard coded number with BUCKET_GC_GEN_MAX · d1cec665
      Coly Li 提交于
      [ Upstream commit 149d0efada7777ad5a5242b095692af142f533d8 ]
      
      In extents.c:bch_extent_bad(), number 96 is used as parameter to call
      btree_bug_on(). The purpose is to check whether stale gen value exceeds
      BUCKET_GC_GEN_MAX, so it is better to use macro BUCKET_GC_GEN_MAX to
      make the code more understandable.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d1cec665
  4. 29 8月, 2019 11 次提交
    • D
      dm zoned: fix potential NULL dereference in dmz_do_reclaim() · 0d5e34c1
      Dan Carpenter 提交于
      [ Upstream commit e0702d90b79d430b0ccc276ead4f88440bb51352 ]
      
      This function is supposed to return error pointers so it matches the
      dmz_get_rnd_zone_for_reclaim() function.  The current code could lead to
      a NULL dereference in dmz_do_reclaim()
      
      Fixes: b234c6d7a703 ("dm zoned: improve error handling in reclaim")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NDmitry Fomichev <dmitry.fomichev@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0d5e34c1
    • D
      dm zoned: properly handle backing device failure · c14fe4e8
      Dmitry Fomichev 提交于
      commit 75d66ffb48efb30f2dd42f041ba8b39c5b2bd115 upstream.
      
      dm-zoned is observed to lock up or livelock in case of hardware
      failure or some misconfiguration of the backing zoned device.
      
      This patch adds a new dm-zoned target function that checks the status of
      the backing device. If the request queue of the backing device is found
      to be in dying state or the SCSI backing device enters offline state,
      the health check code sets a dm-zoned target flag prompting all further
      incoming I/O to be rejected. In order to detect backing device failures
      timely, this new function is called in the request mapping path, at the
      beginning of every reclaim run and before performing any metadata I/O.
      
      The proper way out of this situation is to do
      
      dmsetup remove <dm-zoned target>
      
      and recreate the target when the problem with the backing device
      is resolved.
      
      Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com>
      Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c14fe4e8
    • D
      dm zoned: improve error handling in i/o map code · 4530f2f1
      Dmitry Fomichev 提交于
      commit d7428c50118e739e672656c28d2b26b09375d4e0 upstream.
      
      Some errors are ignored in the I/O path during queueing chunks
      for processing by chunk works. Since at least these errors are
      transient in nature, it should be possible to retry the failed
      incoming commands.
      
      The fix -
      
      Errors that can happen while queueing chunks are carried upwards
      to the main mapping function and it now returns DM_MAPIO_REQUEUE
      for any incoming requests that can not be properly queued.
      
      Error logging/debug messages are added where needed.
      
      Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com>
      Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4530f2f1
    • D
      dm zoned: improve error handling in reclaim · 8b7c17bb
      Dmitry Fomichev 提交于
      commit b234c6d7a703661b5045c5bf569b7c99d2edbf88 upstream.
      
      There are several places in reclaim code where errors are not
      propagated to the main function, dmz_reclaim(). This function
      is responsible for unlocking zones that might be still locked
      at the end of any failed reclaim iterations. As the result,
      some device zones may be left permanently locked for reclaim,
      degrading target's capability to reclaim zones.
      
      This patch fixes these issues as follows -
      
      Make sure that dmz_reclaim_buf(), dmz_reclaim_seq_data() and
      dmz_reclaim_rnd_data() return error codes to the caller.
      
      dmz_reclaim() function is renamed to dmz_do_reclaim() to avoid
      clashing with "struct dmz_reclaim" and is modified to return the
      error to the caller.
      
      dmz_get_zone_for_reclaim() now returns an error instead of NULL
      pointer and reclaim code checks for that error.
      
      Error logging/debug messages are added where necessary.
      
      Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com>
      Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b7c17bb
    • M
      dm table: fix invalid memory accesses with too high sector number · ded8e524
      Mikulas Patocka 提交于
      commit 1cfd5d3399e87167b7f9157ef99daa0e959f395d upstream.
      
      If the sector number is too high, dm_table_find_target() should return a
      pointer to a zeroed dm_target structure (the caller should test it with
      dm_target_is_valid).
      
      However, for some table sizes, the code in dm_table_find_target() that
      performs btree lookup will access out of bound memory structures.
      
      Fix this bug by testing the sector number at the beginning of
      dm_table_find_target(). Also, add an "inline" keyword to the function
      dm_table_get_size() because this is a hot path.
      
      Fixes: 512875bd ("dm: table detect io beyond device")
      Cc: stable@vger.kernel.org
      Reported-by: NZhang Tao <kontais@zoho.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ded8e524
    • Z
      dm space map metadata: fix missing store of apply_bops() return value · 53e73d10
      ZhangXiaoxu 提交于
      commit ae148243d3f0816b37477106c05a2ec7d5f32614 upstream.
      
      In commit 6096d91a ("dm space map metadata: fix occasional leak
      of a metadata block on resize"), we refactor the commit logic to a new
      function 'apply_bops'.  But when that logic was replaced in out() the
      return value was not stored.  This may lead out() returning a wrong
      value to the caller.
      
      Fixes: 6096d91a ("dm space map metadata: fix occasional leak of a metadata block on resize")
      Cc: stable@vger.kernel.org
      Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53e73d10
    • W
      dm raid: add missing cleanup in raid_ctr() · 2cff6c87
      Wenwen Wang 提交于
      commit dc1a3e8e0cc6b2293b48c044710e63395aeb4fb4 upstream.
      
      If rs_prepare_reshape() fails, no cleanup is executed, leading to
      leak of the raid_set structure allocated at the beginning of
      raid_ctr(). To fix this issue, go to the label 'bad' if the error
      occurs.
      
      Fixes: 11e47232 ("dm raid: stop keeping raid set frozen altogether")
      Cc: stable@vger.kernel.org
      Signed-off-by: NWenwen Wang <wenwen@cs.uga.edu>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2cff6c87
    • M
      dm integrity: fix a crash due to BUG_ON in __journal_read_write() · 795b0572
      Mikulas Patocka 提交于
      commit 5729b6e5a1bcb0bbc28abe82d749c7392f66d2c7 upstream.
      
      Fix a crash that was introduced by the commit 724376a0. The crash is
      reported here: https://gitlab.com/cryptsetup/cryptsetup/issues/468
      
      When reading from the integrity device, the function
      dm_integrity_map_continue calls find_journal_node to find out if the
      location to read is present in the journal. Then, it calculates how many
      sectors are consecutively stored in the journal. Then, it locks the range
      with add_new_range and wait_and_add_new_range.
      
      The problem is that during wait_and_add_new_range, we hold no locks (we
      don't hold ic->endio_wait.lock and we don't hold a range lock), so the
      journal may change arbitrarily while wait_and_add_new_range sleeps.
      
      The code then goes to __journal_read_write and hits
      BUG_ON(journal_entry_get_sector(je) != logical_sector); because the
      journal has changed.
      
      In order to fix this bug, we need to re-check the journal location after
      wait_and_add_new_range. We restrict the length to one block in order to
      not complicate the code too much.
      
      Fixes: 724376a0 ("dm integrity: implement fair range locks")
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      795b0572
    • Z
      dm btree: fix order of block initialization in btree_split_beneath · 8114012d
      ZhangXiaoxu 提交于
      commit e4f9d6013820d1eba1432d51dd1c5795759aa77f upstream.
      
      When btree_split_beneath() splits a node to two new children, it will
      allocate two blocks: left and right.  If right block's allocation
      failed, the left block will be unlocked and marked dirty.  If this
      happened, the left block'ss content is zero, because it wasn't
      initialized with the btree struct before the attempot to allocate the
      right block.  Upon return, when flushing the left block to disk, the
      validator will fail when check this block.  Then a BUG_ON is raised.
      
      Fix this by completely initializing the left block before allocating and
      initializing the right block.
      
      Fixes: 4dcb8b57 ("dm btree: fix leak of bufio-backed block in btree_split_beneath error path")
      Cc: stable@vger.kernel.org
      Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8114012d
    • D
      dm kcopyd: always complete failed jobs · e0fb8135
      Dmitry Fomichev 提交于
      commit d1fef41465f0e8cae0693fb184caa6bfafb6cd16 upstream.
      
      This patch fixes a problem in dm-kcopyd that may leave jobs in
      complete queue indefinitely in the event of backing storage failure.
      
      This behavior has been observed while running 100% write file fio
      workload against an XFS volume created on top of a dm-zoned target
      device. If the underlying storage of dm-zoned goes to offline state
      under I/O, kcopyd sometimes never issues the end copy callback and
      dm-zoned reclaim work hangs indefinitely waiting for that completion.
      
      This behavior was traced down to the error handling code in
      process_jobs() function that places the failed job to complete_jobs
      queue, but doesn't wake up the job handler. In case of backing device
      failure, all outstanding jobs may end up going to complete_jobs queue
      via this code path and then stay there forever because there are no
      more successful I/O jobs to wake up the job handler.
      
      This patch adds a wake() call to always wake up kcopyd job wait queue
      for all I/O jobs that fail before dm_io() gets called for that job.
      
      The patch also sets the write error status in all sub jobs that are
      failed because their master job has failed.
      
      Fixes: b73c67c2 ("dm kcopyd: add sequential write feature")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com>
      Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e0fb8135
    • M
      Revert "dm bufio: fix deadlock with loop device" · b608a5a2
      Mikulas Patocka 提交于
      commit cf3591ef832915892f2499b7e54b51d4c578b28c upstream.
      
      Revert the commit bd293d071ffe65e645b4d8104f9d8fe15ea13862. The proper
      fix has been made available with commit d0a255e795ab ("loop: set
      PF_MEMALLOC_NOIO for the worker thread").
      
      Note that the fix offered by commit bd293d071ffe doesn't really prevent
      the deadlock from occuring - if we look at the stacktrace reported by
      Junxiao Bi, we see that it hangs in bit_wait_io and not on the mutex -
      i.e. it has already successfully taken the mutex. Changing the mutex
      from mutex_lock to mutex_trylock won't help with deadlocks that happen
      afterwards.
      
      PID: 474    TASK: ffff8813e11f4600  CPU: 10  COMMAND: "kswapd0"
         #0 [ffff8813dedfb938] __schedule at ffffffff8173f405
         #1 [ffff8813dedfb990] schedule at ffffffff8173fa27
         #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec
         #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186
         #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f
         #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8
         #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81
         #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio]
         #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio]
         #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio]
        #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce
        #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778
        #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f
        #13 [ffff8813dedfbec0] kthread at ffffffff810a8428
        #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: bd293d071ffe ("dm bufio: fix deadlock with loop device")
      Depends-on: d0a255e795ab ("loop: set PF_MEMALLOC_NOIO for the worker thread")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b608a5a2
  5. 25 8月, 2019 1 次提交
    • M
      dm: disable DISCARD if the underlying storage no longer supports it · a1cd2f70
      Mike Snitzer 提交于
      commit bcb44433bba5eaff293888ef22ffa07f1f0347d6 upstream.
      
      Storage devices which report supporting discard commands like
      WRITE_SAME_16 with unmap, but reject discard commands sent to the
      storage device.  This is a clear storage firmware bug but it doesn't
      change the fact that should a program cause discards to be sent to a
      multipath device layered on this buggy storage, all paths can end up
      failed at the same time from the discards, causing possible I/O loss.
      
      The first discard to a path will fail with Illegal Request, Invalid
      field in cdb, e.g.:
       kernel: sd 8:0:8:19: [sdfn] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
       kernel: sd 8:0:8:19: [sdfn] tag#0 Sense Key : Illegal Request [current]
       kernel: sd 8:0:8:19: [sdfn] tag#0 Add. Sense: Invalid field in cdb
       kernel: sd 8:0:8:19: [sdfn] tag#0 CDB: Write same(16) 93 08 00 00 00 00 00 a0 08 00 00 00 80 00 00 00
       kernel: blk_update_request: critical target error, dev sdfn, sector 10487808
      
      The SCSI layer converts this to the BLK_STS_TARGET error number, the sd
      device disables its support for discard on this path, and because of the
      BLK_STS_TARGET error multipath fails the discard without failing any
      path or retrying down a different path.  But subsequent discards can
      cause path failures.  Any discards sent to the path which already failed
      a discard ends up failing with EIO from blk_cloned_rq_check_limits with
      an "over max size limit" error since the discard limit was set to 0 by
      the sd driver for the path.  As the error is EIO, this now fails the
      path and multipath tries to send the discard down the next path.  This
      cycle continues as discards are sent until all paths fail.
      
      Fix this by training DM core to disable DISCARD if the underlying
      storage already did so.
      
      Also, fix branching in dm_done() and clone_endio() to reflect the
      mutually exclussive nature of the IO operations in question.
      
      Cc: stable@vger.kernel.org
      Reported-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      [Salvatore Bonaccorso: backported to 4.19: Adjust for context changes in
      drivers/md/dm-core.h]
      Signed-off-by: NSalvatore Bonaccorso <carnil@debian.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a1cd2f70
  6. 26 7月, 2019 9 次提交
    • J
      dm bufio: fix deadlock with loop device · 025eb12b
      Junxiao Bi 提交于
      commit bd293d071ffe65e645b4d8104f9d8fe15ea13862 upstream.
      
      When thin-volume is built on loop device, if available memory is low,
      the following deadlock can be triggered:
      
      One process P1 allocates memory with GFP_FS flag, direct alloc fails,
      memory reclaim invokes memory shrinker in dm_bufio, dm_bufio_shrink_scan()
      runs, mutex dm_bufio_client->lock is acquired, then P1 waits for dm_buffer
      IO to complete in __try_evict_buffer().
      
      But this IO may never complete if issued to an underlying loop device
      that forwards it using direct-IO, which allocates memory using
      GFP_KERNEL (see: do_blockdev_direct_IO()).  If allocation fails, memory
      reclaim will invoke memory shrinker in dm_bufio, dm_bufio_shrink_scan()
      will be invoked, and since the mutex is already held by P1 the loop
      thread will hang, and IO will never complete.  Resulting in ABBA
      deadlock.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      025eb12b
    • D
      dm zoned: fix zone state management race · e380170b
      Damien Le Moal 提交于
      commit 3b8cafdd5436f9298b3bf6eb831df5eef5ee82b6 upstream.
      
      dm-zoned uses the zone flag DMZ_ACTIVE to indicate that a zone of the
      backend device is being actively read or written and so cannot be
      reclaimed. This flag is set as long as the zone atomic reference
      counter is not 0. When this atomic is decremented and reaches 0 (e.g.
      on BIO completion), the active flag is cleared and set again whenever
      the zone is reused and BIO issued with the atomic counter incremented.
      These 2 operations (atomic inc/dec and flag set/clear) are however not
      always executed atomically under the target metadata mutex lock and
      this causes the warning:
      
      WARN_ON(!test_bit(DMZ_ACTIVE, &zone->flags));
      
      in dmz_deactivate_zone() to be displayed. This problem is regularly
      triggered with xfstests generic/209, generic/300, generic/451 and
      xfs/077 with XFS being used as the file system on the dm-zoned target
      device. Similarly, xfstests ext4/303, ext4/304, generic/209 and
      generic/300 trigger the warning with ext4 use.
      
      This problem can be easily fixed by simply removing the DMZ_ACTIVE flag
      and managing the "ACTIVE" state by directly looking at the reference
      counter value. To do so, the functions dmz_activate_zone() and
      dmz_deactivate_zone() are changed to inline functions respectively
      calling atomic_inc() and atomic_dec(), while the dmz_is_active() macro
      is changed to an inline function calling atomic_read().
      
      Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
      Cc: stable@vger.kernel.org
      Reported-by: NMasato Suzuki <masato.suzuki@wdc.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e380170b
    • X
      raid5-cache: Need to do start() part job after adding journal device · eb6c84e4
      Xiao Ni 提交于
      commit d9771f5ec46c282d518b453c793635dbdc3a2a94 upstream.
      
      commit d5d885fd ("md: introduce new personality funciton start()")
      splits the init job to two parts. The first part run() does the jobs that
      do not require the md threads. The second part start() does the jobs that
      require the md threads.
      
      Now it just does run() in adding new journal device. It needs to do the
      second part start() too.
      
      Fixes: d5d885fd ("md: introduce new personality funciton start()")
      Cc: stable@vger.kernel.org #v4.9+
      Reported-by: NMichal Soltys <soltys@ziu.info>
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eb6c84e4
    • C
      bcache: destroy dc->writeback_write_wq if failed to create dc->writeback_thread · f11ba9df
      Coly Li 提交于
      commit f54d801dda14942dbefa00541d10603015b7859c upstream.
      
      Commit 9baf3097 ("bcache: fix for gc and write-back race") added a
      new work queue dc->writeback_write_wq, but forgot to destroy it in the
      error condition when creating dc->writeback_thread failed.
      
      This patch destroys dc->writeback_write_wq if kthread_create() returns
      error pointer to dc->writeback_thread, then a memory leak is avoided.
      
      Fixes: 9baf3097 ("bcache: fix for gc and write-back race")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f11ba9df
    • C
      bcache: fix mistaken sysfs entry for io_error counter · 2ab14861
      Coly Li 提交于
      commit 5461999848e0462c14f306a62923d22de820a59c upstream.
      
      In bch_cached_dev_files[] from driver/md/bcache/sysfs.c, sysfs_errors is
      incorrectly inserted in. The correct entry should be sysfs_io_errors.
      
      This patch fixes the problem and now I/O errors of cached device can be
      read from /sys/block/bcache<N>/bcache/io_errors.
      
      Fixes: c7b7bd07 ("bcache: add io_disable to struct cached_dev")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2ab14861
    • C
      bcache: ignore read-ahead request failure on backing device · 3c466df8
      Coly Li 提交于
      commit 578df99b1b0531d19af956530fe4da63d01a1604 upstream.
      
      When md raid device (e.g. raid456) is used as backing device, read-ahead
      requests on a degrading and recovering md raid device might be failured
      immediately by md raid code, but indeed this md raid array can still be
      read or write for normal I/O requests. Therefore such failed read-ahead
      request are not real hardware failure. Further more, after degrading and
      recovering accomplished, read-ahead requests will be handled by md raid
      array again.
      
      For such condition, I/O failures of read-ahead requests don't indicate
      real health status (because normal I/O still be served), they should not
      be counted into I/O error counter dc->io_errors.
      
      Since there is no simple way to detect whether the backing divice is a
      md raid device, this patch simply ignores I/O failures for read-ahead
      bios on backing device, to avoid bogus backing device failure on a
      degrading md raid array.
      Suggested-and-tested-by: NThorsten Knabe <linux@thorsten-knabe.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c466df8
    • C
      bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free" · 4fc48cd2
      Coly Li 提交于
      commit ba82c1ac1667d6efb91a268edb13fc9cdaecec9b upstream.
      
      This reverts commit 6268dc2c.
      
      This patch depends on commit c4dc2497 ("bcache: fix high CPU
      occupancy during journal") which is reverted in previous patch. So
      revert this one too.
      
      Fixes: 6268dc2c ("bcache: free heap cache_set->flush_btree in bch_journal_free")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Cc: Shenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4fc48cd2
    • C
      bcache: Revert "bcache: fix high CPU occupancy during journal" · ab966241
      Coly Li 提交于
      commit 249a5f6da57c28a903c75d81505d58ec8c10030d upstream.
      
      This reverts commit c4dc2497.
      
      This patch enlarges a race between normal btree flush code path and
      flush_btree_write(), which causes deadlock when journal space is
      exhausted. Reverts this patch makes the race window from 128 btree
      nodes to only 1 btree nodes.
      
      Fixes: c4dc2497 ("bcache: fix high CPU occupancy during journal")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Cc: Tang Junhui <tang.junhui.linux@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab966241
    • C
      Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()" · 58169c18
      Coly Li 提交于
      commit 695277f16b3a102fcc22c97fdf2de77c7b19f0b3 upstream.
      
      This reverts commit 6147305c.
      
      Although this patch helps the failed bcache device to stop faster when
      too many I/O errors detected on corresponding cached device, setting
      CACHE_SET_IO_DISABLE bit to cache set c->flags was not a good idea. This
      operation will disable all I/Os on cache set, which means other attached
      bcache devices won't work neither.
      
      Without this patch, the failed bcache device can also be stopped
      eventually if internal I/O accomplished (e.g. writeback). Therefore here
      I revert it.
      
      Fixes: 6147305c ("bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()")
      Reported-by: NYong Li <mr.liyong@qq.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58169c18