1. 17 8月, 2019 5 次提交
  2. 26 7月, 2019 14 次提交
    • J
      dm bufio: fix deadlock with loop device · 025eb12b
      Junxiao Bi 提交于
      commit bd293d071ffe65e645b4d8104f9d8fe15ea13862 upstream.
      
      When thin-volume is built on loop device, if available memory is low,
      the following deadlock can be triggered:
      
      One process P1 allocates memory with GFP_FS flag, direct alloc fails,
      memory reclaim invokes memory shrinker in dm_bufio, dm_bufio_shrink_scan()
      runs, mutex dm_bufio_client->lock is acquired, then P1 waits for dm_buffer
      IO to complete in __try_evict_buffer().
      
      But this IO may never complete if issued to an underlying loop device
      that forwards it using direct-IO, which allocates memory using
      GFP_KERNEL (see: do_blockdev_direct_IO()).  If allocation fails, memory
      reclaim will invoke memory shrinker in dm_bufio, dm_bufio_shrink_scan()
      will be invoked, and since the mutex is already held by P1 the loop
      thread will hang, and IO will never complete.  Resulting in ABBA
      deadlock.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      025eb12b
    • D
      dm zoned: fix zone state management race · e380170b
      Damien Le Moal 提交于
      commit 3b8cafdd5436f9298b3bf6eb831df5eef5ee82b6 upstream.
      
      dm-zoned uses the zone flag DMZ_ACTIVE to indicate that a zone of the
      backend device is being actively read or written and so cannot be
      reclaimed. This flag is set as long as the zone atomic reference
      counter is not 0. When this atomic is decremented and reaches 0 (e.g.
      on BIO completion), the active flag is cleared and set again whenever
      the zone is reused and BIO issued with the atomic counter incremented.
      These 2 operations (atomic inc/dec and flag set/clear) are however not
      always executed atomically under the target metadata mutex lock and
      this causes the warning:
      
      WARN_ON(!test_bit(DMZ_ACTIVE, &zone->flags));
      
      in dmz_deactivate_zone() to be displayed. This problem is regularly
      triggered with xfstests generic/209, generic/300, generic/451 and
      xfs/077 with XFS being used as the file system on the dm-zoned target
      device. Similarly, xfstests ext4/303, ext4/304, generic/209 and
      generic/300 trigger the warning with ext4 use.
      
      This problem can be easily fixed by simply removing the DMZ_ACTIVE flag
      and managing the "ACTIVE" state by directly looking at the reference
      counter value. To do so, the functions dmz_activate_zone() and
      dmz_deactivate_zone() are changed to inline functions respectively
      calling atomic_inc() and atomic_dec(), while the dmz_is_active() macro
      is changed to an inline function calling atomic_read().
      
      Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
      Cc: stable@vger.kernel.org
      Reported-by: NMasato Suzuki <masato.suzuki@wdc.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e380170b
    • X
      raid5-cache: Need to do start() part job after adding journal device · eb6c84e4
      Xiao Ni 提交于
      commit d9771f5ec46c282d518b453c793635dbdc3a2a94 upstream.
      
      commit d5d885fd ("md: introduce new personality funciton start()")
      splits the init job to two parts. The first part run() does the jobs that
      do not require the md threads. The second part start() does the jobs that
      require the md threads.
      
      Now it just does run() in adding new journal device. It needs to do the
      second part start() too.
      
      Fixes: d5d885fd ("md: introduce new personality funciton start()")
      Cc: stable@vger.kernel.org #v4.9+
      Reported-by: NMichal Soltys <soltys@ziu.info>
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eb6c84e4
    • C
      bcache: destroy dc->writeback_write_wq if failed to create dc->writeback_thread · f11ba9df
      Coly Li 提交于
      commit f54d801dda14942dbefa00541d10603015b7859c upstream.
      
      Commit 9baf3097 ("bcache: fix for gc and write-back race") added a
      new work queue dc->writeback_write_wq, but forgot to destroy it in the
      error condition when creating dc->writeback_thread failed.
      
      This patch destroys dc->writeback_write_wq if kthread_create() returns
      error pointer to dc->writeback_thread, then a memory leak is avoided.
      
      Fixes: 9baf3097 ("bcache: fix for gc and write-back race")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f11ba9df
    • C
      bcache: fix mistaken sysfs entry for io_error counter · 2ab14861
      Coly Li 提交于
      commit 5461999848e0462c14f306a62923d22de820a59c upstream.
      
      In bch_cached_dev_files[] from driver/md/bcache/sysfs.c, sysfs_errors is
      incorrectly inserted in. The correct entry should be sysfs_io_errors.
      
      This patch fixes the problem and now I/O errors of cached device can be
      read from /sys/block/bcache<N>/bcache/io_errors.
      
      Fixes: c7b7bd07 ("bcache: add io_disable to struct cached_dev")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2ab14861
    • C
      bcache: ignore read-ahead request failure on backing device · 3c466df8
      Coly Li 提交于
      commit 578df99b1b0531d19af956530fe4da63d01a1604 upstream.
      
      When md raid device (e.g. raid456) is used as backing device, read-ahead
      requests on a degrading and recovering md raid device might be failured
      immediately by md raid code, but indeed this md raid array can still be
      read or write for normal I/O requests. Therefore such failed read-ahead
      request are not real hardware failure. Further more, after degrading and
      recovering accomplished, read-ahead requests will be handled by md raid
      array again.
      
      For such condition, I/O failures of read-ahead requests don't indicate
      real health status (because normal I/O still be served), they should not
      be counted into I/O error counter dc->io_errors.
      
      Since there is no simple way to detect whether the backing divice is a
      md raid device, this patch simply ignores I/O failures for read-ahead
      bios on backing device, to avoid bogus backing device failure on a
      degrading md raid array.
      Suggested-and-tested-by: NThorsten Knabe <linux@thorsten-knabe.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c466df8
    • C
      bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free" · 4fc48cd2
      Coly Li 提交于
      commit ba82c1ac1667d6efb91a268edb13fc9cdaecec9b upstream.
      
      This reverts commit 6268dc2c.
      
      This patch depends on commit c4dc2497 ("bcache: fix high CPU
      occupancy during journal") which is reverted in previous patch. So
      revert this one too.
      
      Fixes: 6268dc2c ("bcache: free heap cache_set->flush_btree in bch_journal_free")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Cc: Shenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4fc48cd2
    • C
      bcache: Revert "bcache: fix high CPU occupancy during journal" · ab966241
      Coly Li 提交于
      commit 249a5f6da57c28a903c75d81505d58ec8c10030d upstream.
      
      This reverts commit c4dc2497.
      
      This patch enlarges a race between normal btree flush code path and
      flush_btree_write(), which causes deadlock when journal space is
      exhausted. Reverts this patch makes the race window from 128 btree
      nodes to only 1 btree nodes.
      
      Fixes: c4dc2497 ("bcache: fix high CPU occupancy during journal")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Cc: Tang Junhui <tang.junhui.linux@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab966241
    • C
      Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()" · 58169c18
      Coly Li 提交于
      commit 695277f16b3a102fcc22c97fdf2de77c7b19f0b3 upstream.
      
      This reverts commit 6147305c.
      
      Although this patch helps the failed bcache device to stop faster when
      too many I/O errors detected on corresponding cached device, setting
      CACHE_SET_IO_DISABLE bit to cache set c->flags was not a good idea. This
      operation will disable all I/Os on cache set, which means other attached
      bcache devices won't work neither.
      
      Without this patch, the failed bcache device can also be stopped
      eventually if internal I/O accomplished (e.g. writeback). Therefore here
      I revert it.
      
      Fixes: 6147305c ("bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()")
      Reported-by: NYong Li <mr.liyong@qq.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58169c18
    • C
      bcache: fix potential deadlock in cached_def_free() · 95d08480
      Coly Li 提交于
      [ Upstream commit 7e865eba00a3df2dc8c4746173a8ca1c1c7f042e ]
      
      When enable lockdep and reboot system with a writeback mode bcache
      device, the following potential deadlock warning is reported by lockdep
      engine.
      
      [  101.536569][  T401] kworker/2:2/401 is trying to acquire lock:
      [  101.538575][  T401] 00000000bbf6e6c7 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
      [  101.542054][  T401]
      [  101.542054][  T401] but task is already holding lock:
      [  101.544587][  T401] 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
      [  101.548386][  T401]
      [  101.548386][  T401] which lock already depends on the new lock.
      [  101.548386][  T401]
      [  101.551874][  T401]
      [  101.551874][  T401] the existing dependency chain (in reverse order) is:
      [  101.555000][  T401]
      [  101.555000][  T401] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
      [  101.557860][  T401]        process_one_work+0x277/0x640
      [  101.559661][  T401]        worker_thread+0x39/0x3f0
      [  101.561340][  T401]        kthread+0x125/0x140
      [  101.562963][  T401]        ret_from_fork+0x3a/0x50
      [  101.564718][  T401]
      [  101.564718][  T401] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
      [  101.567701][  T401]        lock_acquire+0xb4/0x1c0
      [  101.569651][  T401]        flush_workqueue+0xae/0x4c0
      [  101.571494][  T401]        drain_workqueue+0xa9/0x180
      [  101.573234][  T401]        destroy_workqueue+0x17/0x250
      [  101.575109][  T401]        cached_dev_free+0x44/0x120 [bcache]
      [  101.577304][  T401]        process_one_work+0x2a4/0x640
      [  101.579357][  T401]        worker_thread+0x39/0x3f0
      [  101.581055][  T401]        kthread+0x125/0x140
      [  101.582709][  T401]        ret_from_fork+0x3a/0x50
      [  101.584592][  T401]
      [  101.584592][  T401] other info that might help us debug this:
      [  101.584592][  T401]
      [  101.588355][  T401]  Possible unsafe locking scenario:
      [  101.588355][  T401]
      [  101.590974][  T401]        CPU0                    CPU1
      [  101.592889][  T401]        ----                    ----
      [  101.594743][  T401]   lock((work_completion)(&cl->work)#2);
      [  101.596785][  T401]                                lock((wq_completion)bcache_writeback_wq);
      [  101.600072][  T401]                                lock((work_completion)(&cl->work)#2);
      [  101.602971][  T401]   lock((wq_completion)bcache_writeback_wq);
      [  101.605255][  T401]
      [  101.605255][  T401]  *** DEADLOCK ***
      [  101.605255][  T401]
      [  101.608310][  T401] 2 locks held by kworker/2:2/401:
      [  101.610208][  T401]  #0: 00000000cf2c7d17 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
      [  101.613709][  T401]  #1: 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
      [  101.617480][  T401]
      [  101.617480][  T401] stack backtrace:
      [  101.619539][  T401] CPU: 2 PID: 401 Comm: kworker/2:2 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
      [  101.623225][  T401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [  101.627210][  T401] Workqueue: events cached_dev_free [bcache]
      [  101.629239][  T401] Call Trace:
      [  101.630360][  T401]  dump_stack+0x85/0xcb
      [  101.631777][  T401]  print_circular_bug+0x19a/0x1f0
      [  101.633485][  T401]  __lock_acquire+0x16cd/0x1850
      [  101.635184][  T401]  ? __lock_acquire+0x6a8/0x1850
      [  101.636863][  T401]  ? lock_acquire+0xb4/0x1c0
      [  101.638421][  T401]  ? find_held_lock+0x34/0xa0
      [  101.640015][  T401]  lock_acquire+0xb4/0x1c0
      [  101.641513][  T401]  ? flush_workqueue+0x87/0x4c0
      [  101.643248][  T401]  flush_workqueue+0xae/0x4c0
      [  101.644832][  T401]  ? flush_workqueue+0x87/0x4c0
      [  101.646476][  T401]  ? drain_workqueue+0xa9/0x180
      [  101.648303][  T401]  drain_workqueue+0xa9/0x180
      [  101.649867][  T401]  destroy_workqueue+0x17/0x250
      [  101.651503][  T401]  cached_dev_free+0x44/0x120 [bcache]
      [  101.653328][  T401]  process_one_work+0x2a4/0x640
      [  101.655029][  T401]  worker_thread+0x39/0x3f0
      [  101.656693][  T401]  ? process_one_work+0x640/0x640
      [  101.658501][  T401]  kthread+0x125/0x140
      [  101.660012][  T401]  ? kthread_create_worker_on_cpu+0x70/0x70
      [  101.661985][  T401]  ret_from_fork+0x3a/0x50
      [  101.691318][  T401] bcache: bcache_device_free() bcache0 stopped
      
      Here is how the above potential deadlock may happen in reboot/shutdown
      code path,
      1) bcache_reboot() is called firstly in the reboot/shutdown code path,
         then in bcache_reboot(), bcache_device_stop() is called.
      2) bcache_device_stop() sets BCACHE_DEV_CLOSING on d->falgs, then call
         closure_queue(&d->cl) to invoke cached_dev_flush(). And in turn
         cached_dev_flush() calls cached_dev_free() via closure_at()
      3) In cached_dev_free(), after stopped writebach kthread
         dc->writeback_thread, the kwork dc->writeback_write_wq is stopping by
         destroy_workqueue().
      4) Inside destroy_workqueue(), drain_workqueue() is called. Inside
         drain_workqueue(), flush_workqueue() is called. Then wq->lockdep_map
         is acquired by lock_map_acquire() in flush_workqueue(). After the
         lock acquired the rest part of flush_workqueue() just wait for the
         workqueue to complete.
      5) Now we look back at writeback thread routine bch_writeback_thread(),
         in the main while-loop, write_dirty() is called via continue_at() in
         read_dirty_submit(), which is called via continue_at() in while-loop
         level called function read_dirty(). Inside write_dirty() it may be
         re-called on workqueeu dc->writeback_write_wq via continue_at().
         It means when the writeback kthread is stopped in cached_dev_free()
         there might be still one kworker queued on dc->writeback_write_wq
         to execute write_dirty() again.
      6) Now this kworker is scheduled on dc->writeback_write_wq to run by
         process_one_work() (which is called by worker_thread()). Before
         calling the kwork routine, wq->lockdep_map is acquired.
      7) But wq->lockdep_map is acquired already in step 4), so a A-A lock
         (lockdep terminology) scenario happens.
      
      Indeed on multiple cores syatem, the above deadlock is very rare to
      happen, just as the code comments in process_one_work() says,
      2263     * AFAICT there is no possible deadlock scenario between the
      2264     * flush_work() and complete() primitives (except for
      	   single-threaded
      2265     * workqueues), so hiding them isn't a problem.
      
      But it is still good to fix such lockdep warning, even no one running
      bcache on single core system.
      
      The fix is simple. This patch solves the above potential deadlock by,
      - Do not destroy workqueue dc->writeback_write_wq in cached_dev_free().
      - Flush and destroy dc->writeback_write_wq in writebach kthread routine
        bch_writeback_thread(), where after quit the thread main while-loop
        and before cached_dev_put() is called.
      
      By this fix, dc->writeback_write_wq will be stopped and destroy before
      the writeback kthread stopped, so the chance for a A-A locking on
      wq->lockdep_map is disappeared, such A-A deadlock won't happen
      any more.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      95d08480
    • C
      bcache: check c->gc_thread by IS_ERR_OR_NULL in cache_set_flush() · 4b7758e9
      Coly Li 提交于
      [ Upstream commit b387e9b58679c60f5b1e4313939bd4878204fc37 ]
      
      When system memory is in heavy pressure, bch_gc_thread_start() from
      run_cache_set() may fail due to out of memory. In such condition,
      c->gc_thread is assigned to -ENOMEM, not NULL pointer. Then in following
      failure code path bch_cache_set_error(), when cache_set_flush() gets
      called, the code piece to stop c->gc_thread is broken,
               if (!IS_ERR_OR_NULL(c->gc_thread))
                       kthread_stop(c->gc_thread);
      
      And KASAN catches such NULL pointer deference problem, with the warning
      information:
      
      [  561.207881] ==================================================================
      [  561.207900] BUG: KASAN: null-ptr-deref in kthread_stop+0x3b/0x440
      [  561.207904] Write of size 4 at addr 000000000000001c by task kworker/15:1/313
      
      [  561.207913] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G        W         5.0.0-vanilla+ #3
      [  561.207916] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
      [  561.207935] Workqueue: events cache_set_flush [bcache]
      [  561.207940] Call Trace:
      [  561.207948]  dump_stack+0x9a/0xeb
      [  561.207955]  ? kthread_stop+0x3b/0x440
      [  561.207960]  ? kthread_stop+0x3b/0x440
      [  561.207965]  kasan_report+0x176/0x192
      [  561.207973]  ? kthread_stop+0x3b/0x440
      [  561.207981]  kthread_stop+0x3b/0x440
      [  561.207995]  cache_set_flush+0xd4/0x6d0 [bcache]
      [  561.208008]  process_one_work+0x856/0x1620
      [  561.208015]  ? find_held_lock+0x39/0x1d0
      [  561.208028]  ? drain_workqueue+0x380/0x380
      [  561.208048]  worker_thread+0x87/0xb80
      [  561.208058]  ? __kthread_parkme+0xb6/0x180
      [  561.208067]  ? process_one_work+0x1620/0x1620
      [  561.208072]  kthread+0x326/0x3e0
      [  561.208079]  ? kthread_create_worker_on_cpu+0xc0/0xc0
      [  561.208090]  ret_from_fork+0x3a/0x50
      [  561.208110] ==================================================================
      [  561.208113] Disabling lock debugging due to kernel taint
      [  561.208115] irq event stamp: 11800231
      [  561.208126] hardirqs last  enabled at (11800231): [<ffffffff83008538>] do_syscall_64+0x18/0x410
      [  561.208127] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
      [  561.208129] #PF error: [WRITE]
      [  561.312253] hardirqs last disabled at (11800230): [<ffffffff830052ff>] trace_hardirqs_off_thunk+0x1a/0x1c
      [  561.312259] softirqs last  enabled at (11799832): [<ffffffff850005c7>] __do_softirq+0x5c7/0x8c3
      [  561.405975] PGD 0 P4D 0
      [  561.442494] softirqs last disabled at (11799821): [<ffffffff831add2c>] irq_exit+0x1ac/0x1e0
      [  561.791359] Oops: 0002 [#1] SMP KASAN NOPTI
      [  561.791362] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G    B   W         5.0.0-vanilla+ #3
      [  561.791363] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
      [  561.791371] Workqueue: events cache_set_flush [bcache]
      [  561.791374] RIP: 0010:kthread_stop+0x3b/0x440
      [  561.791376] Code: 00 00 65 8b 05 26 d5 e0 7c 89 c0 48 0f a3 05 ec aa df 02 0f 82 dc 02 00 00 4c 8d 63 20 be 04 00 00 00 4c 89 e7 e8 65 c5 53 00 <f0> ff 43 20 48 8d 7b 24 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48
      [  561.791377] RSP: 0018:ffff88872fc8fd10 EFLAGS: 00010286
      [  561.838895] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838916] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838934] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838948] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838966] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838979] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838996] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  563.067028] RAX: 0000000000000000 RBX: fffffffffffffffc RCX: ffffffff832dd314
      [  563.067030] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000297
      [  563.067032] RBP: ffff88872fc8fe88 R08: fffffbfff0b8213d R09: fffffbfff0b8213d
      [  563.067034] R10: 0000000000000001 R11: fffffbfff0b8213c R12: 000000000000001c
      [  563.408618] R13: ffff88dc61cc0f68 R14: ffff888102b94900 R15: ffff88dc61cc0f68
      [  563.408620] FS:  0000000000000000(0000) GS:ffff888f7dc00000(0000) knlGS:0000000000000000
      [  563.408622] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  563.408623] CR2: 000000000000001c CR3: 0000000f48a1a004 CR4: 00000000007606e0
      [  563.408625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  563.408627] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  563.904795] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  563.915796] PKRU: 55555554
      [  563.915797] Call Trace:
      [  563.915807]  cache_set_flush+0xd4/0x6d0 [bcache]
      [  563.915812]  process_one_work+0x856/0x1620
      [  564.001226] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.033563]  ? find_held_lock+0x39/0x1d0
      [  564.033567]  ? drain_workqueue+0x380/0x380
      [  564.033574]  worker_thread+0x87/0xb80
      [  564.062823] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.118042]  ? __kthread_parkme+0xb6/0x180
      [  564.118046]  ? process_one_work+0x1620/0x1620
      [  564.118048]  kthread+0x326/0x3e0
      [  564.118050]  ? kthread_create_worker_on_cpu+0xc0/0xc0
      [  564.167066] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.252441]  ret_from_fork+0x3a/0x50
      [  564.252447] Modules linked in: msr rpcrdma sunrpc rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib i40iw configfs iw_cm ib_cm libiscsi scsi_transport_iscsi mlx4_ib ib_uverbs mlx4_en ib_core nls_iso8859_1 nls_cp437 vfat fat intel_rapl skx_edac x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ses raid0 aesni_intel cdc_ether enclosure usbnet ipmi_ssif joydev aes_x86_64 i40e scsi_transport_sas mii bcache md_mod crypto_simd mei_me ioatdma crc64 ptp cryptd pcspkr i2c_i801 mlx4_core glue_helper pps_core mei lpc_ich dca wmi ipmi_si ipmi_devintf nd_pmem dax_pmem nd_btt ipmi_msghandler device_dax pcc_cpufreq button hid_generic usbhid mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect xhci_pci sysimgblt fb_sys_fops xhci_hcd ttm megaraid_sas drm usbcore nfit libnvdimm sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
      [  564.299390] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.348360] CR2: 000000000000001c
      [  564.348362] ---[ end trace b7f0e5cc7b2103b0 ]---
      
      Therefore, it is not enough to only check whether c->gc_thread is NULL,
      we should use IS_ERR_OR_NULL() to check both NULL pointer and error
      value.
      
      This patch changes the above buggy code piece in this way,
               if (!IS_ERR_OR_NULL(c->gc_thread))
                       kthread_stop(c->gc_thread);
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4b7758e9
    • C
      bcache: acquire bch_register_lock later in cached_dev_free() · 81b88c05
      Coly Li 提交于
      [ Upstream commit 80265d8dfd77792e133793cef44a21323aac2908 ]
      
      When enable lockdep engine, a lockdep warning can be observed when
      reboot or shutdown system,
      
      [ 3142.764557][    T1] bcache: bcache_reboot() Stopping all devices:
      [ 3142.776265][ T2649]
      [ 3142.777159][ T2649] ======================================================
      [ 3142.780039][ T2649] WARNING: possible circular locking dependency detected
      [ 3142.782869][ T2649] 5.2.0-rc4-lp151.20-default+ #1 Tainted: G        W
      [ 3142.785684][ T2649] ------------------------------------------------------
      [ 3142.788479][ T2649] kworker/3:67/2649 is trying to acquire lock:
      [ 3142.790738][ T2649] 00000000aaf02291 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
      [ 3142.794678][ T2649]
      [ 3142.794678][ T2649] but task is already holding lock:
      [ 3142.797402][ T2649] 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
      [ 3142.801462][ T2649]
      [ 3142.801462][ T2649] which lock already depends on the new lock.
      [ 3142.801462][ T2649]
      [ 3142.805277][ T2649]
      [ 3142.805277][ T2649] the existing dependency chain (in reverse order) is:
      [ 3142.808902][ T2649]
      [ 3142.808902][ T2649] -> #2 (&bch_register_lock){+.+.}:
      [ 3142.812396][ T2649]        __mutex_lock+0x7a/0x9d0
      [ 3142.814184][ T2649]        cached_dev_free+0x17/0x120 [bcache]
      [ 3142.816415][ T2649]        process_one_work+0x2a4/0x640
      [ 3142.818413][ T2649]        worker_thread+0x39/0x3f0
      [ 3142.820276][ T2649]        kthread+0x125/0x140
      [ 3142.822061][ T2649]        ret_from_fork+0x3a/0x50
      [ 3142.823965][ T2649]
      [ 3142.823965][ T2649] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
      [ 3142.827244][ T2649]        process_one_work+0x277/0x640
      [ 3142.829160][ T2649]        worker_thread+0x39/0x3f0
      [ 3142.830958][ T2649]        kthread+0x125/0x140
      [ 3142.832674][ T2649]        ret_from_fork+0x3a/0x50
      [ 3142.834915][ T2649]
      [ 3142.834915][ T2649] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
      [ 3142.838121][ T2649]        lock_acquire+0xb4/0x1c0
      [ 3142.840025][ T2649]        flush_workqueue+0xae/0x4c0
      [ 3142.842035][ T2649]        drain_workqueue+0xa9/0x180
      [ 3142.844042][ T2649]        destroy_workqueue+0x17/0x250
      [ 3142.846142][ T2649]        cached_dev_free+0x52/0x120 [bcache]
      [ 3142.848530][ T2649]        process_one_work+0x2a4/0x640
      [ 3142.850663][ T2649]        worker_thread+0x39/0x3f0
      [ 3142.852464][ T2649]        kthread+0x125/0x140
      [ 3142.854106][ T2649]        ret_from_fork+0x3a/0x50
      [ 3142.855880][ T2649]
      [ 3142.855880][ T2649] other info that might help us debug this:
      [ 3142.855880][ T2649]
      [ 3142.859663][ T2649] Chain exists of:
      [ 3142.859663][ T2649]   (wq_completion)bcache_writeback_wq --> (work_completion)(&cl->work)#2 --> &bch_register_lock
      [ 3142.859663][ T2649]
      [ 3142.865424][ T2649]  Possible unsafe locking scenario:
      [ 3142.865424][ T2649]
      [ 3142.868022][ T2649]        CPU0                    CPU1
      [ 3142.869885][ T2649]        ----                    ----
      [ 3142.871751][ T2649]   lock(&bch_register_lock);
      [ 3142.873379][ T2649]                                lock((work_completion)(&cl->work)#2);
      [ 3142.876399][ T2649]                                lock(&bch_register_lock);
      [ 3142.879727][ T2649]   lock((wq_completion)bcache_writeback_wq);
      [ 3142.882064][ T2649]
      [ 3142.882064][ T2649]  *** DEADLOCK ***
      [ 3142.882064][ T2649]
      [ 3142.885060][ T2649] 3 locks held by kworker/3:67/2649:
      [ 3142.887245][ T2649]  #0: 00000000e774cdd0 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
      [ 3142.890815][ T2649]  #1: 00000000f7df89da ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
      [ 3142.894884][ T2649]  #2: 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
      [ 3142.898797][ T2649]
      [ 3142.898797][ T2649] stack backtrace:
      [ 3142.900961][ T2649] CPU: 3 PID: 2649 Comm: kworker/3:67 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
      [ 3142.904789][ T2649] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [ 3142.909168][ T2649] Workqueue: events cached_dev_free [bcache]
      [ 3142.911422][ T2649] Call Trace:
      [ 3142.912656][ T2649]  dump_stack+0x85/0xcb
      [ 3142.914181][ T2649]  print_circular_bug+0x19a/0x1f0
      [ 3142.916193][ T2649]  __lock_acquire+0x16cd/0x1850
      [ 3142.917936][ T2649]  ? __lock_acquire+0x6a8/0x1850
      [ 3142.919704][ T2649]  ? lock_acquire+0xb4/0x1c0
      [ 3142.921335][ T2649]  ? find_held_lock+0x34/0xa0
      [ 3142.923052][ T2649]  lock_acquire+0xb4/0x1c0
      [ 3142.924635][ T2649]  ? flush_workqueue+0x87/0x4c0
      [ 3142.926375][ T2649]  flush_workqueue+0xae/0x4c0
      [ 3142.928047][ T2649]  ? flush_workqueue+0x87/0x4c0
      [ 3142.929824][ T2649]  ? drain_workqueue+0xa9/0x180
      [ 3142.931686][ T2649]  drain_workqueue+0xa9/0x180
      [ 3142.933534][ T2649]  destroy_workqueue+0x17/0x250
      [ 3142.935787][ T2649]  cached_dev_free+0x52/0x120 [bcache]
      [ 3142.937795][ T2649]  process_one_work+0x2a4/0x640
      [ 3142.939803][ T2649]  worker_thread+0x39/0x3f0
      [ 3142.941487][ T2649]  ? process_one_work+0x640/0x640
      [ 3142.943389][ T2649]  kthread+0x125/0x140
      [ 3142.944894][ T2649]  ? kthread_create_worker_on_cpu+0x70/0x70
      [ 3142.947744][ T2649]  ret_from_fork+0x3a/0x50
      [ 3142.970358][ T2649] bcache: bcache_device_free() bcache0 stopped
      
      Here is how the deadlock happens.
      1) bcache_reboot() calls bcache_device_stop(), then inside
         bcache_device_stop() BCACHE_DEV_CLOSING bit is set on d->flags.
         Then closure_queue(&d->cl) is called to invoke cached_dev_flush().
      2) In cached_dev_flush(), cached_dev_free() is called by continu_at().
      3) In cached_dev_free(), when stopping the writeback kthread of the
         cached device by kthread_stop(), dc->writeback_thread will be waken
         up to quite the kthread while-loop, then cached_dev_put() is called
         in bch_writeback_thread().
      4) Calling cached_dev_put() in writeback kthread may drop dc->count to
         0, then dc->detach kworker is scheduled, which is initialized as
         cached_dev_detach_finish().
      5) Inside cached_dev_detach_finish(), the last line of code is to call
         closure_put(&dc->disk.cl), which drops the last reference counter of
         closrure dc->disk.cl, then the callback cached_dev_flush() gets
         called.
      Now cached_dev_flush() is called for second time in the code path, the
      first time is in step 2). And again bch_register_lock will be acquired
      again, and a A-A lock (lockdep terminology) is happening.
      
      The root cause of the above A-A lock is in cached_dev_free(), mutex
      bch_register_lock is held before stopping writeback kthread and other
      kworkers. Fortunately now we have variable 'bcache_is_reboot', which may
      prevent device registration or unregistration during reboot/shutdown
      time, so it is unncessary to hold bch_register_lock such early now.
      
      This is how this patch fixes the reboot/shutdown time A-A lock issue:
      After moving mutex_lock(&bch_register_lock) to a later location where
      before atomic_read(&dc->running) in cached_dev_free(), such A-A lock
      problem can be solved without any reboot time registration race.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      81b88c05
    • C
      bcache: check CACHE_SET_IO_DISABLE bit in bch_journal() · d81080a0
      Coly Li 提交于
      [ Upstream commit 383ff2183ad16a8842d1fbd9dd3e1cbd66813e64 ]
      
      When too many I/O errors happen on cache set and CACHE_SET_IO_DISABLE
      bit is set, bch_journal() may continue to work because the journaling
      bkey might be still in write set yet. The caller of bch_journal() may
      believe the journal still work but the truth is in-memory journal write
      set won't be written into cache device any more. This behavior may
      introduce potential inconsistent metadata status.
      
      This patch checks CACHE_SET_IO_DISABLE bit at the head of bch_journal(),
      if the bit is set, bch_journal() returns NULL immediately to notice
      caller to know journal does not work.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d81080a0
    • C
      bcache: check CACHE_SET_IO_DISABLE in allocator code · 57cfb755
      Coly Li 提交于
      [ Upstream commit e775339e1ae1205b47d94881db124c11385e597c ]
      
      If CACHE_SET_IO_DISABLE of a cache set flag is set by too many I/O
      errors, currently allocator routines can still continue allocate
      space which may introduce inconsistent metadata state.
      
      This patch checkes CACHE_SET_IO_DISABLE bit in following allocator
      routines,
      - bch_bucket_alloc()
      - __bch_bucket_alloc_set()
      Once CACHE_SET_IO_DISABLE is set on cache set, the allocator routines
      may reject allocation request earlier to avoid potential inconsistent
      metadata.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      57cfb755
  3. 21 7月, 2019 2 次提交
    • M
      dm verity: use message limit for data block corruption message · 13684714
      Milan Broz 提交于
      [ Upstream commit 2eba4e640b2c4161e31ae20090a53ee02a518657 ]
      
      DM verity should also use DMERR_LIMIT to limit repeat data block
      corruption messages.
      Signed-off-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      13684714
    • J
      dm table: don't copy from a NULL pointer in realloc_argv() · 042be786
      Jerome Marchand 提交于
      [ Upstream commit a0651926553cfe7992166432e418987760882652 ]
      
      For the first call to realloc_argv() in dm_split_args(), old_argv is
      NULL and size is zero. Then memcpy is called, with the NULL old_argv
      as the source argument and a zero size argument. AFAIK, this is
      undefined behavior and generates the following warning when compiled
      with UBSAN on ppc64le:
      
      In file included from ./arch/powerpc/include/asm/paca.h:19,
                       from ./arch/powerpc/include/asm/current.h:16,
                       from ./include/linux/sched.h:12,
                       from ./include/linux/kthread.h:6,
                       from drivers/md/dm-core.h:12,
                       from drivers/md/dm-table.c:8:
      In function 'memcpy',
          inlined from 'realloc_argv' at drivers/md/dm-table.c:565:3,
          inlined from 'dm_split_args' at drivers/md/dm-table.c:588:9:
      ./include/linux/string.h:345:9: error: argument 2 null where non-null expected [-Werror=nonnull]
        return __builtin_memcpy(p, q, size);
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/md/dm-table.c: In function 'dm_split_args':
      ./include/linux/string.h:345:9: note: in a call to built-in function '__builtin_memcpy'
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      042be786
  4. 14 7月, 2019 1 次提交
    • M
      md: fix for divide error in status_resync · 270ae00a
      Mariusz Tkaczyk 提交于
      [ Upstream commit 9642fa73d073527b0cbc337cc17a47d545d82cd2 ]
      
      Stopping external metadata arrays during resync/recovery causes
      retries, loop of interrupting and starting reconstruction, until it
      hit at good moment to stop completely. While these retries
      curr_mark_cnt can be small- especially on HDD drives, so subtraction
      result can be smaller than 0. However it is casted to uint without
      checking. As a result of it the status bar in /proc/mdstat while stopping
      is strange (it jumps between 0% and 99%).
      
      The real problem occurs here after commit 72deb455b5ec ("block: remove
      CONFIG_LBDAF"). Sector_div() macro has been changed, now the
      divisor is casted to uint32. For db = -8 the divisior(db/32-1) becomes 0.
      
      Check if db value can be really counted and replace these macro by
      div64_u64() inline.
      Signed-off-by: NMariusz Tkaczyk <mariusz.tkaczyk@intel.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      270ae00a
  5. 10 7月, 2019 1 次提交
    • G
      md/raid0: Do not bypass blocking queue entered for raid0 bios · 869eec89
      Guilherme G. Piccoli 提交于
      ```--------------------------------------------------------------
      This patch is not on mainline and is meant to 4.19 stable *only*.
      After the patch description there's a reasoning about that.
      ```
      
      --------------------------------------------------------------
      
      Commit cd4a4ae4 ("block: don't use blocking queue entered for
      recursive bio submits") introduced the flag BIO_QUEUE_ENTERED in order
      split bios bypass the blocking queue entering routine and use the live
      non-blocking version. It was a result of an extensive discussion in
      a linux-block thread[0], and the purpose of this change was to prevent
      a hung task waiting on a reference to drop.
      
      Happens that md raid0 split bios all the time, and more important,
      it changes their underlying device to the raid member. After the change
      introduced by this flag's usage, we experience various crashes if a raid0
      member is removed during a large write. This happens because the bio
      reaches the live queue entering function when the queue of the raid0
      member is dying.
      
      A simple reproducer of this behavior is presented below:
      a) Build kernel v4.19.56-stable with CONFIG_BLK_DEV_THROTTLING=y.
      
      b) Create a raid0 md array with 2 NVMe devices as members, and mount
      it with an ext4 filesystem.
      
      c) Run the following oneliner (supposing the raid0 is mounted in /mnt):
      (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3;
      echo 1 > /sys/block/nvme1n1/device/device/remove
      (whereas nvme1n1 is the 2nd array member)
      
      This will trigger the following warning/oops:
      
      ------------[ cut here ]------------
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000155
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP PTI
      RIP: 0010:blk_throtl_bio+0x45/0x970
      [...]
      Call Trace:
       generic_make_request_checks+0x1bf/0x690
       generic_make_request+0x64/0x3f0
       raid0_make_request+0x184/0x620 [raid0]
       ? raid0_make_request+0x184/0x620 [raid0]
       md_handle_request+0x126/0x1a0
       md_make_request+0x7b/0x180
       generic_make_request+0x19e/0x3f0
       submit_bio+0x73/0x140
      [...]
      
      This patch changes raid0 driver to fallback to the "old" blocking queue
      entering procedure, by clearing the BIO_QUEUE_ENTERED from raid0 bios.
      This prevents the crashes and restores the regular behavior of raid0
      arrays when a member is removed during a large write.
      
      [0] lore.kernel.org/linux-block/343bbbf6-64eb-879e-d19e-96aebb037d47@I-love.SAKURA.ne.jp
      
      ----------------------------
      Why this is not on mainline?
      ----------------------------
      
      The patch was originally submitted upstream in linux-raid and
      linux-block mailing-lists - it was initially accepted by Song Liu,
      but Christoph Hellwig[1] observed that there was a clean-up series
      ready to be accepted from Ming Lei[2] that fixed the same issue.
      
      The accepted patches from Ming's series in upstream are: commit
      47cdee29ef9d ("block: move blk_exit_queue into __blk_release_queue") and
      commit fe2008640ae3 ("block: don't protect generic_make_request_checks
      with blk_queue_enter"). Those patches basically do a clean-up in the
      block layer involving:
      
      1) Putting back blk_exit_queue() logic into __blk_release_queue(); that
      path was changed in the past and the logic from blk_exit_queue() was
      added to blk_cleanup_queue().
      
      2) Removing the guard/protection in generic_make_request_checks() with
      blk_queue_enter().
      
      The problem with Ming's series for -stable is that it relies in the
      legacy request IO path removal. So it's "backport-able" to v5.0+,
      but doing that for early versions (like 4.19) would incur in complex
      code changes. Hence, it was suggested by Christoph and Song Liu that
      this patch was submitted to stable only; otherwise merging it upstream
      would add code to fix a path removed in a subsequent commit.
      
      [1] lore.kernel.org/linux-block/20190521172258.GA32702@infradead.org
      [2] lore.kernel.org/linux-block/20190515030310.20393-1-ming.lei@redhat.com
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Fixes: cd4a4ae4 ("block: don't use blocking queue entered for recursive bio submits")
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      869eec89
  6. 03 7月, 2019 1 次提交
  7. 19 6月, 2019 2 次提交
    • C
      bcache: only set BCACHE_DEV_WB_RUNNING when cached device attached · e599bfe5
      Coly Li 提交于
      commit 1f0ffa67349c56ea54c03ccfd1e073c990e7411e upstream.
      
      When people set a writeback percent via sysfs file,
        /sys/block/bcache<N>/bcache/writeback_percent
      current code directly sets BCACHE_DEV_WB_RUNNING to dc->disk.flags
      and schedules kworker dc->writeback_rate_update.
      
      If there is no cache set attached to, the writeback kernel thread is
      not running indeed, running dc->writeback_rate_update does not make
      sense and may cause NULL pointer deference when reference cache set
      pointer inside update_writeback_rate().
      
      This patch checks whether the cache set point (dc->disk.c) is NULL in
      sysfs interface handler, and only set BCACHE_DEV_WB_RUNNING and
      schedule dc->writeback_rate_update when dc->disk.c is not NULL (it
      means the cache device is attached to a cache set).
      
      This problem might be introduced from initial bcache commit, but
      commit 3fd47bfe ("bcache: stop dc->writeback_rate_update properly")
      changes part of the original code piece, so I add 'Fixes: 3fd47bfe'
      to indicate from which commit this patch can be applied.
      
      Fixes: 3fd47bfe ("bcache: stop dc->writeback_rate_update properly")
      Reported-by: NBjørn Forsman <bjorn.forsman@gmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NBjørn Forsman <bjorn.forsman@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e599bfe5
    • C
      bcache: fix stack corruption by PRECEDING_KEY() · 973fc2b3
      Coly Li 提交于
      commit 31b90956b124240aa8c63250243ae1a53585c5e2 upstream.
      
      Recently people report bcache code compiled with gcc9 is broken, one of
      the buggy behavior I observe is that two adjacent 4KB I/Os should merge
      into one but they don't. Finally it turns out to be a stack corruption
      caused by macro PRECEDING_KEY().
      
      See how PRECEDING_KEY() is defined in bset.h,
      437 #define PRECEDING_KEY(_k)                                       \
      438 ({                                                              \
      439         struct bkey *_ret = NULL;                               \
      440                                                                 \
      441         if (KEY_INODE(_k) || KEY_OFFSET(_k)) {                  \
      442                 _ret = &KEY(KEY_INODE(_k), KEY_OFFSET(_k), 0);  \
      443                                                                 \
      444                 if (!_ret->low)                                 \
      445                         _ret->high--;                           \
      446                 _ret->low--;                                    \
      447         }                                                       \
      448                                                                 \
      449         _ret;                                                   \
      450 })
      
      At line 442, _ret points to address of a on-stack variable combined by
      KEY(), the life range of this on-stack variable is in line 442-446,
      once _ret is returned to bch_btree_insert_key(), the returned address
      points to an invalid stack address and this address is overwritten in
      the following called bch_btree_iter_init(). Then argument 'search' of
      bch_btree_iter_init() points to some address inside stackframe of
      bch_btree_iter_init(), exact address depends on how the compiler
      allocates stack space. Now the stack is corrupted.
      
      Fixes: 0eacac22 ("bcache: PRECEDING_KEY()")
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NRolf Fokkens <rolf@rolffokkens.nl>
      Reviewed-by: NPierre JUHEN <pierre.juhen@orange.fr>
      Tested-by: NShenghui Wang <shhuiw@foxmail.com>
      Tested-by: NPierre JUHEN <pierre.juhen@orange.fr>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Nix <nix@esperi.org.uk>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      973fc2b3
  8. 31 5月, 2019 5 次提交
    • A
      bcache: avoid clang -Wunintialized warning · 06740892
      Arnd Bergmann 提交于
      [ Upstream commit 78d4eb8ad9e1d413449d1b7a060f50b6efa81ebd ]
      
      clang has identified a code path in which it thinks a
      variable may be unused:
      
      drivers/md/bcache/alloc.c:333:4: error: variable 'bucket' is used uninitialized whenever 'if' condition is false
            [-Werror,-Wsometimes-uninitialized]
                              fifo_pop(&ca->free_inc, bucket);
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
       #define fifo_pop(fifo, i)       fifo_pop_front(fifo, (i))
                                      ^~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/md/bcache/util.h:189:6: note: expanded from macro 'fifo_pop_front'
              if (_r) {                                                       \
                  ^~
      drivers/md/bcache/alloc.c:343:46: note: uninitialized use occurs here
                              allocator_wait(ca, bch_allocator_push(ca, bucket));
                                                                        ^~~~~~
      drivers/md/bcache/alloc.c:287:7: note: expanded from macro 'allocator_wait'
                      if (cond)                                               \
                          ^~~~
      drivers/md/bcache/alloc.c:333:4: note: remove the 'if' if its condition is always true
                              fifo_pop(&ca->free_inc, bucket);
                              ^
      drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
       #define fifo_pop(fifo, i)       fifo_pop_front(fifo, (i))
                                      ^
      drivers/md/bcache/util.h:189:2: note: expanded from macro 'fifo_pop_front'
              if (_r) {                                                       \
              ^
      drivers/md/bcache/alloc.c:331:15: note: initialize the variable 'bucket' to silence this warning
                              long bucket;
                                         ^
      
      This cannot happen in practice because we only enter the loop
      if there is at least one element in the list.
      
      Slightly rearranging the code makes this clearer to both the
      reader and the compiler, which avoids the warning.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      06740892
    • C
      bcache: add failure check to run_cache_set() for journal replay · 330b6798
      Coly Li 提交于
      [ Upstream commit ce3e4cfb59cb382f8e5ce359238aa580d4ae7778 ]
      
      Currently run_cache_set() has no return value, if there is failure in
      bch_journal_replay(), the caller of run_cache_set() has no idea about
      such failure and just continue to execute following code after
      run_cache_set().  The internal failure is triggered inside
      bch_journal_replay() and being handled in async way. This behavior is
      inefficient, while failure handling inside bch_journal_replay(), cache
      register code is still running to start the cache set. Registering and
      unregistering code running as same time may introduce some rare race
      condition, and make the code to be more hard to be understood.
      
      This patch adds return value to run_cache_set(), and returns -EIO if
      bch_journal_rreplay() fails. Then caller of run_cache_set() may detect
      such failure and stop registering code flow immedidately inside
      register_cache_set().
      
      If journal replay fails, run_cache_set() can report error immediately
      to register_cache_set(). This patch makes the failure handling for
      bch_journal_replay() be in synchronized way, easier to understand and
      debug, and avoid poetential race condition for register-and-unregister
      in same time.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      330b6798
    • T
      bcache: fix failure in journal relplay · cd83c788
      Tang Junhui 提交于
      [ Upstream commit 631207314d88e9091be02fbdd1fdadb1ae2ed79a ]
      
      journal replay failed with messages:
      Sep 10 19:10:43 ceph kernel: bcache: error on
      bb379a64-e44e-4812-b91d-a5599871a3b1: bcache: journal entries
      2057493-2057567 missing! (replaying 2057493-20766016), disabling
      caching
      
      The reason is in journal_reclaim(), when discard is enabled, we send
      discard command and reclaim those journal buckets whose seq is old
      than the last_seq_now, but before we write a journal with last_seq_now,
      the machine is restarted, so the journal with the last_seq_now is not
      written to the journal bucket, and the last_seq_wrote in the newest
      journal is old than last_seq_now which we expect to be, so when we doing
      replay, journals from last_seq_wrote to last_seq_now are missing.
      
      It's hard to write a journal immediately after journal_reclaim(),
      and it harmless if those missed journal are caused by discarding
      since those journals are already wrote to btree node. So, if miss
      seqs are started from the beginning journal, we treat it as normal,
      and only print a message to show the miss journal, and point out
      it maybe caused by discarding.
      
      Patch v2 add a judgement condition to ignore the missed journal
      only when discard enabled as Coly suggested.
      
      (Coly Li: rebase the patch with other changes in bch_journal_replay())
      Signed-off-by: NTang Junhui <tang.junhui.linux@gmail.com>
      Tested-by: NDennis Schridde <devurandom@gmx.net>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      cd83c788
    • C
      bcache: return error immediately in bch_journal_replay() · 29b166da
      Coly Li 提交于
      [ Upstream commit 68d10e6979a3b59e3cd2e90bfcafed79c4cf180a ]
      
      When failure happens inside bch_journal_replay(), calling
      cache_set_err_on() and handling the failure in async way is not a good
      idea. Because after bch_journal_replay() returns, registering code will
      continue to execute following steps, and unregistering code triggered
      by cache_set_err_on() is running in same time. First it is unnecessary
      to handle failure and unregister cache set in an async way, second there
      might be potential race condition to run register and unregister code
      for same cache set.
      
      So in this patch, if failure happens in bch_journal_replay(), we don't
      call cache_set_err_on(), and just print out the same error message to
      kernel message buffer, then return -EIO immediately caller. Then caller
      can detect such failure and handle it in synchrnozied way.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      29b166da
    • S
      bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC... · 8034a6b8
      Shenghui Wang 提交于
      bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set
      
      [ Upstream commit 95f18c9d1310730d075499a75aaf13bcd60405a7 ]
      
      In the CACHE_SYNC branch of run_cache_set(), LIST_HEAD(journal) is used
      to collect journal_replay(s) and filled by bch_journal_read().
      
      If all goes well, bch_journal_replay() will release the list of
      jounal_replay(s) at the end of the branch.
      
      If something goes wrong, code flow will jump to the label "err:" and leave
      the list unreleased.
      
      This patch will release the list of journal_replay(s) in the case of
      error detected.
      
      v1 -> v2:
      * Move the release code to the location after label 'err:' to
        simply the change.
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8034a6b8
  9. 26 5月, 2019 9 次提交
    • N
      md/raid: raid5 preserve the writeback action after the parity check · 43090805
      Nigel Croxon 提交于
      commit b2176a1dfb518d870ee073445d27055fea64dfb8 upstream.
      
      The problem is that any 'uptodate' vs 'disks' check is not precise
      in this path. Put a "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the
      device that might try to kick off writes and then skip the action.
      Better to prevent the raid driver from taking unexpected action *and* keep
      the system alive vs killing the machine with BUG_ON.
      
      Note: fixed warning reported by kbuild test robot <lkp@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNigel Croxon <ncroxon@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43090805
    • S
      Revert "Don't jump to compute_result state from check_result state" · 3d25b7f5
      Song Liu 提交于
      commit a25d8c327bb41742dbd59f8c545f59f3b9c39983 upstream.
      
      This reverts commit 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef.
      
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Nigel Croxon <ncroxon@redhat.com>
      Cc: Xiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d25b7f5
    • M
      dm mpath: always free attached_handler_name in parse_path() · f9eccf6c
      Martin Wilck 提交于
      commit 940bc471780b004a5277c1931f52af363c2fc9da upstream.
      
      Commit b592211c ("dm mpath: fix attached_handler_name leak and
      dangling hw_handler_name pointer") fixed a memory leak for the case
      where setup_scsi_dh() returns failure. But setup_scsi_dh may return
      success and not "use" attached_handler_name if the
      retain_attached_hwhandler flag is not set on the map. As setup_scsi_sh
      properly "steals" the pointer by nullifying it, freeing it
      unconditionally in parse_path() is safe.
      
      Fixes: b592211c ("dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer")
      Cc: stable@vger.kernel.org
      Reported-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NMartin Wilck <mwilck@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f9eccf6c
    • M
      dm integrity: correctly calculate the size of metadata area · 9407680a
      Mikulas Patocka 提交于
      commit 30bba430ddf737978e40561198693ba91386dac1 upstream.
      
      When we use separate devices for data and metadata, dm-integrity would
      incorrectly calculate the size of the metadata device as if it had
      512-byte block size - and it would refuse activation with larger block
      size and smaller metadata device.
      
      Fix this so that it takes actual block size into account, which fixes
      the following reported issue:
      https://gitlab.com/cryptsetup/cryptsetup/issues/450
      
      Fixes: 356d9d52 ("dm integrity: allow separate metadata device")
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9407680a
    • M
      dm delay: fix a crash when invalid device is specified · 3b92ff72
      Mikulas Patocka 提交于
      commit 81bc6d150ace6250503b825d9d0c10f7bbd24095 upstream.
      
      When the target line contains an invalid device, delay_ctr() will call
      delay_dtr() with NULL workqueue.  Attempting to destroy the NULL
      workqueue causes a crash.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b92ff72
    • D
      dm zoned: Fix zone report handling · 90cc7112
      Damien Le Moal 提交于
      commit 7aedf75ff740a98f3683439449cd91c8662d03b2 upstream.
      
      The function blkdev_report_zones() returns success even if no zone
      information is reported (empty report). Empty zone reports can only
      happen if the report start sector passed exceeds the device capacity.
      The conditions for this to happen are either a bug in the caller code,
      or, a change in the device that forced the low level driver to change
      the device capacity to a value that is lower than the report start
      sector. This situation includes a failed disk revalidation resulting in
      the disk capacity being changed to 0.
      
      If this change happens while dm-zoned is in its initialization phase
      executing dmz_init_zones(), this function may enter an infinite loop
      and hang the system. To avoid this, add a check to disallow empty zone
      reports and bail out early. Also fix the function dmz_update_zone() to
      make sure that the report for the requested zone was correctly obtained.
      
      Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NShaun Tancheff <shaun@tancheff.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      90cc7112
    • N
      dm cache metadata: Fix loading discard bitset · ff0699a5
      Nikos Tsironis 提交于
      commit e28adc3bf34e434b30e8d063df4823ba0f3e0529 upstream.
      
      Add missing dm_bitset_cursor_next() to properly advance the bitset
      cursor.
      
      Otherwise, the discarded state of all blocks is set according to the
      discarded state of the first block.
      
      Fixes: ae4a46a1 ("dm cache metadata: use bitset cursor api to load discard bitset")
      Cc: stable@vger.kernel.org
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ff0699a5
    • Y
      md: add mddev->pers to avoid potential NULL pointer dereference · 10cb519c
      Yufen Yu 提交于
      commit ee37e62191a59d253fc916b9fc763deb777211e2 upstream.
      
      When doing re-add, we need to ensure rdev->mddev->pers is not NULL,
      which can avoid potential NULL pointer derefence in fallowing
      add_bound_rdev().
      
      Fixes: a6da4ef8 ("md: re-add a failed disk")
      Cc: Xiao Ni <xni@redhat.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: <stable@vger.kernel.org> # 4.4+
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      10cb519c
    • N
      md: batch flush requests. · 3deaa1dc
      NeilBrown 提交于
      commit 2bc13b83e6298486371761de503faeffd15b7534 upstream.
      
      Currently if many flush requests are submitted to an md device is quick
      succession, they are serialized and can take a long to process them all.
      We don't really need to call flush all those times - a single flush call
      can satisfy all requests submitted before it started.
      So keep track of when the current flush started and when it finished,
      allow any pending flush that was requested before the flush started
      to complete without waiting any more.
      
      Test results from Xiao:
      
      Test is done on a raid10 device which is created by 4 SSDs. The tool is
      dbench.
      
      1. The latest linux stable kernel
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    768    10.509    78.305
        Flush                  2078376     0.013    10.094
        Close                  21787697     0.019    18.821
        LockX                    96580     0.007     3.184
        Mkdir                      384     0.008     0.062
        Rename                 1255883     0.191    23.534
        ReadX                  46495589     0.020    14.230
        WriteX                 14790591     7.123    60.706
        Unlink                 5989118     0.440    54.551
        UnlockX                  96580     0.005     2.736
        FIND_FIRST             10393845     0.042    12.079
        SET_FILE_INFORMATION   2415558     0.129    10.088
        QUERY_FILE_INFORMATION 4711725     0.005     8.462
        QUERY_PATH_INFORMATION 26883327     0.032    21.715
        QUERY_FS_INFORMATION   4929409b     0.010     8.238
        NTCreateX              29660080     0.100    53.268
      
      Throughput 1034.88 MB/sec (sync open)  128 clients  128 procs
      max_latency=60.712 ms
      
      2. With patch1 "Revert "MD: fix lock contention for flush bios""
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    256     8.326    36.761
        Flush                   693291     3.974   180.269
        Close                  7266404     0.009    36.929
        LockX                    32160     0.006     0.840
        Mkdir                      128     0.008     0.021
        Rename                  418755     0.063    29.945
        ReadX                  15498708     0.007     7.216
        WriteX                 4932310    22.482   267.928
        Unlink                 1997557     0.109    47.553
        UnlockX                  32160     0.004     1.110
        FIND_FIRST             3465791     0.036     7.320
        SET_FILE_INFORMATION    805825     0.015     1.561
        QUERY_FILE_INFORMATION 1570950     0.005     2.403
        QUERY_PATH_INFORMATION 8965483     0.013    14.277
        QUERY_FS_INFORMATION   1643626     0.009     3.314
        NTCreateX              9892174     0.061    41.278
      
      Throughput 345.009 MB/sec (sync open)  128 clients  128 procs
      max_latency=267.939 m
      
      3. With patch1 and patch2
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    768     9.570    54.588
        Flush                  2061354     0.666    15.102
        Close                  21604811     0.012    25.697
        LockX                    95770     0.007     1.424
        Mkdir                      384     0.008     0.053
        Rename                 1245411     0.096    12.263
        ReadX                  46103198     0.011    12.116
        WriteX                 14667988     7.375    60.069
        Unlink                 5938936     0.173    30.905
        UnlockX                  95770     0.005     4.147
        FIND_FIRST             10306407     0.041    11.715
        SET_FILE_INFORMATION   2395987     0.048     7.640
        QUERY_FILE_INFORMATION 4672371     0.005     9.291
        QUERY_PATH_INFORMATION 26656735     0.018    19.719
        QUERY_FS_INFORMATION   4887940     0.010     7.654
        NTCreateX              29410811     0.059    28.551
      
      Throughput 1026.21 MB/sec (sync open)  128 clients  128 procs
      max_latency=60.075 ms
      
      Cc: <stable@vger.kernel.org> # v4.19+
      Tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3deaa1dc