1. 27 5月, 2020 3 次提交
    • C
      bcache: asynchronous devices registration · 9e23ccf8
      Coly Li 提交于
      When there is a lot of data cached on cache device, the bcach internal
      btree can take a very long to validate during the backing device and
      cache device registration. In my test, it may takes 55+ minutes to check
      all the internal btree nodes.
      
      The problem is that the registration is invoked by udev rules and the
      udevd has 180 seconds timeout by default. If the btree node checking
      time is longer than udevd timeout, the registering  process will be
      killed by udevd with SIGKILL. If the registering process has pending
      sigal, creating kthread for bcache will fail and the device registration
      will fail. The result is, for bcache device which cached a lot of data
      on cache device, the bcache device node like /dev/bcache<N> won't create
      always due to the very long btree checking time.
      
      A solution to avoid the udevd 180 seconds timeout is to register devices
      in an asynchronous way. Which is, after writing cache or backing device
      path into /sys/fs/bcache/register_async, the kernel code will create a
      kworker and move all the btree node checking (for cache device) or dirty
      data counting (for cached device) in the kwork context. Then the kworder
      is scheduled on system_wq and the registration code just returned to
      user space udev rule task. By this asynchronous way, the udev task for
      bcache rule will complete in seconds, no matter how long time spent in
      the kworker context, it won't be killed by udevd for a timeout.
      
      After all the checking and counting are done asynchronously in the
      kworker, the bcache device will eventually be created successfully.
      
      This patch does the above chagne and add a register sysfs file
      /sys/fs/bcache/register_async. Writing the registering device path into
      this sysfs file will do the asynchronous registration.
      
      The register_async interface is for very rare condition and won't be
      used for common users. In future I plan to make the asynchronous
      registration as default behavior, which depends on feedback for this
      patch.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9e23ccf8
    • C
      bcache: fix refcount underflow in bcache_device_free() · 86da9f73
      Coly Li 提交于
      The problematic code piece in bcache_device_free() is,
      
       785 static void bcache_device_free(struct bcache_device *d)
       786 {
       787     struct gendisk *disk = d->disk;
       [snipped]
       799     if (disk) {
       800             if (disk->flags & GENHD_FL_UP)
       801                     del_gendisk(disk);
       802
       803             if (disk->queue)
       804                     blk_cleanup_queue(disk->queue);
       805
       806             ida_simple_remove(&bcache_device_idx,
       807                               first_minor_to_idx(disk->first_minor));
       808             put_disk(disk);
       809         }
       [snipped]
       816 }
      
      At line 808, put_disk(disk) may encounter kobject refcount of 'disk'
      being underflow.
      
      Here is how to reproduce the issue,
      - Attche the backing device to a cache device and do random write to
        make the cache being dirty.
      - Stop the bcache device while the cache device has dirty data of the
        backing device.
      - Only register the backing device back, NOT register cache device.
      - The bcache device node /dev/bcache0 won't show up, because backing
        device waits for the cache device shows up for the missing dirty
        data.
      - Now echo 1 into /sys/fs/bcache/pendings_cleanup, to stop the pending
        backing device.
      - After the pending backing device stopped, use 'dmesg' to check kernel
        message, a use-after-free warning from KASA reported the refcount of
        kobject linked to the 'disk' is underflow.
      
      The dropping refcount at line 808 in the above code piece is added by
      add_disk(d->disk) in bch_cached_dev_run(). But in the above condition
      the cache device is not registered, bch_cached_dev_run() has no chance
      to be called and the refcount is not added. The put_disk() for a non-
      added refcount of gendisk kobject triggers a underflow warning.
      
      This patch checks whether GENHD_FL_UP is set in disk->flags, if it is
      not set then the bcache device was not added, don't call put_disk()
      and the the underflow issue can be avoided.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86da9f73
    • J
      bcache: Convert pr_<level> uses to a more typical style · 46f5aa88
      Joe Perches 提交于
      Remove the trailing newline from the define of pr_fmt and add newlines
      to the uses.
      
      Miscellanea:
      
      o Convert bch_bkey_dump from multiple uses of pr_err to pr_cont
        as the earlier conversion was inappropriate done causing multiple
        lines to be emitted where only a single output line was desired
      o Use vsprintf extension %pV in bch_cache_set_error to avoid multiple
        line output where only a single line output was desired
      o Coalesce formats
      
      Fixes: 6ae63e35 ("bcache: replace printk() by pr_*() routines")
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46f5aa88
  2. 28 3月, 2020 2 次提交
  3. 13 2月, 2020 1 次提交
    • C
      bcache: Revert "bcache: shrink btree node cache after bch_btree_check()" · 309cc719
      Coly Li 提交于
      This reverts commit 1df3877f.
      
      In my testing, sometimes even all the cached btree nodes are freed,
      creating gc and allocator kernel threads may still fail. Finally it
      turns out that kthread_run() may fail if there is pending signal for
      current task. And the pending signal is sent from OOM killer which
      is triggered by memory consuption in bch_btree_check().
      
      Therefore explicitly shrinking bcache btree node here does not help,
      and after the shrinker callback is improved, as well as pending signals
      are ignored before creating kernel threads, now such operation is
      unncessary anymore.
      
      This patch reverts the commit 1df3877f ("bcache: shrink btree node
      cache after bch_btree_check()") because we have better improvement now.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      309cc719
  4. 01 2月, 2020 1 次提交
  5. 24 1月, 2020 9 次提交
  6. 14 11月, 2019 5 次提交
    • C
      bcache: add idle_max_writeback_rate sysfs interface · c5fcdedc
      Coly Li 提交于
      For writeback mode, if there is no regular I/O request for a while,
      the writeback rate will be set to the maximum value (1TB/s for now).
      This is good for most of the storage workload, but there are still
      people don't what the maximum writeback rate in I/O idle time.
      
      This patch adds a sysfs interface file idle_max_writeback_rate to
      permit people to disable maximum writeback rate. Then the minimum
      writeback rate can be advised by writeback_rate_minimum in the
      bcache device's sysfs interface.
      Reported-by: NChristian Balzer <chibi@gol.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c5fcdedc
    • A
      bcache: fix deadlock in bcache_allocator · 84c529ae
      Andrea Righi 提交于
      bcache_allocator can call the following:
      
       bch_allocator_thread()
        -> bch_prio_write()
           -> bch_bucket_alloc()
              -> wait on &ca->set->bucket_wait
      
      But the wake up event on bucket_wait is supposed to come from
      bch_allocator_thread() itself => deadlock:
      
      [ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 seconds.
      [ 1158.495929]       Not tainted 5.3.0-050300rc3-generic #201908042232
      [ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1158.504413] bcache_allocato D    0 15861      2 0x80004000
      [ 1158.504419] Call Trace:
      [ 1158.504429]  __schedule+0x2a8/0x670
      [ 1158.504432]  schedule+0x2d/0x90
      [ 1158.504448]  bch_bucket_alloc+0xe5/0x370 [bcache]
      [ 1158.504453]  ? wait_woken+0x80/0x80
      [ 1158.504466]  bch_prio_write+0x1dc/0x390 [bcache]
      [ 1158.504476]  bch_allocator_thread+0x233/0x490 [bcache]
      [ 1158.504491]  kthread+0x121/0x140
      [ 1158.504503]  ? invalidate_buckets+0x890/0x890 [bcache]
      [ 1158.504506]  ? kthread_park+0xb0/0xb0
      [ 1158.504510]  ret_from_fork+0x35/0x40
      
      Fix by making the call to bch_prio_write() non-blocking, so that
      bch_allocator_thread() never waits on itself.
      
      Moreover, make sure to wake up the garbage collector thread when
      bch_prio_write() is failing to allocate buckets.
      
      BugLink: https://bugs.launchpad.net/bugs/1784665
      BugLink: https://bugs.launchpad.net/bugs/1796292Signed-off-by: NAndrea Righi <andrea.righi@canonical.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      84c529ae
    • C
      bcache: add more accurate error messages in read_super() · aaf8dbea
      Coly Li 提交于
      Previous code only returns "Not a bcache superblock" for both bcache
      super block offset and magic error. This patch addss more accurate error
      messages,
      - for super block unmatched offset:
        "Not a bcache superblock (bad offset)"
      - for super block unmatched magic number:
        "Not a bcache superblock (bad magic)"
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aaf8dbea
    • C
      bcache: fix static checker warning in bcache_device_free() · 2d886951
      Coly Li 提交于
      Commit cafe5635 ("bcache: A block layer cache") leads to the
      following static checker warning:
      
          ./drivers/md/bcache/super.c:770 bcache_device_free()
          warn: variable dereferenced before check 'd->disk' (see line 766)
      
      drivers/md/bcache/super.c
         762  static void bcache_device_free(struct bcache_device *d)
         763  {
         764          lockdep_assert_held(&bch_register_lock);
         765
         766          pr_info("%s stopped", d->disk->disk_name);
                                            ^^^^^^^^^
      Unchecked dereference.
      
         767
         768          if (d->c)
         769                  bcache_device_detach(d);
         770          if (d->disk && d->disk->flags & GENHD_FL_UP)
                          ^^^^^^^
      Check too late.
      
         771                  del_gendisk(d->disk);
         772          if (d->disk && d->disk->queue)
         773                  blk_cleanup_queue(d->disk->queue);
         774          if (d->disk) {
         775                  ida_simple_remove(&bcache_device_idx,
         776                                    first_minor_to_idx(d->disk->first_minor));
         777                  put_disk(d->disk);
         778          }
         779
      
      It is not 100% sure that the gendisk struct of bcache device will always
      be there, the warning makes sense when there is problem in block core.
      
      This patch tries to remove the static checking warning by checking
      d->disk to avoid NULL pointer deferences.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d886951
    • G
      bcache: fix a lost wake-up problem caused by mca_cannibalize_lock · 34cf78bf
      Guoju Fang 提交于
      This patch fix a lost wake-up problem caused by the race between
      mca_cannibalize_lock and bch_cannibalize_unlock.
      
      Consider two processes, A and B. Process A is executing
      mca_cannibalize_lock, while process B takes c->btree_cache_alloc_lock
      and is executing bch_cannibalize_unlock. The problem happens that after
      process A executes cmpxchg and will execute prepare_to_wait. In this
      timeslice process B executes wake_up, but after that process A executes
      prepare_to_wait and set the state to TASK_INTERRUPTIBLE. Then process A
      goes to sleep but no one will wake up it. This problem may cause bcache
      device to dead.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      34cf78bf
  7. 22 7月, 2019 1 次提交
  8. 28 6月, 2019 16 次提交
    • C
      bcache: shrink btree node cache after bch_btree_check() · 1df3877f
      Coly Li 提交于
      When cache set starts, bch_btree_check() will check all bkeys on cache
      device by calculating the checksum. This operation will consume a huge
      number of system memory if there are a lot of data cached. Since bcache
      uses its own mca cache to maintain all its read-in btree nodes, and only
      releases the cache space when system memory manage code starts to shrink
      caches. Then before memory manager code to call the mca cache shrinker
      callback, bcache mca cache will compete memory resource with user space
      application, which may have nagive effect to performance of user space
      workloads (e.g. data base, or I/O service of distributed storage node).
      
      This patch tries to call bcache mca shrinker routine to proactively
      release mca cache memory, to decrease the memory pressure of system and
      avoid negative effort of the overall system I/O performance.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1df3877f
    • C
      bcache: fix potential deadlock in cached_def_free() · 7e865eba
      Coly Li 提交于
      When enable lockdep and reboot system with a writeback mode bcache
      device, the following potential deadlock warning is reported by lockdep
      engine.
      
      [  101.536569][  T401] kworker/2:2/401 is trying to acquire lock:
      [  101.538575][  T401] 00000000bbf6e6c7 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
      [  101.542054][  T401]
      [  101.542054][  T401] but task is already holding lock:
      [  101.544587][  T401] 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
      [  101.548386][  T401]
      [  101.548386][  T401] which lock already depends on the new lock.
      [  101.548386][  T401]
      [  101.551874][  T401]
      [  101.551874][  T401] the existing dependency chain (in reverse order) is:
      [  101.555000][  T401]
      [  101.555000][  T401] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
      [  101.557860][  T401]        process_one_work+0x277/0x640
      [  101.559661][  T401]        worker_thread+0x39/0x3f0
      [  101.561340][  T401]        kthread+0x125/0x140
      [  101.562963][  T401]        ret_from_fork+0x3a/0x50
      [  101.564718][  T401]
      [  101.564718][  T401] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
      [  101.567701][  T401]        lock_acquire+0xb4/0x1c0
      [  101.569651][  T401]        flush_workqueue+0xae/0x4c0
      [  101.571494][  T401]        drain_workqueue+0xa9/0x180
      [  101.573234][  T401]        destroy_workqueue+0x17/0x250
      [  101.575109][  T401]        cached_dev_free+0x44/0x120 [bcache]
      [  101.577304][  T401]        process_one_work+0x2a4/0x640
      [  101.579357][  T401]        worker_thread+0x39/0x3f0
      [  101.581055][  T401]        kthread+0x125/0x140
      [  101.582709][  T401]        ret_from_fork+0x3a/0x50
      [  101.584592][  T401]
      [  101.584592][  T401] other info that might help us debug this:
      [  101.584592][  T401]
      [  101.588355][  T401]  Possible unsafe locking scenario:
      [  101.588355][  T401]
      [  101.590974][  T401]        CPU0                    CPU1
      [  101.592889][  T401]        ----                    ----
      [  101.594743][  T401]   lock((work_completion)(&cl->work)#2);
      [  101.596785][  T401]                                lock((wq_completion)bcache_writeback_wq);
      [  101.600072][  T401]                                lock((work_completion)(&cl->work)#2);
      [  101.602971][  T401]   lock((wq_completion)bcache_writeback_wq);
      [  101.605255][  T401]
      [  101.605255][  T401]  *** DEADLOCK ***
      [  101.605255][  T401]
      [  101.608310][  T401] 2 locks held by kworker/2:2/401:
      [  101.610208][  T401]  #0: 00000000cf2c7d17 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
      [  101.613709][  T401]  #1: 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
      [  101.617480][  T401]
      [  101.617480][  T401] stack backtrace:
      [  101.619539][  T401] CPU: 2 PID: 401 Comm: kworker/2:2 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
      [  101.623225][  T401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [  101.627210][  T401] Workqueue: events cached_dev_free [bcache]
      [  101.629239][  T401] Call Trace:
      [  101.630360][  T401]  dump_stack+0x85/0xcb
      [  101.631777][  T401]  print_circular_bug+0x19a/0x1f0
      [  101.633485][  T401]  __lock_acquire+0x16cd/0x1850
      [  101.635184][  T401]  ? __lock_acquire+0x6a8/0x1850
      [  101.636863][  T401]  ? lock_acquire+0xb4/0x1c0
      [  101.638421][  T401]  ? find_held_lock+0x34/0xa0
      [  101.640015][  T401]  lock_acquire+0xb4/0x1c0
      [  101.641513][  T401]  ? flush_workqueue+0x87/0x4c0
      [  101.643248][  T401]  flush_workqueue+0xae/0x4c0
      [  101.644832][  T401]  ? flush_workqueue+0x87/0x4c0
      [  101.646476][  T401]  ? drain_workqueue+0xa9/0x180
      [  101.648303][  T401]  drain_workqueue+0xa9/0x180
      [  101.649867][  T401]  destroy_workqueue+0x17/0x250
      [  101.651503][  T401]  cached_dev_free+0x44/0x120 [bcache]
      [  101.653328][  T401]  process_one_work+0x2a4/0x640
      [  101.655029][  T401]  worker_thread+0x39/0x3f0
      [  101.656693][  T401]  ? process_one_work+0x640/0x640
      [  101.658501][  T401]  kthread+0x125/0x140
      [  101.660012][  T401]  ? kthread_create_worker_on_cpu+0x70/0x70
      [  101.661985][  T401]  ret_from_fork+0x3a/0x50
      [  101.691318][  T401] bcache: bcache_device_free() bcache0 stopped
      
      Here is how the above potential deadlock may happen in reboot/shutdown
      code path,
      1) bcache_reboot() is called firstly in the reboot/shutdown code path,
         then in bcache_reboot(), bcache_device_stop() is called.
      2) bcache_device_stop() sets BCACHE_DEV_CLOSING on d->falgs, then call
         closure_queue(&d->cl) to invoke cached_dev_flush(). And in turn
         cached_dev_flush() calls cached_dev_free() via closure_at()
      3) In cached_dev_free(), after stopped writebach kthread
         dc->writeback_thread, the kwork dc->writeback_write_wq is stopping by
         destroy_workqueue().
      4) Inside destroy_workqueue(), drain_workqueue() is called. Inside
         drain_workqueue(), flush_workqueue() is called. Then wq->lockdep_map
         is acquired by lock_map_acquire() in flush_workqueue(). After the
         lock acquired the rest part of flush_workqueue() just wait for the
         workqueue to complete.
      5) Now we look back at writeback thread routine bch_writeback_thread(),
         in the main while-loop, write_dirty() is called via continue_at() in
         read_dirty_submit(), which is called via continue_at() in while-loop
         level called function read_dirty(). Inside write_dirty() it may be
         re-called on workqueeu dc->writeback_write_wq via continue_at().
         It means when the writeback kthread is stopped in cached_dev_free()
         there might be still one kworker queued on dc->writeback_write_wq
         to execute write_dirty() again.
      6) Now this kworker is scheduled on dc->writeback_write_wq to run by
         process_one_work() (which is called by worker_thread()). Before
         calling the kwork routine, wq->lockdep_map is acquired.
      7) But wq->lockdep_map is acquired already in step 4), so a A-A lock
         (lockdep terminology) scenario happens.
      
      Indeed on multiple cores syatem, the above deadlock is very rare to
      happen, just as the code comments in process_one_work() says,
      2263     * AFAICT there is no possible deadlock scenario between the
      2264     * flush_work() and complete() primitives (except for
      	   single-threaded
      2265     * workqueues), so hiding them isn't a problem.
      
      But it is still good to fix such lockdep warning, even no one running
      bcache on single core system.
      
      The fix is simple. This patch solves the above potential deadlock by,
      - Do not destroy workqueue dc->writeback_write_wq in cached_dev_free().
      - Flush and destroy dc->writeback_write_wq in writebach kthread routine
        bch_writeback_thread(), where after quit the thread main while-loop
        and before cached_dev_put() is called.
      
      By this fix, dc->writeback_write_wq will be stopped and destroy before
      the writeback kthread stopped, so the chance for a A-A locking on
      wq->lockdep_map is disappeared, such A-A deadlock won't happen
      any more.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7e865eba
    • C
      bcache: acquire bch_register_lock later in cached_dev_free() · 80265d8d
      Coly Li 提交于
      When enable lockdep engine, a lockdep warning can be observed when
      reboot or shutdown system,
      
      [ 3142.764557][    T1] bcache: bcache_reboot() Stopping all devices:
      [ 3142.776265][ T2649]
      [ 3142.777159][ T2649] ======================================================
      [ 3142.780039][ T2649] WARNING: possible circular locking dependency detected
      [ 3142.782869][ T2649] 5.2.0-rc4-lp151.20-default+ #1 Tainted: G        W
      [ 3142.785684][ T2649] ------------------------------------------------------
      [ 3142.788479][ T2649] kworker/3:67/2649 is trying to acquire lock:
      [ 3142.790738][ T2649] 00000000aaf02291 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
      [ 3142.794678][ T2649]
      [ 3142.794678][ T2649] but task is already holding lock:
      [ 3142.797402][ T2649] 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
      [ 3142.801462][ T2649]
      [ 3142.801462][ T2649] which lock already depends on the new lock.
      [ 3142.801462][ T2649]
      [ 3142.805277][ T2649]
      [ 3142.805277][ T2649] the existing dependency chain (in reverse order) is:
      [ 3142.808902][ T2649]
      [ 3142.808902][ T2649] -> #2 (&bch_register_lock){+.+.}:
      [ 3142.812396][ T2649]        __mutex_lock+0x7a/0x9d0
      [ 3142.814184][ T2649]        cached_dev_free+0x17/0x120 [bcache]
      [ 3142.816415][ T2649]        process_one_work+0x2a4/0x640
      [ 3142.818413][ T2649]        worker_thread+0x39/0x3f0
      [ 3142.820276][ T2649]        kthread+0x125/0x140
      [ 3142.822061][ T2649]        ret_from_fork+0x3a/0x50
      [ 3142.823965][ T2649]
      [ 3142.823965][ T2649] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
      [ 3142.827244][ T2649]        process_one_work+0x277/0x640
      [ 3142.829160][ T2649]        worker_thread+0x39/0x3f0
      [ 3142.830958][ T2649]        kthread+0x125/0x140
      [ 3142.832674][ T2649]        ret_from_fork+0x3a/0x50
      [ 3142.834915][ T2649]
      [ 3142.834915][ T2649] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
      [ 3142.838121][ T2649]        lock_acquire+0xb4/0x1c0
      [ 3142.840025][ T2649]        flush_workqueue+0xae/0x4c0
      [ 3142.842035][ T2649]        drain_workqueue+0xa9/0x180
      [ 3142.844042][ T2649]        destroy_workqueue+0x17/0x250
      [ 3142.846142][ T2649]        cached_dev_free+0x52/0x120 [bcache]
      [ 3142.848530][ T2649]        process_one_work+0x2a4/0x640
      [ 3142.850663][ T2649]        worker_thread+0x39/0x3f0
      [ 3142.852464][ T2649]        kthread+0x125/0x140
      [ 3142.854106][ T2649]        ret_from_fork+0x3a/0x50
      [ 3142.855880][ T2649]
      [ 3142.855880][ T2649] other info that might help us debug this:
      [ 3142.855880][ T2649]
      [ 3142.859663][ T2649] Chain exists of:
      [ 3142.859663][ T2649]   (wq_completion)bcache_writeback_wq --> (work_completion)(&cl->work)#2 --> &bch_register_lock
      [ 3142.859663][ T2649]
      [ 3142.865424][ T2649]  Possible unsafe locking scenario:
      [ 3142.865424][ T2649]
      [ 3142.868022][ T2649]        CPU0                    CPU1
      [ 3142.869885][ T2649]        ----                    ----
      [ 3142.871751][ T2649]   lock(&bch_register_lock);
      [ 3142.873379][ T2649]                                lock((work_completion)(&cl->work)#2);
      [ 3142.876399][ T2649]                                lock(&bch_register_lock);
      [ 3142.879727][ T2649]   lock((wq_completion)bcache_writeback_wq);
      [ 3142.882064][ T2649]
      [ 3142.882064][ T2649]  *** DEADLOCK ***
      [ 3142.882064][ T2649]
      [ 3142.885060][ T2649] 3 locks held by kworker/3:67/2649:
      [ 3142.887245][ T2649]  #0: 00000000e774cdd0 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
      [ 3142.890815][ T2649]  #1: 00000000f7df89da ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
      [ 3142.894884][ T2649]  #2: 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
      [ 3142.898797][ T2649]
      [ 3142.898797][ T2649] stack backtrace:
      [ 3142.900961][ T2649] CPU: 3 PID: 2649 Comm: kworker/3:67 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
      [ 3142.904789][ T2649] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [ 3142.909168][ T2649] Workqueue: events cached_dev_free [bcache]
      [ 3142.911422][ T2649] Call Trace:
      [ 3142.912656][ T2649]  dump_stack+0x85/0xcb
      [ 3142.914181][ T2649]  print_circular_bug+0x19a/0x1f0
      [ 3142.916193][ T2649]  __lock_acquire+0x16cd/0x1850
      [ 3142.917936][ T2649]  ? __lock_acquire+0x6a8/0x1850
      [ 3142.919704][ T2649]  ? lock_acquire+0xb4/0x1c0
      [ 3142.921335][ T2649]  ? find_held_lock+0x34/0xa0
      [ 3142.923052][ T2649]  lock_acquire+0xb4/0x1c0
      [ 3142.924635][ T2649]  ? flush_workqueue+0x87/0x4c0
      [ 3142.926375][ T2649]  flush_workqueue+0xae/0x4c0
      [ 3142.928047][ T2649]  ? flush_workqueue+0x87/0x4c0
      [ 3142.929824][ T2649]  ? drain_workqueue+0xa9/0x180
      [ 3142.931686][ T2649]  drain_workqueue+0xa9/0x180
      [ 3142.933534][ T2649]  destroy_workqueue+0x17/0x250
      [ 3142.935787][ T2649]  cached_dev_free+0x52/0x120 [bcache]
      [ 3142.937795][ T2649]  process_one_work+0x2a4/0x640
      [ 3142.939803][ T2649]  worker_thread+0x39/0x3f0
      [ 3142.941487][ T2649]  ? process_one_work+0x640/0x640
      [ 3142.943389][ T2649]  kthread+0x125/0x140
      [ 3142.944894][ T2649]  ? kthread_create_worker_on_cpu+0x70/0x70
      [ 3142.947744][ T2649]  ret_from_fork+0x3a/0x50
      [ 3142.970358][ T2649] bcache: bcache_device_free() bcache0 stopped
      
      Here is how the deadlock happens.
      1) bcache_reboot() calls bcache_device_stop(), then inside
         bcache_device_stop() BCACHE_DEV_CLOSING bit is set on d->flags.
         Then closure_queue(&d->cl) is called to invoke cached_dev_flush().
      2) In cached_dev_flush(), cached_dev_free() is called by continu_at().
      3) In cached_dev_free(), when stopping the writeback kthread of the
         cached device by kthread_stop(), dc->writeback_thread will be waken
         up to quite the kthread while-loop, then cached_dev_put() is called
         in bch_writeback_thread().
      4) Calling cached_dev_put() in writeback kthread may drop dc->count to
         0, then dc->detach kworker is scheduled, which is initialized as
         cached_dev_detach_finish().
      5) Inside cached_dev_detach_finish(), the last line of code is to call
         closure_put(&dc->disk.cl), which drops the last reference counter of
         closrure dc->disk.cl, then the callback cached_dev_flush() gets
         called.
      Now cached_dev_flush() is called for second time in the code path, the
      first time is in step 2). And again bch_register_lock will be acquired
      again, and a A-A lock (lockdep terminology) is happening.
      
      The root cause of the above A-A lock is in cached_dev_free(), mutex
      bch_register_lock is held before stopping writeback kthread and other
      kworkers. Fortunately now we have variable 'bcache_is_reboot', which may
      prevent device registration or unregistration during reboot/shutdown
      time, so it is unncessary to hold bch_register_lock such early now.
      
      This is how this patch fixes the reboot/shutdown time A-A lock issue:
      After moving mutex_lock(&bch_register_lock) to a later location where
      before atomic_read(&dc->running) in cached_dev_free(), such A-A lock
      problem can be solved without any reboot time registration race.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      80265d8d
    • C
      bcache: acquire bch_register_lock later in cached_dev_detach_finish() · 97ba3b81
      Coly Li 提交于
      Now there is variable bcache_is_reboot to prevent device register or
      unregister during reboot, it is unncessary to still hold mutex lock
      bch_register_lock before stopping writeback_rate_update kworker and
      writeback kthread. And if the stopping kworker or kthread holding
      bch_register_lock inside their routine (we used to have such problem
      in writeback thread, thanks to Junhui Wang fixed it), it is very easy
      to introduce deadlock during reboot/shutdown procedure.
      
      Therefore in this patch, the location to acquire bch_register_lock is
      moved to the location before calling calc_cached_dev_sectors(). Which
      is later then original location in cached_dev_detach_finish().
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      97ba3b81
    • C
      bcache: avoid a deadlock in bcache_reboot() · a59ff6cc
      Coly Li 提交于
      It is quite frequently to observe deadlock in bcache_reboot() happens
      and hang the system reboot process. The reason is, in bcache_reboot()
      when calling bch_cache_set_stop() and bcache_device_stop() the mutex
      bch_register_lock is held. But in the process to stop cache set and
      bcache device, bch_register_lock will be acquired again. If this mutex
      is held here, deadlock will happen inside the stopping process. The
      aftermath of the deadlock is, whole system reboot gets hung.
      
      The fix is to avoid holding bch_register_lock for the following loops
      in bcache_reboot(),
             list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
                      bch_cache_set_stop(c);
      
              list_for_each_entry_safe(dc, tdc, &uncached_devices, list)
                      bcache_device_stop(&dc->disk);
      
      A module range variable 'bcache_is_reboot' is added, it sets to true
      in bcache_reboot(). In register_bcache(), if bcache_is_reboot is checked
      to be true, reject the registration by returning -EBUSY immediately.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a59ff6cc
    • C
      bcache: stop writeback kthread and kworker when bch_cached_dev_run() failed · 5c2a634c
      Coly Li 提交于
      In bch_cached_dev_attach() after bch_cached_dev_writeback_start()
      called, the wrireback kthread and writeback rate update kworker of the
      cached device are created, if the following bch_cached_dev_run()
      failed, bch_cached_dev_attach() will return with -ENOMEM without
      stopping the writeback related kthread and kworker.
      
      This patch stops writeback kthread and writeback rate update kworker
      before returning -ENOMEM if bch_cached_dev_run() returns error.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5c2a634c
    • C
      bcache: add pendings_cleanup to stop pending bcache device · 0c277e21
      Coly Li 提交于
      If a bcache device is in dirty state and its cache set is not
      registered, this bcache device will not appear in /dev/bcache<N>,
      and there is no way to stop it or remove the bcache kernel module.
      
      This is an as-designed behavior, but sometimes people has to reboot
      whole system to release or stop the pending backing device.
      
      This sysfs interface may remove such pending bcache devices when
      write anything into the sysfs file manually.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0c277e21
    • C
      bcache: remove "XXX:" comment line from run_cache_set() · 68a53c95
      Coly Li 提交于
      In previous bcache patches for Linux v5.2, the failure code path of
      run_cache_set() is tested and fixed. So now the following comment
      line can be removed from run_cache_set(),
      	/* XXX: test this, it's broken */
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      68a53c95
    • C
      bcache: improve error message in bch_cached_dev_run() · e0faa3d7
      Coly Li 提交于
      This patch adds more error message in bch_cached_dev_run() to indicate
      the exact reason why an error value is returned. Please notice when
      printing out the "is running already" message, pr_info() is used here,
      because in this case also -EBUSY is returned, the bcache device can
      continue to attach to the cache devince and run, so it won't be an
      error level message in kernel message.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e0faa3d7
    • C
      bcache: add more error message in bch_cached_dev_attach() · 633bb2ce
      Coly Li 提交于
      This patch adds more error message for attaching cached device, this is
      helpful to debug code failure during bache device start up.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      633bb2ce
    • C
      bcache: more detailed error message to bcache_device_link() · 4b6efb4b
      Coly Li 提交于
      This patch adds more accurate error message for specific
      ssyfs_create_link() call, to help debugging failure during
      bcache device start tup.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4b6efb4b
    • C
      bcache: add return value check to bch_cached_dev_run() · 0b13efec
      Coly Li 提交于
      This patch adds return value check to bch_cached_dev_run(), now if there
      is error happens inside bch_cached_dev_run(), it can be catched.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0b13efec
    • C
      bcache: add io error counting in write_bdev_super_endio() · 08ec1e62
      Coly Li 提交于
      When backing device super block is written by bch_write_bdev_super(),
      the bio complete callback write_bdev_super_endio() simply ignores I/O
      status. Indeed such write request also contribute to backing device
      health status if the request failed.
      
      This patch checkes bio->bi_status in write_bdev_super_endio(), if there
      is error, bch_count_backing_io_errors() will be called to count an I/O
      error to dc->io_errors.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08ec1e62
    • C
      bcache: avoid flushing btree node in cache_set_flush() if io disabled · e6dcbd3e
      Coly Li 提交于
      When cache_set_flush() is called for too many I/O errors detected on
      cache device and the cache set is retiring, inside the function it
      doesn't make sense to flushing cached btree nodes from c->btree_cache
      because CACHE_SET_IO_DISABLE is set on c->flags already and all I/Os
      onto cache device will be rejected.
      
      This patch checks in cache_set_flush() that whether CACHE_SET_IO_DISABLE
      is set. If yes, then avoids to flush the cached btree nodes to reduce
      more time and make cache set retiring more faster.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e6dcbd3e
    • C
      Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()" · 695277f1
      Coly Li 提交于
      This reverts commit 6147305c.
      
      Although this patch helps the failed bcache device to stop faster when
      too many I/O errors detected on corresponding cached device, setting
      CACHE_SET_IO_DISABLE bit to cache set c->flags was not a good idea. This
      operation will disable all I/Os on cache set, which means other attached
      bcache devices won't work neither.
      
      Without this patch, the failed bcache device can also be stopped
      eventually if internal I/O accomplished (e.g. writeback). Therefore here
      I revert it.
      
      Fixes: 6147305c ("bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()")
      Reported-by: NYong Li <mr.liyong@qq.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      695277f1
    • C
      bcache: check c->gc_thread by IS_ERR_OR_NULL in cache_set_flush() · b387e9b5
      Coly Li 提交于
      When system memory is in heavy pressure, bch_gc_thread_start() from
      run_cache_set() may fail due to out of memory. In such condition,
      c->gc_thread is assigned to -ENOMEM, not NULL pointer. Then in following
      failure code path bch_cache_set_error(), when cache_set_flush() gets
      called, the code piece to stop c->gc_thread is broken,
               if (!IS_ERR_OR_NULL(c->gc_thread))
                       kthread_stop(c->gc_thread);
      
      And KASAN catches such NULL pointer deference problem, with the warning
      information:
      
      [  561.207881] ==================================================================
      [  561.207900] BUG: KASAN: null-ptr-deref in kthread_stop+0x3b/0x440
      [  561.207904] Write of size 4 at addr 000000000000001c by task kworker/15:1/313
      
      [  561.207913] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G        W         5.0.0-vanilla+ #3
      [  561.207916] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
      [  561.207935] Workqueue: events cache_set_flush [bcache]
      [  561.207940] Call Trace:
      [  561.207948]  dump_stack+0x9a/0xeb
      [  561.207955]  ? kthread_stop+0x3b/0x440
      [  561.207960]  ? kthread_stop+0x3b/0x440
      [  561.207965]  kasan_report+0x176/0x192
      [  561.207973]  ? kthread_stop+0x3b/0x440
      [  561.207981]  kthread_stop+0x3b/0x440
      [  561.207995]  cache_set_flush+0xd4/0x6d0 [bcache]
      [  561.208008]  process_one_work+0x856/0x1620
      [  561.208015]  ? find_held_lock+0x39/0x1d0
      [  561.208028]  ? drain_workqueue+0x380/0x380
      [  561.208048]  worker_thread+0x87/0xb80
      [  561.208058]  ? __kthread_parkme+0xb6/0x180
      [  561.208067]  ? process_one_work+0x1620/0x1620
      [  561.208072]  kthread+0x326/0x3e0
      [  561.208079]  ? kthread_create_worker_on_cpu+0xc0/0xc0
      [  561.208090]  ret_from_fork+0x3a/0x50
      [  561.208110] ==================================================================
      [  561.208113] Disabling lock debugging due to kernel taint
      [  561.208115] irq event stamp: 11800231
      [  561.208126] hardirqs last  enabled at (11800231): [<ffffffff83008538>] do_syscall_64+0x18/0x410
      [  561.208127] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
      [  561.208129] #PF error: [WRITE]
      [  561.312253] hardirqs last disabled at (11800230): [<ffffffff830052ff>] trace_hardirqs_off_thunk+0x1a/0x1c
      [  561.312259] softirqs last  enabled at (11799832): [<ffffffff850005c7>] __do_softirq+0x5c7/0x8c3
      [  561.405975] PGD 0 P4D 0
      [  561.442494] softirqs last disabled at (11799821): [<ffffffff831add2c>] irq_exit+0x1ac/0x1e0
      [  561.791359] Oops: 0002 [#1] SMP KASAN NOPTI
      [  561.791362] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G    B   W         5.0.0-vanilla+ #3
      [  561.791363] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
      [  561.791371] Workqueue: events cache_set_flush [bcache]
      [  561.791374] RIP: 0010:kthread_stop+0x3b/0x440
      [  561.791376] Code: 00 00 65 8b 05 26 d5 e0 7c 89 c0 48 0f a3 05 ec aa df 02 0f 82 dc 02 00 00 4c 8d 63 20 be 04 00 00 00 4c 89 e7 e8 65 c5 53 00 <f0> ff 43 20 48 8d 7b 24 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48
      [  561.791377] RSP: 0018:ffff88872fc8fd10 EFLAGS: 00010286
      [  561.838895] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838916] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838934] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838948] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838966] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838979] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  561.838996] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  563.067028] RAX: 0000000000000000 RBX: fffffffffffffffc RCX: ffffffff832dd314
      [  563.067030] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000297
      [  563.067032] RBP: ffff88872fc8fe88 R08: fffffbfff0b8213d R09: fffffbfff0b8213d
      [  563.067034] R10: 0000000000000001 R11: fffffbfff0b8213c R12: 000000000000001c
      [  563.408618] R13: ffff88dc61cc0f68 R14: ffff888102b94900 R15: ffff88dc61cc0f68
      [  563.408620] FS:  0000000000000000(0000) GS:ffff888f7dc00000(0000) knlGS:0000000000000000
      [  563.408622] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  563.408623] CR2: 000000000000001c CR3: 0000000f48a1a004 CR4: 00000000007606e0
      [  563.408625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  563.408627] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  563.904795] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  563.915796] PKRU: 55555554
      [  563.915797] Call Trace:
      [  563.915807]  cache_set_flush+0xd4/0x6d0 [bcache]
      [  563.915812]  process_one_work+0x856/0x1620
      [  564.001226] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.033563]  ? find_held_lock+0x39/0x1d0
      [  564.033567]  ? drain_workqueue+0x380/0x380
      [  564.033574]  worker_thread+0x87/0xb80
      [  564.062823] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.118042]  ? __kthread_parkme+0xb6/0x180
      [  564.118046]  ? process_one_work+0x1620/0x1620
      [  564.118048]  kthread+0x326/0x3e0
      [  564.118050]  ? kthread_create_worker_on_cpu+0xc0/0xc0
      [  564.167066] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.252441]  ret_from_fork+0x3a/0x50
      [  564.252447] Modules linked in: msr rpcrdma sunrpc rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib i40iw configfs iw_cm ib_cm libiscsi scsi_transport_iscsi mlx4_ib ib_uverbs mlx4_en ib_core nls_iso8859_1 nls_cp437 vfat fat intel_rapl skx_edac x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ses raid0 aesni_intel cdc_ether enclosure usbnet ipmi_ssif joydev aes_x86_64 i40e scsi_transport_sas mii bcache md_mod crypto_simd mei_me ioatdma crc64 ptp cryptd pcspkr i2c_i801 mlx4_core glue_helper pps_core mei lpc_ich dca wmi ipmi_si ipmi_devintf nd_pmem dax_pmem nd_btt ipmi_msghandler device_dax pcc_cpufreq button hid_generic usbhid mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect xhci_pci sysimgblt fb_sys_fops xhci_hcd ttm megaraid_sas drm usbcore nfit libnvdimm sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
      [  564.299390] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
      [  564.348360] CR2: 000000000000001c
      [  564.348362] ---[ end trace b7f0e5cc7b2103b0 ]---
      
      Therefore, it is not enough to only check whether c->gc_thread is NULL,
      we should use IS_ERR_OR_NULL() to check both NULL pointer and error
      value.
      
      This patch changes the above buggy code piece in this way,
               if (!IS_ERR_OR_NULL(c->gc_thread))
                       kthread_stop(c->gc_thread);
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b387e9b5
  9. 30 4月, 2019 1 次提交
  10. 25 4月, 2019 1 次提交
    • S
      bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC... · 95f18c9d
      Shenghui Wang 提交于
      bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set
      
      In the CACHE_SYNC branch of run_cache_set(), LIST_HEAD(journal) is used
      to collect journal_replay(s) and filled by bch_journal_read().
      
      If all goes well, bch_journal_replay() will release the list of
      jounal_replay(s) at the end of the branch.
      
      If something goes wrong, code flow will jump to the label "err:" and leave
      the list unreleased.
      
      This patch will release the list of journal_replay(s) in the case of
      error detected.
      
      v1 -> v2:
      * Move the release code to the location after label 'err:' to
        simply the change.
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      95f18c9d