1. 25 7月, 2020 4 次提交
  2. 01 7月, 2020 2 次提交
  3. 15 6月, 2020 3 次提交
    • C
      bcache: pr_info() format clean up in bcache_device_init() · 4b25bbf5
      Coly Li 提交于
      scripts/checkpatch.pl reports following warning for patch
      ("bcache: check and adjust logical block size for backing devices"),
          WARNING: quoted string split across lines
          #146: FILE: drivers/md/bcache/super.c:896:
          +  pr_info("%s: sb/logical block size (%u) greater than page size "
          +	       "(%lu) falling back to device logical block size (%u)",
      
      There are two things to fix up,
      - The kernel message print should be in a single line.
      - pr_info() won't automatically add new line since v5.8, a '\n' should
        be added.
      
      This patch just does the above cleanup in bcache_device_init().
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4b25bbf5
    • C
      bcache: use delayed kworker fo asynchronous devices registration · ee4a36f4
      Coly Li 提交于
      This patch changes the asynchronous registration kworker to a delayed
      kworker. There is probability queue_work() queues the async registration
      kworker to the same CPU (even though very little), then the process
      which writing sysfs interface to reigster bcache device may won't return
      immeidately. queue_delayed_work() in this patch will delay 10 jiffies
      before insert the kworker to run queue, which makes sure the registering
      process may always returns to user space in time.
      
      Fixes: 9e23ccf8 ("bcache: asynchronous devices registration")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ee4a36f4
    • M
      bcache: check and adjust logical block size for backing devices · dcacbc12
      Mauricio Faria de Oliveira 提交于
      It's possible for a block driver to set logical block size to
      a value greater than page size incorrectly; e.g. bcache takes
      the value from the superblock, set by the user w/ make-bcache.
      
      This causes a BUG/NULL pointer dereference in the path:
      
        __blkdev_get()
        -> set_init_blocksize() // set i_blkbits based on ...
           -> bdev_logical_block_size()
              -> queue_logical_block_size() // ... this value
        -> bdev_disk_changed()
           ...
           -> blkdev_readpage()
              -> block_read_full_page()
                 -> create_page_buffers() // size = 1 << i_blkbits
                    -> create_empty_buffers() // give size/take pointer
                       -> alloc_page_buffers() // return NULL
                       .. BUG!
      
      Because alloc_page_buffers() is called with size > PAGE_SIZE,
      thus it initializes head = NULL, skips the loop, return head;
      then create_empty_buffers() gets (and uses) the NULL pointer.
      
      This has been around longer than commit ad6bf88a ("block:
      fix an integer overflow in logical block size"); however, it
      increased the range of values that can trigger the issue.
      
      Previously only 8k/16k/32k (on x86/4k page size) would do it,
      as greater values overflow unsigned short to zero, and queue_
      logical_block_size() would then use the default of 512.
      
      Now the range with unsigned int is much larger, and users w/
      the 512k value, which happened to be zero'ed previously and
      work fine, started to hit this issue -- as the zero is gone,
      and queue_logical_block_size() does return 512k (>PAGE_SIZE.)
      
      Fix this by checking the bcache device's logical block size,
      and if it's greater than page size, fallback to the backing/
      cached device's logical page size.
      
      This doesn't affect cache devices as those are still checked
      for block/page size in read_super(); only the backing/cached
      devices are not.
      
      Apparently it's a regression from commit 2903381f ("bcache:
      Take data offset from the bdev superblock."), moving the check
      into BCACHE_SB_VERSION_CDEV only. Now that we have superblocks
      of backing devices out there with this larger value, we cannot
      refuse to load them (i.e., have a similar check in _BDEV.)
      
      Ideally perhaps bcache should use all values from the backing
      device (physical/logical/io_min block size)? But for now just
      fix the problematic case.
      
      Test-case:
      
          # IMG=/root/disk.img
          # dd if=/dev/zero of=$IMG bs=1 count=0 seek=1G
          # DEV=$(losetup --find --show $IMG)
          # make-bcache --bdev $DEV --block 8k
            < see dmesg >
      
      Before:
      
          # uname -r
          5.7.0-rc7
      
          [   55.944046] BUG: kernel NULL pointer dereference, address: 0000000000000000
          ...
          [   55.949742] CPU: 3 PID: 610 Comm: bcache-register Not tainted 5.7.0-rc7 #4
          ...
          [   55.952281] RIP: 0010:create_empty_buffers+0x1a/0x100
          ...
          [   55.966434] Call Trace:
          [   55.967021]  create_page_buffers+0x48/0x50
          [   55.967834]  block_read_full_page+0x49/0x380
          [   55.972181]  do_read_cache_page+0x494/0x610
          [   55.974780]  read_part_sector+0x2d/0xaa
          [   55.975558]  read_lba+0x10e/0x1e0
          [   55.977904]  efi_partition+0x120/0x5a6
          [   55.980227]  blk_add_partitions+0x161/0x390
          [   55.982177]  bdev_disk_changed+0x61/0xd0
          [   55.982961]  __blkdev_get+0x350/0x490
          [   55.983715]  __device_add_disk+0x318/0x480
          [   55.984539]  bch_cached_dev_run+0xc5/0x270
          [   55.986010]  register_bcache.cold+0x122/0x179
          [   55.987628]  kernfs_fop_write+0xbc/0x1a0
          [   55.988416]  vfs_write+0xb1/0x1a0
          [   55.989134]  ksys_write+0x5a/0xd0
          [   55.989825]  do_syscall_64+0x43/0x140
          [   55.990563]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
          [   55.991519] RIP: 0033:0x7f7d60ba3154
          ...
      
      After:
      
          # uname -r
          5.7.0.bcachelbspgsz
      
          [   31.672460] bcache: bcache_device_init() bcache0: sb/logical block size (8192) greater than page size (4096) falling back to device logical block size (512)
          [   31.675133] bcache: register_bdev() registered backing device loop0
      
          # grep ^ /sys/block/bcache0/queue/*_block_size
          /sys/block/bcache0/queue/logical_block_size:512
          /sys/block/bcache0/queue/physical_block_size:8192
      Reported-by: NRyan Finnie <ryan@finnie.org>
      Reported-by: NSebastian Marsching <sebastian@marsching.com>
      Signed-off-by: NMauricio Faria de Oliveira <mfo@canonical.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dcacbc12
  4. 27 5月, 2020 4 次提交
    • C
      bcache: configure the asynchronous registertion to be experimental · 0c8d3fce
      Coly Li 提交于
      In order to avoid the experimental async registration interface to
      be treated as new kernel ABI for common users, this patch makes it
      as an experimental kernel configure BCACHE_ASYNC_REGISTRAION.
      
      This interface is for extreme large cached data situation, to make sure
      the bcache device can always created without the udev timeout issue. For
      normal users the async or sync registration does not make difference.
      
      In future when we decide to use the asynchronous registration as default
      behavior, this experimental interface may be removed.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0c8d3fce
    • C
      bcache: asynchronous devices registration · 9e23ccf8
      Coly Li 提交于
      When there is a lot of data cached on cache device, the bcach internal
      btree can take a very long to validate during the backing device and
      cache device registration. In my test, it may takes 55+ minutes to check
      all the internal btree nodes.
      
      The problem is that the registration is invoked by udev rules and the
      udevd has 180 seconds timeout by default. If the btree node checking
      time is longer than udevd timeout, the registering  process will be
      killed by udevd with SIGKILL. If the registering process has pending
      sigal, creating kthread for bcache will fail and the device registration
      will fail. The result is, for bcache device which cached a lot of data
      on cache device, the bcache device node like /dev/bcache<N> won't create
      always due to the very long btree checking time.
      
      A solution to avoid the udevd 180 seconds timeout is to register devices
      in an asynchronous way. Which is, after writing cache or backing device
      path into /sys/fs/bcache/register_async, the kernel code will create a
      kworker and move all the btree node checking (for cache device) or dirty
      data counting (for cached device) in the kwork context. Then the kworder
      is scheduled on system_wq and the registration code just returned to
      user space udev rule task. By this asynchronous way, the udev task for
      bcache rule will complete in seconds, no matter how long time spent in
      the kworker context, it won't be killed by udevd for a timeout.
      
      After all the checking and counting are done asynchronously in the
      kworker, the bcache device will eventually be created successfully.
      
      This patch does the above chagne and add a register sysfs file
      /sys/fs/bcache/register_async. Writing the registering device path into
      this sysfs file will do the asynchronous registration.
      
      The register_async interface is for very rare condition and won't be
      used for common users. In future I plan to make the asynchronous
      registration as default behavior, which depends on feedback for this
      patch.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9e23ccf8
    • C
      bcache: fix refcount underflow in bcache_device_free() · 86da9f73
      Coly Li 提交于
      The problematic code piece in bcache_device_free() is,
      
       785 static void bcache_device_free(struct bcache_device *d)
       786 {
       787     struct gendisk *disk = d->disk;
       [snipped]
       799     if (disk) {
       800             if (disk->flags & GENHD_FL_UP)
       801                     del_gendisk(disk);
       802
       803             if (disk->queue)
       804                     blk_cleanup_queue(disk->queue);
       805
       806             ida_simple_remove(&bcache_device_idx,
       807                               first_minor_to_idx(disk->first_minor));
       808             put_disk(disk);
       809         }
       [snipped]
       816 }
      
      At line 808, put_disk(disk) may encounter kobject refcount of 'disk'
      being underflow.
      
      Here is how to reproduce the issue,
      - Attche the backing device to a cache device and do random write to
        make the cache being dirty.
      - Stop the bcache device while the cache device has dirty data of the
        backing device.
      - Only register the backing device back, NOT register cache device.
      - The bcache device node /dev/bcache0 won't show up, because backing
        device waits for the cache device shows up for the missing dirty
        data.
      - Now echo 1 into /sys/fs/bcache/pendings_cleanup, to stop the pending
        backing device.
      - After the pending backing device stopped, use 'dmesg' to check kernel
        message, a use-after-free warning from KASA reported the refcount of
        kobject linked to the 'disk' is underflow.
      
      The dropping refcount at line 808 in the above code piece is added by
      add_disk(d->disk) in bch_cached_dev_run(). But in the above condition
      the cache device is not registered, bch_cached_dev_run() has no chance
      to be called and the refcount is not added. The put_disk() for a non-
      added refcount of gendisk kobject triggers a underflow warning.
      
      This patch checks whether GENHD_FL_UP is set in disk->flags, if it is
      not set then the bcache device was not added, don't call put_disk()
      and the the underflow issue can be avoided.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86da9f73
    • J
      bcache: Convert pr_<level> uses to a more typical style · 46f5aa88
      Joe Perches 提交于
      Remove the trailing newline from the define of pr_fmt and add newlines
      to the uses.
      
      Miscellanea:
      
      o Convert bch_bkey_dump from multiple uses of pr_err to pr_cont
        as the earlier conversion was inappropriate done causing multiple
        lines to be emitted where only a single output line was desired
      o Use vsprintf extension %pV in bch_cache_set_error to avoid multiple
        line output where only a single line output was desired
      o Coalesce formats
      
      Fixes: 6ae63e35 ("bcache: replace printk() by pr_*() routines")
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46f5aa88
  5. 28 3月, 2020 2 次提交
  6. 13 2月, 2020 1 次提交
    • C
      bcache: Revert "bcache: shrink btree node cache after bch_btree_check()" · 309cc719
      Coly Li 提交于
      This reverts commit 1df3877f.
      
      In my testing, sometimes even all the cached btree nodes are freed,
      creating gc and allocator kernel threads may still fail. Finally it
      turns out that kthread_run() may fail if there is pending signal for
      current task. And the pending signal is sent from OOM killer which
      is triggered by memory consuption in bch_btree_check().
      
      Therefore explicitly shrinking bcache btree node here does not help,
      and after the shrinker callback is improved, as well as pending signals
      are ignored before creating kernel threads, now such operation is
      unncessary anymore.
      
      This patch reverts the commit 1df3877f ("bcache: shrink btree node
      cache after bch_btree_check()") because we have better improvement now.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      309cc719
  7. 01 2月, 2020 1 次提交
  8. 24 1月, 2020 9 次提交
  9. 14 11月, 2019 5 次提交
    • C
      bcache: add idle_max_writeback_rate sysfs interface · c5fcdedc
      Coly Li 提交于
      For writeback mode, if there is no regular I/O request for a while,
      the writeback rate will be set to the maximum value (1TB/s for now).
      This is good for most of the storage workload, but there are still
      people don't what the maximum writeback rate in I/O idle time.
      
      This patch adds a sysfs interface file idle_max_writeback_rate to
      permit people to disable maximum writeback rate. Then the minimum
      writeback rate can be advised by writeback_rate_minimum in the
      bcache device's sysfs interface.
      Reported-by: NChristian Balzer <chibi@gol.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c5fcdedc
    • A
      bcache: fix deadlock in bcache_allocator · 84c529ae
      Andrea Righi 提交于
      bcache_allocator can call the following:
      
       bch_allocator_thread()
        -> bch_prio_write()
           -> bch_bucket_alloc()
              -> wait on &ca->set->bucket_wait
      
      But the wake up event on bucket_wait is supposed to come from
      bch_allocator_thread() itself => deadlock:
      
      [ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 seconds.
      [ 1158.495929]       Not tainted 5.3.0-050300rc3-generic #201908042232
      [ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1158.504413] bcache_allocato D    0 15861      2 0x80004000
      [ 1158.504419] Call Trace:
      [ 1158.504429]  __schedule+0x2a8/0x670
      [ 1158.504432]  schedule+0x2d/0x90
      [ 1158.504448]  bch_bucket_alloc+0xe5/0x370 [bcache]
      [ 1158.504453]  ? wait_woken+0x80/0x80
      [ 1158.504466]  bch_prio_write+0x1dc/0x390 [bcache]
      [ 1158.504476]  bch_allocator_thread+0x233/0x490 [bcache]
      [ 1158.504491]  kthread+0x121/0x140
      [ 1158.504503]  ? invalidate_buckets+0x890/0x890 [bcache]
      [ 1158.504506]  ? kthread_park+0xb0/0xb0
      [ 1158.504510]  ret_from_fork+0x35/0x40
      
      Fix by making the call to bch_prio_write() non-blocking, so that
      bch_allocator_thread() never waits on itself.
      
      Moreover, make sure to wake up the garbage collector thread when
      bch_prio_write() is failing to allocate buckets.
      
      BugLink: https://bugs.launchpad.net/bugs/1784665
      BugLink: https://bugs.launchpad.net/bugs/1796292Signed-off-by: NAndrea Righi <andrea.righi@canonical.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      84c529ae
    • C
      bcache: add more accurate error messages in read_super() · aaf8dbea
      Coly Li 提交于
      Previous code only returns "Not a bcache superblock" for both bcache
      super block offset and magic error. This patch addss more accurate error
      messages,
      - for super block unmatched offset:
        "Not a bcache superblock (bad offset)"
      - for super block unmatched magic number:
        "Not a bcache superblock (bad magic)"
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aaf8dbea
    • C
      bcache: fix static checker warning in bcache_device_free() · 2d886951
      Coly Li 提交于
      Commit cafe5635 ("bcache: A block layer cache") leads to the
      following static checker warning:
      
          ./drivers/md/bcache/super.c:770 bcache_device_free()
          warn: variable dereferenced before check 'd->disk' (see line 766)
      
      drivers/md/bcache/super.c
         762  static void bcache_device_free(struct bcache_device *d)
         763  {
         764          lockdep_assert_held(&bch_register_lock);
         765
         766          pr_info("%s stopped", d->disk->disk_name);
                                            ^^^^^^^^^
      Unchecked dereference.
      
         767
         768          if (d->c)
         769                  bcache_device_detach(d);
         770          if (d->disk && d->disk->flags & GENHD_FL_UP)
                          ^^^^^^^
      Check too late.
      
         771                  del_gendisk(d->disk);
         772          if (d->disk && d->disk->queue)
         773                  blk_cleanup_queue(d->disk->queue);
         774          if (d->disk) {
         775                  ida_simple_remove(&bcache_device_idx,
         776                                    first_minor_to_idx(d->disk->first_minor));
         777                  put_disk(d->disk);
         778          }
         779
      
      It is not 100% sure that the gendisk struct of bcache device will always
      be there, the warning makes sense when there is problem in block core.
      
      This patch tries to remove the static checking warning by checking
      d->disk to avoid NULL pointer deferences.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d886951
    • G
      bcache: fix a lost wake-up problem caused by mca_cannibalize_lock · 34cf78bf
      Guoju Fang 提交于
      This patch fix a lost wake-up problem caused by the race between
      mca_cannibalize_lock and bch_cannibalize_unlock.
      
      Consider two processes, A and B. Process A is executing
      mca_cannibalize_lock, while process B takes c->btree_cache_alloc_lock
      and is executing bch_cannibalize_unlock. The problem happens that after
      process A executes cmpxchg and will execute prepare_to_wait. In this
      timeslice process B executes wake_up, but after that process A executes
      prepare_to_wait and set the state to TASK_INTERRUPTIBLE. Then process A
      goes to sleep but no one will wake up it. This problem may cause bcache
      device to dead.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      34cf78bf
  10. 22 7月, 2019 1 次提交
  11. 28 6月, 2019 8 次提交
    • C
      bcache: shrink btree node cache after bch_btree_check() · 1df3877f
      Coly Li 提交于
      When cache set starts, bch_btree_check() will check all bkeys on cache
      device by calculating the checksum. This operation will consume a huge
      number of system memory if there are a lot of data cached. Since bcache
      uses its own mca cache to maintain all its read-in btree nodes, and only
      releases the cache space when system memory manage code starts to shrink
      caches. Then before memory manager code to call the mca cache shrinker
      callback, bcache mca cache will compete memory resource with user space
      application, which may have nagive effect to performance of user space
      workloads (e.g. data base, or I/O service of distributed storage node).
      
      This patch tries to call bcache mca shrinker routine to proactively
      release mca cache memory, to decrease the memory pressure of system and
      avoid negative effort of the overall system I/O performance.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1df3877f
    • C
      bcache: fix potential deadlock in cached_def_free() · 7e865eba
      Coly Li 提交于
      When enable lockdep and reboot system with a writeback mode bcache
      device, the following potential deadlock warning is reported by lockdep
      engine.
      
      [  101.536569][  T401] kworker/2:2/401 is trying to acquire lock:
      [  101.538575][  T401] 00000000bbf6e6c7 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
      [  101.542054][  T401]
      [  101.542054][  T401] but task is already holding lock:
      [  101.544587][  T401] 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
      [  101.548386][  T401]
      [  101.548386][  T401] which lock already depends on the new lock.
      [  101.548386][  T401]
      [  101.551874][  T401]
      [  101.551874][  T401] the existing dependency chain (in reverse order) is:
      [  101.555000][  T401]
      [  101.555000][  T401] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
      [  101.557860][  T401]        process_one_work+0x277/0x640
      [  101.559661][  T401]        worker_thread+0x39/0x3f0
      [  101.561340][  T401]        kthread+0x125/0x140
      [  101.562963][  T401]        ret_from_fork+0x3a/0x50
      [  101.564718][  T401]
      [  101.564718][  T401] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
      [  101.567701][  T401]        lock_acquire+0xb4/0x1c0
      [  101.569651][  T401]        flush_workqueue+0xae/0x4c0
      [  101.571494][  T401]        drain_workqueue+0xa9/0x180
      [  101.573234][  T401]        destroy_workqueue+0x17/0x250
      [  101.575109][  T401]        cached_dev_free+0x44/0x120 [bcache]
      [  101.577304][  T401]        process_one_work+0x2a4/0x640
      [  101.579357][  T401]        worker_thread+0x39/0x3f0
      [  101.581055][  T401]        kthread+0x125/0x140
      [  101.582709][  T401]        ret_from_fork+0x3a/0x50
      [  101.584592][  T401]
      [  101.584592][  T401] other info that might help us debug this:
      [  101.584592][  T401]
      [  101.588355][  T401]  Possible unsafe locking scenario:
      [  101.588355][  T401]
      [  101.590974][  T401]        CPU0                    CPU1
      [  101.592889][  T401]        ----                    ----
      [  101.594743][  T401]   lock((work_completion)(&cl->work)#2);
      [  101.596785][  T401]                                lock((wq_completion)bcache_writeback_wq);
      [  101.600072][  T401]                                lock((work_completion)(&cl->work)#2);
      [  101.602971][  T401]   lock((wq_completion)bcache_writeback_wq);
      [  101.605255][  T401]
      [  101.605255][  T401]  *** DEADLOCK ***
      [  101.605255][  T401]
      [  101.608310][  T401] 2 locks held by kworker/2:2/401:
      [  101.610208][  T401]  #0: 00000000cf2c7d17 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
      [  101.613709][  T401]  #1: 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
      [  101.617480][  T401]
      [  101.617480][  T401] stack backtrace:
      [  101.619539][  T401] CPU: 2 PID: 401 Comm: kworker/2:2 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
      [  101.623225][  T401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [  101.627210][  T401] Workqueue: events cached_dev_free [bcache]
      [  101.629239][  T401] Call Trace:
      [  101.630360][  T401]  dump_stack+0x85/0xcb
      [  101.631777][  T401]  print_circular_bug+0x19a/0x1f0
      [  101.633485][  T401]  __lock_acquire+0x16cd/0x1850
      [  101.635184][  T401]  ? __lock_acquire+0x6a8/0x1850
      [  101.636863][  T401]  ? lock_acquire+0xb4/0x1c0
      [  101.638421][  T401]  ? find_held_lock+0x34/0xa0
      [  101.640015][  T401]  lock_acquire+0xb4/0x1c0
      [  101.641513][  T401]  ? flush_workqueue+0x87/0x4c0
      [  101.643248][  T401]  flush_workqueue+0xae/0x4c0
      [  101.644832][  T401]  ? flush_workqueue+0x87/0x4c0
      [  101.646476][  T401]  ? drain_workqueue+0xa9/0x180
      [  101.648303][  T401]  drain_workqueue+0xa9/0x180
      [  101.649867][  T401]  destroy_workqueue+0x17/0x250
      [  101.651503][  T401]  cached_dev_free+0x44/0x120 [bcache]
      [  101.653328][  T401]  process_one_work+0x2a4/0x640
      [  101.655029][  T401]  worker_thread+0x39/0x3f0
      [  101.656693][  T401]  ? process_one_work+0x640/0x640
      [  101.658501][  T401]  kthread+0x125/0x140
      [  101.660012][  T401]  ? kthread_create_worker_on_cpu+0x70/0x70
      [  101.661985][  T401]  ret_from_fork+0x3a/0x50
      [  101.691318][  T401] bcache: bcache_device_free() bcache0 stopped
      
      Here is how the above potential deadlock may happen in reboot/shutdown
      code path,
      1) bcache_reboot() is called firstly in the reboot/shutdown code path,
         then in bcache_reboot(), bcache_device_stop() is called.
      2) bcache_device_stop() sets BCACHE_DEV_CLOSING on d->falgs, then call
         closure_queue(&d->cl) to invoke cached_dev_flush(). And in turn
         cached_dev_flush() calls cached_dev_free() via closure_at()
      3) In cached_dev_free(), after stopped writebach kthread
         dc->writeback_thread, the kwork dc->writeback_write_wq is stopping by
         destroy_workqueue().
      4) Inside destroy_workqueue(), drain_workqueue() is called. Inside
         drain_workqueue(), flush_workqueue() is called. Then wq->lockdep_map
         is acquired by lock_map_acquire() in flush_workqueue(). After the
         lock acquired the rest part of flush_workqueue() just wait for the
         workqueue to complete.
      5) Now we look back at writeback thread routine bch_writeback_thread(),
         in the main while-loop, write_dirty() is called via continue_at() in
         read_dirty_submit(), which is called via continue_at() in while-loop
         level called function read_dirty(). Inside write_dirty() it may be
         re-called on workqueeu dc->writeback_write_wq via continue_at().
         It means when the writeback kthread is stopped in cached_dev_free()
         there might be still one kworker queued on dc->writeback_write_wq
         to execute write_dirty() again.
      6) Now this kworker is scheduled on dc->writeback_write_wq to run by
         process_one_work() (which is called by worker_thread()). Before
         calling the kwork routine, wq->lockdep_map is acquired.
      7) But wq->lockdep_map is acquired already in step 4), so a A-A lock
         (lockdep terminology) scenario happens.
      
      Indeed on multiple cores syatem, the above deadlock is very rare to
      happen, just as the code comments in process_one_work() says,
      2263     * AFAICT there is no possible deadlock scenario between the
      2264     * flush_work() and complete() primitives (except for
      	   single-threaded
      2265     * workqueues), so hiding them isn't a problem.
      
      But it is still good to fix such lockdep warning, even no one running
      bcache on single core system.
      
      The fix is simple. This patch solves the above potential deadlock by,
      - Do not destroy workqueue dc->writeback_write_wq in cached_dev_free().
      - Flush and destroy dc->writeback_write_wq in writebach kthread routine
        bch_writeback_thread(), where after quit the thread main while-loop
        and before cached_dev_put() is called.
      
      By this fix, dc->writeback_write_wq will be stopped and destroy before
      the writeback kthread stopped, so the chance for a A-A locking on
      wq->lockdep_map is disappeared, such A-A deadlock won't happen
      any more.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7e865eba
    • C
      bcache: acquire bch_register_lock later in cached_dev_free() · 80265d8d
      Coly Li 提交于
      When enable lockdep engine, a lockdep warning can be observed when
      reboot or shutdown system,
      
      [ 3142.764557][    T1] bcache: bcache_reboot() Stopping all devices:
      [ 3142.776265][ T2649]
      [ 3142.777159][ T2649] ======================================================
      [ 3142.780039][ T2649] WARNING: possible circular locking dependency detected
      [ 3142.782869][ T2649] 5.2.0-rc4-lp151.20-default+ #1 Tainted: G        W
      [ 3142.785684][ T2649] ------------------------------------------------------
      [ 3142.788479][ T2649] kworker/3:67/2649 is trying to acquire lock:
      [ 3142.790738][ T2649] 00000000aaf02291 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
      [ 3142.794678][ T2649]
      [ 3142.794678][ T2649] but task is already holding lock:
      [ 3142.797402][ T2649] 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
      [ 3142.801462][ T2649]
      [ 3142.801462][ T2649] which lock already depends on the new lock.
      [ 3142.801462][ T2649]
      [ 3142.805277][ T2649]
      [ 3142.805277][ T2649] the existing dependency chain (in reverse order) is:
      [ 3142.808902][ T2649]
      [ 3142.808902][ T2649] -> #2 (&bch_register_lock){+.+.}:
      [ 3142.812396][ T2649]        __mutex_lock+0x7a/0x9d0
      [ 3142.814184][ T2649]        cached_dev_free+0x17/0x120 [bcache]
      [ 3142.816415][ T2649]        process_one_work+0x2a4/0x640
      [ 3142.818413][ T2649]        worker_thread+0x39/0x3f0
      [ 3142.820276][ T2649]        kthread+0x125/0x140
      [ 3142.822061][ T2649]        ret_from_fork+0x3a/0x50
      [ 3142.823965][ T2649]
      [ 3142.823965][ T2649] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
      [ 3142.827244][ T2649]        process_one_work+0x277/0x640
      [ 3142.829160][ T2649]        worker_thread+0x39/0x3f0
      [ 3142.830958][ T2649]        kthread+0x125/0x140
      [ 3142.832674][ T2649]        ret_from_fork+0x3a/0x50
      [ 3142.834915][ T2649]
      [ 3142.834915][ T2649] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
      [ 3142.838121][ T2649]        lock_acquire+0xb4/0x1c0
      [ 3142.840025][ T2649]        flush_workqueue+0xae/0x4c0
      [ 3142.842035][ T2649]        drain_workqueue+0xa9/0x180
      [ 3142.844042][ T2649]        destroy_workqueue+0x17/0x250
      [ 3142.846142][ T2649]        cached_dev_free+0x52/0x120 [bcache]
      [ 3142.848530][ T2649]        process_one_work+0x2a4/0x640
      [ 3142.850663][ T2649]        worker_thread+0x39/0x3f0
      [ 3142.852464][ T2649]        kthread+0x125/0x140
      [ 3142.854106][ T2649]        ret_from_fork+0x3a/0x50
      [ 3142.855880][ T2649]
      [ 3142.855880][ T2649] other info that might help us debug this:
      [ 3142.855880][ T2649]
      [ 3142.859663][ T2649] Chain exists of:
      [ 3142.859663][ T2649]   (wq_completion)bcache_writeback_wq --> (work_completion)(&cl->work)#2 --> &bch_register_lock
      [ 3142.859663][ T2649]
      [ 3142.865424][ T2649]  Possible unsafe locking scenario:
      [ 3142.865424][ T2649]
      [ 3142.868022][ T2649]        CPU0                    CPU1
      [ 3142.869885][ T2649]        ----                    ----
      [ 3142.871751][ T2649]   lock(&bch_register_lock);
      [ 3142.873379][ T2649]                                lock((work_completion)(&cl->work)#2);
      [ 3142.876399][ T2649]                                lock(&bch_register_lock);
      [ 3142.879727][ T2649]   lock((wq_completion)bcache_writeback_wq);
      [ 3142.882064][ T2649]
      [ 3142.882064][ T2649]  *** DEADLOCK ***
      [ 3142.882064][ T2649]
      [ 3142.885060][ T2649] 3 locks held by kworker/3:67/2649:
      [ 3142.887245][ T2649]  #0: 00000000e774cdd0 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
      [ 3142.890815][ T2649]  #1: 00000000f7df89da ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
      [ 3142.894884][ T2649]  #2: 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
      [ 3142.898797][ T2649]
      [ 3142.898797][ T2649] stack backtrace:
      [ 3142.900961][ T2649] CPU: 3 PID: 2649 Comm: kworker/3:67 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
      [ 3142.904789][ T2649] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [ 3142.909168][ T2649] Workqueue: events cached_dev_free [bcache]
      [ 3142.911422][ T2649] Call Trace:
      [ 3142.912656][ T2649]  dump_stack+0x85/0xcb
      [ 3142.914181][ T2649]  print_circular_bug+0x19a/0x1f0
      [ 3142.916193][ T2649]  __lock_acquire+0x16cd/0x1850
      [ 3142.917936][ T2649]  ? __lock_acquire+0x6a8/0x1850
      [ 3142.919704][ T2649]  ? lock_acquire+0xb4/0x1c0
      [ 3142.921335][ T2649]  ? find_held_lock+0x34/0xa0
      [ 3142.923052][ T2649]  lock_acquire+0xb4/0x1c0
      [ 3142.924635][ T2649]  ? flush_workqueue+0x87/0x4c0
      [ 3142.926375][ T2649]  flush_workqueue+0xae/0x4c0
      [ 3142.928047][ T2649]  ? flush_workqueue+0x87/0x4c0
      [ 3142.929824][ T2649]  ? drain_workqueue+0xa9/0x180
      [ 3142.931686][ T2649]  drain_workqueue+0xa9/0x180
      [ 3142.933534][ T2649]  destroy_workqueue+0x17/0x250
      [ 3142.935787][ T2649]  cached_dev_free+0x52/0x120 [bcache]
      [ 3142.937795][ T2649]  process_one_work+0x2a4/0x640
      [ 3142.939803][ T2649]  worker_thread+0x39/0x3f0
      [ 3142.941487][ T2649]  ? process_one_work+0x640/0x640
      [ 3142.943389][ T2649]  kthread+0x125/0x140
      [ 3142.944894][ T2649]  ? kthread_create_worker_on_cpu+0x70/0x70
      [ 3142.947744][ T2649]  ret_from_fork+0x3a/0x50
      [ 3142.970358][ T2649] bcache: bcache_device_free() bcache0 stopped
      
      Here is how the deadlock happens.
      1) bcache_reboot() calls bcache_device_stop(), then inside
         bcache_device_stop() BCACHE_DEV_CLOSING bit is set on d->flags.
         Then closure_queue(&d->cl) is called to invoke cached_dev_flush().
      2) In cached_dev_flush(), cached_dev_free() is called by continu_at().
      3) In cached_dev_free(), when stopping the writeback kthread of the
         cached device by kthread_stop(), dc->writeback_thread will be waken
         up to quite the kthread while-loop, then cached_dev_put() is called
         in bch_writeback_thread().
      4) Calling cached_dev_put() in writeback kthread may drop dc->count to
         0, then dc->detach kworker is scheduled, which is initialized as
         cached_dev_detach_finish().
      5) Inside cached_dev_detach_finish(), the last line of code is to call
         closure_put(&dc->disk.cl), which drops the last reference counter of
         closrure dc->disk.cl, then the callback cached_dev_flush() gets
         called.
      Now cached_dev_flush() is called for second time in the code path, the
      first time is in step 2). And again bch_register_lock will be acquired
      again, and a A-A lock (lockdep terminology) is happening.
      
      The root cause of the above A-A lock is in cached_dev_free(), mutex
      bch_register_lock is held before stopping writeback kthread and other
      kworkers. Fortunately now we have variable 'bcache_is_reboot', which may
      prevent device registration or unregistration during reboot/shutdown
      time, so it is unncessary to hold bch_register_lock such early now.
      
      This is how this patch fixes the reboot/shutdown time A-A lock issue:
      After moving mutex_lock(&bch_register_lock) to a later location where
      before atomic_read(&dc->running) in cached_dev_free(), such A-A lock
      problem can be solved without any reboot time registration race.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      80265d8d
    • C
      bcache: acquire bch_register_lock later in cached_dev_detach_finish() · 97ba3b81
      Coly Li 提交于
      Now there is variable bcache_is_reboot to prevent device register or
      unregister during reboot, it is unncessary to still hold mutex lock
      bch_register_lock before stopping writeback_rate_update kworker and
      writeback kthread. And if the stopping kworker or kthread holding
      bch_register_lock inside their routine (we used to have such problem
      in writeback thread, thanks to Junhui Wang fixed it), it is very easy
      to introduce deadlock during reboot/shutdown procedure.
      
      Therefore in this patch, the location to acquire bch_register_lock is
      moved to the location before calling calc_cached_dev_sectors(). Which
      is later then original location in cached_dev_detach_finish().
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      97ba3b81
    • C
      bcache: avoid a deadlock in bcache_reboot() · a59ff6cc
      Coly Li 提交于
      It is quite frequently to observe deadlock in bcache_reboot() happens
      and hang the system reboot process. The reason is, in bcache_reboot()
      when calling bch_cache_set_stop() and bcache_device_stop() the mutex
      bch_register_lock is held. But in the process to stop cache set and
      bcache device, bch_register_lock will be acquired again. If this mutex
      is held here, deadlock will happen inside the stopping process. The
      aftermath of the deadlock is, whole system reboot gets hung.
      
      The fix is to avoid holding bch_register_lock for the following loops
      in bcache_reboot(),
             list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
                      bch_cache_set_stop(c);
      
              list_for_each_entry_safe(dc, tdc, &uncached_devices, list)
                      bcache_device_stop(&dc->disk);
      
      A module range variable 'bcache_is_reboot' is added, it sets to true
      in bcache_reboot(). In register_bcache(), if bcache_is_reboot is checked
      to be true, reject the registration by returning -EBUSY immediately.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a59ff6cc
    • C
      bcache: stop writeback kthread and kworker when bch_cached_dev_run() failed · 5c2a634c
      Coly Li 提交于
      In bch_cached_dev_attach() after bch_cached_dev_writeback_start()
      called, the wrireback kthread and writeback rate update kworker of the
      cached device are created, if the following bch_cached_dev_run()
      failed, bch_cached_dev_attach() will return with -ENOMEM without
      stopping the writeback related kthread and kworker.
      
      This patch stops writeback kthread and writeback rate update kworker
      before returning -ENOMEM if bch_cached_dev_run() returns error.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5c2a634c
    • C
      bcache: add pendings_cleanup to stop pending bcache device · 0c277e21
      Coly Li 提交于
      If a bcache device is in dirty state and its cache set is not
      registered, this bcache device will not appear in /dev/bcache<N>,
      and there is no way to stop it or remove the bcache kernel module.
      
      This is an as-designed behavior, but sometimes people has to reboot
      whole system to release or stop the pending backing device.
      
      This sysfs interface may remove such pending bcache devices when
      write anything into the sysfs file manually.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0c277e21
    • C
      bcache: remove "XXX:" comment line from run_cache_set() · 68a53c95
      Coly Li 提交于
      In previous bcache patches for Linux v5.2, the failure code path of
      run_cache_set() is tested and fixed. So now the following comment
      line can be removed from run_cache_set(),
      	/* XXX: test this, it's broken */
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      68a53c95