1. 24 5月, 2022 1 次提交
    • C
      bcache: improve multithreaded bch_sectors_dirty_init() · 4dc34ae1
      Coly Li 提交于
      Commit b144e45f ("bcache: make bch_sectors_dirty_init() to be
      multithreaded") makes bch_sectors_dirty_init() to be much faster
      when counting dirty sectors by iterating all dirty keys in the btree.
      But it isn't in ideal shape yet, still can be improved.
      
      This patch does the following changes to improve current parallel dirty
      keys iteration on the btree,
      - Add read lock to root node when multiple threads iterating the btree,
        to prevent the root node gets split by I/Os from other registered
        bcache devices.
      - Remove local variable "char name[32]" and generate kernel thread name
        string directly when calling kthread_run().
      - Allocate "struct bch_dirty_init_state state" directly on stack and
        avoid the unnecessary dynamic memory allocation for it.
      - Decrease BCH_DIRTY_INIT_THRD_MAX from 64 to 12 which is enough indeed.
      - Increase &state->started to count created kernel thread after it
        succeeds to create.
      - When wait for all dirty key counting threads to finish, use
        wait_event() to replace wait_event_interruptible().
      
      With the above changes, the code is more clear, and some potential error
      conditions are avoided.
      
      Fixes: b144e45f ("bcache: make bch_sectors_dirty_init() to be multithreaded")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20220524102336.10684-3-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
      4dc34ae1
  2. 06 3月, 2022 2 次提交
    • M
      bcache: fixup multiple threads crash · 887554ab
      Mingzhe Zou 提交于
      When multiple threads to check btree nodes in parallel, the main
      thread wait for all threads to stop or CACHE_SET_IO_DISABLE flag:
      
      wait_event_interruptible(check_state->wait,
                               atomic_read(&check_state->started) == 0 ||
                               test_bit(CACHE_SET_IO_DISABLE, &c->flags));
      
      However, the bch_btree_node_read and bch_btree_node_read_done
      maybe call bch_cache_set_error, then the CACHE_SET_IO_DISABLE
      will be set. If the flag already set, the main thread return
      error. At the same time, maybe some threads still running and
      read NULL pointer, the kernel will crash.
      
      This patch change the event wait condition, the main thread must
      wait for all threads to stop.
      
      Fixes: 8e710227 ("bcache: make bch_btree_check() to be multithreaded")
      Signed-off-by: NMingzhe Zou <mingzhe.zou@easystack.cn>
      Cc: stable@vger.kernel.org # v5.7+
      Signed-off-by: NColy Li <colyli@suse.de>
      887554ab
    • M
      bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing · 7b1002f7
      Mingzhe Zou 提交于
      When attaching a cached device (a.k.a backing device) to a cache
      device, bch_sectors_dirty_init() is called to count dirty sectors
      and stripes (see what bcache_dev_sectors_dirty_add() does) on the
      cache device.
      
      When bcache_dev_sectors_dirty_add() is called, set_bit(stripe,
      d->full_dirty_stripes) or clear_bit(stripe, d->full_dirty_stripes)
      operation will always be performed. In full_dirty_stripes, each 1bit
      represents stripe_size (8192) sectors (512B), so 1bit=4MB (8192*512),
      and each CPU cache line=64B=512bit=2048MB. When 20 threads process
      a cached disk with 100G dirty data, a single thread processes about
      23M at a time, and 20 threads total 460M. These full_dirty_stripes
      bits corresponding to the 460M data is likely to fall in the same CPU
      cache line. When one of these threads performs a set_bit or clear_bit
      operation, the same CPU cache line of other threads will become invalid
      and must read the full_dirty_stripes from the main memory again. Compared
      with single thread, the time of a bcache_dev_sectors_dirty_add()
      call is increased by about 50 times in our test (100G dirty data,
      20 threads, bcache_dev_sectors_dirty_add() is called more than
      20 million times).
      
      This patch tries to test_bit before set_bit or clear_bit operation.
      Therefore, a lot of force set and clear operations will be avoided,
      and most of bcache_dev_sectors_dirty_add() calls will only read CPU
      cache line.
      Signed-off-by: NMingzhe Zou <mingzhe.zou@easystack.cn>
      Signed-off-by: NColy Li <colyli@suse.de>
      7b1002f7
  3. 02 2月, 2022 1 次提交
  4. 19 10月, 2021 1 次提交
  5. 11 4月, 2021 2 次提交
  6. 10 2月, 2021 1 次提交
    • D
      bcache: consider the fragmentation when update the writeback rate · 71dda2a5
      dongdong tao 提交于
      Current way to calculate the writeback rate only considered the
      dirty sectors, this usually works fine when the fragmentation
      is not high, but it will give us unreasonable small rate when
      we are under a situation that very few dirty sectors consumed
      a lot dirty buckets. In some case, the dirty bucekts can reached
      to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even
      reached the writeback_percent, the writeback rate will still
      be the minimum value (4k), thus it will cause all the writes to be
      stucked in a non-writeback mode because of the slow writeback.
      
      We accelerate the rate in 3 stages with different aggressiveness,
      the first stage starts when dirty buckets percent reach above
      BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
      BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
      BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default
      the first stage tries to writeback the amount of dirty data
      in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second,
      the second stage tries to writeback the amount of dirty data in one bucket
      in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third
      stage tries to writeback the amount of dirty data in one bucket in
      (1 / (dirty_buckets_percent - 64)) millisecond.
      
      the initial rate at each stage can be controlled by 3 configurable
      parameters writeback_rate_fp_term_{low|mid|high}, they are by default
      1, 10, 1000, the hint of IO throughput that these values are trying
      to achieve is described by above paragraph, the reason that
      I choose those value as default is based on the testing and the
      production data, below is some details:
      
      A. When it comes to the low stage, there is still a bit far from the 70
         threshold, so we only want to give it a little bit push by setting the
         term to 1, it means the initial rate will be 170 if the fragment is 6,
         it is calculated by bucket_size/fragment, this rate is very small,
         but still much reasonable than the minimum 8.
         For a production bcache with unheavy workload, if the cache device
         is bigger than 1 TB, it may take hours to consume 1% buckets,
         so it is very possible to reclaim enough dirty buckets in this stage,
         thus to avoid entering the next stage.
      
      B. If the dirty buckets ratio didn't turn around during the first stage,
         it comes to the mid stage, then it is necessary for mid stage
         to be more aggressive than low stage, so i choose the initial rate
         to be 10 times more than low stage, that means 1700 as the initial
         rate if the fragment is 6. This is some normal rate
         we usually see for a normal workload when writeback happens
         because of writeback_percent.
      
      C. If the dirty buckets ratio didn't turn around during the low and mid
         stages, it comes to the third stage, and it is the last chance that
         we can turn around to avoid the horrible cutoff writeback sync issue,
         then we choose 100 times more aggressive than the mid stage, that
         means 170000 as the initial rate if the fragment is 6. This is also
         inferred from a production bcache, I've got one week's writeback rate
         data from a production bcache which has quite heavy workloads,
         again, the writeback is triggered by the writeback percent,
         the highest rate area is around 100000 to 240000, so I believe this
         kind aggressiveness at this stage is reasonable for production.
         And it should be mostly enough because the hint is trying to reclaim
         1000 bucket per second, and from that heavy production env,
         it is consuming 50 bucket per second on average in one week's data.
      
      Option writeback_consider_fragment is to control whether we want
      this feature to be on or off, it's on by default.
      
      Lastly, below is the performance data for all the testing result,
      including the data from production env:
      https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharingSigned-off-by: Ndongdong tao <dongdong.tao@canonical.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      71dda2a5
  7. 08 12月, 2020 1 次提交
    • D
      bcache: fix race between setting bdev state to none and new write request direct to backing · df4ad532
      Dongsheng Yang 提交于
      There is a race condition in detaching as below:
      A. detaching			B. Write request
      (1) writing back
      (2) write back done, set bdev
          state to clean.
      (3) cached_dev_put() and
          schedule_work(&dc->detach);
      				(4) write data [0 - 4K] directly
      				    into backing and ack to user.
      (5) power-failure...
      
      When we restart this bcache device, this bdev is clean but not detached,
      and read [0 - 4K], we will get unexpected old data from cache device.
      
      To fix this problem, set the bdev state to none when we writeback done
      in detaching, and then if power-failure happened as above, the data in
      cache will not be used in next bcache device starting, it's detached, we
      will read the correct data from backing derectly.
      Signed-off-by: NDongsheng Yang <dongsheng.yang@easystack.cn>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      df4ad532
  8. 03 10月, 2020 1 次提交
    • C
      bcache: remove embedded struct cache_sb from struct cache_set · 4a784266
      Coly Li 提交于
      Since bcache code was merged into mainline kerrnel, each cache set only
      as one single cache in it. The multiple caches framework is here but the
      code is far from completed. Considering the multiple copies of cached
      data can also be stored on e.g. md raid1 devices, it is unnecessary to
      support multiple caches in one cache set indeed.
      
      The previous preparation patches fix the dependencies of explicitly
      making a cache set only have single cache. Now we don't have to maintain
      an embedded partial super block in struct cache_set, the in-memory super
      block can be directly referenced from struct cache.
      
      This patch removes the embedded struct cache_sb from struct cache_set,
      and fixes all locations where the superb lock was referenced from this
      removed super block by referencing the in-memory super block of struct
      cache.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4a784266
  9. 25 7月, 2020 3 次提交
    • C
      bcache: fix overflow in offset_to_stripe() · 7a148126
      Coly Li 提交于
      offset_to_stripe() returns the stripe number (in type unsigned int) from
      an offset (in type uint64_t) by the following calculation,
      	do_div(offset, d->stripe_size);
      For large capacity backing device (e.g. 18TB) with small stripe size
      (e.g. 4KB), the result is 4831838208 and exceeds UINT_MAX. The actual
      returned value which caller receives is 536870912, due to the overflow.
      
      Indeed in bcache_device_init(), bcache_device->nr_stripes is limited in
      range [1, INT_MAX]. Therefore all valid stripe numbers in bcache are
      in range [0, bcache_dev->nr_stripes - 1].
      
      This patch adds a upper limition check in offset_to_stripe(): the max
      valid stripe number should be less than bcache_device->nr_stripes. If
      the calculated stripe number from do_div() is equal to or larger than
      bcache_device->nr_stripe, -EINVAL will be returned. (Normally nr_stripes
      is less than INT_MAX, exceeding upper limitation doesn't mean overflow,
      therefore -EOVERFLOW is not used as error code.)
      
      This patch also changes nr_stripes' type of struct bcache_device from
      'unsigned int' to 'int', and return value type of offset_to_stripe()
      from 'unsigned int' to 'int', to match their exact data ranges.
      
      All locations where bcache_device->nr_stripes and offset_to_stripe() are
      referenced also get updated for the above type change.
      Reported-and-tested-by: NKen Raeburn <raeburn@redhat.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7a148126
    • G
      bcache: Use struct_size() in kzalloc() · 29f1d5ca
      Gustavo A. R. Silva 提交于
      Make use of the struct_size() helper instead of an open-coded version
      in order to avoid any potential type mistakes.
      
      This code was detected with the help of Coccinelle and, audited and
      fixed manually.
      Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      29f1d5ca
    • X
      bcache: writeback: Remove unneeded variable i · 7236657c
      Xu Wang 提交于
      Remove unneeded variable i in bch_dirty_init_thread().
      Signed-off-by: NXu Wang <vulab@iscas.ac.cn>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7236657c
  10. 27 5月, 2020 1 次提交
    • J
      bcache: Convert pr_<level> uses to a more typical style · 46f5aa88
      Joe Perches 提交于
      Remove the trailing newline from the define of pr_fmt and add newlines
      to the uses.
      
      Miscellanea:
      
      o Convert bch_bkey_dump from multiple uses of pr_err to pr_cont
        as the earlier conversion was inappropriate done causing multiple
        lines to be emitted where only a single output line was desired
      o Use vsprintf extension %pV in bch_cache_set_error to avoid multiple
        line output where only a single line output was desired
      o Coalesce formats
      
      Fixes: 6ae63e35 ("bcache: replace printk() by pr_*() routines")
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46f5aa88
  11. 23 3月, 2020 3 次提交
    • C
      bcache: optimize barrier usage for atomic operations · eb9b6666
      Coly Li 提交于
      The idea of this patch is from Davidlohr Bueso, he posts a patch
      for bcache to optimize barrier usage for read-modify-write atomic
      bitops. Indeed such optimization can also apply on other locations
      where smp_mb() is used before or after an atomic operation.
      
      This patch replaces smp_mb() with smp_mb__before_atomic() or
      smp_mb__after_atomic() in btree.c and writeback.c,  where it is used
      to synchronize memory cache just earlier on other cores. Although
      the locations are not on hot code path, it is always not bad to mkae
      things a little better.
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eb9b6666
    • D
      bcache: optimize barrier usage for Rmw atomic bitops · b004aa86
      Davidlohr Bueso 提交于
      We can avoid the unnecessary barrier on non LL/SC architectures,
      such as x86. Instead, use the smp_mb__after_atomic().
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b004aa86
    • C
      bcache: make bch_sectors_dirty_init() to be multithreaded · b144e45f
      Coly Li 提交于
      When attaching a cached device (a.k.a backing device) to a cache
      device, bch_sectors_dirty_init() is called to count dirty sectors
      and stripes (see what bcache_dev_sectors_dirty_add() does) on the
      cache device.
      
      The counting is done by a single thread recursive function
      bch_btree_map_keys() to iterate all the bcache btree nodes.
      If the btree has huge number of nodes, bch_sectors_dirty_init() will
      take quite long time. In my testing, if the registering cache set has
      a existed UUID which matches a already registered cached device, the
      automatical attachment during the registration may take more than
      55 minutes. This is too long for waiting the bcache to work in real
      deployment.
      
      Fortunately when bch_sectors_dirty_init() is called, no other thread
      will access the btree yet, it is safe to do a read-only parallelized
      dirty sectors counting by multiple threads.
      
      This patch tries to create multiple threads, and each thread tries to
      one-by-one count dirty sectors from the sub-tree indexed by a root
      node key which the thread fetched. After the sub-tree is counted, the
      counting thread will continue to fetch another root node key, until
      the fetched key is NULL. How many threads in parallel depends on
      the number of keys from the btree root node, and the number of online
      CPU core. The thread number will be the less number but no more than
      BCH_DIRTY_INIT_THRD_MAX. If there are only 2 keys in root node, it
      can only be 2x times faster by this patch. But if there are 10 keys
      in the root node, with this patch it can be 10x times faster.
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b144e45f
  12. 14 11月, 2019 1 次提交
    • C
      bcache: add idle_max_writeback_rate sysfs interface · c5fcdedc
      Coly Li 提交于
      For writeback mode, if there is no regular I/O request for a while,
      the writeback rate will be set to the maximum value (1TB/s for now).
      This is good for most of the storage workload, but there are still
      people don't what the maximum writeback rate in I/O idle time.
      
      This patch adds a sysfs interface file idle_max_writeback_rate to
      permit people to disable maximum writeback rate. Then the minimum
      writeback rate can be advised by writeback_rate_minimum in the
      bcache device's sysfs interface.
      Reported-by: NChristian Balzer <chibi@gol.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c5fcdedc
  13. 28 6月, 2019 3 次提交
    • C
      bcache: fix potential deadlock in cached_def_free() · 7e865eba
      Coly Li 提交于
      When enable lockdep and reboot system with a writeback mode bcache
      device, the following potential deadlock warning is reported by lockdep
      engine.
      
      [  101.536569][  T401] kworker/2:2/401 is trying to acquire lock:
      [  101.538575][  T401] 00000000bbf6e6c7 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
      [  101.542054][  T401]
      [  101.542054][  T401] but task is already holding lock:
      [  101.544587][  T401] 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
      [  101.548386][  T401]
      [  101.548386][  T401] which lock already depends on the new lock.
      [  101.548386][  T401]
      [  101.551874][  T401]
      [  101.551874][  T401] the existing dependency chain (in reverse order) is:
      [  101.555000][  T401]
      [  101.555000][  T401] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
      [  101.557860][  T401]        process_one_work+0x277/0x640
      [  101.559661][  T401]        worker_thread+0x39/0x3f0
      [  101.561340][  T401]        kthread+0x125/0x140
      [  101.562963][  T401]        ret_from_fork+0x3a/0x50
      [  101.564718][  T401]
      [  101.564718][  T401] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
      [  101.567701][  T401]        lock_acquire+0xb4/0x1c0
      [  101.569651][  T401]        flush_workqueue+0xae/0x4c0
      [  101.571494][  T401]        drain_workqueue+0xa9/0x180
      [  101.573234][  T401]        destroy_workqueue+0x17/0x250
      [  101.575109][  T401]        cached_dev_free+0x44/0x120 [bcache]
      [  101.577304][  T401]        process_one_work+0x2a4/0x640
      [  101.579357][  T401]        worker_thread+0x39/0x3f0
      [  101.581055][  T401]        kthread+0x125/0x140
      [  101.582709][  T401]        ret_from_fork+0x3a/0x50
      [  101.584592][  T401]
      [  101.584592][  T401] other info that might help us debug this:
      [  101.584592][  T401]
      [  101.588355][  T401]  Possible unsafe locking scenario:
      [  101.588355][  T401]
      [  101.590974][  T401]        CPU0                    CPU1
      [  101.592889][  T401]        ----                    ----
      [  101.594743][  T401]   lock((work_completion)(&cl->work)#2);
      [  101.596785][  T401]                                lock((wq_completion)bcache_writeback_wq);
      [  101.600072][  T401]                                lock((work_completion)(&cl->work)#2);
      [  101.602971][  T401]   lock((wq_completion)bcache_writeback_wq);
      [  101.605255][  T401]
      [  101.605255][  T401]  *** DEADLOCK ***
      [  101.605255][  T401]
      [  101.608310][  T401] 2 locks held by kworker/2:2/401:
      [  101.610208][  T401]  #0: 00000000cf2c7d17 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
      [  101.613709][  T401]  #1: 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
      [  101.617480][  T401]
      [  101.617480][  T401] stack backtrace:
      [  101.619539][  T401] CPU: 2 PID: 401 Comm: kworker/2:2 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
      [  101.623225][  T401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [  101.627210][  T401] Workqueue: events cached_dev_free [bcache]
      [  101.629239][  T401] Call Trace:
      [  101.630360][  T401]  dump_stack+0x85/0xcb
      [  101.631777][  T401]  print_circular_bug+0x19a/0x1f0
      [  101.633485][  T401]  __lock_acquire+0x16cd/0x1850
      [  101.635184][  T401]  ? __lock_acquire+0x6a8/0x1850
      [  101.636863][  T401]  ? lock_acquire+0xb4/0x1c0
      [  101.638421][  T401]  ? find_held_lock+0x34/0xa0
      [  101.640015][  T401]  lock_acquire+0xb4/0x1c0
      [  101.641513][  T401]  ? flush_workqueue+0x87/0x4c0
      [  101.643248][  T401]  flush_workqueue+0xae/0x4c0
      [  101.644832][  T401]  ? flush_workqueue+0x87/0x4c0
      [  101.646476][  T401]  ? drain_workqueue+0xa9/0x180
      [  101.648303][  T401]  drain_workqueue+0xa9/0x180
      [  101.649867][  T401]  destroy_workqueue+0x17/0x250
      [  101.651503][  T401]  cached_dev_free+0x44/0x120 [bcache]
      [  101.653328][  T401]  process_one_work+0x2a4/0x640
      [  101.655029][  T401]  worker_thread+0x39/0x3f0
      [  101.656693][  T401]  ? process_one_work+0x640/0x640
      [  101.658501][  T401]  kthread+0x125/0x140
      [  101.660012][  T401]  ? kthread_create_worker_on_cpu+0x70/0x70
      [  101.661985][  T401]  ret_from_fork+0x3a/0x50
      [  101.691318][  T401] bcache: bcache_device_free() bcache0 stopped
      
      Here is how the above potential deadlock may happen in reboot/shutdown
      code path,
      1) bcache_reboot() is called firstly in the reboot/shutdown code path,
         then in bcache_reboot(), bcache_device_stop() is called.
      2) bcache_device_stop() sets BCACHE_DEV_CLOSING on d->falgs, then call
         closure_queue(&d->cl) to invoke cached_dev_flush(). And in turn
         cached_dev_flush() calls cached_dev_free() via closure_at()
      3) In cached_dev_free(), after stopped writebach kthread
         dc->writeback_thread, the kwork dc->writeback_write_wq is stopping by
         destroy_workqueue().
      4) Inside destroy_workqueue(), drain_workqueue() is called. Inside
         drain_workqueue(), flush_workqueue() is called. Then wq->lockdep_map
         is acquired by lock_map_acquire() in flush_workqueue(). After the
         lock acquired the rest part of flush_workqueue() just wait for the
         workqueue to complete.
      5) Now we look back at writeback thread routine bch_writeback_thread(),
         in the main while-loop, write_dirty() is called via continue_at() in
         read_dirty_submit(), which is called via continue_at() in while-loop
         level called function read_dirty(). Inside write_dirty() it may be
         re-called on workqueeu dc->writeback_write_wq via continue_at().
         It means when the writeback kthread is stopped in cached_dev_free()
         there might be still one kworker queued on dc->writeback_write_wq
         to execute write_dirty() again.
      6) Now this kworker is scheduled on dc->writeback_write_wq to run by
         process_one_work() (which is called by worker_thread()). Before
         calling the kwork routine, wq->lockdep_map is acquired.
      7) But wq->lockdep_map is acquired already in step 4), so a A-A lock
         (lockdep terminology) scenario happens.
      
      Indeed on multiple cores syatem, the above deadlock is very rare to
      happen, just as the code comments in process_one_work() says,
      2263     * AFAICT there is no possible deadlock scenario between the
      2264     * flush_work() and complete() primitives (except for
      	   single-threaded
      2265     * workqueues), so hiding them isn't a problem.
      
      But it is still good to fix such lockdep warning, even no one running
      bcache on single core system.
      
      The fix is simple. This patch solves the above potential deadlock by,
      - Do not destroy workqueue dc->writeback_write_wq in cached_dev_free().
      - Flush and destroy dc->writeback_write_wq in writebach kthread routine
        bch_writeback_thread(), where after quit the thread main while-loop
        and before cached_dev_put() is called.
      
      By this fix, dc->writeback_write_wq will be stopped and destroy before
      the writeback kthread stopped, so the chance for a A-A locking on
      wq->lockdep_map is disappeared, such A-A deadlock won't happen
      any more.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7e865eba
    • C
      bcache: destroy dc->writeback_write_wq if failed to create dc->writeback_thread · f54d801d
      Coly Li 提交于
      Commit 9baf3097 ("bcache: fix for gc and write-back race") added a
      new work queue dc->writeback_write_wq, but forgot to destroy it in the
      error condition when creating dc->writeback_thread failed.
      
      This patch destroys dc->writeback_write_wq if kthread_create() returns
      error pointer to dc->writeback_thread, then a memory leak is avoided.
      
      Fixes: 9baf3097 ("bcache: fix for gc and write-back race")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f54d801d
    • C
      bcache: don't set max writeback rate if gc is running · 141df8bb
      Coly Li 提交于
      When gc is running, user space I/O processes may wait inside
      bcache code, so no new I/O coming. Indeed this is not a real idle
      time, maximum writeback rate should not be set in such situation.
      Otherwise a faster writeback thread may compete locks with gc thread
      and makes garbage collection slower, which results a longer I/O
      freeze period.
      
      This patch checks c->gc_mark_valid in set_at_max_writeback_rate(). If
      c->gc_mark_valid is 0 (gc running), set_at_max_writeback_rate() returns
      false, then update_writeback_rate() will not set writeback rate to
      maximum value even c->idle_counter reaches an idle threshold.
      
      Now writeback thread won't interfere gc thread performance.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      141df8bb
  14. 13 12月, 2018 2 次提交
    • C
      bcache: option to automatically run gc thread after writeback · 7a671d8e
      Coly Li 提交于
      The option gc_after_writeback is disabled by default, because garbage
      collection will discard SSD data which drops cached data.
      
      Echo 1 into /sys/fs/bcache/<UUID>/internal/gc_after_writeback will
      enable this option, which wakes up gc thread when writeback accomplished
      and all cached data is clean.
      
      This option is helpful for people who cares writing performance more. In
      heavy writing workload, all cached data can be clean only happens when
      writeback thread cleans all cached data in I/O idle time. In such
      situation a following gc running may help to shrink bcache B+ tree and
      discard more clean data, which may be helpful for future writing
      requests.
      
      If you are not sure whether this is helpful for your own workload,
      please leave it as disabled by default.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7a671d8e
    • S
      bcache: do not mark writeback_running too early · 79b79146
      Shenghui Wang 提交于
      A fresh backing device is not attached to any cache_set, and
      has no writeback kthread created until first attached to some
      cache_set.
      
      But bch_cached_dev_writeback_init run
      "
      	dc->writeback_running		= true;
      	WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING,
      			&dc->disk.flags));
      "
      for any newly formatted backing devices.
      
      For a fresh standalone backing device, we can get something like
      following even if no writeback kthread created:
      ------------------------
      /sys/block/bcache0/bcache# cat writeback_running
      1
      /sys/block/bcache0/bcache# cat writeback_rate_debug
      rate:		512.0k/sec
      dirty:		0.0k
      target:		0.0k
      proportional:	0.0k
      integral:	0.0k
      change:		0.0k/sec
      next io:	-15427384ms
      
      The none ZERO fields are misleading as no alive writeback kthread yet.
      
      Set dc->writeback_running false as no writeback thread created in
      bch_cached_dev_writeback_init().
      
      We have writeback thread created and woken up in bch_cached_dev_writeback
      _start(). Set dc->writeback_running true before bch_writeback_queue()
      called, as a writeback thread will check if dc->writeback_running is true
      before writing back dirty data, and hung if false detected.
      
      After the change, we can get the following output for a fresh standalone
      backing device:
      -----------------------
      /sys/block/bcache0/bcache$ cat writeback_running
      0
      /sys/block/bcache0/bcache# cat writeback_rate_debug
      rate:		0.0k/sec
      dirty:		0.0k
      target:		0.0k
      proportional:	0.0k
      integral:	0.0k
      change:		0.0k/sec
      next io:	0ms
      
      v1 -> v2:
        Set dc->writeback_running before bch_writeback_queue() called,
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      79b79146
  15. 23 8月, 2018 1 次提交
  16. 12 8月, 2018 4 次提交
  17. 09 8月, 2018 1 次提交
    • C
      bcache: set max writeback rate when I/O request is idle · ea8c5356
      Coly Li 提交于
      Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      allows the writeback rate to be faster if there is no I/O request on a
      bcache device. It works well if there is only one bcache device attached
      to the cache set. If there are many bcache devices attached to a cache
      set, it may introduce performance regression because multiple faster
      writeback threads of the idle bcache devices will compete the btree level
      locks with the bcache device who have I/O requests coming.
      
      This patch fixes the above issue by only permitting fast writebac when
      all bcache devices attached on the cache set are idle. And if one of the
      bcache devices has new I/O request coming, minimized all writeback
      throughput immediately and let PI controller __update_writeback_rate()
      to decide the upcoming writeback rate for each bcache device.
      
      Also when all bcache devices are idle, limited wrieback rate to a small
      number is wast of thoughput, especially when backing devices are slower
      non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
      rate for each backing device if the whole cache set is idle. A faster
      writeback rate in idle time means new I/Os may have more available space
      for dirty data, and people may observe a better write performance then.
      
      Please note bcache may change its cache mode in run time, and this patch
      still works if the cache mode is switched from writeback mode and there
      is still dirty data on cache.
      
      Fixes: Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      Cc: stable@vger.kernel.org #4.16+
      Signed-off-by: NColy Li <colyli@suse.de>
      Tested-by: NKai Krakow <kai@kaishome.de>
      Tested-by: NStefan Priebe <s.priebe@profihost.ag>
      Cc: Michael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ea8c5356
  18. 27 7月, 2018 2 次提交
    • T
      bcache: fix I/O significant decline while backend devices registering · 94f71c16
      Tang Junhui 提交于
      I attached several backend devices in the same cache set, and produced lots
      of dirty data by running small rand I/O writes in a long time, then I
      continue run I/O in the others cached devices, and stopped a cached device,
      after a mean while, I register the stopped device again, I see the running
      I/O in the others cached devices dropped significantly, sometimes even
      jumps to zero.
      
      In currently code, bcache would traverse each keys and btree node to count
      the dirty data under read locker, and the writes threads can not get the
      btree write locker, and when there is a lot of keys and btree node in the
      registering device, it would last several seconds, so the write I/Os in
      others cached device are blocked and declined significantly.
      
      In this patch, when a device registering to a ache set, which exist others
      cached devices with running I/Os, we get the amount of dirty data of the
      device in an incremental way, and do not block other cached devices all the
      time.
      
      Patch v2: Rename some variables and macros name as Coly suggested.
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      94f71c16
    • T
      bcache: simplify the calculation of the total amount of flash dirty data · 99a27d59
      Tang Junhui 提交于
      Currently we calculate the total amount of flash only devices dirty data
      by adding the dirty data of each flash only device under registering
      locker. It is very inefficient.
      
      In this patch, we add a member flash_dev_dirty_sectors in struct cache_set
      to record the total amount of flash only devices dirty data in real time,
      so we didn't need to calculate the total amount of dirty data any more.
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      99a27d59
  19. 03 5月, 2018 1 次提交
    • C
      bcache: count backing device I/O error for writeback I/O · bf78980f
      Coly Li 提交于
      Commit c7b7bd07 ("bcache: add io_disable to struct cached_dev")
      counts backing device I/O requets and set dc->io_disable to true if error
      counters exceeds dc->io_error_limit. But it only counts I/O errors for
      regular I/O request, neglects errors of write back I/Os when backing device
      is offline.
      
      This patch counts the errors of writeback I/Os, in dirty_endio() if
      bio->bi_status is  not 0, it means error happens when writing dirty keys
      to backing device, then bch_count_backing_io_errors() is called.
      
      By this fix, even there is no reqular I/O request coming, if writeback I/O
      errors exceed dc->io_error_limit, the bcache device may still be stopped
      for the broken backing device.
      
      Fixes: c7b7bd07 ("bcache: add io_disable to struct cached_dev")
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bf78980f
  20. 19 3月, 2018 5 次提交
    • C
      bcache: add backing_request_endio() for bi_end_io · 27a40ab9
      Coly Li 提交于
      In order to catch I/O error of backing device, a separate bi_end_io
      call back is required. Then a per backing device counter can record I/O
      errors number and retire the backing device if the counter reaches a
      per backing device I/O error limit.
      
      This patch adds backing_request_endio() to bcache backing device I/O code
      path, this is a preparation for further complicated backing device failure
      handling. So far there is no real code logic change, I make this change a
      separate patch to make sure it is stable and reliable for further work.
      
      Changelog:
      v2: Fix code comments typo, remove a redundant bch_writeback_add() line
          added in v4 patch set.
      v1: indeed this is new added in this patch set.
      
      [mlyle: truncated commit subject]
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Cc: Michael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      27a40ab9
    • C
      bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags · 771f393e
      Coly Li 提交于
      When too many I/Os failed on cache device, bch_cache_set_error() is called
      in the error handling code path to retire whole problematic cache set. If
      new I/O requests continue to come and take refcount dc->count, the cache
      set won't be retired immediately, this is a problem.
      
      Further more, there are several kernel thread and self-armed kernel work
      may still running after bch_cache_set_error() is called. It needs to wait
      quite a while for them to stop, or they won't stop at all. They also
      prevent the cache set from being retired.
      
      The solution in this patch is, to add per cache set flag to disable I/O
      request on this cache and all attached backing devices. Then new coming I/O
      requests can be rejected in *_make_request() before taking refcount, kernel
      threads and self-armed kernel worker can stop very fast when flags bit
      CACHE_SET_IO_DISABLE is set.
      
      Because bcache also do internal I/Os for writeback, garbage collection,
      bucket allocation, journaling, this kind of I/O should be disabled after
      bch_cache_set_error() is called. So closure_bio_submit() is modified to
      check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
      closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
      return, generic_make_request() won't be called.
      
      A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
      from cache_set->flags, to disable or enable cache set I/O for debugging. It
      is helpful to trigger more corner case issues for failed cache device.
      
      Changelog
      v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
          kernel threads.
      v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
          remove "bcache: " prefix when printing out kernel message.
      v2, more changes by previous review,
      - Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
      - Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
        is reported and inspired from origal patch of Pavel Vazharov.
      v1, initial version.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Pavel Vazharov <freakpv@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      771f393e
    • C
      bcache: stop dc->writeback_rate_update properly · 3fd47bfe
      Coly Li 提交于
      struct delayed_work writeback_rate_update in struct cache_dev is a delayed
      worker to call function update_writeback_rate() in period (the interval is
      defined by dc->writeback_rate_update_seconds).
      
      When a metadate I/O error happens on cache device, bcache error handling
      routine bch_cache_set_error() will call bch_cache_set_unregister() to
      retire whole cache set. On the unregister code path, this delayed work is
      stopped by calling cancel_delayed_work_sync(&dc->writeback_rate_update).
      
      dc->writeback_rate_update is a special delayed work from others in bcache.
      In its routine update_writeback_rate(), this delayed work is re-armed
      itself. That means when cancel_delayed_work_sync() returns, this delayed
      work can still be executed after several seconds defined by
      dc->writeback_rate_update_seconds.
      
      The problem is, after cancel_delayed_work_sync() returns, the cache set
      unregister code path will continue and release memory of struct cache set.
      Then the delayed work is scheduled to run, __update_writeback_rate()
      will reference the already released cache_set memory, and trigger a NULL
      pointer deference fault.
      
      This patch introduces two more bcache device flags,
      - BCACHE_DEV_WB_RUNNING
        bit set:  bcache device is in writeback mode and running, it is OK for
                  dc->writeback_rate_update to re-arm itself.
        bit clear:bcache device is trying to stop dc->writeback_rate_update,
                  this delayed work should not re-arm itself and quit.
      - BCACHE_DEV_RATE_DW_RUNNING
        bit set:  routine update_writeback_rate() is executing.
        bit clear: routine update_writeback_rate() quits.
      
      This patch also adds a function cancel_writeback_rate_update_dwork() to
      wait for dc->writeback_rate_update quits before cancel it by calling
      cancel_delayed_work_sync(). In order to avoid a deadlock by unexpected
      quit dc->writeback_rate_update, after time_out seconds this function will
      give up and continue to call cancel_delayed_work_sync().
      
      And here I explain how this patch stops self re-armed delayed work properly
      with the above stuffs.
      
      update_writeback_rate() sets BCACHE_DEV_RATE_DW_RUNNING at its beginning
      and clears BCACHE_DEV_RATE_DW_RUNNING at its end. Before calling
      cancel_writeback_rate_update_dwork() clear flag BCACHE_DEV_WB_RUNNING.
      
      Before calling cancel_delayed_work_sync() wait utill flag
      BCACHE_DEV_RATE_DW_RUNNING is clear. So when calling
      cancel_delayed_work_sync(), dc->writeback_rate_update must be already re-
      armed, or quite by seeing BCACHE_DEV_WB_RUNNING cleared. In both cases
      delayed work routine update_writeback_rate() won't be executed after
      cancel_delayed_work_sync() returns.
      
      Inside update_writeback_rate() before calling schedule_delayed_work(), flag
      BCACHE_DEV_WB_RUNNING is checked before. If this flag is cleared, it means
      someone is about to stop the delayed work. Because flag
      BCACHE_DEV_RATE_DW_RUNNING is set already and cancel_delayed_work_sync()
      has to wait for this flag to be cleared, we don't need to worry about race
      condition here.
      
      If update_writeback_rate() is scheduled to run after checking
      BCACHE_DEV_RATE_DW_RUNNING and before calling cancel_delayed_work_sync()
      in cancel_writeback_rate_update_dwork(), it is also safe. Because at this
      moment BCACHE_DEV_WB_RUNNING is cleared with memory barrier. As I mentioned
      previously, update_writeback_rate() will see BCACHE_DEV_WB_RUNNING is clear
      and quit immediately.
      
      Because there are more dependences inside update_writeback_rate() to struct
      cache_set memory, dc->writeback_rate_update is not a simple self re-arm
      delayed work. After trying many different methods (e.g. hold dc->count, or
      use locks), this is the only way I can find which works to properly stop
      dc->writeback_rate_update delayed work.
      
      Changelog:
      v3: change values of BCACHE_DEV_WB_RUNNING and BCACHE_DEV_RATE_DW_RUNNING
          to bit index, for test_bit().
      v2: Try to fix the race issue which is pointed out by Junhui.
      v1: The initial version for review
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NJunhui Tang <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3fd47bfe
    • C
      bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set · fadd94e0
      Coly Li 提交于
      In patch "bcache: fix cached_dev->count usage for bch_cache_set_error()",
      cached_dev_get() is called when creating dc->writeback_thread, and
      cached_dev_put() is called when exiting dc->writeback_thread. This
      modification works well unless people detach the bcache device manually by
          'echo 1 > /sys/block/bcache<N>/bcache/detach'
      Because this sysfs interface only calls bch_cached_dev_detach() which wakes
      up dc->writeback_thread but does not stop it. The reason is, before patch
      "bcache: fix cached_dev->count usage for bch_cache_set_error()", inside
      bch_writeback_thread(), if cache is not dirty after writeback,
      cached_dev_put() will be called here. And in cached_dev_make_request() when
      a new write request makes cache from clean to dirty, cached_dev_get() will
      be called there. Since we don't operate dc->count in these locations,
      refcount d->count cannot be dropped after cache becomes clean, and
      cached_dev_detach_finish() won't be called to detach bcache device.
      
      This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is
      set inside bch_writeback_thread(). If this bit is set and cache is clean
      (no existing writeback_keys), break the while-loop, call cached_dev_put()
      and quit the writeback thread.
      
      Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the
      writeback thread should continue to perform writeback, this is the original
      design of manually detach.
      
      It is safe to do the following check without locking, let me explain why,
      +	if (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
      +	    (!atomic_read(&dc->has_dirty) || !dc->writeback_running)) {
      
      If the kenrel thread does not sleep and continue to run due to conditions
      are not updated in time on the running CPU core, it just consumes more CPU
      cycles and has no hurt. This should-sleep-but-run is safe here. We just
      focus on the should-run-but-sleep condition, which means the writeback
      thread goes to sleep in mistake while it should continue to run.
      1, First of all, no matter the writeback thread is hung or not,
         kthread_stop() from cached_dev_detach_finish() will wake up it and
         terminate by making kthread_should_stop() return true. And in normal
         run time, bit on index BCACHE_DEV_DETACHING is always cleared, the
         condition
      	!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags)
         is always true and can be ignored as constant value.
      2, If one of the following conditions is true, the writeback thread should
         go to sleep,
         "!atomic_read(&dc->has_dirty)" or "!dc->writeback_running)"
         each of them independently controls the writeback thread should sleep or
         not, let's analyse them one by one.
      2.1 condition "!atomic_read(&dc->has_dirty)"
         If dc->has_dirty is set from 0 to 1 on another CPU core, bcache will
         call bch_writeback_queue() immediately or call bch_writeback_add() which
         indirectly calls bch_writeback_queue() too. In bch_writeback_queue(),
         wake_up_process(dc->writeback_thread) is called. It sets writeback
         thread's task state to TASK_RUNNING and following an implicit memory
         barrier, then tries to wake up the writeback thread.
         In writeback thread, its task state is set to TASK_INTERRUPTIBLE before
         doing the condition check. If other CPU core sets the TASK_RUNNING state
         after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread
         will be scheduled to run very soon because its state is not
         TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before
         writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier
         of wake_up_process() will make sure modification of dc->has_dirty on
         other CPU core is updated and observed on the CPU core of writeback
         thread. Therefore the condition check will correctly be false, and
         continue writeback code without sleeping.
      2.2 condition "!dc->writeback_running)"
         dc->writeback_running can be changed via sysfs file, every time it is
         modified, a following bch_writeback_queue() is alwasy called. So the
         change is always observed on the CPU core of writeback thread. If
         dc->writeback_running is changed from 0 to 1 on other CPU core, this
         condition check will observe the modification and allow writeback
         thread to continue to run without sleeping.
      Now we can see, even without a locking protection, multiple conditions
      check is safe here, no deadlock or process hang up will happen.
      
      I compose a separte patch because that patch "bcache: fix cached_dev->count
      usage for bch_cache_set_error()" already gets a "Reviewed-by:" from Hannes
      Reinecke. Also this fix is not trivial and good for a separate patch.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Huijun Tang <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fadd94e0
    • C
      bcache: fix cached_dev->count usage for bch_cache_set_error() · 804f3c69
      Coly Li 提交于
      When bcache metadata I/O fails, bcache will call bch_cache_set_error()
      to retire the whole cache set. The expected behavior to retire a cache
      set is to unregister the cache set, and unregister all backing device
      attached to this cache set, then remove sysfs entries of the cache set
      and all attached backing devices, finally release memory of structs
      cache_set, cache, cached_dev and bcache_device.
      
      In my testing when journal I/O failure triggered by disconnected cache
      device, sometimes the cache set cannot be retired, and its sysfs
      entry /sys/fs/bcache/<uuid> still exits and the backing device also
      references it. This is not expected behavior.
      
      When metadata I/O failes, the call senquence to retire whole cache set is,
              bch_cache_set_error()
              bch_cache_set_unregister()
              bch_cache_set_stop()
              __cache_set_unregister()     <- called as callback by calling
                                              clousre_queue(&c->caching)
              cache_set_flush()            <- called as a callback when refcount
                                              of cache_set->caching is 0
              cache_set_free()             <- called as a callback when refcount
                                              of catch_set->cl is 0
              bch_cache_set_release()      <- called as a callback when refcount
                                              of catch_set->kobj is 0
      
      I find if kernel thread bch_writeback_thread() quits while-loop when
      kthread_should_stop() is true and searched_full_index is false, clousre
      callback cache_set_flush() set by continue_at() will never be called. The
      result is, bcache fails to retire whole cache set.
      
      cache_set_flush() will be called when refcount of closure c->caching is 0,
      and in function bcache_device_detach() refcount of closure c->caching is
      released to 0 by clousre_put(). In metadata error code path, function
      bcache_device_detach() is called by cached_dev_detach_finish(). This is a
      callback routine being called when cached_dev->count is 0. This refcount
      is decreased by cached_dev_put().
      
      The above dependence indicates, cache_set_flush() will be called when
      refcount of cache_set->cl is 0, and refcount of cache_set->cl to be 0
      when refcount of cache_dev->count is 0.
      
      The reason why sometimes cache_dev->count is not 0 (when metadata I/O fails
      and bch_cache_set_error() called) is, in bch_writeback_thread(), refcount
      of cache_dev is not decreased properly.
      
      In bch_writeback_thread(), cached_dev_put() is called only when
      searched_full_index is true and cached_dev->writeback_keys is empty, a.k.a
      there is no dirty data on cache. In most of run time it is correct, but
      when bch_writeback_thread() quits the while-loop while cache is still
      dirty, current code forget to call cached_dev_put() before this kernel
      thread exits. This is why sometimes cache_set_flush() is not executed and
      cache set fails to be retired.
      
      The reason to call cached_dev_put() in bch_writeback_rate() is, when the
      cache device changes from clean to dirty, cached_dev_get() is called, to
      make sure during writeback operatiions both backing and cache devices
      won't be released.
      
      Adding following code in bch_writeback_thread() does not work,
         static int bch_writeback_thread(void *arg)
              }
      
      +       if (atomic_read(&dc->has_dirty))
      +               cached_dev_put()
      +
              return 0;
       }
      because writeback kernel thread can be waken up and start via sysfs entry:
              echo 1 > /sys/block/bcache<N>/bcache/writeback_running
      It is difficult to check whether backing device is dirty without race and
      extra lock. So the above modification will introduce potential refcount
      underflow in some conditions.
      
      The correct fix is, to take cached dev refcount when creating the kernel
      thread, and put it before the kernel thread exits. Then bcache does not
      need to take a cached dev refcount when cache turns from clean to dirty,
      or to put a cached dev refcount when cache turns from ditry to clean. The
      writeback kernel thread is alwasy safe to reference data structure from
      cache set, cache and cached device (because a refcount of cache device is
      taken for it already), and no matter the kernel thread is stopped by I/O
      errors or system reboot, cached_dev->count can always be used correctly.
      
      The patch is simple, but understanding how it works is quite complicated.
      
      Changelog:
      v2: set dc->writeback_thread to NULL in this patch, as suggested by Hannes.
      v1: initial version for review.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      804f3c69
  21. 08 2月, 2018 2 次提交
    • C
      bcache: set writeback_rate_update_seconds in range [1, 60] seconds · 7a5e3ecb
      Coly Li 提交于
      dc->writeback_rate_update_seconds can be set via sysfs and its value can
      be set to [1, ULONG_MAX].  It does not make sense to set such a large
      value, 60 seconds is long enough value considering the default 5 seconds
      works well for long time.
      
      Because dc->writeback_rate_update is a special delayed work, it re-arms
      itself inside the delayed work routine update_writeback_rate(). When
      stopping it by cancel_delayed_work_sync(), there should be a timeout to
      wait and make sure the re-armed delayed work is stopped too. A small max
      value of dc->writeback_rate_update_seconds is also helpful to decide a
      reasonable small timeout.
      
      This patch limits sysfs interface to set dc->writeback_rate_update_seconds
      in range of [1, 60] seconds, and replaces the hand-coded number by macros.
      
      Changelog:
      v2: fix a rebase typo in v4, which is pointed out by Michael Lyle.
      v1: initial version.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7a5e3ecb
    • C
      bcache: properly set task state in bch_writeback_thread() · 99361bbf
      Coly Li 提交于
      Kernel thread routine bch_writeback_thread() has the following code block,
      
      447         down_write(&dc->writeback_lock);
      448~450     if (check conditions) {
      451                 up_write(&dc->writeback_lock);
      452                 set_current_state(TASK_INTERRUPTIBLE);
      453
      454                 if (kthread_should_stop())
      455                         return 0;
      456
      457                 schedule();
      458                 continue;
      459         }
      
      If condition check is true, its task state is set to TASK_INTERRUPTIBLE
      and call schedule() to wait for others to wake up it.
      
      There are 2 issues in current code,
      1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
         another process changes the condition and call wake_up_process(dc->
         writeback_thread), then at line 452 task state is set back to
         TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
         waken up.
      2, At line 454 if kthread_should_stop() is true, writeback kernel thread
         will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
         call do_exit(). It is not good to enter do_exit() with task state
         TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
         warning message is reported by __might_sleep(): "WARNING: do not call
         blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".
      
      For the first issue, task state should be set before condition checks.
      Ineed because dc->writeback_lock is required when modifying all the
      conditions, calling set_current_state() inside code block where dc->
      writeback_lock is hold is safe. But this is quite implicit, so I still move
      set_current_state() before all the condition checks.
      
      For the second issue, frankley speaking it does not hurt when kernel thread
      exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
      makes them feel there might be something risky with bcache and hurt their
      data.  Setting task state to TASK_RUNNING before returning fixes this
      problem.
      
      In alloc.c:allocator_wait(), there is also a similar issue, and is also
      fixed in this patch.
      
      Changelog:
      v3: merge two similar fixes into one patch
      v2: fix the race issue in v1 patch.
      v1: initial buggy fix.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      99361bbf
  22. 09 1月, 2018 1 次提交
    • M
      bcache: fix writeback target calc on large devices · 616486ab
      Michael Lyle 提交于
      Bcache needs to scale the dirty data in the cache over the multiple
      backing disks in order to calculate writeback rates for each.
      The previous code did this by multiplying the target number of dirty
      sectors by the backing device size, and expected it to fit into a
      uint64_t; this blows up on relatively small backing devices.
      
      The new approach figures out the bdev's share in 16384ths of the overall
      cached data.  This is chosen to cope well when bdevs drastically vary in
      size and to ensure that bcache can cross the petabyte boundary for each
      backing device.
      
      This has been improved based on Tang Junhui's feedback to ensure that
      every device gets a share of dirty data, no matter how small it is
      compared to the total backing pool.
      
      The existing mechanism is very limited; this is purely a bug fix to
      remove limits on volume size.  However, there still needs to be change
      to make this "fair" over many volumes where some are idle.
      Reported-by: NJack Douglas <jack@douglastechnology.co.uk>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      616486ab