1. 03 10月, 2020 1 次提交
    • D
      bcache: check c->root with IS_ERR_OR_NULL() in mca_reserve() · 7e59c506
      Dongsheng Yang 提交于
      In mca_reserve(c) macro, we are checking root whether is NULL or not.
      But that's not enough, when we read the root node in run_cache_set(),
      if we got an error in bch_btree_node_read_done(), we will return
      ERR_PTR(-EIO) to c->root.
      
      And then we will go continue to unregister, but before calling
      unregister_shrinker(&c->shrink), there is a possibility to call
      bch_mca_count(), and we would get a crash with call trace like that:
      
      [ 2149.876008] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b5
      ... ...
      [ 2150.598931] Call trace:
      [ 2150.606439]  bch_mca_count+0x58/0x98 [escache]
      [ 2150.615866]  do_shrink_slab+0x54/0x310
      [ 2150.624429]  shrink_slab+0x248/0x2d0
      [ 2150.632633]  drop_slab_node+0x54/0x88
      [ 2150.640746]  drop_slab+0x50/0x88
      [ 2150.648228]  drop_caches_sysctl_handler+0xf0/0x118
      [ 2150.657219]  proc_sys_call_handler.isra.18+0xb8/0x110
      [ 2150.666342]  proc_sys_write+0x40/0x50
      [ 2150.673889]  __vfs_write+0x48/0x90
      [ 2150.681095]  vfs_write+0xac/0x1b8
      [ 2150.688145]  ksys_write+0x6c/0xd0
      [ 2150.695127]  __arm64_sys_write+0x24/0x30
      [ 2150.702749]  el0_svc_handler+0xa0/0x128
      [ 2150.710296]  el0_svc+0x8/0xc
      Signed-off-by: NDongsheng Yang <dongsheng.yang@easystack.cn>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7e59c506
  2. 25 7月, 2020 2 次提交
    • C
      bcache: handle cache set verify_ondisk properly for bucket size > 8MB · bf6af170
      Coly Li 提交于
      In bch_btree_cache_alloc() when CONFIG_BCACHE_DEBUG is configured,
      allocate memory for c->verify_ondisk may fail if the bucket size > 8MB,
      which will require __get_free_pages() to allocate continuous pages
      with order > 11 (the default MAX_ORDER of Linux buddy allocator). Such
      over size allocation will fail, and cause 2 problems,
      - When CONFIG_BCACHE_DEBUG is configured,  bch_btree_verify() does not
        work, because c->verify_ondisk is NULL and bch_btree_verify() returns
        immediately.
      - bch_btree_cache_alloc() will fail due to c->verify_ondisk allocation
        failed, then the whole cache device registration fails. And because of
        this failure, the first problem of bch_btree_verify() has no chance to
        be triggered.
      
      This patch fixes the above problem by two means,
      1) If pages allocation of c->verify_ondisk fails, set it to NULL and
         returns bch_btree_cache_alloc() with -ENOMEM.
      2) When calling __get_free_pages() to allocate c->verify_ondisk pages,
         use ilog2(meta_bucket_pages(&c->sb)) to make sure ilog2() will always
         generate a pages order <= MAX_ORDER (or CONFIG_FORCE_MAX_ZONEORDER).
         Then the buddy system won't directly reject the allocation request.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bf6af170
    • C
      bcache: allocate meta data pages as compound pages · 5fe48867
      Coly Li 提交于
      There are some meta data of bcache are allocated by multiple pages,
      and they are used as bio bv_page for I/Os to the cache device. for
      example cache_set->uuids, cache->disk_buckets, journal_write->data,
      bset_tree->data.
      
      For such meta data memory, all the allocated pages should be treated
      as a single memory block. Then the memory management and underlying I/O
      code can treat them more clearly.
      
      This patch adds __GFP_COMP flag to all the location allocating >0 order
      pages for the above mentioned meta data. Then their pages are treated
      as compound pages now.
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5fe48867
  3. 01 7月, 2020 1 次提交
  4. 15 6月, 2020 1 次提交
    • Z
      bcache: fix potential deadlock problem in btree_gc_coalesce · be23e837
      Zhiqiang Liu 提交于
      coccicheck reports:
        drivers/md//bcache/btree.c:1538:1-7: preceding lock on line 1417
      
      In btree_gc_coalesce func, if the coalescing process fails, we will goto
      to out_nocoalesce tag directly without releasing new_nodes[i]->write_lock.
      Then, it will cause a deadlock when trying to acquire new_nodes[i]->
      write_lock for freeing new_nodes[i] before return.
      
      btree_gc_coalesce func details as follows:
      	if alloc new_nodes[i] fails:
      		goto out_nocoalesce;
      	// obtain new_nodes[i]->write_lock
      	mutex_lock(&new_nodes[i]->write_lock)
      	// main coalescing process
      	for (i = nodes - 1; i > 0; --i)
      		[snipped]
      		if coalescing process fails:
      			// Here, directly goto out_nocoalesce
      			 // tag will cause a deadlock
      			goto out_nocoalesce;
      		[snipped]
      	// release new_nodes[i]->write_lock
      	mutex_unlock(&new_nodes[i]->write_lock)
      	// coalesing succ, return
      	return;
      out_nocoalesce:
      	btree_node_free(new_nodes[i])	// free new_nodes[i]
      	// obtain new_nodes[i]->write_lock
      	mutex_lock(&new_nodes[i]->write_lock);
      	// set flag for reuse
      	clear_bit(BTREE_NODE_dirty, &ew_nodes[i]->flags);
      	// release new_nodes[i]->write_lock
      	mutex_unlock(&new_nodes[i]->write_lock);
      
      To fix the problem, we add a new tag 'out_unlock_nocoalesce' for
      releasing new_nodes[i]->write_lock before out_nocoalesce tag. If
      coalescing process fails, we will go to out_unlock_nocoalesce tag
      for releasing new_nodes[i]->write_lock before free new_nodes[i] in
      out_nocoalesce tag.
      
      (Coly Li helps to clean up commit log format.)
      
      Fixes: 2a285686 ("bcache: btree locking rework")
      Signed-off-by: NZhiqiang Liu <liuzhiqiang26@huawei.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      be23e837
  5. 27 5月, 2020 2 次提交
  6. 23 3月, 2020 4 次提交
    • C
      bcache: optimize barrier usage for atomic operations · eb9b6666
      Coly Li 提交于
      The idea of this patch is from Davidlohr Bueso, he posts a patch
      for bcache to optimize barrier usage for read-modify-write atomic
      bitops. Indeed such optimization can also apply on other locations
      where smp_mb() is used before or after an atomic operation.
      
      This patch replaces smp_mb() with smp_mb__before_atomic() or
      smp_mb__after_atomic() in btree.c and writeback.c,  where it is used
      to synchronize memory cache just earlier on other cores. Although
      the locations are not on hot code path, it is always not bad to mkae
      things a little better.
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eb9b6666
    • C
      bcache: make bch_btree_check() to be multithreaded · 8e710227
      Coly Li 提交于
      When registering a cache device, bch_btree_check() is called to check
      all btree nodes, to make sure the btree is consistent and not
      corrupted.
      
      bch_btree_check() is recursively executed in a single thread, when there
      are a lot of data cached and the btree is huge, it may take very long
      time to check all the btree nodes. In my testing, I observed it took
      around 50 minutes to finish bch_btree_check().
      
      When checking the bcache btree nodes, the cache set is not running yet,
      and indeed the whole tree is in read-only state, it is safe to create
      multiple threads to check the btree in parallel.
      
      This patch tries to create multiple threads, and each thread tries to
      one-by-one check the sub-tree indexed by a key from the btree root node.
      The parallel thread number depends on how many keys in the btree root
      node. At most BCH_BTR_CHKTHREAD_MAX (64) threads can be created, but in
      practice is should be min(cpu-number/2, root-node-keys-number).
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8e710227
    • C
      bcache: add bcache_ prefix to btree_root() and btree() macros · feac1a70
      Coly Li 提交于
      This patch changes macro btree_root() and btree() to bcache_btree_root()
      and bcache_btree(), to avoid potential generic name clash in future.
      
      NOTE: for product kernel maintainers, this patch can be skipped if
      you feel the rename stuffs introduce inconvenince to patch backport.
      Suggested-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      feac1a70
    • C
      bcache: move macro btree() and btree_root() into btree.h · 253a99d9
      Coly Li 提交于
      In order to accelerate bcache registration speed, the macro btree()
      and btree_root() will be referenced out of btree.c. This patch moves
      them from btree.c into btree.h with other relative function declaration
      in btree.h, for the following changes.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      253a99d9
  7. 03 3月, 2020 1 次提交
  8. 13 2月, 2020 1 次提交
    • C
      bcache: ignore pending signals when creating gc and allocator thread · 0b96da63
      Coly Li 提交于
      When run a cache set, all the bcache btree node of this cache set will
      be checked by bch_btree_check(). If the bcache btree is very large,
      iterating all the btree nodes will occupy too much system memory and
      the bcache registering process might be selected and killed by system
      OOM killer. kthread_run() will fail if current process has pending
      signal, therefore the kthread creating in run_cache_set() for gc and
      allocator kernel threads are very probably failed for a very large
      bcache btree.
      
      Indeed such OOM is safe and the registering process will exit after
      the registration done. Therefore this patch flushes pending signals
      during the cache set start up, specificly in bch_cache_allocator_start()
      and bch_gc_thread_start(), to make sure run_cache_set() won't fail for
      large cahced data set.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0b96da63
  9. 24 1月, 2020 3 次提交
    • C
      bcache: reap from tail of c->btree_cache in bch_mca_scan() · e3de0446
      Coly Li 提交于
      When shrink btree node cache from c->btree_cache in bch_mca_scan(),
      no matter the selected node is reaped or not, it will be rotated from
      the head to the tail of c->btree_cache list. But in bcache journal
      code, when flushing the btree nodes with oldest journal entry, btree
      nodes are iterated and slected from the tail of c->btree_cache list in
      btree_flush_write(). The list_rotate_left() in bch_mca_scan() will
      make btree_flush_write() iterate more nodes in c->btree_list in reverse
      order.
      
      This patch just reaps the selected btree node cache, and not move it
      from the head to the tail of c->btree_cache list. Then bch_mca_scan()
      will not mess up c->btree_cache list to btree_flush_write().
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e3de0446
    • C
      bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() · d5c9c470
      Coly Li 提交于
      In order to skip the most recently freed btree node cahce, currently
      in bch_mca_scan() the first 3 caches in c->btree_cache_freeable list
      are skipped when shrinking bcache node caches in bch_mca_scan(). The
      related code in bch_mca_scan() is,
      
       737 list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) {
       738         if (nr <= 0)
       739                 goto out;
       740
       741         if (++i > 3 &&
       742             !mca_reap(b, 0, false)) {
                   		lines free cache memory
       746         }
       747         nr--;
       748 }
      
      The problem is, if virtual memory code calls bch_mca_scan() and
      the calculated 'nr' is 1 or 2, then in the above loop, nothing will
      be shunk. In such case, if slub/slab manager calls bch_mca_scan()
      for many times with small scan number, it does not help to shrink
      cache memory and just wasts CPU cycles.
      
      This patch just selects btree node caches from tail of the
      c->btree_cache_freeable list, then the newly freed host cache can
      still be allocated by mca_alloc(), and at least 1 node can be shunk.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d5c9c470
    • C
      bcache: remove member accessed from struct btree · 125d98ed
      Coly Li 提交于
      The member 'accessed' of struct btree is used in bch_mca_scan() when
      shrinking btree node caches. The original idea is, if b->accessed is
      set, clean it and look at next btree node cache from c->btree_cache
      list, and only shrink the caches whose b->accessed is cleaned. Then
      only cold btree node cache will be shrunk.
      
      But when I/O pressure is high, it is very probably that b->accessed
      of a btree node cache will be set again in bch_btree_node_get()
      before bch_mca_scan() selects it again. Then there is no chance for
      bch_mca_scan() to shrink enough memory back to slub or slab system.
      
      This patch removes member accessed from struct btree, then once a
      btree node ache is selected, it will be immediately shunk. By this
      change, bch_mca_scan() may release btree node cahce more efficiently.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      125d98ed
  10. 18 11月, 2019 1 次提交
    • J
      Revert "bcache: fix fifo index swapping condition in journal_pin_cmp()" · 00b89892
      Jens Axboe 提交于
      Coly says:
      
      "Guoju Fang talked to me today, he told me this change was unnecessary
      and I was over-thought.
      
      Then I realize fifo_idx() uses a mask to handle the array index overflow
      condition, so the index swap in journal_pin_cmp() won't happen. And yes,
      Guoju and Kent are correct.
      
      Since you already applied this patch, can you please to remove this
      patch from your for-next branch? This single patch does not break
      thing, but it is unecessary at this moment."
      
      This reverts commit c0e0954e.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      00b89892
  11. 14 11月, 2019 4 次提交
    • C
      bcache: at least try to shrink 1 node in bch_mca_scan() · 9fcc34b1
      Coly Li 提交于
      In bch_mca_scan(), the number of shrinking btree node is calculated
      by code like this,
      	unsigned long nr = sc->nr_to_scan;
      
              nr /= c->btree_pages;
              nr = min_t(unsigned long, nr, mca_can_free(c));
      variable sc->nr_to_scan is number of objects (here is bcache B+tree
      nodes' number) to shrink, and pointer variable sc is sent from memory
      management code as parametr of a callback.
      
      If sc->nr_to_scan is smaller than c->btree_pages, after the above
      calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
      frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
      nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
      and releasing mutex c->bucket_lock.
      
      This patch checkes whether nr is 0 after the above calculation, if 0
      is the result then set 1 to variable 'n'. Then at least bch_mca_scan()
      will try to shrink a single B+tree node.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9fcc34b1
    • C
      bcache: add code comments in bch_btree_leaf_dirty() · 5dccefd3
      Coly Li 提交于
      This patch adds code comments in bch_btree_leaf_dirty() to explain
      why w->journal should always reference the eldest journal pin of
      all the writing bkeys in the btree node. To make the bcache journal
      code to be easier to be understood.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5dccefd3
    • G
      bcache: fix a lost wake-up problem caused by mca_cannibalize_lock · 34cf78bf
      Guoju Fang 提交于
      This patch fix a lost wake-up problem caused by the race between
      mca_cannibalize_lock and bch_cannibalize_unlock.
      
      Consider two processes, A and B. Process A is executing
      mca_cannibalize_lock, while process B takes c->btree_cache_alloc_lock
      and is executing bch_cannibalize_unlock. The problem happens that after
      process A executes cmpxchg and will execute prepare_to_wait. In this
      timeslice process B executes wake_up, but after that process A executes
      prepare_to_wait and set the state to TASK_INTERRUPTIBLE. Then process A
      goes to sleep but no one will wake up it. This problem may cause bcache
      device to dead.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      34cf78bf
    • C
      bcache: fix fifo index swapping condition in journal_pin_cmp() · c0e0954e
      Coly Li 提交于
      Fifo structure journal.pin is implemented by a cycle buffer, if the back
      index reaches highest location of the cycle buffer, it will be swapped
      to 0. Once the swapping happens, it means a smaller fifo index might be
      associated to a newer journal entry. So the btree node with oldest
      journal entry won't be selected in bch_btree_leaf_dirty() to reference
      the dirty B+tree leaf node. This problem may cause bcache journal won't
      protect unflushed oldest B+tree dirty leaf node in power failure, and
      this B+tree leaf node is possible to beinconsistent after reboot from
      power failure.
      
      This patch fixes the fifo index comparing logic in journal_pin_cmp(),
      to avoid potential corrupted B+tree leaf node when the back index of
      journal pin is swapped.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c0e0954e
  12. 28 6月, 2019 4 次提交
    • C
      bcache: fix race in btree_flush_write() · 50a260e8
      Coly Li 提交于
      There is a race between mca_reap(), btree_node_free() and journal code
      btree_flush_write(), which results very rare and strange deadlock or
      panic and are very hard to reproduce.
      
      Let me explain how the race happens. In btree_flush_write() one btree
      node with oldest journal pin is selected, then it is flushed to cache
      device, the select-and-flush is a two steps operation. Between these two
      steps, there are something may happen inside the race window,
      - The selected btree node was reaped by mca_reap() and allocated to
        other requesters for other btree node.
      - The slected btree node was selected, flushed and released by mca
        shrink callback bch_mca_scan().
      When btree_flush_write() tries to flush the selected btree node, firstly
      b->write_lock is held by mutex_lock(). If the race happens and the
      memory of selected btree node is allocated to other btree node, if that
      btree node's write_lock is held already, a deadlock very probably
      happens here. A worse case is the memory of the selected btree node is
      released, then all references to this btree node (e.g. b->write_lock)
      will trigger NULL pointer deference panic.
      
      This race was introduced in commit cafe5635 ("bcache: A block layer
      cache"), and enlarged by commit c4dc2497 ("bcache: fix high CPU
      occupancy during journal"), which selected 128 btree nodes and flushed
      them one-by-one in a quite long time period.
      
      Such race is not easy to reproduce before. On a Lenovo SR650 server with
      48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
      device assembled by 3 NVMe SSDs as backing device, this race can be
      observed around every 10,000 times btree_flush_write() gets called. Both
      deadlock and kernel panic all happened as aftermath of the race.
      
      The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
      is set when selecting btree nodes, and cleared after btree nodes
      flushed. Then when mca_reap() selects a btree node with this bit set,
      this btree node will be skipped. Since mca_reap() only reaps btree node
      without BTREE_NODE_journal_flush flag, such race is avoided.
      
      Once corner case should be noticed, that is btree_node_free(). It might
      be called in some error handling code path. For example the following
      code piece from btree_split(),
              2149 err_free2:
              2150         bkey_put(b->c, &n2->key);
              2151         btree_node_free(n2);
              2152         rw_unlock(true, n2);
              2153 err_free1:
              2154         bkey_put(b->c, &n1->key);
              2155         btree_node_free(n1);
              2156         rw_unlock(true, n1);
      At line 2151 and 2155, the btree node n2 and n1 are released without
      mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
      If btree_node_free() is called directly in such error handling path,
      and the selected btree node has BTREE_NODE_journal_flush bit set, just
      delay for 1 us and retry again. In this case this btree node won't
      be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
      and free the btree node memory.
      
      Fixes: cafe5635 ("bcache: A block layer cache")
      Signed-off-by: NColy Li <colyli@suse.de>
      Reported-and-tested-by: Nkbuild test robot <lkp@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      50a260e8
    • C
      bcache: add comments for mutex_lock(&b->write_lock) · 41508bb7
      Coly Li 提交于
      When accessing or modifying BTREE_NODE_dirty bit, it is not always
      necessary to acquire b->write_lock. In bch_btree_cache_free() and
      mca_reap() acquiring b->write_lock is necessary, and this patch adds
      comments to explain why mutex_lock(&b->write_lock) is necessary for
      checking or clearing BTREE_NODE_dirty bit there.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      41508bb7
    • C
      bcache: only clear BTREE_NODE_dirty bit when it is set · e5ec5f47
      Coly Li 提交于
      In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is
      always set no matter btree node is dirty or not. The code looks like
      this,
      	if (btree_node_dirty(b))
      		btree_complete_write(b, btree_current_write(b));
      	clear_bit(BTREE_NODE_dirty, &b->flags);
      
      Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty
      bit is cleared, then it is unnecessary to clear the bit again.
      
      This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is
      true (the bit is set), to save a few CPU cycles.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e5ec5f47
    • C
      bcache: remove unncessary code in bch_btree_keys_init() · bd9026c8
      Coly Li 提交于
      Function bch_btree_keys_init() initializes b->set[].size and
      b->set[].data to zero. As the code comments indicates, these code indeed
      is unncessary, because both struct btree_keys and struct bset_tree are
      nested embedded into struct btree, when struct btree is filled with 0
      bits by kzalloc() in mca_bucket_alloc(), b->set[].size and
      b->set[].data are initialized to 0 (a.k.a NULL) already.
      
      This patch removes the redundant code, and add comments in
      bch_btree_keys_init() and mca_bucket_alloc() to explain why it's safe.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bd9026c8
  13. 30 4月, 2019 2 次提交
  14. 25 4月, 2019 1 次提交
  15. 15 2月, 2019 1 次提交
  16. 13 12月, 2018 1 次提交
    • S
      bcache: add comment for cache_set->fill_iter · d2f96f48
      Shenghui Wang 提交于
      We have the following define for btree iterator:
      	struct btree_iter {
      		size_t size, used;
      	#ifdef CONFIG_BCACHE_DEBUG
      		struct btree_keys *b;
      	#endif
      		struct btree_iter_set {
      			struct bkey *k, *end;
      		} data[MAX_BSETS];
      	};
      
      We can see that the length of data[] field is static MAX_BSETS, which is
      defined as 4 currently.
      
      But a btree node on disk could have too many bsets for an iterator to fit
      on the stack - maybe far more that MAX_BSETS. Have to dynamically allocate
      space to host more btree_iter_sets.
      
      bch_cache_set_alloc() will make sure the pool cache_set->fill_iter can
      allocate an iterator equipped with enough room that can host
      	(sb.bucket_size / sb.block_size)
      btree_iter_sets, which is more than static MAX_BSETS.
      
      bch_btree_node_read_done() will use that pool to allocate one iterator, to
      host many bsets in one btree node.
      
      Add more comment around cache_set->fill_iter to make code less confusing.
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d2f96f48
  17. 08 10月, 2018 1 次提交
    • T
      bcache: fix miss key refill->end in writeback · 2d6cb6ed
      Tang Junhui 提交于
      refill->end record the last key of writeback, for example, at the first
      time, keys (1,128K) to (1,1024K) are flush to the backend device, but
      the end key (1,1024K) is not included, since the bellow code:
      	if (bkey_cmp(k, refill->end) >= 0) {
      		ret = MAP_DONE;
      		goto out;
      	}
      And in the next time when we refill writeback keybuf again, we searched
      key start from (1,1024K), and got a key bigger than it, so the key
      (1,1024K) missed.
      This patch modify the above code, and let the end key to be included to
      the writeback key buffer.
      Signed-off-by: NTang Junhui <tang.junhui.linux@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d6cb6ed
  18. 12 8月, 2018 5 次提交
  19. 09 8月, 2018 1 次提交
  20. 27 7月, 2018 2 次提交
    • T
      bcache: calculate the number of incremental GC nodes according to the total of btree nodes · 7f4a59de
      Tang Junhui 提交于
      This patch base on "[PATCH] bcache: finish incremental GC".
      
      Since incremental GC would stop 100ms when front side I/O comes, so when
      there are many btree nodes, if GC only processes constant (100) nodes each
      time, GC would last a long time, and the front I/Os would run out of the
      buckets (since no new bucket can be allocated during GC), and I/Os be
      blocked again.
      
      So GC should not process constant nodes, but varied nodes according to the
      number of btree nodes. In this patch, GC is divided into constant (100)
      times, so when there are many btree nodes, GC can process more nodes each
      time, otherwise GC will process less nodes each time (but no less than
      MIN_GC_NODES).
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7f4a59de
    • T
      bcache: finish incremental GC · 5c25c4fc
      Tang Junhui 提交于
      In GC thread, we record the latest GC key in gc_done, which is expected
      to be used for incremental GC, but in currently code, we didn't realize
      it. When GC runs, front side IO would be blocked until the GC over, it
      would be a long time if there is a lot of btree nodes.
      
      This patch realizes incremental GC, the main ideal is that, when there
      are front side I/Os, after GC some nodes (100), we stop GC, release locker
      of the btree node, and go to process the front side I/Os for some times
      (100 ms), then go back to GC again.
      
      By this patch, when we doing GC, I/Os are not blocked all the time, and
      there is no obvious I/Os zero jump problem any more.
      
      Patch v2: Rename some variables and macros name as Coly suggested.
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5c25c4fc
  21. 16 6月, 2018 1 次提交