1. 09 4月, 2021 1 次提交
  2. 03 10月, 2020 4 次提交
    • C
      bcache: remove embedded struct cache_sb from struct cache_set · 4a784266
      Coly Li 提交于
      Since bcache code was merged into mainline kerrnel, each cache set only
      as one single cache in it. The multiple caches framework is here but the
      code is far from completed. Considering the multiple copies of cached
      data can also be stored on e.g. md raid1 devices, it is unnecessary to
      support multiple caches in one cache set indeed.
      
      The previous preparation patches fix the dependencies of explicitly
      making a cache set only have single cache. Now we don't have to maintain
      an embedded partial super block in struct cache_set, the in-memory super
      block can be directly referenced from struct cache.
      
      This patch removes the embedded struct cache_sb from struct cache_set,
      and fixes all locations where the superb lock was referenced from this
      removed super block by referencing the in-memory super block of struct
      cache.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4a784266
    • C
      bcache: check and set sync status on cache's in-memory super block · 6f9414e0
      Coly Li 提交于
      Currently the cache's sync status is checked and set on cache set's in-
      memory partial super block. After removing the embedded struct cache_sb
      from cache set and reference cache's in-memory super block from struct
      cache_set, the sync status can set and check directly on cache's super
      block.
      
      This patch checks and sets the cache sync status directly on cache's
      in-memory super block. This is a preparation for later removing embedded
      struct cache_sb from struct cache_set.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6f9414e0
    • C
      bcache: only use block_bytes() on struct cache · 4e1ebae3
      Coly Li 提交于
      Because struct cache_set and struct cache both have struct cache_sb,
      therefore macro block_bytes() can be used on both of them. When removing
      the embedded struct cache_sb from struct cache_set, this macro won't be
      used on struct cache_set anymore.
      
      This patch unifies all block_bytes() usage only on struct cache, this is
      one of the preparation to remove the embedded struct cache_sb from
      struct cache_set.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4e1ebae3
    • C
      bcache: remove for_each_cache() · 08fdb2cd
      Coly Li 提交于
      Since now each cache_set explicitly has single cache, for_each_cache()
      is unnecessary. This patch removes this macro, and update all locations
      where it is used, and makes sure all code logic still being consistent.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08fdb2cd
  3. 24 8月, 2020 1 次提交
  4. 25 7月, 2020 2 次提交
  5. 27 5月, 2020 1 次提交
    • J
      bcache: Convert pr_<level> uses to a more typical style · 46f5aa88
      Joe Perches 提交于
      Remove the trailing newline from the define of pr_fmt and add newlines
      to the uses.
      
      Miscellanea:
      
      o Convert bch_bkey_dump from multiple uses of pr_err to pr_cont
        as the earlier conversion was inappropriate done causing multiple
        lines to be emitted where only a single output line was desired
      o Use vsprintf extension %pV in bch_cache_set_error to avoid multiple
        line output where only a single line output was desired
      o Coalesce formats
      
      Fixes: 6ae63e35 ("bcache: replace printk() by pr_*() routines")
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46f5aa88
  6. 13 2月, 2020 1 次提交
  7. 01 2月, 2020 1 次提交
    • C
      bcache: fix incorrect data type usage in btree_flush_write() · d1c3cc34
      Coly Li 提交于
      Dan Carpenter points out that from commit 2aa8c529 ("bcache: avoid
      unnecessary btree nodes flushing in btree_flush_write()"), there is a
      incorrect data type usage which leads to the following static checker
      warning:
      	drivers/md/bcache/journal.c:444 btree_flush_write()
      	warn: 'ref_nr' unsigned <= 0
      
      drivers/md/bcache/journal.c
         422  static void btree_flush_write(struct cache_set *c)
         423  {
         424          struct btree *b, *t, *btree_nodes[BTREE_FLUSH_NR];
         425          unsigned int i, nr, ref_nr;
                                          ^^^^^^
      
         426          atomic_t *fifo_front_p, *now_fifo_front_p;
         427          size_t mask;
         428
         429          if (c->journal.btree_flushing)
         430                  return;
         431
         432          spin_lock(&c->journal.flush_write_lock);
         433          if (c->journal.btree_flushing) {
         434                  spin_unlock(&c->journal.flush_write_lock);
         435                  return;
         436          }
         437          c->journal.btree_flushing = true;
         438          spin_unlock(&c->journal.flush_write_lock);
         439
         440          /* get the oldest journal entry and check its refcount */
         441          spin_lock(&c->journal.lock);
         442          fifo_front_p = &fifo_front(&c->journal.pin);
         443          ref_nr = atomic_read(fifo_front_p);
         444          if (ref_nr <= 0) {
                          ^^^^^^^^^^^
      Unsigned can't be less than zero.
      
         445                  /*
         446                   * do nothing if no btree node references
         447                   * the oldest journal entry
         448                   */
         449                  spin_unlock(&c->journal.lock);
         450                  goto out;
         451          }
         452          spin_unlock(&c->journal.lock);
      
      As the warning information indicates, local varaible ref_nr in unsigned
      int type is wrong, which does not matche atomic_read() and the "<= 0"
      checking.
      
      This patch fixes the above error by defining local variable ref_nr as
      int type.
      
      Fixes: 2aa8c529 ("bcache: avoid unnecessary btree nodes flushing in btree_flush_write()")
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d1c3cc34
  8. 24 1月, 2020 1 次提交
    • C
      bcache: avoid unnecessary btree nodes flushing in btree_flush_write() · 2aa8c529
      Coly Li 提交于
      the commit 91be66e1 ("bcache: performance improvement for
      btree_flush_write()") was an effort to flushing btree node with oldest
      btree node faster in following methods,
      - Only iterate dirty btree nodes in c->btree_cache, avoid scanning a lot
        of clean btree nodes.
      - Take c->btree_cache as a LRU-like list, aggressively flushing all
        dirty nodes from tail of c->btree_cache util the btree node with
        oldest journal entry is flushed. This is to reduce the time of holding
        c->bucket_lock.
      
      Guoju Fang and Shuang Li reported that they observe unexptected extra
      write I/Os on cache device after applying the above patch. Guoju Fang
      provideed more detailed diagnose information that the aggressive
      btree nodes flushing may cause 10x more btree nodes to flush in his
      workload. He points out when system memory is large enough to hold all
      btree nodes in memory, c->btree_cache is not a LRU-like list any more.
      Then the btree node with oldest journal entry is very probably not-
      close to the tail of c->btree_cache list. In such situation much more
      dirty btree nodes will be aggressively flushed before the target node
      is flushed. When slow SATA SSD is used as cache device, such over-
      aggressive flushing behavior will cause performance regression.
      
      After spending a lot of time on debug and diagnose, I find the real
      condition is more complicated, aggressive flushing dirty btree nodes
      from tail of c->btree_cache list is not a good solution.
      - When all btree nodes are cached in memory, c->btree_cache is not
        a LRU-like list, the btree nodes with oldest journal entry won't
        be close to the tail of the list.
      - There can be hundreds dirty btree nodes reference the oldest journal
        entry, before flushing all the nodes the oldest journal entry cannot
        be reclaimed.
      When the above two conditions mixed together, a simply flushing from
      tail of c->btree_cache list is really NOT a good idea.
      
      Fortunately there is still chance to make btree_flush_write() work
      better. Here is how this patch avoids unnecessary btree nodes flushing,
      - Only acquire c->journal.lock when getting oldest journal entry of
        fifo c->journal.pin. In rested locations check the journal entries
        locklessly, so their values can be changed on other cores
        in parallel.
      - In loop list_for_each_entry_safe_reverse(), checking latest front
        point of fifo c->journal.pin. If it is different from the original
        point which we get with locking c->journal.lock, it means the oldest
        journal entry is reclaim on other cores. At this moment, all selected
        dirty nodes recorded in array btree_nodes[] are all flushed and clean
        on other CPU cores, it is unncessary to iterate c->btree_cache any
        longer. Just quit the list_for_each_entry_safe_reverse() loop and
        the following for-loop will skip all the selected clean nodes.
      - Find a proper time to quit the list_for_each_entry_safe_reverse()
        loop. Check the refcount value of orignial fifo front point, if the
        value is larger than selected node number of btree_nodes[], it means
        more matching btree nodes should be scanned. Otherwise it means no
        more matching btee nodes in rest of c->btree_cache list, the loop
        can be quit. If the original oldest journal entry is reclaimed and
        fifo front point is updated, the refcount of original fifo front point
        will be 0, then the loop will be quit too.
      - Not hold c->bucket_lock too long time. c->bucket_lock is also required
        for space allocation for cached data, hold it for too long time will
        block regular I/O requests. When iterating list c->btree_cache, even
        there are a lot of maching btree nodes, in order to not holding
        c->bucket_lock for too long time, only BTREE_FLUSH_NR nodes are
        selected and to flush in following for-loop.
      With this patch, only btree nodes referencing oldest journal entry
      are flushed to cache device, no aggressive flushing for  unnecessary
      btree node any more. And in order to avoid blocking regluar I/O
      requests, each time when btree_flush_write() called, at most only
      BTREE_FLUSH_NR btree nodes are selected to flush, even there are more
      maching btree nodes in list c->btree_cache.
      
      At last, one more thing to explain: Why it is safe to read front point
      of c->journal.pin without holding c->journal.lock inside the
      list_for_each_entry_safe_reverse() loop ?
      
      Here is my answer: When reading the front point of fifo c->journal.pin,
      we don't need to know the exact value of front point, we just want to
      check whether the value is different from the original front point
      (which is accurate value because we get it while c->jouranl.lock is
      held). For such purpose, it works as expected without holding
      c->journal.lock. Even the front point is changed on other CPU core and
      not updated to local core, and current iterating btree node has
      identical journal entry local as original fetched fifo front point, it
      is still safe. Because after holding mutex b->write_lock (with memory
      barrier) this btree node can be found as clean and skipped, the loop
      will quite latter when iterate on next node of list c->btree_cache.
      
      Fixes: 91be66e1 ("bcache: performance improvement for btree_flush_write()")
      Reported-by: NGuoju Fang <fangguoju@gmail.com>
      Reported-by: NShuang Li <psymon@bonuscloud.io>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2aa8c529
  9. 28 6月, 2019 10 次提交
    • C
      bcache: add reclaimed_journal_buckets to struct cache_set · dff90d58
      Coly Li 提交于
      Now we have counters for how many times jouranl is reclaimed, how many
      times cached dirty btree nodes are flushed, but we don't know how many
      jouranl buckets are really reclaimed.
      
      This patch adds reclaimed_journal_buckets into struct cache_set, this
      is an increasing only counter, to tell how many journal buckets are
      reclaimed since cache set runs. From all these three counters (reclaim,
      reclaimed_journal_buckets, flush_write), we can have idea how well
      current journal space reclaim code works.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dff90d58
    • C
      bcache: performance improvement for btree_flush_write() · 91be66e1
      Coly Li 提交于
      This patch improves performance for btree_flush_write() in following
      ways,
      - Use another spinlock journal.flush_write_lock to replace the very
        hot journal.lock. We don't have to use journal.lock here, selecting
        candidate btree nodes takes a lot of time, hold journal.lock here will
        block other jouranling threads and drop the overall I/O performance.
      - Only select flushing btree node from c->btree_cache list. When the
        machine has a large system memory, mca cache may have a huge number of
        cached btree nodes. Iterating all the cached nodes will take a lot
        of CPU time, and most of the nodes on c->btree_cache_freeable and
        c->btree_cache_freed lists are cleared and have need to flush. So only
        travel mca list c->btree_cache to select flushing btree node should be
        enough for most of the cases.
      - Don't iterate whole c->btree_cache list, only reversely select first
        BTREE_FLUSH_NR btree nodes to flush. Iterate all btree nodes from
        c->btree_cache and select the oldest journal pin btree nodes consumes
        huge number of CPU cycles if the list is huge (push and pop a node
        into/out of a heap is expensive). The last several dirty btree nodes
        on the tail of c->btree_cache list are earlest allocated and cached
        btree nodes, they are relative to the oldest journal pin btree nodes.
        Therefore only flushing BTREE_FLUSH_NR btree nodes from tail of
        c->btree_cache probably includes the oldest journal pin btree nodes.
      
      In my testing, the above change decreases 50%+ CPU consumption when
      journal space is full. Some times IOPS drops to 0 for 5-8 seconds,
      comparing blocking I/O for 120+ seconds in previous code, this is much
      better. Maybe there is room to improve in future, but at this momment
      the fix looks fine and performs well in my testing.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      91be66e1
    • C
      bcache: fix race in btree_flush_write() · 50a260e8
      Coly Li 提交于
      There is a race between mca_reap(), btree_node_free() and journal code
      btree_flush_write(), which results very rare and strange deadlock or
      panic and are very hard to reproduce.
      
      Let me explain how the race happens. In btree_flush_write() one btree
      node with oldest journal pin is selected, then it is flushed to cache
      device, the select-and-flush is a two steps operation. Between these two
      steps, there are something may happen inside the race window,
      - The selected btree node was reaped by mca_reap() and allocated to
        other requesters for other btree node.
      - The slected btree node was selected, flushed and released by mca
        shrink callback bch_mca_scan().
      When btree_flush_write() tries to flush the selected btree node, firstly
      b->write_lock is held by mutex_lock(). If the race happens and the
      memory of selected btree node is allocated to other btree node, if that
      btree node's write_lock is held already, a deadlock very probably
      happens here. A worse case is the memory of the selected btree node is
      released, then all references to this btree node (e.g. b->write_lock)
      will trigger NULL pointer deference panic.
      
      This race was introduced in commit cafe5635 ("bcache: A block layer
      cache"), and enlarged by commit c4dc2497 ("bcache: fix high CPU
      occupancy during journal"), which selected 128 btree nodes and flushed
      them one-by-one in a quite long time period.
      
      Such race is not easy to reproduce before. On a Lenovo SR650 server with
      48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
      device assembled by 3 NVMe SSDs as backing device, this race can be
      observed around every 10,000 times btree_flush_write() gets called. Both
      deadlock and kernel panic all happened as aftermath of the race.
      
      The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
      is set when selecting btree nodes, and cleared after btree nodes
      flushed. Then when mca_reap() selects a btree node with this bit set,
      this btree node will be skipped. Since mca_reap() only reaps btree node
      without BTREE_NODE_journal_flush flag, such race is avoided.
      
      Once corner case should be noticed, that is btree_node_free(). It might
      be called in some error handling code path. For example the following
      code piece from btree_split(),
              2149 err_free2:
              2150         bkey_put(b->c, &n2->key);
              2151         btree_node_free(n2);
              2152         rw_unlock(true, n2);
              2153 err_free1:
              2154         bkey_put(b->c, &n1->key);
              2155         btree_node_free(n1);
              2156         rw_unlock(true, n1);
      At line 2151 and 2155, the btree node n2 and n1 are released without
      mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
      If btree_node_free() is called directly in such error handling path,
      and the selected btree node has BTREE_NODE_journal_flush bit set, just
      delay for 1 us and retry again. In this case this btree node won't
      be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
      and free the btree node memory.
      
      Fixes: cafe5635 ("bcache: A block layer cache")
      Signed-off-by: NColy Li <colyli@suse.de>
      Reported-and-tested-by: Nkbuild test robot <lkp@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      50a260e8
    • C
      bcache: remove retry_flush_write from struct cache_set · d91ce757
      Coly Li 提交于
      In struct cache_set, retry_flush_write is added for commit c4dc2497
      ("bcache: fix high CPU occupancy during journal") which is reverted in
      previous patch.
      
      Now it is useless anymore, and this patch removes it from bcache code.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d91ce757
    • C
      bcache: Revert "bcache: fix high CPU occupancy during journal" · 249a5f6d
      Coly Li 提交于
      This reverts commit c4dc2497.
      
      This patch enlarges a race between normal btree flush code path and
      flush_btree_write(), which causes deadlock when journal space is
      exhausted. Reverts this patch makes the race window from 128 btree
      nodes to only 1 btree nodes.
      
      Fixes: c4dc2497 ("bcache: fix high CPU occupancy during journal")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Cc: Tang Junhui <tang.junhui.linux@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      249a5f6d
    • C
      bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free" · ba82c1ac
      Coly Li 提交于
      This reverts commit 6268dc2c.
      
      This patch depends on commit c4dc2497 ("bcache: fix high CPU
      occupancy during journal") which is reverted in previous patch. So
      revert this one too.
      
      Fixes: 6268dc2c ("bcache: free heap cache_set->flush_btree in bch_journal_free")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Cc: Shenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ba82c1ac
    • C
      bcache: set largest seq to ja->seq[bucket_index] in journal_read_bucket() · a231f07a
      Coly Li 提交于
      In journal_read_bucket() when setting ja->seq[bucket_index], there might
      be potential case that a later non-maximum overwrites a better sequence
      number to ja->seq[bucket_index]. This patch adds a check to make sure
      that ja->seq[bucket_index] will be only set a new value if it is bigger
      then current value.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a231f07a
    • C
      bcache: add code comments for journal_read_bucket() · 2464b693
      Coly Li 提交于
      This patch adds more code comments in journal_read_bucket(), this is an
      effort to make the code to be more understandable.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2464b693
    • C
      bcache: check CACHE_SET_IO_DISABLE bit in bch_journal() · 383ff218
      Coly Li 提交于
      When too many I/O errors happen on cache set and CACHE_SET_IO_DISABLE
      bit is set, bch_journal() may continue to work because the journaling
      bkey might be still in write set yet. The caller of bch_journal() may
      believe the journal still work but the truth is in-memory journal write
      set won't be written into cache device any more. This behavior may
      introduce potential inconsistent metadata status.
      
      This patch checks CACHE_SET_IO_DISABLE bit at the head of bch_journal(),
      if the bit is set, bch_journal() returns NULL immediately to notice
      caller to know journal does not work.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      383ff218
    • C
      bcache: fix return value error in bch_journal_read() · 0ae49cb7
      Coly Li 提交于
      When everything is OK in bch_journal_read(), finally the return value
      is returned by,
      	return ret;
      which assumes ret will be 0 here. This assumption is wrong when all
      journal buckets as are full and filled with valid journal entries. In
      such cache the last location referencess read_bucket() sets 'ret' to
      1, which means new jset added into jset list. The jset list is list
      'journal' in caller run_cache_set().
      
      Return 1 to run_cache_set() means something wrong and the cache set
      won't start, but indeed everything is OK.
      
      This patch changes the line at end of bch_journal_read() to directly
      return 0 since everything if verything is good. Then a bogus error
      is fixed.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0ae49cb7
  10. 01 5月, 2019 1 次提交
  11. 25 4月, 2019 4 次提交
    • T
      bcache: fix failure in journal relplay · 63120731
      Tang Junhui 提交于
      journal replay failed with messages:
      Sep 10 19:10:43 ceph kernel: bcache: error on
      bb379a64-e44e-4812-b91d-a5599871a3b1: bcache: journal entries
      2057493-2057567 missing! (replaying 2057493-20766016), disabling
      caching
      
      The reason is in journal_reclaim(), when discard is enabled, we send
      discard command and reclaim those journal buckets whose seq is old
      than the last_seq_now, but before we write a journal with last_seq_now,
      the machine is restarted, so the journal with the last_seq_now is not
      written to the journal bucket, and the last_seq_wrote in the newest
      journal is old than last_seq_now which we expect to be, so when we doing
      replay, journals from last_seq_wrote to last_seq_now are missing.
      
      It's hard to write a journal immediately after journal_reclaim(),
      and it harmless if those missed journal are caused by discarding
      since those journals are already wrote to btree node. So, if miss
      seqs are started from the beginning journal, we treat it as normal,
      and only print a message to show the miss journal, and point out
      it maybe caused by discarding.
      
      Patch v2 add a judgement condition to ignore the missed journal
      only when discard enabled as Coly suggested.
      
      (Coly Li: rebase the patch with other changes in bch_journal_replay())
      Signed-off-by: NTang Junhui <tang.junhui.linux@gmail.com>
      Tested-by: NDennis Schridde <devurandom@gmx.net>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      63120731
    • C
      bcache: return error immediately in bch_journal_replay() · 68d10e69
      Coly Li 提交于
      When failure happens inside bch_journal_replay(), calling
      cache_set_err_on() and handling the failure in async way is not a good
      idea. Because after bch_journal_replay() returns, registering code will
      continue to execute following steps, and unregistering code triggered
      by cache_set_err_on() is running in same time. First it is unnecessary
      to handle failure and unregister cache set in an async way, second there
      might be potential race condition to run register and unregister code
      for same cache set.
      
      So in this patch, if failure happens in bch_journal_replay(), we don't
      call cache_set_err_on(), and just print out the same error message to
      kernel message buffer, then return -EIO immediately caller. Then caller
      can detect such failure and handle it in synchrnozied way.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      68d10e69
    • C
      bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim() · 1bee2add
      Coly Li 提交于
      In journal_reclaim() ja->cur_idx of each cache will be update to
      reclaim available journal buckets. Variable 'int n' is used to count how
      many cache is successfully reclaimed, then n is set to c->journal.key
      by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
      loop will write the jset data onto each cache.
      
      The problem is, if all jouranl buckets on each cache is full, the
      following code in journal_reclaim(),
      
      529 for_each_cache(ca, c, iter) {
      530       struct journal_device *ja = &ca->journal;
      531       unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
      532
      533       /* No space available on this device */
      534       if (next == ja->discard_idx)
      535               continue;
      536
      537       ja->cur_idx = next;
      538       k->ptr[n++] = MAKE_PTR(0,
      539                         bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
      540                         ca->sb.nr_this_dev);
      541 }
      542
      543 bkey_init(k);
      544 SET_KEY_PTRS(k, n);
      
      If there is no available bucket to reclaim, the if() condition at line
      534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
      will set KEY_PTRS field of c->journal.key to 0.
      
      Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
      journal_write_unlocked() the journal data is written in following loop,
      
      649	for (i = 0; i < KEY_PTRS(k); i++) {
      650-671		submit journal data to cache device
      672	}
      
      If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
      won't be written to cache device here. If system crahed or rebooted
      before bkeys of the lost journal entries written into btree nodes, data
      corruption will be reported during bcache reload after rebooting the
      system.
      
      Indeed there is only one cache in a cache set, there is no need to set
      KEY_PTRS field in journal_reclaim() at all. But in order to keep the
      for_each_cache() logic consistent for now, this patch fixes the above
      problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
      available to reclaim.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1bee2add
    • C
      bcache: move definition of 'int ret' out of macro read_bucket() · 14215ee0
      Coly Li 提交于
      'int ret' is defined as a local variable inside macro read_bucket().
      Since this macro is called multiple times, and following patches will
      use a 'int ret' variable in bch_journal_read(), this patch moves
      definition of 'int ret' from macro read_bucket() to range of function
      bch_journal_read().
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      14215ee0
  12. 13 12月, 2018 1 次提交
  13. 27 9月, 2018 1 次提交
    • G
      bcache: add separate workqueue for journal_write to avoid deadlock · 0f843e65
      Guoju Fang 提交于
      After write SSD completed, bcache schedules journal_write work to
      system_wq, which is a public workqueue in system, without WQ_MEM_RECLAIM
      flag. system_wq is also a bound wq, and there may be no idle kworker on
      current processor. Creating a new kworker may unfortunately need to
      reclaim memory first, by shrinking cache and slab used by vfs, which
      depends on bcache device. That's a deadlock.
      
      This patch create a new workqueue for journal_write with WQ_MEM_RECLAIM
      flag. It's rescuer thread will work to avoid the deadlock.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f843e65
  14. 12 8月, 2018 4 次提交
  15. 27 7月, 2018 1 次提交
  16. 19 3月, 2018 3 次提交
    • B
      bcache: Reduce the number of sparse complaints about lock imbalances · 20d3a518
      Bart Van Assche 提交于
      Add more annotations for sparse to inform it about which functions do
      not have the same number of spin_lock() and spin_unlock() calls.
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      20d3a518
    • B
      bcache: Suppress more warnings about set-but-not-used variables · 42361469
      Bart Van Assche 提交于
      This patch does not change any functionality.
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      42361469
    • C
      bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags · 771f393e
      Coly Li 提交于
      When too many I/Os failed on cache device, bch_cache_set_error() is called
      in the error handling code path to retire whole problematic cache set. If
      new I/O requests continue to come and take refcount dc->count, the cache
      set won't be retired immediately, this is a problem.
      
      Further more, there are several kernel thread and self-armed kernel work
      may still running after bch_cache_set_error() is called. It needs to wait
      quite a while for them to stop, or they won't stop at all. They also
      prevent the cache set from being retired.
      
      The solution in this patch is, to add per cache set flag to disable I/O
      request on this cache and all attached backing devices. Then new coming I/O
      requests can be rejected in *_make_request() before taking refcount, kernel
      threads and self-armed kernel worker can stop very fast when flags bit
      CACHE_SET_IO_DISABLE is set.
      
      Because bcache also do internal I/Os for writeback, garbage collection,
      bucket allocation, journaling, this kind of I/O should be disabled after
      bch_cache_set_error() is called. So closure_bio_submit() is modified to
      check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
      closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
      return, generic_make_request() won't be called.
      
      A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
      from cache_set->flags, to disable or enable cache set I/O for debugging. It
      is helpful to trigger more corner case issues for failed cache device.
      
      Changelog
      v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
          kernel threads.
      v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
          remove "bcache: " prefix when printing out kernel message.
      v2, more changes by previous review,
      - Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
      - Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
        is reported and inspired from origal patch of Pavel Vazharov.
      v1, initial version.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Pavel Vazharov <freakpv@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      771f393e
  17. 08 2月, 2018 2 次提交
    • T
      bcache: fix high CPU occupancy during journal · c4dc2497
      Tang Junhui 提交于
      After long time small writing I/O running, we found the occupancy of CPU
      is very high and I/O performance has been reduced by about half:
      
      [root@ceph151 internal]# top
      top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
      Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
      %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
      KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
      KiB Swap: 65667068 total, 65667068 free,        0 used. 25136812 avail Mem
      
        PID USER PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
       2023 root 20  0       0      0      0 S 55.1  0.0   0:04.42 kworker/11:191
      14126 root 20  0       0      0      0 S 42.9  0.0   0:08.72 kworker/10:3
       9292 root 20  0       0      0      0 S 30.4  0.0   1:10.99 kworker/6:1
       8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
      12287 root 20  0       0      0      0 S 26.7  0.0   0:28.13 kworker/7:85
      31019 root 20  0       0      0      0 S 26.1  0.0   1:30.79 kworker/22:1
       1787 root 20  0       0      0      0 R 25.7  0.0   5:18.45 kworker/8:7
      32169 root 20  0       0      0      0 S 14.5  0.0   1:01.92 kworker/23:1
      21476 root 20  0       0      0      0 S 13.9  0.0   0:05.09 kworker/1:54
       2204 root 20  0       0      0      0 S 12.5  0.0   1:25.17 kworker/9:10
      16994 root 20  0       0      0      0 S 12.2  0.0   0:06.27 kworker/5:106
      15714 root 20  0       0      0      0 R 10.9  0.0   0:01.85 kworker/19:2
       9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
      11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
       9960 root 20  0       0      0      0 S 10.2  0.0   0:02.75 kworker/2:139
      11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
       6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd
      
      The kernel work consumed a lot of CPU, and I found they are running journal
      work, The journal is reclaiming source and flush btree node with surprising
      frequency.
      
      Through further analysis, we found that in btree_flush_write(), we try to
      get a btree node with the smallest fifo idex to flush by traverse all the
      btree nodein c->bucket_hash, after we getting it, since no locker protects
      it, this btree node may have been written to cache device by other works,
      and if this occurred, we retry to traverse in c->bucket_hash and get
      another btree node. When the problem occurrd, the retry times is very high,
      and we consume a lot of CPU in looking for a appropriate btree node.
      
      In this patch, we try to record 128 btree nodes with the smallest fifo idex
      in heap, and pop one by one when we need to flush btree node. It greatly
      reduces the time for the loop to find the appropriate BTREE node, and also
      reduce the occupancy of CPU.
      
      [note by mpl: this triggers a checkpatch error because of adjacent,
      pre-existing style violations]
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c4dc2497
    • T
      bcache: add journal statistic · a728eacb
      Tang Junhui 提交于
      Sometimes, Journal takes up a lot of CPU, we need statistics
      to know what's the journal is doing. So this patch provide
      some journal statistics:
      1) reclaim: how many times the journal try to reclaim resource,
         usually the journal bucket or/and the pin are exhausted.
      2) flush_write: how many times the journal try to flush btree node
         to cache device, usually the journal bucket are exhausted.
      3) retry_flush_write: how many times the journal retry to flush
         the next btree node, usually the previous tree node have been
         flushed by other thread.
      we show these statistic by sysfs interface. Through these statistics
      We can totally see the status of journal module when the CPU is too
      high.
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a728eacb
  18. 25 11月, 2017 1 次提交