提交 · a7f80a9f2897cce411701e697b04ff8236f198a4 · openeuler / Kernel

09 4月, 2021 1 次提交

bcache: Move journal work to new flush wq · a7f80a9f

由 Kai Krakow 提交于 3月 15, 2021

stable inclusion
from stable-5.10.20
commit 695185fce22928dfc9bacd7d7eb976c9ec2fcef3
bugzilla: 50608

--------------------------------

commit afe78ab4 upstream.

This is potentially long running and not latency sensitive, let's get
it out of the way of other latency sensitive events.

As observed in the previous commit, the `system_wq` comes easily
congested by bcache, and this fixes a few more stalls I was observing
every once in a while.

Let's not make this `WQ_MEM_RECLAIM` as it showed to reduce performance
of boot and file system operations in my tests. Also, without
`WQ_MEM_RECLAIM`, I no longer see desktop stalls. This matches the
previous behavior as `system_wq` also does no memory reclaim:

> // workqueue.c:
> system_wq = alloc_workqueue("events", 0, 0);

Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: NKai Krakow <kai@kaishome.de>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a7f80a9f

03 10月, 2020 4 次提交

bcache: remove embedded struct cache_sb from struct cache_set · 4a784266

由 Coly Li 提交于 10月 01, 2020

Since bcache code was merged into mainline kerrnel, each cache set only
as one single cache in it. The multiple caches framework is here but the
code is far from completed. Considering the multiple copies of cached
data can also be stored on e.g. md raid1 devices, it is unnecessary to
support multiple caches in one cache set indeed.

The previous preparation patches fix the dependencies of explicitly
making a cache set only have single cache. Now we don't have to maintain
an embedded partial super block in struct cache_set, the in-memory super
block can be directly referenced from struct cache.

This patch removes the embedded struct cache_sb from struct cache_set,
and fixes all locations where the superb lock was referenced from this
removed super block by referencing the in-memory super block of struct
cache.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4a784266

bcache: check and set sync status on cache's in-memory super block · 6f9414e0

由 Coly Li 提交于 10月 01, 2020

Currently the cache's sync status is checked and set on cache set's in-
memory partial super block. After removing the embedded struct cache_sb
from cache set and reference cache's in-memory super block from struct
cache_set, the sync status can set and check directly on cache's super
block.

This patch checks and sets the cache sync status directly on cache's
in-memory super block. This is a preparation for later removing embedded
struct cache_sb from struct cache_set.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6f9414e0

bcache: only use block_bytes() on struct cache · 4e1ebae3

由 Coly Li 提交于 10月 01, 2020

Because struct cache_set and struct cache both have struct cache_sb,
therefore macro block_bytes() can be used on both of them. When removing
the embedded struct cache_sb from struct cache_set, this macro won't be
used on struct cache_set anymore.

This patch unifies all block_bytes() usage only on struct cache, this is
one of the preparation to remove the embedded struct cache_sb from
struct cache_set.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4e1ebae3

bcache: remove for_each_cache() · 08fdb2cd

由 Coly Li 提交于 10月 01, 2020

Since now each cache_set explicitly has single cache, for_each_cache()
is unnecessary. This patch removes this macro, and update all locations
where it is used, and makes sure all code logic still being consistent.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

08fdb2cd

24 8月, 2020 1 次提交

treewide: Use fallthrough pseudo-keyword · df561f66

由 Gustavo A. R. Silva 提交于 8月 23, 2020

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-throughSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

df561f66

25 7月, 2020 2 次提交

bcache: journel: use for_each_clear_bit() to simplify the code · ef4eeb85

由 Xu Wang 提交于 7月 25, 2020

Using for_each_clear_bit() to simplify the code.
Signed-off-by: NXu Wang <vulab@iscas.ac.cn>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ef4eeb85

bcache: allocate meta data pages as compound pages · 5fe48867

由 Coly Li 提交于 7月 25, 2020

There are some meta data of bcache are allocated by multiple pages,
and they are used as bio bv_page for I/Os to the cache device. for
example cache_set->uuids, cache->disk_buckets, journal_write->data,
bset_tree->data.

For such meta data memory, all the allocated pages should be treated
as a single memory block. Then the memory management and underlying I/O
code can treat them more clearly.

This patch adds __GFP_COMP flag to all the location allocating >0 order
pages for the above mentioned meta data. Then their pages are treated
as compound pages now.
Signed-off-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5fe48867

27 5月, 2020 1 次提交

bcache: Convert pr_<level> uses to a more typical style · 46f5aa88

由 Joe Perches 提交于 5月 27, 2020

Remove the trailing newline from the define of pr_fmt and add newlines
to the uses.

Miscellanea:

o Convert bch_bkey_dump from multiple uses of pr_err to pr_cont
  as the earlier conversion was inappropriate done causing multiple
  lines to be emitted where only a single output line was desired
o Use vsprintf extension %pV in bch_cache_set_error to avoid multiple
  line output where only a single line output was desired
o Coalesce formats

Fixes: 6ae63e35 ("bcache: replace printk() by pr_*() routines")
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

46f5aa88

13 2月, 2020 1 次提交

bcache: remove macro nr_to_fifo_front() · 4ec31cb6

由 Coly Li 提交于 2月 13, 2020

Macro nr_to_fifo_front() is only used once in btree_flush_write(),
it is unncessary indeed. This patch removes this macro and does
calculation directly in place.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4ec31cb6

01 2月, 2020 1 次提交

bcache: fix incorrect data type usage in btree_flush_write() · d1c3cc34

由 Coly Li 提交于 2月 01, 2020

Dan Carpenter points out that from commit 2aa8c529 ("bcache: avoid
unnecessary btree nodes flushing in btree_flush_write()"), there is a
incorrect data type usage which leads to the following static checker
warning:
	drivers/md/bcache/journal.c:444 btree_flush_write()
	warn: 'ref_nr' unsigned <= 0

drivers/md/bcache/journal.c
   422  static void btree_flush_write(struct cache_set *c)
   423  {
   424          struct btree *b, *t, *btree_nodes[BTREE_FLUSH_NR];
   425          unsigned int i, nr, ref_nr;
                                    ^^^^^^

   426          atomic_t *fifo_front_p, *now_fifo_front_p;
   427          size_t mask;
   428
   429          if (c->journal.btree_flushing)
   430                  return;
   431
   432          spin_lock(&c->journal.flush_write_lock);
   433          if (c->journal.btree_flushing) {
   434                  spin_unlock(&c->journal.flush_write_lock);
   435                  return;
   436          }
   437          c->journal.btree_flushing = true;
   438          spin_unlock(&c->journal.flush_write_lock);
   439
   440          /* get the oldest journal entry and check its refcount */
   441          spin_lock(&c->journal.lock);
   442          fifo_front_p = &fifo_front(&c->journal.pin);
   443          ref_nr = atomic_read(fifo_front_p);
   444          if (ref_nr <= 0) {
                    ^^^^^^^^^^^
Unsigned can't be less than zero.

   445                  /*
   446                   * do nothing if no btree node references
   447                   * the oldest journal entry
   448                   */
   449                  spin_unlock(&c->journal.lock);
   450                  goto out;
   451          }
   452          spin_unlock(&c->journal.lock);

As the warning information indicates, local varaible ref_nr in unsigned
int type is wrong, which does not matche atomic_read() and the "<= 0"
checking.

This patch fixes the above error by defining local variable ref_nr as
int type.

Fixes: 2aa8c529 ("bcache: avoid unnecessary btree nodes flushing in btree_flush_write()")
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d1c3cc34

24 1月, 2020 1 次提交

bcache: avoid unnecessary btree nodes flushing in btree_flush_write() · 2aa8c529

由 Coly Li 提交于 1月 24, 2020

the commit 91be66e1 ("bcache: performance improvement for
btree_flush_write()") was an effort to flushing btree node with oldest
btree node faster in following methods,
- Only iterate dirty btree nodes in c->btree_cache, avoid scanning a lot
  of clean btree nodes.
- Take c->btree_cache as a LRU-like list, aggressively flushing all
  dirty nodes from tail of c->btree_cache util the btree node with
  oldest journal entry is flushed. This is to reduce the time of holding
  c->bucket_lock.

Guoju Fang and Shuang Li reported that they observe unexptected extra
write I/Os on cache device after applying the above patch. Guoju Fang
provideed more detailed diagnose information that the aggressive
btree nodes flushing may cause 10x more btree nodes to flush in his
workload. He points out when system memory is large enough to hold all
btree nodes in memory, c->btree_cache is not a LRU-like list any more.
Then the btree node with oldest journal entry is very probably not-
close to the tail of c->btree_cache list. In such situation much more
dirty btree nodes will be aggressively flushed before the target node
is flushed. When slow SATA SSD is used as cache device, such over-
aggressive flushing behavior will cause performance regression.

After spending a lot of time on debug and diagnose, I find the real
condition is more complicated, aggressive flushing dirty btree nodes
from tail of c->btree_cache list is not a good solution.
- When all btree nodes are cached in memory, c->btree_cache is not
  a LRU-like list, the btree nodes with oldest journal entry won't
  be close to the tail of the list.
- There can be hundreds dirty btree nodes reference the oldest journal
  entry, before flushing all the nodes the oldest journal entry cannot
  be reclaimed.
When the above two conditions mixed together, a simply flushing from
tail of c->btree_cache list is really NOT a good idea.

Fortunately there is still chance to make btree_flush_write() work
better. Here is how this patch avoids unnecessary btree nodes flushing,
- Only acquire c->journal.lock when getting oldest journal entry of
  fifo c->journal.pin. In rested locations check the journal entries
  locklessly, so their values can be changed on other cores
  in parallel.
- In loop list_for_each_entry_safe_reverse(), checking latest front
  point of fifo c->journal.pin. If it is different from the original
  point which we get with locking c->journal.lock, it means the oldest
  journal entry is reclaim on other cores. At this moment, all selected
  dirty nodes recorded in array btree_nodes[] are all flushed and clean
  on other CPU cores, it is unncessary to iterate c->btree_cache any
  longer. Just quit the list_for_each_entry_safe_reverse() loop and
  the following for-loop will skip all the selected clean nodes.
- Find a proper time to quit the list_for_each_entry_safe_reverse()
  loop. Check the refcount value of orignial fifo front point, if the
  value is larger than selected node number of btree_nodes[], it means
  more matching btree nodes should be scanned. Otherwise it means no
  more matching btee nodes in rest of c->btree_cache list, the loop
  can be quit. If the original oldest journal entry is reclaimed and
  fifo front point is updated, the refcount of original fifo front point
  will be 0, then the loop will be quit too.
- Not hold c->bucket_lock too long time. c->bucket_lock is also required
  for space allocation for cached data, hold it for too long time will
  block regular I/O requests. When iterating list c->btree_cache, even
  there are a lot of maching btree nodes, in order to not holding
  c->bucket_lock for too long time, only BTREE_FLUSH_NR nodes are
  selected and to flush in following for-loop.
With this patch, only btree nodes referencing oldest journal entry
are flushed to cache device, no aggressive flushing for  unnecessary
btree node any more. And in order to avoid blocking regluar I/O
requests, each time when btree_flush_write() called, at most only
BTREE_FLUSH_NR btree nodes are selected to flush, even there are more
maching btree nodes in list c->btree_cache.

At last, one more thing to explain: Why it is safe to read front point
of c->journal.pin without holding c->journal.lock inside the
list_for_each_entry_safe_reverse() loop ?

Here is my answer: When reading the front point of fifo c->journal.pin,
we don't need to know the exact value of front point, we just want to
check whether the value is different from the original front point
(which is accurate value because we get it while c->jouranl.lock is
held). For such purpose, it works as expected without holding
c->journal.lock. Even the front point is changed on other CPU core and
not updated to local core, and current iterating btree node has
identical journal entry local as original fetched fifo front point, it
is still safe. Because after holding mutex b->write_lock (with memory
barrier) this btree node can be found as clean and skipped, the loop
will quite latter when iterate on next node of list c->btree_cache.

Fixes: 91be66e1 ("bcache: performance improvement for btree_flush_write()")
Reported-by: NGuoju Fang <fangguoju@gmail.com>
Reported-by: NShuang Li <psymon@bonuscloud.io>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2aa8c529

28 6月, 2019 10 次提交

bcache: add reclaimed_journal_buckets to struct cache_set · dff90d58

由 Coly Li 提交于 6月 28, 2019

Now we have counters for how many times jouranl is reclaimed, how many
times cached dirty btree nodes are flushed, but we don't know how many
jouranl buckets are really reclaimed.

This patch adds reclaimed_journal_buckets into struct cache_set, this
is an increasing only counter, to tell how many journal buckets are
reclaimed since cache set runs. From all these three counters (reclaim,
reclaimed_journal_buckets, flush_write), we can have idea how well
current journal space reclaim code works.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dff90d58

bcache: performance improvement for btree_flush_write() · 91be66e1

由 Coly Li 提交于 6月 28, 2019

This patch improves performance for btree_flush_write() in following
ways,
- Use another spinlock journal.flush_write_lock to replace the very
  hot journal.lock. We don't have to use journal.lock here, selecting
  candidate btree nodes takes a lot of time, hold journal.lock here will
  block other jouranling threads and drop the overall I/O performance.
- Only select flushing btree node from c->btree_cache list. When the
  machine has a large system memory, mca cache may have a huge number of
  cached btree nodes. Iterating all the cached nodes will take a lot
  of CPU time, and most of the nodes on c->btree_cache_freeable and
  c->btree_cache_freed lists are cleared and have need to flush. So only
  travel mca list c->btree_cache to select flushing btree node should be
  enough for most of the cases.
- Don't iterate whole c->btree_cache list, only reversely select first
  BTREE_FLUSH_NR btree nodes to flush. Iterate all btree nodes from
  c->btree_cache and select the oldest journal pin btree nodes consumes
  huge number of CPU cycles if the list is huge (push and pop a node
  into/out of a heap is expensive). The last several dirty btree nodes
  on the tail of c->btree_cache list are earlest allocated and cached
  btree nodes, they are relative to the oldest journal pin btree nodes.
  Therefore only flushing BTREE_FLUSH_NR btree nodes from tail of
  c->btree_cache probably includes the oldest journal pin btree nodes.

In my testing, the above change decreases 50%+ CPU consumption when
journal space is full. Some times IOPS drops to 0 for 5-8 seconds,
comparing blocking I/O for 120+ seconds in previous code, this is much
better. Maybe there is room to improve in future, but at this momment
the fix looks fine and performs well in my testing.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

91be66e1

bcache: fix race in btree_flush_write() · 50a260e8

由 Coly Li 提交于 6月 28, 2019

There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.

Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
  other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
  shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.

This race was introduced in commit cafe5635 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497 ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.

Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.

The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.

Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
        2149 err_free2:
        2150         bkey_put(b->c, &n2->key);
        2151         btree_node_free(n2);
        2152         rw_unlock(true, n2);
        2153 err_free1:
        2154         bkey_put(b->c, &n1->key);
        2155         btree_node_free(n1);
        2156         rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.

Fixes: cafe5635 ("bcache: A block layer cache")
Signed-off-by: NColy Li <colyli@suse.de>
Reported-and-tested-by: Nkbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

50a260e8

bcache: remove retry_flush_write from struct cache_set · d91ce757

由 Coly Li 提交于 6月 28, 2019

In struct cache_set, retry_flush_write is added for commit c4dc2497
("bcache: fix high CPU occupancy during journal") which is reverted in
previous patch.

Now it is useless anymore, and this patch removes it from bcache code.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d91ce757

bcache: Revert "bcache: fix high CPU occupancy during journal" · 249a5f6d

由 Coly Li 提交于 6月 28, 2019

This reverts commit c4dc2497.

This patch enlarges a race between normal btree flush code path and
flush_btree_write(), which causes deadlock when journal space is
exhausted. Reverts this patch makes the race window from 128 btree
nodes to only 1 btree nodes.

Fixes: c4dc2497 ("bcache: fix high CPU occupancy during journal")
Signed-off-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Tang Junhui <tang.junhui.linux@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

249a5f6d

bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free" · ba82c1ac

由 Coly Li 提交于 6月 28, 2019

This reverts commit 6268dc2c.

This patch depends on commit c4dc2497 ("bcache: fix high CPU
occupancy during journal") which is reverted in previous patch. So
revert this one too.

Fixes: 6268dc2c ("bcache: free heap cache_set->flush_btree in bch_journal_free")
Signed-off-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ba82c1ac

bcache: set largest seq to ja->seq[bucket_index] in journal_read_bucket() · a231f07a

由 Coly Li 提交于 6月 28, 2019

In journal_read_bucket() when setting ja->seq[bucket_index], there might
be potential case that a later non-maximum overwrites a better sequence
number to ja->seq[bucket_index]. This patch adds a check to make sure
that ja->seq[bucket_index] will be only set a new value if it is bigger
then current value.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a231f07a

bcache: add code comments for journal_read_bucket() · 2464b693

由 Coly Li 提交于 6月 28, 2019

This patch adds more code comments in journal_read_bucket(), this is an
effort to make the code to be more understandable.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2464b693

bcache: check CACHE_SET_IO_DISABLE bit in bch_journal() · 383ff218

由 Coly Li 提交于 6月 28, 2019

When too many I/O errors happen on cache set and CACHE_SET_IO_DISABLE
bit is set, bch_journal() may continue to work because the journaling
bkey might be still in write set yet. The caller of bch_journal() may
believe the journal still work but the truth is in-memory journal write
set won't be written into cache device any more. This behavior may
introduce potential inconsistent metadata status.

This patch checks CACHE_SET_IO_DISABLE bit at the head of bch_journal(),
if the bit is set, bch_journal() returns NULL immediately to notice
caller to know journal does not work.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

383ff218

bcache: fix return value error in bch_journal_read() · 0ae49cb7

由 Coly Li 提交于 6月 28, 2019

When everything is OK in bch_journal_read(), finally the return value
is returned by,
	return ret;
which assumes ret will be 0 here. This assumption is wrong when all
journal buckets as are full and filled with valid journal entries. In
such cache the last location referencess read_bucket() sets 'ret' to
1, which means new jset added into jset list. The jset list is list
'journal' in caller run_cache_set().

Return 1 to run_cache_set() means something wrong and the cache set
won't start, but indeed everything is OK.

This patch changes the line at end of bch_journal_read() to directly
return 0 since everything if verything is good. Then a bogus error
is fixed.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0ae49cb7

01 5月, 2019 1 次提交

bcache: make is_discard_enabled() static · 2d5abb9a

由 Jens Axboe 提交于 5月 01, 2019

It's not used outside this file.

Fixes: 63120731 ("bcache: fix failure in journal relplay")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2d5abb9a

25 4月, 2019 4 次提交

bcache: fix failure in journal relplay · 63120731

由 Tang Junhui 提交于 4月 25, 2019

journal replay failed with messages:
Sep 10 19:10:43 ceph kernel: bcache: error on
bb379a64-e44e-4812-b91d-a5599871a3b1: bcache: journal entries
2057493-2057567 missing! (replaying 2057493-20766016), disabling
caching

The reason is in journal_reclaim(), when discard is enabled, we send
discard command and reclaim those journal buckets whose seq is old
than the last_seq_now, but before we write a journal with last_seq_now,
the machine is restarted, so the journal with the last_seq_now is not
written to the journal bucket, and the last_seq_wrote in the newest
journal is old than last_seq_now which we expect to be, so when we doing
replay, journals from last_seq_wrote to last_seq_now are missing.

It's hard to write a journal immediately after journal_reclaim(),
and it harmless if those missed journal are caused by discarding
since those journals are already wrote to btree node. So, if miss
seqs are started from the beginning journal, we treat it as normal,
and only print a message to show the miss journal, and point out
it maybe caused by discarding.

Patch v2 add a judgement condition to ignore the missed journal
only when discard enabled as Coly suggested.

(Coly Li: rebase the patch with other changes in bch_journal_replay())
Signed-off-by: NTang Junhui <tang.junhui.linux@gmail.com>
Tested-by: NDennis Schridde <devurandom@gmx.net>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

63120731

bcache: return error immediately in bch_journal_replay() · 68d10e69

由 Coly Li 提交于 4月 25, 2019

When failure happens inside bch_journal_replay(), calling
cache_set_err_on() and handling the failure in async way is not a good
idea. Because after bch_journal_replay() returns, registering code will
continue to execute following steps, and unregistering code triggered
by cache_set_err_on() is running in same time. First it is unnecessary
to handle failure and unregister cache set in an async way, second there
might be potential race condition to run register and unregister code
for same cache set.

So in this patch, if failure happens in bch_journal_replay(), we don't
call cache_set_err_on(), and just print out the same error message to
kernel message buffer, then return -EIO immediately caller. Then caller
can detect such failure and handle it in synchrnozied way.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

68d10e69

bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim() · 1bee2add

由 Coly Li 提交于 4月 25, 2019

In journal_reclaim() ja->cur_idx of each cache will be update to
reclaim available journal buckets. Variable 'int n' is used to count how
many cache is successfully reclaimed, then n is set to c->journal.key
by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
loop will write the jset data onto each cache.

The problem is, if all jouranl buckets on each cache is full, the
following code in journal_reclaim(),

529 for_each_cache(ca, c, iter) {
530       struct journal_device *ja = &ca->journal;
531       unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
532
533       /* No space available on this device */
534       if (next == ja->discard_idx)
535               continue;
536
537       ja->cur_idx = next;
538       k->ptr[n++] = MAKE_PTR(0,
539                         bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
540                         ca->sb.nr_this_dev);
541 }
542
543 bkey_init(k);
544 SET_KEY_PTRS(k, n);

If there is no available bucket to reclaim, the if() condition at line
534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
will set KEY_PTRS field of c->journal.key to 0.

Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
journal_write_unlocked() the journal data is written in following loop,

649	for (i = 0; i < KEY_PTRS(k); i++) {
650-671		submit journal data to cache device
672	}

If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
won't be written to cache device here. If system crahed or rebooted
before bkeys of the lost journal entries written into btree nodes, data
corruption will be reported during bcache reload after rebooting the
system.

Indeed there is only one cache in a cache set, there is no need to set
KEY_PTRS field in journal_reclaim() at all. But in order to keep the
for_each_cache() logic consistent for now, this patch fixes the above
problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
available to reclaim.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1bee2add

bcache: move definition of 'int ret' out of macro read_bucket() · 14215ee0

由 Coly Li 提交于 4月 25, 2019

'int ret' is defined as a local variable inside macro read_bucket().
Since this macro is called multiple times, and following patches will
use a 'int ret' variable in bch_journal_read(), this patch moves
definition of 'int ret' from macro read_bucket() to range of function
bch_journal_read().
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

14215ee0

13 12月, 2018 1 次提交

bcache: print number of keys in trace_bcache_journal_write · e78bd0d2

由 Guoju Fang 提交于 12月 13, 2018

Sometimes flush journal may be very frequent, so it's useful to dump
number of keys every time write journal.
Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e78bd0d2

27 9月, 2018 1 次提交

bcache: add separate workqueue for journal_write to avoid deadlock · 0f843e65

由 Guoju Fang 提交于 9月 27, 2018

After write SSD completed, bcache schedules journal_write work to
system_wq, which is a public workqueue in system, without WQ_MEM_RECLAIM
flag. system_wq is also a bound wq, and there may be no idle kworker on
current processor. Creating a new kworker may unfortunately need to
reclaim memory first, by shrinking cache and slab used by vfs, which
depends on bcache device. That's a deadlock.

This patch create a new workqueue for journal_write with WQ_MEM_RECLAIM
flag. It's rescuer thread will work to avoid the deadlock.
Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0f843e65

12 8月, 2018 4 次提交

bcache: style fixes for lines over 80 characters · b0d30981

由 Coly Li 提交于 8月 11, 2018

This patch fixes the lines over 80 characters into more lines, to minimize
warnings by checkpatch.pl. There are still some lines exceed 80 characters,
but it is better to be a single line and I don't change them.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NShenghui Wang <shhuiw@foxmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b0d30981

bcache: add identifier names to arguments of function definitions · fc2d5988

由 Coly Li 提交于 8月 11, 2018

There are many function definitions do not have identifier argument names,
scripts/checkpatch.pl complains warnings like this,

 WARNING: function definition argument 'struct bcache_device *' should
  also have an identifier name
  #16735: FILE: writeback.h:120:
  +void bch_sectors_dirty_init(struct bcache_device *);

This patch adds identifier argument names to all bcache function
definitions to fix such warnings.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fc2d5988

bcache: style fix to add a blank line after declarations · 1fae7cf0

由 Coly Li 提交于 8月 11, 2018

Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NShenghui Wang <shhuiw@foxmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1fae7cf0

bcache: style fix to replace 'unsigned' by 'unsigned int' · 6f10f7d1

由 Coly Li 提交于 8月 11, 2018

This patch fixes warning reported by checkpatch.pl by replacing 'unsigned'
with 'unsigned int'.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NShenghui Wang <shhuiw@foxmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6f10f7d1

27 7月, 2018 1 次提交

bcache: free heap cache_set->flush_btree in bch_journal_free · 6268dc2c

由 Shenghui Wang 提交于 7月 26, 2018

Free the cache_set->flush_bree heap memory on journal free.
Signed-off-by: NWang Sheng-Hui <shhuiw@foxmail.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6268dc2c

19 3月, 2018 3 次提交

bcache: Reduce the number of sparse complaints about lock imbalances · 20d3a518

由 Bart Van Assche 提交于 3月 18, 2018

Add more annotations for sparse to inform it about which functions do
not have the same number of spin_lock() and spin_unlock() calls.
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

20d3a518

bcache: Suppress more warnings about set-but-not-used variables · 42361469

由 Bart Van Assche 提交于 3月 18, 2018

This patch does not change any functionality.
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

42361469

bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags · 771f393e

由 Coly Li 提交于 3月 18, 2018

When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.

Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.

The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.

Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.

A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.

Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
    kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
    remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
  is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

771f393e

08 2月, 2018 2 次提交

bcache: fix high CPU occupancy during journal · c4dc2497

由 Tang Junhui 提交于 2月 07, 2018

After long time small writing I/O running, we found the occupancy of CPU
is very high and I/O performance has been reduced by about half:

[root@ceph151 internal]# top
top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
%Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
KiB Swap: 65667068 total, 65667068 free,        0 used. 25136812 avail Mem

  PID USER PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 2023 root 20  0       0      0      0 S 55.1  0.0   0:04.42 kworker/11:191
14126 root 20  0       0      0      0 S 42.9  0.0   0:08.72 kworker/10:3
 9292 root 20  0       0      0      0 S 30.4  0.0   1:10.99 kworker/6:1
 8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
12287 root 20  0       0      0      0 S 26.7  0.0   0:28.13 kworker/7:85
31019 root 20  0       0      0      0 S 26.1  0.0   1:30.79 kworker/22:1
 1787 root 20  0       0      0      0 R 25.7  0.0   5:18.45 kworker/8:7
32169 root 20  0       0      0      0 S 14.5  0.0   1:01.92 kworker/23:1
21476 root 20  0       0      0      0 S 13.9  0.0   0:05.09 kworker/1:54
 2204 root 20  0       0      0      0 S 12.5  0.0   1:25.17 kworker/9:10
16994 root 20  0       0      0      0 S 12.2  0.0   0:06.27 kworker/5:106
15714 root 20  0       0      0      0 R 10.9  0.0   0:01.85 kworker/19:2
 9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
 9960 root 20  0       0      0      0 S 10.2  0.0   0:02.75 kworker/2:139
11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
 6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd

The kernel work consumed a lot of CPU, and I found they are running journal
work, The journal is reclaiming source and flush btree node with surprising
frequency.

Through further analysis, we found that in btree_flush_write(), we try to
get a btree node with the smallest fifo idex to flush by traverse all the
btree nodein c->bucket_hash, after we getting it, since no locker protects
it, this btree node may have been written to cache device by other works,
and if this occurred, we retry to traverse in c->bucket_hash and get
another btree node. When the problem occurrd, the retry times is very high,
and we consume a lot of CPU in looking for a appropriate btree node.

In this patch, we try to record 128 btree nodes with the smallest fifo idex
in heap, and pop one by one when we need to flush btree node. It greatly
reduces the time for the loop to find the appropriate BTREE node, and also
reduce the occupancy of CPU.

[note by mpl: this triggers a checkpatch error because of adjacent,
pre-existing style violations]
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c4dc2497

bcache: add journal statistic · a728eacb

由 Tang Junhui 提交于 2月 07, 2018

Sometimes, Journal takes up a lot of CPU, we need statistics
to know what's the journal is doing. So this patch provide
some journal statistics:
1) reclaim: how many times the journal try to reclaim resource,
   usually the journal bucket or/and the pin are exhausted.
2) flush_write: how many times the journal try to flush btree node
   to cache device, usually the journal bucket are exhausted.
3) retry_flush_write: how many times the journal retry to flush
   the next btree node, usually the previous tree node have been
   flushed by other thread.
we show these statistic by sysfs interface. Through these statistics
We can totally see the status of journal module when the CPU is too
high.
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a728eacb

25 11月, 2017 1 次提交

bcache: Fix building error on MIPS · cf33c1ee

由 Huacai Chen 提交于 11月 24, 2017

This patch try to fix the building error on MIPS. The reason is MIPS
has already defined the PTR macro, which conflicts with the PTR macro
in include/uapi/linux/bcache.h.

[fixed by mlyle: corrected a line-length issue]

Cc: stable@vger.kernel.org
Signed-off-by: NHuacai Chen <chenhc@lemote.com>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cf33c1ee

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功