提交 · cf89160793c439dca00e2563d0b7f153c274027b · openeuler / Kernel

08 8月, 2019 5 次提交

由 Andy Shevchenko 提交于 7月 23, 2019

Instead of linear approach to calculate power of 10, use generic int_pow()
which does it better.
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

cf891607

md/raid10: end bio when the device faulty · 7cee6d4e

由 Yufen Yu 提交于 7月 19, 2019

Just like raid1, we do not queue write error bio to retry write
and acknowlege badblocks, when the device is faulty.
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

7cee6d4e

md/raid1: end bio when the device faulty · eeba6809

由 Yufen Yu 提交于 7月 19, 2019

When write bio return error, it would be added to conf->retry_list
and wait for raid1d thread to retry write and acknowledge badblocks.

In narrow_write_error(), the error bio will be split in the unit of
badblock shift (such as one sector) and raid1d thread issues them
one by one. Until all of the splited bio has finished, raid1d thread
can go on processing other things, which is time consuming.

But, there is a scene for error handling that is not necessary.
When the device has been set faulty, flush_bio_list() may end
bios in pending_bio_list with error status. Since these bios
has not been issued to the device actually, error handlding to
retry write and acknowledge badblocks make no sense.

Even without that scene, when the device is faulty, badblocks info
can not be written out to the device. Thus, we also no need to
handle the error IO.
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

eeba6809

md/raid6: Set R5_ReadError when there is read failure on parity disk · 143f6e73

由 Xiao Ni 提交于 7月 08, 2019

7471fb77 ("md/raid6: Fix anomily when recovering a single device in
RAID6.") avoids rereading P when it can be computed from other members.
However, this misses the chance to re-write the right data to P. This
patch sets R5_ReadError if the re-read fails.

Also, when re-read is skipped, we also missed the chance to reset
rdev->read_errors to 0. It can fail the disk when there are many read
errors on P member disk (other disks don't have read error)

V2: upper layer read request don't read parity/Q data. So there is no
need to consider such situation.

This is Reported-by: kbuild test robot <lkp@intel.com>

Fixes: 7471fb77 ("md/raid6: Fix anomily when recovering a single device in RAID6.")
Cc: <stable@vger.kernel.org> #4.4+
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

143f6e73

raid1: use an int as the return value of raise_barrier() · 4675719d

由 Hou Tao 提交于 7月 02, 2019

Using a sector_t as the return value is misleading, because
raise_barrier() only return 0 or -EINTR.

Also add comments for the return values of raise_barrier().
Signed-off-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

4675719d

05 8月, 2019 1 次提交

blk-mq: add callback of .cleanup_rq · 226b4fc7

由 Ming Lei 提交于 7月 25, 2019

SCSI maintains its own driver private data hooked off of each SCSI
request, and the pridate data won't be freed after scsi_queue_rq()
returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver
(e.g. dm-rq) may need to retry these SCSI requests, before SCSI has
fully dispatched them, due to a lower level SCSI driver's resource
limitation identified in scsi_queue_rq(). Currently SCSI's per-request
private data is leaked when the upper layer driver (dm-rq) frees and
then retries these requests in response to BLK_STS_RESOURCE or
BLK_STS_DEV_RESOURCE returns from scsi_queue_rq().

This usecase is so specialized that it doesn't warrant training an
existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to
account for freeing its driver private data -- doing so would add an
extra branch for handling a special case that all other consumers of
SCSI (and blk-mq) won't ever need to worry about.

So the most pragmatic way forward is to delegate freeing SCSI driver
private data to the upper layer driver (dm-rq).  Do so by adding
new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method
from dm-rq.  A following commit will implement the .cleanup_rq() hook
in scsi_mq_ops.

Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: <stable@vger.kernel.org>
Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

226b4fc7

31 7月, 2019 2 次提交

M
dm table: fix various whitespace issues with recent DAX code · 9c50a98f
由 Mike Snitzer 提交于 7月 30, 2019
```
Also, rename device_synchronous to device_dax_synchronous.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
9c50a98f

dm table: fix dax_dev NULL dereference in device_synchronous() · 5348deb1

由 Pankaj Gupta 提交于 7月 30, 2019

If a device doesn't support DAX its 'dax_dev' is NULL.  Fix
device_synchronous() to first check if dax_dev is NULL before
dereferencing it.

Fixes: 2e9ee095 ("dm: enable synchronous dax")
Reported-by: jencce.kernel@gmail.com
Signed-off-by: NPankaj Gupta <pagupta@redhat.com>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

5348deb1

22 7月, 2019 1 次提交

bcache: fix possible memory leak in bch_cached_dev_run() · 5d9e06d6

由 Wei Yongjun 提交于 7月 22, 2019

memory malloced in bch_cached_dev_run() and should be freed before
leaving from the error handling cases, otherwise it will cause
memory leak.

Fixes: 0b13efec ("bcache: add return value check to bch_cached_dev_run()")
Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5d9e06d6

17 7月, 2019 3 次提交

dm kcopyd: Increase default sub-job size to 512KB · c663e040

由 Nikos Tsironis 提交于 7月 17, 2019

Currently, kcopyd has a sub-job size of 64KB and a maximum number of 8
sub-jobs. As a result, for any kcopyd job, we have a maximum of 512KB of
I/O in flight.

This upper limit to the amount of in-flight I/O under-utilizes fast
devices and results in decreased throughput, e.g., when writing to a
snapshotted thin LV with I/O size less than the pool's block size (so
COW is performed using kcopyd).

Increase kcopyd's default sub-job size to 512KB, so we have a maximum of
4MB of I/O in flight for each kcopyd job. This results in an up to 96%
improvement of bandwidth when writing to a snapshotted thin LV, with I/O
sizes less than the pool's block size.

Also, add dm_mod.kcopyd_subjob_size_kb module parameter to allow users
to fine tune the sub-job size of kcopyd. The default value of this
parameter is 512KB and the maximum allowed value is 1024KB.

We evaluate the performance impact of the change by running the
snap_breaking_throughput benchmark, from the device mapper test suite
[1].

The benchmark:

  1. Creates a 1G thin LV
  2. Provisions the thin LV
  3. Takes a snapshot of the thin LV
  4. Writes to the thin LV with:

      dd if=/dev/zero of=/dev/vg/thin_lv oflag=direct bs=<I/O size>

Running this benchmark with various thin pool block sizes and dd I/O
sizes (all combinations triggering the use of kcopyd) we get the
following results:

+-----------------+-------------+------------------+-----------------+
| Pool block size | dd I/O size | BW before (MB/s) | BW after (MB/s) |
+-----------------+-------------+------------------+-----------------+
|       1 MB      |      256 KB |       242        |       280       |
|       1 MB      |      512 KB |       238        |       295       |
|                 |             |                  |                 |
|       2 MB      |      256 KB |       238        |       354       |
|       2 MB      |      512 KB |       241        |       380       |
|       2 MB      |        1 MB |       245        |       394       |
|                 |             |                  |                 |
|       4 MB      |      256 KB |       248        |       412       |
|       4 MB      |      512 KB |       234        |       432       |
|       4 MB      |        1 MB |       251        |       474       |
|       4 MB      |        2 MB |       257        |       504       |
|                 |             |                  |                 |
|       8 MB      |      256 KB |       239        |       420       |
|       8 MB      |      512 KB |       256        |       431       |
|       8 MB      |        1 MB |       264        |       467       |
|       8 MB      |        2 MB |       264        |       502       |
|       8 MB      |        4 MB |       281        |       537       |
+-----------------+-------------+------------------+-----------------+

[1] https://github.com/jthornber/device-mapper-test-suiteSigned-off-by: NNikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

c663e040

dm snapshot: fix oversights in optional discard support · 3ee25485

由 Mike Snitzer 提交于 7月 17, 2019

__find_snapshots_sharing_cow() should always be used with _origins_lock
held so fix snapshot_io_hints() accordingly.  Also, once a snapshot is
being merged discards must not be allowed -- otherwise incorrect or
duplicate work will be performed.

Fixes: 2e602385 ("dm snapshot: add optional discard support features")
Reported-by: NNikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

3ee25485

dm zoned: fix zone state management race · 3b8cafdd

由 Damien Le Moal 提交于 7月 16, 2019

dm-zoned uses the zone flag DMZ_ACTIVE to indicate that a zone of the
backend device is being actively read or written and so cannot be
reclaimed. This flag is set as long as the zone atomic reference
counter is not 0. When this atomic is decremented and reaches 0 (e.g.
on BIO completion), the active flag is cleared and set again whenever
the zone is reused and BIO issued with the atomic counter incremented.
These 2 operations (atomic inc/dec and flag set/clear) are however not
always executed atomically under the target metadata mutex lock and
this causes the warning:

WARN_ON(!test_bit(DMZ_ACTIVE, &zone->flags));

in dmz_deactivate_zone() to be displayed. This problem is regularly
triggered with xfstests generic/209, generic/300, generic/451 and
xfs/077 with XFS being used as the file system on the dm-zoned target
device. Similarly, xfstests ext4/303, ext4/304, generic/209 and
generic/300 trigger the warning with ext4 use.

This problem can be easily fixed by simply removing the DMZ_ACTIVE flag
and managing the "ACTIVE" state by directly looking at the reference
counter value. To do so, the functions dmz_activate_zone() and
dmz_deactivate_zone() are changed to inline functions respectively
calling atomic_inc() and atomic_dec(), while the dmz_is_active() macro
is changed to an inline function calling atomic_read().

Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Reported-by: NMasato Suzuki <masato.suzuki@wdc.com>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

3b8cafdd

15 7月, 2019 1 次提交

docs: device-mapper: move it to the admin-guide · 6cf2a73c

由 Mauro Carvalho Chehab 提交于 6月 18, 2019

The DM support describes lots of aspects related to mapped
disk partitions from the userspace PoV.
Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org>

6cf2a73c

12 7月, 2019 3 次提交

dm bufio: fix deadlock with loop device · bd293d07

由 Junxiao Bi 提交于 7月 09, 2019

When thin-volume is built on loop device, if available memory is low,
the following deadlock can be triggered:

One process P1 allocates memory with GFP_FS flag, direct alloc fails,
memory reclaim invokes memory shrinker in dm_bufio, dm_bufio_shrink_scan()
runs, mutex dm_bufio_client->lock is acquired, then P1 waits for dm_buffer
IO to complete in __try_evict_buffer().

But this IO may never complete if issued to an underlying loop device
that forwards it using direct-IO, which allocates memory using
GFP_KERNEL (see: do_blockdev_direct_IO()).  If allocation fails, memory
reclaim will invoke memory shrinker in dm_bufio, dm_bufio_shrink_scan()
will be invoked, and since the mutex is already held by P1 the loop
thread will hang, and IO will never complete.  Resulting in ABBA
deadlock.

Cc: stable@vger.kernel.org
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

bd293d07

dm snapshot: add optional discard support features · 2e602385

由 Mike Snitzer 提交于 6月 19, 2019

discard_zeroes_cow - a discard issued to the snapshot device that maps
to entire chunks to will zero the corresponding exception(s) in the
snapshot's exception store.

discard_passdown_origin - a discard to the snapshot device is passed down
to the snapshot-origin's underlying device.  This doesn't cause copy-out
to the snapshot exception store because the snapshot-origin target is
bypassed.

The discard_passdown_origin feature depends on the discard_zeroes_cow
feature being enabled.

When these 2 features are enabled they allow a temporarily read-only
device that has completely exhausted its free space to recover space.
To do so dm-snapshot provides temporary buffer to accommodate writes
that the temporarily read-only device cannot handle yet.  Once the upper
layer frees space (e.g. fstrim to XFS) the discards issued to the
dm-snapshot target will be issued to underlying read-only device whose
free space was exhausted.  In addition those discards will also cause
zeroes to be written to the snapshot exception store if corresponding
exceptions exist.  If the underlying origin device provides
deduplication for zero blocks then if/when the snapshot is merged backed
to the origin those blocks will become unused.  Once the origin has
gained adequate space, merging the snapshot back to the thinly
provisioned device will permit continued use of that device without the
temporary space provided by the snapshot.
Requested-by: NJohn Dorminy <jdorminy@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

2e602385

block: Kill gfp_t argument of blkdev_report_zones() · bd976e52

由 Damien Le Moal 提交于 7月 01, 2019

Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In
preparation of using vmalloc() for large report buffer and zone array
allocations used by this function, remove its "gfp_t gfp_mask" argument
and rely on the caller context to use memalloc_noio_save/restore() where
necessary (block layer zone revalidation and dm-zoned I/O error path).
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bd976e52

11 7月, 2019 1 次提交

Revert "Merge tag 'keys-acl-20190703' of... · 028db3e2

由 Linus Torvalds 提交于 7月 10, 2019

Revert "Merge tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs"

This reverts merge 0f75ef6a (and thus
effectively commits

   7a1ade84 ("keys: Provide KEYCTL_GRANT_PERMISSION")
   2e12256b ("keys: Replace uid/gid/perm permissions checking with an ACL")

that the merge brought in).

It turns out that it breaks booting with an encrypted volume, and Eric
biggers reports that it also breaks the fscrypt tests [1] and loading of
in-kernel X.509 certificates [2].

The root cause of all the breakage is likely the same, but David Howells
is off email so rather than try to work it out it's getting reverted in
order to not impact the rest of the merge window.

 [1] https://lore.kernel.org/lkml/20190710011559.GA7973@sol.localdomain/
 [2] https://lore.kernel.org/lkml/20190710013225.GB7973@sol.localdomain/

Link: https://lore.kernel.org/lkml/CAHk-=wjxoeMJfeBahnWH=9zShKp2bsVy527vo3_y8HfOdhwAAw@mail.gmail.com/Reported-by: NEric Biggers <ebiggers@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

028db3e2

10 7月, 2019 9 次提交

dm crypt: implement eboiv - encrypted byte-offset initialization vector · b9411d73

由 Milan Broz 提交于 7月 09, 2019

This IV is used in some BitLocker devices with CBC encryption mode.

IV is encrypted little-endian byte-offset (with the same key and cipher
as the volume).
Signed-off-by: NMilan Broz <gmazyland@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

b9411d73

dm crypt: remove obsolete comment about plumb IV · 6028a7a5

由 Milan Broz 提交于 7月 09, 2019

The URL is no longer valid and the comment is obsolete anyway
(the plumb IV was never used).
Signed-off-by: NMilan Broz <gmazyland@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

6028a7a5

dm crypt: wipe private IV struct after key invalid flag is set · 4a52ffc7

由 Milan Broz 提交于 7月 09, 2019

If a private IV wipe function fails, the code does not set the key
invalid flag.  To fix this, move code to after the flag is set to
prevent the device from resuming in an inconsistent state.

Also, this allows using of a randomized key in private wipe function
(to be used in a following commit).
Signed-off-by: NMilan Broz <gmazyland@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

4a52ffc7

F
dm integrity: use kzalloc() instead of kmalloc() + memset() · 131670c2
由 Fuqian Huang 提交于 6月 28, 2019
```
Signed-off-by: NFuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
131670c2

dm: update stale comment in end_clone_bio() · d370ad23

由 Pavel Begunkov 提交于 6月 20, 2019

Since commit a1ce35fa ("block: remove dead elevator
code") blk_end_request() has been replaced with blk_mq_end_request().
So update comment to reference blk_mq_end_request() accordingly.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

d370ad23

dm log writes: fix incorrect comment about the logged sequence example · 7537dad7

由 Qu Wenruo 提交于 6月 18, 2019

dm-log-writes records the sequence of completion, not submission, thus
for the following sequence (W=write, C=complete):

  Wa,Wb,Wc,Cc,Ca,FLUSH,FUAd,Cb,CFLUSH,CFUAd

The logged results in log device should be:
  c,a,b,flush,fua

Fix the comment to give a better example.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7537dad7

dm log writes: use struct_size() to calculate size of pending_block · d4e6e836

由 Zhengyuan Liu 提交于 6月 12, 2019

Use struct_size() to avoid open-coded equivalent that is prone to a type
mistake.
Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

d4e6e836

dm crypt: use struct_size() when allocating encryption context · 9c81c99b

由 Zhengyuan Liu 提交于 6月 12, 2019

Use struct_size() to avoid open-coded equivalent that is prone to a type
mistake.
Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

9c81c99b

dm integrity: always set version on superblock update · 5f1c56b3

由 Milan Broz 提交于 5月 22, 2019

The new integrity bitmap mode uses the dirty flag.  The dirty flag
should not be set in older superblock versions.

The current code sets it unconditionally, even if the superblock
was already formatted without bitmap in older system.

Fix this by moving the version check to one common place and check
version on every superblock write.
Signed-off-by: NMilan Broz <gmazyland@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

5f1c56b3

06 7月, 2019 2 次提交

dm: enable synchronous dax · 2e9ee095

由 Pankaj Gupta 提交于 7月 05, 2019

This patch sets dax device 'DAXDEV_SYNC' flag if all the target
devices of device mapper support synchrononous DAX. If device
mapper consists of both synchronous and asynchronous dax devices,
we don't set 'DAXDEV_SYNC' flag.

'dm_table_supports_dax' is refactored to pass 'iterate_devices_fn'
as argument so that the callers can pass the appropriate functions.
Suggested-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NPankaj Gupta <pagupta@redhat.com>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

2e9ee095

libnvdimm: add dax_dev sync flag · fefc1d97

由 Pankaj Gupta 提交于 7月 05, 2019

This patch adds 'DAXDEV_SYNC' flag which is set
for nd_region doing synchronous flush. This later
is used to disable MAP_SYNC functionality for
ext4 & xfs filesystem for devices don't support
synchronous flush.
Signed-off-by: NPankaj Gupta <pagupta@redhat.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

fefc1d97

03 7月, 2019 1 次提交

dm thin metadata: check if in fail_io mode when setting needs_check · 54fa16ee

由 Mike Snitzer 提交于 7月 02, 2019

Check if in fail_io mode at start of dm_pool_metadata_set_needs_check().
Otherwise dm_pool_metadata_set_needs_check()'s superblock_lock() can
crash in dm_bm_write_lock() while accessing the block manager object
that was previously destroyed as part of a failed
dm_pool_abort_metadata() that ultimately set fail_io to begin with.

Also, update DMERR() message to more accurately describe
superblock_lock() failure.

Cc: stable@vger.kernel.org
Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

54fa16ee

28 6月, 2019 11 次提交

bcache: add reclaimed_journal_buckets to struct cache_set · dff90d58

由 Coly Li 提交于 6月 28, 2019

Now we have counters for how many times jouranl is reclaimed, how many
times cached dirty btree nodes are flushed, but we don't know how many
jouranl buckets are really reclaimed.

This patch adds reclaimed_journal_buckets into struct cache_set, this
is an increasing only counter, to tell how many journal buckets are
reclaimed since cache set runs. From all these three counters (reclaim,
reclaimed_journal_buckets, flush_write), we can have idea how well
current journal space reclaim code works.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dff90d58

bcache: performance improvement for btree_flush_write() · 91be66e1

由 Coly Li 提交于 6月 28, 2019

This patch improves performance for btree_flush_write() in following
ways,
- Use another spinlock journal.flush_write_lock to replace the very
  hot journal.lock. We don't have to use journal.lock here, selecting
  candidate btree nodes takes a lot of time, hold journal.lock here will
  block other jouranling threads and drop the overall I/O performance.
- Only select flushing btree node from c->btree_cache list. When the
  machine has a large system memory, mca cache may have a huge number of
  cached btree nodes. Iterating all the cached nodes will take a lot
  of CPU time, and most of the nodes on c->btree_cache_freeable and
  c->btree_cache_freed lists are cleared and have need to flush. So only
  travel mca list c->btree_cache to select flushing btree node should be
  enough for most of the cases.
- Don't iterate whole c->btree_cache list, only reversely select first
  BTREE_FLUSH_NR btree nodes to flush. Iterate all btree nodes from
  c->btree_cache and select the oldest journal pin btree nodes consumes
  huge number of CPU cycles if the list is huge (push and pop a node
  into/out of a heap is expensive). The last several dirty btree nodes
  on the tail of c->btree_cache list are earlest allocated and cached
  btree nodes, they are relative to the oldest journal pin btree nodes.
  Therefore only flushing BTREE_FLUSH_NR btree nodes from tail of
  c->btree_cache probably includes the oldest journal pin btree nodes.

In my testing, the above change decreases 50%+ CPU consumption when
journal space is full. Some times IOPS drops to 0 for 5-8 seconds,
comparing blocking I/O for 120+ seconds in previous code, this is much
better. Maybe there is room to improve in future, but at this momment
the fix looks fine and performs well in my testing.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

91be66e1

bcache: fix race in btree_flush_write() · 50a260e8

由 Coly Li 提交于 6月 28, 2019

There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.

Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
  other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
  shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.

This race was introduced in commit cafe5635 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497 ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.

Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.

The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.

Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
        2149 err_free2:
        2150         bkey_put(b->c, &n2->key);
        2151         btree_node_free(n2);
        2152         rw_unlock(true, n2);
        2153 err_free1:
        2154         bkey_put(b->c, &n1->key);
        2155         btree_node_free(n1);
        2156         rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.

Fixes: cafe5635 ("bcache: A block layer cache")
Signed-off-by: NColy Li <colyli@suse.de>
Reported-and-tested-by: Nkbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

50a260e8

bcache: remove retry_flush_write from struct cache_set · d91ce757

由 Coly Li 提交于 6月 28, 2019

In struct cache_set, retry_flush_write is added for commit c4dc2497
("bcache: fix high CPU occupancy during journal") which is reverted in
previous patch.

Now it is useless anymore, and this patch removes it from bcache code.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d91ce757

bcache: add comments for mutex_lock(&b->write_lock) · 41508bb7

由 Coly Li 提交于 6月 28, 2019

When accessing or modifying BTREE_NODE_dirty bit, it is not always
necessary to acquire b->write_lock. In bch_btree_cache_free() and
mca_reap() acquiring b->write_lock is necessary, and this patch adds
comments to explain why mutex_lock(&b->write_lock) is necessary for
checking or clearing BTREE_NODE_dirty bit there.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

41508bb7

bcache: only clear BTREE_NODE_dirty bit when it is set · e5ec5f47

由 Coly Li 提交于 6月 28, 2019

In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is
always set no matter btree node is dirty or not. The code looks like
this,
	if (btree_node_dirty(b))
		btree_complete_write(b, btree_current_write(b));
	clear_bit(BTREE_NODE_dirty, &b->flags);

Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty
bit is cleared, then it is unnecessary to clear the bit again.

This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is
true (the bit is set), to save a few CPU cycles.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e5ec5f47

bcache: Revert "bcache: fix high CPU occupancy during journal" · 249a5f6d

由 Coly Li 提交于 6月 28, 2019

This reverts commit c4dc2497.

This patch enlarges a race between normal btree flush code path and
flush_btree_write(), which causes deadlock when journal space is
exhausted. Reverts this patch makes the race window from 128 btree
nodes to only 1 btree nodes.

Fixes: c4dc2497 ("bcache: fix high CPU occupancy during journal")
Signed-off-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Tang Junhui <tang.junhui.linux@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

249a5f6d

bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free" · ba82c1ac

由 Coly Li 提交于 6月 28, 2019

This reverts commit 6268dc2c.

This patch depends on commit c4dc2497 ("bcache: fix high CPU
occupancy during journal") which is reverted in previous patch. So
revert this one too.

Fixes: 6268dc2c ("bcache: free heap cache_set->flush_btree in bch_journal_free")
Signed-off-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ba82c1ac

bcache: shrink btree node cache after bch_btree_check() · 1df3877f

由 Coly Li 提交于 6月 28, 2019

When cache set starts, bch_btree_check() will check all bkeys on cache
device by calculating the checksum. This operation will consume a huge
number of system memory if there are a lot of data cached. Since bcache
uses its own mca cache to maintain all its read-in btree nodes, and only
releases the cache space when system memory manage code starts to shrink
caches. Then before memory manager code to call the mca cache shrinker
callback, bcache mca cache will compete memory resource with user space
application, which may have nagive effect to performance of user space
workloads (e.g. data base, or I/O service of distributed storage node).

This patch tries to call bcache mca shrinker routine to proactively
release mca cache memory, to decrease the memory pressure of system and
avoid negative effort of the overall system I/O performance.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1df3877f

bcache: set largest seq to ja->seq[bucket_index] in journal_read_bucket() · a231f07a

由 Coly Li 提交于 6月 28, 2019

In journal_read_bucket() when setting ja->seq[bucket_index], there might
be potential case that a later non-maximum overwrites a better sequence
number to ja->seq[bucket_index]. This patch adds a check to make sure
that ja->seq[bucket_index] will be only set a new value if it is bigger
then current value.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a231f07a

bcache: add code comments for journal_read_bucket() · 2464b693

由 Coly Li 提交于 6月 28, 2019

This patch adds more code comments in journal_read_bucket(), this is an
effort to make the code to be more understandable.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2464b693

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功