提交 · 50b4aecfbbb09869db967e4a26212a47e10c0088 · openeuler / Kernel

13 8月, 2021 3 次提交

由 Christoph Hellwig 提交于 8月 09, 2021

Just check inode_unhashed on the whole device bdev inode instead,
and provide a helper to check for that information.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-9-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

50b4aecf

bcache: move the del_gendisk call out of bcache_device_free · b75f4aed

由 Christoph Hellwig 提交于 8月 09, 2021

Let the callers call del_gendisk so that we can check if add_disk
has been called properly for the cached device case instead of relying
on the block layer internal GENHD_FL_UP flag.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-8-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

b75f4aed

bcache: add proper error unwinding in bcache_device_init · 224b0683

由 Christoph Hellwig 提交于 8月 09, 2021

Except for the IDA none of the allocations in bcache_device_init is
unwound on error, fix that.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-7-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

224b0683

12 8月, 2021 1 次提交

block: move some macros to blkdev.h · 018eca45

由 Guoqing Jiang 提交于 7月 21, 2021

Move them (PAGE_SECTORS_SHIFT, PAGE_SECTORS and SECTOR_MASK) to the
generic header file to remove redundancy.
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Link: https://lore.kernel.org/r/20210721025315.1729118-1-guoqing.jiang@linux.devSigned-off-by: NJens Axboe <axboe@kernel.dk>

018eca45

10 8月, 2021 5 次提交

block: pass a gendisk to blk_queue_update_readahead · 471aa704

由 Christoph Hellwig 提交于 8月 09, 2021

.. and rename the function to disk_update_readahead.  This is in
preparation for moving the BDI from the request_queue to the gendisk.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

471aa704

dm: delay registering the gendisk · 89f871af

由 Christoph Hellwig 提交于 8月 04, 2021

device mapper is currently the only outlier that tries to call
register_disk after add_disk, leading to fairly inconsistent state
of these block layer data structures.  Instead change device-mapper
to just register the gendisk later now that the holder mechanism
can cope with that.

Note that this introduces a user visible change: the dm kobject is
now only visible after the initial table has been loaded.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-8-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

89f871af

dm: move setting md->type into dm_setup_md_queue · ba305859

由 Christoph Hellwig 提交于 8月 04, 2021

Move setting md->type from both callers into dm_setup_md_queue.
This ensures that md->type is only set to a valid value after the queue
has been fully setup, something we'll rely on future changes.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-7-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

ba305859

dm: cleanup cleanup_mapped_device · 74a2b6ec

由 Christoph Hellwig 提交于 8月 04, 2021

md->queue is now always set when md->disk is set, so simplify the
conditionals a bit.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-6-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

74a2b6ec

block: make the block holder code optional · c66fd019

由 Christoph Hellwig 提交于 8月 04, 2021

Move the block holder code into a separate file as it is not in any way
related to the other block_dev.c code, and add a new selectable config
option for it so that we don't have to build it without any remapped
drivers selected.

The Kconfig symbol contains a _DEPRECATED suffix to match the comments
added in commit 49731baa
("block: restore multiple bd_link_disk_holder() support").
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

c66fd019

03 8月, 2021 1 次提交

dm-writecache: use bvec_kmap_local instead of bvec_kmap_irq · 18a6234c

由 Christoph Hellwig 提交于 7月 27, 2021

There is no need to disable interrupts in bio_copy_block, and the local
only mappings helps to avoid any sort of problems with stray writes
into the bio data.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NIra Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20210727055646.118787-8-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

18a6234c

29 6月, 2021 1 次提交

dm writecache: make writeback pause configurable · 5c0de3d7

由 Mikulas Patocka 提交于 6月 28, 2021

Commit 95b88f4d ("dm writecache: pause
writeback if cache full and origin being written directly") introduced a
code that pauses cache flushing if we are issuing writes directly to the
origin.

Improve that initial commit by making the timeout code configurable
(via the option "pause_writeback"). Also change the default from 1s to
3s because it performed better.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

5c0de3d7

26 6月, 2021 6 次提交

dm writecache: pause writeback if cache full and origin being written directly · 95b88f4d

由 Mikulas Patocka 提交于 6月 25, 2021

Implementation reuses dm_io_tracker, that until now was only used by
dm-cache, to track if any writes were issued directly to the origin
(due to cache being full) within the last second. If so writeback is
paused for a second.

This change improves performance for when the cache is full and IO is
issued directly to the origin device (rather than through the cache).

Depends-on: d53f1faf ("dm writecache: do direct write if the cache is full")
Suggested-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

95b88f4d

M
dm io tracker: factor out IO tracker · dc4fa29f
由 Mike Snitzer 提交于 6月 25, 2021
```
Allow other code to use dm_io_tracker.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
dc4fa29f

dm btree remove: assign new_root only when removal succeeds · b6e58b54

由 Hou Tao 提交于 6月 17, 2021

remove_raw() in dm_btree_remove() may fail due to IO read error
(e.g. read the content of origin block fails during shadowing),
and the value of shadow_spine::root is uninitialized, but
the uninitialized value is still assign to new_root in the
end of dm_btree_remove().

For dm-thin, the value of pmd->details_root or pmd->root will become
an uninitialized value, so if trying to read details_info tree again
out-of-bound memory may occur as showed below:

  general protection fault, probably for non-canonical address 0x3fdcb14c8d7520
  CPU: 4 PID: 515 Comm: dmsetup Not tainted 5.13.0-rc6
  Hardware name: QEMU Standard PC
  RIP: 0010:metadata_ll_load_ie+0x14/0x30
  Call Trace:
   sm_metadata_count_is_more_than_one+0xb9/0xe0
   dm_tm_shadow_block+0x52/0x1c0
   shadow_step+0x59/0xf0
   remove_raw+0xb2/0x170
   dm_btree_remove+0xf4/0x1c0
   dm_pool_delete_thin_device+0xc3/0x140
   pool_message+0x218/0x2b0
   target_message+0x251/0x290
   ctl_ioctl+0x1c4/0x4d0
   dm_ctl_ioctl+0xe/0x20
   __x64_sys_ioctl+0x7b/0xb0
   do_syscall_64+0x40/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixing it by only assign new_root when removal succeeds
Signed-off-by: NHou Tao <houtao1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

b6e58b54

dm zone: fix dm_revalidate_zones() memory allocation · 28436ba3

由 Damien Le Moal 提交于 6月 19, 2021

Make sure that the zone write pointer offset array is allocated with a
vmalloc in dm_zone_revalidate_cb() by passing GFP_KERNEL gfp flag to
kvcalloc(). However, since we do not want to trigger IOs while
revalidating zones, change dm_revalidate_zones() to have the zone scan
done in GFP_NOIO context using memalloc_noio_save/restore calls.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Fixes: bb37d772 ("dm: introduce zone append emulation")
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

28436ba3

dm ps io affinity: remove redundant continue statement · 326dbde2

由 Colin Ian King 提交于 6月 16, 2021

The continue statement at the end of a for-loop has no effect,
remove it.

Addresses-Coverity: ("Continue has no effect")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

326dbde2

dm writecache: add optional "metadata_only" parameter · 611c3e16

由 Mikulas Patocka 提交于 6月 21, 2021

Add a "metadata_only" parameter that when present: only metadata is
promoted to the cache. This option improves performance for heavier
REQ_META workloads (e.g. device-mapper-test-suite's "git clone and
checkout" benchmark improves from 341s to 312s).
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

611c3e16

22 6月, 2021 1 次提交

dm writecache: write at least 4k when committing · 867de40c

由 Mikulas Patocka 提交于 6月 21, 2021

SSDs perform badly with sub-4k writes (because they perfrorm
read-modify-write internally), so make sure writecache writes at least
4k when committing.

Fixes: 991bd8d7 ("dm writecache: commit just one block, not a full page")
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

867de40c

18 6月, 2021 1 次提交

sched: Change task_struct::state · 2f064a59

由 Peter Zijlstra 提交于 6月 11, 2021

Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: NWill Deacon <will@kernel.org>
Acked-by: NDaniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org

2f064a59

17 6月, 2021 1 次提交

dm writecache: flush origin device when writing and cache is full · ee55b92a

由 Mikulas Patocka 提交于 6月 15, 2021

Commit d53f1faf ("dm writecache: do
direct write if the cache is full") changed dm-writecache, so that it
writes directly to the origin device if the cache is full.
Unfortunately, it doesn't forward flush requests to the origin device,
so that there is a bug where flushes are being ignored.

Fix this by adding missing flush forwarding.

For PMEM mode, we fix this bug by disabling direct writes to the origin
device, because it performs better.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Fixes: d53f1faf ("dm writecache: do direct write if the cache is full")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

ee55b92a

16 6月, 2021 1 次提交

dm writecache: have ssd writeback wait if the kcopyd workqueue is busy · 293128b1

由 Mikulas Patocka 提交于 6月 15, 2021

Make dm-writecache wait if the kcopyd workqueue is busy (as will
happen if waiting for page allocation or inside submit_bio).

This change improves performance of "mkfs.ext2" by approximately 20%
on one testbed.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

293128b1

15 6月, 2021 12 次提交

md/raid5: avoid device_lock in read_one_chunk() · 97ae2725

由 Gal Ofri 提交于 6月 07, 2021

There is a lock contention on device_lock in read_one_chunk().
device_lock is taken to sync conf->active_aligned_reads and
conf->quiesce.
read_one_chunk() takes the lock, then waits for quiesce=0 (resumed)
before incrementing active_aligned_reads.
raid5_quiesce() takes the lock, sets quiesce=2 (in-progress), then waits
for active_aligned_reads to be zero before setting quiesce=1
(suspended).

Introduce a fast (lockless) path in read_one_chunk(): activate aligned
read without taking device_lock.  In case quiesce starts while
activating the aligned-read in fast path, deactivate it and revert to
old behavior (take device_lock and wait for quiesce to finish).

Add smp store/load in raid5_quiesce()/read_one_chunk() respectively to
gaurantee that read_one_chunk() does not miss an ongoing quiesce.

My setups:
1. 8 local nvme drives (each up to 250k iops).
2. 8 ram disks (brd).

Each setup with raid6 (6+2), 1024 io threads on a 96 cpu-cores (48 per
socket) system. Record both iops and cpu spent on this contention with
rand-read-4k. Record bw with sequential-read-128k.  Note: in most cases
cpu is still busy but due to "new" bottlenecks.

nvme:
              | iops           | cpu  | bw
-----------------------------------------------
without patch | 1.6M           | ~50% | 5.5GB/s
with patch    | 2M (throttled) | 0%   | 16GB/s (throttled)

ram (brd):
              | iops           | cpu  | bw
-----------------------------------------------
without patch | 2M             | ~80% | 24GB/s
with patch    | 4M             | 0%   | 55GB/s

CC: Song Liu <song@kernel.org>
CC: Neil Brown <neilb@suse.de>
Reviewed-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NGal Ofri <gal.ofri@storing.io>
Signed-off-by: NSong Liu <song@kernel.org>

97ae2725

md: add comments in md_integrity_register · de3ea66e

由 Guoqing Jiang 提交于 6月 03, 2021

Given it is not obvious for the error handling, let's try to add some
comments here to make it clear.
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>

de3ea66e

md: check level before create and exit io_acct_set · daee2024

由 Guoqing Jiang 提交于 6月 03, 2021

The bio_set (io_acct_set) is used by personalities to clone bio and
trace the timestamp of bio. Some personalities such as raid1/10 don't
need the bio_set, so add check to not create it unconditionally.

Also update the comment for md_account_bio to make it more clear.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>

daee2024

md: Constify attribute_group structs · c32dc040

由 Rikard Falkeborn 提交于 5月 29, 2021

The attribute_group structs are never modified, they're only passed to
sysfs_create_group() and sysfs_remove_group(). Make them const to allow
the compiler to put them in read-only memory.
Signed-off-by: NRikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: NSong Liu <song@kernel.org>

c32dc040

md: mark some personalities as deprecated · 608f52e3

由 Guoqing Jiang 提交于 5月 25, 2021

Mark the three personalities (linear, fault and multipath) as deprecated
because:

1. people can use dm multipath or nvme multipath.
2. linear is already deprecated in MODULE_ALIAS.
3. no one actively using fault.
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>

608f52e3

md/raid10: enable io accounting · 528bc2cf

由 Guoqing Jiang 提交于 5月 25, 2021

For raid10, we record the start time between split bio and clone bio,
and finish the accounting in the final endio.

Also introduce start_time in r10bio accordingly.
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>

528bc2cf

md/raid1: enable io accounting · a0159832

由 Guoqing Jiang 提交于 5月 25, 2021

For raid1, we record the start time between split bio and clone bio,
and finish the accounting in the final endio.

Also introduce start_time in r1bio accordingly.
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>

a0159832

md/raid1: rename print_msg with r1bio_existed · 9b8ae7b9

由 Guoqing Jiang 提交于 5月 25, 2021

The caller of raid1_read_request could pass NULL or a valid pointer for
"struct r1bio *r1_bio", so it actually means whether r1_bio is existed
or not.
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>

9b8ae7b9

md/raid5: avoid redundant bio clone in raid5_read_one_chunk · 1147f58e

由 Guoqing Jiang 提交于 5月 25, 2021

After enable io accounting, chunk read bio could be cloned twice which
is not good. To avoid such inefficiency, let's clone align_bio from
io_acct_set too, then we need only call md_account_bio in make_request
unconditionally.
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>

1147f58e

md/raid5: move checking badblock before clone bio in raid5_read_one_chunk · c82aa1b7

由 Guoqing Jiang 提交于 5月 25, 2021

We don't need to clone bio if the relevant region has badblock.
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>

c82aa1b7

md: add io accounting for raid0 and raid5 · 10764815

由 Guoqing Jiang 提交于 5月 25, 2021

We introduce a new bioset (io_acct_set) for raid0 and raid5 since they
don't own clone infrastructure to accounting io. And the bioset is added
to mddev instead of to raid0 and raid5 layer, because with this way, we
can put common functions to md.h and reuse them in raid0 and raid5.

Also struct md_io_acct is added accordingly which includes io start_time,
the origin bio and cloned bio. Then we can call bio_{start,end}_io_acct
to get related io status.
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>

10764815

md: revert io stats accounting · ad3fc798

由 Guoqing Jiang 提交于 5月 25, 2021

The commit 41d2d848 ("md: improve io stats accounting") could cause
double fault problem per the report [1], and also it is not correct to
change ->bi_end_io if md don't own it, so let's revert it.

And io stats accounting will be replemented in later commits.

[1]. https://lore.kernel.org/linux-raid/3bf04253-3fad-434a-63a7-20214e38cf26@gmail.com/T/#t

Fixes: 41d2d848 ("md: improve io stats accounting")
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>

ad3fc798

14 6月, 2021 3 次提交

dm writecache: use list_move instead of list_del/list_add in writecache_writeback() · 8c77f1cb

由 Baokun Li 提交于 6月 09, 2021

Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

8c77f1cb

dm writecache: commit just one block, not a full page · 991bd8d7

由 Mikulas Patocka 提交于 6月 06, 2021

Some architectures have pages larger than 4k and committing a full
page causes needless overhead.

Fix this by writing a single block when committing the superblock.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

991bd8d7

M
dm writecache: remove unused gfp_t argument from wc_add_block() · 620cbe40
由 Mikulas Patocka 提交于 6月 06, 2021
```
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
620cbe40

12 6月, 2021 1 次提交

blk-mq: improve the blk_mq_init_allocated_queue interface · 26a9750a

由 Christoph Hellwig 提交于 6月 02, 2021

Don't return the passed in request_queue but a normal error code, and
drop the elevator_init argument in favor of just calling elevator_init_mq
directly from dm-rq.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

26a9750a

09 6月, 2021 2 次提交

bcache: avoid oversized read request in cache missing code path · 41fe8d08

由 Coly Li 提交于 6月 07, 2021

In the cache missing code path of cached device, if a proper location
from the internal B+ tree is matched for a cache miss range, function
cached_dev_cache_miss() will be called in cache_lookup_fn() in the
following code block,
[code block 1]
  526         unsigned int sectors = KEY_INODE(k) == s->iop.inode
  527                 ? min_t(uint64_t, INT_MAX,
  528                         KEY_START(k) - bio->bi_iter.bi_sector)
  529                 : INT_MAX;
  530         int ret = s->d->cache_miss(b, s, bio, sectors);

Here s->d->cache_miss() is the call backfunction pointer initialized as
cached_dev_cache_miss(), the last parameter 'sectors' is an important
hint to calculate the size of read request to backing device of the
missing cache data.

Current calculation in above code block may generate oversized value of
'sectors', which consequently may trigger 2 different potential kernel
panics by BUG() or BUG_ON() as listed below,

1) BUG_ON() inside bch_btree_insert_key(),
[code block 2]
   886         BUG_ON(b->ops->is_extents && !KEY_SIZE(k));
2) BUG() inside biovec_slab(),
[code block 3]
   51         default:
   52                 BUG();
   53                 return NULL;

All the above panics are original from cached_dev_cache_miss() by the
oversized parameter 'sectors'.

Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate
the size of data read from backing device for the cache missing. This
size is stored in s->insert_bio_sectors by the following lines of code,
[code block 4]
  909    s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);

Then the actual key inserting to the internal B+ tree is generated and
stored in s->iop.replace_key by the following lines of code,
[code block 5]
  911   s->iop.replace_key = KEY(s->iop.inode,
  912                    bio->bi_iter.bi_sector + s->insert_bio_sectors,
  913                    s->insert_bio_sectors);
The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from
the above code block.

And the bio sending to backing device for the missing data is allocated
with hint from s->insert_bio_sectors by the following lines of code,
[code block 6]
  926    cache_bio = bio_alloc_bioset(GFP_NOWAIT,
  927                 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS),
  928                 &dc->disk.bio_split);
The oversized parameter 'sectors' may trigger panic 2) by BUG() from the
agove code block.

Now let me explain how the panics happen with the oversized 'sectors'.
In code block 5, replace_key is generated by macro KEY(). From the
definition of macro KEY(),
[code block 7]
  71 #define KEY(inode, offset, size)                                  \
  72 ((struct bkey) {                                                  \
  73      .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode),     \
  74      .low = (offset)                                              \
  75 })

Here 'size' is 16bits width embedded in 64bits member 'high' of struct
bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is
very probably to be larger than (1<<16) - 1, which makes the bkey size
calculation in code block 5 is overflowed. In one bug report the value
of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors'
results the overflowed s->insert_bio_sectors in code block 4, then makes
size field of s->iop.replace_key to be 0 in code block 5. Then the 0-
sized s->iop.replace_key is inserted into the internal B+ tree as cache
missing check key (a special key to detect and avoid a racing between
normal write request and cache missing read request) as,
[code block 8]
  915   ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key);

Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey
size check BUG_ON() in code block 2, and causes the kernel panic 1).

Another kernel panic is from code block 6, is by the bvecs number
oversized value s->insert_bio_sectors from code block 4,
        min(sectors, bio_sectors(bio) + reada)
There are two possibility for oversized reresult,
- bio_sectors(bio) is valid, but bio_sectors(bio) + reada is oversized.
- sectors < bio_sectors(bio) + reada, but sectors is oversized.

From a bug report the result of "DIV_ROUND_UP(s->insert_bio_sectors,
PAGE_SECTORS)" from code block 6 can be 344, 282, 946, 342 and many
other values which larther than BIO_MAX_VECS (a.k.a 256). When calling
bio_alloc_bioset() with such larger-than-256 value as the 2nd parameter,
this value will eventually be sent to biovec_slab() as parameter
'nr_vecs' in following code path,
   bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab()
Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG()
in code block 3 is triggered inside biovec_slab().

From the above analysis, we know that the 4th parameter 'sector' sent
into cached_dev_cache_miss() may cause overflow in code block 5 and 6,
and finally cause kernel panic in code block 2 and 3. And if result of
bio_sectors(bio) + reada exceeds valid bvecs number, it may also trigger
kernel panic in code block 3 from code block 6.

Now the almost-useless readahead size for cache missing request back to
backing device is removed, this patch can fix the oversized issue with
more simpler method.
- add a local variable size_limit,  set it by the minimum value from
  the max bkey size and max bio bvecs number.
- set s->insert_bio_sectors by the minimum value from size_limit,
  sectors, and the sectors size of bio.
- replace sectors by s->insert_bio_sectors to do bio_next_split.

By the above method with size_limit, s->insert_bio_sectors will never
result oversized replace_key size or bio bvecs number. And split bio
'miss' from bio_next_split() will always match the size of 'cache_bio',
that is the current maximum bio size we can sent to backing device for
fetching the cache missing data.

Current problmatic code can be partially found since Linux v3.13-rc1,
therefore all maintained stable kernels should try to apply this fix.
Reported-by: NAlexander Ullrich <ealex1979@gmail.com>
Reported-by: NDiego Ercolani <diego.ercolani@gmail.com>
Reported-by: NJan Szubiak <jan.szubiak@linuxpolska.pl>
Reported-by: NMarco Rebhan <me@dblsaiko.net>
Reported-by: NMatthias Ferdinand <bcache@mfedv.net>
Reported-by: NVictor Westerhuis <victor@westerhu.is>
Reported-by: NVojtech Pavlik <vojtech@suse.cz>
Reported-and-tested-by: NRolf Fokkens <rolf@rolffokkens.nl>
Reported-and-tested-by: NThorsten Knabe <linux@thorsten-knabe.de>
Signed-off-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Takashi Iwai <tiwai@suse.com>
Link: https://lore.kernel.org/r/20210607125052.21277-3-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

41fe8d08

bcache: remove bcache device self-defined readahead · 1616a4c2

由 Coly Li 提交于 6月 07, 2021

For read cache missing, bcache defines a readahead size for the read I/O
request to the backing device for the missing data. This readahead size
is initialized to 0, and almost no one uses it to avoid unnecessary read
amplifying onto backing device and write amplifying onto cache device.
Considering upper layer file system code has readahead logic allready
and works fine with readahead_cache_policy sysfile interface, we don't
have to keep bcache self-defined readahead anymore.

This patch removes the bcache self-defined readahead for cache missing
request for backing device, and the readahead sysfs file interfaces are
removed as well.

This is the preparation for next patch to fix potential kernel panic due
to oversized request in a simpler method.
Reported-by: NAlexander Ullrich <ealex1979@gmail.com>
Reported-by: NDiego Ercolani <diego.ercolani@gmail.com>
Reported-by: NJan Szubiak <jan.szubiak@linuxpolska.pl>
Reported-by: NMarco Rebhan <me@dblsaiko.net>
Reported-by: NMatthias Ferdinand <bcache@mfedv.net>
Reported-by: NVictor Westerhuis <victor@westerhu.is>
Reported-by: NVojtech Pavlik <vojtech@suse.cz>
Reported-and-tested-by: NRolf Fokkens <rolf@rolffokkens.nl>
Reported-and-tested-by: NThorsten Knabe <linux@thorsten-knabe.de>
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Takashi Iwai <tiwai@suse.com>
Link: https://lore.kernel.org/r/20210607125052.21277-2-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

1616a4c2

05 6月, 2021 1 次提交

dm crypt: Fix zoned block device support · f34ee1dc

由 Damien Le Moal 提交于 5月 26, 2021

Zone append BIOs (REQ_OP_ZONE_APPEND) always specify the start sector
of the zone to be written instead of the actual sector location to
write. The write location is determined by the device and returned to
the host upon completion of the operation. This interface, while simple
and efficient for writing into sequential zones of a zoned block
device, is incompatible with the use of sector values to calculate a
cypher block IV. All data written in a zone end up using the same IV
values corresponding to the first sectors of the zone, but read
operation will specify any sector within the zone resulting in an IV
mismatch between encryption and decryption.

To solve this problem, report to DM core that zone append operations are
not supported. This result in the zone append operations being emulated
using regular write operations.
Reported-by: NShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

f34ee1dc

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功