提交 · dbae70d452a0858d62915166d93650c98fe6639c · openeuler / Kernel

14 5月, 2021 2 次提交

dm integrity: revert to not using discard filler when recalulating · dbae70d4

由 Mikulas Patocka 提交于 5月 12, 2021

Revert the commit 7a5b96b4 ("dm integrity:
use discard support when recalculating").

There's a bug that when we write some data beyond the current recalculate
boundary, the checksum will be rewritten with the discard filler later.
And the data will no longer have integrity protection. There's no easy
fix for this case.

Also, another problematic case is if dm-integrity is used to detect
bitrot (random device errors, bit flips, etc); dm-integrity should
detect that even for unused sectors. With commit 7a5b96b4 it can
happen that such change is undetected (because discard filler is not a
valid checksum).
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Acked-by: NMilan Broz <gmazyland@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

dbae70d4

dm snapshot: fix crash with transient storage and zero chunk size · c699a0db

由 Mikulas Patocka 提交于 5月 10, 2021

The following commands will crash the kernel:

modprobe brd rd_size=1048576
dmsetup create o --table "0 `blockdev --getsize /dev/ram0` snapshot-origin /dev/ram0"
dmsetup create s --table "0 `blockdev --getsize /dev/ram0` snapshot /dev/ram0 /dev/ram1 N 0"

The reason is that when we test for zero chunk size, we jump to the label
bad_read_metadata without setting the "r" variable. The function
snapshot_ctr destroys all the structures and then exits with "r == 0". The
kernel then crashes because it falsely believes that snapshot_ctr
succeeded.

In order to fix the bug, we set the variable "r" to -EINVAL.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

c699a0db

10 5月, 2021 1 次提交

dm snapshot: fix a crash when an origin has no snapshots · 7ee06ddc

由 Mikulas Patocka 提交于 5月 07, 2021

If an origin target has no snapshots, o->split_boundary is set to 0.
This causes BUG_ON(sectors <= 0) in block/bio.c:bio_split().

Fix this by initializing chunk_size, and in turn split_boundary, to
rounddown_pow_of_two(UINT_MAX) -- the largest power of two that fits
into "unsigned" type.
Reported-by: NMichael Tokarev <mjt@tls.msk.ru>
Tested-by: NMichael Tokarev <mjt@tls.msk.ru>
Cc: stable@vger.kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7ee06ddc

07 5月, 2021 1 次提交

include: remove pagemap.h from blkdev.h · 4ee60ec1

由 Matthew Wilcox (Oracle) 提交于 5月 06, 2021

My UEK-derived config has 1030 files depending on pagemap.h before this
change. Afterwards, just 326 files need to be rebuilt when I touch
pagemap.h. I think blkdev.h is probably included too widely, but
untangling that dependency is harder and this solves my problem. x86
allmodconfig builds, but there may be implicit include problems on other
architectures.

Link: https://lkml.kernel.org/r/20210309195747.283796-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Dan Williams <dan.j.williams@intel.com> [nvdimm]
Acked-by: Jens Axboe <axboe@kernel.dk> [block]
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Martin K. Petersen <martin.petersen@oracle.com> [scsi]
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4ee60ec1

01 5月, 2021 5 次提交

dm raid: remove unnecessary discard limits for raid0 and raid10 · ca4a4e9a

由 Mike Snitzer 提交于 4月 30, 2021

Commit 29efc390 ("md/md0: optimize raid0 discard handling") and
commit d30588b2 ("md/raid10: improve raid10 discard request")
remove MD raid0's and raid10's inability to properly handle large
discards. So eliminate associated constraints from dm-raid's support.

Depends-on: 29efc390 ("md/md0: optimize raid0 discard handling")
Depends-on: d30588b2 ("md/raid10: improve raid10 discard request")
Reported-by: NMatthew Ruffell <matthew.ruffell@canonical.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

ca4a4e9a

dm rq: fix double free of blk_mq_tag_set in dev remove after table load fails · 8e947c8f

由 Benjamin Block 提交于 4月 29, 2021

When loading a device-mapper table for a request-based mapped device,
and the allocation/initialization of the blk_mq_tag_set for the device
fails, a following device remove will cause a double free.

E.g. (dmesg):
  device-mapper: core: Cannot initialize queue for request-based dm-mq mapped device
  device-mapper: ioctl: unable to set up device queue for new table.
  Unable to handle kernel pointer dereference in virtual kernel address space
  Failing address: 0305e098835de000 TEID: 0305e098835de803
  Fault in home space mode while using kernel ASCE.
  AS:000000025efe0007 R3:0000000000000024
  Oops: 0038 ilc:3 [#1] SMP
  Modules linked in: ... lots of modules ...
  Supported: Yes, External
  CPU: 0 PID: 7348 Comm: multipathd Kdump: loaded Tainted: G        W      X    5.3.18-53-default #1 SLE15-SP3
  Hardware name: IBM 8561 T01 7I2 (LPAR)
  Krnl PSW : 0704e00180000000 000000025e368eca (kfree+0x42/0x330)
             R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
  Krnl GPRS: 000000000000004a 000000025efe5230 c1773200d779968d 0000000000000000
             000000025e520270 000000025e8d1b40 0000000000000003 00000007aae10000
             000000025e5202a2 0000000000000001 c1773200d779968d 0305e098835de640
             00000007a8170000 000003ff80138650 000000025e5202a2 000003e00396faa8
  Krnl Code: 000000025e368eb8: c4180041e100       lgrl    %r1,25eba50b8
             000000025e368ebe: ecba06b93a55       risbg   %r11,%r10,6,185,58
            #000000025e368ec4: e3b010000008       ag      %r11,0(%r1)
            >000000025e368eca: e310b0080004       lg      %r1,8(%r11)
             000000025e368ed0: a7110001           tmll    %r1,1
             000000025e368ed4: a7740129           brc     7,25e369126
             000000025e368ed8: e320b0080004       lg      %r2,8(%r11)
             000000025e368ede: b904001b           lgr     %r1,%r11
  Call Trace:
   [<000000025e368eca>] kfree+0x42/0x330
   [<000000025e5202a2>] blk_mq_free_tag_set+0x72/0xb8
   [<000003ff801316a8>] dm_mq_cleanup_mapped_device+0x38/0x50 [dm_mod]
   [<000003ff80120082>] free_dev+0x52/0xd0 [dm_mod]
   [<000003ff801233f0>] __dm_destroy+0x150/0x1d0 [dm_mod]
   [<000003ff8012bb9a>] dev_remove+0x162/0x1c0 [dm_mod]
   [<000003ff8012a988>] ctl_ioctl+0x198/0x478 [dm_mod]
   [<000003ff8012ac8a>] dm_ctl_ioctl+0x22/0x38 [dm_mod]
   [<000000025e3b11ee>] ksys_ioctl+0xbe/0xe0
   [<000000025e3b127a>] __s390x_sys_ioctl+0x2a/0x40
   [<000000025e8c15ac>] system_call+0xd8/0x2c8
  Last Breaking-Event-Address:
   [<000000025e52029c>] blk_mq_free_tag_set+0x6c/0xb8
  Kernel panic - not syncing: Fatal exception: panic_on_oops

When allocation/initialization of the blk_mq_tag_set fails in
dm_mq_init_request_queue(), it is uninitialized/freed, but the pointer
is not reset to NULL; so when dev_remove() later gets into
dm_mq_cleanup_mapped_device() it sees the pointer and tries to
uninitialize and free it again.

Fix this by setting the pointer to NULL in dm_mq_init_request_queue()
error-handling. Also set it to NULL in dm_mq_cleanup_mapped_device().

Cc: <stable@vger.kernel.org> # 4.6+
Fixes: 1c357a1e ("dm: allocate blk_mq_tag_set rather than embed in mapped_device")
Signed-off-by: NBenjamin Block <bblock@linux.ibm.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

8e947c8f

dm integrity: use discard support when recalculating · 7a5b96b4

由 Mikulas Patocka 提交于 4月 28, 2021

If we have discard support we don't have to recalculate hash - we can
just fill the metadata with the discard pattern.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7a5b96b4

dm integrity: increase RECALC_SECTORS to improve recalculate speed · b1a2b933

由 Mikulas Patocka 提交于 4月 27, 2021

Increase RECALC_SECTORS because it improves recalculate speed slightly
(from 390kiB/s to 410kiB/s).
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

b1a2b933

dm integrity: don't re-write metadata if discarding same blocks · a9c0fda4

由 Mikulas Patocka 提交于 4月 27, 2021

If we discard already discarded blocks we do not need to write discard
pattern to the metadata, because it is already there.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

a9c0fda4

24 4月, 2021 2 次提交

md/raid1: properly indicate failure when ending a failed write request · 2417b986

由 Paul Clements 提交于 4月 15, 2021

This patch addresses a data corruption bug in raid1 arrays using bitmaps.
Without this fix, the bitmap bits for the failed I/O end up being cleared.

Since we are in the failure leg of raid1_end_write_request, the request
either needs to be retried (R1BIO_WriteError) or failed (R1BIO_Degraded).

Fixes: eeba6809 ("md/raid1: end bio when the device faulty")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: NPaul Clements <paul.clements@us.sios.com>
Signed-off-by: NSong Liu <song@kernel.org>

2417b986

md-cluster: fix use-after-free issue when removing rdev · f7c7a2f9

由 Heming Zhao 提交于 4月 08, 2021

md_kick_rdev_from_array will remove rdev, so we should
use rdev_for_each_safe to search list.

How to trigger:

env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).

```
node2=192.168.0.3

for i in {1..20}; do
    echo ==== $i `date` ====;

    mdadm -Ss && ssh ${node2} "mdadm -Ss"
    wipefs -a /dev/sda /dev/sdb

    mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
       /dev/sdb --assume-clean
    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
    mdadm --wait /dev/md0
    ssh ${node2} "mdadm --wait /dev/md0"

    mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
    sleep 1
done
```

Crash stack:

```
stack segment: 0000 [#1] SMP
... ...
RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
... ...
RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
FS:  0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 raid1d+0x5c/0xd40 [raid1]
 ? finish_task_switch+0x75/0x2a0
 ? lock_timer_base+0x67/0x80
 ? try_to_del_timer_sync+0x4d/0x80
 ? del_timer_sync+0x41/0x50
 ? schedule_timeout+0x254/0x2d0
 ? md_start_sync+0xe0/0xe0 [md_mod]
 ? md_thread+0x127/0x160 [md_mod]
 md_thread+0x127/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 kthread+0x10d/0x130
 ? kthread_park+0xa0/0xa0
 ret_from_fork+0x1f/0x40
```

Fixes: dbb64f86 ("md-cluster: Fix adding of new disk with new reload code")
Fixes: 659b254f ("md-cluster: remove a disk asynchronously from cluster environment")
Cc: stable@vger.kernel.org
Reviewed-by: NGang He <ghe@suse.com>
Signed-off-by: NHeming Zhao <heming.zhao@suse.com>
Signed-off-by: NSong Liu <song@kernel.org>

f7c7a2f9

22 4月, 2021 1 次提交

dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload sequences · f99a8e43

由 Heinz Mauelshagen 提交于 4月 21, 2021

If fast table reloads occur during an ongoing reshape of raid4/5/6
devices the target may race reading a superblock vs the the MD resync
thread; causing an inconclusive reshape state to be read in its
constructor.

lvm2 test lvconvert-raid-reshape-stripes-load-reload.sh can cause
BUG_ON() to trigger in md_run(), e.g.:
"kernel BUG at drivers/md/raid5.c:7567!".

Scenario triggering the bug:

1. the MD sync thread calls end_reshape() from raid5_sync_request()
   when done reshaping. However end_reshape() _only_ updates the
   reshape position to MaxSector keeping the changed layout
   configuration though (i.e. any delta disks, chunk sector or RAID
   algorithm changes). That inconclusive configuration is stored in
   the superblock.

2. dm-raid constructs a mapping, loading named inconsistent superblock
   as of step 1 before step 3 is able to finish resetting the reshape
   state completely, and calls md_run() which leads to mentioned bug
   in raid5.c.

3. the MD RAID personality's finish_reshape() is called; which resets
   the reshape information on chunk sectors, delta disks, etc. This
   explains why the bug is rarely seen on multi-core machines, as MD's
   finish_reshape() superblock update races with the dm-raid
   constructor's superblock load in step 2.

Fix identifies inconclusive superblock content in the dm-raid
constructor and resets it before calling md_run(), factoring out
identifying checks into rs_is_layout_change() to share in existing
rs_reshape_requested() and new rs_reset_inclonclusive_reshape(). Also
enhance a comment and remove an empty line.

Cc: stable@vger.kernel.org
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

f99a8e43

21 4月, 2021 1 次提交

dm raid: fix fall-through warning in rs_check_takeover() for Clang · be962b2f

由 Gustavo A. R. Silva 提交于 4月 20, 2021

In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.

Link: https://github.com/KSPP/linux/issues/115Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

be962b2f

20 4月, 2021 7 次提交

dm clone metadata: remove unused function · 87d5742b

由 Jiapeng Chong 提交于 4月 14, 2021

Fix the following clang warning:

drivers/md/dm-clone-metadata.c:279:19: warning: unused function
'superblock_write_lock' [-Wunused-function].
Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
Signed-off-by: NJiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

87d5742b

dm integrity: fix missing goto in bitmap_flush_interval error handling · 17e9e134

由 Tian Tao 提交于 4月 14, 2021

Fixes: 468dfca3 ("dm integrity: add a bitmap mode")
Cc: stable@vger.kernel.org
Signed-off-by: NTian Tao <tiantao6@hisilicon.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

17e9e134

dm: replace dm_vcalloc() · 7a35693a

由 Matthew Wilcox (Oracle) 提交于 4月 07, 2021

Use kvcalloc or kvmalloc_array instead (depending whether zeroing is
useful).
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7a35693a

dm space map common: fix division bug in sm_ll_find_free_block() · 5208692e

由 Joe Thornber 提交于 4月 13, 2021

This division bug meant the search for free metadata space could skip
the final allocation bitmap's worth of entries. Fix affects DM thinp,
cache and era targets.

Cc: stable@vger.kernel.org
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Tested-by: NMing-Hung Tsai <mtsai@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

5208692e

dm persistent data: packed struct should have an aligned() attribute too · a88b2358

由 Joe Thornber 提交于 3月 29, 2021

Otherwise most non-x86 architectures (e.g. riscv, arm) will resort to
byte-by-byte access.

Cc: stable@vger.kernel.org
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

a88b2358

dm btree spine: remove paranoid node_check call in node_prep_for_write() · f73e2e70

由 Joe Thornber 提交于 3月 29, 2021

Remove this extra BUG_ON() that calls node_check() -- which avoids extra crc checking.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

f73e2e70

dm space map disk: remove redundant calls to sm_disk_get_nr_free() · d6db294f

由 Joe Thornber 提交于 3月 09, 2021

Both sm_disk_new_block and sm_disk_commit are needlessly calling
sm_disk_get_nr_free(). Looks like old queries used for some
debugging.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

d6db294f

16 4月, 2021 4 次提交

md/bitmap: wait for external bitmap writes to complete during tear down · 404a8ef5

由 Sudhakar Panneerselvam 提交于 4月 13, 2021

NULL pointer dereference was observed in super_written() when it tries
to access the mddev structure.

[The below stack trace is from an older kernel, but the problem described
in this patch applies to the mainline kernel.]

[ 1194.474861] task: ffff8fdd20858000 task.stack: ffffb99d40790000
[ 1194.488000] RIP: 0010:super_written+0x29/0xe1
[ 1194.499688] RSP: 0018:ffff8ffb7fcc3c78 EFLAGS: 00010046
[ 1194.512477] RAX: 0000000000000000 RBX: ffff8ffb7bf4a000 RCX: ffff8ffb78991048
[ 1194.527325] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ffb56b8a200
[ 1194.542576] RBP: ffff8ffb7fcc3c90 R08: 000000000000000b R09: 0000000000000000
[ 1194.558001] R10: ffff8ffb56b8a298 R11: 0000000000000000 R12: ffff8ffb56b8a200
[ 1194.573070] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1194.588117] FS:  0000000000000000(0000) GS:ffff8ffb7fcc0000(0000) knlGS:0000000000000000
[ 1194.604264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1194.617375] CR2: 00000000000002b8 CR3: 00000021e040a002 CR4: 00000000007606e0
[ 1194.632327] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1194.647865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1194.663316] PKRU: 55555554
[ 1194.674090] Call Trace:
[ 1194.683735]  <IRQ>
[ 1194.692948]  bio_endio+0xae/0x135
[ 1194.703580]  blk_update_request+0xad/0x2fa
[ 1194.714990]  blk_update_bidi_request+0x20/0x72
[ 1194.726578]  __blk_end_bidi_request+0x2c/0x4d
[ 1194.738373]  __blk_end_request_all+0x31/0x49
[ 1194.749344]  blk_flush_complete_seq+0x377/0x383
[ 1194.761550]  flush_end_io+0x1dd/0x2a7
[ 1194.772910]  blk_finish_request+0x9f/0x13c
[ 1194.784544]  scsi_end_request+0x180/0x25c
[ 1194.796149]  scsi_io_completion+0xc8/0x610
[ 1194.807503]  scsi_finish_command+0xdc/0x125
[ 1194.818897]  scsi_softirq_done+0x81/0xde
[ 1194.830062]  blk_done_softirq+0xa4/0xcc
[ 1194.841008]  __do_softirq+0xd9/0x29f
[ 1194.851257]  irq_exit+0xe6/0xeb
[ 1194.861290]  do_IRQ+0x59/0xe3
[ 1194.871060]  common_interrupt+0x1c6/0x382
[ 1194.881988]  </IRQ>
[ 1194.890646] RIP: 0010:cpuidle_enter_state+0xdd/0x2a5
[ 1194.902532] RSP: 0018:ffffb99d40793e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff43
[ 1194.917317] RAX: ffff8ffb7fce27c0 RBX: ffff8ffb7fced800 RCX: 000000000000001f
[ 1194.932056] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000
[ 1194.946428] RBP: ffffb99d40793ea0 R08: 0000000000000004 R09: 0000000000002ed2
[ 1194.960508] R10: 0000000000002664 R11: 0000000000000018 R12: 0000000000000003
[ 1194.974454] R13: 000000000000000b R14: ffffffff925715a0 R15: 0000011610120d5a
[ 1194.988607]  ? cpuidle_enter_state+0xcc/0x2a5
[ 1194.999077]  cpuidle_enter+0x17/0x19
[ 1195.008395]  call_cpuidle+0x23/0x3a
[ 1195.017718]  do_idle+0x172/0x1d5
[ 1195.026358]  cpu_startup_entry+0x73/0x75
[ 1195.035769]  start_secondary+0x1b9/0x20b
[ 1195.044894]  secondary_startup_64+0xa5/0xa5
[ 1195.084921] RIP: super_written+0x29/0xe1 RSP: ffff8ffb7fcc3c78
[ 1195.096354] CR2: 00000000000002b8

bio in the above stack is a bitmap write whose completion is invoked after
the tear down sequence sets the mddev structure to NULL in rdev.

During tear down, there is an attempt to flush the bitmap writes, but for
external bitmaps, there is no explicit wait for all the bitmap writes to
complete. For instance, md_bitmap_flush() is called to flush the bitmap
writes, but the last call to md_bitmap_daemon_work() in md_bitmap_flush()
could generate new bitmap writes for which there is no explicit wait to
complete those writes. The call to md_bitmap_update_sb() will return
simply for external bitmaps and the follow-up call to md_update_sb() is
conditional and may not get called for external bitmaps. This results in a
kernel panic when the completion routine, super_written() is called which
tries to reference mddev in the rdev that has been set to
NULL(in unbind_rdev_from_array() by tear down sequence).

The solution is to call md_super_wait() for external bitmaps after the
last call to md_bitmap_daemon_work() in md_bitmap_flush() to ensure there
are no pending bitmap writes before proceeding with the tear down.

Cc: stable@vger.kernel.org
Signed-off-by: NSudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com>
Reviewed-by: NZhao Heming <heming.zhao@suse.com>
Signed-off-by: NSong Liu <song@kernel.org>

404a8ef5

md: do not return existing mddevs from mddev_find_or_alloc · 0d809b38

由 Christoph Hellwig 提交于 4月 12, 2021

Instead of returning an existing mddev, just for it to be discarded
later directly return -EEXIST.  Rename the function to mddev_alloc now
that it doesn't find an existing mddev.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>

0d809b38

md: refactor mddev_find_or_alloc · d144fe6f

由 Christoph Hellwig 提交于 4月 12, 2021

Allocate the new mddev first speculatively, which greatly simplifies
the code flow.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>

d144fe6f

md: factor out a mddev_alloc_unit helper from mddev_find · 85c8c3c1

由 Christoph Hellwig 提交于 4月 12, 2021

Split out a self contained helper to find a free minor for the md
"unit" number.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>

85c8c3c1

15 4月, 2021 1 次提交

dm verity fec: fix misaligned RS roots IO · 8ca7cab8

由 Jaegeuk Kim 提交于 4月 14, 2021

commit df7b59ba ("dm verity: fix FEC for RS roots unaligned to
block size") introduced the possibility for misaligned roots IO
relative to the underlying device's logical block size. E.g. Android's
default RS roots=2 results in dm_bufio->block_size=1024, which causes
the following EIO if the logical block size of the device is 4096,
given v->data_dev_block_bits=12:

E sd 0 : 0:0:0: [sda] tag#30 request not aligned to the logical block size
E blk_update_request: I/O error, dev sda, sector 10368424 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
E device-mapper: verity-fec: 254:8: FEC 9244672: parity read failed (block 18056): -5

Fix this by onlu using f->roots for dm_bufio blocksize IFF it is
aligned to v->data_dev_block_bits.

Fixes: df7b59ba ("dm verity: fix FEC for RS roots unaligned to block size")
Cc: stable@vger.kernel.org
Signed-off-by: NJaegeuk Kim <jaegeuk@google.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

8ca7cab8

11 4月, 2021 7 次提交

bcache: fix a regression of code compiling failure in debug.c · 33ec5dfe

由 Coly Li 提交于 4月 11, 2021

The patch "bcache: remove PTR_CACHE" introduces a compiling failure in
debug.c with following error message,
  In file included from drivers/md/bcache/bcache.h:182:0,
                   from drivers/md/bcache/debug.c:9:
  drivers/md/bcache/debug.c: In function 'bch_btree_verify':
  drivers/md/bcache/debug.c:53:19: error: 'c' undeclared (first use in
  this function)
    bio_set_dev(bio, c->cache->bdev);
                     ^
This patch fixes the regression by replacing c->cache->bdev by b->c->
cache->bdev.
Signed-off-by: NColy Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210411134316.80274-8-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

33ec5dfe

bcache: Use 64-bit arithmetic instead of 32-bit · 62594f18

由 Gustavo A. R. Silva 提交于 4月 11, 2021

Cast multiple variables to (int64_t) in order to give the compiler
complete information about the proper arithmetic to use. Notice that
these variables are being used in contexts that expect expressions of
type int64_t (64 bit, signed). And currently, such expressions are
being evaluated using 32-bit arithmetic.

Fixes: d0cf9503 ("octeontx2-pf: ethtool fec mode support")
Addresses-Coverity-ID: 1501724 ("Unintentional integer overflow")
Addresses-Coverity-ID: 1501725 ("Unintentional integer overflow")
Addresses-Coverity-ID: 1501726 ("Unintentional integer overflow")
Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210411134316.80274-7-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

62594f18

md: bcache: Trivial typo fixes in the file journal.c · 9c9b81c4

由 Bhaskar Chowdhury 提交于 4月 11, 2021

s/condidate/candidate/
s/folowing/following/
Signed-off-by: NBhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210411134316.80274-6-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

9c9b81c4

md: bcache: avoid -Wempty-body warnings · be3bacec

由 Arnd Bergmann 提交于 4月 11, 2021

building with 'make W=1' shows a harmless warning for each user of the
EBUG_ON() macro:

drivers/md/bcache/bset.c: In function 'bch_btree_sort_partial':
drivers/md/bcache/util.h:30:55: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
   30 | #define EBUG_ON(cond)                   do { if (cond); } while (0)
      |                                                       ^
drivers/md/bcache/bset.c:1312:9: note: in expansion of macro 'EBUG_ON'
 1312 |         EBUG_ON(oldsize >= 0 && bch_count_data(b) != oldsize);
      |         ^~~~~~~

Reword the macro slightly to avoid the warning.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210411134316.80274-5-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

be3bacec

bcache: use NULL instead of using plain integer as pointer · f9a018e8

由 Yang Li 提交于 4月 11, 2021

This fixes the following sparse warnings:
drivers/md/bcache/features.c:22:16: warning: Using plain integer as NULL
pointer
Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
Signed-off-by: NYang Li <yang.lee@linux.alibaba.com>
Signed-off-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210411134316.80274-4-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

f9a018e8

bcache: remove PTR_CACHE · 11e9560e

由 Christoph Hellwig 提交于 4月 11, 2021

Remove the PTR_CACHE inline and replace it with a direct dereference
of c->cache.

(Coly Li: fix the typo from PTR_BUCKET to PTR_CACHE in commit log)
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210411134316.80274-3-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

11e9560e

bcache: reduce redundant code in bch_cached_dev_run() · 13e1db65

由 Zhiqiang Liu 提交于 4月 11, 2021

In bch_cached_dev_run(), free(env[1])|free(env[2])|free(buf)
show up three times. This patch introduce out tag in
which free(env[1])|free(env[2])|free(buf) are only called
one time. If we need to call free() when errors occur,
we can set error code to ret, and then goto out tag directly.
Signed-off-by: NZhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210411134316.80274-2-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

13e1db65

09 4月, 2021 1 次提交

treewide: Change list_sort to use const pointers · 4f0f586b

由 Sami Tolvanen 提交于 4月 08, 2021

list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.

Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.
Suggested-by: NNick Desaulniers <ndesaulniers@google.com>
Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKees Cook <keescook@chromium.org>
Tested-by: NNick Desaulniers <ndesaulniers@google.com>
Tested-by: NNathan Chancellor <nathan@kernel.org>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com

4f0f586b

08 4月, 2021 3 次提交

md: split mddev_find · 65aa97c4

由 Christoph Hellwig 提交于 4月 03, 2021

Split mddev_find into a simple mddev_find that just finds an existing
mddev by the unit number, and a more complicated mddev_find that deals
with find or allocating a mddev.

This turns out to fix this bug reported by Zhao Heming.

----------------------------- snip ------------------------------
commit d3374825 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.

*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt

*** script ***

about trigger 1 time with 10 tests

`1  node1="15sp3-mdcluster1"
2  node2="15sp3-mdcluster2"
3
4  mdadm -Ss
5  ssh ${node2} "mdadm -Ss"
6  wipefs -a /dev/sda /dev/sdb
7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
   /dev/sdb --assume-clean
8
9  for i in {1..100}; do
10    echo ==== $i ====;
11
12    echo "test  ...."
13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14    sleep 1
15
16    echo "clean  ....."
17    ssh ${node2} "mdadm -Ss"
18 done
`
I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.

*** stack ***

`ID: 2831   TASK: ffff8dd7223b5040  CPU: 0   COMMAND: "mdadm"
 #0 [ffffa15d00a13b90] __schedule at ffffffffb8f1935f
 #1 [ffffa15d00a13ba8] exact_lock at ffffffffb8a4a66d
 #2 [ffffa15d00a13bb0] kobj_lookup at ffffffffb8c62fe3
 #3 [ffffa15d00a13c28] __blkdev_get at ffffffffb89273b9
 #4 [ffffa15d00a13c98] blkdev_get at ffffffffb8927964
 #5 [ffffa15d00a13cb0] do_dentry_open at ffffffffb88dc4b4
 #6 [ffffa15d00a13ce0] path_openat at ffffffffb88f0ccc
 #7 [ffffa15d00a13db8] do_filp_open at ffffffffb88f32bb
 #8 [ffffa15d00a13ee0] do_sys_open at ffffffffb88ddc7d
 #9 [ffffa15d00a13f38] do_syscall_64 at ffffffffb86053cb ffffffffb900008c

or:
[  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
[  884.226515]  md_open+0x3c/0xe0 [md_mod]
[  884.226518]  __blkdev_get+0x30d/0x710
[  884.226520]  ? bd_acquire+0xd0/0xd0
[  884.226522]  blkdev_get+0x14/0x30
[  884.226524]  do_dentry_open+0x204/0x3a0
[  884.226531]  path_openat+0x2fc/0x1520
[  884.226534]  ? seq_printf+0x4e/0x70
[  884.226536]  do_filp_open+0x9b/0x110
[  884.226542]  ? md_release+0x20/0x20 [md_mod]
[  884.226543]  ? seq_read+0x1d8/0x3e0
[  884.226545]  ? kmem_cache_alloc+0x18a/0x270
[  884.226547]  ? do_sys_open+0x1bd/0x260
[  884.226548]  do_sys_open+0x1bd/0x260
[  884.226551]  do_syscall_64+0x5b/0x1e0
[  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
`
*** rootcause ***

"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.

The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.

`<thread 1>: "mdadm -Ss"
md_release
  mddev_put deletes mddev from all_mddevs
  queue_work for mddev_delayed_delete
  at this time, "/dev/md0" is still available for opening

<thread 2>: "mdadm --monitor ..."
md_open
 + mddev_find can't find mddev of /dev/md0, and create a new mddev and
 |    return.
 + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
      -ERESTARTSYS.
`
In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.

In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.
------------------------------ snip ------------------------------

Cc: stable@vger.kernel.org
Fixes: d3374825 ("md: make devices disappear when they are no longer needed.")
Reported-by: NHeming Zhao <heming.zhao@suse.com>
Reviewed-by: NHeming Zhao <heming.zhao@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>

65aa97c4

md: factor out a mddev_find_locked helper from mddev_find · 8b57251f

由 Christoph Hellwig 提交于 4月 03, 2021

Factor out a self-contained helper to just lookup a mddev by the dev_t
"unit".

Cc: stable@vger.kernel.org
Reviewed-by: NHeming Zhao <heming.zhao@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>

8b57251f

md: md_open returns -EBUSY when entering racing area · 6a4db2a6

由 Zhao Heming 提交于 4月 03, 2021

commit d3374825 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.

This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
will break the infinitely retry when md_open enter racing area.

This patch is partly fix soft lockup issue, full fix needs mddev_find
is split into two functions: mddev_find & mddev_find_or_alloc. And
md_open should call new mddev_find (it only does searching job).

For more detail, please refer with Christoph's "split mddev_find" patch
in later commits.

*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt

*** script ***

about trigger every time with below script

```
1  node1="mdcluster1"
2  node2="mdcluster2"
3
4  mdadm -Ss
5  ssh ${node2} "mdadm -Ss"
6  wipefs -a /dev/sda /dev/sdb
7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
   /dev/sdb --assume-clean
8
9  for i in {1..10}; do
10    echo ==== $i ====;
11
12    echo "test  ...."
13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14    sleep 1
15
16    echo "clean  ....."
17    ssh ${node2} "mdadm -Ss"
18 done
```

I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.

*** stack ***

```
[  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
[  884.226515]  md_open+0x3c/0xe0 [md_mod]
[  884.226518]  __blkdev_get+0x30d/0x710
[  884.226520]  ? bd_acquire+0xd0/0xd0
[  884.226522]  blkdev_get+0x14/0x30
[  884.226524]  do_dentry_open+0x204/0x3a0
[  884.226531]  path_openat+0x2fc/0x1520
[  884.226534]  ? seq_printf+0x4e/0x70
[  884.226536]  do_filp_open+0x9b/0x110
[  884.226542]  ? md_release+0x20/0x20 [md_mod]
[  884.226543]  ? seq_read+0x1d8/0x3e0
[  884.226545]  ? kmem_cache_alloc+0x18a/0x270
[  884.226547]  ? do_sys_open+0x1bd/0x260
[  884.226548]  do_sys_open+0x1bd/0x260
[  884.226551]  do_syscall_64+0x5b/0x1e0
[  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
```

*** rootcause ***

"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.

The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.

```
<thread 1>: "mdadm -Ss"
md_release
  mddev_put deletes mddev from all_mddevs
  queue_work for mddev_delayed_delete
  at this time, "/dev/md0" is still available for opening

<thread 2>: "mdadm --monitor ..."
md_open
 + mddev_find can't find mddev of /dev/md0, and create a new mddev and
 |    return.
 + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
      -ERESTARTSYS.
```

In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.

In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.

Cc: stable@vger.kernel.org
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NZhao Heming <heming.zhao@suse.com>
Signed-off-by: NSong Liu <song@kernel.org>

6a4db2a6

27 3月, 2021 4 次提交

dm integrity: add the "reset_recalculate" feature flag · db7b93e3

由 Mikulas Patocka 提交于 3月 23, 2021

Add a new flag "reset_recalculate" that will restart recalculating
from the beginning of the device. It can be used if we want to change
the hash function. Example:

dmsetup remove_all
rmmod brd
set -e
modprobe brd rd_size=1048576
dmsetup create in --table '0 2000000 integrity /dev/ram0 0 16 J 2 internal_hash:sha256 recalculate'
sleep 10
dmsetup status
dmsetup remove in
dmsetup create in --table '0 2000000 integrity /dev/ram0 0 16 J 2 internal_hash:sha3-256 reset_recalculate'
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

db7b93e3

dm persistent data: remove unused return from exit_shadow_spine() · ece25773

由 Jiapeng Chong 提交于 3月 19, 2021

Fix the following coccicheck warnings:

./drivers/md/persistent-data/dm-btree-spine.c:188:5-6: Unneeded
variable: "r". Return "0" on line 194.
Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
Signed-off-by: NJiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

ece25773

dm cache: remove needless request_queue NULL pointer checks · 63508e38

由 Xu Wang 提交于 3月 19, 2021

Since commit ff9ea323 ("block, bdi: an active gendisk always has a
request_queue associated with it") the request_queue pointer returned
from bdev_get_queue() shall never be NULL.
Signed-off-by: NXu Wang <vulab@iscas.ac.cn>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

63508e38

dm thin: remove needless request_queue NULL pointer check · 695902bb

由 Xu Wang 提交于 3月 19, 2021

Since commit ff9ea323 ("block, bdi: an active gendisk always has a
request_queue associated with it") the request_queue pointer returned
from bdev_get_queue() shall never be NULL.
Signed-off-by: NXu Wang <vulab@iscas.ac.cn>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

695902bb

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功