提交 · 8b57251f9a91f5e5a599de7549915d2d226cc3af · openeuler / Kernel

08 4月, 2021 2 次提交

md: factor out a mddev_find_locked helper from mddev_find · 8b57251f

由 Christoph Hellwig 提交于 4月 03, 2021

Factor out a self-contained helper to just lookup a mddev by the dev_t
"unit".

Cc: stable@vger.kernel.org
Reviewed-by: NHeming Zhao <heming.zhao@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>

8b57251f

md: md_open returns -EBUSY when entering racing area · 6a4db2a6

由 Zhao Heming 提交于 4月 03, 2021

commit d3374825 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.

This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
will break the infinitely retry when md_open enter racing area.

This patch is partly fix soft lockup issue, full fix needs mddev_find
is split into two functions: mddev_find & mddev_find_or_alloc. And
md_open should call new mddev_find (it only does searching job).

For more detail, please refer with Christoph's "split mddev_find" patch
in later commits.

*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt

*** script ***

about trigger every time with below script

```
1  node1="mdcluster1"
2  node2="mdcluster2"
3
4  mdadm -Ss
5  ssh ${node2} "mdadm -Ss"
6  wipefs -a /dev/sda /dev/sdb
7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
   /dev/sdb --assume-clean
8
9  for i in {1..10}; do
10    echo ==== $i ====;
11
12    echo "test  ...."
13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14    sleep 1
15
16    echo "clean  ....."
17    ssh ${node2} "mdadm -Ss"
18 done
```

I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.

*** stack ***

```
[  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
[  884.226515]  md_open+0x3c/0xe0 [md_mod]
[  884.226518]  __blkdev_get+0x30d/0x710
[  884.226520]  ? bd_acquire+0xd0/0xd0
[  884.226522]  blkdev_get+0x14/0x30
[  884.226524]  do_dentry_open+0x204/0x3a0
[  884.226531]  path_openat+0x2fc/0x1520
[  884.226534]  ? seq_printf+0x4e/0x70
[  884.226536]  do_filp_open+0x9b/0x110
[  884.226542]  ? md_release+0x20/0x20 [md_mod]
[  884.226543]  ? seq_read+0x1d8/0x3e0
[  884.226545]  ? kmem_cache_alloc+0x18a/0x270
[  884.226547]  ? do_sys_open+0x1bd/0x260
[  884.226548]  do_sys_open+0x1bd/0x260
[  884.226551]  do_syscall_64+0x5b/0x1e0
[  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
```

*** rootcause ***

"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.

The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.

```
<thread 1>: "mdadm -Ss"
md_release
  mddev_put deletes mddev from all_mddevs
  queue_work for mddev_delayed_delete
  at this time, "/dev/md0" is still available for opening

<thread 2>: "mdadm --monitor ..."
md_open
 + mddev_find can't find mddev of /dev/md0, and create a new mddev and
 |    return.
 + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
      -ERESTARTSYS.
```

In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.

In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.

Cc: stable@vger.kernel.org
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NZhao Heming <heming.zhao@suse.com>
Signed-off-by: NSong Liu <song@kernel.org>

6a4db2a6

25 3月, 2021 6 次提交

md: Fix missing unused status line of /proc/mdstat · 7abfabaf

由 Jan Glauber 提交于 3月 17, 2021

Reading /proc/mdstat with a read buffer size that would not
fit the unused status line in the first read will skip this
line from the output.

So 'dd if=/proc/mdstat bs=64 2>/dev/null' will not print something
like: unused devices: <none>

Don't return NULL immediately in start() for v=2 but call
show() once to print the status line also for multiple reads.

Cc: stable@vger.kernel.org
Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code and interface")
Signed-off-by: NJan Glauber <jglauber@digitalocean.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

7abfabaf

md/raid10: improve discard request for far layout · 254c271d

由 Xiao Ni 提交于 2月 04, 2021

For far layout, the discard region is not continuous on disks. So it needs
far copies r10bio to cover all regions. It needs a way to know all r10bios
have finish or not. Similar with raid10_sync_request, only the first r10bio
master_bio records the discard bio. Other r10bios master_bio record the
first r10bio. The first r10bio can finish after other r10bios finish and
then return the discard bio.
Tested-by: NAdrian Huang <ahuang12@lenovo.com>
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

254c271d

md/raid10: improve raid10 discard request · d30588b2

由 Xiao Ni 提交于 2月 04, 2021

Now the discard request is split by chunk size. So it takes a long time
to finish mkfs on disks which support discard function. This patch improve
handling raid10 discard request. It uses the similar way with patch
29efc390 (md/md0: optimize raid0 discard handling).

But it's a little complex than raid0. Because raid10 has different layout.
If raid10 is offset layout and the discard request is smaller than stripe
size. There are some holes when we submit discard bio to underlayer disks.

For example: five disks (disk1 - disk5)
D01 D02 D03 D04 D05
D05 D01 D02 D03 D04
D06 D07 D08 D09 D10
D10 D06 D07 D08 D09
The discard bio just wants to discard from D03 to D10. For disk3, there is
a hole between D03 and D08. For disk4, there is a hole between D04 and D09.
D03 is a chunk, raid10_write_request can handle one chunk perfectly. So
the part that is not aligned with stripe size is still handled by
raid10_write_request.

If reshape is running when discard bio comes and the discard bio spans the
reshape position, raid10_write_request is responsible to handle this
discard bio.

I did a test with this patch set.
Without patch:
time mkfs.xfs /dev/md0
real4m39.775s
user0m0.000s
sys0m0.298s

With patch:
time mkfs.xfs /dev/md0
real0m0.105s
user0m0.000s
sys0m0.007s

nvme3n1           259:1    0   477G  0 disk
└─nvme3n1p1       259:10   0    50G  0 part
nvme4n1           259:2    0   477G  0 disk
└─nvme4n1p1       259:11   0    50G  0 part
nvme5n1           259:6    0   477G  0 disk
└─nvme5n1p1       259:12   0    50G  0 part
nvme2n1           259:9    0   477G  0 disk
└─nvme2n1p1       259:15   0    50G  0 part
nvme0n1           259:13   0   477G  0 disk
└─nvme0n1p1       259:14   0    50G  0 part
Reviewed-by: NColy Li <colyli@suse.de>
Reviewed-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Tested-by: NAdrian Huang <ahuang12@lenovo.com>
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

d30588b2

md/raid10: pull the code that wait for blocked dev into one function · f2e7e269

由 Xiao Ni 提交于 2月 04, 2021

The following patch will reuse these logics, so pull the same codes into
one function.
Tested-by: NAdrian Huang <ahuang12@lenovo.com>
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

f2e7e269

md/raid10: extend r10bio devs to raid disks · c2968285

由 Xiao Ni 提交于 2月 04, 2021

Now it allocs r10bio->devs[conf->copies]. Discard bio needs to submit
to all member disks and it needs to use r10bio. So extend to
r10bio->devs[geo.raid_disks].
Reviewed-by: NColy Li <colyli@suse.de>
Tested-by: NAdrian Huang <ahuang12@lenovo.com>
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

c2968285

md: add md_submit_discard_bio() for submitting discard bio · cf78408f

由 Xiao Ni 提交于 2月 04, 2021

Move these logic from raid0.c to md.c, so that we can also use it in
raid10.c.
Reviewed-by: NColy Li <colyli@suse.de>
Reviewed-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Tested-by: NAdrian Huang <ahuang12@lenovo.com>
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

cf78408f

11 3月, 2021 1 次提交

block: rename BIO_MAX_PAGES to BIO_MAX_VECS · a8affc03

由 Christoph Hellwig 提交于 3月 11, 2021

Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been
horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop
confusing users of the bio API.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

a8affc03

05 3月, 2021 2 次提交

dm verity: fix FEC for RS roots unaligned to block size · df7b59ba

由 Milan Broz 提交于 2月 23, 2021

Optional Forward Error Correction (FEC) code in dm-verity uses
Reed-Solomon code and should support roots from 2 to 24.

The error correction parity bytes (of roots lengths per RS block) are
stored on a separate device in sequence without any padding.

Currently, to access FEC device, the dm-verity-fec code uses dm-bufio
client with block size set to verity data block (usually 4096 or 512
bytes).

Because this block size is not divisible by some (most!) of the roots
supported lengths, data repair cannot work for partially stored parity
bytes.

This fix changes FEC device dm-bufio block size to "roots << SECTOR_SHIFT"
where we can be sure that the full parity data is always available.
(There cannot be partial FEC blocks because parity must cover whole
sectors.)

Because the optional FEC starting offset could be unaligned to this
new block size, we have to use dm_bufio_set_sector_offset() to
configure it.

The problem is easily reproduced using veritysetup, e.g. for roots=13:

  # create verity device with RS FEC
  dd if=/dev/urandom of=data.img bs=4096 count=8 status=none
  veritysetup format data.img hash.img --fec-device=fec.img --fec-roots=13 | awk '/^Root hash/{ print $3 }' >roothash

  # create an erasure that should be always repairable with this roots setting
  dd if=/dev/zero of=data.img conv=notrunc bs=1 count=8 seek=4088 status=none

  # try to read it through dm-verity
  veritysetup open data.img test hash.img --fec-device=fec.img --fec-roots=13 $(cat roothash)
  dd if=/dev/mapper/test of=/dev/null bs=4096 status=noxfer
  # wait for possible recursive recovery in kernel
  udevadm settle
  veritysetup close test

With this fix, errors are properly repaired.
  device-mapper: verity-fec: 7:1: FEC 0: corrected 8 errors
  ...

Without it, FEC code usually ends on unrecoverable failure in RS decoder:
  device-mapper: verity-fec: 7:1: FEC 0: failed to correct: -74
  ...

This problem is present in all kernels since the FEC code's
introduction (kernel 4.5).

It is thought that this problem is not visible in Android ecosystem
because it always uses a default RS roots=2.

Depends-on: a14e5ec6 ("dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size")
Signed-off-by: NMilan Broz <gmazyland@gmail.com>
Tested-by: NJérôme Carretero <cJ-ko@zougloub.eu>
Reviewed-by: NSami Tolvanen <samitolvanen@google.com>
Cc: stable@vger.kernel.org # 4.5+
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

df7b59ba

dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size · a14e5ec6

由 Mikulas Patocka 提交于 2月 23, 2021

dm_bufio_get_device_size returns the device size in blocks. Before
returning the value, we must subtract the nubmer of starting
sectors. The number of starting sectors may not be divisible by block
size.

Note that currently, no target is using dm_bufio_set_sector_offset and
dm_bufio_get_device_size simultaneously, so this change has no effect.
However, an upcoming dm-verity-fec fix needs this change.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Reviewed-by: NMilan Broz <gmazyland@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

a14e5ec6

27 2月, 2021 1 次提交

block: Add bio_max_segs · 5f7136db

由 Matthew Wilcox (Oracle) 提交于 1月 29, 2021

It's often inconvenient to use BIO_MAX_PAGES due to min() requiring the
sign to be the same.  Introduce bio_max_segs() and change BIO_MAX_PAGES to
be unsigned to make it easier for the users.
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5f7136db

11 2月, 2021 12 次提交

dm: fix deadlock when swapping to encrypted device · a666e5c0

由 Mikulas Patocka 提交于 2月 10, 2021

The system would deadlock when swapping to a dm-crypt device. The reason
is that for each incoming write bio, dm-crypt allocates memory that holds
encrypted data. These excessive allocations exhaust all the memory and the
result is either deadlock or OOM trigger.

This patch limits the number of in-flight swap bios, so that the memory
consumed by dm-crypt is limited. The limit is enforced if the target set
the "limit_swap_bios" variable and if the bio has REQ_SWAP set.

Non-swap bios are not affected becuase taking the semaphore would cause
performance degradation.

This is similar to request-based drivers - they will also block when the
number of requests is over the limit.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

a666e5c0

dm: simplify target code conditional on CONFIG_BLK_DEV_ZONED · e3290b94

由 Mike Snitzer 提交于 2月 10, 2021

Allow removal of CONFIG_BLK_DEV_ZONED conditionals in target_type
definition of various targets.
Suggested-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

e3290b94

dm: set DM_TARGET_PASSES_CRYPTO feature for some targets · 3db564b4

由 Satya Tangirala 提交于 2月 01, 2021

dm-linear and dm-flakey obviously can pass through inline crypto support.
Co-developed-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NSatya Tangirala <satyat@google.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

3db564b4

dm: support key eviction from keyslot managers of underlying devices · 9355a9eb

由 Satya Tangirala 提交于 2月 01, 2021

Now that device mapper supports inline encryption, add the ability to
evict keys from all underlying devices. When an upper layer requests
a key eviction, we simply iterate through all underlying devices
and evict that key from each device.
Co-developed-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NSatya Tangirala <satyat@google.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

9355a9eb

dm: add support for passing through inline crypto support · aa6ce87a

由 Satya Tangirala 提交于 2月 01, 2021

Update the device-mapper core to support exposing the inline crypto
support of the underlying device(s) through the device-mapper device.

This works by creating a "passthrough keyslot manager" for the dm
device, which declares support for encryption settings which all
underlying devices support.  When a supported setting is used, the bio
cloning code handles cloning the crypto context to the bios for all the
underlying devices.  When an unsupported setting is used, the blk-crypto
fallback is used as usual.

Crypto support on each underlying device is ignored unless the
corresponding dm target opts into exposing it.  This is needed because
for inline crypto to semantically operate on the original bio, the data
must not be transformed by the dm target.  Thus, targets like dm-linear
can expose crypto support of the underlying device, but targets like
dm-crypt can't.  (dm-crypt could use inline crypto itself, though.)

A DM device's table can only be changed if the "new" inline encryption
capabilities are a (*not* necessarily strict) superset of the "old" inline
encryption capabilities.  Attempts to make changes to the table that result
in some inline encryption capability becoming no longer supported will be
rejected.

For the sake of clarity, key eviction from underlying devices will be
handled in a future patch.
Co-developed-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NSatya Tangirala <satyat@google.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

aa6ce87a

dm era: only resize metadata in preresume · cca2c6ae

由 Nikos Tsironis 提交于 2月 11, 2021

Metadata resize shouldn't happen in the ctr. The ctr loads a temporary
(inactive) table that will only become active upon resume. That is why
resize should always be done in terms of resume. Otherwise a load (ctr)
whose inactive table never becomes active will incorrectly resize the
metadata.

Also, perform the resize directly in preresume, instead of using the
worker to do it.

The worker might run other metadata operations, e.g., it could start
digestion, before resizing the metadata. These operations will end up
using the old size.

This could lead to errors, like:

  device-mapper: era: metadata_digest_transcribe_writeset: dm_array_set_value failed
  device-mapper: era: process_old_eras: digest step failed, stopping digestion

The reason of the above error is that the worker started the digestion
of the archived writeset using the old, larger size.

As a result, metadata_digest_transcribe_writeset tried to write beyond
the end of the era array.

Fixes: eec40579 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

cca2c6ae

dm era: Use correct value size in equality function of writeset tree · 64f2d15a

由 Nikos Tsironis 提交于 1月 22, 2021

Fix the writeset tree equality test function to use the right value size
when comparing two btree values.

Fixes: eec40579 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: NMing-Hung Tsai <mtsai@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

64f2d15a

dm era: Fix bitset memory leaks · 904e6b26

由 Nikos Tsironis 提交于 1月 22, 2021

Deallocate the memory allocated for the in-core bitsets when destroying
the target and in error paths.

Fixes: eec40579 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: NMing-Hung Tsai <mtsai@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

904e6b26

dm era: Verify the data block size hasn't changed · c8e846ff

由 Nikos Tsironis 提交于 1月 22, 2021

dm-era doesn't support changing the data block size of existing devices,
so check explicitly that the requested block size for a new target
matches the one stored in the metadata.

Fixes: eec40579 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: NMing-Hung Tsai <mtsai@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

c8e846ff

dm era: Reinitialize bitset cache before digesting a new writeset · 25249333

由 Nikos Tsironis 提交于 1月 22, 2021

In case of devices with at most 64 blocks, the digestion of consecutive
eras uses the writeset of the first era as the writeset of all eras to
digest, leading to lost writes. That is, we lose the information about
what blocks were written during the affected eras.

The digestion code uses a dm_disk_bitset object to access the archived
writesets. This structure includes a one word (64-bit) cache to reduce
the number of array lookups.

This structure is initialized only once, in metadata_digest_start(),
when we kick off digestion.

But, when we insert a new writeset into the writeset tree, before the
digestion of the previous writeset is done, or equivalently when there
are multiple writesets in the writeset tree to digest, then all these
writesets are digested using the same cache and the cache is not
re-initialized when moving from one writeset to the next.

For devices with more than 64 blocks, i.e., the size of the cache, the
cache is indirectly invalidated when we move to a next set of blocks, so
we avoid the bug.

But for devices with at most 64 blocks we end up using the same cached
data for digesting all archived writesets, i.e., the cache is loaded
when digesting the first writeset and it never gets reloaded, until the
digestion is done.

As a result, the writeset of the first era to digest is used as the
writeset of all the following archived eras, leading to lost writes.

Fix this by reinitializing the dm_disk_bitset structure, and thus
invalidating the cache, every time the digestion code starts digesting a
new writeset.

Fixes: eec40579 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

25249333

dm era: Update in-core bitset after committing the metadata · 2099b145

由 Nikos Tsironis 提交于 1月 22, 2021

In case of a system crash, dm-era might fail to mark blocks as written
in its metadata, although the corresponding writes to these blocks were
passed down to the origin device and completed successfully.

Consider the following sequence of events:

1. We write to a block that has not been yet written in the current era
2. era_map() checks the in-core bitmap for the current era and sees
   that the block is not marked as written.
3. The write is deferred for submission after the metadata have been
   updated and committed.
4. The worker thread processes the deferred write
   (process_deferred_bios()) and marks the block as written in the
   in-core bitmap, **before** committing the metadata.
5. The worker thread starts committing the metadata.
6. We do more writes that map to the same block as the write of step (1)
7. era_map() checks the in-core bitmap and sees that the block is marked
   as written, **although the metadata have not been committed yet**.
8. These writes are passed down to the origin device immediately and the
   device reports them as completed.
9. The system crashes, e.g., power failure, before the commit from step
   (5) finishes.

When the system recovers and we query the dm-era target for the list of
written blocks it doesn't report the aforementioned block as written,
although the writes of step (6) completed successfully.

The issue is that era_map() decides whether to defer or not a write
based on non committed information. The root cause of the bug is that we
update the in-core bitmap, **before** committing the metadata.

Fix this by updating the in-core bitmap **after** successfully
committing the metadata.

Fixes: eec40579 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

2099b145

dm era: Recover committed writeset after crash · de89afc1

由 Nikos Tsironis 提交于 1月 22, 2021

Following a system crash, dm-era fails to recover the committed writeset
for the current era, leading to lost writes. That is, we lose the
information about what blocks were written during the affected era.

dm-era assumes that the writeset of the current era is archived when the
device is suspended. So, when resuming the device, it just moves on to
the next era, ignoring the committed writeset.

This assumption holds when the device is properly shut down. But, when
the system crashes, the code that suspends the target never runs, so the
writeset for the current era is not archived.

There are three issues that cause the committed writeset to get lost:

1. dm-era doesn't load the committed writeset when opening the metadata
2. The code that resizes the metadata wipes the information about the
   committed writeset (assuming it was loaded at step 1)
3. era_preresume() starts a new era, without taking into account that
   the current era might not have been archived, due to a system crash.

To fix this:

1. Load the committed writeset when opening the metadata
2. Fix the code that resizes the metadata to make sure it doesn't wipe
   the loaded writeset
3. Fix era_preresume() to check for a loaded writeset and archive it,
   before starting a new era.

Fixes: eec40579 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

de89afc1

10 2月, 2021 8 次提交

bcache: Avoid comma separated statements · 6751c1e3

由 Joe Perches 提交于 2月 10, 2021

Use semicolons and braces.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6751c1e3

bcache: Move journal work to new flush wq · afe78ab4

由 Kai Krakow 提交于 2月 10, 2021

This is potentially long running and not latency sensitive, let's get
it out of the way of other latency sensitive events.

As observed in the previous commit, the `system_wq` comes easily
congested by bcache, and this fixes a few more stalls I was observing
every once in a while.

Let's not make this `WQ_MEM_RECLAIM` as it showed to reduce performance
of boot and file system operations in my tests. Also, without
`WQ_MEM_RECLAIM`, I no longer see desktop stalls. This matches the
previous behavior as `system_wq` also does no memory reclaim:

> // workqueue.c:
> system_wq = alloc_workqueue("events", 0, 0);

Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: NKai Krakow <kai@kaishome.de>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

afe78ab4

bcache: Give btree_io_wq correct semantics again · d797bd98

由 Kai Krakow 提交于 2月 10, 2021

Before killing `btree_io_wq`, the queue was allocated using
`create_singlethread_workqueue()` which has `WQ_MEM_RECLAIM`. After
killing it, it no longer had this property but `system_wq` is not
single threaded.

Let's combine both worlds and make it multi threaded but able to
reclaim memory.

Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: NKai Krakow <kai@kaishome.de>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d797bd98

Revert "bcache: Kill btree_io_wq" · 9f233ffe

由 Kai Krakow 提交于 2月 10, 2021

This reverts commit 56b30770.

With the btree using the `system_wq`, I seem to see a lot more desktop
latency than I should.

After some more investigation, it looks like the original assumption
of 56b30770 no longer is true, and bcache has a very high potential of
congesting the `system_wq`. In turn, this introduces laggy desktop
performance, IO stalls (at least with btrfs), and input events may be
delayed.

So let's revert this. It's important to note that the semantics of
using `system_wq` previously mean that `btree_io_wq` should be created
before and destroyed after other bcache wqs to keep the same
assumptions.

Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: NKai Krakow <kai@kaishome.de>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9f233ffe

bcache: Fix register_device_aync typo · d7fae7b4

由 Kai Krakow 提交于 2月 10, 2021

Should be `register_device_async`.

Cc: Coly Li <colyli@suse.de>
Signed-off-by: NKai Krakow <kai@kaishome.de>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d7fae7b4

bcache: consider the fragmentation when update the writeback rate · 71dda2a5

由 dongdong tao 提交于 2月 10, 2021

Current way to calculate the writeback rate only considered the
dirty sectors, this usually works fine when the fragmentation
is not high, but it will give us unreasonable small rate when
we are under a situation that very few dirty sectors consumed
a lot dirty buckets. In some case, the dirty bucekts can reached
to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even
reached the writeback_percent, the writeback rate will still
be the minimum value (4k), thus it will cause all the writes to be
stucked in a non-writeback mode because of the slow writeback.

We accelerate the rate in 3 stages with different aggressiveness,
the first stage starts when dirty buckets percent reach above
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default
the first stage tries to writeback the amount of dirty data
in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second,
the second stage tries to writeback the amount of dirty data in one bucket
in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third
stage tries to writeback the amount of dirty data in one bucket in
(1 / (dirty_buckets_percent - 64)) millisecond.

the initial rate at each stage can be controlled by 3 configurable
parameters writeback_rate_fp_term_{low|mid|high}, they are by default
1, 10, 1000, the hint of IO throughput that these values are trying
to achieve is described by above paragraph, the reason that
I choose those value as default is based on the testing and the
production data, below is some details:

A. When it comes to the low stage, there is still a bit far from the 70
threshold, so we only want to give it a little bit push by setting the
term to 1, it means the initial rate will be 170 if the fragment is 6,
it is calculated by bucket_size/fragment, this rate is very small,
but still much reasonable than the minimum 8.
For a production bcache with unheavy workload, if the cache device
is bigger than 1 TB, it may take hours to consume 1% buckets,
so it is very possible to reclaim enough dirty buckets in this stage,
thus to avoid entering the next stage.

B. If the dirty buckets ratio didn't turn around during the first stage,
it comes to the mid stage, then it is necessary for mid stage
to be more aggressive than low stage, so i choose the initial rate
to be 10 times more than low stage, that means 1700 as the initial
rate if the fragment is 6. This is some normal rate
we usually see for a normal workload when writeback happens
because of writeback_percent.

C. If the dirty buckets ratio didn't turn around during the low and mid
stages, it comes to the third stage, and it is the last chance that
we can turn around to avoid the horrible cutoff writeback sync issue,
then we choose 100 times more aggressive than the mid stage, that
means 170000 as the initial rate if the fragment is 6. This is also
inferred from a production bcache, I've got one week's writeback rate
data from a production bcache which has quite heavy workloads,
again, the writeback is triggered by the writeback percent,
the highest rate area is around 100000 to 240000, so I believe this
kind aggressiveness at this stage is reasonable for production.
And it should be mostly enough because the hint is trying to reclaim
1000 bucket per second, and from that heavy production env,
it is consuming 50 bucket per second on average in one week's data.

Option writeback_consider_fragment is to control whether we want
this feature to be on or off, it's on by default.

Lastly, below is the performance data for all the testing result,
including the data from production env:
https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharingSigned-off-by: Ndongdong tao <dongdong.tao@canonical.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

71dda2a5

M
dm writecache: use bdev_nr_sectors() instead of open-coded equivalent · d9928ac5
由 Mike Snitzer 提交于 2月 09, 2021
```
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
d9928ac5

dm writecache: fix writing beyond end of underlying device when shrinking · 4134455f

由 Mikulas Patocka 提交于 2月 09, 2021

Do not attempt to write any data beyond the end of the underlying data
device while shrinking it.

The DM writecache device must be suspended when the underlying data
device is shrunk.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

4134455f

09 2月, 2021 5 次提交

dm table: remove needless request_queue NULL pointer checks · cccb493c

由 Jeffle Xu 提交于 2月 08, 2021

Since commit ff9ea323 ("block, bdi: an active gendisk always has a
request_queue associated with it") the request_queue pointer returned
from bdev_get_queue() shall never be NULL.
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

cccb493c

dm table: fix zoned iterate_devices based device capability checks · 24f6b603

由 Jeffle Xu 提交于 2月 08, 2021

Fix dm_table_supports_zoned_model() and invert logic of both
iterate_devices_callout_fn so that all devices' zoned capabilities are
properly checked.

Add one more parameter to dm_table_any_dev_attr(), which is actually
used as the @data parameter of iterate_devices_callout_fn, so that
dm_table_matches_zone_sectors() can be replaced by
dm_table_any_dev_attr().

Fixes: dd88d313 ("dm table: add zoned block devices validation")
Cc: stable@vger.kernel.org
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

24f6b603

dm table: fix DAX iterate_devices based device capability checks · 5b0fab50

由 Jeffle Xu 提交于 2月 08, 2021

Fix dm_table_supports_dax() and invert logic of both
iterate_devices_callout_fn so that all devices' DAX capabilities are
properly checked.

Fixes: 545ed20e ("dm: add infrastructure for DAX support")
Cc: stable@vger.kernel.org
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

5b0fab50

dm table: fix iterate_devices based device capability checks · a4c8dd9c

由 Jeffle Xu 提交于 2月 02, 2021

According to the definition of dm_iterate_devices_fn:
 * This function must iterate through each section of device used by the
 * target until it encounters a non-zero return code, which it then returns.
 * Returns zero if no callout returned non-zero.

For some target type (e.g. dm-stripe), one call of iterate_devices() may
iterate multiple underlying devices internally, in which case a non-zero
return code returned by iterate_devices_callout_fn will stop the iteration
in advance. No iterate_devices_callout_fn should return non-zero unless
device iteration should stop.

Rename dm_table_requires_stable_pages() to dm_table_any_dev_attr() and
elevate it for reuse to stop iterating (and return non-zero) on the
first device that causes iterate_devices_callout_fn to return non-zero.
Use dm_table_any_dev_attr() to properly iterate through devices.

Rename device_is_nonrot() to device_is_rotational() and invert logic
accordingly to fix improper disposition.

Fixes: c3c4555e ("dm table: clear add_random unless all devices have it set")
Fixes: 4693c966 ("dm table: propagate non rotational flag")
Cc: stable@vger.kernel.org
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

a4c8dd9c

dm writecache: return the exact table values that were set · 054bee16

由 Mikulas Patocka 提交于 2月 04, 2021

LVM doesn't like it when the target returns different values from what
was set in the constructor. Fix dm-writecache so that the returned
table values are exactly the same as requested values.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

054bee16

08 2月, 2021 1 次提交

md/raid10: remove dead code in reshape_request · 72b04365

由 Christoph Hellwig 提交于 2月 02, 2021

A bio allocated by bio_alloc_bioset comes pre-zeroed, no need to
clear random fields.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

72b04365

04 2月, 2021 1 次提交

md/raid5: cast chunk_sectors to sector_t value · c5eec74f

由 Guoqing Jiang 提交于 12月 16, 2020

Currently, raid5 calculates dev_sectors from chunk_sectors without
proper cast, which is problematic.
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

c5eec74f

03 2月, 2021 1 次提交

dm crypt: support using trusted keys · 363880c4

由 Ahmad Fatoum 提交于 1月 22, 2021

Commit 27f5411a ("dm crypt: support using encrypted keys") extended
dm-crypt to allow use of "encrypted" keys along with "user" and "logon".

Along the same lines, teach dm-crypt to support "trusted" keys as well.
Signed-off-by: NAhmad Fatoum <a.fatoum@pengutronix.de>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

363880c4

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功