fs/xfs/xfs_log.c · 3e5da0375f9e311ed2338bd035cab0a930377115 · openeuler / Kernel

xfs: limit iclog tail updates · 3e5da037

由 Dave Chinner 提交于 4月 26, 2023

mainline inclusion
from mainline-v5.14-rc1
commit 9d110014
category: bugfix
bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9d110014205cb1129fa570d8de83d486fa199354

--------------------------------

From the department of "generic/482 keeps on giving", we bring you
another tail update race condition:

iclog:
S1 C1
+-----------------------+-----------------------+
S2 EOIC

Two checkpoints in a single iclog. One is complete, the other just
contains the start record and overruns into a new iclog.

Timeline:

Before S1: Cache flush, log tail = X
At S1: Metadata stable, write start record and checkpoint
At C1: Write commit record, set NEED_FUA
Single iclog checkpoint, so no need for NEED_FLUSH
Log tail still = X, so no need for NEED_FLUSH

After C1,
Before S2: Cache flush, log tail = X
At S2: Metadata stable, write start record and checkpoint
After S2: Log tail moves to X+1
At EOIC: End of iclog, more journal data to write
Releases iclog
Not a commit iclog, so no need for NEED_FLUSH
Writes log tail X+1 into iclog.

At this point, the iclog has tail X+1 and NEED_FUA set. There has
been no cache flush for the metadata between X and X+1, and the
iclog writes the new tail permanently to the log. THis is sufficient
to violate on disk metadata/journal ordering.

We have two options here. The first is to detect this case in some
manner and ensure that the partial checkpoint write sets NEED_FLUSH
when the iclog is already marked NEED_FUA and the log tail changes.
This seems somewhat fragile and quite complex to get right, and it
doesn't actually make it obvious what underlying problem it is
actually addressing from reading the code.

The second option seems much cleaner to me, because it is derived
directly from the requirements of the C1 commit record in the iclog.
That is, when we write this commit record to the iclog, we've
guaranteed that the metadata/data ordering is correct for tail
update purposes. Hence if we only write the log tail into the iclog
for the *first* commit record rather than the log tail at the last
release, we guarantee that the log tail does not move past where the
the first commit record in the log expects it to be.

IOWs, taking the first option means that replay of C1 becomes
dependent on future operations doing the right thing, not just the
C1 checkpoint itself doing the right thing. This makes log recovery
almost impossible to reason about because now we have to take into
account what might or might not have happened in the future when
looking at checkpoints in the log rather than just having to
reconstruct the past...
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

3e5da037

xfs_log.c 108.5 KB

openeuler / Kernel 1 年多 前同步成功

Replace xfs_log.c

openeuler / Kernel
1 年多前同步成功