1. 27 8月, 2021 3 次提交
  2. 21 8月, 2021 2 次提交
  3. 19 8月, 2021 1 次提交
    • L
      pipe: avoid unnecessary EPOLLET wakeups under normal loads · 3b844826
      Linus Torvalds 提交于
      I had forgotten just how sensitive hackbench is to extra pipe wakeups,
      and commit 3a34b13a ("pipe: make pipe writes always wake up
      readers") ended up causing a quite noticeable regression on larger
      machines.
      
      Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
      not clear that this matters in real life all that much, but as Mel
      points out, it's used often enough when comparing kernels and so the
      performance regression shows up like a sore thumb.
      
      It's easy enough to fix at least for the common cases where pipes are
      used purely for data transfer, and you never have any exciting poll
      usage at all.  So set a special 'poll_usage' flag when there is polling
      activity, and make the ugly "EPOLLET has crazy legacy expectations"
      semantics explicit to only that case.
      
      I would love to limit it to just the broken EPOLLET case, but the pipe
      code can't see the difference between epoll and regular select/poll, so
      any non-read/write waiting will trigger the extra wakeup behavior.  That
      is sufficient for at least the hackbench case.
      
      Apart from making the odd extra wakeup cases more explicitly about
      EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
      write, not at the first write chunk.  That is actually much saner
      semantics (as much as you can call any of the legacy edge-triggered
      expectations for EPOLLET "sane") since it means that you know the wakeup
      will happen once the write is done, rather than possibly in the middle
      of one.
      
      [ For stable people: I'm putting a "Fixes" tag on this, but I leave it
        up to you to decide whether you actually want to backport it or not.
        It likely has no impact outside of synthetic benchmarks  - Linus ]
      
      Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
      Fixes: 3a34b13a ("pipe: make pipe writes always wake up readers")
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Tested-by: NSandeep Patil <sspatil@android.com>
      Tested-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b844826
  4. 18 8月, 2021 1 次提交
  5. 16 8月, 2021 1 次提交
    • N
      btrfs: prevent rename2 from exchanging a subvol with a directory from different parents · 3f79f6f6
      NeilBrown 提交于
      Cross-rename lacks a check when that would prevent exchanging a
      directory and subvolume from different parent subvolume. This causes
      data inconsistencies and is caught before commit by tree-checker,
      turning the filesystem to read-only.
      
      Calling the renameat2 with RENAME_EXCHANGE flags like
      
        renameat2(AT_FDCWD, namesrc, AT_FDCWD, namedest, (1 << 1))
      
      on two paths:
      
        namesrc = dir1/subvol1/dir2
       namedest = subvol2/subvol3
      
      will cause key order problem with following write time tree-checker
      report:
      
        [1194842.307890] BTRFS critical (device loop1): corrupt leaf: root=5 block=27574272 slot=10 ino=258, invalid previous key objectid, have 257 expect 258
        [1194842.322221] BTRFS info (device loop1): leaf 27574272 gen 8 total ptrs 11 free space 15444 owner 5
        [1194842.331562] BTRFS info (device loop1): refs 2 lock_owner 0 current 26561
        [1194842.338772]        item 0 key (256 1 0) itemoff 16123 itemsize 160
        [1194842.338793]                inode generation 3 size 16 mode 40755
        [1194842.338801]        item 1 key (256 12 256) itemoff 16111 itemsize 12
        [1194842.338809]        item 2 key (256 84 2248503653) itemoff 16077 itemsize 34
        [1194842.338817]                dir oid 258 type 2
        [1194842.338823]        item 3 key (256 84 2363071922) itemoff 16043 itemsize 34
        [1194842.338830]                dir oid 257 type 2
        [1194842.338836]        item 4 key (256 96 2) itemoff 16009 itemsize 34
        [1194842.338843]        item 5 key (256 96 3) itemoff 15975 itemsize 34
        [1194842.338852]        item 6 key (257 1 0) itemoff 15815 itemsize 160
        [1194842.338863]                inode generation 6 size 8 mode 40755
        [1194842.338869]        item 7 key (257 12 256) itemoff 15801 itemsize 14
        [1194842.338876]        item 8 key (257 84 2505409169) itemoff 15767 itemsize 34
        [1194842.338883]                dir oid 256 type 2
        [1194842.338888]        item 9 key (257 96 2) itemoff 15733 itemsize 34
        [1194842.338895]        item 10 key (258 12 256) itemoff 15719 itemsize 14
        [1194842.339163] BTRFS error (device loop1): block=27574272 write time tree block corruption detected
        [1194842.339245] ------------[ cut here ]------------
        [1194842.443422] WARNING: CPU: 6 PID: 26561 at fs/btrfs/disk-io.c:449 csum_one_extent_buffer+0xed/0x100 [btrfs]
        [1194842.511863] CPU: 6 PID: 26561 Comm: kworker/u17:2 Not tainted 5.14.0-rc3-git+ #793
        [1194842.511870] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
        [1194842.511876] Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
        [1194842.511976] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
        [1194842.512068] RSP: 0018:ffffa2c284d77da0 EFLAGS: 00010282
        [1194842.512074] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff928867bd9978
        [1194842.512078] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff928867bd9970
        [1194842.512081] RBP: ffff92876b958000 R08: 0000000000000001 R09: 00000000000c0003
        [1194842.512085] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
        [1194842.512088] R13: ffff92875f989f98 R14: 0000000000000000 R15: 0000000000000000
        [1194842.512092] FS:  0000000000000000(0000) GS:ffff928867a00000(0000) knlGS:0000000000000000
        [1194842.512095] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [1194842.512099] CR2: 000055f5384da1f0 CR3: 0000000102fe4000 CR4: 00000000000006e0
        [1194842.512103] Call Trace:
        [1194842.512128]  ? run_one_async_free+0x10/0x10 [btrfs]
        [1194842.631729]  btree_csum_one_bio+0x1ac/0x1d0 [btrfs]
        [1194842.631837]  run_one_async_start+0x18/0x30 [btrfs]
        [1194842.631938]  btrfs_work_helper+0xd5/0x1d0 [btrfs]
        [1194842.647482]  process_one_work+0x262/0x5e0
        [1194842.647520]  worker_thread+0x4c/0x320
        [1194842.655935]  ? process_one_work+0x5e0/0x5e0
        [1194842.655946]  kthread+0x135/0x160
        [1194842.655953]  ? set_kthread_struct+0x40/0x40
        [1194842.655965]  ret_from_fork+0x1f/0x30
        [1194842.672465] irq event stamp: 1729
        [1194842.672469] hardirqs last  enabled at (1735): [<ffffffffbd1104f5>] console_trylock_spinning+0x185/0x1a0
        [1194842.672477] hardirqs last disabled at (1740): [<ffffffffbd1104cc>] console_trylock_spinning+0x15c/0x1a0
        [1194842.672482] softirqs last  enabled at (1666): [<ffffffffbdc002e1>] __do_softirq+0x2e1/0x50a
        [1194842.672491] softirqs last disabled at (1651): [<ffffffffbd08aab7>] __irq_exit_rcu+0xa7/0xd0
      
      The corrupted data will not be written, and filesystem can be unmounted
      and mounted again (all changes since the last commit will be lost).
      
      Add the missing check for new_ino so that all non-subvolumes must reside
      under the same parent subvolume. There's an exception allowing to
      exchange two subvolumes from any parents as the directory representing a
      subvolume is only a logical link and does not have any other structures
      related to the parent subvolume, unlike files, directories etc, that
      are always in the inode namespace of the parent subvolume.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.7+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3f79f6f6
  6. 15 8月, 2021 1 次提交
    • J
      io_uring: only assign io_uring_enter() SQPOLL error in actual error case · 21f96522
      Jens Axboe 提交于
      If an SQPOLL based ring is newly created and an application issues an
      io_uring_enter(2) system call on it, then we can return a spurious
      -EOWNERDEAD error. This happens because there's nothing to submit, and
      if the caller doesn't specify any other action, the initial error
      assignment of -EOWNERDEAD never gets overwritten. This causes us to
      return it directly, even if it isn't valid.
      
      Move the error assignment into the actual failure case instead.
      
      Cc: stable@vger.kernel.org
      Fixes: d9d05217 ("io_uring: stop SQPOLL submit on creator's death")
      Reported-by: Sherlock Holo sherlockya@gmail.com
      Link: https://github.com/axboe/liburing/issues/413Signed-off-by: NJens Axboe <axboe@kernel.dk>
      21f96522
  7. 13 8月, 2021 2 次提交
  8. 10 8月, 2021 11 次提交
  9. 09 8月, 2021 3 次提交
  10. 07 8月, 2021 1 次提交
  11. 06 8月, 2021 4 次提交
  12. 05 8月, 2021 3 次提交
    • J
      io-wq: fix race between worker exiting and activating free worker · 83d6c393
      Jens Axboe 提交于
      Nadav correctly reports that we have a race between a worker exiting,
      and new work being queued. This can lead to work being queued behind
      an existing worker that could be sleeping on an event before it can
      run to completion, and hence introducing potential big latency gaps
      if we hit this race condition:
      
      cpu0					cpu1
      ----					----
      					io_wqe_worker()
      					schedule_timeout()
      					 // timed out
      io_wqe_enqueue()
      io_wqe_wake_worker()
      // work_flags & IO_WQ_WORK_CONCURRENT
      io_wqe_activate_free_worker()
      					 io_worker_exit()
      
      Fix this by having the exiting worker go through the normal decrement
      of a running worker, which will spawn a new one if needed.
      
      The free worker activation is modified to only return success if we
      were able to find a sleeping worker - if not, we keep looking through
      the list. If we fail, we create a new worker as per usual.
      
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/io-uring/BFF746C0-FEDE-4646-A253-3021C57C26C9@gmail.com/Reported-by: NNadav Amit <nadav.amit@gmail.com>
      Tested-by: NNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      83d6c393
    • J
      ceph: take snap_empty_lock atomically with snaprealm refcount change · 8434ffe7
      Jeff Layton 提交于
      There is a race in ceph_put_snap_realm. The change to the nref and the
      spinlock acquisition are not done atomically, so you could decrement
      nref, and before you take the spinlock, the nref is incremented again.
      At that point, you end up putting it on the empty list when it
      shouldn't be there. Eventually __cleanup_empty_realms runs and frees
      it when it's still in-use.
      
      Fix this by protecting the 1->0 transition with atomic_dec_and_lock,
      and just drop the spinlock if we can get the rwsem.
      
      Because these objects can also undergo a 0->1 refcount transition, we
      must protect that change as well with the spinlock. Increment locklessly
      unless the value is at 0, in which case we take the spinlock, increment
      and then take it off the empty list if it did the 0->1 transition.
      
      With these changes, I'm removing the dout() messages from these
      functions, as well as in __put_snap_realm. They've always been racy, and
      it's better to not print values that may be misleading.
      
      Cc: stable@vger.kernel.org
      URL: https://tracker.ceph.com/issues/46419Reported-by: NMark Nelson <mnelson@redhat.com>
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: NLuis Henriques <lhenriques@suse.de>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      8434ffe7
    • L
      ceph: reduce contention in ceph_check_delayed_caps() · bf2ba432
      Luis Henriques 提交于
      Function ceph_check_delayed_caps() is called from the mdsc->delayed_work
      workqueue and it can be kept looping for quite some time if caps keep
      being added back to the mdsc->cap_delay_list.  This may result in the
      watchdog tainting the kernel with the softlockup flag.
      
      This patch breaks this loop if the caps have been recently (i.e. during
      the loop execution).  Any new caps added to the list will be handled in
      the next run.
      
      Also, allow schedule_delayed() callers to explicitly set the delay value
      instead of defaulting to 5s, so we can ensure that it runs soon
      afterward if it looks like there is more work.
      
      Cc: stable@vger.kernel.org
      URL: https://tracker.ceph.com/issues/46284Signed-off-by: NLuis Henriques <lhenriques@suse.de>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      bf2ba432
  13. 31 7月, 2021 3 次提交
    • L
      pipe: make pipe writes always wake up readers · 3a34b13a
      Linus Torvalds 提交于
      Since commit 1b6b26ae ("pipe: fix and clarify pipe write wakeup
      logic") we have sanitized the pipe write logic, and would only try to
      wake up readers if they needed it.
      
      In particular, if the pipe already had data in it before the write,
      there was no point in trying to wake up a reader, since any existing
      readers must have been aware of the pre-existing data already.  Doing
      extraneous wakeups will only cause potential thundering herd problems.
      
      However, it turns out that some Android libraries have misused the EPOLL
      interface, and expected "edge triggered" be to "any new write will
      trigger it".  Even if there was no edge in sight.
      
      Quoting Sandeep Patil:
       "The commit 1b6b26ae ('pipe: fix and clarify pipe write wakeup
        logic') changed pipe write logic to wakeup readers only if the pipe
        was empty at the time of write. However, there are libraries that
        relied upon the older behavior for notification scheme similar to
        what's described in [1]
      
        One such library 'realm-core'[2] is used by numerous Android
        applications. The library uses a similar notification mechanism as GNU
        Make but it never drains the pipe until it is full. When Android moved
        to v5.10 kernel, all applications using this library stopped working.
      
        The library has since been fixed[3] but it will be a while before all
        applications incorporate the updated library"
      
      Our regression rule for the kernel is that if applications break from
      new behavior, it's a regression, even if it was because the application
      did something patently wrong.  Also note the original report [4] by
      Michal Kerrisk about a test for this epoll behavior - but at that point
      we didn't know of any actual broken use case.
      
      So add the extraneous wakeup, to approximate the old behavior.
      
      [ I say "approximate", because the exact old behavior was to do a wakeup
        not for each write(), but for each pipe buffer chunk that was filled
        in. The behavior introduced by this change is not that - this is just
        "every write will cause a wakeup, whether necessary or not", which
        seems to be sufficient for the broken library use. ]
      
      It's worth noting that this adds the extraneous wakeup only for the
      write side, while the read side still considers the "edge" to be purely
      about reading enough from the pipe to allow further writes.
      
      See commit f467a6a6 ("pipe: fix and clarify pipe read wakeup logic")
      for the pipe read case, which remains that "only wake up if the pipe was
      full, and we read something from it".
      
      Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
      Link: https://github.com/realm/realm-core [2]
      Link: https://github.com/realm/realm-core/issues/4666 [3]
      Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
      Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/Reported-by: NSandeep Patil <sspatil@android.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a34b13a
    • J
      ocfs2: issue zeroout to EOF blocks · 9449ad33
      Junxiao Bi 提交于
      For punch holes in EOF blocks, fallocate used buffer write to zero the
      EOF blocks in last cluster.  But since ->writepage will ignore EOF
      pages, those zeros will not be flushed.
      
      This "looks" ok as commit 6bba4471 ("ocfs2: fix data corruption by
      fallocate") will zero the EOF blocks when extend the file size, but it
      isn't.  The problem happened on those EOF pages, before writeback, those
      pages had DIRTY flag set and all buffer_head in them also had DIRTY flag
      set, when writeback run by write_cache_pages(), DIRTY flag on the page
      was cleared, but DIRTY flag on the buffer_head not.
      
      When next write happened to those EOF pages, since buffer_head already
      had DIRTY flag set, it would not mark page DIRTY again.  That made
      writeback ignore them forever.  That will cause data corruption.  Even
      directio write can't work because it will fail when trying to drop pages
      caches before direct io, as it found the buffer_head for those pages
      still had DIRTY flag set, then it will fall back to buffer io mode.
      
      To make a summary of the issue, as writeback ingores EOF pages, once any
      EOF page is generated, any write to it will only go to the page cache,
      it will never be flushed to disk even file size extends and that page is
      not EOF page any more.  The fix is to avoid zero EOF blocks with buffer
      write.
      
      The following code snippet from qemu-img could trigger the corruption.
      
        656   open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
        ...
        660   fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
        660   fallocate(11, 0, 2275868672, 327680) = 0
        658   pwrite64(11, "
      
      Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.comSigned-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9449ad33
    • J
      ocfs2: fix zero out valid data · f267aeb6
      Junxiao Bi 提交于
      If append-dio feature is enabled, direct-io write and fallocate could
      run in parallel to extend file size, fallocate used "orig_isize" to
      record i_size before taking "ip_alloc_sem", when
      ocfs2_zeroout_partial_cluster() zeroout EOF blocks, i_size maybe already
      extended by ocfs2_dio_end_io_write(), that will cause valid data zeroed
      out.
      
      Link: https://lkml.kernel.org/r/20210722054923.24389-1-junxiao.bi@oracle.com
      Fixes: 6bba4471 ("ocfs2: fix data corruption by fallocate")
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f267aeb6
  14. 30 7月, 2021 4 次提交
    • D
      xfs: prevent spoofing of rtbitmap blocks when recovering buffers · 81a448d7
      Darrick J. Wong 提交于
      While reviewing the buffer item recovery code, the thought occurred to
      me: in V5 filesystems we use log sequence number (LSN) tracking to avoid
      replaying older metadata updates against newer log items.  However, we
      use the magic number of the ondisk buffer to find the LSN of the ondisk
      metadata, which means that if an attacker can control the layout of the
      realtime device precisely enough that the start of an rt bitmap block
      matches the magic and UUID of some other kind of block, they can control
      the purported LSN of that spoofed block and thereby break log replay.
      
      Since realtime bitmap and summary blocks don't have headers at all, we
      have no way to tell if a block really should be replayed.  The best we
      can do is replay unconditionally and hope for the best.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      81a448d7
    • D
      xfs: limit iclog tail updates · 9d110014
      Dave Chinner 提交于
      From the department of "generic/482 keeps on giving", we bring you
      another tail update race condition:
      
      iclog:
      	S1			C1
      	+-----------------------+-----------------------+
      				 S2			EOIC
      
      Two checkpoints in a single iclog. One is complete, the other just
      contains the start record and overruns into a new iclog.
      
      Timeline:
      
      Before S1:	Cache flush, log tail = X
      At S1:		Metadata stable, write start record and checkpoint
      At C1:		Write commit record, set NEED_FUA
      		Single iclog checkpoint, so no need for NEED_FLUSH
      		Log tail still = X, so no need for NEED_FLUSH
      
      After C1,
      Before S2:	Cache flush, log tail = X
      At S2:		Metadata stable, write start record and checkpoint
      After S2:	Log tail moves to X+1
      At EOIC:	End of iclog, more journal data to write
      		Releases iclog
      		Not a commit iclog, so no need for NEED_FLUSH
      		Writes log tail X+1 into iclog.
      
      At this point, the iclog has tail X+1 and NEED_FUA set. There has
      been no cache flush for the metadata between X and X+1, and the
      iclog writes the new tail permanently to the log. THis is sufficient
      to violate on disk metadata/journal ordering.
      
      We have two options here. The first is to detect this case in some
      manner and ensure that the partial checkpoint write sets NEED_FLUSH
      when the iclog is already marked NEED_FUA and the log tail changes.
      This seems somewhat fragile and quite complex to get right, and it
      doesn't actually make it obvious what underlying problem it is
      actually addressing from reading the code.
      
      The second option seems much cleaner to me, because it is derived
      directly from the requirements of the C1 commit record in the iclog.
      That is, when we write this commit record to the iclog, we've
      guaranteed that the metadata/data ordering is correct for tail
      update purposes. Hence if we only write the log tail into the iclog
      for the *first* commit record rather than the log tail at the last
      release, we guarantee that the log tail does not move past where the
      the first commit record in the log expects it to be.
      
      IOWs, taking the first option means that replay of C1 becomes
      dependent on future operations doing the right thing, not just the
      C1 checkpoint itself doing the right thing. This makes log recovery
      almost impossible to reason about because now we have to take into
      account what might or might not have happened in the future when
      looking at checkpoints in the log rather than just having to
      reconstruct the past...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      9d110014
    • D
      xfs: need to see iclog flags in tracing · b2ae3a9e
      Dave Chinner 提交于
      Because I cannot tell if the NEED_FLUSH flag is being set correctly
      by the log force and CIL push machinery without it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      b2ae3a9e
    • D
      xfs: Enforce attr3 buffer recovery order · d8f4c2d0
      Dave Chinner 提交于
      From the department of "WTAF? How did we miss that!?"...
      
      When we are recovering a buffer, the first thing we do is check the
      buffer magic number and extract the LSN from the buffer. If the LSN
      is older than the current LSN, we replay the modification to it. If
      the metadata on disk is newer than the transaction in the log, we
      skip it. This is a fundamental v5 filesystem metadata recovery
      behaviour.
      
      generic/482 failed with an attribute writeback failure during log
      recovery. The write verifier caught the corruption before it got
      written to disk, and the attr buffer dump looked like:
      
      XFS (dm-3): Metadata corruption detected at xfs_attr3_leaf_verify+0x275/0x2e0, xfs_attr3_leaf block 0x19be8
      XFS (dm-3): Unmount and run xfs_repair
      XFS (dm-3): First 128 bytes of corrupted metadata buffer:
      00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 4d 2a 01 e1  ........;...M*..
      00000010: 00 00 00 00 00 01 9b e8 00 00 00 01 00 00 05 38  ...............8
                                        ^^^^^^^^^^^^^^^^^^^^^^^
      00000020: df 39 5e 51 58 ac 44 b6 8d c5 e7 10 44 09 bc 17  .9^QX.D.....D...
      00000030: 00 00 00 00 00 02 00 83 00 03 00 cc 0f 24 01 00  .............$..
      00000040: 00 68 0e bc 0f c8 00 10 00 00 00 00 00 00 00 00  .h..............
      00000050: 00 00 3c 31 0f 24 01 00 00 00 3c 32 0f 88 01 00  ..<1.$....<2....
      00000060: 00 00 3c 33 0f d8 01 00 00 00 00 00 00 00 00 00  ..<3............
      00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      .....
      
      The highlighted bytes are the LSN that was replayed into the
      buffer: 0x100000538. This is cycle 1, block 0x538. Prior to replay,
      that block on disk looks like this:
      
      $ sudo xfs_db -c "fsb 0x417d" -c "type attr3" -c p /dev/mapper/thin-vol
      hdr.info.hdr.forw = 0
      hdr.info.hdr.back = 0
      hdr.info.hdr.magic = 0x3bee
      hdr.info.crc = 0xb5af0bc6 (correct)
      hdr.info.bno = 105448
      hdr.info.lsn = 0x100000900
                     ^^^^^^^^^^^
      hdr.info.uuid = df395e51-58ac-44b6-8dc5-e7104409bc17
      hdr.info.owner = 131203
      hdr.count = 2
      hdr.usedbytes = 120
      hdr.firstused = 3796
      hdr.holes = 1
      hdr.freemap[0-2] = [base,size]
      
      Note the LSN stamped into the buffer on disk: 1/0x900. The version
      on disk is much newer than the log transaction that was being
      replayed. That's a bug, and should -never- happen.
      
      So I immediately went to look at xlog_recover_get_buf_lsn() to check
      that we handled the LSN correctly. I was wondering if there was a
      similar "two commits with the same start LSN skips the second
      replay" problem with buffers. I didn't get that far, because I found
      a much more basic, rudimentary bug: xlog_recover_get_buf_lsn()
      doesn't recognise buffers with XFS_ATTR3_LEAF_MAGIC set in them!!!
      
      IOWs, attr3 leaf buffers fall through the magic number checks
      unrecognised, so trigger the "recover immediately" behaviour instead
      of undergoing an LSN check. IOWs, we incorrectly replay ATTR3 leaf
      buffers and that causes silent on disk corruption of inode attribute
      forks and potentially other things....
      
      Git history shows this is *another* zero day bug, this time
      introduced in commit 50d5c8d8 ("xfs: check LSN ordering for v5
      superblocks during recovery") which failed to handle the attr3 leaf
      buffers in recovery. And we've failed to handle them ever since...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      d8f4c2d0