1. 07 8月, 2021 4 次提交
  2. 31 7月, 2021 3 次提交
    • L
      pipe: make pipe writes always wake up readers · 3a34b13a
      Linus Torvalds 提交于
      Since commit 1b6b26ae ("pipe: fix and clarify pipe write wakeup
      logic") we have sanitized the pipe write logic, and would only try to
      wake up readers if they needed it.
      
      In particular, if the pipe already had data in it before the write,
      there was no point in trying to wake up a reader, since any existing
      readers must have been aware of the pre-existing data already.  Doing
      extraneous wakeups will only cause potential thundering herd problems.
      
      However, it turns out that some Android libraries have misused the EPOLL
      interface, and expected "edge triggered" be to "any new write will
      trigger it".  Even if there was no edge in sight.
      
      Quoting Sandeep Patil:
       "The commit 1b6b26ae ('pipe: fix and clarify pipe write wakeup
        logic') changed pipe write logic to wakeup readers only if the pipe
        was empty at the time of write. However, there are libraries that
        relied upon the older behavior for notification scheme similar to
        what's described in [1]
      
        One such library 'realm-core'[2] is used by numerous Android
        applications. The library uses a similar notification mechanism as GNU
        Make but it never drains the pipe until it is full. When Android moved
        to v5.10 kernel, all applications using this library stopped working.
      
        The library has since been fixed[3] but it will be a while before all
        applications incorporate the updated library"
      
      Our regression rule for the kernel is that if applications break from
      new behavior, it's a regression, even if it was because the application
      did something patently wrong.  Also note the original report [4] by
      Michal Kerrisk about a test for this epoll behavior - but at that point
      we didn't know of any actual broken use case.
      
      So add the extraneous wakeup, to approximate the old behavior.
      
      [ I say "approximate", because the exact old behavior was to do a wakeup
        not for each write(), but for each pipe buffer chunk that was filled
        in. The behavior introduced by this change is not that - this is just
        "every write will cause a wakeup, whether necessary or not", which
        seems to be sufficient for the broken library use. ]
      
      It's worth noting that this adds the extraneous wakeup only for the
      write side, while the read side still considers the "edge" to be purely
      about reading enough from the pipe to allow further writes.
      
      See commit f467a6a6 ("pipe: fix and clarify pipe read wakeup logic")
      for the pipe read case, which remains that "only wake up if the pipe was
      full, and we read something from it".
      
      Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
      Link: https://github.com/realm/realm-core [2]
      Link: https://github.com/realm/realm-core/issues/4666 [3]
      Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
      Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/Reported-by: NSandeep Patil <sspatil@android.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a34b13a
    • J
      ocfs2: issue zeroout to EOF blocks · 9449ad33
      Junxiao Bi 提交于
      For punch holes in EOF blocks, fallocate used buffer write to zero the
      EOF blocks in last cluster.  But since ->writepage will ignore EOF
      pages, those zeros will not be flushed.
      
      This "looks" ok as commit 6bba4471 ("ocfs2: fix data corruption by
      fallocate") will zero the EOF blocks when extend the file size, but it
      isn't.  The problem happened on those EOF pages, before writeback, those
      pages had DIRTY flag set and all buffer_head in them also had DIRTY flag
      set, when writeback run by write_cache_pages(), DIRTY flag on the page
      was cleared, but DIRTY flag on the buffer_head not.
      
      When next write happened to those EOF pages, since buffer_head already
      had DIRTY flag set, it would not mark page DIRTY again.  That made
      writeback ignore them forever.  That will cause data corruption.  Even
      directio write can't work because it will fail when trying to drop pages
      caches before direct io, as it found the buffer_head for those pages
      still had DIRTY flag set, then it will fall back to buffer io mode.
      
      To make a summary of the issue, as writeback ingores EOF pages, once any
      EOF page is generated, any write to it will only go to the page cache,
      it will never be flushed to disk even file size extends and that page is
      not EOF page any more.  The fix is to avoid zero EOF blocks with buffer
      write.
      
      The following code snippet from qemu-img could trigger the corruption.
      
        656   open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
        ...
        660   fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
        660   fallocate(11, 0, 2275868672, 327680) = 0
        658   pwrite64(11, "
      
      Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.comSigned-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9449ad33
    • J
      ocfs2: fix zero out valid data · f267aeb6
      Junxiao Bi 提交于
      If append-dio feature is enabled, direct-io write and fallocate could
      run in parallel to extend file size, fallocate used "orig_isize" to
      record i_size before taking "ip_alloc_sem", when
      ocfs2_zeroout_partial_cluster() zeroout EOF blocks, i_size maybe already
      extended by ocfs2_dio_end_io_write(), that will cause valid data zeroed
      out.
      
      Link: https://lkml.kernel.org/r/20210722054923.24389-1-junxiao.bi@oracle.com
      Fixes: 6bba4471 ("ocfs2: fix data corruption by fallocate")
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f267aeb6
  3. 30 7月, 2021 12 次提交
    • D
      xfs: prevent spoofing of rtbitmap blocks when recovering buffers · 81a448d7
      Darrick J. Wong 提交于
      While reviewing the buffer item recovery code, the thought occurred to
      me: in V5 filesystems we use log sequence number (LSN) tracking to avoid
      replaying older metadata updates against newer log items.  However, we
      use the magic number of the ondisk buffer to find the LSN of the ondisk
      metadata, which means that if an attacker can control the layout of the
      realtime device precisely enough that the start of an rt bitmap block
      matches the magic and UUID of some other kind of block, they can control
      the purported LSN of that spoofed block and thereby break log replay.
      
      Since realtime bitmap and summary blocks don't have headers at all, we
      have no way to tell if a block really should be replayed.  The best we
      can do is replay unconditionally and hope for the best.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      81a448d7
    • D
      xfs: limit iclog tail updates · 9d110014
      Dave Chinner 提交于
      From the department of "generic/482 keeps on giving", we bring you
      another tail update race condition:
      
      iclog:
      	S1			C1
      	+-----------------------+-----------------------+
      				 S2			EOIC
      
      Two checkpoints in a single iclog. One is complete, the other just
      contains the start record and overruns into a new iclog.
      
      Timeline:
      
      Before S1:	Cache flush, log tail = X
      At S1:		Metadata stable, write start record and checkpoint
      At C1:		Write commit record, set NEED_FUA
      		Single iclog checkpoint, so no need for NEED_FLUSH
      		Log tail still = X, so no need for NEED_FLUSH
      
      After C1,
      Before S2:	Cache flush, log tail = X
      At S2:		Metadata stable, write start record and checkpoint
      After S2:	Log tail moves to X+1
      At EOIC:	End of iclog, more journal data to write
      		Releases iclog
      		Not a commit iclog, so no need for NEED_FLUSH
      		Writes log tail X+1 into iclog.
      
      At this point, the iclog has tail X+1 and NEED_FUA set. There has
      been no cache flush for the metadata between X and X+1, and the
      iclog writes the new tail permanently to the log. THis is sufficient
      to violate on disk metadata/journal ordering.
      
      We have two options here. The first is to detect this case in some
      manner and ensure that the partial checkpoint write sets NEED_FLUSH
      when the iclog is already marked NEED_FUA and the log tail changes.
      This seems somewhat fragile and quite complex to get right, and it
      doesn't actually make it obvious what underlying problem it is
      actually addressing from reading the code.
      
      The second option seems much cleaner to me, because it is derived
      directly from the requirements of the C1 commit record in the iclog.
      That is, when we write this commit record to the iclog, we've
      guaranteed that the metadata/data ordering is correct for tail
      update purposes. Hence if we only write the log tail into the iclog
      for the *first* commit record rather than the log tail at the last
      release, we guarantee that the log tail does not move past where the
      the first commit record in the log expects it to be.
      
      IOWs, taking the first option means that replay of C1 becomes
      dependent on future operations doing the right thing, not just the
      C1 checkpoint itself doing the right thing. This makes log recovery
      almost impossible to reason about because now we have to take into
      account what might or might not have happened in the future when
      looking at checkpoints in the log rather than just having to
      reconstruct the past...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      9d110014
    • D
      xfs: need to see iclog flags in tracing · b2ae3a9e
      Dave Chinner 提交于
      Because I cannot tell if the NEED_FLUSH flag is being set correctly
      by the log force and CIL push machinery without it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      b2ae3a9e
    • D
      xfs: Enforce attr3 buffer recovery order · d8f4c2d0
      Dave Chinner 提交于
      From the department of "WTAF? How did we miss that!?"...
      
      When we are recovering a buffer, the first thing we do is check the
      buffer magic number and extract the LSN from the buffer. If the LSN
      is older than the current LSN, we replay the modification to it. If
      the metadata on disk is newer than the transaction in the log, we
      skip it. This is a fundamental v5 filesystem metadata recovery
      behaviour.
      
      generic/482 failed with an attribute writeback failure during log
      recovery. The write verifier caught the corruption before it got
      written to disk, and the attr buffer dump looked like:
      
      XFS (dm-3): Metadata corruption detected at xfs_attr3_leaf_verify+0x275/0x2e0, xfs_attr3_leaf block 0x19be8
      XFS (dm-3): Unmount and run xfs_repair
      XFS (dm-3): First 128 bytes of corrupted metadata buffer:
      00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 4d 2a 01 e1  ........;...M*..
      00000010: 00 00 00 00 00 01 9b e8 00 00 00 01 00 00 05 38  ...............8
                                        ^^^^^^^^^^^^^^^^^^^^^^^
      00000020: df 39 5e 51 58 ac 44 b6 8d c5 e7 10 44 09 bc 17  .9^QX.D.....D...
      00000030: 00 00 00 00 00 02 00 83 00 03 00 cc 0f 24 01 00  .............$..
      00000040: 00 68 0e bc 0f c8 00 10 00 00 00 00 00 00 00 00  .h..............
      00000050: 00 00 3c 31 0f 24 01 00 00 00 3c 32 0f 88 01 00  ..<1.$....<2....
      00000060: 00 00 3c 33 0f d8 01 00 00 00 00 00 00 00 00 00  ..<3............
      00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      .....
      
      The highlighted bytes are the LSN that was replayed into the
      buffer: 0x100000538. This is cycle 1, block 0x538. Prior to replay,
      that block on disk looks like this:
      
      $ sudo xfs_db -c "fsb 0x417d" -c "type attr3" -c p /dev/mapper/thin-vol
      hdr.info.hdr.forw = 0
      hdr.info.hdr.back = 0
      hdr.info.hdr.magic = 0x3bee
      hdr.info.crc = 0xb5af0bc6 (correct)
      hdr.info.bno = 105448
      hdr.info.lsn = 0x100000900
                     ^^^^^^^^^^^
      hdr.info.uuid = df395e51-58ac-44b6-8dc5-e7104409bc17
      hdr.info.owner = 131203
      hdr.count = 2
      hdr.usedbytes = 120
      hdr.firstused = 3796
      hdr.holes = 1
      hdr.freemap[0-2] = [base,size]
      
      Note the LSN stamped into the buffer on disk: 1/0x900. The version
      on disk is much newer than the log transaction that was being
      replayed. That's a bug, and should -never- happen.
      
      So I immediately went to look at xlog_recover_get_buf_lsn() to check
      that we handled the LSN correctly. I was wondering if there was a
      similar "two commits with the same start LSN skips the second
      replay" problem with buffers. I didn't get that far, because I found
      a much more basic, rudimentary bug: xlog_recover_get_buf_lsn()
      doesn't recognise buffers with XFS_ATTR3_LEAF_MAGIC set in them!!!
      
      IOWs, attr3 leaf buffers fall through the magic number checks
      unrecognised, so trigger the "recover immediately" behaviour instead
      of undergoing an LSN check. IOWs, we incorrectly replay ATTR3 leaf
      buffers and that causes silent on disk corruption of inode attribute
      forks and potentially other things....
      
      Git history shows this is *another* zero day bug, this time
      introduced in commit 50d5c8d8 ("xfs: check LSN ordering for v5
      superblocks during recovery") which failed to handle the attr3 leaf
      buffers in recovery. And we've failed to handle them ever since...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      d8f4c2d0
    • D
      xfs: logging the on disk inode LSN can make it go backwards · 32baa63d
      Dave Chinner 提交于
      When we log an inode, we format the "log inode" core and set an LSN
      in that inode core. We do that via xfs_inode_item_format_core(),
      which calls:
      
      	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);
      
      to format the log inode. It writes the LSN from the inode item into
      the log inode, and if recovery decides the inode item needs to be
      replayed, it recovers the log inode LSN field and writes it into the
      on disk inode LSN field.
      
      Now this might seem like a reasonable thing to do, but it is wrong
      on multiple levels. Firstly, if the item is not yet in the AIL,
      item->li_lsn is zero. i.e. the first time the inode it is logged and
      formatted, the LSN we write into the log inode will be zero. If we
      only log it once, recovery will run and can write this zero LSN into
      the inode.
      
      This means that the next time the inode is logged and log recovery
      runs, it will *always* replay changes to the inode regardless of
      whether the inode is newer on disk than the version in the log and
      that violates the entire purpose of recording the LSN in the inode
      at writeback time (i.e. to stop it going backwards in time on disk
      during recovery).
      
      Secondly, if we commit the CIL to the journal so the inode item
      moves to the AIL, and then relog the inode, the LSN that gets
      stamped into the log inode will be the LSN of the inode's current
      location in the AIL, not it's age on disk. And it's not the LSN that
      will be associated with the current change. That means when log
      recovery replays this inode item, the LSN that ends up on disk is
      the LSN for the previous changes in the log, not the current
      changes being replayed. IOWs, after recovery the LSN on disk is not
      in sync with the LSN of the modifications that were replayed into
      the inode. This, again, violates the recovery ordering semantics
      that on-disk writeback LSNs provide.
      
      Hence the inode LSN in the log dinode is -always- invalid.
      
      Thirdly, recovery actually has the LSN of the log transaction it is
      replaying right at hand - it uses it to determine if it should
      replay the inode by comparing it to the on-disk inode's LSN. But it
      doesn't use that LSN to stamp the LSN into the inode which will be
      written back when the transaction is fully replayed. It uses the one
      in the log dinode, which we know is always going to be incorrect.
      
      Looking back at the change history, the inode logging was broken by
      commit 93f958f9 ("xfs: cull unnecessary icdinode fields") way
      back in 2016 by a stupid idiot who thought he knew how this code
      worked. i.e. me. That commit replaced an in memory di_lsn field that
      was updated only at inode writeback time from the inode item.li_lsn
      value - and hence always contained the same LSN that appeared in the
      on-disk inode - with a read of the inode item LSN at inode format
      time. CLearly these are not the same thing.
      
      Before 93f958f9, the log recovery behaviour was irrelevant,
      because the LSN in the log inode always matched the on-disk LSN at
      the time the inode was logged, hence recovery of the transaction
      would never make the on-disk LSN in the inode go backwards or get
      out of sync.
      
      A symptom of the problem is this, caught from a failure of
      generic/482. Before log recovery, the inode has been allocated but
      never used:
      
      xfs_db> inode 393388
      xfs_db> p
      core.magic = 0x494e
      core.mode = 0
      ....
      v3.crc = 0x99126961 (correct)
      v3.change_count = 0
      v3.lsn = 0
      v3.flags2 = 0
      v3.cowextsize = 0
      v3.crtime.sec = Thu Jan  1 10:00:00 1970
      v3.crtime.nsec = 0
      
      After log recovery:
      
      xfs_db> p
      core.magic = 0x494e
      core.mode = 020444
      ....
      v3.crc = 0x23e68f23 (correct)
      v3.change_count = 2
      v3.lsn = 0
      v3.flags2 = 0
      v3.cowextsize = 0
      v3.crtime.sec = Thu Jul 22 17:03:03 2021
      v3.crtime.nsec = 751000000
      ...
      
      You can see that the LSN of the on-disk inode is 0, even though it
      clearly has been written to disk. I point out this inode, because
      the generic/482 failure occurred because several adjacent inodes in
      this specific inode cluster were not replayed correctly and still
      appeared to be zero on disk when all the other metadata (inobt,
      finobt, directories, etc) indicated they should be allocated and
      written back.
      
      The fix for this is two-fold. The first is that we need to either
      revert the LSN changes in 93f958f9 or stop logging the inode LSN
      altogether. If we do the former, log recovery does not need to
      change but we add 8 bytes of memory per inode to store what is
      largely a write-only inode field. If we do the latter, log recovery
      needs to stamp the on-disk inode in the same manner that inode
      writeback does.
      
      I prefer the latter, because we shouldn't really be trying to log
      and replay changes to the on disk LSN as the on-disk value is the
      canonical source of the on-disk version of the inode. It also
      matches the way we recover buffer items - we create a buf_log_item
      that carries the current recovery transaction LSN that gets stamped
      into the buffer by the write verifier when it gets written back
      when the transaction is fully recovered.
      
      However, this might break log recovery on older kernels even more,
      so I'm going to simply ignore the logged value in recovery and stamp
      the on-disk inode with the LSN of the transaction being recovered
      that will trigger writeback on transaction recovery completion. This
      will ensure that the on-disk inode LSN always reflects the LSN of
      the last change that was written to disk, regardless of whether it
      comes from log recovery or runtime writeback.
      
      Fixes: 93f958f9 ("xfs: cull unnecessary icdinode fields")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      32baa63d
    • D
      xfs: avoid unnecessary waits in xfs_log_force_lsn() · 8191d822
      Dave Chinner 提交于
      Before waiting on a iclog in xfs_log_force_lsn(), we don't check to
      see if the iclog has already been completed and the contents on
      stable storage. We check for completed iclogs in xfs_log_force(), so
      we should do the same thing for xfs_log_force_lsn().
      
      This fixed some random up-to-30s pauses seen in unmounting
      filesystems in some tests. A log force ends up waiting on completed
      iclog, and that doesn't then get flushed (and hence the log force
      get completed) until the background log worker issues a log force
      that flushes the iclog in question. Then the unmount unblocks and
      continues.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      8191d822
    • D
      xfs: log forces imply data device cache flushes · 2bf1ec0f
      Dave Chinner 提交于
      After fixing the tail_lsn vs cache flush race, generic/482 continued
      to fail in a similar way where cache flushes were missing before
      iclog FUA writes. Tracing of iclog state changes during the fsstress
      workload portion of the test (via xlog_iclog* events) indicated that
      iclog writes were coming from two sources - CIL pushes and log
      forces (due to fsync/O_SYNC operations). All of the cases where a
      recovery problem was triggered indicated that the log force was the
      source of the iclog write that was not preceeded by a cache flush.
      
      This was an oversight in the modifications made in commit
      eef983ff ("xfs: journal IO cache flush reductions"). Log forces
      for fsync imply a data device cache flush has been issued if an
      iclog was flushed to disk and is indicated to the caller via the
      log_flushed parameter so they can elide the device cache flush if
      the journal issued one.
      
      The change in eef983ff results in iclogs only issuing a cache
      flush if XLOG_ICL_NEED_FLUSH is set on the iclog, but this was not
      added to the iclogs that the log force code flushes to disk. Hence
      log forces are no longer guaranteeing that a cache flush is issued,
      hence opening up a potential on-disk ordering failure.
      
      Log forces should also set XLOG_ICL_NEED_FUA as well to ensure that
      the actual iclogs it forces to the journal are also on stable
      storage before it returns to the caller.
      
      This patch introduces the xlog_force_iclog() helper function to
      encapsulate the process of taking a reference to an iclog, switching
      its state if WANT_SYNC and flushing it to stable storage correctly.
      
      Both xfs_log_force() and xfs_log_force_lsn() are converted to use
      it, as is xlog_unmount_write() which has an elaborate method of
      doing exactly the same "write this iclog to stable storage"
      operation.
      
      Further, if the log force code needs to wait on a iclog in the
      WANT_SYNC state, it needs to ensure that iclog also results in a
      cache flush being issued. This covers the case where the iclog
      contains the commit record of the CIL flush that the log force
      triggered, but it hasn't been written yet because there is still an
      active reference to the iclog.
      
      Note: this whole cache flush whack-a-mole patch is a result of log
      forces still being iclog state centric rather than being CIL
      sequence centric. Most of this nasty code will go away in future
      when log forces are converted to wait on CIL sequence push
      completion rather than iclog completion. With the CIL push algorithm
      guaranteeing that the CIL checkpoint is fully on stable storage when
      it completes, we no longer need to iterate iclogs and push them to
      ensure a CIL sequence push has completed and so all this nasty iclog
      iteration and flushing code will go away.
      
      Fixes: eef983ff ("xfs: journal IO cache flush reductions")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      2bf1ec0f
    • D
      xfs: factor out forced iclog flushes · 45eddb41
      Dave Chinner 提交于
      We force iclogs in several places - we need them all to have the
      same cache flush semantics, so start by factoring out the iclog
      force into a common helper.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      45eddb41
    • D
      xfs: fix ordering violation between cache flushes and tail updates · 0dc8f7f1
      Dave Chinner 提交于
      There is a race between the new CIL async data device metadata IO
      completion cache flush and the log tail in the iclog the flush
      covers being updated. This can be seen by repeating generic/482 in a
      loop and eventually log recovery fails with a failures such as this:
      
      XFS (dm-3): Starting recovery (logdev: internal)
      XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
      XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
      XFS (dm-3): Unmount and run xfs_repair
      XFS (dm-3): First 128 bytes of corrupted metadata buffer:
      00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117
      
      Analysis of the logwrite replay shows that there were no writes to
      the data device between the FUA @ write 124 and the FUA at write @
      125, but log recovery @ 125 failed. The difference was the one log
      write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
      and so the inode create intent in (1,8) was not replayed and so the
      inode cluster was zero on disk when replay of the first inode item
      in (1,32) was attempted.
      
      What this meant was that the journal write that occurred at @ 125
      did not ensure that metadata completed before the iclog was written
      was correctly on stable storage. The tail of the log moved forward,
      so IO must have been completed between the two iclog writes. This
      means that there is a race condition between the unconditional async
      cache flush in the CIL push work and the tail LSN that is written to
      the iclog. This happens like so:
      
      CIL push work				AIL push work
      -------------				-------------
      Add to committing list
      start async data dev cache flush
      .....
      <flush completes>
      <all writes to old tail lsn are stable>
      xlog_write
        ....					push inode create buffer
      					<start IO>
      					.....
      xlog_write(commit record)
        ....					<IO completes>
        					log tail moves
        					  xlog_assign_tail_lsn()
      start_lsn == commit_lsn
        <no iclog preflush!>
      xlog_state_release_iclog
        __xlog_state_release_iclog()
          <writes *new* tail_lsn into iclog>
        xlog_sync()
          ....
          submit_bio()
      <tail in log moves forward without flushing written metadata>
      
      Essentially, this can only occur if the commit iclog is issued
      without a cache flush. If the iclog bio is submitted with
      REQ_PREFLUSH, then it will guarantee that all the completed IO is
      one stable storage before the iclog bio with the new tail LSN in it
      is written to the log.
      
      IOWs, the tail lsn that is written to the iclog needs to be sampled
      *before* we issue the cache flush that guarantees all IO up to that
      LSN has been completed.
      
      To fix this without giving up the performance advantage of the
      flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
      compared to 5.13), we need to ensure that we always issue a cache
      flush if the tail LSN changes between the initial async flush and
      the commit record being written. THis requires sampling the tail_lsn
      before we start the flush, and then passing the sampled tail LSN to
      xlog_state_release_iclog() so it can determine if the the tail LSN
      has changed while writing the checkpoint. If the tail LSN has
      changed, then it needs to set the NEED_FLUSH flag on the iclog and
      we'll issue another cache flush before writing the iclog.
      
      Fixes: eef983ff ("xfs: journal IO cache flush reductions")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0dc8f7f1
    • D
      xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog · 9d392064
      Dave Chinner 提交于
      Fold __xlog_state_release_iclog into its only caller to prepare
      make an upcoming fix easier.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [hch: split from a larger patch]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      9d392064
    • D
      xfs: external logs need to flush data device · b5d721ea
      Dave Chinner 提交于
      The recent journal flush/FUA changes replaced the flushing of the
      data device on every iclog write with an up-front async data device
      cache flush. Unfortunately, the assumption of which this was based
      on has been proven incorrect by the flush vs log tail update
      ordering issue. As the fix for that issue uses the
      XLOG_ICL_NEED_FLUSH flag to indicate that data device needs a cache
      flush, we now need to (once again) ensure that an iclog write to
      external logs that need a cache flush to be issued actually issue a
      cache flush to the data device as well as the log device.
      
      Fixes: eef983ff ("xfs: journal IO cache flush reductions")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      b5d721ea
    • D
      xfs: flush data dev on external log write · b1e27239
      Dave Chinner 提交于
      We incorrectly flush the log device instead of the data device when
      trying to ensure metadata is correctly on disk before writing the
      unmount record.
      
      Fixes: eef983ff ("xfs: journal IO cache flush reductions")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      b1e27239
  4. 29 7月, 2021 5 次提交
    • D
      btrfs: calculate number of eb pages properly in csum_tree_block · 7280305e
      David Sterba 提交于
      Building with -Warray-bounds on systems with 64K pages there's a
      warning:
      
        fs/btrfs/disk-io.c: In function ‘csum_tree_block’:
        fs/btrfs/disk-io.c:226:34: warning: array subscript 1 is above array bounds of ‘struct page *[1]’ [-Warray-bounds]
          226 |   kaddr = page_address(buf->pages[i]);
              |                        ~~~~~~~~~~^~~
        ./include/linux/mm.h:1630:48: note: in definition of macro ‘page_address’
         1630 | #define page_address(page) lowmem_page_address(page)
              |                                                ^~~~
        In file included from fs/btrfs/ctree.h:32,
                         from fs/btrfs/disk-io.c:23:
        fs/btrfs/extent_io.h:98:15: note: while referencing ‘pages’
           98 |  struct page *pages[1];
              |               ^~~~~
      
      The compiler has no way to know that in that case the nodesize is exactly
      PAGE_SIZE, so the resulting number of pages will be correct (1).
      
      Let's use num_extent_pages that makes the case nodesize == PAGE_SIZE
      explicitly 1.
      Reported-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7280305e
    • R
      cifs: add missing parsing of backupuid · b946dbcf
      Ronnie Sahlberg 提交于
      We lost parsing of backupuid in the switch to new mount API.
      Add it back.
      Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed-by: NShyam Prasad N <sprasad@microsoft.com>
      Cc: <stable@vger.kernel.org> # v5.11+
      Reported-by: NXiaoli Feng <xifeng@redhat.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      b946dbcf
    • D
      btrfs: fix rw device counting in __btrfs_free_extra_devids · b2a61667
      Desmond Cheong Zhi Xi 提交于
      When removing a writeable device in __btrfs_free_extra_devids, the rw
      device count should be decremented.
      
      This error was caught by Syzbot which reported a warning in
      close_fs_devices:
      
        WARNING: CPU: 1 PID: 9355 at fs/btrfs/volumes.c:1168 close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
        Modules linked in:
        CPU: 0 PID: 9355 Comm: syz-executor552 Not tainted 5.13.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
        RSP: 0018:ffffc9000333f2f0 EFLAGS: 00010293
        RAX: ffffffff8365f5c3 RBX: 0000000000000001 RCX: ffff888029afd4c0
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
        RBP: ffff88802846f508 R08: ffffffff8365f525 R09: ffffed100337d128
        R10: ffffed100337d128 R11: 0000000000000000 R12: dffffc0000000000
        R13: ffff888019be8868 R14: 1ffff1100337d10d R15: 1ffff1100337d10a
        FS:  00007f6f53828700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000000000047c410 CR3: 00000000302a6000 CR4: 00000000001506f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_close_devices+0xc9/0x450 fs/btrfs/volumes.c:1180
         open_ctree+0x8e1/0x3968 fs/btrfs/disk-io.c:3693
         btrfs_fill_super fs/btrfs/super.c:1382 [inline]
         btrfs_mount_root+0xac5/0xc60 fs/btrfs/super.c:1749
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x86/0x270 fs/super.c:1498
         fc_mount fs/namespace.c:993 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1023
         btrfs_mount+0x3d3/0xb50 fs/btrfs/super.c:1809
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x86/0x270 fs/super.c:1498
         do_new_mount fs/namespace.c:2905 [inline]
         path_mount+0x196f/0x2be0 fs/namespace.c:3235
         do_mount fs/namespace.c:3248 [inline]
         __do_sys_mount fs/namespace.c:3456 [inline]
         __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433
         do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Because fs_devices->rw_devices was not 0 after
      closing all devices. Here is the call trace that was observed:
      
        btrfs_mount_root():
          btrfs_scan_one_device():
            device_list_add();   <---------------- device added
          btrfs_open_devices():
            open_fs_devices():
              btrfs_open_one_device();   <-------- writable device opened,
      	                                     rw device count ++
          btrfs_fill_super():
            open_ctree():
              btrfs_free_extra_devids():
      	  __btrfs_free_extra_devids();  <--- writable device removed,
      	                              rw device count not decremented
      	  fail_tree_roots:
      	    btrfs_close_devices():
      	      close_fs_devices();   <------- rw device count off by 1
      
      As a note, prior to commit cf89af14 ("btrfs: dev-replace: fail
      mount if we don't have replace item with target device"), rw_devices
      was decremented on removing a writable device in
      __btrfs_free_extra_devids only if the BTRFS_DEV_STATE_REPLACE_TGT bit
      was not set for the device. However, this check does not need to be
      reinstated as it is now redundant and incorrect.
      
      In __btrfs_free_extra_devids, we skip removing the device if it is the
      target for replacement. This is done by checking whether device->devid
      == BTRFS_DEV_REPLACE_DEVID. Since BTRFS_DEV_STATE_REPLACE_TGT is set
      only on the device with devid BTRFS_DEV_REPLACE_DEVID, no devices
      should have the BTRFS_DEV_STATE_REPLACE_TGT bit set after the check,
      and so it's redundant to test for that bit.
      
      Additionally, following commit 82372bc8 ("Btrfs: make
      the logic of source device removing more clear"), rw_devices is
      incremented whenever a writeable device is added to the alloc
      list (including the target device in btrfs_dev_replace_finishing), so
      all removals of writable devices from the alloc list should also be
      accompanied by a decrement to rw_devices.
      
      Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
      Fixes: cf89af14 ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
      CC: stable@vger.kernel.org # 5.10+
      Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDesmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b2a61667
    • F
      btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction · ecc64fab
      Filipe Manana 提交于
      When checking if we need to log the new name of a renamed inode, we are
      checking if the inode and its parent inode have been logged before, and if
      not we don't log the new name. The check however is buggy, as it directly
      compares the logged_trans field of the inodes versus the ID of the current
      transaction. The problem is that logged_trans is a transient field, only
      stored in memory and never persisted in the inode item, so if an inode
      was logged before, evicted and reloaded, its logged_trans field is set to
      a value of 0, meaning the check will return false and the new name of the
      renamed inode is not logged. If the old parent directory was previously
      fsynced and we deleted the logged directory entries corresponding to the
      old name, we end up with a log that when replayed will delete the renamed
      inode.
      
      The following example triggers the problem:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ mkdir /mnt/A
        $ mkdir /mnt/B
        $ echo -n "hello world" > /mnt/A/foo
      
        $ sync
      
        # Add some new file to A and fsync directory A.
        $ touch /mnt/A/bar
        $ xfs_io -c "fsync" /mnt/A
      
        # Now trigger inode eviction. We are only interested in triggering
        # eviction for the inode of directory A.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # Move foo from directory A to directory B.
        # This deletes the directory entries for foo in A from the log, and
        # does not add the new name for foo in directory B to the log, because
        # logged_trans of A is 0, which is less than the current transaction ID.
        $ mv /mnt/A/foo /mnt/B/foo
      
        # Now make an fsync to anything except A, B or any file inside them,
        # like for example create a file at the root directory and fsync this
        # new file. This syncs the log that contains all the changes done by
        # previous rename operation.
        $ touch /mnt/baz
        $ xfs_io -c "fsync" /mnt/baz
      
        <power fail>
      
        # Mount the filesystem and replay the log.
        $ mount /dev/sdc /mnt
      
        # Check the filesystem content.
        $ ls -1R /mnt
        /mnt/:
        A
        B
        baz
      
        /mnt/A:
        bar
      
        /mnt/B:
        $
      
        # File foo is gone, it's neither in A/ nor in B/.
      
      Fix this by using the inode_logged() helper at btrfs_log_new_name(), which
      safely checks if an inode was logged before in the current transaction.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ecc64fab
    • G
      btrfs: mark compressed range uptodate only if all bio succeed · 240246f6
      Goldwyn Rodrigues 提交于
      In compression write endio sequence, the range which the compressed_bio
      writes is marked as uptodate if the last bio of the compressed (sub)bios
      is completed successfully. There could be previous bio which may
      have failed which is recorded in cb->errors.
      
      Set the writeback range as uptodate only if cb->errors is zero, as opposed
      to checking only the last bio's status.
      
      Backporting notes: in all versions up to 4.4 the last argument is always
      replaced by "!cb->errors".
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      240246f6
  5. 28 7月, 2021 4 次提交
  6. 27 7月, 2021 3 次提交
  7. 26 7月, 2021 2 次提交
  8. 24 7月, 2021 5 次提交
    • M
      hugetlbfs: fix mount mode command line processing · e0f7e2b2
      Mike Kravetz 提交于
      In commit 32021982 ("hugetlbfs: Convert to fs_context") processing
      of the mount mode string was changed from match_octal() to fsparam_u32.
      
      This changed existing behavior as match_octal does not require octal
      values to have a '0' prefix, but fsparam_u32 does.
      
      Use fsparam_u32oct which provides the same behavior as match_octal.
      
      Link: https://lkml.kernel.org/r/20210721183326.102716-1-mike.kravetz@oracle.com
      Fixes: 32021982 ("hugetlbfs: Convert to fs_context")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NDennis Camera <bugs+kernel.org@dtnr.ch>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0f7e2b2
    • R
      writeback, cgroup: do not reparent dax inodes · 593311e8
      Roman Gushchin 提交于
      The inode switching code is not suited for dax inodes.  An attempt to
      switch a dax inode to a parent writeback structure (as a part of a
      writeback cleanup procedure) results in a panic like this:
      
        run fstests generic/270 at 2021-07-15 05:54:02
        XFS (pmem0p2): EXPERIMENTAL big timestamp feature in use.  Use at your own risk!
        XFS (pmem0p2): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
        XFS (pmem0p2): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
        XFS (pmem0p2): Mounting V5 Filesystem
        XFS (pmem0p2): Ending clean mount
        XFS (pmem0p2): Quotacheck needed: Please wait.
        XFS (pmem0p2): Quotacheck: Done.
        XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
        XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
        XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
        BUG: unable to handle page fault for address: 0000000005b0f669
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        CPU: 13 PID: 10479 Comm: kworker/13:16 Not tainted 5.14.0-rc1-master-8096acd7+ #8
        Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 09/13/2016
        Workqueue: inode_switch_wbs inode_switch_wbs_work_fn
        RIP: 0010:inode_do_switch_wbs+0xaf/0x470
        Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
        RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002
        RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0
        RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff
        RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228
        R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130
        R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0
        FS:  0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0
        Call Trace:
         inode_switch_wbs_work_fn+0xb6/0x2a0
         process_one_work+0x1e6/0x380
         worker_thread+0x53/0x3d0
         kthread+0x10f/0x130
         ret_from_fork+0x22/0x30
        Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink bridge stp llc rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm mgag200 i2c_algo_bit iTCO_wdt irqbypass drm_kms_helper iTCO_vendor_support acpi_ipmi rapl syscopyarea sysfillrect intel_cstate ipmi_si sysimgblt ioatdma dax_pmem_compat fb_sys_fops ipmi_devintf device_dax i2c_i801 pcspkr intel_uncore hpilo nd_pmem cec dax_pmem_core dca i2c_smbus acpi_tad lpc_ich ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel tg3 ghash_clmulni_intel serio_raw hpsa hpwdt scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod
        CR2: 0000000005b0f669
        ---[ end trace ed2105faff8384f3 ]---
        RIP: 0010:inode_do_switch_wbs+0xaf/0x470
        Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
        RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002
        RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0
        RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff
        RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228
        R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130
        R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0
        FS:  0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0
        Kernel panic - not syncing: Fatal exception
        Kernel Offset: 0x15200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
        ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      The crash happens on an attempt to iterate over attached pagecache pages
      and check the dirty flag: a dax inode's xarray contains pfn's instead of
      generic struct page pointers.
      
      This happens for DAX and not for other kinds of non-page entries in the
      inodes because it's a tagged iteration, and shadow/swap entries are
      never tagged; only DAX entries get tagged.
      
      Fix the problem by bailing out (with the false return value) of
      inode_prepare_sbs_switch() if a dax inode is passed.
      
      [willy@infradead.org: changelog addition]
      
      Link: https://lkml.kernel.org/r/20210719171350.3876830-1-guro@fb.com
      Fixes: c22d70a1 ("writeback, cgroup: release dying cgwbs by switching attached inodes")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
      Reported-by: NDarrick J. Wong <djwong@kernel.org>
      Tested-by: NDarrick J. Wong <djwong@kernel.org>
      Tested-by: NMurphy Zhou <jencce.kernel@gmail.com>
      Acked-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <dchinner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      593311e8
    • P
      userfaultfd: do not untag user pointers · e71e2ace
      Peter Collingbourne 提交于
      Patch series "userfaultfd: do not untag user pointers", v5.
      
      If a user program uses userfaultfd on ranges of heap memory, it may end
      up passing a tagged pointer to the kernel in the range.start field of
      the UFFDIO_REGISTER ioctl.  This can happen when using an MTE-capable
      allocator, or on Android if using the Tagged Pointers feature for MTE
      readiness [1].
      
      When a fault subsequently occurs, the tag is stripped from the fault
      address returned to the application in the fault.address field of struct
      uffd_msg.  However, from the application's perspective, the tagged
      address *is* the memory address, so if the application is unaware of
      memory tags, it may get confused by receiving an address that is, from
      its point of view, outside of the bounds of the allocation.  We observed
      this behavior in the kselftest for userfaultfd [2] but other
      applications could have the same problem.
      
      Address this by not untagging pointers passed to the userfaultfd ioctls.
      Instead, let the system call fail.  Also change the kselftest to use
      mmap so that it doesn't encounter this problem.
      
      [1] https://source.android.com/devices/tech/debug/tagged-pointers
      [2] tools/testing/selftests/vm/userfaultfd.c
      
      This patch (of 2):
      
      Do not untag pointers passed to the userfaultfd ioctls.  Instead, let
      the system call fail.  This will provide an early indication of problems
      with tag-unaware userspace code instead of letting the code get confused
      later, and is consistent with how we decided to handle brk/mmap/mremap
      in commit dcde2373 ("mm: Avoid creating virtual address aliases in
      brk()/mmap()/mremap()"), as well as being consistent with the existing
      tagged address ABI documentation relating to how ioctl arguments are
      handled.
      
      The code change is a revert of commit 7d032574 ("userfaultfd: untag
      user pointers") plus some fixups to some additional calls to
      validate_range that have appeared since then.
      
      [1] https://source.android.com/devices/tech/debug/tagged-pointers
      [2] tools/testing/selftests/vm/userfaultfd.c
      
      Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
      Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
      Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
      Fixes: 63f0c603 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
      Signed-off-by: NPeter Collingbourne <pcc@google.com>
      Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Delva <adelva@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Martin <Dave.Martin@arm.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mitch Phillips <mitchp@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: William McVicker <willmcvicker@google.com>
      Cc: <stable@vger.kernel.org>	[5.4]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e71e2ace
    • J
      io_uring: explicitly catch any illegal async queue attempt · 991468dc
      Jens Axboe 提交于
      Catch an illegal case to queue async from an unrelated task that got
      the ring fd passed to it. This should not be possible to hit, but
      better be proactive and catch it explicitly. io-wq is extended to
      check for early IO_WQ_WORK_CANCEL being set on a work item as well,
      so it can run the request through the normal cancelation path.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      991468dc
    • J
      io_uring: never attempt iopoll reissue from release path · 3c30ef0f
      Jens Axboe 提交于
      There are two reasons why this shouldn't be done:
      
      1) Ring is exiting, and we're canceling requests anyway. Any request
         should be canceled anyway. In theory, this could iterate for a
         number of times if someone else is also driving the target block
         queue into request starvation, however the likelihood of this
         happening is miniscule.
      
      2) If the original task decided to pass the ring to another task, then
         we don't want to be reissuing from this context as it may be an
         unrelated task or context. No assumptions should be made about
         the context in which ->release() is run. This can only happen for pure
         read/write, and we'll get -EFAULT on them anyway.
      
      Link: https://lore.kernel.org/io-uring/YPr4OaHv0iv0KTOc@zeniv-ca.linux.org.uk/Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3c30ef0f
  9. 23 7月, 2021 2 次提交