1. 18 3月, 2019 1 次提交
    • B
      xfs: don't trip over uninitialized buffer on extent read of corrupted inode · 6958d11f
      Brian Foster 提交于
      We've had rather rare reports of bmap btree block corruption where
      the bmap root block has a level count of zero. The root cause of the
      corruption is so far unknown. We do have verifier checks to detect
      this form of on-disk corruption, but this doesn't cover a memory
      corruption variant of the problem. The latter is a reasonable
      possibility because the root block is part of the inode fork and can
      reside in-core for some time before inode extents are read.
      
      If this occurs, it leads to a system crash such as the following:
      
       BUG: unable to handle kernel paging request at ffffffff00000221
       PF error: [normal kernel read fault]
       ...
       RIP: 0010:xfs_trans_brelse+0xf/0x200 [xfs]
       ...
       Call Trace:
        xfs_iread_extents+0x379/0x540 [xfs]
        xfs_file_iomap_begin_delay+0x11a/0xb40 [xfs]
        ? xfs_attr_get+0xd1/0x120 [xfs]
        ? iomap_write_begin.constprop.40+0x2d0/0x2d0
        xfs_file_iomap_begin+0x4c4/0x6d0 [xfs]
        ? __vfs_getxattr+0x53/0x70
        ? iomap_write_begin.constprop.40+0x2d0/0x2d0
        iomap_apply+0x63/0x130
        ? iomap_write_begin.constprop.40+0x2d0/0x2d0
        iomap_file_buffered_write+0x62/0x90
        ? iomap_write_begin.constprop.40+0x2d0/0x2d0
        xfs_file_buffered_aio_write+0xe4/0x3b0 [xfs]
        __vfs_write+0x150/0x1b0
        vfs_write+0xba/0x1c0
        ksys_pwrite64+0x64/0xa0
        do_syscall_64+0x5a/0x1d0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The crash occurs because xfs_iread_extents() attempts to release an
      uninitialized buffer pointer as the level == 0 value prevented the
      buffer from ever being allocated or read. Change the level > 0
      assert to an explicit error check in xfs_iread_extents() to avoid
      crashing the kernel in the event of localized, in-core inode
      corruption.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      6958d11f
  2. 21 2月, 2019 1 次提交
    • C
      xfs: make COW fork unwritten extent conversions more robust · 26b91c72
      Christoph Hellwig 提交于
      If we have racing buffered and direct I/O COW fork extents under
      writeback can have been moved to the data fork by the time we call
      xfs_reflink_convert_cow from xfs_submit_ioend.  This would be mostly
      harmless as the block numbers don't change by this move, except for
      the fact that xfs_bmapi_write will crash or trigger asserts when
      not finding existing extents, even despite trying to paper over this
      with the XFS_BMAPI_CONVERT_ONLY flag.
      
      Instead of special casing non-transaction conversions in the already
      way too complicated xfs_bmapi_write just add a new helper for the much
      simpler non-transactional COW fork case, which simplify ignores not
      found extents.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      26b91c72
  3. 18 2月, 2019 5 次提交
  4. 12 2月, 2019 3 次提交
    • B
      xfs: use the latest extent at writeback delalloc conversion time · c2b31643
      Brian Foster 提交于
      The writeback delalloc conversion code is racy with respect to
      changes in the currently cached file mapping outside of the current
      page. This is because the ilock is cycled between the time the
      caller originally looked up the mapping and across each real
      allocation of the provided file range. This code has collected
      various hacks over the years to help combat the symptoms of these
      races (i.e., truncate race detection, allocation into hole
      detection, etc.), but none address the fundamental problem that the
      imap may not be valid at allocation time.
      
      Rather than continue to use race detection hacks, update writeback
      delalloc conversion to a model that explicitly converts the delalloc
      extent backing the current file offset being processed. The current
      file offset is the only block we can trust to remain once the ilock
      is dropped because any operation that can remove the block
      (truncate, hole punch, etc.) must flush and discard pagecache pages
      first.
      
      Modify xfs_iomap_write_allocate() to use the xfs_bmapi_delalloc()
      mechanism to request allocation of the entire delalloc extent
      backing the current offset instead of assuming the extent passed by
      the caller is unchanged. Record the range specified by the caller
      and apply it to the resulting allocated extent so previous checks by
      the caller for COW fork overlap are not lost. Finally, overload the
      bmapi delalloc flag with the range reval flag behavior since this is
      the only use case for both.
      
      This ensures that writeback always picks up the correct
      and current extent associated with the page, regardless of races
      with other extent modifying operations. If operating on a data fork
      and the COW overlap state has changed since the ilock was cycled,
      the caller revalidates against the COW fork sequence number before
      using the imap for the next block.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c2b31643
    • B
      xfs: create delalloc bmapi wrapper for full extent allocation · 627209fb
      Brian Foster 提交于
      The writeback delalloc conversion code is racy with respect to
      changes in the currently cached file mapping. This stems from the
      fact that the bmapi allocation code requires a file range to
      allocate and the writeback conversion code assumes the range of the
      currently cached mapping is still valid with respect to the fork. It
      may not be valid, however, because the ilock is cycled (potentially
      multiple times) between the time the cached mapping was populated
      and the delalloc conversion occurs.
      
      To facilitate a solution to this problem, create a new
      xfs_bmapi_delalloc() wrapper to xfs_bmapi_write() that takes a file
      (FSB) offset and attempts to allocate whatever delalloc extent backs
      the offset. Use a new bmapi flag to cause xfs_bmapi_write() to set
      the range based on the extent backing the bno parameter unless bno
      lands in a hole. If bno does land in a hole, fall back to the
      current behavior (which may result in an error or quietly skipping
      holes in the specified range depending on other parameters). This
      patch does not change behavior.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      627209fb
    • B
      xfs: remove superfluous writeback mapping eof trimming · 3b350898
      Brian Foster 提交于
      Now that the cached writeback mapping is explicitly invalidated on
      data fork changes, the EOF trimming band-aid is no longer necessary.
      Remove xfs_trim_extent_eof() as well since it has no other users.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      3b350898
  5. 13 12月, 2018 2 次提交
  6. 22 11月, 2018 1 次提交
    • D
      xfs: delalloc -> unwritten COW fork allocation can go wrong · 9230a0b6
      Dave Chinner 提交于
      Long saga. There have been days spent following this through dead end
      after dead end in multi-GB event traces. This morning, after writing
      a trace-cmd wrapper that enabled me to be more selective about XFS
      trace points, I discovered that I could get just enough essential
      tracepoints enabled that there was a 50:50 chance the fsx config
      would fail at ~115k ops. If it didn't fail at op 115547, I stopped
      fsx at op 115548 anyway.
      
      That gave me two traces - one where the problem manifested, and one
      where it didn't. After refining the traces to have the necessary
      information, I found that in the failing case there was a real
      extent in the COW fork compared to an unwritten extent in the
      working case.
      
      Walking back through the two traces to the point where the CWO fork
      extents actually diverged, I found that the bad case had an extra
      unwritten extent in it. This is likely because the bug it led me to
      had triggered multiple times in those 115k ops, leaving stray
      COW extents around. What I saw was a COW delalloc conversion to an
      unwritten extent (as they should always be through
      xfs_iomap_write_allocate()) resulted in a /written extent/:
      
      xfs_writepage:        dev 259:0 ino 0x83 pgoff 0x17000 size 0x79a00 offset 0 length 0
      xfs_iext_remove:      dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/2 offset 32 block 152 count 20 flag 1 caller xfs_bmap_add_extent_delay_real
      xfs_bmap_pre_update:  dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 4503599627239429 count 31 flag 0 caller xfs_bmap_add_extent_delay_real
      xfs_bmap_post_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 121 count 51 flag 0 caller xfs_bmap_add_ex
      
      Basically, Cow fork before:
      
      	0 1            32          52
      	+H+DDDDDDDDDDDD+UUUUUUUUUUU+
      	   PREV		RIGHT
      
      COW delalloc conversion allocates:
      
      	  1	       32
      	  +uuuuuuuuuuuu+
      	  NEW
      
      And the result according to the xfs_bmap_post_update trace was:
      
      	0 1            32          52
      	+H+wwwwwwwwwwwwwwwwwwwwwwww+
      	   PREV
      
      Which is clearly wrong - it should be a merged unwritten extent,
      not an unwritten extent.
      
      That lead me to look at the LEFT_FILLING|RIGHT_FILLING|RIGHT_CONTIG
      case in xfs_bmap_add_extent_delay_real(), and sure enough, there's
      the bug.
      
      It takes the old delalloc extent (PREV) and adds the length of the
      RIGHT extent to it, takes the start block from NEW, removes the
      RIGHT extent and then updates PREV with the new extent.
      
      What it fails to do is update PREV.br_state. For delalloc, this is
      always XFS_EXT_NORM, while in this case we are converting the
      delayed allocation to unwritten, so it needs to be updated to
      XFS_EXT_UNWRITTEN. This LF|RF|RC case does not do this, and so
      the resultant extent is always written.
      
      And that's the bug I've been chasing for a week - a bmap btree bug,
      not a reflink/dedupe/copy_file_range bug, but a BMBT bug introduced
      with the recent in core extent tree scalability enhancements.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9230a0b6
  7. 18 10月, 2018 2 次提交
  8. 01 10月, 2018 1 次提交
    • D
      xfs: fix error handling in xfs_bmap_extents_to_btree · e55ec4dd
      Dave Chinner 提交于
      Commit 01239d77 ("xfs: fix a null pointer dereference in
      xfs_bmap_extents_to_btree") attempted to fix a null pointer
      dreference when a fuzzing corruption of some kind was found.
      This fix was flawed, resulting in assert failures like:
      
      XFS: Assertion failed: ifp->if_broot == NULL, file: fs/xfs/libxfs/xfs_bmap.c, line: 715
      .....
      Call Trace:
        xfs_bmap_extents_to_btree+0x6b9/0x7b0
        __xfs_bunmapi+0xae7/0xf00
        ? xfs_log_reserve+0x1c8/0x290
        xfs_reflink_remap_extent+0x20b/0x620
        xfs_reflink_remap_blocks+0x7e/0x290
        xfs_reflink_remap_range+0x311/0x530
        vfs_dedupe_file_range_one+0xd7/0xe0
        vfs_dedupe_file_range+0x15b/0x1a0
        do_vfs_ioctl+0x267/0x6c0
      
      The problem is that the error handling code now asserts that the
      inode fork is not in btree format before the error handling code
      undoes the modifications that put the fork back in extent format.
      Fix this by moving the assert back to after the xfs_iroot_realloc()
      call that returns the fork to extent format, and clean up the jump
      labels to be meaningful.
      
      Also, returning ENOSPC when xfs_btree_get_bufl() fails to
      instantiate the buffer that was allocated (the actual fix in the
      commit mentioned above) is incorrect. This is a fatal error - only
      an invalid block address or a filesystem shutdown can result in
      failing to get a buffer here.
      
      Hence change this to EFSCORRUPTED so that the higher layer knows
      this was a corruption related failure and should not treat it as an
      ENOSPC error.  This should result in a shutdown (via cancelling a
      dirty transaction) which is necessary as we do not attempt to clean
      up the (invalid) block that we have already allocated.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      e55ec4dd
  9. 12 8月, 2018 1 次提交
  10. 03 8月, 2018 6 次提交
    • B
      xfs: fold dfops into the transaction · 9d9e6233
      Brian Foster 提交于
      struct xfs_defer_ops has now been reduced to a single list_head. The
      external dfops mechanism is unused and thus everywhere a (permanent)
      transaction is accessible the associated dfops structure is as well.
      
      Remove the xfs_defer_ops structure and fold the list_head into the
      transaction. Also remove the last remnant of external dfops in
      xfs_trans_dup().
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9d9e6233
    • B
      xfs: pass transaction to xfs_defer_add() · 0f37d178
      Brian Foster 提交于
      The majority of remaining references to struct xfs_defer_ops in XFS
      are associated with xfs_defer_add(). At this point, there are no
      more external xfs_defer_ops users left. All instances of
      xfs_defer_ops are embedded in the transaction, which means we can
      safely pass the transaction down to the dfops add interface.
      
      Update xfs_defer_add() to receive the transaction as a parameter.
      Various subsystems implement wrappers to allocate and construct the
      context specific data structures for the associated deferred
      operation type. Update these to also carry the transaction down as
      needed and clean up unused dfops parameters along the way.
      
      This removes most of the remaining references to struct
      xfs_defer_ops throughout the code and facilitates removal of the
      structure.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      [darrick: fix unused variable warnings with ftrace disabled]
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0f37d178
    • B
      xfs: drop dop param from xfs_defer_op_type ->finish_item() callback · 7dbddbac
      Brian Foster 提交于
      The dfops infrastructure ->finish_item() callback passes the
      transaction and dfops as separate parameters. Since dfops is always
      part of a transaction, the latter parameter is no longer necessary.
      Remove it from the various callbacks.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7dbddbac
    • B
      xfs: automatic dfops inode relogging · a8198666
      Brian Foster 提交于
      Inodes that are held across deferred operations are explicitly
      joined to the dfops structure to ensure appropriate relogging.
      While inodes are currently joined explicitly, we can detect the
      conditions that require relogging at dfops finish time by inspecting
      the transaction item list for inodes with ili_lock_flags == 0.
      
      Replace the xfs_defer_ijoin() infrastructure with such detection and
      automatic relogging of held inodes. This eliminates the need for the
      per-dfops inode list, replaced by an on-stack variant in
      xfs_defer_trans_roll().
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a8198666
    • B
      xfs: add missing defer ijoins for held inodes · 488c919a
      Brian Foster 提交于
      Log items that require relogging during deferred operations
      processing are explicitly joined to the associated dfops via the
      xfs_defer_*join() helpers. These calls imply that the associated
      object is "held" by the transaction such that when rolled, the item
      can be immediately joined to a follow up transaction. For buffers,
      this means the buffer remains locked and held after each roll. For
      inodes, this means that the inode remains locked.
      
      Failure to join a held item to the dfops structure means the
      associated object pins the tail of the log while dfops processing
      completes, because the item never relogs and is not unlocked or
      released until deferred processing completes.
      
      Currently, all buffers that are held in transactions (XFS_BLI_HOLD)
      with deferred operations are explicitly joined to the dfops. This is
      not the case for inodes, however, as various contexts defer
      operations to transactions with held inodes without explicit joins
      to the associated dfops (and thus not relogging).
      
      While this is not a catastrophic problem, it is not ideal. Given
      that we want to eventually relog such items automatically during
      dfops processing, start by explicitly adding these missing
      xfs_defer_ijoin() calls. A call is added everywhere an inode is
      joined to a transaction without transferring lock ownership and
      said transaction runs deferred operations.
      
      All xfs_defer_ijoin() calls will eventually be replaced by automatic
      dfops inode relogging. This patch essentially implements the
      behavior change that would otherwise occur due to automatic inode
      dfops relogging.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      488c919a
    • B
      xfs: replace dop_low with transaction flag · 1214f1cf
      Brian Foster 提交于
      The dop_low field enables the low free space allocation mode when a
      previous allocation has detected difficulty allocating blocks. It
      has historically been part of the xfs_defer_ops structure, which
      means if enabled, it remains enabled across a set of transactions
      until the deferred operations have completed and the dfops is reset.
      
      Now that the dfops is embedded in the transaction, we can save a bit
      more space by using a transaction flag rather than a standalone
      boolean. Drop the ->dop_low field and replace it with a transaction
      flag that is set at the same points, carried across rolling
      transactions and cleared on completion of deferred operations. This
      essentially emulates the behavior of ->dop_low and so should not
      change behavior.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      1214f1cf
  11. 30 7月, 2018 1 次提交
  12. 27 7月, 2018 1 次提交
    • B
      xfs: remove all boilerplate defer init/finish code · c8eac49e
      Brian Foster 提交于
      At this point, the transaction subsystem completely manages deferred
      items internally such that the common and boilerplate
      xfs_trans_alloc() -> xfs_defer_init() -> xfs_defer_finish() ->
      xfs_trans_commit() sequence can be replaced with a simple
      transaction allocation and commit.
      
      Remove all such boilerplate deferred ops code. In doing so, we
      change each case over to use the dfops in the transaction and
      specifically eliminate:
      
      - The on-stack dfops and associated xfs_defer_init() call, as the
        internal dfops is initialized on transaction allocation.
      - xfs_bmap_finish() calls that precede a final xfs_trans_commit() of
        a transaction.
      - xfs_defer_cancel() calls in error handlers that precede a
        transaction cancel.
      
      The only deferred ops calls that remain are those that are
      non-deterministic with respect to the final commit of the associated
      transaction or are open-coded due to special handling.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c8eac49e
  13. 24 7月, 2018 1 次提交
  14. 12 7月, 2018 14 次提交