1. 27 4月, 2020 1 次提交
    • F
      btrfs: fix partial loss of prealloc extent past i_size after fsync · f135cea3
      Filipe Manana 提交于
      When we have an inode with a prealloc extent that starts at an offset
      lower than the i_size and there is another prealloc extent that starts at
      an offset beyond i_size, we can end up losing part of the first prealloc
      extent (the part that starts at i_size) and have an implicit hole if we
      fsync the file and then have a power failure.
      
      Consider the following example with comments explaining how and why it
      happens.
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        # Create our test file with 2 consecutive prealloc extents, each with a
        # size of 128Kb, and covering the range from 0 to 256Kb, with a file
        # size of 0.
        $ xfs_io -f -c "falloc -k 0 128K" /mnt/foo
        $ xfs_io -c "falloc -k 128K 128K" /mnt/foo
      
        # Fsync the file to record both extents in the log tree.
        $ xfs_io -c "fsync" /mnt/foo
      
        # Now do a redudant extent allocation for the range from 0 to 64Kb.
        # This will merely increase the file size from 0 to 64Kb. Instead we
        # could also do a truncate to set the file size to 64Kb.
        $ xfs_io -c "falloc 0 64K" /mnt/foo
      
        # Fsync the file, so we update the inode item in the log tree with the
        # new file size (64Kb). This also ends up setting the number of bytes
        # for the first prealloc extent to 64Kb. This is done by the truncation
        # at btrfs_log_prealloc_extents().
        # This means that if a power failure happens after this, a write into
        # the file range 64Kb to 128Kb will not use the prealloc extent and
        # will result in allocation of a new extent.
        $ xfs_io -c "fsync" /mnt/foo
      
        # Now set the file size to 256K with a truncate and then fsync the file.
        # Since no changes happened to the extents, the fsync only updates the
        # i_size in the inode item at the log tree. This results in an implicit
        # hole for the file range from 64Kb to 128Kb, something which fsck will
        # complain when not using the NO_HOLES feature if we replay the log
        # after a power failure.
        $ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo
      
      So instead of always truncating the log to the inode's current i_size at
      btrfs_log_prealloc_extents(), check first if there's a prealloc extent
      that starts at an offset lower than the i_size and with a length that
      crosses the i_size - if there is one, just make sure we truncate to a
      size that corresponds to the end offset of that prealloc extent, so
      that we don't lose the part of that extent that starts at i_size if a
      power failure happens.
      
      A test case for fstests follows soon.
      
      Fixes: 31d11b83 ("Btrfs: fix duplicate extents after fsync of file with prealloc extents")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f135cea3
  2. 09 4月, 2020 1 次提交
    • F
      btrfs: make full fsyncs always operate on the entire file again · 7af59743
      Filipe Manana 提交于
      This is a revert of commit 0a8068a3 ("btrfs: make ranged full
      fsyncs more efficient"), with updated comment in btrfs_sync_file.
      
      Commit 0a8068a3 ("btrfs: make ranged full fsyncs more efficient")
      made full fsyncs operate on the given range only as it assumed it was safe
      when using the NO_HOLES feature, since the hole detection was simplified
      some time ago and no longer was a source for races with ordered extent
      completion of adjacent file ranges.
      
      However it's still not safe to have a full fsync only operate on the given
      range, because extent maps for new extents might not be present in memory
      due to inode eviction or extent cloning. Consider the following example:
      
      1) We are currently at transaction N;
      
      2) We write to the file range [0, 1MiB);
      
      3) Writeback finishes for the whole range and ordered extents complete,
         while we are still at transaction N;
      
      4) The inode is evicted;
      
      5) We open the file for writing, causing the inode to be loaded to
         memory again, which sets the 'full sync' bit on its flags. At this
         point the inode's list of modified extent maps is empty (figuring
         out which extents were created in the current transaction and were
         not yet logged by an fsync is expensive, that's why we set the
         'full sync' bit when loading an inode);
      
      6) We write to the file range [512KiB, 768KiB);
      
      7) We do a ranged fsync (such as msync()) for file range [512KiB, 768KiB).
         This correctly flushes this range and logs its extent into the log
         tree. When the writeback started an extent map for range [512KiB, 768KiB)
         was added to the inode's list of modified extents, and when the fsync()
         finishes logging it removes that extent map from the list of modified
         extent maps. This fsync also clears the 'full sync' bit;
      
      8) We do a regular fsync() (full ranged). This fsync() ends up doing
         nothing because the inode's list of modified extents is empty and
         no other changes happened since the previous ranged fsync(), so
         it just returns success (0) and we end up never logging extents for
         the file ranges [0, 512KiB) and [768KiB, 1MiB).
      
      Another scenario where this can happen is if we replace steps 2 to 4 with
      cloning from another file into our test file, as that sets the 'full sync'
      bit in our inode's flags and does not populate its list of modified extent
      maps.
      
      This was causing test case generic/457 to fail sporadically when using the
      NO_HOLES feature, as it exercised this later case where the inode has the
      'full sync' bit set and has no extent maps in memory to represent the new
      extents due to extent cloning.
      
      Fix this by reverting commit 0a8068a3 ("btrfs: make ranged full fsyncs
      more efficient") since there is no easy way to work around it.
      
      Fixes: 0a8068a3 ("btrfs: make ranged full fsyncs more efficient")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7af59743
  3. 24 3月, 2020 15 次提交
  4. 24 1月, 2020 1 次提交
    • F
      Btrfs: fix infinite loop during fsync after rename operations · b5e4ff9d
      Filipe Manana 提交于
      Recently fsstress (from fstests) sporadically started to trigger an
      infinite loop during fsync operations. This turned out to be because
      support for the rename exchange and whiteout operations was added to
      fsstress in fstests. These operations, unlike any others in fsstress,
      cause file names to be reused, whence triggering this issue. However
      it's not necessary to use rename exchange and rename whiteout operations
      trigger this issue, simple rename operations and file creations are
      enough to trigger the issue.
      
      The issue boils down to when we are logging inodes that conflict (that
      had the name of any inode we need to log during the fsync operation), we
      keep logging them even if they were already logged before, and after
      that we check if there's any other inode that conflicts with them and
      then add it again to the list of inodes to log. Skipping already logged
      inodes fixes the issue.
      
      Consider the following example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/testdir                           # inode 257
      
        $ touch /mnt/testdir/zz                        # inode 258
        $ ln /mnt/testdir/zz /mnt/testdir/zz_link
      
        $ touch /mnt/testdir/a                         # inode 259
      
        $ sync
      
        # The following 3 renames achieve the same result as a rename exchange
        # operation (<rename_exchange> /mnt/testdir/zz_link to /mnt/testdir/a).
      
        $ mv /mnt/testdir/a /mnt/testdir/a/tmp
        $ mv /mnt/testdir/zz_link /mnt/testdir/a
        $ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link
      
        # The following rename and file creation give the same result as a
        # rename whiteout operation (<rename_whiteout> zz to a2).
      
        $ mv /mnt/testdir/zz /mnt/testdir/a2
        $ touch /mnt/testdir/zz                        # inode 260
      
        $ xfs_io -c fsync /mnt/testdir/zz
          --> results in the infinite loop
      
      The following steps happen:
      
      1) When logging inode 260, we find that its reference named "zz" was
         used by inode 258 in the previous transaction (through the commit
         root), so inode 258 is added to the list of conflicting indoes that
         need to be logged;
      
      2) After logging inode 258, we find that its reference named "a" was
         used by inode 259 in the previous transaction, and therefore we add
         inode 259 to the list of conflicting inodes to be logged;
      
      3) After logging inode 259, we find that its reference named "zz_link"
         was used by inode 258 in the previous transaction - we add inode 258
         to the list of conflicting inodes to log, again - we had already
         logged it before at step 3. After logging it again, we find again
         that inode 259 conflicts with him, and we add again 259 to the list,
         etc - we end up repeating all the previous steps.
      
      So fix this by skipping logging of conflicting inodes that were already
      logged.
      
      Fixes: 6b5fc433 ("Btrfs: fix fsync after succession of renames of different files")
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b5e4ff9d
  5. 20 1月, 2020 4 次提交
    • N
      btrfs: Remove redundant WARN_ON in walk_down_log_tree · 36ee0b44
      Nikolay Borisov 提交于
      level <0 and level >= BTRFS_MAX_LEVEL are already performed upon
      extent buffer read by tree checker in btrfs_check_node.
      go. As far as 'level <= 0'  we are guaranteed that level is '> 0'
      because the value of level _before_ reading 'next' is larger than 1
      (otherwise we wouldn't have executed that code at all) this in turn
      guarantees that 'level' after btrfs_read_buffer is 'level - 1' since
      we verify this invariant in:
      
          btrfs_read_buffer
           btree_read_extent_buffer_pages
            btrfs_verify_level_key
      
      This guarantees that level can never be '<= 0' so the warn on is
      never triggered.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      36ee0b44
    • N
      btrfs: Remove WARN_ON in walk_log_tree · 5c4b691e
      Nikolay Borisov 提交于
      The log_root passed to walk_log_tree is guaranteed to have its
      root_key.objectid always be BTRFS_TREE_LOG_OBJECTID. This is by
      merit that all log roots of an ordinary root are allocated in
      alloc_log_tree which hard-codes objectid to be BTRFS_TREE_LOG_OBJECTID.
      
      In case walk_log_tree is called for a log tree found by btrfs_read_fs_root
      in btrfs_recover_log_trees, that function already ensures
      found_key.objectid is BTRFS_TREE_LOG_OBJECTID.
      
      No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5c4b691e
    • N
      btrfs: Rename __btrfs_free_reserved_extent to btrfs_pin_reserved_extent · a0fbf736
      Nikolay Borisov 提交于
      __btrfs_free_reserved_extent now performs the actions of
      btrfs_free_and_pin_reserved_extent. But this name is a bit of a
      misnomer, since the extent is not really freed but just pinned. Reflect
      this in the new name. No semantics changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a0fbf736
    • F
      Btrfs: fix missing hole after hole punching and fsync when using NO_HOLES · 0e56315c
      Filipe Manana 提交于
      When using the NO_HOLES feature, if we punch a hole into a file and then
      fsync it, there are cases where a subsequent fsync will miss the fact that
      a hole was punched, resulting in the holes not existing after replaying
      the log tree.
      
      Essentially these cases all imply that, tree-log.c:copy_items(), is not
      invoked for the leafs that delimit holes, because nothing changed those
      leafs in the current transaction. And it's precisely copy_items() where
      we currenly detect and log holes, which works as long as the holes are
      between file extent items in the input leaf or between the beginning of
      input leaf and the previous leaf or between the last item in the leaf
      and the next leaf.
      
      First example where we miss a hole:
      
        *) The extent items of the inode span multiple leafs;
      
        *) The punched hole covers a range that affects only the extent items of
           the first leaf;
      
        *) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC
           is set in the inode's runtime flags).
      
        That results in the hole not existing after replaying the log tree.
      
        For example, if the fs/subvolume tree has the following layout for a
        particular inode:
      
            Leaf N, generation 10:
      
            [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ]
      
            Leaf N + 1, generation 10:
      
            [ EXTENT_ITEM (128K 64K) ... ]
      
        If at transaction 11 we punch a hole coverting the range [0, 128K[, we end
        up dropping the two extent items from leaf N, but we don't touch the other
        leaf, so we end up in the following state:
      
            Leaf N, generation 11:
      
            [ ... INODE_ITEM INODE_REF ]
      
            Leaf N + 1, generation 10:
      
            [ EXTENT_ITEM (128K 64K) ... ]
      
        A full fsync after punching the hole will only process leaf N because it
        was modified in the current transaction, but not leaf N + 1, since it
        was not modified in the current transaction (generation 10 and not 11).
        As a result the fsync will not log any holes, because it didn't process
        any leaf with extent items.
      
      Second example where we will miss a hole:
      
        *) An inode as its items spanning 5 (or more) leafs;
      
        *) A hole is punched and it covers only the extents items of the 3rd
           leaf. This resulsts in deleting the entire leaf and not touching any
           of the other leafs.
      
        So the only leaf that is modified in the current transaction, when
        punching the hole, is the first leaf, which contains the inode item.
        During the full fsync, the only leaf that is passed to copy_items()
        is that first leaf, and that's not enough for the hole detection
        code in copy_items() to determine there's a hole between the last
        file extent item in the 2nd leaf and the first file extent item in
        the 3rd leaf (which was the 4th leaf before punching the hole).
      
      Fix this by scanning all leafs and punch holes as necessary when doing a
      full fsync (less common than a non-full fsync) when the NO_HOLES feature
      is enabled. The lack of explicit file extent items to mark holes makes it
      necessary to scan existing extents to determine if holes exist.
      
      A test case for fstests follows soon.
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0e56315c
  6. 13 12月, 2019 2 次提交
    • J
      btrfs: skip log replay on orphaned roots · 9bc574de
      Josef Bacik 提交于
      My fsstress modifications coupled with generic/475 uncovered a failure
      to mount and replay the log if we hit a orphaned root.  We do not want
      to replay the log for an orphan root, but it's completely legitimate to
      have an orphaned root with a log attached.  Fix this by simply skipping
      replaying the log.  We still need to pin it's root node so that we do
      not overwrite it while replaying other logs, as we re-read the log root
      at every stage of the replay.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9bc574de
    • F
      Btrfs: fix missing data checksums after replaying a log tree · 40e046ac
      Filipe Manana 提交于
      When logging a file that has shared extents (reflinked with other files or
      with itself), we can end up logging multiple checksum items that cover
      overlapping ranges. This confuses the search for checksums at log replay
      time causing some checksums to never be added to the fs/subvolume tree.
      
      Consider the following example of a file that shares the same extent at
      offsets 0 and 256Kb:
      
         [ bytenr 13893632, offset 64Kb, len 64Kb  ]
         0                                         64Kb
      
         [ bytenr 13631488, offset 64Kb, len 192Kb ]
         64Kb                                      256Kb
      
         [ bytenr 13893632, offset 0, len 256Kb    ]
         256Kb                                     512Kb
      
      When logging the inode, at tree-log.c:copy_items(), when processing the
      file extent item at offset 0, we log a checksum item covering the range
      13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
      64Kb + 64Kb, respectively.
      
      Later when processing the extent item at offset 256K, we log the checksums
      for the range from 13893632 to 14155776 (which corresponds to 13893632 +
      256Kb). These checksums get merged with the checksum item for the range
      from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
      So after this we get the two following checksum items in the log tree:
      
         (...)
         item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
                 range start 13631488 end 14155776 length 524288
         item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
                 range start 13959168 end 14024704 length 65536
      
      The first one covers the range from the second one, they overlap.
      
      So far this does not cause a problem after replaying the log, because
      when replaying the file extent item for offset 256K, we copy all the
      checksums for the extent 13893632 from the log tree to the fs/subvolume
      tree, since searching for an checksum item for bytenr 13893632 leaves us
      at the first checksum item, which covers the whole range of the extent.
      
      However if we write 64Kb to file offset 256Kb for example, we will
      not be able to find and copy the checksums for the last 128Kb of the
      extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
      
      After writing 64Kb into file offset 256Kb we get the following extent
      layout for our file:
      
         [ bytenr 13893632, offset 64K, len 64Kb   ]
         0                                         64Kb
      
         [ bytenr 13631488, offset 64Kb, len 192Kb ]
         64Kb                                      256Kb
      
         [ bytenr 14155776, offset 0, len 64Kb     ]
         256Kb                                     320Kb
      
         [ bytenr 13893632, offset 64Kb, len 192Kb ]
         320Kb                                     512Kb
      
      After fsync'ing the file, if we have a power failure and then mount
      the filesystem to replay the log, the following happens:
      
      1) When replaying the file extent item for file offset 320Kb, we
         lookup for the checksums for the extent range from 13959168
         (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
         to btrfs_lookup_csums_range();
      
      2) btrfs_lookup_csums_range() finds the checksum item that starts
         precisely at offset 13959168 (item 7 in the log tree, shown before);
      
      3) However that checksum item only covers 64Kb of data, and not 192Kb
         of data;
      
      4) As a result only the checksums for the first 64Kb of data referenced
         by the file extent item are found and copied to the fs/subvolume tree.
         The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
         the corresponding data checksums found and copied to the fs/subvolume
         tree.
      
      5) After replaying the log userspace will not be able to read the file
         range from 384Kb to 512Kb, because the checksums are missing and
         resulting in an -EIO error.
      
      The following steps reproduce this scenario:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
        $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
      
        $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
      
        $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
      
        <power failure>
      
        $ mount /dev/sdc /mnt/sdc
        $ md5sum /mnt/sdc/foobar
        md5sum: /mnt/sdc/foobar: Input/output error
      
        $ dmesg | tail
        [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
        [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
        [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
        [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
        [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
        [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
        [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
        [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
        [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
        [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
      
      Fix this simply by deleting first any checksums, from the log tree, for the
      range of the extent we are logging at copy_items(). This ensures we do not
      get checksum items in the log tree that have overlapping ranges.
      
      This is a long time issue that has been present since we have the clone
      (and deduplication) ioctl, and can happen both when an extent is shared
      between different files and within the same file.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40e046ac
  7. 18 11月, 2019 5 次提交
  8. 02 10月, 2019 1 次提交
    • J
      btrfs: fix incorrect updating of log root tree · 4203e968
      Josef Bacik 提交于
      We've historically had reports of being unable to mount file systems
      because the tree log root couldn't be read.  Usually this is the "parent
      transid failure", but could be any of the related errors, including
      "fsid mismatch" or "bad tree block", depending on which block got
      allocated.
      
      The modification of the individual log root items are serialized on the
      per-log root root_mutex.  This means that any modification to the
      per-subvol log root_item is completely protected.
      
      However we update the root item in the log root tree outside of the log
      root tree log_mutex.  We do this in order to allow multiple subvolumes
      to be updated in each log transaction.
      
      This is problematic however because when we are writing the log root
      tree out we update the super block with the _current_ log root node
      information.  Since these two operations happen independently of each
      other, you can end up updating the log root tree in between writing out
      the dirty blocks and setting the super block to point at the current
      root.
      
      This means we'll point at the new root node that hasn't been written
      out, instead of the one we should be pointing at.  Thus whatever garbage
      or old block we end up pointing at complains when we mount the file
      system later and try to replay the log.
      
      Fix this by copying the log's root item into a local root item copy.
      Then once we're safely under the log_root_tree->log_mutex we update the
      root item in the log_root_tree.  This way we do not modify the
      log_root_tree while we're committing it, fixing the problem.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NChris Mason <clm@fb.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4203e968
  9. 12 9月, 2019 1 次提交
    • F
      Btrfs: fix assertion failure during fsync and use of stale transaction · 410f954c
      Filipe Manana 提交于
      Sometimes when fsync'ing a file we need to log that other inodes exist and
      when we need to do that we acquire a reference on the inodes and then drop
      that reference using iput() after logging them.
      
      That generally is not a problem except if we end up doing the final iput()
      (dropping the last reference) on the inode and that inode has a link count
      of 0, which can happen in a very short time window if the logging path
      gets a reference on the inode while it's being unlinked.
      
      In that case we end up getting the eviction callback, btrfs_evict_inode(),
      invoked through the iput() call chain which needs to drop all of the
      inode's items from its subvolume btree, and in order to do that, it needs
      to join a transaction at the helper function evict_refill_and_join().
      However because the task previously started a transaction at the fsync
      handler, btrfs_sync_file(), it has current->journal_info already pointing
      to a transaction handle and therefore evict_refill_and_join() will get
      that transaction handle from btrfs_join_transaction(). From this point on,
      two different problems can happen:
      
      1) evict_refill_and_join() will often change the transaction handle's
         block reserve (->block_rsv) and set its ->bytes_reserved field to a
         value greater than 0. If evict_refill_and_join() never commits the
         transaction, the eviction handler ends up decreasing the reference
         count (->use_count) of the transaction handle through the call to
         btrfs_end_transaction(), and after that point we have a transaction
         handle with a NULL ->block_rsv (which is the value prior to the
         transaction join from evict_refill_and_join()) and a ->bytes_reserved
         value greater than 0. If after the eviction/iput completes the inode
         logging path hits an error or it decides that it must fallback to a
         transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
         non-zero value from btrfs_log_dentry_safe(), and because of that
         non-zero value it tries to commit the transaction using a handle with
         a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
         the transaction commit hit an assertion failure at
         btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
         the ->block_rsv is NULL. The produced stack trace for that is like the
         following:
      
         [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
         [192922.917553] ------------[ cut here ]------------
         [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
         [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
         [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
         [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
         [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
         (...)
         [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
         [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
         [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
         [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
         [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
         [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
         [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
         [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
         [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [192922.925105] Call Trace:
         [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
         [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
         [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
         [192922.926731]  do_fsync+0x38/0x60
         [192922.927138]  __x64_sys_fdatasync+0x13/0x20
         [192922.927543]  do_syscall_64+0x60/0x1c0
         [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
         (...)
         [192922.934077] ---[ end trace f00808b12068168f ]---
      
      2) If evict_refill_and_join() decides to commit the transaction, it will
         be able to do it, since the nested transaction join only increments the
         transaction handle's ->use_count reference counter and it does not
         prevent the transaction from getting committed. This means that after
         eviction completes, the fsync logging path will be using a transaction
         handle that refers to an already committed transaction. What happens
         when using such a stale transaction can be unpredictable, we are at
         least having a use-after-free on the transaction handle itself, since
         the transaction commit will call kmem_cache_free() against the handle
         regardless of its ->use_count value, or we can end up silently losing
         all the updates to the log tree after that iput() in the logging path,
         or using a transaction handle that in the meanwhile was allocated to
         another task for a new transaction, etc, pretty much unpredictable
         what can happen.
      
      In order to fix both of them, instead of using iput() during logging, use
      btrfs_add_delayed_iput(), so that the logging path of fsync never drops
      the last reference on an inode, that step is offloaded to a safe context
      (usually the cleaner kthread).
      
      The assertion failure issue was sporadically triggered by the test case
      generic/475 from fstests, which loads the dm error target while fsstress
      is running, which lead to fsync failing while logging inodes with -EIO
      errors and then trying later to commit the transaction, triggering the
      assertion failure.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      410f954c
  10. 09 9月, 2019 7 次提交
  11. 02 7月, 2019 2 次提交
    • F
      Btrfs: fix fsync not persisting dentry deletions due to inode evictions · 803f0f64
      Filipe Manana 提交于
      In order to avoid searches on a log tree when unlinking an inode, we check
      if the inode being unlinked was logged in the current transaction, as well
      as the inode of its parent directory. When any of the inodes are logged,
      we proceed to delete directory items and inode reference items from the
      log, to ensure that if a subsequent fsync of only the inode being unlinked
      or only of the parent directory when the other is not fsync'ed as well,
      does not result in the entry still existing after a power failure.
      
      That check however is not reliable when one of the inodes involved (the
      one being unlinked or its parent directory's inode) is evicted, since the
      logged_trans field is transient, that is, it is not stored on disk, so it
      is lost when the inode is evicted and loaded into memory again (which is
      set to zero on load). As a consequence the checks currently being done by
      btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() always
      return true if the inode was evicted before, regardless of the inode
      having been logged or not before (and in the current transaction), this
      results in the dentry being unlinked still existing after a log replay
      if after the unlink operation only one of the inodes involved is fsync'ed.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/foo
        $ xfs_io -c fsync /mnt/dir/foo
      
        # Keep an open file descriptor on our directory while we evict inodes.
        # We just want to evict the file's inode, the directory's inode must not
        # be evicted.
        $ ( cd /mnt/dir; while true; do :; done ) &
        $ pid=$!
      
        # Wait a bit to give time to background process to chdir to our test
        # directory.
        $ sleep 0.5
      
        # Trigger eviction of the file's inode.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # Unlink our file and fsync the parent directory. After a power failure
        # we don't expect to see the file anymore, since we fsync'ed the parent
        # directory.
        $ rm -f $SCRATCH_MNT/dir/foo
        $ xfs_io -c fsync /mnt/dir
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ ls /mnt/dir
        foo
        $
         --> file still there, unlink not persisted despite explicit fsync on dir
      
      Fix this by checking if the inode has the full_sync bit set in its runtime
      flags as well, since that bit is set everytime an inode is loaded from
      disk, or for other less common cases such as after a shrinking truncate
      or failure to allocate extent maps for holes, and gets cleared after the
      first fsync. Also consider the inode as possibly logged only if it was
      last modified in the current transaction (besides having the full_fsync
      flag set).
      
      Fixes: 3a5f1d45 ("Btrfs: Optimize btree walking while logging inodes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      803f0f64
    • F
      Btrfs: fix data loss after inode eviction, renaming it, and fsync it · d1d832a0
      Filipe Manana 提交于
      When we log an inode, regardless of logging it completely or only that it
      exists, we always update it as logged (logged_trans and last_log_commit
      fields of the inode are updated). This is generally fine and avoids future
      attempts to log it from having to do repeated work that brings no value.
      
      However, if we write data to a file, then evict its inode after all the
      dealloc was flushed (and ordered extents completed), rename the file and
      fsync it, we end up not logging the new extents, since the rename may
      result in logging that the inode exists in case the parent directory was
      logged before. The following reproducer shows and explains how this can
      happen:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/foo
        $ touch /mnt/dir/bar
      
        # Do a direct IO write instead of a buffered write because with a
        # buffered write we would need to make sure dealloc gets flushed and
        # complete before we do the inode eviction later, and we can not do that
        # from user space with call to things such as sync(2) since that results
        # in a transaction commit as well.
        $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar
      
        # Keep the directory dir in use while we evict inodes. We want our file
        # bar's inode to be evicted but we don't want our directory's inode to
        # be evicted (if it were evicted too, we would not be able to reproduce
        # the issue since the first fsync below, of file foo, would result in a
        # transaction commit.
        $ ( cd /mnt/dir; while true; do :; done ) &
        $ pid=$!
      
        # Wait a bit to give time for the background process to chdir.
        $ sleep 0.1
      
        # Evict all inodes, except the inode for the directory dir because it is
        # currently in use by our background process.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # fsync file foo, which ends up persisting information about the parent
        # directory because it is a new inode.
        $ xfs_io -c fsync /mnt/dir/foo
      
        # Rename bar, this results in logging that this inode exists (inode item,
        # names, xattrs) because the parent directory is in the log.
        $ mv /mnt/dir/bar /mnt/dir/baz
      
        # Now fsync baz, which ends up doing absolutely nothing because of the
        # rename operation which logged that the inode exists only.
        $ xfs_io -c fsync /mnt/dir/baz
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/dir/baz
        0000000
      
          --> Empty file, data we wrote is missing.
      
      Fix this by not updating last_sub_trans of an inode when we are logging
      only that it exists and the inode was not yet logged since it was loaded
      from disk (full_sync bit set), this is enough to make btrfs_inode_in_log()
      return false for this scenario and make us log the inode. The logged_trans
      of the inode is still always setsince that alone is used to track if names
      need to be deleted as part of unlink operations.
      
      Fixes: 257c62e1 ("Btrfs: avoid tree log commit when there are no changes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1d832a0