1. 22 11月, 2014 2 次提交
    • F
      Btrfs: don't ignore log btree writeback errors · 5ab5e44a
      Filipe Manana 提交于
      If an error happens during writeback of log btree extents, make sure the
      error is returned to the caller (fsync), so that it takes proper action
      (commit current transaction) instead of writing a superblock that points
      to log btrees with all or some nodes that weren't durably persisted.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5ab5e44a
    • J
      Btrfs: make sure logged extents complete in the current transaction V3 · 50d9aa99
      Josef Bacik 提交于
      Liu Bo pointed out that my previous fix would lose the generation update in the
      scenario I described.  It is actually much worse than that, we could lose the
      entire extent if we lose power right after the transaction commits.  Consider
      the following
      
      write extent 0-4k
      log extent in log tree
      commit transaction
      	< power fail happens here
      ordered extent completes
      
      We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
      the transaction commit will reset the log so it isn't replayed.  If we lose
      power before the transaction commit we are save, otherwise we are not.
      
      Fix this by keeping track of all extents we logged in this transaction.  Then
      when we go to commit the transaction make sure we wait for all of those ordered
      extents to complete before proceeding.  This will make sure that if we lose
      power after the transaction commit we still have our data.  This also fixes the
      problem of the improperly updated extent generation.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      50d9aa99
  2. 21 11月, 2014 1 次提交
    • J
      Btrfs: make sure we wait on logged extents when fsycning two subvols · 9dba8cf1
      Josef Bacik 提交于
      If we have two fsync()'s race on different subvols one will do all of its work
      to get into the log_tree, wait on it's outstanding IO, and then allow the
      log_tree to finish it's commit.  The problem is we were just free'ing that
      subvols logged extents instead of waiting on them, so whoever lost the race
      wouldn't really have their data on disk.  Fix this by waiting properly instead
      of freeing the logged extents.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9dba8cf1
  3. 28 10月, 2014 1 次提交
    • F
      Btrfs: fix invalid leaf slot access in btrfs_lookup_extent() · 1a4ed8fd
      Filipe Manana 提交于
      If we couldn't find our extent item, we accessed the current slot
      (path->slots[0]) to check if it corresponds to an equivalent skinny
      metadata item. However this slot could be beyond our last item in the
      leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case
      we shouldn't process it.
      
      Since btrfs_lookup_extent() is only used to find extent items for data
      extents, fix this by removing completely the logic that looks up for an
      equivalent skinny metadata item, since it can not exist.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1a4ed8fd
  4. 19 9月, 2014 1 次提交
    • F
      Btrfs: fix data corruption after fast fsync and writeback error · 8407f553
      Filipe Manana 提交于
      When we do a fast fsync, we start all ordered operations and then while
      they're running in parallel we visit the list of modified extent maps
      and construct their matching file extent items and write them to the
      log btree. After that, in btrfs_sync_log() we wait for all the ordered
      operations to finish (via btrfs_wait_logged_extents).
      
      The problem with this is that we were completely ignoring errors that
      can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM
      or -EIO errors for example. When such error happens, it means we have parts
      of the on disk extent that weren't written to, and so we end up logging
      file extent items that point to these extents that contain garbage/random
      data - so after a crash/reboot plus log replay, we get our inode's metadata
      pointing to those extents.
      
      This worked in contrast with the full (non-fast) fsync path, where we
      start all ordered operations, wait for them to finish and then write
      to the log btree. In this path, after each ordered operation completes
      we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return
      -EIO if so (via btrfs_wait_ordered_range).
      
      So if an error happens with any ordered operation, just return a -EIO
      error to userspace, so that it knows that not all of its previous writes
      were durably persisted and the application can take proper action (like
      redo the writes for e.g.) - and definitely not leave any file extent items
      in the log refer to non fully written extents.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8407f553
  5. 18 9月, 2014 4 次提交
    • F
      Btrfs: fix directory recovery from fsync log · a2cc11db
      Filipe Manana 提交于
      When replaying a directory from the fsync log, if a directory entry
      exists both in the fs/subvol tree and in the log, the directory's inode
      got its i_size updated incorrectly, accounting for the dentry's name
      twice.
      
      Reproducer, from a test for xfstests:
      
          _scratch_mkfs >> $seqres.full 2>&1
          _init_flakey
          _mount_flakey
      
          touch $SCRATCH_MNT/foo
          sync
      
          touch $SCRATCH_MNT/bar
          xfs_io -c "fsync" $SCRATCH_MNT
          xfs_io -c "fsync" $SCRATCH_MNT/bar
      
          _load_flakey_table $FLAKEY_DROP_WRITES
          _unmount_flakey
      
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          _mount_flakey
      
          [ -f $SCRATCH_MNT/foo ] || echo "file foo is missing"
          [ -f $SCRATCH_MNT/bar ] || echo "file bar is missing"
      
          _unmount_flakey
          _check_scratch_fs $FLAKEY_DEV
      
      The filesystem check at the end failed with the message:
      "root 5 root dir 256 error".
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a2cc11db
    • F
      Btrfs: make btrfs_search_forward return with nodes unlocked · f98de9b9
      Filipe Manana 提交于
      None of the uses of btrfs_search_forward() need to have the path
      nodes (level >= 1) read locked, only the leaf needs to be locked
      while the caller processes it. Therefore make it return a path
      with all nodes unlocked, except for the leaf.
      
      This change is motivated by the observation that during a file
      fsync we repeatdly call btrfs_search_forward() and process the
      returned leaf while upper nodes of the returned path (level >= 1)
      are read locked, which unnecessarily blocks other tasks that want
      to write to the same fs/subvol btree.
      Therefore instead of modifying the fsync code to unlock all nodes
      with level >= 1 immediately after calling btrfs_search_forward(),
      change btrfs_search_forward() to do it, so that it benefits all
      callers.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f98de9b9
    • D
      btrfs: use nodesize everywhere, kill leafsize · 707e8a07
      David Sterba 提交于
      The nodesize and leafsize were never of different values. Unify the
      usage and make nodesize the one. Cleanup the redundant checks and
      helpers.
      
      Shaves a few bytes from .text:
      
        text    data     bss     dec     hex filename
      852418   24560   23112  900090   dbbfa btrfs.ko.before
      851074   24584   23112  898770   db6d2 btrfs.ko.after
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      707e8a07
    • D
      btrfs: kill the key type accessor helpers · 962a298f
      David Sterba 提交于
      btrfs_set_key_type and btrfs_key_type are used inconsistently along with
      open coded variants. Other members of btrfs_key are accessed directly
      without any helpers anyway.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      962a298f
  6. 17 9月, 2014 1 次提交
    • F
      Btrfs: set inode's logged_trans/last_log_commit after ranged fsync · 125c4cf9
      Filipe Manana 提交于
      When a ranged fsync finishes if there are still extent maps in the modified
      list, still set the inode's logged_trans and last_log_commit. This is important
      in case an inode is fsync'ed and unlinked in the same transaction, to ensure its
      inode ref gets deleted from the log and the respective dentries in its parent
      are deleted too from the log (if the parent directory was fsync'ed in the same
      transaction).
      
      Instead make btrfs_inode_in_log() return false if the list of modified extent
      maps isn't empty.
      
      This is an incremental on top of the v4 version of the patch:
      
          "Btrfs: fix fsync data loss after a ranged fsync"
      
      which was added to its v5, but didn't make it on time.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      125c4cf9
  7. 09 9月, 2014 1 次提交
    • F
      Btrfs: fix fsync data loss after a ranged fsync · 49dae1bc
      Filipe Manana 提交于
      While we're doing a full fsync (when the inode has the flag
      BTRFS_INODE_NEEDS_FULL_SYNC set) that is ranged too (covers only a
      portion of the file), we might have ordered operations that are started
      before or while we're logging the inode and that fall outside the fsync
      range.
      
      Therefore when a full ranged fsync finishes don't remove every extent
      map from the list of modified extent maps - as for some of them, that
      fall outside our fsync range, their respective ordered operation hasn't
      finished yet, meaning the corresponding file extent item wasn't inserted
      into the fs/subvol tree yet and therefore we didn't log it, and we must
      let the next fast fsync (one that checks only the modified list) see this
      extent map and log a matching file extent item to the log btree and wait
      for its ordered operation to finish (if it's still ongoing).
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      49dae1bc
  8. 26 8月, 2014 1 次提交
  9. 21 8月, 2014 1 次提交
    • F
      Btrfs: fix hole detection during file fsync · 74121f7c
      Filipe Manana 提交于
      The file hole detection logic during a file fsync wasn't correct,
      because it didn't look back (in a previous leaf) for the last file
      extent item that can be in a leaf to the left of our leaf and that
      has a generation lower than the current transaction id. This made it
      assume that a hole exists when it really doesn't exist in the file.
      
      Such false positive hole detection happens in the following scenario:
      
      * We have a file that has many file extent items, covering 3 or more
        btree leafs (the first leaf must contain non file extent items too).
      
      * Two ranges of the file are modified, with their extent items being
        located at 2 different leafs and those leafs aren't consecutive.
      
      * When processing the second modified leaf, we weren't checking if
        some file extent item exists that is located in some leaf that is
        between our 2 modified leafs, and therefore assumed the range defined
        between the last file extent item in the first leaf and the first file
        extent item in the second leaf matched a hole.
      
      Fortunately this didn't result in overriding the log with wrong data,
      instead it made the last loop in copy_items() attempt to insert a
      duplicated key (for a hole file extent item), which makes the file
      fsync code return with -EEXIST to file.c:btrfs_sync_file() which in
      turn ends up doing a full transaction commit, which is much more
      expensive then writing only to the log tree and wait for it to be
      durably persisted (as well as the file's modified extents/pages).
      Therefore fix the hole detection logic, so that we don't pay the
      cost of doing full transaction commits.
      
      I could trigger this issue with the following test for xfstests (which
      never fails, either without or with this patch). The last fsync call
      results in a full transaction commit, due to the -EEXIST error mentioned
      above. I could also observe this behaviour happening frequently when
      running xfstests/generic/075 in a loop.
      
      Test:
      
          _cleanup()
          {
              _cleanup_flakey
              rm -fr $tmp
          }
      
          # get standard environment, filters and checks
          . ./common/rc
          . ./common/filter
          . ./common/dmflakey
      
          # real QA test starts here
          _supported_fs btrfs
          _supported_os Linux
          _require_scratch
          _require_dm_flakey
          _need_to_be_root
      
          rm -f $seqres.full
      
          # Create a file with many file extent items, each representing a 4Kb extent.
          # These items span 3 btree leaves, of 16Kb each (default mkfs.btrfs leaf size
          # as of btrfs-progs 3.12).
          _scratch_mkfs -l 16384 >/dev/null 2>&1
          _init_flakey
          SAVE_MOUNT_OPTIONS="$MOUNT_OPTIONS"
          MOUNT_OPTIONS="$MOUNT_OPTIONS -o commit=999"
          _mount_flakey
      
          # First fsync, inode has BTRFS_INODE_NEEDS_FULL_SYNC flag set.
          $XFS_IO_PROG -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" \
                  $SCRATCH_MNT/foo | _filter_xfs_io
      
          # For any of the following fsync calls, inode doesn't have the flag
          # BTRFS_INODE_NEEDS_FULL_SYNC set.
          for ((i = 1; i <= 500; i++)); do
              OFFSET=$((4096 * i))
              LEN=4096
              $XFS_IO_PROG -c "pwrite -S 0x01 $OFFSET $LEN" -c "fsync" \
                      $SCRATCH_MNT/foo | _filter_xfs_io
          done
      
          # Commit transaction and bump next transaction's id (to 7).
          sync
      
          # Truncate will set the BTRFS_INODE_NEEDS_FULL_SYNC flag in the btrfs's
          # inode runtime flags.
          $XFS_IO_PROG -c "truncate 2048000" $SCRATCH_MNT/foo
      
          # Commit transaction and bump next transaction's id (to 8).
          sync
      
          # Touch 1 extent item from the first leaf and 1 from the last leaf. The leaf
          # in the middle, containing only file extent items, isn't touched. So the
          # next fsync, when calling btrfs_search_forward(), won't visit that middle
          # leaf. First and 3rd leaf have now a generation with value 8, while the
          # middle leaf remains with a generation with value 6.
          $XFS_IO_PROG \
              -c "pwrite -S 0xee -b 4096 0 4096" \
              -c "pwrite -S 0xff -b 4096 2043904 4096" \
              -c "fsync" \
              $SCRATCH_MNT/foo | _filter_xfs_io
      
          _load_flakey_table $FLAKEY_DROP_WRITES
          md5sum $SCRATCH_MNT/foo | _filter_scratch
          _unmount_flakey
      
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          # During mount, we'll replay the log created by the fsync above, and the file's
          # md5 digest should be the same we got before the unmount.
          _mount_flakey
          md5sum $SCRATCH_MNT/foo | _filter_scratch
          _unmount_flakey
          MOUNT_OPTIONS="$SAVE_MOUNT_OPTIONS"
      
          status=0
          exit
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      74121f7c
  10. 10 6月, 2014 2 次提交
  11. 11 3月, 2014 9 次提交
  12. 29 1月, 2014 5 次提交
    • C
      Btrfs: don't use ram_bytes for uncompressed inline items · 514ac8ad
      Chris Mason 提交于
      If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
      the new size.  The fixe uses the size directly from the item header when
      reading uncompressed inlines, and also fixes truncate to update the
      size as it goes.
      Reported-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      CC: stable@vger.kernel.org
      514ac8ad
    • M
      Btrfs: flush the dirty pages of the ordered extent aggressively during logging csum · 23c671a5
      Miao Xie 提交于
      The performance of fsync dropped down suddenly sometimes, the main reason
      of this problem was that we might only flush part dirty pages in a ordered
      extent, then got that ordered extent, wait for the csum calcucation. But if
      no task flushed the left part, we would wait until the flusher flushed them,
      sometimes we need wait for several seconds, it made the performance drop
      down suddenly. (On my box, it drop down from 56MB/s to 4-10MB/s)
      
      This patch improves the above problem by flushing left dirty pages aggressively.
      
      Test Environment:
      CPU:		2CPU * 2Cores
      Memory:		4GB
      Partition:	20GB(HDD)
      
      Test Command:
       # sysbench --num-threads=8 --test=fileio --file-num=1 \
       > --file-total-size=8G --file-block-size=32768 \
       > --file-io-mode=sync --file-fsync-freq=100 \
       > --file-fsync-end=no --max-requests=10000 \
       > --file-test-mode=rndwr run
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      23c671a5
    • F
      Btrfs: faster file extent item replace operations · 1acae57b
      Filipe David Borba Manana 提交于
      When writing to a file we drop existing file extent items that cover the
      write range and then add a new file extent item that represents that write
      range.
      
      Before this change we were doing a tree lookup to remove the file extent
      items, and then after we did another tree lookup to insert the new file
      extent item.
      Most of the time all the file extent items we need to drop are located
      within a single leaf - this is the leaf where our new file extent item ends
      up at. Therefore, in this common case just combine these 2 operations into
      a single one.
      
      By avoiding the second btree navigation for insertion of the new file extent
      item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
      COW operations, CPU time on btree node/leaf key binary searches, etc.
      
      Besides for file writes, this is an operation that happens for file fsync's
      as well. However log btrees are much less likely to big as big as regular
      fs btrees, therefore the impact of this change is smaller.
      
      The following benchmark was performed against an SSD drive and a
      HDD drive, both for random and sequential writes:
      
        sysbench --test=fileio --file-num=4096 --file-total-size=8G \
           --file-test-mode=[rndwr|seqwr] --num-threads=512 \
           --file-block-size=8192 \ --max-requests=1000000 \
           --file-fsync-freq=0 --file-io-mode=sync [prepare|run]
      
      All results below are averages of 10 runs of the respective test.
      
      ** SSD sequential writes
      
      Before this change: 225.88 Mb/sec
      After this change:  277.26 Mb/sec
      
      ** SSD random writes
      
      Before this change: 49.91 Mb/sec
      After this change:  56.39 Mb/sec
      
      ** HDD sequential writes
      
      Before this change: 68.53 Mb/sec
      After this change:  69.87 Mb/sec
      
      ** HDD random writes
      
      Before this change: 13.04 Mb/sec
      After this change:  14.39 Mb/sec
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1acae57b
    • K
      btrfs: expand btrfs_find_item() to include find_orphan_item functionality · 3f870c28
      Kelley Nielsen 提交于
      This is the third step in bootstrapping the btrfs_find_item interface.
      The function find_orphan_item(), in orphan.c, is similar to the two
      functions already replaced by the new interface. It uses two parameters,
      which are already present in the interface, and is nearly identical to
      the function brought in in the previous patch.
      
      Replace the two calls to find_orphan_item() with calls to
      btrfs_find_item(), with the defined objectid and type that was used
      internally by find_orphan_item(), a null path, and a null key. Add a
      test for a null path to btrfs_find_item, and if it passes, allocate and
      free the path. Finally, remove find_orphan_item().
      Signed-off-by: NKelley Nielsen <kelleynnn@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3f870c28
    • J
      Btrfs: incompatible format change to remove hole extents · 16e7549f
      Josef Bacik 提交于
      Btrfs has always had these filler extent data items for holes in inodes.  This
      has made somethings very easy, like logging hole punches and sending hole
      punches.  However for large holey files these extent data items are pure
      overhead.  So add an incompatible feature to no longer add hole extents to
      reduce the amount of metadata used by these sort of files.  This has a few
      changes for logging and send obviously since they will need to detect holes and
      log/send the holes if there are any.  I've tested this thoroughly with xfstests
      and it doesn't cause any issues with and without the incompat format set.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      16e7549f
  13. 21 11月, 2013 2 次提交
  14. 12 11月, 2013 9 次提交