1. 11 4月, 2015 1 次提交
  2. 27 3月, 2015 1 次提交
    • F
      Btrfs: fix metadata inconsistencies after directory fsync · 2f2ff0ee
      Filipe Manana 提交于
      We can get into inconsistency between inodes and directory entries
      after fsyncing a directory. The issue is that while a directory gets
      the new dentries persisted in the fsync log and replayed at mount time,
      the link count of the inode that directory entries point to doesn't
      get updated, staying with an incorrect link count (smaller then the
      correct value). This later leads to stale file handle errors when
      accessing (including attempt to delete) some of the links if all the
      other ones are removed, which also implies impossibility to delete the
      parent directories, since the dentries can not be removed.
      
      Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2),
      when fsyncing a directory, new files aren't logged (their metadata and
      dentries) nor any child directories. So this patch fixes this issue too,
      since it has the same resolution as the incorrect inode link count issue
      mentioned before.
      
      This is very easy to reproduce, and the following excerpt from my test
      case for xfstests shows how:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our main test file and directory.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io
        mkdir $SCRATCH_MNT/mydir
      
        # Make sure all metadata and data are durably persisted.
        sync
      
        # Add a hard link to 'foo' inside our test directory and fsync only the
        # directory. The btrfs fsync implementation had a bug that caused the new
        # directory entry to be visible after the fsync log replay but, the inode
        # of our file remained with a link count of 1.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2
      
        # Add a few more links and new files.
        # This is just to verify nothing breaks or gives incorrect results after the
        # fsync log is replayed.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3
        $XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io
        ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2
      
        # Add some subdirectories and new files and links to them. This is to verify
        # that after fsyncing our top level directory 'mydir', all the subdirectories
        # and their files/links are registered in the fsync log and exist after the
        # fsync log is replayed.
        mkdir -p $SCRATCH_MNT/mydir/x/y/z
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link
        touch $SCRATCH_MNT/mydir/x/y/z/qwerty
      
        # Now fsync only our top directory.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir
      
        # And fsync now our new file named 'hello', just to verify later that it has
        # the expected content and that the previous fsync on the directory 'mydir' had
        # no bad influence on this fsync.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Verify the content of our file 'foo' remains the same as before, 8192 bytes,
        # all with the value 0xaa.
        echo "File 'foo' content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Remove the first name of our inode. Because of the directory fsync bug, the
        # inode's link count was 1 instead of 5, so removing the 'foo' name ended up
        # deleting the inode and the other names became stale directory entries (still
        # visible to applications). Attempting to remove or access the remaining
        # dentries pointing to that inode resulted in stale file handle errors and
        # made it impossible to remove the parent directories since it was impossible
        # for them to become empty.
        echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)"
        rm -f $SCRATCH_MNT/foo
      
        # Now verify that all files, links and directories created before fsyncing our
        # directory exist after the fsync log was replayed.
        [ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing"
        [ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing"
        [ -f $SCRATCH_MNT/hello ] || echo "File hello is missing"
        [ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \
            echo "Link mydir/x/y/foo_y_link is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \
            echo "Link mydir/x/y/z/foo_z_link is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \
            echo "File mydir/x/y/z/qwerty is missing"
      
        # We expect our file here to have a size of 64Kb and all the bytes having the
        # value 0xff.
        echo "file 'hello' content after log replay:"
        od -t x1 $SCRATCH_MNT/hello
      
        # Now remove all files/links, under our test directory 'mydir', and verify we
        # can remove all the directories.
        rm -f $SCRATCH_MNT/mydir/x/y/z/*
        rmdir $SCRATCH_MNT/mydir/x/y/z
        rm -f $SCRATCH_MNT/mydir/x/y/*
        rmdir $SCRATCH_MNT/mydir/x/y
        rmdir $SCRATCH_MNT/mydir/x
        rm -f $SCRATCH_MNT/mydir/*
        rmdir $SCRATCH_MNT/mydir
      
        # An fsck, run by the fstests framework everytime a test finishes, also detected
        # the inconsistency and printed the following error message:
        #
        # root 5 inode 257 errors 2001, no inode item, link count wrong
        #    unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
        #    unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
      
        status=0
        exit
      
      The expected golden output for the test is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 65536/65536 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File 'foo' content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
        file 'foo' link count after log replay: 5
        file 'hello' content after log replay:
        0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0200000
      
      Which is the output after this patch and when running the test against
      ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's
      output is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 65536/65536 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File 'foo' content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
        file 'foo' link count after log replay: 1
        Link mydir/foo_2 is missing
        Link mydir/foo_3 is missing
        Link mydir/x/y/foo_y_link is missing
        Link mydir/x/y/z/foo_z_link is missing
        File mydir/x/y/z/qwerty is missing
        file 'hello' content after log replay:
        0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0200000
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory
        rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle
        rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty
      
      Fsck, without this fix, also complains about the wrong link count:
      
        root 5 inode 257 errors 2001, no inode item, link count wrong
            unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
            unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
      
      So fix this by logging the inodes that the dentries point to when
      fsyncing a directory.
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2f2ff0ee
  3. 15 2月, 2015 1 次提交
    • Z
      btrfs: Fix out-of-space bug · 13212b54
      Zhao Lei 提交于
      Btrfs will report NO_SPACE when we create and remove files for several times,
      and we can't write to filesystem until mount it again.
      
      Steps to reproduce:
       1: Create a single-dev btrfs fs with default option
       2: Write a file into it to take up most fs space
       3: Delete above file
       4: Wait about 100s to let chunk removed
       5: goto 2
      
      Script is like following:
       #!/bin/bash
      
       # Recommend 1.2G space, too large disk will make test slow
       DEV="/dev/sda16"
       MNT="/mnt/tmp"
      
       dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2
       file_size_m=$((dev_size * 75 / 100 / 1024 / 1024))
      
       echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev"
      
       for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done
       echo "mkfs $DEV"
       mkfs.btrfs -f "$DEV" >/dev/null || exit 2
       echo "mount $DEV $MNT"
       mount "$DEV" "$MNT" || exit 2
      
       for ((loop_i = 0; loop_i < 20; loop_i++)); do
           echo
           echo "loop $loop_i"
      
           echo "dd file..."
           cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m")
           "${cmd[@]}" 2>/dev/null || {
               # NO_SPACE error triggered
               echo "dd failed: ${cmd[*]}"
               exit 1
           }
      
           echo "rm file..."
           rm -f "$MNT"/file0 || exit 2
      
           for ((i = 0; i < 10; i++)); do
               df "$MNT" | tail -1
               sleep 10
           done
       done
      
      Reason:
       It is triggered by commit: 47ab2a6c
       which is used to remove empty block groups automatically, but the
       reason is not in that patch. Code before works well because btrfs
       don't need to create and delete chunks so many times with high
       complexity.
       Above bug is caused by many reason, any of them can trigger it.
      
      Reason1:
       When we remove some continuous chunks but leave other chunks after,
       these disk space should be used by chunk-recreating, but in current
       code, only first create will successed.
       Fixed by Forrest Liu <forrestl@synology.com> in:
       Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
      
      Reason2:
       contains_pending_extent() return wrong value in calculation.
       Fixed by Forrest Liu <forrestl@synology.com> in:
       Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
      
      Reason3:
       btrfs_check_data_free_space() try to commit transaction and retry
       allocating chunk when the first allocating failed, but space_info->full
       is set in first allocating, and prevent second allocating in retry.
       Fixed in this patch by clear space_info->full in commit transaction.
      
       Tested for severial times by above script.
      
      Changelog v3->v4:
       use light weight int instead of atomic_t to record have_remove_bgs in
       transaction, suggested by:
       Josef Bacik <jbacik@fb.com>
      
      Changelog v2->v3:
       v2 fixed the bug by adding more commit-transaction, but we
       only need to reclaim space when we are really have no space for
       new chunk, noticed by:
       Filipe David Manana <fdmanana@gmail.com>
      
       Actually, our code already have this type of commit-and-retry,
       we only need to make it working with removed-bgs.
       v3 fixed the bug with above way.
      
      Changelog v1->v2:
       v1 will introduce a new bug when delete and create chunk in same disk
       space in same transaction, noticed by:
       Filipe David Manana <fdmanana@gmail.com>
       V2 fix this bug by commit transaction after remove block grops.
      Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Suggested-by: NFilipe David Manana <fdmanana@gmail.com>
      Suggested-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      13212b54
  4. 22 1月, 2015 1 次提交
    • J
      Btrfs: track dirty block groups on their own list · ce93ec54
      Josef Bacik 提交于
      Currently any time we try to update the block groups on disk we will walk _all_
      block groups and check for the ->dirty flag to see if it is set.  This function
      can get called several times during a commit.  So if you have several terabytes
      of data you will be a very sad panda as we will loop through _all_ of the block
      groups several times, which makes the commit take a while which slows down the
      rest of the file system operations.
      
      This patch introduces a dirty list for the block groups that we get added to
      when we dirty the block group for the first time.  Then we simply update any
      block groups that have been dirtied since the last time we called
      btrfs_write_dirty_block_groups.  This allows us to clean up how we write the
      free space cache out so it is much cleaner.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ce93ec54
  5. 22 11月, 2014 1 次提交
    • J
      Btrfs: make sure logged extents complete in the current transaction V3 · 50d9aa99
      Josef Bacik 提交于
      Liu Bo pointed out that my previous fix would lose the generation update in the
      scenario I described.  It is actually much worse than that, we could lose the
      entire extent if we lose power right after the transaction commits.  Consider
      the following
      
      write extent 0-4k
      log extent in log tree
      commit transaction
      	< power fail happens here
      ordered extent completes
      
      We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
      the transaction commit will reset the log so it isn't replayed.  If we lose
      power before the transaction commit we are save, otherwise we are not.
      
      Fix this by keeping track of all extents we logged in this transaction.  Then
      when we go to commit the transaction make sure we wait for all of those ordered
      extents to complete before proceeding.  This will make sure that if we lose
      power after the transaction commit we still have our data.  This also fixes the
      problem of the improperly updated extent generation.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      50d9aa99
  6. 21 11月, 2014 1 次提交
    • F
      Btrfs: deal with convert_extent_bit errors to avoid fs corruption · 663dfbb0
      Filipe Manana 提交于
      When committing a transaction or a log, we look for btree extents that
      need to be durably persisted by searching for ranges in a io tree that
      have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear
      those bits and set the EXTENT_NEED_WAIT bit, with calls to the function
      convert_extent_bit, and then start writeback for the extents.
      
      That function however can return an error (at the moment only -ENOMEM
      is possible, specially when it does GFP_ATOMIC allocation requests
      through alloc_extent_state_atomic) - that means the ranges didn't got
      the EXTENT_NEED_WAIT bit set (or at least not for the whole range),
      which in turn means a call to btrfs_wait_marked_extents() won't find
      those ranges for which we started writeback, causing a transaction
      commit or a log commit to persist a new superblock without waiting
      for the writeback of extents in that range to finish first.
      
      Therefore if a crash happens after persisting the new superblock and
      before writeback finishes, we have a superblock pointing to roots that
      weren't fully persisted or roots that point to nodes or leafs that weren't
      fully persisted, causing all sorts of unexpected/bad behaviour as we endup
      reading garbage from disk or the content of some node/leaf from a past
      generation that got cowed or deleted and is no longer valid (for this later
      case we end up getting error messages like "parent transid verify failed on
      X wanted Y found Z" when reading btree nodes/leafs from disk).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      663dfbb0
  7. 12 11月, 2014 1 次提交
    • D
      btrfs: add support for processing pending changes · 572d9ab7
      David Sterba 提交于
      There are some actions that modify global filesystem state but cannot be
      performed at the time of request, but later at the transaction commit
      time when the filesystem is in a known state.
      
      For example enabling new incompat features on-the-fly or issuing
      transaction commit from unsafe contexts (sysfs handlers).
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      572d9ab7
  8. 02 10月, 2014 1 次提交
  9. 15 8月, 2014 1 次提交
    • C
      btrfs: disable strict file flushes for renames and truncates · 8d875f95
      Chris Mason 提交于
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      deadlock.
      
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      deadlocks.
      Signed-off-by: NChris Mason <clm@fb.com>
      8d875f95
  10. 10 6月, 2014 1 次提交
    • J
      Btrfs: add sanity tests for new qgroup accounting code · faa2dbf0
      Josef Bacik 提交于
      This exercises the various parts of the new qgroup accounting code.  We do some
      basic stuff and do some things with the shared refs to make sure all that code
      works.  I had to add a bunch of infrastructure because I needed to be able to
      insert items into a fake tree without having to do all the hard work myself,
      hopefully this will be usefull in the future.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      faa2dbf0
  11. 07 4月, 2014 2 次提交
    • J
      Btrfs: remove transaction from send · 9e351cc8
      Josef Bacik 提交于
      Lets try this again.  We can deadlock the box if we send on a box and try to
      write onto the same fs with the app that is trying to listen to the send pipe.
      This is because the writer could get stuck waiting for a transaction commit
      which is being blocked by the send.  So fix this by making sure looking at the
      commit roots is always going to be consistent.  We do this by keeping track of
      which roots need to have their commit roots swapped during commit, and then
      taking the commit_root_sem and swapping them all at once.  Then make sure we
      take a read lock on the commit_root_sem in cases where we search the commit root
      to make sure we're always looking at a consistent view of the commit roots.
      Previously we had problems with this because we would swap a fs tree commit root
      and then swap the extent tree commit root independently which would cause the
      backref walking code to screw up sometimes.  With this patch we no longer
      deadlock and pass all the weird send/receive corner cases.  Thanks,
      Reportedy-by: NHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e351cc8
    • J
      Btrfs: don't clear uptodate if the eb is under IO · a26e8c9f
      Josef Bacik 提交于
      So I have an awful exercise script that will run snapshot, balance and
      send/receive in parallel.  This sometimes would crash spectacularly and when it
      came back up the fs would be completely hosed.  Turns out this is because of a
      bad interaction of balance and send/receive.  Send will hold onto its entire
      path for the whole send, but its blocks could get relocated out from underneath
      it, and because it doesn't old tree locks theres nothing to keep this from
      happening.  So it will go to read in a slot with an old transid, and we could
      have re-allocated this block for something else and it could have a completely
      different transid.  But because we think it is invalid we clear uptodate and
      re-read in the block.  If we do this before we actually write out the new block
      we could write back stale data to the fs, and boom we're screwed.
      
      Now we definitely need to fix this disconnect between send and balance, but we
      really really need to not allow ourselves to accidently read in stale data over
      new data.  So make sure we check if the extent buffer is not under io before
      clearing uptodate, this will kick back EIO to the caller instead of reading in
      stale data and keep us from corrupting the fs.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a26e8c9f
  12. 29 1月, 2014 2 次提交
    • J
      Btrfs: make fsync latency less sucky · 5039eddc
      Josef Bacik 提交于
      Looking into some performance related issues with large amounts of metadata
      revealed that we can have some pretty huge swings in fsync() performance.  If we
      have a lot of delayed refs backed up (as you will tend to do with lots of
      metadata) fsync() will wander off and try to run some of those delayed refs
      which can result in reading from disk and such.  Since the actual act of fsync()
      doesn't create any delayed refs there is no need to make it throttle on delayed
      ref stuff, that will be handled by other people.  With this patch we get much
      smoother fsync performance with large amounts of metadata.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5039eddc
    • M
      Btrfs: remove btrfs_end_transaction_dmeta() · a56dbd89
      Miao Xie 提交于
      Two reasons:
      - btrfs_end_transaction_dmeta() is the same as btrfs_end_transaction_throttle()
        so it is unnecessary.
      - All the delayed items should be dealt in the current transaction, so the
        workers should not commit the transaction, instead, deal with the delayed
        items as many as possible.
      
      So we can remove btrfs_end_transaction_dmeta()
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a56dbd89
  13. 12 11月, 2013 2 次提交
    • M
      Btrfs: fix BUG_ON() casued by the reserved space migration · 20dd2cbf
      Miao Xie 提交于
      When we did space balance and snapshot creation at the same time, we might
      meet the following oops:
       kernel BUG at fs/btrfs/inode.c:3038!
       [SNIP]
       Call Trace:
       [<ffffffffa0411ec7>] btrfs_orphan_cleanup+0x293/0x407 [btrfs]
       [<ffffffffa042dc45>] btrfs_mksubvol.isra.28+0x259/0x373 [btrfs]
       [<ffffffffa042de85>] btrfs_ioctl_snap_create_transid+0x126/0x156 [btrfs]
       [<ffffffffa042dff1>] btrfs_ioctl_snap_create_v2+0xd0/0x121 [btrfs]
       [<ffffffffa0430b2c>] btrfs_ioctl+0x414/0x1854 [btrfs]
       [<ffffffff813b60b7>] ? __do_page_fault+0x305/0x379
       [<ffffffff811215a9>] vfs_ioctl+0x1d/0x39
       [<ffffffff81121d7c>] do_vfs_ioctl+0x32d/0x3e2
       [<ffffffff81057fe7>] ? finish_task_switch+0x80/0xb8
       [<ffffffff81121e88>] SyS_ioctl+0x57/0x83
       [<ffffffff813b39ff>] ? do_device_not_available+0x12/0x14
       [<ffffffff813b99c2>] system_call_fastpath+0x16/0x1b
       [SNIP]
       RIP  [<ffffffffa040da40>] btrfs_orphan_add+0xc3/0x126 [btrfs]
      
      The reason of the problem is that the relocation root creation stole
      the reserved space, which was reserved for orphan item deletion.
      
      There are several ways to fix this problem, one is to increasing
      the reserved space size of the space balace, and then we can use
      that space to create the relocation tree for each fs/file trees.
      But it is hard to calculate the suitable size because we doesn't
      know how many fs/file trees we need relocate.
      
      We fixed this problem by reserving the space for relocation root creation
      actively since the space it need is very small (one tree block, used for
      root node copy), then we use that reserved space to create the
      relocation tree. If we don't reserve space for relocation tree creation,
      we will use the reserved space of the balance.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      20dd2cbf
    • J
      Btrfs: fix two use-after-free bugs with transaction cleanup · 724e2315
      Josef Bacik 提交于
      I was noticing the slab redzone stuff going off every once and a while during
      transaction aborts.  This was caused by two things
      
      1) We would walk the pending snapshots and set their error to -ECANCELED.  We
      don't need to do this, the snapshot stuff waits for a transaction commit and if
      there is a problem we just free our pending snapshot object and exit.  Doing
      this was causing us to touch the pending snapshot object after the thing had
      already been freed.
      
      2) We were freeing the transaction manually with wanton disregard for it's
      use_count reference counter.  To fix this I cleaned up the transaction freeing
      loop to either wait for the transaction commit to finish if it was in the middle
      of that (since it will be cleaned and freed up there) or to do the cleanup
      oursevles.
      
      I also moved the global "kill all things dirty everywhere" stuff outside of the
      transaction cleanup loop since that only needs to be done once.  With this patch
      I'm no longer seeing slab corruption because of use after frees.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      724e2315
  14. 01 9月, 2013 1 次提交
  15. 10 8月, 2013 1 次提交
  16. 02 7月, 2013 1 次提交
    • J
      Btrfs: make the chunk allocator completely tree lockless · 6df9a95e
      Josef Bacik 提交于
      When adjusting the enospc rules for relocation I ran into a deadlock because we
      were relocating the only system chunk and that forced us to try and allocate a
      new system chunk while holding locks in the chunk tree, which caused us to
      deadlock.  To fix this I've moved all of the dev extent addition and chunk
      addition out to the delayed chunk completion stuff.  We still keep the in-memory
      stuff which makes sure everything is consistent.
      
      One change I had to make was to search the commit root of the device tree to
      find a free dev extent, and hold onto any chunk em's that we allocated in that
      transaction so we do not allocate the same dev extent twice.  This has the side
      effect of fixing a bug with balance that has been there ever since balance
      existed.  Basically you can free a block group and it's dev extent and then
      immediately allocate that dev extent for a new block group and write stuff to
      that dev extent, all within the same transaction.  So if you happen to crash
      during a balance you could come back to a completely broken file system.  This
      patch should keep these sort of things from happening in the future since we
      won't be able to allocate free'd dev extents until after the transaction
      commits.  This has passed all of the xfstests and my super annoying stress test
      followed by a balance.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      6df9a95e
  17. 14 6月, 2013 3 次提交
    • M
      Btrfs: make the state of the transaction more readable · 4a9d8bde
      Miao Xie 提交于
      We used 3 variants to track the state of the transaction, it was complex
      and wasted the memory space. Besides that, it was hard to understand that
      which types of the transaction handles should be blocked in each transaction
      state, so the developers often made mistakes.
      
      This patch improved the above problem. In this patch, we define 6 states
      for the transaction,
        enum btrfs_trans_state {
      	TRANS_STATE_RUNNING		= 0,
      	TRANS_STATE_BLOCKED		= 1,
      	TRANS_STATE_COMMIT_START	= 2,
      	TRANS_STATE_COMMIT_DOING	= 3,
      	TRANS_STATE_UNBLOCKED		= 4,
      	TRANS_STATE_COMPLETED		= 5,
      	TRANS_STATE_MAX			= 6,
        }
      and just use 1 variant to track those state.
      
      In order to make the blocked handle types for each state more clear,
      we introduce a array:
        unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
      	[TRANS_STATE_RUNNING]		= 0U,
      	[TRANS_STATE_BLOCKED]		= (__TRANS_USERSPACE |
      					   __TRANS_START),
      	[TRANS_STATE_COMMIT_START]	= (__TRANS_USERSPACE |
      					   __TRANS_START |
      					   __TRANS_ATTACH),
      	[TRANS_STATE_COMMIT_DOING]	= (__TRANS_USERSPACE |
      					   __TRANS_START |
      					   __TRANS_ATTACH |
      					   __TRANS_JOIN),
      	[TRANS_STATE_UNBLOCKED]		= (__TRANS_USERSPACE |
      					   __TRANS_START |
      					   __TRANS_ATTACH |
      					   __TRANS_JOIN |
      					   __TRANS_JOIN_NOLOCK),
      	[TRANS_STATE_COMPLETED]		= (__TRANS_USERSPACE |
      					   __TRANS_START |
      					   __TRANS_ATTACH |
      					   __TRANS_JOIN |
      					   __TRANS_JOIN_NOLOCK),
        }
      it is very intuitionistic.
      
      Besides that, because we remove ->in_commit in transaction structure, so
      the lock ->commit_lock which was used to protect it is unnecessary, remove
      ->commit_lock.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      4a9d8bde
    • M
      Btrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure · 3f1e3fa6
      Miao Xie 提交于
      We used ->num_joined track if there were some writers which join the current
      transaction when the committer was sleeping. If some writers joined the current
      transaction, we has to continue the while loop to do some necessary stuff, such
      as flush the ordered operations. But it is unnecessary because we will do it
      after the while loop.
      
      Besides that, tracking ->num_joined would make the committer drop into the while
      loop when there are lots of internal writers(TRANS_JOIN).
      
      So we remove ->num_joined and don't track if there are some writers which join
      the current transaction when the committer is sleeping.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      3f1e3fa6
    • M
      Btrfs: don't wait for all the writers circularly during the transaction commit · 0860adfd
      Miao Xie 提交于
      btrfs_commit_transaction has the following loop before we commit the
      transaction.
      
      do {
          // attempt to do some useful stuff and/or sleep
      } while (atomic_read(&cur_trans->num_writers) > 1 ||
      	 (should_grow && cur_trans->num_joined != joined));
      
      This is used to prevent from the TRANS_START to get in the way of a
      committing transaction. But it does not prevent from TRANS_JOIN, that
      is we would do this loop for a long time if some writers JOIN the
      current transaction endlessly.
      
      Because we need join the current transaction to do some useful stuff,
      we can not block TRANS_JOIN here. So we introduce a external writer
      counter, which is used to count the TRANS_USERSPACE/TRANS_START writers.
      If the external writer counter is zero, we can break the above loop.
      
      In order to make the code more clear, we don't use enum variant
      to define the type of the transaction handle, use bitmask instead.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0860adfd
  18. 07 5月, 2013 2 次提交
    • E
      btrfs: make static code static & remove dead code · 48a3b636
      Eric Sandeen 提交于
      Big patch, but all it does is add statics to functions which
      are in fact static, then remove the associated dead-code fallout.
      
      removed functions:
      
      btrfs_iref_to_path()
      __btrfs_lookup_delayed_deletion_item()
      __btrfs_search_delayed_insertion_item()
      __btrfs_search_delayed_deletion_item()
      find_eb_for_page()
      btrfs_find_block_group()
      range_straddles_pages()
      extent_range_uptodate()
      btrfs_file_extent_length()
      btrfs_scrub_cancel_devid()
      btrfs_start_transaction_lflush()
      
      btrfs_print_tree() is left because it is used for debugging.
      btrfs_start_transaction_lflush() and btrfs_reada_detach() are
      left for symmetry.
      
      ulist.c functions are left, another patch will take care of those.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      48a3b636
    • D
      btrfs: clean snapshots one by one · 9d1a2a3a
      David Sterba 提交于
      Each time pick one dead root from the list and let the caller know if
      it's needed to continue. This should improve responsiveness during
      umount and balance which at some point waits for cleaning all currently
      queued dead roots.
      
      A new dead root is added to the end of the list, so the snapshots
      disappear in the order of deletion.
      
      The snapshot cleaning work is now done only from the cleaner thread and the
      others wake it if needed.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      9d1a2a3a
  19. 01 3月, 2013 2 次提交
  20. 21 2月, 2013 3 次提交
    • M
      Btrfs: fix uncompleted transaction · d4edf39b
      Miao Xie 提交于
      In some cases, we need commit the current transaction, but don't want
      to start a new one if there is no running transaction, so we introduce
      the function - btrfs_attach_transaction(), which can catch the current
      transaction, and return -ENOENT if there is no running transaction.
      
      But no running transaction doesn't mean the current transction completely,
      because we removed the running transaction before it completes. In some
      cases, it doesn't matter. But in some special cases, such as freeze fs, we
      hope the transaction is fully on disk, it will introduce some bugs, for
      example, we may feeze the fs and dump the data in the disk, if the transction
      doesn't complete, we would dump inconsistent data. So we need fix the above
      problem for those cases.
      
      We fixes this problem by introducing a function:
      	btrfs_attach_transaction_barrier()
      if we hope all the transaction is fully on the disk, even they are not
      running, we can use this function.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      d4edf39b
    • J
      Btrfs: place ordered operations on a per transaction list · 569e0f35
      Josef Bacik 提交于
      Miao made the ordered operations stuff run async, which introduced a
      deadlock where we could get somebody (sync) racing in and committing the
      transaction while a commit was already happening.  The new committer would
      try and flush ordered operations which would hang waiting for the commit to
      finish because it is done asynchronously and no longer inherits the callers
      trans handle.  To fix this we need to make the ordered operations list a per
      transaction list.  We can get new inodes added to the ordered operation list
      by truncating them and then having another process writing to them, so this
      makes it so that anybody trying to add an ordered operation _must_ start a
      transaction in order to add itself to the list, which will keep new inodes
      from getting added to the ordered operations list after we start committing.
      This should fix the deadlock and also keeps us from doing a lot more work
      than we need to during commit.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      569e0f35
    • E
      btrfs: remove cache only arguments from defrag path · de78b51a
      Eric Sandeen 提交于
      The entry point at the defrag ioctl always sets "cache only" to 0;
      the codepaths haven't run for a long time as far as I can
      tell.  Chris says they're dead code, so remove them.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      de78b51a
  21. 20 2月, 2013 1 次提交
    • J
      Btrfs: don't re-enter when allocating a chunk · c6b305a8
      Josef Bacik 提交于
      If we start running low on metadata space we will try to allocate a chunk,
      which could then try to allocate a chunk to add the device entry.  The thing
      is we allocate a chunk before we try really hard to make the allocation, so
      we should be able to find space for the device entry.  Add a flag to the
      trans handle so we know we're currently allocating a chunk so we can just
      bail out if we try to allocate another chunk.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      c6b305a8
  22. 12 12月, 2012 1 次提交
    • M
      Btrfs: improve the noflush reservation · 08e007d2
      Miao Xie 提交于
      In some places(such as: evicting inode), we just can not flush the reserved
      space of delalloc, flushing the delayed directory index and delayed inode
      is OK, but we don't try to flush those things and just go back when there is
      no enough space to be reserved. This patch fixes this problem.
      
      We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
      If we can in the transaction, we should not flush anything, or the deadlock
      would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
      would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
      and we will flush all things.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      08e007d2
  23. 09 10月, 2012 2 次提交
    • M
      Btrfs: fix orphan transaction on the freezed filesystem · 354aa0fb
      Miao Xie 提交于
      With the following debug patch:
      
       static int btrfs_freeze(struct super_block *sb)
       {
      + 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
      +	struct btrfs_transaction *trans;
      +
      +	spin_lock(&fs_info->trans_lock);
      +	trans = fs_info->running_transaction;
      +	if (trans) {
      +		printk("Transid %llu, use_count %d, num_writer %d\n",
      +			trans->transid, atomic_read(&trans->use_count),
      +			atomic_read(&trans->num_writers));
      +	}
      +	spin_unlock(&fs_info->trans_lock);
       	return 0;
       }
      
      I found there was a orphan transaction after the freeze operation was done.
      
      It is because the transaction may not be committed when the transaction handle
      end even though it is the last handle of the current transaction. This design
      avoid committing the transaction frequently, but also introduce the above
      problem.
      
      So I add btrfs_attach_transaction() which can catch the current transaction
      and commit it. If there is no transaction, it will return ENOENT, and do not
      anything.
      
      This function also can be used to instead of btrfs_join_transaction_freeze()
      because it don't increase the writer counter and don't start a new transaction,
      so it also can fix the deadlock between sync and freeze.
      
      Besides that, it is used to instead of btrfs_join_transaction() in
      transaction_kthread(), because if there is no transaction, the transaction
      kthread needn't anything.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      354aa0fb
    • M
      Btrfs: add a type field for the transaction handle · a698d075
      Miao Xie 提交于
      This patch add a type field into the transaction handle structure,
      in this way, we needn't implement various end-transaction functions
      and can make the code more simple and readable.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      a698d075
  24. 04 10月, 2012 1 次提交
    • J
      Btrfs: fix race in sync and freeze again · 60376ce4
      Josef Bacik 提交于
      I screwed this up, there is a race between checking if there is a running
      transaction and actually starting a transaction in sync where we could race
      with a freezer and get ourselves into trouble.  To fix this we need to make
      a new join type to only do the try lock on the freeze stuff.  If it fails
      we'll return EPERM and just return from sync.  This fixes a hang Liu Bo
      reported when running xfstest 68 in a loop.  Thanks,
      Reported-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      60376ce4
  25. 02 10月, 2012 3 次提交
    • J
      Btrfs: delay block group item insertion · ea658bad
      Josef Bacik 提交于
      So we have lots of places where we try to preallocate chunks in order to
      make sure we have enough space as we make our allocations.  This has
      historically meant that we're constantly tweaking when we should allocate a
      new chunk, and historically we have gotten this horribly wrong so we way
      over allocate either metadata or data.  To try and keep this from happening
      we are going to make it so that the block group item insertion is done out
      of band at the end of a transaction.  This will allow us to create chunks
      even if we are trying to make an allocation for the extent tree.  With this
      patch my enospc tests run faster (didn't expect this) and more efficiently
      use the disk space (this is what I wanted).  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ea658bad
    • M
      Btrfs: fix corrupted metadata in the snapshot · 8407aa46
      Miao Xie 提交于
      When we delete a inode, we will remove all the delayed items including delayed
      inode update, and then truncate all the relative metadata. If there is lots of
      metadata, we will end the current transaction, and start a new transaction to
      truncate the left metadata. In this way, we will leave a inode item that its
      link counter is > 0, and also may leave some directory index items in fs/file tree
      after the current transaction ends. In other words, the metadata in this fs/file tree
      is inconsistent. If we create a snapshot for this tree now, we will find a inode with
      corrupted metadata in the new snapshot, and we won't continue to drop the left metadata,
      because its link counter is not 0.
      
      We fix this problem by updating the inode item before the current transaction ends.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      8407aa46
    • L
      Btrfs: fix a bug in checking whether a inode is already in log · 46d8bc34
      Liu Bo 提交于
      This is based on Josef's "Btrfs: turbo charge fsync".
      
      The current btrfs checks if an inode is in log by comparing
      root's last_log_commit to inode's last_sub_trans[2].
      
      But the problem is that this root->last_log_commit is shared among
      inodes.
      
      Say we have N inodes to be logged, after the first inode,
      root's last_log_commit is updated and the N-1 remained files will
      be skipped.
      
      This fixes the bug by keeping a local copy of root's last_log_commit
      inside each inode and this local copy will be maintained itself.
      
      [1]: we regard each log transaction as a subset of btrfs's transaction,
      i.e. sub_trans
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      46d8bc34
  26. 24 7月, 2012 1 次提交
    • J
      Btrfs: change how we indicate we're adding csums · 0e721106
      Josef Bacik 提交于
      There is weird logic I had to put in place to make sure that when we were
      adding csums that we'd used the delalloc block rsv instead of the global
      block rsv.  Part of this meant that we had to free up our transaction
      reservation before we ran the delayed refs since csum deletion happens
      during the delayed ref work.  The problem with this is that when we release
      a reservation we will add it to the global reserve if it is not full in
      order to keep us going along longer before we have to force a transaction
      commit.  By releasing our reservation before we run delayed refs we don't
      get the opportunity to drain down the global reserve for the work we did, so
      we won't refill it as often.  This isn't a problem per-se, it just results
      in us possibly committing transactions more and more often, and in rare
      cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran
      out of space in our block rsv.
      
      This also helps us by holding onto space while the delayed refs run so we
      don't end up with as many people trying to do things at the same time, which
      again will help us not force commits or hit the use_block_rsv warnings.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0e721106
  27. 12 7月, 2012 2 次提交
    • A
      Btrfs: add qgroup inheritance · 6f72c7e2
      Arne Jansen 提交于
      When creating a subvolume or snapshot, it is necessary
      to initialize the qgroup account with a copy of some
      other (tracking) qgroup. This patch adds parameters
      to the ioctls to pass the information from which qgroup
      to inherit.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      6f72c7e2
    • A
      Btrfs: hooks to reserve qgroup space · c5567237
      Arne Jansen 提交于
      Like block reserves, reserve a small piece of space on each
      transaction start and for delalloc. These are the hooks that
      can actually return EDQUOT to the user.
      The amount of space reserved is tracked in the transaction
      handle.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      c5567237