1. 07 5月, 2013 1 次提交
    • J
      Btrfs: compare relevant parts of delayed tree refs · 41b0fc42
      Josef Bacik 提交于
      A user reported a panic while running a balance.  What was happening was he was
      relocating a block, which added the reference to the relocation tree.  Then
      relocation would walk through the relocation tree and drop that reference and
      free that block, and then it would walk down a snapshot which referenced the
      same block and add another ref to the block.  The problem is this was all
      happening in the same transaction, so the parent block was free'ed up when we
      drop our reference which was immediately available for allocation, and then it
      was used _again_ to add a reference for the same block from a different
      snapshot.  This resulted in something like this in the delayed ref tree
      
      add ref to 90234880, parent=2067398656, ref_root 1766, level 1
      del ref to 90234880, parent=2067398656, ref_root 18446744073709551608, level 1
      add ref to 90234880, parent=2067398656, ref_root 1767, level 1
      
      as you can see the ref_root's don't match, because when we inc the ref we use
      the header owner, which is the original tree the block belonged to, instead of
      the data reloc tree.  Then when we remove the extent we use the reloc tree
      objectid.  But none of this matters, since it is a shared reference which means
      only the parent matters.  When the delayed ref stuff runs it adds all the
      increments first, and then does all the drops, to make sure that we don't delete
      the ref if we net a positive ref count.  But tree blocks aren't allowed to have
      multiple refs from the same block, so this panics when it tries to add the
      second ref.  We need the add and the drop to cancel each other out in memory so
      we only do the final add.
      
      So to fix this we need to adjust how the delayed refs are added to the tree.
      Only the ref_root matters when it is a normal backref, and only the parent
      matters when it is a shared backref.  So make our decision based on what ref
      type we have.  This allows us to keep the ref_root in memory in case anybody
      wants to use it for something else, and it allows the delayed refs to be merged
      properly so we don't end up with this panic.
      
      With this patch the users image no longer panics on mount, and it has a clean
      fsck after a normal mount/umount cycle.  Thanks,
      
      Cc: stable@vger.kernel.org
      Reported-by: NRoman Mamedov <rm@romanrm.ru>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      41b0fc42
  2. 20 2月, 2013 2 次提交
  3. 29 8月, 2012 2 次提交
    • J
      Btrfs: allow delayed refs to be merged · ae1e206b
      Josef Bacik 提交于
      Daniel Blueman reported a bug with fio+balance on a ramdisk setup.
      Basically what happens is the balance relocates a tree block which will drop
      the implicit refs for all of its children and adds a full backref.  Once the
      block is relocated we have to add the implicit refs back, so when we cow the
      block again we add the implicit refs for its children back.  The problem
      comes when the original drop ref doesn't get run before we add the implicit
      refs back.  The delayed ref stuff will specifically prefer ADD operations
      over DROP to keep us from freeing up an extent that will have references to
      it, so we try to add the implicit ref before it is actually removed and we
      panic.  This worked fine before because the add would have just canceled the
      drop out and we would have been fine.  But the backref walking work needs to
      be able to freeze the delayed ref stuff in time so we have this ever
      increasing sequence number that gets attached to all new delayed ref updates
      which makes us not merge refs and we run into this issue.
      
      So to fix this we need to merge delayed refs.  So everytime we run a
      clustered ref we need to try and merge all of its delayed refs.  The backref
      walking stuff locks the delayed ref head before processing, so if we have it
      locked we are safe to merge any refs inside of the sequence number.  If
      there is no sequence number we can merge all refs.  Doing this not only
      fixes our bug but keeps the delayed ref code from adding and removing
      useless refs and batching together multiple refs into one search instead of
      one search per delayed ref, which will really help our commit times.  I ran
      this with Daniels test and 276 and I haven't seen any problems.  Thanks,
      Reported-by: NDaniel J Blueman <daniel@quora.org>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ae1e206b
    • A
      Btrfs: fix deadlock in wait_for_more_refs · 1fa11e26
      Arne Jansen 提交于
      Commit a168650c introduced a waiting mechanism to prevent busy waiting in
      btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations,
      where a tree_mod_seq is held while waiting for the io to complete, while
      the end_io calls btrfs_run_delayed_refs.
      This whole mechanism is unnecessary. If not enough runnable refs are
      available to satisfy count, just return as count is more like a guideline
      than a strict requirement.
      In case we have to run all refs, commit transaction makes sure that no
      other threads are working in the transaction anymore, so we just assert
      here that no refs are blocked.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      1fa11e26
  4. 12 7月, 2012 1 次提交
  5. 10 7月, 2012 1 次提交
  6. 31 5月, 2012 1 次提交
    • J
      Btrfs: use delayed ref sequence numbers for all fs-tree updates · 95a06077
      Jan Schmidt 提交于
      The sequence number for delayed refs is needed to postpone certain delayed
      refs for a very short period while walking backrefs. Before the tree
      modification log, we thought we'd only have to hold back those references
      that don't have a counter operation.
      
      While now we've the tree mod log, we're rewinding fs tree blocks to a
      defined consistent state. We cannot know in advance for which tree block
      we'll be doing rewind operations later. Therefore, we must postpone all the
      delayed refs for fs-tree blocks, even those having a counter operation.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      95a06077
  7. 22 3月, 2012 2 次提交
  8. 04 1月, 2012 3 次提交
    • J
      Btrfs: add waitqueue instead of doing busy waiting for more delayed refs · a168650c
      Jan Schmidt 提交于
      Now that we may be holding back delayed refs for a limited period, we
      might end up having no runnable delayed refs. Without this commit, we'd
      do busy waiting in that thread until another (runnable) ref arives.
      Instead, we're detecting this situation and use a waitqueue, such that
      we only try to run more refs after
      	a) another runnable ref was added  or
      	b) delayed refs are no longer held back
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      a168650c
    • A
      Btrfs: put back delayed refs that are too new · d1270cd9
      Arne Jansen 提交于
      When processing a delayed ref, first check if there are still old refs in
      the process of being added. If so, put this ref back to the tree. To avoid
      looping on this ref, choose a newer one in the next loop.
      btrfs_find_ref_cluster has to take care of that.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      d1270cd9
    • A
      Btrfs: add sequence numbers to delayed refs · 00f04b88
      Arne Jansen 提交于
      Sequence numbers are needed to reconstruct the backrefs of a given extent to
      a certain point in time. The total set of backrefs consist of the set of
      backrefs recorded on disk plus the enqueued delayed refs for it that existed
      at that moment.
      
      This patch also adds a list that records all delayed refs which are
      currently in the process of being added.
      
      When walking all refs of an extent in btrfs_find_all_roots(), we freeze the
      current state of delayed refs, honor anythinh up to this point and prevent
      processing newer delayed refs to assert consistency.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      00f04b88
  9. 22 12月, 2011 2 次提交
    • A
      Btrfs: always save ref_root in delayed refs · eebe063b
      Arne Jansen 提交于
      For consistent backref walking and (later) qgroup calculation the
      information to which root a delayed ref belongs is useful even for shared
      refs.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      eebe063b
    • A
      Btrfs: mark delayed refs as for cow · 66d7e7f0
      Arne Jansen 提交于
      Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
      from every call site. The for_cow parameter will later on be used to
      determine if a ref will change anything with respect to qgroups.
      
      Delayed refs coming from relocation are always counted as for_cow, as they
      don't change subvol quota.
      
      Also pass in the fs_info for later use.
      
      btrfs_find_all_roots() will use this as an optimization, as changes that are
      for_cow will not change anything with respect to which root points to a
      certain leaf. Thus, we don't need to add the current sequence number to
      those delayed refs.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      66d7e7f0
  10. 06 5月, 2011 2 次提交
  11. 28 3月, 2011 1 次提交
    • L
      Btrfs: add initial tracepoint support for btrfs · 1abe9b8a
      liubo 提交于
      Tracepoints can provide insight into why btrfs hits bugs and be greatly
      helpful for debugging, e.g
                    dd-7822  [000]  2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
                    dd-7822  [000]  2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
       btrfs-transacti-7804  [001]  2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
       btrfs-transacti-7804  [001]  2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
       btrfs-transacti-7804  [001]  2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
         flush-btrfs-2-7821  [001]  2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
         flush-btrfs-2-7821  [001]  2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
         flush-btrfs-2-7821  [001]  2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
         flush-btrfs-2-7821  [000]  2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
       btrfs-endio-wri-7800  [001]  2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
       btrfs-endio-wri-7800  [001]  2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
      
      Here is what I have added:
      
      1) ordere_extent:
              btrfs_ordered_extent_add
              btrfs_ordered_extent_remove
              btrfs_ordered_extent_start
              btrfs_ordered_extent_put
      
      These provide critical information to understand how ordered_extents are
      updated.
      
      2) extent_map:
              btrfs_get_extent
      
      extent_map is used in both read and write cases, and it is useful for tracking
      how btrfs specific IO is running.
      
      3) writepage:
              __extent_writepage
              btrfs_writepage_end_io_hook
      
      Pages are cirtical resourses and produce a lot of corner cases during writeback,
      so it is valuable to know how page is written to disk.
      
      4) inode:
              btrfs_inode_new
              btrfs_inode_request
              btrfs_inode_evict
      
      These can show where and when a inode is created, when a inode is evicted.
      
      5) sync:
              btrfs_sync_file
              btrfs_sync_fs
      
      These show sync arguments.
      
      6) transaction:
              btrfs_transaction_commit
      
      In transaction based filesystem, it will be useful to know the generation and
      who does commit.
      
      7) back reference and cow:
      	btrfs_delayed_tree_ref
      	btrfs_delayed_data_ref
      	btrfs_delayed_ref_head
      	btrfs_cow_block
      
      Btrfs natively supports back references, these tracepoints are helpful on
      understanding btrfs's COW mechanism.
      
      8) chunk:
      	btrfs_chunk_alloc
      	btrfs_chunk_free
      
      Chunk is a link between physical offset and logical offset, and stands for space
      infomation in btrfs, and these are helpful on tracing space things.
      
      9) reserved_extent:
      	btrfs_reserved_extent_alloc
      	btrfs_reserved_extent_free
      
      These can show how btrfs uses its space.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1abe9b8a
  12. 25 5月, 2010 1 次提交
  13. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  14. 10 6月, 2009 1 次提交
    • Y
      Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2
      Yan Zheng 提交于
      This commit introduces a new kind of back reference for btrfs metadata.
      Once a filesystem has been mounted with this commit, IT WILL NO LONGER
      BE MOUNTABLE BY OLDER KERNELS.
      
      When a tree block in subvolume tree is cow'd, the reference counts of all
      extents it points to are increased by one.  At transaction commit time,
      the old root of the subvolume is recorded in a "dead root" data structure,
      and the btree it points to is later walked, dropping reference counts
      and freeing any blocks where the reference count goes to 0.
      
      The increments done during cow and decrements done after commit cancel out,
      and the walk is a very expensive way to go about freeing the blocks that
      are no longer referenced by the new btree root.  This commit reduces the
      transaction overhead by avoiding the need for dead root records.
      
      When a non-shared tree block is cow'd, we free the old block at once, and the
      new block inherits old block's references. When a tree block with reference
      count > 1 is cow'd, we increase the reference counts of all extents
      the new block points to by one, and decrease the old block's reference count by
      one.
      
      This dead tree avoidance code removes the need to modify the reference
      counts of lower level extents when a non-shared tree block is cow'd.
      But we still need to update back ref for all pointers in the block.
      This is because the location of the block is recorded in the back ref
      item.
      
      We can solve this by introducing a new type of back ref. The new
      back ref provides information about pointer's key, level and in which
      tree the pointer lives. This information allow us to find the pointer
      by searching the tree. The shortcoming of the new back ref is that it
      only works for pointers in tree blocks referenced by their owner trees.
      
      This is mostly a problem for snapshots, where resolving one of these
      fuzzy back references would be O(number_of_snapshots) and quite slow.
      The solution used here is to use the fuzzy back references in the common
      case where a given tree block is only referenced by one root,
      and use the full back references when multiple roots have a reference
      on a given block.
      
      This commit adds per subvolume red-black tree to keep trace of cached
      inodes. The red-black tree helps the balancing code to find cached
      inodes whose inode numbers within a given range.
      
      This commit improves the balancing code by introducing several data
      structures to keep the state of balancing. The most important one
      is the back ref cache. It caches how the upper level tree blocks are
      referenced. This greatly reduce the overhead of checking back ref.
      
      The improved balancing code scales significantly better with a large
      number of snapshots.
      
      This is a very large commit and was written in a number of
      pieces.  But, they depend heavily on the disk format change and were
      squashed together to make sure git bisect didn't end up in a
      bad state wrt space balancing or the format change.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5d4f98a2
  15. 03 4月, 2009 1 次提交
  16. 25 3月, 2009 4 次提交
    • C
      Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod · 1a81af4d
      Chris Mason 提交于
      btrfs_update_delayed_ref is optimized to add and remove different
      references in one pass through the delayed ref tree.  It is a zero
      sum on the total number of refs on a given extent.
      
      But, the code was recording an extra ref in the head node.  This
      never made it down to the disk but was used when deciding if it was
      safe to free the extent while dropping snapshots.
      
      The fix used here is to make sure the ref_mod count is unchanged
      on the head ref when btrfs_update_delayed_ref is called.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1a81af4d
    • C
      Btrfs: process the delayed reference queue in clusters · c3e69d58
      Chris Mason 提交于
      The delayed reference queue maintains pending operations that need to
      be done to the extent allocation tree.  These are processed by
      finding records in the tree that are not currently being processed one at
      a time.
      
      This is slow because it uses lots of time searching through the rbtree
      and because it creates lock contention on the extent allocation tree
      when lots of different procs are running delayed refs at the same time.
      
      This commit changes things to grab a cluster of refs for processing,
      using a cursor into the rbtree as the starting point of the next search.
      This way we walk smoothly through the rbtree.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c3e69d58
    • C
      Btrfs: try to cleanup delayed refs while freeing extents · 1887be66
      Chris Mason 提交于
      When extents are freed, it is likely that we've removed the last
      delayed reference update for the extent.  This checks the delayed
      ref tree when things are freed, and if no ref updates area left it
      immediately processes the delayed ref.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1887be66
    • C
      Btrfs: do extent allocation and reference count updates in the background · 56bec294
      Chris Mason 提交于
      The extent allocation tree maintains a reference count and full
      back reference information for every extent allocated in the
      filesystem.  For subvolume and snapshot trees, every time
      a block goes through COW, the new copy of the block adds a reference
      on every block it points to.
      
      If a btree node points to 150 leaves, then the COW code needs to go
      and add backrefs on 150 different extents, which might be spread all
      over the extent allocation tree.
      
      These updates currently happen during btrfs_cow_block, and most COWs
      happen during btrfs_search_slot.  btrfs_search_slot has locks held
      on both the parent and the node we are COWing, and so we really want
      to avoid IO during the COW if we can.
      
      This commit adds an rbtree of pending reference count updates and extent
      allocations.  The tree is ordered by byte number of the extent and byte number
      of the parent for the back reference.  The tree allows us to:
      
      1) Modify back references in something close to disk order, reducing seeks
      2) Significantly reduce the number of modifications made as block pointers
      are balanced around
      3) Do all of the extent insertion and back reference modifications outside
      of the performance critical btrfs_search_slot code.
      
      #3 has the added benefit of greatly reducing the btrfs stack footprint.
      The extent allocation tree modifications are done without the deep
      (and somewhat recursive) call chains used in the past.
      
      These delayed back reference updates must be done before the transaction
      commits, and so the rbtree is tied to the transaction.  Throttling is
      implemented to help keep the queue of backrefs at a reasonable size.
      
      Since there was a similar mechanism in place for the extent tree
      extents, that is removed and replaced by the delayed reference tree.
      
      Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      56bec294