1. 17 2月, 2015 2 次提交
    • D
      Btrfs: disk-io: replace root args iff only fs_info used · 01d58472
      Daniel Dressler 提交于
      This is the 3rd independent patch of a larger project to cleanup btrfs's
      internal usage of btrfs_root. Many functions take btrfs_root only to
      grab the fs_info struct.
      
      By requiring a root these functions cause programmer overhead. That
      these functions can accept any valid root is not obvious until
      inspection.
      
      This patch reduces the specificity of such functions to accept the
      fs_info directly.
      
      These patches can be applied independently and thus are not being
      submitted as a patch series. There should be about 26 patches by the
      project's completion. Each patch will cleanup between 1 and 34 functions
      apiece.  Each patch covers a single file's functions.
      
      This patch affects the following function(s):
        1) csum_tree_block
        2) csum_dirty_buffer
        3) check_tree_block_fsid
        4) btrfs_find_tree_block
        5) clean_tree_block
      Signed-off-by: NDaniel Dressler <danieru.dressler@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      01d58472
    • D
      Btrfs: ctree: reduce args where only fs_info used · b7a0365e
      Daniel Dressler 提交于
      This patch is part of a larger project to cleanup btrfs's internal usage
      of struct btrfs_root. Many functions take btrfs_root only to grab a
      pointer to fs_info.
      
      This causes programmers to ponder which root can be passed. Since only
      the fs_info is read affected functions can accept any root, except this
      is only obvious upon inspection.
      
      This patch reduces the specificty of such functions to accept the
      fs_info directly.
      
      This patch does not address the two functions in ctree.c (insert_ptr,
      and split_item) which only use root for BUG_ONs in ctree.c
      
      This patch affects the following functions:
        1) fixup_low_keys
        2) btrfs_set_item_key_safe
      Signed-off-by: NDaniel Dressler <danieru.dressler@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      b7a0365e
  2. 22 1月, 2015 3 次提交
    • C
      Btrfs: insert_new_root: Fix lock type of the extent buffer. · 95449a16
      chandan 提交于
      btrfs_alloc_tree_block() returns an extent buffer on which a blocked lock has
      been taken. Hence assign the appropriate value to path->locks[level].
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      95449a16
    • F
      Btrfs: fix setup_leaf_for_split() to avoid leaf corruption · a8df6fe6
      Filipe Manana 提交于
      We were incorrectly detecting when the target key didn't exist anymore
      after releasing the path and re-searching the tree. This could make
      us split or duplicate (btrfs_split_item() and btrfs_duplicate_item()
      are its only callers at the moment) an item when we should not.
      
      For the case of duplicating an item, we currently only duplicate
      checksum items (csum tree) and file extent items (fs/subvol trees).
      For the checksum items we end up overriding the item completely,
      but for file extent items we update only some of their fields in
      the copy (done in __btrfs_drop_extents), which means we can end up
      having a logical corruption for some values.
      
      Also for the case where we duplicate a file extent item it will make
      us produce a leaf with a wrong key order, as btrfs_duplicate_item()
      advances us to the next slot and then its caller sets a smaller key
      on the new item at that slot (like in __btrfs_drop_extents() e.g.).
      Alternatively if the tree search in setup_leaf_for_split() leaves
      with path->slots[0] == btrfs_header_nritems(path->nodes[0]), we end
      up accessing beyond the leaf's end (when we check if the item's size
      has changed) and make our caller insert an item at the invalid slot
      btrfs_header_nritems(path->nodes[0]) + 1, causing an invalid memory
      access if the leaf is full or nearly full.
      
      This issue has been present since the introduction of this function
      in 2009:
      
          Btrfs: Add btrfs_duplicate_item
          commit ad48fd75Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a8df6fe6
    • J
      Btrfs: change how we track dirty roots · e7070be1
      Josef Bacik 提交于
      I've been overloading root->dirty_list to keep track of dirty roots and which
      roots need to have their commit roots switched at transaction commit time.  This
      could cause us to lose an update to the root which could corrupt the file
      system.  To fix this use a state bit to know if the root is dirty, and if it
      isn't set we go ahead and move the root to the dirty list.  This way if we
      re-dirty the root after adding it to the switch_commit list we make sure to
      update it.  This also makes it so that the extent root is always the last root
      on the dirty list to try and keep the amount of churn down at this point in the
      commit.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e7070be1
  3. 15 1月, 2015 2 次提交
  4. 13 12月, 2014 2 次提交
  5. 21 11月, 2014 1 次提交
    • F
      Btrfs: make xattr replace operations atomic · 5f5bc6b1
      Filipe Manana 提交于
      Replacing a xattr consists of doing a lookup for its existing value, delete
      the current value from the respective leaf, release the search path and then
      finally insert the new value. This leaves a time window where readers (getxattr,
      listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs,
      so this has security implications.
      
      This change also fixes 2 other existing issues which were:
      
      *) Deleting the old xattr value without verifying first if the new xattr will
         fit in the existing leaf item (in case multiple xattrs are packed in the
         same item due to name hash collision);
      
      *) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't
         exist but we have have an existing item that packs muliple xattrs with
         the same name hash as the input xattr. In this case we should return ENOSPC.
      
      A test case for xfstests follows soon.
      
      Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace
      implementation.
      Reported-by: NAlexandre Oliva <oliva@gnu.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5f5bc6b1
  6. 20 11月, 2014 1 次提交
    • C
      btrfs: fix lockups from btrfs_clear_path_blocking · f82c458a
      Chris Mason 提交于
      The fair reader/writer locks mean that btrfs_clear_path_blocking needs
      to strictly follow lock ordering rules even when we already have
      blocking locks on a given path.
      
      Before we can clear a blocking lock on the path, we need to make sure
      all of the locks have been converted to blocking.  This will remove lock
      inversions against anyone spinning in write_lock() against the buffers
      we're trying to get read locks on.  These inversions didn't exist before
      the fair read/writer locks, but now we need to be more careful.
      
      We papered over this deadlock in the past by changing
      btrfs_try_read_lock() to be a true trylock against both the spinlock and
      the blocking lock.  This was slower, and not sufficient to fix all the
      deadlocks.  This patch adds a btrfs_tree_read_lock_atomic(), which
      basically means get the spinlock but trylock on the blocking lock.
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reported-by: NPatrick Schmid <schmid@phys.ethz.ch>
      cc: stable@vger.kernel.org #v3.15+
      f82c458a
  7. 04 10月, 2014 1 次提交
  8. 02 10月, 2014 6 次提交
  9. 18 9月, 2014 4 次提交
    • F
      Btrfs: make btrfs_search_forward return with nodes unlocked · f98de9b9
      Filipe Manana 提交于
      None of the uses of btrfs_search_forward() need to have the path
      nodes (level >= 1) read locked, only the leaf needs to be locked
      while the caller processes it. Therefore make it return a path
      with all nodes unlocked, except for the leaf.
      
      This change is motivated by the observation that during a file
      fsync we repeatdly call btrfs_search_forward() and process the
      returned leaf while upper nodes of the returned path (level >= 1)
      are read locked, which unnecessarily blocks other tasks that want
      to write to the same fs/subvol btree.
      Therefore instead of modifying the fsync code to unlock all nodes
      with level >= 1 immediately after calling btrfs_search_forward(),
      change btrfs_search_forward() to do it, so that it benefits all
      callers.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f98de9b9
    • F
      Btrfs: avoid unnecessary switch of path locks to blocking mode · 160f4089
      Filipe Manana 提交于
      If we need to cow a node, increase the write lock level and retry the
      tree search, there's no point of changing the node locks in our path
      to blocking mode, as we only waste time and unnecessarily wake up other
      tasks waiting on the spinning locks (just to block them again shortly
      after) because we release our path before repeating the tree search.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      160f4089
    • F
      Btrfs: unlock nodes earlier when inserting items in a btree · 24cdc847
      Filipe Manana 提交于
      In ctree.c:setup_items_for_insert(), we can unlock all nodes in our
      path before we process the leaf (shift items and data, adjust data
      offsets, etc). This allows for better btree concurrency, as we're
      often holding a write lock on at least the node at level 1.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      24cdc847
    • D
      btrfs: use nodesize everywhere, kill leafsize · 707e8a07
      David Sterba 提交于
      The nodesize and leafsize were never of different values. Unify the
      usage and make nodesize the one. Cleanup the redundant checks and
      helpers.
      
      Shaves a few bytes from .text:
      
        text    data     bss     dec     hex filename
      852418   24560   23112  900090   dbbfa btrfs.ko.before
      851074   24584   23112  898770   db6d2 btrfs.ko.after
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      707e8a07
  10. 15 8月, 2014 1 次提交
    • J
      Btrfs: __btrfs_mod_ref should always use no_quota · e339a6b0
      Josef Bacik 提交于
      Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
      understand how snapshot delete was using it and assumed that we needed the
      quota operations there.  With Mark's work this has turned out to be not the
      case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
      argument and make __btrfs_mod_ref call it's process function with no_quota set
      always.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e339a6b0
  11. 10 6月, 2014 5 次提交
    • L
      Btrfs: fix leaf corruption after __btrfs_drop_extents · 0b43e04f
      Liu Bo 提交于
      Several reports about leaf corruption has been floating on the list, one of them
      points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted
      after __btrfs_drop_extents(), it's really a rare case but it does exist.
      
      The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents().
      
      So in btrfs_next_leaf(), we release the current path to re-search the last key of
      the leaf for locating next leaf, and we've taken it into account that there might
      be balance operations between leafs during this 'unlock and re-lock' dance, so
      we check the path again and advance it if there are now more items available.
      But things are a bit different if that last key happens to be removed and balance
      gets a bigger key as the last one, and btrfs_search_slot will return it with
      ret > 0, IOW, nothing change in this leaf except the new last key, then we think
      we're okay because there is no more item balanced in, fine, we thinks we can
      go to the next leaf.
      
      However, we should return that bigger key, otherwise we deserve leaf corruption,
      for example, in endio, skipping that key means that __btrfs_drop_extents() thinks
      it has dropped all extent matched the required range and finish_ordered_io can
      safely insert a new extent, but it actually doesn't and ends up a leaf
      corruption.
      
      One may be asking that why our locking on extent io tree doesn't work as
      expected, ie. it should avoid this kind of race situation.  But in
      __btrfs_drop_extents(), we don't always find extents which are included within
      our locking range, IOW, extents can start before our searching start, in this
      case locking on extent io tree doesn't protect us from the race.
      
      This takes the special case into account.
      Reviewed-by: NFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0b43e04f
    • F
      Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item · 337c6f68
      Filipe Manana 提交于
      We might have had an item with the previous key in the tree right
      before we released our path. And after we released our path, that
      item might have been pushed to the first slot (0) of the leaf we
      were holding due to a tree balance. Alternatively, an item with the
      previous key can exist as the only element of a leaf (big fat item).
      Therefore account for these 2 cases, so that our callers (like
      btrfs_previous_item) don't miss an existing item with a key matching
      the previous key we computed above.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      337c6f68
    • J
      Btrfs: add sanity tests for new qgroup accounting code · faa2dbf0
      Josef Bacik 提交于
      This exercises the various parts of the new qgroup accounting code.  We do some
      basic stuff and do some things with the shared refs to make sure all that code
      works.  I had to add a bunch of infrastructure because I needed to be able to
      insert items into a fake tree without having to do all the hard work myself,
      hopefully this will be usefull in the future.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      faa2dbf0
    • J
      Btrfs: rework qgroup accounting · fcebe456
      Josef Bacik 提交于
      Currently qgroups account for space by intercepting delayed ref updates to fs
      trees.  It does this by adding sequence numbers to delayed ref updates so that
      it can figure out how the tree looked before the update so we can adjust the
      counters properly.  The problem with this is that it does not allow delayed refs
      to be merged, so if you say are defragging an extent with 5k snapshots pointing
      to it we will thrash the delayed ref lock because we need to go back and
      manually merge these things together.  Instead we want to process quota changes
      when we know they are going to happen, like when we first allocate an extent, we
      free a reference for an extent, we add new references etc.  This patch
      accomplishes this by only adding qgroup operations for real ref changes.  We
      only modify the sequence number when we need to lookup roots for bytenrs, this
      reduces the amount of churn on the sequence number and allows us to merge
      delayed refs as we add them most of the time.  This patch encompasses a bunch of
      architectural changes
      
      1) qgroup ref operations: instead of tracking qgroup operations through the
      delayed refs we simply add new ref operations whenever we notice that we need to
      when we've modified the refs themselves.
      
      2) tree mod seq:  we no longer have this separation of major/minor counters.
      this makes the sequence number stuff much more sane and we can remove some
      locking that was needed to protect the counter.
      
      3) delayed ref seq: we now read the tree mod seq number and use that as our
      sequence.  This means each new delayed ref doesn't have it's own unique sequence
      number, rather whenever we go to lookup backrefs we inc the sequence number so
      we can make sure to keep any new operations from screwing up our world view at
      that given point.  This allows us to merge delayed refs during runtime.
      
      With all of these changes the delayed ref stuff is a little saner and the qgroup
      accounting stuff no longer goes negative in some cases like it was before.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fcebe456
    • M
  12. 08 4月, 2014 1 次提交
  13. 07 4月, 2014 1 次提交
    • J
      Btrfs: remove transaction from send · 9e351cc8
      Josef Bacik 提交于
      Lets try this again.  We can deadlock the box if we send on a box and try to
      write onto the same fs with the app that is trying to listen to the send pipe.
      This is because the writer could get stuck waiting for a transaction commit
      which is being blocked by the send.  So fix this by making sure looking at the
      commit roots is always going to be consistent.  We do this by keeping track of
      which roots need to have their commit roots swapped during commit, and then
      taking the commit_root_sem and swapping them all at once.  Then make sure we
      take a read lock on the commit_root_sem in cases where we search the commit root
      to make sure we're always looking at a consistent view of the commit roots.
      Previously we had problems with this because we would swap a fs tree commit root
      and then swap the extent tree commit root independently which would cause the
      backref walking code to screw up sometimes.  With this patch we no longer
      deadlock and pass all the weird send/receive corner cases.  Thanks,
      Reportedy-by: NHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e351cc8
  14. 11 3月, 2014 1 次提交
    • F
      Btrfs: correctly determine if blocks are shared in btrfs_compare_trees · 6baa4293
      Filipe Manana 提交于
      Just comparing the pointers (logical disk addresses) of the btree nodes is
      not completely bullet proof, we have to check if their generation numbers
      match too.
      
      It is guaranteed that a COW operation will result in a block with a different
      logical disk address than the original block's address, but over time we can
      reuse that former logical disk address.
      
      For example, creating a 2Gb filesystem on a loop device, and having a script
      running in a loop always updating the access timestamp of a file, resulted in
      the same logical disk address being reused for the same fs btree block in about
      only 4 minutes.
      
      This could make us skip entire subtrees when doing an incremental send (which
      is currently the only user of btrfs_compare_trees). However the odds of getting
      2 blocks at the same tree level, with the same logical disk address, equal first
      slot keys and different generations, should hopefully be very low.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      6baa4293
  15. 29 1月, 2014 9 次提交
    • F
      Btrfs: fix btrfs_search_slot_for_read backwards iteration · 23c6bf6a
      Filipe David Borba Manana 提交于
      If the current path's leaf slot is 0, we do search for the previous
      leaf (via btrfs_prev_leaf) and set the new path's leaf slot to a
      value corresponding to the number of items - 1 of the former leaf.
      Fix this by using the slot set by btrfs_prev_leaf, decrementing it
      by 1 if it's equal to the leaf's number of items.
      
      Use of btrfs_search_slot_for_read() for backward iteration is used in
      particular by the send feature, which could miss items when the input
      leaf has less items than its previous leaf.
      
      This could be reproduced by running btrfs/007 from xfstests in a loop.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      23c6bf6a
    • W
      Btrfs: fix to search previous metadata extent item since skinny metadata · ade2e0b3
      Wang Shilong 提交于
      There is a bug that using btrfs_previous_item() to search metadata extent item.
      This is because in btrfs_previous_item(), we need type match, however, since
      skinny metada was introduced by josef, we may mix this two types. So just
      use btrfs_previous_item() is not working right.
      
      To keep btrfs_previous_item() like normal tree search, i introduce another
      function btrfs_previous_extent_item().
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ade2e0b3
    • F
      Btrfs: reduce btree node locking duration on item update · eb653de1
      Filipe David Borba Manana 提交于
      If we do a btree search with the goal of updating an existing item
      without changing its size (ins_len == 0 and cow == 1), then we never
      need to hold locks on upper level nodes (even when slot == 0) after we
      COW their child nodes/leaves, as we won't have node splits or merges
      in this scenario (that is, no key additions, removals or shifts on any
      nodes or leaves).
      
      Therefore release the locks immediately after COWing the child nodes/leaves
      while navigating the btree, even if their parent slot is 0, instead of
      returning a path to the caller with those nodes locked, which would get
      released only when the caller releases or frees the path (or if it calls
      btrfs_unlock_up_safe).
      
      This is a common scenario, for example when updating inode items in fs
      trees and block group items in the extent tree.
      
      The following benchmarks were performed on a quad core machine with 32Gb
      of ram, using a leaf/node size of 4Kb (to generate deeper fs trees more
      quickly).
      
        sysbench --test=fileio --file-num=131072 --file-total-size=8G \
          --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
          --max-requests=100000 --file-io-mode=sync [prepare|run]
      
      Before this change:  49.85Mb/s (average of 5 runs)
      After this change:   50.38Mb/s (average of 5 runs)
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      eb653de1
    • F
      Btrfs: convert printk to btrfs_ and fix BTRFS prefix · efe120a0
      Frank Holton 提交于
      Convert all applicable cases of printk and pr_* to the btrfs_* macros.
      
      Fix all uses of the BTRFS prefix.
      Signed-off-by: NFrank Holton <fholton@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      efe120a0
    • F
      Btrfs: fix tree mod logging · 5de865ee
      Filipe David Borba Manana 提交于
      While running the test btrfs/004 from xfstests in a loop, it failed
      about 1 time out of 20 runs in my desktop. The failure happened in
      the backref walking part of the test, and the test's error message was
      like this:
      
        btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
            --- tests/btrfs/004.out	2013-11-26 18:25:29.263333714 +0000
            +++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad	2013-12-10 15:25:10.327518516 +0000
            @@ -1,3 +1,8 @@
             QA output created by 004
             *** test backref walking
            -*** done
            +unexpected output from
            +	/home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
            +expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
            +
             ...
             (Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
        Ran: btrfs/004
        Failures: btrfs/004
        Failed 1 of 1 tests
      
      But immediately after the test finished, the btrfs inspect-internal command
      returned the expected output:
      
        $ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
        inode 405 offset 454656 root 258
        inode 405 offset 454656 root 5
      
      It turned out this was because the btrfs_search_old_slot() calls performed
      during backref walking (backref.c:__resolve_indirect_ref) were not finding
      anything. The reason for this turned out to be that the tree mod logging
      code was not logging some node multi-step operations atomically, therefore
      btrfs_search_old_slot() callers iterated often over an incomplete tree that
      wasn't fully consistent with any tree state from the past. Besides missing
      items, this often (but not always) resulted in -EIO errors during old slot
      searches, reported in dmesg like this:
      
      [ 4299.933936] ------------[ cut here ]------------
      [ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
      [ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
      [ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
      [ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
      [ 4299.933979]  000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
      [ 4299.933982]  0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
      [ 4299.933984]  ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
      [ 4299.933987] Call Trace:
      [ 4299.933991]  [<ffffffff8176d284>] dump_stack+0x55/0x76
      [ 4299.933994]  [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
      [ 4299.933997]  [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
      [ 4299.934003]  [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
      [ 4299.934005]  [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
      [ 4299.934010]  [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
      [ 4299.934019]  [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
      [ 4299.934027]  [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
      [ 4299.934034]  [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
      [ 4299.934042]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
      [ 4299.934048]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
      [ 4299.934056]  [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
      [ 4299.934058]  [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
      [ 4299.934065]  [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
      [ 4299.934071]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
      [ 4299.934078]  [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
      [ 4299.934080]  [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
      [ 4299.934083]  [<ffffffff81075563>] ? up_read+0x23/0x40
      [ 4299.934085]  [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
      [ 4299.934088]  [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
      [ 4299.934090]  [<ffffffff81776e23>] ? error_sti+0x5/0x6
      [ 4299.934093]  [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
      [ 4299.934096]  [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
      [ 4299.934098]  [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
      [ 4299.934100]  [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [ 4299.934102]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
      [ 4299.934102]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
      [ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
      [ 4299.934378] btrfs bad fsid on block 0
      
      These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
      tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
      be performed atomically before the following commit:
      
        c8cc6341
        (Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)
      
      That change removed the atomicity of such operations. This patch restores the
      atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
      structures, so it has to do the allocations using GFP_NOFS before acquiring
      the mod log lock.
      
      This issue has been experienced by several users recently, such as for example:
      
        http://www.spinics.net/lists/linux-btrfs/msg28574.html
      
      After running the btrfs/004 test for 679 consecutive iterations with this
      patch applied, I didn't ran into the issue anymore.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5de865ee
    • F
      Btrfs: return immediately if tree log mod is not necessary · 78357766
      Filipe David Borba Manana 提交于
      In ctree.c:tree_mod_log_set_node_key() we were calling
      __tree_mod_log_insert_key() even when the modification doesn't need
      to be logged. This would allocate a tree_mod_elem structure, fill it
      and pass it to  __tree_mod_log_insert(), which would just acquire
      the tree mod log write lock and then free the tree_mod_elem structure
      and return (that is, a no-op).
      
      Therefore call tree_mod_log_insert() instead of __tree_mod_log_insert()
      which just returns immediately if the modification doesn't need to be
      logged (without allocating the structure, fill it, acquire write lock,
      free structure).
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      78357766
    • F
      Btrfs: more efficient push_leaf_right · 2ef1fed2
      Filipe David Borba Manana 提交于
      Currently when finding the leaf to insert a key into a btree, if the
      leaf doesn't have enough space to store the item we attempt to move
      off some items from our leaf to its right neighbor leaf, and if this
      fails to create enough free space in our leaf, we try to move off more
      items to the left neighbor leaf as well.
      
      When trying to move off items to the right neighbor leaf, if it has
      enough room to store the new key but not not enough room to move off
      at least one item from our target leaf, __push_leaf_right returns 1 and
      we have to attempt to move items to the left neighbor (push_leaf_left
      function) without touching the right neighbor leaf.
      For the case where the right leaf has enough room to store at least 1
      item from our leaf, we end up modifying (and dirtying) both our leaf
      and the right leaf. This is non-optimal for the case where the new key
      is greater than any key in our target leaf because it can be inserted at
      slot 0 of the right neighbor leaf and we don't need to touch our leaf
      at all nor to attempt to move off items to the left neighbor leaf.
      
      Therefore this change just selects the right neighbor leaf as our new
      target leaf if it has enough room for the new key without modifying our
      initial target leaf - we do this only if the new key is higher than any
      key in the initial target leaf.
      
      While running the following test, push_leaf_right was called by split_leaf
      4802 times. Out of those 4802 calls, for 2571 calls (53.5%) we hit this
      special case (right leaf has enough room and new key is higher than any key
      in the initial target leaf).
      
      Test:
      
        sysbench --test=fileio --file-num=512 --file-total-size=5G \
          --file-test-mode=[seqwr|rndwr] --num-threads=512 --file-block-size=8192 \
          --max-requests=100000 --file-io-mode=sync [prepare|run]
      
      Results:
      
      sequential writes
      
      Throughput before this change: 65.71Mb/sec (average of 10 runs)
      Throughput after this change:  66.58Mb/sec (average of 10 runs)
      
      random writes
      
      Throughput before this change: 10.75Mb/sec (average of 10 runs)
      Throughput after this change:  11.56Mb/sec (average of 10 runs)
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2ef1fed2
    • F
      Btrfs: try harder to avoid btree node splits · 5a4267ca
      Filipe David Borba Manana 提交于
      When attempting to move items from our target leaf to its neighbor
      leaves (right and left), we only need to free data_size - free_space
      bytes from our leaf in order to add the new item (which has size of
      data_size bytes). Therefore attempt to move items to the right and
      left leaves if they have at least data_size - free_space bytes free,
      instead of data_size bytes free.
      
      After 5 runs of the following test, I got a smaller number of btree
      node splits overall:
      
      sysbench --test=fileio --file-num=512 --file-total-size=5G \
        --file-test-mode=seqwr --num-threads=512 \
         --file-block-size=8192 --max-requests=100000 --file-io-mode=sync
      
      Before this change:
      * 6171 splits (average of 5 test runs)
      * 61.508Mb/sec of throughput (average of 5 test runs)
      
      After this change:
      * 6036 splits (average of 5 test runs)
      * 63.533Mb/sec of throughput (average of 5 test runs)
      
      An ideal test would not just have multiple threads/processes writing
      to a file (insertion of file extent items) but also do other operations
      that result in insertion of items with varied sizes, like file/directory
      creations, creation of links, symlinks, xattrs, etc.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5a4267ca
    • K
      btrfs: expand btrfs_find_item() to include find_orphan_item functionality · 3f870c28
      Kelley Nielsen 提交于
      This is the third step in bootstrapping the btrfs_find_item interface.
      The function find_orphan_item(), in orphan.c, is similar to the two
      functions already replaced by the new interface. It uses two parameters,
      which are already present in the interface, and is nearly identical to
      the function brought in in the previous patch.
      
      Replace the two calls to find_orphan_item() with calls to
      btrfs_find_item(), with the defined objectid and type that was used
      internally by find_orphan_item(), a null path, and a null key. Add a
      test for a null path to btrfs_find_item, and if it passes, allocate and
      free the path. Finally, remove find_orphan_item().
      Signed-off-by: NKelley Nielsen <kelleynnn@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3f870c28