1. 29 1月, 2014 6 次提交
    • F
      Btrfs: more efficient push_leaf_right · 2ef1fed2
      Filipe David Borba Manana 提交于
      Currently when finding the leaf to insert a key into a btree, if the
      leaf doesn't have enough space to store the item we attempt to move
      off some items from our leaf to its right neighbor leaf, and if this
      fails to create enough free space in our leaf, we try to move off more
      items to the left neighbor leaf as well.
      
      When trying to move off items to the right neighbor leaf, if it has
      enough room to store the new key but not not enough room to move off
      at least one item from our target leaf, __push_leaf_right returns 1 and
      we have to attempt to move items to the left neighbor (push_leaf_left
      function) without touching the right neighbor leaf.
      For the case where the right leaf has enough room to store at least 1
      item from our leaf, we end up modifying (and dirtying) both our leaf
      and the right leaf. This is non-optimal for the case where the new key
      is greater than any key in our target leaf because it can be inserted at
      slot 0 of the right neighbor leaf and we don't need to touch our leaf
      at all nor to attempt to move off items to the left neighbor leaf.
      
      Therefore this change just selects the right neighbor leaf as our new
      target leaf if it has enough room for the new key without modifying our
      initial target leaf - we do this only if the new key is higher than any
      key in the initial target leaf.
      
      While running the following test, push_leaf_right was called by split_leaf
      4802 times. Out of those 4802 calls, for 2571 calls (53.5%) we hit this
      special case (right leaf has enough room and new key is higher than any key
      in the initial target leaf).
      
      Test:
      
        sysbench --test=fileio --file-num=512 --file-total-size=5G \
          --file-test-mode=[seqwr|rndwr] --num-threads=512 --file-block-size=8192 \
          --max-requests=100000 --file-io-mode=sync [prepare|run]
      
      Results:
      
      sequential writes
      
      Throughput before this change: 65.71Mb/sec (average of 10 runs)
      Throughput after this change:  66.58Mb/sec (average of 10 runs)
      
      random writes
      
      Throughput before this change: 10.75Mb/sec (average of 10 runs)
      Throughput after this change:  11.56Mb/sec (average of 10 runs)
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2ef1fed2
    • F
      Btrfs: try harder to avoid btree node splits · 5a4267ca
      Filipe David Borba Manana 提交于
      When attempting to move items from our target leaf to its neighbor
      leaves (right and left), we only need to free data_size - free_space
      bytes from our leaf in order to add the new item (which has size of
      data_size bytes). Therefore attempt to move items to the right and
      left leaves if they have at least data_size - free_space bytes free,
      instead of data_size bytes free.
      
      After 5 runs of the following test, I got a smaller number of btree
      node splits overall:
      
      sysbench --test=fileio --file-num=512 --file-total-size=5G \
        --file-test-mode=seqwr --num-threads=512 \
         --file-block-size=8192 --max-requests=100000 --file-io-mode=sync
      
      Before this change:
      * 6171 splits (average of 5 test runs)
      * 61.508Mb/sec of throughput (average of 5 test runs)
      
      After this change:
      * 6036 splits (average of 5 test runs)
      * 63.533Mb/sec of throughput (average of 5 test runs)
      
      An ideal test would not just have multiple threads/processes writing
      to a file (insertion of file extent items) but also do other operations
      that result in insertion of items with varied sizes, like file/directory
      creations, creation of links, symlinks, xattrs, etc.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5a4267ca
    • K
      btrfs: expand btrfs_find_item() to include find_orphan_item functionality · 3f870c28
      Kelley Nielsen 提交于
      This is the third step in bootstrapping the btrfs_find_item interface.
      The function find_orphan_item(), in orphan.c, is similar to the two
      functions already replaced by the new interface. It uses two parameters,
      which are already present in the interface, and is nearly identical to
      the function brought in in the previous patch.
      
      Replace the two calls to find_orphan_item() with calls to
      btrfs_find_item(), with the defined objectid and type that was used
      internally by find_orphan_item(), a null path, and a null key. Add a
      test for a null path to btrfs_find_item, and if it passes, allocate and
      free the path. Finally, remove find_orphan_item().
      Signed-off-by: NKelley Nielsen <kelleynnn@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3f870c28
    • K
      btrfs: expand btrfs_find_item() to include find_root_ref functionality · 75ac2dd9
      Kelley Nielsen 提交于
      This patch is the second step in bootstrapping the btrfs_find_item
      interface. The btrfs_find_root_ref() is similar to the former
      __inode_info(); it accepts four of its parameters, and duplicates the
      first half of its functionality.
      
      Replace the one former call to btrfs_find_root_ref() with a call to
      btrfs_find_item(), along with the defined key type that was used
      internally by btrfs_find_root ref, and a null found key. In
      btrfs_find_item(), add a test for the null key at the place where
      the functionality of btrfs_find_root_ref() ends; btrfs_find_item()
      then returns if the test passes. Finally, remove btrfs_find_root_ref().
      Signed-off-by: NKelley Nielsen <kelleynnn@gmail.com>
      Suggested-by: NZach Brown <zab@redhat.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      75ac2dd9
    • K
      btrfs: bootstrap generic btrfs_find_item interface · e33d5c3d
      Kelley Nielsen 提交于
      There are many btrfs functions that manually search the tree for an
      item. They all reimplement the same mechanism and differ in the
      conditions that they use to find the item. __inode_info() is one such
      example. Zach Brown proposed creating a new interface to take the place
      of these functions.
      
      This patch is the first step to creating the interface. A new function,
      btrfs_find_item, has been added to ctree.c and prototyped in ctree.h.
      It is identical to __inode_info, except that the order of the parameters
      has been rearranged to more closely those of similar functions elsewhere
      in the code (now, root and path come first, then the objectid, offset
      and type, and the key to be filled in last). __inode_info's callers have
      been set to call this new function instead, and __inode_info itself has
      been removed.
      Signed-off-by: NKelley Nielsen <kelleynnn@gmail.com>
      Suggested-by: NZach Brown <zab@redhat.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e33d5c3d
    • J
      Btrfs: incompatible format change to remove hole extents · 16e7549f
      Josef Bacik 提交于
      Btrfs has always had these filler extent data items for holes in inodes.  This
      has made somethings very easy, like logging hole punches and sending hole
      punches.  However for large holey files these extent data items are pure
      overhead.  So add an incompatible feature to no longer add hole extents to
      reduce the amount of metadata used by these sort of files.  This has a few
      changes for logging and send obviously since they will need to detect holes and
      log/send the holes if there are any.  I've tested this thoroughly with xfstests
      and it doesn't cause any issues with and without the incompat format set.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      16e7549f
  2. 12 11月, 2013 8 次提交
  3. 21 9月, 2013 1 次提交
  4. 01 9月, 2013 9 次提交
    • F
      Btrfs: optimize key searches in btrfs_search_slot · d7396f07
      Filipe David Borba Manana 提交于
      When the binary search returns 0 (exact match), the target key
      will necessarily be at slot 0 of all nodes below the current one,
      so in this case the binary search is not needed because it will
      always return 0, and we waste time doing it, holding node locks
      for longer than necessary, etc.
      
      Below follow histograms with the times spent on the current approach of
      doing a binary search when the previous binary search returned 0, and
      times for the new approach, which directly picks the first item/child
      node in the leaf/node.
      
      Current approach:
      
      Count: 6682
      Range: 35.000 - 8370.000; Mean: 85.837; Median: 75.000; Stddev: 106.429
      Percentiles:  90th: 124.000; 95th: 145.000; 99th: 206.000
        35.000 -   61.080:  1235 ################
        61.080 -  106.053:  4207 #####################################################
       106.053 -  183.606:  1122 ##############
       183.606 -  317.341:   111 #
       317.341 -  547.959:     6 |
       547.959 - 8370.000:     1 |
      
      Approach proposed by this patch:
      
      Count: 6682
      Range:  6.000 - 135.000; Mean: 16.690; Median: 16.000; Stddev:  7.160
      Percentiles:  90th: 23.000; 95th: 27.000; 99th: 40.000
         6.000 -    8.418:    58 #
         8.418 -   11.670:  1149 #########################
        11.670 -   16.046:  2418 #####################################################
        16.046 -   21.934:  2098 ##############################################
        21.934 -   29.854:   744 ################
        29.854 -   40.511:   154 ###
        40.511 -   54.848:    41 #
        54.848 -   74.136:     5 |
        74.136 -  100.087:     9 |
       100.087 -  135.000:     6 |
      
      These samples were captured during a run of the btrfs tests 001, 002 and
      004 in the xfstests, with a leaf/node size of 4Kb.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      d7396f07
    • G
      Btrfs: Make btrfs_header_chunk_tree_uuid() return unsigned long · b308bc2f
      Geert Uytterhoeven 提交于
      Internally, btrfs_header_chunk_tree_uuid() calculates an unsigned long, but
      casts it to a pointer, while all callers cast it to unsigned long again.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      b308bc2f
    • G
      Btrfs: Make btrfs_header_fsid() return unsigned long · fba6aa75
      Geert Uytterhoeven 提交于
      Internally, btrfs_header_fsid() calculates an unsigned long, but casts
      it to a pointer, while all callers cast it to unsigned long again.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      fba6aa75
    • G
      Btrfs: Remove superfluous casts from u64 to unsigned long long · c1c9ff7c
      Geert Uytterhoeven 提交于
      u64 is "unsigned long long" on all architectures now, so there's no need to
      cast it when formatting it using the "ll" length modifier.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      c1c9ff7c
    • S
      Btrfs: get rid of sparse warnings · 35a3621b
      Stefan Behrens 提交于
      make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__
      
      I tried to filter out the warnings for which patches have already
      been sent to the mailing list, pending for inclusion in btrfs-next.
      
      All these changes should be obviously safe.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      35a3621b
    • J
      Btrfs: fix send issues related to inode number reuse · ba5e8f2e
      Josef Bacik 提交于
      If you are sending a snapshot and specifying a parent snapshot we will walk the
      trees and figure out where they differ and send the differences only.  The way
      we check for differences are if the leaves aren't the same and if the keys are
      not the same within the leaves.  So if neither leaf is the same (ie the leaf has
      been cow'ed from the parent snapshot) we walk each item in the send root and
      check it against the parent root.  If the items match exactly then we don't do
      anything.  This doesn't quite work for inode refs, since they will just have the
      name and the parent objectid.  If you move the file from a directory and then
      remove that directory and re-create a directory with the same inode number as
      the old directory and then move that file back into that directory we will
      assume that nothing changed and you will get errors when you try to receive.
      
      In order to fix this we need to do extra checking to see if the inode ref really
      is the same or not.  So do this by passing down BTRFS_COMPARE_TREE_SAME if the
      items match.  Then if the key type is an inode ref we can do some extra
      checking, otherwise we just keep processing.  The extra checking is to look up
      the generation of the directory in the parent volume and compare it to the
      generation of the send volume.  If they match then they are the same directory
      and we are good to go.  If they don't we have to add them to the changed refs
      list.
      
      This means we have to track the generation of the ref we're trying to lookup
      when we iterate all the refs for a particular inode.  So in the case of looking
      for new refs we have to get the generation from the parent volume, and in the
      case of looking for deleted refs we have to get the generation from the send
      volume to compare with.
      
      There was also the issue of using a ulist to keep track of the directories we
      needed to check.  Because we can get a deleted ref and a new ref for the same
      inode number the ulist won't work since it indexes based on the value.  So
      instead just dup any directory ref we find and add it to a local list, and then
      process that list as normal and do away with using a ulist for this altogether.
      
      Before we would fail all of the tests in the far-progs that related to moving
      directories (test group 32).  With this patch we now pass these tests, and all
      of the tests in the far-progs send testing suite.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      ba5e8f2e
    • J
      Btrfs: stop using GFP_ATOMIC when allocating rewind ebs · 9ec72677
      Josef Bacik 提交于
      There is no reason we can't just set the path to blocking and then do normal
      GFP_NOFS allocations for these extent buffers.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      9ec72677
    • J
      Btrfs: deal with enomem in the rewind path · db7f3436
      Josef Bacik 提交于
      We can get ENOMEM trying to allocate dummy bufs for the rewind operation of the
      tree mod log.  Instead of BUG_ON()'ing in this case pass up ENOMEM.  I looked
      back through the callers and I'm pretty sure I got everybody who did BUG_ON(ret)
      in this path.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      db7f3436
    • J
      Btrfs: stop using GFP_ATOMIC for the tree mod log allocations · c8cc6341
      Josef Bacik 提交于
      Previously we held the tree mod lock when adding stuff because we use it to
      check and see if we truly do want to track tree modifications.  This is
      admirable, but GFP_ATOMIC in a critical area that is going to get hit pretty
      hard and often is not nice.  So instead do our basic checks to see if we don't
      need to track modifications, and if those pass then do our allocation, and then
      when we go to insert the new modification check if we still care, and if we
      don't just free up our mod and return.  Otherwise we're good to go and we can
      carry on.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      c8cc6341
  5. 10 8月, 2013 1 次提交
  6. 02 7月, 2013 2 次提交
    • J
      Btrfs: only do the tree_mod_log_free_eb if this is our last ref · 7fb7d76f
      Josef Bacik 提交于
      There is another bug in the tree mod log stuff in that we're calling
      tree_mod_log_free_eb every single time a block is cow'ed.  The problem with this
      is that if this block is shared by multiple snapshots we will call this multiple
      times per block, so if we go to rewind the mod log for this block we'll BUG_ON()
      in __tree_mod_log_rewind because we try to rewind a free twice.  We only want to
      call tree_mod_log_free_eb if we are actually freeing the block.  With this patch
      I no longer hit the panic in __tree_mod_log_rewind.  Thanks,
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      7fb7d76f
    • J
      Btrfs: hold the tree mod lock in __tree_mod_log_rewind · f1ca7e98
      Josef Bacik 提交于
      We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
      forward in the tree mod entries, otherwise we'll end up with random entries and
      trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes the panics
      people were seeing when running
      
      find /whatever -type f -exec btrfs fi defrag {} \;
      
      Thansk,
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f1ca7e98
  7. 01 7月, 2013 2 次提交
    • J
      Btrfs: optimize reada_for_balance · 0b08851f
      Josef Bacik 提交于
      This patch does two things.  First we no longer explicitly read in the blocks
      we're trying to readahead.  For things like balance_level we may never actually
      use the blocks so this just adds uneeded latency, and balance_level and
      split_node will both read in the blocks they care about explicitly so if the
      blocks need to be waited on it will be done there.  Secondly we no longer drop
      the path if we do readahead, we just set the path blocking before we call
      reada_for_balance() and then we're good to go.  Hopefully this will cut down on
      the number of re-searches.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0b08851f
    • J
      Btrfs: optimize read_block_for_search · bdf7c00e
      Josef Bacik 提交于
      This patch does two things, first it only does one call to
      btrfs_buffer_uptodate() with the gen specified instead of once with 0 and then
      again with gen specified.  The other thing is to call btrfs_read_buffer() on the
      buffer we've found instead of dropping it and then calling read_tree_block().
      This will keep us from doing yet another radix tree lookup for a buffer we've
      already found.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      bdf7c00e
  8. 14 6月, 2013 3 次提交
  9. 28 5月, 2013 1 次提交
  10. 18 5月, 2013 1 次提交
    • J
      Btrfs: handle running extent ops with skinny metadata · b1c79e09
      Josef Bacik 提交于
      Chris hit a bug where we weren't finding extent records when running extent ops.
      This is because we use the delayed_ref_head when running the extent op, which
      means we can't use the ->type checks to see if we are metadata.  We also lose
      the level of the metadata we are working on.  So to fix this we can just check
      the ->is_data section of the extent_op, and we can store the level of the buffer
      we were modifying in the extent_op.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      b1c79e09
  11. 07 5月, 2013 6 次提交
    • E
      btrfs: make static code static & remove dead code · 48a3b636
      Eric Sandeen 提交于
      Big patch, but all it does is add statics to functions which
      are in fact static, then remove the associated dead-code fallout.
      
      removed functions:
      
      btrfs_iref_to_path()
      __btrfs_lookup_delayed_deletion_item()
      __btrfs_search_delayed_insertion_item()
      __btrfs_search_delayed_deletion_item()
      find_eb_for_page()
      btrfs_find_block_group()
      range_straddles_pages()
      extent_range_uptodate()
      btrfs_file_extent_length()
      btrfs_scrub_cancel_devid()
      btrfs_start_transaction_lflush()
      
      btrfs_print_tree() is left because it is used for debugging.
      btrfs_start_transaction_lflush() and btrfs_reada_detach() are
      left for symmetry.
      
      ulist.c functions are left, another patch will take care of those.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      48a3b636
    • J
      Btrfs: separate sequence numbers for delayed ref tracking and tree mod log · fc36ed7e
      Jan Schmidt 提交于
      Sequence numbers for delayed refs have been introduced in the first version
      of the qgroup patch set. To solve the problem of find_all_roots on a busy
      file system, the tree mod log was introduced. The sequence numbers for that
      were simply shared between those two users.
      
      However, at one point in qgroup's quota accounting, there's a statement
      accessing the previous sequence number, that's still just doing (seq - 1)
      just as it would have to in the very first version.
      
      To satisfy that requirement, this patch makes the sequence number counter 64
      bit and splits it into a major part (used for qgroup sequence number
      counting) and a minor part (incremented for each tree modification in the
      log). This enables us to go exactly one major step backwards, as required
      for qgroups, while still incrementing the sequence counter for tree mod log
      insertions to keep track of their order. Keeping them in a single variable
      means there's no need to change all the code dealing with comparisons of two
      sequence numbers.
      
      The sequence number is reset to 0 on commit (not new in this patch), which
      ensures we won't overflow the two 32 bit counters.
      
      Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
      from the tree mod log code may happen.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fc36ed7e
    • J
      Btrfs: fix all callers of read_tree_block · 416bc658
      Josef Bacik 提交于
      We kept leaking extent buffers when mounting a broken file system and it turns
      out it's because not everybody uses read_tree_block properly.  You need to check
      and make sure the extent_buffer is uptodate before you use it.  This patch fixes
      everybody who calls read_tree_block directly to make sure they check that it is
      uptodate and free it and return an error if it is not.  With this we no longer
      leak EB's when things go horribly wrong.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      416bc658
    • T
      Btrfs: remove unused argument of btrfs_extend_item() · 4b90c680
      Tsutomu Itoh 提交于
      Argument 'trans' is not used in btrfs_extend_item().
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      4b90c680
    • T
      Btrfs: cleanup of function where fixup_low_keys() is called · afe5fea7
      Tsutomu Itoh 提交于
      If argument 'trans' is unnecessary in the function where
      fixup_low_keys() is called, 'trans' is deleted.
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      afe5fea7
    • T
      Btrfs: remove unused argument of fixup_low_keys() · d6a0a126
      Tsutomu Itoh 提交于
      Argument 'trans' is not used in fixup_low_keys(). So, remove it.
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      d6a0a126