1. 30 5月, 2012 4 次提交
  2. 26 5月, 2012 1 次提交
  3. 06 5月, 2012 1 次提交
    • C
      Btrfs: avoid sleeping in verify_parent_transid while atomic · b9fab919
      Chris Mason 提交于
      verify_parent_transid needs to lock the extent range to make
      sure no IO is underway, and so it can safely clear the
      uptodate bits if our checks fail.
      
      But, a few callers are using it with spinlocks held.  Most
      of the time, the generation numbers are going to match, and
      we don't want to switch to a blocking lock just for the error
      case.  This adds an atomic flag to verify_parent_transid,
      and changes it to return EAGAIN if it needs to block to
      properly verifiy things.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9fab919
  4. 05 5月, 2012 1 次提交
  5. 27 3月, 2012 3 次提交
    • C
      Btrfs: adjust the write_lock_level as we unlock · f7c79f30
      Chris Mason 提交于
      btrfs_search_slot sometimes needs write locks on high levels of
      the tree.  It remembers the highest level that needs a write lock
      and will use that for all future searches through the tree in a given
      call.
      
      But, very often we'll just cow the top level or the level below and we
      won't really need write locks on the root again after that.  This patch
      changes things to adjust the write lock requirement as it unlocks
      levels.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f7c79f30
    • C
      Btrfs: add the ability to cache a pointer into the eb · cfed81a0
      Chris Mason 提交于
      This cuts down on the CPU time used by map_private_extent_buffer
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cfed81a0
    • J
      Btrfs: introduce free_extent_buffer_stale · 3083ee2e
      Josef Bacik 提交于
      Because btrfs cow's we can end up with extent buffers that are no longer
      necessary just sitting around in memory.  So instead of evicting these pages, we
      could end up evicting things we actually care about.  Thus we have
      free_extent_buffer_stale for use when we are freeing tree blocks.  This will
      make it so that the ref for the eb being in the radix tree is dropped as soon as
      possible and then is freed when the refcount hits 0 instead of waiting to be
      released by releasepage.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      3083ee2e
  6. 22 3月, 2012 6 次提交
  7. 22 12月, 2011 1 次提交
    • A
      Btrfs: mark delayed refs as for cow · 66d7e7f0
      Arne Jansen 提交于
      Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
      from every call site. The for_cow parameter will later on be used to
      determine if a ref will change anything with respect to qgroups.
      
      Delayed refs coming from relocation are always counted as for_cow, as they
      don't change subvol quota.
      
      Also pass in the fs_info for later use.
      
      btrfs_find_all_roots() will use this as an optimization, as changes that are
      for_cow will not change anything with respect to which root points to a
      certain leaf. Thus, we don't need to add the current sequence number to
      those delayed refs.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      66d7e7f0
  8. 15 11月, 2011 1 次提交
    • L
      Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush · f1ebcc74
      Liu Bo 提交于
      The btrfs snapshotting code requires that once a root has been
      snapshotted, we don't change it during a commit.
      
      But there are two cases to lead to tree corruptions:
      
      1) multi-thread snapshots can commit serveral snapshots in a transaction,
         and this may change the src root when processing the following pending
         snapshots, which lead to the former snapshots corruptions;
      
      2) the free inode cache was changing the roots when it root the cache,
         which lead to corruptions.
      
      This fixes things by making sure we force COW the block after we create a
      snapshot during commiting a transaction, then any changes to the roots
      will result in COW, and we get all the fs roots and snapshot roots to be
      consistent.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f1ebcc74
  9. 21 10月, 2011 1 次提交
  10. 28 7月, 2011 3 次提交
    • C
      Btrfs: remove lockdep magic from btrfs_next_leaf · 31533fb2
      Chris Mason 提交于
      Before the reader/writer locks, btrfs_next_leaf needed to keep
      the path blocking to avoid making lockdep upset.
      
      Now that btrfs_next_leaf only takes read locks, this isn't required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      31533fb2
    • C
      Btrfs: switch the btrfs tree locks to reader/writer · bd681513
      Chris Mason 提交于
      The btrfs metadata btree is the source of significant
      lock contention, especially in the root node.   This
      commit changes our locking to use a reader/writer
      lock.
      
      The lock is built on top of rw spinlocks, and it
      extends the lock tracking to remember if we have a
      read lock or a write lock when we go to blocking.  Atomics
      count the number of blocking readers or writers at any
      given time.
      
      It removes all of the adaptive spinning from the old code
      and uses only the spinning/blocking hints inside of btrfs
      to decide when it should continue spinning.
      
      In read heavy workloads this is dramatically faster.  In write
      heavy workloads we're still faster because of less contention
      on the root node lock.
      
      We suffer slightly in dbench because we schedule more often
      during write locks, but all other benchmarks so far are improved.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd681513
    • C
      Btrfs: stop using highmem for extent_buffers · a6591715
      Chris Mason 提交于
      The extent_buffers have a very complex interface where
      we use HIGHMEM for metadata and try to cache a kmap mapping
      to access the memory.
      
      The next commit adds reader/writer locks, and concurrent use
      of this kmap cache would make it even more complex.
      
      This commit drops the ability to use HIGHMEM with extent buffers,
      and rips out all of the related code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a6591715
  11. 10 6月, 2011 1 次提交
    • J
      Btrfs: don't map extent buffer if path->skip_locking is set · ad3e34bb
      Josef Bacik 提交于
      Arne's scrub stuff exposed a problem with mapping the extent buffer in
      reada_for_search.  He searches the commit root with multiple threads and with
      skip_locking set, so we can race and overwrite node->map_token since node isn't
      locked.  So fix this so that we only map the extent buffer if we don't already
      have a map_token and skip_locking isn't set.  Without this patch scrub would
      panic almost immediately, with the patch it doesn't panic anymore.  Thanks,
      Reported-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      ad3e34bb
  12. 09 6月, 2011 1 次提交
    • J
      Btrfs: don't map extent buffer if path->skip_locking is set · 25b8b936
      Josef Bacik 提交于
      Arne's scrub stuff exposed a problem with mapping the extent buffer in
      reada_for_search.  He searches the commit root with multiple threads and with
      skip_locking set, so we can race and overwrite node->map_token since node isn't
      locked.  So fix this so that we only map the extent buffer if we don't already
      have a map_token and skip_locking isn't set.  Without this patch scrub would
      panic almost immediately, with the patch it doesn't panic anymore.  Thanks,
      Reported-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      25b8b936
  13. 24 5月, 2011 4 次提交
  14. 21 5月, 2011 1 次提交
    • M
      btrfs: implement delayed inode items operation · 16cdcec7
      Miao Xie 提交于
      Changelog V5 -> V6:
      - Fix oom when the memory load is high, by storing the delayed nodes into the
        root's radix tree, and letting btrfs inodes go.
      
      Changelog V4 -> V5:
      - Fix the race on adding the delayed node to the inode, which is spotted by
        Chris Mason.
      - Merge Chris Mason's incremental patch into this patch.
      - Fix deadlock between readdir() and memory fault, which is reported by
        Itaru Kitayama.
      
      Changelog V3 -> V4:
      - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
        inode in time.
      
      Changelog V2 -> V3:
      - Fix the race between the delayed worker and the task which does delayed items
        balance, which is reported by Tsutomu Itoh.
      - Modify the patch address David Sterba's comment.
      - Fix the bug of the cpu recursion spinlock, reported by Chris Mason
      
      Changelog V1 -> V2:
      - break up the global rb-tree, use a list to manage the delayed nodes,
        which is created for every directory and file, and used to manage the
        delayed directory name index items and the delayed inode item.
      - introduce a worker to deal with the delayed nodes.
      
      Compare with Ext3/4, the performance of file creation and deletion on btrfs
      is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
      such as inode item, directory name item, directory name index and so on.
      
      If we can do some delayed b+ tree insertion or deletion, we can improve the
      performance, so we made this patch which implemented delayed directory name
      index insertion/deletion and delayed inode update.
      
      Implementation:
      - introduce a delayed root object into the filesystem, that use two lists to
        manage the delayed nodes which are created for every file/directory.
        One is used to manage all the delayed nodes that have delayed items. And the
        other is used to manage the delayed nodes which is waiting to be dealt with
        by the work thread.
      - Every delayed node has two rb-tree, one is used to manage the directory name
        index which is going to be inserted into b+ tree, and the other is used to
        manage the directory name index which is going to be deleted from b+ tree.
      - introduce a worker to deal with the delayed operation. This worker is used
        to deal with the works of the delayed directory name index items insertion
        and deletion and the delayed inode update.
        When the delayed items is beyond the lower limit, we create works for some
        delayed nodes and insert them into the work queue of the worker, and then
        go back.
        When the delayed items is beyond the upper bound, we create works for all
        the delayed nodes that haven't been dealt with, and insert them into the work
        queue of the worker, and then wait for that the untreated items is below some
        threshold value.
      - When we want to insert a directory name index into b+ tree, we just add the
        information into the delayed inserting rb-tree.
        And then we check the number of the delayed items and do delayed items
        balance. (The balance policy is above.)
      - When we want to delete a directory name index from the b+ tree, we search it
        in the inserting rb-tree at first. If we look it up, just drop it. If not,
        add the key of it into the delayed deleting rb-tree.
        Similar to the delayed inserting rb-tree, we also check the number of the
        delayed items and do delayed items balance.
        (The same to inserting manipulation)
      - When we want to update the metadata of some inode, we cached the data of the
        inode into the delayed node. the worker will flush it into the b+ tree after
        dealing with the delayed insertion and deletion.
      - We will move the delayed node to the tail of the list after we access the
        delayed node, By this way, we can cache more delayed items and merge more
        inode updates.
      - If we want to commit transaction, we will deal with all the delayed node.
      - the delayed node will be freed when we free the btrfs inode.
      - Before we log the inode items, we commit all the directory name index items
        and the delayed inode update.
      
      I did a quick test by the benchmark tool[1] and found we can improve the
      performance of file creation by ~15%, and file deletion by ~20%.
      
      Before applying this patch:
      Create files:
              Total files: 50000
              Total time: 1.096108
              Average time: 0.000022
      Delete files:
              Total files: 50000
              Total time: 1.510403
              Average time: 0.000030
      
      After applying this patch:
      Create files:
              Total files: 50000
              Total time: 0.932899
              Average time: 0.000019
      Delete files:
              Total files: 50000
              Total time: 1.215732
              Average time: 0.000024
      
      [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
      
      Many thanks for Kitayama-san's help!
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dave@jikos.cz>
      Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Tested-by: NItaru Kitayama <kitayama@cl.bb4u.ne.jp>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      16cdcec7
  15. 02 5月, 2011 3 次提交
  16. 28 3月, 2011 4 次提交
    • T
      Btrfs: check return value of read_tree_block() · 97d9a8a4
      Tsutomu Itoh 提交于
      This patch is checking return value of read_tree_block(),
      and if it is NULL, error processing.
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      97d9a8a4
    • T
      Btrfs: cleanup some BUG_ON() · db5b493a
      Tsutomu Itoh 提交于
      This patch changes some BUG_ON() to the error return.
      (but, most callers still use BUG_ON())
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      db5b493a
    • L
      Btrfs: add initial tracepoint support for btrfs · 1abe9b8a
      liubo 提交于
      Tracepoints can provide insight into why btrfs hits bugs and be greatly
      helpful for debugging, e.g
                    dd-7822  [000]  2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
                    dd-7822  [000]  2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
       btrfs-transacti-7804  [001]  2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
       btrfs-transacti-7804  [001]  2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
       btrfs-transacti-7804  [001]  2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
         flush-btrfs-2-7821  [001]  2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
         flush-btrfs-2-7821  [001]  2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
         flush-btrfs-2-7821  [001]  2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
         flush-btrfs-2-7821  [000]  2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
       btrfs-endio-wri-7800  [001]  2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
       btrfs-endio-wri-7800  [001]  2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
      
      Here is what I have added:
      
      1) ordere_extent:
              btrfs_ordered_extent_add
              btrfs_ordered_extent_remove
              btrfs_ordered_extent_start
              btrfs_ordered_extent_put
      
      These provide critical information to understand how ordered_extents are
      updated.
      
      2) extent_map:
              btrfs_get_extent
      
      extent_map is used in both read and write cases, and it is useful for tracking
      how btrfs specific IO is running.
      
      3) writepage:
              __extent_writepage
              btrfs_writepage_end_io_hook
      
      Pages are cirtical resourses and produce a lot of corner cases during writeback,
      so it is valuable to know how page is written to disk.
      
      4) inode:
              btrfs_inode_new
              btrfs_inode_request
              btrfs_inode_evict
      
      These can show where and when a inode is created, when a inode is evicted.
      
      5) sync:
              btrfs_sync_file
              btrfs_sync_fs
      
      These show sync arguments.
      
      6) transaction:
              btrfs_transaction_commit
      
      In transaction based filesystem, it will be useful to know the generation and
      who does commit.
      
      7) back reference and cow:
      	btrfs_delayed_tree_ref
      	btrfs_delayed_data_ref
      	btrfs_delayed_ref_head
      	btrfs_cow_block
      
      Btrfs natively supports back references, these tracepoints are helpful on
      understanding btrfs's COW mechanism.
      
      8) chunk:
      	btrfs_chunk_alloc
      	btrfs_chunk_free
      
      Chunk is a link between physical offset and logical offset, and stands for space
      infomation in btrfs, and these are helpful on tracing space things.
      
      9) reserved_extent:
      	btrfs_reserved_extent_alloc
      	btrfs_reserved_extent_free
      
      These can show how btrfs uses its space.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1abe9b8a
    • C
      Btrfs: use RCU instead of a spinlock to protect the root node · 240f62c8
      Chris Mason 提交于
      The pointer to the extent buffer for the root of each tree
      is protected by a spinlock so that we can safely read the pointer
      and take a reference on the extent buffer.
      
      But now that the extent buffers are freed via RCU, we can safely
      use rcu_read_lock instead.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      240f62c8
  17. 18 3月, 2011 1 次提交
    • J
      Btrfs: check items for correctness as we search · a826d6dc
      Josef Bacik 提交于
      Currently if we have corrupted items things will blow up in spectacular ways.
      So as we read in blocks and they are leaves, check the entire leaf to make sure
      all of the items are correct and point to valid parts in the leaf for the item
      data the are responsible for.  If the item is corrupt we will kick back EIO and
      not read any of the copies since they are likely to not be correct either.  This
      will catch generic corruptions, it will be up to the individual callers of
      btrfs_search_slot to make sure their items are right.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      a826d6dc
  18. 17 1月, 2011 2 次提交
  19. 30 10月, 2010 1 次提交