1. 06 1月, 2009 1 次提交
    • L
      Btrfs: Fix free block discard calls down to the block layer · 1f3c79a2
      Liu Hui 提交于
      This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
      We can improve FTL's performance by telling it which sectors are freed by file
      system. But if we don't tell FTL the information of free sectors in proper
      time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
      roll back the previous transaction under the power loss condition.
      
      There are some problems in the old implementation:
      1, In __free_extent(), the pinned down extents should not be discarded.
      2, In free_extents(), the free extents are all pinned, so they need to
      be discarded in transaction committing time instead of free_extents().
      3, The reserved extent used by log tree should be discard too.
      
      This patch change discard behavior as follows:
      1, For the extents which need to be free at once,
         we discard them in update_block_group().
      2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
         when committing transaction.
      3, Remove discarding from free_extents() and __free_extent()
      4, Add discard interface into btrfs_free_reserved_extent()
      5, Discard sectors before updating the free space cache, otherwise,
         FTL will destroy file system data.
      1f3c79a2
  2. 19 12月, 2008 2 次提交
  3. 17 12月, 2008 1 次提交
    • C
      Btrfs: delete checksum items before marking blocks free · dcbdd4dc
      Chris Mason 提交于
      Btrfs maintains a cache of blocks available for allocation in ram.  The
      code that frees extents was marking the extents free and then deleting
      the checksum items.
      
      This meant it was possible the extent would be reallocated before the
      checksum item was actually deleted, leading to races and other
      problems as the checksums were updated for the newly allocated extent.
      
      The fix is to delete the checksum before marking the extent free.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      dcbdd4dc
  4. 16 12月, 2008 1 次提交
  5. 12 12月, 2008 3 次提交
    • Y
      Btrfs: fix nodatasum handling in balancing code · 17d217fe
      Yan Zheng 提交于
      Checksums on data can be disabled by mount option, so it's
      possible some data extents don't have checksums or have
      invalid checksums. This causes trouble for data relocation.
      This patch contains following things to make data relocation
      work.
      
      1) make nodatasum/nodatacow mount option only affects new
      files. Checksums and COW on data are only controlled by the
      inode flags.
      
      2) check the existence of checksum in the nodatacow checker.
      If checksums exist, force COW the data extent. This ensure that
      checksum for a given block is either valid or does not exist.
      
      3) update data relocation code to properly handle the case
      of checksum missing.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      17d217fe
    • Y
      Btrfs: shared seed device · e4404d6e
      Yan Zheng 提交于
      This patch makes seed device possible to be shared by
      multiple mounted file systems. The sharing is achieved
      by cloning seed device's btrfs_fs_devices structure.
      Thanks you,
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      e4404d6e
    • Y
      Btrfs: fix leaking block group on balance · d2fb3437
      Yan Zheng 提交于
      The block group structs are referenced in many different
      places, and it's not safe to free while balancing.  So, those block
      group structs were simply leaked instead.
      
      This patch replaces the block group pointer in the inode with the starting byte
      offset of the block group and adds reference counting to the block group
      struct.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      d2fb3437
  6. 11 12月, 2008 1 次提交
  7. 10 12月, 2008 1 次提交
    • C
      Btrfs: Delete csum items when freeing extents · 459931ec
      Chris Mason 提交于
      This finishes off the new checksumming code by removing csum items
      for extents that are no longer in use.
      
      The trick is doing it without racing because a single csum item may
      hold csums for more than one extent.  Extra checks are added to
      btrfs_csum_file_blocks to make sure that we are using the correct
      csum item after dropping locks.
      
      A new btrfs_split_item is added to split a single csum item so it
      can be split without dropping the leaf lock.  This is used to
      remove csum bytes from the middle of an item.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      459931ec
  8. 09 12月, 2008 1 次提交
    • Y
      Btrfs: superblock duplication · a512bbf8
      Yan Zheng 提交于
      This patch implements superblock duplication. Superblocks
      are stored at offset 16K, 64M and 256G on every devices.
      Spaces used by superblocks are preserved by the allocator,
      which uses a reverse mapping function to find the logical
      addresses that correspond to superblocks. Thank you,
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      a512bbf8
  9. 02 12月, 2008 1 次提交
  10. 21 11月, 2008 1 次提交
  11. 20 11月, 2008 3 次提交
    • C
      Btrfs: compat code fixes · 4b4e25f2
      Chris Mason 提交于
      The btrfs git kernel trees is used to build a standalone tree for
      compiling against older kernels.  This commit makes the standalone tree
      work with 2.6.27
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4b4e25f2
    • C
      Btrfs: Fixes for 2.6.28-rc API changes · 15916de8
      Chris Mason 提交于
      * open/close_bdev_excl -> open/close_bdev_exclusive
      * blkdev_issue_discard takes a GFP mask now
      * Fix blkdev_issue_discard usage now that it is enabled
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      15916de8
    • J
      Btrfs: fix free space accounting when unpinning extents · 07103a3c
      Josef Bacik 提交于
      This patch fixes what I hope is the last early ENOSPC bug left.  I did not know
      that pinned extents would merge into one big extent when inserted on to the
      pinned extent tree, so I was adding free space to a block group that could
      possibly span multiple block groups.
      
      This is a big issue because first that space doesn't exist in that block group,
      and second we won't actually use that space because there are a bunch of other
      checks to make sure we're allocating within the constraints of the block group.
      
      This patch fixes the problem by adding the btrfs_add_free_space to
      btrfs_update_pinned_extents which makes sure we are adding the appropriate
      amount of free space to the appropriate block group.  Thanks much to Lee Trager
      for running my myriad of debug patches to help me track this problem down.
      Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      07103a3c
  12. 19 11月, 2008 1 次提交
    • L
      Btrfs: Some fixes for batching extent insert. · b4eec2ca
      Liu Hui 提交于
      In insert_extents(), when ret==1 and last is not zero, it should
      check if the current inserted item is the last item in this batching
      inserts. If so, it should just break from loop. If not, 'cur =
      insert_list->next' will make no sense because the list is empty now,
      and 'op' will point to an unexpectable place.
      
      There are also some trivial fixs in this patch including one comment
      typo error and deleting two redundant lines.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b4eec2ca
  13. 18 11月, 2008 3 次提交
    • J
      Btrfs: Add some debugging around the ENOSPC bugs · 4ce4cb52
      Josef Bacik 提交于
      Some people are still reporting problems with early enospc.  This
      will help narrown down the cause.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4ce4cb52
    • J
      Btrfs: fix free space leak · e3e469f8
      Josef Bacik 提交于
      In my batch delete/update/insert patch I introduced a free space leak.  The
      extent that we do the original search on in free_extents is never pinned, so we
      always update the block saying that it has free space, but the free space never
      actually gets added to the free space tree, since op->del will always be 0 and
      it's never actually added to the pinned extents tree.
      
      This patch fixes this problem by making sure we call pin_down_bytes on the
      pending extent op and set op->del to the return value of pin_down_bytes so
      update_block_group is called with the right value.  This seems to fix the case
      where we were getting ENOSPC when there was plenty of space available.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      e3e469f8
    • Y
      Btrfs: Seed device support · 2b82032c
      Yan Zheng 提交于
      Seed device is a special btrfs with SEEDING super flag
      set and can only be mounted in read-only mode. Seed
      devices allow people to create new btrfs on top of it.
      
      The new FS contains the same contents as the seed device,
      but it can be mounted in read-write mode.
      
      This patch does the following:
      
      1) split code in btrfs_alloc_chunk into two parts. The first part does makes
      the newly allocated chunk usable, but does not do any operation that modifies
      the chunk tree. The second part does the the chunk tree modifications. This
      division is for the bootstrap step of adding storage to the seed device.
      
      2) Update device management code to handle seed device.
      The basic idea is: For an FS grown from seed devices, its
      seed devices are put into a list. Seed devices are
      opened on demand at mounting time. If any seed device is
      missing or has been changed, btrfs kernel module will
      refuse to mount the FS.
      
      3) make btrfs_find_block_group not return NULL when all
      block groups are read-only.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      2b82032c
  14. 13 11月, 2008 3 次提交
    • Y
      Btrfs: mount ro and remount support · c146afad
      Yan Zheng 提交于
      This patch adds mount ro and remount support. The main
      changes in patch are: adding btrfs_remount and related
      helper function; splitting the transaction related code
      out of close_ctree into btrfs_commit_super; updating
      allocator to properly handle read only block group.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      c146afad
    • J
      Btrfs: batch extent inserts/updates/deletions on the extent root · f3465ca4
      Josef Bacik 提交于
      While profiling the allocator I noticed a good amount of time was being spent in
      finish_current_insert and del_pending_extents, and as the filesystem filled up
      more and more time was being spent in those functions.  This patch aims to try
      and reduce that problem.  This happens two ways
      
      1) track if we tried to delete an extent that we are going to update or insert.
      Once we get into finish_current_insert we discard any of the extents that were
      marked for deletion.  This saves us from doing unnecessary work almost every
      time finish_current_insert runs.
      
      2) Batch insertion/updates/deletions.  Instead of doing a btrfs_search_slot for
      each individual extent and doing the needed operation, we instead keep the leaf
      around and see if there is anything else we can do on that leaf.  On the insert
      case I introduced a btrfs_insert_some_items, which will take an array of keys
      with an array of data_sizes and try and squeeze in as many of those keys as
      possible, and then return how many keys it was able to insert.  In the update
      case we search for an extent ref, update the ref and then loop through the leaf
      to see if any of the other refs we are looking to update are on that leaf, and
      then once we are done we release the path and search for the next ref we need to
      update.  And finally for the deletion we try and delete the extent+ref in pairs,
      so we will try to find extent+ref pairs next to the extent we are trying to free
      and free them in bulk if possible.
      
      This along with the other cluster fix that Chris pushed out a bit ago helps make
      the allocator preform more uniformly as it fills up the disk.  There is still a
      slight drop as we fill up the disk since we start having to stick new blocks in
      odd places which results in more COW's than on a empty fs, but the drop is not
      nearly as severe as it was before.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      f3465ca4
    • C
      Btrfs: Fix handling of space info full during allocations · 2ed6d664
      Chris Mason 提交于
      When we fail to allocate a new block group, we should still do the
      checks to make sure allocations try again with the minimum requested
      allocation size.
      
      This also fixes a deadlock that come from a missed down_read in
      the chunk allocation failure handling.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2ed6d664
  15. 11 11月, 2008 2 次提交
    • C
      Btrfs: empty_size allocation fixes again · 8a1413a2
      Chris Mason 提交于
      The allocator wasn't catching all of the cases where it needed to do
      extra loops because the check to enforce them wasn't happening early
      enough.
      
      When the allocator decided to increase the size of the allocation
      for metadata clustering, it wasn't always setting the empty_size to
      include the extra (optional) bytes.  This also fixes the empty_size field
      to be correct.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8a1413a2
    • C
      Btrfs: Try harder while searching for free space · f5a31e16
      Chris Mason 提交于
      The loop searching for free space would exit out too soon when
      metadata clustering was trying to allocate a large extent.  This makes
      sure a full scan of the free space is done searching for only the
      minimum extent size requested by the higher layers.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f5a31e16
  16. 10 11月, 2008 1 次提交
  17. 08 11月, 2008 1 次提交
  18. 07 11月, 2008 3 次提交
    • C
      Btfs: More metadata allocator optimizations · 4366211c
      Chris Mason 提交于
      This lowers the empty cluster target for metadata allocations.  The lower
      target makes it easier to do allocations and still seems to perform well.
      
      It also fixes the allocator loop to drop the empty cluster when things
      start getting difficult, avoiding false enospc warnings.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4366211c
    • C
      Btrfs: enforce metadata allocation clustering · 3b7885bf
      Chris Mason 提交于
      The allocator uses the last allocation as a starting point for metadata
      allocations, and tries to allocate in clusters of at least 256k.
      
      If the search for a free block fails to find the expected block, this patch
      forces a new cluster to be found in the free list.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3b7885bf
    • C
      Btrfs: Optimize compressed writeback and reads · 771ed689
      Chris Mason 提交于
      When reading compressed extents, try to put pages into the page cache
      for any pages covered by the compressed extent that readpages didn't already
      preload.
      
      Add an async work queue to handle transformations at delayed allocation processing
      time.  Right now this is just compression.  The workflow is:
      
      1) Find offsets in the file marked for delayed allocation
      2) Lock the pages
      3) Lock the state bits
      4) Call the async delalloc code
      
      The async delalloc code clears the state lock bits and delalloc bits.  It is
      important this happens before the range goes into the work queue because
      otherwise it might deadlock with other work queue items that try to lock
      those extent bits.
      
      The file pages are compressed, and if the compression doesn't work the
      pages are written back directly.
      
      An ordered work queue is used to make sure the inodes are written in the same
      order that pdflush or writepages sent them down.
      
      This changes extent_write_cache_pages to let the writepage function
      update the wbc nr_written count.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      771ed689
  19. 31 10月, 2008 3 次提交
    • Y
      Btrfs: Add fallocate support v2 · d899e052
      Yan Zheng 提交于
      This patch updates btrfs-progs for fallocate support.
      
      fallocate is a little different in Btrfs because we need to tell the
      COW system that a given preallocated extent doesn't need to be
      cow'd as long as there are no snapshots of it.  This leverages the
      -o nodatacow checks.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      d899e052
    • Y
      Btrfs: update nodatacow code v2 · 80ff3856
      Yan Zheng 提交于
      This patch simplifies the nodatacow checker. If all references
      were created after the latest snapshot, then we can avoid COW
      safely. This patch also updates run_delalloc_nocow to do more
      fine-grained checking.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      80ff3856
    • Y
      Btrfs: Fix bookend extent race v2 · 6643558d
      Yan Zheng 提交于
      When dropping middle part of an extent, btrfs_drop_extents truncates
      the extent at first, then inserts a bookend extent.
      
      Since truncation and insertion can't be done atomically, there is a small
      period that the bookend extent isn't in the tree. This causes problem for
      functions that search the tree for file extent item. The way to fix this is
      lock the range of the bookend extent before truncation.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      6643558d
  20. 30 10月, 2008 6 次提交
    • C
      Btrfs: prevent looping forever in finish_current_insert and del_pending_extents · 87ef2bb4
      Chris Mason 提交于
      finish_current_insert and del_pending_extents process extent tree modifications
      that build up while we are changing the extent tree.  It is a confusing
      bit of code that prevents recursion.
      
      Both functions run through a list of pending operations and both funcs
      add to the list of pending operations.  If you have two procs in either
      one of them, they can end up looping forever making more work for each other.
      
      This patch makes them walk forward through the list of pending changes instead
      of always trying to process the entire list.  At transaction commit
      time, we catch any changes that were left over.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      87ef2bb4
    • Y
      Btrfs: Add root tree pointer transaction ids · 84234f3a
      Yan Zheng 提交于
      This patch adds transaction IDs to root tree pointers.
      Transaction IDs in tree pointers are compared with the
      generation numbers in block headers when reading root
      blocks of trees. This can detect some types of IO errors.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      
      84234f3a
    • J
      Btrfs: nuke fs wide allocation mutex V2 · 25179201
      Josef Bacik 提交于
      This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
      of little locks.
      
      There is now a pinned_mutex, which is used when messing with the pinned_extents
      extent io tree, and the extent_ins_mutex which is used with the pending_del and
      extent_ins extent io trees.
      
      The locking for the extent tree stuff was inspired by a patch that Yan Zheng
      wrote to fix a race condition, I cleaned it up some and changed the locking
      around a little bit, but the idea remains the same.  Basically instead of
      holding the extent_ins_mutex throughout the processing of an extent on the
      extent_ins or pending_del trees, we just hold it while we're searching and when
      we clear the bits on those trees, and lock the extent for the duration of the
      operations on the extent.
      
      Also to keep from getting hung up waiting to lock an extent, I've added a
      try_lock_extent so if we cannot lock the extent, move on to the next one in the
      tree and we'll come back to that one.  I have tested this heavily and it does
      not appear to break anything.  This has to be applied on top of my
      find_free_extent redo patch.
      
      I tested this patch on top of Yan's space reblancing code and it worked fine.
      The only thing that has changed since the last version is I pulled out all my
      debugging stuff, apparently I forgot to run guilt refresh before I sent the
      last patch out.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      
      25179201
    • J
      Btrfs: fix enospc when there is plenty of space · 80eb234a
      Josef Bacik 提交于
      So there is an odd case where we can possibly return -ENOSPC when there is in
      fact space to be had.  It only happens with Metadata writes, and happens _very_
      infrequently.  What has to happen is we have to allocate have allocated out of
      the first logical byte on the disk, which would set last_alloc to
      first_logical_byte(root, 0), so search_start == orig_search_start.  We then
      need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
      BTRFS_BLOCK_GROUP_DUP.  We will do a block lookup for the given search_start,
      block_group_bits() won't match and we'll go to choose another block group.
      However because search_start matches orig_search_start we go to see if we can
      allocate a chunk.
      
      If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
      This is kind of a big flaw of the way find_free_extent works, as it along with
      find_free_space loop through _all_ of the block groups, not just the ones that
      we want to allocate out of.  This patch completely kills find_free_space and
      rolls it into find_free_extent.  I've introduced a sort of state machine into
      this, which will make it easier to get cache miss information out of the
      allocator, and will work well with my locking changes.
      
      The basic flow is this:  We have the variable loop which is 0, meaning we are
      in the hint phase.  We lookup the block group for the hint, and lookup the
      space_info for what we want to allocate out of.  If the block group we were
      pointed at by the hint either isn't of the correct type, or just doesn't have
      the space we need, we set head to space_info->block_groups, so we start at the
      beginning of the block groups for this particular space info, and loop through.
      
      This is also where we add the empty_cluster to total_needed.  At this point
      loop is set to 1 and we just loop through all of the block groups for this
      particular space_info looking for the space we need, just as find_free_space
      would have done, except we only hit the block groups we want and not _all_ of
      the block groups.  If we come full circle we see if we can allocate a chunk.
      If we cannot of course we exit with -ENOSPC and we are good.  If not we start
      over at space_info->block_groups and loop through again, with loop == 2.  If we
      come full circle and haven't found what we need then we exit with -ENOSPC.
      I've been running this for a couple of days now and it seems stable, and I
      haven't yet hit a -ENOSPC when there was plenty of space left.
      
      Also I've made a groups_sem to handle the group list for the space_info.  This
      is part of my locking changes, but is relatively safe and seems better than
      holding the space_info spinlock over that entire search time.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
       
      80eb234a
    • Y
      Btrfs: Improve space balancing code · f82d02d9
      Yan Zheng 提交于
      This patch improves the space balancing code to keep more sharing
      of tree blocks. The only case that breaks sharing of tree blocks is
      data extents get fragmented during balancing. The main changes in
      this patch are:
      
      Add a 'drop sub-tree' function. This solves the problem in old code
      that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.
      
      Remove relocation mapping tree. Relocation mappings are stored in
      struct btrfs_ref_path and updated dynamically during walking up/down
      the reference path. This reduces CPU usage and simplifies code.
      
      This patch also fixes a bug. Root items for reloc trees should be
      updated in btrfs_free_reloc_root.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      
      f82d02d9
    • C
      Btrfs: Add zlib compression support · c8b97818
      Chris Mason 提交于
      This is a large change for adding compression on reading and writing,
      both for inline and regular extents.  It does some fairly large
      surgery to the writeback paths.
      
      Compression is off by default and enabled by mount -o compress.  Even
      when the -o compress mount option is not used, it is possible to read
      compressed extents off the disk.
      
      If compression for a given set of pages fails to make them smaller, the
      file is flagged to avoid future compression attempts later.
      
      * While finding delalloc extents, the pages are locked before being sent down
      to the delalloc handler.  This allows the delalloc handler to do complex things
      such as cleaning the pages, marking them writeback and starting IO on their
      behalf.
      
      * Inline extents are inserted at delalloc time now.  This allows us to compress
      the data before inserting the inline extent, and it allows us to insert
      an inline extent that spans multiple pages.
      
      * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
      are changed to record both an in-memory size and an on disk size, as well
      as a flag for compression.
      
      From a disk format point of view, the extent pointers in the file are changed
      to record the on disk size of a given extent and some encoding flags.
      Space in the disk format is allocated for compression encoding, as well
      as encryption and a generic 'other' field.  Neither the encryption or the
      'other' field are currently used.
      
      In order to limit the amount of data read for a single random read in the
      file, the size of a compressed extent is limited to 128k.  This is a
      software only limit, the disk format supports u64 sized compressed extents.
      
      In order to limit the ram consumed while processing extents, the uncompressed
      size of a compressed extent is limited to 256k.  This is a software only limit
      and will be subject to tuning later.
      
      Checksumming is still done on compressed extents, and it is done on the
      uncompressed version of the data.  This way additional encodings can be
      layered on without having to figure out which encoding to checksum.
      
      Compression happens at delalloc time, which is basically singled threaded because
      it is usually done by a single pdflush thread.  This makes it tricky to
      spread the compression load across all the cpus on the box.  We'll have to
      look at parallel pdflush walks of dirty inodes at a later time.
      
      Decompression is hooked into readpages and it does spread across CPUs nicely.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c8b97818
  21. 09 10月, 2008 1 次提交
    • Y
      Btrfs: Fix leaf reference cache miss · 5b84e8d6
      Yan Zheng 提交于
      Due to the optimization for truncate, tree leaves only containing
      checksum items can be deleted without being COW'ed first. This causes
      reference cache misses. The way to fix the miss is create cache
      entries for tree leaves only contain checksum.
      
      This patch also fixes a -EEXIST issue in shared reference cache.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      5b84e8d6