1. 07 11月, 2008 2 次提交
    • C
      Btrfs: Optimize compressed writeback and reads · 771ed689
      Chris Mason 提交于
      When reading compressed extents, try to put pages into the page cache
      for any pages covered by the compressed extent that readpages didn't already
      preload.
      
      Add an async work queue to handle transformations at delayed allocation processing
      time.  Right now this is just compression.  The workflow is:
      
      1) Find offsets in the file marked for delayed allocation
      2) Lock the pages
      3) Lock the state bits
      4) Call the async delalloc code
      
      The async delalloc code clears the state lock bits and delalloc bits.  It is
      important this happens before the range goes into the work queue because
      otherwise it might deadlock with other work queue items that try to lock
      those extent bits.
      
      The file pages are compressed, and if the compression doesn't work the
      pages are written back directly.
      
      An ordered work queue is used to make sure the inodes are written in the same
      order that pdflush or writepages sent them down.
      
      This changes extent_write_cache_pages to let the writepage function
      update the wbc nr_written count.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      771ed689
    • C
      Btrfs: Add ordered async work queues · 4a69a410
      Chris Mason 提交于
      Btrfs uses kernel threads to create async work queues for cpu intensive
      operations such as checksumming and decompression.  These work well,
      but they make it difficult to keep IO order intact.
      
      A single writepages call from pdflush or fsync will turn into a number
      of bios, and each bio is checksummed in parallel.  Once the checksum is
      computed, the bio is sent down to the disk, and since we don't control
      the order in which the parallel operations happen, they might go down to
      the disk in almost any order.
      
      The code deals with this somewhat by having deep work queues for a single
      kernel thread, making it very likely that a single thread will process all
      the bios for a single inode.
      
      This patch introduces an explicitly ordered work queue.  As work structs
      are placed into the queue they are put onto the tail of a list.  They have
      three callbacks:
      
      ->func (cpu intensive processing here)
      ->ordered_func (order sensitive processing here)
      ->ordered_free (free the work struct, all processing is done)
      
      The work struct has three callbacks.  The func callback does the cpu intensive
      work, and when it completes the work struct is marked as done.
      
      Every time a work struct completes, the list is checked to see if the head
      is marked as done.  If so the ordered_func callback is used to do the
      order sensitive processing and the ordered_free callback is used to do
      any cleanup.  Then we loop back and check the head of the list again.
      
      This patch also changes the checksumming code to use the ordered workqueues.
      One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4a69a410
  2. 01 11月, 2008 2 次提交
    • C
      Btrfs: rev the disk format for fallocate · 537fb067
      Chris Mason 提交于
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      537fb067
    • C
      Btrfs: Compression corner fixes · 70b99e69
      Chris Mason 提交于
      Make sure we keep page->mapping NULL on the pages we're getting
      via alloc_page.  It gets set so a few of the callbacks can do the right
      thing, but in general these pages don't have a mapping.
      
      Don't try to truncate compressed inline items in btrfs_drop_extents.
      The whole compressed item must be preserved.
      
      Don't try to create multipage inline compressed items.  When we try to
      overwrite just the first page of the file, we would have to read in and recow
      all the pages after it in the same compressed inline items.  For now, only
      create single page inline items.
      
      Make sure we lock pages in the correct order during delalloc.  The
      search into the state tree for delalloc bytes can return bytes before
      the page we already have locked.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      70b99e69
  3. 31 10月, 2008 6 次提交
  4. 30 10月, 2008 7 次提交
    • C
      Btrfs: prevent looping forever in finish_current_insert and del_pending_extents · 87ef2bb4
      Chris Mason 提交于
      finish_current_insert and del_pending_extents process extent tree modifications
      that build up while we are changing the extent tree.  It is a confusing
      bit of code that prevents recursion.
      
      Both functions run through a list of pending operations and both funcs
      add to the list of pending operations.  If you have two procs in either
      one of them, they can end up looping forever making more work for each other.
      
      This patch makes them walk forward through the list of pending changes instead
      of always trying to process the entire list.  At transaction commit
      time, we catch any changes that were left over.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      87ef2bb4
    • C
    • Y
      Btrfs: Add root tree pointer transaction ids · 84234f3a
      Yan Zheng 提交于
      This patch adds transaction IDs to root tree pointers.
      Transaction IDs in tree pointers are compared with the
      generation numbers in block headers when reading root
      blocks of trees. This can detect some types of IO errors.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      
      84234f3a
    • J
      Btrfs: nuke fs wide allocation mutex V2 · 25179201
      Josef Bacik 提交于
      This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
      of little locks.
      
      There is now a pinned_mutex, which is used when messing with the pinned_extents
      extent io tree, and the extent_ins_mutex which is used with the pending_del and
      extent_ins extent io trees.
      
      The locking for the extent tree stuff was inspired by a patch that Yan Zheng
      wrote to fix a race condition, I cleaned it up some and changed the locking
      around a little bit, but the idea remains the same.  Basically instead of
      holding the extent_ins_mutex throughout the processing of an extent on the
      extent_ins or pending_del trees, we just hold it while we're searching and when
      we clear the bits on those trees, and lock the extent for the duration of the
      operations on the extent.
      
      Also to keep from getting hung up waiting to lock an extent, I've added a
      try_lock_extent so if we cannot lock the extent, move on to the next one in the
      tree and we'll come back to that one.  I have tested this heavily and it does
      not appear to break anything.  This has to be applied on top of my
      find_free_extent redo patch.
      
      I tested this patch on top of Yan's space reblancing code and it worked fine.
      The only thing that has changed since the last version is I pulled out all my
      debugging stuff, apparently I forgot to run guilt refresh before I sent the
      last patch out.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      
      25179201
    • J
      Btrfs: fix enospc when there is plenty of space · 80eb234a
      Josef Bacik 提交于
      So there is an odd case where we can possibly return -ENOSPC when there is in
      fact space to be had.  It only happens with Metadata writes, and happens _very_
      infrequently.  What has to happen is we have to allocate have allocated out of
      the first logical byte on the disk, which would set last_alloc to
      first_logical_byte(root, 0), so search_start == orig_search_start.  We then
      need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
      BTRFS_BLOCK_GROUP_DUP.  We will do a block lookup for the given search_start,
      block_group_bits() won't match and we'll go to choose another block group.
      However because search_start matches orig_search_start we go to see if we can
      allocate a chunk.
      
      If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
      This is kind of a big flaw of the way find_free_extent works, as it along with
      find_free_space loop through _all_ of the block groups, not just the ones that
      we want to allocate out of.  This patch completely kills find_free_space and
      rolls it into find_free_extent.  I've introduced a sort of state machine into
      this, which will make it easier to get cache miss information out of the
      allocator, and will work well with my locking changes.
      
      The basic flow is this:  We have the variable loop which is 0, meaning we are
      in the hint phase.  We lookup the block group for the hint, and lookup the
      space_info for what we want to allocate out of.  If the block group we were
      pointed at by the hint either isn't of the correct type, or just doesn't have
      the space we need, we set head to space_info->block_groups, so we start at the
      beginning of the block groups for this particular space info, and loop through.
      
      This is also where we add the empty_cluster to total_needed.  At this point
      loop is set to 1 and we just loop through all of the block groups for this
      particular space_info looking for the space we need, just as find_free_space
      would have done, except we only hit the block groups we want and not _all_ of
      the block groups.  If we come full circle we see if we can allocate a chunk.
      If we cannot of course we exit with -ENOSPC and we are good.  If not we start
      over at space_info->block_groups and loop through again, with loop == 2.  If we
      come full circle and haven't found what we need then we exit with -ENOSPC.
      I've been running this for a couple of days now and it seems stable, and I
      haven't yet hit a -ENOSPC when there was plenty of space left.
      
      Also I've made a groups_sem to handle the group list for the space_info.  This
      is part of my locking changes, but is relatively safe and seems better than
      holding the space_info spinlock over that entire search time.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
       
      80eb234a
    • Y
      Btrfs: Improve space balancing code · f82d02d9
      Yan Zheng 提交于
      This patch improves the space balancing code to keep more sharing
      of tree blocks. The only case that breaks sharing of tree blocks is
      data extents get fragmented during balancing. The main changes in
      this patch are:
      
      Add a 'drop sub-tree' function. This solves the problem in old code
      that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.
      
      Remove relocation mapping tree. Relocation mappings are stored in
      struct btrfs_ref_path and updated dynamically during walking up/down
      the reference path. This reduces CPU usage and simplifies code.
      
      This patch also fixes a bug. Root items for reloc trees should be
      updated in btrfs_free_reloc_root.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      
      f82d02d9
    • C
      Btrfs: Add zlib compression support · c8b97818
      Chris Mason 提交于
      This is a large change for adding compression on reading and writing,
      both for inline and regular extents.  It does some fairly large
      surgery to the writeback paths.
      
      Compression is off by default and enabled by mount -o compress.  Even
      when the -o compress mount option is not used, it is possible to read
      compressed extents off the disk.
      
      If compression for a given set of pages fails to make them smaller, the
      file is flagged to avoid future compression attempts later.
      
      * While finding delalloc extents, the pages are locked before being sent down
      to the delalloc handler.  This allows the delalloc handler to do complex things
      such as cleaning the pages, marking them writeback and starting IO on their
      behalf.
      
      * Inline extents are inserted at delalloc time now.  This allows us to compress
      the data before inserting the inline extent, and it allows us to insert
      an inline extent that spans multiple pages.
      
      * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
      are changed to record both an in-memory size and an on disk size, as well
      as a flag for compression.
      
      From a disk format point of view, the extent pointers in the file are changed
      to record the on disk size of a given extent and some encoding flags.
      Space in the disk format is allocated for compression encoding, as well
      as encryption and a generic 'other' field.  Neither the encryption or the
      'other' field are currently used.
      
      In order to limit the amount of data read for a single random read in the
      file, the size of a compressed extent is limited to 128k.  This is a
      software only limit, the disk format supports u64 sized compressed extents.
      
      In order to limit the ram consumed while processing extents, the uncompressed
      size of a compressed extent is limited to 256k.  This is a software only limit
      and will be subject to tuning later.
      
      Checksumming is still done on compressed extents, and it is done on the
      uncompressed version of the data.  This way additional encodings can be
      layered on without having to figure out which encoding to checksum.
      
      Compression happens at delalloc time, which is basically singled threaded because
      it is usually done by a single pdflush thread.  This makes it tricky to
      spread the compression load across all the cpus on the box.  We'll have to
      look at parallel pdflush walks of dirty inodes at a later time.
      
      Decompression is hooked into readpages and it does spread across CPUs nicely.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c8b97818
  5. 10 10月, 2008 3 次提交
    • J
      Btrfs: make tree_search_offset more flexible in its searching · 37d3cddd
      Josef Bacik 提交于
      Sometimes we end up freeing a reserved extent because we don't need it, however
      this means that its possible for transaction->last_alloc to point to the middle
      of a free area.
      
      When we search for free space in find_free_space we do a tree_search_offset
      with contains set to 0, because we want it to find the next best free area if
      we do not have an offset starting on the given offset.
      
      Unfortunately that currently means that if the offset we were given as a hint
      points to the middle of a free area, we won't find anything.  This is especially
      bad if we happened to last allocate from the big huge chunk of a newly formed
      block group, since we won't find anything and have to go back and search the
      long way around.
      
      This fixes this problem by making it so that we return the free space area
      regardless of the contains variable.  This made cache missing happen _alot_
      less, and speeds things up considerably.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      37d3cddd
    • C
      Btrfs: Don't call security_inode_mkdir during subvol creation · a3dddf3f
      Chris Mason 提交于
      Subvol creation already requires privs, and security_inode_mkdir isn't
      exported.  For now we don't need it.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a3dddf3f
    • C
      Btrfs: Fix subvolume creation locking rules · cb8e7090
      Christoph Hellwig 提交于
      Creating a subvolume is in many ways like a normal VFS ->mkdir, and we
      really need to play with the VFS topology locking rules.  So instead of
      just creating the snapshot on disk and then later getting rid of
      confliting aliases do it correctly from the start.  This will become
      especially important once we allow for subvolumes anywhere in the tree,
      and not just below a hidden root.
      
      Note that snapshots will need the same treatment, but do to the delay
      in creating them we can't do it currently.  Chris promised to fix that
      issue, so I'll wait on that.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      
      cb8e7090
  6. 09 10月, 2008 5 次提交
  7. 04 10月, 2008 3 次提交
  8. 02 10月, 2008 4 次提交
    • C
      Btrfs: don't read leaf blocks containing only checksums during truncate · 323ac95b
      Chris Mason 提交于
      Checksum items take up a significant portion of the metadata for large files.
      It is possible to avoid reading them during truncates by checking the keys in
      the higher level nodes.
      
      If a given leaf is followed by another leaf where the lowest key is a checksum
      item from the same file, we know we can safely delete the leaf without
      reading it.
      
      For a 32GB file on a 6 drive raid0 array, Btrfs needs 8s to delete
      the file with a cold cache.  It is read bound during the run.
      
      With this change, Btrfs is able to delete the file in 0.5s
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      323ac95b
    • J
      Btrfs: fix deadlock between alloc_mutex/chunk_mutex · cf749823
      Josef Bacik 提交于
      This fixes a deadlock that happens between the alloc_mutex and chunk_mutex.
      Process A comes in, decides to do a do_chunk_alloc, which takes the
      chunk_mutex, and is holding the alloc_mutex because the only way you get to
      do_chunk_alloc is by holding the alloc_mutex.  btrfs_alloc_chunk does its thing
      and goes to insert a new item, which results in a cow of the block.
      
      We get into del_pending_extents from there, where if we need to be rescheduled
      we drop the alloc_mutex and schedule.  At this point process B comes in to do
      an allocation and gets the alloc_mutex, and because process A did not do the
      chunk allocation completely it thinks its a good time to do a chunk allocation
      as well, and hangs on the chunk_mutex.
      
      Process A wakes up and tries to take the alloc_mutex and cannot.  The way to
      fix this is do a mutex_trylock() on chunk_mutex.  If we return 0 we didn't get
      the lock, and if this is just a "hey it may be a good time to allocate a chunk"
      then we just exit.  If we are trying to force an allocation then we reschedule
      and keep trying to acquire the chunk_mutex.  If once we acquire it the space is
      already full then we can just exit, otherwise we can continue with the chunk
      allocation.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      cf749823
    • J
    • J
  9. 01 10月, 2008 2 次提交
    • C
      Btrfs: fix multi-device code to use raid policies set by mkfs · 75ccf47d
      Chris Mason 提交于
      When reading in block groups, a global mask of the available raid policies
      should be adjusted based on the types of block groups found on disk.  This
      global mask is then used to decide which raid policy to use for new
      block groups.
      
      The recent allocator changes dropped the call that updated the global
      mask, making all the block groups allocated at run time single striped
      onto a single drive.
      
      This also fixes the async worker threads to set any thread that uses
      the requeue mechanism as busy.  This allows us to avoid blocking
      on get_request_wait for the async bio submission threads.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      75ccf47d
    • J
      Btrfs: fix seekiness due to finding the wrong block group · 45b8c9a8
      Josef Bacik 提交于
      This patch fixes a problem where we end up seeking too much when *last_ptr is
      valid.  This happens because btrfs_lookup_first_block_group only returns a
      block group that starts on or after the given search start, so if the
      search_start is in the middle of a block group it will return the block group
      after the given search_start, which is suboptimal.
      
      This patch fixes that by doing a btrfs_lookup_block_group, which will return
      the block group that contains the given search start.  If we fail to find a
      block group, we fall back on btrfs_lookup_first_block_group so we can find the
      next block group, not sure if this is absolutely needed, but better safe than
      sorry.
      
      Also if we can't find the block group that we need, or it happens to not be of
      the right type, we need to add empty_cluster since *last_ptr could point to a
      mismatched block group, which means we need to start over with empty_cluster
      added to total needed.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      45b8c9a8
  10. 30 9月, 2008 1 次提交
    • C
      Btrfs: add and improve comments · d352ac68
      Chris Mason 提交于
      This improves the comments at the top of many functions.  It didn't
      dive into the guts of functions because I was trying to
      avoid merging problems with the new allocator and back reference work.
      
      extent-tree.c and volumes.c were both skipped, and there is definitely
      more work todo in cleaning and commenting the code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d352ac68
  11. 29 9月, 2008 2 次提交
    • C
      Btrfs: drop WARN_ON from btrfs_add_leaf_ref · 9a5e1ea1
      Chris Mason 提交于
      btrfs_add_leaf_ref was doing checks on the objects it found in the
      rbtree to make sure they were properly linked into the tree.  But, the field
      it was checking can be safely changed outside of the tree spin lock.
      
      The WARN_ON was for debugging the initial implementation and can be
      safely removed.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9a5e1ea1
    • C
      Btrfs: Wait for IO on the block device inodes of newly added devices · 8c8bee1d
      Chris Mason 提交于
      btrfs-vol -a /dev/xxx will zero the first and last two MB of the device.
      The kernel code needs to wait for this IO to finish before it adds
      the device.
      
      btrfs metadata IO does not happen through the block device inode.  A
      separate address space is used, allowing the zero filled buffer heads in
      the block device inode to be written to disk after FS metadata starts
      going down to the disk via the btrfs metadata inode.
      
      The end result is zero filled metadata blocks after adding new devices
      into the filesystem.
      
      The fix is a simple filemap_write_and_wait on the block device inode
      before actually inserting it into the pool of available devices.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8c8bee1d
  12. 26 9月, 2008 3 次提交
    • Z
      Btrfs: update space balancing code · 1a40e23b
      Zheng Yan 提交于
      This patch updates the space balancing code to utilize the new
      backref format.  Before, btrfs-vol -b would break any COW links
      on data blocks or metadata.  This was slow and caused the amount
      of space used to explode if a large number of snapshots were present.
      
      The new code can keeps the sharing of all data extents and
      most of the tree blocks.
      
      To maintain the sharing of data extents, the space balance code uses
      a seperate inode hold data extent pointers, then updates the references
      to point to the new location.
      
      To maintain the sharing of tree blocks, the space balance code uses
      reloc trees to relocate tree blocks in reference counted roots.
      There is one reloc tree for each subvol, and all reloc trees share
      same root key objectid. Reloc trees are snapshots of the latest
      committed roots of subvols (root->commit_root).
      
      To relocate a tree block referenced by a subvol, there are two steps.
      COW the block through subvol's reloc tree, then update block pointer in
      the subvol to point to the new block. Since all reloc trees share
      same root key objectid, doing special handing for tree blocks
      owned by them is easy. Once a tree block has been COWed in one
      reloc tree, we can use the resulting new block directly when the
      same block is required to COW again through other reloc trees.
      In this way, relocated tree blocks are shared between reloc trees,
      so they are also shared between subvols.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1a40e23b
    • Z
      Btrfs: extent_map and data=ordered fixes for space balancing · 5b21f2ed
      Zheng Yan 提交于
      * Add an EXTENT_BOUNDARY state bit to keep the writepage code
      from merging data extents that are in the process of being
      relocated.  This allows us to do accounting for them properly.
      
      * The balancing code relocates data extents indepdent of the underlying
      inode.  The extent_map code was modified to properly account for
      things moving around (invalidating extent_map caches in the inode).
      
      * Don't take the drop_mutex in the create_subvol ioctl.  It isn't
      required.
      
      * Fix walking of the ordered extent list to avoid races with sys_unlink
      
      * Change the lock ordering rules.  Transaction start goes outside
      the drop_mutex.  This allows btrfs_commit_transaction to directly
      drop the relocation trees.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5b21f2ed
    • Z
      Btrfs: Add shared reference cache · e4657689
      Zheng Yan 提交于
      Btrfs has a cache of reference counts in leaves, allowing it to
      avoid reading tree leaves while deleting snapshots.  To reduce
      contention with multiple subvolumes, this cache is private to each
      subvolume.
      
      This patch adds shared reference cache support. The new space
      balancing code plays with multiple subvols at the same time, So
      the old per-subvol reference cache is not well suited.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e4657689