1. 22 9月, 2009 4 次提交
    • Y
      Btrfs: add snapshot/subvolume destroy ioctl · 76dda93c
      Yan, Zheng 提交于
      This patch adds snapshot/subvolume destroy ioctl.  A subvolume that isn't being
      used and doesn't contains links to other subvolumes can be destroyed.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      76dda93c
    • Y
      Btrfs: change how subvolumes are organized · 4df27c4d
      Yan, Zheng 提交于
      btrfs allows subvolumes and snapshots anywhere in the directory tree.
      If we snapshot a subvolume that contains a link to other subvolume
      called subvolA, subvolA can be accessed through both the original
      subvolume and the snapshot. This is similar to creating hard link to
      directory, and has the very similar problems.
      
      The aim of this patch is enforcing there is only one access point to
      each subvolume. Only the first directory entry (the one added when
      the subvolume/snapshot was created) is treated as valid access point.
      The first directory entry is distinguished by checking root forward
      reference. If the corresponding root forward reference is missing,
      we know the entry is not the first one.
      
      This patch also adds snapshot/subvolume rename support, the code
      allows rename subvolume link across subvolumes.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4df27c4d
    • Y
      Btrfs: do not reuse objectid of deleted snapshot/subvol · 13a8a7c8
      Yan, Zheng 提交于
      The new back reference format does not allow reusing objectid of
      deleted snapshot/subvol. So we use ++highest_objectid to allocate
      objectid for new snapshot/subvol.
      
      Now we use ++highest_objectid to allocate objectid for both new inode
      and new snapshot/subvolume, so this patch removes 'find hole' code in
      btrfs_find_free_objectid.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      13a8a7c8
    • Y
      Btrfs: speed up snapshot dropping · 1c4850e2
      Yan, Zheng 提交于
      This patch contains two changes to avoid unnecessary tree block reads during
      snapshot dropping.
      
      First, check tree block's reference count and flags before reading the tree
      block. if reference count > 1 and there is no need to update backrefs, we can
      avoid reading the tree block.
      
      Second, save when snapshot was created in root_key.offset. we can compare block
      pointer's generation with snapshot's creation generation during updating
      backrefs. If a given block was created before snapshot was created, the
      snapshot can't be the tree block's owner. So we can avoid reading the block.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1c4850e2
  2. 19 9月, 2009 2 次提交
    • C
      Btrfs: search for an allocation hint while filling file COW · b917b7c3
      Chris Mason 提交于
      The allocator has some nice knobs for sending hints about where
      to try and allocate new blocks, but when we're doing file allocations
      we're not sending any hint at all.
      
      This commit adds a simple extent map search to see if we can
      quickly and easily find a hint for the allocator.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b917b7c3
    • C
      Btrfs: properly honor wbc->nr_to_write changes · f85d7d6c
      Chris Mason 提交于
      When btrfs fills a delayed allocation, it tries to increase
      the wbc nr_to_write to cover a big part of allocation.  The
      theory is that we're doing contiguous IO and writing a few
      more blocks will save seeks overall at a very low cost.
      
      The problem is that extent_write_cache_pages could ignore
      the new higher nr_to_write if nr_to_write had already gone
      down to zero.  We fix that by rechecking the nr_to_write
      for every page that is processed in the pagevec.
      
      This updates the math around bumping the nr_to_write value
      to make sure we don't leave a tiny amount of IO hanging
      around for the very end of a new extent.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f85d7d6c
  3. 18 9月, 2009 1 次提交
    • Y
      Btrfs: improve async block group caching · 11833d66
      Yan Zheng 提交于
      This patch gets rid of two limitations of async block group caching.
      The old code delays handling pinned extents when block group is in
      caching. To allocate logged file extents, the old code need wait
      until block group is fully cached. To get rid of the limitations,
      This patch introduces a data structure to track the progress of
      caching. Base on the caching progress, we know which extents should
      be added to the free space cache when handling the pinned extents.
      The logged file extents are also handled in a similar way.
      
      This patch also changes how pinned extents are tracked. The old
      code uses one tree to track pinned extents, and copy the pinned
      extents tree at transaction commit time. This patch makes it use
      two trees to track pinned extents. One tree for extents that are
      pinned in the running transaction, one tree for extents that can
      be unpinned. At transaction commit time, we swap the two trees.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      11833d66
  4. 16 9月, 2009 3 次提交
  5. 12 9月, 2009 16 次提交
    • C
      Btrfs: zero page past end of inline file items · 93c82d57
      Chris Mason 提交于
      When btrfs_get_extent is reading inline file items for readpage,
      it needs to copy the inline extent into the page.  If the
      inline extent doesn't cover all of the page, that means there
      is a hole in the file, or that our file is smaller than one
      page.
      
      readpage does zeroing for the case where the file is smaller than one
      page, but nobody is currently zeroing for the case where there is
      a hole after the inline item.
      
      This commit changes btrfs_get_extent to zero fill the page past
      the end of the inline item.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      93c82d57
    • C
      Btrfs: fix btrfs page_mkwrite to return locked page · 50a9b214
      Chris Mason 提交于
      This closes a whole where the page may be written before
      the page_mkwrite caller has a chance to dirty it
      
      (thanks to Nick Piggin)
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      50a9b214
    • C
      Btrfs: Fix extent replacment race · a1ed835e
      Chris Mason 提交于
      Data COW means that whenever we write to a file, we replace any old
      extent pointers with new ones.  There was a window where a readpage
      might find the old extent pointers on disk and cache them in the
      extent_map tree in ram in the middle of a given write replacing them.
      
      Even though both the readpage and the write had their respective bytes
      in the file locked, the extent readpage inserts may cover more bytes than
      it had locked down.
      
      This commit closes the race by keeping the new extent pinned in the extent
      map tree until after the on-disk btree is properly setup with the new
      extent pointers.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a1ed835e
    • C
      Btrfs: Use PagePrivate2 to track pages in the data=ordered code. · 8b62b72b
      Chris Mason 提交于
      Btrfs writes go through delalloc to the data=ordered code.  This
      makes sure that all of the data is on disk before the metadata
      that references it.  The tracking means that we have to make sure
      each page in an extent is fully written before we add that extent into
      the on-disk btree.
      
      This was done in the past by setting the EXTENT_ORDERED bit for the
      range of an extent when it was added to the data=ordered code, and then
      clearing the EXTENT_ORDERED bit in the extent state tree as each page
      finished IO.
      
      One of the reasons we had to do this was because sometimes pages are
      magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
      bit is checked at writepage time, and if it isn't there, our page become
      dirty without going through the proper path.
      
      These bit operations make for a number of rbtree searches for each page,
      and can cause considerable lock contention.
      
      This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
      As pages go into the ordered code, PagePrivate2 is set on each one.
      This is a cheap operation because we already have all the pages locked
      and ready to go.
      
      As IO finishes, the PagePrivate2 bit is cleared and the ordered
      accoutning is updated for each page.
      
      At writepage time, if the PagePrivate2 bit is missing, we go into the
      writepage fixup code to handle improperly dirtied pages.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8b62b72b
    • C
      Btrfs: use a cached state for extent state operations during delalloc · 9655d298
      Chris Mason 提交于
      This changes the btrfs code to find delalloc ranges in the extent state
      tree to use the new state caching code from set/test bit.  It reduces
      one of the biggest causes of rbtree searches in the writeback path.
      
      test_range_bit is also modified to take the cached state as a starting
      point while searching.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9655d298
    • C
      Btrfs: don't lock bits in the extent tree during writepage · d5550c63
      Chris Mason 提交于
      At writepage time, we have the page locked and we have the
      extent_map entry for this extent pinned in the extent_map tree.
      So, the page can't go away and its mapping can't change.
      
      There is no need for the extra extent_state lock bits during writepage.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d5550c63
    • C
      Btrfs: cache values for locking extents · 2c64c53d
      Chris Mason 提交于
      Many of the btrfs extent state tree users follow the same pattern.
      They lock an extent range in the tree, do some operation and then
      unlock.
      
      This translates to at least 2 rbtree searches, and maybe more if they
      are doing operations on the extent state tree.  A locked extent
      in the tree isn't going to be merged or changed, and so we can
      safely return the extent state structure as a cached handle.
      
      This changes set_extent_bit to give back a cached handle, and also
      changes both set_extent_bit and clear_extent_bit to use the cached
      handle if it is available.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2c64c53d
    • C
      Btrfs: reduce CPU usage in the extent_state tree · 1edbb734
      Chris Mason 提交于
      Btrfs is currently mirroring some of the page state bits into
      its extent state tree.  The goal behind this was to use it in supporting
      blocksizes other than the page size.
      
      But, we don't currently support that, and we're using quite a lot of CPU
      on the rb tree and its spin lock.  This commit starts a series of
      cleanups to reduce the amount of work done in the extent state tree as
      part of each IO.
      
      This commit:
      
      * Adds the ability to lock an extent in the state tree and also set
      other bits.  The idea is to do locking and delalloc in one call
      
      * Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits.  Btrfs is using
      a combination of the page bits and the ordered write code for this
      instead.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1edbb734
    • C
      Btrfs: Fix new state initialization order · e48c465b
      Chris Mason 提交于
      As the extent state tree is manipulated, there are call backs
      that are used to take extra actions when different state bits are set
      or cleared.  One example of this is a counter for the total number
      of delayed allocation bytes in a single inode and in the whole FS.
      
      When new states are inserted, this callback is being done before we
      properly setup the new state.  This hasn't caused problems before
      because the lock bit was always done first, and the existing call backs
      don't care about the lock bit.
      
      This patch makes sure the state is properly setup before using the
      callback, which is important for later optimizations that do more work
      without using the lock bit.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e48c465b
    • C
      Btrfs: switch extent_map to a rw lock · 890871be
      Chris Mason 提交于
      There are two main users of the extent_map tree.  The
      first is regular file inodes, where it is evenly spread
      between readers and writers.
      
      The second is the chunk allocation tree, which maps blocks from
      logical addresses to phyiscal ones, and it is 99.99% reads.
      
      The mapping tree is a point of lock contention during heavy IO
      workloads, so this commit switches things to a rw lock.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      890871be
    • C
      Btrfs: tweak congestion backoff · 57fd5a5f
      Chris Mason 提交于
      The btrfs io submission thread tries to back off congested devices in
      favor of rotating off to another disk.
      
      But, it tries to make sure it submits at least some IO before rotating
      on (the others may be congested too), and so it has a magic number of
      requests it tries to write before it hops.
      
      This makes the magic number smaller.  Testing shows that we're spending
      too much time on congested devices and leaving the other devices idle.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      57fd5a5f
    • C
      Btrfs: use larger nr_to_write for larger extents · a97adc9f
      Chris Mason 提交于
      When btrfs fills a large delayed allocation extent, it is a good idea
      to try and convince the write_cache_pages caller to go ahead and
      write a good chunk of that extent.  The extra IO is basically free
      because we know it is contiguous.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a97adc9f
    • C
      Btrfs: reduce worker thread spin_lock_irq hold times · 4f878e84
      Chris Mason 提交于
      This changes the btrfs worker threads to batch work items
      into a local list.  It allows us to pull work items in
      large chunks and significantly reduces the number of times we
      need to take the worker thread spinlock.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4f878e84
    • C
      Btrfs: keep irqs on more often in the worker threads · 4e3f9c50
      Chris Mason 提交于
      The btrfs worker thread spinlock was being used both for the
      queueing of IO and for the processing of ordered events.
      
      The ordered events never happen from end_io handlers, and so they
      don't need to use the _irq version of spinlocks.  This adds a
      dedicated lock to the ordered lists so they don't have to run
      with irqs off.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4e3f9c50
    • C
      Btrfs: optimize set extent bit · 40431d6c
      Chris Mason 提交于
      The Btrfs set_extent_bit call currently searches the rbtree
      every time it needs to find more extent_state objects to fill
      the requested operation.
      
      This adds a simple test with rb_next to see if the next object
      in the tree was adjacent to the one we just found.  If so,
      we skip the search and just use the next object.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      40431d6c
    • C
      Btrfs: Allow worker threads to exit when idle · 9042846b
      Chris Mason 提交于
      The Btrfs worker threads don't currently die off after they have
      been idle for a while, leading to a lot of threads sitting around
      doing nothing for each mount.
      
      Also, they are unable to start atomically (from end_io hanlders).
      
      This commit reworks the worker threads so they can be started
      from end_io handlers (just setting a flag that asks for a thread
      to be added at a later date) and so they can exit if they
      have been idle for a long time.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9042846b
  6. 21 8月, 2009 1 次提交
  7. 08 8月, 2009 3 次提交
  8. 01 8月, 2009 1 次提交
    • C
      Btrfs: make sure the async caching thread advances the key · 013f1b12
      Chris Mason 提交于
      The async caching thread can end up looping forever if a given
      search puts it at the last key in a leaf.  It will end up calling
      btrfs_next_leaf and then checking if it needs to politely drop
      the read semaphore.
      
      Most of the time this looping isn't noticed because it is able to
      make progress the next time around.  But, during log replay,
      we wait on the async caching thread to finish, and the async thread
      is waiting on the commit, and no progress is really made.
      
      The fix used here is to copy the key out of the next leaf,
      that way our search lands there properly.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      013f1b12
  9. 31 7月, 2009 1 次提交
    • J
      Btrfs: fix btrfs_remove_from_free_space corner case · 6606bb97
      Josef Bacik 提交于
      Yan Zheng hit a problem where we tried to remove some free space but failed
      because we couldn't find the free space entry.  This is because the free space
      was held within a bitmap that had a starting offset well before the actual
      offset of the free space, and there were free space extents that were in the
      same range as that offset, so tree_search_offset returned with NULL because we
      couldn't find a free space extent that had that offset.  This is fixed by
      making sure that if we fail to find the entry, we re-search again with
      bitmap_only set to 1 and do an offset_to_bitmap so we can get the appropriate
      bitmap.  A similar problem happens in btrfs_alloc_from_bitmap for the
      clustering code, but that is not as bad since we will just go and redo our
      cluster allocation.
      
      Also this adds some debugging checks to make sure that the free space we are
      trying to remove from the bitmap is in fact there.  This can probably go away
      after a while, but since this code is only used by the tree-logging stuff it
      would be nice to run with it for a while to make sure there are no problems.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      6606bb97
  10. 30 7月, 2009 2 次提交
    • C
      Btrfs: be more polite in the async caching threads · f36f3042
      Chris Mason 提交于
      The semaphore used by the async caching threads can prevent a
      transaction commit, which can make the FS appear to stall.  This
      releases the semaphore more often when a transaction commit is
      in progress.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f36f3042
    • Y
      Btrfs: preserve commit_root for async caching · 276e680d
      Yan Zheng 提交于
      The async block group caching code uses the commit_root pointer
      to get a stable version of the extent allocation tree for scanning.
      This copy of the tree root isn't going to change and it significantly
      reduces the complexity of the scanning code.
      
      During a commit, we have a loop where we update the extent allocation
      tree root.  We need to loop because updating the root pointer in
      the tree of tree roots may allocate blocks which may change the
      extent allocation tree.
      
      Right now the commit_root pointer is changed inside this loop.  It
      is more correct to change the commit_root pointer only after all the
      looping is done.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      276e680d
  11. 28 7月, 2009 3 次提交
    • Y
      Btrfs: Fix async caching interaction with unmount · f25784b3
      Yan Zheng 提交于
      - don't stop the caching thread until btrfs_commit_super return.
      
      - if caching is interrupted by umount, set last to (u64)-1.
        otherwise the un-scanned range of block group will be considered
        as free extent.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f25784b3
    • J
      Btrfs: change how we unpin extents · 68b38550
      Josef Bacik 提交于
      We are racy with async block caching and unpinning extents.  This patch makes
      things much less complicated by only unpinning the extent if the block group is
      cached.  We check the block_group->cached var under the block_group->lock spin
      lock.  If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
      and unpin the extent and add the free space back.  If it is not set to this, we
      start the caching of the block group so the next time we unpin extents we can
      unpin the extent.  This keeps us from racing with the async caching threads,
      lets us kill the fs wide async thread counter, and keeps us from having to set
      DELALLOC bits for every extent we hit if there are caching kthreads going.
      
      One thing that needed to be changed was btrfs_free_super_mirror_extents.  Now
      instead of just looking for LOCKED extents, we also look for DIRTY extents,
      since we could have left some extents pinned in the previous transaction that
      will never get freed now that we are unmounting, which would cause us to leak
      memory.  So btrfs_free_super_mirror_extents has been changed to
      btrfs_free_pinned_extents, and it will clear the extents locked for the super
      mirror, and any remaining pinned extents that may be present.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      68b38550
    • J
      Btrfs: Correct redundant test in add_inode_ref · 631c07c8
      Julia Lawall 提交于
      dir has already been tested.  It seems that this test should be on the
      recently returned value inode.
      
      A simplified version of the semantic match that finds this problem is as
      follows: (http://www.emn.fr/x-info/coccinelle/)
      Signed-off-by: NJulia Lawall <julia@diku.dk>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      631c07c8
  12. 25 7月, 2009 3 次提交
    • C
      Btrfs: find smallest available device extent during chunk allocation · 9779b72f
      Chris Mason 提交于
      Allocating new block group is easy when the disk has plenty of space.
      But things get difficult as the disk fills up, especially if
      the FS has been run through btrfs-vol -b.  The balance operation
      is likely to make the total bytes available on the device greater
      than the largest extent we'll actually be able to allocate.
      
      But the device extent allocation code incorrectly assumes that a device
      with 5G free will be able to allocate a 5G extent.  It isn't normally a
      problem because device extents don't get freed unless btrfs-vol -b
      is run.
      
      This fixes the device extent allocator to remember the largest free
      extent it can find, and then uses that value as a fallback.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9779b72f
    • C
      Btrfs: clear all space_info->full after removing a block group · 283bb197
      Chris Mason 提交于
      Btrfs allocates individual extents from block groups, and each
      block group has a specific type.  It may hold metadata, data
      mirrored or striped etc.
      
      When we balance space (btrfs-vol -b) or remove a drive (btrfs-vol -r)
      we free block groups.  Once a block group is freed, the space it was
      using on the device may be available for use by new block groups.
      
      btrfs_remove_block_group was clearing the flag that said
      'our devices are full, don't even try to allocate new block groups',
      but it was only clearing that flag for a specific type of block group.
      
      This commit clears the full flag for all of the types of block groups,
      making it much more likely that we'll be able to balance space when
      the drive is close to full.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      283bb197
    • S
      Btrfs: make flushoncommit mount option correctly wait on ordered_extents · ebecd3d9
      Sage Weil 提交于
      The commit_transaction call to wait_ordered_extents when snap_pending
      passes nocow_only=1 to process only NOCOW or PREALLOC extents.  This isn't
      correct for the 'flushoncommit' mode, as it skips extents we just started
      IO on in start_delalloc_inodes.
      
      So, in the flushoncommit case, wait on all ordered extents.  Otherwise,
      only pass the nocow_only flag to wait_ordered_extents if snap_pending.
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ebecd3d9