1. 24 9月, 2009 2 次提交
    • C
      Btrfs: Fix test_range_bit for whole file extents · 46562cec
      Chris Mason 提交于
      If test_range_bit finds an extent that goes all the way to (u64)-1, it
      can incorrectly wrap the u64 instead of treaing it like the end of
      the address space.
      
      This just adds a check for the highest possible offset so we don't wrap.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      46562cec
    • C
      Btrfs: fix errors handling cached state in set/clear_extent_bit · 42daec29
      Chris Mason 提交于
      Both set and clear_extent_bit allow passing a cached
      state struct to reduce rbtree search times.  clear_extent_bit
      was improperly bypassing some of the checks around making sure
      the extent state fields were correct for a given operation.
      
      The fix used here (from Yan Zheng) is to use the hit_next
      goto target instead of jumping all the way down to start clearing
      bits without making sure the cached state was exactly correct
      for the operation we were doing.
      
      This also fixes up the setting of the start variable for both
      ops in the case where we find an overlapping extent that
      begins before the range we want to change.  In both cases
      we were incorrectly going backwards from the original
      requested change.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      42daec29
  2. 23 9月, 2009 2 次提交
    • C
      Btrfs: fix early enospc during balancing · 7ce618db
      Chris Mason 提交于
      We now do extra checks before a balance to make sure
      there is room for the balance to take place.  One of
      the checks was testing to see if we were trying to
      balance away the last block group of a given type.
      
      If there is no space available for new chunks, we
      should not try and balance away the last block group
      of a give type.  But, the code wasn't checking for
      available chunk space, and so it was exiting too soon.
      
      The fix here is to combine some of the checks and make
      sure we try to allocate new chunks when we're balancing
      the last block group.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7ce618db
    • C
      Btrfs: deal with NULL space info · 33b4d47f
      Chris Mason 提交于
      After a balance it is briefly possible for the space info
      field in the inode to be NULL.  This adds some checks
      to make sure things properly deal with the NULL value.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      33b4d47f
  3. 22 9月, 2009 11 次提交
    • J
      Btrfs: account for space used by the super mirrors · 1b2da372
      Josef Bacik 提交于
      As we get closer to proper -ENOSPC handling in btrfs, we need more accurate
      space accounting for the space info's.  Currently we exclude the free space for
      the super mirrors, but the space they take up isn't accounted for in any of the
      counters.  This patch introduces bytes_super, which keeps track of the amount
      of bytes used for a super mirror in the block group cache and space info.  This
      makes sure that our free space caclucations will be completely accurate.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1b2da372
    • J
      Btrfs: fix extent entry threshold calculation · 25891f79
      Josef Bacik 提交于
      There is a slight problem with the extent entry threshold calculation for the
      free space cache.  We only adjust the threshold down as we add bitmaps, but
      never actually adjust the threshold up as we add bitmaps.  This means we could
      fragment the free space so badly that we end up using all bitmaps to describe
      the free space, use all the free space which would result in the bitmaps being
      freed, but then go to add free space again as we delete things and immediately
      add bitmaps since the extent threshold would still be 0.  Now as we free
      bitmaps the extent threshold will be ratcheted up to allow more extent entries
      to be added.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      25891f79
    • J
      Btrfs: remove dead code · f61408b8
      Josef Bacik 提交于
      This patch removes a bunch of dead code from the snapshot removal stuff.  It
      was confusing me when doing the metadata ENOSPC stuff so I killed it.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f61408b8
    • J
      Btrfs: fix bitmap size tracking · f019f426
      Josef Bacik 提交于
      When we first go to add free space, we allocate a new info and set the offset
      and bytes to the space we are adding.  This is fine, except we actually set the
      size of a bitmap as we set the bits in it, so if we add space to a bitmap, we'd
      end up counting the same space twice.  This isn't a huge deal, it just makes
      the allocator behave weirdly since it will think that a bitmap entry has more
      space than it ends up actually having.  I used a BUG_ON() to catch when this
      problem happened, and with this patch I no longer get the BUG_ON().
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f019f426
    • J
      Btrfs: don't keep retrying a block group if we fail to allocate a cluster · 0a24325e
      Josef Bacik 提交于
      The box can get locked up in the allocator if we happen upon a block group
      under these conditions:
      
      1) During a commit, so caching threads cannot make progress
      2) Our block group currently is in the middle of being cached
      3) Our block group currently has plenty of free space in it
      4) Our block group is so fragmented that it ends up having no free space chunks
      larger than min_bytes calculated by btrfs_find_space_cluster.
      
      What happens is we try and do btrfs_find_space_cluster, which fails because it
      is unable to find enough free space chunks that are large than min_bytes and
      are close enough together.  Since the block group is not cached we do a
      wait_block_group_cache_progress, which waits for the number of bytes we need,
      except the block group already has _plenty_ of free space, its just severely
      fragmented, so we loop and try again, ad infinitum.  This patch keeps us from
      waiting on the block group to finish caching if we failed to find a free space
      cluster before.  It also makes sure that we don't even try to find a free space
      cluster if we are on our last loop in the allocator, since we will have tried
      everything at this point at it is futile.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0a24325e
    • J
      Btrfs: make balance code choose more wisely when relocating · ba1bf481
      Josef Bacik 提交于
      Currently, we can panic the box if the first block group we go to move is of a
      type where there is no space left to move those extents.  For example, if we
      fill the disk up with data, and then we try to balance and we have no room to
      move the data nor room to allocate new chunks, we will panic.  Change this by
      checking to see if we have room to move this chunk around, and if not, return
      -ENOSPC and move on to the next chunk.  This will make sure we remove block
      groups that are moveable, like if we have alot of empty metadata block groups,
      and then that way we make room to be able to balance our data chunks as well.
      Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
      panics with this patch.
      
      V1->V2:
      -actually search for a free extent on the device to make sure we can allocate a
      chunk if need be.
      
      -fix btrfs_shrink_device to make sure we actually try to relocate all the
      chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
      we don't remove the device with data still on it.
      
      -check to make sure the block group we are going to relocate isn't the last one
      in that particular space
      
      -fix a bug in btrfs_shrink_device where we would change the device's size and
      not fix it if we fail to do our relocate
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ba1bf481
    • S
      Btrfs: fix arithmetic error in clone ioctl · 1fb58a60
      Sage Weil 提交于
      Fix an arithmetic error that was breaking extents cloned via the clone
      ioctl starting in the second half of a file.
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1fb58a60
    • Y
      Btrfs: add snapshot/subvolume destroy ioctl · 76dda93c
      Yan, Zheng 提交于
      This patch adds snapshot/subvolume destroy ioctl.  A subvolume that isn't being
      used and doesn't contains links to other subvolumes can be destroyed.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      76dda93c
    • Y
      Btrfs: change how subvolumes are organized · 4df27c4d
      Yan, Zheng 提交于
      btrfs allows subvolumes and snapshots anywhere in the directory tree.
      If we snapshot a subvolume that contains a link to other subvolume
      called subvolA, subvolA can be accessed through both the original
      subvolume and the snapshot. This is similar to creating hard link to
      directory, and has the very similar problems.
      
      The aim of this patch is enforcing there is only one access point to
      each subvolume. Only the first directory entry (the one added when
      the subvolume/snapshot was created) is treated as valid access point.
      The first directory entry is distinguished by checking root forward
      reference. If the corresponding root forward reference is missing,
      we know the entry is not the first one.
      
      This patch also adds snapshot/subvolume rename support, the code
      allows rename subvolume link across subvolumes.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4df27c4d
    • Y
      Btrfs: do not reuse objectid of deleted snapshot/subvol · 13a8a7c8
      Yan, Zheng 提交于
      The new back reference format does not allow reusing objectid of
      deleted snapshot/subvol. So we use ++highest_objectid to allocate
      objectid for new snapshot/subvol.
      
      Now we use ++highest_objectid to allocate objectid for both new inode
      and new snapshot/subvolume, so this patch removes 'find hole' code in
      btrfs_find_free_objectid.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      13a8a7c8
    • Y
      Btrfs: speed up snapshot dropping · 1c4850e2
      Yan, Zheng 提交于
      This patch contains two changes to avoid unnecessary tree block reads during
      snapshot dropping.
      
      First, check tree block's reference count and flags before reading the tree
      block. if reference count > 1 and there is no need to update backrefs, we can
      avoid reading the tree block.
      
      Second, save when snapshot was created in root_key.offset. we can compare block
      pointer's generation with snapshot's creation generation during updating
      backrefs. If a given block was created before snapshot was created, the
      snapshot can't be the tree block's owner. So we can avoid reading the block.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1c4850e2
  4. 19 9月, 2009 2 次提交
    • C
      Btrfs: search for an allocation hint while filling file COW · b917b7c3
      Chris Mason 提交于
      The allocator has some nice knobs for sending hints about where
      to try and allocate new blocks, but when we're doing file allocations
      we're not sending any hint at all.
      
      This commit adds a simple extent map search to see if we can
      quickly and easily find a hint for the allocator.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b917b7c3
    • C
      Btrfs: properly honor wbc->nr_to_write changes · f85d7d6c
      Chris Mason 提交于
      When btrfs fills a delayed allocation, it tries to increase
      the wbc nr_to_write to cover a big part of allocation.  The
      theory is that we're doing contiguous IO and writing a few
      more blocks will save seeks overall at a very low cost.
      
      The problem is that extent_write_cache_pages could ignore
      the new higher nr_to_write if nr_to_write had already gone
      down to zero.  We fix that by rechecking the nr_to_write
      for every page that is processed in the pagevec.
      
      This updates the math around bumping the nr_to_write value
      to make sure we don't leave a tiny amount of IO hanging
      around for the very end of a new extent.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f85d7d6c
  5. 18 9月, 2009 1 次提交
    • Y
      Btrfs: improve async block group caching · 11833d66
      Yan Zheng 提交于
      This patch gets rid of two limitations of async block group caching.
      The old code delays handling pinned extents when block group is in
      caching. To allocate logged file extents, the old code need wait
      until block group is fully cached. To get rid of the limitations,
      This patch introduces a data structure to track the progress of
      caching. Base on the caching progress, we know which extents should
      be added to the free space cache when handling the pinned extents.
      The logged file extents are also handled in a similar way.
      
      This patch also changes how pinned extents are tracked. The old
      code uses one tree to track pinned extents, and copy the pinned
      extents tree at transaction commit time. This patch makes it use
      two trees to track pinned extents. One tree for extents that are
      pinned in the running transaction, one tree for extents that can
      be unpinned. At transaction commit time, we swap the two trees.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      11833d66
  6. 16 9月, 2009 3 次提交
  7. 12 9月, 2009 17 次提交
    • C
    • C
      Btrfs: zero page past end of inline file items · 93c82d57
      Chris Mason 提交于
      When btrfs_get_extent is reading inline file items for readpage,
      it needs to copy the inline extent into the page.  If the
      inline extent doesn't cover all of the page, that means there
      is a hole in the file, or that our file is smaller than one
      page.
      
      readpage does zeroing for the case where the file is smaller than one
      page, but nobody is currently zeroing for the case where there is
      a hole after the inline item.
      
      This commit changes btrfs_get_extent to zero fill the page past
      the end of the inline item.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      93c82d57
    • C
      Btrfs: fix btrfs page_mkwrite to return locked page · 50a9b214
      Chris Mason 提交于
      This closes a whole where the page may be written before
      the page_mkwrite caller has a chance to dirty it
      
      (thanks to Nick Piggin)
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      50a9b214
    • C
      Btrfs: Fix extent replacment race · a1ed835e
      Chris Mason 提交于
      Data COW means that whenever we write to a file, we replace any old
      extent pointers with new ones.  There was a window where a readpage
      might find the old extent pointers on disk and cache them in the
      extent_map tree in ram in the middle of a given write replacing them.
      
      Even though both the readpage and the write had their respective bytes
      in the file locked, the extent readpage inserts may cover more bytes than
      it had locked down.
      
      This commit closes the race by keeping the new extent pinned in the extent
      map tree until after the on-disk btree is properly setup with the new
      extent pointers.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a1ed835e
    • C
      Btrfs: Use PagePrivate2 to track pages in the data=ordered code. · 8b62b72b
      Chris Mason 提交于
      Btrfs writes go through delalloc to the data=ordered code.  This
      makes sure that all of the data is on disk before the metadata
      that references it.  The tracking means that we have to make sure
      each page in an extent is fully written before we add that extent into
      the on-disk btree.
      
      This was done in the past by setting the EXTENT_ORDERED bit for the
      range of an extent when it was added to the data=ordered code, and then
      clearing the EXTENT_ORDERED bit in the extent state tree as each page
      finished IO.
      
      One of the reasons we had to do this was because sometimes pages are
      magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
      bit is checked at writepage time, and if it isn't there, our page become
      dirty without going through the proper path.
      
      These bit operations make for a number of rbtree searches for each page,
      and can cause considerable lock contention.
      
      This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
      As pages go into the ordered code, PagePrivate2 is set on each one.
      This is a cheap operation because we already have all the pages locked
      and ready to go.
      
      As IO finishes, the PagePrivate2 bit is cleared and the ordered
      accoutning is updated for each page.
      
      At writepage time, if the PagePrivate2 bit is missing, we go into the
      writepage fixup code to handle improperly dirtied pages.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8b62b72b
    • C
      Btrfs: use a cached state for extent state operations during delalloc · 9655d298
      Chris Mason 提交于
      This changes the btrfs code to find delalloc ranges in the extent state
      tree to use the new state caching code from set/test bit.  It reduces
      one of the biggest causes of rbtree searches in the writeback path.
      
      test_range_bit is also modified to take the cached state as a starting
      point while searching.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9655d298
    • C
      Btrfs: don't lock bits in the extent tree during writepage · d5550c63
      Chris Mason 提交于
      At writepage time, we have the page locked and we have the
      extent_map entry for this extent pinned in the extent_map tree.
      So, the page can't go away and its mapping can't change.
      
      There is no need for the extra extent_state lock bits during writepage.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d5550c63
    • C
      Btrfs: cache values for locking extents · 2c64c53d
      Chris Mason 提交于
      Many of the btrfs extent state tree users follow the same pattern.
      They lock an extent range in the tree, do some operation and then
      unlock.
      
      This translates to at least 2 rbtree searches, and maybe more if they
      are doing operations on the extent state tree.  A locked extent
      in the tree isn't going to be merged or changed, and so we can
      safely return the extent state structure as a cached handle.
      
      This changes set_extent_bit to give back a cached handle, and also
      changes both set_extent_bit and clear_extent_bit to use the cached
      handle if it is available.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2c64c53d
    • C
      Btrfs: reduce CPU usage in the extent_state tree · 1edbb734
      Chris Mason 提交于
      Btrfs is currently mirroring some of the page state bits into
      its extent state tree.  The goal behind this was to use it in supporting
      blocksizes other than the page size.
      
      But, we don't currently support that, and we're using quite a lot of CPU
      on the rb tree and its spin lock.  This commit starts a series of
      cleanups to reduce the amount of work done in the extent state tree as
      part of each IO.
      
      This commit:
      
      * Adds the ability to lock an extent in the state tree and also set
      other bits.  The idea is to do locking and delalloc in one call
      
      * Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits.  Btrfs is using
      a combination of the page bits and the ordered write code for this
      instead.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1edbb734
    • C
      Btrfs: Fix new state initialization order · e48c465b
      Chris Mason 提交于
      As the extent state tree is manipulated, there are call backs
      that are used to take extra actions when different state bits are set
      or cleared.  One example of this is a counter for the total number
      of delayed allocation bytes in a single inode and in the whole FS.
      
      When new states are inserted, this callback is being done before we
      properly setup the new state.  This hasn't caused problems before
      because the lock bit was always done first, and the existing call backs
      don't care about the lock bit.
      
      This patch makes sure the state is properly setup before using the
      callback, which is important for later optimizations that do more work
      without using the lock bit.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e48c465b
    • C
      Btrfs: switch extent_map to a rw lock · 890871be
      Chris Mason 提交于
      There are two main users of the extent_map tree.  The
      first is regular file inodes, where it is evenly spread
      between readers and writers.
      
      The second is the chunk allocation tree, which maps blocks from
      logical addresses to phyiscal ones, and it is 99.99% reads.
      
      The mapping tree is a point of lock contention during heavy IO
      workloads, so this commit switches things to a rw lock.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      890871be
    • C
      Btrfs: tweak congestion backoff · 57fd5a5f
      Chris Mason 提交于
      The btrfs io submission thread tries to back off congested devices in
      favor of rotating off to another disk.
      
      But, it tries to make sure it submits at least some IO before rotating
      on (the others may be congested too), and so it has a magic number of
      requests it tries to write before it hops.
      
      This makes the magic number smaller.  Testing shows that we're spending
      too much time on congested devices and leaving the other devices idle.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      57fd5a5f
    • C
      Btrfs: use larger nr_to_write for larger extents · a97adc9f
      Chris Mason 提交于
      When btrfs fills a large delayed allocation extent, it is a good idea
      to try and convince the write_cache_pages caller to go ahead and
      write a good chunk of that extent.  The extra IO is basically free
      because we know it is contiguous.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a97adc9f
    • C
      Btrfs: reduce worker thread spin_lock_irq hold times · 4f878e84
      Chris Mason 提交于
      This changes the btrfs worker threads to batch work items
      into a local list.  It allows us to pull work items in
      large chunks and significantly reduces the number of times we
      need to take the worker thread spinlock.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4f878e84
    • C
      Btrfs: keep irqs on more often in the worker threads · 4e3f9c50
      Chris Mason 提交于
      The btrfs worker thread spinlock was being used both for the
      queueing of IO and for the processing of ordered events.
      
      The ordered events never happen from end_io handlers, and so they
      don't need to use the _irq version of spinlocks.  This adds a
      dedicated lock to the ordered lists so they don't have to run
      with irqs off.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4e3f9c50
    • C
      Btrfs: optimize set extent bit · 40431d6c
      Chris Mason 提交于
      The Btrfs set_extent_bit call currently searches the rbtree
      every time it needs to find more extent_state objects to fill
      the requested operation.
      
      This adds a simple test with rb_next to see if the next object
      in the tree was adjacent to the one we just found.  If so,
      we skip the search and just use the next object.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      40431d6c
    • C
      Btrfs: Allow worker threads to exit when idle · 9042846b
      Chris Mason 提交于
      The Btrfs worker threads don't currently die off after they have
      been idle for a while, leading to a lot of threads sitting around
      doing nothing for each mount.
      
      Also, they are unable to start atomically (from end_io hanlders).
      
      This commit reworks the worker threads so they can be started
      from end_io handlers (just setting a flag that asks for a thread
      to be added at a later date) and so they can exit if they
      have been idle for a long time.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9042846b
  8. 10 9月, 2009 1 次提交
  9. 09 9月, 2009 1 次提交
    • E
      aoe: allocate unused request_queue for sysfs · 7135a71b
      Ed Cashin 提交于
      Andy Whitcroft reported an oops in aoe triggered by use of an
      incorrectly initialised request_queue object:
      
        [ 2645.959090] kobject '<NULL>' (ffff880059ca22c0): tried to add
      		an uninitialized object, something is seriously wrong.
        [ 2645.959104] Pid: 6, comm: events/0 Not tainted 2.6.31-5-generic #24-Ubuntu
        [ 2645.959107] Call Trace:
        [ 2645.959139] [<ffffffff8126ca2f>] kobject_add+0x5f/0x70
        [ 2645.959151] [<ffffffff8125b4ab>] blk_register_queue+0x8b/0xf0
        [ 2645.959155] [<ffffffff8126043f>] add_disk+0x8f/0x160
        [ 2645.959161] [<ffffffffa01673c4>] aoeblk_gdalloc+0x164/0x1c0 [aoe]
      
      The request queue of an aoe device is not used but can be allocated in
      code that does not sleep.
      
      Bruno bisected this regression down to
      
        cd43e26f
      
        block: Expose stacked device queues in sysfs
      
      "This seems to generate /sys/block/$device/queue and its contents for
       everyone who is using queues, not just for those queues that have a
       non-NULL queue->request_fn."
      
      Addresses http://bugs.launchpad.net/bugs/410198
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13942
      
      Note that embedding a queue inside another object has always been
      an illegal construct, since the queues are reference counted and
      must persist until the last reference is dropped. So aoe was
      always buggy in this respect (Jens).
      Signed-off-by: NEd Cashin <ecashin@coraid.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Bruno Premont <bonbons@linux-vserver.org>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      7135a71b