1. 24 9月, 2009 3 次提交
    • C
      Btrfs: fix releasepage to avoid unlocking extents we haven't locked · 11ef160f
      Chris Mason 提交于
      During releasepage, we try to drop any extent_state structs for the
      bye offsets of the page we're releaseing.  But the code was incorrectly
      telling clear_extent_bit to delete the state struct unconditionallly.
      
      Normally this would be fine because we have the page locked, but other
      parts of btrfs will lock down an entire extent, the most common place
      being IO completion.
      
      releasepage was deleting the extent state without first locking the extent,
      which may result in removing a state struct that another process had
      locked down.  The fix here is to leave the NODATASUM and EXTENT_LOCKED
      bits alone in releasepage.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      11ef160f
    • C
      Btrfs: Fix test_range_bit for whole file extents · 46562cec
      Chris Mason 提交于
      If test_range_bit finds an extent that goes all the way to (u64)-1, it
      can incorrectly wrap the u64 instead of treaing it like the end of
      the address space.
      
      This just adds a check for the highest possible offset so we don't wrap.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      46562cec
    • C
      Btrfs: fix errors handling cached state in set/clear_extent_bit · 42daec29
      Chris Mason 提交于
      Both set and clear_extent_bit allow passing a cached
      state struct to reduce rbtree search times.  clear_extent_bit
      was improperly bypassing some of the checks around making sure
      the extent state fields were correct for a given operation.
      
      The fix used here (from Yan Zheng) is to use the hit_next
      goto target instead of jumping all the way down to start clearing
      bits without making sure the cached state was exactly correct
      for the operation we were doing.
      
      This also fixes up the setting of the start variable for both
      ops in the case where we find an overlapping extent that
      begins before the range we want to change.  In both cases
      we were incorrectly going backwards from the original
      requested change.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      42daec29
  2. 19 9月, 2009 1 次提交
    • C
      Btrfs: properly honor wbc->nr_to_write changes · f85d7d6c
      Chris Mason 提交于
      When btrfs fills a delayed allocation, it tries to increase
      the wbc nr_to_write to cover a big part of allocation.  The
      theory is that we're doing contiguous IO and writing a few
      more blocks will save seeks overall at a very low cost.
      
      The problem is that extent_write_cache_pages could ignore
      the new higher nr_to_write if nr_to_write had already gone
      down to zero.  We fix that by rechecking the nr_to_write
      for every page that is processed in the pagevec.
      
      This updates the math around bumping the nr_to_write value
      to make sure we don't leave a tiny amount of IO hanging
      around for the very end of a new extent.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f85d7d6c
  3. 12 9月, 2009 9 次提交
    • C
      Btrfs: Use PagePrivate2 to track pages in the data=ordered code. · 8b62b72b
      Chris Mason 提交于
      Btrfs writes go through delalloc to the data=ordered code.  This
      makes sure that all of the data is on disk before the metadata
      that references it.  The tracking means that we have to make sure
      each page in an extent is fully written before we add that extent into
      the on-disk btree.
      
      This was done in the past by setting the EXTENT_ORDERED bit for the
      range of an extent when it was added to the data=ordered code, and then
      clearing the EXTENT_ORDERED bit in the extent state tree as each page
      finished IO.
      
      One of the reasons we had to do this was because sometimes pages are
      magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
      bit is checked at writepage time, and if it isn't there, our page become
      dirty without going through the proper path.
      
      These bit operations make for a number of rbtree searches for each page,
      and can cause considerable lock contention.
      
      This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
      As pages go into the ordered code, PagePrivate2 is set on each one.
      This is a cheap operation because we already have all the pages locked
      and ready to go.
      
      As IO finishes, the PagePrivate2 bit is cleared and the ordered
      accoutning is updated for each page.
      
      At writepage time, if the PagePrivate2 bit is missing, we go into the
      writepage fixup code to handle improperly dirtied pages.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8b62b72b
    • C
      Btrfs: use a cached state for extent state operations during delalloc · 9655d298
      Chris Mason 提交于
      This changes the btrfs code to find delalloc ranges in the extent state
      tree to use the new state caching code from set/test bit.  It reduces
      one of the biggest causes of rbtree searches in the writeback path.
      
      test_range_bit is also modified to take the cached state as a starting
      point while searching.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9655d298
    • C
      Btrfs: don't lock bits in the extent tree during writepage · d5550c63
      Chris Mason 提交于
      At writepage time, we have the page locked and we have the
      extent_map entry for this extent pinned in the extent_map tree.
      So, the page can't go away and its mapping can't change.
      
      There is no need for the extra extent_state lock bits during writepage.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d5550c63
    • C
      Btrfs: cache values for locking extents · 2c64c53d
      Chris Mason 提交于
      Many of the btrfs extent state tree users follow the same pattern.
      They lock an extent range in the tree, do some operation and then
      unlock.
      
      This translates to at least 2 rbtree searches, and maybe more if they
      are doing operations on the extent state tree.  A locked extent
      in the tree isn't going to be merged or changed, and so we can
      safely return the extent state structure as a cached handle.
      
      This changes set_extent_bit to give back a cached handle, and also
      changes both set_extent_bit and clear_extent_bit to use the cached
      handle if it is available.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2c64c53d
    • C
      Btrfs: reduce CPU usage in the extent_state tree · 1edbb734
      Chris Mason 提交于
      Btrfs is currently mirroring some of the page state bits into
      its extent state tree.  The goal behind this was to use it in supporting
      blocksizes other than the page size.
      
      But, we don't currently support that, and we're using quite a lot of CPU
      on the rb tree and its spin lock.  This commit starts a series of
      cleanups to reduce the amount of work done in the extent state tree as
      part of each IO.
      
      This commit:
      
      * Adds the ability to lock an extent in the state tree and also set
      other bits.  The idea is to do locking and delalloc in one call
      
      * Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits.  Btrfs is using
      a combination of the page bits and the ordered write code for this
      instead.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1edbb734
    • C
      Btrfs: Fix new state initialization order · e48c465b
      Chris Mason 提交于
      As the extent state tree is manipulated, there are call backs
      that are used to take extra actions when different state bits are set
      or cleared.  One example of this is a counter for the total number
      of delayed allocation bytes in a single inode and in the whole FS.
      
      When new states are inserted, this callback is being done before we
      properly setup the new state.  This hasn't caused problems before
      because the lock bit was always done first, and the existing call backs
      don't care about the lock bit.
      
      This patch makes sure the state is properly setup before using the
      callback, which is important for later optimizations that do more work
      without using the lock bit.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e48c465b
    • C
      Btrfs: switch extent_map to a rw lock · 890871be
      Chris Mason 提交于
      There are two main users of the extent_map tree.  The
      first is regular file inodes, where it is evenly spread
      between readers and writers.
      
      The second is the chunk allocation tree, which maps blocks from
      logical addresses to phyiscal ones, and it is 99.99% reads.
      
      The mapping tree is a point of lock contention during heavy IO
      workloads, so this commit switches things to a rw lock.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      890871be
    • C
      Btrfs: use larger nr_to_write for larger extents · a97adc9f
      Chris Mason 提交于
      When btrfs fills a large delayed allocation extent, it is a good idea
      to try and convince the write_cache_pages caller to go ahead and
      write a good chunk of that extent.  The extra IO is basically free
      because we know it is contiguous.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a97adc9f
    • C
      Btrfs: optimize set extent bit · 40431d6c
      Chris Mason 提交于
      The Btrfs set_extent_bit call currently searches the rbtree
      every time it needs to find more extent_state objects to fill
      the requested operation.
      
      This adds a simple test with rb_next to see if the next object
      in the tree was adjacent to the one we just found.  If so,
      we skip the search and just use the next object.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      40431d6c
  4. 10 6月, 2009 1 次提交
  5. 27 4月, 2009 1 次提交
  6. 25 4月, 2009 1 次提交
  7. 21 4月, 2009 3 次提交
    • C
      Btrfs: fix oops on page->mapping->host during writepage · 11c8349b
      Chris Mason 提交于
      The extent_io writepage call updates the writepage index in the inode
      as it makes progress.  But, it was doing the update after unlocking the page,
      which isn't legal because page->mapping can't be trusted once the page
      is unlocked.
      
      This lead to an oops, especially common with compression turned on.  The
      fix here is to update the writeback index before unlocking the page.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      11c8349b
    • C
      Btrfs: add a priority queue to the async thread helpers · d313d7a3
      Chris Mason 提交于
      Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a
      higher priority.  But, the checksumming helper threads prevent it
      from being fully effective.
      
      There are two problems.  First, a big queue of pending checksumming
      will delay the synchronous IO behind other lower priority writes.  Second,
      the checksumming uses an ordered async work queue.  The ordering makes sure
      that IOs are sent to the block layer in the same order they are sent
      to the checksumming threads.  Usually this gives us less seeky IO.
      
      But, when we start mixing IO priorities, the lower priority IO can delay
      the higher priority IO.
      
      This patch solves both problems by adding a high priority list to the async
      helper threads, and a new btrfs_set_work_high_prio(), which is used
      to make put a new async work item onto the higher priority list.
      
      The ordering is still done on high priority IO, but all of the high
      priority bios are ordered separately from the low priority bios.  This
      ordering is purely an IO optimization, it is not involved in data
      or metadata integrity.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d313d7a3
    • C
      Btrfs: use WRITE_SYNC for synchronous writes · ffbd517d
      Chris Mason 提交于
      Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for
      writes we plan on waiting on in the near future.  This patch
      mirrors recent changes in other filesystems and the generic code to
      use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for
      other latency critical writes.
      
      Btrfs uses async worker threads for checksumming before the write is done,
      and then again to actually submit the bios.  The bio submission code just
      runs a per-device list of bios that need to be sent down the pipe.
      
      This list is split into low priority and high priority lists so the
      WRITE_SYNC IO happens first.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ffbd517d
  8. 03 4月, 2009 1 次提交
  9. 25 3月, 2009 1 次提交
    • C
      Btrfs: leave btree locks spinning more often · b9473439
      Chris Mason 提交于
      btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
      for the buffers it was dirtying.  This may require a kmalloc and it
      was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
      set any btree locks they were holding to blocking first.
      
      This commit changes dirty tracking for extent buffers to just use a flag
      in the extent buffer.  Now that we have one and only one extent buffer
      per page, this can be safely done without losing dirty bits along the way.
      
      This also introduces a path->leave_spinning flag that callers of
      btrfs_search_slot can use to indicate they will properly deal with a
      path returned where all the locks are spinning instead of blocking.
      
      Many of the btree search callers now expect spinning paths,
      resulting in better btree concurrency overall.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9473439
  10. 13 2月, 2009 1 次提交
  11. 04 2月, 2009 3 次提交
    • C
      Btrfs: don't return congestion in write_cache_pages as often · 9b0d3ace
      Chris Mason 提交于
      On fast devices that go from congested to uncongested very quickly, pdflush
      is waiting too often in congestion_wait, and the FS is backing off to
      easily in write_cache_pages.
      
      For now, fix this on the btrfs side by only checking congestion after
      some bios have already gone down.  Longer term a real fix is needed
      for pdflush, but that is a larger project.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9b0d3ace
    • C
      Btrfs: Change btree locking to use explicit blocking points · b4ce94de
      Chris Mason 提交于
      Most of the btrfs metadata operations can be protected by a spinlock,
      but some operations still need to schedule.
      
      So far, btrfs has been using a mutex along with a trylock loop,
      most of the time it is able to avoid going for the full mutex, so
      the trylock loop is a big performance gain.
      
      This commit is step one for getting rid of the blocking locks entirely.
      btrfs_tree_lock takes a spinlock, and the code explicitly switches
      to a blocking lock when it starts an operation that can schedule.
      
      We'll be able get rid of the blocking locks in smaller pieces over time.
      Tracing allows us to find the most common cause of blocking, so we
      can start with the hot spots first.
      
      The basic idea is:
      
      btrfs_tree_lock() returns with the spin lock held
      
      btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
      the extent buffer flags, and then drops the spin lock.  The buffer is
      still considered locked by all of the btrfs code.
      
      If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
      the spin lock and waits on a wait queue for the blocking bit to go away.
      
      Much of the code that needs to set the blocking bit finishes without actually
      blocking a good percentage of the time.  So, an adaptive spin is still
      used against the blocking bit to avoid very high context switch rates.
      
      btrfs_clear_lock_blocking() clears the blocking bit and returns
      with the spinlock held again.
      
      btrfs_tree_unlock() can be called on either blocking or spinning locks,
      it does the right thing based on the blocking bit.
      
      ctree.c has a helper function to set/clear all the locked buffers in a
      path as blocking.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b4ce94de
    • C
      Btrfs: disable leak debugging checks in extent_io.c · 3935127c
      Chris Mason 提交于
      extent_io.c has debugging code to report and free leaked extent_state
      and extent_buffer objects at rmmod time.  This helps track down
      leaks and it saves you from rebooting just to properly remove the
      kmem_cache object.
      
      But, the code runs under a fairly expensive spinlock and the checks to
      see if it is currently enabled are not entirely consistent.  Some use
      #ifdef and some #if.
      
      This changes everything to #if and disables the leak checking.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3935127c
  12. 22 1月, 2009 1 次提交
  13. 21 1月, 2009 1 次提交
  14. 06 1月, 2009 3 次提交
  15. 18 12月, 2008 1 次提交
    • C
      Btrfs: shift all end_io work to thread pools · cad321ad
      Chris Mason 提交于
      bio_end_io for reads without checksumming on and btree writes were
      happening without using async thread pools.  This means the extent_io.c
      code had to use spin_lock_irq and friends on the rb tree locks for
      extent state.
      
      There were some irq safe vs unsafe lock inversions between the delallock
      lock and the extent state locks.  This patch gets rid of them by moving
      all end_io code into the thread pools.
      
      To avoid contention and deadlocks between the data end_io processing and the
      metadata end_io processing yet another thread pool is added to finish
      off metadata writes.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cad321ad
  16. 09 12月, 2008 2 次提交
    • C
      Btrfs: Use map_private_extent_buffer during generic_bin_search · 934d375b
      Chris Mason 提交于
      It is possible that generic_bin_search will be called on a tree block
      that has not been locked.  This happens because cache_block_block skips
      locking on the tree blocks.
      
      Since the tree block isn't locked, we aren't allowed to change
      the extent_buffer->map_token field.  Using map_private_extent_buffer
      avoids any changes to the internal extent buffer fields.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      934d375b
    • C
      Btrfs: move data checksumming into a dedicated tree · d20f7043
      Chris Mason 提交于
      Btrfs stores checksums for each data block.  Until now, they have
      been stored in the subvolume trees, indexed by the inode that is
      referencing the data block.  This means that when we read the inode,
      we've probably read in at least some checksums as well.
      
      But, this has a few problems:
      
      * The checksums are indexed by logical offset in the file.  When
      compression is on, this means we have to do the expensive checksumming
      on the uncompressed data.  It would be faster if we could checksum
      the compressed data instead.
      
      * If we implement encryption, we'll be checksumming the plain text and
      storing that on disk.  This is significantly less secure.
      
      * For either compression or encryption, we have to get the plain text
      back before we can verify the checksum as correct.  This makes the raid
      layer balancing and extent moving much more expensive.
      
      * It makes the front end caching code more complex, as we have touch
      the subvolume and inodes as we cache extents.
      
      * There is potentitally one copy of the checksum in each subvolume
      referencing an extent.
      
      The solution used here is to store the extent checksums in a dedicated
      tree.  This allows us to index the checksums by phyiscal extent
      start and length.  It means:
      
      * The checksum is against the data stored on disk, after any compression
      or encryption is done.
      
      * The checksum is stored in a central location, and can be verified without
      following back references, or reading inodes.
      
      This makes compression significantly faster by reducing the amount of
      data that needs to be checksummed.  It will also allow much faster
      raid management code in general.
      
      The checksums are indexed by a key with a fixed objectid (a magic value
      in ctree.h) and offset set to the starting byte of the extent.  This
      allows us to copy the checksum items into the fsync log tree directly (or
      any other tree), without having to invent a second format for them.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d20f7043
  17. 02 12月, 2008 2 次提交
  18. 20 11月, 2008 3 次提交
    • C
      Btrfs: only flush down bios for writeback pages · 0e6bd956
      Chris Mason 提交于
      The btrfs write_cache_pages call has a flush function so that it submits
      the bio it has been building before it waits on any writeback pages.
      
      This adds a check so that flush only happens on writeback pages.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0e6bd956
    • C
      Btrfs: Fixes for 2.6.28-rc API changes · 15916de8
      Chris Mason 提交于
      * open/close_bdev_excl -> open/close_bdev_exclusive
      * blkdev_issue_discard takes a GFP mask now
      * Fix blkdev_issue_discard usage now that it is enabled
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      15916de8
    • C
      Btrfs: Avoid writeback stalls · d2c3f4f6
      Chris Mason 提交于
      While building large bios in writepages, btrfs may end up waiting
      for other page writeback to finish if WB_SYNC_ALL is used.
      
      While it is waiting, the bio it is building has a number of pages with the
      writeback bit set and they aren't getting to the disk any time soon.  This
      lowers the latencies of writeback in general by sending down the bio being
      built before waiting for other pages.
      
      The bio submission code tries to limit the total number of async bios in
      flight by waiting when we're over a certain number of async bios.  But,
      the waits are happening while writepages is building bios, and this can easily
      lead to stalls and other problems for people calling wait_on_page_writeback.
      
      The current fix is to let the congestion tests take care of waiting.
      
      sync() and others make sure to drain the current async requests to make
      sure that everything that was pending when the sync was started really get
      to disk.  The code would drain pending requests both before and after
      submitting a new request.
      
      But, if one of the requests is waiting for page writeback to finish,
      the draining waits might block that page writeback.  This changes the
      draining code to only wait after submitting the bio being processed.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d2c3f4f6
  19. 11 11月, 2008 2 次提交