1. 15 2月, 2011 2 次提交
    • C
      Btrfs: don't release pages when we can't clear the uptodate bits · e3f24cc5
      Chris Mason 提交于
      Btrfs tracks uptodate state in an rbtree as well as in the
      page bits.  This is supposed to enable us to use block sizes other than
      the page size, but there are a few parts still missing before that
      completely works.
      
      But, our readpage routine trusts this additional range based tracking
      of uptodateness, much in the same way the buffer head up to date bits
      are trusted for the other filesystems.
      
      The problem is that sometimes we need to allocate memory in order to
      split records in the rbtree, even when we are just clearing bits.  This
      can be difficult when our clearing function is called GFP_ATOMIC, which
      can happen in the releasepage path.
      
      So, what happens today looks like this:
      
      releasepage called with GFP_ATOMIC
      btrfs_releasepage calls clear_extent_bit
      clear_extent_bit fails to allocate ram, leaving the up to date bit set
      btrfs_releasepage returns success
      
      The end result is the page being gone, but btrfs thinking the range is
      up to date.   Later on if someone tries to read that same page, the
      btrfs readpage code will return immediately thinking the page is already
      up to date.
      
      This commit fixes things to fail the releasepage when we can't clear the
      extent state bits.  It covers both data pages and metadata tree blocks.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e3f24cc5
    • C
      Btrfs: fix page->private races · eb14ab8e
      Chris Mason 提交于
      There is a race where btrfs_releasepage can drop the
      page->private contents just as alloc_extent_buffer is setting
      up pages for metadata.  Because of how the Btrfs page flags work,
      this results in us skipping the crc on the page during IO.
      
      This patch sovles the race by waiting until after the extent buffer
      is inserted into the radix tree before it sets page private.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      eb14ab8e
  2. 01 2月, 2011 1 次提交
  3. 29 1月, 2011 1 次提交
  4. 17 1月, 2011 1 次提交
  5. 22 12月, 2010 1 次提交
  6. 28 11月, 2010 1 次提交
    • J
      Btrfs: fix fiemap · 975f84fe
      Josef Bacik 提交于
      There are two big problems currently with FIEMAP
      
      1) We return extents for holes.  This isn't supposed to happen, we just don't
      return extents for holes and then userspace interprets the lack of an extent as
      a hole.
      
      2) We sometimes don't set FIEMAP_EXTENT_LAST properly.  This is because we wait
      to see a EXTENT_FLAG_VACANCY flag on the em, but this won't happen if say we ask
      fiemap to map up to the last extent in a file, and there is nothing but holes up
      to the i_size.  To fix this we need to lookup the last extent in this file and
      save the logical offset, so if we happen to try and map that extent we can be
      sure to set FIEMAP_EXTENT_LAST.
      
      With this patch we now pass xfstest 225, which we never have before.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      975f84fe
  7. 22 11月, 2010 2 次提交
  8. 30 10月, 2010 2 次提交
  9. 29 10月, 2010 2 次提交
  10. 06 7月, 2010 1 次提交
  11. 26 5月, 2010 1 次提交
    • C
      Btrfs: rework O_DIRECT enospc handling · 4845e44f
      Chris Mason 提交于
      This changes O_DIRECT write code to mark extents as delalloc
      while it is processing them.  Yan Zheng has reworked the
      enospc accounting based on tracking delalloc extents and
      this makes it much easier to track enospc in the O_DIRECT code.
      
      There are a few space cases with the O_DIRECT code though,
      it only sets the EXTENT_DELALLOC bits, instead of doing
      EXTENT_DELALLOC | EXTENT_DIRTY | EXTENT_UPTODATE, because
      we don't want to mess with clearing the dirty and uptodate
      bits when things go wrong.  This is important because there
      are no pages in the page cache, so any extent state structs
      that we put in the tree won't get freed by releasepage.  We have
      to clear them ourselves as the DIO ends.
      
      With this commit, we reserve space at in btrfs_file_aio_write,
      and then as each btrfs_direct_IO call progresses it sets
      EXTENT_DELALLOC on the range.
      
      btrfs_get_blocks_direct is responsible for clearing the delalloc
      at the same time it drops the extent lock.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4845e44f
  12. 25 5月, 2010 3 次提交
  13. 06 4月, 2010 1 次提交
    • N
      Btrfs: use add_to_page_cache_lru, use __page_cache_alloc · 28ecb609
      Nick Piggin 提交于
      Pagecache pages should be allocated with __page_cache_alloc, so they
      obey pagecache memory policies.
      
      add_to_page_cache_lru is exported, so it should be used. Benefits over
      using a private pagevec: neater code, 128 bytes fewer stack used, percpu
      lru ordering is preserved, and finally don't need to flush pagevec
      before returning so batching may be shared with other LRU insertions.
      
      Signed-off-by: Nick Piggin <npiggin@suse.de>:
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      28ecb609
  14. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  15. 15 3月, 2010 3 次提交
    • J
      Btrfs: cache the extent state everywhere we possibly can V2 · 2ac55d41
      Josef Bacik 提交于
      This patch just goes through and fixes everybody that does
      
      lock_extent()
      blah
      unlock_extent()
      
      to use
      
      lock_extent_bits()
      blah
      unlock_extent_cached()
      
      and pass around a extent_state so we only have to do the searches once per
      function.  This gives me about a 3 mb/s boots on my random write test.  I have
      not converted some things, like the relocation and ioctl's, since they aren't
      heavily used and the relocation stuff is in the middle of being re-written.  I
      also changed the clear_extent_bit() to only unset the cached state if we are
      clearing EXTENT_LOCKED and related stuff, so we can do things like this
      
      lock_extent_bits()
      clear delalloc bits
      unlock_extent_cached()
      
      without losing our cached state.  I tested this thoroughly and turned on
      LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out
      fine.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2ac55d41
    • J
      Btrfs: cache extent state in find_delalloc_range · c2a128d2
      Josef Bacik 提交于
      This patch makes us cache the extent state we find in find_delalloc_range since
      we'll have to lock the extent later on in the function.  This will keep us from
      re-searching for the rang when we try to lock the extent.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c2a128d2
    • C
      Btrfs: finish read pages in the order they are submitted · 4125bf76
      Chris Mason 提交于
      The endio is done at reverse order of bio vectors.
      
      That means for a sequential read, the page first submitted will finish
      last in a bio. Considering we will do checksum (making cache hot) for
      every page, this does introduce delay (and chance to squeeze cache used
      soon) for pages submitted at the begining.
      
      I don't observe obvious performance difference with below patch at my
      simple test, but seems more natural to finish read in the order they are
      submitted.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4125bf76
  16. 09 3月, 2010 1 次提交
  17. 05 2月, 2010 1 次提交
  18. 09 10月, 2009 2 次提交
    • J
      Btrfs: release delalloc reservations on extent item insertion · 32c00aff
      Josef Bacik 提交于
      This patch fixes an issue with the delalloc metadata space reservation
      code.  The problem is we used to free the reservation as soon as we
      allocated the delalloc region.  The problem with this is if we are not
      inserting an inline extent, we don't actually insert the extent item until
      after the ordered extent is written out.  This patch does 3 things,
      
      1) It moves the reservation clearing stuff into the ordered code, so when
      we remove the ordered extent we remove the reservation.
      2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear
      delalloc bits in the cases where we want to clear the metadata reservation
      when we clear the delalloc extent, in the case that we do an inline extent
      or we invalidate the page.
      3) It adds another waitqueue to the space info so that when we start a fs
      wide delalloc flush, anybody else who also hits that area will simply wait
      for the flush to finish and then try to make their allocation.
      
      This has been tested thoroughly to make sure we did not regress on
      performance.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      32c00aff
    • C
      Btrfs: cleanup extent_clear_unlock_delalloc flags · a791e35e
      Chris Mason 提交于
      extent_clear_unlock_delalloc has a growing set of ugly parameters
      that is very difficult to read and maintain.
      
      This switches to a flag field and well named flag defines.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a791e35e
  19. 29 9月, 2009 1 次提交
    • J
      Btrfs: proper -ENOSPC handling · 9ed74f2d
      Josef Bacik 提交于
      At the start of a transaction we do a btrfs_reserve_metadata_space() and
      specify how many items we plan on modifying.  Then once we've done our
      modifications and such, just call btrfs_unreserve_metadata_space() for
      the same number of items we reserved.
      
      For keeping track of metadata needed for data I've had to add an extent_io op
      for when we merge extents.  This lets us track space properly when we are doing
      sequential writes, so we don't end up reserving way more metadata space than
      what we need.
      
      The only place where the metadata space accounting is not done is in the
      relocation code.  This is because Yan is going to be reworking that code in the
      near future, so running btrfs-vol -b could still possibly result in a ENOSPC
      related panic.  This patch also turns off the metadata_ratio stuff in order to
      allow users to more efficiently use their disk space.
      
      This patch makes it so we track how much metadata we need for an inode's
      delayed allocation extents by tracking how many extents are currently
      waiting for allocation.  It introduces two new callbacks for the
      extent_io tree's, merge_extent_hook and split_extent_hook.  These help
      us keep track of when we merge delalloc extents together and split them
      up.  Reservations are handled prior to any actually dirty'ing occurs,
      and then we unreserve after we dirty.
      
      btrfs_unreserve_metadata_for_delalloc() will make the appropriate
      unreservations as needed based on the number of reservations we
      currently have and the number of extents we currently have.  Doing the
      reservation outside of doing any of the actual dirty'ing lets us do
      things like filemap_flush() the inode to try and force delalloc to
      happen, or as a last resort actually start allocation on all delalloc
      inodes in the fs.  This has survived dbench, fs_mark and an fsx torture
      test.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9ed74f2d
  20. 24 9月, 2009 3 次提交
    • C
      Btrfs: fix releasepage to avoid unlocking extents we haven't locked · 11ef160f
      Chris Mason 提交于
      During releasepage, we try to drop any extent_state structs for the
      bye offsets of the page we're releaseing.  But the code was incorrectly
      telling clear_extent_bit to delete the state struct unconditionallly.
      
      Normally this would be fine because we have the page locked, but other
      parts of btrfs will lock down an entire extent, the most common place
      being IO completion.
      
      releasepage was deleting the extent state without first locking the extent,
      which may result in removing a state struct that another process had
      locked down.  The fix here is to leave the NODATASUM and EXTENT_LOCKED
      bits alone in releasepage.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      11ef160f
    • C
      Btrfs: Fix test_range_bit for whole file extents · 46562cec
      Chris Mason 提交于
      If test_range_bit finds an extent that goes all the way to (u64)-1, it
      can incorrectly wrap the u64 instead of treaing it like the end of
      the address space.
      
      This just adds a check for the highest possible offset so we don't wrap.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      46562cec
    • C
      Btrfs: fix errors handling cached state in set/clear_extent_bit · 42daec29
      Chris Mason 提交于
      Both set and clear_extent_bit allow passing a cached
      state struct to reduce rbtree search times.  clear_extent_bit
      was improperly bypassing some of the checks around making sure
      the extent state fields were correct for a given operation.
      
      The fix used here (from Yan Zheng) is to use the hit_next
      goto target instead of jumping all the way down to start clearing
      bits without making sure the cached state was exactly correct
      for the operation we were doing.
      
      This also fixes up the setting of the start variable for both
      ops in the case where we find an overlapping extent that
      begins before the range we want to change.  In both cases
      we were incorrectly going backwards from the original
      requested change.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      42daec29
  21. 19 9月, 2009 1 次提交
    • C
      Btrfs: properly honor wbc->nr_to_write changes · f85d7d6c
      Chris Mason 提交于
      When btrfs fills a delayed allocation, it tries to increase
      the wbc nr_to_write to cover a big part of allocation.  The
      theory is that we're doing contiguous IO and writing a few
      more blocks will save seeks overall at a very low cost.
      
      The problem is that extent_write_cache_pages could ignore
      the new higher nr_to_write if nr_to_write had already gone
      down to zero.  We fix that by rechecking the nr_to_write
      for every page that is processed in the pagevec.
      
      This updates the math around bumping the nr_to_write value
      to make sure we don't leave a tiny amount of IO hanging
      around for the very end of a new extent.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f85d7d6c
  22. 12 9月, 2009 8 次提交
    • C
      Btrfs: Use PagePrivate2 to track pages in the data=ordered code. · 8b62b72b
      Chris Mason 提交于
      Btrfs writes go through delalloc to the data=ordered code.  This
      makes sure that all of the data is on disk before the metadata
      that references it.  The tracking means that we have to make sure
      each page in an extent is fully written before we add that extent into
      the on-disk btree.
      
      This was done in the past by setting the EXTENT_ORDERED bit for the
      range of an extent when it was added to the data=ordered code, and then
      clearing the EXTENT_ORDERED bit in the extent state tree as each page
      finished IO.
      
      One of the reasons we had to do this was because sometimes pages are
      magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
      bit is checked at writepage time, and if it isn't there, our page become
      dirty without going through the proper path.
      
      These bit operations make for a number of rbtree searches for each page,
      and can cause considerable lock contention.
      
      This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
      As pages go into the ordered code, PagePrivate2 is set on each one.
      This is a cheap operation because we already have all the pages locked
      and ready to go.
      
      As IO finishes, the PagePrivate2 bit is cleared and the ordered
      accoutning is updated for each page.
      
      At writepage time, if the PagePrivate2 bit is missing, we go into the
      writepage fixup code to handle improperly dirtied pages.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8b62b72b
    • C
      Btrfs: use a cached state for extent state operations during delalloc · 9655d298
      Chris Mason 提交于
      This changes the btrfs code to find delalloc ranges in the extent state
      tree to use the new state caching code from set/test bit.  It reduces
      one of the biggest causes of rbtree searches in the writeback path.
      
      test_range_bit is also modified to take the cached state as a starting
      point while searching.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9655d298
    • C
      Btrfs: don't lock bits in the extent tree during writepage · d5550c63
      Chris Mason 提交于
      At writepage time, we have the page locked and we have the
      extent_map entry for this extent pinned in the extent_map tree.
      So, the page can't go away and its mapping can't change.
      
      There is no need for the extra extent_state lock bits during writepage.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d5550c63
    • C
      Btrfs: cache values for locking extents · 2c64c53d
      Chris Mason 提交于
      Many of the btrfs extent state tree users follow the same pattern.
      They lock an extent range in the tree, do some operation and then
      unlock.
      
      This translates to at least 2 rbtree searches, and maybe more if they
      are doing operations on the extent state tree.  A locked extent
      in the tree isn't going to be merged or changed, and so we can
      safely return the extent state structure as a cached handle.
      
      This changes set_extent_bit to give back a cached handle, and also
      changes both set_extent_bit and clear_extent_bit to use the cached
      handle if it is available.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2c64c53d
    • C
      Btrfs: reduce CPU usage in the extent_state tree · 1edbb734
      Chris Mason 提交于
      Btrfs is currently mirroring some of the page state bits into
      its extent state tree.  The goal behind this was to use it in supporting
      blocksizes other than the page size.
      
      But, we don't currently support that, and we're using quite a lot of CPU
      on the rb tree and its spin lock.  This commit starts a series of
      cleanups to reduce the amount of work done in the extent state tree as
      part of each IO.
      
      This commit:
      
      * Adds the ability to lock an extent in the state tree and also set
      other bits.  The idea is to do locking and delalloc in one call
      
      * Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits.  Btrfs is using
      a combination of the page bits and the ordered write code for this
      instead.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1edbb734
    • C
      Btrfs: Fix new state initialization order · e48c465b
      Chris Mason 提交于
      As the extent state tree is manipulated, there are call backs
      that are used to take extra actions when different state bits are set
      or cleared.  One example of this is a counter for the total number
      of delayed allocation bytes in a single inode and in the whole FS.
      
      When new states are inserted, this callback is being done before we
      properly setup the new state.  This hasn't caused problems before
      because the lock bit was always done first, and the existing call backs
      don't care about the lock bit.
      
      This patch makes sure the state is properly setup before using the
      callback, which is important for later optimizations that do more work
      without using the lock bit.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e48c465b
    • C
      Btrfs: switch extent_map to a rw lock · 890871be
      Chris Mason 提交于
      There are two main users of the extent_map tree.  The
      first is regular file inodes, where it is evenly spread
      between readers and writers.
      
      The second is the chunk allocation tree, which maps blocks from
      logical addresses to phyiscal ones, and it is 99.99% reads.
      
      The mapping tree is a point of lock contention during heavy IO
      workloads, so this commit switches things to a rw lock.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      890871be
    • C
      Btrfs: use larger nr_to_write for larger extents · a97adc9f
      Chris Mason 提交于
      When btrfs fills a large delayed allocation extent, it is a good idea
      to try and convince the write_cache_pages caller to go ahead and
      write a good chunk of that extent.  The extra IO is basically free
      because we know it is contiguous.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a97adc9f