1. 27 3月, 2012 3 次提交
    • J
      Btrfs: only use the existing eb if it's count isn't 0 · 115391d2
      Josef Bacik 提交于
      We can run into a problem where we find an eb for our existing page already on
      the radix tree but it has a ref count of 0.  It hasn't yet been removed by RCU
      yet so this can cause issues where we will use the EB after free.  So do
      atomic_inc_not_zero on the exists->refs and if it is zero just do
      synchronize_rcu() and try again.  We won't have to worry about new allocators
      coming in since they will block on the page lock at this point.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      115391d2
    • J
      Btrfs: set page->private to the eb · 4f2de97a
      Josef Bacik 提交于
      We spend a lot of time looking up extent buffers from pages when we could just
      store the pointer to the eb the page is associated with in page->private.  This
      patch does just that, and it makes things a little simpler and reduces a bit of
      CPU overhead involved with doing metadata IO.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4f2de97a
    • C
      Btrfs: allow metadata blocks larger than the page size · 727011e0
      Chris Mason 提交于
      A few years ago the btrfs code to support blocks lager than
      the page size was disabled to fix a few corner cases in the
      page cache handling.  This fixes the code to properly support
      large metadata blocks again.
      
      Since current kernels will crash early and often with larger
      metadata blocks, this adds an incompat bit so that older kernels
      can't mount it.
      
      This also does away with different blocksizes for nodes and leaves.
      You get a single block size for all tree blocks.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      727011e0
  2. 23 2月, 2012 1 次提交
    • C
      Btrfs: clear the extent uptodate bits during parent transid failures · 50653190
      Chris Mason 提交于
      If btrfs reads a block and finds a parent transid mismatch, it clears
      the uptodate flags on the extent buffer, and the pages inside it.  But
      we only clear the uptodate bits in the state tree if the block straddles
      more than one page.
      
      This is from an old optimization from to reduce contention on the extent
      state tree.  But it is buggy because the code that retries a read from
      a different copy of the block is going to find the uptodate state bits
      set and skip the IO.
      
      The end result of the bug is that we'll never actually read the good
      copy (if there is one).
      
      The fix here is to always clear the uptodate state bits, which is safe
      because this code is only called when the parent transid fails.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      50653190
  3. 21 2月, 2012 1 次提交
  4. 17 2月, 2012 4 次提交
  5. 15 2月, 2012 1 次提交
    • J
      btrfs: delalloc for page dirtied out-of-band in fixup worker · 87826df0
      Jeff Mahoney 提交于
       We encountered an issue that was easily observable on s/390 systems but
       could really happen anywhere. The timing just seemed to hit reliably
       on s/390 with limited memory.
      
       The gist is that when an unexpected set_page_dirty() happened, we'd
       run into the BUG() in btrfs_writepage_fixup_worker since it wasn't
       properly set up for delalloc.
      
       This patch does the following:
       - Performs the missing delalloc in the fixup worker
       - Allow the start hook to return -EBUSY which informs __extent_writepage
         that it should mark the page skipped and not to redirty it. This is
         required since the fixup worker can fail with -ENOSPC and the page
         will have already been redirtied. That causes an Oops in
         drop_outstanding_extents later. Retrying the fixup worker could
         lead to an infinite loop. Deferring the page redirty also saves us
         some cycles since the page would be stuck in a resubmit-redirty loop
         until the fixup worker completes. It's not harmful, just wasteful.
       - If the fixup worker fails, we mark the page and mapping as errored,
         and end the writeback, similar to what we would do had the page
         actually been submitted to writeback.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      87826df0
  6. 27 1月, 2012 1 次提交
    • M
      Btrfs: Check for NULL page in extent_range_uptodate · 8bedd51b
      Mitch Harder 提交于
      A user has encountered a NULL pointer kernel oops in btrfs when
      encountering media errors.  The problem has been identified
      as an unhandled NULL pointer returned from find_get_page().
      This modification simply checks for a NULL page, and returns
      with an error if found (the extent_range_uptodate() function
      returns 1 on errors).
      
      After testing this patch, the user reported that the error with
      the NULL pointer oops was solved.  However, there is still a
      remaining problem with a thread becoming stuck in
      wait_on_page_locked(page) in the read_extent_buffer_pages(...)
      function in extent_io.c
      
             for (i = start_i; i < num_pages; i++) {
                     page = extent_buffer_page(eb, i);
                     wait_on_page_locked(page);
                     if (!PageUptodate(page))
                             ret = -EIO;
             }
      
      This patch leaves the issue with the locked page yet to be resolved.
      Signed-off-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8bedd51b
  7. 04 1月, 2012 1 次提交
  8. 22 12月, 2011 1 次提交
    • S
      Btrfs: integrate integrity check module into btrfs · 21adbd5c
      Stefan Behrens 提交于
      This is the last part of the patch series. It modifies the btrfs
      code to use the integrity check module if configured to do so
      with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set,
      the only effective change is that code is added that handles the
      mount option to activate the integrity check. If the mount option is
      set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code
      complains in the log and the mount fails with EINVAL.
      
      Add the mount option to activate the usage of the integrity check
      code.
      Add invocation of btrfs integrity check code init and cleanup
      function on mount and umount, respectively.
      Add hook to call btrfs integrity check code version of
      submit_bh/submit_bio.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      21adbd5c
  9. 08 12月, 2011 1 次提交
  10. 01 12月, 2011 1 次提交
  11. 20 11月, 2011 3 次提交
    • J
      Btrfs: sectorsize align offsets in fiemap · 4d479cf0
      Josef Bacik 提交于
      We've been hitting BUG()'s in btrfs_cont_expand and btrfs_fallocate and anywhere
      else that calls btrfs_get_extent while running xfstests 13 in a loop.  This is
      because fiemap is calling btrfs_get_extent with non-sectorsize aligned offsets,
      which will end up adding mappings that are not sectorsize aligned, which will
      cause problems in some cases for subsequent calls to btrfs_get_extent for
      similar areas that are sectorsize aligned.  With this patch I ran xfstests 13 in
      a loop for a couple of hours and didn't hit the problem that I could previously
      hit in at most 20 minutes.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4d479cf0
    • J
      btrfs: mirror_num should be int, not u64 · 32240a91
      Jan Schmidt 提交于
      My previous patch introduced some u64 for failed_mirror variables, this one
      makes it consistent again.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      32240a91
    • J
      btrfs: Fix up 32/64-bit compatibility for new ioctls · 745c4d8e
      Jeff Mahoney 提交于
       This patch casts to unsigned long before casting to a pointer and fixes
       the following warnings:
      fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
      fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
      fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
      fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
      fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
      fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      745c4d8e
  12. 06 11月, 2011 2 次提交
    • C
      Btrfs: ClearPageError during writepage and clean_tree_block · bf0da8c1
      Chris Mason 提交于
      Failure testing was tripping up over stale PageError bits in
      metadata pages.  If we have an io error on a block, and later on
      end up reusing it, nobody ever clears PageError on those pages.
      
      During commit, we'll find PageError and think we had trouble writing
      the block, which will lead to aborts and other problems.
      
      This changes clean_tree_block and the btrfs writepage code to
      clear the PageError bit.  In both cases we're either completely
      done with the page or the page has good stuff and the error bit
      is no longer valid.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bf0da8c1
    • C
      Btrfs: make sure to flush queued bios if write_cache_pages waits · 01d658f2
      Chris Mason 提交于
      write_cache_pages tries to build up a large bio to stuff down the pipe.
      But if it needs to wait for a page lock, it needs to make sure and send
      down any pending writes so we don't deadlock with anyone who has the
      page lock and is waiting for writeback of things inside the bio.
      
      Dave Sterba triggered this as a deadlock between the autodefrag code and
      the extent write_cache_pages
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      01d658f2
  13. 21 10月, 2011 1 次提交
  14. 20 10月, 2011 2 次提交
    • J
      Btrfs: introduce convert_extent_bit · 462d6fac
      Josef Bacik 提交于
      If I have a range where I know a certain bit is and I want to set it to another
      bit the only option I have is to call set and then clear bit, which will result
      in 2 tree searches.  This is inefficient, so introduce convert_extent_bit which
      will go through and set the bit I want and clear the old bit I don't want.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      462d6fac
    • J
      Btrfs: skip looking for delalloc if we don't have ->fill_delalloc · 9e487107
      Josef Bacik 提交于
      We always look for delalloc bytes in our io_tree so we can fill in delalloc.
      This is fine in most cases, but if we're writing out the btree_inode this is
      just a superfluous tree search on the io_tree, and if we have a lot of metadata
      dirty this could be an expensive check.  So instead check to see if our io_tree
      has a ->fill_delalloc op, and if not don't even bother doing the lookup.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      9e487107
  15. 02 10月, 2011 2 次提交
    • A
      btrfs: hooks for readahead · 4bb31e92
      Arne Jansen 提交于
      This adds the hooks needed for readahead. In the readpage_end_io_hook,
      the extent state is checked for the EXTENT_READAHEAD flag. Only in this
      case the readahead hook is called, to keep the impact on non-ra as low
      as possible.
      Additionally, a hook for a failed IO is added, otherwise readahead would
      wait indefinitely for the extent to finish.
      
      Changes for v2:
       - eliminate race condition
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      4bb31e92
    • A
      btrfs: add an extra wait mode to read_extent_buffer_pages · bb82ab88
      Arne Jansen 提交于
      read_extent_buffer_pages currently has two modes, either trigger a read
      without waiting for anything, or wait for the I/O to finish. The former
      also bails when it's unable to lock the page. This patch now adds an
      additional parameter to allow it to block on page lock, but don't wait
      for completion.
      
      Changes v5:
       - merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and
         WAIT_PAGE_LOCK
      
      Change v6:
       - fix bug introduced in v5
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      bb82ab88
  16. 29 9月, 2011 2 次提交
    • J
      btrfs: Moved repair code from inode.c to extent_io.c · 4a54c8c1
      Jan Schmidt 提交于
      The raid-retry code in inode.c can be generalized so that it works for
      metadata as well. Thus, this patch moves it to extent_io.c and makes the
      raid-retry code a raid-repair code.
      
      Repair works that way: Whenever a read error occurs and we have more
      mirrors to try, note the failed mirror, and retry another. If we find a
      good one, check if we did note a failure earlier and if so, do not allow
      the read to complete until after the bad sector was written with the good
      data we just fetched. As we have the extent locked while reading, no one
      can change the data in between.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      4a54c8c1
    • J
      btrfs: add mirror_num to extent_read_full_page · 8ddc7d9c
      Jan Schmidt 提交于
      Currently, extent_read_full_page always assumes we are trying to read mirror
      0, which generally is the best we can do. To add flexibility, pass it as a
      parameter. This will be needed by scrub fixup code.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      8ddc7d9c
  17. 02 8月, 2011 5 次提交
  18. 28 7月, 2011 4 次提交
    • C
      Btrfs: reduce extent_state lock contention for metadata · 19b6caf4
      Chris Mason 提交于
      For metadata buffers that don't straddle pages (all of them), btrfs
      can safely use the page uptodate bits and extent_buffer uptodate bit
      instead of needing to use the extent_state tree.
      
      This greatly reduces contention on the state tree lock.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      19b6caf4
    • C
      Btrfs: switch the btrfs tree locks to reader/writer · bd681513
      Chris Mason 提交于
      The btrfs metadata btree is the source of significant
      lock contention, especially in the root node.   This
      commit changes our locking to use a reader/writer
      lock.
      
      The lock is built on top of rw spinlocks, and it
      extends the lock tracking to remember if we have a
      read lock or a write lock when we go to blocking.  Atomics
      count the number of blocking readers or writers at any
      given time.
      
      It removes all of the adaptive spinning from the old code
      and uses only the spinning/blocking hints inside of btrfs
      to decide when it should continue spinning.
      
      In read heavy workloads this is dramatically faster.  In write
      heavy workloads we're still faster because of less contention
      on the root node lock.
      
      We suffer slightly in dbench because we schedule more often
      during write locks, but all other benchmarks so far are improved.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd681513
    • C
      Btrfs: stop using highmem for extent_buffers · a6591715
      Chris Mason 提交于
      The extent_buffers have a very complex interface where
      we use HIGHMEM for metadata and try to cache a kmap mapping
      to access the memory.
      
      The next commit adds reader/writer locks, and concurrent use
      of this kmap cache would make it even more complex.
      
      This commit drops the ability to use HIGHMEM with extent buffers,
      and rips out all of the related code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a6591715
    • J
      Btrfs: tag pages for writeback in sync · f7aaa06b
      Josef Bacik 提交于
      Everybody else does this, we need to do it too.  If we're syncing, we need to
      tag the pages we're going to write for writeback so we don't end up writing the
      same stuff over and over again if somebody is constantly redirtying our file.
      This will keep us from having latencies with heavy sync workloads.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f7aaa06b
  19. 11 7月, 2011 1 次提交
    • J
      Btrfs: fix how we merge extent states and deal with cached states · df98b6e2
      Josef Bacik 提交于
      First, we can sometimes free the state we're merging, which means anybody who
      calls merge_state() may have the state it passed in free'ed.  This is
      problematic because we could end up caching the state, which makes caching
      useless as the state will no longer be part of the tree.  So instead of free'ing
      the state we passed into merge_state(), set it's end to the other->end and free
      the other state.  This way we are sure to cache the correct state.  Also because
      we can merge states together, instead of only using the cache'd state if it's
      start == the start we are looking for, go ahead and use it if the start we are
      looking for is within the range of the cached state.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      df98b6e2
  20. 10 7月, 2011 1 次提交
    • W
      writeback: make writeback_control.nr_to_write straight · d46db3d5
      Wu Fengguang 提交于
      Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
      and initialize the struct writeback_control there.
      
      struct writeback_control is basically designed to control writeback of a
      single file, but we keep abuse it for writing multiple files in
      writeback_sb_inodes() and its callers.
      
      It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
      work->nr_pages starts to make sense, and instead of saving and restoring
      pages_skipped in writeback_sb_inodes it can always start with a clean
      zero value.
      
      It also makes a neat IO pattern change: large dirty files are now
      written in the full 4MB writeback chunk size, rather than whatever
      remained quota in wbc->nr_to_write.
      Acked-by: NJan Kara <jack@suse.cz>
      Proposed-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      d46db3d5
  21. 27 5月, 2011 2 次提交
    • C
      Btrfs: return -ENOMEM in clear_extent_bit · c309df07
      Chris Mason 提交于
      The btrfs releasepage function depends on ENOMEM coming
      back when it is called atomic.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c309df07
    • D
      btrfs: add cleancache support · 90a887c9
      Dan Magenheimer 提交于
      This sixth patch of eight in this cleancache series "opts-in"
      cleancache for btrfs.  Filesystems must explicitly enable
      cleancache by calling cleancache_init_fs anytime an instance
      of the filesystem is mounted.  Btrfs uses its own readpage
      which must be hooked, but all other cleancache hooks are in
      the VFS layer including the matching cleancache_flush_fs hook
      which must be called on unmount.
      
      Details and a FAQ can be found in Documentation/vm/cleancache.txt
      
      [v6-v8: no changes]
      [v5: jeremy@goop.org: simplify init hook and any future fs init changes]
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Reviewed-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik Van Riel <riel@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Andreas Dilger <adilger@sun.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      90a887c9