1. 20 11月, 2011 1 次提交
    • J
      btrfs: Fix up 32/64-bit compatibility for new ioctls · 745c4d8e
      Jeff Mahoney 提交于
       This patch casts to unsigned long before casting to a pointer and fixes
       the following warnings:
      fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
      fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
      fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
      fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
      fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
      fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      745c4d8e
  2. 06 11月, 2011 2 次提交
    • C
      Btrfs: ClearPageError during writepage and clean_tree_block · bf0da8c1
      Chris Mason 提交于
      Failure testing was tripping up over stale PageError bits in
      metadata pages.  If we have an io error on a block, and later on
      end up reusing it, nobody ever clears PageError on those pages.
      
      During commit, we'll find PageError and think we had trouble writing
      the block, which will lead to aborts and other problems.
      
      This changes clean_tree_block and the btrfs writepage code to
      clear the PageError bit.  In both cases we're either completely
      done with the page or the page has good stuff and the error bit
      is no longer valid.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bf0da8c1
    • C
      Btrfs: make sure to flush queued bios if write_cache_pages waits · 01d658f2
      Chris Mason 提交于
      write_cache_pages tries to build up a large bio to stuff down the pipe.
      But if it needs to wait for a page lock, it needs to make sure and send
      down any pending writes so we don't deadlock with anyone who has the
      page lock and is waiting for writeback of things inside the bio.
      
      Dave Sterba triggered this as a deadlock between the autodefrag code and
      the extent write_cache_pages
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      01d658f2
  3. 21 10月, 2011 1 次提交
  4. 20 10月, 2011 2 次提交
    • J
      Btrfs: introduce convert_extent_bit · 462d6fac
      Josef Bacik 提交于
      If I have a range where I know a certain bit is and I want to set it to another
      bit the only option I have is to call set and then clear bit, which will result
      in 2 tree searches.  This is inefficient, so introduce convert_extent_bit which
      will go through and set the bit I want and clear the old bit I don't want.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      462d6fac
    • J
      Btrfs: skip looking for delalloc if we don't have ->fill_delalloc · 9e487107
      Josef Bacik 提交于
      We always look for delalloc bytes in our io_tree so we can fill in delalloc.
      This is fine in most cases, but if we're writing out the btree_inode this is
      just a superfluous tree search on the io_tree, and if we have a lot of metadata
      dirty this could be an expensive check.  So instead check to see if our io_tree
      has a ->fill_delalloc op, and if not don't even bother doing the lookup.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      9e487107
  5. 02 10月, 2011 2 次提交
    • A
      btrfs: hooks for readahead · 4bb31e92
      Arne Jansen 提交于
      This adds the hooks needed for readahead. In the readpage_end_io_hook,
      the extent state is checked for the EXTENT_READAHEAD flag. Only in this
      case the readahead hook is called, to keep the impact on non-ra as low
      as possible.
      Additionally, a hook for a failed IO is added, otherwise readahead would
      wait indefinitely for the extent to finish.
      
      Changes for v2:
       - eliminate race condition
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      4bb31e92
    • A
      btrfs: add an extra wait mode to read_extent_buffer_pages · bb82ab88
      Arne Jansen 提交于
      read_extent_buffer_pages currently has two modes, either trigger a read
      without waiting for anything, or wait for the I/O to finish. The former
      also bails when it's unable to lock the page. This patch now adds an
      additional parameter to allow it to block on page lock, but don't wait
      for completion.
      
      Changes v5:
       - merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and
         WAIT_PAGE_LOCK
      
      Change v6:
       - fix bug introduced in v5
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      bb82ab88
  6. 29 9月, 2011 2 次提交
    • J
      btrfs: Moved repair code from inode.c to extent_io.c · 4a54c8c1
      Jan Schmidt 提交于
      The raid-retry code in inode.c can be generalized so that it works for
      metadata as well. Thus, this patch moves it to extent_io.c and makes the
      raid-retry code a raid-repair code.
      
      Repair works that way: Whenever a read error occurs and we have more
      mirrors to try, note the failed mirror, and retry another. If we find a
      good one, check if we did note a failure earlier and if so, do not allow
      the read to complete until after the bad sector was written with the good
      data we just fetched. As we have the extent locked while reading, no one
      can change the data in between.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      4a54c8c1
    • J
      btrfs: add mirror_num to extent_read_full_page · 8ddc7d9c
      Jan Schmidt 提交于
      Currently, extent_read_full_page always assumes we are trying to read mirror
      0, which generally is the best we can do. To add flexibility, pass it as a
      parameter. This will be needed by scrub fixup code.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      8ddc7d9c
  7. 02 8月, 2011 5 次提交
  8. 28 7月, 2011 4 次提交
    • C
      Btrfs: reduce extent_state lock contention for metadata · 19b6caf4
      Chris Mason 提交于
      For metadata buffers that don't straddle pages (all of them), btrfs
      can safely use the page uptodate bits and extent_buffer uptodate bit
      instead of needing to use the extent_state tree.
      
      This greatly reduces contention on the state tree lock.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      19b6caf4
    • C
      Btrfs: switch the btrfs tree locks to reader/writer · bd681513
      Chris Mason 提交于
      The btrfs metadata btree is the source of significant
      lock contention, especially in the root node.   This
      commit changes our locking to use a reader/writer
      lock.
      
      The lock is built on top of rw spinlocks, and it
      extends the lock tracking to remember if we have a
      read lock or a write lock when we go to blocking.  Atomics
      count the number of blocking readers or writers at any
      given time.
      
      It removes all of the adaptive spinning from the old code
      and uses only the spinning/blocking hints inside of btrfs
      to decide when it should continue spinning.
      
      In read heavy workloads this is dramatically faster.  In write
      heavy workloads we're still faster because of less contention
      on the root node lock.
      
      We suffer slightly in dbench because we schedule more often
      during write locks, but all other benchmarks so far are improved.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd681513
    • C
      Btrfs: stop using highmem for extent_buffers · a6591715
      Chris Mason 提交于
      The extent_buffers have a very complex interface where
      we use HIGHMEM for metadata and try to cache a kmap mapping
      to access the memory.
      
      The next commit adds reader/writer locks, and concurrent use
      of this kmap cache would make it even more complex.
      
      This commit drops the ability to use HIGHMEM with extent buffers,
      and rips out all of the related code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a6591715
    • J
      Btrfs: tag pages for writeback in sync · f7aaa06b
      Josef Bacik 提交于
      Everybody else does this, we need to do it too.  If we're syncing, we need to
      tag the pages we're going to write for writeback so we don't end up writing the
      same stuff over and over again if somebody is constantly redirtying our file.
      This will keep us from having latencies with heavy sync workloads.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f7aaa06b
  9. 11 7月, 2011 1 次提交
    • J
      Btrfs: fix how we merge extent states and deal with cached states · df98b6e2
      Josef Bacik 提交于
      First, we can sometimes free the state we're merging, which means anybody who
      calls merge_state() may have the state it passed in free'ed.  This is
      problematic because we could end up caching the state, which makes caching
      useless as the state will no longer be part of the tree.  So instead of free'ing
      the state we passed into merge_state(), set it's end to the other->end and free
      the other state.  This way we are sure to cache the correct state.  Also because
      we can merge states together, instead of only using the cache'd state if it's
      start == the start we are looking for, go ahead and use it if the start we are
      looking for is within the range of the cached state.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      df98b6e2
  10. 10 7月, 2011 1 次提交
    • W
      writeback: make writeback_control.nr_to_write straight · d46db3d5
      Wu Fengguang 提交于
      Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
      and initialize the struct writeback_control there.
      
      struct writeback_control is basically designed to control writeback of a
      single file, but we keep abuse it for writing multiple files in
      writeback_sb_inodes() and its callers.
      
      It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
      work->nr_pages starts to make sense, and instead of saving and restoring
      pages_skipped in writeback_sb_inodes it can always start with a clean
      zero value.
      
      It also makes a neat IO pattern change: large dirty files are now
      written in the full 4MB writeback chunk size, rather than whatever
      remained quota in wbc->nr_to_write.
      Acked-by: NJan Kara <jack@suse.cz>
      Proposed-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      d46db3d5
  11. 27 5月, 2011 2 次提交
    • C
      Btrfs: return -ENOMEM in clear_extent_bit · c309df07
      Chris Mason 提交于
      The btrfs releasepage function depends on ENOMEM coming
      back when it is called atomic.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c309df07
    • D
      btrfs: add cleancache support · 90a887c9
      Dan Magenheimer 提交于
      This sixth patch of eight in this cleancache series "opts-in"
      cleancache for btrfs.  Filesystems must explicitly enable
      cleancache by calling cleancache_init_fs anytime an instance
      of the filesystem is mounted.  Btrfs uses its own readpage
      which must be hooked, but all other cleancache hooks are in
      the VFS layer including the matching cleancache_flush_fs hook
      which must be called on unmount.
      
      Details and a FAQ can be found in Documentation/vm/cleancache.txt
      
      [v6-v8: no changes]
      [v5: jeremy@goop.org: simplify init hook and any future fs init changes]
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Reviewed-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik Van Riel <riel@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Andreas Dilger <adilger@sun.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      90a887c9
  12. 24 5月, 2011 3 次提交
  13. 21 5月, 2011 1 次提交
    • L
      sanitize <linux/prefetch.h> usage · 268bb0ce
      Linus Torvalds 提交于
      Commit e66eed65 ("list: remove prefetching from regular list
      iterators") removed the include of prefetch.h from list.h, which
      uncovered several cases that had apparently relied on that rather
      obscure header file dependency.
      
      So this fixes things up a bit, using
      
         grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
         grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
      
      to guide us in finding files that either need <linux/prefetch.h>
      inclusion, or have it despite not needing it.
      
      There are more of them around (mostly network drivers), but this gets
      many core ones.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      268bb0ce
  14. 06 5月, 2011 1 次提交
  15. 02 5月, 2011 5 次提交
  16. 26 4月, 2011 1 次提交
  17. 25 4月, 2011 1 次提交
    • L
      Btrfs: Always use 64bit inode number · 33345d01
      Li Zefan 提交于
      There's a potential problem in 32bit system when we exhaust 32bit inode
      numbers and start to allocate big inode numbers, because btrfs uses
      inode->i_ino in many places.
      
      So here we always use BTRFS_I(inode)->location.objectid, which is an
      u64 variable.
      
      There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
      inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
      and inode->i_ino will be used in those cases.
      
      Another reason to make this change is I'm going to use a special inode
      to save free ino cache, and the inode number must be > (u64)-256.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      33345d01
  18. 16 4月, 2011 1 次提交
  19. 13 4月, 2011 1 次提交
    • C
      Btrfs: make uncache_state unconditional · 109b36a2
      Chris Mason 提交于
      The extent_io code can take cached pointers into the extent state trees,
      and these can make lookups much faster in common operations.  The
      caching only happens when specific bits are set that prevent merging
      and splitting of the extent state.
      
      A help function was added to uncache the state, and it was testing
      the same set of conditionals.  This can leak in very strange corner
      cases where the lock bit goes away unexpectedly.
      
      The uncaching should be unconditional.  Once we have a ref on the
      extent we should always give it up.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      109b36a2
  20. 12 4月, 2011 2 次提交
    • A
      btrfs: using cached extent_state in set/unlock combinations · 507903b8
      Arne Jansen 提交于
      In several places the sequence (set_extent_uptodate, unlock_extent) is used.
      This leads to a duplicate lookup of the extent state. This patch lets
      set_extent_uptodate return a cached extent_state which can be passed to
      unlock_extent_cached.
      The occurences of the above sequences are updated to use the cache. Only
      end_bio_extent_readpage is updated that it first gets a cached state to
      pass it to the readpage_end_io_hook as the prototype requested and is later
      on being used for set/unlock.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      507903b8
    • S
      btrfs: properly handle overlapping areas in memmove_extent_buffer · 3387206f
      Sergei Trofimovich 提交于
      Fix data corruption caused by memcpy() usage on overlapping data.
      I've observed it first when found out usermode linux crash on btrfs.
      
      ?all chain is the following:
      ------------[ cut here ]------------
      WARNING: at /home/slyfox/linux-2.6/fs/btrfs/extent_io.c:3900 memcpy_extent_buffer+0x1a5/0x219()
      Call Trace:
      6fa39a58:  [<601b495e>] _raw_spin_unlock_irqrestore+0x18/0x1c
      6fa39a68:  [<60029ad9>] warn_slowpath_common+0x59/0x70
      6fa39aa8:  [<60029b05>] warn_slowpath_null+0x15/0x17
      6fa39ab8:  [<600efc97>] memcpy_extent_buffer+0x1a5/0x219
      6fa39b48:  [<600efd9f>] memmove_extent_buffer+0x94/0x208
      6fa39bc8:  [<600becbf>] btrfs_del_items+0x214/0x473
      6fa39c78:  [<600ce1b0>] btrfs_delete_one_dir_name+0x7c/0xda
      6fa39cc8:  [<600dad6b>] __btrfs_unlink_inode+0xad/0x25d
      6fa39d08:  [<600d7864>] btrfs_start_transaction+0xe/0x10
      6fa39d48:  [<600dc9ff>] btrfs_unlink_inode+0x1b/0x3b
      6fa39d78:  [<600e04bc>] btrfs_unlink+0x70/0xef
      6fa39dc8:  [<6007f0d0>] vfs_unlink+0x58/0xa3
      6fa39df8:  [<60080278>] do_unlinkat+0xd4/0x162
      6fa39e48:  [<600517db>] call_rcu_sched+0xe/0x10
      6fa39e58:  [<600452a8>] __put_cred+0x58/0x5a
      6fa39e78:  [<6007446c>] sys_faccessat+0x154/0x166
      6fa39ed8:  [<60080317>] sys_unlink+0x11/0x13
      6fa39ee8:  [<60016b80>] handle_syscall+0x58/0x70
      6fa39f08:  [<60021377>] userspace+0x2d4/0x381
      6fa39fc8:  [<60014507>] fork_handler+0x62/0x69
      ---[ end trace 70b0ca2ef0266b93 ]---
      
      http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg09302.htmlSigned-off-by: NSergei Trofimovich <slyfox@gentoo.org>
      Reviewed-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3387206f
  21. 28 3月, 2011 1 次提交
    • L
      Btrfs: add initial tracepoint support for btrfs · 1abe9b8a
      liubo 提交于
      Tracepoints can provide insight into why btrfs hits bugs and be greatly
      helpful for debugging, e.g
                    dd-7822  [000]  2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
                    dd-7822  [000]  2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
       btrfs-transacti-7804  [001]  2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
       btrfs-transacti-7804  [001]  2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
       btrfs-transacti-7804  [001]  2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
         flush-btrfs-2-7821  [001]  2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
         flush-btrfs-2-7821  [001]  2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
         flush-btrfs-2-7821  [001]  2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
         flush-btrfs-2-7821  [000]  2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
       btrfs-endio-wri-7800  [001]  2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
       btrfs-endio-wri-7800  [001]  2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
      
      Here is what I have added:
      
      1) ordere_extent:
              btrfs_ordered_extent_add
              btrfs_ordered_extent_remove
              btrfs_ordered_extent_start
              btrfs_ordered_extent_put
      
      These provide critical information to understand how ordered_extents are
      updated.
      
      2) extent_map:
              btrfs_get_extent
      
      extent_map is used in both read and write cases, and it is useful for tracking
      how btrfs specific IO is running.
      
      3) writepage:
              __extent_writepage
              btrfs_writepage_end_io_hook
      
      Pages are cirtical resourses and produce a lot of corner cases during writeback,
      so it is valuable to know how page is written to disk.
      
      4) inode:
              btrfs_inode_new
              btrfs_inode_request
              btrfs_inode_evict
      
      These can show where and when a inode is created, when a inode is evicted.
      
      5) sync:
              btrfs_sync_file
              btrfs_sync_fs
      
      These show sync arguments.
      
      6) transaction:
              btrfs_transaction_commit
      
      In transaction based filesystem, it will be useful to know the generation and
      who does commit.
      
      7) back reference and cow:
      	btrfs_delayed_tree_ref
      	btrfs_delayed_data_ref
      	btrfs_delayed_ref_head
      	btrfs_cow_block
      
      Btrfs natively supports back references, these tracepoints are helpful on
      understanding btrfs's COW mechanism.
      
      8) chunk:
      	btrfs_chunk_alloc
      	btrfs_chunk_free
      
      Chunk is a link between physical offset and logical offset, and stands for space
      infomation in btrfs, and these are helpful on tracing space things.
      
      9) reserved_extent:
      	btrfs_reserved_extent_alloc
      	btrfs_reserved_extent_free
      
      These can show how btrfs uses its space.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1abe9b8a