1. 02 8月, 2011 10 次提交
  2. 28 7月, 2011 14 次提交
    • C
      Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors · 75c195a2
      Chris Mason 提交于
      The btrfs transaction code will return any errors that come from
      reserve_metadata_bytes.  We need to make sure we don't return funny
      things like 1 or EAGAIN.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      75c195a2
    • C
      Btrfs: use the commit_root for reading free_space_inode crcs · 2cf8572d
      Chris Mason 提交于
      Now that we are using regular file crcs for the free space cache,
      we can deadlock if we try to read the free_space_inode while we are
      updating the crc tree.
      
      This commit fixes things by using the commit_root to read the crcs.  This is
      safe because we the free space cache file would already be loaded if
      that block group had been changed in the current transaction.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2cf8572d
    • C
      Btrfs: reduce extent_state lock contention for metadata · 19b6caf4
      Chris Mason 提交于
      For metadata buffers that don't straddle pages (all of them), btrfs
      can safely use the page uptodate bits and extent_buffer uptodate bit
      instead of needing to use the extent_state tree.
      
      This greatly reduces contention on the state tree lock.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      19b6caf4
    • C
      Btrfs: remove lockdep magic from btrfs_next_leaf · 31533fb2
      Chris Mason 提交于
      Before the reader/writer locks, btrfs_next_leaf needed to keep
      the path blocking to avoid making lockdep upset.
      
      Now that btrfs_next_leaf only takes read locks, this isn't required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      31533fb2
    • C
      Btrfs: make a lockdep class for each root · 85d4e461
      Chris Mason 提交于
      This patch was originally from Tejun Heo.  lockdep complains about the btrfs
      locking because we sometimes take btree locks from two different trees at the
      same time.  The current classes are based only on level in the btree, which
      isn't enough information for lockdep to figure out if the lock is safe.
      
      This patch makes a class for each type of tree, and lumps all the FS trees that
      actually have files and directories into the same class.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      85d4e461
    • C
      Btrfs: switch the btrfs tree locks to reader/writer · bd681513
      Chris Mason 提交于
      The btrfs metadata btree is the source of significant
      lock contention, especially in the root node.   This
      commit changes our locking to use a reader/writer
      lock.
      
      The lock is built on top of rw spinlocks, and it
      extends the lock tracking to remember if we have a
      read lock or a write lock when we go to blocking.  Atomics
      count the number of blocking readers or writers at any
      given time.
      
      It removes all of the adaptive spinning from the old code
      and uses only the spinning/blocking hints inside of btrfs
      to decide when it should continue spinning.
      
      In read heavy workloads this is dramatically faster.  In write
      heavy workloads we're still faster because of less contention
      on the root node lock.
      
      We suffer slightly in dbench because we schedule more often
      during write locks, but all other benchmarks so far are improved.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd681513
    • J
      Btrfs: fix deadlock when throttling transactions · 81317fde
      Josef Bacik 提交于
      Hit this nice little deadlock.  What happens is this
      
      __btrfs_end_transaction with throttle set, --use_count so it equals 0
        btrfs_commit_transaction
          <somebody else actually manages to start the commit>
          btrfs_end_transaction --use_count so now its -1 <== BAD
            we just return and wait on the transaction
      
      This is bad because we just return after our use_count is -1 and don't let go
      of our num_writer count on the transaction, so the guy committing the
      transaction just sits there forever.  Fix this by inc'ing our use_count if we're
      going to call commit_transaction so that if we call btrfs_end_transaction it's
      valid.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      81317fde
    • C
      Btrfs: stop using highmem for extent_buffers · a6591715
      Chris Mason 提交于
      The extent_buffers have a very complex interface where
      we use HIGHMEM for metadata and try to cache a kmap mapping
      to access the memory.
      
      The next commit adds reader/writer locks, and concurrent use
      of this kmap cache would make it even more complex.
      
      This commit drops the ability to use HIGHMEM with extent buffers,
      and rips out all of the related code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a6591715
    • M
      Btrfs: fix BUG_ON() caused by ENOSPC when relocating space · 199c36ea
      Miao Xie 提交于
      When we balanced the chunks across the devices, BUG_ON() in
      __finish_chunk_alloc() was triggered.
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/volumes.c:2568!
      [SNIP]
      Call Trace:
       [<ffffffffa049525e>] btrfs_alloc_chunk+0x8e/0xa0 [btrfs]
       [<ffffffffa04546b0>] do_chunk_alloc+0x330/0x3a0 [btrfs]
       [<ffffffffa045c654>] btrfs_reserve_extent+0xb4/0x1f0 [btrfs]
       [<ffffffffa045c86b>] btrfs_alloc_free_block+0xdb/0x350 [btrfs]
       [<ffffffffa048a8d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
       [<ffffffffa04476fd>] __btrfs_cow_block+0x14d/0x5e0 [btrfs]
       [<ffffffffa044660d>] ? read_block_for_search+0x14d/0x4d0 [btrfs]
       [<ffffffffa0447c9b>] btrfs_cow_block+0x10b/0x240 [btrfs]
       [<ffffffffa044dd5e>] btrfs_search_slot+0x49e/0x7a0 [btrfs]
       [<ffffffffa044f07d>] btrfs_insert_empty_items+0x8d/0xf0 [btrfs]
       [<ffffffffa045e973>] insert_with_overflow+0x43/0x110 [btrfs]
       [<ffffffffa045eb0d>] btrfs_insert_dir_item+0xcd/0x1f0 [btrfs]
       [<ffffffffa0489bd0>] ? map_extent_buffer+0xb0/0xc0 [btrfs]
       [<ffffffff812276ad>] ? rb_insert_color+0x9d/0x160
       [<ffffffffa046cc40>] ? inode_tree_add+0xf0/0x150 [btrfs]
       [<ffffffffa0474801>] btrfs_add_link+0xc1/0x1c0 [btrfs]
       [<ffffffff811dacac>] ? security_inode_init_security+0x1c/0x30
       [<ffffffffa04a28aa>] ? btrfs_init_acl+0x4a/0x180 [btrfs]
       [<ffffffffa047492f>] btrfs_add_nondir+0x2f/0x70 [btrfs]
       [<ffffffffa046af16>] ? btrfs_init_inode_security+0x46/0x60 [btrfs]
       [<ffffffffa0474ac0>] btrfs_create+0x150/0x1d0 [btrfs]
       [<ffffffff81159c63>] ? generic_permission+0x23/0xb0
       [<ffffffff8115b415>] vfs_create+0xa5/0xc0
       [<ffffffff8115ce6e>] do_last+0x5fe/0x880
       [<ffffffff8115dc0d>] path_openat+0xcd/0x3d0
       [<ffffffff8115e029>] do_filp_open+0x49/0xa0
       [<ffffffff8116a965>] ? alloc_fd+0x95/0x160
       [<ffffffff8114f0c7>] do_sys_open+0x107/0x1e0
       [<ffffffff810bcc3f>] ? audit_syscall_entry+0x1bf/0x1f0
       [<ffffffff8114f1e0>] sys_open+0x20/0x30
       [<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b
      [SNIP]
      RIP  [<ffffffffa049444a>] __finish_chunk_alloc+0x20a/0x220 [btrfs]
      
      The reason is:
      Task1					Space balance task
      do_chunk_alloc()
        __finish_chunk_alloc()
          update device info
          in the chunk tree
            alloc system metadata block
      					relocate system metadata block group
      					  set system metadata block group
      					  readonly, This block group is the
      					  only one that can allocate space. So
      					  there is no free space that can be
      					  allocated now.
              find no space and don't try
              to alloc new chunk, and then
              return ENOSPC
        BUG_ON() in __finish_chunk_alloc()
        was triggered.
      
      Fix this bug by allocating a new system metadata chunk before relocating the
      old one if we find there is no free space which can be allocated after setting
      the old block group to be read-only.
      Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      199c36ea
    • J
      Btrfs: tag pages for writeback in sync · f7aaa06b
      Josef Bacik 提交于
      Everybody else does this, we need to do it too.  If we're syncing, we need to
      tag the pages we're going to write for writeback so we don't end up writing the
      same stuff over and over again if somebody is constantly redirtying our file.
      This will keep us from having latencies with heavy sync workloads.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f7aaa06b
    • J
      Btrfs: fix enospc problems with delalloc · 9e0baf60
      Josef Bacik 提交于
      So I had this brilliant idea to use atomic counters for outstanding and reserved
      extents, but this turned out to be a bad idea.  Consider this where we have 1
      outstanding extent and 1 reserved extent
      
      Reserver				Releaser
      					atomic_dec(outstanding) now 0
      atomic_read(outstanding)+1 get 1
      atomic_read(reserved) get 1
      don't actually reserve anything because
      they are the same
      					atomic_cmpxchg(reserved, 1, 0)
      atomic_inc(outstanding)
      atomic_add(0, reserved)
      					free reserved space for 1 extent
      
      Then the reserver now has no actual space reserved for it, and when it goes to
      finish the ordered IO it won't have enough space to do it's allocation and you
      get those lovely warnings.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9e0baf60
    • J
      Btrfs: don't flush delalloc arbitrarily · a5991428
      Josef Bacik 提交于
      Kill the check to see if we have 512mb of reserved space in delalloc and
      shrink_delalloc if we do.  This causes unexpected latencies and we have other
      logic to see if we need to throttle.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a5991428
    • J
      Btrfs: use find_or_create_page instead of grab_cache_page · a94733d0
      Josef Bacik 提交于
      grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to
      GFP_HIGHUSER_MOVABLE.  So instead use find_or_create_page in all cases where we
      need GFP_NOFS so we don't deadlock.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      a94733d0
    • J
      Btrfs: use a worker thread to do caching · bab39bf9
      Josef Bacik 提交于
      A user reported a deadlock when copying a bunch of files.  This is because they
      were low on memory and kthreadd got hung up trying to migrate pages for an
      allocation when starting the caching kthread.  The page was locked by the person
      starting the caching kthread.  To fix this we just need to use the async thread
      stuff so that the threads are already created and we don't have to worry about
      deadlocks.  Thanks,
      Reported-by: NRoman Mamedov <rm@romanrm.ru>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      bab39bf9
  3. 26 7月, 2011 2 次提交
    • M
      btrfs: don't BUG_ON allocation errors in btrfs_drop_snapshot · 38a1a919
      Mark Fasheh 提交于
      In addition to properly handling allocation failure from btrfs_alloc_path, I
      also fixed up the kzalloc error handling code immediately below it.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      38a1a919
    • M
      btrfs: Don't BUG_ON alloc_path errors in find_next_chunk · 92b8e897
      Mark Fasheh 提交于
      I also removed the BUG_ON from error return of find_next_chunk in
      init_first_rw_device(). It turns out that the only caller of
      init_first_rw_device() also BUGS on any nonzero return so no actual behavior
      change has occurred here.
      
      do_chunk_alloc() also needed an update since it calls btrfs_alloc_chunk()
      which can now return -ENOMEM. Instead of setting space_info->full on any
      error from btrfs_alloc_chunk() I catch and return every error value _except_
      -ENOSPC. Thanks goes to Tsutomu Itoh for pointing that issue out.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      92b8e897
  4. 15 7月, 2011 5 次提交
  5. 11 7月, 2011 5 次提交
    • J
      Btrfs: fix how we merge extent states and deal with cached states · df98b6e2
      Josef Bacik 提交于
      First, we can sometimes free the state we're merging, which means anybody who
      calls merge_state() may have the state it passed in free'ed.  This is
      problematic because we could end up caching the state, which makes caching
      useless as the state will no longer be part of the tree.  So instead of free'ing
      the state we passed into merge_state(), set it's end to the other->end and free
      the other state.  This way we are sure to cache the correct state.  Also because
      we can merge states together, instead of only using the cache'd state if it's
      start == the start we are looking for, go ahead and use it if the start we are
      looking for is within the range of the cached state.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      df98b6e2
    • J
      Btrfs: use the normal checksumming infrastructure for free space cache · 2f356126
      Josef Bacik 提交于
      We used to store the checksums of the space cache directly in the space cache,
      however that doesn't work out too well if we have more space than we can fit the
      checksums into the first page.  So instead use the normal checksumming
      infrastructure.  There were problems with doing this originally but those
      problems don't exist now so this works out fine.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      2f356126
    • J
      Btrfs: serialize flushers in reserve_metadata_bytes · fdb5effd
      Josef Bacik 提交于
      We keep having problems with early enospc, and that's because our method of
      making space is inherently racy.  The problem is we can have one guy trying to
      make space for himself, and in the meantime people come in and steal his
      reservation.  In order to stop this we make a waitqueue and put anybody who
      comes into reserve_metadata_bytes on that waitqueue if somebody is trying to
      make more space.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      fdb5effd
    • J
      Btrfs: do transaction space reservation before joining the transaction · b5009945
      Josef Bacik 提交于
      We have to do weird things when handling enospc in the transaction joining code.
      Because we've already joined the transaction we cannot commit the transaction
      within the reservation code since it will deadlock, so we have to return EAGAIN
      and then make sure we don't retry too many times.  Instead of doing this, just
      do the reservation the normal way before we join the transaction, that way we
      can do whatever we want to try and reclaim space, and then if it fails we know
      for sure we are out of space and we can return ENOSPC.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      b5009945
    • J
      Btrfs: try to only do one btrfs_search_slot in do_setxattr · fa09200b
      Josef Bacik 提交于
      I've been watching how many btrfs_search_slot()'s we do and I noticed that when
      we create a file with selinux enabled we were doing 2 each time we initialize
      the security context.  That's because we lookup the xattr first so we can delete
      it if we're setting a new value to an existing xattr.  But in the create case we
      don't have any xattrs, so it is completely useless to have the extra lookup.  So
      re-arrange things so that we only lookup first if we specifically have
      XATTR_REPLACE.  That way in the basic case we only do 1 search, and in the more
      complicated case we do the normal 2 lookups.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      fa09200b
  6. 07 7月, 2011 3 次提交
    • M
      btrfs: fix oops when doing space balance · 149e2d76
      Miao Xie 提交于
      We need to make sure the data relocation inode doesn't go through
      the delayed metadata updates, otherwise we get an oops during balance:
      
      kernel BUG at fs/btrfs/relocation.c:4303!
      [SNIP]
      Call Trace:
       [<ffffffffa03143fd>] ? update_ref_for_cow+0x22d/0x330 [btrfs]
       [<ffffffffa0314951>] __btrfs_cow_block+0x451/0x5e0 [btrfs]
       [<ffffffffa031355d>] ? read_block_for_search+0x14d/0x4d0 [btrfs]
       [<ffffffffa0314beb>] btrfs_cow_block+0x10b/0x240 [btrfs]
       [<ffffffffa031acae>] btrfs_search_slot+0x49e/0x7a0 [btrfs]
       [<ffffffffa032d8af>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
       [<ffffffff8147bf0e>] ? mutex_lock+0x1e/0x50
       [<ffffffffa0380cf1>] btrfs_update_delayed_inode+0x71/0x160 [btrfs]
       [<ffffffffa037ff27>] ? __btrfs_release_delayed_node+0x67/0x190 [btrfs]
       [<ffffffffa0381cf8>] btrfs_run_delayed_items+0xe8/0x120 [btrfs]
       [<ffffffffa03365e0>] btrfs_commit_transaction+0x250/0x850 [btrfs]
       [<ffffffff810f91d9>] ? find_get_pages+0x39/0x130
       [<ffffffffa0336cd5>] ? join_transaction+0x25/0x250 [btrfs]
       [<ffffffff81081de0>] ? wake_up_bit+0x40/0x40
       [<ffffffffa03785fa>] prepare_to_relocate+0xda/0xf0 [btrfs]
       [<ffffffffa037f2bb>] relocate_block_group+0x4b/0x620 [btrfs]
       [<ffffffffa0334cf5>] ? btrfs_clean_old_snapshots+0x35/0x150 [btrfs]
       [<ffffffffa037fa43>] btrfs_relocate_block_group+0x1b3/0x2e0 [btrfs]
       [<ffffffffa0368ec0>] ? btrfs_tree_unlock+0x50/0x50 [btrfs]
       [<ffffffffa035e39b>] btrfs_relocate_chunk+0x8b/0x670 [btrfs]
       [<ffffffffa031303d>] ? btrfs_set_path_blocking+0x3d/0x50 [btrfs]
       [<ffffffffa03577d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
       [<ffffffffa031bea1>] ? btrfs_previous_item+0xb1/0x150 [btrfs]
       [<ffffffffa03577d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
       [<ffffffffa035f5aa>] btrfs_balance+0x21a/0x2b0 [btrfs]
       [<ffffffffa0368898>] btrfs_ioctl+0x798/0xd20 [btrfs]
       [<ffffffff8111e358>] ? handle_mm_fault+0x148/0x270
       [<ffffffff814809e8>] ? do_page_fault+0x1d8/0x4b0
       [<ffffffff81160d6a>] do_vfs_ioctl+0x9a/0x540
       [<ffffffff811612b1>] sys_ioctl+0xa1/0xb0
       [<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b
      [SNIP]
      RIP  [<ffffffffa037c1cc>] btrfs_reloc_cow_block+0x22c/0x270 [btrfs]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      149e2d76
    • J
      Btrfs: don't panic if we get an error while balancing V2 · 508794eb
      Josef Bacik 提交于
      A user reported an error where if we try to balance an fs after a device has
      been removed it will blow up.  This is because we get an EIO back and this is
      where BUG_ON(ret) bites us in the ass.  To fix we just exit.  Thanks,
      Reported-by: NAnand Jain <Anand.Jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      508794eb
    • D
      btrfs: add missing options displayed in mount output · 0942caa3
      David Sterba 提交于
      There are three missed mount options settable by user which are not
      currently displayed in mount output.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0942caa3
  7. 27 6月, 2011 1 次提交
    • M
      btrfs: fix inconsonant inode information · 2f7e33d4
      Miao Xie 提交于
      When iputting the inode, We may leave the delayed nodes if they have some
      delayed items that have not been dealt with. So when the inode is read again,
      we must look up the relative delayed node, and use the information in it to
      initialize the inode. Or we will get inconsonant inode information, it may
      cause that the same directory index number is allocated again, and hit the
      following oops:
      
      [ 5447.554187] err add delayed dir index item(name: pglog_0.965_0) into the
      insertion tree of the delayed node(root id: 262, inode id: 258, errno: -17)
      [ 5447.569766] ------------[ cut here ]------------
      [ 5447.575361] kernel BUG at fs/btrfs/delayed-inode.c:1301!
      [SNIP]
      [ 5447.790721] Call Trace:
      [ 5447.793191]  [<ffffffffa0641c4e>] btrfs_insert_dir_item+0x189/0x1bb [btrfs]
      [ 5447.800156]  [<ffffffffa0651a45>] btrfs_add_link+0x12b/0x191 [btrfs]
      [ 5447.806517]  [<ffffffffa0651adc>] btrfs_add_nondir+0x31/0x58 [btrfs]
      [ 5447.812876]  [<ffffffffa0651d6a>] btrfs_create+0xf9/0x197 [btrfs]
      [ 5447.818961]  [<ffffffff8111f840>] vfs_create+0x72/0x92
      [ 5447.824090]  [<ffffffff8111fa8c>] do_last+0x22c/0x40b
      [ 5447.829133]  [<ffffffff8112076a>] path_openat+0xc0/0x2ef
      [ 5447.834438]  [<ffffffff810c58e2>] ? __perf_event_task_sched_out+0x24/0x44
      [ 5447.841216]  [<ffffffff8103ecdd>] ? perf_event_task_sched_out+0x59/0x67
      [ 5447.847846]  [<ffffffff81121a79>] do_filp_open+0x3d/0x87
      [ 5447.853156]  [<ffffffff811e126c>] ? strncpy_from_user+0x43/0x4d
      [ 5447.859072]  [<ffffffff8111f1f5>] ? getname_flags+0x2e/0x80
      [ 5447.864636]  [<ffffffff8111f179>] ? do_getname+0x14b/0x173
      [ 5447.870112]  [<ffffffff8111f1b7>] ? audit_getname+0x16/0x26
      [ 5447.875682]  [<ffffffff8112b1ab>] ? spin_lock+0xe/0x10
      [ 5447.880882]  [<ffffffff81112d39>] do_sys_open+0x69/0xae
      [ 5447.886153]  [<ffffffff81112db1>] sys_open+0x20/0x22
      [ 5447.891114]  [<ffffffff813b9aab>] system_call_fastpath+0x16/0x1b
      
      Fix it by reusing the old delayed node.
      Reported-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Tested-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2f7e33d4