1. 20 10月, 2011 8 次提交
  2. 21 8月, 2011 1 次提交
  3. 17 8月, 2011 3 次提交
  4. 02 8月, 2011 4 次提交
  5. 28 7月, 2011 7 次提交
    • C
      Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors · 75c195a2
      Chris Mason 提交于
      The btrfs transaction code will return any errors that come from
      reserve_metadata_bytes.  We need to make sure we don't return funny
      things like 1 or EAGAIN.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      75c195a2
    • C
      Btrfs: make a lockdep class for each root · 85d4e461
      Chris Mason 提交于
      This patch was originally from Tejun Heo.  lockdep complains about the btrfs
      locking because we sometimes take btree locks from two different trees at the
      same time.  The current classes are based only on level in the btree, which
      isn't enough information for lockdep to figure out if the lock is safe.
      
      This patch makes a class for each type of tree, and lumps all the FS trees that
      actually have files and directories into the same class.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      85d4e461
    • C
      Btrfs: switch the btrfs tree locks to reader/writer · bd681513
      Chris Mason 提交于
      The btrfs metadata btree is the source of significant
      lock contention, especially in the root node.   This
      commit changes our locking to use a reader/writer
      lock.
      
      The lock is built on top of rw spinlocks, and it
      extends the lock tracking to remember if we have a
      read lock or a write lock when we go to blocking.  Atomics
      count the number of blocking readers or writers at any
      given time.
      
      It removes all of the adaptive spinning from the old code
      and uses only the spinning/blocking hints inside of btrfs
      to decide when it should continue spinning.
      
      In read heavy workloads this is dramatically faster.  In write
      heavy workloads we're still faster because of less contention
      on the root node lock.
      
      We suffer slightly in dbench because we schedule more often
      during write locks, but all other benchmarks so far are improved.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd681513
    • M
      Btrfs: fix BUG_ON() caused by ENOSPC when relocating space · 199c36ea
      Miao Xie 提交于
      When we balanced the chunks across the devices, BUG_ON() in
      __finish_chunk_alloc() was triggered.
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/volumes.c:2568!
      [SNIP]
      Call Trace:
       [<ffffffffa049525e>] btrfs_alloc_chunk+0x8e/0xa0 [btrfs]
       [<ffffffffa04546b0>] do_chunk_alloc+0x330/0x3a0 [btrfs]
       [<ffffffffa045c654>] btrfs_reserve_extent+0xb4/0x1f0 [btrfs]
       [<ffffffffa045c86b>] btrfs_alloc_free_block+0xdb/0x350 [btrfs]
       [<ffffffffa048a8d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
       [<ffffffffa04476fd>] __btrfs_cow_block+0x14d/0x5e0 [btrfs]
       [<ffffffffa044660d>] ? read_block_for_search+0x14d/0x4d0 [btrfs]
       [<ffffffffa0447c9b>] btrfs_cow_block+0x10b/0x240 [btrfs]
       [<ffffffffa044dd5e>] btrfs_search_slot+0x49e/0x7a0 [btrfs]
       [<ffffffffa044f07d>] btrfs_insert_empty_items+0x8d/0xf0 [btrfs]
       [<ffffffffa045e973>] insert_with_overflow+0x43/0x110 [btrfs]
       [<ffffffffa045eb0d>] btrfs_insert_dir_item+0xcd/0x1f0 [btrfs]
       [<ffffffffa0489bd0>] ? map_extent_buffer+0xb0/0xc0 [btrfs]
       [<ffffffff812276ad>] ? rb_insert_color+0x9d/0x160
       [<ffffffffa046cc40>] ? inode_tree_add+0xf0/0x150 [btrfs]
       [<ffffffffa0474801>] btrfs_add_link+0xc1/0x1c0 [btrfs]
       [<ffffffff811dacac>] ? security_inode_init_security+0x1c/0x30
       [<ffffffffa04a28aa>] ? btrfs_init_acl+0x4a/0x180 [btrfs]
       [<ffffffffa047492f>] btrfs_add_nondir+0x2f/0x70 [btrfs]
       [<ffffffffa046af16>] ? btrfs_init_inode_security+0x46/0x60 [btrfs]
       [<ffffffffa0474ac0>] btrfs_create+0x150/0x1d0 [btrfs]
       [<ffffffff81159c63>] ? generic_permission+0x23/0xb0
       [<ffffffff8115b415>] vfs_create+0xa5/0xc0
       [<ffffffff8115ce6e>] do_last+0x5fe/0x880
       [<ffffffff8115dc0d>] path_openat+0xcd/0x3d0
       [<ffffffff8115e029>] do_filp_open+0x49/0xa0
       [<ffffffff8116a965>] ? alloc_fd+0x95/0x160
       [<ffffffff8114f0c7>] do_sys_open+0x107/0x1e0
       [<ffffffff810bcc3f>] ? audit_syscall_entry+0x1bf/0x1f0
       [<ffffffff8114f1e0>] sys_open+0x20/0x30
       [<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b
      [SNIP]
      RIP  [<ffffffffa049444a>] __finish_chunk_alloc+0x20a/0x220 [btrfs]
      
      The reason is:
      Task1					Space balance task
      do_chunk_alloc()
        __finish_chunk_alloc()
          update device info
          in the chunk tree
            alloc system metadata block
      					relocate system metadata block group
      					  set system metadata block group
      					  readonly, This block group is the
      					  only one that can allocate space. So
      					  there is no free space that can be
      					  allocated now.
              find no space and don't try
              to alloc new chunk, and then
              return ENOSPC
        BUG_ON() in __finish_chunk_alloc()
        was triggered.
      
      Fix this bug by allocating a new system metadata chunk before relocating the
      old one if we find there is no free space which can be allocated after setting
      the old block group to be read-only.
      Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      199c36ea
    • J
      Btrfs: fix enospc problems with delalloc · 9e0baf60
      Josef Bacik 提交于
      So I had this brilliant idea to use atomic counters for outstanding and reserved
      extents, but this turned out to be a bad idea.  Consider this where we have 1
      outstanding extent and 1 reserved extent
      
      Reserver				Releaser
      					atomic_dec(outstanding) now 0
      atomic_read(outstanding)+1 get 1
      atomic_read(reserved) get 1
      don't actually reserve anything because
      they are the same
      					atomic_cmpxchg(reserved, 1, 0)
      atomic_inc(outstanding)
      atomic_add(0, reserved)
      					free reserved space for 1 extent
      
      Then the reserver now has no actual space reserved for it, and when it goes to
      finish the ordered IO it won't have enough space to do it's allocation and you
      get those lovely warnings.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9e0baf60
    • J
      Btrfs: don't flush delalloc arbitrarily · a5991428
      Josef Bacik 提交于
      Kill the check to see if we have 512mb of reserved space in delalloc and
      shrink_delalloc if we do.  This causes unexpected latencies and we have other
      logic to see if we need to throttle.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a5991428
    • J
      Btrfs: use a worker thread to do caching · bab39bf9
      Josef Bacik 提交于
      A user reported a deadlock when copying a bunch of files.  This is because they
      were low on memory and kthreadd got hung up trying to migrate pages for an
      allocation when starting the caching kthread.  The page was locked by the person
      starting the caching kthread.  To fix this we just need to use the async thread
      stuff so that the threads are already created and we don't have to worry about
      deadlocks.  Thanks,
      Reported-by: NRoman Mamedov <rm@romanrm.ru>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      bab39bf9
  6. 26 7月, 2011 2 次提交
    • M
      btrfs: don't BUG_ON allocation errors in btrfs_drop_snapshot · 38a1a919
      Mark Fasheh 提交于
      In addition to properly handling allocation failure from btrfs_alloc_path, I
      also fixed up the kzalloc error handling code immediately below it.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      38a1a919
    • M
      btrfs: Don't BUG_ON alloc_path errors in find_next_chunk · 92b8e897
      Mark Fasheh 提交于
      I also removed the BUG_ON from error return of find_next_chunk in
      init_first_rw_device(). It turns out that the only caller of
      init_first_rw_device() also BUGS on any nonzero return so no actual behavior
      change has occurred here.
      
      do_chunk_alloc() also needed an update since it calls btrfs_alloc_chunk()
      which can now return -ENOMEM. Instead of setting space_info->full on any
      error from btrfs_alloc_chunk() I catch and return every error value _except_
      -ENOSPC. Thanks goes to Tsutomu Itoh for pointing that issue out.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      92b8e897
  7. 15 7月, 2011 1 次提交
    • M
      btrfs: don't BUG_ON btrfs_alloc_path() errors · d8926bb3
      Mark Fasheh 提交于
      This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
      failure. All the sites that are fixed in this patch were checked by me to
      be fairly trivial to fix because of at least one of two criteria:
      
       - Callers of the function catch errors from it already so bubbling the
         error up will be handled.
       - Callers of the function might BUG_ON any nonzero return code in which
         case there is no behavior changed (but we still got to remove a BUG_ON)
      
      The following functions were updated:
      
      btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
      btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
      btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
      insert_reserved_file_extent, and run_delalloc_nocow
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      d8926bb3
  8. 11 7月, 2011 2 次提交
    • J
      Btrfs: serialize flushers in reserve_metadata_bytes · fdb5effd
      Josef Bacik 提交于
      We keep having problems with early enospc, and that's because our method of
      making space is inherently racy.  The problem is we can have one guy trying to
      make space for himself, and in the meantime people come in and steal his
      reservation.  In order to stop this we make a waitqueue and put anybody who
      comes into reserve_metadata_bytes on that waitqueue if somebody is trying to
      make more space.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      fdb5effd
    • J
      Btrfs: do transaction space reservation before joining the transaction · b5009945
      Josef Bacik 提交于
      We have to do weird things when handling enospc in the transaction joining code.
      Because we've already joined the transaction we cannot commit the transaction
      within the reservation code since it will deadlock, so we have to return EAGAIN
      and then make sure we don't retry too many times.  Instead of doing this, just
      do the reservation the normal way before we join the transaction, that way we
      can do whatever we want to try and reclaim space, and then if it fails we know
      for sure we are out of space and we can return ENOSPC.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      b5009945
  9. 25 6月, 2011 1 次提交
  10. 13 6月, 2011 1 次提交
  11. 09 6月, 2011 2 次提交
    • J
      Btrfs: fix the allocator loop logic · 723bda20
      Josef Bacik 提交于
      I was testing with empty_cluster = 0 to try and reproduce a problem and kept
      hitting early enospc panics.  This was because our loop logic was a little
      confused.  So this is what I did
      
      1) Make the loop variable the ultimate decider on wether we should loop again
      isntead of checking to see if we had an uncached bg, empty size or empty
      cluster.
      
      2) Increment loop before checking to see what we are on to make the loop
      definitions make more sense.
      
      3) If we are on the chunk alloc loop don't set empty_size/empty_cluster to 0
      unless we didn't actually allocate a chunk.  If we did allocate a chunk we
      should be able to easily setup a new cluster so clearing
      empty_size/empty_cluster makes us less efficient.
      
      This kept me from hitting panics while trying to reproduce the other problem.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      723bda20
    • J
      Btrfs: don't commit the transaction if we dont have enough pinned bytes · f2bb8f5c
      Josef Bacik 提交于
      I noticed when running an enospc test that we would get stuck committing the
      transaction in check_data_space even though we truly didn't have enough space.
      So check to see if bytes_pinned is bigger than num_bytes, if it's not don't
      commit the transaction.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      f2bb8f5c
  12. 04 6月, 2011 1 次提交
  13. 24 5月, 2011 7 次提交
    • T
      Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item · 1cd30799
      Tsutomu Itoh 提交于
      Currently, btrfs_truncate_item and btrfs_extend_item returns only 0.
      So, the check by BUG_ON in the caller is unnecessary.
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1cd30799
    • S
      btrfs: don't spin in shrink_delalloc if there is nothing to free · c4f675cd
      Sergei Trofimovich 提交于
      Observed as a large delay when --mixed filesystem is filled up.
      Test example:
      1. create tiny --mixed FS:
         $ dd if=/dev/zero of=2G.img seek=$((2048 * 1024 * 1024 - 1)) count=1 bs=1
         $ mkfs.btrfs --mixed 2G.img
         $ mount -oloop 2G.img /mnt/ut/
      2. Try to fill it up:
         $ dd if=/dev/urandom of=10M.file bs=10240 count=1024
         $ seq 1 256 | while read file_no; do echo $file_no; time cp 10M.file ${file_no}.copy; done
      
      Up to '200.copy' it goes fast, but when disk fills-up each -ENOSPC
      message takes 3 seconds to pop-up _every_ ENOSPC (and in usermode linux
      it's even more: 30-60 seconds!). (Maybe, time depends on kernel's timer resolution).
      
      No IO, no CPU load, just rescheduling. Some debugging revealed busy spinning
      in shrink_delalloc.
      Signed-off-by: NSergei Trofimovich <slyfox@gentoo.org>
      Reviewed-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c4f675cd
    • J
      Btrfs: don't try to allocate from a block group that doesn't have enough space · cca1c81f
      Josef Bacik 提交于
      If we have a very large filesystem, we can spend a lot of time in
      find_free_extent just trying to allocate from empty block groups.  So instead
      check to see if the block group even has enough space for the allocation, and if
      not go on to the next block group.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      cca1c81f
    • J
      Btrfs: don't always do readahead · 026fd317
      Josef Bacik 提交于
      Our readahead is sort of sloppy, and really isn't always needed.  For example if
      ls is doing a stating ls (which is the default) it's going to stat in non-disk
      order, so if say you have a directory with a stupid amount of files, readahead
      is going to do nothing but waste time in the case of doing the stat.  Taking the
      unconditional readahead out made my test go from 57 minutes to 36 minutes.  This
      means that everywhere we do loop through the tree we want to make sure we do set
      path->reada properly, so I went through and found all of the places where we
      loop through the path and set reada to 1.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      026fd317
    • J
      Btrfs: try not to sleep as much when doing slow caching · 589d8ade
      Josef Bacik 提交于
      When the fs is super full and we unmount the fs, we could get stuck in this
      thing where unmount is waiting for the caching kthread to make progress and the
      caching kthread keeps scheduling because we're in the middle of a commit.  So
      instead just let the caching kthread keep going and only yeild if
      need_resched().  This makes my horrible umount case go from taking up to 10
      minutes to taking less than 20 seconds.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      589d8ade
    • J
      Btrfs: kill BTRFS_I(inode)->block_group · d82a6f1d
      Josef Bacik 提交于
      Originally this was going to be used as a way to give hints to the allocator,
      but frankly we can get much better hints elsewhere and it's not even used at all
      for anything usefull.  In addition to be completely useless, when we initialize
      an inode we try and find a freeish block group to set as the inodes block group,
      and with a completely full 40gb fs this takes _forever_, so I imagine with say
      1tb fs this is just unbearable.  So just axe the thing altoghether, we don't
      need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
      inode lookup in my testcase.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      d82a6f1d
    • J
      Btrfs: fix how we do space reservation for truncate · fcb80c2a
      Josef Bacik 提交于
      The ceph guys keep running into problems where we have space reserved in our
      orphan block rsv when freeing it up.  This is because they tend to do snapshots
      alot, so their truncates tend to use a bunch of space, so when we go to do
      things like update the inode we have to steal reservation space in order to make
      the reservation happen.  This happens because truncate can use as much space as
      it freaking feels like, but we still have to hold space for removing the orphan
      item and updating the inode, which will definitely always happen.  So in order
      to fix this we need to split all of the reservation stuf up.  So with this patch
      we have
      
      1) The orphan block reserve which only holds the space for deleting our orphan
      item when everything is over.
      
      2) The truncate block reserve which gets allocated and used specifically for the
      space that the truncate will use on a per truncate basis.
      
      3) The transaction will always have 1 item's worth of data reserved so we can
      update the inode normally.
      
      Hopefully this will make the ceph problem go away.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      fcb80c2a