1. 20 10月, 2011 25 次提交
    • J
      Btrfs: release trans metadata bytes before flushing delayed refs · b24e03db
      Josef Bacik 提交于
      We started setting trans->block_rsv = NULL to allow the delayed refs flushing
      stuff to use the right block_rsv and then just made
      btrfs_trans_release_metadata() unconditionally use the trans block rsv.  The
      problem with this is we need to reserve some space in the transaction and then
      migrate it to the global block rsv, so we need to be able to free that out
      properly.  So instead just move btrfs_trans_release_metadata() before the
      delayed ref flushing and use trans->block_rsv for the freeing.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      b24e03db
    • J
      Btrfs: allow shrink_delalloc flush the needed reclaimed pages · 877da174
      Josef Bacik 提交于
      Currently we only allow a maximum of 2 megabytes of pages to be flushed at a
      time.  This was ok before, but now we have overcommit which will screw us in a
      heartbeat if we are quickly filling the disk.  So instead pick either 2
      megabytes or the number of pages we need to reclaim to be safe again, which ever
      is larger.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      877da174
    • J
      Btrfs: wait for ordered extents if we're in trouble when shrinking delalloc · f104d044
      Josef Bacik 提交于
      The only way we actually reclaim delalloc space is waiting for the IO to
      completely finish.  Usually we kick off a bunch of IO and wait for a little bit
      and hope we can make our reservation, and usually this works out pretty well.
      With overcommit however we can get seriously underwater if we're filling up the
      disk quickly, so we need to be able to force the delalloc shrinker to wait for
      the ordered IO to finish to give us a better chance of actually reclaiming
      enough space to get our reservation.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      f104d044
    • J
      Btrfs: don't check bytes_pinned to determine if we should commit the transaction · bbb495c2
      Josef Bacik 提交于
      Before the only reason to commit the transaction to recover space in
      reserve_metadata_bytes() was if there were enough pinned_bytes to satisfy our
      reservation.  But now we have the delayed inode stuff which will hold it's
      reservations until we commit the transaction.  So say we max out our reservation
      by creating a bunch of files but don't have any pinned bytes we will ENOSPC out
      early even though we could commit the transaction and get that space back.  So
      now just unconditionally commit the transaction since currently there is no way
      to know how much metadata space is being reserved by delayed inode stuff.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      bbb495c2
    • J
      Btrfs: wait for ordered extents if we didn't reclaim enough · 4b91c14f
      Josef Bacik 提交于
      I noticed recently that my overcommit patch was causing one of my enospc tests
      to fail 25% of the time with early ENOSPC.  This is because my overcommit patch
      was letting us go way over board, but it wasn't waiting long enough to let the
      delalloc shrinker do it's job.  The problem is we just start writeback and wait
      a little bit hoping we flush enough, but we only free up delalloc space by
      having the writes complete all the way.  We do this by waiting for ordered
      extents, which we do but only if we already free'd enough for the reservation,
      which isn't right, we should flush ordered extents if we didn't reclaim enough
      in case that will push us over the edge.  With this patch I've not seen a
      failure in this enospc test after running it in a loop for an hour.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4b91c14f
    • J
      Btrfs: inline checksums into the disk free space cache · 5b0e95bf
      Josef Bacik 提交于
      Yeah yeah I know this is how we used to do it and then I changed it, but damnit
      I'm changing it back.  The fact is that writing out checksums will modify
      metadata, which could cause us to dirty a block group we've already written out,
      so we have to truncate it and all of it's checksums and re-write it which will
      write new checksums which could dirty a blockg roup that has already been
      written and you see where I'm going with this?  This can cause unmount or really
      anything that depends on a transaction to commit to take it's sweet damned time
      to happen.  So go back to the way it was, only this time we're specifically
      setting NODATACOW because we can't go through the COW pathway anyway and we're
      doing our own built-in cow'ing by truncating the free space cache.  The other
      new thing is once we truncate the old cache and preallocate the new space, we
      don't need to do that song and dance at all for the rest of the transaction, we
      can just overwrite the existing space with the new cache if the block group
      changes for whatever reason, and the NODATACOW will let us do this fine.  So
      keep track of which transaction we last cleared our cache in and if we cleared
      it in this transaction just say we're all setup and carry on.  This survives
      xfstests and stress.sh.
      
      The inode cache will continue to use the normal csum infrastructure since it
      only gets written once and there will be no more modifications to the fs tree in
      a transaction commit.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      5b0e95bf
    • J
      Btrfs: take overflow into account in reserving space · 9a82ca65
      Josef Bacik 提交于
      My overcommit stuff can be a little racy when we're filling up the disk with
      fs_mark and we overcommit into things that quickly get used up for data.  So use
      num_bytes to see if we have enough available space so we're less likely to
      overcommit ourselves out of the ability to make reservations.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      9a82ca65
    • J
      Btrfs: introduce mount option no_space_cache · 73bc1876
      Josef Bacik 提交于
      Some users have requested this and I've found I needed a way to disable cache
      loading without actually clearing the cache, so introduce the no_space_cache
      option.  Before we check the super blocks cache generation field and if it was
      populated we always turned space caching on.  Now we check this and set the
      space cache option on, and then parse the mount options so that if we want it
      off it get's turned off.  Then we check the mount option all the places we do
      the caching work instead of checking the super's cache generation.  This makes
      things more consistent and lets us turn space caching off.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      73bc1876
    • J
      Btrfs: allow us to overcommit our enospc reservations · 2bf64758
      Josef Bacik 提交于
      One of the things that kills us is the fact that our ENOSPC reservations are
      horribly over the top in most normal cases.  There isn't too much that can be
      done about this because when we are completely full we really need them to work
      like this so we don't under reserve.  However if there is plenty of unallocated
      chunks on the disk we can use that to gauge how much we can overcommit.  So this
      patch adds chunk free space accounting so we always know how much unallocated
      space we have.  Then if we fail to make a reservation within our allocated
      space, check to see if we can overcommit.  In the normal flushing case (like
      with delalloc metadata reservations) we'll take the free space and divide it by
      2 if our metadata profile is setup for DUP or any of those, and then divide it
      by 8 to make sure we don't overcommit too much.  Then if we're in a non-flushing
      case (we really need this reservation now!) we only limit ourselves to half of
      the free space.  This makes this fio test
      
      [torrent]
      filename=torrent-test
      rw=randwrite
      size=4g
      ioengine=sync
      directory=/mnt/btrfs-test
      
      go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB
      file system.  This doesn't seem to break my other enospc tests, but could really
      use some more testing as this is a super scary change.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      2bf64758
    • J
      Btrfs: check unused against how much space we actually want · ef3be457
      Josef Bacik 提交于
      There is a bug that may lead to early ENOSPC in our reservation code.  We've
      been checking against num_bytes which may be above and beyond what we want to
      actually reserve, which could give us a false ENOSPC.  Fix this by making sure
      the unused space is above how much we want to reserve and not how much we're
      trying to flush.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      ef3be457
    • J
      Btrfs: delay iput when deleting a block group · 455757c3
      Josef Bacik 提交于
      I kept getting warnings from evict because we were calling
      btrfs_start_transaction() with a transaction already started when doing a
      balance.  This is because we remove a block group which requires a transaction,
      and the put the last reference on the cache inode.  Instead of doing this we
      need to delay the iput so it is done not within a transaction having started.
      This gets rid of our warnings.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      455757c3
    • J
      Btrfs: stop passing a trans handle all around the reservation code · 4a92b1b8
      Josef Bacik 提交于
      The only thing that we need to have a trans handle for is in
      reserve_metadata_bytes and thats to know how much flushing we can do.  So
      instead of passing it around, just check current->journal_info for a
      trans_handle so we know if we can commit a transaction to try and free up space
      or not.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4a92b1b8
    • J
      Btrfs: don't get the block_rsv in btrfs_free_tree_block · d02c9955
      Josef Bacik 提交于
      Since the durable block rsv stuff has been killed there is no need to get the
      block_rsv in btrfs_free_tree_block anymore.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      d02c9955
    • J
      Btrfs: use the transactions block_rsv for the csum root · 4c13d758
      Josef Bacik 提交于
      The alloc warnings everybody has been seeing is because we have been reserving
      space for csums, but we weren't actually using that space.  So make
      get_block_rsv() return the trans->block_rsv if we're modifying the csum root.
      Also set the trans->block_rsv to NULL so that if we modify the csum root when
      running delayed ref's that comes out of the global reserve like it's supposed
      to.  With this patch I'm not seeing those alloc warnings anymore.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4c13d758
    • J
      Btrfs: handle enospc accounting for free space inodes · c09544e0
      Josef Bacik 提交于
      Since free space inodes now use normal checksumming we need to make sure to
      account for their metadata use.  So reserve metadata space, and then if we fail
      to write out the metadata we can just release it, otherwise it will be freed up
      when the io completes.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      c09544e0
    • J
      Btrfs: don't increase the block_rsv's size when emergency allocating space · 7f701508
      Josef Bacik 提交于
      If we have to emergency reserve space we need to not increase the block_rsv
      size, otherwise we'll leak space.  Take for instance delalloc, say we reserve
      4k, and we use that 4k, and then we have to emergency allocate another 4k, we
      bump the size up to 8k, however we've only accounted for 4k in reservations in
      all of our supporting logic, so we'll go to free the 4k and end up having a size
      of 4k, which will cause us to later not free as much space.  I saw this doing
      testing where I wasn't reserving enough space for something but was still
      leaking space, very frustrating.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7f701508
    • J
      Btrfs: fix space leak when we fail to make an allocation · 7ed49f18
      Josef Bacik 提交于
      When changing back to using a spin_lock to protect the extent counters I decided
      that since we would only be dropping our original extent, it was ok to just drop
      the extent and return.  However since somebody else could have come in and done
      a reservation, we need to do the normal song and dance to clear the reservation
      out properly.  So calculate how much space we need to free, and then subtract
      what we just attempted to reserve.  If it's more then we know we need to drop
      those bytes from the delalloc block rsv.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7ed49f18
    • J
      Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_check · 482e6dc5
      Josef Bacik 提交于
      If you run xfstest 224 it you will get lots of messages about not being able to
      delete inodes and that they will be cleaned up next mount.  This is because
      btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to
      flush, so if there was not enough space, it simply failed.  But in truncate and
      evict case we could easily flush space to try and get enough space to do our
      work, so make btrfs_block_rsv_check take a flush argument to pass down to
      reserve_metadata_bytes.  Now xfstests 224 runs fine without all those
      complaints.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      482e6dc5
    • J
      Btrfs: kill btrfs_truncate_reserve_metadata · 5e962c78
      Josef Bacik 提交于
      Since we've optimized the truncate path, we no longer require this function.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      5e962c78
    • J
      Btrfs: don't try to commit in btrfs_block_rsv_check · 13553e52
      Josef Bacik 提交于
      We will try and reserve metadata bytes in btrfs_block_rsv_check and if we cannot
      because we have a transaction open it will return EAGAIN, so we do not need to
      try and commit the transaction again.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      13553e52
    • J
      Btrfs: kill unused parts of block_rsv · dabdb640
      Josef Bacik 提交于
      The priority and refill_used flags are not used anymore, and neither is the
      usage counter, so just remove them from btrfs_block_rsv.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      dabdb640
    • J
      Btrfs: kill the durable block rsv stuff · 37be25bc
      Josef Bacik 提交于
      This is confusing code and isn't used by anything anymore, so delete it.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      37be25bc
    • J
      Btrfs: calculate checksum space correctly · 7709cde3
      Josef Bacik 提交于
      We have not been reserving enough space for checksums.  We were just reserving
      bytes for the checksum items themselves, we were not taking into account having
      to cow the tree and such.  This patch adds a csum_bytes counter to the inode for
      keeping track of the number of bytes outstanding we have for checksums.  Then we
      calculate how many leaves would be required for the checksums we are given and
      use that to reserve space.  This adds a significant amount of bytes to our
      reservations, but we will handle this later.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7709cde3
    • J
      Btrfs: use bytes_may_use for all ENOSPC reservations · fb25e914
      Josef Bacik 提交于
      We have been using bytes_reserved for metadata reservations, which is wrong
      since we use that to keep track of outstanding reservations from the allocator.
      This resulted in us doing a lot of silly things to make sure we don't allocate a
      bunch of metadata chunks since we never had a real view of how much space was
      actually in use by metadata.
      
      This passes Arne's enospc test and xfstests as well as my own enospc tests.
      Hopefully this will get us moving in the right direction.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      fb25e914
    • J
      Btrfs: kill reserved_bytes in inode · 0cbbdf7c
      Josef Bacik 提交于
      reserved_bytes is not used for anything in the inode, remove it.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0cbbdf7c
  2. 21 8月, 2011 1 次提交
  3. 17 8月, 2011 3 次提交
  4. 02 8月, 2011 4 次提交
  5. 28 7月, 2011 7 次提交
    • C
      Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors · 75c195a2
      Chris Mason 提交于
      The btrfs transaction code will return any errors that come from
      reserve_metadata_bytes.  We need to make sure we don't return funny
      things like 1 or EAGAIN.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      75c195a2
    • C
      Btrfs: make a lockdep class for each root · 85d4e461
      Chris Mason 提交于
      This patch was originally from Tejun Heo.  lockdep complains about the btrfs
      locking because we sometimes take btree locks from two different trees at the
      same time.  The current classes are based only on level in the btree, which
      isn't enough information for lockdep to figure out if the lock is safe.
      
      This patch makes a class for each type of tree, and lumps all the FS trees that
      actually have files and directories into the same class.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      85d4e461
    • C
      Btrfs: switch the btrfs tree locks to reader/writer · bd681513
      Chris Mason 提交于
      The btrfs metadata btree is the source of significant
      lock contention, especially in the root node.   This
      commit changes our locking to use a reader/writer
      lock.
      
      The lock is built on top of rw spinlocks, and it
      extends the lock tracking to remember if we have a
      read lock or a write lock when we go to blocking.  Atomics
      count the number of blocking readers or writers at any
      given time.
      
      It removes all of the adaptive spinning from the old code
      and uses only the spinning/blocking hints inside of btrfs
      to decide when it should continue spinning.
      
      In read heavy workloads this is dramatically faster.  In write
      heavy workloads we're still faster because of less contention
      on the root node lock.
      
      We suffer slightly in dbench because we schedule more often
      during write locks, but all other benchmarks so far are improved.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd681513
    • M
      Btrfs: fix BUG_ON() caused by ENOSPC when relocating space · 199c36ea
      Miao Xie 提交于
      When we balanced the chunks across the devices, BUG_ON() in
      __finish_chunk_alloc() was triggered.
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/volumes.c:2568!
      [SNIP]
      Call Trace:
       [<ffffffffa049525e>] btrfs_alloc_chunk+0x8e/0xa0 [btrfs]
       [<ffffffffa04546b0>] do_chunk_alloc+0x330/0x3a0 [btrfs]
       [<ffffffffa045c654>] btrfs_reserve_extent+0xb4/0x1f0 [btrfs]
       [<ffffffffa045c86b>] btrfs_alloc_free_block+0xdb/0x350 [btrfs]
       [<ffffffffa048a8d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
       [<ffffffffa04476fd>] __btrfs_cow_block+0x14d/0x5e0 [btrfs]
       [<ffffffffa044660d>] ? read_block_for_search+0x14d/0x4d0 [btrfs]
       [<ffffffffa0447c9b>] btrfs_cow_block+0x10b/0x240 [btrfs]
       [<ffffffffa044dd5e>] btrfs_search_slot+0x49e/0x7a0 [btrfs]
       [<ffffffffa044f07d>] btrfs_insert_empty_items+0x8d/0xf0 [btrfs]
       [<ffffffffa045e973>] insert_with_overflow+0x43/0x110 [btrfs]
       [<ffffffffa045eb0d>] btrfs_insert_dir_item+0xcd/0x1f0 [btrfs]
       [<ffffffffa0489bd0>] ? map_extent_buffer+0xb0/0xc0 [btrfs]
       [<ffffffff812276ad>] ? rb_insert_color+0x9d/0x160
       [<ffffffffa046cc40>] ? inode_tree_add+0xf0/0x150 [btrfs]
       [<ffffffffa0474801>] btrfs_add_link+0xc1/0x1c0 [btrfs]
       [<ffffffff811dacac>] ? security_inode_init_security+0x1c/0x30
       [<ffffffffa04a28aa>] ? btrfs_init_acl+0x4a/0x180 [btrfs]
       [<ffffffffa047492f>] btrfs_add_nondir+0x2f/0x70 [btrfs]
       [<ffffffffa046af16>] ? btrfs_init_inode_security+0x46/0x60 [btrfs]
       [<ffffffffa0474ac0>] btrfs_create+0x150/0x1d0 [btrfs]
       [<ffffffff81159c63>] ? generic_permission+0x23/0xb0
       [<ffffffff8115b415>] vfs_create+0xa5/0xc0
       [<ffffffff8115ce6e>] do_last+0x5fe/0x880
       [<ffffffff8115dc0d>] path_openat+0xcd/0x3d0
       [<ffffffff8115e029>] do_filp_open+0x49/0xa0
       [<ffffffff8116a965>] ? alloc_fd+0x95/0x160
       [<ffffffff8114f0c7>] do_sys_open+0x107/0x1e0
       [<ffffffff810bcc3f>] ? audit_syscall_entry+0x1bf/0x1f0
       [<ffffffff8114f1e0>] sys_open+0x20/0x30
       [<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b
      [SNIP]
      RIP  [<ffffffffa049444a>] __finish_chunk_alloc+0x20a/0x220 [btrfs]
      
      The reason is:
      Task1					Space balance task
      do_chunk_alloc()
        __finish_chunk_alloc()
          update device info
          in the chunk tree
            alloc system metadata block
      					relocate system metadata block group
      					  set system metadata block group
      					  readonly, This block group is the
      					  only one that can allocate space. So
      					  there is no free space that can be
      					  allocated now.
              find no space and don't try
              to alloc new chunk, and then
              return ENOSPC
        BUG_ON() in __finish_chunk_alloc()
        was triggered.
      
      Fix this bug by allocating a new system metadata chunk before relocating the
      old one if we find there is no free space which can be allocated after setting
      the old block group to be read-only.
      Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      199c36ea
    • J
      Btrfs: fix enospc problems with delalloc · 9e0baf60
      Josef Bacik 提交于
      So I had this brilliant idea to use atomic counters for outstanding and reserved
      extents, but this turned out to be a bad idea.  Consider this where we have 1
      outstanding extent and 1 reserved extent
      
      Reserver				Releaser
      					atomic_dec(outstanding) now 0
      atomic_read(outstanding)+1 get 1
      atomic_read(reserved) get 1
      don't actually reserve anything because
      they are the same
      					atomic_cmpxchg(reserved, 1, 0)
      atomic_inc(outstanding)
      atomic_add(0, reserved)
      					free reserved space for 1 extent
      
      Then the reserver now has no actual space reserved for it, and when it goes to
      finish the ordered IO it won't have enough space to do it's allocation and you
      get those lovely warnings.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9e0baf60
    • J
      Btrfs: don't flush delalloc arbitrarily · a5991428
      Josef Bacik 提交于
      Kill the check to see if we have 512mb of reserved space in delalloc and
      shrink_delalloc if we do.  This causes unexpected latencies and we have other
      logic to see if we need to throttle.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a5991428
    • J
      Btrfs: use a worker thread to do caching · bab39bf9
      Josef Bacik 提交于
      A user reported a deadlock when copying a bunch of files.  This is because they
      were low on memory and kthreadd got hung up trying to migrate pages for an
      allocation when starting the caching kthread.  The page was locked by the person
      starting the caching kthread.  To fix this we just need to use the async thread
      stuff so that the threads are already created and we don't have to worry about
      deadlocks.  Thanks,
      Reported-by: NRoman Mamedov <rm@romanrm.ru>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      bab39bf9