1. 12 11月, 2014 1 次提交
    • D
      btrfs: add support for processing pending changes · 572d9ab7
      David Sterba 提交于
      There are some actions that modify global filesystem state but cannot be
      performed at the time of request, but later at the transaction commit
      time when the filesystem is in a known state.
      
      For example enabling new incompat features on-the-fly or issuing
      transaction commit from unsafe contexts (sysfs handlers).
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      572d9ab7
  2. 04 10月, 2014 2 次提交
    • S
      Btrfs: fix race in WAIT_SYNC ioctl · 42383020
      Sage Weil 提交于
      We check whether transid is already committed via last_trans_committed and
      then search through trans_list for pending transactions.  If
      last_trans_committed is updated by btrfs_commit_transaction after we check
      it (there is no locking), we will fail to find the committed transaction
      and return EINVAL to the caller.  This has been observed occasionally by
      ceph-osd (which uses this ioctl heavily).
      
      Fix by rechecking whether the provided transid <= last_trans_committed
      after the search fails, and if so return 0.
      Signed-off-by: NSage Weil <sage@redhat.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      42383020
    • F
      Btrfs: be aware of btree inode write errors to avoid fs corruption · 656f30db
      Filipe Manana 提交于
      While we have a transaction ongoing, the VM might decide at any time
      to call btree_inode->i_mapping->a_ops->writepages(), which will start
      writeback of dirty pages belonging to btree nodes/leafs. This call
      might return an error or the writeback might finish with an error
      before we attempt to commit the running transaction. If this happens,
      we might have no way of knowing that such error happened when we are
      committing the transaction - because the pages might no longer be
      marked dirty nor tagged for writeback (if a subsequent modification
      to the extent buffer didn't happen before the transaction commit) which
      makes filemap_fdata[write|wait]_range unable to find such pages (even
      if they're marked with SetPageError).
      So if this happens we must abort the transaction, otherwise we commit
      a super block with btree roots that point to btree nodes/leafs whose
      content on disk is invalid - either garbage or the content of some
      node/leaf from a past generation that got cowed or deleted and is no
      longer valid (for this later case we end up getting error messages like
      "parent transid verify failed on 10826481664 wanted 25748 found 29562"
      when reading btree nodes/leafs from disk).
      
      Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
      i_mapping would not be enough because we need to distinguish between
      log tree extents (not fatal) vs non-log tree extents (fatal) and
      because the next call to filemap_fdatawait_range() will catch and clear
      such errors in the mapping - and that call might be from a log sync and
      not from a transaction commit, which means we would not know about the
      error at transaction commit time. Also, checking for the eb flag
      EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
      not be completely reliable, as the eb might be removed from memory and
      read back when trying to get it, which clears that flag right before
      reading the eb's pages from disk, making us not know about the previous
      write error.
      
      Using the new 3 flags for the btree inode also makes us achieve the
      goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
      writeback for all dirty pages and before filemap_fdatawait_range() is
      called, the writeback for all dirty pages had already finished with
      errors - because we were not using AS_EIO/AS_ENOSPC,
      filemap_fdatawait_range() would return success, as it could not know
      that writeback errors happened (the pages were no longer tagged for
      writeback).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      656f30db
  3. 02 10月, 2014 2 次提交
  4. 18 9月, 2014 4 次提交
  5. 15 8月, 2014 1 次提交
    • C
      btrfs: disable strict file flushes for renames and truncates · 8d875f95
      Chris Mason 提交于
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      deadlock.
      
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      deadlocks.
      Signed-off-by: NChris Mason <clm@fb.com>
      8d875f95
  6. 03 7月, 2014 2 次提交
    • F
      Btrfs: fix crash when starting transaction · abdd2e80
      Filipe Manana 提交于
      Often when starting a transaction we commit the currently running transaction,
      which can end up writing block group caches when the current process has its
      journal_info set to NULL (and not to a transaction). This makes our assertion
      at btrfs_check_data_free_space() (current_journal != NULL) fail, resulting
      in a crash/hang. Therefore fix it by setting journal_info.
      
      Two different traces of this issue follow below.
      
      1)
      
          [51502.241936] BTRFS: assertion failed: current->journal_info, file: fs/btrfs/extent-tree.c, line: 3670
          [51502.242213] ------------[ cut here ]------------
          [51502.242493] kernel BUG at fs/btrfs/ctree.h:3964!
          [51502.242669] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
          (...)
          [51502.244010] Call Trace:
          [51502.244010]  [<ffffffffa02bc025>] btrfs_check_data_free_space+0x395/0x3a0 [btrfs]
          [51502.244010]  [<ffffffffa02c3bdc>] btrfs_write_dirty_block_groups+0x4ac/0x640 [btrfs]
          [51502.244010]  [<ffffffffa0357a6a>] commit_cowonly_roots+0x164/0x226 [btrfs]
          [51502.244010]  [<ffffffffa02d53cd>] btrfs_commit_transaction+0x4ed/0xab0 [btrfs]
          [51502.244010]  [<ffffffff8168ec7b>] ? _raw_spin_unlock+0x2b/0x40
          [51502.244010]  [<ffffffffa02d6259>] start_transaction+0x459/0x620 [btrfs]
          [51502.244010]  [<ffffffffa02d67ab>] btrfs_start_transaction+0x1b/0x20 [btrfs]
          [51502.244010]  [<ffffffffa02d73e1>] __unlink_start_trans+0x31/0xe0 [btrfs]
          [51502.244010]  [<ffffffffa02dea67>] btrfs_unlink+0x37/0xc0 [btrfs]
          [51502.244010]  [<ffffffff811bb054>] ? do_unlinkat+0x114/0x2a0
          [51502.244010]  [<ffffffff811baebc>] vfs_unlink+0xcc/0x150
          [51502.244010]  [<ffffffff811bb1a0>] do_unlinkat+0x260/0x2a0
          [51502.244010]  [<ffffffff811a9ef4>] ? filp_close+0x64/0x90
          [51502.244010]  [<ffffffff810aaea6>] ? trace_hardirqs_on_caller+0x16/0x1e0
          [51502.244010]  [<ffffffff81349cab>] ? trace_hardirqs_on_thunk+0x3a/0x3f
          [51502.244010]  [<ffffffff811be9eb>] SyS_unlinkat+0x1b/0x40
          [51502.244010]  [<ffffffff81698452>] system_call_fastpath+0x16/0x1b
          [51502.244010] Code: 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 71 13 36 a0 48 89 fe 31 c0 48 c7 c7 b8 43 36 a0 48 89 e5 e8 5d b0 32 e1 <0f> 0b 0f 1f 44 00 00 55 b9 11 00 00 00 48 89 e5 41 55 49 89 f5
          [51502.244010] RIP  [<ffffffffa03575da>] assfail.constprop.88+0x1e/0x20 [btrfs]
      
      2)
      
          [25405.097230] BTRFS: assertion failed: current->journal_info, file: fs/btrfs/extent-tree.c, line: 3670
          [25405.097488] ------------[ cut here ]------------
          [25405.097767] kernel BUG at fs/btrfs/ctree.h:3964!
          [25405.097940] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
          (...)
          [25405.100008] Call Trace:
          [25405.100008]  [<ffffffffa02bc025>] btrfs_check_data_free_space+0x395/0x3a0 [btrfs]
          [25405.100008]  [<ffffffffa02c3bdc>] btrfs_write_dirty_block_groups+0x4ac/0x640 [btrfs]
          [25405.100008]  [<ffffffffa035755a>] commit_cowonly_roots+0x164/0x226 [btrfs]
          [25405.100008]  [<ffffffffa02d53cd>] btrfs_commit_transaction+0x4ed/0xab0 [btrfs]
          [25405.100008]  [<ffffffff8109c170>] ? bit_waitqueue+0xc0/0xc0
          [25405.100008]  [<ffffffffa02d6259>] start_transaction+0x459/0x620 [btrfs]
          [25405.100008]  [<ffffffffa02d67ab>] btrfs_start_transaction+0x1b/0x20 [btrfs]
          [25405.100008]  [<ffffffffa02e3407>] btrfs_create+0x47/0x210 [btrfs]
          [25405.100008]  [<ffffffffa02d74cc>] ? btrfs_permission+0x3c/0x80 [btrfs]
          [25405.100008]  [<ffffffff811bc63b>] vfs_create+0x9b/0x130
          [25405.100008]  [<ffffffff811bcf19>] do_last+0x849/0xe20
          [25405.100008]  [<ffffffff811b9409>] ? link_path_walk+0x79/0x820
          [25405.100008]  [<ffffffff811bd5b5>] path_openat+0xc5/0x690
          [25405.100008]  [<ffffffff810ab07d>] ? trace_hardirqs_on+0xd/0x10
          [25405.100008]  [<ffffffff811cdcd2>] ? __alloc_fd+0x32/0x1d0
          [25405.100008]  [<ffffffff811be2a3>] do_filp_open+0x43/0xa0
          [25405.100008]  [<ffffffff811cddf1>] ? __alloc_fd+0x151/0x1d0
          [25405.100008]  [<ffffffff811abcfc>] do_sys_open+0x13c/0x230
          [25405.100008]  [<ffffffff810aaea6>] ? trace_hardirqs_on_caller+0x16/0x1e0
          [25405.100008]  [<ffffffff811abe12>] SyS_open+0x22/0x30
          [25405.100008]  [<ffffffff81698452>] system_call_fastpath+0x16/0x1b
          [25405.100008] Code: 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 51 13 36 a0 48 89 fe 31 c0 48 c7 c7 d0 43 36 a0 48 89 e5 e8 6d b5 32 e1 <0f> 0b 0f 1f 44 00 00 55 b9 11 00 00 00 48 89 e5 41 55 49 89 f5
          [25405.100008] RIP  [<ffffffffa03570ca>] assfail.constprop.88+0x1e/0x20 [btrfs]
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      abdd2e80
    • D
      btrfs: remove stale comment from btrfs_flush_all_pending_stuffs · 0a4eaea8
      David Sterba 提交于
      Commit fcebe456 (Btrfs: rework qgroup
      accounting) removed the qgroup accounting after delayed refs.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      0a4eaea8
  7. 29 6月, 2014 1 次提交
  8. 14 6月, 2014 1 次提交
  9. 11 6月, 2014 1 次提交
  10. 10 6月, 2014 4 次提交
    • C
      Btrfs: async delayed refs · a79b7d4b
      Chris Mason 提交于
      Delayed extent operations are triggered during transaction commits.
      The goal is to queue up a healthly batch of changes to the extent
      allocation tree and run through them in bulk.
      
      This farms them off to async helper threads.  The goal is to have the
      bulk of the delayed operations being done in the background, but this is
      also important to limit our stack footprint.
      Signed-off-by: NChris Mason <clm@fb.com>
      a79b7d4b
    • J
      Btrfs: rework qgroup accounting · fcebe456
      Josef Bacik 提交于
      Currently qgroups account for space by intercepting delayed ref updates to fs
      trees.  It does this by adding sequence numbers to delayed ref updates so that
      it can figure out how the tree looked before the update so we can adjust the
      counters properly.  The problem with this is that it does not allow delayed refs
      to be merged, so if you say are defragging an extent with 5k snapshots pointing
      to it we will thrash the delayed ref lock because we need to go back and
      manually merge these things together.  Instead we want to process quota changes
      when we know they are going to happen, like when we first allocate an extent, we
      free a reference for an extent, we add new references etc.  This patch
      accomplishes this by only adding qgroup operations for real ref changes.  We
      only modify the sequence number when we need to lookup roots for bytenrs, this
      reduces the amount of churn on the sequence number and allows us to merge
      delayed refs as we add them most of the time.  This patch encompasses a bunch of
      architectural changes
      
      1) qgroup ref operations: instead of tracking qgroup operations through the
      delayed refs we simply add new ref operations whenever we notice that we need to
      when we've modified the refs themselves.
      
      2) tree mod seq:  we no longer have this separation of major/minor counters.
      this makes the sequence number stuff much more sane and we can remove some
      locking that was needed to protect the counter.
      
      3) delayed ref seq: we now read the tree mod seq number and use that as our
      sequence.  This means each new delayed ref doesn't have it's own unique sequence
      number, rather whenever we go to lookup backrefs we inc the sequence number so
      we can make sure to keep any new operations from screwing up our world view at
      that given point.  This allows us to merge delayed refs during runtime.
      
      With all of these changes the delayed ref stuff is a little saner and the qgroup
      accounting stuff no longer goes negative in some cases like it was before.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fcebe456
    • M
    • D
      btrfs: assert that send is not in progres before root deletion · 61155aa0
      David Sterba 提交于
      CC: Miao Xie <miaox@cn.fujitsu.com>
      CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      61155aa0
  11. 07 4月, 2014 2 次提交
    • J
      Btrfs: remove transaction from send · 9e351cc8
      Josef Bacik 提交于
      Lets try this again.  We can deadlock the box if we send on a box and try to
      write onto the same fs with the app that is trying to listen to the send pipe.
      This is because the writer could get stuck waiting for a transaction commit
      which is being blocked by the send.  So fix this by making sure looking at the
      commit roots is always going to be consistent.  We do this by keeping track of
      which roots need to have their commit roots swapped during commit, and then
      taking the commit_root_sem and swapping them all at once.  Then make sure we
      take a read lock on the commit_root_sem in cases where we search the commit root
      to make sure we're always looking at a consistent view of the commit roots.
      Previously we had problems with this because we would swap a fs tree commit root
      and then swap the extent tree commit root independently which would cause the
      backref walking code to screw up sometimes.  With this patch we no longer
      deadlock and pass all the weird send/receive corner cases.  Thanks,
      Reportedy-by: NHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e351cc8
    • J
      Btrfs: don't clear uptodate if the eb is under IO · a26e8c9f
      Josef Bacik 提交于
      So I have an awful exercise script that will run snapshot, balance and
      send/receive in parallel.  This sometimes would crash spectacularly and when it
      came back up the fs would be completely hosed.  Turns out this is because of a
      bad interaction of balance and send/receive.  Send will hold onto its entire
      path for the whole send, but its blocks could get relocated out from underneath
      it, and because it doesn't old tree locks theres nothing to keep this from
      happening.  So it will go to read in a slot with an old transid, and we could
      have re-allocated this block for something else and it could have a completely
      different transid.  But because we think it is invalid we clear uptodate and
      re-read in the block.  If we do this before we actually write out the new block
      we could write back stale data to the fs, and boom we're screwed.
      
      Now we definitely need to fix this disconnect between send and balance, but we
      really really need to not allow ourselves to accidently read in stale data over
      new data.  So make sure we check if the extent buffer is not under io before
      clearing uptodate, this will kick back EIO to the caller instead of reading in
      stale data and keep us from corrupting the fs.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a26e8c9f
  12. 21 3月, 2014 1 次提交
    • J
      Btrfs: fix deadlock with nested trans handles · 3bbb24b2
      Josef Bacik 提交于
      Zach found this deadlock that would happen like this
      
      btrfs_end_transaction <- reduce trans->use_count to 0
        btrfs_run_delayed_refs
          btrfs_cow_block
            find_free_extent
      	btrfs_start_transaction <- increase trans->use_count to 1
                allocate chunk
      	btrfs_end_transaction <- decrease trans->use_count to 0
      	  btrfs_run_delayed_refs
      	    lock tree block we are cowing above ^^
      
      We need to only decrease trans->use_count if it is above 1, otherwise leave it
      alone.  This will make nested trans be the only ones who decrease their added
      ref, and will let us get rid of the trans->use_count++ hack if we have to commit
      the transaction.  Thanks,
      
      cc: stable@vger.kernel.org
      Reported-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Tested-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3bbb24b2
  13. 11 3月, 2014 3 次提交
  14. 29 1月, 2014 9 次提交
    • Q
      btrfs: Add noinode_cache mount option · 3818aea2
      Qu Wenruo 提交于
      Add noinode_cache mount option for btrfs.
      
      Since inode map cache involves all the btrfs_find_free_ino/return_ino
      things and if just trigger the mount_opt,
      an inode number get from inode map cache will not returned to inode map
      cache.
      
      To keep the find and return inode both in the same behavior,
      a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea.
      CHANGE_INODE_CACHE is set/cleared in remounting, and the original
      INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a
      success transaction.
      Since find/return inode is all done between btrfs_start_transaction and
      btrfs_commit_transaction, this will keep consistent behavior.
      
      Also noinode_cache mount option will not stop the caching_kthread.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3818aea2
    • J
      Btrfs: throttle delayed refs better · 0a2b2a84
      Josef Bacik 提交于
      On one of our gluster clusters we noticed some pretty big lag spikes.  This
      turned out to be because our transaction commit was taking like 3 minutes to
      complete.  This is because we have like 30 gigs of metadata, so our global
      reserve would end up being the max which is like 512 mb.  So our throttling code
      would allow a ridiculous amount of delayed refs to build up and then they'd all
      get run at transaction commit time, and for a cold mounted file system that
      could take up to 3 minutes to run.  So fix the throttling to be based on both
      the size of the global reserve and how long it takes us to run delayed refs.
      This patch tracks the time it takes to run delayed refs and then only allows 1
      seconds worth of outstanding delayed refs at a time.  This way it will auto-tune
      itself from cold cache up to when everything is in memory and it no longer has
      to go to disk.  This makes our transaction commits take much less time to run.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0a2b2a84
    • J
      Btrfs: attach delayed ref updates to delayed ref heads · d7df2c79
      Josef Bacik 提交于
      Currently we have two rb-trees, one for delayed ref heads and one for all of the
      delayed refs, including the delayed ref heads.  When we process the delayed refs
      we have to hold onto the delayed ref lock for all of the selecting and merging
      and such, which results in quite a bit of lock contention.  This was solved by
      having a waitqueue and only one flusher at a time, however this hurts if we get
      a lot of delayed refs queued up.
      
      So instead just have an rb tree for the delayed ref heads, and then attach the
      delayed ref updates to an rb tree that is per delayed ref head.  Then we only
      need to take the delayed ref lock when adding new delayed refs and when
      selecting a delayed ref head to process, all the rest of the time we deal with a
      per delayed ref head lock which will be much less contentious.
      
      The locking rules for this get a little more complicated since we have to lock
      up to 3 things to properly process delayed refs, but I will address that problem
      later.  For now this passes all of xfstests and my overnight stress tests.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d7df2c79
    • J
      Btrfs: make fsync latency less sucky · 5039eddc
      Josef Bacik 提交于
      Looking into some performance related issues with large amounts of metadata
      revealed that we can have some pretty huge swings in fsync() performance.  If we
      have a lot of delayed refs backed up (as you will tend to do with lots of
      metadata) fsync() will wander off and try to run some of those delayed refs
      which can result in reading from disk and such.  Since the actual act of fsync()
      doesn't create any delayed refs there is no need to make it throttle on delayed
      ref stuff, that will be handled by other people.  With this patch we get much
      smoother fsync performance with large amounts of metadata.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5039eddc
    • W
      Btrfs: fix protection between send and root deletion · 18f687d5
      Wang Shilong 提交于
      We should gurantee that parent and clone roots can not be destroyed
      during send, for this we have two ideas.
      
      1.by holding @subvol_sem, this might be a nightmare, because it will
      block all subvolumes deletion for a long time.
      
      2.Miao pointed out we can reuse @send_in_progress, that mean we will
      skip snapshot deletion if root sending is in progress.
      
      Here we adopt the second approach since it won't block other subvolumes
      deletion for a long time.
      
      Besides in btrfs_clean_one_deleted_snapshot(), we only check first root
      , if this root is involved in send, we return directly rather than
      continue to check.There are several reasons about it:
      
      1.this case happen seldomly.
      2.after sending,cleaner thread can continue to drop that root.
      3.make code simple
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      18f687d5
    • M
      Btrfs: remove btrfs_end_transaction_dmeta() · a56dbd89
      Miao Xie 提交于
      Two reasons:
      - btrfs_end_transaction_dmeta() is the same as btrfs_end_transaction_throttle()
        so it is unnecessary.
      - All the delayed items should be dealt in the current transaction, so the
        workers should not commit the transaction, instead, deal with the delayed
        items as many as possible.
      
      So we can remove btrfs_end_transaction_dmeta()
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a56dbd89
    • F
      Btrfs: convert printk to btrfs_ and fix BTRFS prefix · efe120a0
      Frank Holton 提交于
      Convert all applicable cases of printk and pr_* to the btrfs_* macros.
      
      Fix all uses of the BTRFS prefix.
      Signed-off-by: NFrank Holton <fholton@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      efe120a0
    • W
      Btrfs: wrap repeated code into scrub_blocked_if_needed() · cb7ab021
      Wang Shilong 提交于
      Just wrap same code into one function scrub_blocked_if_needed().
      
      This make a change that we will move waiting (@workers_pending = 0)
      before we can wake up commiting transaction(atomic_inc(@scrub_paused)),
      we must take carefully to not deadlock here.
      
      Thread 1			Thread 2
      				|->btrfs_commit_transaction()
      					|->set trans type(COMMIT_DOING)
      					|->btrfs_scrub_paused()(blocked)
      |->join_transaction(blocked)
      
      Move btrfs_scrub_paused() before setting trans type which means we can
      still join a transaction when commiting_transaction is blocked.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Suggested-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      cb7ab021
    • L
      Btrfs: introduce a head ref rbtree · c46effa6
      Liu Bo 提交于
      The way how we process delayed refs is
      1) get a bunch of head refs,
      2) pick up one head ref,
      3) go one node back for any delayed ref updates.
      
      The head ref is also linked in the same rbtree as the delayed ref is,
      so in 1) stage, we have to walk one by one including not only head refs, but
      delayed refs.
      
      When we have a great number of delayed refs pending to process,
      this'll cost time a lot.
      
      Here we introduce a head ref specific rbtree, it only has head refs, so troubles
      go away.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c46effa6
  15. 21 11月, 2013 1 次提交
    • L
      Btrfs: fix lockdep error in async commit · b1a06a4b
      Liu Bo 提交于
      Lockdep complains about btrfs's async commit:
      
      [ 2372.462171] [ BUG: bad unlock balance detected! ]
      [ 2372.462191] 3.12.0+ #32 Tainted: G        W
      [ 2372.462209] -------------------------------------
      [ 2372.462228] ceph-osd/14048 is trying to release lock (sb_internal) at:
      [ 2372.462275] [<ffffffffa022cb10>] btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
      [ 2372.462305] but there are no more locks to release!
      [ 2372.462324]
      [ 2372.462324] other info that might help us debug this:
      [ 2372.462349] no locks held by ceph-osd/14048.
      [ 2372.462367]
      [ 2372.462367] stack backtrace:
      [ 2372.462386] CPU: 2 PID: 14048 Comm: ceph-osd Tainted: G        W    3.12.0+ #32
      [ 2372.462414] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
      [ 2372.462455]  ffffffffa022cb10 ffff88007490fd28 ffffffff816f094a ffff8800378aa320
      [ 2372.462491]  ffff88007490fd50 ffffffff810adf4c ffff8800378aa320 ffff88009af97650
      [ 2372.462526]  ffffffffa022cb10 ffff88007490fd88 ffffffff810b01ee ffff8800898c0000
      [ 2372.462562] Call Trace:
      [ 2372.462584]  [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
      [ 2372.462619]  [<ffffffff816f094a>] dump_stack+0x45/0x56
      [ 2372.462642]  [<ffffffff810adf4c>] print_unlock_imbalance_bug+0xec/0x100
      [ 2372.462677]  [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
      [ 2372.462710]  [<ffffffff810b01ee>] lock_release+0x18e/0x210
      [ 2372.462742]  [<ffffffffa022cb36>] btrfs_commit_transaction_async+0x1d6/0x2a0 [btrfs]
      [ 2372.462783]  [<ffffffffa025a7ce>] btrfs_ioctl_start_sync+0x3e/0xc0 [btrfs]
      [ 2372.462822]  [<ffffffffa025f1d3>] btrfs_ioctl+0x4c3/0x1f70 [btrfs]
      [ 2372.462849]  [<ffffffff812c0321>] ? avc_has_perm+0x121/0x1b0
      [ 2372.462873]  [<ffffffff812c0224>] ? avc_has_perm+0x24/0x1b0
      [ 2372.462897]  [<ffffffff8107ecc8>] ? sched_clock_cpu+0xa8/0x100
      [ 2372.462922]  [<ffffffff8117b145>] do_vfs_ioctl+0x2e5/0x4e0
      [ 2372.462946]  [<ffffffff812c19e6>] ? file_has_perm+0x86/0xa0
      [ 2372.462969]  [<ffffffff8117b3c1>] SyS_ioctl+0x81/0xa0
      [ 2372.462991]  [<ffffffff817045a4>] tracesys+0xdd/0xe2
      
      ====================================================
      
      It's because that we don't do the right thing when checking if it's ok to
      tell lockdep that we're trying to release the rwsem.
      
      If the trans handle's type is TRANS_ATTACH, we won't acquire the freeze rwsem, but
      as TRANS_ATTACH fits the check (trans < TRANS_JOIN_NOLOCK), we'll release the freeze
      rwsem, which makes lockdep complains a lot.
      Reported-by: NMa Jianpeng <majianpeng@gmail.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      b1a06a4b
  16. 12 11月, 2013 5 次提交
    • M
      Btrfs: rename btrfs_start_all_delalloc_inodes · 91aef86f
      Miao Xie 提交于
      rename the function -- btrfs_start_all_delalloc_inodes(), and make its
      name be compatible to btrfs_wait_ordered_roots(), since they are always
      used at the same place.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      91aef86f
    • M
      Btrfs: don't wait for the completion of all the ordered extents · b0244199
      Miao Xie 提交于
      It is very likely that there are lots of ordered extents in the filesytem,
      if we wait for the completion of all of them when we want to reclaim some
      space for the metadata space reservation, we would be blocked for a long
      time. The performance would drop down suddenly for a long time.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      b0244199
    • F
      Btrfs: fix memory leaks on transaction commit failure · 80d94fb3
      Filipe David Borba Manana 提交于
      Structures of the types tree_mod_elem and qgroup_update are allocated
      during transaction commit but were not being released if the call to
      btrfs_run_delayed_items() returned an error.
      
      Stack trace reported by kmemleak:
      
      unreferenced object 0xffff880679f0b398 (size 128):
        comm "umount", pid 21508, jiffies 4295967793 (age 36718.112s)
        hex dump (first 32 bytes):
          60 b5 f0 79 06 88 ff ff 00 00 00 00 00 00 00 00  `..y............
          00 00 00 00 00 00 00 00 50 1c 00 00 00 00 00 00  ........P.......
        backtrace:
          [<ffffffff81742d26>] kmemleak_alloc+0x26/0x50
          [<ffffffff811889c2>] kmem_cache_alloc_trace+0x112/0x200
          [<ffffffffa046f2d3>] tree_mod_log_insert_key.constprop.45+0x93/0x150 [btrfs]
          [<ffffffffa04720f9>] __btrfs_cow_block+0x299/0x4f0 [btrfs]
          [<ffffffffa0472510>] btrfs_cow_block+0x120/0x1f0 [btrfs]
          [<ffffffffa0476679>] btrfs_search_slot+0x449/0x930 [btrfs]
          [<ffffffffa048eecf>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
          [<ffffffffa04eb49c>] __btrfs_update_delayed_inode+0x1c/0x1d0 [btrfs]
          [<ffffffffa04eb9e2>] __btrfs_run_delayed_items+0x162/0x1e0 [btrfs]
          [<ffffffffa04eba63>] btrfs_delayed_inode_exit+0x3/0x20 [btrfs]
          [<ffffffffa0499c63>] btrfs_commit_transaction+0x203/0xa50 [btrfs]
          [<ffffffffa046b519>] btrfs_sync_fs+0x69/0x110 [btrfs]
          [<ffffffff811cb210>] __sync_filesystem+0x30/0x60
          [<ffffffff811cb2bb>] sync_filesystem+0x4b/0x70
          [<ffffffff8119ce7b>] generic_shutdown_super+0x3b/0xf0
          [<ffffffff8119cfc6>] kill_anon_super+0x16/0x30
      unreferenced object 0xffff880677e0dd88 (size 32):
        comm "umount", pid 21508, jiffies 4295967793 (age 36718.112s)
        hex dump (first 32 bytes):
          78 75 11 a9 06 88 ff ff 00 c0 e0 77 06 88 ff ff  xu.........w....
          40 c3 a2 70 06 88 ff ff 00 00 00 00 00 00 00 00  @..p............
        backtrace:
          [<ffffffff81742d26>] kmemleak_alloc+0x26/0x50
          [<ffffffff811889c2>] kmem_cache_alloc_trace+0x112/0x200
          [<ffffffffa04fa54f>] btrfs_qgroup_record_ref+0xf/0x90 [btrfs]
          [<ffffffffa04e1914>] btrfs_add_delayed_tree_ref+0xf4/0x170 [btrfs]
          [<ffffffffa048518a>] btrfs_free_tree_block+0x9a/0x220 [btrfs]
          [<ffffffffa0472163>] __btrfs_cow_block+0x303/0x4f0 [btrfs]
          [<ffffffffa0472510>] btrfs_cow_block+0x120/0x1f0 [btrfs]
          [<ffffffffa0476679>] btrfs_search_slot+0x449/0x930 [btrfs]
          [<ffffffffa048eecf>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
          [<ffffffffa04eb49c>] __btrfs_update_delayed_inode+0x1c/0x1d0 [btrfs]
          [<ffffffffa04eb9e2>] __btrfs_run_delayed_items+0x162/0x1e0 [btrfs]
          [<ffffffffa04eba63>] btrfs_delayed_inode_exit+0x3/0x20 [btrfs]
          [<ffffffffa0499c63>] btrfs_commit_transaction+0x203/0xa50 [btrfs]
          [<ffffffffa046b519>] btrfs_sync_fs+0x69/0x110 [btrfs]
          [<ffffffff811cb210>] __sync_filesystem+0x30/0x60
          [<ffffffff811cb2bb>] sync_filesystem+0x4b/0x70
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      80d94fb3
    • M
      Btrfs: fix BUG_ON() casued by the reserved space migration · 20dd2cbf
      Miao Xie 提交于
      When we did space balance and snapshot creation at the same time, we might
      meet the following oops:
       kernel BUG at fs/btrfs/inode.c:3038!
       [SNIP]
       Call Trace:
       [<ffffffffa0411ec7>] btrfs_orphan_cleanup+0x293/0x407 [btrfs]
       [<ffffffffa042dc45>] btrfs_mksubvol.isra.28+0x259/0x373 [btrfs]
       [<ffffffffa042de85>] btrfs_ioctl_snap_create_transid+0x126/0x156 [btrfs]
       [<ffffffffa042dff1>] btrfs_ioctl_snap_create_v2+0xd0/0x121 [btrfs]
       [<ffffffffa0430b2c>] btrfs_ioctl+0x414/0x1854 [btrfs]
       [<ffffffff813b60b7>] ? __do_page_fault+0x305/0x379
       [<ffffffff811215a9>] vfs_ioctl+0x1d/0x39
       [<ffffffff81121d7c>] do_vfs_ioctl+0x32d/0x3e2
       [<ffffffff81057fe7>] ? finish_task_switch+0x80/0xb8
       [<ffffffff81121e88>] SyS_ioctl+0x57/0x83
       [<ffffffff813b39ff>] ? do_device_not_available+0x12/0x14
       [<ffffffff813b99c2>] system_call_fastpath+0x16/0x1b
       [SNIP]
       RIP  [<ffffffffa040da40>] btrfs_orphan_add+0xc3/0x126 [btrfs]
      
      The reason of the problem is that the relocation root creation stole
      the reserved space, which was reserved for orphan item deletion.
      
      There are several ways to fix this problem, one is to increasing
      the reserved space size of the space balace, and then we can use
      that space to create the relocation tree for each fs/file trees.
      But it is hard to calculate the suitable size because we doesn't
      know how many fs/file trees we need relocate.
      
      We fixed this problem by reserving the space for relocation root creation
      actively since the space it need is very small (one tree block, used for
      root node copy), then we use that reserved space to create the
      relocation tree. If we don't reserve space for relocation tree creation,
      we will use the reserved space of the balance.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      20dd2cbf
    • J
      Btrfs: fix two use-after-free bugs with transaction cleanup · 724e2315
      Josef Bacik 提交于
      I was noticing the slab redzone stuff going off every once and a while during
      transaction aborts.  This was caused by two things
      
      1) We would walk the pending snapshots and set their error to -ECANCELED.  We
      don't need to do this, the snapshot stuff waits for a transaction commit and if
      there is a problem we just free our pending snapshot object and exit.  Doing
      this was causing us to touch the pending snapshot object after the thing had
      already been freed.
      
      2) We were freeing the transaction manually with wanton disregard for it's
      use_count reference counter.  To fix this I cleaned up the transaction freeing
      loop to either wait for the transaction commit to finish if it was in the middle
      of that (since it will be cleaned and freed up there) or to do the cleanup
      oursevles.
      
      I also moved the global "kill all things dirty everywhere" stuff outside of the
      transaction cleanup loop since that only needs to be done once.  With this patch
      I'm no longer seeing slab corruption because of use after frees.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      724e2315