1. 25 1月, 2013 2 次提交
  2. 18 12月, 2012 1 次提交
    • C
      Btrfs: fix hash overflow handling · 9c52057c
      Chris Mason 提交于
      The handling for directory crc hash overflows was fairly obscure,
      split_leaf returns EOVERFLOW when we try to extend the item and that is
      supposed to bubble up to userland.  For a while it did so, but along the
      way we added better handling of errors and forced the FS readonly if we
      hit IO errors during the directory insertion.
      
      Along the way, we started testing only for EEXIST and the EOVERFLOW case
      was dropped.  The end result is that we may force the FS readonly if we
      catch a directory hash bucket overflow.
      
      This fixes a few problem spots.  First I add tests for EOVERFLOW in the
      places where we can safely just return the error up the chain.
      
      btrfs_rename is harder though, because it tries to insert the new
      directory item only after it has already unlinked anything the rename
      was going to overwrite.  Rather than adding very complex logic, I added
      a helper to test for the hash overflow case early while it is still safe
      to bail out.
      
      Snapshot and subvolume creation had a similar problem, so they are using
      the new helper now too.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      Reported-by: NPascal Junod <pascal@junod.info>
      9c52057c
  3. 17 12月, 2012 2 次提交
  4. 13 12月, 2012 5 次提交
  5. 12 12月, 2012 3 次提交
  6. 26 10月, 2012 1 次提交
  7. 09 10月, 2012 5 次提交
    • J
      Btrfs: cache extent state when writing out dirty metadata pages · e6138876
      Josef Bacik 提交于
      Everytime we write out dirty pages we search for an offset in the tree,
      convert the bits in the state, and then when we wait we search for the
      offset again and clear the bits.  So for every dirty range in the io tree we
      are doing 4 rb searches, which is suboptimal.  With this patch we are only
      doing 2 searches for every cycle (modulo weird things happening).  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      e6138876
    • M
      Btrfs: fix orphan transaction on the freezed filesystem · 354aa0fb
      Miao Xie 提交于
      With the following debug patch:
      
       static int btrfs_freeze(struct super_block *sb)
       {
      + 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
      +	struct btrfs_transaction *trans;
      +
      +	spin_lock(&fs_info->trans_lock);
      +	trans = fs_info->running_transaction;
      +	if (trans) {
      +		printk("Transid %llu, use_count %d, num_writer %d\n",
      +			trans->transid, atomic_read(&trans->use_count),
      +			atomic_read(&trans->num_writers));
      +	}
      +	spin_unlock(&fs_info->trans_lock);
       	return 0;
       }
      
      I found there was a orphan transaction after the freeze operation was done.
      
      It is because the transaction may not be committed when the transaction handle
      end even though it is the last handle of the current transaction. This design
      avoid committing the transaction frequently, but also introduce the above
      problem.
      
      So I add btrfs_attach_transaction() which can catch the current transaction
      and commit it. If there is no transaction, it will return ENOENT, and do not
      anything.
      
      This function also can be used to instead of btrfs_join_transaction_freeze()
      because it don't increase the writer counter and don't start a new transaction,
      so it also can fix the deadlock between sync and freeze.
      
      Besides that, it is used to instead of btrfs_join_transaction() in
      transaction_kthread(), because if there is no transaction, the transaction
      kthread needn't anything.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      354aa0fb
    • M
      Btrfs: add a type field for the transaction handle · a698d075
      Miao Xie 提交于
      This patch add a type field into the transaction handle structure,
      in this way, we needn't implement various end-transaction functions
      and can make the code more simple and readable.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      a698d075
    • M
      Btrfs: fix memory leak in start_transaction() · e8830e60
      Miao Xie 提交于
      This patch fixes memory leak of the transaction handle which happened
      when starting transaction failed on a freezed fs.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      e8830e60
    • M
      Btrfs: fix the missing error information in create_pending_snapshot() · 8732d44f
      Miao Xie 提交于
      The macro btrfs_abort_transaction() can get the line number of the code
      where the problem happens, so we should invoke it in the place that the
      error occurs, or we will lose the line number.
      Reported-by: NDavid Sterba <dave@jikos.cz>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      8732d44f
  8. 04 10月, 2012 3 次提交
    • J
      Btrfs: fix race with freeze and free space inodes · 98114659
      Josef Bacik 提交于
      So we start our freeze, somebody comes in and does an fsync() on a file
      where we have to commit a transaction for whatever reason, and we will
      deadlock because the freeze is waiting on FS_FREEZE people to stop writing
      to the file system, but the transaction is waiting for its free space inodes
      to be written out, which are in turn waiting on sb_start_intwrite while
      trying to write the file extents.  To fix this we'll just skip the
      sb_start_intwrite() if we TRANS_JOIN_NOLOCK since we're being waited on by a
      transaction commit so we're safe wrt to freeze and this will keep us from
      deadlocking.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      98114659
    • L
      Btrfs: kill obsolete arguments in btrfs_wait_ordered_extents · 6bbe3a9c
      Liu Bo 提交于
      nocow_only is now an obsolete argument.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      6bbe3a9c
    • J
      Btrfs: fix race in sync and freeze again · 60376ce4
      Josef Bacik 提交于
      I screwed this up, there is a race between checking if there is a running
      transaction and actually starting a transaction in sync where we could race
      with a freezer and get ourselves into trouble.  To fix this we need to make
      a new join type to only do the try lock on the freeze stuff.  If it fails
      we'll return EPERM and just return from sync.  This fixes a hang Liu Bo
      reported when running xfstest 68 in a loop.  Thanks,
      Reported-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      60376ce4
  9. 02 10月, 2012 8 次提交
    • J
      Btrfs: delay block group item insertion · ea658bad
      Josef Bacik 提交于
      So we have lots of places where we try to preallocate chunks in order to
      make sure we have enough space as we make our allocations.  This has
      historically meant that we're constantly tweaking when we should allocate a
      new chunk, and historically we have gotten this horribly wrong so we way
      over allocate either metadata or data.  To try and keep this from happening
      we are going to make it so that the block group item insertion is done out
      of band at the end of a transaction.  This will allow us to create chunks
      even if we are trying to make an allocation for the extent tree.  With this
      patch my enospc tests run faster (didn't expect this) and more efficiently
      use the disk space (this is what I wanted).  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ea658bad
    • J
      Btrfs: move the sb_end_intwrite until after the throttle logic · 6df7881a
      Josef Bacik 提交于
      Sage reported the following lockdep backtrace
      
      =====================================
      [ BUG: bad unlock balance detected! ]
      3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted
      -------------------------------------
      btrfs-cleaner/7607 is trying to release lock (sb_internal) at:
      [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
      but there are no more locks to release!
      
      other info that might help us debug this:
      1 lock held by btrfs-cleaner/7607:
       #0:  (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b405>] cleaner_kthread+0x95/0x120 [btrfs]
      
      stack backtrace:
      Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1
      Call Trace:
       [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
       [<ffffffff810afa9e>] print_unlock_inbalance_bug+0xfe/0x110
       [<ffffffff810b289e>] lock_release_non_nested+0x1ee/0x310
       [<ffffffff81172f9b>] ? kmem_cache_free+0x7b/0x160
       [<ffffffffa004106c>] ? put_transaction+0x8c/0x130 [btrfs]
       [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
       [<ffffffff810b2a95>] lock_release+0xd5/0x220
       [<ffffffff81173071>] ? kmem_cache_free+0x151/0x160
       [<ffffffff8117d9ed>] __sb_end_write+0x7d/0x90
       [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
       [<ffffffff81079850>] ? __init_waitqueue_head+0x60/0x60
       [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40
       [<ffffffffa0042758>] __btrfs_end_transaction+0x368/0x3c0 [btrfs]
       [<ffffffffa0042808>] btrfs_end_transaction_throttle+0x18/0x20 [btrfs]
       [<ffffffffa00318f0>] btrfs_drop_snapshot+0x410/0x600 [btrfs]
       [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0
       [<ffffffffa00430ef>] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs]
       [<ffffffffa003b405>] ? cleaner_kthread+0x95/0x120 [btrfs]
       [<ffffffffa003b419>] cleaner_kthread+0xa9/0x120 [btrfs]
       [<ffffffffa003b370>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
       [<ffffffff810791ee>] kthread+0xae/0xc0
       [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
       [<ffffffff81635430>] ? retint_restore_args+0x13/0x13
       [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0
       [<ffffffff8163e740>] ? gs_change+0x13/0x13
      
      This is because the throttle stuff can commit the transaction, which expects to
      be the one stopping the intwrite stuff, but we've already done it in the
      __btrfs_end_transaction.  Moving the sb_end_intewrite after this logic makes the
      lockdep go away.  Thanks,
      Tested-by: NSage Weil <sage@inktank.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      6df7881a
    • M
      Btrfs: fix corrupted metadata in the snapshot · 8407aa46
      Miao Xie 提交于
      When we delete a inode, we will remove all the delayed items including delayed
      inode update, and then truncate all the relative metadata. If there is lots of
      metadata, we will end the current transaction, and start a new transaction to
      truncate the left metadata. In this way, we will leave a inode item that its
      link counter is > 0, and also may leave some directory index items in fs/file tree
      after the current transaction ends. In other words, the metadata in this fs/file tree
      is inconsistent. If we create a snapshot for this tree now, we will find a inode with
      corrupted metadata in the new snapshot, and we won't continue to drop the left metadata,
      because its link counter is not 0.
      
      We fix this problem by updating the inode item before the current transaction ends.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      8407aa46
    • M
      Btrfs: fix the snapshot that should not exist · 42874b3d
      Miao Xie 提交于
      The snapshot should be the image of the fs tree before it was created,
      so the metadata of the snapshot should not exist in the its tree. But now, we
      found the directory item and directory name index is in both the snapshot tree
      and the fs tree. It introduces some problems and makes the users feel strange:
      
       # mkfs.btrfs /dev/sda1
       # mount /dev/sda1 /mnt
       # mkdir /mnt/1
       # cd /mnt/1
       # btrfs subvolume snapshot /mnt snap0
       # ls -a /mnt/1/snap0/1
       .	..	[no other file/dir]
      
       # ll /mnt/1/snap0/
       total 0
       drwxr-xr-x 1 root root 10 Ju1 24 12:11 1
      			^^^
      			There is no file/dir in it, but it's size is 10
      
       # cd /mnt/1/snap0/1/snap0
       [Enter a unexisted directory successfully...]
      
      There is nothing in the directory 1 in snap0, but btrfs told the length of
      this directory is 10. Beside that, we can enter an unexisted directory, it is
      very strange to the users.
      
       # btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1
       # ll /mnt/1/snap0/1/
       total 0
       [None]
       # ll /mnt/snap1/1/
       total 0
       drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0
      
      And the source of snap1 did have any directory in Directory 1, but snap1 have
      a snap0, it is different between the source and the snapshot.
      
      So I think we should insert directory item and directory name index and update
      the parent inode as the last step of snapshot creation, and do not leave the
      useless metadata in the file tree.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      42874b3d
    • M
      Btrfs: fix full backref problem when inserting shared block reference · 361048f5
      Miao Xie 提交于
      If we create several snapshots at the same time, the following BUG_ON() will be
      triggered.
      
      	kernel BUG at fs/btrfs/extent-tree.c:6047!
      
      Steps to reproduce:
       # mkfs.btrfs <partition>
       # mount <partition> <mnt>
       # cd <mnt>
       # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done
       # for ((i=0; i<4; i++))
       > do
       > mkdir $i
       > for ((j=0; j<200; j++))
       > do
       > btrfs sub snap . $i/$j
       > done &
       > done
      
      The reason is:
      Before transaction commit, some operations changed the fs tree and new tree
      blocks were allocated because of COW. We used the implicit non-shared back
      reference for those newly allocated tree blocks because they were not shared by
      two or more trees.
      
      And then we created the first snapshot for the fs tree, according to the back
      reference rules, we also used implicit back refs for the child tree blocks of
      the root node of the fs tree, now those child nodes/leaves were shared by two
      trees.
      
      Then We didn't deal with the delayed references, and continued to change the fs
      tree(created the second snapshot and inserted the dir item of the new snapshot
      into the fs tree). According to the rules of the back reference, we added full
      back refs for those tree blocks whose parents have be shared by two trees.
      Now some newly allocated tree blocks had two types of the references.
      
      As we know, the delayed reference system handles these delayed references from
      back to front, and the full delayed reference is inserted after the implicit
      ones. So when we dealt with the back references of those newly allocated tree
      blocks, the full references was dealt with at first. And if the first reference
      is a shared back reference and the tree block that the reference points to is
      newly allocated, It would be considered as a tree block which is shared by two
      or more trees when it is allocated and should be a full back reference not a
      implicit one, the flag of its reference also should be set to FULL_BACKREF.
      But in fact, it was a non-shared tree block with a implicit reference at
      beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON
      was triggered.
      
      We have several methods to fix this bug:
      1. deal with delayed references after the snapshot is created and before we
         change the source tree of the snapshot. This is the easiest and safest way.
      2. modify the sort method of the delayed reference tree, make the full delayed
         references be inserted before the implicit ones. It is also very easy, but
         I don't know if it will introduce some problems or not.
      3. modify select_delayed_ref() and make it select the implicit delayed reference
         at first. This way is not so good because it may wastes CPU time if we have
         lots of delayed references.
      4. set the flags to FULL_BACKREF, this method is a little complex comparing with
         the 1st way.
      
      I chose the 1st way to fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      361048f5
    • M
      Btrfs: fix error path in create_pending_snapshot() · 6fa9700e
      Miao Xie 提交于
      This patch fixes the following problem:
      - If we failed to deal with the delayed dir items, we should abort transaction,
        just as its comment said. Fix it.
      - If root reference or root back reference insertion failed, we should
        abort transaction. Fix it.
      - Fix the double free problem of pending->inherit.
      - Do not restore the trans->rsv if we doesn't change it.
      - make the error path more clearly.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      6fa9700e
    • S
      Btrfs: set journal_info in async trans commit worker · e209db7a
      Sage Weil 提交于
      We expect current->journal_info to point to the trans handle we are
      committing.
      Signed-off-by: NSage Weil <sage@inktank.com>
      e209db7a
    • S
      Btrfs: pass lockdep rwsem metadata to async commit transaction · 6fc4e354
      Sage Weil 提交于
      The freeze rwsem is taken by sb_start_intwrite() and dropped during the
      commit_ or end_transaction().  In the async case, that happens in a worker
      thread.  Tell lockdep the calling thread is releasing ownership of the
      rwsem and the async thread is picking it up.
      
      XFS plays the same trick in fs/xfs/xfs_aops.c.
      Signed-off-by: NSage Weil <sage@inktank.com>
      6fc4e354
  10. 29 8月, 2012 2 次提交
  11. 31 7月, 2012 1 次提交
    • J
      btrfs: Convert to new freezing mechanism · b2b5ef5c
      Jan Kara 提交于
      We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
      freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
      the transaction mechanism to avoid starting transactions on frozen filesystem.
      At minimum this is necessary to stop iput() of unlinked file to change frozen
      filesystem during truncation.
      
      Checks in cleaner_kthread() and transaction_kthread() can be safely removed
      since btrfs_freeze() will lock the mutexes and thus block the threads (and they
      shouldn't have anything to do anyway).
      
      CC: linux-btrfs@vger.kernel.org
      CC: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b2b5ef5c
  12. 26 7月, 2012 1 次提交
    • A
      Btrfs: introduce subvol uuids and times · 8ea05e3a
      Alexander Block 提交于
      This patch introduces uuids for subvolumes. Each
      subvolume has it's own uuid. In case it was snapshotted,
      it also contains parent_uuid. In case it was received,
      it also contains received_uuid.
      
      It also introduces subvolume ctime/otime/stime/rtime. The
      first two are comparable to the times found in inodes. otime
      is the origin/creation time and ctime is the change time.
      stime/rtime are only valid on received subvolumes.
      stime is the time of the subvolume when it was
      sent. rtime is the time of the subvolume when it was
      received.
      
      Additionally to the times, we have a transid for each
      time. They are updated at the same place as the times.
      
      btrfs receive uses stransid and rtransid to find out
      if a received subvolume changed in the meantime.
      
      If an older kernel mounts a filesystem with the
      extented fields, all fields become invalid. The next
      mount with a new kernel will detect this and reset the
      fields.
      Signed-off-by: NAlexander Block <ablock84@googlemail.com>
      Reviewed-by: NDavid Sterba <dave@jikos.cz>
      Reviewed-by: NArne Jansen <sensille@gmx.net>
      Reviewed-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Reviewed-by: NAlex Lyakas <alex.bolshoy.btrfs@gmail.com>
      8ea05e3a
  13. 24 7月, 2012 3 次提交
    • J
      Btrfs: change how we indicate we're adding csums · 0e721106
      Josef Bacik 提交于
      There is weird logic I had to put in place to make sure that when we were
      adding csums that we'd used the delalloc block rsv instead of the global
      block rsv.  Part of this meant that we had to free up our transaction
      reservation before we ran the delayed refs since csum deletion happens
      during the delayed ref work.  The problem with this is that when we release
      a reservation we will add it to the global reserve if it is not full in
      order to keep us going along longer before we have to force a transaction
      commit.  By releasing our reservation before we run delayed refs we don't
      get the opportunity to drain down the global reserve for the work we did, so
      we won't refill it as often.  This isn't a problem per-se, it just results
      in us possibly committing transactions more and more often, and in rare
      cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran
      out of space in our block rsv.
      
      This also helps us by holding onto space while the delayed refs run so we
      don't end up with as many people trying to do things at the same time, which
      again will help us not force commits or hit the use_block_rsv warnings.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0e721106
    • D
      Btrfs: small naming cleanup in join_transaction() · e4b50e14
      Dan Carpenter 提交于
      "root->fs_info" and "fs_info" are the same, but "fs_info" is prefered
      because it is shorter and that's what is used in the rest of the
      function.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      e4b50e14
    • C
      Btrfs: don't wait around for new log writers on an SSD · e39e64ac
      Chris Mason 提交于
      Waiting on spindles improves performance, but ssds want all the
      IO as quickly as we can push it down.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      e39e64ac
  14. 12 7月, 2012 3 次提交