1. 14 6月, 2013 11 次提交
    • M
      Btrfs: merge pending IO for tree log write back · c6adc9cc
      Miao Xie 提交于
      Before applying this patch, we flushed the log tree of the fs/file
      tree firstly, and then flushed the log root tree. It is ineffective,
      especially on the hard disk. This patch improved this problem by wrapping
      the above two flushes by the same blk_plug.
      
      By test, the performance of the sync write went up ~60%(2.9MB/s -> 4.6MB/s)
      on my scsi disk whose disk buffer was enabled.
      
      Test step:
       # mkfs.btrfs -f -m single <disk>
       # mount <disk> <mnt>
       # dd if=/dev/zero of=<mnt>/file0 bs=32K count=1024 oflag=sync
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      c6adc9cc
    • M
      Btrfs: make the state of the transaction more readable · 4a9d8bde
      Miao Xie 提交于
      We used 3 variants to track the state of the transaction, it was complex
      and wasted the memory space. Besides that, it was hard to understand that
      which types of the transaction handles should be blocked in each transaction
      state, so the developers often made mistakes.
      
      This patch improved the above problem. In this patch, we define 6 states
      for the transaction,
        enum btrfs_trans_state {
      	TRANS_STATE_RUNNING		= 0,
      	TRANS_STATE_BLOCKED		= 1,
      	TRANS_STATE_COMMIT_START	= 2,
      	TRANS_STATE_COMMIT_DOING	= 3,
      	TRANS_STATE_UNBLOCKED		= 4,
      	TRANS_STATE_COMPLETED		= 5,
      	TRANS_STATE_MAX			= 6,
        }
      and just use 1 variant to track those state.
      
      In order to make the blocked handle types for each state more clear,
      we introduce a array:
        unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
      	[TRANS_STATE_RUNNING]		= 0U,
      	[TRANS_STATE_BLOCKED]		= (__TRANS_USERSPACE |
      					   __TRANS_START),
      	[TRANS_STATE_COMMIT_START]	= (__TRANS_USERSPACE |
      					   __TRANS_START |
      					   __TRANS_ATTACH),
      	[TRANS_STATE_COMMIT_DOING]	= (__TRANS_USERSPACE |
      					   __TRANS_START |
      					   __TRANS_ATTACH |
      					   __TRANS_JOIN),
      	[TRANS_STATE_UNBLOCKED]		= (__TRANS_USERSPACE |
      					   __TRANS_START |
      					   __TRANS_ATTACH |
      					   __TRANS_JOIN |
      					   __TRANS_JOIN_NOLOCK),
      	[TRANS_STATE_COMPLETED]		= (__TRANS_USERSPACE |
      					   __TRANS_START |
      					   __TRANS_ATTACH |
      					   __TRANS_JOIN |
      					   __TRANS_JOIN_NOLOCK),
        }
      it is very intuitionistic.
      
      Besides that, because we remove ->in_commit in transaction structure, so
      the lock ->commit_lock which was used to protect it is unnecessary, remove
      ->commit_lock.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      4a9d8bde
    • M
      Btrfs: remove the time check in btrfs_commit_transaction() · 581227d0
      Miao Xie 提交于
      We checked the commit time to avoid committing the transaction
      frequently, but it is unnecessary because:
      - It made the transaction commit spend more time, and delayed the
        operation of the external writers(TRANS_START/TRANS_USERSPACE).
      - Except the space that we have to commit transaction, such as
        snapshot creation, btrfs doesn't commit the transaction on its
        own initiative.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      581227d0
    • M
      Btrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure · 3f1e3fa6
      Miao Xie 提交于
      We used ->num_joined track if there were some writers which join the current
      transaction when the committer was sleeping. If some writers joined the current
      transaction, we has to continue the while loop to do some necessary stuff, such
      as flush the ordered operations. But it is unnecessary because we will do it
      after the while loop.
      
      Besides that, tracking ->num_joined would make the committer drop into the while
      loop when there are lots of internal writers(TRANS_JOIN).
      
      So we remove ->num_joined and don't track if there are some writers which join
      the current transaction when the committer is sleeping.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      3f1e3fa6
    • M
      Btrfs: don't flush the delalloc inodes in the while loop if flushoncommit is set · 82436617
      Miao Xie 提交于
      It is unnecessary to flush the delalloc inodes again and again because
      we don't care the dirty pages which are introduced after the flush, and
      they will be flush in the transaction commit.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      82436617
    • M
      Btrfs: don't wait for all the writers circularly during the transaction commit · 0860adfd
      Miao Xie 提交于
      btrfs_commit_transaction has the following loop before we commit the
      transaction.
      
      do {
          // attempt to do some useful stuff and/or sleep
      } while (atomic_read(&cur_trans->num_writers) > 1 ||
      	 (should_grow && cur_trans->num_joined != joined));
      
      This is used to prevent from the TRANS_START to get in the way of a
      committing transaction. But it does not prevent from TRANS_JOIN, that
      is we would do this loop for a long time if some writers JOIN the
      current transaction endlessly.
      
      Because we need join the current transaction to do some useful stuff,
      we can not block TRANS_JOIN here. So we introduce a external writer
      counter, which is used to count the TRANS_USERSPACE/TRANS_START writers.
      If the external writer counter is zero, we can break the above loop.
      
      In order to make the code more clear, we don't use enum variant
      to define the type of the transaction handle, use bitmask instead.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0860adfd
    • M
      Btrfs: remove the code for the impossible case in cleanup_transaction() · 25d8c284
      Miao Xie 提交于
      If the transaction is removed from the transaction list, it means the
      transaction has been committed successfully. So it is impossible to
      call cleanup_transaction(), otherwise there is something wrong with
      the code logic. Thus, we use BUG_ON() instead of the original handle.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      25d8c284
    • M
      Btrfs: just flush the delalloc inodes in the source tree before snapshot creation · 6a03843d
      Miao Xie 提交于
      Before applying this patch, we need flush all the delalloc inodes in
      the fs when we want to create a snapshot, it wastes time, and make
      the transaction commit be blocked for a long time. It means some other
      user operation would also be blocked for a long time.
      
      This patch improves this problem, we just flush the delalloc inodes that
      in the source trees before snapshot creation, so the transaction commit
      will complete quickly.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      6a03843d
    • M
      Btrfs: introduce per-subvolume ordered extent list · 199c2a9c
      Miao Xie 提交于
      The reason we introduce per-subvolume ordered extent list is the same
      as the per-subvolume delalloc inode list.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      199c2a9c
    • M
      Btrfs: introduce per-subvolume delalloc inode list · eb73c1b7
      Miao Xie 提交于
      When we create a snapshot, we need flush all delalloc inodes in the
      fs, just flushing the inodes in the source tree is OK. So we introduce
      per-subvolume delalloc inode list.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      eb73c1b7
    • M
      Btrfs: move the R/O check out of btrfs_clean_one_deleted_snapshot() · dc7f370c
      Miao Xie 提交于
      If the fs is remounted to be R/O, it is unnecessary to call
      btrfs_clean_one_deleted_snapshot(), so move the R/O check out of
      this function. And besides that, it can make the check logic in the
      caller more clear.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      dc7f370c
  2. 07 5月, 2013 8 次提交
    • E
      btrfs: make static code static & remove dead code · 48a3b636
      Eric Sandeen 提交于
      Big patch, but all it does is add statics to functions which
      are in fact static, then remove the associated dead-code fallout.
      
      removed functions:
      
      btrfs_iref_to_path()
      __btrfs_lookup_delayed_deletion_item()
      __btrfs_search_delayed_insertion_item()
      __btrfs_search_delayed_deletion_item()
      find_eb_for_page()
      btrfs_find_block_group()
      range_straddles_pages()
      extent_range_uptodate()
      btrfs_file_extent_length()
      btrfs_scrub_cancel_devid()
      btrfs_start_transaction_lflush()
      
      btrfs_print_tree() is left because it is used for debugging.
      btrfs_start_transaction_lflush() and btrfs_reada_detach() are
      left for symmetry.
      
      ulist.c functions are left, another patch will take care of those.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      48a3b636
    • J
      Btrfs: separate sequence numbers for delayed ref tracking and tree mod log · fc36ed7e
      Jan Schmidt 提交于
      Sequence numbers for delayed refs have been introduced in the first version
      of the qgroup patch set. To solve the problem of find_all_roots on a busy
      file system, the tree mod log was introduced. The sequence numbers for that
      were simply shared between those two users.
      
      However, at one point in qgroup's quota accounting, there's a statement
      accessing the previous sequence number, that's still just doing (seq - 1)
      just as it would have to in the very first version.
      
      To satisfy that requirement, this patch makes the sequence number counter 64
      bit and splits it into a major part (used for qgroup sequence number
      counting) and a minor part (incremented for each tree modification in the
      log). This enables us to go exactly one major step backwards, as required
      for qgroups, while still incrementing the sequence counter for tree mod log
      insertions to keep track of their order. Keeping them in a single variable
      means there's no need to change all the code dealing with comparisons of two
      sequence numbers.
      
      The sequence number is reset to 0 on commit (not new in this patch), which
      ensures we won't overflow the two 32 bit counters.
      
      Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
      from the tree mod log code may happen.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fc36ed7e
    • S
      Btrfs: clear received_uuid field for new writable snapshots · 70023da2
      Stefan Behrens 提交于
      For created snapshots, the full root_item is copied from the source
      root and afterwards selectively modified. The current code forgets
      to clear the field received_uuid. The only problem is that it is
      confusing when you look at it with 'btrfs subv list', since for
      writable snapshots, the contents of the snapshot can be completely
      unrelated to the previously received snapshot.
      The receiver ignores such snapshots anyway because he also checks
      the field stransid in the root_item and that value used to be reset
      to zero for all created snapshots.
      
      This commit changes two things:
      - clear the received_uuid field for new writable snapshots.
      - don't clear the send/receive related information like the stransid
        for read-only snapshots (which makes them useable as a parent for
        the automatic selection of parents in the receive code).
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      70023da2
    • W
    • J
      Btrfs: fix infinite loop when we abort on mount · cf79ffb5
      Josef Bacik 提交于
      Testing my enospc log code I managed to abort a transaction during mount, which
      put me into an infinite loop.  This is because of two things, first we don't
      reset trans_no_join if we abort during transaction commit, which will force
      anybody trying to start a transaction to just loop endlessly waiting for it to
      be set to 0.  But this is still just a symptom, the second issue is we don't set
      the fs state to error during errors on mount.  This is because we don't want to
      do the flip read only thing during mount, but we still really want to set the fs
      state to an error to keep us from even getting to the trans_no_join check.  So
      fix both of these things, make sure to reset trans_no_join if we abort during a
      commit, and make sure we set the fs state to error no matter if we're mounting
      or not.  This should keep us from getting into this infinite loop again.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      cf79ffb5
    • S
      Btrfs: Include the device in most error printk()s · c2cf52eb
      Simon Kirby 提交于
      With more than one btrfs volume mounted, it can be very difficult to find
      out which volume is hitting an error. btrfs_error() will print this, but
      it is currently rigged as more of a fatal error handler, while many of
      the printk()s are currently for debugging and yet-unhandled cases.
      
      This patch just changes the functions where the device information is
      already available. Some cases remain where the root or fs_info is not
      passed to the function emitting the error.
      
      This may introduce some confusion with volumes backed by multiple devices
      emitting errors referring to the primary device in the set instead of the
      one on which the error occurred.
      
      Use btrfs_printk(fs_info, format, ...) rather than writing the device
      string every time, and introduce macro wrappers ala XFS for brevity.
      Since the function already cannot be used for continuations, print a
      newline as part of the btrfs_printk() message rather than at each caller.
      Signed-off-by: NSimon Kirby <sim@hostway.ca>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      c2cf52eb
    • D
      btrfs: clean snapshots one by one · 9d1a2a3a
      David Sterba 提交于
      Each time pick one dead root from the list and let the caller know if
      it's needed to continue. This should improve responsiveness during
      umount and balance which at some point waits for cleaning all currently
      queued dead roots.
      
      A new dead root is added to the end of the list, so the snapshots
      disappear in the order of deletion.
      
      The snapshot cleaning work is now done only from the cleaner thread and the
      others wake it if needed.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      9d1a2a3a
    • D
      btrfs: clean up transaction abort messages · 08748810
      David Sterba 提交于
      The transaction abort stacktrace is printed only once per module
      lifetime, but we'd like to see it each time it happens per mounted
      filesystem.  Introduce a fs_state flag that records it.
      
      Tweak the messages around abort:
      * add error number to the first abort
      * print the exact negative errno from btrfs_decode_error
      * clean up btrfs_decode_error and callers
      * no dots at the end of the messages
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      08748810
  3. 15 3月, 2013 1 次提交
  4. 05 3月, 2013 2 次提交
    • L
      Btrfs: avoid deadlock on transaction waiting list · 66b6135b
      Liu Bo 提交于
      Only let one trans handle to wait for other handles, otherwise we
      will get ABBA issues.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      66b6135b
    • M
      Btrfs: fix wrong handle at error path of create_snapshot() when the commit fails · aec8030a
      Miao Xie 提交于
      There are several bugs at error path of create_snapshot() when the
      transaction commitment failed.
      - access the freed transaction handler. At the end of the
        transaction commitment, the transaction handler was freed, so we
        should not access it after the transaction commitment.
      - we were not aware of the error which happened during the snapshot
        creation if we submitted a async transaction commitment.
      - pending snapshot access vs pending snapshot free. when something
        wrong happened after we submitted a async transaction commitment,
        the transaction committer would cleanup the pending snapshots and
        free them. But the snapshot creators were not aware of it, they
        would access the freed pending snapshots.
      
      This patch fixes the above problems by:
      - remove the dangerous code that accessed the freed handler
      - assign ->error if the error happens during the snapshot creation
      - the transaction committer doesn't free the pending snapshots,
        just assigns the error number and evicts them before we unblock
        the transaction.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      aec8030a
  5. 01 3月, 2013 3 次提交
  6. 27 2月, 2013 1 次提交
    • L
      Btrfs: use reserved space for creating a snapshot · 2382c5cc
      Liu Bo 提交于
      While inserting dir index and updating inode for a snapshot, we'd
      add delayed items which consume trans->block_rsv, if we don't have
      any space reserved in this trans handle, we either just return or
      reserve space again.
      
      But before creating pending snapshots during committing transaction,
      we've done a release on this trans handle, so we don't have space reserved
      in it at this stage.
      
      What we're using is block_rsv of pending snapshots which has already
      reserved well enough space for both inserting dir index and updating
      inode, so we need to set trans handle to indicate that we have space
      now.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2382c5cc
  7. 21 2月, 2013 9 次提交
    • M
      Btrfs: fix missing release of qgroup reservation in commit_transaction() · 272d26d0
      Miao Xie 提交于
      We forget to free qgroup reservation in commit_transaction(),fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl-fnst@cn.fujitsu.com>
      Cc: Arne Jansen <sensille@gmx.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      272d26d0
    • M
      Btrfs: fix uncompleted transaction · d4edf39b
      Miao Xie 提交于
      In some cases, we need commit the current transaction, but don't want
      to start a new one if there is no running transaction, so we introduce
      the function - btrfs_attach_transaction(), which can catch the current
      transaction, and return -ENOENT if there is no running transaction.
      
      But no running transaction doesn't mean the current transction completely,
      because we removed the running transaction before it completes. In some
      cases, it doesn't matter. But in some special cases, such as freeze fs, we
      hope the transaction is fully on disk, it will introduce some bugs, for
      example, we may feeze the fs and dump the data in the disk, if the transction
      doesn't complete, we would dump inconsistent data. So we need fix the above
      problem for those cases.
      
      We fixes this problem by introducing a function:
      	btrfs_attach_transaction_barrier()
      if we hope all the transaction is fully on the disk, even they are not
      running, we can use this function.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      d4edf39b
    • M
      Btrfs: fix the deadlock between the transaction start/attach and commit · 178260b2
      Miao Xie 提交于
      Now btrfs_commit_transaction() does this
      
      ret = btrfs_run_ordered_operations(root, 0)
      
      which async flushes all inodes on the ordered operations list, it introduced
      a deadlock that transaction-start task, transaction-commit task and the flush
      workers waited for each other.
      (See the following URL to get the detail
       http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
      
      As we know, if ->in_commit is set, it means someone is committing the
      current transaction, we should not try to join it if we are not JOIN
      or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
      the above problem. In this way, there is another benefit: there is no new
      transaction handle to block the transaction which is on the way of commit,
      once we set ->in_commit.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      178260b2
    • M
      Btrfs: fix the qgroup reserved space is released prematurely · 4b824906
      Miao Xie 提交于
      In start_transactio(), we will try to join the transaction again after
      the current transaction is committed, so we should not release the
      reserved space of the qgroup. Fix it.
      
      Cc: Arne Jansen <sensille@gmx.net>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      4b824906
    • J
      Btrfs: place ordered operations on a per transaction list · 569e0f35
      Josef Bacik 提交于
      Miao made the ordered operations stuff run async, which introduced a
      deadlock where we could get somebody (sync) racing in and committing the
      transaction while a commit was already happening.  The new committer would
      try and flush ordered operations which would hang waiting for the commit to
      finish because it is done asynchronously and no longer inherits the callers
      trans handle.  To fix this we need to make the ordered operations list a per
      transaction list.  We can get new inodes added to the ordered operation list
      by truncating them and then having another process writing to them, so this
      makes it so that anybody trying to add an ordered operation _must_ start a
      transaction in order to add itself to the list, which will keep new inodes
      from getting added to the ordered operations list after we start committing.
      This should fix the deadlock and also keeps us from doing a lot more work
      than we need to during commit.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      569e0f35
    • D
      btrfs: add cancellation points to defrag · 210549eb
      David Sterba 提交于
      The defrag operation can take very long, we want to have a way how to
      cancel it. The code checks for a pending signal at safe points in the
      defrag loops and returns EAGAIN. This means a user can press ^C after
      running 'btrfs fi defrag', woks for both defrag modes, files and root.
      
      Returning from the command was instant in my light tests, but may take
      longer depending on the aging factor of the filesystem.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      210549eb
    • E
      btrfs: remove cache only arguments from defrag path · de78b51a
      Eric Sandeen 提交于
      The entry point at the defrag ioctl always sets "cache only" to 0;
      the codepaths haven't run for a long time as far as I can
      tell.  Chris says they're dead code, so remove them.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      de78b51a
    • J
      Btrfs: if we aren't committing just end the transaction if we error out · e4a2bcac
      Josef Bacik 提交于
      I hit a deadlock where transaction commit was waiting on num_writers to be
      0.  This happened because somebody came into btrfs_commit_transaction and
      noticed we had aborted and it went to cleanup_transaction.  This shouldn't
      happen because cleanup_transaction is really to fixup a bad commit, it
      doesn't do the normal trans handle cleanup things.  So if we have an error
      just do the normal btrfs_end_transaction dance and return.  Once we are in
      the actual commit path we can use cleanup_transaction and be good to go.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      e4a2bcac
    • M
      Btrfs: use bit operation for ->fs_state · 87533c47
      Miao Xie 提交于
      There is no lock to protect fs_info->fs_state, it will introduce
      some problems, such as the value may be covered by the other task
      when several tasks modify it. For example:
      	Task0 - CPU0		Task1 - CPU1
      	mov %fs_state rax
      	or $0x1 rax
      				mov %fs_state rax
      				or $0x2 rax
      	mov rax %fs_state
      				mov rax %fs_state
      The expected value is 3, but in fact, it is 2.
      
      Though this problem doesn't happen now (because there is only one
      flag currently), the code is error prone, if we add other flags,
      the above problem will happen to a certainty.
      
      Now we use bit operation for it to fix the above problem.
      In this way, we can make the code more robust and be easy to
      add new flags.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      87533c47
  8. 20 2月, 2013 5 次提交