1. 01 9月, 2013 3 次提交
  2. 14 6月, 2013 15 次提交
    • J
      Btrfs: do not pin while under spin lock · e78417d1
      Josef Bacik 提交于
      When testing a corrupted fs I noticed I was getting sleep while atomic errors
      when the transaction aborted.  This is because btrfs_pin_extent may need to
      allocate memory and we are calling this under the spin lock.  Fix this by moving
      it out and doing the pin after dropping the spin lock but before dropping the
      mutex, the same way it works when delayed refs run normally.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      e78417d1
    • J
      Btrfs: fix qgroup rescan resume on mount · b382a324
      Jan Schmidt 提交于
      When called during mount, we cannot start the rescan worker thread until
      open_ctree is done. This commit restuctures the qgroup rescan internals to
      enable a clean deferral of the rescan resume operation.
      
      First of all, the struct qgroup_rescan is removed, saving us a malloc and
      some initialization synchronizations problems. Its only element (the worker
      struct) now lives within fs_info just as the rest of the rescan code.
      
      Then setting up a rescan worker is split into several reusable stages.
      Currently we have three different rescan startup scenarios:
      	(A) rescan ioctl
      	(B) rescan resume by mount
      	(C) rescan by quota enable
      
      Each case needs its own combination of the four following steps:
      	(1) set the progress [A, C: zero; B: state of umount]
      	(2) commit the transaction [A]
      	(3) set the counters [A, C: zero; B: state of umount]
      	(4) start worker [A, B, C]
      
      qgroup_rescan_init does step (1). There's no extra function added to commit
      a transaction, we've got that already. qgroup_rescan_zero_tracking does
      step (3). Step (4) is nothing more than a call to the generic
      btrfs_queue_worker.
      
      We also get rid of a double check for the rescan progress during
      btrfs_qgroup_account_ref, which is no longer required due to having step 2
      from the list above.
      
      As a side effect, this commit prepares to move the rescan start code from
      btrfs_run_qgroups (which is run during commit) to a less time critical
      section.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      b382a324
    • M
      Btrfs: make the state of the transaction more readable · 4a9d8bde
      Miao Xie 提交于
      We used 3 variants to track the state of the transaction, it was complex
      and wasted the memory space. Besides that, it was hard to understand that
      which types of the transaction handles should be blocked in each transaction
      state, so the developers often made mistakes.
      
      This patch improved the above problem. In this patch, we define 6 states
      for the transaction,
        enum btrfs_trans_state {
      	TRANS_STATE_RUNNING		= 0,
      	TRANS_STATE_BLOCKED		= 1,
      	TRANS_STATE_COMMIT_START	= 2,
      	TRANS_STATE_COMMIT_DOING	= 3,
      	TRANS_STATE_UNBLOCKED		= 4,
      	TRANS_STATE_COMPLETED		= 5,
      	TRANS_STATE_MAX			= 6,
        }
      and just use 1 variant to track those state.
      
      In order to make the blocked handle types for each state more clear,
      we introduce a array:
        unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
      	[TRANS_STATE_RUNNING]		= 0U,
      	[TRANS_STATE_BLOCKED]		= (__TRANS_USERSPACE |
      					   __TRANS_START),
      	[TRANS_STATE_COMMIT_START]	= (__TRANS_USERSPACE |
      					   __TRANS_START |
      					   __TRANS_ATTACH),
      	[TRANS_STATE_COMMIT_DOING]	= (__TRANS_USERSPACE |
      					   __TRANS_START |
      					   __TRANS_ATTACH |
      					   __TRANS_JOIN),
      	[TRANS_STATE_UNBLOCKED]		= (__TRANS_USERSPACE |
      					   __TRANS_START |
      					   __TRANS_ATTACH |
      					   __TRANS_JOIN |
      					   __TRANS_JOIN_NOLOCK),
      	[TRANS_STATE_COMPLETED]		= (__TRANS_USERSPACE |
      					   __TRANS_START |
      					   __TRANS_ATTACH |
      					   __TRANS_JOIN |
      					   __TRANS_JOIN_NOLOCK),
        }
      it is very intuitionistic.
      
      Besides that, because we remove ->in_commit in transaction structure, so
      the lock ->commit_lock which was used to protect it is unnecessary, remove
      ->commit_lock.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      4a9d8bde
    • M
      Btrfs: cleanup unnecessary assignment when cleaning up all the residual transaction · ac673879
      Miao Xie 提交于
      When we umount a fs with serious errors, we will invoke btrfs_cleanup_transactions()
      to clean up the residual transaction. At this time, It is impossible to start a new
      transaction, so we needn't assign trans_no_join to 1, and also needn't clear running
      transaction every time we destroy a residual transaction.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ac673879
    • M
      Btrfs: introduce per-subvolume ordered extent list · 199c2a9c
      Miao Xie 提交于
      The reason we introduce per-subvolume ordered extent list is the same
      as the per-subvolume delalloc inode list.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      199c2a9c
    • M
      Btrfs: introduce per-subvolume delalloc inode list · eb73c1b7
      Miao Xie 提交于
      When we create a snapshot, we need flush all delalloc inodes in the
      fs, just flushing the inodes in the source tree is OK. So we introduce
      per-subvolume delalloc inode list.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      eb73c1b7
    • M
      Btrfs: introduce grab/put functions for the root of the fs/file tree · b0feb9d9
      Miao Xie 提交于
      The grab/put funtions will be used in the next patch, which need grab
      the root object and ensure it is not freed. We use reference counter
      instead of the srcu lock is to aovid blocking the memory reclaim task,
      which invokes synchronize_srcu().
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      b0feb9d9
    • M
      Btrfs: cleanup the similar code of the fs root read · cb517eab
      Miao Xie 提交于
      There are several functions whose code is similar, such as
        btrfs_find_last_root()
        btrfs_read_fs_root_no_radix()
      
      Besides that, some functions are invoked twice, it is unnecessary,
      for example, we are sure that all roots which is found in
        btrfs_find_orphan_roots()
      have their orphan items, so it is unnecessary to check the orphan
      item again.
      
      So cleanup it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      cb517eab
    • M
      Btrfs: make the snap/subv deletion end more early when the fs is R/O · babbf170
      Miao Xie 提交于
      The snapshot/subvolume deletion might spend lots of time, it would make
      the remount task wait for a long time. This patch improve this problem,
      we will break the deletion if the fs is remounted to be R/O. It will make
      the users happy.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      babbf170
    • M
      Btrfs: move the R/O check out of btrfs_clean_one_deleted_snapshot() · dc7f370c
      Miao Xie 提交于
      If the fs is remounted to be R/O, it is unnecessary to call
      btrfs_clean_one_deleted_snapshot(), so move the R/O check out of
      this function. And besides that, it can make the check logic in the
      caller more clear.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      dc7f370c
    • M
      Btrfs: make the cleaner complete early when the fs is going to be umounted · 05323cd1
      Miao Xie 提交于
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      05323cd1
    • M
      Btrfs: remove unnecessary ->s_umount in cleaner_kthread() · d0278245
      Miao Xie 提交于
      In order to avoid the R/O remount, we acquired ->s_umount lock during
      we deleted the dead snapshots and subvolumes. But it is unnecessary,
      because we have cleaner_mutex.
      
      We use cleaner_mutex to protect the process of the dead snapshots/subvolumes
      deletion. And when we remount the fs to be R/O, we also acquire this mutex to
      do cleanup after we change the status of the fs. That is this lock can serialize
      the above operations, the cleaner can be aware of the status of the fs, and if
      the cleaner is deleting the dead snapshots/subvolumes, the remount task will
      wait for it. So it is safe to remove ->s_umount in cleaner_kthread().
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      d0278245
    • S
      Btrfs: cleanup, btrfs_read_fs_root_no_name() doesn't return NULL · b1b19596
      Stefan Behrens 提交于
      No need to check for NULL in send.c and disk-io.c.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      b1b19596
    • W
      Btrfs: introduce qgroup_ulist to avoid frequently allocating/freeing ulist · 1e8f9158
      Wang Shilong 提交于
      When doing qgroup accounting, we call ulist_alloc()/ulist_free() every time
      when we want to walk qgroup tree.
      
      By introducing 'qgroup_ulist', we only need to call ulist_alloc()/ulist_free()
      once. This reduce some sys time to allocate memory, see the measurements below
      
      fsstress -p 4 -n 10000 -d $dir
      
      With this patch:
      
      real    0m50.153s
      user    0m0.081s
      sys     0m6.294s
      
      real    0m51.113s
      user    0m0.092s
      sys     0m6.220s
      
      real    0m52.610s
      user    0m0.096s
      sys     0m6.125s	avg 6.213
      -----------------------------------------------------
      Without the patch:
      
      real    0m54.825s
      user    0m0.061s
      sys     0m10.665s
      
      real    1m6.401s
      user    0m0.089s
      sys     0m11.218s
      
      real    1m13.768s
      user    0m0.087s
      sys     0m10.665s       avg 10.849
      
      we can see the sys time reduce ~43%.
      Signed-off-by: NWang Shilong <wangsl-fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      1e8f9158
    • H
      Btrfs: fix check on same raid type flag twice · 15b0a89d
      Henrik Nordvik 提交于
      Code checked for raid 5 flag in two else-if branches, so code would never be reached. Probably a copy-paste bug.
      Signed-off-by: NHenrik Nordvik <henrikno@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      15b0a89d
  3. 09 6月, 2013 3 次提交
    • J
      Btrfs: stop all workers before cleaning up roots · 13e6c37b
      Josef Bacik 提交于
      Dave reported a panic because the extent_root->commit_root was NULL in the
      caching kthread.  That is because we just unset it in free_root_pointers, which
      is not the correct thing to do, we have to either wait for the caching kthread
      to complete or hold the extent_commit_sem lock so we know the thread has exited.
      This patch makes the kthreads all stop first and then we do our cleanup.  This
      should fix the race.  Thanks,
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      13e6c37b
    • L
      Btrfs: fix use-after-free bug during umount · 2932505a
      Liu Bo 提交于
      Commit be283b2e
      (    Btrfs: use helper to cleanup tree roots) introduced the following bug,
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000034
       IP: [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
      [...]
       Pid: 2463, comm: btrfs-cache-1 Tainted: G           O 3.9.0+ #4 innotek GmbH VirtualBox/VirtualBox
       RIP: 0010:[<ffffffffa039368c>]  [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
       Process btrfs-cache-1 (pid: 2463, threadinfo ffff880112d60000, task ffff880117679730)
      [...]
       Call Trace:
        [<ffffffffa0398a99>] btrfs_search_slot+0x104/0x64d [btrfs]
        [<ffffffffa039aea4>] btrfs_next_old_leaf+0xa7/0x334 [btrfs]
        [<ffffffffa039b141>] btrfs_next_leaf+0x10/0x12 [btrfs]
        [<ffffffffa039ea13>] caching_thread+0x1a3/0x2e0 [btrfs]
        [<ffffffffa03d8811>] worker_loop+0x14b/0x48e [btrfs]
        [<ffffffffa03d86c6>] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
        [<ffffffff81068d3d>] kthread+0x8d/0x95
        [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43
        [<ffffffff8151e5ac>] ret_from_fork+0x7c/0xb0
        [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43
      RIP  [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
      
      We've free'ed commit_root before actually getting to free block groups where
      caching thread needs valid extent_root->commit_root.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      2932505a
    • J
      Btrfs: don't delete fs_roots until after we cleanup the transaction · 7b5ff90e
      Josef Bacik 提交于
      We get a use after free if we had a transaction to cleanup since there could be
      delayed inodes which refer to their respective fs_root.  Thanks
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      7b5ff90e
  4. 22 5月, 2013 1 次提交
    • L
      mm: change invalidatepage prototype to accept length · d47992f8
      Lukas Czerner 提交于
      Currently there is no way to truncate partial page where the end
      truncate point is not at the end of the page. This is because it was not
      needed and the functionality was enough for file system truncate
      operation to work properly. However more file systems now support punch
      hole feature and it can benefit from mm supporting truncating page just
      up to the certain point.
      
      Specifically, with this functionality truncate_inode_pages_range() can
      be changed so it supports truncating partial page at the end of the
      range (currently it will BUG_ON() if 'end' is not at the end of the
      page).
      
      This commit changes the invalidatepage() address space operation
      prototype to accept range to be invalidated and update all the instances
      for it.
      
      We also change the block_invalidatepage() in the same way and actually
      make a use of the new length argument implementing range invalidation.
      
      Actual file system implementations will follow except the file systems
      where the changes are really simple and should not change the behaviour
      in any way .Implementation for truncate_page_range() which will be able
      to accept page unaligned ranges will follow as well.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      d47992f8
  5. 18 5月, 2013 6 次提交
  6. 07 5月, 2013 12 次提交
    • C
      Btrfs: allow superblock mismatch from older mkfs · 667e7d94
      Chris Mason 提交于
      We've added new checks to make sure the super block crc is correct
      during mount.  A fresh filesystem from an older mkfs won't have the
      crc set.  This adds a warning when it finds a newly created filesystem
      but doesn't fail the mount.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      667e7d94
    • D
      btrfs: enhance superblock checks · 1104a885
      David Sterba 提交于
      The superblock checksum is not verified upon mount. <awkward silence>
      
      Add that check and also reorder existing checks to a more logical
      order.
      
      Current mkfs.btrfs does not calculate the correct checksum of
      super_block and thus a freshly created filesytem will fail to mount when
      this patch is applied.
      
      First transaction commit calculates correct superblock checksum and
      saves it to disk.
      
      Reproducer:
      $ mfks.btrfs /dev/sda
      $ mount /dev/sda /mnt
      $ btrfs scrub start /mnt
      $ sleep 5
      $ btrfs scrub status /mnt
      ... super:2 ...
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      1104a885
    • D
      f7a52a40
    • E
      btrfs: make static code static & remove dead code · 48a3b636
      Eric Sandeen 提交于
      Big patch, but all it does is add statics to functions which
      are in fact static, then remove the associated dead-code fallout.
      
      removed functions:
      
      btrfs_iref_to_path()
      __btrfs_lookup_delayed_deletion_item()
      __btrfs_search_delayed_insertion_item()
      __btrfs_search_delayed_deletion_item()
      find_eb_for_page()
      btrfs_find_block_group()
      range_straddles_pages()
      extent_range_uptodate()
      btrfs_file_extent_length()
      btrfs_scrub_cancel_devid()
      btrfs_start_transaction_lflush()
      
      btrfs_print_tree() is left because it is used for debugging.
      btrfs_start_transaction_lflush() and btrfs_reada_detach() are
      left for symmetry.
      
      ulist.c functions are left, another patch will take care of those.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      48a3b636
    • J
      Btrfs: deal with errors in write_dev_supers · 634554dc
      Josef Bacik 提交于
      If you try to mount -o loop a restored file system it will panic if the file
      ends up being smaller than the original disk.  This is because we go to try and
      get a block for a super that may be past the EOF which makes __getblk return
      NULL for a buffer head when we aren't expecting it to.  Fix this by dealing with
      this case and just jacking up the errors count.  With this patch we no longer
      panic when mounting a restored file system loopback.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      634554dc
    • J
      Btrfs: rescan for qgroups · 2f232036
      Jan Schmidt 提交于
      If qgroup tracking is out of sync, a rescan operation can be started. It
      iterates the complete extent tree and recalculates all qgroup tracking data.
      This is an expensive operation and should not be used unless required.
      
      A filesystem under rescan can still be umounted. The rescan continues on the
      next mount.  Status information is provided with a separate ioctl while a
      rescan operation is in progress.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2f232036
    • J
      Btrfs: separate sequence numbers for delayed ref tracking and tree mod log · fc36ed7e
      Jan Schmidt 提交于
      Sequence numbers for delayed refs have been introduced in the first version
      of the qgroup patch set. To solve the problem of find_all_roots on a busy
      file system, the tree mod log was introduced. The sequence numbers for that
      were simply shared between those two users.
      
      However, at one point in qgroup's quota accounting, there's a statement
      accessing the previous sequence number, that's still just doing (seq - 1)
      just as it would have to in the very first version.
      
      To satisfy that requirement, this patch makes the sequence number counter 64
      bit and splits it into a major part (used for qgroup sequence number
      counting) and a minor part (incremented for each tree modification in the
      log). This enables us to go exactly one major step backwards, as required
      for qgroups, while still incrementing the sequence counter for tree mod log
      insertions to keep track of their order. Keeping them in a single variable
      means there's no need to change all the code dealing with comparisons of two
      sequence numbers.
      
      The sequence number is reset to 0 on commit (not new in this patch), which
      ensures we won't overflow the two 32 bit counters.
      
      Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
      from the tree mod log code may happen.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fc36ed7e
    • S
      Btrfs: set UUID in root_item for created trees · 6463fe58
      Stefan Behrens 提交于
      It is a rare exception that a new tree is created, like the qgroups
      tree. So far these new trees have an all-zero UUID in their root
      items. All trees that mkfs.btrfs has created get an UUID during the
      first mount when btrfs_read_root_item() rewrites the root_item to
      the v2 structure style. These UUID are never used so far, but
      anyway, since it is better to have it uniform for all trees, this
      commit adds some lines that generate and write an UUID for newly
      created trees.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      6463fe58
    • S
    • J
      Btrfs: various abort cleanups · 54067ae9
      Josef Bacik 提交于
      I have a broken file system that when it aborts leaves all sorts of accounting
      things wrong and gives you lots of WARN_ON()'s other than the abort.  This is
      because we're not cleaning up various parts of the file system when we abort.
      The first chunks are specific to mount failures, we weren't cleaning up the
      block group cached inodes and we weren't cleaning up any transactions that had
      been aborted, which leaves a bunch of things laying around.
      
      The second half of this are related to the cleanup parts.  First we don't need
      to release space for the dirty pages from the trans_block_rsv, that's all
      handled by the trans handles so this is just plain wrong.  The other thing is we
      need to pin down extents that were set ->must_insert_reserved for delayed refs.
      This isn't so much for the pinning but more for the cleaning up the
      cache->reserved counter since we are no longer going to use those reserved
      bytes.  With this patch I no longer see a bunch of WARN_ON()'s when I try to
      mount this broken file system, just the initial one from the abort.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      54067ae9
    • J
      Btrfs: cleanup destroy_marked_extents · fd8b2b61
      Josef Bacik 提交于
      We can just look up the extent_buffers for the range and free stuff that way.
      This makes the cleanup a bit cleaner and we can make sure to evict the
      extent_buffers pretty quickly by marking them as stale.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fd8b2b61
    • J
      Btrfs: cleanup fs roots if we fail to mount · 171f6537
      Josef Bacik 提交于
      We can run the tree logging recovery or the orphan cleanup on mount, so we'll
      end up looking up a random fs tree in the meantime.  So we need to clean this up
      so we don't leave extent buffers hanging around on the cache.  With this patch
      we no longer leak extent buffers on failure to mount.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      171f6537