1. 14 6月, 2013 3 次提交
  2. 09 6月, 2013 3 次提交
    • J
      Btrfs: stop all workers before cleaning up roots · 13e6c37b
      Josef Bacik 提交于
      Dave reported a panic because the extent_root->commit_root was NULL in the
      caching kthread.  That is because we just unset it in free_root_pointers, which
      is not the correct thing to do, we have to either wait for the caching kthread
      to complete or hold the extent_commit_sem lock so we know the thread has exited.
      This patch makes the kthreads all stop first and then we do our cleanup.  This
      should fix the race.  Thanks,
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      13e6c37b
    • L
      Btrfs: fix use-after-free bug during umount · 2932505a
      Liu Bo 提交于
      Commit be283b2e
      (    Btrfs: use helper to cleanup tree roots) introduced the following bug,
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000034
       IP: [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
      [...]
       Pid: 2463, comm: btrfs-cache-1 Tainted: G           O 3.9.0+ #4 innotek GmbH VirtualBox/VirtualBox
       RIP: 0010:[<ffffffffa039368c>]  [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
       Process btrfs-cache-1 (pid: 2463, threadinfo ffff880112d60000, task ffff880117679730)
      [...]
       Call Trace:
        [<ffffffffa0398a99>] btrfs_search_slot+0x104/0x64d [btrfs]
        [<ffffffffa039aea4>] btrfs_next_old_leaf+0xa7/0x334 [btrfs]
        [<ffffffffa039b141>] btrfs_next_leaf+0x10/0x12 [btrfs]
        [<ffffffffa039ea13>] caching_thread+0x1a3/0x2e0 [btrfs]
        [<ffffffffa03d8811>] worker_loop+0x14b/0x48e [btrfs]
        [<ffffffffa03d86c6>] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
        [<ffffffff81068d3d>] kthread+0x8d/0x95
        [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43
        [<ffffffff8151e5ac>] ret_from_fork+0x7c/0xb0
        [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43
      RIP  [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
      
      We've free'ed commit_root before actually getting to free block groups where
      caching thread needs valid extent_root->commit_root.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      2932505a
    • J
      Btrfs: don't delete fs_roots until after we cleanup the transaction · 7b5ff90e
      Josef Bacik 提交于
      We get a use after free if we had a transaction to cleanup since there could be
      delayed inodes which refer to their respective fs_root.  Thanks
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      7b5ff90e
  3. 18 5月, 2013 6 次提交
  4. 07 5月, 2013 24 次提交
    • C
      Btrfs: allow superblock mismatch from older mkfs · 667e7d94
      Chris Mason 提交于
      We've added new checks to make sure the super block crc is correct
      during mount.  A fresh filesystem from an older mkfs won't have the
      crc set.  This adds a warning when it finds a newly created filesystem
      but doesn't fail the mount.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      667e7d94
    • D
      btrfs: enhance superblock checks · 1104a885
      David Sterba 提交于
      The superblock checksum is not verified upon mount. <awkward silence>
      
      Add that check and also reorder existing checks to a more logical
      order.
      
      Current mkfs.btrfs does not calculate the correct checksum of
      super_block and thus a freshly created filesytem will fail to mount when
      this patch is applied.
      
      First transaction commit calculates correct superblock checksum and
      saves it to disk.
      
      Reproducer:
      $ mfks.btrfs /dev/sda
      $ mount /dev/sda /mnt
      $ btrfs scrub start /mnt
      $ sleep 5
      $ btrfs scrub status /mnt
      ... super:2 ...
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      1104a885
    • D
      f7a52a40
    • E
      btrfs: make static code static & remove dead code · 48a3b636
      Eric Sandeen 提交于
      Big patch, but all it does is add statics to functions which
      are in fact static, then remove the associated dead-code fallout.
      
      removed functions:
      
      btrfs_iref_to_path()
      __btrfs_lookup_delayed_deletion_item()
      __btrfs_search_delayed_insertion_item()
      __btrfs_search_delayed_deletion_item()
      find_eb_for_page()
      btrfs_find_block_group()
      range_straddles_pages()
      extent_range_uptodate()
      btrfs_file_extent_length()
      btrfs_scrub_cancel_devid()
      btrfs_start_transaction_lflush()
      
      btrfs_print_tree() is left because it is used for debugging.
      btrfs_start_transaction_lflush() and btrfs_reada_detach() are
      left for symmetry.
      
      ulist.c functions are left, another patch will take care of those.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      48a3b636
    • J
      Btrfs: deal with errors in write_dev_supers · 634554dc
      Josef Bacik 提交于
      If you try to mount -o loop a restored file system it will panic if the file
      ends up being smaller than the original disk.  This is because we go to try and
      get a block for a super that may be past the EOF which makes __getblk return
      NULL for a buffer head when we aren't expecting it to.  Fix this by dealing with
      this case and just jacking up the errors count.  With this patch we no longer
      panic when mounting a restored file system loopback.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      634554dc
    • J
      Btrfs: rescan for qgroups · 2f232036
      Jan Schmidt 提交于
      If qgroup tracking is out of sync, a rescan operation can be started. It
      iterates the complete extent tree and recalculates all qgroup tracking data.
      This is an expensive operation and should not be used unless required.
      
      A filesystem under rescan can still be umounted. The rescan continues on the
      next mount.  Status information is provided with a separate ioctl while a
      rescan operation is in progress.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2f232036
    • J
      Btrfs: separate sequence numbers for delayed ref tracking and tree mod log · fc36ed7e
      Jan Schmidt 提交于
      Sequence numbers for delayed refs have been introduced in the first version
      of the qgroup patch set. To solve the problem of find_all_roots on a busy
      file system, the tree mod log was introduced. The sequence numbers for that
      were simply shared between those two users.
      
      However, at one point in qgroup's quota accounting, there's a statement
      accessing the previous sequence number, that's still just doing (seq - 1)
      just as it would have to in the very first version.
      
      To satisfy that requirement, this patch makes the sequence number counter 64
      bit and splits it into a major part (used for qgroup sequence number
      counting) and a minor part (incremented for each tree modification in the
      log). This enables us to go exactly one major step backwards, as required
      for qgroups, while still incrementing the sequence counter for tree mod log
      insertions to keep track of their order. Keeping them in a single variable
      means there's no need to change all the code dealing with comparisons of two
      sequence numbers.
      
      The sequence number is reset to 0 on commit (not new in this patch), which
      ensures we won't overflow the two 32 bit counters.
      
      Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
      from the tree mod log code may happen.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fc36ed7e
    • S
      Btrfs: set UUID in root_item for created trees · 6463fe58
      Stefan Behrens 提交于
      It is a rare exception that a new tree is created, like the qgroups
      tree. So far these new trees have an all-zero UUID in their root
      items. All trees that mkfs.btrfs has created get an UUID during the
      first mount when btrfs_read_root_item() rewrites the root_item to
      the v2 structure style. These UUID are never used so far, but
      anyway, since it is better to have it uniform for all trees, this
      commit adds some lines that generate and write an UUID for newly
      created trees.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      6463fe58
    • S
    • J
      Btrfs: various abort cleanups · 54067ae9
      Josef Bacik 提交于
      I have a broken file system that when it aborts leaves all sorts of accounting
      things wrong and gives you lots of WARN_ON()'s other than the abort.  This is
      because we're not cleaning up various parts of the file system when we abort.
      The first chunks are specific to mount failures, we weren't cleaning up the
      block group cached inodes and we weren't cleaning up any transactions that had
      been aborted, which leaves a bunch of things laying around.
      
      The second half of this are related to the cleanup parts.  First we don't need
      to release space for the dirty pages from the trans_block_rsv, that's all
      handled by the trans handles so this is just plain wrong.  The other thing is we
      need to pin down extents that were set ->must_insert_reserved for delayed refs.
      This isn't so much for the pinning but more for the cleaning up the
      cache->reserved counter since we are no longer going to use those reserved
      bytes.  With this patch I no longer see a bunch of WARN_ON()'s when I try to
      mount this broken file system, just the initial one from the abort.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      54067ae9
    • J
      Btrfs: cleanup destroy_marked_extents · fd8b2b61
      Josef Bacik 提交于
      We can just look up the extent_buffers for the range and free stuff that way.
      This makes the cleanup a bit cleaner and we can make sure to evict the
      extent_buffers pretty quickly by marking them as stale.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fd8b2b61
    • J
      Btrfs: cleanup fs roots if we fail to mount · 171f6537
      Josef Bacik 提交于
      We can run the tree logging recovery or the orphan cleanup on mount, so we'll
      end up looking up a random fs tree in the meantime.  So we need to clean this up
      so we don't leave extent buffers hanging around on the cache.  With this patch
      we no longer leak extent buffers on failure to mount.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      171f6537
    • J
      Btrfs: fix all callers of read_tree_block · 416bc658
      Josef Bacik 提交于
      We kept leaking extent buffers when mounting a broken file system and it turns
      out it's because not everybody uses read_tree_block properly.  You need to check
      and make sure the extent_buffer is uptodate before you use it.  This patch fixes
      everybody who calls read_tree_block directly to make sure they check that it is
      uptodate and free it and return an error if it is not.  With this we no longer
      leak EB's when things go horribly wrong.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      416bc658
    • J
      Btrfs: add tree block level sanity check · 1c24c3ce
      Josef Bacik 提交于
      With a users corrupted fs I was getting weird behavior and panics and it turns
      out it was because one of his tree blocks had a bogus header level.  So add this
      to the sanity checks in the endio handler for tree blocks.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      1c24c3ce
    • J
      Btrfs: don't call readahead hook until we have read the entire eb · 79fb65a1
      Josef Bacik 提交于
      Martin Steigerwald reported a BUG_ON() where we were given a bogus bytenr to
      map.  Turns out he is using > PAGESIZE leafsizes.  The readahead stuff is called
      every time we do a completion, but we may not have finished reading in all the
      pages, so the bytenr we read off the node could be completely bogus.  Fix this
      by only calling the readahead hook once all pages have been read in.  Thanks,
      Reported-by: NMartin Steigerwald <Martin@lichtvoll.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      79fb65a1
    • J
      Btrfs: don't force pages under writeback to finish when aborting · b8d7f3ac
      Josef Bacik 提交于
      Dave reported a BUG_ON() that happened in end_page_writeback() after an abort.
      This happened because we unconditionally call end_page_writeback() in the endio
      case, which is right.  However when we abort the transaction we will call
      end_page_writeback() on any writeback pages we find, which is wrong.  We need to
      lock the page and wait on page writeback to complete if it is.  There is nothing
      unsafe about this since we are discarding the transaction anyway.  Thanks,
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      b8d7f3ac
    • M
      Btrfs: use a lock to protect incompat/compat flag of the super block · ceda0864
      Miao Xie 提交于
      The following case will make the incompat/compat flag of the super block
      be recovered.
       Task1					|Task2
       flags = btrfs_super_incompat_flags();	|
      					|flags = btrfs_super_incompat_flags();
       flags |= new_flag1;			|
      					|flags |= new_flag2;
       btrfs_set_super_incompat_flags(flags);	|
      					|btrfs_set_super_incompat_flags(flags);
      the new_flag1 is recovered.
      
      In order to avoid this problem, we introduce a lock named super_lock into
      the btrfs_fs_info structure. If we want to update incompat/compat flags
      of the super block, we must hold it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ceda0864
    • W
      Btrfs: introduce a mutex lock for btrfs quota operations · f2f6ed3d
      Wang Shilong 提交于
      The original code has one spin_lock 'qgroup_lock' to protect quota
      configurations in memory. If we want to add a BTRFS_QGROUP_INFO_KEY,
      it will be added to Btree firstly, and then update configurations in
      memory,however, a race condition may happen between these operations.
      For example:
      	->add_qgroup_info_item()
      		->add_qgroup_rb()
      
      For the above case, del_qgroup_info_item() may happen just before
      add_qgroup_rb().
      
      What's worse, when we want to add a qgroup relation:
      	->add_qgroup_relation_item()
      		->add_qgroup_relations()
      
      We don't have any checks whether 'src' and 'dst' exist before
      add_qgroup_relation_item(), a race condition can also happen for
      the above case.
      
      To avoid race condition and have all the necessary checks, we introduce
      a mutex lock 'qgroup_ioctl_lock', and we make all the user change operations
      protected by the mutex lock.
      Signed-off-by: NWang Shilong <wangsl-fnst@cn.fujitsu.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f2f6ed3d
    • J
      Btrfs: fix bad extent logging · 09a2a8f9
      Josef Bacik 提交于
      A user sent me a btrfs-image of a file system that was panicing on mount during
      the log recovery.  I had originally thought these problems were from a bug in
      the free space cache code, but that was just a symptom of the problem.  The
      problem is if your application does something like this
      
      [prealloc][prealloc][prealloc]
      
      the internal extent maps will merge those all together into one extent map, even
      though on disk they are 3 separate extents.  So if you go to write into one of
      these ranges the extent map will be right since we use the physical extent when
      doing the write, but when we log the extents they will use the wrong sizes for
      the remainder prealloc space.  If this doesn't happen to trip up the free space
      cache (which it won't in a lot of cases) then you will get bogus entries in your
      extent tree which will screw stuff up later.  The data and such will still work,
      but everything else is broken.  This patch fixes this by not allowing extents
      that are on the modified list to be merged.  This has the side effect that we
      are no longer adding everything to the modified list all the time, which means
      we now have to call btrfs_drop_extents every time we log an extent into the
      tree.  So this allows me to drop all this speciality code I was using to get
      around calling btrfs_drop_extents.  With this patch the testcase I've created no
      longer creates a bogus file system after replaying the log.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      09a2a8f9
    • D
      btrfs: clean snapshots one by one · 9d1a2a3a
      David Sterba 提交于
      Each time pick one dead root from the list and let the caller know if
      it's needed to continue. This should improve responsiveness during
      umount and balance which at some point waits for cleaning all currently
      queued dead roots.
      
      A new dead root is added to the end of the list, so the snapshots
      disappear in the order of deletion.
      
      The snapshot cleaning work is now done only from the cleaner thread and the
      others wake it if needed.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      9d1a2a3a
    • L
      Btrfs: share stop worker code · 7abadb64
      Liu Bo 提交于
      Share the exactly same code of stopping workers.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      7abadb64
    • J
      Btrfs: add a incompatible format change for smaller metadata extent refs · 3173a18f
      Josef Bacik 提交于
      We currently store the first key of the tree block inside the reference for the
      tree block in the extent tree.  This takes up quite a bit of space.  Make a new
      key type for metadata which holds the level as the offset and completely removes
      storing the btrfs_tree_block_info inside the extent ref.  This reduces the size
      from 51 bytes to 33 bytes per extent reference for each tree block.  In practice
      this results in a 30-35% decrease in the size of our extent tree, which means we
      COW less and can keep more of the extent tree in memory which makes our heavy
      metadata operations go much faster.  This is not an automatic format change, you
      must enable it at mkfs time or with btrfstune.  This patch deals with having
      metadata stored as either the old format or the new format so it is easy to
      convert.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      3173a18f
    • L
      Btrfs: use helper to cleanup tree roots · be283b2e
      Liu Bo 提交于
      free_root_pointers() has been introduced to cleanup all of tree roots,
      so just use it instead.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      be283b2e
    • L
      Btrfs: cleanup unused arguments of btrfs_csum_data · b0496686
      Liu Bo 提交于
      Argument 'root' is no more used in btrfs_csum_data().
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      b0496686
  5. 22 3月, 2013 2 次提交
  6. 05 3月, 2013 1 次提交
    • M
      Btrfs: fix wrong handle at error path of create_snapshot() when the commit fails · aec8030a
      Miao Xie 提交于
      There are several bugs at error path of create_snapshot() when the
      transaction commitment failed.
      - access the freed transaction handler. At the end of the
        transaction commitment, the transaction handler was freed, so we
        should not access it after the transaction commitment.
      - we were not aware of the error which happened during the snapshot
        creation if we submitted a async transaction commitment.
      - pending snapshot access vs pending snapshot free. when something
        wrong happened after we submitted a async transaction commitment,
        the transaction committer would cleanup the pending snapshots and
        free them. But the snapshot creators were not aware of it, they
        would access the freed pending snapshots.
      
      This patch fixes the above problems by:
      - remove the dangerous code that accessed the freed handler
      - assign ->error if the error happens during the snapshot creation
      - the transaction committer doesn't free the pending snapshots,
        just assigns the error number and evicts them before we unblock
        the transaction.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      aec8030a
  7. 01 3月, 2013 1 次提交
    • D
      btrfs: try harder to allocate raid56 stripe cache · 83c8266a
      David Sterba 提交于
      The stripe hash table is large, starting with allocation order 4 and can go as
      high as order 7 in case lock debugging is turned on and structure padding
      happens.
      
      Observed mount failure:
      
      mount: page allocation failure: order:7, mode:0x200050
      Pid: 8234, comm: mount Tainted: G        W    3.8.0-default+ #267
      Call Trace:
       [<ffffffff81114353>] warn_alloc_failed+0xf3/0x140
       [<ffffffff811171d2>] ? __alloc_pages_direct_compact+0x92/0x250
       [<ffffffff81117ac3>] __alloc_pages_nodemask+0x733/0x9d0
       [<ffffffff81152878>] ? cache_alloc_refill+0x3f8/0x840
       [<ffffffff811528bc>] cache_alloc_refill+0x43c/0x840
       [<ffffffff811302eb>] ? is_kernel_percpu_address+0x4b/0x90
       [<ffffffffa00a00ac>] ? btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs]
       [<ffffffff811531d7>] kmem_cache_alloc_trace+0x247/0x270
       [<ffffffffa00a00ac>] btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs]
       [<ffffffffa003133f>] open_ctree+0xb2f/0x1f90 [btrfs]
       [<ffffffff81397289>] ? string+0x49/0xe0
       [<ffffffff813987b3>] ? vsnprintf+0x443/0x5d0
       [<ffffffffa0007cb6>] btrfs_mount+0x526/0x600 [btrfs]
       [<ffffffff8115127c>] ? cache_alloc_debugcheck_after+0x4c/0x200
       [<ffffffff81162b90>] mount_fs+0x20/0xe0
       [<ffffffff8117db26>] vfs_kern_mount+0x76/0x120
       [<ffffffff811801b6>] do_mount+0x386/0x980
       [<ffffffff8112a5cb>] ? strndup_user+0x5b/0x80
       [<ffffffff81180840>] sys_mount+0x90/0xe0
       [<ffffffff81962e99>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      83c8266a