1. 07 5月, 2013 13 次提交
    • J
      Btrfs: cleanup fs roots if we fail to mount · 171f6537
      Josef Bacik 提交于
      We can run the tree logging recovery or the orphan cleanup on mount, so we'll
      end up looking up a random fs tree in the meantime.  So we need to clean this up
      so we don't leave extent buffers hanging around on the cache.  With this patch
      we no longer leak extent buffers on failure to mount.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      171f6537
    • J
      Btrfs: fix all callers of read_tree_block · 416bc658
      Josef Bacik 提交于
      We kept leaking extent buffers when mounting a broken file system and it turns
      out it's because not everybody uses read_tree_block properly.  You need to check
      and make sure the extent_buffer is uptodate before you use it.  This patch fixes
      everybody who calls read_tree_block directly to make sure they check that it is
      uptodate and free it and return an error if it is not.  With this we no longer
      leak EB's when things go horribly wrong.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      416bc658
    • J
      Btrfs: add tree block level sanity check · 1c24c3ce
      Josef Bacik 提交于
      With a users corrupted fs I was getting weird behavior and panics and it turns
      out it was because one of his tree blocks had a bogus header level.  So add this
      to the sanity checks in the endio handler for tree blocks.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      1c24c3ce
    • J
      Btrfs: don't call readahead hook until we have read the entire eb · 79fb65a1
      Josef Bacik 提交于
      Martin Steigerwald reported a BUG_ON() where we were given a bogus bytenr to
      map.  Turns out he is using > PAGESIZE leafsizes.  The readahead stuff is called
      every time we do a completion, but we may not have finished reading in all the
      pages, so the bytenr we read off the node could be completely bogus.  Fix this
      by only calling the readahead hook once all pages have been read in.  Thanks,
      Reported-by: NMartin Steigerwald <Martin@lichtvoll.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      79fb65a1
    • J
      Btrfs: don't force pages under writeback to finish when aborting · b8d7f3ac
      Josef Bacik 提交于
      Dave reported a BUG_ON() that happened in end_page_writeback() after an abort.
      This happened because we unconditionally call end_page_writeback() in the endio
      case, which is right.  However when we abort the transaction we will call
      end_page_writeback() on any writeback pages we find, which is wrong.  We need to
      lock the page and wait on page writeback to complete if it is.  There is nothing
      unsafe about this since we are discarding the transaction anyway.  Thanks,
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      b8d7f3ac
    • M
      Btrfs: use a lock to protect incompat/compat flag of the super block · ceda0864
      Miao Xie 提交于
      The following case will make the incompat/compat flag of the super block
      be recovered.
       Task1					|Task2
       flags = btrfs_super_incompat_flags();	|
      					|flags = btrfs_super_incompat_flags();
       flags |= new_flag1;			|
      					|flags |= new_flag2;
       btrfs_set_super_incompat_flags(flags);	|
      					|btrfs_set_super_incompat_flags(flags);
      the new_flag1 is recovered.
      
      In order to avoid this problem, we introduce a lock named super_lock into
      the btrfs_fs_info structure. If we want to update incompat/compat flags
      of the super block, we must hold it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ceda0864
    • W
      Btrfs: introduce a mutex lock for btrfs quota operations · f2f6ed3d
      Wang Shilong 提交于
      The original code has one spin_lock 'qgroup_lock' to protect quota
      configurations in memory. If we want to add a BTRFS_QGROUP_INFO_KEY,
      it will be added to Btree firstly, and then update configurations in
      memory,however, a race condition may happen between these operations.
      For example:
      	->add_qgroup_info_item()
      		->add_qgroup_rb()
      
      For the above case, del_qgroup_info_item() may happen just before
      add_qgroup_rb().
      
      What's worse, when we want to add a qgroup relation:
      	->add_qgroup_relation_item()
      		->add_qgroup_relations()
      
      We don't have any checks whether 'src' and 'dst' exist before
      add_qgroup_relation_item(), a race condition can also happen for
      the above case.
      
      To avoid race condition and have all the necessary checks, we introduce
      a mutex lock 'qgroup_ioctl_lock', and we make all the user change operations
      protected by the mutex lock.
      Signed-off-by: NWang Shilong <wangsl-fnst@cn.fujitsu.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f2f6ed3d
    • J
      Btrfs: fix bad extent logging · 09a2a8f9
      Josef Bacik 提交于
      A user sent me a btrfs-image of a file system that was panicing on mount during
      the log recovery.  I had originally thought these problems were from a bug in
      the free space cache code, but that was just a symptom of the problem.  The
      problem is if your application does something like this
      
      [prealloc][prealloc][prealloc]
      
      the internal extent maps will merge those all together into one extent map, even
      though on disk they are 3 separate extents.  So if you go to write into one of
      these ranges the extent map will be right since we use the physical extent when
      doing the write, but when we log the extents they will use the wrong sizes for
      the remainder prealloc space.  If this doesn't happen to trip up the free space
      cache (which it won't in a lot of cases) then you will get bogus entries in your
      extent tree which will screw stuff up later.  The data and such will still work,
      but everything else is broken.  This patch fixes this by not allowing extents
      that are on the modified list to be merged.  This has the side effect that we
      are no longer adding everything to the modified list all the time, which means
      we now have to call btrfs_drop_extents every time we log an extent into the
      tree.  So this allows me to drop all this speciality code I was using to get
      around calling btrfs_drop_extents.  With this patch the testcase I've created no
      longer creates a bogus file system after replaying the log.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      09a2a8f9
    • D
      btrfs: clean snapshots one by one · 9d1a2a3a
      David Sterba 提交于
      Each time pick one dead root from the list and let the caller know if
      it's needed to continue. This should improve responsiveness during
      umount and balance which at some point waits for cleaning all currently
      queued dead roots.
      
      A new dead root is added to the end of the list, so the snapshots
      disappear in the order of deletion.
      
      The snapshot cleaning work is now done only from the cleaner thread and the
      others wake it if needed.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      9d1a2a3a
    • L
      Btrfs: share stop worker code · 7abadb64
      Liu Bo 提交于
      Share the exactly same code of stopping workers.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      7abadb64
    • J
      Btrfs: add a incompatible format change for smaller metadata extent refs · 3173a18f
      Josef Bacik 提交于
      We currently store the first key of the tree block inside the reference for the
      tree block in the extent tree.  This takes up quite a bit of space.  Make a new
      key type for metadata which holds the level as the offset and completely removes
      storing the btrfs_tree_block_info inside the extent ref.  This reduces the size
      from 51 bytes to 33 bytes per extent reference for each tree block.  In practice
      this results in a 30-35% decrease in the size of our extent tree, which means we
      COW less and can keep more of the extent tree in memory which makes our heavy
      metadata operations go much faster.  This is not an automatic format change, you
      must enable it at mkfs time or with btrfstune.  This patch deals with having
      metadata stored as either the old format or the new format so it is easy to
      convert.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      3173a18f
    • L
      Btrfs: use helper to cleanup tree roots · be283b2e
      Liu Bo 提交于
      free_root_pointers() has been introduced to cleanup all of tree roots,
      so just use it instead.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      be283b2e
    • L
      Btrfs: cleanup unused arguments of btrfs_csum_data · b0496686
      Liu Bo 提交于
      Argument 'root' is no more used in btrfs_csum_data().
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      b0496686
  2. 22 3月, 2013 2 次提交
  3. 05 3月, 2013 1 次提交
    • M
      Btrfs: fix wrong handle at error path of create_snapshot() when the commit fails · aec8030a
      Miao Xie 提交于
      There are several bugs at error path of create_snapshot() when the
      transaction commitment failed.
      - access the freed transaction handler. At the end of the
        transaction commitment, the transaction handler was freed, so we
        should not access it after the transaction commitment.
      - we were not aware of the error which happened during the snapshot
        creation if we submitted a async transaction commitment.
      - pending snapshot access vs pending snapshot free. when something
        wrong happened after we submitted a async transaction commitment,
        the transaction committer would cleanup the pending snapshots and
        free them. But the snapshot creators were not aware of it, they
        would access the freed pending snapshots.
      
      This patch fixes the above problems by:
      - remove the dangerous code that accessed the freed handler
      - assign ->error if the error happens during the snapshot creation
      - the transaction committer doesn't free the pending snapshots,
        just assigns the error number and evicts them before we unblock
        the transaction.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      aec8030a
  4. 01 3月, 2013 2 次提交
    • D
      btrfs: try harder to allocate raid56 stripe cache · 83c8266a
      David Sterba 提交于
      The stripe hash table is large, starting with allocation order 4 and can go as
      high as order 7 in case lock debugging is turned on and structure padding
      happens.
      
      Observed mount failure:
      
      mount: page allocation failure: order:7, mode:0x200050
      Pid: 8234, comm: mount Tainted: G        W    3.8.0-default+ #267
      Call Trace:
       [<ffffffff81114353>] warn_alloc_failed+0xf3/0x140
       [<ffffffff811171d2>] ? __alloc_pages_direct_compact+0x92/0x250
       [<ffffffff81117ac3>] __alloc_pages_nodemask+0x733/0x9d0
       [<ffffffff81152878>] ? cache_alloc_refill+0x3f8/0x840
       [<ffffffff811528bc>] cache_alloc_refill+0x43c/0x840
       [<ffffffff811302eb>] ? is_kernel_percpu_address+0x4b/0x90
       [<ffffffffa00a00ac>] ? btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs]
       [<ffffffff811531d7>] kmem_cache_alloc_trace+0x247/0x270
       [<ffffffffa00a00ac>] btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs]
       [<ffffffffa003133f>] open_ctree+0xb2f/0x1f90 [btrfs]
       [<ffffffff81397289>] ? string+0x49/0xe0
       [<ffffffff813987b3>] ? vsnprintf+0x443/0x5d0
       [<ffffffffa0007cb6>] btrfs_mount+0x526/0x600 [btrfs]
       [<ffffffff8115127c>] ? cache_alloc_debugcheck_after+0x4c/0x200
       [<ffffffff81162b90>] mount_fs+0x20/0xe0
       [<ffffffff8117db26>] vfs_kern_mount+0x76/0x120
       [<ffffffff811801b6>] do_mount+0x386/0x980
       [<ffffffff8112a5cb>] ? strndup_user+0x5b/0x80
       [<ffffffff81180840>] sys_mount+0x90/0xe0
       [<ffffffff81962e99>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      83c8266a
    • L
      Btrfs: fix memory leak of log roots · 3321719e
      Liu Bo 提交于
      When we abort a transaction while fsyncing, we'll skip freeing log roots
      part of committing a transaction, which leads to memory leak.
      
      This adds a 'free log roots' in putting super when no more users hold
      references on log roots, so it's safe and clean.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      3321719e
  5. 21 2月, 2013 11 次提交
    • Z
      btrfs: define BTRFS_MAGIC as a u64 value · cdb4c574
      Zach Brown 提交于
      super.magic is an le64 but it's treated as an unterminated string when
      compared against BTRFS_MAGIC which is defined as a string.  Instead
      define BTRFS_MAGIC as a normal hex value and use endian helpers to
      compare it to the super's magic.
      
      I tested this by mounting an fs made before the change and made sure
      that it didn't introduce sparse errors.  This matches a similar cleanup
      that is pending in btrfs-progs.  David Sterba pointed out that we should
      fix the kernel side as well :).
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      cdb4c574
    • J
      Btrfs: place ordered operations on a per transaction list · 569e0f35
      Josef Bacik 提交于
      Miao made the ordered operations stuff run async, which introduced a
      deadlock where we could get somebody (sync) racing in and committing the
      transaction while a commit was already happening.  The new committer would
      try and flush ordered operations which would hang waiting for the commit to
      finish because it is done asynchronously and no longer inherits the callers
      trans handle.  To fix this we need to make the ordered operations list a per
      transaction list.  We can get new inodes added to the ordered operation list
      by truncating them and then having another process writing to them, so this
      makes it so that anybody trying to add an ordered operation _must_ start a
      transaction in order to add itself to the list, which will keep new inodes
      from getting added to the ordered operations list after we start committing.
      This should fix the deadlock and also keeps us from doing a lot more work
      than we need to during commit.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      569e0f35
    • M
      Btrfs: fix the race between bio and btrfs_stop_workers · 2b8195bb
      Miao Xie 提交于
      open_ctree() need read the metadata to initialize the global information
      of btrfs. But it may fail after it submit some bio, and then it will jump
      to the error path. Unfortunately, it doesn't check if there are some bios
      in flight, and just stop all the worker threads. As a result, when the
      submitted bios end, they can not find any worker thread which can deal with
      subsequent work, then oops happen.
      
      kernel BUG at fs/btrfs/async-thread.c:605!
      
      Fix this problem by invoking invalidate_inode_pages2() before we stop the
      worker threads. This function will wait until the bio end because it need
      lock the pages which are going to be invalidated, and if a page is under
      disk read IO, it must be locked. invalidate_inode_pages2() need wait until
      end bio handler to unlocked it.
      Reported-and-Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2b8195bb
    • J
      Btrfs: fix how we discard outstanding ordered extents on abort · 779880ef
      Josef Bacik 提交于
      When we abort we've been just free'ing up all the ordered extents and
      hoping for the best.  This results in lots of warnings from various places,
      warnings from btrfs_destroy_inode() because it's ENOSPC accounting isn't
      fixed.  It will also screw up lots of pages who have been set private but
      never get cleared because the ordered extents are never allowed to be
      submitted.  This patch fixes those warnings.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      779880ef
    • J
      Btrfs: fix freeing delayed ref head while still holding its mutex · eb12db69
      Josef Bacik 提交于
      I hit this error when reproducing a bug that would end in a transaction
      abort.  We take the delayed ref head's mutex to keep anybody from processing
      it while we're destroying it, but we fail to drop the mutex before we carry
      on and free the damned thing.  Fix this by doing the remove logic for the
      head ourselves and unlock the mutex, that way we can avoid use after free's
      or hung tasks waiting on that mutex to come back so they know the delayed
      ref completed.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      eb12db69
    • E
      btrfs: list_entry can't return NULL · d1d3cd27
      Eric Sandeen 提交于
      No need to test the result, we can't get a
      null pointer from list_entry()
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      d1d3cd27
    • M
      Btrfs: use bit operation for ->fs_state · 87533c47
      Miao Xie 提交于
      There is no lock to protect fs_info->fs_state, it will introduce
      some problems, such as the value may be covered by the other task
      when several tasks modify it. For example:
      	Task0 - CPU0		Task1 - CPU1
      	mov %fs_state rax
      	or $0x1 rax
      				mov %fs_state rax
      				or $0x2 rax
      	mov rax %fs_state
      				mov rax %fs_state
      The expected value is 3, but in fact, it is 2.
      
      Though this problem doesn't happen now (because there is only one
      flag currently), the code is error prone, if we add other flags,
      the above problem will happen to a certainty.
      
      Now we use bit operation for it to fix the above problem.
      In this way, we can make the code more robust and be easy to
      add new flags.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      87533c47
    • M
      Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits · de98ced9
      Miao Xie 提交于
      There is no lock to protect
        fs_info->avail_{data, metadata, system}_alloc_bits,
      it may introduce some problem, such as the wrong profile
      information, so we add a seqlock to protect them.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      de98ced9
    • M
      Btrfs: use the inode own lock to protect its delalloc_bytes · df0af1a5
      Miao Xie 提交于
      We need not use a global lock to protect the delalloc_bytes of the
      inode, just use its own lock. In this way, we can reduce the lock
      contention and ->delalloc_lock will just protect delalloc inode
      list.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      df0af1a5
    • M
      Btrfs: use percpu counter for fs_info->delalloc_bytes · 963d678b
      Miao Xie 提交于
      fs_info->delalloc_bytes is accessed very frequently, so use percpu
      counter instead of the u64 variant for it to reduce the lock
      contention.
      
      This patch also fixed the problem that we access the variant
      without the lock protection.At worst, we would not flush the
      delalloc inodes, and just return ENOSPC error when we still have
      some free space in the fs.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      963d678b
    • M
      Btrfs: use percpu counter for dirty metadata count · e2d84521
      Miao Xie 提交于
      ->dirty_metadata_bytes is accessed very frequently, so use percpu
      counter instead of the u64 variant to reduce the contention of
      the lock.
      
      This patch also fixed the problem that we access it without
      lock protection in __btrfs_btree_balance_dirty(), which may
      cause we skip the dirty pages flush.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      e2d84521
  6. 20 2月, 2013 4 次提交
  7. 02 2月, 2013 1 次提交
    • D
      Btrfs: RAID5 and RAID6 · 53b381b3
      David Woodhouse 提交于
      This builds on David Woodhouse's original Btrfs raid5/6 implementation.
      The code has changed quite a bit, blame Chris Mason for any bugs.
      
      Read/modify/write is done after the higher levels of the filesystem have
      prepared a given bio.  This means the higher layers are not responsible
      for building full stripes, and they don't need to query for the topology
      of the extents that may get allocated during delayed allocation runs.
      It also means different files can easily share the same stripe.
      
      But, it does expose us to incorrect parity if we crash or lose power
      while doing a read/modify/write cycle.  This will be addressed in a
      later commit.
      
      Scrub is unable to repair crc errors on raid5/6 chunks.
      
      Discard does not work on raid5/6 (yet)
      
      The stripe size is fixed at 64KiB per disk.  This will be tunable
      in a later commit.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      53b381b3
  8. 17 12月, 2012 4 次提交
  9. 13 12月, 2012 2 次提交
    • S
      Btrfs: change core code of btrfs to support the device replace operations · 8dabb742
      Stefan Behrens 提交于
      This commit contains all the essential changes to the core code
      of Btrfs for support of the device replace procedure.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      8dabb742
    • S
      Btrfs: handle errors from btrfs_map_bio() everywhere · 61891923
      Stefan Behrens 提交于
      With the addition of the device replace procedure, it is possible
      for btrfs_map_bio(READ) to report an error. This happens when the
      specific mirror is requested which is located on the target disk,
      and the copy operation has not yet copied this block. Hence the
      block cannot be read and this error state is indicated by
      returning EIO.
      Some background information follows now. A new mirror is added
      while the device replace procedure is running.
      btrfs_get_num_copies() returns one more, and
      btrfs_map_bio(GET_READ_MIRROR) adds one more mirror if a disk
      location is involved that was already handled by the device
      replace copy operation. The assigned mirror num is the highest
      mirror number, e.g. the value 3 in case of RAID1.
      If btrfs_map_bio() is invoked with mirror_num == 0 (i.e., select
      any mirror), the copy on the target drive is never selected
      because that disk shall be able to perform the write requests as
      quickly as possible. The parallel execution of read requests would
      only slow down the disk copy procedure. Second case is that
      btrfs_map_bio() is called with mirror_num > 0. This is done from
      the repair code only. In this case, the highest mirror num is
      assigned to the target disk, since it is used last. And when this
      mirror is not available because the copy procedure has not yet
      handled this area, an error is returned. Everywhere in the code
      the handling of such errors is added now.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      61891923