1. 05 3月, 2013 1 次提交
    • M
      Btrfs: fix wrong handle at error path of create_snapshot() when the commit fails · aec8030a
      Miao Xie 提交于
      There are several bugs at error path of create_snapshot() when the
      transaction commitment failed.
      - access the freed transaction handler. At the end of the
        transaction commitment, the transaction handler was freed, so we
        should not access it after the transaction commitment.
      - we were not aware of the error which happened during the snapshot
        creation if we submitted a async transaction commitment.
      - pending snapshot access vs pending snapshot free. when something
        wrong happened after we submitted a async transaction commitment,
        the transaction committer would cleanup the pending snapshots and
        free them. But the snapshot creators were not aware of it, they
        would access the freed pending snapshots.
      
      This patch fixes the above problems by:
      - remove the dangerous code that accessed the freed handler
      - assign ->error if the error happens during the snapshot creation
      - the transaction committer doesn't free the pending snapshots,
        just assigns the error number and evicts them before we unblock
        the transaction.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      aec8030a
  2. 01 3月, 2013 2 次提交
    • D
      btrfs: try harder to allocate raid56 stripe cache · 83c8266a
      David Sterba 提交于
      The stripe hash table is large, starting with allocation order 4 and can go as
      high as order 7 in case lock debugging is turned on and structure padding
      happens.
      
      Observed mount failure:
      
      mount: page allocation failure: order:7, mode:0x200050
      Pid: 8234, comm: mount Tainted: G        W    3.8.0-default+ #267
      Call Trace:
       [<ffffffff81114353>] warn_alloc_failed+0xf3/0x140
       [<ffffffff811171d2>] ? __alloc_pages_direct_compact+0x92/0x250
       [<ffffffff81117ac3>] __alloc_pages_nodemask+0x733/0x9d0
       [<ffffffff81152878>] ? cache_alloc_refill+0x3f8/0x840
       [<ffffffff811528bc>] cache_alloc_refill+0x43c/0x840
       [<ffffffff811302eb>] ? is_kernel_percpu_address+0x4b/0x90
       [<ffffffffa00a00ac>] ? btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs]
       [<ffffffff811531d7>] kmem_cache_alloc_trace+0x247/0x270
       [<ffffffffa00a00ac>] btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs]
       [<ffffffffa003133f>] open_ctree+0xb2f/0x1f90 [btrfs]
       [<ffffffff81397289>] ? string+0x49/0xe0
       [<ffffffff813987b3>] ? vsnprintf+0x443/0x5d0
       [<ffffffffa0007cb6>] btrfs_mount+0x526/0x600 [btrfs]
       [<ffffffff8115127c>] ? cache_alloc_debugcheck_after+0x4c/0x200
       [<ffffffff81162b90>] mount_fs+0x20/0xe0
       [<ffffffff8117db26>] vfs_kern_mount+0x76/0x120
       [<ffffffff811801b6>] do_mount+0x386/0x980
       [<ffffffff8112a5cb>] ? strndup_user+0x5b/0x80
       [<ffffffff81180840>] sys_mount+0x90/0xe0
       [<ffffffff81962e99>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      83c8266a
    • L
      Btrfs: fix memory leak of log roots · 3321719e
      Liu Bo 提交于
      When we abort a transaction while fsyncing, we'll skip freeing log roots
      part of committing a transaction, which leads to memory leak.
      
      This adds a 'free log roots' in putting super when no more users hold
      references on log roots, so it's safe and clean.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      3321719e
  3. 21 2月, 2013 11 次提交
    • Z
      btrfs: define BTRFS_MAGIC as a u64 value · cdb4c574
      Zach Brown 提交于
      super.magic is an le64 but it's treated as an unterminated string when
      compared against BTRFS_MAGIC which is defined as a string.  Instead
      define BTRFS_MAGIC as a normal hex value and use endian helpers to
      compare it to the super's magic.
      
      I tested this by mounting an fs made before the change and made sure
      that it didn't introduce sparse errors.  This matches a similar cleanup
      that is pending in btrfs-progs.  David Sterba pointed out that we should
      fix the kernel side as well :).
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      cdb4c574
    • J
      Btrfs: place ordered operations on a per transaction list · 569e0f35
      Josef Bacik 提交于
      Miao made the ordered operations stuff run async, which introduced a
      deadlock where we could get somebody (sync) racing in and committing the
      transaction while a commit was already happening.  The new committer would
      try and flush ordered operations which would hang waiting for the commit to
      finish because it is done asynchronously and no longer inherits the callers
      trans handle.  To fix this we need to make the ordered operations list a per
      transaction list.  We can get new inodes added to the ordered operation list
      by truncating them and then having another process writing to them, so this
      makes it so that anybody trying to add an ordered operation _must_ start a
      transaction in order to add itself to the list, which will keep new inodes
      from getting added to the ordered operations list after we start committing.
      This should fix the deadlock and also keeps us from doing a lot more work
      than we need to during commit.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      569e0f35
    • M
      Btrfs: fix the race between bio and btrfs_stop_workers · 2b8195bb
      Miao Xie 提交于
      open_ctree() need read the metadata to initialize the global information
      of btrfs. But it may fail after it submit some bio, and then it will jump
      to the error path. Unfortunately, it doesn't check if there are some bios
      in flight, and just stop all the worker threads. As a result, when the
      submitted bios end, they can not find any worker thread which can deal with
      subsequent work, then oops happen.
      
      kernel BUG at fs/btrfs/async-thread.c:605!
      
      Fix this problem by invoking invalidate_inode_pages2() before we stop the
      worker threads. This function will wait until the bio end because it need
      lock the pages which are going to be invalidated, and if a page is under
      disk read IO, it must be locked. invalidate_inode_pages2() need wait until
      end bio handler to unlocked it.
      Reported-and-Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2b8195bb
    • J
      Btrfs: fix how we discard outstanding ordered extents on abort · 779880ef
      Josef Bacik 提交于
      When we abort we've been just free'ing up all the ordered extents and
      hoping for the best.  This results in lots of warnings from various places,
      warnings from btrfs_destroy_inode() because it's ENOSPC accounting isn't
      fixed.  It will also screw up lots of pages who have been set private but
      never get cleared because the ordered extents are never allowed to be
      submitted.  This patch fixes those warnings.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      779880ef
    • J
      Btrfs: fix freeing delayed ref head while still holding its mutex · eb12db69
      Josef Bacik 提交于
      I hit this error when reproducing a bug that would end in a transaction
      abort.  We take the delayed ref head's mutex to keep anybody from processing
      it while we're destroying it, but we fail to drop the mutex before we carry
      on and free the damned thing.  Fix this by doing the remove logic for the
      head ourselves and unlock the mutex, that way we can avoid use after free's
      or hung tasks waiting on that mutex to come back so they know the delayed
      ref completed.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      eb12db69
    • E
      btrfs: list_entry can't return NULL · d1d3cd27
      Eric Sandeen 提交于
      No need to test the result, we can't get a
      null pointer from list_entry()
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      d1d3cd27
    • M
      Btrfs: use bit operation for ->fs_state · 87533c47
      Miao Xie 提交于
      There is no lock to protect fs_info->fs_state, it will introduce
      some problems, such as the value may be covered by the other task
      when several tasks modify it. For example:
      	Task0 - CPU0		Task1 - CPU1
      	mov %fs_state rax
      	or $0x1 rax
      				mov %fs_state rax
      				or $0x2 rax
      	mov rax %fs_state
      				mov rax %fs_state
      The expected value is 3, but in fact, it is 2.
      
      Though this problem doesn't happen now (because there is only one
      flag currently), the code is error prone, if we add other flags,
      the above problem will happen to a certainty.
      
      Now we use bit operation for it to fix the above problem.
      In this way, we can make the code more robust and be easy to
      add new flags.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      87533c47
    • M
      Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits · de98ced9
      Miao Xie 提交于
      There is no lock to protect
        fs_info->avail_{data, metadata, system}_alloc_bits,
      it may introduce some problem, such as the wrong profile
      information, so we add a seqlock to protect them.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      de98ced9
    • M
      Btrfs: use the inode own lock to protect its delalloc_bytes · df0af1a5
      Miao Xie 提交于
      We need not use a global lock to protect the delalloc_bytes of the
      inode, just use its own lock. In this way, we can reduce the lock
      contention and ->delalloc_lock will just protect delalloc inode
      list.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      df0af1a5
    • M
      Btrfs: use percpu counter for fs_info->delalloc_bytes · 963d678b
      Miao Xie 提交于
      fs_info->delalloc_bytes is accessed very frequently, so use percpu
      counter instead of the u64 variant for it to reduce the lock
      contention.
      
      This patch also fixed the problem that we access the variant
      without the lock protection.At worst, we would not flush the
      delalloc inodes, and just return ENOSPC error when we still have
      some free space in the fs.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      963d678b
    • M
      Btrfs: use percpu counter for dirty metadata count · e2d84521
      Miao Xie 提交于
      ->dirty_metadata_bytes is accessed very frequently, so use percpu
      counter instead of the u64 variant to reduce the contention of
      the lock.
      
      This patch also fixed the problem that we access it without
      lock protection in __btrfs_btree_balance_dirty(), which may
      cause we skip the dirty pages flush.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      e2d84521
  4. 20 2月, 2013 4 次提交
  5. 02 2月, 2013 1 次提交
    • D
      Btrfs: RAID5 and RAID6 · 53b381b3
      David Woodhouse 提交于
      This builds on David Woodhouse's original Btrfs raid5/6 implementation.
      The code has changed quite a bit, blame Chris Mason for any bugs.
      
      Read/modify/write is done after the higher levels of the filesystem have
      prepared a given bio.  This means the higher layers are not responsible
      for building full stripes, and they don't need to query for the topology
      of the extents that may get allocated during delayed allocation runs.
      It also means different files can easily share the same stripe.
      
      But, it does expose us to incorrect parity if we crash or lose power
      while doing a read/modify/write cycle.  This will be addressed in a
      later commit.
      
      Scrub is unable to repair crc errors on raid5/6 chunks.
      
      Discard does not work on raid5/6 (yet)
      
      The stripe size is fixed at 64KiB per disk.  This will be tunable
      in a later commit.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      53b381b3
  6. 17 12月, 2012 4 次提交
  7. 13 12月, 2012 8 次提交
  8. 12 12月, 2012 2 次提交
  9. 09 10月, 2012 5 次提交
    • W
      Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer · 1037a5af
      Wang Sheng-Hui 提交于
      In csum_dirty_buffer, we first get eb from page->private.
      Then we check if the page is the first page of eb. Later
      we check it again. Remove the repeated check here.
      Signed-off-by: NWang Sheng-Hui <shhuiw@gmail.com>
      1037a5af
    • S
      Btrfs: make filesystem read-only when submitting barrier fails · 5af3e8cc
      Stefan Behrens 提交于
      So far the return code of barrier_all_devices() is ignored, which
      means that errors are ignored. The result can be a corrupt
      filesystem which is not consistent.
      This commit adds code to evaluate the return code of
      barrier_all_devices(). The normal btrfs_error() mechanism is used to
      switch the filesystem into read-only mode when errors are detected.
      
      In order to decide whether barrier_all_devices() should return
      error or success, the number of disks that are allowed to fail the
      barrier submission is calculated. This calculation accounts for the
      worst RAID level of metadata, system and data. If single, dup or
      RAID0 is in use, a single disk error is already considered to be
      fatal. Otherwise a single disk error is tolerated.
      
      The calculation of the number of disks that are tolerated to fail
      the barrier operation is performed when the filesystem gets mounted,
      when a balance operation is started and finished, and when devices
      are added or removed.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      5af3e8cc
    • J
      Btrfs: cache extent state when writing out dirty metadata pages · e6138876
      Josef Bacik 提交于
      Everytime we write out dirty pages we search for an offset in the tree,
      convert the bits in the state, and then when we wait we search for the
      offset again and clear the bits.  So for every dirty range in the io tree we
      are doing 4 rb searches, which is suboptimal.  With this patch we are only
      doing 2 searches for every cycle (modulo weird things happening).  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      e6138876
    • J
      Btrfs: do not async metadata csumming in certain situations · de0022b9
      Josef Bacik 提交于
      There are a coule scenarios where farming metadata csumming off to an async
      thread doesn't help.  The first is if our processor supports crc32c, in
      which case the csumming will be fast and so the overhead of the async model
      is not worth the cost.  The other case is for our tree log.  We will be
      making that stuff dirty and writing it out and waiting for it immediately.
      Even with software crc32c this gives me a ~15% increase in speed with O_SYNC
      workloads.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      de0022b9
    • M
      Btrfs: fix orphan transaction on the freezed filesystem · 354aa0fb
      Miao Xie 提交于
      With the following debug patch:
      
       static int btrfs_freeze(struct super_block *sb)
       {
      + 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
      +	struct btrfs_transaction *trans;
      +
      +	spin_lock(&fs_info->trans_lock);
      +	trans = fs_info->running_transaction;
      +	if (trans) {
      +		printk("Transid %llu, use_count %d, num_writer %d\n",
      +			trans->transid, atomic_read(&trans->use_count),
      +			atomic_read(&trans->num_writers));
      +	}
      +	spin_unlock(&fs_info->trans_lock);
       	return 0;
       }
      
      I found there was a orphan transaction after the freeze operation was done.
      
      It is because the transaction may not be committed when the transaction handle
      end even though it is the last handle of the current transaction. This design
      avoid committing the transaction frequently, but also introduce the above
      problem.
      
      So I add btrfs_attach_transaction() which can catch the current transaction
      and commit it. If there is no transaction, it will return ENOENT, and do not
      anything.
      
      This function also can be used to instead of btrfs_join_transaction_freeze()
      because it don't increase the writer counter and don't start a new transaction,
      so it also can fix the deadlock between sync and freeze.
      
      Besides that, it is used to instead of btrfs_join_transaction() in
      transaction_kthread(), because if there is no transaction, the transaction
      kthread needn't anything.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      354aa0fb
  10. 04 10月, 2012 2 次提交