1. 17 12月, 2012 6 次提交
  2. 13 12月, 2012 6 次提交
  3. 12 12月, 2012 2 次提交
    • M
      Btrfs: make delalloc inodes be flushed by multi-task · 8ccf6f19
      Miao Xie 提交于
      This patch introduce a new worker pool named "flush_workers", and if we
      want to force all the inode with pending delalloc to the disks, we can
      queue those inodes into the work queue of the worker pool, in this way,
      those inodes will be flushed by multi-task.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      8ccf6f19
    • M
      Btrfs: improve the noflush reservation · 08e007d2
      Miao Xie 提交于
      In some places(such as: evicting inode), we just can not flush the reserved
      space of delalloc, flushing the delayed directory index and delayed inode
      is OK, but we don't try to flush those things and just go back when there is
      no enough space to be reserved. This patch fixes this problem.
      
      We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
      If we can in the transaction, we should not flush anything, or the deadlock
      would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
      would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
      and we will flush all things.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      08e007d2
  4. 26 10月, 2012 1 次提交
  5. 09 10月, 2012 4 次提交
  6. 04 10月, 2012 1 次提交
  7. 03 10月, 2012 1 次提交
  8. 02 10月, 2012 15 次提交
    • M
      Btrfs: fix unnecessary warning when the fragments make the space alloc fail · 962197ba
      Miao Xie 提交于
      When we wrote some data by compress mode into a btrfs filesystem which was full
      of the fragments, the kernel will report:
      	BTRFS warning (device xxx): Aborting unused transaction.
      
      The reason is:
      We can not find a long enough free space to store the compressed data because
      of the fragmentary free space, and the compressed data can not be splited,
      so the kernel outputed the above message.
      
      In fact, btrfs can deal with this problem very well: it fall back to
      uncompressed IO, split the uncompressed data into small ones, and then
      store them into to the fragmentary free space. So we shouldn't output the
      above warning message.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      962197ba
    • J
      Btrfs: create a pinned em when writing to a prealloc range in DIO · 69ffb543
      Josef Bacik 提交于
      Wade Cline reported a problem where he was getting garbage and warnings when
      writing to a preallocated range via O_DIRECT.  This is because we weren't
      creating our normal pinned extent_map for the range we were writing to,
      which was causing all sorts of issues.  This patch fixes the problem and
      makes his testcase much happier.  Thanks,
      Reported-by: NWade Cline <clinew@linux.vnet.ibm.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      69ffb543
    • M
      Btrfs: fix corrupted metadata in the snapshot · 8407aa46
      Miao Xie 提交于
      When we delete a inode, we will remove all the delayed items including delayed
      inode update, and then truncate all the relative metadata. If there is lots of
      metadata, we will end the current transaction, and start a new transaction to
      truncate the left metadata. In this way, we will leave a inode item that its
      link counter is > 0, and also may leave some directory index items in fs/file tree
      after the current transaction ends. In other words, the metadata in this fs/file tree
      is inconsistent. If we create a snapshot for this tree now, we will find a inode with
      corrupted metadata in the new snapshot, and we won't continue to drop the left metadata,
      because its link counter is not 0.
      
      We fix this problem by updating the inode item before the current transaction ends.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      8407aa46
    • D
      btrfs: polish names of kmem caches · 837e1972
      David Sterba 提交于
      Usecase:
      
        watch 'grep btrfs < /proc/slabinfo'
      
      easy to watch all caches in one go.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      837e1972
    • L
      Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag · 9e8a4a8b
      Liu Bo 提交于
      We're going to use this flag EXTENT_DEFRAG to indicate which range
      belongs to defragment so that we can implement snapshow-aware defrag:
      
      We set the EXTENT_DEFRAG flag when dirtying the extents that need
      defragmented, so later on writeback thread can differentiate between
      normal writeback and writeback started by defragmentation.
      Original-Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      9e8a4a8b
    • M
      Btrfs: add a new "type" field into the block reservation structure · 66d8f3dd
      Miao Xie 提交于
      Sometimes we need choose the method of the reservation according to the type
      of the block reservation, such as the reservation for the delayed inode update.
      Now we identify the type just by comparing the address of the reservation
      variants, it is very ugly if it is a temporary one because we need compare it
      with all the common reservation variants. So we add a new "type" field to keep
      the type the reservation variants.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      66d8f3dd
    • S
      Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs() · ac14aed6
      Sage Weil 提交于
      Josef has suggested that this is not necessary.  Removing it also avoids
      this lockdep splat (after the new sb_internal locking stuff was added):
      
      [  604.090449] ======================================================
      [  604.114819] [ INFO: possible circular locking dependency detected ]
      [  604.139262] 3.6.0-rc2-ceph-00144-g463b030 #1 Not tainted
      [  604.162193] -------------------------------------------------------
      [  604.186139] btrfs-cleaner/6669 is trying to acquire lock:
      [  604.209555]  (sb_internal#2){.+.+..}, at: [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
      [  604.257100]
      [  604.257100] but task is already holding lock:
      [  604.300366]  (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs]
      [  604.352989]
      [  604.352989] which lock already depends on the new lock.
      [  604.352989]
      [  604.427104]
      [  604.427104] the existing dependency chain (in reverse order) is:
      [  604.478493]
      [  604.478493] -> #1 (&fs_info->cleanup_work_sem){.+.+..}:
      [  604.529313]        [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
      [  604.559621]        [<ffffffff81632b69>] down_read+0x39/0x4e
      [  604.589382]        [<ffffffffa004db98>] btrfs_lookup_dentry+0x218/0x550 [btrfs]
      [  604.596161] btrfs: unlinked 1 orphans
      [  604.675002]        [<ffffffffa006aadd>] create_subvol+0x62d/0x690 [btrfs]
      [  604.708859]        [<ffffffffa006d666>] btrfs_mksubvol.isra.52+0x346/0x3a0 [btrfs]
      [  604.772466]        [<ffffffffa006d7f2>] btrfs_ioctl_snap_create_transid+0x132/0x190 [btrfs]
      [  604.842245]        [<ffffffffa006d8ae>] btrfs_ioctl_snap_create+0x5e/0x80 [btrfs]
      [  604.912852]        [<ffffffffa00708ae>] btrfs_ioctl+0x138e/0x1990 [btrfs]
      [  604.951888]        [<ffffffff8118e9b8>] do_vfs_ioctl+0x98/0x560
      [  604.989961]        [<ffffffff8118ef11>] sys_ioctl+0x91/0xa0
      [  605.026628]        [<ffffffff8163d569>] system_call_fastpath+0x16/0x1b
      [  605.064404]
      [  605.064404] -> #0 (sb_internal#2){.+.+..}:
      [  605.126832]        [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90
      [  605.163671]        [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
      [  605.200228]        [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0
      [  605.236818]        [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
      [  605.274029]        [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs]
      [  605.340520]        [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs]
      [  605.378720]        [<ffffffff811972c8>] evict+0xb8/0x1c0
      [  605.416057]        [<ffffffff811974d5>] iput+0x105/0x210
      [  605.452373]        [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs]
      [  605.521627]        [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs]
      [  605.560520]        [<ffffffff810791ee>] kthread+0xae/0xc0
      [  605.598094]        [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
      [  605.636499]
      [  605.636499] other info that might help us debug this:
      [  605.636499]
      [  605.736504]  Possible unsafe locking scenario:
      [  605.736504]
      [  605.801931]        CPU0                    CPU1
      [  605.835126]        ----                    ----
      [  605.867093]   lock(&fs_info->cleanup_work_sem);
      [  605.898594]                                lock(sb_internal#2);
      [  605.931954]                                lock(&fs_info->cleanup_work_sem);
      [  605.965359]   lock(sb_internal#2);
      [  605.994758]
      [  605.994758]  *** DEADLOCK ***
      [  605.994758]
      [  606.075281] 2 locks held by btrfs-cleaner/6669:
      [  606.104528]  #0:  (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b5d5>] cleaner_kthread+0x95/0x120 [btrfs]
      [  606.165626]  #1:  (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs]
      [  606.231297]
      [  606.231297] stack backtrace:
      [  606.287723] Pid: 6669, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00144-g463b030 #1
      [  606.347823] Call Trace:
      [  606.376184]  [<ffffffff8162a77c>] print_circular_bug+0x1fb/0x20c
      [  606.409243]  [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90
      [  606.441343]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
      [  606.474583]  [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
      [  606.505934]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
      [  606.539429]  [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0
      [  606.571719]  [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0
      [  606.603498]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
      [  606.637405]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
      [  606.670165]  [<ffffffff81172e75>] ? kmem_cache_alloc+0xb5/0x160
      [  606.702144]  [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
      [  606.735562]  [<ffffffffa00256a6>] ? block_rsv_add_bytes+0x56/0x80 [btrfs]
      [  606.769861]  [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs]
      [  606.804575]  [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs]
      [  606.838756]  [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40
      [  606.872010]  [<ffffffff811972c8>] evict+0xb8/0x1c0
      [  606.903800]  [<ffffffff811974d5>] iput+0x105/0x210
      [  606.935416]  [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs]
      [  606.970510]  [<ffffffffa003b5d5>] ? cleaner_kthread+0x95/0x120 [btrfs]
      [  607.005648]  [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs]
      [  607.040724]  [<ffffffffa003b540>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
      [  607.104740]  [<ffffffff810791ee>] kthread+0xae/0xc0
      [  607.137119]  [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10
      [  607.169797]  [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
      [  607.202472]  [<ffffffff81635430>] ? retint_restore_args+0x13/0x13
      [  607.235884]  [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0
      [  607.268731]  [<ffffffff8163e740>] ? gs_change+0x13/0x13
      Signed-off-by: NSage Weil <sage@inktank.com>
      ac14aed6
    • J
      Btrfs: add hole punching · 2aaa6655
      Josef Bacik 提交于
      This patch adds hole punching via fallocate.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2aaa6655
    • J
      Btrfs: remove unused hint byte argument for btrfs_drop_extents · 2671485d
      Josef Bacik 提交于
      I audited all users of btrfs_drop_extents and found that nobody actually uses
      the hint_byte argument.  I'm sure it was used for something at some point but
      it's not used now, and the way the pinning works the disk bytenr would never be
      immediately useful anyway so lets just remove it.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2671485d
    • L
      Btrfs: fix a bug in checking whether a inode is already in log · 46d8bc34
      Liu Bo 提交于
      This is based on Josef's "Btrfs: turbo charge fsync".
      
      The current btrfs checks if an inode is in log by comparing
      root's last_log_commit to inode's last_sub_trans[2].
      
      But the problem is that this root->last_log_commit is shared among
      inodes.
      
      Say we have N inodes to be logged, after the first inode,
      root's last_log_commit is updated and the N-1 remained files will
      be skipped.
      
      This fixes the bug by keeping a local copy of root's last_log_commit
      inside each inode and this local copy will be maintained itself.
      
      [1]: we regard each log transaction as a subset of btrfs's transaction,
      i.e. sub_trans
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      46d8bc34
    • M
      Btrfs: fix wrong orphan count of the fs/file tree · 321f0e70
      Miao Xie 提交于
      If we add a new orphan item, we should increase the atomic counter,
      not decrease it. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      321f0e70
    • L
      Btrfs: improve fsync by filtering extents that we want · 4e2f84e6
      Liu Bo 提交于
      This is based on Josef's "Btrfs: turbo charge fsync".
      
      The above Josef's patch performs very good in random sync write test,
      because we won't have too much extents to merge.
      
      However, it does not performs good on the test:
      dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync
      
      The reason is when we do sequencial sync write, we need to merge the
      current extent just with the previous one, so that we can get accumulated
      extents to log:
      
      A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ...
      
      So we'll have to flush more and more checksum into log tree, which is the
      bottleneck according to my tests.
      
      But we can avoid this by telling fsync the real extents that are needed
      to be logged.
      
      With this, I did the above dd sync write test (size=50m),
      
               w/o (orig)   w/ (josef's)   w/ (this)
      SATA      104KB/s       109KB/s       121KB/s
      ramdisk   1.5MB/s       1.5MB/s       10.7MB/s (613%)
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      4e2f84e6
    • J
      Btrfs: do not needlessly restart the transaction for enospc · ca7e70f5
      Josef Bacik 提交于
      We will stop and restart a transaction every time we move to a different leaf
      when truncating a file.  This is for enospc reasons, but really we could
      probably get away with doing this a little better by actually working until we
      hit an ENOSPC.  So add a ->failfast flag to the block_rsv and set it when we do
      truncates which will fail as soon as the block rsv runs out of space, and then
      at that point we can stop and restart the transaction and refill the block rsv
      and carry on.  This will make rm'ing of a file with lots of extents a bit
      faster.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ca7e70f5
    • J
      Btrfs: turbo charge fsync · 5dc562c5
      Josef Bacik 提交于
      At least for the vm workload.  Currently on fsync we will
      
      1) Truncate all items in the log tree for the given inode if they exist
      
      and
      
      2) Copy all items for a given inode into the log
      
      The problem with this is that for things like VMs you can have lots of
      extents from the fragmented writing behavior, and worst yet you may have
      only modified a few extents, not the entire thing.  This patch fixes this
      problem by tracking which transid modified our extent, and then when we do
      the tree logging we find all of the extents we've modified in our current
      transaction, sort them and commit them.  We also only truncate up to the
      xattrs of the inode and copy that stuff in normally, and then just drop any
      extents in the range we have that exist in the log already.  Here are some
      numbers of a 50 meg fio job that does random writes and fsync()s after every
      write
      
      		Original	Patched
      SATA drive	82KB/s		140KB/s
      Fusion drive	431KB/s		2532KB/s
      
      So around 2-6 times faster depending on your hardware.  There are a few
      corner cases, for example if you truncate at all we have to do it the old
      way since there is no way to be sure what is in the log is ok.  This
      probably could be done smarter, but if you write-fsync-truncate-write-fsync
      you deserve what you get.  All this work is in RAM of course so if your
      inode gets evicted from cache and you read it in and fsync it we'll do it
      the slow way if we are still in the same transaction that we last modified
      the inode in.
      
      The biggest cool part of this is that it requires no changes to the recovery
      code, so if you fsync with this patch and crash and load an old kernel, it
      will run the recovery and be a-ok.  I have tested this pretty thoroughly
      with an fsync tester and everything comes back fine, as well as xfstests.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      5dc562c5
    • J
      Btrfs: update last trans if we don't update the inode · 7c735313
      Josef Bacik 提交于
      There is a completely impossible situation to hit where you can preallocate
      a file, fsync it, write into the preallocated region, have the transaction
      commit twice and then fsync and then immediately lose power and lose all of
      the contents of the write.  This patch fixes this just so I feel better
      about the situation and because it is lightweight, we just update the
      last_trans when we finish an ordered IO and we don't update the inode
      itself.  This way we are completely safe and I feel better.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      7c735313
  9. 21 9月, 2012 1 次提交
  10. 01 9月, 2012 1 次提交
  11. 29 8月, 2012 2 次提交