1. 06 11月, 2011 2 次提交
    • C
      Btrfs: fix extent pinning bugs in the tree log · e688b725
      Chris Mason 提交于
      The tree log had two important bugs that could cause corruptions after a
      crash.  Sometimes we were allowing tree log blocks to be reused after
      the tree log was committed but before the transaction commit was done.
      
      This allowed a future metadata write to overwrite the tree log data.  It
      is fixed by adding a new variant of freeing reserved extents that always
      pins them.  Credit goes to Stefan Behrens and Arne Jansen for many many
      hours spent tracking this bug down.
      
      During tree log replay, we do a pass through the tree log and pin all
      the extents we find.  This makes sure the replay code won't go in and
      use any of those blocks for new allocations during replay.  The problem
      is the free space cache isn't honoring these pinned extents.  So the
      allocator can end up handing them out, leading to all kinds of problems
      during replay.
      
      The fix here is to force any free space cache to load while we pin the
      extents, and then to make sure we remove the pinned extents from the
      free space rbtree.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Reported-by: NStefan Behrens <sbehrens@giantdisaster.de>
      e688b725
    • C
      Btrfs: don't wait as long for more batches during SSD log commit · cd354ad6
      Chris Mason 提交于
      When we're doing log commits, we try to wait for more writers to come in
      and make the commit bigger.  This helps improve performance on rotating
      disks, but on SSDs it adds latencies.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cd354ad6
  2. 17 8月, 2011 1 次提交
    • L
      Btrfs: fix an oops of log replay · 34f3e4f2
      liubo 提交于
      When btrfs recovers from a crash, it may hit the oops below:
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/inode.c:4580!
      [...]
      RIP: 0010:[<ffffffffa03df251>]  [<ffffffffa03df251>] btrfs_add_link+0x161/0x1c0 [btrfs]
      [...]
      Call Trace:
       [<ffffffffa03e7b31>] ? btrfs_inode_ref_index+0x31/0x80 [btrfs]
       [<ffffffffa04054e9>] add_inode_ref+0x319/0x3f0 [btrfs]
       [<ffffffffa0407087>] replay_one_buffer+0x2c7/0x390 [btrfs]
       [<ffffffffa040444a>] walk_down_log_tree+0x32a/0x480 [btrfs]
       [<ffffffffa0404695>] walk_log_tree+0xf5/0x240 [btrfs]
       [<ffffffffa0406cc0>] btrfs_recover_log_trees+0x250/0x350 [btrfs]
       [<ffffffffa0406dc0>] ? btrfs_recover_log_trees+0x350/0x350 [btrfs]
       [<ffffffffa03d18b2>] open_ctree+0x1442/0x17d0 [btrfs]
      [...]
      
      This comes from that while replaying an inode ref item, we forget to
      check those old conflicting DIR_ITEM and DIR_INDEX items in fs/file tree,
      then we will come to conflict corners which lead to BUG_ON().
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Tested-by: NAndy Lutomirski <luto@mit.edu>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      34f3e4f2
  3. 28 7月, 2011 1 次提交
    • C
      Btrfs: switch the btrfs tree locks to reader/writer · bd681513
      Chris Mason 提交于
      The btrfs metadata btree is the source of significant
      lock contention, especially in the root node.   This
      commit changes our locking to use a reader/writer
      lock.
      
      The lock is built on top of rw spinlocks, and it
      extends the lock tracking to remember if we have a
      read lock or a write lock when we go to blocking.  Atomics
      count the number of blocking readers or writers at any
      given time.
      
      It removes all of the adaptive spinning from the old code
      and uses only the spinning/blocking hints inside of btrfs
      to decide when it should continue spinning.
      
      In read heavy workloads this is dramatically faster.  In write
      heavy workloads we're still faster because of less contention
      on the root node lock.
      
      We suffer slightly in dbench because we schedule more often
      during write locks, but all other benchmarks so far are improved.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd681513
  4. 15 7月, 2011 1 次提交
  5. 18 6月, 2011 1 次提交
  6. 24 5月, 2011 4 次提交
  7. 23 5月, 2011 1 次提交
    • L
      Btrfs: do not flush csum items of unchanged file data during treelog · 8e531cdf
      liubo 提交于
      The current code relogs the entire inode every time during fsync log,
      and it is much better suited to small files rather than large ones.
      
      During my performance test, the fsync performace of large files sucks,
      and we can ascribe this to the tremendous amount of csum infos of the
      large ones, cause we have to flush all of these csum infos into log trees
      even when there are only _one_ change in the whole file data.  Apparently,
      to optimize fsync, we need to create a filter to skip the unnecessary csum
      ones, that is, the corresponding file data remains unchanged before this fsync.
      
      Here I have some test results to show, I use sysbench to do "random write + fsync".
      
      ===
      sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K --file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags=  [prepare, run]
      ===
      
      Sysbench args:
        - Number of threads: 1
        - Extra file open flags: 0
        - 2 files, 4Gb each
        - Block size 4Kb
        - Number of random requests for random IO: 10000
        - Read/Write ratio for combined random IO test: 1.50
        - Periodic FSYNC enabled, calling fsync() each 100 requests.
        - Calling fsync() at the end of test, Enabled.
        - Using synchronous I/O mode
        - Doing random write test
      
      Sysbench results:
      ===
         Operations performed:  0 Read, 10000 Write, 200 Other = 10200 Total
         Read 0b  Written 39.062Mb  Total transferred 39.062Mb
      ===
      a) without patch:  (*SPEED* : 451.01Kb/sec)
         112.75 Requests/sec executed
      
      b) with patch:     (*SPEED* : 4.7533Mb/sec)
         1216.84 Requests/sec executed
      
      PS: I've made a _sub transid_ stuff patch, but it does not perform as effectively as this patch,
      and I'm wanderring where the problem is and trying to improve it more.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8e531cdf
  8. 21 5月, 2011 1 次提交
    • M
      btrfs: implement delayed inode items operation · 16cdcec7
      Miao Xie 提交于
      Changelog V5 -> V6:
      - Fix oom when the memory load is high, by storing the delayed nodes into the
        root's radix tree, and letting btrfs inodes go.
      
      Changelog V4 -> V5:
      - Fix the race on adding the delayed node to the inode, which is spotted by
        Chris Mason.
      - Merge Chris Mason's incremental patch into this patch.
      - Fix deadlock between readdir() and memory fault, which is reported by
        Itaru Kitayama.
      
      Changelog V3 -> V4:
      - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
        inode in time.
      
      Changelog V2 -> V3:
      - Fix the race between the delayed worker and the task which does delayed items
        balance, which is reported by Tsutomu Itoh.
      - Modify the patch address David Sterba's comment.
      - Fix the bug of the cpu recursion spinlock, reported by Chris Mason
      
      Changelog V1 -> V2:
      - break up the global rb-tree, use a list to manage the delayed nodes,
        which is created for every directory and file, and used to manage the
        delayed directory name index items and the delayed inode item.
      - introduce a worker to deal with the delayed nodes.
      
      Compare with Ext3/4, the performance of file creation and deletion on btrfs
      is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
      such as inode item, directory name item, directory name index and so on.
      
      If we can do some delayed b+ tree insertion or deletion, we can improve the
      performance, so we made this patch which implemented delayed directory name
      index insertion/deletion and delayed inode update.
      
      Implementation:
      - introduce a delayed root object into the filesystem, that use two lists to
        manage the delayed nodes which are created for every file/directory.
        One is used to manage all the delayed nodes that have delayed items. And the
        other is used to manage the delayed nodes which is waiting to be dealt with
        by the work thread.
      - Every delayed node has two rb-tree, one is used to manage the directory name
        index which is going to be inserted into b+ tree, and the other is used to
        manage the directory name index which is going to be deleted from b+ tree.
      - introduce a worker to deal with the delayed operation. This worker is used
        to deal with the works of the delayed directory name index items insertion
        and deletion and the delayed inode update.
        When the delayed items is beyond the lower limit, we create works for some
        delayed nodes and insert them into the work queue of the worker, and then
        go back.
        When the delayed items is beyond the upper bound, we create works for all
        the delayed nodes that haven't been dealt with, and insert them into the work
        queue of the worker, and then wait for that the untreated items is below some
        threshold value.
      - When we want to insert a directory name index into b+ tree, we just add the
        information into the delayed inserting rb-tree.
        And then we check the number of the delayed items and do delayed items
        balance. (The balance policy is above.)
      - When we want to delete a directory name index from the b+ tree, we search it
        in the inserting rb-tree at first. If we look it up, just drop it. If not,
        add the key of it into the delayed deleting rb-tree.
        Similar to the delayed inserting rb-tree, we also check the number of the
        delayed items and do delayed items balance.
        (The same to inserting manipulation)
      - When we want to update the metadata of some inode, we cached the data of the
        inode into the delayed node. the worker will flush it into the b+ tree after
        dealing with the delayed insertion and deletion.
      - We will move the delayed node to the tail of the list after we access the
        delayed node, By this way, we can cache more delayed items and merge more
        inode updates.
      - If we want to commit transaction, we will deal with all the delayed node.
      - the delayed node will be freed when we free the btrfs inode.
      - Before we log the inode items, we commit all the directory name index items
        and the delayed inode update.
      
      I did a quick test by the benchmark tool[1] and found we can improve the
      performance of file creation by ~15%, and file deletion by ~20%.
      
      Before applying this patch:
      Create files:
              Total files: 50000
              Total time: 1.096108
              Average time: 0.000022
      Delete files:
              Total files: 50000
              Total time: 1.510403
              Average time: 0.000030
      
      After applying this patch:
      Create files:
              Total files: 50000
              Total time: 0.932899
              Average time: 0.000019
      Delete files:
              Total files: 50000
              Total time: 1.215732
              Average time: 0.000024
      
      [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
      
      Many thanks for Kitayama-san's help!
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dave@jikos.cz>
      Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Tested-by: NItaru Kitayama <kitayama@cl.bb4u.ne.jp>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      16cdcec7
  9. 12 5月, 2011 1 次提交
    • A
      btrfs: scrub · a2de733c
      Arne Jansen 提交于
      This adds an initial implementation for scrub. It works quite
      straightforward. The usermode issues an ioctl for each device in the
      fs. For each device, it enumerates the allocated device chunks. For
      each chunk, the contained extents are enumerated and the data checksums
      fetched. The extents are read sequentially and the checksums verified.
      If an error occurs (checksum or EIO), a good copy is searched for. If
      one is found, the bad copy will be rewritten.
      All enumerations happen from the commit roots. During a transaction
      commit, the scrubs get paused and afterwards continue from the new
      roots.
      
      This commit is based on the series originally posted to linux-btrfs
      with some improvements that resulted from comments from David Sterba,
      Ilya Dryomov and Jan Schmidt.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      a2de733c
  10. 02 5月, 2011 2 次提交
  11. 26 4月, 2011 1 次提交
  12. 25 4月, 2011 1 次提交
    • L
      Btrfs: Always use 64bit inode number · 33345d01
      Li Zefan 提交于
      There's a potential problem in 32bit system when we exhaust 32bit inode
      numbers and start to allocate big inode numbers, because btrfs uses
      inode->i_ino in many places.
      
      So here we always use BTRFS_I(inode)->location.objectid, which is an
      u64 variable.
      
      There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
      inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
      and inode->i_ino will be used in those cases.
      
      Another reason to make this change is I'm going to use a special inode
      to save free ino cache, and the inode number must be > (u64)-256.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      33345d01
  13. 28 3月, 2011 2 次提交
    • L
      btrfs: make inode ref log recovery faster · c622ae60
      liubo 提交于
      When we recover from crash via write-ahead log tree and process
      the inode refs, for each btrfs_inode_ref item, we will
      1) check if we already have a perfect match in fs/file tree, if
         we have, then we're done.
      2) search the corresponding back reference in fs/file tree, and
         check all the names in this back reference to see if they are
         also in the log to avoid conflict corners.
      3) recover the logged inode refs to fs/file tree.
      
      In current btrfs, however,
      - for 2)'s check, once is enough, since the checked back reference
        will remain unchanged after processing all the inode refs belonged
        to the key.
      - it has no need to do another 1) between 2) and 3).
      
      I've made a small test to show how it improves,
      
      $dd if=/dev/zero of=foobar bs=4K count=1
      $sync
      $make 100 hard links continuously, like ln foobar link_i
      $fsync foobar
      $echo b > /proc/sysrq-trigger
      after reboot
      $time mount DEV PATH
      
      without patch:
      real    0m0.285s
      user    0m0.001s
      sys     0m0.009s
      
      with patch:
      real    0m0.123s
      user    0m0.000s
      sys     0m0.010s
      
      Changelog v1->v2:
      - fix double free - pointed by David Sterba
      Changelog v2->v3:
      - adjust free order
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c622ae60
    • T
      Btrfs: cleanup some BUG_ON() · db5b493a
      Tsutomu Itoh 提交于
      This patch changes some BUG_ON() to the error return.
      (but, most callers still use BUG_ON())
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      db5b493a
  14. 18 3月, 2011 1 次提交
  15. 01 2月, 2011 3 次提交
  16. 29 1月, 2011 1 次提交
  17. 22 11月, 2010 1 次提交
  18. 30 10月, 2010 2 次提交
  19. 25 5月, 2010 1 次提交
  20. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  21. 15 3月, 2010 1 次提交
    • J
      Btrfs: change how we mount subvolumes · 73f73415
      Josef Bacik 提交于
      This work is in preperation for being able to set a different root as the
      default mounting root.
      
      There is currently a problem with how we mount subvolumes.  We cannot currently
      mount a subvolume of a subvolume, you can only mount subvolumes/snapshots of the
      default subvolume.  So say you take a snapshot of the default subvolume and call
      it snap1, and then take a snapshot of snap1 and call it snap2, so now you have
      
      /
      /snap1
      /snap1/snap2
      
      as your available volumes.  Currently you can only mount / and /snap1,
      you cannot mount /snap1/snap2.  To fix this problem instead of passing
      subvolid=<name> you must pass in subvolid=<treeid>, where <treeid> is
      the tree id that gets spit out via the subvolume listing you get from
      the subvolume listing patches (btrfs filesystem list).  This allows us
      to mount /, /snap1 and /snap1/snap2 as the root volume.
      
      In addition to the above, we also now read the default dir item in the
      tree root to get the root key that it points to.  For now this just
      points at what has always been the default subvolme, but later on I plan
      to change it to point at whatever root you want to be the new default
      root, so you can just set the default mount and not have to mount with
      -o subvolid=<treeid>.  I tested this out with the above scenario and it
      worked perfectly.  Thanks,
      
      mount -o subvol operates inside the selected subvolid.  For example:
      
      mount -o subvol=snap1,subvolid=256 /dev/xxx /mnt
      
      /mnt will have the snap1 directory for the subvolume with id
      256.
      
      mount -o subvol=snap /dev/xxx /mnt
      
      /mnt will be the snap directory of whatever the default subvolume
      is.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      73f73415
  22. 18 12月, 2009 1 次提交
  23. 16 12月, 2009 2 次提交
  24. 14 10月, 2009 4 次提交
    • Y
      Btrfs: properly wait log writers during log sync · 86df7eb9
      Yan, Zheng 提交于
      A recently fsync optimization make btrfs_sync_log skip calling
      wait_for_writer in the single log writer case. This is incorrect
      since the writer count can also be increased by btrfs_pin_log.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      86df7eb9
    • C
      Btrfs: streamline tree-log btree block writeout · 690587d1
      Chris Mason 提交于
      Syncing the tree log is a 3 phase operation.
      
      1) write and wait for all the tree log blocks for a given root.
      
      2) write and wait for all the tree log blocks for the
      tree of tree log roots.
      
      3) write and wait for the super blocks (barriers here)
      
      This isn't as efficient as it could be because there is
      no requirement to wait for the blocks from step one to hit the disk
      before we start writing the blocks from step two.  This commit
      changes the sequence so that we don't start waiting until
      all the tree blocks from both steps one and two have been sent
      to disk.
      
      We do this by breaking up btrfs_write_wait_marked_extents into
      two functions, which is trivial because it was already broken
      up into two parts.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      690587d1
    • C
      Btrfs: avoid tree log commit when there are no changes · 257c62e1
      Chris Mason 提交于
      rpm has a habit of running fdatasync when the file hasn't
      changed.  We already detect if a file hasn't been changed
      in the current transaction but it might have been sent to
      the tree-log in this transaction and not changed since
      the last call to fsync.
      
      In this case, we want to avoid a tree log sync, which includes
      a number of synchronous writes and barriers.  This commit
      extends the existing tracking of the last transaction to change
      a file to also track the last sub-transaction.
      
      The end result is that rpm -ivh and -Uvh are roughly twice as fast,
      and on par with ext3.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      257c62e1
    • C
      Btrfs: only write one super copy during fsync · 4722607d
      Chris Mason 提交于
      During a tree-log commit for fsync, we've been writing at least
      two copies of the super block and forcing them to disk.
      
      The other filesystems write only one, and this change brings us on
      par with them.  A full transaction commit will write all the super
      copies, so we still have redundant info written on a regular
      basis.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4722607d
  25. 09 10月, 2009 1 次提交
  26. 22 9月, 2009 2 次提交