提交 · 1eae31e918972bbeefc119d23c1d67674f49a301 · openeuler / raspberrypi-kernel

06 11月, 2011 1 次提交

Btrfs: don't wait as long for more batches during SSD log commit · cd354ad6

由 Chris Mason 提交于 10月 20, 2011

When we're doing log commits, we try to wait for more writers to come in
and make the commit bigger.  This helps improve performance on rotating
disks, but on SSDs it adds latencies.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

cd354ad6

17 8月, 2011 1 次提交

Btrfs: fix an oops of log replay · 34f3e4f2

由 liubo 提交于 8月 06, 2011

When btrfs recovers from a crash, it may hit the oops below:

------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:4580!
[...]
RIP: 0010:[<ffffffffa03df251>]  [<ffffffffa03df251>] btrfs_add_link+0x161/0x1c0 [btrfs]
[...]
Call Trace:
 [<ffffffffa03e7b31>] ? btrfs_inode_ref_index+0x31/0x80 [btrfs]
 [<ffffffffa04054e9>] add_inode_ref+0x319/0x3f0 [btrfs]
 [<ffffffffa0407087>] replay_one_buffer+0x2c7/0x390 [btrfs]
 [<ffffffffa040444a>] walk_down_log_tree+0x32a/0x480 [btrfs]
 [<ffffffffa0404695>] walk_log_tree+0xf5/0x240 [btrfs]
 [<ffffffffa0406cc0>] btrfs_recover_log_trees+0x250/0x350 [btrfs]
 [<ffffffffa0406dc0>] ? btrfs_recover_log_trees+0x350/0x350 [btrfs]
 [<ffffffffa03d18b2>] open_ctree+0x1442/0x17d0 [btrfs]
[...]

This comes from that while replaying an inode ref item, we forget to
check those old conflicting DIR_ITEM and DIR_INDEX items in fs/file tree,
then we will come to conflict corners which lead to BUG_ON().
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Tested-by: NAndy Lutomirski <luto@mit.edu>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

34f3e4f2

28 7月, 2011 1 次提交

Btrfs: switch the btrfs tree locks to reader/writer · bd681513

由 Chris Mason 提交于 7月 16, 2011

The btrfs metadata btree is the source of significant
lock contention, especially in the root node.   This
commit changes our locking to use a reader/writer
lock.

The lock is built on top of rw spinlocks, and it
extends the lock tracking to remember if we have a
read lock or a write lock when we go to blocking.  Atomics
count the number of blocking readers or writers at any
given time.

It removes all of the adaptive spinning from the old code
and uses only the spinning/blocking hints inside of btrfs
to decide when it should continue spinning.

In read heavy workloads this is dramatically faster.  In write
heavy workloads we're still faster because of less contention
on the root node lock.

We suffer slightly in dbench because we schedule more often
during write locks, but all other benchmarks so far are improved.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

bd681513

15 7月, 2011 1 次提交

btrfs: Don't BUG_ON alloc_path errors in replay_one_buffer() · 1e5063d0

由 Mark Fasheh 提交于 7月 12, 2011

The two ->process_func call sites in tree-log.c which were ignoring a return
code have also been updated to gracefully exit as well.
Signed-off-by: NMark Fasheh <mfasheh@suse.com>

1e5063d0

18 6月, 2011 1 次提交

btrfs: fix dereference of ERR_PTR value · 3ed4498c

由 David Sterba 提交于 6月 13, 2011

smatch reports:

btrfs_recover_log_trees error: 'wc.replay_dest' dereferencing
possible ERR_PTR()
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3ed4498c

24 5月, 2011 4 次提交

Btrfs: check return value of btrfs_inc_extent_ref() · 37daa4f9

由 Tsutomu Itoh 提交于 4月 28, 2011

If return value of btrfs_inc_extent_ref() is not 0, BUG() is called.
Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

37daa4f9

Btrfs: return error to caller if read_one_inode() fails · c00e9493

由 Tsutomu Itoh 提交于 4月 28, 2011

When read_one_inode() fails, error code is returned to caller instead
of BUG_ON().
Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

c00e9493

Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item · 1cd30799

由 Tsutomu Itoh 提交于 5月 19, 2011

Currently, btrfs_truncate_item and btrfs_extend_item returns only 0.
So, the check by BUG_ON in the caller is unnecessary.
Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

1cd30799

Btrfs: return error code to caller when btrfs_del_item fails · 65a246c5

由 Tsutomu Itoh 提交于 5月 19, 2011

The error code is returned instead of calling BUG_ON when
btrfs_del_item returns the error.
Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

65a246c5

23 5月, 2011 1 次提交

Btrfs: do not flush csum items of unchanged file data during treelog · 8e531cdf

由 liubo 提交于 5月 06, 2011

The current code relogs the entire inode every time during fsync log,
and it is much better suited to small files rather than large ones.

During my performance test, the fsync performace of large files sucks,
and we can ascribe this to the tremendous amount of csum infos of the
large ones, cause we have to flush all of these csum infos into log trees
even when there are only _one_ change in the whole file data.  Apparently,
to optimize fsync, we need to create a filter to skip the unnecessary csum
ones, that is, the corresponding file data remains unchanged before this fsync.

Here I have some test results to show, I use sysbench to do "random write + fsync".

===
sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K --file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags=  [prepare, run]
===

Sysbench args:
  - Number of threads: 1
  - Extra file open flags: 0
  - 2 files, 4Gb each
  - Block size 4Kb
  - Number of random requests for random IO: 10000
  - Read/Write ratio for combined random IO test: 1.50
  - Periodic FSYNC enabled, calling fsync() each 100 requests.
  - Calling fsync() at the end of test, Enabled.
  - Using synchronous I/O mode
  - Doing random write test

Sysbench results:
===
   Operations performed:  0 Read, 10000 Write, 200 Other = 10200 Total
   Read 0b  Written 39.062Mb  Total transferred 39.062Mb
===
a) without patch:  (*SPEED* : 451.01Kb/sec)
   112.75 Requests/sec executed

b) with patch:     (*SPEED* : 4.7533Mb/sec)
   1216.84 Requests/sec executed

PS: I've made a _sub transid_ stuff patch, but it does not perform as effectively as this patch,
and I'm wanderring where the problem is and trying to improve it more.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

8e531cdf

21 5月, 2011 1 次提交

btrfs: implement delayed inode items operation · 16cdcec7

由 Miao Xie 提交于 4月 22, 2011

Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
  root's radix tree, and letting btrfs inodes go.

Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
  Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
  Itaru Kitayama.

Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
  inode in time.

Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
  balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason

Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
  which is created for every directory and file, and used to manage the
  delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.

Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
  manage the delayed nodes which are created for every file/directory.
  One is used to manage all the delayed nodes that have delayed items. And the
  other is used to manage the delayed nodes which is waiting to be dealt with
  by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
  to deal with the works of the delayed directory name index items insertion
  and deletion and the delayed inode update.
  When the delayed items is beyond the lower limit, we create works for some
  delayed nodes and insert them into the work queue of the worker, and then
  go back.
  When the delayed items is beyond the upper bound, we create works for all
  the delayed nodes that haven't been dealt with, and insert them into the work
  queue of the worker, and then wait for that the untreated items is below some
  threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the delayed inserting rb-tree.
  And then we check the number of the delayed items and do delayed items
  balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the delayed deleting rb-tree.
  Similar to the delayed inserting rb-tree, we also check the number of the
  delayed items and do delayed items balance.
  (The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
  inode into the delayed node. the worker will flush it into the b+ tree after
  dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
  delayed node, By this way, we can cache more delayed items and merge more
  inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
  and the delayed inode update.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.

Before applying this patch:
Create files:
        Total files: 50000
        Total time: 1.096108
        Average time: 0.000022
Delete files:
        Total files: 50000
        Total time: 1.510403
        Average time: 0.000030

After applying this patch:
Create files:
        Total files: 50000
        Total time: 0.932899
        Average time: 0.000019
Delete files:
        Total files: 50000
        Total time: 1.215732
        Average time: 0.000024

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Many thanks for Kitayama-san's help!
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dave@jikos.cz>
Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: NItaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

16cdcec7

12 5月, 2011 1 次提交

btrfs: scrub · a2de733c

由 Arne Jansen 提交于 3月 08, 2011

This adds an initial implementation for scrub. It works quite
straightforward. The usermode issues an ioctl for each device in the
fs. For each device, it enumerates the allocated device chunks. For
each chunk, the contained extents are enumerated and the data checksums
fetched. The extents are read sequentially and the checksums verified.
If an error occurs (checksum or EIO), a good copy is searched for. If
one is found, the bad copy will be rewritten.
All enumerations happen from the commit roots. During a transaction
commit, the scrubs get paused and afterwards continue from the new
roots.

This commit is based on the series originally posted to linux-btrfs
with some improvements that resulted from comments from David Sterba,
Ilya Dryomov and Jan Schmidt.
Signed-off-by: NArne Jansen <sensille@gmx.net>

a2de733c

02 5月, 2011 2 次提交

btrfs: drop unused parameter from btrfs_release_path · b3b4aa74

由 David Sterba 提交于 4月 21, 2011

parameter tree root it's not used since commit
5f39d397 ("Btrfs: Create extent_buffer
interface for large blocksizes")
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

b3b4aa74

btrfs: unify checking of IS_ERR and null · c704005d

由 David Sterba 提交于 4月 19, 2011

use IS_ERR_OR_NULL when possible, done by this coccinelle script:

@ match @
identifier id;
@@
(
- BUG_ON(IS_ERR(id) || !id);
+ BUG_ON(IS_ERR_OR_NULL(id));
|
- IS_ERR(id) || !id
+ IS_ERR_OR_NULL(id)
|
- !id || IS_ERR(id)
+ IS_ERR_OR_NULL(id)
)
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

c704005d

26 4月, 2011 1 次提交

Btrfs: fix missing mutex_unlock in btrfs_del_dir_entries_in_log() · a62f44a5

由 Tsutomu Itoh 提交于 4月 25, 2011

It is necessary to unlock mutex_lock before it return an error when
btrfs_alloc_path() fails.
Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

a62f44a5

25 4月, 2011 1 次提交

Btrfs: Always use 64bit inode number · 33345d01

由 Li Zefan 提交于 4月 20, 2011

There's a potential problem in 32bit system when we exhaust 32bit inode
numbers and start to allocate big inode numbers, because btrfs uses
inode->i_ino in many places.

So here we always use BTRFS_I(inode)->location.objectid, which is an
u64 variable.

There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
and inode->i_ino will be used in those cases.

Another reason to make this change is I'm going to use a special inode
to save free ino cache, and the inode number must be > (u64)-256.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>

33345d01

28 3月, 2011 2 次提交

btrfs: make inode ref log recovery faster · c622ae60

由 liubo 提交于 3月 26, 2011

When we recover from crash via write-ahead log tree and process
the inode refs, for each btrfs_inode_ref item, we will
1) check if we already have a perfect match in fs/file tree, if
   we have, then we're done.
2) search the corresponding back reference in fs/file tree, and
   check all the names in this back reference to see if they are
   also in the log to avoid conflict corners.
3) recover the logged inode refs to fs/file tree.

In current btrfs, however,
- for 2)'s check, once is enough, since the checked back reference
  will remain unchanged after processing all the inode refs belonged
  to the key.
- it has no need to do another 1) between 2) and 3).

I've made a small test to show how it improves,

$dd if=/dev/zero of=foobar bs=4K count=1
$sync
$make 100 hard links continuously, like ln foobar link_i
$fsync foobar
$echo b > /proc/sysrq-trigger
after reboot
$time mount DEV PATH

without patch:
real    0m0.285s
user    0m0.001s
sys     0m0.009s

with patch:
real    0m0.123s
user    0m0.000s
sys     0m0.010s

Changelog v1->v2:
- fix double free - pointed by David Sterba
Changelog v2->v3:
- adjust free order
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

c622ae60

Btrfs: cleanup some BUG_ON() · db5b493a

由 Tsutomu Itoh 提交于 3月 23, 2011

This patch changes some BUG_ON() to the error return.
(but, most callers still use BUG_ON())
Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

db5b493a

18 3月, 2011 1 次提交

Btrfs: add checks to verify dir items are correct · 22a94d44

由 Josef Bacik 提交于 3月 16, 2011

We need to make sure the dir items we get are valid dir items.  So any time we
try and read one check it with verify_dir_item, which will do various sanity
checks to make sure it looks sane.  Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

22a94d44

01 2月, 2011 3 次提交

btrfs: fix return value check of btrfs_start_transaction() · 98d5dc13

由 Tsutomu Itoh 提交于 1月 20, 2011

The error check of btrfs_start_transaction() is added, and the mistake
of the error check on several places is corrected.
Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

98d5dc13

btrfs: checking NULL or not in some functions · 5df67083

由 Tsutomu Itoh 提交于 2月 01, 2011

Because NULL is returned when the memory allocation fails,
it is checked whether it is NULL.
Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5df67083

Btrfs: catch errors from btrfs_sync_log · b31eabd8

由 Chris Mason 提交于 1月 31, 2011

btrfs_sync_log returns -EAGAIN when we need full transaction commits
instead of small log commits, but sometimes we were dropping the return
value.

In practice, we check for this a few different ways, but this is still a
bug that can leave off full log commits when we really need them.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b31eabd8

29 1月, 2011 1 次提交

btrfs: fix several uncheck memory allocations · 2a29edc6

由 liubo 提交于 1月 26, 2011

To make btrfs more stable, add several missing necessary memory allocation
checks, and when no memory, return proper errno.

We've checked that some of those -ENOMEM errors will be returned to
userspace, and some will be catched by BUG_ON() in the upper callers,
and none will be ignored silently.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

2a29edc6

22 11月, 2010 1 次提交

Btrfs: use dget_parent where we can UPDATED · 6a912213

由 Josef Bacik 提交于 11月 20, 2010

There are lots of places where we do dentry->d_parent->d_inode without holding
the dentry->d_lock.  This could cause problems with rename.  So instead we need
to use dget_parent() and hold the reference to the parent as long as we are
going to use it's inode and then dput it at the end.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Cc: raven@themaw.net
Signed-off-by: NChris Mason <chris.mason@oracle.com>

6a912213

30 10月, 2010 2 次提交

Btrfs: cleanup warnings from gcc 4.6 (nonbugs) · 559af821

由 Andi Kleen 提交于 10月 29, 2010

These are all the cases where a variable is set, but not read which are
not bugs as far as I can see, but simply leftovers.

Still needs more review.

Found by gcc 4.6's new warnings
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

559af821

Btrfs: Fix variables set but not read (bugs found by gcc 4.6) · 411fc6bc

由 Andi Kleen 提交于 10月 29, 2010

These are all the cases where a variable is set, but not
read which are really bugs.

- Couple of incorrect error handling fixed.
- One incorrect use of a allocation policy
- Some other things

Still needs more review.

Found by gcc 4.6's new warnings.

[akpm@linux-foundation.org: fix build.  Might have been bitrot]
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

411fc6bc

25 5月, 2010 1 次提交

Btrfs: Metadata ENOSPC handling for tree log · 4a500fd1

由 Yan, Zheng 提交于 5月 16, 2010

Previous patches make the allocater return -ENOSPC if there is no
unreserved free metadata space. This patch updates tree log code
and various other places to propagate/handle the ENOSPC error.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

4a500fd1

30 3月, 2010 1 次提交

include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6

由 Tejun Heo 提交于 3月 24, 2010

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: NTejun Heo <tj@kernel.org>
Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

5a0e3ad6

15 3月, 2010 1 次提交

Btrfs: change how we mount subvolumes · 73f73415

由 Josef Bacik 提交于 12月 04, 2009

This work is in preperation for being able to set a different root as the
default mounting root.

There is currently a problem with how we mount subvolumes.  We cannot currently
mount a subvolume of a subvolume, you can only mount subvolumes/snapshots of the
default subvolume.  So say you take a snapshot of the default subvolume and call
it snap1, and then take a snapshot of snap1 and call it snap2, so now you have

/
/snap1
/snap1/snap2

as your available volumes.  Currently you can only mount / and /snap1,
you cannot mount /snap1/snap2.  To fix this problem instead of passing
subvolid=<name> you must pass in subvolid=<treeid>, where <treeid> is
the tree id that gets spit out via the subvolume listing you get from
the subvolume listing patches (btrfs filesystem list).  This allows us
to mount /, /snap1 and /snap1/snap2 as the root volume.

In addition to the above, we also now read the default dir item in the
tree root to get the root key that it points to.  For now this just
points at what has always been the default subvolme, but later on I plan
to change it to point at whatever root you want to be the new default
root, so you can just set the default mount and not have to mount with
-o subvolid=<treeid>.  I tested this out with the above scenario and it
worked perfectly.  Thanks,

mount -o subvol operates inside the selected subvolid.  For example:

mount -o subvol=snap1,subvolid=256 /dev/xxx /mnt

/mnt will have the snap1 directory for the subvolume with id
256.

mount -o subvol=snap /dev/xxx /mnt

/mnt will be the snap directory of whatever the default subvolume
is.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

73f73415

18 12月, 2009 1 次提交

Btrfs: Avoid orphan inodes cleanup while replaying log · c71bf099

由 Yan, Zheng 提交于 11月 12, 2009

We do log replay in a single transaction, so it's not good to do unbound
operations. This patch cleans up orphan inodes cleanup after replaying
the log. It also avoids doing other unbound operations such as truncating
a file during replaying log. These unbound operations are postponed to
the orphan inode cleanup stage.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

c71bf099

16 12月, 2009 2 次提交

Btrfs: Rewrite btrfs_drop_extents · 920bbbfb

由 Yan, Zheng 提交于 11月 12, 2009

Rewrite btrfs_drop_extents by using btrfs_duplicate_item, so we can
avoid calling lock_extent within transaction.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

920bbbfb

Btrfs: Avoid superfluous tree-log writeout · 8cef4e16

由 Yan, Zheng 提交于 11月 12, 2009

We allow two log transactions at a time, but use same flag
to mark dirty tree-log btree blocks. So we may flush dirty
blocks belonging to newer log transaction when committing a
log transaction. This patch fixes the issue by using two
flags to mark dirty tree-log btree blocks.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

8cef4e16

14 10月, 2009 4 次提交

Btrfs: properly wait log writers during log sync · 86df7eb9

由 Yan, Zheng 提交于 10月 14, 2009

A recently fsync optimization make btrfs_sync_log skip calling
wait_for_writer in the single log writer case. This is incorrect
since the writer count can also be increased by btrfs_pin_log.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

86df7eb9

Btrfs: streamline tree-log btree block writeout · 690587d1

由 Chris Mason 提交于 10月 13, 2009

Syncing the tree log is a 3 phase operation.

1) write and wait for all the tree log blocks for a given root.

2) write and wait for all the tree log blocks for the
tree of tree log roots.

3) write and wait for the super blocks (barriers here)

This isn't as efficient as it could be because there is
no requirement to wait for the blocks from step one to hit the disk
before we start writing the blocks from step two.  This commit
changes the sequence so that we don't start waiting until
all the tree blocks from both steps one and two have been sent
to disk.

We do this by breaking up btrfs_write_wait_marked_extents into
two functions, which is trivial because it was already broken
up into two parts.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

690587d1

Btrfs: avoid tree log commit when there are no changes · 257c62e1

由 Chris Mason 提交于 10月 13, 2009

rpm has a habit of running fdatasync when the file hasn't
changed.  We already detect if a file hasn't been changed
in the current transaction but it might have been sent to
the tree-log in this transaction and not changed since
the last call to fsync.

In this case, we want to avoid a tree log sync, which includes
a number of synchronous writes and barriers.  This commit
extends the existing tracking of the last transaction to change
a file to also track the last sub-transaction.

The end result is that rpm -ivh and -Uvh are roughly twice as fast,
and on par with ext3.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

257c62e1

Btrfs: only write one super copy during fsync · 4722607d

由 Chris Mason 提交于 10月 13, 2009

During a tree-log commit for fsync, we've been writing at least
two copies of the super block and forcing them to disk.

The other filesystems write only one, and this change brings us on
par with them.  A full transaction commit will write all the super
copies, so we still have redundant info written on a regular
basis.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

4722607d

09 10月, 2009 1 次提交

Btrfs: optimize fsync for the single writer case · ff782e0a

由 Josef Bacik 提交于 10月 08, 2009

This patch optimizes the tree logging stuff so it doesn't always wait 1 jiffie
for new people to join the logging transaction if there is only ever 1 writer.
This helps a little bit with latency where we have something like RPM where it
will fdatasync every file it writes, and so waiting the 1 jiffie for every
fdatasync really starts to add up.
Signed-off-by: NJosef Bacik <jbacik@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ff782e0a

22 9月, 2009 2 次提交

Btrfs: add snapshot/subvolume destroy ioctl · 76dda93c

由 Yan, Zheng 提交于 9月 21, 2009

This patch adds snapshot/subvolume destroy ioctl. A subvolume that isn't being
used and doesn't contains links to other subvolumes can be destroyed.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

76dda93c

Btrfs: do not reuse objectid of deleted snapshot/subvol · 13a8a7c8

由 Yan, Zheng 提交于 9月 21, 2009

The new back reference format does not allow reusing objectid of
deleted snapshot/subvol. So we use ++highest_objectid to allocate
objectid for new snapshot/subvol.

Now we use ++highest_objectid to allocate objectid for both new inode
and new snapshot/subvolume, so this patch removes 'find hole' code in
btrfs_find_free_objectid.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

13a8a7c8

21 9月, 2009 1 次提交

trivial: remove unnecessary semicolons · a419aef8

由 Joe Perches 提交于 8月 18, 2009

Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>

a419aef8