提交 · 261507a02ccba9afda919852263b6bc1581ce1ef · openeuler / Kernel

10 6月, 2009 1 次提交

Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2

由 Yan Zheng 提交于 6月 10, 2009

This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.

When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.

The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.

When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.

This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.

We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.

This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.

This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.

This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.

The improved balancing code scales significantly better with a large
number of snapshots.

This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5d4f98a2

25 4月, 2009 1 次提交

Btrfs: simplify makefile · 2ea2544e

由 Christoph Hellwig 提交于 4月 13, 2009

Get rid of the hacks for building out of tree, and always use += for
assigning to the object lists.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

2ea2544e

25 3月, 2009 1 次提交

Btrfs: do extent allocation and reference count updates in the background · 56bec294

由 Chris Mason 提交于 3月 13, 2009

The extent allocation tree maintains a reference count and full
back reference information for every extent allocated in the
filesystem.  For subvolume and snapshot trees, every time
a block goes through COW, the new copy of the block adds a reference
on every block it points to.

If a btree node points to 150 leaves, then the COW code needs to go
and add backrefs on 150 different extents, which might be spread all
over the extent allocation tree.

These updates currently happen during btrfs_cow_block, and most COWs
happen during btrfs_search_slot.  btrfs_search_slot has locks held
on both the parent and the node we are COWing, and so we really want
to avoid IO during the COW if we can.

This commit adds an rbtree of pending reference count updates and extent
allocations.  The tree is ordered by byte number of the extent and byte number
of the parent for the back reference.  The tree allows us to:

1) Modify back references in something close to disk order, reducing seeks
2) Significantly reduce the number of modifications made as block pointers
are balanced around
3) Do all of the extent insertion and back reference modifications outside
of the performance critical btrfs_search_slot code.

#3 has the added benefit of greatly reducing the btrfs stack footprint.
The extent allocation tree modifications are done without the deep
(and somewhat recursive) call chains used in the past.

These delayed back reference updates must be done before the transaction
commits, and so the rbtree is tied to the transaction.  Throttling is
implemented to help keep the queue of backrefs at a reasonable size.

Since there was a similar mechanism in place for the extent tree
extents, that is removed and replaced by the delayed reference tree.

Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

56bec294

30 10月, 2008 1 次提交

Btrfs: Add zlib compression support · c8b97818

由 Chris Mason 提交于 10月 29, 2008

This is a large change for adding compression on reading and writing,
both for inline and regular extents.  It does some fairly large
surgery to the writeback paths.

Compression is off by default and enabled by mount -o compress.  Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.

If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.

* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler.  This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.

* Inline extents are inserted at delalloc time now.  This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.

* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.

From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field.  Neither the encryption or the
'other' field are currently used.

In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k.  This is a
software only limit, the disk format supports u64 sized compressed extents.

In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k.  This is a software only limit
and will be subject to tuning later.

Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data.  This way additional encodings can be
layered on without having to figure out which encoding to checksum.

Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread.  This makes it tricky to
spread the compression load across all the cpus on the box.  We'll have to
look at parallel pdflush walks of dirty inodes at a later time.

Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

c8b97818

09 10月, 2008 1 次提交

Btrfs: Fix makefile for builing btrfs static · 61f8c86e

由 Sage Weil 提交于 10月 09, 2008

This fixes the btrfs makefile for building in the tree and out of the tree
both as a module and static.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

61f8c86e

30 9月, 2008 1 次提交

Btrfs: add and improve comments · d352ac68

由 Chris Mason 提交于 9月 29, 2008

This improves the comments at the top of many functions.  It didn't
dive into the guts of functions because I was trying to
avoid merging problems with the new allocator and back reference work.

extent-tree.c and volumes.c were both skipped, and there is definitely
more work todo in cleaning and commenting the code.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d352ac68

26 9月, 2008 1 次提交

Update Btrfs files for in-kernel usage · b4f6c45d

由 Chris Mason 提交于 9月 24, 2008

btrfs had magic to put the chagneset id into a printk on module load.
This removes that from the Makefile and hardcodes the printk to print
"Btrfs"
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b4f6c45d

25 9月, 2008 19 次提交

Btrfs: free space accounting redo · 0f9dd46c

由 Josef Bacik 提交于 9月 23, 2008

1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.

2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.

3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.

4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.

There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

0f9dd46c

Btrfs: Add a write ahead tree log to optimize synchronous operations · e02119d5

由 Chris Mason 提交于 9月 05, 2008

File syncs and directory syncs are optimized by copying their
items into a special (copy-on-write) log tree. There is one log tree per
subvolume and the btrfs super block points to a tree of log tree roots.

After a crash, items are copied out of the log tree and back into the
subvolume. See tree-log.c for all the details.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e02119d5

Btrfs: compile when posix acl's are disabled · eab922ec

由 Josef Bacik 提交于 8月 28, 2008

This patch makes btrfs so it will compile properly when acls are disabled. I
tested this and it worked with CONFIG_FS_POSIX_ACL off and on.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

eab922ec

Switch btrfs_name_hash() to crc32c · 615f996f

由 David Woodhouse 提交于 8月 19, 2008

Date: Tue, 19 Aug 2008 19:21:57 +0100
Using a 64-bit hash as the readdir cookie is just asking for trouble.
And gets it, when we try to export the file system by NFS.
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

615f996f

NFS support for btrfs - v3 · be6e8dc0

由 Balaji Rao 提交于 7月 21, 2008

Date: Mon, 21 Jul 2008 02:01:56 +0530
Here's an implementation of NFS support for btrfs. It relies on the
fixes which are going in to 2.6.28 for the NFS readdir/lookup deadlock.

This uses the btrfs_iget helper introduced previously.

[dwmw2: Tidy up a little, switch to d_obtain_alias() w/compat routine,
	change fh_type,	store parent's root object ID where needed,
	fix some get_parent() and fs_to_dentry() bugs]
Signed-off-by: NBalaji Rao <balajirrao@gmail.com>
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

be6e8dc0

Btrfs: Add a leaf reference cache · 31153d81

由 Yan Zheng 提交于 7月 28, 2008

Much of the IO done while dropping snapshots is done looking up
leaves in the filesystem trees to see if they point to any extents and
to drop the references on any extents found.

This creates a cache so that IO isn't required.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

31153d81

J
Btrfs: Create orphan inode records to prevent lost files after a crash · 7b128766
由 Josef Bacik 提交于 7月 24, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
7b128766
C
Btrfs: Add version strings on module load · b3c3da71
由 Chris Mason 提交于 7月 23, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
b3c3da71

Btrfs: Start btree concurrency work. · 925baedd

由 Chris Mason 提交于 6月 25, 2008

The allocation trees and the chunk trees are serialized via their own
dedicated mutexes.  This means allocation location is still not very
fine grained.

The main FS btree is protected by locks on each block in the btree.  Locks
are taken top / down, and as processing finishes on a given level of the
tree, the lock is released after locking the lower level.

The end result of a search is now a path where only the lowest level
is locked.  Releasing or freeing the path drops any locks held.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

925baedd

Btrfs: split out ioctl.c · f46b5a66

由 Christoph Hellwig 提交于 6月 11, 2008

Split the ioctl handling out of inode.c into a file of it's own.
Also fix up checkpatch.pl warnings for the moved code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f46b5a66

Btrfs: Add async worker threads for pre and post IO checksumming · 8b712842

由 Chris Mason 提交于 6月 11, 2008

Btrfs has been using workqueues to spread the checksumming load across
other CPUs in the system.  But, workqueues only schedule work on the
same CPU that queued the work, giving them a limited benefit for systems with
higher CPU counts.

This code adds a generic facility to schedule work with pools of kthreads,
and changes the bio submission code to queue bios up.  The queueing is
important to make sure large numbers of procs on the system don't
turn streaming workloads into random workloads by sending IO down
concurrently.

The end result of all of this is much higher performance (and CPU usage) when
doing checksumming on large machines.  Two worker pools are created,
one for writes and one for endio processing.  The two could deadlock if
we tried to service both from a single pool.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

8b712842

btrfs: tiny makefile cleanup · 95c9eb17

由 Christoph Hellwig 提交于 6月 10, 2008

use normal kbuild syntax to build acl.o conditinally and remove comment
out lines.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

95c9eb17

C
Btrfs: Add support for multiple devices per filesystem · 0b86a832
由 Chris Mason 提交于 3月 24, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
0b86a832

Btrfs: Split the extent_map code into two parts · d1310b2e

由 Chris Mason 提交于 1月 24, 2008

There is now extent_map for mapping offsets in the file to disk and
extent_io for state tracking, IO submission and extent_bufers.

The new extent_map code shifts from [start,end] pairs to [start,len], and
pushes the locking out into the caller.  This allows a few performance
optimizations and is easier to use.

A number of extent_map usage bugs were fixed, mostly with failing
to remove extent_map entries when changing the file.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d1310b2e

Y
Btrfs: Fix compile on kernel without ACLs enabled · caaca38b
由 Yan 提交于 1月 17, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
caaca38b

Btrfs: Add data=ordered support · dc17ff8f

由 Chris Mason 提交于 1月 08, 2008

This forces file data extents down the disk along with the metadata that
references them. The current implementation is fairly simple, and just
writes out all of the dirty pages in an inode before the commit.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

dc17ff8f

J
xattr support for btrfs · 5103e947
由 Josef Bacik 提交于 11月 16, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
5103e947
C
Breakout BTRFS_SETGET_FUNCS into a separate C file, the inlines were too big. · 0f82731f
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
0f82731f
C
Btrfs: Create extent_buffer interface for large blocksizes · 5f39d397
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
5f39d397

14 9月, 2007 2 次提交
- J
  Btrfs: Simplify makefile · 432eba08
  由 Jan Engelhardt 提交于 9月 14, 2007
```
Single-colons will do here.
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  432eba08
- C
  Btrfs: add modules_install target · 84a5d5ee
  由 Chris Mason 提交于 9月 14, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  84a5d5ee
30 8月, 2007 1 次提交
- J
  Btrfs: Add per-root block accounting and sysfs entries · 58176a96
  由 Josef Bacik 提交于 8月 29, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  58176a96
28 8月, 2007 1 次提交
- C
  Btrfs: Extent based page cache code. This uses an rbtree of extents and tests · a52d9a80
  由 Chris Mason 提交于 8月 27, 2007
```
instead of buffer heads.
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  a52d9a80
08 8月, 2007 1 次提交

Btrfs: Add run time btree defrag, and an ioctl to force btree defrag · 6702ed49

由 Chris Mason 提交于 8月 07, 2007

This adds two types of btree defrag, a run time form that tries to
defrag recently allocated blocks in the btree when they are still in ram,
and an ioctl that forces defrag of all btree blocks.

File data blocks are not defragged yet, but this can make a huge difference
in sequential btree reads.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

6702ed49

26 7月, 2007 1 次提交
- J
  Btrfs: cleaner make clean · 8578f0f1
  由 Joel Becker 提交于 7月 25, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  8578f0f1
12 6月, 2007 1 次提交
- C
  Btrfs: split up super.c · 39279cc3
  由 Chris Mason 提交于 6月 12, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  39279cc3
26 3月, 2007 1 次提交
- C
  Btrfs: add a radix back bit tree · 8ef97622
  由 Chris Mason 提交于 3月 26, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  8ef97622
23 3月, 2007 2 次提交
- C
  Btrfs: transaction rework · 79154b1b
  由 Chris Mason 提交于 3月 22, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  79154b1b
- C
  Mountable btrfs, with readdir · e20d96d6
  由 Chris Mason 提交于 3月 22, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  e20d96d6
21 3月, 2007 3 次提交
- C
  Btrfs: initial move to kernel module land · 2e635a27
  由 Chris Mason 提交于 3月 21, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  2e635a27
- C
  Btrfs: Better block record keeping, real mkfs · 1261ec42
  由 Chris Mason 提交于 3月 20, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  1261ec42
- C
  Btrfs: Add inode map, and the start of file extent items · 9f5fae2f
  由 Chris Mason 提交于 3月 20, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  9f5fae2f
20 3月, 2007 1 次提交
- C
  Btrfs: add transaction.h to the Makefile · 631d7d95
  由 Chris Mason 提交于 3月 20, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  631d7d95

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功