提交 · 934d375bacf9ea8a37fbfff5f3cf1c093f324095 · openanolis / cloud-kernel

09 12月, 2008 1 次提交

Btrfs: Use map_private_extent_buffer during generic_bin_search · 934d375b

由 Chris Mason 提交于 12月 08, 2008

It is possible that generic_bin_search will be called on a tree block
that has not been locked.  This happens because cache_block_block skips
locking on the tree blocks.

Since the tree block isn't locked, we aren't allowed to change
the extent_buffer->map_token field.  Using map_private_extent_buffer
avoids any changes to the internal extent buffer fields.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

934d375b

02 12月, 2008 1 次提交

Btrfs: make things static and include the right headers · b2950863

由 Christoph Hellwig 提交于 12月 02, 2008

Shut up various sparse warnings about symbols that should be either
static or have their declarations in scope.
Signed-off-by: NChristoph Hellwig <hch@lst.de>

b2950863

19 11月, 2008 1 次提交

Btrfs: Some fixes for batching extent insert. · b4eec2ca

由 Liu Hui 提交于 11月 18, 2008

In insert_extents(), when ret==1 and last is not zero, it should
check if the current inserted item is the last item in this batching
inserts. If so, it should just break from loop. If not, 'cur =
insert_list->next' will make no sense because the list is empty now,
and 'op' will point to an unexpectable place.

There are also some trivial fixs in this patch including one comment
typo error and deleting two redundant lines.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b4eec2ca

18 11月, 2008 1 次提交

Btrfs: Seed device support · 2b82032c

由 Yan Zheng 提交于 11月 17, 2008

Seed device is a special btrfs with SEEDING super flag
set and can only be mounted in read-only mode. Seed
devices allow people to create new btrfs on top of it.

The new FS contains the same contents as the seed device,
but it can be mounted in read-write mode.

This patch does the following:

1) split code in btrfs_alloc_chunk into two parts. The first part does makes
the newly allocated chunk usable, but does not do any operation that modifies
the chunk tree. The second part does the the chunk tree modifications. This
division is for the bootstrap step of adding storage to the seed device.

2) Update device management code to handle seed device.
The basic idea is: For an FS grown from seed devices, its
seed devices are put into a list. Seed devices are
opened on demand at mounting time. If any seed device is
missing or has been changed, btrfs kernel module will
refuse to mount the FS.

3) make btrfs_find_block_group not return NULL when all
block groups are read-only.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

2b82032c

13 11月, 2008 2 次提交

Btrfs: batch extent inserts/updates/deletions on the extent root · f3465ca4

由 Josef Bacik 提交于 11月 12, 2008

While profiling the allocator I noticed a good amount of time was being spent in
finish_current_insert and del_pending_extents, and as the filesystem filled up
more and more time was being spent in those functions. This patch aims to try
and reduce that problem. This happens two ways

1) track if we tried to delete an extent that we are going to update or insert.
Once we get into finish_current_insert we discard any of the extents that were
marked for deletion. This saves us from doing unnecessary work almost every
time finish_current_insert runs.

2) Batch insertion/updates/deletions. Instead of doing a btrfs_search_slot for
each individual extent and doing the needed operation, we instead keep the leaf
around and see if there is anything else we can do on that leaf. On the insert
case I introduced a btrfs_insert_some_items, which will take an array of keys
with an array of data_sizes and try and squeeze in as many of those keys as
possible, and then return how many keys it was able to insert. In the update
case we search for an extent ref, update the ref and then loop through the leaf
to see if any of the other refs we are looking to update are on that leaf, and
then once we are done we release the path and search for the next ref we need to
update. And finally for the deletion we try and delete the extent+ref in pairs,
so we will try to find extent+ref pairs next to the extent we are trying to free
and free them in bulk if possible.

This along with the other cluster fix that Chris pushed out a bit ago helps make
the allocator preform more uniformly as it fills up the disk. There is still a
slight drop as we fill up the disk since we start having to stick new blocks in
odd places which results in more COW's than on a empty fs, but the drop is not
nearly as severe as it was before.
Signed-off-by: NJosef Bacik <jbacik@redhat.com>

f3465ca4

Btrfs: Improve metadata read latencies · 6f3577bd

由 Chris Mason 提交于 11月 13, 2008

This fixes latency problems on metadata reads by making sure they
don't go through the async submit queue, and by tuning down the amount
of readahead done during btree searches.

Also, the btrfs bdi congestion function is tuned to ignore the
number of pending async bios and checksums pending.  There is additional
code that throttles new async bios now and the congestion function
doesn't need to worry about it anymore.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

6f3577bd

30 10月, 2008 2 次提交

Btrfs: nuke fs wide allocation mutex V2 · 25179201

由 Josef Bacik 提交于 10月 29, 2008

This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.

There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.

The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.

Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.

I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: NJosef Bacik <jbacik@redhat.com>

25179201

Btrfs: Improve space balancing code · f82d02d9

由 Yan Zheng 提交于 10月 29, 2008

This patch improves the space balancing code to keep more sharing
of tree blocks. The only case that breaks sharing of tree blocks is
data extents get fragmented during balancing. The main changes in
this patch are:

Add a 'drop sub-tree' function. This solves the problem in old code
that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.

Remove relocation mapping tree. Relocation mappings are stored in
struct btrfs_ref_path and updated dynamically during walking up/down
the reference path. This reduces CPU usage and simplifies code.

This patch also fixes a bug. Root items for reloc trees should be
updated in btrfs_free_reloc_root.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

f82d02d9

09 10月, 2008 1 次提交

Btrfs: Remove offset field from struct btrfs_extent_ref · 3bb1a1bc

由 Yan Zheng 提交于 10月 09, 2008

The offset field in struct btrfs_extent_ref records the position
inside file that file extent is referenced by. In the new back
reference system, tree leaves holding references to file extent
are recorded explicitly. We can scan these tree leaves very quickly, so the
offset field is not required.

This patch also makes the back reference system check the objectid
when extents are in deleting.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

3bb1a1bc

02 10月, 2008 1 次提交

Btrfs: don't read leaf blocks containing only checksums during truncate · 323ac95b

由 Chris Mason 提交于 10月 01, 2008

Checksum items take up a significant portion of the metadata for large files.
It is possible to avoid reading them during truncates by checking the keys in
the higher level nodes.

If a given leaf is followed by another leaf where the lowest key is a checksum
item from the same file, we know we can safely delete the leaf without
reading it.

For a 32GB file on a 6 drive raid0 array, Btrfs needs 8s to delete
the file with a cold cache.  It is read bound during the run.

With this change, Btrfs is able to delete the file in 0.5s
Signed-off-by: NChris Mason <chris.mason@oracle.com>

323ac95b

30 9月, 2008 1 次提交

Btrfs: add and improve comments · d352ac68

由 Chris Mason 提交于 9月 29, 2008

This improves the comments at the top of many functions.  It didn't
dive into the guts of functions because I was trying to
avoid merging problems with the new allocator and back reference work.

extent-tree.c and volumes.c were both skipped, and there is definitely
more work todo in cleaning and commenting the code.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d352ac68

26 9月, 2008 2 次提交

Btrfs: update space balancing code · 1a40e23b

由 Zheng Yan 提交于 9月 26, 2008

This patch updates the space balancing code to utilize the new
backref format.  Before, btrfs-vol -b would break any COW links
on data blocks or metadata.  This was slow and caused the amount
of space used to explode if a large number of snapshots were present.

The new code can keeps the sharing of all data extents and
most of the tree blocks.

To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.

To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).

To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

1a40e23b

Btrfs: extent_map and data=ordered fixes for space balancing · 5b21f2ed

由 Zheng Yan 提交于 9月 26, 2008

* Add an EXTENT_BOUNDARY state bit to keep the writepage code
from merging data extents that are in the process of being
relocated.  This allows us to do accounting for them properly.

* The balancing code relocates data extents indepdent of the underlying
inode.  The extent_map code was modified to properly account for
things moving around (invalidating extent_map caches in the inode).

* Don't take the drop_mutex in the create_subvol ioctl.  It isn't
required.

* Fix walking of the ordered extent list to avoid races with sys_unlink

* Change the lock ordering rules.  Transaction start goes outside
the drop_mutex.  This allows btrfs_commit_transaction to directly
drop the relocation trees.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5b21f2ed

25 9月, 2008 27 次提交

Btrfs: Full back reference support · 31840ae1

由 Zheng Yan 提交于 9月 23, 2008

This patch makes the back reference system to explicit record the
location of parent node for all types of extents. The location of
parent node is placed into the offset field of backref key. Every
time a tree block is balanced, the back references for the affected
lower level extents are updated.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

31840ae1

Btrfs: free space accounting redo · 0f9dd46c

由 Josef Bacik 提交于 9月 23, 2008

1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.

2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.

3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.

4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.

There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

0f9dd46c

Fix leaf overflow check in btrfs_insert_empty_items · f25956cc

由 Chris Mason 提交于 9月 12, 2008

It was incorrectly adding an extra sizeof(struct btrfs_item) and causing
false positives (oops)
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f25956cc

Btrfs: trivial sparse fixes · b214107e

由 Christoph Hellwig 提交于 9月 05, 2008

Fix a bunch of trivial sparse complaints.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b214107e

Btrfs: missing endianess conversion in insert_new_root · ad3d81ba

由 Christoph Hellwig 提交于 9月 05, 2008

Add two missing endianess conversions in this function, found by sparse.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ad3d81ba

Btrfs: Add a write ahead tree log to optimize synchronous operations · e02119d5

由 Chris Mason 提交于 9月 05, 2008

File syncs and directory syncs are optimized by copying their
items into a special (copy-on-write) log tree. There is one log tree per
subvolume and the btrfs super block points to a tree of log tree roots.

After a crash, items are copied out of the log tree and back into the
subvolume. See tree-log.c for all the details.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e02119d5

btrfs_search_slot: reduce lock contention by cowing in two stages · 65b51a00

由 Chris Mason 提交于 8月 01, 2008

A btree block cow has two parts, the first is to allocate a destination
block and the second is to copy the old bock over.

The first part needs locks in the extent allocation tree, and may need to
do IO. This changeset splits that into a separate function that can be
called without any tree locks held.

btrfs_search_slot is changed to drop its path and start over if it has
to COW a contended block. This often means that many writers will
pre-alloc a new destination for a the same contended block, but they
cache their prealloc for later use on lower levels in the tree.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

65b51a00

Btrfs: implement memory reclaim for leaf reference cache · bcc63abb

由 Yan 提交于 7月 30, 2008

The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.

The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

bcc63abb

Btrfs: Add a leaf reference cache · 31153d81

由 Yan Zheng 提交于 7月 28, 2008

Much of the IO done while dropping snapshots is done looking up
leaves in the filesystem trees to see if they point to any extents and
to drop the references on any extents found.

This creates a cache so that IO isn't required.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

31153d81

Fix path slots selection in btrfs_search_forward · 9652480b

由 Yan 提交于 7月 24, 2008

We should decrease the found slot by one as btrfs_search_slot does
when bin_search return 1 and node level > 0.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

9652480b

J
Btrfs: Create orphan inode records to prevent lost files after a crash · 7b128766
由 Josef Bacik 提交于 7月 24, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
7b128766
C
btrfs_next_leaf: do readahead when skip_locking is turned on · 0bd40a71
由 Chris Mason 提交于 7月 17, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
0bd40a71
C
Btrfs: Add locking around volume management (device add/remove/balance) · 7d9eb12c
由 Chris Mason 提交于 7月 08, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
7d9eb12c

Btrfs: Reduce contention on the root node · f9efa9c7

由 Chris Mason 提交于 6月 25, 2008

This calls unlock_up sooner in btrfs_search_slot in order to decrease the
amount of work done with the higher level tree locks held.

Also, it changes btrfs_tree_lock to spin for a big against the page lock
before scheduling. This makes a big difference in context switch rate under
highly contended workloads.

Longer term, a better locking structure is needed than the page lock.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f9efa9c7

Btrfs: Online btree defragmentation fixes · 3f157a2f

由 Chris Mason 提交于 6月 25, 2008

The btree defragger wasn't making forward progress because the new key wasn't
being saved by the btrfs_search_forward function.

This also disables the automatic btree defrag, it wasn't scaling well to
huge filesystems. The auto-defrag needs to be done differently.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3f157a2f

Btrfs: Add btree locking to the tree defragmentation code · e7a84565

由 Chris Mason 提交于 6月 25, 2008

The online btree defragger is simplified and rewritten to use
standard btree searches instead of a walk up / down mechanism.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e7a84565

Btrfs: Replace the transaction work queue with kthreads · a74a4b97

由 Chris Mason 提交于 6月 25, 2008

This creates one kthread for commits and one kthread for
deleting old snapshots.  All the work queues are removed.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

a74a4b97

C
Btrfs: Fix snapshot deletion to release the alloc_mutex much more often. · 333db94c
由 Chris Mason 提交于 6月 25, 2008
```
This lowers the impact of snapshot deletion on the rest of the FS.
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
333db94c

Btrfs: Add a skip_locking parameter to struct path, and make various funcs honor it · 5cd57b2c

由 Chris Mason 提交于 6月 25, 2008

Allocations may need to read in block groups from the extent allocation tree,
which will require a tree search and take locks on the extent allocation
tree.  But, those locks might already be held in other places, leading
to deadlocks.

Since the alloc_mutex serializes everything right now, it is safe to
skip the btree locking while caching block groups.  A better fix will be
to either create a recursive lock or find a way to back off existing
locks while caching block groups.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5cd57b2c

C
Fix btrfs_next_leaf to check for new items after dropping locks · 168fd7d2
由 Chris Mason 提交于 6月 25, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
168fd7d2

Fix btrfs_del_ordered_inode to allow forcing the drop during unlinks · 594a24eb

由 Chris Mason 提交于 6月 25, 2008

This allows us to delete an unlinked inode with dirty pages from the list
instead of forcing commit to write these out before deleting the inode.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

594a24eb

Drop locks in btrfs_search_slot when reading a tree block. · 051e1b9f

由 Chris Mason 提交于 6月 25, 2008

One lock per btree block can make for significant congestion if everyone
has to wait for IO at the high levels of the btree. This drops
locks held by a path when doing reads during a tree search.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

051e1b9f

Btrfs: Replace the big fs_mutex with a collection of other locks · a2135011

由 Chris Mason 提交于 6月 25, 2008

Extent alloctions are still protected by a large alloc_mutex.
Objectid allocations are covered by a objectid mutex
Other btree operations are protected by a lock on individual btree nodes
Signed-off-by: NChris Mason <chris.mason@oracle.com>

a2135011

Btrfs: Start btree concurrency work. · 925baedd

由 Chris Mason 提交于 6月 25, 2008

The allocation trees and the chunk trees are serialized via their own
dedicated mutexes.  This means allocation location is still not very
fine grained.

The main FS btree is protected by locks on each block in the btree.  Locks
are taken top / down, and as processing finishes on a given level of the
tree, the lock is released after locking the lower level.

The end result of a search is now a path where only the lowest level
is locked.  Releasing or freeing the path drops any locks held.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

925baedd

Btrfs: Allocator fix variety pack · 0ef3e66b

由 Chris Mason 提交于 5月 24, 2008

* Force chunk allocation when find_free_extent has to do a full scan
* Record the max key at the start of defrag so it doesn't run forever
* Block groups might not be contiguous, make a forward search for the
  next block group in extent-tree.c
* Get rid of extra checks for total fs size
* Fix relocate_one_reference to avoid relocating the same file data block
  twice when referenced by an older transaction
* Use the open device count when allocating chunks so that we don't
  try to allocate from devices that don't exist
Signed-off-by: NChris Mason <chris.mason@oracle.com>

0ef3e66b

Btrfs: Handle write errors on raid1 and raid10 · 1259ab75

由 Chris Mason 提交于 5月 12, 2008

When duplicate copies exist, writes are allowed to fail to one of those
copies.  This changeset includes a few changes that allow the FS to
continue even when some IOs fail.

It also adds verification of the parent generation number for btree blocks.
This generation is stored in the pointer to a block, and it ensures
that missed writes to are detected.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

1259ab75

C
Btrfs: Pass down the expected generation number when reading tree blocks · ca7a79ad
由 Chris Mason 提交于 5月 12, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
ca7a79ad

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功